Git Product home page Git Product logo

multi-gpu-tools's People

Contributors

ayushdg avatar jnke2016 avatar nv-rliu avatar pentschev avatar rlratzel avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

multi-gpu-tools's Issues

Generalize pytest and benchmark scripts

The current scripts for running pytests and nightly benchmarks are cugraph specific and may not apply more generally.

Raising this issue to track the status of generalizing the scripts to either accomplish one of two things:

  • Option 1: Generalize the 2 scripts in a way that can be utilized by cugraph, while also being general enough to not have assumptions that might be specific to cugraph scripts/folder structure
  • Option 2: Generalize the 2 scripts to be more like a template/example of how they can be implemented and the repositories implementing the pytest/benchmark contain the concrete runners for these tests.

Open to thoughts and discussion on the topic.

Nightly MNMG test runs need to provide dask logs from single-node tests too

Current nightly runs only include the dask logs (for workers and scheduler) for multi-node runs, but single-node runs (2-8 GPUs) have no dask logs at all. This is due to the difference between how log locations are specified for LocalCUDACluster (ie. single-node) vs. a multi-node cluster. The dask logs are needed for debugging single-node MG test failures.

report generation step should also generate html for logs

The current create-html-reports.sh script is responsible for generating the required .html files used by project-specific reporting tools. For example, for cugraph, the generated files are uploaded to a server and a message that includes links to the files on the server is sent to team members.

The log files are currently not processed by the above script (in the case of cugraph, they are uploaded as-is and the user's browser displays them as text), but it would be relatively easy to also generate html for the logs, specifically so line numbers could be added with links that can be shared to specific lines.

To summarize the request:

  • Add code to create-html-reports.sh to generate a .html file per log file that includes line numbers with hyperlinks that can be shared.

README needs updating

The README is in desperate need of updating in order for this repo to be adopted by more projects/users.

It needs the following examples:

  • Basic use by a "project" script that overrides the default config. This was the initial use case, where a project like cuGraph or cuML would, for example, have a MNMG test running script that used the mg-tools scripts to setup the MNMG environment, but had directories, conda env names, etc. specific to their project that overrode the default config before calling the mg-tools scripts.
  • Using the scripts interactively from the CLI

Since the scripts are intended to be a "toolbox", they are meant to be called individually as part of a larger user workflow (these scripts could eventually be packaged as a conda install). Because of that, the README also needs the following documentation:

  • Description of the script hierarchy (in an ASCII tree-like format) which includes every script.
  • Section containing a brief one-line description of what each script does.

[FEA] Add cluster setup scripts for different communication protocols/setups

An update to the existing scripts would to support different protocol options and utilize the different networking hardware present in a given cluster. For the time being I see 3 options:

  • TCP only
  • UCX + NVLink + no-ib
  • UCX + NVLink + IB

This could either be separated out into 3 different scripts that export relevant variables and build the appropriate scheduler and worker arguments, or be implemented as different function within the existing cluster startup scripts with an additional parameter defining the cluster type.

Do others have a preference on which method they prefer?

Happy to take this on make updates.

cc: @rlratzel

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.