multi-gpu-tools's People
multi-gpu-tools's Issues
Generalize pytest and benchmark scripts
The current scripts for running pytests and nightly benchmarks are cugraph specific and may not apply more generally.
Raising this issue to track the status of generalizing the scripts to either accomplish one of two things:
- Option 1: Generalize the 2 scripts in a way that can be utilized by cugraph, while also being general enough to not have assumptions that might be specific to cugraph scripts/folder structure
- Option 2: Generalize the 2 scripts to be more like a template/example of how they can be implemented and the repositories implementing the pytest/benchmark contain the concrete runners for these tests.
Open to thoughts and discussion on the topic.
Nightly MNMG test runs need to provide dask logs from single-node tests too
Current nightly runs only include the dask logs (for workers and scheduler) for multi-node runs, but single-node runs (2-8 GPUs) have no dask logs at all. This is due to the difference between how log locations are specified for LocalCUDACluster
(ie. single-node) vs. a multi-node cluster. The dask logs are needed for debugging single-node MG test failures.
report generation step should also generate html for logs
The current create-html-reports.sh
script is responsible for generating the required .html files used by project-specific reporting tools. For example, for cugraph, the generated files are uploaded to a server and a message that includes links to the files on the server is sent to team members.
The log files are currently not processed by the above script (in the case of cugraph, they are uploaded as-is and the user's browser displays them as text), but it would be relatively easy to also generate html for the logs, specifically so line numbers could be added with links that can be shared to specific lines.
To summarize the request:
- Add code to
create-html-reports.sh
to generate a .html file per log file that includes line numbers with hyperlinks that can be shared.
README needs updating
The README is in desperate need of updating in order for this repo to be adopted by more projects/users.
It needs the following examples:
- Basic use by a "project" script that overrides the default config. This was the initial use case, where a project like cuGraph or cuML would, for example, have a MNMG test running script that used the mg-tools scripts to setup the MNMG environment, but had directories, conda env names, etc. specific to their project that overrode the default config before calling the mg-tools scripts.
- Using the scripts interactively from the CLI
Since the scripts are intended to be a "toolbox", they are meant to be called individually as part of a larger user workflow (these scripts could eventually be packaged as a conda install). Because of that, the README also needs the following documentation:
- Description of the script hierarchy (in an ASCII tree-like format) which includes every script.
- Section containing a brief one-line description of what each script does.
[FEA] Add cluster setup scripts for different communication protocols/setups
An update to the existing scripts would to support different protocol options and utilize the different networking hardware present in a given cluster. For the time being I see 3 options:
- TCP only
- UCX + NVLink + no-ib
- UCX + NVLink + IB
This could either be separated out into 3 different scripts that export relevant variables and build the appropriate scheduler and worker arguments, or be implemented as different function within the existing cluster startup scripts with an additional parameter defining the cluster type.
Do others have a preference on which method they prefer?
Happy to take this on make updates.
cc: @rlratzel
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.