Git Product home page Git Product logo

taccl's Introduction

TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches
Aashaka Shah, Vijay Chidambaram, Meghan Cowan, Saeed Maleki, Madan Musuvathi, Todd Mytkowicz, Jacob Nelson, Olli Saarikivi, Rachee Singh
NSDI 2023 [https://arxiv.org/pdf/2111.04867.pdf]

TACCL is a tool to generate algorithms for collective communication like AllGather, AllToAll, and AllReduce for any given hardware configuration. TACCL takes a human-in-loop approach to algorithm design in which a user provides communication sketches to guide the synthesis process. TACCL outputs TACCL-EF, an execution format that contains the schedule of GPU data transfers to efficiently implement the target collective. TACCL's schedules can be run by registering TACCL-EF files using the MSCCL tool stack.

Installation

We use Gurobi to solve the optimization problem. Please obtain a Gurobi license online and then proceed to install the Gurobi licensing tools as follows. Within an anaconda environment, run:

conda config --add channels http://conda.anaconda.org/gurobi
conda install -c conda-forge gurobi -y
<command to add Gurobi license>

Finally, run

pip install .

Usage

Generating the algorithm

To generate data transfer algorithm for a collective for a specific topology , please provide a topology file <topo-file.json> containing profiling information of links in the node and a sketch file <sketch.json> specifying the communication sketch. Using these, a JSON file of data transfer steps for the collective algorithm can be generated with the following command:

$ taccl solve <topo> <coll> --topology-file <topo-file.json> --sketch-file <sketch.json> -o <output.json>

For example, in order to generate an algorithm for Allgather used in the Evaluation for dgx2-sk-1, we run the following command:

$ cd taccl/examples
$ taccl solve DGX2 Allgather --topology-file ../taccl/examples/topo/topo-dgx2-1MB.json --sketch-file ../taccl/examples/sketch/sk1-dgx2-n2.json

To generate schedules for the combining collective AllReduce, we first obtain an AllGather algorithm and use it to generate the AllReduce algorithm. is the timestamp that is used to save the send_dict of AllGather algorithm and can be obtained from the suffix of the json algorithm file.

$ cd taccl/examples
$ taccl solve DGX2 Allgather --topology-file ../taccl/examples/topo/topo-dgx2-1MB.json --sketch-file ../taccl/examples/sketch/sk1-dgx2-n2.json
$ taccl combine DGX2 Allgather --topology-file ../taccl/examples/topo/topo-dgx2-1MB.json --sketch-file ../taccl/examples/sketch/sk1-dgx2-n2.json --ts <ts>

./commands.sh shows the commands with sketches and profiled topologies that can be used to obtain the results in our paper.

Providing topology

  • "<topo>" should be selected from the 'known_topologies' constructor and can be either of "custom", "HubAndSpoke", "DGX2", or "NDv2". ./taccl/topologies/generic.py defines the link connection matrix within a node for each <topo>. links[dst][src] is 1 if there is one link going from src to dst.
  • "<topo-file.json>" contains profiling information about the node, like latency and bandwidth costs of intra-node and inter-node transfers as well as number of GPUs and NICs in the node. Depending on the profiling information given in the topology-file provided by the user, the profile attributes alpha, betas, invbws, nics_per_node, remote_invbw, remote_alpha, and remote_beta are used to obtain a NodeTopology object. taccl/examples/topo directory gives examples of topology files for DGX-2 and NDv2 nodes that can be provided to --topology-file.

You can add new node topologies as a separate function in ./taccl/cli/known_topologies.py or use the "custom" option for topologies and provide all details of the topology in the user-provided topology-file.

Providing communication sketch

A communication sketch has the following three purposes:

  1. Create a logical topology that determines how different nodes are connected to each other
  2. Annotate links which form a switch
  3. Annotate symmetry planes in the topology

./taccl/examples/sketch directory gives examples of some communication sketches that can be used for NVIDIA DGX-2 and Azure NDv2 nodes.

Lowering to TACCL-EF

Once the algorithm is generated, we lower it into a TACCL-EF file. The number of instances <instances> determines the multiple we will use to increase the number of channels that are used to perform sends and receives. When there are already many threadblocks being used per GPU, you should use a single instance for the best performance.

$ taccl ncclize <output.json> --instances <instances>

Running with MSCCL

The TACCL-EF file can be used as input by the MSCCL runtime to actually run the algorithm on the hardware. Please follow the setup instructions in the MSCCL repository to run TACCL algorithms. Once MSCCL is setup, TACCL-EF files can be benchmarked using nccl-tests as follows:

$ mpirun -np <ngpus> -x LD_LIBRARY_PATH=msccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x MSCCL_XML_FILES=<taccl-ef> -x NCCL_ALGO=MSCCL,RING,TREE  nccl-tests/build/<nccl-test-binary> -b 128 -e 32MB -f 2 -g 1 -c 1 -n 100 -w 100 -G 100 -z 0

TACCL-EF files can also be registered in the MSCCL-toolkit to be used in frameworks like PyTorch and Tensorflow.

Guide to providing topology profiles

INPUT_GUIDE.md provides more details of how to specify different profiles for topologies.

Citation

Shah, A., Chidambaram, V., Cowan, M., Maleki, S., Musuvathi, M., Mytkowicz, T., Nelson, J. and Saarikivi, O., 2023. {TACCL}: Guiding Collective Algorithm Synthesis using Communication Sketches. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) (pp. 593-612).

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

taccl's People

Contributors

aashaka avatar microsoftopensource avatar saeedmaleki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

taccl's Issues

Action required: self-attest your goal for this repository

It's time to review and renew the intent of this repository

An owner or administrator of this repository has previously indicated that this repository can not be migrate to GitHub inside Microsoft because it is going public, open source, or it is used to collaborate with external parties (customers, partners, suppliers, etc.).

Action

πŸ‘€ ✍️ In order to keep Microsoft secure, we require repository owners and administrators to review this repository and regularly renew the intent whether to opt-in or opt-out of migration to GitHub inside Microsoft which is specifically intended for private or internal projects.

❗Only users with admin permission in the repository are allowed to respond. Failure to provide a response will result to your repository getting automatically archived. πŸ”’

Instructions

❌ Opt-out of migration

If this repository can not be migrated to GitHub inside Microsoft, you can opt-out of migration by replying with a comment on this issue containing one of the following optout command options below.

@gimsvc optout --reason <staging|collaboration|delete|other>

Example: @gimsvc optout --reason staging

Options:

  • staging : My project will ship as Open Source
  • collaboration : Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete : This repository will be deleted because it is no longer needed.
  • other : Other reasons not specified

βœ… Opt-in to migrate

If the circumstances of this repository has changed and you decide that you need to migrate, then you can specify the optin command below. For example, the repository is no longer going public, open source or require external collaboration.

@gimsvc optin --date <target_migration_date in mm-dd-yyyy format>

Example: @gimsvc optin --date 03-15-2023

Click here for more information about optin and optout command options and examples

Opt-in

@gimsvc optin --date <target_migration_date>

When opting-in to migrate your repository, the --date option is required followed by your specified migration date using the format: mm-dd-yyyy

@gimsvc optin --date 03-15-2023

Opt-out

@gimsvc optout --reason <staging|collaboration|delete|other>

When opting-out of migration, you need to specify the --reason.

  • staging
    • My project will ship as Open Source
  • collaboration
    • Used for external or 3rd party collaboration with customers, partners, suppliers, etc.
  • delete
    • This repository will be deleted because it is no longer needed.
  • other
    • Other reasons not specified

Examples:

@gimsvc optout --reason staging

@gimsvc optout --reason collaboration

@gimsvc optout --reason delete

@gimsvc optout --reason other

Need more help? πŸ–οΈ

Alltoall on DGX2 returns infeasible

Hi,

I am trying to generate schedules for DGX2 for Alltoall collective following the README.
When I run the command, taccl solve DGX2 Alltoall --topology-file ../taccl/examples/topo/topo-dgx2-1MB.json --sketch-file ../taccl/examples/sketch/sk1-dgx2-n2.json, the tool raises value error saying the model is infeasible. The same command with Allgather run quickly and produces the schedule. Is there some changes to be made to the topology or sketch for Alltoall? The paper mentions that same sketch was used for both collectives.

Thanks
Siva

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.