Git Product home page Git Product logo

dear_pytorch's Introduction

DeAR: decoupling the all-reduce primitive to accelerate distributed deep learning

Introduction

We propose a new optimization algorithm called DeAR, that decouples the all-reduce primitive to two operations, so as to enable fine-grained scheduling without introducing extra communication overhead. This repository contains DeAR's source code, as well as a set of benchmarking scripts for evaluating the training performance of popular distributed deep learning methods with data parallelism. Currently, it covers:

Optimization algorithms without Tensor Fusion

  • Wait-free backpropagation (WFBP), which is also known as the technique of pipelining the backward computations with gradient communications.
  • ByteScheduler, which uses tensor partition and priority schedule to overlap some communication tasks with feed-forward computing tasks.
  • DeAR w/o TF, which disables the tensor fusion technique by setting THRESHOLD=None and NUM_NEARBY_LAYERS=1.

Optimization algorithms with Tensor Fusion

Deep Neural Networks

Installation

Prerequisites

Get the code

$git clone https://github.com/lzhangbv/dear_pytorch.git
$cd dear_pytorch
$pip install -r requirements.txt
$HOROVOD_GPU_OPERATIONS=NCCL pip install horovod==0.21.3

If pip installation failed, please try to upgrade pip via pip install --upgrade pip. If Horovod installation with NCCL failed, please check the installation guide. To run ByteScheduler, please check the installation instruction and it was found to be compatible with PyTorch 1.4.

If you have encountered other errors during installation, please check the install document (contributed by Haoxuan Yu), and we recommend using the same software versions according to our paper (section VI.A).

Configure the cluster settings

Before running the scripts, please carefully configure the configuration files in the directory of configs.

  • configs/cluster*: configure the host files for MPI
  • configs/envs.conf: configure the cluster environments

Compile the communication package:

$ cd common/comm_core
$ bash compile.sh

Create a log folder in the dear_pytorch dir, e.g.,

$mkdir -p logs/sc22-tf

Run benchmarks

  • The batch mode
$python benchmarks.py

For different experimental settings, users can modify the DNN model, batch size, the number of GPUs, and network configurations in the benckmarks.py script.

  • The individual mode, e.g.,
$cd dear
$dnn=resnet50 bs=64 nworkers=64 ./horovod_mpi_cj.sh

Before running DeAR w/o tensor fusion, please set THRESHOLD=None and NUM_NEARBY_LAYERS=1 in the DeAR's dopt_rsag.py script. For DeAR with tensor fusion, we use THRESHOLD=25MB by default. To support Bayesian optimization, please import dopt_rsag_bo and increase the num-warmup-batches to at least 60 to tune buffer size in DeAR's benchmark scripts.

DeAR Usage

The DeAR distributed optimizer can be easily used like horovod.DistributedOptimizer().

import dear
dear.init()
... 
optimizer = optim.SGD(model.parameters(), ...)
optimizer = dear.DistributedOptimizer(optimizer, ...)
... 
for i, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()
...

DeAR Example

Example script for training on MNIST was provided.

$ bash mnist.sh

Paper

If you are using this repository for your paper, please cite our work

@article{zhang2023decoupling,
  title={Decoupling the All-Reduce Primitive for Accelerating Distributed Deep Learning},
  author={Zhang, Lin and Shi, Shaohuai and Chu, Xiaowen and Wang, Wei and Li, Bo and Liu, Chengjian},
  journal={arXiv preprint arXiv:2302.12445},
  year={2023}
}

dear_pytorch's People

Contributors

lzhangbv avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.