The gradient-compression from vineeths96

Quantization for Distributed Optimization

tags : distributed optimization, large-scale machine learning, gradient compression, edge learning, federated learning, deep learning, pytorch

Code for the paper Quantization for Distributed Optimization.

About The Project

Massive amounts of data have led to the training of large-scale machine learning models on a single worker inefficient. Distributed machine learning methods such as Parallel-SGD have received significant interest as a solution to tackle this problem. However, the performance of distributed systems does not scale linearly with the number of workers due to the high network communication cost for synchronizing gradients and parameters. Researchers have proposed techniques such as quantization and sparsification to alleviate this problem by compressing the gradients. Most of the compression schemes result in compressed gradients that cannot be directly aggregated with efficient protocols such as all-reduce. In this paper, we present a set of all-reduce compatible gradient compression algorithms - QSGDMaxNorm Quantization, QSGDMaxNormMultiScale Quantization, and its sparsified variants - which significantly reduce the communication overhead while maintaining the performance of vanilla SGD. We establish upper bounds on the variance introduced by the quantization schemes and prove its convergence for smooth convex functions. The proposed compression schemes can trade off between the communication costs and the rate of convergence. We empirically evaluate the performance of the compression methods by training deep neural networks on the CIFAR10 dataset. We examine the performance of training ResNet50 (computation-intensive) model and VGG16 (communication-intensive) model with and without the compression methods. We also compare the scalability of these methods with the increase in the number of workers. Our compression methods perform better than the in-built methods currently offered by the deep learning frameworks.

Built With

This project was built with

python v3.7.6
PyTorch v1.7.1
The environment used for developing this project is available at environment.yml.

Getting Started

Clone the repository into a local machine using,

git clone https://github.com/vineeths96/Gradient-Compression
cd Gradient-Compression/

Prerequisites

Create a new conda environment and install all the libraries by running the following command

conda env create -f environment.yml

The dataset used in this project (CIFAR 10) will be automatically downloaded and setup in data directory during execution.

Instructions to run

The training of the models can be performed on a distributed cluster with multiple machines and multiple worker GPUs. We make use of torch.distributed.launch to launch the distributed training. More information is available here.

To launch distributed training on a single machine with multiple workers (GPUs),

python -m torch.distributed.launch --nproc_per_node=<num_gpus> trainer.py --local_world_size=<num_gpus>

To launch distributed training on multiple machine with multiple workers (GPUs),

export NCCL_SOCKET_IFNAME=ens3

python -m torch.distributed.launch --nproc_per_node=<num_gpus> --nnodes=<num_machines> --node_rank=<node_rank> --master_addr=<master_address> --master_port=<master_port> trainer.py --local_world_size=<num_gpus>

Model overview

We conducted experiments on ResNet50 architecture and VGG16 architecture. Refer the original papers for more information about the models. We use publicly available implementations from GitHub for reproducing the models.

Results

We highly recommend to read through the paper before proceeding to this section. The paper explains the different compression schemes we propose and contains many more analysis & results than what is presented here.

We begin with an explanation of the notations used for the plot legends in this section. AllReduce-SGD corresponds to the default gradient aggregation provided by PyTorch. QSGD-MN and GRandK-MN corresponds to QSGDMaxNorm Quantization and GlobalRandKMaxNorm Compression respectively. The precision or number of bits used for the representation follows it. QSGD-MN-TS and GRandK-MN-TS corresponds to QSGDMaxNormMultiScale Quantization and GlobalRandKMaxNormMultiScale Compression respectively, with two scales (TS) of compression. The precision or number of bits used for the representation of the two scales follows it. For the sparsified schemes, we choose the value of K as 10000 for all the experiments. We compare our methods with a recent all-reduce compatible gradient compression scheme PowerSGD for Rank-1 compression and Rank-2 compression.