Git Product home page Git Product logo

distributed-resnet-tensorflow's Introduction

Distributed ResNet on Cifar and Imagenet Dataset.

This Repo contains code for Distributed ResNet Training and scripts to submit distributed tasks in slurm system, specific to multiple machine each having one GPU card. I use the official resnet model provided by google and wrap it with my distributed code using SyncReplicaOptimzor. Some modifications from official model in r1.4 are made to fit r1.3 version Tensorflow.

Problem occured:

I met the same problem with SyncReplicaOptimzor as mentioned in

github issue

stackoverflow

If you have any idea to fix this problem, please contact the author. contact: Jiarui Fang ([email protected])

Results with this code:

  1. Cifar-10 global batch size = 128, evaluation results with test data are as following. A. One CPU with 4 Titan Xp GPU
CIFAR-10 Model Horovod Best Precision #node steps speed (stp/sec)
50 layer 93.3% 4 ~90k 21.82

Each node is a P100 GPU.

CIFAR-10 Model TF Best Precision PS-WK Steps Speed (stp/sec) Horovod Best Prec. #node speed
50 layer 93.6% local ~80k 13.94
50 layer 85.2% 1ps-1wk ~80k 10.19
50 layer 86.4% 2ps-4wk ~80k 20.3
50 layer 87.3% 4ps-8wk ~60k 19.19 - 8 28.66

The eval best precisions are illustrated in the following picture. Jumps in curves are due to restart evaluation from checkpoint, which will loss previous best precision values and shows sudden drop of curves in picture. image

Distributed Versions get lower eval accuracy results as provided in Tensorflow Model Research

  1. ImageNet We set global batch size as 128*8 = 1024. Follows the Hyperparameter settting in Intel-Caffe, i.e. sub-batch-size is 128 for each node. Runing out of memory warning will occure for 128 sub-batch-size.
Model Layer Batch Size TF Best Precision PS-WK Steps Speed (stp/sec) Horovod Best Prec. #node speed
50 128 62.6% 8-ps-8wk ~76k 0.93
50 128 64.4% 4-ps-8wk ~75k 0.90
50 64 - 1-ps-1wk - 1.56
50 32 - 1-ps-1wk - 2.20
50 128 - 1-ps-1wk - 0.96
50 128 - 8-ps-128wk - 0.285
50 32 - 8-ps-128wk - 0.292

Also get lower eval accuracy values.

Usage

Prerequists

  1. Install TensorFlow, Bazel. I install a conda2 package on Daint. Bazel and other packages required are installed by virtualenv inside conda2.

  2. Download ImageNet Dataset to Daint To avoid the error raised from unrecognition of the relative directory path, the following modification should made in download_and_preprocess_imagenet.sh. replace

WORK_DIR="$0.runfiles/inception/inception"

with

WORK_DIR="$(realpath -s "$0").runfiles/inception/inception"

After few days, you will see the following data in your data path. Due to the file system of Daint dose not support storage of millions of files, you have to deleted raw-data directory.

  1. Download CIFAR-10/CIFAR-100 dataset.
curl -o cifar-10-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
curl -o cifar-100-binary.tar.gz https://www.cs.toronto.edu/~kriz/cifar-100-binary.tar.gz

How to run:

$ cd scripts 
# run local for cifar10. It will launch 1 ps and 2 workers
$ sh submit_local_dist.sh
# run distributed for cifar
$ sh submit_cifar_daint_dist.sh #server #worker #batch_size
# run distributed for Imagenet
$ sh submit_imagenet_daint_dist.sh #server #worker

I left one node for evaluation, so the #worker should be the #worker for traing plus one. For example, you would like to launch a 2 ps and 4 worker job and evaluate your model simultanously on another node. The ps and work are assigned to the same node in default.

$ cd scripts
$ sh submit_imagenet_daint_dist.sh 2 5

Related papers:

Identity Mappings in Deep Residual Networks

https://arxiv.org/pdf/1603.05027v2.pdf

Deep Residual Learning for Image Recognition

https://arxiv.org/pdf/1512.03385v1.pdf

Wide Residual Networks

https://arxiv.org/pdf/1605.07146v1.pdf

distributed-resnet-tensorflow's People

Contributors

feifeibear avatar fastalgo avatar

Watchers

James Cloos avatar Jana avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.