Git Product home page Git Product logo

tensorflow_unet_scaling's Introduction

GPU acceleration of Unet using Tensorflow native environment and Horovod-tensorflow

On the root directory you will find the scripts to run UNet (used for image semantic segmentation) implemented in Tensorflow using a GPU data parallel scheme in Horovod-tensorflow. In the directory tensorflow_native you will find the scripts to run the same distributed training jobs using the Tensorflow native environment without Horovod. The goal is to compare the paralleisation performance of Horovod Tensorflow vs native Tensorflow for a UNet algorithm. The data used here is an open microscopy data for semantic segmentation: DOI. These calculations have all been done on the RWTH high performance computing cluster, using Tesla V100 GPUs.

Virtual environment

To install Horovod for Tensorflow a virtual environment was created. The same environment will be used for both Tensorflow native and Horovod trainings so that the results are comparable. Read this README to see how to set-up the environment and look at this script to see which softwares and environments need to be loaded before creating the vitual env and running the jobs. For both native Pytorch and Horovod we use NCCL as the backend for collective communications. We also use Intel MPI for spawning the parallel processes.

Data parallel scheme

A data parallel scheme is a common method of deep learning parallelisation, often used for large internet dataset sizes. Scientific datasets are usually smaller but might be memory intensive such as higher resolution microscopy images in our case. Running deep learning models on these images would lead to OOM errors when you increase the batch size. Therefore using a data parallel scheme can be helpful for our smaller dataset as well. In a data parallel scheme, the mini-batch size is fixed per GPU (worker) and is usually the batch size that maxes out the GPU memory. In our case here it was 16. By having more GPUs the effective batch size increases and therefore run time decreases. The drawback maybe that it can eventually lead to poor convergence and therefore model metrics (in our case intersection over union (IOU)) deteriorate.

To implement the data parallel scheme the following necessary steps have been taken:

Submission:

  • For native Tensorflow, during the submission process, some SLURM environment variables have been set up which will help us access the size and ranks of workers during training. Additionally to run Tensorflow on more than one node, a TF_CONFIG env variable needs to be set that lists the IP addresses of all workers. The linked python file helps set it up for our computing cluster. For Horovod Tensorflow no env variables are required to be set up.

Training:

  • Strategy: In Tensorflow native, when the jobs are run on one node, a “mirror strategy” was used and when they are run across multiple nodes a “multi worker strategy”. In Horovod Tensorflow there is no distinction between the two cases.

  • Learning rate: A linear scaling rule is applied to the learning rate: it means the learning rate is multiplied by the number of workers (GPUs). In addition an optional initial warm-up and a gradual scheduling might help the convergence. The warm-up is commented out in our case as it didn’t provide improvement.

  • Datasets: Note that for parallelizing the datasets “sharding” has been added and the order of dataset operations (mapping, caching, sharding …) matters in configuring the dataset for performance. Here we have used them in an optimal way.

  • Model and optimisation: In Tensorflow native the model should be wrapped in the relevant strategy (mirror or multi worker). In Tensorflow horovod the optimiser is wrapped in the Horovod distributed optimizer.

  • Horovod-tensorflow/Keras requires some other minor edits to the training script that you can read here.

Submission

  • The submission.sh/submission_hvd.sh files submit all jobs in a loop. For our data, 14 GPUs was the maximum number of GPUs, which for our computing cluster correlates with 7 nodes. The submission file adapts the run_file.sh (Contatining the python script and its input arguments) and submit_file.sh (containing the submission script) for each job.

  • Jobs run on one node, should be submitted without an MPI run. With more than one node, parallel MPI jobs are spawned by the env variable $MPIEXEC in submit_file.sh.

  • Log files containing training times and metrics are copied in the logs folder on the root directory.

Notebook

This notebook has been used for post processing of log files. We use two metrics to judge the parallelisation performance. First, the deviation from an ideal linear speed-up which corresponds to increasing the computational cost. Second, the model metrics, here IOU, which might decrease in comparison with a 1 GPU scenario as the loss convergence might suffer in a data parallel scheme.

In the figure below, we compare the scaling behavior of Unet for Tensorflow native and Tensorflow Horovod. For Tensorflow native, the speed-up deviates rather quickly from the ideal speed-up and also the model metric (Here IOU) deteriorates at higher GPU numbers. Tensorflow Horovod seems to outperform native Tensorflow. We have repeated these calculation with different seeds and the behaviours observed here seem to be consistent.

Here you will find similar analysis for Unet implemented in Pytorch.

tensorflow_unet_scaling's People

Contributors

jklinkenberg avatar marcusritter1 avatar oekosheri avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.