Git Product home page Git Product logo

Comments (3)

alsrgv avatar alsrgv commented on May 9, 2024 2

The primary motivation for us was to make it easy to take single GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects: (1) how much modifications does one have to make to program to make it distributed, and how easy is it to run it, and (2) how much faster would it run in distributed mode?

Internally, we found that it's much easier for people to understand MPI model that requires minimal changes to source code (as described in README) than set up regular Distributed TensorFlow.

To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers, tf.Server(), tf.ClusterSpec(), SyncReplicasOptimizer, tf.replicas_device_setter() and etc.

We also found that performance that MPI and NCCL 2 pack is pretty good. We are still working on publish-able benchmark, but I can say that on 16 GPUs deployed across 4 servers connected with InfiniBand we got 15.6x scaling for Inception V3 and 13.8x scaling for VGG-16. We got slightly worse numbers on TCP.

All in all, we wanted to give back to TensorFlow community something that can help them train their models faster and with less efforts.

from horovod.

rohskopf avatar rohskopf commented on May 9, 2024

Just to throw in some discussion - I think it's just because Distributed TensorFlow doesn't use MPI. MPI is convenient and powerful for high performance computing, and this project brings MPI to TensorFlow.

from horovod.

alsrgv avatar alsrgv commented on May 9, 2024

Added motivation to README as per #10. Closing this issue.

from horovod.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.