Comments (3)
The primary motivation for us was to make it easy to take single GPU TensorFlow program and successfully train it on many GPUs faster. This has two aspects: (1) how much modifications does one have to make to program to make it distributed, and how easy is it to run it, and (2) how much faster would it run in distributed mode?
Internally, we found that it's much easier for people to understand MPI model that requires minimal changes to source code (as described in README) than set up regular Distributed TensorFlow.
To give some perspective on that, this commit into our fork of TF Benchmarks shows how much code can be removed if one doesn't need to worry about towers, tf.Server(), tf.ClusterSpec(), SyncReplicasOptimizer, tf.replicas_device_setter() and etc.
We also found that performance that MPI and NCCL 2 pack is pretty good. We are still working on publish-able benchmark, but I can say that on 16 GPUs deployed across 4 servers connected with InfiniBand we got 15.6x scaling for Inception V3 and 13.8x scaling for VGG-16. We got slightly worse numbers on TCP.
All in all, we wanted to give back to TensorFlow community something that can help them train their models faster and with less efforts.
from horovod.
Just to throw in some discussion - I think it's just because Distributed TensorFlow doesn't use MPI. MPI is convenient and powerful for high performance computing, and this project brings MPI to TensorFlow.
from horovod.
Added motivation to README as per #10. Closing this issue.
from horovod.
Related Issues (20)
- Horovod 0.28.1 incompatibility with PyTorch 2.1.0 HOT 1
- No module named 'packaging' when installing Horovod HOT 9
- [Volcano] Error using horovod with Vocalno cluster HOT 5
- ipv6 address family
- AttributeError: module 'horovod.torch' has no attribute 'init'
- Error install Horovod with python-3.11.5 on macos 11.3.1 HOT 1
- Error install horovod with python 3.11.5 on macOS 11.3.1
- Use pytorch from pip installed but get "#error You need C++17 to compile PyTorch" when installing horovod HOT 1
- Stop specific worker in Horovod Elastic
- Can I call horovod training process in proc = subprocess.Popen(command, shell=True, cwd=cwd) using command
- The program blocks hvd.init(). HOT 1
- Unable to run Horovod Pytorch on AMD AMI100 GPUs HOT 2
- Horovod with TensorFlow crashed
- Unexpected Worker Failure when using Elastic Horovod + Process Sets
- Horovod + Deepspeed : Device mismatch error
- Early Stopping tf.keras Crashes
- Tensorflow Saved model not portable with latest tf.keras.optimizers
- Model parallelisation
- Can horovd process more shards than workers
- v0.28.1 Version Mismatch with TF 2.12.0. Works with v0.28.0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from horovod.