Comments (13)
I truly agree that MPI is the way to go for synchronous data parallelism, and all reduce is an awesome technique for distributed training. The reason we came up with low-level send/recv primitives of TF is because we have to support alternative parallelization mechanisms (async, model parallelism, etc.). A good deal of comparing PS with AllReduce discussion could be found on M. Li's OSDI'14 paper.
from horovod.
Hi @byronyi,
Thanks for reporting this issue!
NCCL has bunch of knobs to tune various things, and turning off RDMA is one of the knobs.
Can you try this setting?
$ export NCCL_IB_DISABLE=1
Thanks,
Alex
from horovod.
Good news: NVIDIA just released NCCL with RoCE support. I've updated documentation with installation pointers and additional benchmarks.
from horovod.
Thanks for the prompt response. We shall be able to test it soon then.
Regarding to the benchmark results you posted in mailing list, e.g. 13.8x speed up on vgg16 using 16 GPUs, is RDMA enabled in that case?
Thanks,
Byron
from horovod.
Great!
Yes, 13.8x was with VGG-16 on 16 GPUs with RDMA (InfiniBand). With TCP we were getting 12.5x.
Can you share details about your GPU cluster: how many servers are you planning to use, how many GPUs per server and what kind of GPUs?
from horovod.
We are using a small RoCE cluster with 4 nodes. Each node is equipped with 4 K40m GPUs and 1 40GbE ConnectX-3 pro NIC.
Btw, I observe similar performance numbers with grpc and grpc+gdr runtime in the latest TF. The grpc runtime has been known to have a performance bug long time ago and it's been fixed recently. For the lastest master, we observe around 11.6x speed up with grpc and 13.6x speed up with grpc+gdr, all using CPU based PS on vgg16.
Interest declaration: I'm the author of grpc+gdr :)
from horovod.
Oh, that's great! Would be very interesting to see how Horovod fares on your hardware :-)
Primary reason we prefer Horovod approach internally is due to complexities that traditional distributed TensorFlow entails. There are many concepts to be learned - tf.Server
, tf.ClusterSpec
, tf.replica_device_setter
, tf.train.SyncReplicasOptimizer
, and we found that it's very easy to make mistake that would give you bad performance and hard time finding out why you have this bad performance. You're also required to manually split your computation within server to multiple GPUs by creating towers, which introduces relevant reduce variables
problems.
As we learned how other companies and frameworks embrace MPI or similar computational environments, it became clear that there would be big benefit to use allreduce approach instead of parameter server approach because it's much easier to comprehend and to apply to your typical single-GPU TensorFlow program. And so, this is where we ended up :-)
from horovod.
Agreed. Thanks for the pointer!
from horovod.
I guess this could serve as a tracking issue, since you suggest that NV will support RoCE in the near future. Hope this issue could be closed sooner than later :)
from horovod.
Yes, we all do :-)
from horovod.
I only got about 6.3x speedup for most computer vision base models using 8 1080ti gpus on a single machine, mode is cpu based parameter server, so it's reasonable 4 gpus 4 machines distributed training using grpc+gdr
only has 11x speedup using 16 gpus. How about single machine benchmark result in your environment @alsrgv .
from horovod.
@suiyuan2009 I don't have 8x 1080TI, but I have one server with 8x P100 on NVLink.
On that server, I was getting 7.6x for Inception V3, 7.4x for ResNet-101, 7.8x for VGG-16.
from horovod.
Yay that's good! I'm closing this.
from horovod.
Related Issues (20)
- Test test.integration.test_spark.SparkTests.test_dbfs_local_store broken for tensorflow>=2.13 HOT 1
- Getting error while running multi node machine learning training on H100 servers HOT 1
- Compiling with MPI+PyTorch does not work HOT 2
- Decentralized ML framework
- tensorflow hvd.DistributedOptimizer bug
- Horovod 0.28.1 incompatibility with PyTorch 2.1.0 HOT 1
- No module named 'packaging' when installing Horovod HOT 9
- [Volcano] Error using horovod with Vocalno cluster HOT 5
- ipv6 address family
- AttributeError: module 'horovod.torch' has no attribute 'init'
- Error install Horovod with python-3.11.5 on macos 11.3.1 HOT 1
- Error install horovod with python 3.11.5 on macOS 11.3.1
- Use pytorch from pip installed but get "#error You need C++17 to compile PyTorch" when installing horovod HOT 1
- Stop specific worker in Horovod Elastic
- Can I call horovod training process in proc = subprocess.Popen(command, shell=True, cwd=cwd) using command
- The program blocks hvd.init(). HOT 1
- Unable to run Horovod Pytorch on AMD AMI100 GPUs HOT 2
- Horovod with TensorFlow crashed
- Unexpected Worker Failure when using Elastic Horovod + Process Sets
- Horovod + Deepspeed : Device mismatch error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from horovod.