NCCL 2 will even fail in local mode if there's an RoCE capable NIC on the target syste

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

NCCL 2 does not support RoCE about horovod HOT 13 CLOSED

horovod commented on May 8, 2024

NCCL 2 does not support RoCE

from horovod.

Comments (13)

byronyi commented on May 8, 2024 2

I truly agree that MPI is the way to go for synchronous data parallelism, and all reduce is an awesome technique for distributed training. The reason we came up with low-level send/recv primitives of TF is because we have to support alternative parallelization mechanisms (async, model parallelism, etc.). A good deal of comparing PS with AllReduce discussion could be found on M. Li's OSDI'14 paper.

from horovod.

alsrgv commented on May 8, 2024 1

Hi @byronyi,

Thanks for reporting this issue!

NCCL has bunch of knobs to tune various things, and turning off RDMA is one of the knobs.

Can you try this setting?

$ export NCCL_IB_DISABLE=1

Thanks,
Alex

from horovod.

alsrgv commented on May 8, 2024 1

Good news: NVIDIA just released NCCL with RoCE support. I've updated documentation with installation pointers and additional benchmarks.

from horovod.

byronyi commented on May 8, 2024

Thanks for the prompt response. We shall be able to test it soon then.

Regarding to the benchmark results you posted in mailing list, e.g. 13.8x speed up on vgg16 using 16 GPUs, is RDMA enabled in that case?

Thanks,
Byron

from horovod.

alsrgv commented on May 8, 2024

Great!

Yes, 13.8x was with VGG-16 on 16 GPUs with RDMA (InfiniBand). With TCP we were getting 12.5x.

Can you share details about your GPU cluster: how many servers are you planning to use, how many GPUs per server and what kind of GPUs?

from horovod.

byronyi commented on May 8, 2024

We are using a small RoCE cluster with 4 nodes. Each node is equipped with 4 K40m GPUs and 1 40GbE ConnectX-3 pro NIC.

Btw, I observe similar performance numbers with grpc and grpc+gdr runtime in the latest TF. The grpc runtime has been known to have a performance bug long time ago and it's been fixed recently. For the lastest master, we observe around 11.6x speed up with grpc and 13.6x speed up with grpc+gdr, all using CPU based PS on vgg16.

Interest declaration: I'm the author of grpc+gdr :)

from horovod.

alsrgv commented on May 8, 2024

Oh, that's great! Would be very interesting to see how Horovod fares on your hardware :-)

Primary reason we prefer Horovod approach internally is due to complexities that traditional distributed TensorFlow entails. There are many concepts to be learned - tf.Server, tf.ClusterSpec, tf.replica_device_setter, tf.train.SyncReplicasOptimizer, and we found that it's very easy to make mistake that would give you bad performance and hard time finding out why you have this bad performance. You're also required to manually split your computation within server to multiple GPUs by creating towers, which introduces relevant reduce variables problems.

As we learned how other companies and frameworks embrace MPI or similar computational environments, it became clear that there would be big benefit to use allreduce approach instead of parameter server approach because it's much easier to comprehend and to apply to your typical single-GPU TensorFlow program. And so, this is where we ended up :-)

from horovod.

alsrgv commented on May 8, 2024

Agreed. Thanks for the pointer!

from horovod.

byronyi commented on May 8, 2024

I guess this could serve as a tracking issue, since you suggest that NV will support RoCE in the near future. Hope this issue could be closed sooner than later :)

from horovod.

alsrgv commented on May 8, 2024

Yes, we all do :-)

from horovod.

suiyuan2009 commented on May 8, 2024

I only got about 6.3x speedup for most computer vision base models using 8 1080ti gpus on a single machine, mode is cpu based parameter server, so it's reasonable 4 gpus 4 machines distributed training using grpc+gdr only has 11x speedup using 16 gpus. How about single machine benchmark result in your environment @alsrgv .

from horovod.

alsrgv commented on May 8, 2024

@suiyuan2009 I don't have 8x 1080TI, but I have one server with 8x P100 on NVLink.

On that server, I was getting 7.6x for Inception V3, 7.4x for ResNet-101, 7.8x for VGG-16.

from horovod.

byronyi commented on May 8, 2024

Yay that's good! I'm closing this.

from horovod.

NCCL 2 does not support RoCE about horovod HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent