Git Product home page Git Product logo

Comments (13)

byronyi avatar byronyi commented on May 8, 2024 2

I truly agree that MPI is the way to go for synchronous data parallelism, and all reduce is an awesome technique for distributed training. The reason we came up with low-level send/recv primitives of TF is because we have to support alternative parallelization mechanisms (async, model parallelism, etc.). A good deal of comparing PS with AllReduce discussion could be found on M. Li's OSDI'14 paper.

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024 1

Hi @byronyi,

Thanks for reporting this issue!

NCCL has bunch of knobs to tune various things, and turning off RDMA is one of the knobs.

Can you try this setting?

$ export NCCL_IB_DISABLE=1

Thanks,
Alex

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024 1

Good news: NVIDIA just released NCCL with RoCE support. I've updated documentation with installation pointers and additional benchmarks.

from horovod.

byronyi avatar byronyi commented on May 8, 2024

Thanks for the prompt response. We shall be able to test it soon then.

Regarding to the benchmark results you posted in mailing list, e.g. 13.8x speed up on vgg16 using 16 GPUs, is RDMA enabled in that case?

Thanks,
Byron

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

Great!

Yes, 13.8x was with VGG-16 on 16 GPUs with RDMA (InfiniBand). With TCP we were getting 12.5x.

Can you share details about your GPU cluster: how many servers are you planning to use, how many GPUs per server and what kind of GPUs?

from horovod.

byronyi avatar byronyi commented on May 8, 2024

We are using a small RoCE cluster with 4 nodes. Each node is equipped with 4 K40m GPUs and 1 40GbE ConnectX-3 pro NIC.

Btw, I observe similar performance numbers with grpc and grpc+gdr runtime in the latest TF. The grpc runtime has been known to have a performance bug long time ago and it's been fixed recently. For the lastest master, we observe around 11.6x speed up with grpc and 13.6x speed up with grpc+gdr, all using CPU based PS on vgg16.

Interest declaration: I'm the author of grpc+gdr :)

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

Oh, that's great! Would be very interesting to see how Horovod fares on your hardware :-)

Primary reason we prefer Horovod approach internally is due to complexities that traditional distributed TensorFlow entails. There are many concepts to be learned - tf.Server, tf.ClusterSpec, tf.replica_device_setter, tf.train.SyncReplicasOptimizer, and we found that it's very easy to make mistake that would give you bad performance and hard time finding out why you have this bad performance. You're also required to manually split your computation within server to multiple GPUs by creating towers, which introduces relevant reduce variables problems.

As we learned how other companies and frameworks embrace MPI or similar computational environments, it became clear that there would be big benefit to use allreduce approach instead of parameter server approach because it's much easier to comprehend and to apply to your typical single-GPU TensorFlow program. And so, this is where we ended up :-)

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

Agreed. Thanks for the pointer!

from horovod.

byronyi avatar byronyi commented on May 8, 2024

I guess this could serve as a tracking issue, since you suggest that NV will support RoCE in the near future. Hope this issue could be closed sooner than later :)

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

Yes, we all do :-)

from horovod.

suiyuan2009 avatar suiyuan2009 commented on May 8, 2024

I only got about 6.3x speedup for most computer vision base models using 8 1080ti gpus on a single machine, mode is cpu based parameter server, so it's reasonable 4 gpus 4 machines distributed training using grpc+gdr only has 11x speedup using 16 gpus. How about single machine benchmark result in your environment @alsrgv .

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

@suiyuan2009 I don't have 8x 1080TI, but I have one server with 8x P100 on NVLink.

On that server, I was getting 7.6x for Inception V3, 7.4x for ResNet-101, 7.8x for VGG-16.

from horovod.

byronyi avatar byronyi commented on May 8, 2024

Yay that's good! I'm closing this.

from horovod.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.