Git Product home page Git Product logo

Comments (10)

alsrgv avatar alsrgv commented on May 9, 2024

Hi @AlexanderYukhanov,

Thanks for your question!

Few questions:

  1. Is your Open MPI compiled with --enable-cuda support? It's required for HOROVOD_GPU_*=MPI.
  2. Do you have Tesla or Quatro GPU cards and GPUDirect drivers installed, or you're using consumer GPUs, such as GTX ***?
  3. What network cards are you using?
  4. Are you able to run any example MPI program, such as OSU Benchmarks ?

We generally recommend using HOROVOD_GPU_ALLREDUCE=MPI only for specialized MPI implementation, such as Cray MPI, which is optimized for allreduce on GPU. In other cases, such as with Open MPI, you will be much better off installing NCCL and specifying HOROVOD_GPU_ALLREDUCE=NCCL.

HOROVOD_GPU_ALLGATHER=MPI and HOROVOD_GPU_BROADCAST=MPI should only be specified if you have Tesla or Quatro cards and GPUDirect drivers installed.

from horovod.

AlexanderYukhanov avatar AlexanderYukhanov commented on May 9, 2024

Thank you for quick response.

  1. I am using standard mpi coming with ubuntu
  2. I have Tesla K80
  3. i am running on azure vm
  4. i am able to run other mpi based frameworks, horovod on two gpus on same vm and simple tests like mpirun -H host;host echi hi -)

I was trying first pip install horovod without any HOROVOD_GPU* environment variables but had the same results, so decided to try xxx=MPI way.

from horovod.

alsrgv avatar alsrgv commented on May 9, 2024

Let's try to debug together :-)

Can you share output of:

  1. mpirun -version
  2. ifconfig -a

Can you try running vanilla allreduce program:

  1. wget https://raw.githubusercontent.com/wesleykendall/mpitutorial/gh-pages/tutorials/mpi-reduce-and-allreduce/code/reduce_stddev.c
  2. mpicc -lm reduce_stddev.c -o reduce_stddev -lm
  3. mpirun <your flags for multi-node> ./reduce_stddev 100000000

from horovod.

AlexanderYukhanov avatar AlexanderYukhanov commented on May 9, 2024

sure, allocating new cluster to run these tests

from horovod.

AlexanderYukhanov avatar AlexanderYukhanov commented on May 9, 2024
  1. mpirun (Open MPI) 1.10.2

ifconfig -a
docker0 Link encap:Ethernet HWaddr 02:42:14:97:4c:f1
inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

eth0 Link encap:Ethernet HWaddr 00:0d:3a:17:91:92
inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::20d:3aff:fe17:9192/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:92458 errors:0 dropped:0 overruns:0 frame:0
TX packets:13695 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:135609308 (135.6 MB) TX bytes:1379897 (1.3 MB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:252 errors:0 dropped:0 overruns:0 frame:0
TX packets:252 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:17520 (17.5 KB) TX bytes:17520 (17.5 KB)
3. exactly same issue - 100% cpu consumption in active wait on gettimeofday and poll

from horovod.

alsrgv avatar alsrgv commented on May 9, 2024

Aha, this is familiar. Can you try mpirun -mca btl_tcp_if_exclude docker0,lo ...?

from horovod.

AlexanderYukhanov avatar AlexanderYukhanov commented on May 9, 2024

Thanks a lot! That works

from horovod.

AlexanderYukhanov avatar AlexanderYukhanov commented on May 9, 2024

FYI, i am going to add a recipe on running horovod on new BatchAI service: https://github.com/Azure/BatchAI/tree/master/recipes.

Thanks a lot for the help!

from horovod.

alsrgv avatar alsrgv commented on May 9, 2024

That's great! Please share a direct link to the recipe when it's ready, and I'll add it to the README :-)

from horovod.

AlexanderYukhanov avatar AlexanderYukhanov commented on May 9, 2024

Here it is https://github.com/Azure/BatchAI/tree/master/recipes/Horovod

from horovod.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.