Hello, I have a cluster of 2 gpu nodes. I installed horovo

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank you for quick response. I am using standard mpi coming w

Let's try to debug together :-) Can you share output of: <ol dir

mpirun (Open MPI) 1.10.2 ifconfig -a dock

hvd.init() hangs on mpi gpu cluster about horovod HOT 10 CLOSED

horovod commented on May 9, 2024

hvd.init() hangs on mpi gpu cluster

from horovod.

Comments (10)

alsrgv commented on May 9, 2024

Hi @AlexanderYukhanov,

Thanks for your question!

Few questions:

Is your Open MPI compiled with --enable-cuda support? It's required for HOROVOD_GPU_*=MPI.
Do you have Tesla or Quatro GPU cards and GPUDirect drivers installed, or you're using consumer GPUs, such as GTX ***?
What network cards are you using?
Are you able to run any example MPI program, such as OSU Benchmarks ?

We generally recommend using HOROVOD_GPU_ALLREDUCE=MPI only for specialized MPI implementation, such as Cray MPI, which is optimized for allreduce on GPU. In other cases, such as with Open MPI, you will be much better off installing NCCL and specifying HOROVOD_GPU_ALLREDUCE=NCCL.

HOROVOD_GPU_ALLGATHER=MPI and HOROVOD_GPU_BROADCAST=MPI should only be specified if you have Tesla or Quatro cards and GPUDirect drivers installed.

from horovod.

AlexanderYukhanov commented on May 9, 2024

Thank you for quick response.

I am using standard mpi coming with ubuntu
I have Tesla K80
i am running on azure vm
i am able to run other mpi based frameworks, horovod on two gpus on same vm and simple tests like mpirun -H host;host echi hi -)

I was trying first pip install horovod without any HOROVOD_GPU* environment variables but had the same results, so decided to try xxx=MPI way.

from horovod.

alsrgv commented on May 9, 2024

Let's try to debug together :-)

Can you share output of:

mpirun -version
ifconfig -a

Can you try running vanilla allreduce program:

wget https://raw.githubusercontent.com/wesleykendall/mpitutorial/gh-pages/tutorials/mpi-reduce-and-allreduce/code/reduce_stddev.c
mpicc -lm reduce_stddev.c -o reduce_stddev -lm
mpirun <your flags for multi-node> ./reduce_stddev 100000000

from horovod.

AlexanderYukhanov commented on May 9, 2024

sure, allocating new cluster to run these tests

from horovod.

AlexanderYukhanov commented on May 9, 2024

mpirun (Open MPI) 1.10.2

ifconfig -a
docker0 Link encap:Ethernet HWaddr 02:42:14:97:4c:f1
inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

eth0 Link encap:Ethernet HWaddr 00:0d:3a:17:91:92
inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::20d:3aff:fe17:9192/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:92458 errors:0 dropped:0 overruns:0 frame:0
TX packets:13695 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:135609308 (135.6 MB) TX bytes:1379897 (1.3 MB)

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:252 errors:0 dropped:0 overruns:0 frame:0
TX packets:252 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:17520 (17.5 KB) TX bytes:17520 (17.5 KB)
3. exactly same issue - 100% cpu consumption in active wait on gettimeofday and poll

from horovod.

alsrgv commented on May 9, 2024

Aha, this is familiar. Can you try mpirun -mca btl_tcp_if_exclude docker0,lo ...?

from horovod.

AlexanderYukhanov commented on May 9, 2024

Thanks a lot! That works

from horovod.

AlexanderYukhanov commented on May 9, 2024

FYI, i am going to add a recipe on running horovod on new BatchAI service: https://github.com/Azure/BatchAI/tree/master/recipes.

Thanks a lot for the help!

from horovod.

alsrgv commented on May 9, 2024

That's great! Please share a direct link to the recipe when it's ready, and I'll add it to the README :-)

from horovod.

AlexanderYukhanov commented on May 9, 2024

Here it is https://github.com/Azure/BatchAI/tree/master/recipes/Horovod

from horovod.

hvd.init() hangs on mpi gpu cluster about horovod HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent