Comments (10)
Thanks for your question!
Few questions:
- Is your Open MPI compiled with
--enable-cuda
support? It's required forHOROVOD_GPU_*=MPI
. - Do you have Tesla or Quatro GPU cards and GPUDirect drivers installed, or you're using consumer GPUs, such as GTX ***?
- What network cards are you using?
- Are you able to run any example MPI program, such as OSU Benchmarks ?
We generally recommend using HOROVOD_GPU_ALLREDUCE=MPI
only for specialized MPI implementation, such as Cray MPI, which is optimized for allreduce on GPU. In other cases, such as with Open MPI, you will be much better off installing NCCL and specifying HOROVOD_GPU_ALLREDUCE=NCCL
.
HOROVOD_GPU_ALLGATHER=MPI
and HOROVOD_GPU_BROADCAST=MPI
should only be specified if you have Tesla or Quatro cards and GPUDirect drivers installed.
from horovod.
Thank you for quick response.
- I am using standard mpi coming with ubuntu
- I have Tesla K80
- i am running on azure vm
- i am able to run other mpi based frameworks, horovod on two gpus on same vm and simple tests like mpirun -H host;host echi hi -)
I was trying first pip install horovod without any HOROVOD_GPU* environment variables but had the same results, so decided to try xxx=MPI way.
from horovod.
Let's try to debug together :-)
Can you share output of:
mpirun -version
ifconfig -a
Can you try running vanilla allreduce program:
wget https://raw.githubusercontent.com/wesleykendall/mpitutorial/gh-pages/tutorials/mpi-reduce-and-allreduce/code/reduce_stddev.c
mpicc -lm reduce_stddev.c -o reduce_stddev -lm
mpirun <your flags for multi-node> ./reduce_stddev 100000000
from horovod.
sure, allocating new cluster to run these tests
from horovod.
- mpirun (Open MPI) 1.10.2
ifconfig -a
docker0 Link encap:Ethernet HWaddr 02:42:14:97:4c:f1
inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
eth0 Link encap:Ethernet HWaddr 00:0d:3a:17:91:92
inet addr:10.0.0.4 Bcast:10.0.0.255 Mask:255.255.255.0
inet6 addr: fe80::20d:3aff:fe17:9192/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:92458 errors:0 dropped:0 overruns:0 frame:0
TX packets:13695 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:135609308 (135.6 MB) TX bytes:1379897 (1.3 MB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:252 errors:0 dropped:0 overruns:0 frame:0
TX packets:252 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:17520 (17.5 KB) TX bytes:17520 (17.5 KB)
3. exactly same issue - 100% cpu consumption in active wait on gettimeofday and poll
from horovod.
Aha, this is familiar. Can you try mpirun -mca btl_tcp_if_exclude docker0,lo ...
?
from horovod.
Thanks a lot! That works
from horovod.
FYI, i am going to add a recipe on running horovod on new BatchAI service: https://github.com/Azure/BatchAI/tree/master/recipes.
Thanks a lot for the help!
from horovod.
That's great! Please share a direct link to the recipe when it's ready, and I'll add it to the README :-)
from horovod.
Here it is https://github.com/Azure/BatchAI/tree/master/recipes/Horovod
from horovod.
Related Issues (20)
- No module named 'packaging' when installing Horovod HOT 9
- [Volcano] Error using horovod with Vocalno cluster HOT 5
- ipv6 address family
- AttributeError: module 'horovod.torch' has no attribute 'init'
- Error install Horovod with python-3.11.5 on macos 11.3.1 HOT 1
- Error install horovod with python 3.11.5 on macOS 11.3.1
- Use pytorch from pip installed but get "#error You need C++17 to compile PyTorch" when installing horovod HOT 1
- Stop specific worker in Horovod Elastic
- Can I call horovod training process in proc = subprocess.Popen(command, shell=True, cwd=cwd) using command
- The program blocks hvd.init(). HOT 1
- Unable to run Horovod Pytorch on AMD AMI100 GPUs HOT 2
- Horovod with TensorFlow crashed
- Unexpected Worker Failure when using Elastic Horovod + Process Sets
- Horovod + Deepspeed : Device mismatch error
- Early Stopping tf.keras Crashes
- Tensorflow Saved model not portable with latest tf.keras.optimizers
- Model parallelisation
- Can horovd process more shards than workers
- v0.28.1 Version Mismatch with TF 2.12.0. Works with v0.28.0
- Replace tf.train.SessionRunHook by tf.compat.v1.train.SessionRunHook ?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from horovod.