Comments (12)
@pbalapra The example in the docs works on both CPU and GPU. Does that help?
from horovod.
@alsrgv Thanks for the quick answer! I tried to run it on 8 CPU nodes (1 mpi rank per node). The code runs without any issues. However, the values of hvd.local_rank() and hvd.size are 0 and 1, respectively for all ranks. Based on the code documentation, hvd.size should be the number of GPUs (in my case CPUs)? Is that correct?
from horovod.
@pbalapra Can you clarify which MPI are you using (Open MPI, MPICH, etc) and what is your mpirun command?
If you're doing mpirun -n 8 -H host1,host2,host3,host4,host5,host6,host7,host8 ...
, then hvd.local_rank()
should be 0 on all nodes because you'd run only one process per node, hvd.rank()
should be index of node, and hvd.size()
should be 8
.
from horovod.
@alsrgv I am using Crayβs MPICH and launch the code using aprun -n 8 -N 1 python ./keras_mnist.py
Also, I installed tensorflow and keras with conda without building tensorflow from source with MPI option enabled. I assume this is not a problem. Please correct me if I am wrong.
from horovod.
@pbalapra I see! You are right, you don't need to rebuild TensorFlow. To confirm your symptoms, can you run this simple program as well:
from __future__ import print_function
import horovod.tensorflow as hvd
hvd.init()
print('local_rank=%d, rank=%d, size=%d' % (hvd.local_rank(), hvd.rank(), hvd.size()))
The output I get with Open MPI is:
$ mpirun -np 4 -H host1:2,host2:2 -x LD_LIBRARY_PATH python debug.py
local_rank=0, rank=2, size=4
local_rank=0, rank=0, size=4
local_rank=1, rank=1, size=4
local_rank=1, rank=3, size=4
from horovod.
Hi, @pbalapra,
@jbalma (from Cray) and I huddled on this and made a change to Horovod to enable Cray-MPI. Can you try to reinstall Horovod like this and let us know if it helps?
$ pip uninstall -y horovod
$ HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod
or if your cluster has GPUs:
$ pip uninstall -y horovod
$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod
from horovod.
@alsrgv This works for me but users might need to add "--user" at the end of the pip command ensure it installs to their local /home/user/.local/lib/python2.7/site-packages unless they're using a virtualenv. Also if you are using GPUs, be sure to export LD_LIBRARY_PATH like this:
export LD_LIBRARY_PATH=$CUDATOOLKIT_HOME/lib64:$CUDNN_PATH/lib64:$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
I was able to build and run the keras_mnist.py with Cray-MPICH using the following:
HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod --user
from horovod.
@jbalma and @alsrgv Fantastic! Thanks a lot for fixing this.
from horovod.
@jbalma could you please provide a sample script for running horovod on cray with 16 nodes having 1 gpu each.
from horovod.
@UditGupta10 I put an example script here for issue #516 for you to try.
Let me know if it works for you
from horovod.
I am running Horovod on CPU. I use MNIST-Tensorflow example.
When I run your command:
mpirun -np 4 -H host1:2,host2:2 -x LD_LIBRARY_PATH python debug.py
I got this:
[ip-] Warning: could not find environment variable "LD_LIBRARY_PATH"
ssh: Could not resolve hostname host2: Name or service not known
ssh: Could not resolve hostname host1: Name or service not known
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
-
not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default -
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities. -
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use. -
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type. -
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
ORTE does not know how to route a message to the specified daemon
located on the indicated node:
my node: ip-000-00-00-00
target node: host2
This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.
from horovod.
@salemmohammed, could you open a new issue with your question?
from horovod.
Related Issues (20)
- Horovod 0.28.1 incompatibility with PyTorch 2.1.0 HOT 1
- No module named 'packaging' when installing Horovod HOT 9
- [Volcano] Error using horovod with Vocalno cluster HOT 5
- ipv6 address family
- AttributeError: module 'horovod.torch' has no attribute 'init'
- Error install Horovod with python-3.11.5 on macos 11.3.1 HOT 1
- Error install horovod with python 3.11.5 on macOS 11.3.1
- Use pytorch from pip installed but get "#error You need C++17 to compile PyTorch" when installing horovod HOT 1
- Stop specific worker in Horovod Elastic
- Can I call horovod training process in proc = subprocess.Popen(command, shell=True, cwd=cwd) using command
- The program blocks hvd.init(). HOT 1
- Unable to run Horovod Pytorch on AMD AMI100 GPUs HOT 2
- Horovod with TensorFlow crashed
- Unexpected Worker Failure when using Elastic Horovod + Process Sets
- Horovod + Deepspeed : Device mismatch error
- Early Stopping tf.keras Crashes
- Tensorflow Saved model not portable with latest tf.keras.optimizers
- Model parallelisation
- Can horovd process more shards than workers
- v0.28.1 Version Mismatch with TF 2.12.0. Works with v0.28.0
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from horovod.