<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

keras_mnist.py for CPU example about horovod HOT 12 CLOSED

horovod commented on May 8, 2024

keras_mnist.py for CPU example

from horovod.

Comments (12)

alsrgv commented on May 8, 2024

@pbalapra The example in the docs works on both CPU and GPU. Does that help?

from horovod.

pbalapra commented on May 8, 2024

@alsrgv Thanks for the quick answer! I tried to run it on 8 CPU nodes (1 mpi rank per node). The code runs without any issues. However, the values of hvd.local_rank() and hvd.size are 0 and 1, respectively for all ranks. Based on the code documentation, hvd.size should be the number of GPUs (in my case CPUs)? Is that correct?

from horovod.

alsrgv commented on May 8, 2024

@pbalapra Can you clarify which MPI are you using (Open MPI, MPICH, etc) and what is your mpirun command?

If you're doing mpirun -n 8 -H host1,host2,host3,host4,host5,host6,host7,host8 ..., then hvd.local_rank() should be 0 on all nodes because you'd run only one process per node, hvd.rank() should be index of node, and hvd.size() should be 8.

from horovod.

pbalapra commented on May 8, 2024

@alsrgv I am using Cray’s MPICH and launch the code using aprun -n 8 -N 1 python ./keras_mnist.py

Also, I installed tensorflow and keras with conda without building tensorflow from source with MPI option enabled. I assume this is not a problem. Please correct me if I am wrong.

from horovod.

alsrgv commented on May 8, 2024

@pbalapra I see! You are right, you don't need to rebuild TensorFlow. To confirm your symptoms, can you run this simple program as well:

from __future__ import print_function
import horovod.tensorflow as hvd

hvd.init()
print('local_rank=%d, rank=%d, size=%d' % (hvd.local_rank(), hvd.rank(), hvd.size()))

The output I get with Open MPI is:

$ mpirun -np 4 -H host1:2,host2:2 -x LD_LIBRARY_PATH python debug.py
local_rank=0, rank=2, size=4
local_rank=0, rank=0, size=4
local_rank=1, rank=1, size=4
local_rank=1, rank=3, size=4

from horovod.

alsrgv commented on May 8, 2024

Hi, @pbalapra,

@jbalma (from Cray) and I huddled on this and made a change to Horovod to enable Cray-MPI. Can you try to reinstall Horovod like this and let us know if it helps?

$ pip uninstall -y horovod
$ HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod

or if your cluster has GPUs:

$ pip uninstall -y horovod
$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod

from horovod.

jbalma commented on May 8, 2024

@alsrgv This works for me but users might need to add "--user" at the end of the pip command ensure it installs to their local /home/user/.local/lib/python2.7/site-packages unless they're using a virtualenv. Also if you are using GPUs, be sure to export LD_LIBRARY_PATH like this:

export LD_LIBRARY_PATH=$CUDATOOLKIT_HOME/lib64:$CUDNN_PATH/lib64:$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

I was able to build and run the keras_mnist.py with Cray-MPICH using the following:

HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod --user

from horovod.

pbalapra commented on May 8, 2024

@jbalma and @alsrgv Fantastic! Thanks a lot for fixing this.

from horovod.

UditGupta10 commented on May 8, 2024

@jbalma could you please provide a sample script for running horovod on cray with 16 nodes having 1 gpu each.

from horovod.

jbalma commented on May 8, 2024

@UditGupta10 I put an example script here for issue #516 for you to try.

Let me know if it works for you

from horovod.

salemmohammed commented on May 8, 2024

@alsrgv

I am running Horovod on CPU. I use MNIST-Tensorflow example.

When I run your command:
mpirun -np 4 -H host1:2,host2:2 -x LD_LIBRARY_PATH python debug.py

I got this:
[ip-] Warning: could not find environment variable "LD_LIBRARY_PATH"
ssh: Could not resolve hostname host2: Name or service not known
ssh: Could not resolve hostname host1: Name or service not known

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).

ORTE does not know how to route a message to the specified daemon
located on the indicated node:

my node: ip-000-00-00-00
target node: host2

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

from horovod.

alsrgv commented on May 8, 2024

@salemmohammed, could you open a new issue with your question?

from horovod.

keras_mnist.py for CPU example about horovod HOT 12 CLOSED

Comments (12)

I got this:
[ip-] Warning: could not find environment variable "LD_LIBRARY_PATH"
ssh: Could not resolve hostname host2: Name or service not known
ssh: Could not resolve hostname host1: Name or service not known

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Comments (12)

I got this: [ip-] Warning: could not find environment variable "LD_LIBRARY_PATH" ssh: Could not resolve hostname host2: Name or service not known ssh: Could not resolve hostname host1: Name or service not known

This is usually an internal programming error that should be reported to the developers. In the meantime, a workaround may be to set the MCA param routed=direct on the command line or in your environment. We apologize for the problem.

Related Issues (20)

Recommend Projects

Recommend Topics

Recommend Org

I got this:
[ip-] Warning: could not find environment variable "LD_LIBRARY_PATH"
ssh: Could not resolve hostname host2: Name or service not known
ssh: Could not resolve hostname host1: Name or service not known

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.