Git Product home page Git Product logo

Comments (12)

alsrgv avatar alsrgv commented on May 8, 2024

@pbalapra The example in the docs works on both CPU and GPU. Does that help?

from horovod.

pbalapra avatar pbalapra commented on May 8, 2024

@alsrgv Thanks for the quick answer! I tried to run it on 8 CPU nodes (1 mpi rank per node). The code runs without any issues. However, the values of hvd.local_rank() and hvd.size are 0 and 1, respectively for all ranks. Based on the code documentation, hvd.size should be the number of GPUs (in my case CPUs)? Is that correct?

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

@pbalapra Can you clarify which MPI are you using (Open MPI, MPICH, etc) and what is your mpirun command?

If you're doing mpirun -n 8 -H host1,host2,host3,host4,host5,host6,host7,host8 ..., then hvd.local_rank() should be 0 on all nodes because you'd run only one process per node, hvd.rank() should be index of node, and hvd.size() should be 8.

from horovod.

pbalapra avatar pbalapra commented on May 8, 2024

@alsrgv I am using Cray’s MPICH and launch the code using aprun -n 8 -N 1 python ./keras_mnist.py

Also, I installed tensorflow and keras with conda without building tensorflow from source with MPI option enabled. I assume this is not a problem. Please correct me if I am wrong.

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

@pbalapra I see! You are right, you don't need to rebuild TensorFlow. To confirm your symptoms, can you run this simple program as well:

from __future__ import print_function
import horovod.tensorflow as hvd

hvd.init()
print('local_rank=%d, rank=%d, size=%d' % (hvd.local_rank(), hvd.rank(), hvd.size()))

The output I get with Open MPI is:

$ mpirun -np 4 -H host1:2,host2:2 -x LD_LIBRARY_PATH python debug.py
local_rank=0, rank=2, size=4
local_rank=0, rank=0, size=4
local_rank=1, rank=1, size=4
local_rank=1, rank=3, size=4

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

Hi, @pbalapra,

@jbalma (from Cray) and I huddled on this and made a change to Horovod to enable Cray-MPI. Can you try to reinstall Horovod like this and let us know if it helps?

$ pip uninstall -y horovod
$ HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod

or if your cluster has GPUs:

$ pip uninstall -y horovod
$ HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod

from horovod.

jbalma avatar jbalma commented on May 8, 2024

@alsrgv This works for me but users might need to add "--user" at the end of the pip command ensure it installs to their local /home/user/.local/lib/python2.7/site-packages unless they're using a virtualenv. Also if you are using GPUs, be sure to export LD_LIBRARY_PATH like this:

export LD_LIBRARY_PATH=$CUDATOOLKIT_HOME/lib64:$CUDNN_PATH/lib64:$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH

I was able to build and run the keras_mnist.py with Cray-MPICH using the following:

HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI HOROVOD_MPICXX_SHOW="CC --cray-print-opts=all" pip install --no-cache-dir horovod --user

from horovod.

pbalapra avatar pbalapra commented on May 8, 2024

@jbalma and @alsrgv Fantastic! Thanks a lot for fixing this.

from horovod.

UditGupta10 avatar UditGupta10 commented on May 8, 2024

@jbalma could you please provide a sample script for running horovod on cray with 16 nodes having 1 gpu each.

from horovod.

jbalma avatar jbalma commented on May 8, 2024

@UditGupta10 I put an example script here for issue #516 for you to try.

Let me know if it works for you

from horovod.

salemmohammed avatar salemmohammed commented on May 8, 2024

@alsrgv

I am running Horovod on CPU. I use MNIST-Tensorflow example.

When I run your command:
mpirun -np 4 -H host1:2,host2:2 -x LD_LIBRARY_PATH python debug.py

I got this:
[ip-] Warning: could not find environment variable "LD_LIBRARY_PATH"
ssh: Could not resolve hostname host2: Name or service not known
ssh: Could not resolve hostname host1: Name or service not known

ORTE was unable to reliably start one or more daemons.
This usually is caused by:

  • not finding the required libraries and/or binaries on
    one or more nodes. Please check your PATH and LD_LIBRARY_PATH
    settings, or configure OMPI with --enable-orterun-prefix-by-default

  • lack of authority to execute on one or more specified nodes.
    Please verify your allocation and authorities.

  • the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
    Please check with your sys admin to determine the correct location to use.

  • compilation of the orted with dynamic libraries when static are required
    (e.g., on Cray). Please check your configure cmd line and consider using
    one of the contrib/platform definitions for your system type.

  • an inability to create a connection back to mpirun due to a
    lack of common network interfaces and/or no route found between
    them. Please check network connectivity (including firewalls
    and network routing requirements).



ORTE does not know how to route a message to the specified daemon
located on the indicated node:

my node: ip-000-00-00-00
target node: host2

This is usually an internal programming error that should be
reported to the developers. In the meantime, a workaround may
be to set the MCA param routed=direct on the command line or
in your environment. We apologize for the problem.

from horovod.

alsrgv avatar alsrgv commented on May 8, 2024

@salemmohammed, could you open a new issue with your question?

from horovod.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.