horovod / horovod Goto Github PK

View Code? Open in Web Editor NEW

14.0K 334.0 2.2K 6.76 MB

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Home Page: http://horovod.ai

License: Other

Python 67.23% C++ 28.34% Shell 0.77% CMake 2.24% Cuda 0.76% C 0.04% Dockerfile 0.58% Mustache 0.03%

tensorflow uber machine-learning machinelearning mpi baidu deep-learning deeplearning keras pytorch

horovod's Introduction

Horovod

Horovod is a distributed deep learning training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. The goal of Horovod is to make distributed deep learning fast and easy to use.

Horovod is hosted by the LF AI & Data Foundation (LF AI & Data). If you are a company that is deeply committed to using open source technologies in artificial intelligence, machine, and deep learning, and want to support the communities of open source projects in these domains, consider joining the LF AI & Data Foundation. For details about who's involved and how Horovod plays a role, read the Linux Foundation announcement.

Documentation

Why Horovod?

The primary motivation for this project is to make it easy to take a single-GPU training script and successfully scale it to train across many GPUs in parallel. This has two aspects:

How much modification does one have to make to a program to make it distributed, and how easy is it to run it?
How much faster would it run in distributed mode?

Internally at Uber we found the MPI model to be much more straightforward and require far less code changes than previous solutions such as Distributed TensorFlow with parameter servers. Once a training script has been written for scale with Horovod, it can run on a single-GPU, multiple-GPUs, or even multiple hosts without any further code changes. See the Usage section for more details.

In addition to being easy to use, Horovod is fast. Below is a chart representing the benchmark that was done on 128 servers with 4 Pascal GPUs each connected by RoCE-capable 25 Gbit/s network:

Horovod achieves 90% scaling efficiency for both Inception V3 and ResNet-101, and 68% scaling efficiency for VGG-16. See Benchmarks to find out how to reproduce these numbers.

While installing MPI and NCCL itself may seem like an extra hassle, it only needs to be done once by the team dealing with infrastructure, while everyone else in the company who builds the models can enjoy the simplicity of training them at scale.

Install

To install Horovod on Linux or macOS:

Install CMake

If you've installed TensorFlow from PyPI, make sure that g++-5 or above is installed. Starting with TensorFlow 2.10 a C++17-compliant compiler like g++8 or above will be required.

If you've installed PyTorch from PyPI, make sure that g++-5 or above is installed.

If you've installed either package from Conda, make sure that the gxx_linux-64 Conda package is installed.

Install the horovod pip package.

To run on CPUs:

$ pip install horovod

To run on GPUs with NCCL:

$ HOROVOD_GPU_OPERATIONS=NCCL pip install horovod

For more details on installing Horovod with GPU support, read Horovod on GPU.

For the full list of Horovod installation options, read the Installation Guide.

If you want to use MPI, read Horovod with MPI.

If you want to use Conda, read Building a Conda environment with GPU support for Horovod.

If you want to use Docker, read Horovod in Docker.

To compile Horovod from source, follow the instructions in the Contributor Guide.

Concepts

Horovod core principles are based on MPI concepts such as size, rank, local rank, allreduce, allgather, broadcast, and alltoall. See this page for more details.

Supported frameworks

See these pages for Horovod examples and best practices:

Usage

To use Horovod, make the following additions to your program:

Run hvd.init() to initialize Horovod.

Pin each GPU to a single process to avoid resource contention.

With the typical setup of one GPU per process, set this to local rank. The first process on the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth.

Scale the learning rate by the number of workers.

Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in learning rate compensates for the increased batch size.

Wrap the optimizer in hvd.DistributedOptimizer.

The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.

Broadcast the initial variable states from rank 0 to all other processes.

This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.

Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.

Example using TensorFlow v1 (see the examples directory for full training examples):

import tensorflow as tf
import horovod.tensorflow as hvd


# Initialize Horovod
hvd.init()

# Pin GPU to be used to process local rank (one GPU per process)
config = tf.ConfigProto()
config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...
loss = ...
opt = tf.train.AdagradOptimizer(0.01 * hvd.size())

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)

# Add hook to broadcast variables from rank 0 to all other processes during
# initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operation
train_op = opt.minimize(loss)

# Save checkpoints only on worker 0 to prevent other workers from corrupting them.
checkpoint_dir = '/tmp/train_logs' if hvd.rank() == 0 else None

# The MonitoredTrainingSession takes care of session initialization,
# restoring from a checkpoint, saving to a checkpoint, and closing when done
# or an error occurs.
with tf.train.MonitoredTrainingSession(checkpoint_dir=checkpoint_dir,
                                       config=config,
                                       hooks=hooks) as mon_sess:
  while not mon_sess.should_stop():
    # Perform synchronous training.
    mon_sess.run(train_op)

Running Horovod

The example commands below show how to run distributed training. See Run Horovod for more details, including RoCE/InfiniBand tweaks and tips for dealing with hangs.

To run on a machine with 4 GPUs:

$ horovodrun -np 4 -H localhost:4 python train.py

To run on 4 machines with 4 GPUs each:

$ horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

To run using Open MPI without the horovodrun wrapper, see Running Horovod with Open MPI.
To run in Docker, see Horovod in Docker.
To run on Kubernetes, see Helm Chart, Kubeflow MPI Operator, FfDL, and Polyaxon.
To run on Spark, see Horovod on Spark.
To run on Ray, see Horovod on Ray.
To run in Singularity, see Singularity.
To run in a LSF HPC cluster (e.g. Summit), see LSF.
To run on Hadoop Yarn, see TonY.

Gloo

Gloo is an open source collective communications library developed by Facebook.

Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed.

For environments that have support both MPI and Gloo, you can choose to use Gloo at runtime by passing the --gloo argument to horovodrun:

$ horovodrun --gloo -np 2 python train.py

mpi4py

Horovod supports mixing and matching Horovod collectives with other MPI libraries, such as mpi4py, provided that the MPI was built with multi-threading support.

You can check for MPI multi-threading support by querying the hvd.mpi_threads_supported() function.

import horovod.tensorflow as hvd

# Initialize Horovod
hvd.init()

# Verify that MPI multi-threading is supported.
assert hvd.mpi_threads_supported()

from mpi4py import MPI
assert hvd.size() == MPI.COMM_WORLD.Get_size()

You can also initialize Horovod with an mpi4py sub-communicator, in which case each sub-communicator will run an independent Horovod training.

from mpi4py import MPI
import horovod.tensorflow as hvd

# Split COMM_WORLD into subcommunicators
subcomm = MPI.COMM_WORLD.Split(color=MPI.COMM_WORLD.rank % 2,
                               key=MPI.COMM_WORLD.rank)

# Initialize Horovod
hvd.init(comm=subcomm)

print('COMM_WORLD rank: %d, Horovod rank: %d' % (MPI.COMM_WORLD.rank, hvd.rank()))

Inference

Learn how to optimize your model for inference and remove Horovod operations from the graph here.

Tensor Fusion

One of the unique things about Horovod is its ability to interleave communication and computation coupled with the ability to batch small allreduce operations, which results in improved performance. We call this batching feature Tensor Fusion.

See here for full details and tweaking instructions.

Horovod Timeline

Horovod has the ability to record the timeline of its activity, called Horovod Timeline.

Use Horovod timeline to analyze Horovod performance. See here for full details and usage instructions.

Automated Performance Tuning

Selecting the right values to efficiently make use of Tensor Fusion and other advanced Horovod features can involve a good amount of trial and error. We provide a system to automate this performance optimization process called autotuning, which you can enable with a single command line argument to horovodrun.

See here for full details and usage instructions.

Horovod Process Sets

Horovod allows you to concurrently run distinct collective operations in different groups of processes taking part in one distributed training. Set up hvd.process_set objects to make use of this capability.

See Process Sets for detailed instructions.

Guides

Run distributed training in Microsoft Azure using Batch AI and Horovod.
Distributed model training using Horovod.

Send us links to any user guides you want to publish on this site

Troubleshooting

See Troubleshooting and submit a ticket if you can't find an answer.

Citation

Please cite Horovod in your publications if it helps your research:

@article{sergeev2018horovod,
  Author = {Alexander Sergeev and Mike Del Balso},
  Journal = {arXiv preprint arXiv:1802.05799},
  Title = {Horovod: fast and easy distributed deep learning in {TensorFlow}},
  Year = {2018}
}

Publications

1. Sergeev, A., Del Balso, M. (2017) Meet Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow. Retrieved from https://eng.uber.com/horovod/

2. Sergeev, A. (2017) Horovod - Distributed TensorFlow Made Easy. Retrieved from https://www.slideshare.net/AlexanderSergeev4/horovod-distributed-tensorflow-made-easy

3. Sergeev, A., Del Balso, M. (2018) Horovod: fast and easy distributed deep learning in TensorFlow. Retrieved from arXiv:1802.05799

References

The Horovod source code was based off the Baidu tensorflow-allreduce repository written by Andrew Gibiansky and Joel Hestness. Their original work is described in the article Bringing HPC Techniques to Deep Learning.

Getting Involved

Community Slack for collaboration and discussion
Horovod Announce for updates on the project
Horovod Technical-Discuss for public discussion
Horovod Security to report security vulnerabilities

horovod's People

Contributors

Stargazers

Watchers

Forkers

alsrgv pdevanur ml-lab mkolod yibit zuowang yhgon bkovalev hongyunnchen jayhpark530 shamoya brettin jq zyxie joseph-chan pbalapra dbusbridge kcyu1993 vodelerk hoangcuong2011 mbrukman xhivaw pes2g universityai wcy0319 amit2014 limberc hbcbh1999 nagyist techbossmb shyamalschandra zhouyonglong usakey vishalforcode asetsuna chenxinglili strategist922 maximvegorov guilherme-pombo codeaudit ndjido dfrsg brian8128 j87 sziegler11-zz anerky2016 tanyokwok burness micseb narayanmahto ssaroha zencoding mldl onisimchukv liuq4360 bhavesh2g schrezraoeder kangxz tikyau haoshuji schevalier ofirbb la-ml aymar73 vyomshm hal2001 ajziiin liyuanyaun ddcr qiaohaijun wycharry devopsmi beaver-company tobyyouup kingctan chenkaiidy bureddy i-spark gustavocarita fanlu expejwang4 mellanox gazzola ruchirpuri bigdata0803 hi-noikiy 1234clam xunzhang asanakoy kioco alexleethinker anpark snci ciyongch shubhampachori12110095 zmoon111 tegulia king821221 chuzhiml ananthc

horovod's Issues

Is mpirun a necessary to launch distributed training?

Hi,

Is distributed training with Horovod must use mpirun to distribute the training process?

I mean, could me run each process on local machine by manually and then Horvorod could find the cluster or peer node to do distributed computation?

Such as,
In machine 1, I run "python train.py"
In machine 2, I also run "python train.py"

Is this OK?

AttributeError: 'NoneType' object has no attribute 'dtype'

Some models, like RFCN can return variables without gradients

  File "train.py", line 249, in <module>
    tf.app.run()
  File "/home/csq/.cache/bazel/_bazel_csq/0a3580d8ecd2fafe9cc9970e974f5dc4/execroot/source/bazel-out/release_links/lib/python_env/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "train.py", line 229, in main
    worker_job_name, is_chief, FLAGS.train_dir, FLAGS.trace_every_n_steps)
  File "/home/csq/src/models/research/object_detection/trainer.py", line 271, in train
    for (gradient, var) in gradients]
  File "/home/csq/.cache/bazel/_bazel_csq/0a3580d8ecd2fafe9cc9970e974f5dc4/execroot/source/bazel-out/release_links/lib/python_env/horovod/tensorflow/__init__.py", line 76, in allreduce
    horovod_size = tf.cast(size(), tensor.dtype)
AttributeError: 'NoneType' object has no attribute 'dtype'

I just work around it by replacing compute_gradients() with

gradients = (super(hvd.DistributedOptimizer, training_optimizer)
             .compute_gradients(total_loss))
gradients = [(g, v) for g, v in gradients if g is not None]
if hvd.size() > 1:
  with tf.name_scope(training_optimizer._name + "_Allreduce"):
    grads_and_vars = [(hvd.allreduce(gradient, device_dense=training_optimizer._device_dense,
                                                    device_sparse=training_optimizer._device_sparse), var)
                      for (gradient, var) in gradients]

but think adding

gradients = [(g, v) for g, v in gradients if g is not None]

after this line: https://github.com/uber/horovod/blob/64317537436ae4225a004043acaa37a205a18dc3/horovod/tensorflow/__init__.py#L166

could fix it for others.

undefined symbol: opal_cuda_check_bufs

After i installed horovod and run the tensorflow_mnist.py, i get the following error.

python: symbol lookup error: /usr/local/lib/openmpi/mca_coll_cuda.so: undefined symbol: opal_cuda_check_bufs

horovod in single node Multi-GPU training

Hi professional team,

Is there any possible to use horovod distributed training in Single node Multiple GPU training?

Thanks for the kindly support.

Question: RDMA Horovod with NCCL

Hi,
a question on the first 16-GPU benchmark shown in the README, it's said RoCE 25Gb, and "RDMA Horovod (allreduce on GPU with NCCL)", however, NCCL2 so far doesn't support RoCE RDMA, so it's a typo about this 'RDMA Horovod'? right?

RDMA Horovod (allreduce on GPU with NCCL)	2,022.6 (15.0x)	1,746.2 (14.6x)	1,787.4 (13.7x)

instead, in the 2nd benchmark on 64-GPU, it's 'TCP Horovod (allreduce on GPU with NCCL)'

run distributed tensorflow_mnist.py on two machines, but it hangs up and dose not show log

Hi, @alsrgv I run tensorflow_mnist.py on two machines with
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.77:1,172.23.233.75:1 python tensorflow_mnist.py

After showing some logs, it hangs up and cannot proceed. The process on both machines has GPU memory usage, but didn't show logs any more. The logs before hangs up is as follows:

2017-11-05 23:17:20.677867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: name: Tesla M40 24GB major: 5 minor: 2 memoryClockRate (GHz) 1.112 pciBusID 0000:83:00.0 Total memory: 22.40GiB Free memory: 22.29GiB 2017-11-05 23:17:20.677908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 2017-11-05 23:17:20.677915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 2017-11-05 23:17:20.677927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:83:00.0) 2017-11-05 23:18:24.170460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: name: Tesla M40 24GB major: 5 minor: 2 memoryClockRate (GHz) 1.112 pciBusID 0000:83:00.0 Total memory: 22.40GiB Free memory: 22.29GiB 2017-11-05 23:18:24.170521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 2017-11-05 23:18:24.170528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 2017-11-05 23:18:24.170539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:83:00.0) INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-100 [msragpum13:53619] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix [msragpum13:53619] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [msragpum13:53619] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port

However, if I just run on a single machine , it has no problems:
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.75:2 python tensorflow_mnist.py
or
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.77:2 python tensorflow_mnist.py

both can work properly.

My system is ubuntu 14.04, tensorflow is 1.3.0, python is 2.7, horovod is following the install guides using NCCL2.0 and openmpi 3.0. So what's the problems?

Is computation and communication overlapped?

@alsrgv
You are using AsyncOpKernel. Does that overlaps computation and communication?
I don't know what's the cycle life of an AsyncOpKernel. Will Tensorflow wait for async ops to finish at the end of each iteration?

I am quite curious about horovod's internals. Thank you so much!

Reproduce the example benchmarks

Will you be making the benchmark programs available in the examples section (Resnet, Inception, VGGNet)? That would enable us to reproduce your numbers. Thanks.

How can I properly store the best weights into a json file to use it later?

@alsrgv: If I use multiple GPUs, I tried to store the best weights to use it later but I couldn't get the results like what I could get using single GPU, what is the proper way to do it by using Horovod? Thanks a lot in advance.

multi-gpu seems to be slower than single gpu

I have tried to run on single and multiple-gpus and multi-gpu seems to be slower on horovod on the tensor_mnist.py example. Below is the outputs
Here is single GPU run with CUDA_VISIBLE_DEVICE=1 python tensorflow_mnist.py

INFO:tensorflow:loss = 2.30667, step = 1
INFO:tensorflow:loss = 2.29006, step = 11 (0.068 sec)
INFO:tensorflow:loss = 2.25554, step = 21 (0.052 sec)
INFO:tensorflow:loss = 2.1974, step = 31 (0.052 sec)
INFO:tensorflow:loss = 1.89438, step = 41 (0.051 sec)
INFO:tensorflow:loss = 1.82758, step = 51 (0.052 sec)
INFO:tensorflow:loss = 0.915703, step = 61 (0.049 sec)
INFO:tensorflow:loss = 1.77014, step = 71 (0.050 sec)
INFO:tensorflow:loss = 1.03163, step = 81 (0.049 sec)
INFO:tensorflow:loss = 0.936019, step = 91 (0.050 sec)
INFO:tensorflow:global_step/sec: 162.7
INFO:tensorflow:loss = 0.522648, step = 101 (0.048 sec)
INFO:tensorflow:loss = 0.573508, step = 111 (0.049 sec)
INFO:tensorflow:loss = 0.214133, step = 121 (0.049 sec)
INFO:tensorflow:loss = 0.757816, step = 131 (0.048 sec)
INFO:tensorflow:loss = 0.190643, step = 141 (0.047 sec)
INFO:tensorflow:loss = 0.213472, step = 151 (0.048 sec)
INFO:tensorflow:loss = 0.0751245, step = 161 (0.048 sec)
INFO:tensorflow:loss = 0.115839, step = 171 (0.048 sec)
INFO:tensorflow:loss = 0.0470715, step = 181 (0.048 sec)
INFO:tensorflow:loss = 0.173935, step = 191 (0.048 sec)
INFO:tensorflow:global_step/sec: 207.711
INFO:tensorflow:loss = 0.196011, step = 201 (0.048 sec)
INFO:tensorflow:loss = 0.0540665, step = 211 (0.049 sec)
INFO:tensorflow:loss = 0.157532, step = 221 (0.047 sec)
INFO:tensorflow:loss = 0.392506, step = 231 (0.048 sec)
INFO:tensorflow:loss = 0.216696, step = 241 (0.048 sec)
INFO:tensorflow:loss = 0.146749, step = 251 (0.048 sec)
INFO:tensorflow:loss = 0.0601752, step = 261 (0.047 sec)
INFO:tensorflow:loss = 0.255786, step = 271 (0.049 sec)
INFO:tensorflow:loss = 0.116301, step = 281 (0.048 sec)
INFO:tensorflow:loss = 0.226139, step = 291 (0.048 sec)

here is cmd for 2 gpus CUDA_VISIBLE_DEVICES=1,2 mpirun -np 2 python tensorflow_mnist.py
and the outputs

INFO:tensorflow:loss = 2.29663, step = 2
INFO:tensorflow:loss = 2.30619, step = 2
INFO:tensorflow:loss = 2.28995, step = 12 (0.295 sec)
INFO:tensorflow:loss = 2.2678, step = 12 (0.208 sec)
INFO:tensorflow:loss = 2.26281, step = 22 (0.189 sec)
INFO:tensorflow:loss = 2.2539, step = 22 (0.189 sec)
INFO:tensorflow:loss = 2.17149, step = 32 (0.190 sec)
INFO:tensorflow:loss = 2.18344, step = 32 (0.190 sec)
INFO:tensorflow:loss = 1.73136, step = 42 (0.190 sec)
INFO:tensorflow:loss = 1.66357, step = 42 (0.190 sec)
INFO:tensorflow:loss = 1.18375, step = 52 (0.190 sec)
INFO:tensorflow:loss = 1.1508, step = 52 (0.190 sec)
INFO:tensorflow:loss = 0.689034, step = 62 (0.190 sec)
INFO:tensorflow:loss = 0.60923, step = 62 (0.190 sec)
INFO:tensorflow:loss = 0.417928, step = 72 (0.189 sec)
INFO:tensorflow:loss = 0.389533, step = 72 (0.189 sec)
INFO:tensorflow:loss = 1.40314, step = 82 (0.194 sec)
INFO:tensorflow:loss = 1.60605, step = 82 (0.194 sec)
INFO:tensorflow:loss = 0.517931, step = 92 (0.198 sec)
INFO:tensorflow:loss = 0.588845, step = 92 (0.198 sec)
INFO:tensorflow:loss = 0.958909, step = 102 (0.190 sec)
INFO:tensorflow:global_step/sec: 49.6269
INFO:tensorflow:loss = 1.07172, step = 102 (0.191 sec)
INFO:tensorflow:loss = 0.209173, step = 112 (0.189 sec)
INFO:tensorflow:loss = 0.172761, step = 112 (0.188 sec)
INFO:tensorflow:loss = 0.282635, step = 122 (0.192 sec)
INFO:tensorflow:loss = 0.316832, step = 122 (0.192 sec)
INFO:tensorflow:loss = 0.235262, step = 132 (0.192 sec)
INFO:tensorflow:loss = 0.39039, step = 132 (0.192 sec)
INFO:tensorflow:loss = 0.249453, step = 142 (0.190 sec)
INFO:tensorflow:loss = 0.220639, step = 142 (0.190 sec)
INFO:tensorflow:loss = 0.188635, step = 152 (0.191 sec)
INFO:tensorflow:loss = 0.237106, step = 152 (0.191 sec)
INFO:tensorflow:loss = 0.0990839, step = 162 (0.196 sec)
INFO:tensorflow:loss = 0.298068, step = 162 (0.196 sec)
INFO:tensorflow:loss = 0.289399, step = 172 (0.190 sec)
INFO:tensorflow:loss = 0.209013, step = 172 (0.190 sec)
INFO:tensorflow:loss = 0.187031, step = 182 (0.195 sec)
INFO:tensorflow:loss = 0.325866, step = 182 (0.195 sec)
INFO:tensorflow:loss = 0.106716, step = 192 (0.193 sec)
INFO:tensorflow:loss = 0.120047, step = 192 (0.193 sec)
INFO:tensorflow:global_step/sec: 51.9471

keras_mnist.py for CPU example

@alsrgv Could you please provide a CPU only example for keras_mnist.py? Thanks...

hvd.allreduce() is not reducing tensors

In the Horovod source, in mpi.ops.py under the _allreduce function, it says:

   """An op which sums an input tensor over all the Horovod processes.
    The reduction operation is keyed by the name of the op. The tensor type and
    shape must be the same on all Horovod processes for a given name. The reduction
    will not start until all processes are ready to send and receive the tensor.
    Returns:
      A tensor of the same shape and type as `tensor`, summed across all
      processes.
    """

I made a minimum example to reproduce this description, by reducing a tensor called const on each process to a tensor called reduced:

import tensorflow as tf
import horovod.tensorflow as hvd

# Initialize Horovod
hvd.init()

config = tf.ConfigProto()

# Rank
rank = hvd.rank()

# Constant tensor to reduce
const = tf.constant([1.,2.])
reduced = hvd.allreduce(const)

# Monitored session
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
sess = tf.train.MonitoredTrainingSession(config=config, hooks=hooks)

# I expect a summed tensor -> [2,4]
print("rank: %d" % (rank), sess.run(reduced))

Saving this in a file called test.py, I run with the following:
mpirun -np 2 python allreduce.py

The result is:

('rank: 0', array([ 1.,  2.], dtype=float32))
('rank: 1', array([ 1.,  2.], dtype=float32))

This is just returns the original tensor [1., 2.] that was supposed to be reduced. I expected the following:

('rank: 0', array([ 2.,  4.], dtype=float32))
('rank: 1', array([ 2.,  4.], dtype=float32))

Am I using this function incorrectly? I'm running on my laptop with 4 available processing units, and I'm using CPUs.

EDIT: Wow, I thought allreduce() returns the sum of tensors but it returns the average. It's working as expected now. Sorry for my foolishness.

[q] mpirun crash

Hi,

We're testing HOROVOD on centos7.3, nccl2, cuda/8, cudnn6.0 but it keeps crashing. We tested openmpi 2.1.2, 3.0.0, and mvapich2.2. but all crashes with same error message shown below.
NCCL2 has been confirmed working through nccl-tests.

Could you comment about the issues? Other details are shown below.

Best regards,

Byoungseon

Install command: HOROVOD_NCCL_HOME=$NCCL_HOME HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip3 install --no-cache-dir horovod
Run command: mpirun -n 2 python3 ./keras_mnist.py

Error message shown below:

2017-10-30 15:46:26.208991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:13:00.0, compute capability: 6.0)
[ava02:21329:1] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000068d1c mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
3 0x000000000006926c mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
4 0x0000000000035250 killpg() ??:0
5 0x000000000014b9cc __memcpy_ssse3_back() :0
6 0x000000000038ad0e mv2_smp_fast_write_contig() ??:0
7 0x000000000036ac7f MPIDI_CH3_EagerContigShortSend() ??:0
8 0x000000000037518f MPID_Send() ??:0
9 0x000000000030dbaa MPIC_Send() ??:0
10 0x000000000007c372 MPIR_Bcast_binomial.isra.1() bcast.c:0
11 0x000000000007cd15 MPIR_Bcast_intra() ??:0
12 0x00000000000e738a MPIR_Bcast_index_tuned_intra_MV2() ??:0
13 0x00000000000e4eaf MPIR_Bcast_MV2() ??:0
14 0x000000000007cfe8 MPIR_Bcast_intra() ??:0
15 0x00000000000e738a MPIR_Bcast_index_tuned_intra_MV2() ??:0
16 0x00000000000e4eaf MPIR_Bcast_MV2() ??:0
17 0x000000000007d89b MPIR_Bcast_impl() ??:0
18 0x000000000007dea9 MPI_Bcast() ??:0
19 0x000000000001f1be _ZN7horovod10tensorflow12_GLOBAL__N_116PerformOperationERSt13unordered_mapISsNS1_16TensorTableEntryESt4hashISsESt8equal_toISsESaISt4pairIKSsS3_EEENS0_11MPIResponseE() /tmp/pip-build-7bvx75z5/horovod/horovod/tensorflow/mpi_ops.cc:1131
20 0x0000000000021391 _ZN7horovod10tensorflow12_GLOBAL__N_120BackgroundThreadLoopERNS1_18HorovodGlobalStateE() /tmp/pip-build-7bvx75z5/horovod/horovod/tensorflow/mpi_ops.cc:1431
21 0x0000000000021391 ~vector() /usr/include/c++/4.8.2/bits/stl_vector.h:416
22 0x0000000000021391 ~MPIResponse() /tmp/pip-build-7bvx75z5/horovod/horovod/tensorflow/mpi_message.h:93
23 0x0000000000021391 _ZN7horovod10tensorflow12_GLOBAL__N_120BackgroundThreadLoopERNS1_18HorovodGlobalStateE() /tmp/pip-build-7bvx75z5/horovod/horovod/tensorflow/mpi_ops.cc:1431
24 0x00000000000b5230 _ZNSt11this_thread11__sleep_forENSt6chrono8durationIlSt5ratioILl1ELl1EEEENS1_IlS2_ILl1ELl1000000000EEEE() ??:0
25 0x0000000000007dc5 start_thread() pthread_create.c:0
26 0x00000000000f776d __clone() ??:0

Listing tf local devices makes horovod go bananas.

See title.

Add this to the top of fe. the keras-example script, and it will allocate a ton of processors and go OOM when using multiple devices.

from tensorflow.python.client import device_lib
LOCAL_DEVICES = device_lib.list_local_devices()

Release 🍌 with:

mpirun -np 4 python keras_mnist.py

Got error when installing it under windows 10

@alsrgv: I wanna use it under windows 10 (one of our tool will call horovod and it can only support Windows) and I got the following error messages when I install it, could you please help me? Or there is no chance to use it under Windows? Also, I couldn't use the options "HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod" as you mentioned.

C:\WINDOWS\system32>pip3 install --no-cache-dir horovod
Collecting horovod
Downloading horovod-0.9.10.tar.gz (64kB)
100% |████████████████████████████████| 71kB 1.5MB/s
Installing collected packages: horovod
Running setup.py install for horovod ... error
Complete output from command "c:\program files\python36\python.exe" -u -c "import setuptools, tokenize;file='C:\Users\BRIGHT1\AppData\Local\Temp\pip-build-e3i6bu64\horovod\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\BRIGHT1\AppData\Local\Temp\pip-lt5q0wnv-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\horovod
copying horovod_init_.py -> build\lib.win-amd64-3.6\horovod
creating build\lib.win-amd64-3.6\horovod\tensorflow
copying horovod\tensorflow\mpi_ops.py -> build\lib.win-amd64-3.6\horovod\tensorflow
copying horovod\tensorflow\mpi_ops_test.py -> build\lib.win-amd64-3.6\horovod\tensorflow
copying horovod\tensorflow_init_.py -> build\lib.win-amd64-3.6\horovod\tensorflow
running build_ext
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe -std=c++11 /c /nologo /Ox /W3 /GL /DNDEBUG /MD "-Ic:\program files\python36\include" "-Ic:\program files\python36\include" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /EHsc /Tpbuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.cc /Fobuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.obj
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
test_tensorflow_libs.cc
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO "/LIBPATH:c:\program files\python36\lib\site-packages\tensorflow\core" "/LIBPATH:c:\program files\python36\libs" "/LIBPATH:c:\program files\python36\PCbuild\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\um\x64" tensorflow_framework.lib build\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.obj /OUT:build\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.dll
LINK : fatal error LNK1181: cannot open input file 'tensorflow_framework.lib'
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe -std=c++11 /c /nologo /Ox /W3 /GL /DNDEBUG /MD "-Ic:\program files\python36\include" "-Ic:\program files\python36\include" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /EHsc /Tpbuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.cc /Fobuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.obj
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
test_tensorflow_libs.cc
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO "/LIBPATH:c:\program files\python36\lib\site-packages\tensorflow\core" "/LIBPATH:c:\program files\python36\libs" "/LIBPATH:c:\program files\python36\PCbuild\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\um\x64" build\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.obj /OUT:build\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.dll
Generating code
Finished generating code
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe -std=c++11 /c /nologo /Ox /W3 /GL /DNDEBUG /MD -D_GLIBCXX_USE_CXX11_ABI=0 "-Ic:\program files\python36\lib\site-packages\tensorflow\include" "-Ic:\program files\python36\lib\site-packages\tensorflow\include/external/nsync/public" "-Ic:\program files\python36\include" "-Ic:\program files\python36\include" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /EHsc /Tpbuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_abi.cc /Fobuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_abi.obj
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
test_tensorflow_abi.cc
c:\program files\python36\lib\site-packages\tensorflow\include\google/protobuf/message_lite.h(247): warning C4267: 'return': conversion from 'size_t' to 'int', possible loss of data
c:\program files\python36\lib\site-packages\tensorflow\include\google/protobuf/wire_format_lite_inl.h(865): warning C4267: 'argument': conversion from 'size_t' to 'google::protobuf::uint32', possible loss of data
c:\program files\python36\lib\site-packages\tensorflow\include\tensorflow/core/lib/gtl/manual_constructor.h(97): fatal error C1189: #error: "You must define TF_LIB_GTL_ALIGNED_CHAR_ARRAY for your compiler."
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe -std=c++11 /c /nologo /Ox /W3 /GL /DNDEBUG /MD -D_GLIBCXX_USE_CXX11_ABI=1 "-Ic:\program files\python36\lib\site-packages\tensorflow\include" "-Ic:\program files\python36\lib\site-packages\tensorflow\include/external/nsync/public" "-Ic:\program files\python36\include" "-Ic:\program files\python36\include" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /EHsc /Tpbuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_abi.cc /Fobuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_abi.obj
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
test_tensorflow_abi.cc
c:\program files\python36\lib\site-packages\tensorflow\include\google/protobuf/message_lite.h(247): warning C4267: 'return': conversion from 'size_t' to 'int', possible loss of data
c:\program files\python36\lib\site-packages\tensorflow\include\google/protobuf/wire_format_lite_inl.h(865): warning C4267: 'argument': conversion from 'size_t' to 'google::protobuf::uint32', possible loss of data
c:\program files\python36\lib\site-packages\tensorflow\include\tensorflow/core/lib/gtl/manual_constructor.h(97): fatal error C1189: #error: "You must define TF_LIB_GTL_ALIGNED_CHAR_ARRAY for your compiler."
error: Unable to determine CXX11 ABI to use with TensorFlow (see error above).

----------------------------------------

NCCL 2 does not support RoCE

NCCL 2 will even fail in local mode if there's an RoCE capable NIC on the target system. See http://docs.nvidia.com/deeplearning/sdk/nccl-release-notes/rel_2.0.2.html

Error: 'TFOptimizer' object has no attribute 'lr'

Thanks a lot for this great framework. I got this error message "'TFOptimizer' object has no attribute 'lr'" when I was using ReduceLROnPlateau callback from Keras, could you please help me to find a way to solve this with horovod? Thank you very much in advance.

Any detailed documents about this package?

Hi,

About how to use the parameters, how to save model and restore model, is there any detailed documents about this package?
Such as code beblow:
hooks = [hvd.BroadcastGlobalVariablesHook(0),

         tf.train.StopAtStepHook(last_step=10000),

         tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss},

                                    every_n_iter=10),

         ]

hvd.init() hangs on mpi gpu cluster

Hello,
I have a cluster of 2 gpu nodes.

I installed horovod with the following command:
HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --user --no-cache-dir horovod
added print statements into original tensorflow_mnist.py sample before and after hvd.init() call
when i am running the sample on two nodes I see the print statements before hvd.init() but do not see them after:
_azbatch@1d4848ca651249409735f1aebc6a7e60000000:~$ mpirun -H 10.0.0.4,10.0.0.5 python /mnt/batch/tasks/shared/LS_root/mounts/external/tensorflow_mnist.py
Initializing horovod

[[767,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: 1d4848ca651249409735f1aebc6a7e60000001

Another transport will be used instead, although this may result in
lower performance.

Initializing horovod
[1d4848ca651249409735f1aebc6a7e60000000:41232] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[1d4848ca651249409735f1aebc6a7e60000000:41232] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
^C_azbatch@1d4848ca651249409735f1aebc6a7e60000000:~$ mpirun -d -H 10.0.0.4,10.0.0.5 python /mnt/batch/tasks/shared/LS_root/mounts/external/tensorflow_mnist.py
[1d4848ca651249409735f1aebc6a7e60000000:42354] procdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0/1693/0/0
[1d4848ca651249409735f1aebc6a7e60000000:42354] jobdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0/1693/0
[1d4848ca651249409735f1aebc6a7e60000000:42354] top: openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0
[1d4848ca651249409735f1aebc6a7e60000000:42354] tmp: /tmp
[1d4848ca651249409735f1aebc6a7e60000000:42354] sess_dir_cleanup: job session dir does not exist
[1d4848ca651249409735f1aebc6a7e60000000:42354] procdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0/1693/0/0
[1d4848ca651249409735f1aebc6a7e60000000:42354] jobdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0/1693/0
[1d4848ca651249409735f1aebc6a7e60000000:42354] top: openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0
[1d4848ca651249409735f1aebc6a7e60000000:42354] tmp: /tmp
[1d4848ca651249409735f1aebc6a7e60000001:44075] procdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0/1693/0/1
[1d4848ca651249409735f1aebc6a7e60000001:44075] jobdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0/1693/0
[1d4848ca651249409735f1aebc6a7e60000001:44075] top: openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0
[1d4848ca651249409735f1aebc6a7e60000001:44075] tmp: /tmp
[1d4848ca651249409735f1aebc6a7e60000001:44075] sess_dir_cleanup: job session dir does not exist
[1d4848ca651249409735f1aebc6a7e60000001:44075] procdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0/1693/0/1
[1d4848ca651249409735f1aebc6a7e60000001:44075] jobdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0/1693/0
[1d4848ca651249409735f1aebc6a7e60000001:44075] top: openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0
[1d4848ca651249409735f1aebc6a7e60000001:44075] tmp: /tmp
Initializing horovod
[1d4848ca651249409735f1aebc6a7e60000001:44078] procdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0/1693/1/1
[1d4848ca651249409735f1aebc6a7e60000001:44078] jobdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0/1693/1
[1d4848ca651249409735f1aebc6a7e60000001:44078] top: openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000001_0
[1d4848ca651249409735f1aebc6a7e60000001:44078] tmp: /tmp

[[1693,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: 1d4848ca651249409735f1aebc6a7e60000001

Another transport will be used instead, although this may result in
lower performance.

Initializing horovod
[1d4848ca651249409735f1aebc6a7e60000000:42378] procdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0/1693/1/0
[1d4848ca651249409735f1aebc6a7e60000000:42378] jobdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0/1693/1
[1d4848ca651249409735f1aebc6a7e60000000:42378] top: openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0
[1d4848ca651249409735f1aebc6a7e60000000:42378] tmp: /tmp
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, 1d4848ca651249409735f1aebc6a7e60000000, /anaconda/envs/py35/bin/python, 42378)
(i, host, exe, pid) = (1, 10.0.0.5, /anaconda/envs/py35/bin/python, 44078)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[1d4848ca651249409735f1aebc6a7e60000000:42354] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[1d4848ca651249409735f1aebc6a7e60000000:42354] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

I see python processes on both nodes consuming 100% cpu
strace for both processes shows only calls to nanosleep({0, 1000000}, NULL)
The sample runs perfectly well on a single node (on one or two gpus)

Thank you,
Alex

Unable to install Horovod on GPUs

I have installed NCCL2 and OpenMPI but am unable to install Horovod on a GPU. As specified in the documentation I am using the following command:

HOROVOD_NCCL_HOME=/usr/local/nccl_2.0.5-3+cuda8.0_amd64 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod

This results in the following output:

#HOROVOD_NCCL_HOME=/usr/local/nccl_2.0.5-3+cuda8.0_amd64 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
Collecting horovod
Downloading horovod-0.9.10.tar.gz (64kB)
100% |████████████████████████████████| 71kB 510kB/s
Installing collected packages: horovod
Running setup.py install for horovod ... error
Complete output from command /opt/anaconda2/envs/tensorflow/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-61FZkn/horovod/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-ItuZD1-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-2.7/horovod
creating build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops_test.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
running build_ext
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/core -L/opt/anaconda2/envs/tensorflow/lib -ltensorflow_framework -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
/bin/ld: cannot find -ltensorflow_framework
collect2: error: ld returned 1 exit status
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/core -L/opt/anaconda2/envs/tensorflow/lib -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0 -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o -L/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/core -L/opt/anaconda2/envs/tensorflow/lib -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/local/cuda/include -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_cuda.cc -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_cuda.o -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/opt/anaconda2/envs/tensorflow/lib -lcudart -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/local/nccl_2.0.5-3+cuda8.0_amd64/include -I/usr/local/cuda/include -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_nccl.cc -o build/temp.linux-x86_64-2.7/test_compile/test_nccl.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_nccl.o -L/usr/local/nccl_2.0.5-3+cuda8.0_amd64/lib -L/usr/local/nccl_2.0.5-3+cuda8.0_amd64/lib64 -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/opt/anaconda2/envs/tensorflow/lib -lnccl -o build/temp.linux-x86_64-2.7/test_compile/test_nccl.so
building 'horovod.tensorflow.mpi_lib' extension
creating build/temp.linux-x86_64-2.7/horovod
creating build/temp.linux-x86_64-2.7/horovod/tensorflow
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DHAVE_CUDA=1 -DHAVE_NCCL=1 -DHOROVOD_GPU_ALLREDUCE='N' -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/usr/local/nccl_2.0.5-3+cuda8.0_amd64/include -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c horovod/tensorflow/mpi_message.cc -o build/temp.linux-x86_64-2.7/horovod/tensorflow/mpi_message.o -std=c++11 -fPIC -O2 -I/usr/local/openmpi/include -pthread -Wl,-rpath -Wl,/usr/local/openmpi/lib -Wl,--enable-new-dtags -L/usr/local/openmpi/lib -lmpi_cxx -lmpi -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DHAVE_CUDA=1 -DHAVE_NCCL=1 -DHOROVOD_GPU_ALLREDUCE='N' -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/usr/local/nccl_2.0.5-3+cuda8.0_amd64/include -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c horovod/tensorflow/mpi_ops.cc -o build/temp.linux-x86_64-2.7/horovod/tensorflow/mpi_ops.o -std=c++11 -fPIC -O2 -I/usr/local/openmpi/include -pthread -Wl,-rpath -Wl,/usr/local/openmpi/lib -Wl,--enable-new-dtags -L/usr/local/openmpi/lib -lmpi_cxx -lmpi -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
horovod/tensorflow/mpi_ops.cc: In function ‘int horovod::tensorflow::{anonymous}::GetDeviceID(tensorflow::OpKernelContext*)’:
horovod/tensorflow/mpi_ops.cc:1723:63: error: ‘const struct tensorflow::DeviceBase::GpuDeviceInfo’ has no member named ‘gpu_id’
device = context->device()->tensorflow_gpu_device_info()->gpu_id;
^
In file included from /opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/op_kernel.h:22:0,
from horovod/tensorflow/mpi_ops.cc:22:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/allocator.h: In member function ‘virtual std::size_t tensorflow::Allocator::RequestedSize(void*)’:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/allocator.h:155:3: warning: control reaches end of non-void function [-Wreturn-type]
}
^
In file included from /opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/op_kernel.h:25:0,
from horovod/tensorflow/mpi_ops.cc:22:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/device_base.h: In member function ‘virtual tensorflow::Allocator* tensorflow::DeviceBase::GetAllocator(tensorflow::AllocatorAttributes)’:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/device_base.h:152:3: warning: control reaches end of non-void function [-Wreturn-type]
}
^
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/device_base.h: In member function ‘virtual const tensorflow::DeviceAttributes& tensorflow::DeviceBase::attributes() const’:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/device_base.h:183:3: warning: control reaches end of non-void function [-Wreturn-type]
}
^
error: command 'gcc' failed with exit status 1

----------------------------------------

Command "/opt/anaconda2/envs/tensorflow/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-61FZkn/horovod/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-ItuZD1-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-61FZkn/horovod/

How do I address these errors? Do I also need to make any changes to handle the warnings which state "valid for C/ObjC but not for C++"?

Running batch norm updates

Is there a simple way to run batch norm updates in addition to gradient updates? The industry practice is to run batch update ops from a single tower only, therefore, it would be incorrect to add batch updates to the training op in the provided sample code.

Error when save model

Hi,

I'm using Horovod to do distributed TF training.

In my code, seems that at the beginning my task failed due to save model error:

	 [[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:localhost/replica:0/task:0/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _arg_save/Const_0_0)]]

Caused by op 'save/MergeV2Checkpoints', defined at:
  File "/tmp/apprunner/.working/runtime/app/train.py", line 432, in <module>
    main()
  File "/tmp/apprunner/.working/runtime/app/train.py", line 395, in main
    checkpoint_dir=work_path) as sess:
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session
    return self._sess_creator.create_session()
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session
    self._scaffold.finalize()
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 203, in finalize
    self._saver = training_saver._get_saver_or_default()  # pylint: disable=protected-access
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 736, in _get_saver_or_default
    saver = Saver(sharded=True, allow_empty=True)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
    self.build()
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1170, in build
    restore_sequentially=self._restore_sequentially)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 685, in build
    save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 361, in _AddShardedSaveOps
    return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
    sharded_prefixes, checkpoint_prefix, delete_old_dirs=True)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/ops/gen_io_ops.py", line 185, in merge_v2_checkpoints
    delete_old_dirs=delete_old_dirs, name=name)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

My code is

...
    # BroadcastGlobalVariablesHook broadcasts variables from rank 0 to all other
    # processes during initialization.
    hooks = [hvd.BroadcastGlobalVariablesHook(0),
             tf.train.StopAtStepHook(last_step=8000000),
             tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss},
                                        every_n_iter=10),
            ]

    is_chief = (hvd.rank() == 0)
    # save_secs = args.save_seconds if is_chief else 800000000000000000
    save_secs = args.save_seconds if is_chief else None

    with tf.train.MonitoredTrainingSession(config=config,
                                           hooks=hooks,
                                           save_checkpoint_secs=save_secs,
                                           checkpoint_dir=work_path) as sess:

work_path is an HDFS path here.

Any idea about this issue?

Benchmark's scalability (weak/strong)

I take it the benchmarks in the README are based on weak scaling? (i.e. batch size per process stays the same regardless of number of processes?). If it is weak scaling, you'd be solving a different problem than if you would divide the back size by the number of processes (strong scaling), also giving you worse results.

`bool` is not in the list of allowed variables - cifar

I'm running Cifar10 implementation adopted for Horovod on gpu cluster (with 4 and another time 8 gpus) and I get the following error.
The code is working on the local machine when horovod codes removed, no problem with the tensorflow implementation I guess. Also on the cluster, the mnist example runs perfectly so the setup also looks okay.

My guess is that it's something with horovod_broadcast.

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[27685,1],0]
  Exit code:    1
--------------------------------------------------------------------------


 STDERR: INFO:tensorflow:Create CheckpointSaverHook.
Traceback (most recent call last):
  File "/data/hadoop-2.8.2.1/tmp/nm-local-dir/usercache/WLxuXLFw1KZBatf4JlKiQVx2fB-Nwe4CDQUcvyKfLbY/appcache/application_1508601191819_0234/container_e22_1508601191819_0234_01_000003/cifar_horovod-2.py", line 191, in <module>
    with tf.train.SingularMonitoredSession(hooks=hooks, config=config) as sess:
  File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 483, in __init__
    h.begin()
  File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/horovod/tensorflow/__init__.py", line 115, in begin
    self.bcast_op = broadcast_global_variables(self.root_rank)
  File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/horovod/tensorflow/__init__.py", line 90, in broadcast_global_variables
    for var in tf.global_variables()])
  File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/horovod/tensorflow/mpi_ops.py", line 187, in broadcast
    return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank)
  File "<string>", line 90, in horovod_broadcast
  File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 589, in apply_op
    param_name=input_name)
  File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
    ", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'tensor' has DataType bool not in list of allowed values: uint8, int8, uint16, int16, int32, int64, float32, float64


	at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
	at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
	at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:108)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)
17/11/06 17:38:16 INFO CoarseGrainedExecutorBackend: Got assigned task 11
17/11/06 17:38:16 INFO Executor: Running task 0.3 in stage 0.0 (TID 11)
17/11/06 17:38:16 INFO Executor: Executor is trying to kill task 0.3 in stage 0.0 (TID 11), reason: stage cancelled

How do I know GPUDirect is being used?

Hi, @alsrgv

I have a concern that when I installed nv_peer_memory driver and I also installed horovod via

    HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip3 install --no-cache-dir horovod

But, when I enable the training task, how do I know the GPUDirect is being used?

For instance,

root@ecb2de971add:/gang/distributed/allreduce/mnist# mpirun -np 4 --allow-run-as-root -bind-to none -oversubscribe -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -mca pml ob1  python3 keras_uber.py

2017-11-08 08:43:33.395738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:08:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:43:33.395794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 3, name: TITAN Xp, pci bus id: 0000:08:00.0, compute capability: 6.1)
2017-11-08 08:43:33.399529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:07:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:43:33.399567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 2, name: TITAN Xp, pci bus id: 0000:07:00.0, compute capability: 6.1)
2017-11-08 08:43:33.419007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:43:33.419047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
2017-11-08 08:43:33.419182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:43:33.419221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 1, name: TITAN Xp, pci bus id: 0000:06:00.0, compute capability: 6.1)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
Train on 60000 samples, validate on 10000 samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
Epoch 1/3
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
ecb2de971add:35788:35963 [0] INFO NET : Using interface eth0:172.17.0.8<0>
ecb2de971add:35788:35963 [0] INFO NET/IB : Using interface eth0 for sideband communication
ecb2de971add:35788:35963 [0] INFO NET/IB: [0] mlx4_0:1/IB
ecb2de971add:35788:35963 [0] INFO Using internal Network IB
NCCL version 2.0.5 compiled with CUDA 9.0
ecb2de971add:35790:35969 [2] INFO NET : Using interface eth0:172.17.0.8<0>
ecb2de971add:35790:35969 [2] INFO NET/IB : Using interface eth0 for sideband communication
ecb2de971add:35789:35964 [1] INFO NET : Using interface eth0:172.17.0.8<0>
ecb2de971add:35789:35964 [1] INFO NET/IB : Using interface eth0 for sideband communication
ecb2de971add:35791:35960 [3] INFO NET : Using interface eth0:172.17.0.8<0>
ecb2de971add:35791:35960 [3] INFO NET/IB : Using interface eth0 for sideband communication
ecb2de971add:35790:35969 [2] INFO NET/IB: [0] mlx4_0:1/IB
ecb2de971add:35789:35964 [1] INFO NET/IB: [0] mlx4_0:1/IB
ecb2de971add:35791:35960 [3] INFO NET/IB: [0] mlx4_0:1/IB
ecb2de971add:35790:35969 [2] INFO Using internal Network IB
ecb2de971add:35789:35964 [1] INFO Using internal Network IB
ecb2de971add:35791:35960 [3] INFO Using internal Network IB
ecb2de971add:35788:35963 [0] INFO CUDA Dev 0, IB Ports : mlx4_0/1(PIX)
ecb2de971add:35790:35969 [2] INFO CUDA Dev 2, IB Ports : mlx4_0/1(PIX)
ecb2de971add:35789:35964 [1] INFO CUDA Dev 1, IB Ports : mlx4_0/1(PIX)
ecb2de971add:35791:35960 [3] INFO CUDA Dev 3, IB Ports : mlx4_0/1(PIX)
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO Using 256 threads
ecb2de971add:35788:35963 [0] INFO [0] Ring 0 :    0   1   2   3
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO 2 -> 1 via P2P/IPC
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO 3 -> 2 via P2P/IPC
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO 1 -> 0 via P2P/IPC
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO 0 -> 3 via P2P/IPC
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO 2 -> 3 via P2P/IPC
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO 3 -> 0 via P2P/IPC
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO 1 -> 2 via P2P/IPC
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO 0 -> 1 via P2P/IPC
59648/60000 [============================>.] - ETA: 0s - loss: 0.1823 - acc: 0.945259648/60000 [============================>.] - ETA: 0s - loss: 0.1812 - acc: 0.944259648/60000 [==========60000/60000 [==============================] - 7s - loss: 0.1787 - acc: 0.9450 - val_loss: 0.0445 - val_acc: 0.9856- loss: 0.1795 - acc: 0.94478331268
Epoch 2/3
60000/60000 [==============================] - 7s - loss: 0.1804 - acc: 0.9445 - val_loss: 0.0445 - val_acc: 0.9856
Epoch 2/3
60000/60000 [==============================] - 7s - loss: 0.1816 - acc: 0.9453 - val_loss: 0.0445 - val_acc: 0.9856
Epoch 2/3
60000/60000 [==============================] - 7s - loss: 0.1808 - acc: 0.9435 - val_loss: 0.0445 - val_acc: 0.9856
Epoch 2/3
59648/60000 [============================>.] - ETA: 0s - loss: 0.0465 - acc: 0.986159648/60000 [============================>.] - ETA: 0s - loss: 0.0458 - acc: 0.986059648/60000 [==========60000/60000 [==============================] - 5s - loss: 0.0462 - acc: 0.9858 - val_loss: 0.0289 - val_acc: 0.9905- loss: 0.0464 - acc: 0.985644
Epoch 3/3
60000/60000 [==============================] - 5s - loss: 0.0458 - acc: 0.9860 - val_loss: 0.0289 - val_acc: 0.9905
Epoch 3/3
60000/60000 [==============================] - 5s - loss: 0.0465 - acc: 0.9861 - val_loss: 0.0289 - val_acc: 0.9905
Epoch 3/3
60000/60000 [==============================] - 5s - loss: 0.0462 - acc: 0.9857 - val_loss: 0.0289 - val_acc: 0.9905
Epoch 3/3
59648/60000 [============================>.] - ETA: 0s - loss: 0.0302 - acc: 0.990559648/60000 [============================>.] - ETA: 0s - loss: 0.0300 - acc: 0.990659648/60000 [==========60000/60000 [==============================] - 5s - loss: 0.0291 - acc: 0.9907 - val_loss: 0.0274 - val_acc: 0.9911- loss: 0.0291 - acc: 0.9907
60000/60000 [==============================] - 5s - loss: 0.0302 - acc: 0.9905 - val_loss: 0.0274 - val_acc: 0.9911
60000/60000 [==============================] - 5s - loss: 0.0309 - acc: 0.9904 - val_loss: 0.0274 - val_acc: 0.9911
60000/60000 [==============================] - 5s - loss: 0.0299 - acc: 0.9906 - val_loss: 0.0274 - val_acc: 0.9911
Test loss: 0.0273595407882
Test accuracy: 0.9911
Using TensorFlow backend.
Test loss: 0.0273595407882
Test accuracy: 0.9911
Using TensorFlow backend.
Test loss: 0.0273595407882
Test accuracy: 0.9911
Using TensorFlow backend.
Test loss: 0.0273595407882
Test accuracy: 0.9911
Using TensorFlow backend.
--------------------------------------------------------------------------
The call to cuIpcCloseMemHandle failed. This is a warning and the program
will continue to run.
  cuIpcCloseMemHandle return value:   4
  address: 0x1090fc00000
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[ecb2de971add:35789] Sleep on 35789

run distributed tensorflow_mnist.py on kubernetes, got "TCP connection request " error

Command :

mpirun -np 2 \ 
        -x LD_LIBRARY_PATH \
        -H xuerq-horovod0:1,xuerq-horovod1:1 python tensorflow_mnist.py

Error log :

--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          xuerq-horovod0
  Local PID:           8136
  Peer hostname:       xuerq-horovod1 ([[30832,1],1])
  Source IP of socket: 10.10.10.8
  Known IPs of peer:   
	10.1.102.5
--------------------------------------------------------------------------
[xuerq-horovod0:08130] 1 more process has sent help message help-mpi-btl-tcp.txt / dropped inbound connection
[xuerq-horovod0:08130] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

But the following commands is ok

mpirun -np 1 \ 
        -x LD_LIBRARY_PATH \
        -H xuerq-horovod0:1 python tensorflow_mnist.py

mpirun -np 1 \ 
        -x LD_LIBRARY_PATH \
        -H xuerq-horovod1:1 python tensorflow_mnist.py

Got similar error when installing Horovod under Ubuntu 16.04 (as I got under Windows 10)

@alsrgv: I got similar error as I got under windows 10, could you please help me? Thanks a lot in advance. I´ve installed tensorflow-gpu (version1.3.0) and Cuda 8/CudNN 6, it works well with python 2.7 when I run the script without horovod, and I also installed openmpi-3.0/NCCL2.0.

$sudo -H HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod -v
Converted retries value: Retry(total=5, connect=None, read=None, redirect=None) -> Retry(total=Retry(total=5, connect=None, read=None, redirect=None), connect=None, read=None, redirect=None)
Converted retries value: Retry(total=5, connect=None, read=None, redirect=None) -> Retry(total=Retry(total=5, connect=None, read=None, redirect=None), connect=None, read=None, redirect=None)
Collecting horovod
1 location(s) to search for versions of horovod:

https://pypi.python.org/simple/horovod/
Getting page https://pypi.python.org/simple/horovod/
Starting new HTTPS connection (1): pypi.python.org
"GET /simple/horovod/ HTTP/1.1" 200 915
Analyzing links from page https://pypi.python.org/simple/horovod/
Found link https://pypi.python.org/packages/07/bb/f2edeee902a6c5e345f5cc6d8ba3f5b6805506489da4bb6cf35665cccd95/horovod-0.9.5.tar.gz#md5=a883937ed44f95a98b434a3c13178aa0 (from https://pypi.python.org/simple/horovod/), version: 0.9.5
Found link https://pypi.python.org/packages/11/48/e56cb33974a7fc24252e304e7e60b7b892eb7c1093099493913880b6f4ce/horovod-0.9.8.tar.gz#md5=e981a37f7ee6c1f4ba5dcac637184e7a (from https://pypi.python.org/simple/horovod/), version: 0.9.8
Found link https://pypi.python.org/packages/2c/c4/cd8d9d1136465e6f23a535cc4c49fad24da92982f1275d7c38a564d5417b/horovod-0.9.9.tar.gz#md5=1d680e9ffe37552f4510c29e0fcfefc3 (from https://pypi.python.org/simple/horovod/), version: 0.9.9
Found link https://pypi.python.org/packages/58/8d/6a84cd3fa9bdeb96e6c162d26d05c6e594c38f788005f2d2f42ed2bfcfe5/horovod-0.9.1.tar.gz#md5=277e9c1bc7b459a49eb1aca7731ca5bd (from https://pypi.python.org/simple/horovod/), version: 0.9.1
Found link https://pypi.python.org/packages/7f/78/3c42b58fa37d5811eb2178e6d8f0e586b18fd883f8575a2da8a7bce9b02b/horovod-0.9.4.tar.gz#md5=d436ab4b0efa0734f270f4ba24d1d665 (from https://pypi.python.org/simple/horovod/), version: 0.9.4
Found link https://pypi.python.org/packages/8a/82/085978898e51938791c35949fb757905e746503369e898074f76f6044d32/horovod-0.9.3.tar.gz#md5=4f4dd52b7763ec7d2da0da55ad04aeff (from https://pypi.python.org/simple/horovod/), version: 0.9.3
Found link https://pypi.python.org/packages/8c/91/91495a609f7b4077ba4eeb00c8494bf29eb94de76e5a7a458cf57add7f0e/horovod-0.9.0.tar.gz#md5=1cd031e8d892cab8e2cf05cfe2caba55 (from https://pypi.python.org/simple/horovod/), version: 0.9.0
Found link https://pypi.python.org/packages/9c/62/3d37b3697a06e0c5ef97fd01f690f4b94509486df4f5401a65461e110d1c/horovod-0.9.7.tar.gz#md5=42c02e8e117303e35744fdce39d79ac5 (from https://pypi.python.org/simple/horovod/), version: 0.9.7
Found link https://pypi.python.org/packages/9c/be/5aebe7ecd7024e0ad12196127d22488a73defa3c1b6fac0eea20d7aa6115/horovod-0.9.6.tar.gz#md5=bc4ae17e261dffef0b9ce1e125336a75 (from https://pypi.python.org/simple/horovod/), version: 0.9.6
Found link https://pypi.python.org/packages/b1/83/ac3d9499fdf9173a5870d6fff0a58a4e35240a9c17ce556c3c59cb1a4533/horovod-0.9.10.tar.gz#md5=694ca17e718cd3d62831e23dbf1f5437 (from https://pypi.python.org/simple/horovod/), version: 0.9.10
Found link https://pypi.python.org/packages/fc/53/314d19c558f726e480489e97d576883b6bd79b7d811913216b30d7849b8b/horovod-0.9.2.tar.gz#md5=804dbcd20c1de82c7adb5213c4d0363c (from https://pypi.python.org/simple/horovod/), version: 0.9.2
Using version 0.9.10 (newest of versions: 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4, 0.9.5, 0.9.6, 0.9.7, 0.9.8, 0.9.9, 0.9.10)
"GET /packages/b1/83/ac3d9499fdf9173a5870d6fff0a58a4e35240a9c17ce556c3c59cb1a4533/horovod-0.9.10.tar.gz HTTP/1.1" 200 64842
Downloading horovod-0.9.10.tar.gz (64kB)
Downloading from URL https://pypi.python.org/packages/b1/83/ac3d9499fdf9173a5870d6fff0a58a4e35240a9c17ce556c3c59cb1a4533/horovod-0.9.10.tar.gz#md5=694ca17e718cd3d62831e23dbf1f5437 (from https://pypi.python.org/simple/horovod/)
100% |████████████████████████████████| 71kB 11.9MB/s
Running setup.py (path:/tmp/pip-build-wrtVwH/horovod/setup.py) egg_info for package horovod
Running command python setup.py egg_info
running egg_info
creating pip-egg-info/horovod.egg-info
writing pip-egg-info/horovod.egg-info/PKG-INFO
writing top-level names to pip-egg-info/horovod.egg-info/top_level.txt
writing dependency_links to pip-egg-info/horovod.egg-info/dependency_links.txt
writing manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
reading manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
Source in /tmp/pip-build-wrtVwH/horovod has version 0.9.10, which satisfies requirement horovod from https://pypi.python.org/packages/b1/83/ac3d9499fdf9173a5870d6fff0a58a4e35240a9c17ce556c3c59cb1a4533/horovod-0.9.10.tar.gz#md5=694ca17e718cd3d62831e23dbf1f5437
Installing collected packages: horovod
Running setup.py install for horovod ... Running command /usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-wrtVwH/horovod/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-7mUT3A-record/install-record.txt --single-version-externally-managed --compile
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-2.7/horovod
creating build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops_test.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
running build_ext
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -std=c++11 -I/usr/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/usr/local/lib/python2.7/dist-packages/tensorflow/core -ltensorflow_framework -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
/usr/bin/ld: cannot find -ltensorflow_framework
collect2: error: ld returned 1 exit status
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -std=c++11 -I/usr/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/usr/local/lib/python2.7/dist-packages/tensorflow/core -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0 -I/usr/local/lib/python2.7/dist-packages/tensorflow/include -I/usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public -I/usr/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o -L/usr/local/lib/python2.7/dist-packages/tensorflow/core -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.so
mpicxx: error while loading shared libraries: libopen-pal.so.40: cannot open shared object file: No such file or directory
error: mpicxx -show failed (see error below), is MPI in $PATH?
Note: If your version of MPI has custom command to show compilation flags, please specify it via HOROVOD_MPICXX_SHOW environment variable.

Traceback (most recent call last):
File "/tmp/pip-build-wrtVwH/horovod/setup.py", line 107, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['mpicxx', '-show']' returned non-zero exit status 127

error
Cleaning up...
Removing source in /tmp/pip-build-wrtVwH/horovod
Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-wrtVwH/horovod/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-7mUT3A-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-wrtVwH/horovod/
Exception information:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 209, in main
status = self.run(options, args)
File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 335, in run
prefix=options.prefix_path,
File "/usr/lib/python2.7/dist-packages/pip/req/req_set.py", line 732, in install
**kwargs
File "/usr/lib/python2.7/dist-packages/pip/req/req_install.py", line 886, in install
spinner=spinner,
File "/usr/lib/python2.7/dist-packages/pip/utils/init.py", line 736, in call_subprocess
% (command_desc, proc.returncode, cwd))
InstallationError: Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-wrtVwH/horovod/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-7mUT3A-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-wrtVwH/horovod/
Converted retries value: Retry(total=0, connect=None, read=None, redirect=None) -> Retry(total=Retry(total=0, connect=None, read=None, redirect=None), connect=None, read=None, redirect=None)
Converted retries value: Retry(total=0, connect=None, read=None, redirect=None) -> Retry(total=Retry(total=0, connect=None, read=None, redirect=None), connect=None, read=None, redirect=None)

NCCL Error Symbol not found: _ncclAllReduce

I downloaded NCCL and installed it according to the horovod gpu installation guide. I'm on MacOS, so I was a little surprised to see everything work. I tried the Keras example and things worked like a charm, first time. Then, today I went to install it on another Mac, and on my original machine where it was working, and I get this error:

(gpu-tensorflow) tylers-iMac:horovod tyler$ python3 test-keras.python 
Using TensorFlow backend.
Traceback (most recent call last):
  File "test-keras.python", line 9, in <module>
    import horovod.tensorflow as hvd
  File "/usr/local/lib/python3.6/site-packages/horovod/tensorflow/__init__.py", line 34, in <module>
    from horovod.tensorflow.mpi_ops import size
  File "/usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 74, in <module>
    ['HorovodAllgather', 'HorovodAllreduce'])
  File "/usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 56, in _load_library
    library = load_library.load_op_library(filename)
  File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 64, in load_op_library
    None, None, error_msg, error_code)
tensorflow.python.framework.errors_impl.NotFoundError: dlopen(/usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-darwin.so, 6): Symbol not found: _ncclAllReduce
  Referenced from: /usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-darwin.so
  Expected in: flat namespace
 in /usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-darwin.so

I can't get it working again. I'm so confused and can't figure out what has changed.

Comparison with upcoming tensorflow.contrib.mpi_collectives

Just curious if you guys notice a recent contribution from here which added allreduce/allgather (which I think horovod derived from?) and a similar optimizer wrapper. I'm planning to do a benchmark with both horovod and above. It'll be great if you guys could share some insight in terms of features/flexibility/performance.

Thanks for the great library!

Do you have a sample of using tensorboard with distributed horovod?

Thanks

plan to support D-PSGD?

I have tested horovod in my own tensorflow code, which shows significant improvement!!!
Dive into the docs and implementation, horovod used the ring-Allreduce algorithom, when I search on website, found there is a new method published recently, which says it can remove gradients divide latency, only communicate weights with each neighbor, and proved the local weights converged too and the convergence speed is as fast as sync SGD. So I am wonder is this a method worth to try?

here is the publication:

https://arxiv.org/abs/1705.09056

Horovod won't install on Mac OSX El Capitan

On Mac OSX El Capitan with Anaconda 3, pip install horovod gives the following error:

fatal error: 'unordered_map' file not found

This can be fixed if I have access to the compilation scripts, but in this case pip does all that and I don't know a workaround. Can this be fixed?

Check whether GPUs are communicating directly

How to check if the models on different GPUs exchange gradients directly via nv_peer_mem?

I measure the running time of the keras example when nv_peer_mem is turned on and off. The 2 cases make no difference in running time. There's no complaint from the code too.

To run on 4 machines with 1 GPUs each using Open MPI

To run on 4 different machine that has 1 GPU and connected by LAN, do I need to use the following command in any one of the system or use it on every computer changing the position of server IP's?

mpirun -np 4 -x LD_LIBRARY_PATH -H server1:4,server2:4,server3:4,server4:4 python train.py

Order of broadcast ops' evaluation

./horovod/tensorflow/init.py:89

return tf.group(*[tf.assign(var, broadcast(var, root_rank))
                      for var in tf.global_variables()])

@alsrgv I think we can't make sure that the broadcast ops are evaluated by TensorFlow in the same order for each instance for all platform. Do you think so?

FYI, BatchAI added a demo of running Horovod on a gpu cluster

FYI, we have added a demo on how to run horovod on Azure Batch AI. https://github.com/Azure/BatchAI/tree/master/recipes/Horovod

Thanks,
Alex

run distributed mnist example on kubernetes failed with error "writev error Bad address(1)"

so I run this example on kubernetes, the standalone is ok, but distributed version failed.
my run command is (change eth0 to flannel.1 have the same error, and other card just cannot run):

mpirun --allow-run-as-root -v --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 3 -x LD_LIBRARY_PATH -H hvd-1:1,hvd-2:1,hvd-3:1 python tensorflow_mnist.py

error msg:

Warning: Permanently added '[hvd-2]:33,[10.87.217.230]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-1]:33,[10.87.217.233]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-3]:33,[10.87.217.211]:33' (ECDSA) to the list of known hosts.
0
Extracting MNIST-data-0/train-images-idx3-ubyte.gz
2
Extracting MNIST-data-2/train-images-idx3-ubyte.gz
1
Extracting MNIST-data-1/train-images-idx3-ubyte.gz
Extracting MNIST-data-2/train-labels-idx1-ubyte.gz
Extracting MNIST-data-2/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-1/train-labels-idx1-ubyte.gz
Extracting MNIST-data-1/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-0/train-labels-idx1-ubyte.gz
Extracting MNIST-data-0/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-2/t10k-labels-idx1-ubyte.gz
Extracting MNIST-data-1/t10k-labels-idx1-ubyte.gz
Extracting MNIST-data-0/t10k-labels-idx1-ubyte.gz
0
0
0
2017-10-27 04:13:53.870673: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.870718: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.870727: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.870733: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.870738: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875339: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875404: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875414: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875419: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875425: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897845: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897889: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897898: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897904: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897909: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:54.028404: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-27 04:13:54.028873: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-27 04:13:54.028986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:08.0
Total memory: 11.95GiB
Free memory: 11.84GiB
2017-10-27 04:13:54.029016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-27 04:13:54.029024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y
2017-10-27 04:13:54.029039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
2017-10-27 04:13:54.029476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:08.0
Total memory: 11.95GiB
Free memory: 11.84GiB
2017-10-27 04:13:54.029507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-27 04:13:54.029515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y
2017-10-27 04:13:54.029528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
2017-10-27 04:13:54.046652: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-27 04:13:54.047204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:08.0
Total memory: 11.21GiB
Free memory: 11.09GiB
2017-10-27 04:13:54.047236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-27 04:13:54.047244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y
2017-10-27 04:13:54.047259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
[node233][[9159,1],0][btl_tcp_frag.c:130:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x1b0a43da00, 128)
	Bad address(1)

[node233:02924] pml_ob1_sendreq.c:191 FATAL

ifconfig in docker as follow:

cni0      Link encap:Ethernet  HWaddr 0a:58:c0:a8:2a:01
          inet addr:192.168.42.1  Bcast:0.0.0.0  Mask:255.255.255.0
          UP BROADCAST MULTICAST  MTU:1450  Metric:1
          RX packets:1662 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3252 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:71609 (71.6 KB)  TX bytes:6438981 (6.4 MB)

docker0   Link encap:Ethernet  HWaddr 02:42:16:50:f4:e9
          inet addr:172.17.0.1  Bcast:0.0.0.0  Mask:255.255.0.0
          UP BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

eth0      Link encap:Ethernet  HWaddr 00:16:3e:00:2d:bb
          inet addr:10.87.217.233  Bcast:10.87.255.255  Mask:255.255.128.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:21258550691 errors:0 dropped:0 overruns:0 frame:0
          TX packets:3273335789 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:29633764824437 (29.6 TB)  TX bytes:8441356508419 (8.4 TB)

flannel.1 Link encap:Ethernet  HWaddr 9e:ee:4a:ed:a1:93
          inet addr:192.168.42.0  Bcast:0.0.0.0  Mask:255.255.255.255
          UP BROADCAST RUNNING MULTICAST  MTU:1450  Metric:1
          RX packets:189364 errors:0 dropped:0 overruns:0 frame:0
          TX packets:189765 errors:0 dropped:132 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:11844463 (11.8 MB)  TX bytes:23939132 (23.9 MB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:7571360 errors:0 dropped:0 overruns:0 frame:0
          TX packets:7571360 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1
          RX bytes:2581387511 (2.5 GB)  TX bytes:2581387511 (2.5 GB)

is there anyway for my to know what went wrong? I don't know which address is bad.

Error when pip install horovod

After installing nccl 2.0 and mpi, I use "HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod" to install horovod, but is shows error: horovod/tensorflow/mpi_ops.cc:802:79: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive].

The details log shows:

`Collecting horovod
Downloading horovod-0.9.11.tar.gz (65kB)
100% |████████████████████████████████| 71kB 271kB/s
Installing collected packages: horovod
Running setup.py install for horovod ... error
Complete output from command /usr/anaconda2/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-g7B1qn/horovod/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-raRdo5-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-2.7/horovod
creating build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops_test.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
running build_ext
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/usr/anaconda2/lib/python2.7/site-packages/tensorflow/core -L/usr/anaconda2/lib -ltensorflow_framework -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
/usr/bin/ld: cannot find -ltensorflow_framework
collect2: error: ld returned 1 exit status
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/usr/anaconda2/lib/python2.7/site-packages/tensorflow/core -L/usr/anaconda2/lib -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0 -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o -L/usr/anaconda2/lib/python2.7/site-packages/tensorflow/core -L/usr/anaconda2/lib -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/local/cuda/include -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_cuda.cc -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_cuda.o -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/usr/anaconda2/lib -lcudart -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/local/cuda/include -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_nccl.cc -o build/temp.linux-x86_64-2.7/test_compile/test_nccl.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_nccl.o -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/usr/anaconda2/lib -lnccl -o build/temp.linux-x86_64-2.7/test_compile/test_nccl.so
building 'horovod.tensorflow.mpi_lib' extension
creating build/temp.linux-x86_64-2.7/horovod
creating build/temp.linux-x86_64-2.7/horovod/tensorflow
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DHAVE_CUDA=1 -DHAVE_NCCL=1 -DHOROVOD_GPU_ALLREDUCE='N' -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/usr/anaconda2/include/python2.7 -c horovod/tensorflow/mpi_message.cc -o build/temp.linux-x86_64-2.7/horovod/tensorflow/mpi_message.o -std=c++11 -fPIC -O2 -I/usr/anaconda2/include -L/usr/anaconda2/lib -Wl,-rpath -Wl,/usr/anaconda2/lib -lmpichcxx -lmpich -lopa -lmpl -lrt -lpthread -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DHAVE_CUDA=1 -DHAVE_NCCL=1 -DHOROVOD_GPU_ALLREDUCE='N' -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/usr/anaconda2/include/python2.7 -c horovod/tensorflow/mpi_ops.cc -o build/temp.linux-x86_64-2.7/horovod/tensorflow/mpi_ops.o -std=c++11 -fPIC -O2 -I/usr/anaconda2/include -L/usr/anaconda2/lib -Wl,-rpath -Wl,/usr/anaconda2/lib -lmpichcxx -lmpich -lopa -lmpl -lrt -lpthread -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
horovod/tensorflow/mpi_ops.cc: In function ‘void horovod::tensorflow::{anonymous}::PerformOperation(horovod::tensorflow::{anonymous}::TensorTable&, horovod::tensorflow::MPIResponse)’:
horovod/tensorflow/mpi_ops.cc:802:79: # error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
recvcounts, displcmnts, dtype, MPI_COMM_WORLD);
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:633:5: error: initializing argument 1 of ‘int MPI_Allgatherv(void*, int, MPI_Datatype, void*, int*, int*, MPI_Datatype, MPI_Comm)’ [-fpermissive]
int MPI_Allgatherv(void* , int, MPI_Datatype, void*, int , int , MPI_Datatype, MPI_Comm);
^
horovod/tensorflow/mpi_ops.cc:1102:45: error: invalid conversion from ‘const void’ to ‘void’ [-fpermissive]
MPI_COMM_WORLD))
^
horovod/tensorflow/mpi_ops.cc:534:24: note: in definition of macro ‘MPI_CHECK’
auto mpi_result = (op);
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:639:5: error: initializing argument 1 of ‘int MPI_Allreduce(void*, void*, int, MPI_Datatype, MPI_Op, MPI_Comm)’ [-fpermissive]
int MPI_Allreduce(void* , void*, int, MPI_Datatype, MPI_Op, MPI_Comm);
^
horovod/tensorflow/mpi_ops.cc: In function ‘void horovod::tensorflow::{anonymous}::BackgroundThreadLoop(horovod::tensorflow::{anonymous}::HorovodGlobalState&)’:
horovod/tensorflow/mpi_ops.cc:1261:39: error: ‘MPI_COMM_TYPE_SHARED’ was not declared in this scope
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL,
^
horovod/tensorflow/mpi_ops.cc:1262:34: error: ‘MPI_Comm_split_type’ was not declared in this scope
&local_comm);
^
horovod/tensorflow/mpi_ops.cc:1324:40: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
MPI_Send(encoded_message.c_str(), (int)encoded_message.length() + 1,
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:569:5: error: initializing argument 1 of ‘int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm)’ [-fpermissive]
int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm);
^
horovod/tensorflow/mpi_ops.cc:1425:43: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
MPI_Send(encoded_response.c_str(), (int)encoded_response.length() + 1,
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:569:5: error: initializing argument 1 of ‘int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm)’ [-fpermissive]
int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm);
^
horovod/tensorflow/mpi_ops.cc:1442:41: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
MPI_Send(encoded_response.c_str(), (int)encoded_response.length() + 1,
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:569:5: error: initializing argument 1 of ‘int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm)’ [-fpermissive]
int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm);
^
error: command 'gcc' failed with exit status 1

----------------------------------------

Command "/usr/anaconda2/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-g7B1qn/horovod/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-raRdo5-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-g7B1qn/horovod/`

This system is Ubuntu 14.04, python 2.7.6. CUDA 8.0 and CuDNN 6, Tensorflow 1.3.0. Can you help to figure out where is the problem?

Meet error when run examples keras_mnist.py in horovod V0.9.10

I run examples keras_mnit.py in horovod/examples.

mpirun --allow-run-as-root --mca btl_tcp_if_include eno1 -np 2 -x LD_LIBRARY_PATH -H 192.168.12.50:1,192.168.12.49:1 python /home/dyc/horovod/examples/keras_mnist.py

But got warning messages below so that it can't continue to run.

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops: HorovodBroadcast_TFOptimizer_iterations_0 [ready ranks: 0], HorovodBroadcast_iterations_0 [ready ranks: 1]

What's wrong with it. could you please help me to find a way to solve this warning? Thank you very much in advance.

Quick questions: feeding images to GPUs

Thanks for a great library.

I have 2 quick questions.

In the context of the Keras example, each GPU reads the same set of images in each epoch, possibly in different orders.

So the time it takes to make an epoch is independent of the number of GPUs. Is this understanding correct?

It may happen that different GPUs may read a few same images in a batch. Is there any easy way to make sure that each GPU reads a different set of images?

Add tags to commits corresponding to pypi uploads.

Hey Alex! Thanks for adding the __version__!. Would you mind adding tags to the corresponding commits?

Intel C++ Compiler: Error when installing horovod with pip (Unable to determine CXX11 ABI)

I do pip install horovod.

    build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
    
    build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
    
    build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
    
    build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
    
    build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
    
    build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
    
    compilation aborted for build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc (code 2)
    error: Unable to determine CXX11 ABI to use with TensorFlow (see error above).
    
    ----------------------------------------
Command "/beegfs/home/hd/hd_hd/hd_tn445/.venvs/tf_r1/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/hd_tn445_job_5405240/pip-build-o4EglG/horovod/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/hd_tn445_job_5405240/pip-qVrJY6-record/install-record.txt --single-version-externally-managed --compile --install-headers /beegfs/home/hd/hd_hd/hd_tn445/.venvs/tf_r1/include/site/python2.7/horovod" failed with error code 1 in /tmp/hd_tn445_job_5405240/pip-build-o4EglG/horovod/

Full log: horovod_install_error_log.txt

OS: Red Hat 4.8.5-16
Tensorflow version: 1.4.0
Openmpi version: 1.10
Compiler: Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.4.258

Unable to get Horovod to run successfully on multiple-GPU

I am currently running InceptionV3 model on a Tesla K80 GPU. And here is the code snippet that works perfectly fine:

        file_list = build_data(sys.argv[1])
        train_size = math.ceil(0.8 * len(file_list))

        train_file_list = file_list[:train_size]
        test_file_list=file_list[train_size:]

        nbatches_train, mod = divmod(len(train_file_list), BATCH_SIZE)
        nbatches_valid, mod = divmod(len(test_file_list), BATCH_SIZE)

        # Initialise session
        os.environ["CUDA_VISIBLE_DEVICES"] = "0"
        sess = tf.Session()
        K.set_session(sess)
        K.set_learning_phase(1)


        model = create_inception_model(num_classes=4, input_shape=(HEIGHT, WIDTH, CHANNELS),
                                       fully_trainable=True)

        cb = []

        tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0,
                                 write_graph=True, write_images=True)
        cb.append(tbCallBack)

        train_generator = generator_from_data_path(train_file_list, batch_size=BATCH_SIZE, width=WIDTH, height=HEIGHT)
        test_generator = generator_from_data_path(test_file_list, batch_size=BATCH_SIZE, width=WIDTH, height=HEIGHT)

        model.fit_generator(train_generator, epochs=EPOCHS,
                            steps_per_epoch=nbatches_train,callbacks=cb,
                            validation_data=test_generator,
                            validation_steps=nbatches_valid)

        serialize_mode(model, name='inception')`

As always, the training takes a longer time and hence I wanted to shift to Horovod to run it on multiple GPU's.
Here is the code that I added to the above code in trying to make that happen:

Here is the addition to the model generation code:

opt = keras.optimizers.Adadelta()
opt = hvd.DistributedOptimizer(opt)

model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
return model

And here is the code where we are initializing Horovod to run the model:

        hvd.init()
        config = tf.ConfigProto()
        config.gpu_options.allow_growth = True
        config.gpu_options.visible_device_list = str(hvd.local_rank())
        K.set_session(tf.Session(config=config))


        file_list = build_data(sys.argv[1])
        train_size = math.ceil(0.8 * len(file_list))

        train_file_list = file_list[:train_size]
        test_file_list=file_list[train_size:]

        nbatches_train, mod = divmod(len(train_file_list), BATCH_SIZE)
        nbatches_valid, mod = divmod(len(test_file_list), BATCH_SIZE)

        model = create_inception_model(num_classes=4, input_shape=(HEIGHT, WIDTH, CHANNELS),
                                       fully_trainable=True)

        cb = []

        tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0,
                                 write_graph=True, write_images=True)
        if hvd.rank()  == 0:
            modelCheckPointCallBack = keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5')
            cb.append(modelCheckPointCallBack)

        cb.append(tbCallBack)

        train_generator = generator_from_data_path(train_file_list, batch_size=BATCH_SIZE, width=WIDTH, height=HEIGHT)
        test_generator = generator_from_data_path(test_file_list, batch_size=BATCH_SIZE, width=WIDTH, height=HEIGHT)

        model.fit_generator(train_generator, epochs=EPOCHS,
                            steps_per_epoch=nbatches_train,callbacks=cb,
                            validation_data=test_generator,
                            validation_steps=nbatches_valid)

        serialize_mode(model, name='inception')

And this is what happens when I run Horovod with 4 gpus:


--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
  python

Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------

But when I run watch nvidia-smi I have 4 GPU's :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                  Off |
| N/A   31C    P8    28W / 149W |     16MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:05.0 Off |                  Off |
| N/A   31C    P8    28W / 149W |      1MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:00:06.0 Off |                  Off |
| N/A   33C    P8    29W / 149W |      1MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:00:07.0 Off |                  Off |
| N/A   30C    P8    27W / 149W |      1MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1956      G   /usr/lib/xorg/Xorg                            15MiB |
+-----------------------------------------------------------------------------+

When I run the same program with 1 gpu, my nvidia-smi command looks like this:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                  Off |
| N/A   42C    P0   140W / 149W |  11690MiB / 12205MiB |     82%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:05.0 Off |                  Off |
| N/A   36C    P0    56W / 149W |  11598MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:00:06.0 Off |                  Off |
| N/A   38C    P0    69W / 149W |  11598MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:00:07.0 Off |                  Off |
| N/A   35C    P0    55W / 149W |  11598MiB / 12205MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1956      G   /usr/lib/xorg/Xorg                            15MiB |
|    0      3447      C   python                                     11661MiB |
|    1      3447      C   python                                     11585MiB |
|    2      3447      C   python                                     11585MiB |
|    3      3447      C   python                                     11585MiB |
+-----------------------------------------------------------------------------+

It creates 4 python processes to be used by GPU's but it does not run it in a distributed fashion.

How do I get this to run on a single machine with Multiple GPU's.

motivation for this project

Hi, I'm curious about the motivation for this project, since Tensorflow itself provides supports for distributed training. Is it because the open-sourced distributed Tensorflow is slow? If it is the case, what are the bottlenecks (e.g. grpc?) and why? Are there some benchmarks we can compare? Thanks.

confirming installation and setup running examples in the code

Guys thanks for this interesting project. Appreciate the effort that you have put in here. I am trying out a few things but have a few doubts if I am doing things right.

I have 2 machines, each with 2 P100s and 16gig mem on each GPU
I am starting with examples/ , and executing the command ~/.openmpi/bin/mpirun -np 4 -v -hostfile ~/hostfile -x NCCL_DEBUG=INFO ~/envs/horovod/bin/python tensorflow_word2vec.py

I can confirm that this kicks off the example across both my machines in the host files and across the total 4 gpus.

But I have a few doubts if I am missing something, as I am seeing a higher time periods of exec when running a single GPU with horovd as compared to vanilla single core TF b distributed horovod vs single GPU horovod execution

So just wanted to confirm a few things:

whether the GPUs are sufficiently being utilized

(nvidia-smi seems to point out that each of the GPU is running the code and 7% of the memory is being used).
From top:
%Cpu(s): 3.1 us, 0.8 sy, 0.0 ni, 96.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

 PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 20   0 29.199g 1.487g 421128 S 116.9  2.4   5:32.41 python
 20   0 29.199g 1.495g 421172 S 114.6  2.4   5:24.61 python

whether the GPUs are efficiently communicating with each other

( as pointed out in the other ticket I should see something in the logs about it, but I don't see anything. I can see the processes started across machines, but I am not sure if all the communication is good. The training gets to completion, if I specify incorrect host then I do see errors, so I have a feeling that all the instances are communicating but a not sure if they are communicating properly b they are communicating via libnccl2

whether anything special needs to be done from open mpi side
I basically downloaded openmpi, did not do anything special to build it optimally for say GPUs or libnccl. Do I need to anything additional to optimize/leverage for GPUs
training times are increasing as we add more GPUs

vanilla tensor flow for above example ~ 3 sec
horovod single GPU ~ 4 sec
horovod with 2 GPU on same machine ~9 sec

Benchmark description

The benchmarks in the README are not clear to me. Could you possibly clarify?

The single GPU version has benchamrk of 133 for inception.
The multi GPU versions have Distributed Tensorflow at 10.4
The Horovod results range from 14.4 to 15.6.

I presume given single GPU results that higher is worse. But you have bolded the highest result for Horovod suggesting that is best. These results suggest Distributed Tensorflow is far faster.

So, what are the numbers and is higher better?

Syncing metrics between processes.

In the context of some job that reports a loss, each process reports the loss of the batch they have just processed. Does horovod, or can you suggest a best way of aggregating/syncing metrics or loss values across processes for reporting those somewhere?

Parallel mini-batch scalability issue

I am performing batch gradient descent, where the gradients are averaged across all training examples to move forward in a single training epoch. This can easily be parallelized by splitting the original batch into "mini-batches", and each process calculates gradients associated with its mini-batch. The resulting mini-batch gradients can be averaged on the master process to then move forward with training. This implementation should be "embarrassingly parallel" since the gradients of N examples are calculated over P processes in N/P mini-batches. A good animation is from TensorFlow's synchronous training examples:

I want to perform this type of training on a CPU cluster and I'm very familiar with MPI, so I think Horovod is a better choice than Distributed TensorFlow. I am having a hard time setting up this task, however. My recent attempt involves optimizing a single layer neural network to reproduce the parabola y = x^2 (just for simplicity):

import tensorflow as tf
import numpy as np
import horovod.tensorflow as hvd
import time

# Function 
def add_layer(inputs, in_size, out_size, activation_function=None):
    w = tf.Variable(tf.random_normal([in_size, out_size]))
    b = tf.Variable(tf.random_normal([1, out_size]))
    Wx_plus_b = tf.matmul(inputs, w) + b
    if activation_function is None:
        outputs = Wx_plus_b
    else:
        outputs = activation_function(Wx_plus_b)
    return outputs

# Initialize Horovod
hvd.init()
rank = hvd.rank()

# Make up some data
N = 20000
x_data = np.linspace(-1, 1, N)[:, np.newaxis]
y_data = np.square(x_data)

# Make mini-batches, each with N/procs elements
procs = hvd.size()
x_split = np.split(x_data, procs)
y_split = np.split(y_data, procs)
x_batch = x_split[rank]
y_batch = y_split[rank]

# Define placeholder for x_data and y_data
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])
# Add hidden layer
l1 = add_layer(xs, 1, 3, activation_function=tf.tanh)
# Add output layer
prediction = add_layer(l1, 3, 1, activation_function=None)
# Define loss
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction), reduction_indices=[1]))
# Define optimizer
opt = tf.train.GradientDescentOptimizer(0.01)

# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss)
# Ask the optimizer to apply the gradients
train_opt = opt.apply_gradients(grads_and_vars)

# Monitored session
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
sess = tf.train.MonitoredTrainingSession(hooks=hooks)

# Initialize timer
if rank == 0:
    start = time.time()

# Perform the training
for i in range(10000):
    # Training step
    sess.run(train_opt, feed_dict={xs: x_batch, ys: y_batch})
    if i % 100 == 0:
        print("Rank: %d" % (rank), "Epoch: %d" % (i), "Loss: %.5f" % (sess.run(loss, feed_dict={xs: x_data, ys: y_data})) )

# Output the final time
if rank == 0:
    end = time.time()
    print("Elapsed time: %.5f sec" % (end - start) )

If this is saved in a file called test.py, you can run with:
mpirun -np P python test.py
where P is the number of processes to use. This parallelizes the gradient calculation from N training examples to N/P examples, and then the gradients are averaged using hvd.allreduce in the distributed optimizer.

There are 2 issues that I noticed:
(1) Even though the mini-batches are being parallelized properly, the training actually gets slower with increasing the number of processes. Is this simply due to overhead, since this is a rather simple example?
(2) There are P identical neural networks being trained simultaneously. I would like to have the separate processes calculate the gradients of their mini-batches, and send them back to the master process to proceed with training. Is there a way to set this up using Horovod?

I was thinking maybe (2) was causing (1), since many identical neural nets are being trained at once perhaps this is somehow slower?

Thanks for your time.

EDIT: Cleaned up code for more clarity.

The call to cuIpcCloseMemHandle failed. This is a warning and the program will continue to run.

root@ecb2de971add:/gang/distributed/allreduce/mnist# mpirun -np 2 --allow-run-as-root   python3 keras_uber.py
2017-11-08 08:24:44.036889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:24:44.036943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
2017-11-08 08:24:44.037064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:24:44.037098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 1, name: TITAN Xp, pci bus id: 0000:06:00.0, compute capability: 6.1)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/6
Train on 60000 samples, validate on 10000 samples
Epoch 1/6
60000/60000 [==============================] - 6s - loss: 0.2183 - acc: 0.9329 - val_loss: 0.0478 - val_acc: 0.9846======>....] - ETA: 0s - loss: 0.2351 - acc: 0.9277728
Epoch 2/6
60000/60000 [==============================] - 6s - loss: 0.2194 - acc: 0.9332 - val_loss: 0.0478 - val_acc: 0.9846
Epoch 2/6
60000/60000 [==============================] - 5s - loss: 0.0694 - acc: 0.9796 - val_loss: 0.0365 - val_acc: 0.9877=========>.] - ETA: 0s - loss: 0.0695 - acc: 0.97951
Epoch 3/6
60000/60000 [==============================] - 5s - loss: 0.0685 - acc: 0.9793 - val_loss: 0.0365 - val_acc: 0.9877
Epoch 3/6
60000/60000 [==============================] - 5s - loss: 0.0499 - acc: 0.9849 - val_loss: 0.0302 - val_acc: 0.9904=========>.] - ETA: 0s - loss: 0.0499 - acc: 0.9849
60000/60000 [==============================] - 5s - loss: 0.0501 - acc: 0.9849 - val_loss: 0.0302 - val_acc: 0.9904
Epoch 4/6
Epoch 4/6
60000/60000 [==============================] - 5s - loss: 0.0395 - acc: 0.9881 - val_loss: 0.0289 - val_acc: 0.9904=========>.] - ETA: 0s - loss: 0.0410 - acc: 0.9872
Epoch 5/6
60000/60000 [==============================] - 5s - loss: 0.0409 - acc: 0.9872 - val_loss: 0.0289 - val_acc: 0.9904
Epoch 5/6
60000/60000 [==============================] - 5s - loss: 0.0330 - acc: 0.9898 - val_loss: 0.0294 - val_acc: 0.9904=========>.] - ETA: 0s - loss: 0.0330 - acc: 0.9898
Epoch 6/6
60000/60000 [==============================] - 5s - loss: 0.0330 - acc: 0.9900 - val_loss: 0.0294 - val_acc: 0.9904
Epoch 6/6
60000/60000 [==============================] - 5s - loss: 0.0280 - acc: 0.9910 - val_loss: 0.0270 - val_acc: 0.9913=========>.] - ETA: 0s - loss: 0.0281 - acc: 0.9910
60000/60000 [==============================] - 5s - loss: 0.0299 - acc: 0.9910 - val_loss: 0.0270 - val_acc: 0.9913
Test loss: 0.0269502004177
Test accuracy: 0.9913
Using TensorFlow backend.
Test loss: 0.0269502004177
Test accuracy: 0.9913
Using TensorFlow backend.
--------------------------------------------------------------------------
The call to cuIpcCloseMemHandle failed. This is a warning and the program
will continue to run.
  cuIpcCloseMemHandle return value:   4
  address: 0x1090fc00000
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[ecb2de971add:17340] Sleep on 17340
[ecb2de971add:17334] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed
[ecb2de971add:17334] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ecb2de971add:17340] Sleep on 17340

WARNING when running more than one process.

I'v add horovod in my project and found strange warnings like below.

WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.

It happens under 2 occations:

3 or more process on 1 machine, simple network and simple data loading.
2 process on 2 machine(1 for each), complex network and complex data loading, it would take about 10 to 20 seconds before the first batch is ready. Actually I make some buffered preloading for data loading, so when it comes to complex dataset, it might take some time for the first batch to be ready. But it stays the same even I make the buffer smaller so that it only take 1 to 2 seconds for a first batch.

Is there anything I am missing?

I am adding horovod to an old and large project so it might take some time to make a demo to reproduce the problem but I'll be glad to do that if necessary.

horovod / horovod Goto Github PK

horovod's Introduction

Horovod

Documentation

Why Horovod?

Install

Concepts

Supported frameworks

Usage

Running Horovod

Gloo

mpi4py

Inference

Tensor Fusion

Horovod Timeline

Automated Performance Tuning

Horovod Process Sets

Guides

Troubleshooting

Citation

Publications

References

Getting Involved

horovod's People

Contributors

Stargazers

Watchers

Forkers

horovod's Issues

Another transport will be used instead, although this may result in lower performance.

Another transport will be used instead, although this may result in lower performance.

Recommend Projects

Recommend Topics

Recommend Org

Another transport will be used instead, although this may result in
lower performance.

Another transport will be used instead, although this may result in
lower performance.