horovod / horovod Goto Github PK
View Code? Open in Web Editor NEWDistributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Home Page: http://horovod.ai
License: Other
Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.
Home Page: http://horovod.ai
License: Other
I run examples keras_mnit.py in horovod/examples.
mpirun --allow-run-as-root --mca btl_tcp_if_include eno1 -np 2 -x LD_LIBRARY_PATH -H 192.168.12.50:1,192.168.12.49:1 python /home/dyc/horovod/examples/keras_mnist.py
But got warning messages below so that it can't continue to run.
WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock. Stalled ops: HorovodBroadcast_TFOptimizer_iterations_0 [ready ranks: 0], HorovodBroadcast_iterations_0 [ready ranks: 1]
What's wrong with it. could you please help me to find a way to solve this warning? Thank you very much in advance.
Hi,
a question on the first 16-GPU benchmark shown in the README, it's said RoCE 25Gb, and "RDMA Horovod (allreduce on GPU with NCCL)", however, NCCL2 so far doesn't support RoCE RDMA, so it's a typo about this 'RDMA Horovod'? right?
RDMA Horovod (allreduce on GPU with NCCL) | 2,022.6 (15.0x) | 1,746.2 (14.6x) | 1,787.4 (13.7x) |
---|
instead, in the 2nd benchmark on 64-GPU, it's 'TCP Horovod (allreduce on GPU with NCCL)'
I have installed NCCL2 and OpenMPI but am unable to install Horovod on a GPU. As specified in the documentation I am using the following command:
HOROVOD_NCCL_HOME=/usr/local/nccl_2.0.5-3+cuda8.0_amd64 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
This results in the following output:
#HOROVOD_NCCL_HOME=/usr/local/nccl_2.0.5-3+cuda8.0_amd64 HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
Collecting horovod
Downloading horovod-0.9.10.tar.gz (64kB)
100% |████████████████████████████████| 71kB 510kB/s
Installing collected packages: horovod
Running setup.py install for horovod ... error
Complete output from command /opt/anaconda2/envs/tensorflow/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-61FZkn/horovod/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-ItuZD1-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-2.7/horovod
creating build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops_test.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
running build_ext
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/core -L/opt/anaconda2/envs/tensorflow/lib -ltensorflow_framework -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
/bin/ld: cannot find -ltensorflow_framework
collect2: error: ld returned 1 exit status
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/core -L/opt/anaconda2/envs/tensorflow/lib -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0 -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o -L/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/core -L/opt/anaconda2/envs/tensorflow/lib -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/local/cuda/include -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_cuda.cc -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_cuda.o -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/opt/anaconda2/envs/tensorflow/lib -lcudart -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/local/nccl_2.0.5-3+cuda8.0_amd64/include -I/usr/local/cuda/include -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_nccl.cc -o build/temp.linux-x86_64-2.7/test_compile/test_nccl.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/opt/anaconda2/envs/tensorflow/lib -Wl,-rpath=/opt/anaconda2/envs/tensorflow/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_nccl.o -L/usr/local/nccl_2.0.5-3+cuda8.0_amd64/lib -L/usr/local/nccl_2.0.5-3+cuda8.0_amd64/lib64 -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/opt/anaconda2/envs/tensorflow/lib -lnccl -o build/temp.linux-x86_64-2.7/test_compile/test_nccl.so
building 'horovod.tensorflow.mpi_lib' extension
creating build/temp.linux-x86_64-2.7/horovod
creating build/temp.linux-x86_64-2.7/horovod/tensorflow
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DHAVE_CUDA=1 -DHAVE_NCCL=1 -DHOROVOD_GPU_ALLREDUCE='N' -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/usr/local/nccl_2.0.5-3+cuda8.0_amd64/include -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c horovod/tensorflow/mpi_message.cc -o build/temp.linux-x86_64-2.7/horovod/tensorflow/mpi_message.o -std=c++11 -fPIC -O2 -I/usr/local/openmpi/include -pthread -Wl,-rpath -Wl,/usr/local/openmpi/lib -Wl,--enable-new-dtags -L/usr/local/openmpi/lib -lmpi_cxx -lmpi -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DHAVE_CUDA=1 -DHAVE_NCCL=1 -DHOROVOD_GPU_ALLREDUCE='N' -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include -I/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/usr/local/nccl_2.0.5-3+cuda8.0_amd64/include -I/opt/anaconda2/envs/tensorflow/include/python2.7 -c horovod/tensorflow/mpi_ops.cc -o build/temp.linux-x86_64-2.7/horovod/tensorflow/mpi_ops.o -std=c++11 -fPIC -O2 -I/usr/local/openmpi/include -pthread -Wl,-rpath -Wl,/usr/local/openmpi/lib -Wl,--enable-new-dtags -L/usr/local/openmpi/lib -lmpi_cxx -lmpi -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
horovod/tensorflow/mpi_ops.cc: In function ‘int horovod::tensorflow::{anonymous}::GetDeviceID(tensorflow::OpKernelContext*)’:
horovod/tensorflow/mpi_ops.cc:1723:63: error: ‘const struct tensorflow::DeviceBase::GpuDeviceInfo’ has no member named ‘gpu_id’
device = context->device()->tensorflow_gpu_device_info()->gpu_id;
^
In file included from /opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/op_kernel.h:22:0,
from horovod/tensorflow/mpi_ops.cc:22:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/allocator.h: In member function ‘virtual std::size_t tensorflow::Allocator::RequestedSize(void*)’:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/allocator.h:155:3: warning: control reaches end of non-void function [-Wreturn-type]
}
^
In file included from /opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/op_kernel.h:25:0,
from horovod/tensorflow/mpi_ops.cc:22:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/device_base.h: In member function ‘virtual tensorflow::Allocator* tensorflow::DeviceBase::GetAllocator(tensorflow::AllocatorAttributes)’:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/device_base.h:152:3: warning: control reaches end of non-void function [-Wreturn-type]
}
^
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/device_base.h: In member function ‘virtual const tensorflow::DeviceAttributes& tensorflow::DeviceBase::attributes() const’:
/opt/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/include/tensorflow/core/framework/device_base.h:183:3: warning: control reaches end of non-void function [-Wreturn-type]
}
^
error: command 'gcc' failed with exit status 1
----------------------------------------
Command "/opt/anaconda2/envs/tensorflow/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-61FZkn/horovod/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-ItuZD1-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-61FZkn/horovod/
How do I address these errors? Do I also need to make any changes to handle the warnings which state "valid for C/ObjC but not for C++"?
I take it the benchmarks in the README are based on weak scaling? (i.e. batch size per process stays the same regardless of number of processes?). If it is weak scaling, you'd be solving a different problem than if you would divide the back size by the number of processes (strong scaling), also giving you worse results.
Hi,
We're testing HOROVOD on centos7.3, nccl2, cuda/8, cudnn6.0 but it keeps crashing. We tested openmpi 2.1.2, 3.0.0, and mvapich2.2. but all crashes with same error message shown below.
NCCL2 has been confirmed working through nccl-tests.
Could you comment about the issues? Other details are shown below.
Best regards,
Byoungseon
Install command: HOROVOD_NCCL_HOME=$NCCL_HOME HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip3 install --no-cache-dir horovod
Run command: mpirun -n 2 python3 ./keras_mnist.py
Error message shown below:
2017-10-30 15:46:26.208991: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 1, name: Tesla P100-PCIE-16GB, pci bus id: 0000:13:00.0, compute capability: 6.0)
[ava02:21329:1] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000068d1c mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
3 0x000000000006926c mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
4 0x0000000000035250 killpg() ??:0
5 0x000000000014b9cc __memcpy_ssse3_back() :0
6 0x000000000038ad0e mv2_smp_fast_write_contig() ??:0
7 0x000000000036ac7f MPIDI_CH3_EagerContigShortSend() ??:0
8 0x000000000037518f MPID_Send() ??:0
9 0x000000000030dbaa MPIC_Send() ??:0
10 0x000000000007c372 MPIR_Bcast_binomial.isra.1() bcast.c:0
11 0x000000000007cd15 MPIR_Bcast_intra() ??:0
12 0x00000000000e738a MPIR_Bcast_index_tuned_intra_MV2() ??:0
13 0x00000000000e4eaf MPIR_Bcast_MV2() ??:0
14 0x000000000007cfe8 MPIR_Bcast_intra() ??:0
15 0x00000000000e738a MPIR_Bcast_index_tuned_intra_MV2() ??:0
16 0x00000000000e4eaf MPIR_Bcast_MV2() ??:0
17 0x000000000007d89b MPIR_Bcast_impl() ??:0
18 0x000000000007dea9 MPI_Bcast() ??:0
19 0x000000000001f1be _ZN7horovod10tensorflow12_GLOBAL__N_116PerformOperationERSt13unordered_mapISsNS1_16TensorTableEntryESt4hashISsESt8equal_toISsESaISt4pairIKSsS3_EEENS0_11MPIResponseE() /tmp/pip-build-7bvx75z5/horovod/horovod/tensorflow/mpi_ops.cc:1131
20 0x0000000000021391 _ZN7horovod10tensorflow12_GLOBAL__N_120BackgroundThreadLoopERNS1_18HorovodGlobalStateE() /tmp/pip-build-7bvx75z5/horovod/horovod/tensorflow/mpi_ops.cc:1431
21 0x0000000000021391 ~vector() /usr/include/c++/4.8.2/bits/stl_vector.h:416
22 0x0000000000021391 ~MPIResponse() /tmp/pip-build-7bvx75z5/horovod/horovod/tensorflow/mpi_message.h:93
23 0x0000000000021391 _ZN7horovod10tensorflow12_GLOBAL__N_120BackgroundThreadLoopERNS1_18HorovodGlobalStateE() /tmp/pip-build-7bvx75z5/horovod/horovod/tensorflow/mpi_ops.cc:1431
24 0x00000000000b5230 _ZNSt11this_thread11__sleep_forENSt6chrono8durationIlSt5ratioILl1ELl1EEEENS1_IlS2_ILl1ELl1000000000EEEE() ??:0
25 0x0000000000007dc5 start_thread() pthread_create.c:0
26 0x00000000000f776d __clone() ??:0
In the context of some job that reports a loss, each process reports the loss of the batch they have just processed. Does horovod, or can you suggest a best way of aggregating/syncing metrics or loss values across processes for reporting those somewhere?
Guys thanks for this interesting project. Appreciate the effort that you have put in here. I am trying out a few things but have a few doubts if I am doing things right.
I have 2 machines, each with 2 P100s and 16gig mem on each GPU
I am starting with examples/ , and executing the command ~/.openmpi/bin/mpirun -np 4 -v -hostfile ~/hostfile -x NCCL_DEBUG=INFO ~/envs/horovod/bin/python tensorflow_word2vec.py
I can confirm that this kicks off the example across both my machines in the host files and across the total 4 gpus.
But I have a few doubts if I am missing something, as I am seeing a higher time periods of exec when running a
single GPU with horovd as compared to vanilla single core TF b
distributed horovod vs single GPU horovod execution
So just wanted to confirm a few things:
(nvidia-smi seems to point out that each of the GPU is running the code and 7% of the memory is being used).
From top:
%Cpu(s): 3.1 us, 0.8 sy, 0.0 ni, 96.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20 0 29.199g 1.487g 421128 S 116.9 2.4 5:32.41 python
20 0 29.199g 1.495g 421172 S 114.6 2.4 5:24.61 python
( as pointed out in the other ticket I should see something in the logs about it, but I don't see anything. I can see the processes started across machines, but I am not sure if all the communication is good. The training gets to completion, if I specify incorrect host then I do see errors, so I have a feeling that all the instances are communicating but a
not sure if they are communicating properly b
they are communicating via libnccl2
whether anything special needs to be done from open mpi side
I basically downloaded openmpi, did not do anything special to build it optimally for say GPUs or libnccl. Do I need to anything additional to optimize/leverage for GPUs
training times are increasing as we add more GPUs
vanilla tensor flow for above example ~ 3 sec
horovod single GPU ~ 4 sec
horovod with 2 GPU on same machine ~9 sec
Some models, like RFCN can return variables without gradients
File "train.py", line 249, in <module>
tf.app.run()
File "/home/csq/.cache/bazel/_bazel_csq/0a3580d8ecd2fafe9cc9970e974f5dc4/execroot/source/bazel-out/release_links/lib/python_env/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 229, in main
worker_job_name, is_chief, FLAGS.train_dir, FLAGS.trace_every_n_steps)
File "/home/csq/src/models/research/object_detection/trainer.py", line 271, in train
for (gradient, var) in gradients]
File "/home/csq/.cache/bazel/_bazel_csq/0a3580d8ecd2fafe9cc9970e974f5dc4/execroot/source/bazel-out/release_links/lib/python_env/horovod/tensorflow/__init__.py", line 76, in allreduce
horovod_size = tf.cast(size(), tensor.dtype)
AttributeError: 'NoneType' object has no attribute 'dtype'
I just work around it by replacing compute_gradients() with
gradients = (super(hvd.DistributedOptimizer, training_optimizer)
.compute_gradients(total_loss))
gradients = [(g, v) for g, v in gradients if g is not None]
if hvd.size() > 1:
with tf.name_scope(training_optimizer._name + "_Allreduce"):
grads_and_vars = [(hvd.allreduce(gradient, device_dense=training_optimizer._device_dense,
device_sparse=training_optimizer._device_sparse), var)
for (gradient, var) in gradients]
but think adding
gradients = [(g, v) for g, v in gradients if g is not None]
after this line: https://github.com/uber/horovod/blob/64317537436ae4225a004043acaa37a205a18dc3/horovod/tensorflow/__init__.py#L166
could fix it for others.
See title.
Add this to the top of fe. the keras-example script, and it will allocate a ton of processors and go OOM when using multiple devices.
from tensorflow.python.client import device_lib
LOCAL_DEVICES = device_lib.list_local_devices()
Release 🍌 with:
mpirun -np 4 python keras_mnist.py
I have tested horovod in my own tensorflow code, which shows significant improvement!!!
Dive into the docs and implementation, horovod used the ring-Allreduce algorithom, when I search on website, found there is a new method published recently, which says it can remove gradients divide latency, only communicate weights with each neighbor, and proved the local weights converged too and the convergence speed is as fast as sync SGD. So I am wonder is this a method worth to try?
here is the publication:
I downloaded NCCL and installed it according to the horovod gpu installation guide. I'm on MacOS, so I was a little surprised to see everything work. I tried the Keras example and things worked like a charm, first time. Then, today I went to install it on another Mac, and on my original machine where it was working, and I get this error:
(gpu-tensorflow) tylers-iMac:horovod tyler$ python3 test-keras.python
Using TensorFlow backend.
Traceback (most recent call last):
File "test-keras.python", line 9, in <module>
import horovod.tensorflow as hvd
File "/usr/local/lib/python3.6/site-packages/horovod/tensorflow/__init__.py", line 34, in <module>
from horovod.tensorflow.mpi_ops import size
File "/usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 74, in <module>
['HorovodAllgather', 'HorovodAllreduce'])
File "/usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_ops.py", line 56, in _load_library
library = load_library.load_op_library(filename)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 64, in load_op_library
None, None, error_msg, error_code)
tensorflow.python.framework.errors_impl.NotFoundError: dlopen(/usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-darwin.so, 6): Symbol not found: _ncclAllReduce
Referenced from: /usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-darwin.so
Expected in: flat namespace
in /usr/local/lib/python3.6/site-packages/horovod/tensorflow/mpi_lib.cpython-36m-darwin.so
I can't get it working again. I'm so confused and can't figure out what has changed.
Hi, @alsrgv I run tensorflow_mnist.py on two machines with
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.77:1,172.23.233.75:1 python tensorflow_mnist.py
After showing some logs, it hangs up and cannot proceed. The process on both machines has GPU memory usage, but didn't show logs any more. The logs before hangs up is as follows:
2017-11-05 23:17:20.677867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: name: Tesla M40 24GB major: 5 minor: 2 memoryClockRate (GHz) 1.112 pciBusID 0000:83:00.0 Total memory: 22.40GiB Free memory: 22.29GiB 2017-11-05 23:17:20.677908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 2017-11-05 23:17:20.677915: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 2017-11-05 23:17:20.677927: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:83:00.0) 2017-11-05 23:18:24.170460: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: name: Tesla M40 24GB major: 5 minor: 2 memoryClockRate (GHz) 1.112 pciBusID 0000:83:00.0 Total memory: 22.40GiB Free memory: 22.29GiB 2017-11-05 23:18:24.170521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 2017-11-05 23:18:24.170528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y 2017-11-05 23:18:24.170539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40 24GB, pci bus id: 0000:83:00.0) INFO:tensorflow:Restoring parameters from ./checkpoints/model.ckpt-100 [msragpum13:53619] 1 more process has sent help message help-mpi-btl-openib.txt / default subnet prefix [msragpum13:53619] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [msragpum13:53619] 1 more process has sent help message help-mpi-btl-openib-cpc-base.txt / no cpcs for port
However, if I just run on a single machine , it has no problems:
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.75:2 python tensorflow_mnist.py
or
mpirun -np 2 -x PATH -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES -H 172.23.233.77:2 python tensorflow_mnist.py
both can work properly.
My system is ubuntu 14.04, tensorflow is 1.3.0, python is 2.7, horovod is following the install guides using NCCL2.0 and openmpi 3.0. So what's the problems?
NCCL 2 will even fail in local mode if there's an RoCE capable NIC on the target system. See http://docs.nvidia.com/deeplearning/sdk/nccl-release-notes/rel_2.0.2.html
Just curious if you guys notice a recent contribution from here which added allreduce/allgather
(which I think horovod derived from?) and a similar optimizer wrapper. I'm planning to do a benchmark with both horovod and above. It'll be great if you guys could share some insight in terms of features/flexibility/performance.
Thanks for the great library!
Hi,
Is distributed training with Horovod must use mpirun to distribute the training process?
I mean, could me run each process on local machine by manually and then Horvorod could find the cluster or peer node to do distributed computation?
Such as,
In machine 1, I run "python train.py"
In machine 2, I also run "python train.py"
Is this OK?
@alsrgv: I got similar error as I got under windows 10, could you please help me? Thanks a lot in advance. I´ve installed tensorflow-gpu (version1.3.0) and Cuda 8/CudNN 6, it works well with python 2.7 when I run the script without horovod, and I also installed openmpi-3.0/NCCL2.0.
$sudo -H HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod -v
Converted retries value: Retry(total=5, connect=None, read=None, redirect=None) -> Retry(total=Retry(total=5, connect=None, read=None, redirect=None), connect=None, read=None, redirect=None)
Converted retries value: Retry(total=5, connect=None, read=None, redirect=None) -> Retry(total=Retry(total=5, connect=None, read=None, redirect=None), connect=None, read=None, redirect=None)
Collecting horovod
1 location(s) to search for versions of horovod:
https://pypi.python.org/simple/horovod/
Getting page https://pypi.python.org/simple/horovod/
Starting new HTTPS connection (1): pypi.python.org
"GET /simple/horovod/ HTTP/1.1" 200 915
Analyzing links from page https://pypi.python.org/simple/horovod/
Found link https://pypi.python.org/packages/07/bb/f2edeee902a6c5e345f5cc6d8ba3f5b6805506489da4bb6cf35665cccd95/horovod-0.9.5.tar.gz#md5=a883937ed44f95a98b434a3c13178aa0 (from https://pypi.python.org/simple/horovod/), version: 0.9.5
Found link https://pypi.python.org/packages/11/48/e56cb33974a7fc24252e304e7e60b7b892eb7c1093099493913880b6f4ce/horovod-0.9.8.tar.gz#md5=e981a37f7ee6c1f4ba5dcac637184e7a (from https://pypi.python.org/simple/horovod/), version: 0.9.8
Found link https://pypi.python.org/packages/2c/c4/cd8d9d1136465e6f23a535cc4c49fad24da92982f1275d7c38a564d5417b/horovod-0.9.9.tar.gz#md5=1d680e9ffe37552f4510c29e0fcfefc3 (from https://pypi.python.org/simple/horovod/), version: 0.9.9
Found link https://pypi.python.org/packages/58/8d/6a84cd3fa9bdeb96e6c162d26d05c6e594c38f788005f2d2f42ed2bfcfe5/horovod-0.9.1.tar.gz#md5=277e9c1bc7b459a49eb1aca7731ca5bd (from https://pypi.python.org/simple/horovod/), version: 0.9.1
Found link https://pypi.python.org/packages/7f/78/3c42b58fa37d5811eb2178e6d8f0e586b18fd883f8575a2da8a7bce9b02b/horovod-0.9.4.tar.gz#md5=d436ab4b0efa0734f270f4ba24d1d665 (from https://pypi.python.org/simple/horovod/), version: 0.9.4
Found link https://pypi.python.org/packages/8a/82/085978898e51938791c35949fb757905e746503369e898074f76f6044d32/horovod-0.9.3.tar.gz#md5=4f4dd52b7763ec7d2da0da55ad04aeff (from https://pypi.python.org/simple/horovod/), version: 0.9.3
Found link https://pypi.python.org/packages/8c/91/91495a609f7b4077ba4eeb00c8494bf29eb94de76e5a7a458cf57add7f0e/horovod-0.9.0.tar.gz#md5=1cd031e8d892cab8e2cf05cfe2caba55 (from https://pypi.python.org/simple/horovod/), version: 0.9.0
Found link https://pypi.python.org/packages/9c/62/3d37b3697a06e0c5ef97fd01f690f4b94509486df4f5401a65461e110d1c/horovod-0.9.7.tar.gz#md5=42c02e8e117303e35744fdce39d79ac5 (from https://pypi.python.org/simple/horovod/), version: 0.9.7
Found link https://pypi.python.org/packages/9c/be/5aebe7ecd7024e0ad12196127d22488a73defa3c1b6fac0eea20d7aa6115/horovod-0.9.6.tar.gz#md5=bc4ae17e261dffef0b9ce1e125336a75 (from https://pypi.python.org/simple/horovod/), version: 0.9.6
Found link https://pypi.python.org/packages/b1/83/ac3d9499fdf9173a5870d6fff0a58a4e35240a9c17ce556c3c59cb1a4533/horovod-0.9.10.tar.gz#md5=694ca17e718cd3d62831e23dbf1f5437 (from https://pypi.python.org/simple/horovod/), version: 0.9.10
Found link https://pypi.python.org/packages/fc/53/314d19c558f726e480489e97d576883b6bd79b7d811913216b30d7849b8b/horovod-0.9.2.tar.gz#md5=804dbcd20c1de82c7adb5213c4d0363c (from https://pypi.python.org/simple/horovod/), version: 0.9.2
Using version 0.9.10 (newest of versions: 0.9.0, 0.9.1, 0.9.2, 0.9.3, 0.9.4, 0.9.5, 0.9.6, 0.9.7, 0.9.8, 0.9.9, 0.9.10)
"GET /packages/b1/83/ac3d9499fdf9173a5870d6fff0a58a4e35240a9c17ce556c3c59cb1a4533/horovod-0.9.10.tar.gz HTTP/1.1" 200 64842
Downloading horovod-0.9.10.tar.gz (64kB)
Downloading from URL https://pypi.python.org/packages/b1/83/ac3d9499fdf9173a5870d6fff0a58a4e35240a9c17ce556c3c59cb1a4533/horovod-0.9.10.tar.gz#md5=694ca17e718cd3d62831e23dbf1f5437 (from https://pypi.python.org/simple/horovod/)
100% |████████████████████████████████| 71kB 11.9MB/s
Running setup.py (path:/tmp/pip-build-wrtVwH/horovod/setup.py) egg_info for package horovod
Running command python setup.py egg_info
running egg_info
creating pip-egg-info/horovod.egg-info
writing pip-egg-info/horovod.egg-info/PKG-INFO
writing top-level names to pip-egg-info/horovod.egg-info/top_level.txt
writing dependency_links to pip-egg-info/horovod.egg-info/dependency_links.txt
writing manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
reading manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'pip-egg-info/horovod.egg-info/SOURCES.txt'
Source in /tmp/pip-build-wrtVwH/horovod has version 0.9.10, which satisfies requirement horovod from https://pypi.python.org/packages/b1/83/ac3d9499fdf9173a5870d6fff0a58a4e35240a9c17ce556c3c59cb1a4533/horovod-0.9.10.tar.gz#md5=694ca17e718cd3d62831e23dbf1f5437
Installing collected packages: horovod
Running setup.py install for horovod ... Running command /usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-wrtVwH/horovod/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-7mUT3A-record/install-record.txt --single-version-externally-managed --compile
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-2.7/horovod
creating build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops_test.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
running build_ext
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -std=c++11 -I/usr/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/usr/local/lib/python2.7/dist-packages/tensorflow/core -ltensorflow_framework -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
/usr/bin/ld: cannot find -ltensorflow_framework
collect2: error: ld returned 1 exit status
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -std=c++11 -I/usr/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/usr/local/lib/python2.7/dist-packages/tensorflow/core -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fno-strict-aliasing -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -fPIC -std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0 -I/usr/local/lib/python2.7/dist-packages/tensorflow/include -I/usr/local/lib/python2.7/dist-packages/tensorflow/include/external/nsync/public -I/usr/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
x86_64-linux-gnu-gcc -pthread -shared -Wl,-O1 -Wl,-Bsymbolic-functions -Wl,-Bsymbolic-functions -Wl,-z,relro -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security -Wl,-Bsymbolic-functions -Wl,-z,relro -Wdate-time -D_FORTIFY_SOURCE=2 -g -fstack-protector-strong -Wformat -Werror=format-security build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o -L/usr/local/lib/python2.7/dist-packages/tensorflow/core -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.so
mpicxx: error while loading shared libraries: libopen-pal.so.40: cannot open shared object file: No such file or directory
error: mpicxx -show failed (see error below), is MPI in $PATH?
Note: If your version of MPI has custom command to show compilation flags, please specify it via HOROVOD_MPICXX_SHOW environment variable.
Traceback (most recent call last):
File "/tmp/pip-build-wrtVwH/horovod/setup.py", line 107, in get_mpi_flags
shlex.split(show_command), universal_newlines=True).strip()
File "/usr/lib/python2.7/subprocess.py", line 574, in check_output
raise CalledProcessError(retcode, cmd, output=output)
CalledProcessError: Command '['mpicxx', '-show']' returned non-zero exit status 127
error
Cleaning up...
Removing source in /tmp/pip-build-wrtVwH/horovod
Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-wrtVwH/horovod/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-7mUT3A-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-wrtVwH/horovod/
Exception information:
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/pip/basecommand.py", line 209, in main
status = self.run(options, args)
File "/usr/lib/python2.7/dist-packages/pip/commands/install.py", line 335, in run
prefix=options.prefix_path,
File "/usr/lib/python2.7/dist-packages/pip/req/req_set.py", line 732, in install
**kwargs
File "/usr/lib/python2.7/dist-packages/pip/req/req_install.py", line 886, in install
spinner=spinner,
File "/usr/lib/python2.7/dist-packages/pip/utils/init.py", line 736, in call_subprocess
% (command_desc, proc.returncode, cwd))
InstallationError: Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-wrtVwH/horovod/setup.py';exec(compile(getattr(tokenize, 'open', open)(file).read().replace('\r\n', '\n'), file, 'exec'))" install --record /tmp/pip-7mUT3A-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-wrtVwH/horovod/
Converted retries value: Retry(total=0, connect=None, read=None, redirect=None) -> Retry(total=Retry(total=0, connect=None, read=None, redirect=None), connect=None, read=None, redirect=None)
Converted retries value: Retry(total=0, connect=None, read=None, redirect=None) -> Retry(total=Retry(total=0, connect=None, read=None, redirect=None), connect=None, read=None, redirect=None)
After i installed horovod and run the tensorflow_mnist.py, i get the following error.
python: symbol lookup error: /usr/local/lib/openmpi/mca_coll_cuda.so: undefined symbol: opal_cuda_check_bufs
FYI, we have added a demo on how to run horovod on Azure Batch AI. https://github.com/Azure/BatchAI/tree/master/recipes/Horovod
Thanks,
Alex
Is there a simple way to run batch norm updates in addition to gradient updates? The industry practice is to run batch update ops from a single tower only, therefore, it would be incorrect to add batch updates to the training op in the provided sample code.
In the Horovod source, in mpi.ops.py
under the _allreduce
function, it says:
"""An op which sums an input tensor over all the Horovod processes.
The reduction operation is keyed by the name of the op. The tensor type and
shape must be the same on all Horovod processes for a given name. The reduction
will not start until all processes are ready to send and receive the tensor.
Returns:
A tensor of the same shape and type as `tensor`, summed across all
processes.
"""
I made a minimum example to reproduce this description, by reducing a tensor called const
on each process to a tensor called reduced
:
import tensorflow as tf
import horovod.tensorflow as hvd
# Initialize Horovod
hvd.init()
config = tf.ConfigProto()
# Rank
rank = hvd.rank()
# Constant tensor to reduce
const = tf.constant([1.,2.])
reduced = hvd.allreduce(const)
# Monitored session
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
sess = tf.train.MonitoredTrainingSession(config=config, hooks=hooks)
# I expect a summed tensor -> [2,4]
print("rank: %d" % (rank), sess.run(reduced))
Saving this in a file called test.py
, I run with the following:
mpirun -np 2 python allreduce.py
The result is:
('rank: 0', array([ 1., 2.], dtype=float32))
('rank: 1', array([ 1., 2.], dtype=float32))
This is just returns the original tensor [1., 2.] that was supposed to be reduced. I expected the following:
('rank: 0', array([ 2., 4.], dtype=float32))
('rank: 1', array([ 2., 4.], dtype=float32))
Am I using this function incorrectly? I'm running on my laptop with 4 available processing units, and I'm using CPUs.
EDIT: Wow, I thought allreduce() returns the sum of tensors but it returns the average. It's working as expected now. Sorry for my foolishness.
I am performing batch gradient descent, where the gradients are averaged across all training examples to move forward in a single training epoch. This can easily be parallelized by splitting the original batch into "mini-batches", and each process calculates gradients associated with its mini-batch. The resulting mini-batch gradients can be averaged on the master process to then move forward with training. This implementation should be "embarrassingly parallel" since the gradients of N examples are calculated over P processes in N/P mini-batches. A good animation is from TensorFlow's synchronous training examples:
I want to perform this type of training on a CPU cluster and I'm very familiar with MPI, so I think Horovod is a better choice than Distributed TensorFlow. I am having a hard time setting up this task, however. My recent attempt involves optimizing a single layer neural network to reproduce the parabola y = x^2 (just for simplicity):
import tensorflow as tf
import numpy as np
import horovod.tensorflow as hvd
import time
# Function
def add_layer(inputs, in_size, out_size, activation_function=None):
w = tf.Variable(tf.random_normal([in_size, out_size]))
b = tf.Variable(tf.random_normal([1, out_size]))
Wx_plus_b = tf.matmul(inputs, w) + b
if activation_function is None:
outputs = Wx_plus_b
else:
outputs = activation_function(Wx_plus_b)
return outputs
# Initialize Horovod
hvd.init()
rank = hvd.rank()
# Make up some data
N = 20000
x_data = np.linspace(-1, 1, N)[:, np.newaxis]
y_data = np.square(x_data)
# Make mini-batches, each with N/procs elements
procs = hvd.size()
x_split = np.split(x_data, procs)
y_split = np.split(y_data, procs)
x_batch = x_split[rank]
y_batch = y_split[rank]
# Define placeholder for x_data and y_data
xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])
# Add hidden layer
l1 = add_layer(xs, 1, 3, activation_function=tf.tanh)
# Add output layer
prediction = add_layer(l1, 3, 1, activation_function=None)
# Define loss
loss = tf.reduce_mean(tf.reduce_sum(tf.square(ys-prediction), reduction_indices=[1]))
# Define optimizer
opt = tf.train.GradientDescentOptimizer(0.01)
# Add Horovod Distributed Optimizer
opt = hvd.DistributedOptimizer(opt)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(loss)
# Ask the optimizer to apply the gradients
train_opt = opt.apply_gradients(grads_and_vars)
# Monitored session
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
sess = tf.train.MonitoredTrainingSession(hooks=hooks)
# Initialize timer
if rank == 0:
start = time.time()
# Perform the training
for i in range(10000):
# Training step
sess.run(train_opt, feed_dict={xs: x_batch, ys: y_batch})
if i % 100 == 0:
print("Rank: %d" % (rank), "Epoch: %d" % (i), "Loss: %.5f" % (sess.run(loss, feed_dict={xs: x_data, ys: y_data})) )
# Output the final time
if rank == 0:
end = time.time()
print("Elapsed time: %.5f sec" % (end - start) )
If this is saved in a file called test.py
, you can run with:
mpirun -np P python test.py
where P
is the number of processes to use. This parallelizes the gradient calculation from N
training examples to N/P
examples, and then the gradients are averaged using hvd.allreduce
in the distributed optimizer.
There are 2 issues that I noticed:
(1) Even though the mini-batches are being parallelized properly, the training actually gets slower with increasing the number of processes. Is this simply due to overhead, since this is a rather simple example?
(2) There are P
identical neural networks being trained simultaneously. I would like to have the separate processes calculate the gradients of their mini-batches, and send them back to the master process to proceed with training. Is there a way to set this up using Horovod?
I was thinking maybe (2) was causing (1), since many identical neural nets are being trained at once perhaps this is somehow slower?
Thanks for your time.
EDIT: Cleaned up code for more clarity.
Thanks
@alsrgv: If I use multiple GPUs, I tried to store the best weights to use it later but I couldn't get the results like what I could get using single GPU, what is the proper way to do it by using Horovod? Thanks a lot in advance.
Hey Alex! Thanks for adding the __version__
!. Would you mind adding tags to the corresponding commits?
Hi, I'm curious about the motivation for this project, since Tensorflow itself provides supports for distributed training. Is it because the open-sourced distributed Tensorflow is slow? If it is the case, what are the bottlenecks (e.g. grpc?) and why? Are there some benchmarks we can compare? Thanks.
root@ecb2de971add:/gang/distributed/allreduce/mnist# mpirun -np 2 --allow-run-as-root python3 keras_uber.py
2017-11-08 08:24:44.036889: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:24:44.036943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
2017-11-08 08:24:44.037064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:24:44.037098: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 1, name: TITAN Xp, pci bus id: 0000:06:00.0, compute capability: 6.1)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/6
Train on 60000 samples, validate on 10000 samples
Epoch 1/6
60000/60000 [==============================] - 6s - loss: 0.2183 - acc: 0.9329 - val_loss: 0.0478 - val_acc: 0.9846======>....] - ETA: 0s - loss: 0.2351 - acc: 0.9277728
Epoch 2/6
60000/60000 [==============================] - 6s - loss: 0.2194 - acc: 0.9332 - val_loss: 0.0478 - val_acc: 0.9846
Epoch 2/6
60000/60000 [==============================] - 5s - loss: 0.0694 - acc: 0.9796 - val_loss: 0.0365 - val_acc: 0.9877=========>.] - ETA: 0s - loss: 0.0695 - acc: 0.97951
Epoch 3/6
60000/60000 [==============================] - 5s - loss: 0.0685 - acc: 0.9793 - val_loss: 0.0365 - val_acc: 0.9877
Epoch 3/6
60000/60000 [==============================] - 5s - loss: 0.0499 - acc: 0.9849 - val_loss: 0.0302 - val_acc: 0.9904=========>.] - ETA: 0s - loss: 0.0499 - acc: 0.9849
60000/60000 [==============================] - 5s - loss: 0.0501 - acc: 0.9849 - val_loss: 0.0302 - val_acc: 0.9904
Epoch 4/6
Epoch 4/6
60000/60000 [==============================] - 5s - loss: 0.0395 - acc: 0.9881 - val_loss: 0.0289 - val_acc: 0.9904=========>.] - ETA: 0s - loss: 0.0410 - acc: 0.9872
Epoch 5/6
60000/60000 [==============================] - 5s - loss: 0.0409 - acc: 0.9872 - val_loss: 0.0289 - val_acc: 0.9904
Epoch 5/6
60000/60000 [==============================] - 5s - loss: 0.0330 - acc: 0.9898 - val_loss: 0.0294 - val_acc: 0.9904=========>.] - ETA: 0s - loss: 0.0330 - acc: 0.9898
Epoch 6/6
60000/60000 [==============================] - 5s - loss: 0.0330 - acc: 0.9900 - val_loss: 0.0294 - val_acc: 0.9904
Epoch 6/6
60000/60000 [==============================] - 5s - loss: 0.0280 - acc: 0.9910 - val_loss: 0.0270 - val_acc: 0.9913=========>.] - ETA: 0s - loss: 0.0281 - acc: 0.9910
60000/60000 [==============================] - 5s - loss: 0.0299 - acc: 0.9910 - val_loss: 0.0270 - val_acc: 0.9913
Test loss: 0.0269502004177
Test accuracy: 0.9913
Using TensorFlow backend.
Test loss: 0.0269502004177
Test accuracy: 0.9913
Using TensorFlow backend.
--------------------------------------------------------------------------
The call to cuIpcCloseMemHandle failed. This is a warning and the program
will continue to run.
cuIpcCloseMemHandle return value: 4
address: 0x1090fc00000
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[ecb2de971add:17340] Sleep on 17340
[ecb2de971add:17334] 1 more process has sent help message help-mpi-common-cuda.txt / cuIpcCloseMemHandle failed
[ecb2de971add:17334] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ecb2de971add:17340] Sleep on 17340
After installing nccl 2.0 and mpi, I use "HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod" to install horovod, but is shows error: horovod/tensorflow/mpi_ops.cc:802:79: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
.
The details log shows:
`Collecting horovod
Downloading horovod-0.9.11.tar.gz (65kB)
100% |████████████████████████████████| 71kB 271kB/s
Installing collected packages: horovod
Running setup.py install for horovod ... error
Complete output from command /usr/anaconda2/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-g7B1qn/horovod/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-raRdo5-record/install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-2.7
creating build/lib.linux-x86_64-2.7/horovod
copying horovod/init.py -> build/lib.linux-x86_64-2.7/horovod
creating build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/init.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops_test.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
copying horovod/tensorflow/mpi_ops.py -> build/lib.linux-x86_64-2.7/horovod/tensorflow
running build_ext
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/usr/anaconda2/lib/python2.7/site-packages/tensorflow/core -L/usr/anaconda2/lib -ltensorflow_framework -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
/usr/bin/ld: cannot find -ltensorflow_framework
collect2: error: ld returned 1 exit status
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.o -L/usr/anaconda2/lib/python2.7/site-packages/tensorflow/core -L/usr/anaconda2/lib -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_libs.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -D_GLIBCXX_USE_CXX11_ABI=0 -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.o -L/usr/anaconda2/lib/python2.7/site-packages/tensorflow/core -L/usr/anaconda2/lib -o build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/local/cuda/include -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_cuda.cc -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_cuda.o -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/usr/anaconda2/lib -lcudart -o build/temp.linux-x86_64-2.7/test_compile/test_cuda.so
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -std=c++11 -I/usr/local/cuda/include -I/usr/anaconda2/include/python2.7 -c build/temp.linux-x86_64-2.7/test_compile/test_nccl.cc -o build/temp.linux-x86_64-2.7/test_compile/test_nccl.o
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -shared -L/usr/anaconda2/lib -Wl,-rpath=/usr/anaconda2/lib,--no-as-needed build/temp.linux-x86_64-2.7/test_compile/test_nccl.o -L/usr/local/cuda/lib -L/usr/local/cuda/lib64 -L/usr/anaconda2/lib -lnccl -o build/temp.linux-x86_64-2.7/test_compile/test_nccl.so
building 'horovod.tensorflow.mpi_lib' extension
creating build/temp.linux-x86_64-2.7/horovod
creating build/temp.linux-x86_64-2.7/horovod/tensorflow
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DHAVE_CUDA=1 -DHAVE_NCCL=1 -DHOROVOD_GPU_ALLREDUCE='N' -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/usr/anaconda2/include/python2.7 -c horovod/tensorflow/mpi_message.cc -o build/temp.linux-x86_64-2.7/horovod/tensorflow/mpi_message.o -std=c++11 -fPIC -O2 -I/usr/anaconda2/include -L/usr/anaconda2/lib -Wl,-rpath -Wl,/usr/anaconda2/lib -lmpichcxx -lmpich -lopa -lmpl -lrt -lpthread -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
gcc -pthread -fno-strict-aliasing -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DHAVE_CUDA=1 -DHAVE_NCCL=1 -DHOROVOD_GPU_ALLREDUCE='N' -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include -I/usr/anaconda2/lib/python2.7/site-packages/tensorflow/include/external/nsync/public -I/usr/local/cuda/include -I/usr/anaconda2/include/python2.7 -c horovod/tensorflow/mpi_ops.cc -o build/temp.linux-x86_64-2.7/horovod/tensorflow/mpi_ops.o -std=c++11 -fPIC -O2 -I/usr/anaconda2/include -L/usr/anaconda2/lib -Wl,-rpath -Wl,/usr/anaconda2/lib -lmpichcxx -lmpich -lopa -lmpl -lrt -lpthread -D_GLIBCXX_USE_CXX11_ABI=0
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++ [enabled by default]
horovod/tensorflow/mpi_ops.cc: In function ‘void horovod::tensorflow::{anonymous}::PerformOperation(horovod::tensorflow::{anonymous}::TensorTable&, horovod::tensorflow::MPIResponse)’:
horovod/tensorflow/mpi_ops.cc:802:79: # error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
recvcounts, displcmnts, dtype, MPI_COMM_WORLD);
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:633:5: error: initializing argument 1 of ‘int MPI_Allgatherv(void*, int, MPI_Datatype, void*, int*, int*, MPI_Datatype, MPI_Comm)’ [-fpermissive]
int MPI_Allgatherv(void* , int, MPI_Datatype, void*, int , int , MPI_Datatype, MPI_Comm);
^
horovod/tensorflow/mpi_ops.cc:1102:45: error: invalid conversion from ‘const void’ to ‘void’ [-fpermissive]
MPI_COMM_WORLD))
^
horovod/tensorflow/mpi_ops.cc:534:24: note: in definition of macro ‘MPI_CHECK’
auto mpi_result = (op);
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:639:5: error: initializing argument 1 of ‘int MPI_Allreduce(void*, void*, int, MPI_Datatype, MPI_Op, MPI_Comm)’ [-fpermissive]
int MPI_Allreduce(void* , void*, int, MPI_Datatype, MPI_Op, MPI_Comm);
^
horovod/tensorflow/mpi_ops.cc: In function ‘void horovod::tensorflow::{anonymous}::BackgroundThreadLoop(horovod::tensorflow::{anonymous}::HorovodGlobalState&)’:
horovod/tensorflow/mpi_ops.cc:1261:39: error: ‘MPI_COMM_TYPE_SHARED’ was not declared in this scope
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0, MPI_INFO_NULL,
^
horovod/tensorflow/mpi_ops.cc:1262:34: error: ‘MPI_Comm_split_type’ was not declared in this scope
&local_comm);
^
horovod/tensorflow/mpi_ops.cc:1324:40: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
MPI_Send(encoded_message.c_str(), (int)encoded_message.length() + 1,
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:569:5: error: initializing argument 1 of ‘int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm)’ [-fpermissive]
int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm);
^
horovod/tensorflow/mpi_ops.cc:1425:43: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
MPI_Send(encoded_response.c_str(), (int)encoded_response.length() + 1,
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:569:5: error: initializing argument 1 of ‘int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm)’ [-fpermissive]
int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm);
^
horovod/tensorflow/mpi_ops.cc:1442:41: error: invalid conversion from ‘const void*’ to ‘void*’ [-fpermissive]
MPI_Send(encoded_response.c_str(), (int)encoded_response.length() + 1,
^
In file included from horovod/tensorflow/mpi_ops.cc:38:0:
/usr/anaconda2/include/mpi.h:569:5: error: initializing argument 1 of ‘int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm)’ [-fpermissive]
int MPI_Send(void*, int, MPI_Datatype, int, int, MPI_Comm);
^
error: command 'gcc' failed with exit status 1
----------------------------------------
Command "/usr/anaconda2/bin/python -u -c "import setuptools, tokenize;file='/tmp/pip-build-g7B1qn/horovod/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /tmp/pip-raRdo5-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-g7B1qn/horovod/`
This system is Ubuntu 14.04, python 2.7.6. CUDA 8.0 and CuDNN 6, Tensorflow 1.3.0. Can you help to figure out where is the problem?
@alsrgv: I wanna use it under windows 10 (one of our tool will call horovod and it can only support Windows) and I got the following error messages when I install it, could you please help me? Or there is no chance to use it under Windows? Also, I couldn't use the options "HOROVOD_GPU_ALLREDUCE=MPI HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip install --no-cache-dir horovod" as you mentioned.
C:\WINDOWS\system32>pip3 install --no-cache-dir horovod
Collecting horovod
Downloading horovod-0.9.10.tar.gz (64kB)
100% |████████████████████████████████| 71kB 1.5MB/s
Installing collected packages: horovod
Running setup.py install for horovod ... error
Complete output from command "c:\program files\python36\python.exe" -u -c "import setuptools, tokenize;file='C:\Users\BRIGHT1\AppData\Local\Temp\pip-build-e3i6bu64\horovod\setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\BRIGHT1\AppData\Local\Temp\pip-lt5q0wnv-record\install-record.txt --single-version-externally-managed --compile:
running install
running build
running build_py
creating build
creating build\lib.win-amd64-3.6
creating build\lib.win-amd64-3.6\horovod
copying horovod_init_.py -> build\lib.win-amd64-3.6\horovod
creating build\lib.win-amd64-3.6\horovod\tensorflow
copying horovod\tensorflow\mpi_ops.py -> build\lib.win-amd64-3.6\horovod\tensorflow
copying horovod\tensorflow\mpi_ops_test.py -> build\lib.win-amd64-3.6\horovod\tensorflow
copying horovod\tensorflow_init_.py -> build\lib.win-amd64-3.6\horovod\tensorflow
running build_ext
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe -std=c++11 /c /nologo /Ox /W3 /GL /DNDEBUG /MD "-Ic:\program files\python36\include" "-Ic:\program files\python36\include" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /EHsc /Tpbuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.cc /Fobuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.obj
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
test_tensorflow_libs.cc
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO "/LIBPATH:c:\program files\python36\lib\site-packages\tensorflow\core" "/LIBPATH:c:\program files\python36\libs" "/LIBPATH:c:\program files\python36\PCbuild\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\um\x64" tensorflow_framework.lib build\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.obj /OUT:build\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.dll
LINK : fatal error LNK1181: cannot open input file 'tensorflow_framework.lib'
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe -std=c++11 /c /nologo /Ox /W3 /GL /DNDEBUG /MD "-Ic:\program files\python36\include" "-Ic:\program files\python36\include" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /EHsc /Tpbuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.cc /Fobuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.obj
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
test_tensorflow_libs.cc
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\link.exe /nologo /INCREMENTAL:NO /LTCG /DLL /MANIFEST:EMBED,ID=2 /MANIFESTUAC:NO "/LIBPATH:c:\program files\python36\lib\site-packages\tensorflow\core" "/LIBPATH:c:\program files\python36\libs" "/LIBPATH:c:\program files\python36\PCbuild\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\LIB\amd64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\ucrt\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\lib\um\x64" "/LIBPATH:C:\Program Files (x86)\Windows Kits\10\lib\10.0.10240.0\um\x64" build\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.obj /OUT:build\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_libs.dll
Generating code
Finished generating code
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe -std=c++11 /c /nologo /Ox /W3 /GL /DNDEBUG /MD -D_GLIBCXX_USE_CXX11_ABI=0 "-Ic:\program files\python36\lib\site-packages\tensorflow\include" "-Ic:\program files\python36\lib\site-packages\tensorflow\include/external/nsync/public" "-Ic:\program files\python36\include" "-Ic:\program files\python36\include" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /EHsc /Tpbuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_abi.cc /Fobuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_abi.obj
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
test_tensorflow_abi.cc
c:\program files\python36\lib\site-packages\tensorflow\include\google/protobuf/message_lite.h(247): warning C4267: 'return': conversion from 'size_t' to 'int', possible loss of data
c:\program files\python36\lib\site-packages\tensorflow\include\google/protobuf/wire_format_lite_inl.h(865): warning C4267: 'argument': conversion from 'size_t' to 'google::protobuf::uint32', possible loss of data
c:\program files\python36\lib\site-packages\tensorflow\include\tensorflow/core/lib/gtl/manual_constructor.h(97): fatal error C1189: #error: "You must define TF_LIB_GTL_ALIGNED_CHAR_ARRAY for your compiler."
C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\BIN\x86_amd64\cl.exe -std=c++11 /c /nologo /Ox /W3 /GL /DNDEBUG /MD -D_GLIBCXX_USE_CXX11_ABI=1 "-Ic:\program files\python36\lib\site-packages\tensorflow\include" "-Ic:\program files\python36\lib\site-packages\tensorflow\include/external/nsync/public" "-Ic:\program files\python36\include" "-Ic:\program files\python36\include" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\INCLUDE" "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\ATLMFC\INCLUDE" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\shared" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\um" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.10240.0\winrt" /EHsc /Tpbuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_abi.cc /Fobuild\temp.win-amd64-3.6\Release\test_compile\test_tensorflow_abi.obj
cl : Command line warning D9002 : ignoring unknown option '-std=c++11'
test_tensorflow_abi.cc
c:\program files\python36\lib\site-packages\tensorflow\include\google/protobuf/message_lite.h(247): warning C4267: 'return': conversion from 'size_t' to 'int', possible loss of data
c:\program files\python36\lib\site-packages\tensorflow\include\google/protobuf/wire_format_lite_inl.h(865): warning C4267: 'argument': conversion from 'size_t' to 'google::protobuf::uint32', possible loss of data
c:\program files\python36\lib\site-packages\tensorflow\include\tensorflow/core/lib/gtl/manual_constructor.h(97): fatal error C1189: #error: "You must define TF_LIB_GTL_ALIGNED_CHAR_ARRAY for your compiler."
error: Unable to determine CXX11 ABI to use with TensorFlow (see error above).
----------------------------------------
I do pip install horovod
.
build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc(7): warning #2196: routine is both "inline" and "noinline"
compilation aborted for build/temp.linux-x86_64-2.7/test_compile/test_tensorflow_abi.cc (code 2)
error: Unable to determine CXX11 ABI to use with TensorFlow (see error above).
----------------------------------------
Command "/beegfs/home/hd/hd_hd/hd_tn445/.venvs/tf_r1/bin/python -u -c "import setuptools, tokenize;__file__='/tmp/hd_tn445_job_5405240/pip-build-o4EglG/horovod/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /tmp/hd_tn445_job_5405240/pip-qVrJY6-record/install-record.txt --single-version-externally-managed --compile --install-headers /beegfs/home/hd/hd_hd/hd_tn445/.venvs/tf_r1/include/site/python2.7/horovod" failed with error code 1 in /tmp/hd_tn445_job_5405240/pip-build-o4EglG/horovod/
Full log: horovod_install_error_log.txt
OS: Red Hat 4.8.5-16
Tensorflow version: 1.4.0
Openmpi version: 1.10
Compiler: Intel(R) C Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 16.0.4.258
./horovod/tensorflow/init.py:89
return tf.group(*[tf.assign(var, broadcast(var, root_rank))
for var in tf.global_variables()])
@alsrgv I think we can't make sure that the broadcast ops are evaluated by TensorFlow in the same order for each instance for all platform. Do you think so?
Hi, @alsrgv
I have a concern that when I installed nv_peer_memory driver and I also installed horovod via
HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_ALLGATHER=MPI HOROVOD_GPU_BROADCAST=MPI pip3 install --no-cache-dir horovod
But, when I enable the training task, how do I know the GPUDirect is being used?
For instance,
root@ecb2de971add:/gang/distributed/allreduce/mnist# mpirun -np 4 --allow-run-as-root -bind-to none -oversubscribe -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -mca pml ob1 python3 keras_uber.py
2017-11-08 08:43:33.395738: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:08:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:43:33.395794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 3, name: TITAN Xp, pci bus id: 0000:08:00.0, compute capability: 6.1)
2017-11-08 08:43:33.399529: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:07:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:43:33.399567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 2, name: TITAN Xp, pci bus id: 0000:07:00.0, compute capability: 6.1)
2017-11-08 08:43:33.419007: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:04:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:43:33.419047: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: TITAN Xp, pci bus id: 0000:04:00.0, compute capability: 6.1)
2017-11-08 08:43:33.419182: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1031] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
totalMemory: 11.90GiB freeMemory: 11.74GiB
2017-11-08 08:43:33.419221: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] Creating TensorFlow device (/device:GPU:0) -> (device: 1, name: TITAN Xp, pci bus id: 0000:06:00.0, compute capability: 6.1)
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
Train on 60000 samples, validate on 10000 samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
Epoch 1/3
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
ecb2de971add:35788:35963 [0] INFO NET : Using interface eth0:172.17.0.8<0>
ecb2de971add:35788:35963 [0] INFO NET/IB : Using interface eth0 for sideband communication
ecb2de971add:35788:35963 [0] INFO NET/IB: [0] mlx4_0:1/IB
ecb2de971add:35788:35963 [0] INFO Using internal Network IB
NCCL version 2.0.5 compiled with CUDA 9.0
ecb2de971add:35790:35969 [2] INFO NET : Using interface eth0:172.17.0.8<0>
ecb2de971add:35790:35969 [2] INFO NET/IB : Using interface eth0 for sideband communication
ecb2de971add:35789:35964 [1] INFO NET : Using interface eth0:172.17.0.8<0>
ecb2de971add:35789:35964 [1] INFO NET/IB : Using interface eth0 for sideband communication
ecb2de971add:35791:35960 [3] INFO NET : Using interface eth0:172.17.0.8<0>
ecb2de971add:35791:35960 [3] INFO NET/IB : Using interface eth0 for sideband communication
ecb2de971add:35790:35969 [2] INFO NET/IB: [0] mlx4_0:1/IB
ecb2de971add:35789:35964 [1] INFO NET/IB: [0] mlx4_0:1/IB
ecb2de971add:35791:35960 [3] INFO NET/IB: [0] mlx4_0:1/IB
ecb2de971add:35790:35969 [2] INFO Using internal Network IB
ecb2de971add:35789:35964 [1] INFO Using internal Network IB
ecb2de971add:35791:35960 [3] INFO Using internal Network IB
ecb2de971add:35788:35963 [0] INFO CUDA Dev 0, IB Ports : mlx4_0/1(PIX)
ecb2de971add:35790:35969 [2] INFO CUDA Dev 2, IB Ports : mlx4_0/1(PIX)
ecb2de971add:35789:35964 [1] INFO CUDA Dev 1, IB Ports : mlx4_0/1(PIX)
ecb2de971add:35791:35960 [3] INFO CUDA Dev 3, IB Ports : mlx4_0/1(PIX)
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO Using 256 threads
ecb2de971add:35788:35963 [0] INFO [0] Ring 0 : 0 1 2 3
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO 2 -> 1 via P2P/IPC
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO 3 -> 2 via P2P/IPC
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO 1 -> 0 via P2P/IPC
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO 0 -> 3 via P2P/IPC
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35790:35969 [2] INFO 2 -> 3 via P2P/IPC
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35791:35960 [3] INFO 3 -> 0 via P2P/IPC
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35789:35964 [1] INFO 1 -> 2 via P2P/IPC
ecb2de971add:35788:35963 [0] INFO nvmlDeviceGetNvLinkCapability() failed: Not Supported
ecb2de971add:35788:35963 [0] INFO 0 -> 1 via P2P/IPC
59648/60000 [============================>.] - ETA: 0s - loss: 0.1823 - acc: 0.945259648/60000 [============================>.] - ETA: 0s - loss: 0.1812 - acc: 0.944259648/60000 [==========60000/60000 [==============================] - 7s - loss: 0.1787 - acc: 0.9450 - val_loss: 0.0445 - val_acc: 0.9856- loss: 0.1795 - acc: 0.94478331268
Epoch 2/3
60000/60000 [==============================] - 7s - loss: 0.1804 - acc: 0.9445 - val_loss: 0.0445 - val_acc: 0.9856
Epoch 2/3
60000/60000 [==============================] - 7s - loss: 0.1816 - acc: 0.9453 - val_loss: 0.0445 - val_acc: 0.9856
Epoch 2/3
60000/60000 [==============================] - 7s - loss: 0.1808 - acc: 0.9435 - val_loss: 0.0445 - val_acc: 0.9856
Epoch 2/3
59648/60000 [============================>.] - ETA: 0s - loss: 0.0465 - acc: 0.986159648/60000 [============================>.] - ETA: 0s - loss: 0.0458 - acc: 0.986059648/60000 [==========60000/60000 [==============================] - 5s - loss: 0.0462 - acc: 0.9858 - val_loss: 0.0289 - val_acc: 0.9905- loss: 0.0464 - acc: 0.985644
Epoch 3/3
60000/60000 [==============================] - 5s - loss: 0.0458 - acc: 0.9860 - val_loss: 0.0289 - val_acc: 0.9905
Epoch 3/3
60000/60000 [==============================] - 5s - loss: 0.0465 - acc: 0.9861 - val_loss: 0.0289 - val_acc: 0.9905
Epoch 3/3
60000/60000 [==============================] - 5s - loss: 0.0462 - acc: 0.9857 - val_loss: 0.0289 - val_acc: 0.9905
Epoch 3/3
59648/60000 [============================>.] - ETA: 0s - loss: 0.0302 - acc: 0.990559648/60000 [============================>.] - ETA: 0s - loss: 0.0300 - acc: 0.990659648/60000 [==========60000/60000 [==============================] - 5s - loss: 0.0291 - acc: 0.9907 - val_loss: 0.0274 - val_acc: 0.9911- loss: 0.0291 - acc: 0.9907
60000/60000 [==============================] - 5s - loss: 0.0302 - acc: 0.9905 - val_loss: 0.0274 - val_acc: 0.9911
60000/60000 [==============================] - 5s - loss: 0.0309 - acc: 0.9904 - val_loss: 0.0274 - val_acc: 0.9911
60000/60000 [==============================] - 5s - loss: 0.0299 - acc: 0.9906 - val_loss: 0.0274 - val_acc: 0.9911
Test loss: 0.0273595407882
Test accuracy: 0.9911
Using TensorFlow backend.
Test loss: 0.0273595407882
Test accuracy: 0.9911
Using TensorFlow backend.
Test loss: 0.0273595407882
Test accuracy: 0.9911
Using TensorFlow backend.
Test loss: 0.0273595407882
Test accuracy: 0.9911
Using TensorFlow backend.
--------------------------------------------------------------------------
The call to cuIpcCloseMemHandle failed. This is a warning and the program
will continue to run.
cuIpcCloseMemHandle return value: 4
address: 0x1090fc00000
Check the cuda.h file for what the return value means. Perhaps a reboot
of the node will clear the problem.
--------------------------------------------------------------------------
[ecb2de971add:35789] Sleep on 35789
I have tried to run on single and multiple-gpus and multi-gpu seems to be slower on horovod on the tensor_mnist.py example. Below is the outputs
Here is single GPU run with CUDA_VISIBLE_DEVICE=1 python tensorflow_mnist.py
INFO:tensorflow:loss = 2.30667, step = 1
INFO:tensorflow:loss = 2.29006, step = 11 (0.068 sec)
INFO:tensorflow:loss = 2.25554, step = 21 (0.052 sec)
INFO:tensorflow:loss = 2.1974, step = 31 (0.052 sec)
INFO:tensorflow:loss = 1.89438, step = 41 (0.051 sec)
INFO:tensorflow:loss = 1.82758, step = 51 (0.052 sec)
INFO:tensorflow:loss = 0.915703, step = 61 (0.049 sec)
INFO:tensorflow:loss = 1.77014, step = 71 (0.050 sec)
INFO:tensorflow:loss = 1.03163, step = 81 (0.049 sec)
INFO:tensorflow:loss = 0.936019, step = 91 (0.050 sec)
INFO:tensorflow:global_step/sec: 162.7
INFO:tensorflow:loss = 0.522648, step = 101 (0.048 sec)
INFO:tensorflow:loss = 0.573508, step = 111 (0.049 sec)
INFO:tensorflow:loss = 0.214133, step = 121 (0.049 sec)
INFO:tensorflow:loss = 0.757816, step = 131 (0.048 sec)
INFO:tensorflow:loss = 0.190643, step = 141 (0.047 sec)
INFO:tensorflow:loss = 0.213472, step = 151 (0.048 sec)
INFO:tensorflow:loss = 0.0751245, step = 161 (0.048 sec)
INFO:tensorflow:loss = 0.115839, step = 171 (0.048 sec)
INFO:tensorflow:loss = 0.0470715, step = 181 (0.048 sec)
INFO:tensorflow:loss = 0.173935, step = 191 (0.048 sec)
INFO:tensorflow:global_step/sec: 207.711
INFO:tensorflow:loss = 0.196011, step = 201 (0.048 sec)
INFO:tensorflow:loss = 0.0540665, step = 211 (0.049 sec)
INFO:tensorflow:loss = 0.157532, step = 221 (0.047 sec)
INFO:tensorflow:loss = 0.392506, step = 231 (0.048 sec)
INFO:tensorflow:loss = 0.216696, step = 241 (0.048 sec)
INFO:tensorflow:loss = 0.146749, step = 251 (0.048 sec)
INFO:tensorflow:loss = 0.0601752, step = 261 (0.047 sec)
INFO:tensorflow:loss = 0.255786, step = 271 (0.049 sec)
INFO:tensorflow:loss = 0.116301, step = 281 (0.048 sec)
INFO:tensorflow:loss = 0.226139, step = 291 (0.048 sec)
here is cmd for 2 gpus CUDA_VISIBLE_DEVICES=1,2 mpirun -np 2 python tensorflow_mnist.py
and the outputs
INFO:tensorflow:loss = 2.29663, step = 2
INFO:tensorflow:loss = 2.30619, step = 2
INFO:tensorflow:loss = 2.28995, step = 12 (0.295 sec)
INFO:tensorflow:loss = 2.2678, step = 12 (0.208 sec)
INFO:tensorflow:loss = 2.26281, step = 22 (0.189 sec)
INFO:tensorflow:loss = 2.2539, step = 22 (0.189 sec)
INFO:tensorflow:loss = 2.17149, step = 32 (0.190 sec)
INFO:tensorflow:loss = 2.18344, step = 32 (0.190 sec)
INFO:tensorflow:loss = 1.73136, step = 42 (0.190 sec)
INFO:tensorflow:loss = 1.66357, step = 42 (0.190 sec)
INFO:tensorflow:loss = 1.18375, step = 52 (0.190 sec)
INFO:tensorflow:loss = 1.1508, step = 52 (0.190 sec)
INFO:tensorflow:loss = 0.689034, step = 62 (0.190 sec)
INFO:tensorflow:loss = 0.60923, step = 62 (0.190 sec)
INFO:tensorflow:loss = 0.417928, step = 72 (0.189 sec)
INFO:tensorflow:loss = 0.389533, step = 72 (0.189 sec)
INFO:tensorflow:loss = 1.40314, step = 82 (0.194 sec)
INFO:tensorflow:loss = 1.60605, step = 82 (0.194 sec)
INFO:tensorflow:loss = 0.517931, step = 92 (0.198 sec)
INFO:tensorflow:loss = 0.588845, step = 92 (0.198 sec)
INFO:tensorflow:loss = 0.958909, step = 102 (0.190 sec)
INFO:tensorflow:global_step/sec: 49.6269
INFO:tensorflow:loss = 1.07172, step = 102 (0.191 sec)
INFO:tensorflow:loss = 0.209173, step = 112 (0.189 sec)
INFO:tensorflow:loss = 0.172761, step = 112 (0.188 sec)
INFO:tensorflow:loss = 0.282635, step = 122 (0.192 sec)
INFO:tensorflow:loss = 0.316832, step = 122 (0.192 sec)
INFO:tensorflow:loss = 0.235262, step = 132 (0.192 sec)
INFO:tensorflow:loss = 0.39039, step = 132 (0.192 sec)
INFO:tensorflow:loss = 0.249453, step = 142 (0.190 sec)
INFO:tensorflow:loss = 0.220639, step = 142 (0.190 sec)
INFO:tensorflow:loss = 0.188635, step = 152 (0.191 sec)
INFO:tensorflow:loss = 0.237106, step = 152 (0.191 sec)
INFO:tensorflow:loss = 0.0990839, step = 162 (0.196 sec)
INFO:tensorflow:loss = 0.298068, step = 162 (0.196 sec)
INFO:tensorflow:loss = 0.289399, step = 172 (0.190 sec)
INFO:tensorflow:loss = 0.209013, step = 172 (0.190 sec)
INFO:tensorflow:loss = 0.187031, step = 182 (0.195 sec)
INFO:tensorflow:loss = 0.325866, step = 182 (0.195 sec)
INFO:tensorflow:loss = 0.106716, step = 192 (0.193 sec)
INFO:tensorflow:loss = 0.120047, step = 192 (0.193 sec)
INFO:tensorflow:global_step/sec: 51.9471
The benchmarks in the README are not clear to me. Could you possibly clarify?
The single GPU version has benchamrk of 133 for inception.
The multi GPU versions have Distributed Tensorflow at 10.4
The Horovod results range from 14.4 to 15.6.
I presume given single GPU results that higher is worse. But you have bolded the highest result for Horovod suggesting that is best. These results suggest Distributed Tensorflow is far faster.
So, what are the numbers and is higher better?
I am currently running InceptionV3 model on a Tesla K80 GPU. And here is the code snippet that works perfectly fine:
file_list = build_data(sys.argv[1])
train_size = math.ceil(0.8 * len(file_list))
train_file_list = file_list[:train_size]
test_file_list=file_list[train_size:]
nbatches_train, mod = divmod(len(train_file_list), BATCH_SIZE)
nbatches_valid, mod = divmod(len(test_file_list), BATCH_SIZE)
# Initialise session
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
sess = tf.Session()
K.set_session(sess)
K.set_learning_phase(1)
model = create_inception_model(num_classes=4, input_shape=(HEIGHT, WIDTH, CHANNELS),
fully_trainable=True)
cb = []
tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0,
write_graph=True, write_images=True)
cb.append(tbCallBack)
train_generator = generator_from_data_path(train_file_list, batch_size=BATCH_SIZE, width=WIDTH, height=HEIGHT)
test_generator = generator_from_data_path(test_file_list, batch_size=BATCH_SIZE, width=WIDTH, height=HEIGHT)
model.fit_generator(train_generator, epochs=EPOCHS,
steps_per_epoch=nbatches_train,callbacks=cb,
validation_data=test_generator,
validation_steps=nbatches_valid)
serialize_mode(model, name='inception')`
As always, the training takes a longer time and hence I wanted to shift to Horovod to run it on multiple GPU's.
Here is the code that I added to the above code in trying to make that happen:
Here is the addition to the model generation code:
opt = keras.optimizers.Adadelta()
opt = hvd.DistributedOptimizer(opt)
model.compile(loss='sparse_categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
return model
And here is the code where we are initializing Horovod to run the model:
hvd.init()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
K.set_session(tf.Session(config=config))
file_list = build_data(sys.argv[1])
train_size = math.ceil(0.8 * len(file_list))
train_file_list = file_list[:train_size]
test_file_list=file_list[train_size:]
nbatches_train, mod = divmod(len(train_file_list), BATCH_SIZE)
nbatches_valid, mod = divmod(len(test_file_list), BATCH_SIZE)
model = create_inception_model(num_classes=4, input_shape=(HEIGHT, WIDTH, CHANNELS),
fully_trainable=True)
cb = []
tbCallBack = TensorBoard(log_dir='./Graph', histogram_freq=0,
write_graph=True, write_images=True)
if hvd.rank() == 0:
modelCheckPointCallBack = keras.callbacks.ModelCheckpoint('./checkpoint-{epoch}.h5')
cb.append(modelCheckPointCallBack)
cb.append(tbCallBack)
train_generator = generator_from_data_path(train_file_list, batch_size=BATCH_SIZE, width=WIDTH, height=HEIGHT)
test_generator = generator_from_data_path(test_file_list, batch_size=BATCH_SIZE, width=WIDTH, height=HEIGHT)
model.fit_generator(train_generator, epochs=EPOCHS,
steps_per_epoch=nbatches_train,callbacks=cb,
validation_data=test_generator,
validation_steps=nbatches_valid)
serialize_mode(model, name='inception')
And this is what happens when I run Horovod with 4 gpus:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 4 slots
that were requested by the application:
python
Either request fewer slots for your application, or make more slots available
for use.
--------------------------------------------------------------------------
But when I run watch nvidia-smi I have 4 GPU's :
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | Off |
| N/A 31C P8 28W / 149W | 16MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:00:05.0 Off | Off |
| N/A 31C P8 28W / 149W | 1MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 00000000:00:06.0 Off | Off |
| N/A 33C P8 29W / 149W | 1MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 00000000:00:07.0 Off | Off |
| N/A 30C P8 27W / 149W | 1MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1956 G /usr/lib/xorg/Xorg 15MiB |
+-----------------------------------------------------------------------------+
When I run the same program with 1 gpu, my nvidia-smi command looks like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90 Driver Version: 384.90 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | Off |
| N/A 42C P0 140W / 149W | 11690MiB / 12205MiB | 82% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K80 Off | 00000000:00:05.0 Off | Off |
| N/A 36C P0 56W / 149W | 11598MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla K80 Off | 00000000:00:06.0 Off | Off |
| N/A 38C P0 69W / 149W | 11598MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla K80 Off | 00000000:00:07.0 Off | Off |
| N/A 35C P0 55W / 149W | 11598MiB / 12205MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1956 G /usr/lib/xorg/Xorg 15MiB |
| 0 3447 C python 11661MiB |
| 1 3447 C python 11585MiB |
| 2 3447 C python 11585MiB |
| 3 3447 C python 11585MiB |
+-----------------------------------------------------------------------------+
It creates 4 python processes to be used by GPU's but it does not run it in a distributed fashion.
How do I get this to run on a single machine with Multiple GPU's.
@alsrgv Could you please provide a CPU only example for keras_mnist.py? Thanks...
Command :
mpirun -np 2 \
-x LD_LIBRARY_PATH \
-H xuerq-horovod0:1,xuerq-horovod1:1 python tensorflow_mnist.py
Error log :
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected. This is highly unusual.
The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).
Local host: xuerq-horovod0
Local PID: 8136
Peer hostname: xuerq-horovod1 ([[30832,1],1])
Source IP of socket: 10.10.10.8
Known IPs of peer:
10.1.102.5
--------------------------------------------------------------------------
[xuerq-horovod0:08130] 1 more process has sent help message help-mpi-btl-tcp.txt / dropped inbound connection
[xuerq-horovod0:08130] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
But the following commands is ok
mpirun -np 1 \
-x LD_LIBRARY_PATH \
-H xuerq-horovod0:1 python tensorflow_mnist.py
mpirun -np 1 \
-x LD_LIBRARY_PATH \
-H xuerq-horovod1:1 python tensorflow_mnist.py
so I run this example on kubernetes, the standalone is ok, but distributed version failed.
my run command is (change eth0 to flannel.1 have the same error, and other card just cannot run):
mpirun --allow-run-as-root -v --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 3 -x LD_LIBRARY_PATH -H hvd-1:1,hvd-2:1,hvd-3:1 python tensorflow_mnist.py
error msg:
Warning: Permanently added '[hvd-2]:33,[10.87.217.230]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-1]:33,[10.87.217.233]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-3]:33,[10.87.217.211]:33' (ECDSA) to the list of known hosts.
0
Extracting MNIST-data-0/train-images-idx3-ubyte.gz
2
Extracting MNIST-data-2/train-images-idx3-ubyte.gz
1
Extracting MNIST-data-1/train-images-idx3-ubyte.gz
Extracting MNIST-data-2/train-labels-idx1-ubyte.gz
Extracting MNIST-data-2/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-1/train-labels-idx1-ubyte.gz
Extracting MNIST-data-1/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-0/train-labels-idx1-ubyte.gz
Extracting MNIST-data-0/t10k-images-idx3-ubyte.gz
Extracting MNIST-data-2/t10k-labels-idx1-ubyte.gz
Extracting MNIST-data-1/t10k-labels-idx1-ubyte.gz
Extracting MNIST-data-0/t10k-labels-idx1-ubyte.gz
0
0
0
2017-10-27 04:13:53.870673: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.870718: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.870727: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.870733: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.870738: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875339: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875404: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875414: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875419: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.875425: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897845: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897889: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897898: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897904: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:53.897909: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-10-27 04:13:54.028404: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-27 04:13:54.028873: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-27 04:13:54.028986: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:08.0
Total memory: 11.95GiB
Free memory: 11.84GiB
2017-10-27 04:13:54.029016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-27 04:13:54.029024: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-10-27 04:13:54.029039: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
2017-10-27 04:13:54.029476: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:08.0
Total memory: 11.95GiB
Free memory: 11.84GiB
2017-10-27 04:13:54.029507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-27 04:13:54.029515: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-10-27 04:13:54.029528: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
2017-10-27 04:13:54.046652: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-27 04:13:54.047204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:08.0
Total memory: 11.21GiB
Free memory: 11.09GiB
2017-10-27 04:13:54.047236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-27 04:13:54.047244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-10-27 04:13:54.047259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
[node233][[9159,1],0][btl_tcp_frag.c:130:mca_btl_tcp_frag_send] mca_btl_tcp_frag_send: writev error (0x1b0a43da00, 128)
Bad address(1)
[node233:02924] pml_ob1_sendreq.c:191 FATAL
ifconfig in docker as follow:
cni0 Link encap:Ethernet HWaddr 0a:58:c0:a8:2a:01
inet addr:192.168.42.1 Bcast:0.0.0.0 Mask:255.255.255.0
UP BROADCAST MULTICAST MTU:1450 Metric:1
RX packets:1662 errors:0 dropped:0 overruns:0 frame:0
TX packets:3252 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:71609 (71.6 KB) TX bytes:6438981 (6.4 MB)
docker0 Link encap:Ethernet HWaddr 02:42:16:50:f4:e9
inet addr:172.17.0.1 Bcast:0.0.0.0 Mask:255.255.0.0
UP BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)
eth0 Link encap:Ethernet HWaddr 00:16:3e:00:2d:bb
inet addr:10.87.217.233 Bcast:10.87.255.255 Mask:255.255.128.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:21258550691 errors:0 dropped:0 overruns:0 frame:0
TX packets:3273335789 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:29633764824437 (29.6 TB) TX bytes:8441356508419 (8.4 TB)
flannel.1 Link encap:Ethernet HWaddr 9e:ee:4a:ed:a1:93
inet addr:192.168.42.0 Bcast:0.0.0.0 Mask:255.255.255.255
UP BROADCAST RUNNING MULTICAST MTU:1450 Metric:1
RX packets:189364 errors:0 dropped:0 overruns:0 frame:0
TX packets:189765 errors:0 dropped:132 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:11844463 (11.8 MB) TX bytes:23939132 (23.9 MB)
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:7571360 errors:0 dropped:0 overruns:0 frame:0
TX packets:7571360 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:2581387511 (2.5 GB) TX bytes:2581387511 (2.5 GB)
is there anyway for my to know what went wrong? I don't know which address is bad.
Hi,
About how to use the parameters, how to save model and restore model, is there any detailed documents about this package?
Such as code beblow:
hooks = [hvd.BroadcastGlobalVariablesHook(0),
tf.train.StopAtStepHook(last_step=10000),
tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss},
every_n_iter=10),
]
Hi professional team,
Is there any possible to use horovod distributed training in Single node Multiple GPU training?
Thanks for the kindly support.
On Mac OSX El Capitan with Anaconda 3, pip install horovod
gives the following error:
fatal error: 'unordered_map' file not found
This can be fixed if I have access to the compilation scripts, but in this case pip does all that and I don't know a workaround. Can this be fixed?
I'v add horovod in my project and found strange warnings like below.
WARNING: One or more tensors were submitted to be reduced, gathered or broadcasted by subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit different tensors or that only subset of ranks is submitting tensors, which will cause deadlock.
It happens under 2 occations:
Is there anything I am missing?
I am adding horovod to an old and large project so it might take some time to make a demo to reproduce the problem but I'll be glad to do that if necessary.
To run on 4 different machine that has 1 GPU and connected by LAN, do I need to use the following command in any one of the system or use it on every computer changing the position of server IP's?
mpirun -np 4 -x LD_LIBRARY_PATH -H server1:4,server2:4,server3:4,server4:4 python train.py
Will you be making the benchmark programs available in the examples section (Resnet, Inception, VGGNet)? That would enable us to reproduce your numbers. Thanks.
Hello,
I have a cluster of 2 gpu nodes.
[[767,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: 1d4848ca651249409735f1aebc6a7e60000001
[[1693,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: 1d4848ca651249409735f1aebc6a7e60000001
Initializing horovod
[1d4848ca651249409735f1aebc6a7e60000000:42378] procdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0/1693/1/0
[1d4848ca651249409735f1aebc6a7e60000000:42378] jobdir: /tmp/openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0/1693/1
[1d4848ca651249409735f1aebc6a7e60000000:42378] top: openmpi-sessions-_azbatch@1d4848ca651249409735f1aebc6a7e60000000_0
[1d4848ca651249409735f1aebc6a7e60000000:42378] tmp: /tmp
MPIR_being_debugged = 0
MPIR_debug_state = 1
MPIR_partial_attach_ok = 1
MPIR_i_am_starter = 0
MPIR_forward_output = 0
MPIR_proctable_size = 2
MPIR_proctable:
(i, host, exe, pid) = (0, 1d4848ca651249409735f1aebc6a7e60000000, /anaconda/envs/py35/bin/python, 42378)
(i, host, exe, pid) = (1, 10.0.0.5, /anaconda/envs/py35/bin/python, 44078)
MPIR_executable_path: NULL
MPIR_server_arguments: NULL
[1d4848ca651249409735f1aebc6a7e60000000:42354] 1 more process has sent help message help-mpi-btl-base.txt / btl:no-nics
[1d4848ca651249409735f1aebc6a7e60000000:42354] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
Thank you,
Alex
Hi,
I'm using Horovod to do distributed TF training.
In my code, seems that at the beginning my task failed due to save model error:
[[Node: save/MergeV2Checkpoints = MergeV2Checkpoints[delete_old_dirs=true, _device="/job:localhost/replica:0/task:0/cpu:0"](save/MergeV2Checkpoints/checkpoint_prefixes, _arg_save/Const_0_0)]]
Caused by op 'save/MergeV2Checkpoints', defined at:
File "/tmp/apprunner/.working/runtime/app/train.py", line 432, in <module>
main()
File "/tmp/apprunner/.working/runtime/app/train.py", line 395, in main
checkpoint_dir=work_path) as sess:
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__
self._sess = _RecoverableSession(self._coordinated_creator)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__
_WrappedSession.__init__(self, self._create_session())
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session
return self._sess_creator.create_session()
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session
self.tf_sess = self._session_creator.create_session()
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session
self._scaffold.finalize()
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 203, in finalize
self._saver = training_saver._get_saver_or_default() # pylint: disable=protected-access
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 736, in _get_saver_or_default
saver = Saver(sharded=True, allow_empty=True)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1139, in __init__
self.build()
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1170, in build
restore_sequentially=self._restore_sequentially)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 685, in build
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 361, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 343, in _AddShardedSaveOpsForV2
sharded_prefixes, checkpoint_prefix, delete_old_dirs=True)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/ops/gen_io_ops.py", line 185, in merge_v2_checkpoints
delete_old_dirs=delete_old_dirs, name=name)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/tmp/apprunner/.working/runtime/python/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
self._traceback = _extract_stack()
My code is
...
# BroadcastGlobalVariablesHook broadcasts variables from rank 0 to all other
# processes during initialization.
hooks = [hvd.BroadcastGlobalVariablesHook(0),
tf.train.StopAtStepHook(last_step=8000000),
tf.train.LoggingTensorHook(tensors={'step': global_step, 'loss': loss},
every_n_iter=10),
]
is_chief = (hvd.rank() == 0)
# save_secs = args.save_seconds if is_chief else 800000000000000000
save_secs = args.save_seconds if is_chief else None
with tf.train.MonitoredTrainingSession(config=config,
hooks=hooks,
save_checkpoint_secs=save_secs,
checkpoint_dir=work_path) as sess:
work_path is an HDFS path here.
Any idea about this issue?
How to check if the models on different GPUs exchange gradients directly via nv_peer_mem?
I measure the running time of the keras example when nv_peer_mem is turned on and off. The 2 cases make no difference in running time. There's no complaint from the code too.
Thanks a lot for this great framework. I got this error message "'TFOptimizer' object has no attribute 'lr'" when I was using ReduceLROnPlateau callback from Keras, could you please help me to find a way to solve this with horovod? Thank you very much in advance.
Thanks for a great library.
I have 2 quick questions.
In the context of the Keras example, each GPU reads the same set of images in each epoch, possibly in different orders.
So the time it takes to make an epoch is independent of the number of GPUs. Is this understanding correct?
It may happen that different GPUs may read a few same images in a batch. Is there any easy way to make sure that each GPU reads a different set of images?
@alsrgv
You are using AsyncOpKernel. Does that overlaps computation and communication?
I don't know what's the cycle life of an AsyncOpKernel. Will Tensorflow wait for async ops to finish at the end of each iteration?
I am quite curious about horovod's internals. Thank you so much!
I'm running Cifar10
implementation adopted for Horovod on gpu cluster (with 4 and another time 8 gpus) and I get the following error.
The code is working on the local machine when horovod codes removed, no problem with the tensorflow implementation I guess. Also on the cluster, the mnist
example runs perfectly so the setup also looks okay.
My guess is that it's something with horovod_broadcast
.
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[27685,1],0]
Exit code: 1
--------------------------------------------------------------------------
STDERR: INFO:tensorflow:Create CheckpointSaverHook.
Traceback (most recent call last):
File "/data/hadoop-2.8.2.1/tmp/nm-local-dir/usercache/WLxuXLFw1KZBatf4JlKiQVx2fB-Nwe4CDQUcvyKfLbY/appcache/application_1508601191819_0234/container_e22_1508601191819_0234_01_000003/cifar_horovod-2.py", line 191, in <module>
with tf.train.SingularMonitoredSession(hooks=hooks, config=config) as sess:
File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 750, in __init__
stop_grace_period_secs=stop_grace_period_secs)
File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 483, in __init__
h.begin()
File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/horovod/tensorflow/__init__.py", line 115, in begin
self.bcast_op = broadcast_global_variables(self.root_rank)
File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/horovod/tensorflow/__init__.py", line 90, in broadcast_global_variables
for var in tf.global_variables()])
File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/horovod/tensorflow/mpi_ops.py", line 187, in broadcast
return MPI_LIB.horovod_broadcast(tensor, name=name, root_rank=root_rank)
File "<string>", line 90, in horovod_broadcast
File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 589, in apply_op
param_name=input_name)
File "/srv/hops-gpu/anaconda/anaconda/envs/robin_allreduce/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'tensor' has DataType bool not in list of allowed values: uint8, int8, uint16, int16, int32, int64, float32, float64
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/11/06 17:38:16 INFO CoarseGrainedExecutorBackend: Got assigned task 11
17/11/06 17:38:16 INFO Executor: Running task 0.3 in stage 0.0 (TID 11)
17/11/06 17:38:16 INFO Executor: Executor is trying to kill task 0.3 in stage 0.0 (TID 11), reason: stage cancelled
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.