Comments (22)
@alsrgv so much thanks, finally I run the example successfully!!
my command:
mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG -H hvd-0:1,hvd-1:1 python tensorflow_mnist.py
I am so excited!!! I can try my own tensorflow example now!
thanks again to @alsrgv
from horovod.
the same env, I run opemmpi example executable, it return success:
$ mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 3 -x LD_LIBRARY_PATH -H hvd-1:1,hvd-2:1,hvd-3:1 ./hello_c
Warning: Permanently added '[hvd-1]:33,[10.87.217.233]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-3]:33,[10.87.217.211]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-2]:33,[10.87.217.230]:33' (ECDSA) to the list of known hosts.
Hello, world, I am 1 of 3, (Open MPI v3.0.0, package: Open MPI root@0b3bcef148b5 Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017, 112)
Hello, world, I am 2 of 3, (Open MPI v3.0.0, package: Open MPI root@0b3bcef148b5 Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017, 112)
Hello, world, I am 0 of 3, (Open MPI v3.0.0, package: Open MPI root@0b3bcef148b5 Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017, 112)
from horovod.
@ydp, can you try running the MPI Allreduce example from MPI Tutorial GitHub? I'm wondering if the example that you ran actually did any communication, or just printed out ranks and exited.
from horovod.
@alsrgv here is my result for reduce_avg and reduce_stddev
root@node212:~# mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 4 -x LD_LIBRARY_PATH -H hvd-0:1,hvd-1:1,hvd-2:1,hvd-3:1 ./reduce_avg 100
Warning: Permanently added '[hvd-3]:33,[10.87.217.220]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-1]:33,[10.87.216.72]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-2]:33,[10.87.216.46]:33' (ECDSA) to the list of known hosts.
Local sum for process 0 - 54.682476, avg = 0.546825
Local sum for process 2 - 47.458782, avg = 0.474588
Local sum for process 3 - 51.602047, avg = 0.516020
Local sum for process 1 - 49.442955, avg = 0.494430
Total sum = 203.186264, avg = 0.507966
root@node212:~# mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 4 -x LD_LIBRARY_PATH -H hvd-0:1,hvd-1:1,hvd-2:1,hvd-3:1 ./reduce_stddev 100
Warning: Permanently added '[hvd-2]:33,[10.87.216.46]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-3]:33,[10.87.217.220]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-1]:33,[10.87.216.72]:33' (ECDSA) to the list of known hosts.
Mean - 0.530262, Standard deviation = 0.293936
from horovod.
since the error message is Bad Address; I modify code in OpenMPI and recompile it. Code I add is here:
cnt = writev(sd, frag->iov_ptr, frag->iov_cnt);
struct sockaddr_in name;
size_t namelen = sizeof(name);
if (getpeername(sd, (struct sockaddr *)&name, &namelen) == 0) {
char ipaddr[16];
memset(ipaddr, '\0', 16);
strncpy(ipaddr, inet_ntoa(*(struct in_addr *)&name.sin_addr.s_addr), 16);
BTL_ERROR(("getpeername %s", ipaddr));
}
char dt[1024];
memset(dt, '\0', 1024);
if (frag->iov_ptr[0].iov_base == NULL) {
BTL_ERROR(("null data: "));
return false;
}
strncpy(dt, frag->iov_ptr[0].iov_base, (unsigned long)frag->iov_ptr[0].iov_len);
BTL_ERROR(("data: %s", dt));
if(cnt < 0) {
switch(opal_socket_errno) {
case EINTR:
continue;
case EWOULDBLOCK:
return false;
case EFAULT:
BTL_ERROR(("mca_btl_tcp_frag_send: 0-writev error (%p, %lu)\n\t%s(%lu)\n",
frag->iov_ptr[0].iov_base, (unsigned long) frag->iov_ptr[0].iov_len,
strerror(opal_socket_errno), (unsigned long) frag->iov_cnt));
frag->endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
mca_btl_tcp_endpoint_close(frag->endpoint);
message is:
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.46
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.46
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.46
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212:142138] *** Process received signal ***
[node212:142138] Signal: Segmentation fault (11)
[node212:142138] Signal code: Invalid permissions (2)
[node212:142138] Failing at address: 0x1b0f216200
[node212:142138] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7feb4313c390]
[node212:142138] [ 1] /lib/x86_64-linux-gnu/libc.so.6(strnlen+0x40)[0x7feb42dec900]
[node212:142138] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x9f53e)[0x7feb42e0053e]
[node212:142138] [ 3] /usr/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_frag_send+0xb5)[0x7fea6abbcef5]
[node212:142138] [ 4] /usr/lib/openmpi/mca_btl_tcp.so(+0x83f8)[0x7fea6abbb3f8]
[node212:142138] [ 5] /usr/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0x7f3)[0x7feb10191503]
[node212:142138] [ 6] /usr/lib/libopen-pal.so.40(opal_progress+0x111)[0x7feb1014b6d1]
[node212:142138] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x7cd)[0x7fea6a795e9d]
[node212:142138] [ 8] /usr/lib/libmpi.so.40(ompi_coll_base_bcast_intra_split_bintree+0x76f)[0x7feb1076f21f]
[node212:142138] [ 9] /usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x126)[0x7fea6a378776]
[node212:142138] [10] /usr/lib/libmpi.so.40(MPI_Bcast+0x1a9)[0x7feb10732679]
seems it is communicating, but after a while, it shows segmentfault error
from horovod.
I made a few other test;
Now I have 2 GPU on 1 docker;
run the command:
mpirun --allow-run-as-root -np 2 python tensorflow_mnist.py
same error as on different docker;
but when change it to -np 1, it run success.
mpirun --allow-run-as-root -np 1 python tensorflow_mnist.py
so I guess there is something wrong with different GPU commnunicationg, but not mpi node commnunicate
from horovod.
I found something here:
https://stackoverflow.com/questions/18070428/simple-mpi-send-and-recv-gives-segmentation-fault-11-and-invalid-permission-2
maybe it is relevant, but not sure how to solve it
from horovod.
@ydp, did you by any chance install Horovod with HOROVOD_GPU_ALLREDUCE=MPI
, HOROVOD_GPU_ALLGATHER=MPI
or HOROVOD_GPU_BROADCAST=MPI
?
from horovod.
yes, I installed it with that command, since I donot have nccl and rdma, so I used mpi.
from horovod.
I an wondering if I need to build mpi with cuda aware support
from horovod.
@ydp, HOROVOD_GPU_ALLREDUCE=MPI
is made for very special situations, e.g. for people running on Cray systems.
In most cases, you should simply follow NCCL 2 + Horovod Guide. NCCL 2 will give you better performance on GPUs than pure MPI.
from horovod.
@alsrgv really appreciate your help, sorry for my big mistake, now I changed it to nccl2 run -np 2, it run success, but run on 2 different docker, now has following error:
2017-10-30 06:21:13.437817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-30 06:21:13.437825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-10-30 06:21:13.437838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
2017-10-30 06:21:13.457447: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-30 06:21:13.458057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:08.0
Total memory: 11.95GiB
Free memory: 11.84GiB
2017-10-30 06:21:13.458089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-30 06:21:13.458099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-10-30 06:21:13.458114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
Traceback (most recent call last):
File "tensorflow_mnist.py", line 110, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_mnist.py", line 106, in main
mon_sess.run(train_op, feed_dict={image: image_, label: label_})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
Traceback (most recent call last):
File "tensorflow_mnist.py", line 110, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_mnist.py", line 106, in main
mon_sess.run(train_op, feed_dict={image: image_, label: label_})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
[[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0', defined at:
File "tensorflow_mnist.py", line 110, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_mnist.py", line 84, in main
train_op = opt.minimize(loss, global_step=global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
grad_loss=grad_loss)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 174, in compute_gradients
device_sparse=self._device_sparse)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
summed_tensor = _allreduce(tensor)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "<string>", line 39, in horovod_allreduce
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
self._traceback = _extract_stack()
UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
[[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
: ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
[[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0', defined at:
File "tensorflow_mnist.py", line 110, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_mnist.py", line 84, in main
train_op = opt.minimize(loss, global_step=global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
grad_loss=grad_loss)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 174, in compute_gradients
device_sparse=self._device_sparse)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
summed_tensor = _allreduce(tensor)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "<string>", line 39, in horovod_allreduce
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
self._traceback = _extract_stack()
UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
[[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[23213,1],0]
Exit code: 1
--------------------------------------------------------------------------
from horovod.
my tensorflow-version is
tensorflow-gpu (1.2.0)
from horovod.
@ydp, no worries. For your ncclCommInitRank
error, take a look at this troubleshooting section.
from horovod.
add debug info, shows the detailed error message:
2017-10-30 06:51:30.780619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-30 06:51:30.780627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-10-30 06:51:30.780641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
node224:4549:4609 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
node224:4549:4609 [0] INFO Using internal Network Socket
node224:4549:4609 [0] INFO NET : Using interface eth0:10.87.217.224<0>
node224:4549:4609 [0] INFO NET : Using interface docker0:172.17.0.1<0>
node224:4549:4609 [0] INFO NET : Using interface flannel.1:192.168.33.0<0>
node224:4549:4609 [0] INFO NET : Using interface cni0:192.168.33.1<0>
node224:4549:4609 [0] INFO NET/Socket : 4 interfaces found
NCCL version 2.0.5 compiled with CUDA 9.0
node217:3339:3399 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
node217:3339:3399 [0] INFO Using internal Network Socket
node217:3339:3399 [0] INFO NET : Using interface eth0:10.87.217.217<0>
node217:3339:3399 [0] INFO NET : Using interface docker0:172.17.0.1<0>
node217:3339:3399 [0] INFO NET : Using interface flannel.1:192.168.26.0<0>
node217:3339:3399 [0] INFO NET/Socket : 3 interfaces found
node224:4549:4609 [0] INFO Using 256 threads
node224:4549:4609 [0] INFO [0] Ring 0 : 0 1
node224:4549:4609 [0] INFO [0] Ring 1 : 0 1
node224:4549:4609 [0] INFO [0] Ring 2 : 0 1
node217:3339:3399 [0] INFO 0 -> 1 via NET/Socket/0
node224:4549:4609 [0] INFO 1 -> 0 via NET/Socket/0
node217:3339:3399 [0] INFO 0 -> 1 via NET/Socket/1
node224:4549:4609 [0] INFO 1 -> 0 via NET/Socket/1
node224:4549:4609 [0] include/socket.h:185 WARN Call to connect failed : Connection refused
node224:4549:4609 [0] INFO transport/net_socket.cu:103 -> 2
node224:4549:4609 [0] INFO include/net.h:28 -> 2 [Net]
node224:4549:4609 [0] INFO transport/net.cu:254 -> 2
node224:4549:4609 [0] INFO init.cu:373 -> 2
node224:4549:4609 [0] INFO init.cu:432 -> 2
node217:3339:3399 [0] include/socket.h:185 WARN Call to connect failed : Connection refused
node217:3339:3399 [0] INFO transport/net_socket.cu:103 -> 2
node217:3339:3399 [0] INFO include/net.h:28 -> 2 [Net]
node217:3339:3399 [0] INFO transport/net.cu:254 -> 2
node217:3339:3399 [0] INFO init.cu:373 -> 2
node217:3339:3399 [0] INFO init.cu:432 -> 2
Traceback (most recent call last):
File "tensorflow_mnist.py", line 110, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_mnist.py", line 106, in main
mon_sess.run(train_op, feed_dict={image: image_, label: label_})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
Traceback (most recent call last):
File "tensorflow_mnist.py", line 110, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_mnist.py", line 106, in main
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
mon_sess.run(train_op, feed_dict={image: image_, label: label_})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownErrorreturn self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
: ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
[[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0', defined at:
File "tensorflow_mnist.py", line 110, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_mnist.py", line 84, in main
train_op = opt.minimize(loss, global_step=global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
grad_loss=grad_loss)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 174, in compute_gradients
device_sparse=self._device_sparse)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
summed_tensor = _allreduce(tensor)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "<string>", line 39, in horovod_allreduce
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
self._traceback = _extract_stack()
UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
[[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
[[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
Caused by op u'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0', defined at:
File "tensorflow_mnist.py", line 110, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "tensorflow_mnist.py", line 84, in main
train_op = opt.minimize(loss, global_step=global_step)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
grad_loss=grad_loss)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 174, in compute_gradients
device_sparse=self._device_sparse)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
summed_tensor = _allreduce(tensor)
File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
return MPI_LIB.horovod_allreduce(tensor, name=name)
File "<string>", line 39, in horovod_allreduce
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
self._traceback = _extract_stack()
UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
[[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
[[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[17273,1],0]
Exit code: 1
--------------------------------------------------------------------------
from horovod.
my command:
mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 2 -x NCCL_DEBUG -x LD_LIBRARY_PATH -H hvd-0:1,hvd-1:1 python tensorflow_mnist.py
don't know why I used btl_tcp_if_include, it still found 4 interfaces.
from horovod.
Yes, NCCL found few interfaces. NCCL interface selection is independent from MPI.
node217:3339:3399 [0] INFO NET : Using interface eth0:10.87.217.217<0>
node217:3339:3399 [0] INFO NET : Using interface docker0:172.17.0.1<0>
node217:3339:3399 [0] INFO NET : Using interface flannel.1:192.168.26.0<0>
You can narrow it down to eth0 by adding -x NCCL_SOCKET_IFNAME=eth0
to your mpirun
command, which will set the NCCL_SOCKET_IFNAME
environment variable.
from horovod.
Closing this issue. Feel free to reopen if you have more questions.
from horovod.
Have the same problem, finally i tried NCCL_SOCKET_IFNAME=^docker0 and everything works fine. docker0 should be excluded by both mpi via -mca and nccl via NCCL_SOCKET_IFNAME. I think the doc running.md or troubleshooting.md should be updated to include this.
from horovod.
@xy-xin, great find! I didn't know you could exclude interfaces for NCCL as well. Will add in next doc update.
from horovod.
Thanks @alsrgv .
from horovod.
BTW, here's my reference. https://devtalk.nvidia.com/default/topic/1023946/nccl-2-0-support-inter-node-communication-using-sockets-/, FYI. @alsrgv
from horovod.
Related Issues (20)
- Test test.integration.test_spark.SparkTests.test_dbfs_local_store broken for tensorflow>=2.13 HOT 1
- Getting error while running multi node machine learning training on H100 servers HOT 1
- Compiling with MPI+PyTorch does not work HOT 2
- Decentralized ML framework
- tensorflow hvd.DistributedOptimizer bug
- Horovod 0.28.1 incompatibility with PyTorch 2.1.0 HOT 1
- No module named 'packaging' when installing Horovod HOT 9
- [Volcano] Error using horovod with Vocalno cluster HOT 5
- ipv6 address family
- AttributeError: module 'horovod.torch' has no attribute 'init'
- Error install Horovod with python-3.11.5 on macos 11.3.1 HOT 1
- Error install horovod with python 3.11.5 on macOS 11.3.1
- Use pytorch from pip installed but get "#error You need C++17 to compile PyTorch" when installing horovod HOT 1
- Stop specific worker in Horovod Elastic
- Can I call horovod training process in proc = subprocess.Popen(command, shell=True, cwd=cwd) using command
- The program blocks hvd.init(). HOT 1
- Unable to run Horovod Pytorch on AMD AMI100 GPUs HOT 2
- Horovod with TensorFlow crashed
- Unexpected Worker Failure when using Elastic Horovod + Process Sets
- Horovod + Deepspeed : Device mismatch error
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from horovod.