so I run this example on kubernetes, the standalone is ok, but distributed version fai

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I found something here: <a href="https://stackoverflow.com/questions/18070428/simp

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

run distributed mnist example on kubernetes failed with error "writev error Bad address(1)",about horovod/horovod

Comments (22)

ydp commented on May 9, 2024 3

@alsrgv so much thanks， finally I run the example successfully!!

my command:

mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 2 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_DEBUG -H hvd-0:1,hvd-1:1  python tensorflow_mnist.py

I am so excited!!! I can try my own tensorflow example now!
thanks again to @alsrgv

from horovod.

ydp commented on May 9, 2024

the same env, I run opemmpi example executable, it return success:

$ mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 3 -x LD_LIBRARY_PATH -H hvd-1:1,hvd-2:1,hvd-3:1 ./hello_c
Warning: Permanently added '[hvd-1]:33,[10.87.217.233]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-3]:33,[10.87.217.211]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-2]:33,[10.87.217.230]:33' (ECDSA) to the list of known hosts.
Hello, world, I am 1 of 3, (Open MPI v3.0.0, package: Open MPI root@0b3bcef148b5 Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017, 112)
Hello, world, I am 2 of 3, (Open MPI v3.0.0, package: Open MPI root@0b3bcef148b5 Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017, 112)
Hello, world, I am 0 of 3, (Open MPI v3.0.0, package: Open MPI root@0b3bcef148b5 Distribution, ident: 3.0.0, repo rev: v3.0.0, Sep 12, 2017, 112)

from horovod.

alsrgv commented on May 9, 2024

@ydp, can you try running the MPI Allreduce example from MPI Tutorial GitHub? I'm wondering if the example that you ran actually did any communication, or just printed out ranks and exited.

from horovod.

ydp commented on May 9, 2024

@alsrgv here is my result for reduce_avg and reduce_stddev

root@node212:~# mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 4 -x LD_LIBRARY_PATH -H hvd-0:1,hvd-1:1,hvd-2:1,hvd-3:1 ./reduce_avg 100
Warning: Permanently added '[hvd-3]:33,[10.87.217.220]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-1]:33,[10.87.216.72]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-2]:33,[10.87.216.46]:33' (ECDSA) to the list of known hosts.
Local sum for process 0 - 54.682476, avg = 0.546825
Local sum for process 2 - 47.458782, avg = 0.474588
Local sum for process 3 - 51.602047, avg = 0.516020
Local sum for process 1 - 49.442955, avg = 0.494430
Total sum = 203.186264, avg = 0.507966

root@node212:~# mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 4 -x LD_LIBRARY_PATH -H hvd-0:1,hvd-1:1,hvd-2:1,hvd-3:1 ./reduce_stddev 100
Warning: Permanently added '[hvd-2]:33,[10.87.216.46]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-3]:33,[10.87.217.220]:33' (ECDSA) to the list of known hosts.
Warning: Permanently added '[hvd-1]:33,[10.87.216.72]:33' (ECDSA) to the list of known hosts.
Mean - 0.530262, Standard deviation = 0.293936

from horovod.

ydp commented on May 9, 2024

since the error message is Bad Address; I modify code in OpenMPI and recompile it. Code I add is here:

        cnt = writev(sd, frag->iov_ptr, frag->iov_cnt);
        struct sockaddr_in name;
        size_t namelen = sizeof(name);
        if (getpeername(sd, (struct sockaddr *)&name, &namelen) == 0) {
            char ipaddr[16];
            memset(ipaddr, '\0', 16);
            strncpy(ipaddr, inet_ntoa(*(struct in_addr *)&name.sin_addr.s_addr), 16);
            BTL_ERROR(("getpeername %s", ipaddr));
        }
        char dt[1024];
        memset(dt, '\0', 1024);
        if (frag->iov_ptr[0].iov_base == NULL) {
            BTL_ERROR(("null data: "));
            return false;
        }
        strncpy(dt, frag->iov_ptr[0].iov_base, (unsigned long)frag->iov_ptr[0].iov_len);
        BTL_ERROR(("data: %s", dt));
        if(cnt < 0) {
            switch(opal_socket_errno) {
            case EINTR:
                continue;
            case EWOULDBLOCK:
                return false;
            case EFAULT:
                BTL_ERROR(("mca_btl_tcp_frag_send: 0-writev error (%p, %lu)\n\t%s(%lu)\n",
                    frag->iov_ptr[0].iov_base, (unsigned long) frag->iov_ptr[0].iov_len,
                    strerror(opal_socket_errno), (unsigned long) frag->iov_cnt));
                frag->endpoint->endpoint_state = MCA_BTL_TCP_FAILED;
                mca_btl_tcp_endpoint_close(frag->endpoint);

message is:

[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.46
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.46
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.46
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212][[28029,1],0][btl_tcp_frag.c:136:mca_btl_tcp_frag_send] data: A
[node212][[28029,1],0][btl_tcp_frag.c:127:mca_btl_tcp_frag_send] getpeername 10.87.216.72
[node212:142138] *** Process received signal ***
[node212:142138] Signal: Segmentation fault (11)
[node212:142138] Signal code: Invalid permissions (2)
[node212:142138] Failing at address: 0x1b0f216200
[node212:142138] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7feb4313c390]
[node212:142138] [ 1] /lib/x86_64-linux-gnu/libc.so.6(strnlen+0x40)[0x7feb42dec900]
[node212:142138] [ 2] /lib/x86_64-linux-gnu/libc.so.6(+0x9f53e)[0x7feb42e0053e]
[node212:142138] [ 3] /usr/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_frag_send+0xb5)[0x7fea6abbcef5]
[node212:142138] [ 4] /usr/lib/openmpi/mca_btl_tcp.so(+0x83f8)[0x7fea6abbb3f8]
[node212:142138] [ 5] /usr/lib/libopen-pal.so.40(opal_libevent2022_event_base_loop+0x7f3)[0x7feb10191503]
[node212:142138] [ 6] /usr/lib/libopen-pal.so.40(opal_progress+0x111)[0x7feb1014b6d1]
[node212:142138] [ 7] /usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x7cd)[0x7fea6a795e9d]
[node212:142138] [ 8] /usr/lib/libmpi.so.40(ompi_coll_base_bcast_intra_split_bintree+0x76f)[0x7feb1076f21f]
[node212:142138] [ 9] /usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_bcast_intra_dec_fixed+0x126)[0x7fea6a378776]
[node212:142138] [10] /usr/lib/libmpi.so.40(MPI_Bcast+0x1a9)[0x7feb10732679]

seems it is communicating, but after a while, it shows segmentfault error

from horovod.

ydp commented on May 9, 2024

I made a few other test;
Now I have 2 GPU on 1 docker;
run the command:

mpirun --allow-run-as-root -np 2  python tensorflow_mnist.py

same error as on different docker;
but when change it to -np 1, it run success.

mpirun --allow-run-as-root -np 1  python tensorflow_mnist.py

so I guess there is something wrong with different GPU commnunicationg, but not mpi node commnunicate

from horovod.

ydp commented on May 9, 2024

I found something here:
https://stackoverflow.com/questions/18070428/simple-mpi-send-and-recv-gives-segmentation-fault-11-and-invalid-permission-2
maybe it is relevant, but not sure how to solve it

from horovod.

alsrgv commented on May 9, 2024

@ydp, did you by any chance install Horovod with HOROVOD_GPU_ALLREDUCE=MPI, HOROVOD_GPU_ALLGATHER=MPI or HOROVOD_GPU_BROADCAST=MPI?

from horovod.

ydp commented on May 9, 2024

yes, I installed it with that command, since I donot have nccl and rdma, so I used mpi.

from horovod.

ydp commented on May 9, 2024

I an wondering if I need to build mpi with cuda aware support

from horovod.

alsrgv commented on May 9, 2024

@ydp, HOROVOD_GPU_ALLREDUCE=MPI is made for very special situations, e.g. for people running on Cray systems.

In most cases, you should simply follow NCCL 2 + Horovod Guide. NCCL 2 will give you better performance on GPUs than pure MPI.

from horovod.

ydp commented on May 9, 2024

@alsrgv really appreciate your help, sorry for my big mistake, now I changed it to nccl2 run -np 2, it run success, but run on 2 different docker, now has following error:

2017-10-30 06:21:13.437817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-30 06:21:13.437825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y
2017-10-30 06:21:13.437838: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
2017-10-30 06:21:13.457447: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-10-30 06:21:13.458057: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: Tesla M40
major: 5 minor: 2 memoryClockRate (GHz) 1.112
pciBusID 0000:00:08.0
Total memory: 11.95GiB
Free memory: 11.84GiB
2017-10-30 06:21:13.458089: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-30 06:21:13.458099: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y
2017-10-30 06:21:13.458114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)
Traceback (most recent call last):
  File "tensorflow_mnist.py", line 110, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tensorflow_mnist.py", line 106, in main
    mon_sess.run(train_op, feed_dict={image: image_, label: label_})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
Traceback (most recent call last):
  File "tensorflow_mnist.py", line 110, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tensorflow_mnist.py", line 106, in main
    mon_sess.run(train_op, feed_dict={image: image_, label: label_})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
	 [[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0', defined at:
  File "tensorflow_mnist.py", line 110, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tensorflow_mnist.py", line 84, in main
    train_op = opt.minimize(loss, global_step=global_step)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 174, in compute_gradients
    device_sparse=self._device_sparse)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
    summed_tensor = _allreduce(tensor)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 39, in horovod_allreduce
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
	 [[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
	 [[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

: ncclCommInitRank failed: unhandled system error
	 [[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
	 [[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0', defined at:
  File "tensorflow_mnist.py", line 110, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tensorflow_mnist.py", line 84, in main
    train_op = opt.minimize(loss, global_step=global_step)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 174, in compute_gradients
    device_sparse=self._device_sparse)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
    summed_tensor = _allreduce(tensor)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 39, in horovod_allreduce
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
	 [[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
	 [[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[23213,1],0]
  Exit code:    1
--------------------------------------------------------------------------

from horovod.

ydp commented on May 9, 2024

my tensorflow-version is

tensorflow-gpu (1.2.0)

from horovod.

alsrgv commented on May 9, 2024

@ydp, no worries. For your ncclCommInitRank error, take a look at this troubleshooting section.

from horovod.

ydp commented on May 9, 2024

add debug info, shows the detailed error message:

2017-10-30 06:51:30.780619: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-30 06:51:30.780627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y
2017-10-30 06:51:30.780641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla M40, pci bus id: 0000:00:08.0)

node224:4549:4609 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
node224:4549:4609 [0] INFO Using internal Network Socket
node224:4549:4609 [0] INFO NET : Using interface eth0:10.87.217.224<0>
node224:4549:4609 [0] INFO NET : Using interface docker0:172.17.0.1<0>
node224:4549:4609 [0] INFO NET : Using interface flannel.1:192.168.33.0<0>
node224:4549:4609 [0] INFO NET : Using interface cni0:192.168.33.1<0>
node224:4549:4609 [0] INFO NET/Socket : 4 interfaces found
NCCL version 2.0.5 compiled with CUDA 9.0

node217:3339:3399 [0] misc/ibvwrap.cu:60 WARN Failed to open libibverbs.so[.1]
node217:3339:3399 [0] INFO Using internal Network Socket
node217:3339:3399 [0] INFO NET : Using interface eth0:10.87.217.217<0>
node217:3339:3399 [0] INFO NET : Using interface docker0:172.17.0.1<0>
node217:3339:3399 [0] INFO NET : Using interface flannel.1:192.168.26.0<0>
node217:3339:3399 [0] INFO NET/Socket : 3 interfaces found
node224:4549:4609 [0] INFO Using 256 threads
node224:4549:4609 [0] INFO [0] Ring 0 :    0   1
node224:4549:4609 [0] INFO [0] Ring 1 :    0   1
node224:4549:4609 [0] INFO [0] Ring 2 :    0   1
node217:3339:3399 [0] INFO 0 -> 1 via NET/Socket/0
node224:4549:4609 [0] INFO 1 -> 0 via NET/Socket/0
node217:3339:3399 [0] INFO 0 -> 1 via NET/Socket/1
node224:4549:4609 [0] INFO 1 -> 0 via NET/Socket/1

node224:4549:4609 [0] include/socket.h:185 WARN Call to connect failed : Connection refused
node224:4549:4609 [0] INFO transport/net_socket.cu:103 -> 2
node224:4549:4609 [0] INFO include/net.h:28 -> 2 [Net]
node224:4549:4609 [0] INFO transport/net.cu:254 -> 2
node224:4549:4609 [0] INFO init.cu:373 -> 2
node224:4549:4609 [0] INFO init.cu:432 -> 2

node217:3339:3399 [0] include/socket.h:185 WARN Call to connect failed : Connection refused
node217:3339:3399 [0] INFO transport/net_socket.cu:103 -> 2
node217:3339:3399 [0] INFO include/net.h:28 -> 2 [Net]
node217:3339:3399 [0] INFO transport/net.cu:254 -> 2
node217:3339:3399 [0] INFO init.cu:373 -> 2
node217:3339:3399 [0] INFO init.cu:432 -> 2
Traceback (most recent call last):
  File "tensorflow_mnist.py", line 110, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tensorflow_mnist.py", line 106, in main
    mon_sess.run(train_op, feed_dict={image: image_, label: label_})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
Traceback (most recent call last):
  File "tensorflow_mnist.py", line 110, in <module>
    tf.app.run()
      File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tensorflow_mnist.py", line 106, in main
return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    mon_sess.run(train_op, feed_dict={image: image_, label: label_})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 505, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 952, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 798, in run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
        raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownErrorreturn self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
: ncclCommInitRank failed: unhandled system error
	 [[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
	 [[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0', defined at:
  File "tensorflow_mnist.py", line 110, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tensorflow_mnist.py", line 84, in main
    train_op = opt.minimize(loss, global_step=global_step)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 174, in compute_gradients
    device_sparse=self._device_sparse)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
    summed_tensor = _allreduce(tensor)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 39, in horovod_allreduce
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
	 [[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
	 [[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: ncclCommInitRank failed: unhandled system error
	 [[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
	 [[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

Caused by op u'DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0', defined at:
  File "tensorflow_mnist.py", line 110, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "tensorflow_mnist.py", line 84, in main
    train_op = opt.minimize(loss, global_step=global_step)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 315, in minimize
    grad_loss=grad_loss)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 174, in compute_gradients
    device_sparse=self._device_sparse)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/__init__.py", line 76, in allreduce
    summed_tensor = _allreduce(tensor)
  File "/usr/local/lib/python2.7/dist-packages/horovod/tensorflow/mpi_ops.py", line 145, in _allreduce
    return MPI_LIB.horovod_allreduce(tensor, name=name)
  File "<string>", line 39, in horovod_allreduce
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

UnknownError (see above for traceback): ncclCommInitRank failed: unhandled system error
	 [[Node: DistributedRMSPropOptimizer_Allreduce/HorovodAllreduce_gradients_conv_layer1_Conv_convolution_grad_tuple_control_dependency_1_0 = HorovodAllreduce[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](gradients/conv_layer1/Conv/convolution_grad/tuple/control_dependency_1)]]
	 [[Node: DistributedRMSPropOptimizer/update/_254 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_445_DistributedRMSPropOptimizer/update", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[17273,1],0]
  Exit code:    1
--------------------------------------------------------------------------

from horovod.

ydp commented on May 9, 2024

my command:

mpirun --allow-run-as-root --mca oob_tcp_if_include eth0 --mca btl_tcp_if_include eth0 -np 2 -x NCCL_DEBUG -x LD_LIBRARY_PATH -H hvd-0:1,hvd-1:1  python tensorflow_mnist.py

don't know why I used btl_tcp_if_include, it still found 4 interfaces.

from horovod.

alsrgv commented on May 9, 2024

Yes, NCCL found few interfaces. NCCL interface selection is independent from MPI.

node217:3339:3399 [0] INFO NET : Using interface eth0:10.87.217.217<0>
node217:3339:3399 [0] INFO NET : Using interface docker0:172.17.0.1<0>
node217:3339:3399 [0] INFO NET : Using interface flannel.1:192.168.26.0<0>

You can narrow it down to eth0 by adding -x NCCL_SOCKET_IFNAME=eth0 to your mpirun command, which will set the NCCL_SOCKET_IFNAME environment variable.

from horovod.

alsrgv commented on May 9, 2024

Closing this issue. Feel free to reopen if you have more questions.

from horovod.

xianyinxin commented on May 9, 2024

Have the same problem, finally i tried NCCL_SOCKET_IFNAME=^docker0 and everything works fine. docker0 should be excluded by both mpi via -mca and nccl via NCCL_SOCKET_IFNAME. I think the doc running.md or troubleshooting.md should be updated to include this.

from horovod.

alsrgv commented on May 9, 2024

@xy-xin, great find! I didn't know you could exclude interfaces for NCCL as well. Will add in next doc update.

from horovod.

xianyinxin commented on May 9, 2024

Thanks @alsrgv .

from horovod.

xianyinxin commented on May 9, 2024

BTW, here's my reference. https://devtalk.nvidia.com/default/topic/1023946/nccl-2-0-support-inter-node-communication-using-sockets-/, FYI. @alsrgv

from horovod.

run distributed mnist example on kubernetes failed with error "writev error Bad address(1)" about horovod HOT 22 CLOSED

Comments (22)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent