Following README. I can run on all local node successfully xxx@master:/tmp/KungFu$

Hang when run on distributed mode,about lsds/kungfu

Comments (4)

lgarithm commented on June 22, 2024

The kungfu-run command should be executed on both machines.

from kungfu.

sondv7 commented on June 22, 2024

The kungfu-run command should be executed on both machines.

yes. I can run on both machines. but in cluster mode be hang. Maybe the network interface have problem

from kungfu.

lgarithm commented on June 22, 2024

Could you share the log from the other machine?

You can also turn on debug log:

export KUNGFU_CONFIG_LOG_LEVEL=DEBUG

from kungfu.

sondv7 commented on June 22, 2024

I fixed this issue.
2 server have different NIC name so I miss configuration.

Log server1:
kungfu-run -np 3 -H 10.208.209.163:1,10.208.209.171:2 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=3
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:2
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=DEBUG
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eno1 :: 10.208.209.163/24, fe80::b9b2:6891:c63:5d72/64
[D] Using self=10.208.209.163
[I] will parallel run 1 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]
[10.208.209.163.10000::stdout] [D] listening: 0.0.0.0:10000
[10.208.209.163.10000::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.163.10000::stdout] [D] using name based hash
[10.208.209.163.10000::stdout] [D] got new connection of type Collective from: 10.208.209.171:10000
[10.208.209.163.10000::stdout] [D] connection to #<10.208.209.171:10000> established after 1 trials, took 227.647µs
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint8 = np.dtype([("qint8", np.int8, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint16 = np.dtype([("qint16", np.int16, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint32 = np.dtype([("qint32", np.int32, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] np_resource = np.dtype([("resource", np.ubyte, 1)])
[10.208.209.163.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
[10.208.209.163.10000::stderr] Instructions for updating:
[10.208.209.163.10000::stderr] Colocations handled automatically by placer.
[10.208.209.163.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
[10.208.209.163.10000::stderr] Instructions for updating:
[10.208.209.163.10000::stderr] Use tf.cast instead.
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.696298: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.793487: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.796763: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x23bdbf0 executing computations on platform CUDA. Devices:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.796773: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.817165: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818181: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x23de620 executing computations on platform Host. Devices:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818192: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
[10.208.209.163.10000::stderr] name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
[10.208.209.163.10000::stderr] pciBusID: 0000:01:00.0
[10.208.209.163.10000::stderr] totalMemory: 10.91GiB freeMemory: 10.35GiB
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10064 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
[10.208.209.163.10000::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.163.10000::stdout] training
[10.208.209.163.10000::stderr] 2020-01-18 09:49:09.305317: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[10.208.209.163.10000::stdout] training accuracy: 0.360000
[10.208.209.163.10000::stdout] validation accuracy: 0.329400
[10.208.209.163.10000::stdout] training accuracy: 0.900000
[10.208.209.163.10000::stdout] validation accuracy: 0.885000
[10.208.209.163.10000::stdout] training accuracy: 0.940000
[10.208.209.163.10000::stdout] validation accuracy: 0.896000
[10.208.209.163.10000::stdout] training accuracy: 0.920000
[10.208.209.163.10000::stdout] validation accuracy: 0.901700
[10.208.209.163.10000::stdout] test accuracy: 0.902400
[10.208.209.163.10000::stdout] [D] Server Closed
[D] #<10.208.209.163.10000> finished successfully
[I] all 1/3 local peers finished, took 21.875099576s
[D] kungfu-run finished, took 21.875377819s

Log server 2:
kungfu-run -np 3 -H 10.208.209.163:1,10.208.209.171:2 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=3
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:2
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=DEBUG
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eno1 :: 10.208.209.171/24, fe80::7af5:7968:e59f:de55/64
[nic] [2] docker0 :: 192.168.99.1/24, fe80::42:77ff:fe22:c7fd/64
[nic] [3] vetha6b3b40 :: fe80::f09b:a4ff:fe5f:fce0/64
[nic] [4] virbr0 :: 192.168.122.1/24
[nic] [5] virbr0-nic ::
[nic] [6] veth1598cdf :: fe80::28d0:b6ff:fe46:d0d1/64
[D] Using self=10.208.209.171
[I] will parallel run 2 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]
[10.208.209.171.10000::stdout] [D] listening: 0.0.0.0:10000
[10.208.209.171.10001::stdout] [D] listening: 0.0.0.0:10001
[10.208.209.171.10001::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.171.10001::stdout] [D] using name based hash
[10.208.209.171.10000::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.171.10000::stdout] [D] using name based hash
[10.208.209.171.10001::stdout] [D] connection to #<10.208.209.171:10000> established after 1 trials, took 69.678µs
[10.208.209.171.10000::stdout] [D] got new connection of type Collective from: 10.208.209.171:10001
[10.208.209.171.10000::stdout] [D] connection to #<10.208.209.163:10000> established after 1 trials, took 210.27µs
[10.208.209.171.10000::stdout] [D] got new connection of type Collective from: 10.208.209.163:10000
[10.208.209.171.10000::stdout] [D] connection to #<10.208.209.171:10001> established after 1 trials, took 53.139µs
[10.208.209.171.10001::stdout] [D] got new connection of type Collective from: 10.208.209.171:10000
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:69: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:7: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:8: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:9: The name tf.mod is deprecated. Please use tf.math.mod instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:10: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:94: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /home/sondv7/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
[10.208.209.171.10000::stderr] Instructions for updating:
[10.208.209.171.10000::stderr] If using Keras pass *_constraint arguments to layers.
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:101: The name tf.log is deprecated. Please use tf.math.log instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:69: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:7: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:8: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:9: The name tf.mod is deprecated. Please use tf.math.mod instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:10: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:94: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /home/sondv7/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
[10.208.209.171.10001::stderr] Instructions for updating:
[10.208.209.171.10001::stderr] If using Keras pass *_constraint arguments to layers.
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:101: The name tf.log is deprecated. Please use tf.math.log instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:208: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.083367: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.094570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
[10.208.209.171.10000::stderr] name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
[10.208.209.171.10000::stderr] pciBusID: 0000:01:00.0
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095188: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095222: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095250: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095277: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095305: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095332: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097481: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097507: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
[10.208.209.171.10000::stderr] Skipping registering GPU devices...
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097763: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:208: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.113296: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115491: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115507: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: slave
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115511: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: slave
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115534: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115546: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115549: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115724: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.118929: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.119398: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a7f3e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.119408: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[10.208.209.171.10001::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:151: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.121817: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.122286: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47ecba0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.122296: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/initializer/init.py:27: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179584: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179944: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47fb3b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179954: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.180000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.180004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]
[10.208.209.171.10000::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:151: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/initializer/init.py:27: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stdout] training
[10.208.209.171.10001::stdout] training
[10.208.209.171.10000::stdout] training accuracy: 0.460000
[10.208.209.171.10001::stdout] training accuracy: 0.480000
[10.208.209.171.10001::stdout] validation accuracy: 0.329400
[10.208.209.171.10000::stdout] validation accuracy: 0.329400
[10.208.209.171.10001::stdout] training accuracy: 0.880000
[10.208.209.171.10000::stdout] training accuracy: 0.920000
[10.208.209.171.10001::stdout] validation accuracy: 0.885000
[10.208.209.171.10000::stdout] validation accuracy: 0.885000
[10.208.209.171.10000::stdout] training accuracy: 0.900000
[10.208.209.171.10001::stdout] training accuracy: 0.880000
[10.208.209.171.10000::stdout] validation accuracy: 0.896000
[10.208.209.171.10001::stdout] validation accuracy: 0.896000
[10.208.209.171.10000::stdout] training accuracy: 0.940000
[10.208.209.171.10001::stdout] training accuracy: 0.960000
[10.208.209.171.10000::stdout] validation accuracy: 0.901700
[10.208.209.171.10001::stdout] validation accuracy: 0.901700
[10.208.209.171.10001::stdout] test accuracy: 0.902400
[10.208.209.171.10000::stdout] test accuracy: 0.902400
[10.208.209.171.10001::stdout] [D] Server Closed
[10.208.209.171.10000::stdout] [D] Server Closed
[D] #<10.208.209.171.10001> finished successfully
[D] #<10.208.209.171.10000> finished successfully
[I] all 2/3 local peers finished, took 2.948737336s
[D] kungfu-run finished, took 2.949251703s

So it work well.

Thanks

from kungfu.

Hang when run on distributed mode about kungfu HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent