Comments (4)
The kungfu-run
command should be executed on both machines.
from kungfu.
The
kungfu-run
command should be executed on both machines.
yes. I can run on both machines. but in cluster mode be hang. Maybe the network interface have problem
from kungfu.
Could you share the log from the other machine?
You can also turn on debug log:
export KUNGFU_CONFIG_LOG_LEVEL=DEBUG
from kungfu.
I fixed this issue.
2 server have different NIC name so I miss configuration.
Log server1:
kungfu-run -np 3 -H 10.208.209.163:1,10.208.209.171:2 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=3
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:2
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=DEBUG
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eno1 :: 10.208.209.163/24, fe80::b9b2:6891:c63:5d72/64
[D] Using self=10.208.209.163
[I] will parallel run 1 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]
[10.208.209.163.10000::stdout] [D] listening: 0.0.0.0:10000
[10.208.209.163.10000::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.163.10000::stdout] [D] using name based hash
[10.208.209.163.10000::stdout] [D] got new connection of type Collective from: 10.208.209.171:10000
[10.208.209.163.10000::stdout] [D] connection to #<10.208.209.171:10000> established after 1 trials, took 227.647µs
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint8 = np.dtype([("qint8", np.int8, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint16 = np.dtype([("qint16", np.int16, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] _np_qint32 = np.dtype([("qint32", np.int32, 1)])
[10.208.209.163.10000::stderr] /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
[10.208.209.163.10000::stderr] np_resource = np.dtype([("resource", np.ubyte, 1)])
[10.208.209.163.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/resource_variable_ops.py:435: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
[10.208.209.163.10000::stderr] Instructions for updating:
[10.208.209.163.10000::stderr] Colocations handled automatically by placer.
[10.208.209.163.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
[10.208.209.163.10000::stderr] Instructions for updating:
[10.208.209.163.10000::stderr] Use tf.cast instead.
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.696298: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.793487: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.796763: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x23bdbf0 executing computations on platform CUDA. Devices:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.796773: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.817165: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3696000000 Hz
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818181: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x23de620 executing computations on platform Host. Devices:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818192: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
[10.208.209.163.10000::stderr] name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
[10.208.209.163.10000::stderr] pciBusID: 0000:01:00.0
[10.208.209.163.10000::stderr] totalMemory: 10.91GiB freeMemory: 10.35GiB
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818704: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
[10.208.209.163.10000::stderr] 2020-01-18 09:49:08.818741: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10064 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
[10.208.209.163.10000::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.163.10000::stdout] training
[10.208.209.163.10000::stderr] 2020-01-18 09:49:09.305317: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[10.208.209.163.10000::stdout] training accuracy: 0.360000
[10.208.209.163.10000::stdout] validation accuracy: 0.329400
[10.208.209.163.10000::stdout] training accuracy: 0.900000
[10.208.209.163.10000::stdout] validation accuracy: 0.885000
[10.208.209.163.10000::stdout] training accuracy: 0.940000
[10.208.209.163.10000::stdout] validation accuracy: 0.896000
[10.208.209.163.10000::stdout] training accuracy: 0.920000
[10.208.209.163.10000::stdout] validation accuracy: 0.901700
[10.208.209.163.10000::stdout] test accuracy: 0.902400
[10.208.209.163.10000::stdout] [D] Server Closed
[D] #<10.208.209.163.10000> finished successfully
[I] all 1/3 local peers finished, took 21.875099576s
[D] kungfu-run finished, took 21.875377819s
Log server 2:
kungfu-run -np 3 -H 10.208.209.163:1,10.208.209.171:2 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=3
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:2
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=DEBUG
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eno1 :: 10.208.209.171/24, fe80::7af5:7968:e59f:de55/64
[nic] [2] docker0 :: 192.168.99.1/24, fe80::42:77ff:fe22:c7fd/64
[nic] [3] vetha6b3b40 :: fe80::f09b:a4ff:fe5f:fce0/64
[nic] [4] virbr0 :: 192.168.122.1/24
[nic] [5] virbr0-nic ::
[nic] [6] veth1598cdf :: fe80::28d0:b6ff:fe46:d0d1/64
[D] Using self=10.208.209.171
[I] will parallel run 2 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]
[10.208.209.171.10000::stdout] [D] listening: 0.0.0.0:10000
[10.208.209.171.10001::stdout] [D] listening: 0.0.0.0:10001
[10.208.209.171.10001::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.171.10001::stdout] [D] using name based hash
[10.208.209.171.10000::stdout] [D] Kungfu::updateTo(10.208.209.163:10000,10.208.209.171:10000,10.208.209.171:10001), 3 peers
[10.208.209.171.10000::stdout] [D] using name based hash
[10.208.209.171.10001::stdout] [D] connection to #<10.208.209.171:10000> established after 1 trials, took 69.678µs
[10.208.209.171.10000::stdout] [D] got new connection of type Collective from: 10.208.209.171:10001
[10.208.209.171.10000::stdout] [D] connection to #<10.208.209.163:10000> established after 1 trials, took 210.27µs
[10.208.209.171.10000::stdout] [D] got new connection of type Collective from: 10.208.209.163:10000
[10.208.209.171.10000::stdout] [D] connection to #<10.208.209.171:10001> established after 1 trials, took 53.139µs
[10.208.209.171.10001::stdout] [D] got new connection of type Collective from: 10.208.209.171:10000
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:69: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:7: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:8: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:9: The name tf.mod is deprecated. Please use tf.math.mod instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:10: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:94: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /home/sondv7/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
[10.208.209.171.10000::stderr] Instructions for updating:
[10.208.209.171.10000::stderr] If using Keras pass *_constraint arguments to layers.
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:101: The name tf.log is deprecated. Please use tf.math.log instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:69: The name tf.train.GradientDescentOptimizer is deprecated. Please use tf.compat.v1.train.GradientDescentOptimizer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:7: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:8: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:9: The name tf.mod is deprecated. Please use tf.math.mod instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/compat/init.py:10: The name tf.train.SessionRunHook is deprecated. Please use tf.estimator.SessionRunHook instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:94: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /home/sondv7/.local/lib/python3.6/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
[10.208.209.171.10001::stderr] Instructions for updating:
[10.208.209.171.10001::stderr] If using Keras pass *_constraint arguments to layers.
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:101: The name tf.log is deprecated. Please use tf.math.log instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:208: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.083367: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.094570: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095126: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
[10.208.209.171.10000::stderr] name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
[10.208.209.171.10000::stderr] pciBusID: 0000:01:00.0
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095188: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095222: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095250: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095277: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095305: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.095332: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097481: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097507: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
[10.208.209.171.10000::stderr] Skipping registering GPU devices...
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.097763: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:208: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.113296: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115491: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115507: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: slave
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115511: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: slave
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115534: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115546: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115549: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 430.40.0
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.115724: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.118929: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.119398: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a7f3e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10001::stderr] 2020-01-18 09:49:09.119408: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[10.208.209.171.10001::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.171.10001::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:151: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.121817: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.122286: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47ecba0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.122296: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
[10.208.209.171.10001::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/initializer/init.py:27: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
[10.208.209.171.10001::stderr]
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179584: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179944: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47fb3b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.179954: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.180000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
[10.208.209.171.10000::stderr] 2020-01-18 09:49:09.180004: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]
[10.208.209.171.10000::stdout] step_per_epoch: 333, 333 steps in total
[10.208.209.171.10000::stderr] WARNING:tensorflow:From examples/tf1_mnist_session.py:151: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stderr] WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/kungfu/tensorflow/initializer/init.py:27: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
[10.208.209.171.10000::stderr]
[10.208.209.171.10000::stdout] training
[10.208.209.171.10001::stdout] training
[10.208.209.171.10000::stdout] training accuracy: 0.460000
[10.208.209.171.10001::stdout] training accuracy: 0.480000
[10.208.209.171.10001::stdout] validation accuracy: 0.329400
[10.208.209.171.10000::stdout] validation accuracy: 0.329400
[10.208.209.171.10001::stdout] training accuracy: 0.880000
[10.208.209.171.10000::stdout] training accuracy: 0.920000
[10.208.209.171.10001::stdout] validation accuracy: 0.885000
[10.208.209.171.10000::stdout] validation accuracy: 0.885000
[10.208.209.171.10000::stdout] training accuracy: 0.900000
[10.208.209.171.10001::stdout] training accuracy: 0.880000
[10.208.209.171.10000::stdout] validation accuracy: 0.896000
[10.208.209.171.10001::stdout] validation accuracy: 0.896000
[10.208.209.171.10000::stdout] training accuracy: 0.940000
[10.208.209.171.10001::stdout] training accuracy: 0.960000
[10.208.209.171.10000::stdout] validation accuracy: 0.901700
[10.208.209.171.10001::stdout] validation accuracy: 0.901700
[10.208.209.171.10001::stdout] test accuracy: 0.902400
[10.208.209.171.10000::stdout] test accuracy: 0.902400
[10.208.209.171.10001::stdout] [D] Server Closed
[10.208.209.171.10000::stdout] [D] Server Closed
[D] #<10.208.209.171.10001> finished successfully
[D] #<10.208.209.171.10000> finished successfully
[I] all 2/3 local peers finished, took 2.948737336s
[D] kungfu-run finished, took 2.949251703s
So it work well.
Thanks
from kungfu.
Related Issues (20)
- bert demo question HOT 4
- panic error HOT 3
- When using the config-server, if you call allgather api, it will block.
- After remove the worker from the cluster, it is better to set the rank id to -1. HOT 3
- Elastic hook can't support training from checkpoint.
- Support real global batch normalisation HOT 2
- Inconsistency detected by ld.so
- failed to establish connection to the newly runner HOT 5
- the kungfu-job is hang when it scale down HOT 2
- kungfu job is hang in a inconsistent version when i scale down/up mutiple times HOT 14
- Performance drops when TensorFlow experimental XLA JIT is enabled. HOT 6
- [doc] request parameters doc when the -init-version=-1 HOT 5
- Support for share-memory channels? HOT 1
- use a dedicated thread for NCCL operations
- Access to Adaptive Batch Size Policy
- Is Windows supported?
- Error from pytoch demo HOT 1
- code loss HOT 2
- A question about Horovod central coordinator in the paper of KungFu HOT 2
- With PairAveragingOptimizer, is it possible that two workers in different iterations average their models? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kungfu.