lsds / kungfu Goto Github PK
View Code? Open in Web Editor NEWFast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.
License: Apache License 2.0
Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.
License: Apache License 2.0
When use elastic hook,the global_step will be useless, so maybe kungfu should save the trained_sample to checkpoint and restored the variable before training.
Note: (A, B) means operator B has a dependency on operator A.
A TensorFlow dataflow:
(forward, loss)
(loss, gradients)
(gradients, reduced_gradients)
(reduced_gradients, optimizer)
(optimizer, forward)
.....
A controllable TensorFlow dataflow
(forward, loss)
(loss, gradients)
(gradients, reduced_gradients)
(gradients, gradient_noise)
(reduced_gradients, optimizer)
(gradient_noise, controller)
(optimizer, controller)
(controller, forward)
......
/src/kungfu # valgrind ./bin/fake-agent
==25== Memcheck, a memory error detector
==25== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==25== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==25== Command: ./bin/fake-agent
==25==
==25== Thread 2:
==25== Invalid read of size 8
==25== at 0x4E5660D: runtime.argv_index (/usr/lib/go/src/runtime/runtime1.go:57)
==25== by 0x4E5660D: runtime.sysargs (/usr/lib/go/src/runtime/os_linux.go:206)
==25== by 0x4E669AA: runtime.args (/usr/lib/go/src/runtime/runtime1.go:63)
==25== by 0x4E82353: runtime.rt0_go (/usr/lib/go/src/runtime/asm_amd64.s:193)
==25== by 0x544E06F: ??? (in /src/kungfu/lib/libkungfu.so)
==25== Address 0x2a270388 is not stack'd, malloc'd or (recently) free'd
==25==
==25==
==25== Process terminating with default action of signal 11 (SIGSEGV)
==25== at 0x4E85EB1: runtime.raise (/usr/lib/go/src/runtime/sys_linux_amd64.s:150)
==25== by 0x4E6CBDA: runtime.dieFromSignal (/usr/lib/go/src/runtime/signal_unix.go:424)
==25== by 0x4E6D03C: runtime.sigfwdgo (/usr/lib/go/src/runtime/signal_unix.go:629)
==25== by 0x4E6C27F: runtime.sigtrampgo (/usr/lib/go/src/runtime/signal_unix.go:289)
==25== by 0x4E861A2: runtime.sigtramp (/usr/lib/go/src/runtime/sys_linux_amd64.s:357)
==25== by 0x4044E20: ??? (in /lib/ld-musl-x86_64.so.1)
==25==
==25== HEAP SUMMARY:
==25== in use at exit: 74,480 bytes in 11 blocks
==25== total heap usage: 13 allocs, 2 frees, 74,600 bytes allocated
==25==
==25== LEAK SUMMARY:
==25== definitely lost: 0 bytes in 0 blocks
==25== indirectly lost: 0 bytes in 0 blocks
==25== possibly lost: 0 bytes in 0 blocks
==25== still reachable: 74,480 bytes in 11 blocks
==25== suppressed: 0 bytes in 0 blocks
==25== Rerun with --leak-check=full to see details of leaked memory
==25==
==25== For lists of detected and suppressed errors, rerun with: -s
==25== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault
To support partitioned gradient exchange with NCCL, we need to create different order groups for different global_step % n_partitions
, the follow code snippet shows how we might archive that using tf.cond
#!/usr/bin/env python3
import tensorflow as tf
gs = tf.Variable(tf.zeros([]))
advance_gs = tf.assign(gs, gs + 1)
x = tf.Variable(tf.zeros([3]))
def create_run_part(i):
# gs = 5n + i => x += i, i = 0, ..., 4
return tf.cond(tf.equal(tf.mod(gs, 5),
i), lambda: tf.assign(x, x + i), lambda: x)
def create_step_op():
with tf.control_dependencies([advance_gs]):
return tf.group([create_run_part(i) for i in range(5)])
step_op = create_step_op()
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
for i in range(15):
t, v, _ = sess.run([gs, x, step_op])
print('step %d, x is %s' % (t, v))
step 1, x is [1. 1. 1.]
step 2, x is [3. 3. 3.]
step 3, x is [6. 6. 6.]
step 4, x is [10. 10. 10.]
step 5, x is [10. 10. 10.]
step 6, x is [11. 11. 11.]
step 7, x is [13. 13. 13.]
step 8, x is [16. 16. 16.]
step 9, x is [20. 20. 20.]
step 10, x is [20. 20. 20.]
step 11, x is [21. 21. 21.]
step 12, x is [23. 23. 23.]
step 13, x is [26. 26. 26.]
step 14, x is [30. 30. 30.]
step 15, x is [30. 30. 30.]
Tensorflow 2.2 supports the following feature:
Support added for global sync BatchNormalization by using the newly added tf.keras.layers.experimental.SyncBatchNormalization layer. This layer will sync BatchNormalization statistics every step across all replicas taking part in sync training.
It would be great if KungFu can support it as well.
@lgarithm @luomai
you can test using your kungfu_benchmark.py with one line adding.
args = parser.parse_args()
args.cuda = not args.no_cuda
config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1
the testing by me is tested in 8 V100 node, the result as follows,
Configuration:
optimizer=sync-sgd
batch-size=64.
kungfu:
no xla:
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #3: 313.9 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #4: 312.8 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #5: 316.1 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #6: 311.5 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #7: 314.6 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #8: 315.3 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #9: 313.4 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Img/sec per /gpu:0: 313.8 +-2.4
with xla:
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #5: 230.2 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #6: 230.9 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #7: 231.4 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #8: 229.6 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #9: 230.1 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Img/sec per /gpu:0: 230.1 +-1.5
horovod:
no xla:
Iter #35: 334.8 img/sec per GPU
Iter #36: 335.5 img/sec per GPU
Iter #37: 327.0 img/sec per GPU
Iter #38: 327.9 img/sec per GPU
Iter #39: 335.2 img/sec per GPU
Iter #40: 334.9 img/sec per GPU
Iter #41: 335.0 img/sec per GPU
Iter #42: 334.9 img/sec per GPU
Iter #43: 335.4 img/sec per GPU
Iter #44: 335.3 img/sec per GPU
Iter #45: 331.8 img/sec per GPU
Iter #46: 334.7 img/sec per GPU
Iter #47: 335.3 img/sec per GPU
Iter #48: 334.8 img/sec per GPU
Iter #49: 335.2 img/sec per GPU
Img/sec per GPU: 334.2 +-5.4
with xla:
Iter #39: 372.4 img/sec per GPU
Iter #40: 379.9 img/sec per GPU
Iter #41: 379.0 img/sec per GPU
Iter #42: 380.2 img/sec per GPU
Iter #43: 378.8 img/sec per GPU
Iter #44: 379.9 img/sec per GPU
Iter #45: 380.3 img/sec per GPU
Iter #46: 379.7 img/sec per GPU
Iter #47: 379.4 img/sec per GPU
Iter #48: 379.5 img/sec per GPU
Iter #49: 379.1 img/sec per GPU
Img/sec per GPU: 379.7 +-5.1
follow up by #294 (i.e. the scale up case
), i continue to test the scale down
case of kung-fu job in kubernetes, and i found there may be a race condition when kungfu handle the scale down case
this is the time sequence about my test
scale up
)scale down
)B final logs (L4 indicates the time of updating config.json)
[10.0.0.224.10003::stdout] [D] ingore unchanged proposal
[10.0.0.224.10003::stdout] [D] ignore update
I0607 20:13:22.713992 1 watch_host_file.go:65] update host file
I0607 20:13:22.714544 1 ma_fmk_kungfu.go:150] generated host file
[W] terminated trapped
[D] cancelled
[E] canceled: context canceled
[10.0.0.224.10006::stdout] [D] Server Closed
[D] Server Closed
[10.0.0.224.10004::stdout] [D] Server Closed
[10.0.0.224.10007::stdout] [D] Server Closed
[10.0.0.224.10001::stdout] [D] Server Closed
[10.0.0.224.10003::stdout] [D] Server Closed
[10.0.0.224.10005::stdout] [D] Server Closed
[10.0.0.224.10002::stdout] [D] Server Closed
[10.0.0.224.10000::stdout] [D] Server Closed
[I] stop watching
[D] Server Closed
[D] kungfu-run finished, took 2m22.50148318s
A final logs (L4 indicates the time of updating config.json)
[10.0.1.26.10005::stdout] [D] ignore update
[10.0.1.26.10003::stdout] [D] ignore update
I0607 20:13:22.712071 1 watch_host_file.go:65] update host file
I0607 20:13:22.712325 1 ma_fmk_kungfu.go:150] generated host file
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_133) failed: runStrategies failed with2 errors: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: connection reset by peer, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_67) failed: runStrategies failed with 1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_191) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_130) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_125) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_128) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_124) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_129) failed: runStrategies failed with9 errors: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_123) failed: runStrategies failed with2 errors: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_121) failed: runStrategies failed with3 errors: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_118) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_122) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_120) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_116) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_115) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
currently, i have a workaround method
kung-run
; the kungfu-manager process swallow the signal (TERM) and doesn't transit to kungfu-run proc; kungfu processes will exit by respecting the updated config.json (scale down)by the way, kungfu-run acted as pid 1 process in container (in my hang case), and it will receive the TERM signal when i shutdown the container
i am not sure, is my workaround method a correct way, when we want to scale down the job. i.e. scaling down only by updating the config.json and don't send TERM or INT signal to the kung-fu processes
but there still has a drawback, users trap the TERM signal in their python scripts, maybe they can't do it in kung-fu ?
Check if sharding is enabled.
Following README. I can run on all local node successfully
xxx@master:/tmp/KungFu$ kungfu-run -np 2 python3 examples/tf1_mnist_session.py --data-dir=./mnist
...
[I] all 2/2 local peers finished, took 2.397370504s
but when run on cluster. It hang without any error.
@master:/tmp/KungFu$ kungfu-run -np 2 -H 10.208.209.163:1,10.208.209.171:1 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=2
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:1
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[nic] [0] lo :: 127.0.0.1/8
[nic] [1] eno1 :: 10.208.209.163/24
[nic] [2] docker0 :: 192.168.99.1/24
[nic] [3] br-fefb2fb37d81 :: 172.18.0.1/16
[cuda-env]: CUDA_VISIBLE_DEVICES=1
[I] will parallel run 1 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]
Hi! With PairAveragingOptimizer, is it possible that two workers in different iterations average their models? Or do all workers keep in the same iteration during the training?
The asynchronous collective communication layer also avoids having an expensive central coordinator, as used for invoking synchronous collective communication operations inexisting systems, such as Horovod.
I see the paper of Horovod and KongFu,I wonder why does Horovod use the central coordinator,I havent find it in the paper of Horovod.Could you please give me some information about it?Such as some codes.I want to compare the difference.
Thanks!Have a nice day!
i'd like kungfu can provide a brief doc about the parameters, i have try set -init-version=-1
and ignore the -H
param, but it seems kungfu-run can't handle it well
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=64
[arg] [3]=-w
[arg] [4]=-config-server
[arg] [5]=file:///home/ma-user/user-job-dir/config.json
[arg] [6]=-nic
[arg] [7]=ib0
[arg] [8]=-init-version
[arg] [9]=-1
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-17/image_classification_xk.py
[arg] [12]=--num_clases=1001
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.143.141/20
[nic] [8] bond0 :: 192.168.5.175/22, fe80::f816:3eff:fef7:d4fc/64
[nic] [9] docker0 :: 169.254.30.1/28, fe80::42:baff:fe91:cb50/64
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::ece2:eaff:fefb:ce44/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::c052:54ff:fe20:7f91/64
[nic] [15] gw_11cbf51a :: 172.16.0.193/16, fe80::44b8:36ff:febb:623c/64
[nic] [16] br_plc_a149041e ::
[nic] [17] veth_a149041e :: fe80::5428:eff:fe9b:5826/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
exit on error: 169.254.143.141:38080 not in 127.0.0.1:38080 at 7037287:/home/work/KungFu/srcs/go/cmd/kungfu-run/kungfu-run.go:62
Happens when using []byte
returned from bytes.Buffer
Use GPU events such that the stream is not blocked. Use CUDA Event API (Record() and Wait() functions should be used).
From the output, we can see that the dataset iterator has been rescheduled on the fly.
#!/usr/bin/env python3
import numpy as np
import tensorflow as tf
def main():
batch_size = tf.Variable(tf.constant(10, tf.int64))
n_workers = tf.Variable(tf.constant(4, tf.int64))
shard_id = tf.Variable(tf.constant(1, tf.int64))
def _update_topology():
update_cluster = tf.assign(n_workers, n_workers + 1)
update_batch_size = tf.assign(batch_size, batch_size + 2)
with tf.control_dependencies([update_cluster]):
return tf.group([
tf.assign(shard_id, tf.mod(shard_id + 1, n_workers)),
update_batch_size,
])
n = 1024
source = np.array(list(range(n)))
it = tf.data.Dataset.from_tensor_slices(source)
_update_topology_op = _update_topology()
it = it.batch(batch_size)
it = it.shard(n_workers, shard_id)
it = it.make_initializable_iterator()
_reschedule_op = it.initializer
get_next = it.get_next()
def _debug_info(sess):
np, rank, bs = sess.run([n_workers, shard_id, batch_size])
print('np=%d, rank=%d, batch size=%d' % (np, rank, bs))
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
_debug_info(sess)
for stage in range(3):
print('stage %d' % stage)
sess.run(_update_topology_op)
sess.run(_reschedule_op)
_debug_info(sess)
for step in range(3):
v = sess.run(get_next)
print('stage %d, step %d, %s (bs=%d)' %
(stage, step, v, len(v)))
main()
np=4, rank=1, batch size=10
stage 0
np=5, rank=2, batch size=12
stage 0, step 0, [24 25 26 27 28 29 30 31 32 33 34 35] (bs=12)
stage 0, step 1, [84 85 86 87 88 89 90 91 92 93 94 95] (bs=12)
stage 0, step 2, [144 145 146 147 148 149 150 151 152 153 154 155] (bs=12)
stage 1
np=6, rank=3, batch size=14
stage 1, step 0, [42 43 44 45 46 47 48 49 50 51 52 53 54 55] (bs=14)
stage 1, step 1, [126 127 128 129 130 131 132 133 134 135 136 137 138 139] (bs=14)
stage 1, step 2, [210 211 212 213 214 215 216 217 218 219 220 221 222 223] (bs=14)
stage 2
np=7, rank=4, batch size=16
stage 2, step 0, [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79] (bs=16)
stage 2, step 1, [176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191] (bs=16)
stage 2, step 2, [288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303] (bs=16)
In ResNet-50, I noticed that there exist tensors with names such as tower_0/v0/gradients/tower_0/v0/cg/resnet_v111/conv38/batchnorm38/FusedBatchNorm_grad/FusedBatchNormGrad:2
and tower_0/v0/gradients/tower_0/v0/cg/resnet_v111/conv38/batchnorm38/FusedBatchNorm_grad/FusedBatchNormGrad:1
and the current code for allreduce (CPU and GPU) ellides the last two characters of the tensor names (e.g., https://github.com/lsds/KungFu/blob/master/srcs/python/kungfu/ops.py#L46), which means that the key during the all-reduce operation is not unique. This might explain why why the number of tensors observed in-flight is less than the number of tensors before the operators are constructed. Could you please confirm if this behaviour is expected?
Hello! I am looking for training distribution framework and did you try it on windows? Because I am facing linking issue with tensorflow (possibly MinGW and MSVC linking conflicts)
hi, i am testing the elastic capability of kung-fu in a kubernetes cluster (container), and it turns out there is a tricky problem, as sometimes, kung-fu can't connect to the newly created runner, and the job is failed due to the following code
KungFu/srcs/go/kungfu/peer/peer.go
Lines 190 to 196 in dda9542
these are my running log
[D] update to v1 with [16@2]{10.0.1.50:10000,10.0.1.50:10001,10.0.1.50:10002,10.0.1.50:10003,10.0.1.50:10004,10.0.1.50:10005,10.0.1.50:10006,10.0.1.50:10007,10.0.0.241:10000,10.0.0.241:10001,10.0.0.241:10002,10.0.0.241:10003,10.0.0.241:10004,10.0.0.241:10005,10.0.0.241:10006,10.0.0.241:10007}@{10.0.1.50:38080,10.0.0.241:38080}
[10.0.1.50.10005::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10000::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10001::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10006::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10002::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10003::stdout] exit on error: par failed with 1 error: can't establish connection at 140552590385529:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10007::stdout] exit on error: par failed with 1 error: can't establish connection at 140247738442105:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10004::stdout] exit on error: par failed with 1 error: can't establish connection at 139984866718073:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10005::stdout] exit on error: par failed with 1 error: can't establish connection at 139962272162169:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10000::stdout] exit on error: par failed with 1 error: can't establish connection at 139740905816441:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10001::stdout] exit on error: par failed with 1 error: can't establish connection at 140017202741625:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10006::stdout] exit on error: par failed with 1 error: can't establish connection at 140218611171705:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10002::stdout] exit on error: par failed with 1 error: can't establish connection at 140089222893945:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[I] 10.0.1.50.10000 finished with error: exit status 1
exit on error: exit status 1 at 7069499:/cache/KungFu/srcs/go/kungfu/runner/watch.go:146
i observed that there is a time gap when we update the config.json
file and the newly kung-fu bootstrap (i.e. newly container is created)
and i find the retry time setting in here
KungFu/srcs/go/rchannel/connection/connection.go
Lines 90 to 93 in dda9542
also
KungFu/srcs/go/rchannel/connection/connection.go
Lines 137 to 146 in dda9542
it shows that the Controll
msg only has one time to try, if that failed, the job will exit ...
to clarify, i post the time sequence of my job
finnaly, i am not sure can we simply add the retry time for Controll
msg for solving my problem, or what's the best practice of kung-fu in kubernetes
@luomai @marwage , when run early bert demo, when update from one peers to 8 peers, panic happens, configuration is the same as the run.sh, error as follows:
updating to 1 peers: 169.254.128.33:13006
[I] updated to 1 peers: 169.254.128.33:13006
OK
[I] arrived at v11, new np=1, local: +0/-6, global: -0/14
[169.254.128.33.13000::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13000/model.ckpt.
[169.254.128.33.13003::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13003/model.ckpt.
[169.254.128.33.13001::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13001/model.ckpt.
[169.254.128.33.13002::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13002/model.ckpt.
[169.254.128.33.13006::stdout] sync offset 153320 -> 153320 on step 605
[169.254.128.33.13004::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13004/model.ckpt.
[169.254.128.33.13007::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13007/model.ckpt.
[169.254.128.33.13007::�[1;35mstderr�[m] WARNING:tensorflow:From /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
[169.254.128.33.13007::�[1;35mstderr�[m] Instructions for updating:
[169.254.128.33.13007::�[1;35mstderr�[m] Use standard file APIs to delete files with this prefix.
[169.254.128.33.13003::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13001::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13004::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13002::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13000::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13007::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13003::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.1395919.
[169.254.128.33.13003::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13003::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:42.733748
[169.254.128.33.13001::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.4893956.
[169.254.128.33.13001::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13001::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:42.850651
[169.254.128.33.13004::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.357612.
[169.254.128.33.13004::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13004::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:42.928306
[169.254.128.33.13002::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.4187098.
[169.254.128.33.13002::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13002::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:43.183136
[169.254.128.33.13000::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.1474836.
[169.254.128.33.13000::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13000::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:43.285399
[169.254.128.33.13007::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 2.2803762.
[169.254.128.33.13007::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13007::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:43.593442
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 0.469528
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 3.75623
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 3.41049
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 27.2839
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 3.41522
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 27.3218
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1390 into /cache/output/169.254.128.33-13006/model.ckpt.
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 3.1294
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 25.0352
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 3.41119
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 27.2895
updating to 8 peers: 169.254.128.33:13002,169.254.128.114:13006,169.254.128.114:13005,169.254.128.33:13001,169.254.128.114:13002,169.254.128.114:13001,169.254.128.114:13000,169.254.128.33:13005
[I] updated to 8 peers: 169.254.128.33:13002,169.254.128.114:13006,169.254.128.114:13005,169.254.128.33:13001,169.254.128.114:13002,169.254.128.114:13001,169.254.128.114:13000,169.254.128.33:13005
OK
�[1;35m[E]�[m full update detected: [1@2]{169.254.128.33:13006}@{169.254.128.33:38080,169.254.128.114:38080} -> [8@2]{169.254.128.33:13002,169.254.128.114:13006,169.254.128.114:13005,169.254.128.33:13001,169.254.128.114:13002,169.254.128.114:13001,169.254.128.114:13000,169.254.128.33:13005}@{169.254.128.33:38080,169.254.128.114:38080}
[I] arrived at v12, new np=8, local: +3/-1, global: -8/1
[169.254.128.33.13006::�[1;35mstderr�[m] panic: runtime error: invalid memory address or nil pointer dereference
[169.254.128.33.13006::�[1;35mstderr�[m] [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x7fab3df583c5]
[169.254.128.33.13006::�[1;35mstderr�[m]
[169.254.128.33.13006::�[1;35mstderr�[m] goroutine 2706338 [running]:
[169.254.128.33.13006::�[1;35mstderr�[m] encoding/binary.Write(0x0, 0x0, 0x7fab3e4a5300, 0x7fab3e6c5900, 0x7fab3e4166e0, 0xc0007fa548, 0x6, 0xc0007fa540)
[169.254.128.33.13006::�[1;35mstderr�[m] /usr/local/go/src/encoding/binary/binary.go:342 +0x145
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/rchannel.(*messageHeader).WriteTo(0xc0b92adda8, 0x0, 0x0, 0x0, 0x0)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/rchannel/message.go:84 +0x78
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/rchannel.(*tcpConnection).Send(0xc0006bccf0, 0x7fab3e10338a, 0x6, 0x17d, 0xc000320000, 0x17d, 0x17d, 0x0, 0x7fab00000000, 0x0, ...)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/rchannel/connection.go:106 +0x16b
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/rchannel.(*Router).send(0xc000174db0, 0x94c0a9fe8072, 0x7fab3e10338a, 0x6, 0x17d, 0xc000320000, 0x17d, 0x17d, 0x0, 0x1, ...)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/rchannel/router.go:53 +0xc0
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/rchannel.(*Router).Send(0xc000174db0, 0x94c0a9fe8072, 0x7fab3e10338a, 0x6, 0xc000320000, 0x17d, 0x17d, 0x320001, 0x17d, 0x17d)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/rchannel/router.go:44 +0xeb
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/kungfu.(*Kungfu).propose.func1(0x94c0a9fe8072, 0x5a, 0x5a)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/kungfu/kungfu.go:204 +0xfc
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/kungfu.par.func1(0xc00022c0f0, 0xc0002500c0, 0x2, 0x2, 0xc00031e230, 0x1, 0x94c0a9fe8072)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/kungfu/kungfu.go:171 +0x3e
[169.254.128.33.13006::�[1;35mstderr�[m] created by github.com/lsds/KungFu/srcs/go/kungfu.par
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/kungfu/kungfu.go:170 +0x10f
[I] 169.254.128.33.13006 finished with error: signal: aborted (core dumped)
exit on error: signal: aborted (core dumped)
Inconsistency detected by ld.so: ../elf/dl-tls.c: 481: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!
often happens when running on azure with p100:
Standard_NC6s_v2
Standard_NC12s_v2
Standard_NC24s_v2
I would like to give kungfu-run a different directory to where to write the log files.
The -logfile option does log something else than the logs like 127.0.0.1.10000-stderr.log
In the paper based on this project it is mentioned: "We implement an AP that adapts the batch size based on GNS when training the ResNet-56 model with the CIFAR-10 dataset."
Is this corresponding benchmark or adaptation policy available to the public as I'm having troubling finding it. I'm really interested in this portion of the project and am looking forward to how GNS can be used to enforce dynamic batch size selection.
I've looked through the tensorflow benchmarks and monitoring benchmarks but all I can find is:
Thanks!
HI.
Distributed mode is lower speed than single mode
and it take 2s for one epoch
[10.208.209.163.10000::stdout] Epoch 1/50
[10.208.209.163.10000::stderr] 2020-02-01 09:32:09.231149: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[10.208.209.163.10000::stdout] - 2s - loss: 0.3730 - sparse_categorical_accuracy: 0.8899 - val_loss: 0.1923 - val_sparse_categorical_accuracy: 0.9436
it take 22s for one epoch
[10.208.209.163.10000::stdout] Epoch 1/50
[10.208.209.163.10000::stderr] 2020-02-01 09:31:14.899848: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[10.208.209.163.10000::stdout] - 22s - loss: 0.4644 - sparse_categorical_accuracy: 0.8648 - val_loss: 0.2567 - val_sparse_categorical_accuracy: 0.9226
We are using the default stream that TF gives us (context->op_device_context()->stream()). This is the computation stream, which actually blocks graph computation, affecting performance. We should use the device to device communication stream instead.
Please find below the latencies measured for the synchronous case of P2P Model Request and Store update. This is the current bottleneck of the system which needs to be resolved in order to improve the throughput of the system.
[127.0.0.1/02/02-of-04::stdout] Model store update took 16.012414ms
[127.0.0.1/03/03-of-04::stdout] Model store update took 276.625µs
[127.0.0.1/02/02-of-04::stdout] Model store update took 176.388µs
[127.0.0.1/02/02-of-04::stdout] Request took 673.288µs
[127.0.0.1/03/03-of-04::stdout] Request took 15.947665ms
[127.0.0.1/00/00-of-04::stdout] Model store update took 3.733306ms
[127.0.0.1/01/01-of-04::stdout] Model store update took 175.905µs
[127.0.0.1/00/00-of-04::stdout] Request took 14.822951ms
[127.0.0.1/03/03-of-04::stdout] Model store update took 195.484µs
[127.0.0.1/01/01-of-04::stdout] Request took 20.67839ms
[127.0.0.1/02/02-of-04::stdout] Model store update took 216.966µs
[127.0.0.1/00/00-of-04::stdout] Model store update took 235.257µs
[127.0.0.1/01/01-of-04::stdout] Request took 833.436µs
add -hostfile
flag (https://www.open-mpi.org/faq/?category=running#mpirun-hostfile) as alternative of -H
.
[127.0.0.1.10000::stderr] Traceback (most recent call last):
[127.0.0.1.10000::stderr] File "/home/wrk/KungFu/examples/torch_elastic/torch_mnist_example.py", line 157, in
[127.0.0.1.10000::stderr] main()
[127.0.0.1.10000::stderr] File "/home/wrk/KungFu/examples/torch_elastic/torch_mnist_example.py", line 153, in main
[127.0.0.1.10000::stderr] train(args, model, device, optimizer, step_based_schedule)
[127.0.0.1.10000::stderr] File "/home/wrk/KungFu/examples/torch_elastic/torch_mnist_example.py", line 71, in train
[127.0.0.1.10000::stderr] sync_model(model)
[127.0.0.1.10000::stderr] File "/home/wrk/KungFu/examples/torch_elastic/torch_mnist_example.py", line 48, in sync_model
[127.0.0.1.10000::stderr] kf.broadcast_parameters(model.state_dict())
[127.0.0.1.10000::stderr] File "/home/wrk/anaconda3/envs/py36tf13/lib/python3.6/site-packages/kungfu/torch/ops/collective.py", line 43, in broadcast_parameters
[127.0.0.1.10000::stderr] h = inplace_broadcast_async_op(value, name)
[127.0.0.1.10000::stderr] File "/home/wrk/anaconda3/envs/py36tf13/lib/python3.6/site-packages/kungfu/torch/ops/collective.py", line 29, in inplace_broadcast_async_op
[127.0.0.1.10000::stderr] return broadcast_async_op_map[x.type()](x, x, x.type(), name)
[127.0.0.1.10000::stderr] KeyError: 'torch.FloatTensor'
After inspecting the code and trying out an implementation skeleton, I have two questions and one observation:
Router
currently expects a message from a particular peer (one of Prevs
in the gather graph, as required by the collective communication paradigm). We will need to adapt this for point-to-point communication. I am thinking to simply listen to k
distinct channels, one for each peer. Is there any efficient way of doing this in Go?inbound gradient
can be updated with the partial gradients received from other peers. Is there a way to do multi-threading inside a Tensorflow operator? I am afraid that thread
may not be supported.My current code skeleton for point-to-point communication for Ako can be found on branch point-to-point
.
I have installed the necessary packages with python 3.5, Golang 1.6 and Tensorflow 1.2.0.
When I installed the Kungfu, I came across the issue as below:
ERROR: Command errored out with exit status 1:
command: /eric_work_space/gx/py3/bin/python3.5 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-zz01ll48/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-zz01ll48/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-6mffe0_d --python-tag cp35
cwd: /tmp/pip-req-build-zz01ll48/
Complete output (125 lines):
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.5
creating build/lib.linux-x86_64-3.5/kungfu
copying ./srcs/python/kungfu/ext.py -> build/lib.linux-x86_64-3.5/kungfu
copying ./srcs/python/kungfu/loader.py -> build/lib.linux-x86_64-3.5/kungfu
copying ./srcs/python/kungfu/_utils.py -> build/lib.linux-x86_64-3.5/kungfu
copying ./srcs/python/kungfu/__init__.py -> build/lib.linux-x86_64-3.5/kungfu
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow
copying ./srcs/python/kungfu/tensorflow/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2
copying ./srcs/python/kungfu/tensorflow/v2/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1
copying ./srcs/python/kungfu/tensorflow/v1/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
copying ./srcs/python/kungfu/tensorflow/initializer/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
copying ./srcs/python/kungfu/tensorflow/initializer/keras.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/topology.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/adapt.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/p2p.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/collective.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/state.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/monitor.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/_tf_oplib.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/local.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/core.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/grad_noise_scale.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/sync_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/async_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/sma_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/ada_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/keras.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/grad_variance.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/compat
copying ./srcs/python/kungfu/tensorflow/compat/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/compat
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
copying ./srcs/python/kungfu/tensorflow/v2/examples/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
copying ./srcs/python/kungfu/tensorflow/v2/examples/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
copying ./srcs/python/kungfu/tensorflow/v1/datasets/adaptor.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
copying ./srcs/python/kungfu/tensorflow/v1/datasets/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
copying ./srcs/python/kungfu/tensorflow/v1/examples/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
copying ./srcs/python/kungfu/tensorflow/v1/examples/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/layers.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/mnist.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/mnist.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/imagenet.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/idx.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/utils.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/cifar.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
running build_ext
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/pip-req-build-zz01ll48/build/temp.linux-x86_64-3.5
Scanning dependencies of target libkungfu-comm
can't load package: package /tmp/pip-req-build-zz01ll48/srcs/go/libkungfu-comm: import "/tmp/pip-req-build-zz01ll48/srcs/go/libkungfu-comm": cannot import absolute path
CMakeFiles/libkungfu-comm.dir/build.make:57: recipe for target 'CMakeFiles/libkungfu-comm' failed
make[2]: *** [CMakeFiles/libkungfu-comm] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/libkungfu-comm.dir/all' failed
make[1]: *** [CMakeFiles/libkungfu-comm.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-zz01ll48/setup.py", line 108, in <module>
install_requires=[],
File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/__init__.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/eric_work_space/gx/py3/lib/python3.5/site-packages/wheel/bdist_wheel.py", line 192, in run
self.run_command('build')
File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/usr/lib/python3.5/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/usr/lib/python3.5/distutils/command/build_ext.py", line 338, in run
self.build_extensions()
File "/usr/lib/python3.5/distutils/command/build_ext.py", line 447, in build_extensions
self._build_extensions_serial()
File "/usr/lib/python3.5/distutils/command/build_ext.py", line 472, in _build_extensions_serial
self.build_extension(ext)
File "/tmp/pip-req-build-zz01ll48/setup.py", line 90, in build_extension
cwd=self.build_temp,
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.']' returned non-zero exit status 2
----------------------------------------
ERROR: Failed building wheel for kungfu
Running setup.py clean for kungfu
Failed to build kungfu
Installing collected packages: kungfu
Running setup.py install for kungfu ... error
ERROR: Command errored out with exit status 1:
command: /eric_work_space/gx/py3/bin/python3.5 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-zz01ll48/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-zz01ll48/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-e64wkiq0/install-record.txt --single-version-externally-managed --compile --install-headers /eric_work_space/gx/py3/include/site/python3.5/kungfu
cwd: /tmp/pip-req-build-zz01ll48/
Complete output (127 lines):
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.5
creating build/lib.linux-x86_64-3.5/kungfu
copying ./srcs/python/kungfu/ext.py -> build/lib.linux-x86_64-3.5/kungfu
copying ./srcs/python/kungfu/loader.py -> build/lib.linux-x86_64-3.5/kungfu
copying ./srcs/python/kungfu/_utils.py -> build/lib.linux-x86_64-3.5/kungfu
copying ./srcs/python/kungfu/__init__.py -> build/lib.linux-x86_64-3.5/kungfu
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow
copying ./srcs/python/kungfu/tensorflow/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2
copying ./srcs/python/kungfu/tensorflow/v2/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1
copying ./srcs/python/kungfu/tensorflow/v1/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
copying ./srcs/python/kungfu/tensorflow/initializer/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
copying ./srcs/python/kungfu/tensorflow/initializer/keras.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/topology.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/adapt.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/p2p.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/collective.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/state.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/monitor.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/_tf_oplib.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
copying ./srcs/python/kungfu/tensorflow/ops/local.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/core.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/grad_noise_scale.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/sync_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/async_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/sma_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/ada_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/keras.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
copying ./srcs/python/kungfu/tensorflow/optimizers/grad_variance.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/compat
copying ./srcs/python/kungfu/tensorflow/compat/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/compat
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
copying ./srcs/python/kungfu/tensorflow/v2/examples/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
copying ./srcs/python/kungfu/tensorflow/v2/examples/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
copying ./srcs/python/kungfu/tensorflow/v1/datasets/adaptor.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
copying ./srcs/python/kungfu/tensorflow/v1/datasets/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
copying ./srcs/python/kungfu/tensorflow/v1/examples/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
copying ./srcs/python/kungfu/tensorflow/v1/examples/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/layers.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/mnist.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/mnist.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/imagenet.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/idx.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/utils.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
copying ./srcs/python/kungfu/tensorflow/v1/helpers/cifar.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
running build_ext
-- The C compiler identification is GNU 5.4.0
-- The CXX compiler identification is GNU 5.4.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/pip-req-build-zz01ll48/build/temp.linux-x86_64-3.5
Scanning dependencies of target libkungfu-comm
can't load package: package /tmp/pip-req-build-zz01ll48/srcs/go/libkungfu-comm: import "/tmp/pip-req-build-zz01ll48/srcs/go/libkungfu-comm": cannot import absolute path
CMakeFiles/libkungfu-comm.dir/build.make:57: recipe for target 'CMakeFiles/libkungfu-comm' failed
make[2]: *** [CMakeFiles/libkungfu-comm] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/libkungfu-comm.dir/all' failed
make[1]: *** [CMakeFiles/libkungfu-comm.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-zz01ll48/setup.py", line 108, in <module>
install_requires=[],
File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/__init__.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/usr/lib/python3.5/distutils/command/install.py", line 583, in run
self.run_command('build')
File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/usr/lib/python3.5/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/usr/lib/python3.5/distutils/command/build_ext.py", line 338, in run
self.build_extensions()
File "/usr/lib/python3.5/distutils/command/build_ext.py", line 447, in build_extensions
self._build_extensions_serial()
File "/usr/lib/python3.5/distutils/command/build_ext.py", line 472, in _build_extensions_serial
self.build_extension(ext)
File "/tmp/pip-req-build-zz01ll48/setup.py", line 90, in build_extension
cwd=self.build_temp,
File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.']' returned non-zero exit status 2
----------------------------------------
Can anyone help to address the issue?
I am thinking to support a shared-memory channel. According to this benchmark, a properly implemented shared memory channel has the potential to out-perform domain sockets by 100x. I quote the numbers specific to shared-memory and domain-sockets produce by this IPC-benchmark code:
Shared Memory:
Message size: 128
Message count: 1000000
Total duration: 261.650 ms
Average duration: 0.238 us
Minimum duration: 0.000 us
Maximum duration: 10092.032 us
Standard deviation: 22.095 us
Message rate: 3821893 msg/s
Domain Socket:
Message size: 128
Message count: 1000000
Total duration: 24579.846 ms
Average duration: 24.531 us
Minimum duration: 2.560 us
Maximum duration: 15932.928 us
Standard deviation: 37.854 us
Message rate: 40683 msg/s
We could still rely on domain sockets to implement the small tensor communication but if the tensor are big, we can use the shared-memory as a fast path. One possible direction is to:
This design has two benefits:
@lgarithm thought?
To concretely implement the above approach, we will use python. The computation of the gradient norm uses the locally computed gradient, the negotiated gradient via KungFu Parallel SGD, the batch size per worker and the global batch size.
We will use TensorFlow operators to calculate the biased estimates at each iteration:
tf.norm()
, which treats the tensor as a vector and computes the L2 norm.tf.constant()
tf.constant()
at the end of the computationDoD:
tf.constant()
panic: sync: WaitGroup is reused before previous Wait has returned
goroutine 1 [running]:
sync.(*WaitGroup).Wait(0xc00009c010)
/Users/lg/local/go/src/sync/waitgroup.go:132 +0xad
main.watchRun(0x43a41e0, 0xc000104500, 0x94c07f000001, 0xc0001305b8, 0x1, 0x1, 0xc0000b2180, 0x7f00000100000000, 0x94c0, 0xc000106240, ...)
/Users/lg/code/repos/github.com/lsds/KungFu/srcs/go/cmd/kungfu-run/watch.go:91 +0x908
main.main()
/Users/lg/code/repos/github.com/lsds/KungFu/srcs/go/cmd/kungfu-run/kungfu-run.go:121 +0x7eb
Reproduced with tensorflow-gpu 1.13.2
and 1.15.0
:
python3 -m kungfu.tensorflow.v1.examples
[127.0.0.1.10000::stdout] step 0, result: 1.000000
[127.0.0.1.10000::stdout] unexpected result: 1.000000, want: 0.800000
[127.0.0.1.10000::stdout] step 1, result: 0.800000
[127.0.0.1.10000::stdout] unexpected result: 0.800000, want: 0.640000
[127.0.0.1.10000::stdout] step 2, result: 0.640000
[127.0.0.1.10000::stdout] unexpected result: 0.640000, want: 0.512000
[127.0.0.1.10000::stdout] step 3, result: 0.512000
[127.0.0.1.10000::stdout] unexpected result: 0.512000, want: 0.409600
[127.0.0.1.10000::stdout] step 4, result: 0.409600
[127.0.0.1.10000::stdout] unexpected result: 0.409600, want: 0.327680
A: v0 -> v1
A container log
[10.0.0.230.10003::stdout] sync to offset 0 on step 0
[10.0.0.230.10004::stdout] sync to offset 0 on step 0
[10.0.0.230.10007::stdout] sync to offset 0 on step 0
[10.0.0.230.10002::stdout] sync to offset 0 on step 0
[10.0.0.230.10006::stdout] sync to offset 0 on step 0
[10.0.0.230.10001::stdout] sync to offset 0 on step 0
[10.0.0.230.10005::stdout] sync to offset 0 on step 0
[10.0.0.230.10000::stdout] sync to offset 0 on step 0
[I] arrived at v1, new np=16, local: +0/-0, global: +8/-0
B container log
[10.0.1.29.10004::stdout] sync to offset 0 on step 0
[10.0.1.29.10005::stdout] sync to offset 0 on step 0
[10.0.1.29.10001::stdout] sync to offset 0 on step 0
[10.0.1.29.10002::stdout] sync to offset 0 on step 0
[10.0.1.29.10003::stdout] sync to offset 0 on step 0
[10.0.1.29.10000::stdout] sync to offset 0 on step 0
[10.0.1.29.10006::stdout] sync to offset 0 on step 0
[10.0.1.29.10007::stdout] sync to offset 0 on step 0
[10.0.1.29.10002::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10003::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10006::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10005::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10000::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10001::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10007::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10004::stderr] [E] New root can't not be a new worker! State will be lost.
A/B running well
A: v1 -> v2
A container log
[10.0.0.230.10006::stderr] INFO:tensorflow:step: 60(global step: 60) step/sec: 0.384 loss: 0.777 top-1: 0.800
[10.0.0.230.10003::stderr] INFO:tensorflow:step: 60(global step: 60) step/sec: 0.384 loss: 0.687 top-1: 0.800
I0608 09:10:39.150780 1 watch_host_file.go:65] update host file
I0608 09:10:39.150810 1 ma_fmk_kungfu.go:116] scale down to 1
I0608 09:10:39.151063 1 ma_fmk_kungfu.go:150] generated host file
[I] arrived at v2, new np=8, local: +0/-0, global: +0/-8
[10.0.0.230.10006::stdout] sync to offset 20320 on step 64
[10.0.0.230.10005::stdout] sync to offset 20320 on step 64
[10.0.0.230.10001::stdout] sync to offset 20320 on step 64
[10.0.0.230.10007::stdout] sync to offset 20320 on step 64
[10.0.0.230.10004::stdout] sync to offset 20320 on step 64
[10.0.0.230.10002::stdout] sync to offset 20320 on step 64
[10.0.0.230.10000::stdout] sync to offset 20320 on step 64
[10.0.0.230.10003::stdout] sync to offset 20320 on step 64
[10.0.0.230.10000::stderr] INFO:tensorflow:step: 70(global step: 70) step/sec: 1.041 loss: 0.683 top-1: 0.800
[10.0.0.230.10004::stderr] INFO:tensorflow:step: 70(global step: 70) step/sec: 1.041 loss: 0.629 top-1: 0.800
B container log
[10.0.1.29.10007::stderr] INFO:tensorflow:step: 60(global step: 60) step/sec: 0.384 loss: 0.926 top-1: 0.750
[10.0.1.29.10003::stderr] INFO:tensorflow:step: 60(global step: 60) step/sec: 0.384 loss: 0.757 top-1: 0.800
I0608 09:10:39.152199 1 watch_host_file.go:65] update host file
I0608 09:10:39.152222 1 ma_fmk_kungfu.go:116] scale down to 1
I0608 09:10:39.152408 1 ma_fmk_kungfu.go:150] generated host file
[10.0.1.29.10002::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10007::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10004::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10000::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10005::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10003::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10006::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10007::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10002::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10004::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10001::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10000::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10006::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10005::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10003::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10001::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[W] terminated trapped
[E] canceled: context canceled
[I] stop watching
A running well, B is closed
A: v2 -> v3
A container log
[10.0.0.230.10002::stderr] INFO:tensorflow:step: 210(global step: 210) step/sec: 1.455 loss: 0.011 top-1: 1.000
[10.0.0.230.10001::stderr] INFO:tensorflow:step: 210(global step: 210) step/sec: 1.455 loss: 0.020 top-1: 1.000
I0608 09:12:30.528867 1 ma_fmk_kungfu.go:150] generated host file
[I] arrived at v3, new np=16, local: +0/-0, global: +8/-0
[10.0.0.230.10003::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10005::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10004::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10000::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10007::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10001::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10002::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10006::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
exit on error: inconsistent update detected at 7031490:/home/work/KungFu/srcs/go/kungfu/runner/handler.go:102
i found the runner of A is exited as exit on error: inconsistent update detected at 7031490:/home/work/KungFu/srcs/go/kungfu/runner/handler.go:102
B container log
[10.0.1.30.10000::stdout] start with 0 trained samples.
[10.0.1.30.10000::stderr] INFO:tensorflow:Running will end at step: 750
[10.0.1.30.10006::stdout] sync to offset 0 on step 0
[10.0.1.30.10002::stdout] sync to offset 0 on step 0
[10.0.1.30.10004::stdout] sync to offset 0 on step 0
[10.0.1.30.10003::stdout] sync to offset 0 on step 0
[10.0.1.30.10005::stdout] sync to offset 0 on step 0
[10.0.1.30.10007::stdout] sync to offset 0 on step 0
[10.0.1.30.10000::stdout] sync to offset 0 on step 0
[10.0.1.30.10001::stdout] sync to offset 0 on step 0
[I] arrived at v1, new np=16, local: +0/-0, global: +0/-0
[10.0.1.30.10004::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10005::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10006::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10003::stdout] exit on error: par failed with 1 error: can't establish connection at 140503338179417:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10001::stdout] exit on error: par failed with 1 error: can't establish connection at 139700864633689:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10007::stdout] exit on error: par failed with 1 error: can't establish connection at 139746162006873:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10000::stdout] exit on error: par failed with 1 error: can't establish connection at 139711679425369:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10002::stdout] exit on error: par failed with 1 error: can't establish connection at 140323524553561:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[I] 10.0.1.30.10000 finished with error: exit status 1
exit on error: exit status 1 at 7030827:/home/work/KungFu/srcs/go/kungfu/runner/watch.go:147
now A/B is hang ...
currently, can kung-fu support my test case ? or how should i handle this case ...
/usr/bin/ld: cannot find -ltensorflow_framework
collect2: error: ld returned 1 exit status
CMakeFiles/kungfu_tensorflow_ops.dir/build.make:226: recipe for target '../lib.linux-x86_64-3.6/kungfu/kungfu_tensorflow_ops.cpython-36m-x86_64-linux-gnu.so' failed
make[2]: *** [../lib.linux-x86_64-3.6/kungfu/kungfu_tensorflow_ops.cpython-36m-x86_64-linux-gnu.so] Error 1
CMakeFiles/Makefile2:68: recipe for target 'CMakeFiles/kungfu_tensorflow_ops.dir/all' failed
make[1]: *** [CMakeFiles/kungfu_tensorflow_ops.dir/all] Error 2
Makefile:83: recipe for target 'all' failed
make: *** [all] Error 2
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-nw69zdj6/setup.py", line 106, in <module>
install_requires=[],
File "/home/marcel/kungfu-venv/lib/python3.6/site-packages/setuptools/__init__.py", line 145, in setup
return distutils.core.setup(**attrs)
File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/marcel/kungfu-venv/lib/python3.6/site-packages/wheel/bdist_wheel.py", line 192, in run
self.run_command('build')
File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/marcel/kungfu-venv/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 84, in run
_build_ext.run(self)
File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
self.build_extensions()
File "/usr/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions
self._build_extensions_serial()
File "/usr/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial
self.build_extension(ext)
File "/tmp/pip-req-build-nw69zdj6/setup.py", line 88, in build_extension
cwd=self.build_temp,
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.']' returned non-zero exit status 2.
----------------------------------------
ERROR: Failed building wheel for kungfu
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.