lsds / kungfu Goto Github PK

View Code? Open in Web Editor NEW

291.0 23.0 58.0 1.92 MB

Fast and Adaptive Distributed Machine Learning for TensorFlow, PyTorch and MindSpore.

License: Apache License 2.0

CMake 1.91% Python 32.68% Shell 3.40% C++ 25.74% Go 35.67% C 0.60%

tensorflow keras distributed-training distributed-systems

kungfu's People

Contributors

Stargazers

Watchers

kungfu's Issues

KungFu Parallel AllReduce Gradient Tensor Names

In ResNet-50, I noticed that there exist tensors with names such as tower_0/v0/gradients/tower_0/v0/cg/resnet_v111/conv38/batchnorm38/FusedBatchNorm_grad/FusedBatchNormGrad:2 and tower_0/v0/gradients/tower_0/v0/cg/resnet_v111/conv38/batchnorm38/FusedBatchNorm_grad/FusedBatchNormGrad:1 and the current code for allreduce (CPU and GPU) ellides the last two characters of the tensor names (e.g., https://github.com/lsds/KungFu/blob/master/srcs/python/kungfu/ops.py#L46), which means that the key during the all-reduce operation is not unique. This might explain why why the number of tensors observed in-flight is less than the number of tensors before the operators are constructed. Could you please confirm if this behaviour is expected?

[Keras example] Downloading data might conflict among parallel peers

KungFu implementation TODO

Enable the order group (by passing reducible tensor's names into the kungfu-world through TF operator attributes).
- use attr for passing name: #30
- choose the correct VariableMgr, @luomai currently the variableMgr doesn't perform all_reduce for all grads, which cause our order group never completes.
Support multiple machines
Huawei Cloud + TF benchmark

use a dedicated thread for NCCL operations

With PairAveragingOptimizer, is it possible that two workers in different iterations average their models?

Hi! With PairAveragingOptimizer, is it possible that two workers in different iterations average their models? Or do all workers keep in the same iteration during the training?

training result delayed by 1 step when running with tensorflow-gpu

Reproduced with tensorflow-gpu 1.13.2 and 1.15.0:

python3 -m kungfu.tensorflow.v1.examples

[127.0.0.1.10000::stdout] step 0, result: 1.000000
[127.0.0.1.10000::stdout] unexpected result: 1.000000, want: 0.800000
[127.0.0.1.10000::stdout] step 1, result: 0.800000
[127.0.0.1.10000::stdout] unexpected result: 0.800000, want: 0.640000
[127.0.0.1.10000::stdout] step 2, result: 0.640000
[127.0.0.1.10000::stdout] unexpected result: 0.640000, want: 0.512000
[127.0.0.1.10000::stdout] step 3, result: 0.512000
[127.0.0.1.10000::stdout] unexpected result: 0.512000, want: 0.409600
[127.0.0.1.10000::stdout] step 4, result: 0.409600
[127.0.0.1.10000::stdout] unexpected result: 0.409600, want: 0.327680

Adaptive dataset notes

From the output, we can see that the dataset iterator has been rescheduled on the fly.

#!/usr/bin/env python3

import numpy as np
import tensorflow as tf


def main():
    batch_size = tf.Variable(tf.constant(10, tf.int64))
    n_workers = tf.Variable(tf.constant(4, tf.int64))
    shard_id = tf.Variable(tf.constant(1, tf.int64))

    def _update_topology():
        update_cluster = tf.assign(n_workers, n_workers + 1)
        update_batch_size = tf.assign(batch_size, batch_size + 2)
        with tf.control_dependencies([update_cluster]):
            return tf.group([
                tf.assign(shard_id, tf.mod(shard_id + 1, n_workers)),
                update_batch_size,
            ])

    n = 1024
    source = np.array(list(range(n)))
    it = tf.data.Dataset.from_tensor_slices(source)

    _update_topology_op = _update_topology()
    it = it.batch(batch_size)
    it = it.shard(n_workers, shard_id)
    it = it.make_initializable_iterator()
    _reschedule_op = it.initializer
    get_next = it.get_next()

    def _debug_info(sess):
        np, rank, bs = sess.run([n_workers, shard_id, batch_size])
        print('np=%d, rank=%d, batch size=%d' % (np, rank, bs))

    with tf.Session() as sess:
        sess.run(tf.global_variables_initializer())
        _debug_info(sess)

        for stage in range(3):
            print('stage %d' % stage)
            sess.run(_update_topology_op)
            sess.run(_reschedule_op)
            _debug_info(sess)
            for step in range(3):
                v = sess.run(get_next)
                print('stage %d, step %d, %s (bs=%d)' %
                      (stage, step, v, len(v)))


main()

np=4, rank=1, batch size=10
stage 0
np=5, rank=2, batch size=12
stage 0, step 0, [24 25 26 27 28 29 30 31 32 33 34 35] (bs=12)
stage 0, step 1, [84 85 86 87 88 89 90 91 92 93 94 95] (bs=12)
stage 0, step 2, [144 145 146 147 148 149 150 151 152 153 154 155] (bs=12)
stage 1
np=6, rank=3, batch size=14
stage 1, step 0, [42 43 44 45 46 47 48 49 50 51 52 53 54 55] (bs=14)
stage 1, step 1, [126 127 128 129 130 131 132 133 134 135 136 137 138 139] (bs=14)
stage 1, step 2, [210 211 212 213 214 215 216 217 218 219 220 221 222 223] (bs=14)
stage 2
np=7, rank=4, batch size=16
stage 2, step 0, [64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79] (bs=16)
stage 2, step 1, [176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191] (bs=16)
stage 2, step 2, [288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303] (bs=16)

After remove the worker from the cluster, it is better to set the rank id to -1.

Support real global batch normalisation

Tensorflow 2.2 supports the following feature:

Support added for global sync BatchNormalization by using the newly added tf.keras.layers.experimental.SyncBatchNormalization layer. This layer will sync BatchNormalization statistics every step across all replicas taking part in sync training.

It would be great if KungFu can support it as well.

panic error

@luomai @marwage , when run early bert demo, when update from one peers to 8 peers, panic happens, configuration is the same as the run.sh, error as follows:
updating to 1 peers: 169.254.128.33:13006
[I] updated to 1 peers: 169.254.128.33:13006
OK
[I] arrived at v11, new np=1, local: +0/-6, global: -0/14
[169.254.128.33.13000::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13000/model.ckpt.
[169.254.128.33.13003::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13003/model.ckpt.
[169.254.128.33.13001::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13001/model.ckpt.
[169.254.128.33.13002::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13002/model.ckpt.
[169.254.128.33.13006::stdout] sync offset 153320 -> 153320 on step 605
[169.254.128.33.13004::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13004/model.ckpt.
[169.254.128.33.13007::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1047 into /cache/output/169.254.128.33-13007/model.ckpt.
[169.254.128.33.13007::�[1;35mstderr�[m] WARNING:tensorflow:From /home/work/anaconda3/lib/python3.6/site-packages/tensorflow/python/training/saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
[169.254.128.33.13007::�[1;35mstderr�[m] Instructions for updating:
[169.254.128.33.13007::�[1;35mstderr�[m] Use standard file APIs to delete files with this prefix.
[169.254.128.33.13003::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13001::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13004::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13002::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13000::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13007::stdout] stopped after trained 153320 samples in 138 steps due to change cluster
[169.254.128.33.13003::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.1395919.
[169.254.128.33.13003::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13003::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:42.733748
[169.254.128.33.13001::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.4893956.
[169.254.128.33.13001::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13001::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:42.850651
[169.254.128.33.13004::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.357612.
[169.254.128.33.13004::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13004::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:42.928306
[169.254.128.33.13002::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.4187098.
[169.254.128.33.13002::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13002::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:43.183136
[169.254.128.33.13000::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 1.1474836.
[169.254.128.33.13000::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13000::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:43.285399
[169.254.128.33.13007::�[1;35mstderr�[m] INFO:tensorflow:Loss for final step: 2.2803762.
[169.254.128.33.13007::�[1;35mstderr�[m] INFO:tensorflow:training_loop marked as finished
[169.254.128.33.13007::�[1;35mstderr�[m] INFO:tensorflow:Training end time 2020-03-13 12:45:43.593442
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 0.469528
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 3.75623
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 3.41049
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 27.2839
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 3.41522
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 27.3218
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:Saving checkpoints for 1390 into /cache/output/169.254.128.33-13006/model.ckpt.
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 3.1294
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 25.0352
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:global_step/sec: 3.41119
[169.254.128.33.13006::�[1;35mstderr�[m] INFO:tensorflow:examples/sec: 27.2895
updating to 8 peers: 169.254.128.33:13002,169.254.128.114:13006,169.254.128.114:13005,169.254.128.33:13001,169.254.128.114:13002,169.254.128.114:13001,169.254.128.114:13000,169.254.128.33:13005
[I] updated to 8 peers: 169.254.128.33:13002,169.254.128.114:13006,169.254.128.114:13005,169.254.128.33:13001,169.254.128.114:13002,169.254.128.114:13001,169.254.128.114:13000,169.254.128.33:13005
OK
�[1;35m[E]�[m full update detected: [1@2]{169.254.128.33:13006}@{169.254.128.33:38080,169.254.128.114:38080} -> [8@2]{169.254.128.33:13002,169.254.128.114:13006,169.254.128.114:13005,169.254.128.33:13001,169.254.128.114:13002,169.254.128.114:13001,169.254.128.114:13000,169.254.128.33:13005}@{169.254.128.33:38080,169.254.128.114:38080}
[I] arrived at v12, new np=8, local: +3/-1, global: -8/1
[169.254.128.33.13006::�[1;35mstderr�[m] panic: runtime error: invalid memory address or nil pointer dereference
[169.254.128.33.13006::�[1;35mstderr�[m] [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x7fab3df583c5]
[169.254.128.33.13006::�[1;35mstderr�[m]
[169.254.128.33.13006::�[1;35mstderr�[m] goroutine 2706338 [running]:
[169.254.128.33.13006::�[1;35mstderr�[m] encoding/binary.Write(0x0, 0x0, 0x7fab3e4a5300, 0x7fab3e6c5900, 0x7fab3e4166e0, 0xc0007fa548, 0x6, 0xc0007fa540)
[169.254.128.33.13006::�[1;35mstderr�[m] /usr/local/go/src/encoding/binary/binary.go:342 +0x145
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/rchannel.(*messageHeader).WriteTo(0xc0b92adda8, 0x0, 0x0, 0x0, 0x0)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/rchannel/message.go:84 +0x78
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/rchannel.(*tcpConnection).Send(0xc0006bccf0, 0x7fab3e10338a, 0x6, 0x17d, 0xc000320000, 0x17d, 0x17d, 0x0, 0x7fab00000000, 0x0, ...)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/rchannel/connection.go:106 +0x16b
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/rchannel.(*Router).send(0xc000174db0, 0x94c0a9fe8072, 0x7fab3e10338a, 0x6, 0x17d, 0xc000320000, 0x17d, 0x17d, 0x0, 0x1, ...)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/rchannel/router.go:53 +0xc0
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/rchannel.(*Router).Send(0xc000174db0, 0x94c0a9fe8072, 0x7fab3e10338a, 0x6, 0xc000320000, 0x17d, 0x17d, 0x320001, 0x17d, 0x17d)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/rchannel/router.go:44 +0xeb
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/kungfu.(*Kungfu).propose.func1(0x94c0a9fe8072, 0x5a, 0x5a)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/kungfu/kungfu.go:204 +0xfc
[169.254.128.33.13006::�[1;35mstderr�[m] github.com/lsds/KungFu/srcs/go/kungfu.par.func1(0xc00022c0f0, 0xc0002500c0, 0x2, 0x2, 0xc00031e230, 0x1, 0x94c0a9fe8072)
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/kungfu/kungfu.go:171 +0x3e
[169.254.128.33.13006::�[1;35mstderr�[m] created by github.com/lsds/KungFu/srcs/go/kungfu.par
[169.254.128.33.13006::�[1;35mstderr�[m] /tmp/pip-req-build-3r6z01fr/srcs/go/kungfu/kungfu.go:170 +0x10f
[I] 169.254.128.33.13006 finished with error: signal: aborted (core dumped)
exit on error: signal: aborted (core dumped)

panic: runtime error: cgo argument has Go pointer to Go pointer

Happens when using []byte returned from bytes.Buffer

Inconsistency detected by ld.so

Inconsistency detected by ld.so: ../elf/dl-tls.c: 481: _dl_allocate_tls_init: Assertion `listp->slotinfo[cnt].gen <= GL(dl_tls_generation)' failed!

often happens when running on azure with p100:

    Standard_NC6s_v2
    Standard_NC12s_v2
    Standard_NC24s_v2

When using the config-server, if you call allgather api, it will block.

panic: sync: WaitGroup is reused before previous Wait has returned

panic: sync: WaitGroup is reused before previous Wait has returned

goroutine 1 [running]:
sync.(*WaitGroup).Wait(0xc00009c010)
        /Users/lg/local/go/src/sync/waitgroup.go:132 +0xad
main.watchRun(0x43a41e0, 0xc000104500, 0x94c07f000001, 0xc0001305b8, 0x1, 0x1, 0xc0000b2180, 0x7f00000100000000, 0x94c0, 0xc000106240, ...)
        /Users/lg/code/repos/github.com/lsds/KungFu/srcs/go/cmd/kungfu-run/watch.go:91 +0x908
main.main()
        /Users/lg/code/repos/github.com/lsds/KungFu/srcs/go/cmd/kungfu-run/kungfu-run.go:121 +0x7eb

code loss

KungFu/srcs/cpp/include/kungfu/python/collective.h

Line 6 in 4ccc0f8

extern void kungfu_resize_from_url(char *p_changed, char *p_detached);

@lgarithm

Kungfu installation issue

I have installed the necessary packages with python 3.5, Golang 1.6 and Tensorflow 1.2.0.

When I installed the Kungfu, I came across the issue as below:

  ERROR: Command errored out with exit status 1:
   command: /eric_work_space/gx/py3/bin/python3.5 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-zz01ll48/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-zz01ll48/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-6mffe0_d --python-tag cp35
       cwd: /tmp/pip-req-build-zz01ll48/
  Complete output (125 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.linux-x86_64-3.5
  creating build/lib.linux-x86_64-3.5/kungfu
  copying ./srcs/python/kungfu/ext.py -> build/lib.linux-x86_64-3.5/kungfu
  copying ./srcs/python/kungfu/loader.py -> build/lib.linux-x86_64-3.5/kungfu
  copying ./srcs/python/kungfu/_utils.py -> build/lib.linux-x86_64-3.5/kungfu
  copying ./srcs/python/kungfu/__init__.py -> build/lib.linux-x86_64-3.5/kungfu
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow
  copying ./srcs/python/kungfu/tensorflow/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2
  copying ./srcs/python/kungfu/tensorflow/v2/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1
  copying ./srcs/python/kungfu/tensorflow/v1/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
  copying ./srcs/python/kungfu/tensorflow/initializer/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
  copying ./srcs/python/kungfu/tensorflow/initializer/keras.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/topology.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/adapt.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/p2p.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/collective.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/state.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/monitor.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/_tf_oplib.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  copying ./srcs/python/kungfu/tensorflow/ops/local.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/core.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/grad_noise_scale.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/sync_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/async_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/sma_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/ada_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/keras.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  copying ./srcs/python/kungfu/tensorflow/optimizers/grad_variance.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/compat
  copying ./srcs/python/kungfu/tensorflow/compat/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/compat
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
  copying ./srcs/python/kungfu/tensorflow/v2/examples/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
  copying ./srcs/python/kungfu/tensorflow/v2/examples/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
  copying ./srcs/python/kungfu/tensorflow/v1/datasets/adaptor.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
  copying ./srcs/python/kungfu/tensorflow/v1/datasets/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
  copying ./srcs/python/kungfu/tensorflow/v1/examples/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
  copying ./srcs/python/kungfu/tensorflow/v1/examples/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
  copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/layers.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
  copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
  copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/mnist.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
  copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
  creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
  copying ./srcs/python/kungfu/tensorflow/v1/helpers/mnist.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
  copying ./srcs/python/kungfu/tensorflow/v1/helpers/imagenet.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
  copying ./srcs/python/kungfu/tensorflow/v1/helpers/idx.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
  copying ./srcs/python/kungfu/tensorflow/v1/helpers/utils.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
  copying ./srcs/python/kungfu/tensorflow/v1/helpers/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
  copying ./srcs/python/kungfu/tensorflow/v1/helpers/cifar.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
  running build_ext
  -- The C compiler identification is GNU 5.4.0
  -- The CXX compiler identification is GNU 5.4.0
  -- Check for working C compiler: /usr/bin/cc
  -- Check for working C compiler: /usr/bin/cc -- works
  -- Detecting C compiler ABI info
  -- Detecting C compiler ABI info - done
  -- Detecting C compile features
  -- Detecting C compile features - done
  -- Check for working CXX compiler: /usr/bin/c++
  -- Check for working CXX compiler: /usr/bin/c++ -- works
  -- Detecting CXX compiler ABI info
  -- Detecting CXX compiler ABI info - done
  -- Detecting CXX compile features
  -- Detecting CXX compile features - done
  -- Configuring done
  -- Generating done
  -- Build files have been written to: /tmp/pip-req-build-zz01ll48/build/temp.linux-x86_64-3.5
  Scanning dependencies of target libkungfu-comm
  can't load package: package /tmp/pip-req-build-zz01ll48/srcs/go/libkungfu-comm: import "/tmp/pip-req-build-zz01ll48/srcs/go/libkungfu-comm": cannot import absolute path
  CMakeFiles/libkungfu-comm.dir/build.make:57: recipe for target 'CMakeFiles/libkungfu-comm' failed
  make[2]: *** [CMakeFiles/libkungfu-comm] Error 1
  CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/libkungfu-comm.dir/all' failed
  make[1]: *** [CMakeFiles/libkungfu-comm.dir/all] Error 2
  Makefile:83: recipe for target 'all' failed
  make: *** [all] Error 2
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-req-build-zz01ll48/setup.py", line 108, in <module>
      install_requires=[],
    File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/__init__.py", line 145, in setup
      return distutils.core.setup(**attrs)
    File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
      self.run_command(cmd)
    File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/eric_work_space/gx/py3/lib/python3.5/site-packages/wheel/bdist_wheel.py", line 192, in run
      self.run_command('build')
    File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/usr/lib/python3.5/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/usr/lib/python3.5/distutils/command/build_ext.py", line 338, in run
      self.build_extensions()
    File "/usr/lib/python3.5/distutils/command/build_ext.py", line 447, in build_extensions
      self._build_extensions_serial()
    File "/usr/lib/python3.5/distutils/command/build_ext.py", line 472, in _build_extensions_serial
      self.build_extension(ext)
    File "/tmp/pip-req-build-zz01ll48/setup.py", line 90, in build_extension
      cwd=self.build_temp,
    File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '--build', '.']' returned non-zero exit status 2
  ----------------------------------------
  ERROR: Failed building wheel for kungfu
  Running setup.py clean for kungfu
Failed to build kungfu
Installing collected packages: kungfu
    Running setup.py install for kungfu ... error
    ERROR: Command errored out with exit status 1:
     command: /eric_work_space/gx/py3/bin/python3.5 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-zz01ll48/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-zz01ll48/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-e64wkiq0/install-record.txt --single-version-externally-managed --compile --install-headers /eric_work_space/gx/py3/include/site/python3.5/kungfu
         cwd: /tmp/pip-req-build-zz01ll48/
    Complete output (127 lines):
    running install
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.5
    creating build/lib.linux-x86_64-3.5/kungfu
    copying ./srcs/python/kungfu/ext.py -> build/lib.linux-x86_64-3.5/kungfu
    copying ./srcs/python/kungfu/loader.py -> build/lib.linux-x86_64-3.5/kungfu
    copying ./srcs/python/kungfu/_utils.py -> build/lib.linux-x86_64-3.5/kungfu
    copying ./srcs/python/kungfu/__init__.py -> build/lib.linux-x86_64-3.5/kungfu
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow
    copying ./srcs/python/kungfu/tensorflow/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2
    copying ./srcs/python/kungfu/tensorflow/v2/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1
    copying ./srcs/python/kungfu/tensorflow/v1/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
    copying ./srcs/python/kungfu/tensorflow/initializer/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
    copying ./srcs/python/kungfu/tensorflow/initializer/keras.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/initializer
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/topology.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/adapt.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/p2p.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/collective.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/state.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/monitor.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/_tf_oplib.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    copying ./srcs/python/kungfu/tensorflow/ops/local.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/ops
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/core.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/grad_noise_scale.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/sync_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/async_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/sma_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/ada_sgd.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/keras.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    copying ./srcs/python/kungfu/tensorflow/optimizers/grad_variance.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/optimizers
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/compat
    copying ./srcs/python/kungfu/tensorflow/compat/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/compat
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
    copying ./srcs/python/kungfu/tensorflow/v2/examples/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
    copying ./srcs/python/kungfu/tensorflow/v2/examples/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v2/examples
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
    copying ./srcs/python/kungfu/tensorflow/v1/datasets/adaptor.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
    copying ./srcs/python/kungfu/tensorflow/v1/datasets/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/datasets
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
    copying ./srcs/python/kungfu/tensorflow/v1/examples/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
    copying ./srcs/python/kungfu/tensorflow/v1/examples/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/examples
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
    copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/layers.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
    copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/__main__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
    copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/mnist.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
    copying ./srcs/python/kungfu/tensorflow/v1/benchmarks/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/benchmarks
    creating build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
    copying ./srcs/python/kungfu/tensorflow/v1/helpers/mnist.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
    copying ./srcs/python/kungfu/tensorflow/v1/helpers/imagenet.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
    copying ./srcs/python/kungfu/tensorflow/v1/helpers/idx.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
    copying ./srcs/python/kungfu/tensorflow/v1/helpers/utils.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
    copying ./srcs/python/kungfu/tensorflow/v1/helpers/__init__.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
    copying ./srcs/python/kungfu/tensorflow/v1/helpers/cifar.py -> build/lib.linux-x86_64-3.5/kungfu/tensorflow/v1/helpers
    running build_ext
    -- The C compiler identification is GNU 5.4.0
    -- The CXX compiler identification is GNU 5.4.0
    -- Check for working C compiler: /usr/bin/cc
    -- Check for working C compiler: /usr/bin/cc -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Check for working CXX compiler: /usr/bin/c++
    -- Check for working CXX compiler: /usr/bin/c++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Configuring done
    -- Generating done
    -- Build files have been written to: /tmp/pip-req-build-zz01ll48/build/temp.linux-x86_64-3.5
    Scanning dependencies of target libkungfu-comm
    can't load package: package /tmp/pip-req-build-zz01ll48/srcs/go/libkungfu-comm: import "/tmp/pip-req-build-zz01ll48/srcs/go/libkungfu-comm": cannot import absolute path
    CMakeFiles/libkungfu-comm.dir/build.make:57: recipe for target 'CMakeFiles/libkungfu-comm' failed
    make[2]: *** [CMakeFiles/libkungfu-comm] Error 1
    CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/libkungfu-comm.dir/all' failed
    make[1]: *** [CMakeFiles/libkungfu-comm.dir/all] Error 2
    Makefile:83: recipe for target 'all' failed
    make: *** [all] Error 2
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-zz01ll48/setup.py", line 108, in <module>
        install_requires=[],
      File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/__init__.py", line 145, in setup
        return distutils.core.setup(**attrs)
      File "/usr/lib/python3.5/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/lib/python3.5/distutils/dist.py", line 955, in run_commands
        self.run_command(cmd)
      File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/command/install.py", line 61, in run
        return orig.install.run(self)
      File "/usr/lib/python3.5/distutils/command/install.py", line 583, in run
        self.run_command('build')
      File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/usr/lib/python3.5/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/usr/lib/python3.5/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/lib/python3.5/distutils/dist.py", line 974, in run_command
        cmd_obj.run()
      File "/eric_work_space/gx/py3/lib/python3.5/site-packages/setuptools/command/build_ext.py", line 84, in run
        _build_ext.run(self)
      File "/usr/lib/python3.5/distutils/command/build_ext.py", line 338, in run
        self.build_extensions()
      File "/usr/lib/python3.5/distutils/command/build_ext.py", line 447, in build_extensions
        self._build_extensions_serial()
      File "/usr/lib/python3.5/distutils/command/build_ext.py", line 472, in _build_extensions_serial
        self.build_extension(ext)
      File "/tmp/pip-req-build-zz01ll48/setup.py", line 90, in build_extension
        cwd=self.build_temp,
      File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '--build', '.']' returned non-zero exit status 2
    ----------------------------------------

Can anyone help to address the issue?

CUDA Synchronize Affects Performance

Use GPU events such that the stream is not blocked. Use CUDA Event API (Record() and Wait() functions should be used).

A question about Horovod central coordinator in the paper of KungFu

The asynchronous collective communication layer also avoids having an expensive central coordinator, as used for invoking synchronous collective communication operations inexisting systems, such as Horovod.

I see the paper of Horovod and KongFu,I wonder why does Horovod use the central coordinator,I havent find it in the paper of Horovod.Could you please give me some information about it?Such as some codes.I want to compare the difference.

Thanks!Have a nice day!

Access to Adaptive Batch Size Policy

In the paper based on this project it is mentioned: "We implement an AP that adapts the batch size based on GNS when training the ResNet-56 model with the CIFAR-10 dataset."

Is this corresponding benchmark or adaptation policy available to the public as I'm having troubling finding it. I'm really interested in this portion of the project and am looking forward to how GNS can be used to enforce dynamic batch size selection.

I've looked through the tensorflow benchmarks and monitoring benchmarks but all I can find is:

Thanks!

Segmentation fault on alpine Linux

/src/kungfu # valgrind ./bin/fake-agent 
==25== Memcheck, a memory error detector
==25== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==25== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==25== Command: ./bin/fake-agent
==25== 
==25== Thread 2:
==25== Invalid read of size 8
==25==    at 0x4E5660D: runtime.argv_index (/usr/lib/go/src/runtime/runtime1.go:57)
==25==    by 0x4E5660D: runtime.sysargs (/usr/lib/go/src/runtime/os_linux.go:206)
==25==    by 0x4E669AA: runtime.args (/usr/lib/go/src/runtime/runtime1.go:63)
==25==    by 0x4E82353: runtime.rt0_go (/usr/lib/go/src/runtime/asm_amd64.s:193)
==25==    by 0x544E06F: ??? (in /src/kungfu/lib/libkungfu.so)
==25==  Address 0x2a270388 is not stack'd, malloc'd or (recently) free'd
==25== 
==25== 
==25== Process terminating with default action of signal 11 (SIGSEGV)
==25==    at 0x4E85EB1: runtime.raise (/usr/lib/go/src/runtime/sys_linux_amd64.s:150)
==25==    by 0x4E6CBDA: runtime.dieFromSignal (/usr/lib/go/src/runtime/signal_unix.go:424)
==25==    by 0x4E6D03C: runtime.sigfwdgo (/usr/lib/go/src/runtime/signal_unix.go:629)
==25==    by 0x4E6C27F: runtime.sigtrampgo (/usr/lib/go/src/runtime/signal_unix.go:289)
==25==    by 0x4E861A2: runtime.sigtramp (/usr/lib/go/src/runtime/sys_linux_amd64.s:357)
==25==    by 0x4044E20: ??? (in /lib/ld-musl-x86_64.so.1)
==25== 
==25== HEAP SUMMARY:
==25==     in use at exit: 74,480 bytes in 11 blocks
==25==   total heap usage: 13 allocs, 2 frees, 74,600 bytes allocated
==25== 
==25== LEAK SUMMARY:
==25==    definitely lost: 0 bytes in 0 blocks
==25==    indirectly lost: 0 bytes in 0 blocks
==25==      possibly lost: 0 bytes in 0 blocks
==25==    still reachable: 74,480 bytes in 11 blocks
==25==         suppressed: 0 bytes in 0 blocks
==25== Rerun with --leak-check=full to see details of leaked memory
==25== 
==25== For lists of detected and suppressed errors, rerun with: -s
==25== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault

Control plane operator design doc

Design goals

Computing monitoring metrics for the learning plane. The first monitoring metric is: gradient noise scale proposed by OpenAI.
Applying changes to hyper-parameters on distributed devices. This can be achieved using the Chi approach where we maintain a distributed asynchronous control barrier for workers.

Dependency graph of KungFu dataflows

Note: (A, B) means operator B has a dependency on operator A.

A TensorFlow dataflow:

(forward, loss)
(loss, gradients)
(gradients, reduced_gradients)
(reduced_gradients, optimizer)
(optimizer, forward)
.....

A controllable TensorFlow dataflow

(forward, loss)
(loss, gradients)
(gradients, reduced_gradients)
(gradients, gradient_noise)
(reduced_gradients, optimizer)
(gradient_noise, controller)
(optimizer, controller)
(controller, forward)
......

kungfu job is hang in a inconsistent version when i scale down/up mutiple times

scale up from 1 instance to 2 instances

A: v0 -> v1

A container log

[10.0.0.230.10003::stdout] sync to offset 0 on step 0
[10.0.0.230.10004::stdout] sync to offset 0 on step 0
[10.0.0.230.10007::stdout] sync to offset 0 on step 0
[10.0.0.230.10002::stdout] sync to offset 0 on step 0
[10.0.0.230.10006::stdout] sync to offset 0 on step 0
[10.0.0.230.10001::stdout] sync to offset 0 on step 0
[10.0.0.230.10005::stdout] sync to offset 0 on step 0
[10.0.0.230.10000::stdout] sync to offset 0 on step 0
[I] arrived at v1, new np=16, local: +0/-0, global: +8/-0

B container log

[10.0.1.29.10004::stdout] sync to offset 0 on step 0
[10.0.1.29.10005::stdout] sync to offset 0 on step 0
[10.0.1.29.10001::stdout] sync to offset 0 on step 0
[10.0.1.29.10002::stdout] sync to offset 0 on step 0
[10.0.1.29.10003::stdout] sync to offset 0 on step 0
[10.0.1.29.10000::stdout] sync to offset 0 on step 0
[10.0.1.29.10006::stdout] sync to offset 0 on step 0
[10.0.1.29.10007::stdout] sync to offset 0 on step 0
[10.0.1.29.10002::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10003::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10006::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10005::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10000::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10001::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10007::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.29.10004::stderr] [E] New root can't not be a new worker! State will be lost.

A/B running well

scale down from 2 instance to 1 instances

A: v1 -> v2

A container log

[10.0.0.230.10006::stderr] INFO:tensorflow:step: 60(global step: 60)    step/sec: 0.384 loss: 0.777     top-1: 0.800
[10.0.0.230.10003::stderr] INFO:tensorflow:step: 60(global step: 60)    step/sec: 0.384 loss: 0.687     top-1: 0.800
I0608 09:10:39.150780       1 watch_host_file.go:65] update host file
I0608 09:10:39.150810       1 ma_fmk_kungfu.go:116] scale down to 1
I0608 09:10:39.151063       1 ma_fmk_kungfu.go:150] generated host file
[I] arrived at v2, new np=8, local: +0/-0, global: +0/-8
[10.0.0.230.10006::stdout] sync to offset 20320 on step 64
[10.0.0.230.10005::stdout] sync to offset 20320 on step 64
[10.0.0.230.10001::stdout] sync to offset 20320 on step 64
[10.0.0.230.10007::stdout] sync to offset 20320 on step 64
[10.0.0.230.10004::stdout] sync to offset 20320 on step 64
[10.0.0.230.10002::stdout] sync to offset 20320 on step 64
[10.0.0.230.10000::stdout] sync to offset 20320 on step 64
[10.0.0.230.10003::stdout] sync to offset 20320 on step 64
[10.0.0.230.10000::stderr] INFO:tensorflow:step: 70(global step: 70)    step/sec: 1.041 loss: 0.683     top-1: 0.800
[10.0.0.230.10004::stderr] INFO:tensorflow:step: 70(global step: 70)    step/sec: 1.041 loss: 0.629     top-1: 0.800

B container log

[10.0.1.29.10007::stderr] INFO:tensorflow:step: 60(global step: 60)     step/sec: 0.384 loss: 0.926     top-1: 0.750
[10.0.1.29.10003::stderr] INFO:tensorflow:step: 60(global step: 60)     step/sec: 0.384 loss: 0.757     top-1: 0.800
I0608 09:10:39.152199       1 watch_host_file.go:65] update host file
I0608 09:10:39.152222       1 ma_fmk_kungfu.go:116] scale down to 1
I0608 09:10:39.152408       1 ma_fmk_kungfu.go:150] generated host file
[10.0.1.29.10002::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10007::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10004::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10000::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10005::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10003::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10006::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10007::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10002::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10004::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10001::stdout] stopped after trained 0 samples in 64 steps due to change cluster
[10.0.1.29.10000::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10006::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10005::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10003::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[10.0.1.29.10001::stderr] INFO:tensorflow:Saving checkpoints for 64 into ./train_url/model.ckpt.
[W] terminated trapped
[E] canceled: context canceled
[I] stop watching

A running well, B is closed

scale up to 2 instances again

A: v2 -> v3

A container log

[10.0.0.230.10002::stderr] INFO:tensorflow:step: 210(global step: 210)  step/sec: 1.455 loss: 0.011     top-1: 1.000
[10.0.0.230.10001::stderr] INFO:tensorflow:step: 210(global step: 210)  step/sec: 1.455 loss: 0.020     top-1: 1.000
I0608 09:12:30.528867       1 ma_fmk_kungfu.go:150] generated host file
[I] arrived at v3, new np=16, local: +0/-0, global: +8/-0
[10.0.0.230.10003::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10005::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10004::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10000::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10007::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10001::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10002::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
[10.0.0.230.10006::stderr] [W] 10.0.1.30:38080 is up after pinged 8 times
exit on error: inconsistent update detected at 7031490:/home/work/KungFu/srcs/go/kungfu/runner/handler.go:102

i found the runner of A is exited as exit on error: inconsistent update detected at 7031490:/home/work/KungFu/srcs/go/kungfu/runner/handler.go:102

B container log

[10.0.1.30.10000::stdout] start with 0 trained samples.
[10.0.1.30.10000::stderr] INFO:tensorflow:Running will end at step: 750
[10.0.1.30.10006::stdout] sync to offset 0 on step 0
[10.0.1.30.10002::stdout] sync to offset 0 on step 0
[10.0.1.30.10004::stdout] sync to offset 0 on step 0
[10.0.1.30.10003::stdout] sync to offset 0 on step 0
[10.0.1.30.10005::stdout] sync to offset 0 on step 0
[10.0.1.30.10007::stdout] sync to offset 0 on step 0
[10.0.1.30.10000::stdout] sync to offset 0 on step 0
[10.0.1.30.10001::stdout] sync to offset 0 on step 0
[I] arrived at v1, new np=16, local: +0/-0, global: +0/-0
[10.0.1.30.10004::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10005::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10006::stderr] [E] New root can't not be a new worker! State will be lost.
[10.0.1.30.10003::stdout] exit on error: par failed with 1 error: can't establish connection at 140503338179417:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10001::stdout] exit on error: par failed with 1 error: can't establish connection at 139700864633689:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10007::stdout] exit on error: par failed with 1 error: can't establish connection at 139746162006873:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10000::stdout] exit on error: par failed with 1 error: can't establish connection at 139711679425369:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[10.0.1.30.10002::stdout] exit on error: par failed with 1 error: can't establish connection at 140323524553561:/tmp/pip-req-build-ve3royr1/srcs/go/kungfu/peer/peer.go:204
[I] 10.0.1.30.10000 finished with error: exit status 1
exit on error: exit status 1 at 7030827:/home/work/KungFu/srcs/go/kungfu/runner/watch.go:147

now A/B is hang ...

currently, can kung-fu support my test case ? or how should i handle this case ...

Synchronous P2P Request Latency Too Large

Please find below the latencies measured for the synchronous case of P2P Model Request and Store update. This is the current bottleneck of the system which needs to be resolved in order to improve the throughput of the system.

[127.0.0.1/02/02-of-04::stdout] Model store update took 16.012414ms
[127.0.0.1/03/03-of-04::stdout] Model store update took 276.625µs
[127.0.0.1/02/02-of-04::stdout] Model store update took 176.388µs
[127.0.0.1/02/02-of-04::stdout] Request took 673.288µs
[127.0.0.1/03/03-of-04::stdout] Request took 15.947665ms
[127.0.0.1/00/00-of-04::stdout] Model store update took 3.733306ms
[127.0.0.1/01/01-of-04::stdout] Model store update took 175.905µs
[127.0.0.1/00/00-of-04::stdout] Request took 14.822951ms
[127.0.0.1/03/03-of-04::stdout] Model store update took 195.484µs
[127.0.0.1/01/01-of-04::stdout] Request took 20.67839ms
[127.0.0.1/02/02-of-04::stdout] Model store update took 216.966µs
[127.0.0.1/00/00-of-04::stdout] Model store update took 235.257µs
[127.0.0.1/01/01-of-04::stdout] Request took 833.436µs

Is Windows supported?

Hello! I am looking for training distribution framework and did you try it on windows? Because I am facing linking issue with tensorflow (possibly MinGW and MSVC linking conflicts)

Performance drops when TensorFlow experimental XLA JIT is enabled.

@lgarithm @luomai
you can test using your kungfu_benchmark.py with one line adding.

args = parser.parse_args()
args.cuda = not args.no_cuda

config = tf.ConfigProto()
config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

the testing by me is tested in 8 V100 node, the result as follows,
Configuration:
optimizer=sync-sgd
batch-size=64.

kungfu:
no xla:

[�[1;32m127.0.0.1.10000�[m::stdout] Iter #3: 313.9 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #4: 312.8 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #5: 316.1 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #6: 311.5 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #7: 314.6 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #8: 315.3 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #9: 313.4 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Img/sec per /gpu:0: 313.8 +-2.4

with xla:

[�[1;32m127.0.0.1.10000�[m::stdout] Iter #5: 230.2 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #6: 230.9 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #7: 231.4 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #8: 229.6 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Iter #9: 230.1 img/sec per /gpu:0
[�[1;32m127.0.0.1.10000�[m::stdout] Img/sec per /gpu:0: 230.1 +-1.5

horovod:
no xla:

Iter #35: 334.8 img/sec per GPU
Iter #36: 335.5 img/sec per GPU
Iter #37: 327.0 img/sec per GPU
Iter #38: 327.9 img/sec per GPU
Iter #39: 335.2 img/sec per GPU
Iter #40: 334.9 img/sec per GPU
Iter #41: 335.0 img/sec per GPU
Iter #42: 334.9 img/sec per GPU
Iter #43: 335.4 img/sec per GPU
Iter #44: 335.3 img/sec per GPU
Iter #45: 331.8 img/sec per GPU
Iter #46: 334.7 img/sec per GPU
Iter #47: 335.3 img/sec per GPU
Iter #48: 334.8 img/sec per GPU
Iter #49: 335.2 img/sec per GPU
Img/sec per GPU: 334.2 +-5.4

with xla:

Iter #39: 372.4 img/sec per GPU
Iter #40: 379.9 img/sec per GPU
Iter #41: 379.0 img/sec per GPU
Iter #42: 380.2 img/sec per GPU
Iter #43: 378.8 img/sec per GPU
Iter #44: 379.9 img/sec per GPU
Iter #45: 380.3 img/sec per GPU
Iter #46: 379.7 img/sec per GPU
Iter #47: 379.4 img/sec per GPU
Iter #48: 379.5 img/sec per GPU
Iter #49: 379.1 img/sec per GPU
Img/sec per GPU: 379.7 +-5.1

[doc] request parameters doc when the -init-version=-1

i'd like kungfu can provide a brief doc about the parameters, i have try set -init-version=-1 and ignore the -H param, but it seems kungfu-run can't handle it well

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=64
[arg] [3]=-w
[arg] [4]=-config-server
[arg] [5]=file:///home/ma-user/user-job-dir/config.json
[arg] [6]=-nic
[arg] [7]=ib0
[arg] [8]=-init-version
[arg] [9]=-1
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-17/image_classification_xk.py
[arg] [12]=--num_clases=1001
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.143.141/20
[nic] [8] bond0 :: 192.168.5.175/22, fe80::f816:3eff:fef7:d4fc/64
[nic] [9] docker0 :: 169.254.30.1/28, fe80::42:baff:fe91:cb50/64
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::ece2:eaff:fefb:ce44/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::c052:54ff:fe20:7f91/64
[nic] [15] gw_11cbf51a :: 172.16.0.193/16, fe80::44b8:36ff:febb:623c/64
[nic] [16] br_plc_a149041e ::
[nic] [17] veth_a149041e :: fe80::5428:eff:fe9b:5826/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
exit on error: 169.254.143.141:38080 not in 127.0.0.1:38080 at 7037287:/home/work/KungFu/srcs/go/cmd/kungfu-run/kungfu-run.go:62

Point-to-Point Communication API

After inspecting the code and trying out an implementation skeleton, I have two questions and one observation:

The receive function in Router currently expects a message from a particular peer (one of Prevs in the gather graph, as required by the collective communication paradigm). We will need to adapt this for point-to-point communication. I am thinking to simply listen to k distinct channels, one for each peer. Is there any efficient way of doing this in Go?
In the Ako operator, I need to spawn a listener when the operator is constructed so that the inbound gradient can be updated with the partial gradients received from other peers. Is there a way to do multi-threading inside a Tensorflow operator? I am afraid that thread may not be supported.
For sending, we will not require any underlying topology (although we can think of it as a inexplicit full-mesh network). It is sufficient for each node to know the total number of nodes.

My current code skeleton for point-to-point communication for Ako can be found on branch point-to-point.

failed to establish connection to the newly runner

hi, i am testing the elastic capability of kung-fu in a kubernetes cluster (container), and it turns out there is a tricky problem, as sometimes, kung-fu can't connect to the newly created runner, and the job is failed due to the following code

KungFu/srcs/go/kungfu/peer/peer.go

Lines 190 to 196 in dda9542

 // FIXME: assuming runners are up and running 

 var notify execution.PeerFunc = func(ctrl plan.PeerID) error { 

 return p.router.Send(ctrl.WithName("update"), stage.Encode(), connection.ConnControl, 0) 

 } 

 if err := notify.Par(cluster.Runners); err != nil { 

 utils.ExitErr(err) 

 }

these are my running log

[D] update to v1 with [16@2]{10.0.1.50:10000,10.0.1.50:10001,10.0.1.50:10002,10.0.1.50:10003,10.0.1.50:10004,10.0.1.50:10005,10.0.1.50:10006,10.0.1.50:10007,10.0.0.241:10000,10.0.0.241:10001,10.0.0.241:10002,10.0.0.241:10003,10.0.0.241:10004,10.0.0.241:10005,10.0.0.241:10006,10.0.0.241:10007}@{10.0.1.50:38080,10.0.0.241:38080}
[10.0.1.50.10005::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10000::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10001::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10006::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10002::stdout] [D] failed to establish connection to #<10.0.0.241:38080> for 1 times: dial tcp 10.0.0.241:38080: connect: connection refused
[10.0.1.50.10003::stdout] exit on error: par failed with 1 error: can't establish connection at 140552590385529:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10007::stdout] exit on error: par failed with 1 error: can't establish connection at 140247738442105:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10004::stdout] exit on error: par failed with 1 error: can't establish connection at 139984866718073:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10005::stdout] exit on error: par failed with 1 error: can't establish connection at 139962272162169:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10000::stdout] exit on error: par failed with 1 error: can't establish connection at 139740905816441:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10001::stdout] exit on error: par failed with 1 error: can't establish connection at 140017202741625:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10006::stdout] exit on error: par failed with 1 error: can't establish connection at 140218611171705:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[10.0.1.50.10002::stdout] exit on error: par failed with 1 error: can't establish connection at 140089222893945:/tmp/pip-req-build-9vrn50rb/srcs/go/kungfu/peer/peer.go:189
[I] 10.0.1.50.10000 finished with error: exit status 1
exit on error: exit status 1 at 7069499:/cache/KungFu/srcs/go/kungfu/runner/watch.go:146

i observed that there is a time gap when we update the config.json file and the newly kung-fu bootstrap (i.e. newly container is created)

and i find the retry time setting in here

KungFu/srcs/go/rchannel/connection/connection.go

Lines 90 to 93 in dda9542

 var initRetry int 

 if t == ConnCollective || t == ConnPeerToPeer { 

 initRetry = config.ConnRetryCount 

 }

also

KungFu/srcs/go/rchannel/connection/connection.go

Lines 137 to 146 in dda9542

 for i := 0; i <= c.initRetry; i++ { 

 var err error 

 if c.conn, err = c.init(); err == nil { 

 log.Debugf("%s connection to #<%s> established after %d trials, took %s", c.connType, c.dest, i+1, time.Since(t0)) 

 return nil 

 } 

 log.Debugf("failed to establish connection to #<%s> for %d times: %v", c.dest, i+1, err) 

 time.Sleep(config.ConnRetryPeriod) 

 } 

 return errCantEstablishConnection

it shows that the Controll msg only has one time to try, if that failed, the job will exit ...

to clarify, i post the time sequence of my job

start one kung-fu container (A), -w config.json
A is running
update config.json to 2 container (A, B), and bootstrap another container B
A try to connect the container B runner
container B is running

finnaly, i am not sure can we simply add the retry time for Controll msg for solving my problem, or what's the best practice of kung-fu in kubernetes

Gradient Noise Implementation Design Doc

To concretely implement the above approach, we will use python. The computation of the gradient norm uses the locally computed gradient, the negotiated gradient via KungFu Parallel SGD, the batch size per worker and the global batch size.

We will use TensorFlow operators to calculate the biased estimates at each iteration:

To calculate the norm of the gradient tensor, we use tf.norm(), which treats the tensor as a vector and computes the L2 norm.
All other variables in the formulas can be represented by tf.constant()
Computing the exponentially moving average constitutes only from tensor multiplication by scalar and tensor additions.
The noise scale is a tf.constant() at the end of the computation

DoD:

Obtain the noise scale tf.constant()

support partitioned gradient exchange with NCCL

To support partitioned gradient exchange with NCCL, we need to create different order groups for different global_step % n_partitions, the follow code snippet shows how we might archive that using tf.cond

#!/usr/bin/env python3

import tensorflow as tf

gs = tf.Variable(tf.zeros([]))
advance_gs = tf.assign(gs, gs + 1)

x = tf.Variable(tf.zeros([3]))


def create_run_part(i):
    # gs = 5n + i => x += i, i = 0, ..., 4
    return tf.cond(tf.equal(tf.mod(gs, 5),
                            i), lambda: tf.assign(x, x + i), lambda: x)


def create_step_op():
    with tf.control_dependencies([advance_gs]):
        return tf.group([create_run_part(i) for i in range(5)])


step_op = create_step_op()

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())
    for i in range(15):
        t, v, _ = sess.run([gs, x, step_op])
        print('step %d, x is %s' % (t, v))

step 1, x is [1. 1. 1.]
step 2, x is [3. 3. 3.]
step 3, x is [6. 6. 6.]
step 4, x is [10. 10. 10.]
step 5, x is [10. 10. 10.]
step 6, x is [11. 11. 11.]
step 7, x is [13. 13. 13.]
step 8, x is [16. 16. 16.]
step 9, x is [20. 20. 20.]
step 10, x is [20. 20. 20.]
step 11, x is [21. 21. 21.]
step 12, x is [23. 23. 23.]
step 13, x is [26. 26. 26.]
step 14, x is [30. 30. 30.]
step 15, x is [30. 30. 30.]

Hang when run on distributed mode

Following README. I can run on all local node successfully
xxx@master:/tmp/KungFu$ kungfu-run -np 2 python3 examples/tf1_mnist_session.py --data-dir=./mnist

...
[I] all 2/2 local peers finished, took 2.397370504s

but when run on cluster. It hang without any error.
@master:/tmp/KungFu$ kungfu-run -np 2 -H 10.208.209.163:1,10.208.209.171:1 -nic eno1 python3 examples/tf1_mnist_session.py --data-dir=./mnist
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=2
[arg] [3]=-H
[arg] [4]=10.208.209.163:1,10.208.209.171:1
[arg] [5]=-nic
[arg] [6]=eno1
[arg] [7]=python3
[arg] [8]=examples/tf1_mnist_session.py
[arg] [9]=--data-dir=./mnist
[nic] [0] lo :: 127.0.0.1/8
[nic] [1] eno1 :: 10.208.209.163/24
[nic] [2] docker0 :: 192.168.99.1/24
[nic] [3] br-fefb2fb37d81 :: 172.18.0.1/16
[cuda-env]: CUDA_VISIBLE_DEVICES=1
[I] will parallel run 1 instances of python3 with ["examples/tf1_mnist_session.py" "--data-dir=./mnist"]

Shard Data Benchmarks Project

Check if sharding is enabled.

[Estimator] Sometime throws an dataset end_of_sequence error.

Installation with tf 1.14

/usr/bin/ld: cannot find -ltensorflow_framework
  collect2: error: ld returned 1 exit status
  CMakeFiles/kungfu_tensorflow_ops.dir/build.make:226: recipe for target '../lib.linux-x86_64-3.6/kungfu/kungfu_tensorflow_ops.cpython-36m-x86_64-linux-gnu.so' failed
  make[2]: *** [../lib.linux-x86_64-3.6/kungfu/kungfu_tensorflow_ops.cpython-36m-x86_64-linux-gnu.so] Error 1
  CMakeFiles/Makefile2:68: recipe for target 'CMakeFiles/kungfu_tensorflow_ops.dir/all' failed
  make[1]: *** [CMakeFiles/kungfu_tensorflow_ops.dir/all] Error 2
  Makefile:83: recipe for target 'all' failed
  make: *** [all] Error 2
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-req-build-nw69zdj6/setup.py", line 106, in <module>
      install_requires=[],
    File "/home/marcel/kungfu-venv/lib/python3.6/site-packages/setuptools/__init__.py", line 145, in setup
      return distutils.core.setup(**attrs)
    File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
      dist.run_commands()
    File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
      self.run_command(cmd)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/home/marcel/kungfu-venv/lib/python3.6/site-packages/wheel/bdist_wheel.py", line 192, in run
      self.run_command('build')
    File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run
      self.run_command(cmd_name)
    File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command
      self.distribution.run_command(command)
    File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
      cmd_obj.run()
    File "/home/marcel/kungfu-venv/lib/python3.6/site-packages/setuptools/command/build_ext.py", line 84, in run
      _build_ext.run(self)
    File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run
      self.build_extensions()
    File "/usr/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions
      self._build_extensions_serial()
    File "/usr/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial
      self.build_extension(ext)
    File "/tmp/pip-req-build-nw69zdj6/setup.py", line 88, in build_extension
      cwd=self.build_temp,
    File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['cmake', '--build', '.']' returned non-zero exit status 2.
  ----------------------------------------
  ERROR: Failed building wheel for kungfu

Huawei cloud driver program

Huawei custom image service

Example

Error from pytoch demo

[127.0.0.1.10000::stderr] Traceback (most recent call last):
[127.0.0.1.10000::stderr] File "/home/wrk/KungFu/examples/torch_elastic/torch_mnist_example.py", line 157, in
[127.0.0.1.10000::stderr] main()
[127.0.0.1.10000::stderr] File "/home/wrk/KungFu/examples/torch_elastic/torch_mnist_example.py", line 153, in main
[127.0.0.1.10000::stderr] train(args, model, device, optimizer, step_based_schedule)
[127.0.0.1.10000::stderr] File "/home/wrk/KungFu/examples/torch_elastic/torch_mnist_example.py", line 71, in train
[127.0.0.1.10000::stderr] sync_model(model)
[127.0.0.1.10000::stderr] File "/home/wrk/KungFu/examples/torch_elastic/torch_mnist_example.py", line 48, in sync_model
[127.0.0.1.10000::stderr] kf.broadcast_parameters(model.state_dict())
[127.0.0.1.10000::stderr] File "/home/wrk/anaconda3/envs/py36tf13/lib/python3.6/site-packages/kungfu/torch/ops/collective.py", line 43, in broadcast_parameters
[127.0.0.1.10000::stderr] h = inplace_broadcast_async_op(value, name)
[127.0.0.1.10000::stderr] File "/home/wrk/anaconda3/envs/py36tf13/lib/python3.6/site-packages/kungfu/torch/ops/collective.py", line 29, in inplace_broadcast_async_op
[127.0.0.1.10000::stderr] return broadcast_async_op_map[x.type()](x, x, x.type(), name)
[127.0.0.1.10000::stderr] KeyError: 'torch.FloatTensor'

@lgarithm

Support for share-memory channels?

I am thinking to support a shared-memory channel. According to this benchmark, a properly implemented shared memory channel has the potential to out-perform domain sockets by 100x. I quote the numbers specific to shared-memory and domain-sockets produce by this IPC-benchmark code:

Shared Memory:

Message size:       128
Message count:      1000000
Total duration:     261.650 ms
Average duration:   0.238 us
Minimum duration:   0.000 us
Maximum duration:   10092.032 us
Standard deviation: 22.095 us
Message rate:       3821893 msg/s

Domain Socket:

Message size:       128
Message count:      1000000
Total duration:     24579.846 ms
Average duration:   24.531 us
Minimum duration:   2.560 us
Maximum duration:   15932.928 us
Standard deviation: 37.854 us
Message rate:       40683 msg/s

We could still rely on domain sockets to implement the small tensor communication but if the tensor are big, we can use the shared-memory as a fast path. One possible direction is to:

Check if the tensor size
If the tensor is small, using domain sockets
if the tensor is big, using domain socket to notify the receiver to read big tensors from the shared memory.

This design has two benefits:

We are still using domain sockets to synchronise the read/write of shared memory
We leverage the shared memory as a heavy lifter for big tensor communication.

@lgarithm thought?

Elastic hook can't support training from checkpoint.

When use elastic hook，the global_step will be useless, so maybe kungfu should save the trained_sample to checkpoint and restored the variable before training.

Collective All-reduce using NCCL

We are using the default stream that TF gives us (context->op_device_context()->stream()). This is the computation stream, which actually blocks graph computation, affecting performance. We should use the device to device communication stream instead.

Kungfu-run log files directory

I would like to give kungfu-run a different directory to where to write the log files.
The -logfile option does log something else than the logs like 127.0.0.1.10000-stderr.log

support -hostfile flag

add -hostfile flag (https://www.open-mpi.org/faq/?category=running#mpirun-hostfile) as alternative of -H.

Lower speed

HI.

Distributed mode is lower speed than single mode

in single mode
kungfu-run -np 1 -H 10.208.209.163,10.208.209.171 -nic eno1 python3 ../../KungFu-0.2.1/examples/tf1_mnist_keras.py --n-epochs 50

and it take 2s for one epoch
[10.208.209.163.10000::stdout] Epoch 1/50
[10.208.209.163.10000::stderr] 2020-02-01 09:32:09.231149: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[10.208.209.163.10000::stdout] - 2s - loss: 0.3730 - sparse_categorical_accuracy: 0.8899 - val_loss: 0.1923 - val_sparse_categorical_accuracy: 0.9436

in distributed mode
kungfu-run -np 2 -H 10.208.209.163,10.208.209.171 -nic eno1 python3 ../../KungFu-0.2.1/examples/tf1_mnist_keras.py --n-epochs 50

it take 22s for one epoch
[10.208.209.163.10000::stdout] Epoch 1/50
[10.208.209.163.10000::stderr] 2020-02-01 09:31:14.899848: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
[10.208.209.163.10000::stdout] - 22s - loss: 0.4644 - sparse_categorical_accuracy: 0.8648 - val_loss: 0.2567 - val_sparse_categorical_accuracy: 0.9226

bert demo question

@marwage
I want to run bert elastic training with kungfu commit 9f8a69b, could tell me the corresponding bert commit number at kungfu branch which i can reproduce your result?

the kungfu-job is hang when it scale down

follow up by #294 (i.e. the scale up case), i continue to test the scale down case of kung-fu job in kubernetes, and i found there may be a race condition when kungfu handle the scale down case

this is the time sequence about my test

bootstrap A container
A is running
update config.json and bootstrap B container (scale up)
B is running
A and B are running well
update config.json and shutdown B container (scale down)
B (server close immediatly)
A is hang forever ... (there is no err log or AllReduce broken pipe log)

B final logs (L4 indicates the time of updating config.json)

[10.0.0.224.10003::stdout] [D] ingore unchanged proposal
[10.0.0.224.10003::stdout] [D] ignore update
I0607 20:13:22.713992       1 watch_host_file.go:65] update host file
I0607 20:13:22.714544       1 ma_fmk_kungfu.go:150] generated host file
[W] terminated trapped
[D] cancelled
[E] canceled: context canceled
[10.0.0.224.10006::stdout] [D] Server Closed
[D] Server Closed
[10.0.0.224.10004::stdout] [D] Server Closed
[10.0.0.224.10007::stdout] [D] Server Closed
[10.0.0.224.10001::stdout] [D] Server Closed
[10.0.0.224.10003::stdout] [D] Server Closed
[10.0.0.224.10005::stdout] [D] Server Closed
[10.0.0.224.10002::stdout] [D] Server Closed
[10.0.0.224.10000::stdout] [D] Server Closed
[I] stop watching
[D] Server Closed
[D] kungfu-run finished, took 2m22.50148318s

A final logs (L4 indicates the time of updating config.json)

[10.0.1.26.10005::stdout] [D] ignore update
[10.0.1.26.10003::stdout] [D] ignore update
I0607 20:13:22.712071       1 watch_host_file.go:65] update host file
I0607 20:13:22.712325       1 ma_fmk_kungfu.go:150] generated host file
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_133) failed: runStrategies failed with2 errors: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: connection reset by peer, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_67) failed: runStrategies failed with 1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_191) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_130) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_125) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_128) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_124) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_129) failed: runStrategies failed with9 errors: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_123) failed: runStrategies failed with2 errors: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_121) failed: runStrategies failed with3 errors: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe, par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_118) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_122) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_120) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_116) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe
[10.0.1.26.10000::stderr] [E] kungfu operation AllReduce(v0/tower_0/KungfuAllReduce_115) failed: runStrategies failed with1 error: par failed with 1 error: write tcp 10.0.1.26:39252->10.0.0.224:10000: write: broken pipe

currently, i have a workaround method

import a kungfu-manager process as the pid 1 process of the container, and using this mng proc for bootstraping kung-run; the kungfu-manager process swallow the signal (TERM) and doesn't transit to kungfu-run proc; kungfu processes will exit by respecting the updated config.json (scale down)

by the way, kungfu-run acted as pid 1 process in container (in my hang case), and it will receive the TERM signal when i shutdown the container

i am not sure, is my workaround method a correct way, when we want to scale down the job. i.e. scaling down only by updating the config.json and don't send TERM or INT signal to the kung-fu processes

but there still has a drawback, users trap the TERM signal in their python scripts, maybe they can't do it in kung-fu ?

	// FIXME: assuming runners are up and running
	var notify execution.PeerFunc = func(ctrl plan.PeerID) error {
	return p.router.Send(ctrl.WithName("update"), stage.Encode(), connection.ConnControl, 0)
	}
	if err := notify.Par(cluster.Runners); err != nil {
	utils.ExitErr(err)
	}

	var initRetry int
	if t == ConnCollective \|\| t == ConnPeerToPeer {
	initRetry = config.ConnRetryCount
	}

	for i := 0; i <= c.initRetry; i++ {
	var err error
	if c.conn, err = c.init(); err == nil {
	log.Debugf("%s connection to #<%s> established after %d trials, took %s", c.connType, c.dest, i+1, time.Since(t0))
	return nil
	}
	log.Debugf("failed to establish connection to #<%s> for %d times: %v", c.dest, i+1, err)
	time.Sleep(config.ConnRetryPeriod)
	}
	return errCantEstablishConnection

lsds / kungfu Goto Github PK

kungfu's People

Contributors

Stargazers

Watchers

Forkers

kungfu's Issues

Design goals

Dependency graph of KungFu dataflows

scale up from 1 instance to 2 instances

scale down from 2 instance to 1 instances

scale up to 2 instances again

Recommend Projects

Recommend Topics

Recommend Org