Comments (5)
it seems a bug of kungfu-run
, could you try this workaround before we fix it.
# get self ipv4 of given nic
get_self_ip() {
local nic=$1
ifconfig $nic | grep inet | grep -v inet6 | awk '{print $2}'
}
# construct kungfu-run flags
kungfu_run_flags() {
local nic=$1
local IP=$(get_self_ip $nic)
echo -H $IP # workaround
echo -init-version -1
echo -w
echo -nic $nic
}
kungfu_run_with_nic() {
local nic=$1
kungfu-run $(kungfu_run_flags $nic) $@
}
kungfu_run_with_nic ib0 python3 train-xxx.py
from kungfu.
thanks for the reply, i have tested kungfu-run with-H ${current_ip_of_nic} -nic ${current_nic}
and without -np
params, but it turns out that gpuPool
is initialized with the wrong slots number in the newly added kungfu-run container ...
should i also set the correct value of np (gpu num in the newly added kungfu-run container)
to clarify, currently, we have the machine with 8 * V100 GPU, and the newly added kungfu-run should be start with
kungfu-run
-np 8
-H ${current_ip_of_nic}:8
-nic ${current_nic}
-init-version -1
from kungfu.
this is my test case, init a job with 1 container (i.e. A.1 container), then scale up to 2 container (i.e. A.2 container is been added), but it turns out, both of containers hang after the sync to offset ...
init A.1 container
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.131.180:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-17/image_classification.py
...
[I] watching config server
[I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [8@1]{169.254.131.180:10000,169.254.131.180:10001,169.254.131.180:10002,169.254.131.180:10003,169.254.131.180:10004,169.254.131.180:10005,169.254.131.180:10006,169.254.131.180:10007}@{169.254.131.180:38080}
...
...
...
[169.254.131.180.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 2.084 top1: 0.000 top5: 0.000 ent_loss: 13.765 reg_loss: nan total_loss: nan
I0628 17:07:26.790192 1 ma_fmk_kungfu.go:314] generated host file
[I] arrived at v1, new np=16, local: +0/-0, global: +8/-0
[169.254.131.180.10001::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10006::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10005::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10002::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10004::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10007::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10000::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
newly added A.2 container
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.138.208:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=-init-version
[arg] [11]=-1
[arg] [12]=/home/work/anaconda/bin/python
[arg] [13]=kungfu-demo-6-17/image_classification.py
...
[I] ^[[1;34mwaiting to be initialized^[[m
[I] watching config server
W0628 17:07:26.790124 1 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not found
[I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [16@2]{169.254.131.180:10000,169.254.131.180:10001,169.254.131.180:10002,169.254.131.180:10003,169.254.131.180:10004,169.254.131.180:10005,169.254.131.180:10006,169.254.131.180:10007,169.254.138.208:10000,169.254.138.208:10001,169.254.138.208:10002,169.254.138.208:10003,169.254.138.208:10004,169.254.138.208:10005,169.254.138.208:10006,169.254.138.208:10007}@{169.254.131.180:38080,169.254.138.208:38080}
[169.254.138.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
export KUNGFU_CONFIG_LOG_LEVEL=0
init A.1 container
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.143.141:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-10/bert_classifier.py
[arg] [12]=--data_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/data/5000.manifest
[arg] [13]=--train_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/train_url
[arg] [14]=--checkpoint_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/chinese_L-12_H-768_A-12
[arg] [15]=--variable_update=kungfu_ssgd
[arg] [16]=--train_batch_size=20
[arg] [17]=--num_train_epochs=30
[arg] [18]=save_summaries_steps=10000
[arg] [19]=--eval_batch_size=20
[arg] [20]=--learning_rate=2e-5
[arg] [21]=--max_seq_length=128
[arg] [22]=--save_model_steps=20
[arg] [23]=--save_interval_secs=40
[arg] [24]=--kungfu_elastic=True
[arg] [25]=--kungfu_batch_size=20
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=0
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.143.141/20
[nic] [8] bond0 :: 192.168.5.175/22, fe80::f816:3eff:fef7:d4fc/64
[nic] [9] docker0 :: 169.254.30.1/28, fe80::42:baff:fe91:cb50/64
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::ece2:eaff:fefb:ce44/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::c052:54ff:fe20:7f91/64
[nic] [15] gw_11cbf51a :: 172.16.0.193/16, fe80::44b8:36ff:febb:623c/64
[nic] [16] br_plc_a149041e ::
[nic] [17] veth_a149041e :: fe80::5428:eff:fe9b:5826/64
[nic] [18] vethf7542b2 :: fe80::9cce:b4ff:fe37:e823/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
[D] Using self=169.254.143.141
[D] listening: 0.0.0.0:38080
[I] watching config server
[I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
[D] waiting 0 peers to stop
[D] 0 peer removed: 0 - 0 = 0
[E] full update detected: [0@0]{}@{} -> [8@1]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007}@{169.254.143.141:38080}
[D] 8 peers created: 0 - 0 + 8 = 8
...
[169.254.143.141.10007::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.033 loss: 2.274 top-1: 0.150
[169.254.143.141.10005::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.033 loss: 2.329 top-1: 0.100
[169.254.143.141.10003::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.246 top-1: 0.100
[169.254.143.141.10002::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.332 top-1: 0.050
[169.254.143.141.10000::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.059 loss: 2.259 top-1: 0.150
[169.254.143.141.10006::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.393 top-1: 0.050
[169.254.143.141.10004::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.041 loss: 2.341 top-1: 0.100
[169.254.143.141.10001::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.041 loss: 2.370 top-1: 0.000
[169.254.143.141.10004::stdout] sync to offset 160 on step 1
[169.254.143.141.10007::stdout] sync to offset 160 on step 1
[169.254.143.141.10005::stdout] sync to offset 160 on step 1
[169.254.143.141.10003::stdout] sync to offset 160 on step 1
[169.254.143.141.10002::stdout] sync to offset 160 on step 1
[169.254.143.141.10006::stdout] sync to offset 160 on step 1
[169.254.143.141.10001::stdout] sync to offset 160 on step 1
[169.254.143.141.10000::stdout] sync to offset 160 on step 1
[169.254.143.141.10005::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10005::stdout] [D] ingore unchanged proposal
[169.254.143.141.10005::stdout] [D] ignore update
[169.254.143.141.10001::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10000::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10004::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10006::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10001::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10002::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10004::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] ingore unchanged proposal
[169.254.143.141.10002::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] ignore update
[169.254.143.141.10004::stdout] [D] ignore update
[169.254.143.141.10006::stdout] [D] ingore unchanged proposal
[169.254.143.141.10003::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10002::stdout] [D] ignore update
[169.254.143.141.10001::stdout] [D] ignore update
[169.254.143.141.10006::stdout] [D] ignore update
[169.254.143.141.10003::stdout] [D] ingore unchanged proposal
[169.254.143.141.10003::stdout] [D] ignore update
[169.254.143.141.10000::stdout] [D] ingore unchanged proposal
[169.254.143.141.10000::stdout] [D] ignore update
HANG happen at here
newly added A.2 container
[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.135.208:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=-init-version
[arg] [11]=-1
[arg] [12]=/home/work/anaconda/bin/python
[arg] [13]=kungfu-demo-6-10/bert_classifier.py
[arg] [14]=--data_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/data/5000.manifest
[arg] [15]=--train_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/train_url
[arg] [16]=--checkpoint_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/chinese_L-12_H-768_A-12
[arg] [17]=--variable_update=kungfu_ssgd
[arg] [18]=--train_batch_size=20
[arg] [19]=--num_train_epochs=30
[arg] [20]=save_summaries_steps=10000
[arg] [21]=--eval_batch_size=20
[arg] [22]=--learning_rate=2e-5
[arg] [23]=--max_seq_length=128
[arg] [24]=--save_model_steps=20
[arg] [25]=--save_interval_secs=40
[arg] [26]=--kungfu_elastic=True
[arg] [27]=--kungfu_batch_size=20
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=0
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.135.208/20
[nic] [8] bond0 :: 192.168.6.37/22, fe80::f816:3eff:fed7:1503/64
[nic] [9] docker0 :: 169.254.30.1/28
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::481:6ff:fe7c:d368/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::8c7f:9ff:fe8a:6612/64
[nic] [15] gw_11cbf51a :: 172.16.1.17/16, fe80::d0f2:4cff:feb3:895c/64
[nic] [16] br_plc_949f84f2 ::
[nic] [17] veth_949f84f2 :: fe80::5410:51ff:fe16:6fed/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
[D] Using self=169.254.135.208
[I] waiting to be initialized
[D] listening: 0.0.0.0:38080
[I] watching config server
W0629 09:19:07.999957 8 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not found
[D] got control message from 169.254.143.141:10000, name: update, length: 644
[D] got control message from 169.254.143.141:10002, name: update, length: 644
[D] got control message from 169.254.143.141:10006, name: update, length: 644
[D] got control message from 169.254.143.141:10003, name: update, length: 644
[D] got control message from 169.254.143.141:10007, name: update, length: 644
[D] got control message from 169.254.143.141:10005, name: update, length: 644
[D] update to v1 with [16@2]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007,169.254.135.208:10000,169.254.135.208:10001,169.254.135.208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10007}@{169.254.143.141:38080,169.254.135.208:38080}
[I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
[D] waiting 0 peers to stop
[D] 0 peer removed: 0 - 0 = 0
[E] full update detected: [0@0]{}@{} -> [16@2]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007,169.254.135.208:10000,169.254.135.208:10001,169.254.135.208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10007}@{169.254.143.141:38080,169.254.135.208:38080}
[D] got control message from 169.254.143.141:10001, name: update, length: 644
[D] got control message from 169.254.143.141:10004, name: update, length: 644
[D] 8 peers created: 0 - 0 + 8 = 16
...
[169.254.135.208.10000::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10004::stderr] INFO:tensorflow:Saving checkpoints for 0 into /home/ma-user/user-job-dir/kungfu-demo-6-10/train_url/model.ckpt.
[169.254.135.208.10006::stdout] start with 0 trained samples.
[169.254.135.208.10005::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10007::stdout] start with 0 trained samples.
[169.254.135.208.10004::stdout] start with 0 trained samples.
[169.254.135.208.10003::stderr] INFO:tensorflow:Saving checkpoints for 0 into /home/ma-user/user-job-dir/kungfu-demo-6-10/train_url/model.ckpt.
[169.254.135.208.10001::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10002::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10006::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10003::stdout] start with 0 trained samples.
[169.254.135.208.10007::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10004::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10003::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10005::stdout] sync to offset 160 on step 0
[169.254.135.208.10004::stdout] sync to offset 160 on step 0
[169.254.135.208.10001::stdout] sync to offset 160 on step 0
[169.254.135.208.10002::stdout] sync to offset 160 on step 0
[169.254.135.208.10006::stdout] sync to offset 160 on step 0
[169.254.135.208.10007::stdout] sync to offset 160 on step 0
[169.254.135.208.10003::stdout] sync to offset 160 on step 0
[169.254.135.208.10000::stdout] sync to offset 160 on step 0
[169.254.135.208.10005::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10000::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10002::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10003::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10006::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10001::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10007::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10004::stdout] [D] New peer list is consistent after ONE attempt!
HANG happen at here
from kungfu.
init a job with 1 container (i.e. A.1 container), then scale up to 2 container (i.e. A.2 container is been added)
by the way, it works well in this way (-np
/-H
is diff with the previous case)
init A.1 container
22 [arg] [0]=kungfu-run
23 [arg] [1]=-np
24 [arg] [2]=8
25 [arg] [3]=-H
26 [arg] [4]=169.254.135.208:8
27 [arg] [5]=-w
28 [arg] [6]=-config-server
29 [arg] [7]=file:///home/ma-user/user-job-dir/config.json
30 [arg] [8]=-nic
31 [arg] [9]=ib0
32 [arg] [10]=/home/work/anaconda/bin/python
33 [arg] [11]=kungfu-demo-6-10/bert_classifier.py
...
48 [nic] [0] lo :: 127.0.0.1/8, ::1/128
49 [nic] [1] eth0 ::
50 [nic] [2] eth1 ::
51 [nic] [3] enp220s0 ::
52 [nic] [4] enp221s0 ::
53 [nic] [5] enp222s0 ::
54 [nic] [6] enp223s0 ::
55 [nic] [7] ib0 :: 169.254.135.208/20
56 [nic] [8] bond0 :: 192.168.6.37/22, fe80::f816:3eff:fed7:1503/64
57 [nic] [9] docker0 :: 169.254.30.1/28
58 [nic] [10] ovs-system ::
59 [nic] [11] br_monitor :: fe80::481:6ff:fe7c:d368/64
60 [nic] [12] overlay_br_int ::
61 [nic] [13] br_tun_b0345198 ::
62 [nic] [14] vxlan_sys_4789 :: fe80::8c7f:9ff:fe8a:6612/64
63 [nic] [15] gw_11cbf51a :: 172.16.1.17/16, fe80::d0f2:4cff:feb3:895c/64
64 [nic] [16] br_plc_949f84f2 ::
65 [nic] [17] veth_949f84f2 :: fe80::5410:51ff:fe16:6fed/64
...
66 [cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
67 [cuda-env]: CUDA_VERSION=10.0.130
68 [nccl-env]: NCCL_VERSION=2.4.2
69 [I] watching config server
70 [I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
71 ^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [8@1]{169.254.135.208:10000,169.254.135.208:10001,169.254.135.2 08:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10 007}@{169.254.135.208:38080}
...
516 [169.254.135.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30) step/sec: 1.332 loss: 2.309 top-1: 0.250
517 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30) step/sec: 1.332 loss: 2.353 top-1: 0.100
518 [169.254.135.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30) step/sec: 1.332 loss: 2.351 top-1: 0.200
519 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30) step/sec: 1.332 loss: 2.246 top-1: 0.200
520 [169.254.135.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30) step/sec: 1.331 loss: 2.203 top-1: 0.100
521 [169.254.135.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30) step/sec: 1.332 loss: 2.176 top-1: 0.300
522 [169.254.135.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 40(global step: 40) step/sec: 1.884 loss: 1.823 top-1: 0.450
...
650 I0628 20:16:27.978056 1 ma_fmk_kungfu.go:226] generated host file
651 [I] arrived at v1, new np=16, local: +0/-0, global: +8/-0
652 [169.254.135.208.10006::stdout] sync to offset 26240 on step 164
653 [169.254.135.208.10005::stdout] sync to offset 26240 on step 164
654 [169.254.135.208.10007::stdout] sync to offset 26240 on step 164
655 [169.254.135.208.10003::stdout] sync to offset 26240 on step 164
656 [169.254.135.208.10004::stdout] sync to offset 26240 on step 164
657 [169.254.135.208.10002::stdout] sync to offset 26240 on step 164
658 [169.254.135.208.10001::stdout] sync to offset 26240 on step 164
659 [169.254.135.208.10000::stdout] sync to offset 26240 on step 164
660 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job- dir/kungfu-demo-6-10/train_url/model.ckpt.
661 [169.254.135.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job- dir/kungfu-demo-6-10/train_url/model.ckpt.
662 [169.254.135.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job- dir/kungfu-demo-6-10/train_url/model.ckpt.
663 [169.254.135.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job- dir/kungfu-demo-6-10/train_url/model.ckpt.
664 [169.254.135.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job- dir/kungfu-demo-6-10/train_url/model.ckpt.
665 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job- dir/kungfu-demo-6-10/train_url/model.ckpt.
666 [169.254.135.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job- dir/kungfu-demo-6-10/train_url/model.ckpt.
667 [169.254.135.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job- dir/kungfu-demo-6-10/train_url/model.ckpt.
668 [169.254.135.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.038 top-1: 1.000
669 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.073 top-1: 1.000
670 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.126 top-1: 0.950
newly added A.2 container
23 [arg] [0]=kungfu-run
24 [arg] [1]=-np
25 [arg] [2]=16
26 [arg] [3]=-H
27 [arg] [4]=169.254.135.208:8,169.254.138.11:8
28 [arg] [5]=-w
29 [arg] [6]=-config-server
30 [arg] [7]=file:///home/ma-user/user-job-dir/config.json
31 [arg] [8]=-nic
32 [arg] [9]=ib0
33 [arg] [10]=-init-version
34 [arg] [11]=-1
35 [arg] [12]=/home/work/anaconda/bin/python
36 [arg] [13]=kungfu-demo-6-10/bert_classifier.py
...
51 [nic] [0] lo :: 127.0.0.1/8, ::1/128
52 [nic] [1] eth0 ::
53 [nic] [2] eth1 ::
54 [nic] [3] enp220s0 ::
55 [nic] [4] enp221s0 ::
56 [nic] [5] enp222s0 ::
57 [nic] [6] enp223s0 ::
58 [nic] [7] ib0 :: 169.254.138.11/20
59 [nic] [8] bond0 :: 192.168.6.190/22, fe80::f816:3eff:fe75:a4a/64
60 [nic] [9] docker0 :: 169.254.30.1/28
61 [nic] [10] ovs-system ::
62 [nic] [11] br_monitor :: fe80::206e:5aff:fe16:107/64
63 [nic] [12] br_tun_b0345198 ::
64 [nic] [13] overlay_br_int ::
65 [nic] [14] vxlan_sys_4789 :: fe80::bc59:a2ff:fe10:65ab/64
66 [nic] [15] gw_11cbf51a :: 172.16.1.1/16, fe80::3451:daff:fe7d:68c1/64
67 [nic] [16] br_plc_8ea52e57 ::
68 [nic] [17] veth_8ea52e57 :: fe80::e84f:f8ff:fe22:6d7a/64
69 [cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
70 [cuda-env]: CUDA_VERSION=10.0.130
71 [nccl-env]: NCCL_VERSION=2.4.2
72 [I] ^[[1;34mwaiting to be initialized^[[m
73 [I] watching config server
74 W0628 20:16:27.976171 1 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not found
75 [I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
76 ^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [16@2]{169.254.135.208:10000,169.254.135.208:10001,169.254.135. 208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:1 0007,169.254.138.11:10000,169.254.138.11:10001,169.254.138.11:10002,169.254.138.11:10003,169.254.138.11:10004,169.2 54.138.11:10005,169.254.138.11:10006,169.254.138.11:10007}@{169.254.135.208:38080,169.254.138.11:38080}
...
459 [169.254.138.11.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Running will end at step: 375 460 [169.254.138.11.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Running will end at step: 375 461 [169.254.138.11.10001::stdout] sync to offset 26240 on step 0 462 [169.254.138.11.10004::stdout] sync to offset 26240 on step 0 463 [169.254.138.11.10005::stdout] sync to offset 26240 on step 0 464 [169.254.138.11.10003::stdout] sync to offset 26240 on step 0 465 [169.254.138.11.10002::stdout] sync to offset 26240 on step 0
466 [169.254.138.11.10006::stdout] sync to offset 26240 on step 0
467 [169.254.138.11.10007::stdout] sync to offset 26240 on step 0
468 [169.254.138.11.10000::stdout] sync to offset 26240 on step 0
469 [169.254.138.11.10001::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164) step/sec: 0.061 loss: 0.037 top-1: 1.000
470 [169.254.138.11.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164) step/sec: 0.074 loss: 0.073 top-1: 1.000
471 [169.254.138.11.10002::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164) step/sec: 0.072 loss: 0.121 top-1: 0.950
472 [169.254.138.11.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164) step/sec: 0.062 loss: 0.038 top-1: 1.000
473 [169.254.138.11.10004::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164) step/sec: 0.061 loss: 0.065 top-1: 1.000
474 [169.254.138.11.10005::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164) step/sec: 0.063 loss: 0.118 top-1: 0.950
475 [169.254.138.11.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164) step/sec: 0.069 loss: 0.036 top-1: 1.000
476 [169.254.138.11.10003::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164) step/sec: 0.062 loss: 0.039 top-1: 1.000
from kungfu.
we try again but find out an other problem
init a job with a container, job 0 runs well, then scale up to 3, errors arise
container0
container1\2 (barrier failed...
from kungfu.
Related Issues (20)
- bert demo question HOT 4
- panic error HOT 3
- When using the config-server, if you call allgather api, it will block.
- After remove the worker from the cluster, it is better to set the rank id to -1. HOT 3
- Elastic hook can't support training from checkpoint.
- Support real global batch normalisation HOT 2
- Inconsistency detected by ld.so
- failed to establish connection to the newly runner HOT 5
- the kungfu-job is hang when it scale down HOT 2
- kungfu job is hang in a inconsistent version when i scale down/up mutiple times HOT 14
- Performance drops when TensorFlow experimental XLA JIT is enabled. HOT 6
- Support for share-memory channels? HOT 1
- use a dedicated thread for NCCL operations
- Access to Adaptive Batch Size Policy
- Is Windows supported?
- Error from pytoch demo HOT 1
- code loss HOT 2
- A question about Horovod central coordinator in the paper of KungFu HOT 2
- With PairAveragingOptimizer, is it possible that two workers in different iterations average their models? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kungfu.