Git Product home page Git Product logo

Comments (5)

lgarithm avatar lgarithm commented on June 23, 2024 1

it seems a bug of kungfu-run, could you try this workaround before we fix it.

# get self ipv4 of given nic
get_self_ip() {
    local nic=$1
    ifconfig $nic | grep inet | grep -v inet6 | awk '{print $2}'
}

#  construct kungfu-run flags
kungfu_run_flags() {
    local nic=$1
    local IP=$(get_self_ip $nic)

    echo -H $IP  # workaround
    echo -init-version -1
    echo -w
    echo -nic $nic
}

kungfu_run_with_nic() {
    local nic=$1
    kungfu-run $(kungfu_run_flags $nic) $@
}

kungfu_run_with_nic ib0 python3 train-xxx.py

from kungfu.

zrss avatar zrss commented on June 23, 2024

thanks for the reply, i have tested kungfu-run with-H ${current_ip_of_nic} -nic ${current_nic} and without -np params, but it turns out that gpuPool is initialized with the wrong slots number in the newly added kungfu-run container ...

should i also set the correct value of np (gpu num in the newly added kungfu-run container)

to clarify, currently, we have the machine with 8 * V100 GPU, and the newly added kungfu-run should be start with

kungfu-run
-np 8
-H ${current_ip_of_nic}:8
-nic ${current_nic}
-init-version -1

from kungfu.

zrss avatar zrss commented on June 23, 2024

this is my test case, init a job with 1 container (i.e. A.1 container), then scale up to 2 container (i.e. A.2 container is been added), but it turns out, both of containers hang after the sync to offset ...

init A.1 container

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.131.180:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-17/image_classification.py
...
[I] watching config server
[I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [8@1]{169.254.131.180:10000,169.254.131.180:10001,169.254.131.180:10002,169.254.131.180:10003,169.254.131.180:10004,169.254.131.180:10005,169.254.131.180:10006,169.254.131.180:10007}@{169.254.131.180:38080}
...
...
...
[169.254.131.180.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 2.084 top1: 0.000     top5: 0.000     ent_loss: 13.765        reg_loss: nan   total_loss: nan
I0628 17:07:26.790192       1 ma_fmk_kungfu.go:314] generated host file
[I] arrived at v1, new np=16, local: +0/-0, global: +8/-0
[169.254.131.180.10001::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10006::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10005::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10002::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10004::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178[169.254.131.180.10007::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178
[169.254.131.180.10000::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 178

newly added A.2 container

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.138.208:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=-init-version
[arg] [11]=-1
[arg] [12]=/home/work/anaconda/bin/python
[arg] [13]=kungfu-demo-6-17/image_classification.py
...
[I] ^[[1;34mwaiting to be initialized^[[m
[I] watching config server
W0628 17:07:26.790124       1 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not found
[I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [16@2]{169.254.131.180:10000,169.254.131.180:10001,169.254.131.180:10002,169.254.131.180:10003,169.254.131.180:10004,169.254.131.180:10005,169.254.131.180:10006,169.254.131.180:10007,169.254.138.208:10000,169.254.138.208:10001,169.254.138.208:10002,169.254.138.208:10003,169.254.138.208:10004,169.254.138.208:10005,169.254.138.208:10006,169.254.138.208:10007}@{169.254.131.180:38080,169.254.138.208:38080}
[169.254.138.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up
[169.254.138.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:sync to offset 364544 on step 0
[169.254.138.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:Done warm up

export KUNGFU_CONFIG_LOG_LEVEL=0

init A.1 container

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.143.141:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=/home/work/anaconda/bin/python
[arg] [11]=kungfu-demo-6-10/bert_classifier.py
[arg] [12]=--data_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/data/5000.manifest
[arg] [13]=--train_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/train_url
[arg] [14]=--checkpoint_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/chinese_L-12_H-768_A-12
[arg] [15]=--variable_update=kungfu_ssgd
[arg] [16]=--train_batch_size=20
[arg] [17]=--num_train_epochs=30
[arg] [18]=save_summaries_steps=10000
[arg] [19]=--eval_batch_size=20
[arg] [20]=--learning_rate=2e-5
[arg] [21]=--max_seq_length=128
[arg] [22]=--save_model_steps=20
[arg] [23]=--save_interval_secs=40
[arg] [24]=--kungfu_elastic=True
[arg] [25]=--kungfu_batch_size=20
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=0
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.143.141/20
[nic] [8] bond0 :: 192.168.5.175/22, fe80::f816:3eff:fef7:d4fc/64
[nic] [9] docker0 :: 169.254.30.1/28, fe80::42:baff:fe91:cb50/64
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::ece2:eaff:fefb:ce44/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::c052:54ff:fe20:7f91/64
[nic] [15] gw_11cbf51a :: 172.16.0.193/16, fe80::44b8:36ff:febb:623c/64
[nic] [16] br_plc_a149041e ::
[nic] [17] veth_a149041e :: fe80::5428:eff:fe9b:5826/64
[nic] [18] vethf7542b2 :: fe80::9cce:b4ff:fe37:e823/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
[D] Using self=169.254.143.141
[D] listening: 0.0.0.0:38080
[I] watching config server
[I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
[D] waiting 0 peers to stop
[D] 0 peer removed: 0 - 0 = 0
[E] full update detected: [0@0]{}@{} -> [8@1]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007}@{169.254.143.141:38080}
[D] 8 peers created: 0 - 0 + 8 = 8

...

[169.254.143.141.10007::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.033 loss: 2.274     top-1: 0.150
[169.254.143.141.10005::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.033 loss: 2.329     top-1: 0.100
[169.254.143.141.10003::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.246     top-1: 0.100
[169.254.143.141.10002::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.332     top-1: 0.050
[169.254.143.141.10000::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.059 loss: 2.259     top-1: 0.150
[169.254.143.141.10006::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.034 loss: 2.393     top-1: 0.050
[169.254.143.141.10004::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.041 loss: 2.341     top-1: 0.100
[169.254.143.141.10001::stderr] INFO:tensorflow:step: 0(global step: 0) step/sec: 0.041 loss: 2.370     top-1: 0.000
[169.254.143.141.10004::stdout] sync to offset 160 on step 1
[169.254.143.141.10007::stdout] sync to offset 160 on step 1
[169.254.143.141.10005::stdout] sync to offset 160 on step 1
[169.254.143.141.10003::stdout] sync to offset 160 on step 1
[169.254.143.141.10002::stdout] sync to offset 160 on step 1
[169.254.143.141.10006::stdout] sync to offset 160 on step 1
[169.254.143.141.10001::stdout] sync to offset 160 on step 1
[169.254.143.141.10000::stdout] sync to offset 160 on step 1
[169.254.143.141.10005::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10005::stdout] [D] ingore unchanged proposal
[169.254.143.141.10005::stdout] [D] ignore update
[169.254.143.141.10001::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10000::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10004::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10006::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10001::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10002::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10004::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] ingore unchanged proposal
[169.254.143.141.10002::stdout] [D] ingore unchanged proposal
[169.254.143.141.10007::stdout] [D] ignore update
[169.254.143.141.10004::stdout] [D] ignore update
[169.254.143.141.10006::stdout] [D] ingore unchanged proposal
[169.254.143.141.10003::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.143.141.10002::stdout] [D] ignore update
[169.254.143.141.10001::stdout] [D] ignore update
[169.254.143.141.10006::stdout] [D] ignore update
[169.254.143.141.10003::stdout] [D] ingore unchanged proposal
[169.254.143.141.10003::stdout] [D] ignore update
[169.254.143.141.10000::stdout] [D] ingore unchanged proposal
[169.254.143.141.10000::stdout] [D] ignore update

HANG happen at here

newly added A.2 container

[arg] [0]=kungfu-run
[arg] [1]=-np
[arg] [2]=8
[arg] [3]=-H
[arg] [4]=169.254.135.208:8
[arg] [5]=-w
[arg] [6]=-config-server
[arg] [7]=file:///home/ma-user/user-job-dir/config.json
[arg] [8]=-nic
[arg] [9]=ib0
[arg] [10]=-init-version
[arg] [11]=-1
[arg] [12]=/home/work/anaconda/bin/python
[arg] [13]=kungfu-demo-6-10/bert_classifier.py
[arg] [14]=--data_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/data/5000.manifest
[arg] [15]=--train_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/train_url
[arg] [16]=--checkpoint_url=/home/ma-user/user-job-dir/kungfu-demo-6-10/chinese_L-12_H-768_A-12
[arg] [17]=--variable_update=kungfu_ssgd
[arg] [18]=--train_batch_size=20
[arg] [19]=--num_train_epochs=30
[arg] [20]=save_summaries_steps=10000
[arg] [21]=--eval_batch_size=20
[arg] [22]=--learning_rate=2e-5
[arg] [23]=--max_seq_length=128
[arg] [24]=--save_model_steps=20
[arg] [25]=--save_interval_secs=40
[arg] [26]=--kungfu_elastic=True
[arg] [27]=--kungfu_batch_size=20
[kf-env]: KUNGFU_CONFIG_LOG_LEVEL=0
[nic] [0] lo :: 127.0.0.1/8, ::1/128
[nic] [1] eth0 ::
[nic] [2] eth1 ::
[nic] [3] enp220s0 ::
[nic] [4] enp221s0 ::
[nic] [5] enp222s0 ::
[nic] [6] enp223s0 ::
[nic] [7] ib0 :: 169.254.135.208/20
[nic] [8] bond0 :: 192.168.6.37/22, fe80::f816:3eff:fed7:1503/64
[nic] [9] docker0 :: 169.254.30.1/28
[nic] [10] ovs-system ::
[nic] [11] br_monitor :: fe80::481:6ff:fe7c:d368/64
[nic] [12] overlay_br_int ::
[nic] [13] br_tun_b0345198 ::
[nic] [14] vxlan_sys_4789 :: fe80::8c7f:9ff:fe8a:6612/64
[nic] [15] gw_11cbf51a :: 172.16.1.17/16, fe80::d0f2:4cff:feb3:895c/64
[nic] [16] br_plc_949f84f2 ::
[nic] [17] veth_949f84f2 :: fe80::5410:51ff:fe16:6fed/64
[cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
[cuda-env]: CUDA_VERSION=10.0.130
[nccl-env]: NCCL_VERSION=2.4.2
[D] Using self=169.254.135.208
[I] waiting to be initialized
[D] listening: 0.0.0.0:38080
[I] watching config server
W0629 09:19:07.999957       8 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not found
[D] got control message from 169.254.143.141:10000, name: update, length: 644
[D] got control message from 169.254.143.141:10002, name: update, length: 644
[D] got control message from 169.254.143.141:10006, name: update, length: 644
[D] got control message from 169.254.143.141:10003, name: update, length: 644
[D] got control message from 169.254.143.141:10007, name: update, length: 644
[D] got control message from 169.254.143.141:10005, name: update, length: 644
[D] update to v1 with [16@2]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007,169.254.135.208:10000,169.254.135.208:10001,169.254.135.208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10007}@{169.254.143.141:38080,169.254.135.208:38080}
[I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
[D] waiting 0 peers to stop
[D] 0 peer removed: 0 - 0 = 0
[E] full update detected: [0@0]{}@{} -> [16@2]{169.254.143.141:10000,169.254.143.141:10001,169.254.143.141:10002,169.254.143.141:10003,169.254.143.141:10004,169.254.143.141:10005,169.254.143.141:10006,169.254.143.141:10007,169.254.135.208:10000,169.254.135.208:10001,169.254.135.208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10007}@{169.254.143.141:38080,169.254.135.208:38080}
[D] got control message from 169.254.143.141:10001, name: update, length: 644
[D] got control message from 169.254.143.141:10004, name: update, length: 644
[D] 8 peers created: 0 - 0 + 8 = 16

...

[169.254.135.208.10000::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10004::stderr] INFO:tensorflow:Saving checkpoints for 0 into /home/ma-user/user-job-dir/kungfu-demo-6-10/train_url/model.ckpt.
[169.254.135.208.10006::stdout] start with 0 trained samples.
[169.254.135.208.10005::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10007::stdout] start with 0 trained samples.
[169.254.135.208.10004::stdout] start with 0 trained samples.
[169.254.135.208.10003::stderr] INFO:tensorflow:Saving checkpoints for 0 into /home/ma-user/user-job-dir/kungfu-demo-6-10/train_url/model.ckpt.
[169.254.135.208.10001::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10002::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10006::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10003::stdout] start with 0 trained samples.
[169.254.135.208.10007::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10004::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10003::stderr] INFO:tensorflow:Running will end at step: 375
[169.254.135.208.10005::stdout] sync to offset 160 on step 0
[169.254.135.208.10004::stdout] sync to offset 160 on step 0
[169.254.135.208.10001::stdout] sync to offset 160 on step 0
[169.254.135.208.10002::stdout] sync to offset 160 on step 0
[169.254.135.208.10006::stdout] sync to offset 160 on step 0
[169.254.135.208.10007::stdout] sync to offset 160 on step 0
[169.254.135.208.10003::stdout] sync to offset 160 on step 0
[169.254.135.208.10000::stdout] sync to offset 160 on step 0
[169.254.135.208.10005::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10000::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10002::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10003::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10006::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10001::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10007::stdout] [D] New peer list is consistent after ONE attempt!
[169.254.135.208.10004::stdout] [D] New peer list is consistent after ONE attempt!

HANG happen at here

from kungfu.

zrss avatar zrss commented on June 23, 2024

init a job with 1 container (i.e. A.1 container), then scale up to 2 container (i.e. A.2 container is been added)

by the way, it works well in this way (-np/-H is diff with the previous case)

init A.1 container

     22 [arg] [0]=kungfu-run
     23 [arg] [1]=-np
     24 [arg] [2]=8
     25 [arg] [3]=-H
     26 [arg] [4]=169.254.135.208:8
     27 [arg] [5]=-w
     28 [arg] [6]=-config-server
     29 [arg] [7]=file:///home/ma-user/user-job-dir/config.json
     30 [arg] [8]=-nic
     31 [arg] [9]=ib0
     32 [arg] [10]=/home/work/anaconda/bin/python
     33 [arg] [11]=kungfu-demo-6-10/bert_classifier.py
...
     48 [nic] [0] lo :: 127.0.0.1/8, ::1/128
     49 [nic] [1] eth0 ::
     50 [nic] [2] eth1 ::
     51 [nic] [3] enp220s0 ::
     52 [nic] [4] enp221s0 ::
     53 [nic] [5] enp222s0 ::
     54 [nic] [6] enp223s0 ::
     55 [nic] [7] ib0 :: 169.254.135.208/20
     56 [nic] [8] bond0 :: 192.168.6.37/22, fe80::f816:3eff:fed7:1503/64
     57 [nic] [9] docker0 :: 169.254.30.1/28
     58 [nic] [10] ovs-system ::
     59 [nic] [11] br_monitor :: fe80::481:6ff:fe7c:d368/64
     60 [nic] [12] overlay_br_int ::
     61 [nic] [13] br_tun_b0345198 ::
     62 [nic] [14] vxlan_sys_4789 :: fe80::8c7f:9ff:fe8a:6612/64
     63 [nic] [15] gw_11cbf51a :: 172.16.1.17/16, fe80::d0f2:4cff:feb3:895c/64
     64 [nic] [16] br_plc_949f84f2 ::
     65 [nic] [17] veth_949f84f2 :: fe80::5410:51ff:fe16:6fed/64

...

     66 [cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
     67 [cuda-env]: CUDA_VERSION=10.0.130
     68 [nccl-env]: NCCL_VERSION=2.4.2
     69 [I] watching config server
     70 [I] arrived at v0, new np=8, local: +8/-0, global: +8/-0
     71 ^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [8@1]{169.254.135.208:10000,169.254.135.208:10001,169.254.135.2        08:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:10        007}@{169.254.135.208:38080}

...

    516 [169.254.135.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.309             top-1: 0.250
    517 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.353             top-1: 0.100
    518 [169.254.135.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.351             top-1: 0.200
    519 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.246             top-1: 0.200
    520 [169.254.135.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.331 loss: 2.203             top-1: 0.100
    521 [169.254.135.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:step: 30(global step: 30)   step/sec: 1.332 loss: 2.176             top-1: 0.300
    522 [169.254.135.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 40(global step: 40)   step/sec: 1.884 loss: 1.823             top-1: 0.450

...

    650 I0628 20:16:27.978056       1 ma_fmk_kungfu.go:226] generated host file
    651 [I] arrived at v1, new np=16, local: +0/-0, global: +8/-0

    652 [169.254.135.208.10006::stdout] sync to offset 26240 on step 164
    653 [169.254.135.208.10005::stdout] sync to offset 26240 on step 164
    654 [169.254.135.208.10007::stdout] sync to offset 26240 on step 164
    655 [169.254.135.208.10003::stdout] sync to offset 26240 on step 164
    656 [169.254.135.208.10004::stdout] sync to offset 26240 on step 164
    657 [169.254.135.208.10002::stdout] sync to offset 26240 on step 164
    658 [169.254.135.208.10001::stdout] sync to offset 26240 on step 164
    659 [169.254.135.208.10000::stdout] sync to offset 26240 on step 164
    660 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    661 [169.254.135.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    662 [169.254.135.208.10000::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    663 [169.254.135.208.10001::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    664 [169.254.135.208.10005::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    665 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    666 [169.254.135.208.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    667 [169.254.135.208.10003::^[[1;35mstderr^[[m] INFO:tensorflow:Saving checkpoints for 165 into /home/ma-user/user-job-        dir/kungfu-demo-6-10/train_url/model.ckpt.
    668 [169.254.135.208.10004::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.038             top-1: 1.000
    669 [169.254.135.208.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.073             top-1: 1.000
    670 [169.254.135.208.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 170(global step: 170) step/sec: 1.685 loss: 0.126             top-1: 0.950

newly added A.2 container

     23 [arg] [0]=kungfu-run
     24 [arg] [1]=-np
     25 [arg] [2]=16
     26 [arg] [3]=-H
     27 [arg] [4]=169.254.135.208:8,169.254.138.11:8
     28 [arg] [5]=-w
     29 [arg] [6]=-config-server
     30 [arg] [7]=file:///home/ma-user/user-job-dir/config.json
     31 [arg] [8]=-nic
     32 [arg] [9]=ib0
     33 [arg] [10]=-init-version
     34 [arg] [11]=-1
     35 [arg] [12]=/home/work/anaconda/bin/python
     36 [arg] [13]=kungfu-demo-6-10/bert_classifier.py
...
     51 [nic] [0] lo :: 127.0.0.1/8, ::1/128
     52 [nic] [1] eth0 ::
     53 [nic] [2] eth1 ::
     54 [nic] [3] enp220s0 ::
     55 [nic] [4] enp221s0 ::
     56 [nic] [5] enp222s0 ::
     57 [nic] [6] enp223s0 ::
     58 [nic] [7] ib0 :: 169.254.138.11/20
     59 [nic] [8] bond0 :: 192.168.6.190/22, fe80::f816:3eff:fe75:a4a/64
     60 [nic] [9] docker0 :: 169.254.30.1/28
     61 [nic] [10] ovs-system ::
     62 [nic] [11] br_monitor :: fe80::206e:5aff:fe16:107/64
     63 [nic] [12] br_tun_b0345198 ::
     64 [nic] [13] overlay_br_int ::
     65 [nic] [14] vxlan_sys_4789 :: fe80::bc59:a2ff:fe10:65ab/64
     66 [nic] [15] gw_11cbf51a :: 172.16.1.1/16, fe80::3451:daff:fe7d:68c1/64
     67 [nic] [16] br_plc_8ea52e57 ::
     68 [nic] [17] veth_8ea52e57 :: fe80::e84f:f8ff:fe22:6d7a/64
     69 [cuda-env]: CUDA_PKG_VERSION=10-0=10.0.130-1
     70 [cuda-env]: CUDA_VERSION=10.0.130
     71 [nccl-env]: NCCL_VERSION=2.4.2
     72 [I] ^[[1;34mwaiting to be initialized^[[m
     73 [I] watching config server
     74 W0628 20:16:27.976171       1 utils.go:83] get the ipv4 addr of Nic(eth0) failed: the ipv4 addr of Nic(eth0) is not         found
     75 [I] arrived at v1, new np=16, local: +8/-0, global: +16/-0
     76 ^[[1;35m[E]^[[m full update detected: [0@0]{}@{} -> [16@2]{169.254.135.208:10000,169.254.135.208:10001,169.254.135.        208:10002,169.254.135.208:10003,169.254.135.208:10004,169.254.135.208:10005,169.254.135.208:10006,169.254.135.208:1        0007,169.254.138.11:10000,169.254.138.11:10001,169.254.138.11:10002,169.254.138.11:10003,169.254.138.11:10004,169.2        54.138.11:10005,169.254.138.11:10006,169.254.138.11:10007}@{169.254.135.208:38080,169.254.138.11:38080}

...

    459 [169.254.138.11.10002::^[[1;35mstderr^[[m] INFO:tensorflow:Running will end at step: 375    460 [169.254.138.11.10007::^[[1;35mstderr^[[m] INFO:tensorflow:Running will end at step: 375    461 [169.254.138.11.10001::stdout] sync to offset 26240 on step 0    462 [169.254.138.11.10004::stdout] sync to offset 26240 on step 0    463 [169.254.138.11.10005::stdout] sync to offset 26240 on step 0    464 [169.254.138.11.10003::stdout] sync to offset 26240 on step 0    465 [169.254.138.11.10002::stdout] sync to offset 26240 on step 0
    466 [169.254.138.11.10006::stdout] sync to offset 26240 on step 0
    467 [169.254.138.11.10007::stdout] sync to offset 26240 on step 0
    468 [169.254.138.11.10000::stdout] sync to offset 26240 on step 0
    469 [169.254.138.11.10001::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.061 loss: 0.037             top-1: 1.000
    470 [169.254.138.11.10007::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.074 loss: 0.073             top-1: 1.000
    471 [169.254.138.11.10002::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.072 loss: 0.121             top-1: 0.950
    472 [169.254.138.11.10000::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.062 loss: 0.038             top-1: 1.000
    473 [169.254.138.11.10004::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.061 loss: 0.065             top-1: 1.000
    474 [169.254.138.11.10005::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.063 loss: 0.118             top-1: 0.950
    475 [169.254.138.11.10006::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.069 loss: 0.036             top-1: 1.000
    476 [169.254.138.11.10003::^[[1;35mstderr^[[m] INFO:tensorflow:step: 0(global step: 164)    step/sec: 0.062 loss: 0.039             top-1: 1.000

from kungfu.

kevin0525 avatar kevin0525 commented on June 23, 2024

we try again but find out an other problem
init a job with a container, job 0 runs well, then scale up to 3, errors arise
container0
image
container1\2 (barrier failed...
image
image

from kungfu.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.