Git Product home page Git Product logo

chimera's People

Contributors

shigangli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

chimera's Issues

NCCL stuck

I tried to change the slurm script (i.e., prof_steps.sh) to torchrun and ran it directly, but encountered a stuck issue with NCCL as collective_backend. The torchran script is as follows:

#!/bin/bash -l

export CUDA_DEVICE_MAX_CONNECTIONS=1
export CUDA_VISIBLE_DEVICES=0,1,2,3
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1
GPUS_PER_NODE=4

MASTER_ADDR=localhost
MASTER_PORT=1234
NNODES=1
NODE_RANK=0
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))

#model=bert-base
model=bert-large
#pipeline='gpipe'
#pipeline='1f1b'
pipeline='chimera'
#pipeline='interleave'
GLOO='gloo'
NCCL='nccl'
stages=4
ngpus=4
microbs=16
acc=1

DISTRIBUTED_ARGS="
    --nproc_per_node $GPUS_PER_NODE \
    --nnodes $NNODES \
    --node_rank $NODE_RANK \
    --master_addr $MASTER_ADDR \
    --master_port $MASTER_PORT \
"

BERT_ARGS="
    --num_stages $stages \
    --corpus_path /data/wikipedia.segmented.nltk.txt \
    --vocab_path ./bert_data/bert-large-uncased-vocab.txt \
    --corpus_lines 1000000 \
    --do_lower_case \
    --bert_config_path ./configs/bert_config_${model}-uncased.json \
    --max_seq_length 128 \
    --micro_batch_size $microbs \
    --num_optimization_steps 10000 \
    --gradient_accumulation_steps $acc \
    --pipeline_method $pipeline \
    --p2p_backend $GLOO \
    --collective_backend $NCCL \
    --profile \
    --chunks 2 \
    --num_pipelines 2\
"
torchrun $DISTRIBUTED_ARGS main_bert.py \
    $BERT_ARGS \

When I choose ‘gpipe’ or '1f1b' as the pipeline method, it can work normally. However, selecting 'interleave' will result in a loss of 0, while 'chimera' leads the program to get stuck and then raises an error of timeout.

root@gpt-dev-ppt:~/Chimera# bash ./scripts/prof_steps_torchrun.sh
W0227 12:31:58.166000 140270371275200 torch/distributed/run.py:717] 
W0227 12:31:58.166000 140270371275200 torch/distributed/run.py:717] *****************************************
W0227 12:31:58.166000 140270371275200 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0227 12:31:58.166000 140270371275200 torch/distributed/run.py:717] *****************************************
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/usr/local/lib/python3.10/dist-packages/transformers/utils/generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
0
1
2
3
============================
pipeline_method: chimera
num_epochs: 1
num_optimization_steps: 10000
world_size: 4
num_replica: 2
num_pipeline: 2
num_micro_batches_per_step: 2
recompute: False
stage0: ranks [0, 3]
stage1: ranks [1, 2]
stage2: ranks [2, 1]
stage3: ranks [3, 0]
----------------------------
corpus_path: /data/wikipedia.segmented.nltk.txt
corpus_lines: 1000000
vocab_path: ./bert_data/bert-large-uncased-vocab.txt
on_memory: False
do_lower_case: True
bert_config_path: ./configs/bert_config_bert-large-uncased.json
max_seq_length: 128
micro_batch_size: 16
num_optimization_steps: 10000
num_epochs: None
gradient_accumulation_steps: 1
adam_learning_rate: 3e-05
adam_max_grad_norm: 1.0
beta1: 0.9
weight_decay: 0.01
warmup_proportion: 0.1
damping: 0.01
pipeline_method: chimera
chunks: 2
recompute: False
num_stages: 4
num_pipelines: 2
checkpoint_dir: None
save_checkpoint_steps: 200
seed: 1
p2p_backend: gloo
collective_backend: nccl
num_workers: 4
profile: True
observe_norm: False
log_interval: 100
config: None
wandb: False
============================
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py:909: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
gpt-dev-ppt:65957:65957 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65957:65957 [0] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
gpt-dev-ppt:65957:65957 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
gpt-dev-ppt:65957:65957 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
gpt-dev-ppt:65957:65957 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
gpt-dev-ppt:65957:65957 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
gpt-dev-ppt:65957:65957 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.19.3+cuda12.2
gpt-dev-ppt:65958:65958 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65958:65958 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
gpt-dev-ppt:65960:65960 [1] NCCL INFO cudaDriverVersion 12000
gpt-dev-ppt:65960:65960 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65960:65960 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
gpt-dev-ppt:65958:65958 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
gpt-dev-ppt:65958:65958 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
gpt-dev-ppt:65958:65958 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
gpt-dev-ppt:65958:65958 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
gpt-dev-ppt:65958:65958 [0] NCCL INFO cudaDriverVersion 12000
NCCL version 2.19.3+cuda12.2
gpt-dev-ppt:65959:65959 [1] NCCL INFO cudaDriverVersion 12000
gpt-dev-ppt:65959:65959 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65959:65959 [1] NCCL INFO Bootstrap : Using lo:127.0.0.1<0>
gpt-dev-ppt:65960:65960 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
gpt-dev-ppt:65960:65960 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
gpt-dev-ppt:65960:65960 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
gpt-dev-ppt:65960:65960 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
gpt-dev-ppt:65959:65959 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v7 symbol.
gpt-dev-ppt:65959:65959 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin v6 (v6)
gpt-dev-ppt:65959:65959 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
gpt-dev-ppt:65959:65959 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v6)
gpt-dev-ppt:65957:67361 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpt-dev-ppt:65957:67361 [0] NCCL INFO P2P plugin IBext
gpt-dev-ppt:65957:67361 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65957:67361 [0] NCCL INFO NET/IB : No device found.
gpt-dev-ppt:65957:67361 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpt-dev-ppt:65957:67361 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65957:67361 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
gpt-dev-ppt:65957:67361 [0] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65957:67361 [0] NCCL INFO Using network Socket
gpt-dev-ppt:65958:67371 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpt-dev-ppt:65958:67371 [0] NCCL INFO P2P plugin IBext
gpt-dev-ppt:65958:67371 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65958:67371 [0] NCCL INFO NET/IB : No device found.
gpt-dev-ppt:65958:67371 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpt-dev-ppt:65958:67371 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65958:67371 [0] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
gpt-dev-ppt:65958:67371 [0] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65958:67371 [0] NCCL INFO Using network Socket
gpt-dev-ppt:65960:67373 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpt-dev-ppt:65960:67373 [1] NCCL INFO P2P plugin IBext
gpt-dev-ppt:65960:67373 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65960:67373 [1] NCCL INFO NET/IB : No device found.
gpt-dev-ppt:65960:67373 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpt-dev-ppt:65960:67373 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65960:67373 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
gpt-dev-ppt:65960:67373 [1] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65960:67373 [1] NCCL INFO Using network Socket
gpt-dev-ppt:65960:67373 [1] NCCL INFO comm 0x5636b3a68020 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0xcfc18309f227cab4 - Init START
gpt-dev-ppt:65957:67361 [0] NCCL INFO comm 0x562ab1bdec40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1b000 commId 0xcfc18309f227cab4 - Init START
gpt-dev-ppt:65960:67373 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
gpt-dev-ppt:65960:67373 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
gpt-dev-ppt:65959:67375 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
gpt-dev-ppt:65959:67375 [1] NCCL INFO P2P plugin IBext
gpt-dev-ppt:65959:67375 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65957:67361 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
gpt-dev-ppt:65957:67361 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
gpt-dev-ppt:65957:67361 [0] NCCL INFO Channel 00/02 :    0   1
gpt-dev-ppt:65957:67361 [0] NCCL INFO Channel 01/02 :    0   1
gpt-dev-ppt:65957:67361 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
gpt-dev-ppt:65960:67373 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
gpt-dev-ppt:65957:67361 [0] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65960:67373 [1] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65959:67375 [1] NCCL INFO NET/IB : No device found.
gpt-dev-ppt:65959:67375 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 1.
gpt-dev-ppt:65959:67375 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to lo
gpt-dev-ppt:65959:67375 [1] NCCL INFO NET/Socket : Using [0]lo:127.0.0.1<0>
gpt-dev-ppt:65959:67375 [1] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65959:67375 [1] NCCL INFO Using network Socket
gpt-dev-ppt:65959:67375 [1] NCCL INFO comm 0x55ed36a3e600 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0xf4533e4f39f3ac28 - Init START
gpt-dev-ppt:65958:67371 [0] NCCL INFO comm 0x5599f949dd50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1b000 commId 0xf4533e4f39f3ac28 - Init START
gpt-dev-ppt:65958:67371 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
gpt-dev-ppt:65958:67371 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
gpt-dev-ppt:65957:67361 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
gpt-dev-ppt:65957:67361 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
gpt-dev-ppt:65959:67375 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
gpt-dev-ppt:65959:67375 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
gpt-dev-ppt:65960:67373 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
gpt-dev-ppt:65958:67371 [0] NCCL INFO Channel 00/02 :    0   1
gpt-dev-ppt:65959:67375 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
gpt-dev-ppt:65958:67371 [0] NCCL INFO Channel 01/02 :    0   1
gpt-dev-ppt:65959:67375 [1] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65958:67371 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
gpt-dev-ppt:65958:67371 [0] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65960:67373 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
gpt-dev-ppt:65957:67361 [0] NCCL INFO Connected all rings
gpt-dev-ppt:65957:67361 [0] NCCL INFO Connected all trees
gpt-dev-ppt:65960:67373 [1] NCCL INFO Connected all rings
gpt-dev-ppt:65960:67373 [1] NCCL INFO Connected all trees
gpt-dev-ppt:65960:67373 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65960:67373 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65957:67361 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65957:67361 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65958:67371 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
gpt-dev-ppt:65958:67371 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
gpt-dev-ppt:65959:67375 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
gpt-dev-ppt:65959:67375 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
gpt-dev-ppt:65958:67371 [0] NCCL INFO Connected all rings
gpt-dev-ppt:65958:67371 [0] NCCL INFO Connected all trees
gpt-dev-ppt:65959:67375 [1] NCCL INFO Connected all rings
gpt-dev-ppt:65959:67375 [1] NCCL INFO Connected all trees
gpt-dev-ppt:65959:67375 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65959:67375 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65958:67371 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65958:67371 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65960:67373 [1] NCCL INFO comm 0x5636b3a68020 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0xcfc18309f227cab4 - Init COMPLETE
gpt-dev-ppt:65957:67361 [0] NCCL INFO comm 0x562ab1bdec40 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1b000 commId 0xcfc18309f227cab4 - Init COMPLETE
gpt-dev-ppt:65959:67375 [1] NCCL INFO comm 0x55ed36a3e600 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0xf4533e4f39f3ac28 - Init COMPLETE
gpt-dev-ppt:65958:67371 [0] NCCL INFO comm 0x5599f949dd50 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1b000 commId 0xf4533e4f39f3ac28 - Init COMPLETE
gpt-dev-ppt:65957:67385 [0] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65957:67385 [0] NCCL INFO Using network Socket
gpt-dev-ppt:65959:67387 [2] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65959:67387 [2] NCCL INFO Using network Socket
gpt-dev-ppt:65958:67388 [1] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65958:67388 [1] NCCL INFO Using network Socket
gpt-dev-ppt:65959:67387 [2] NCCL INFO comm 0x55ed36ba41a0 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 1d000 commId 0x73c5d7f193d81ab4 - Init START
gpt-dev-ppt:65958:67388 [1] NCCL INFO comm 0x5599f96036f0 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0x73c5d7f193d81ab4 - Init START
gpt-dev-ppt:65958:67388 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
gpt-dev-ppt:65959:67387 [2] NCCL INFO Setting affinity for GPU 2 to ff00ff
gpt-dev-ppt:65958:67388 [1] NCCL INFO Channel 00/02 :    0   1
gpt-dev-ppt:65958:67388 [1] NCCL INFO Channel 01/02 :    0   1
gpt-dev-ppt:65958:67388 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
gpt-dev-ppt:65958:67388 [1] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65959:67387 [2] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
gpt-dev-ppt:65959:67387 [2] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65958:67388 [1] NCCL INFO Channel 00 : 0[1] -> 1[2] via SHM/direct/direct
gpt-dev-ppt:65958:67388 [1] NCCL INFO Channel 01 : 0[1] -> 1[2] via SHM/direct/direct
gpt-dev-ppt:65959:67387 [2] NCCL INFO Channel 00 : 1[2] -> 0[1] via SHM/direct/direct
gpt-dev-ppt:65959:67387 [2] NCCL INFO Channel 01 : 1[2] -> 0[1] via SHM/direct/direct
gpt-dev-ppt:65958:67388 [1] NCCL INFO Connected all rings
gpt-dev-ppt:65958:67388 [1] NCCL INFO Connected all trees
gpt-dev-ppt:65959:67387 [2] NCCL INFO Connected all rings
gpt-dev-ppt:65959:67387 [2] NCCL INFO Connected all trees
gpt-dev-ppt:65959:67387 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65959:67387 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65958:67388 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65958:67388 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65958:67388 [1] NCCL INFO comm 0x5599f96036f0 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0x73c5d7f193d81ab4 - Init COMPLETE
gpt-dev-ppt:65959:67387 [2] NCCL INFO comm 0x55ed36ba41a0 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 1d000 commId 0x73c5d7f193d81ab4 - Init COMPLETE
gpt-dev-ppt:65958:67394 [0] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65958:67394 [0] NCCL INFO Using network Socket
gpt-dev-ppt:65959:67395 [1] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65959:67395 [1] NCCL INFO Using network Socket
gpt-dev-ppt:65958:67394 [0] NCCL INFO comm 0x5599f9cc8900 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1b000 commId 0x1ad9716b1600c155 - Init START
gpt-dev-ppt:65959:67395 [1] NCCL INFO comm 0x55ed37268f30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0x1ad9716b1600c155 - Init START
gpt-dev-ppt:65958:67394 [0] NCCL INFO Setting affinity for GPU 0 to ff00ff
gpt-dev-ppt:65959:67395 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
gpt-dev-ppt:65958:67394 [0] NCCL INFO Channel 00/02 :    0   1
gpt-dev-ppt:65958:67394 [0] NCCL INFO Channel 01/02 :    0   1
gpt-dev-ppt:65958:67394 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
gpt-dev-ppt:65958:67394 [0] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65959:67395 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
gpt-dev-ppt:65959:67395 [1] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65958:67394 [0] NCCL INFO Channel 00 : 0[0] -> 1[1] via SHM/direct/direct
gpt-dev-ppt:65958:67394 [0] NCCL INFO Channel 01 : 0[0] -> 1[1] via SHM/direct/direct
gpt-dev-ppt:65959:67395 [1] NCCL INFO Channel 00 : 1[1] -> 0[0] via SHM/direct/direct
gpt-dev-ppt:65959:67395 [1] NCCL INFO Channel 01 : 1[1] -> 0[0] via SHM/direct/direct
gpt-dev-ppt:65958:67394 [0] NCCL INFO Connected all rings
gpt-dev-ppt:65958:67394 [0] NCCL INFO Connected all trees
gpt-dev-ppt:65959:67395 [1] NCCL INFO Connected all rings
gpt-dev-ppt:65958:67394 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65958:67394 [0] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65959:67395 [1] NCCL INFO Connected all trees
gpt-dev-ppt:65959:67395 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65959:67395 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65959:67395 [1] NCCL INFO comm 0x55ed37268f30 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0x1ad9716b1600c155 - Init COMPLETE
gpt-dev-ppt:65958:67394 [0] NCCL INFO comm 0x5599f9cc8900 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1b000 commId 0x1ad9716b1600c155 - Init COMPLETE
gpt-dev-ppt:65959:67414 [2] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65959:67414 [2] NCCL INFO Using network Socket
gpt-dev-ppt:65958:67413 [1] NCCL INFO Using non-device net plugin version 0
gpt-dev-ppt:65958:67413 [1] NCCL INFO Using network Socket
gpt-dev-ppt:65959:67414 [2] NCCL INFO comm 0x55ed3727d5e0 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 1d000 commId 0x2678884a28f56324 - Init START
gpt-dev-ppt:65958:67413 [1] NCCL INFO comm 0x5599f9cdd200 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0x2678884a28f56324 - Init START
gpt-dev-ppt:65958:67413 [1] NCCL INFO Setting affinity for GPU 1 to ff00ff
gpt-dev-ppt:65959:67414 [2] NCCL INFO Setting affinity for GPU 2 to ff00ff
gpt-dev-ppt:65958:67413 [1] NCCL INFO Channel 00/02 :    0   1
gpt-dev-ppt:65959:67414 [2] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] -1/-1/-1->1->0
gpt-dev-ppt:65959:67414 [2] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65958:67413 [1] NCCL INFO Channel 01/02 :    0   1
gpt-dev-ppt:65958:67413 [1] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
gpt-dev-ppt:65958:67413 [1] NCCL INFO P2P Chunksize set to 131072
gpt-dev-ppt:65959:67414 [2] NCCL INFO Channel 00 : 1[2] -> 0[1] via SHM/direct/direct
gpt-dev-ppt:65959:67414 [2] NCCL INFO Channel 01 : 1[2] -> 0[1] via SHM/direct/direct
gpt-dev-ppt:65958:67413 [1] NCCL INFO Channel 00 : 0[1] -> 1[2] via SHM/direct/direct
gpt-dev-ppt:65958:67413 [1] NCCL INFO Channel 01 : 0[1] -> 1[2] via SHM/direct/direct
gpt-dev-ppt:65958:67413 [1] NCCL INFO Connected all rings
gpt-dev-ppt:65958:67413 [1] NCCL INFO Connected all trees
gpt-dev-ppt:65959:67414 [2] NCCL INFO Connected all rings
gpt-dev-ppt:65959:67414 [2] NCCL INFO Connected all trees
gpt-dev-ppt:65958:67413 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65959:67414 [2] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
gpt-dev-ppt:65958:67413 [1] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65959:67414 [2] NCCL INFO 2 coll channels, 0 nvls channels, 2 p2p channels, 2 p2p channels per peer
gpt-dev-ppt:65958:67413 [1] NCCL INFO comm 0x5599f9cdd200 rank 0 nranks 2 cudaDev 1 nvmlDev 1 busId 1c000 commId 0x2678884a28f56324 - Init COMPLETE
gpt-dev-ppt:65959:67414 [2] NCCL INFO comm 0x55ed3727d5e0 rank 1 nranks 2 cudaDev 2 nvmlDev 2 busId 1d000 commId 0x2678884a28f56324 - Init COMPLETE
[rank0]:[E ProcessGroupNCCL.cpp:526] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=108965692, NumelOut=108965692, Timeout(ms)=600000) ran for 600032 milliseconds before timing out.
gpt-dev-ppt:65957:67376 [0] NCCL INFO [Service thread] Connection closed by localRank 0
gpt-dev-ppt:65957:66031 [0] NCCL INFO comm 0x562ab1bdec40 rank 0 nranks 2 cudaDev 0 busId 1b000 - Abort COMPLETE
[rank0]:[E ProcessGroupNCCL.cpp:1553] [PG 1 Rank 0] Timeout at NCCL work: 2, last enqueued NCCL work: 2, last completed NCCL work: 1.
[rank0]:[E ProcessGroupNCCL.cpp:540] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:546] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1391] [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=108965692, NumelOut=108965692, Timeout(ms)=600000) ran for 600032 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:528 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f741677a0e7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f73cb1b1872 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x203 (0x7f73cb1b6e93 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f73cb1b829c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f7415eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f741aa6eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f741ab00a40 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
  what():  [PG 1 Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=2, OpType=ALLREDUCE, NumelIn=108965692, NumelOut=108965692, Timeout(ms)=600000) ran for 600032 milliseconds before timing out.
Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:528 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f741677a0e7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f73cb1b1872 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x203 (0x7f73cb1b6e93 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f73cb1b829c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xdc253 (0x7f7415eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #5: <unknown function> + 0x94ac3 (0x7f741aa6eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #6: <unknown function> + 0x126a40 (0x7f741ab00a40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1395 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f741677a0e7 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe1cd31 (0x7f73caea9d31 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xdc253 (0x7f7415eb0253 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #3: <unknown function> + 0x94ac3 (0x7f741aa6eac3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #4: <unknown function> + 0x126a40 (0x7f741ab00a40 in /lib/x86_64-linux-gnu/libc.so.6)

Exception in thread Thread-5 (recv_comm_thread):
Exception in thread Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
Thread-3 (recv_comm_thread):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
        self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/root/Chimera/pipeline.py", line 160, in recv_comm_thread
    self._target(*self._args, **self._kwargs)
  File "/root/Chimera/pipeline.py", line 160, in recv_comm_thread
        dist.recv(tensor=recv_tensor, src=src_rank, tag=tag)dist.recv(tensor=recv_tensor, src=src_rank, tag=tag)

  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
        return func(*args, **kwargs)return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1867, in recv

  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1867, in recv
    pg.recv([tensor], src, tag).wait()    
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:23269
pg.recv([tensor], src, tag).wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/pair.cc:534] Connection closed by peer [127.0.0.1]:23269
W0227 12:42:13.691000 140270371275200 torch/distributed/elastic/multiprocessing/api.py:694] Sending process 65958 closing signal SIGTERM
W0227 12:42:13.692000 140270371275200 torch/distributed/elastic/multiprocessing/api.py:694] Sending process 65959 closing signal SIGTERM
W0227 12:42:13.692000 140270371275200 torch/distributed/elastic/multiprocessing/api.py:694] Sending process 65960 closing signal SIGTERM
E0227 12:42:14.132000 140270371275200 torch/distributed/elastic/multiprocessing/api.py:669] failed (exitcode: -6) local_rank: 0 (pid: 65957) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 834, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 825, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 137, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 271, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
main_bert.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-27_12:42:13
  host      : gpt-dev-ppt
  rank      : 0 (local_rank: 0)
  exitcode  : -6 (pid: 65957)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 65957
======================================================

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.