Describe the bug When I trained the model with two nodes for pipe

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

repro I just use the <a href="https://github.com/microsoft

[BUG]CUDA error in pipeline parallel about deepspeed HOT 3 OPEN

sunkun1997 commented on September 20, 2024

[BUG]CUDA error in pipeline parallel

from deepspeed.

Comments (3)

loadams commented on September 20, 2024

Hi @sunkun1997 - can you please share more information on your setup, ds_config, ds_report, and sample repro script?

from deepspeed.

sunkun1997 commented on September 20, 2024

repro script
I just use the https://github.com/microsoft/DeepSpeedExamples/tree/master/training/pipeline_parallelism example, but in order to fit our environment, I need to make a slight modification. In our environment, each node has four environment variables: the number of nodes WORLD_SIZE, the node rank RANK, the master node ip MASTER_ADDR, the port MASTER_PORT. So I modified run.sh

gpu=8  
n=$(($WORLD_SIZE * $gpu))  
start_rank=$(($RANK * $gpu))  
end_rank=$((($RANK + 1) * $gpu))  
for ((i=$start_rank; i<$end_rank; i++))  
do  
  {  
    LOCAL_RANK=$i WORLD_SIZE=$n MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT python train.py \  
    --p $gpu \  
    --steps=200  
  }&  
done  
wait

And modified the main of train.py

if __name__ == '__main__':  
    import json  
    args = get_args()  
    with open('./ds_config.json', 'r') as f:  
        args.deepspeed_config = json.loads(f.read())  
    args.local_rank = int(os.environ['LOCAL_RANK'])  
    args.world_size = int(os.environ['WORLD_SIZE'])  
    deepspeed.init_distributed(dist_backend=args.backend, rank=args.local_rank,  
                               world_size=args.world_size, auto_mpi_discovery=False)  
    torch.cuda.set_device(args.local_rank % 8)  
    if args.pipeline_parallel_size == 0:  
        train_base(args)  
    else:  
        train_pipe(args)

Then Each node run the run.sh.
ds_report
raise Cuda error

Traceback (most recent call last):
  File "train.py", line 165, in <module>
    train_pipe(args)
  File "train.py", line 131, in train_pipe
    net = PipelineModule(layers=join_layers(net),
  File "/home/ray/anaconda3/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 201, in __init__
    self.to(get_accelerator().device_name(self.local_rank))
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1145, in to
    return self._apply(convert)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 797, in _apply
    module._apply(fn)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 820, in _apply
    param_applied = fn(param)
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1143, in convert
    return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

from deepspeed.

sunkun1997 commented on September 20, 2024

By the way, If I modify the start train.py with
LOCAL_RANK=$((i % $gpu)) RANK=$i WORLD_SIZE=$n MASTER_ADDR=$MASTER_ADDR MASTER_PORT=$MASTER_PORT python train.py
and modify the start distributed environment with
deepspeed.init_distributed(dist_backend="nccl", rank=args.rank, world_size=args.world_size, auto_mpi_discovery=False).
It looks like the nodes can't communicate with each other and raise

Traceback (most recent call last):
  File "train.py", line 166, in <module>
    train_pipe(args)
  File "train.py", line 137, in train_pipe
    trainset = cifar_trainset(args.local_rank)
  File "train.py", line 30, in cifar_trainset
    dist.barrier()
  File "/home/ray/anaconda3/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3328, in barrier
    work = default_pg.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1275, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.

from deepspeed.

[BUG]CUDA error in pipeline parallel about deepspeed HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent