Distributed training, single node, 4 GPUs. <div class="snippet-clipboard-content n

Sometimes also like this: <div class="snippet-clipboard-content notranslate positi

Hang in training (often with multi GPU training) about returnn HOT 1 OPEN

albertz commented on September 15, 2024

Hang in training (often with multi GPU training)

from returnn.

Comments (1)

albertz commented on September 15, 2024

Sometimes also like this:

...

ep 28 train, step 56, ctc_4 2.616, ctc_8 2.268, ctc 2.221, num_seqs 8, max_size:time 278344, max_size:out-spatial 67, mem_usage:cuda:0 6.3GB, 0.658 sec/step
ep 28 train, step 56, ctc_4 2.049, ctc_8 1.746, ctc 1.688, num_seqs 8, max_size:time 276496, max_size:out-spatial 62, mem_usage:cuda:2 6.3GB, 0.678 sec/step
ep 28 train, step 57, ctc_4 2.239, ctc_8 1.990, ctc 1.961, num_seqs 8, max_size:time 278959, max_size:out-spatial 61, mem_usage:cuda:0 6.3GB, 0.653 sec/step
ep 28 train, step 57, ctc_4 2.137, ctc_8 1.780, ctc 1.708, num_seqs 8, max_size:time 280104, max_size:out-spatial 60, mem_usage:cuda:3 6.3GB, 0.674 sec/step
ep 28 train, step 57, ctc_4 2.338, ctc_8 1.937, ctc 1.926, num_seqs 9, max_size:time 252480, max_size:out-spatial 55, mem_usage:cuda:1 6.3GB, 0.693 sec/step
ep 28 train, step 57, ctc_4 3.121, ctc_8 2.822, ctc 2.807, num_seqs 8, max_size:time 276760, max_size:out-spatial 64, mem_usage:cuda:2 6.3GB, 0.675 sec/step
ep 28 train, step 58, ctc_4 2.397, ctc_8 2.037, ctc 1.967, num_seqs 9, max_size:time 255120, max_size:out-spatial 65, mem_usage:cuda:3 6.3GB, 0.631 sec/step
ep 28 train, step 58, ctc_4 2.598, ctc_8 2.242, ctc 2.165, num_seqs 8, max_size:time 279224, max_size:out-spatial 56, mem_usage:cuda:0 6.3GB, 0.657 sec/step
ep 28 train, step 58, ctc_4 2.433, ctc_8 2.155, ctc 2.129, num_seqs 10, max_size:time 228024, max_size:out-spatial 63, mem_usage:cuda:1 6.3GB, 0.628 sec/step
MEMORY: sub proc TDL worker 0(5599) increased RSS: rss=524.3MB pss=372.6MB uss=356.5MB shared=167.8MB
MEMORY: sub proc TDL worker 0(5603) increased RSS: rss=454.3MB pss=302.6MB uss=286.5MB shared=167.7MB
MEMORY: sub proc TDL worker 0(5600) increased RSS: rss=523.1MB pss=371.6MB uss=355.5MB shared=167.6MB
MEMORY: total (main 3853, 2024-06-28, 17:46:24, 21 procs): pss=6.3GB uss=6.0GB
MEMORY: total (main 3850, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.4GB
MEMORY: total (main 3851, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.3GB
MEMORY: sub proc TDL worker 0(5602) increased RSS: rss=542.4MB pss=390.7MB uss=374.6MB shared=167.7MB
MEMORY: total (main 3852, 2024-06-28, 17:46:24, 21 procs): pss=6.4GB uss=6.1GB
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 130292506959872)>, proc 3852.

...

Send signal SIGINT to pid 4123/'train worker proc 4/4'
Send signal SIGINT to pid 4119/'train worker proc 3/4'
Send signal SIGINT to pid 5063/'devtrain worker proc 1/4'
Send signal SIGINT to pid 5064/'devtrain worker proc 2/4'
Send signal SIGINT to pid 5065/'devtrain worker proc 3/4'
Send signal SIGINT to pid 5066/'devtrain worker proc 4/4'
Send signal SIGINT to pid 5602/'NonDaemonicSpawnProcess-15'
Send signal SIGINT to pid 4604/'dev worker proc 2/4'
Send signal SIGINT to pid 4611/'dev worker proc 4/4'
Send signal SIGINT to pid 4607/'dev worker proc 3/4'
Send signal SIGINT to pid 4601/'dev worker proc 1/4'
Send signal SIGINT to pid 4114/'train worker proc 1/4'
[2024-06-28 17:46:56,408] INFO: Run time: 0:03:16 CPU: 1.00% RSS: 21.22GB VMS: 733.07GB

And then hanging.

Procs:

zeyer@cn-252 ~ % ps a --forest -u $(whoami) -o pid,comm
    PID COMMAND                   
   6791 sshd                        
   6792  \_ zsh                     
   6810      \_ ps                  
   3790 slurm_script                
   3804  \_ python3.11              
   3832      \_ python3.11        
   3850          \_ python3.11      
   3989          |   \_ python3.11  
   3995          |   \_ watch memory
   4110          |   \_ MPD worker 0
   4111          |   \_ MPD worker 1
   4115          |   \_ MPD worker 2
   4121          |   \_ MPD worker 3
   4589          |   \_ python3.11  
   4600          |   \_ MPD worker 0
   4603          |   \_ MPD worker 1    
   4608          |   \_ MPD worker 2    
   4612          |   \_ MPD worker 3    
   5057          |   \_ MPD worker 0    
   5059          |   \_ MPD worker 1
   5061          |   \_ MPD worker 2
   5062          |   \_ MPD worker 3
   5603          |   \_ TDL worker 0
   5841          |       \_ MPD worker 0
   5944          |       \_ MPD worker 1
   6053          |       \_ MPD worker 2
   6159          |       \_ MPD worker 3
   3851          \_ python3.11
   3991          |   \_ python3.11
   3993          |   \_ watch memory
   4112          |   \_ MPD worker 0
   4116          |   \_ MPD worker 1
   4120          |   \_ MPD worker 2
   4124          |   \_ MPD worker 3
   4577          |   \_ python3.11
   4602          |   \_ MPD worker 0
   4606          |   \_ MPD worker 1
   4609          |   \_ MPD worker 2
   4614          |   \_ MPD worker 3
   5051          |   \_ MPD worker 0
   5053          |   \_ MPD worker 1
   5055          |   \_ MPD worker 2
   5056          |   \_ MPD worker 3
   5600          |   \_ TDL worker 0
   5842          |       \_ MPD worker 0
   5947          |       \_ MPD worker 1
   6055          |       \_ MPD worker 2
   6163          |       \_ MPD worker 3
   3852          \_ python3.11 <defunct>
   3853          \_ python3.11
   3988              \_ python3.11
   3992              \_ watch memory
   4113              \_ MPD worker 0
   4118              \_ MPD worker 1
   4122              \_ MPD worker 2
   4125              \_ MPD worker 3
   4583              \_ python3.11
   4599              \_ MPD worker 0
   4605              \_ MPD worker 1
   4610              \_ MPD worker 2
   4613              \_ MPD worker 3
   5052              \_ MPD worker 0
   5054              \_ MPD worker 1
   5058              \_ MPD worker 2
   5060              \_ MPD worker 3
   5599              \_ TDL worker 0
   5840                  \_ MPD worker 0
   5945                  \_ MPD worker 1
   6049                  \_ MPD worker 2
   6157                  \_ MPD worker 3

Those procs just hang. E.g. py-spy:

% py-spy dump -p 3850
Process 3850: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.ns6wGzNHZ8zI/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)

^C

from returnn.

Hang in training (often with multi GPU training) about returnn HOT 1 OPEN

Comments (1)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent