Git Product home page Git Product logo

Comments (1)

albertz avatar albertz commented on September 15, 2024

Sometimes also like this:

...

ep 28 train, step 56, ctc_4 2.616, ctc_8 2.268, ctc 2.221, num_seqs 8, max_size:time 278344, max_size:out-spatial 67, mem_usage:cuda:0 6.3GB, 0.658 sec/step
ep 28 train, step 56, ctc_4 2.049, ctc_8 1.746, ctc 1.688, num_seqs 8, max_size:time 276496, max_size:out-spatial 62, mem_usage:cuda:2 6.3GB, 0.678 sec/step
ep 28 train, step 57, ctc_4 2.239, ctc_8 1.990, ctc 1.961, num_seqs 8, max_size:time 278959, max_size:out-spatial 61, mem_usage:cuda:0 6.3GB, 0.653 sec/step
ep 28 train, step 57, ctc_4 2.137, ctc_8 1.780, ctc 1.708, num_seqs 8, max_size:time 280104, max_size:out-spatial 60, mem_usage:cuda:3 6.3GB, 0.674 sec/step
ep 28 train, step 57, ctc_4 2.338, ctc_8 1.937, ctc 1.926, num_seqs 9, max_size:time 252480, max_size:out-spatial 55, mem_usage:cuda:1 6.3GB, 0.693 sec/step
ep 28 train, step 57, ctc_4 3.121, ctc_8 2.822, ctc 2.807, num_seqs 8, max_size:time 276760, max_size:out-spatial 64, mem_usage:cuda:2 6.3GB, 0.675 sec/step
ep 28 train, step 58, ctc_4 2.397, ctc_8 2.037, ctc 1.967, num_seqs 9, max_size:time 255120, max_size:out-spatial 65, mem_usage:cuda:3 6.3GB, 0.631 sec/step
ep 28 train, step 58, ctc_4 2.598, ctc_8 2.242, ctc 2.165, num_seqs 8, max_size:time 279224, max_size:out-spatial 56, mem_usage:cuda:0 6.3GB, 0.657 sec/step
ep 28 train, step 58, ctc_4 2.433, ctc_8 2.155, ctc 2.129, num_seqs 10, max_size:time 228024, max_size:out-spatial 63, mem_usage:cuda:1 6.3GB, 0.628 sec/step
MEMORY: sub proc TDL worker 0(5599) increased RSS: rss=524.3MB pss=372.6MB uss=356.5MB shared=167.8MB
MEMORY: sub proc TDL worker 0(5603) increased RSS: rss=454.3MB pss=302.6MB uss=286.5MB shared=167.7MB
MEMORY: sub proc TDL worker 0(5600) increased RSS: rss=523.1MB pss=371.6MB uss=355.5MB shared=167.6MB
MEMORY: total (main 3853, 2024-06-28, 17:46:24, 21 procs): pss=6.3GB uss=6.0GB
MEMORY: total (main 3850, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.4GB
MEMORY: total (main 3851, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.3GB
MEMORY: sub proc TDL worker 0(5602) increased RSS: rss=542.4MB pss=390.7MB uss=374.6MB shared=167.7MB
MEMORY: total (main 3852, 2024-06-28, 17:46:24, 21 procs): pss=6.4GB uss=6.1GB
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 130292506959872)>, proc 3852.

...

Send signal SIGINT to pid 4123/'train worker proc 4/4'
Send signal SIGINT to pid 4119/'train worker proc 3/4'
Send signal SIGINT to pid 5063/'devtrain worker proc 1/4'
Send signal SIGINT to pid 5064/'devtrain worker proc 2/4'
Send signal SIGINT to pid 5065/'devtrain worker proc 3/4'
Send signal SIGINT to pid 5066/'devtrain worker proc 4/4'
Send signal SIGINT to pid 5602/'NonDaemonicSpawnProcess-15'
Send signal SIGINT to pid 4604/'dev worker proc 2/4'
Send signal SIGINT to pid 4611/'dev worker proc 4/4'
Send signal SIGINT to pid 4607/'dev worker proc 3/4'
Send signal SIGINT to pid 4601/'dev worker proc 1/4'
Send signal SIGINT to pid 4114/'train worker proc 1/4'
[2024-06-28 17:46:56,408] INFO: Run time: 0:03:16 CPU: 1.00% RSS: 21.22GB VMS: 733.07GB

And then hanging.

Procs:

zeyer@cn-252 ~ % ps a --forest -u $(whoami) -o pid,comm
    PID COMMAND                   
   6791 sshd                        
   6792  \_ zsh                     
   6810      \_ ps                  
   3790 slurm_script                
   3804  \_ python3.11              
   3832      \_ python3.11        
   3850          \_ python3.11      
   3989          |   \_ python3.11  
   3995          |   \_ watch memory
   4110          |   \_ MPD worker 0
   4111          |   \_ MPD worker 1
   4115          |   \_ MPD worker 2
   4121          |   \_ MPD worker 3
   4589          |   \_ python3.11  
   4600          |   \_ MPD worker 0
   4603          |   \_ MPD worker 1    
   4608          |   \_ MPD worker 2    
   4612          |   \_ MPD worker 3    
   5057          |   \_ MPD worker 0    
   5059          |   \_ MPD worker 1
   5061          |   \_ MPD worker 2
   5062          |   \_ MPD worker 3
   5603          |   \_ TDL worker 0
   5841          |       \_ MPD worker 0
   5944          |       \_ MPD worker 1
   6053          |       \_ MPD worker 2
   6159          |       \_ MPD worker 3
   3851          \_ python3.11
   3991          |   \_ python3.11
   3993          |   \_ watch memory
   4112          |   \_ MPD worker 0
   4116          |   \_ MPD worker 1
   4120          |   \_ MPD worker 2
   4124          |   \_ MPD worker 3
   4577          |   \_ python3.11
   4602          |   \_ MPD worker 0
   4606          |   \_ MPD worker 1
   4609          |   \_ MPD worker 2
   4614          |   \_ MPD worker 3
   5051          |   \_ MPD worker 0
   5053          |   \_ MPD worker 1
   5055          |   \_ MPD worker 2
   5056          |   \_ MPD worker 3
   5600          |   \_ TDL worker 0
   5842          |       \_ MPD worker 0
   5947          |       \_ MPD worker 1
   6055          |       \_ MPD worker 2
   6163          |       \_ MPD worker 3
   3852          \_ python3.11 <defunct>
   3853          \_ python3.11
   3988              \_ python3.11
   3992              \_ watch memory
   4113              \_ MPD worker 0
   4118              \_ MPD worker 1
   4122              \_ MPD worker 2
   4125              \_ MPD worker 3
   4583              \_ python3.11
   4599              \_ MPD worker 0
   4605              \_ MPD worker 1
   4610              \_ MPD worker 2
   4613              \_ MPD worker 3
   5052              \_ MPD worker 0
   5054              \_ MPD worker 1
   5058              \_ MPD worker 2
   5060              \_ MPD worker 3
   5599              \_ TDL worker 0
   5840                  \_ MPD worker 0
   5945                  \_ MPD worker 1
   6049                  \_ MPD worker 2
   6157                  \_ MPD worker 3

Those procs just hang. E.g. py-spy:

% py-spy dump -p 3850
Process 3850: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.ns6wGzNHZ8zI/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)

^C

from returnn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.