Comments (1)
Sometimes also like this:
...
ep 28 train, step 56, ctc_4 2.616, ctc_8 2.268, ctc 2.221, num_seqs 8, max_size:time 278344, max_size:out-spatial 67, mem_usage:cuda:0 6.3GB, 0.658 sec/step
ep 28 train, step 56, ctc_4 2.049, ctc_8 1.746, ctc 1.688, num_seqs 8, max_size:time 276496, max_size:out-spatial 62, mem_usage:cuda:2 6.3GB, 0.678 sec/step
ep 28 train, step 57, ctc_4 2.239, ctc_8 1.990, ctc 1.961, num_seqs 8, max_size:time 278959, max_size:out-spatial 61, mem_usage:cuda:0 6.3GB, 0.653 sec/step
ep 28 train, step 57, ctc_4 2.137, ctc_8 1.780, ctc 1.708, num_seqs 8, max_size:time 280104, max_size:out-spatial 60, mem_usage:cuda:3 6.3GB, 0.674 sec/step
ep 28 train, step 57, ctc_4 2.338, ctc_8 1.937, ctc 1.926, num_seqs 9, max_size:time 252480, max_size:out-spatial 55, mem_usage:cuda:1 6.3GB, 0.693 sec/step
ep 28 train, step 57, ctc_4 3.121, ctc_8 2.822, ctc 2.807, num_seqs 8, max_size:time 276760, max_size:out-spatial 64, mem_usage:cuda:2 6.3GB, 0.675 sec/step
ep 28 train, step 58, ctc_4 2.397, ctc_8 2.037, ctc 1.967, num_seqs 9, max_size:time 255120, max_size:out-spatial 65, mem_usage:cuda:3 6.3GB, 0.631 sec/step
ep 28 train, step 58, ctc_4 2.598, ctc_8 2.242, ctc 2.165, num_seqs 8, max_size:time 279224, max_size:out-spatial 56, mem_usage:cuda:0 6.3GB, 0.657 sec/step
ep 28 train, step 58, ctc_4 2.433, ctc_8 2.155, ctc 2.129, num_seqs 10, max_size:time 228024, max_size:out-spatial 63, mem_usage:cuda:1 6.3GB, 0.628 sec/step
MEMORY: sub proc TDL worker 0(5599) increased RSS: rss=524.3MB pss=372.6MB uss=356.5MB shared=167.8MB
MEMORY: sub proc TDL worker 0(5603) increased RSS: rss=454.3MB pss=302.6MB uss=286.5MB shared=167.7MB
MEMORY: sub proc TDL worker 0(5600) increased RSS: rss=523.1MB pss=371.6MB uss=355.5MB shared=167.6MB
MEMORY: total (main 3853, 2024-06-28, 17:46:24, 21 procs): pss=6.3GB uss=6.0GB
MEMORY: total (main 3850, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.4GB
MEMORY: total (main 3851, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.3GB
MEMORY: sub proc TDL worker 0(5602) increased RSS: rss=542.4MB pss=390.7MB uss=374.6MB shared=167.7MB
MEMORY: total (main 3852, 2024-06-28, 17:46:24, 21 procs): pss=6.4GB uss=6.1GB
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 130292506959872)>, proc 3852.
...
Send signal SIGINT to pid 4123/'train worker proc 4/4'
Send signal SIGINT to pid 4119/'train worker proc 3/4'
Send signal SIGINT to pid 5063/'devtrain worker proc 1/4'
Send signal SIGINT to pid 5064/'devtrain worker proc 2/4'
Send signal SIGINT to pid 5065/'devtrain worker proc 3/4'
Send signal SIGINT to pid 5066/'devtrain worker proc 4/4'
Send signal SIGINT to pid 5602/'NonDaemonicSpawnProcess-15'
Send signal SIGINT to pid 4604/'dev worker proc 2/4'
Send signal SIGINT to pid 4611/'dev worker proc 4/4'
Send signal SIGINT to pid 4607/'dev worker proc 3/4'
Send signal SIGINT to pid 4601/'dev worker proc 1/4'
Send signal SIGINT to pid 4114/'train worker proc 1/4'
[2024-06-28 17:46:56,408] INFO: Run time: 0:03:16 CPU: 1.00% RSS: 21.22GB VMS: 733.07GB
And then hanging.
Procs:
zeyer@cn-252 ~ % ps a --forest -u $(whoami) -o pid,comm
PID COMMAND
6791 sshd
6792 \_ zsh
6810 \_ ps
3790 slurm_script
3804 \_ python3.11
3832 \_ python3.11
3850 \_ python3.11
3989 | \_ python3.11
3995 | \_ watch memory
4110 | \_ MPD worker 0
4111 | \_ MPD worker 1
4115 | \_ MPD worker 2
4121 | \_ MPD worker 3
4589 | \_ python3.11
4600 | \_ MPD worker 0
4603 | \_ MPD worker 1
4608 | \_ MPD worker 2
4612 | \_ MPD worker 3
5057 | \_ MPD worker 0
5059 | \_ MPD worker 1
5061 | \_ MPD worker 2
5062 | \_ MPD worker 3
5603 | \_ TDL worker 0
5841 | \_ MPD worker 0
5944 | \_ MPD worker 1
6053 | \_ MPD worker 2
6159 | \_ MPD worker 3
3851 \_ python3.11
3991 | \_ python3.11
3993 | \_ watch memory
4112 | \_ MPD worker 0
4116 | \_ MPD worker 1
4120 | \_ MPD worker 2
4124 | \_ MPD worker 3
4577 | \_ python3.11
4602 | \_ MPD worker 0
4606 | \_ MPD worker 1
4609 | \_ MPD worker 2
4614 | \_ MPD worker 3
5051 | \_ MPD worker 0
5053 | \_ MPD worker 1
5055 | \_ MPD worker 2
5056 | \_ MPD worker 3
5600 | \_ TDL worker 0
5842 | \_ MPD worker 0
5947 | \_ MPD worker 1
6055 | \_ MPD worker 2
6163 | \_ MPD worker 3
3852 \_ python3.11 <defunct>
3853 \_ python3.11
3988 \_ python3.11
3992 \_ watch memory
4113 \_ MPD worker 0
4118 \_ MPD worker 1
4122 \_ MPD worker 2
4125 \_ MPD worker 3
4583 \_ python3.11
4599 \_ MPD worker 0
4605 \_ MPD worker 1
4610 \_ MPD worker 2
4613 \_ MPD worker 3
5052 \_ MPD worker 0
5054 \_ MPD worker 1
5058 \_ MPD worker 2
5060 \_ MPD worker 3
5599 \_ TDL worker 0
5840 \_ MPD worker 0
5945 \_ MPD worker 1
6049 \_ MPD worker 2
6157 \_ MPD worker 3
Those procs just hang. E.g. py-spy:
% py-spy dump -p 3850
Process 3850: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.ns6wGzNHZ8zI/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/[email protected]/3.11.2_1/bin/python3.11)
^C
from returnn.
Related Issues (20)
- Make batch_size configurable for cross validation HOT 1
- Ignore a single broken gradient HOT 2
- DistributeFilesDataset: _distribute_evenly_by_size suboptimal for multi-gpu sharding HOT 8
- multiprocessing: OSError: AF_UNIX path too long HOT 11
- ConcatSeqsDataset with extended functionality HOT 3
- Torch: print model at log verbosity 3 HOT 1
- RuntimeError: CUDA error: an illegal memory access was encountered HOT 1
- Torch gradient_checkpoint_scope _unregister_custom_saved_tensors_hooks error HOT 4
- RF parametrization breaks Conv
- Torch gradient_checkpoint_scope could trigger segmentation fault? HOT 16
- Torch gradient_checkpoint_scope potential memory leak
- Torch multiple simultaneous gradient_checkpoint_scope
- `rf.pack_padded` with PyTorch takes a lot of memory HOT 1
- `rf.RelPosCausalSelfAttention` fails with `single_step_dim` HOT 9
- Torch `report_profile` `check_events` based tests maybe unstable HOT 1
- Torch: gradient_clip wrong when grad_scaler is used
- Torch print step info on crash
- Make `FileCache` able to detect updated remote files HOT 1
- RF masked computation / masking (like masked_select but without the packing) HOT 3
- TF end layer independent of batch causes error in beam search
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from returnn.