Comments (10)
That Nemo version is 6 months old, can you use r1.23 and see if it persists ? We do not see constantly increasing CPU memory per epoch, but that may be because we use multiple nodes - min 4 nodes
from nemo.
Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?
What version of NeMo are you using ?
Without sufficient details it's not possible to debug.
What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory.
If you can visibly see CPU ram constantly increase during training, a pseudo fix could be to use exp_manager.max_time_per_run and set it to a reasonable value like a day, then the job stops after a day and you can restart it and avoid memory leak. It's not a fix but a temporary solution
from nemo.
Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?
- CPU not GPU
- Just single node
We just added one row
self.log('loss', loss_value, on_step=True, prog_bar=True, on_epoch=False,)
in file:
nemo/collections/asr/models/ctc_models.py
Previously, we used on_epoch=True, but now the problem still remains after changine to False.
What version of NeMo are you using ?
git log
commit 0d3d8fa (HEAD -> main)
Author: anteju [email protected]
Date: Wed Nov 15 16:56:29 2023 -0800
[ASR] GSS-based mask estimator (#7849)
* Added GSS-based mask estimator for multispeaker scenarios
Signed-off-by: Ante Jukić <[email protected]>
* Addressed PR comments
Signed-off-by: Ante Jukić <[email protected]>
---------
Signed-off-by: Ante Jukić <[email protected]>
Co-authored-by: Taejin Park <[email protected]>
Actually, it's very easy to verify: you just submit a training task with, say librispeech data, you can observe you CPU memory keeps increasing within an epoch.
But such memory increase won't hurt since memory increase slow and after an epoch, memory usage somehwo is going down again. Here, if we decrease our training data down to 30k, for 1.2T cpu memory, we can finish an epoch normally.
from nemo.
from nemo.
Hi, is this issue resolved? I've been running into the same issue. (I can confirm that it happens on 1.23 as well)
from nemo.
Hi there,
Just checking here and wondering whether this is resolved?
I am facing same issue.
Thank you.
from nemo.
from nemo.
Thanks @haihua
I'm indeed 5 nodes with 5 GPU each. Is that what you mean?
from nemo.
from nemo.
I see but the above issue is persistent with multi node and I'd like to get it working.
from nemo.
Related Issues (20)
- Nemo_toolkit 2.0.0.rc0 installation failure HOT 12
- FSDP reduce_scatter can not overlap with compute HOT 1
- Getting `TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.`
- Python 3.11 dataclasses ValueError
- RAM memory leaks for EncDecCTCModelBPE at inference
- Conflict between bf16-mixed Precision Setting and MegatronHalfPrecisionPlugin in MegatronGPT Training HOT 2
- FastConformer-Longformer HOT 2
- CTC Language Finetuning convergence HOT 1
- Can we run NeMo MSDD Neural Diarizer model in realtime for realtime diarization? HOT 3
- Latest release version 1.23.0 missing the AudioCodecModel checkpoint list. HOT 1
- NLP isn't getting imported due to ApexGuardDefaults HOT 1
- Job specific environment variables can't be set in Hydra multi-run HOT 2
- Using lhotse when training a hybrid fast conformer model fails HOT 7
- How to config a locally model?
- Unable to reproduce cache aware streaming results for Conformer that were there for Fastconformer.
- Can we add emotions to the produced audio? HOT 1
- LM on Parakeet models HOT 1
- to support deepseekv2 HOT 1
- How to use a pre-trained model for cache-aware FastConformer-Hybrid model? HOT 3
- When Trying to import nlp collections in the Nemo Primer getting error "No Module named megatron"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nemo.