During training, memory is noticed increasing as time goes on, until 74% training done

using multiple nodes to train can avoid the problem. <span class="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Memory is fully eaten and training quit with errors for 40k hours ASR training about nemo HOT 10 OPEN

haihua commented on July 20, 2024

Memory is fully eaten and training quit with errors for 40k hours ASR training

from nemo.

Comments (10)

titu1994 commented on July 20, 2024 1

That Nemo version is 6 months old, can you use r1.23 and see if it persists ? We do not see constantly increasing CPU memory per epoch, but that may be because we use multiple nodes - min 4 nodes

from nemo.

titu1994 commented on July 20, 2024

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

What version of NeMo are you using ?
Without sufficient details it's not possible to debug.

What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory.

from nemo.

haihua commented on July 20, 2024

Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ?

CPU not GPU
Just single node
We just added one row
self.log('loss', loss_value, on_step=True, prog_bar=True, on_epoch=False,)
in file:
nemo/collections/asr/models/ctc_models.py
Previously, we used on_epoch=True, but now the problem still remains after changine to False.

What version of NeMo are you using ?

git log
commit 0d3d8fa (HEAD -> main)
Author: anteju [email protected]
Date: Wed Nov 15 16:56:29 2023 -0800

[ASR] GSS-based mask estimator (#7849)

* Added GSS-based mask estimator for multispeaker scenarios

Signed-off-by: Ante Jukić <[email protected]>

* Addressed PR comments

Signed-off-by: Ante Jukić <[email protected]>

---------

Signed-off-by: Ante Jukić <[email protected]>
Co-authored-by: Taejin Park <[email protected]>

Actually, it's very easy to verify: you just submit a training task with, say librispeech data, you can observe you CPU memory keeps increasing within an epoch.
But such memory increase won't hurt since memory increase slow and after an epoch, memory usage somehwo is going down again. Here, if we decrease our training data down to 30k, for 1.2T cpu memory, we can finish an epoch normally.

from nemo.

haihua commented on July 20, 2024

On Fri, Apr 12, 2024 at 2:21 PM Somshubra Majumdar ***@***.***> wrote: Is it GPU or CPU memory that is exhausted ? And how many nodes are you using ? What version of NeMo are you using ? Without sufficient details it's not possible to debug. What I can say is we train on nodes with 400 GB ram per node and A100 with 80GB gpu memory and train on 90-400K hours of speech without oom in either CPU or GPU memory.

How many nodes have you used ? if you use a lot of nodes, then you might not trigger the bugs. Say, you have used 8 nodes, then there might be no issues ... Regards, Haihua

…

If you can visibly see CPU ram constantly increase during training, a pseudo fix could be to use exp_manager.max_time_per_run and set it to a reasonable value like a day, then the job stops after a day and you can restart it and avoid memory leak. It's not a fix but a temporary solution — Reply to this email directly, view it on GitHub <#8897 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZBHYOSYLVXC677XXF2LWLY454PZAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRGA3DKMRTHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from nemo.

riqiang-dp commented on July 20, 2024

Hi, is this issue resolved? I've been running into the same issue. (I can confirm that it happens on 1.23 as well)

from nemo.

ROZBEH commented on July 20, 2024

Hi there,

Just checking here and wondering whether this is resolved?
I am facing same issue.

Thank you.

from nemo.

haihua commented on July 20, 2024

using multiple nodes to train can avoid the problem.

…

On Fri, May 24, 2024, 1:58 AM ROZBEH ***@***.***> wrote: Hi there, Just checking here and wondering whether this is resolved? I am facing same issue. Thank you. — Reply to this email directly, view it on GitHub <#8897 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZBHYJPQ36M6WWYAGGERQLZDYU2RAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRXG42DGNRVHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from nemo.

ROZBEH commented on July 20, 2024

Thanks @haihua
I'm indeed 5 nodes with 5 GPU each. Is that what you mean?

from nemo.

haihua commented on July 20, 2024

Yes, that's it.

…

On Fri, May 24, 2024, 8:06 PM ROZBEH ***@***.***> wrote: Thanks @haihua <https://github.com/haihua> I'm indeed 5 nodes with 5 GPU each. Is that what you mean? — Reply to this email directly, view it on GitHub <#8897 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZBHYLT2XC5A4QKW26PJHDZD4ULRAVCNFSM6AAAAABGDMTIVKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRZGM3DQNJYHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

from nemo.

ROZBEH commented on July 20, 2024

I see but the above issue is persistent with multi node and I'd like to get it working.

from nemo.

Memory is fully eaten and training quit with errors for 40k hours ASR training about nemo HOT 10 OPEN

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent