Hello when batch_size=4, mp.spawn(main, args=(2, args), nprocs=2), train_ds, test_

是这段代码引起训练阻塞 <a target="_blank" rel="noopener noreferrer" href="https://private-use

是这段代码引起训练阻塞 <a target="_blank" rel="noopener noreferrer nofollow" href="h

Training can get stuck about cmgan HOT 6 CLOSED

ruizhecao96 commented on September 16, 2024

Training can get stuck

from cmgan.

Comments (6)

taqta commented on September 16, 2024 2

是这段代码引起训练阻塞

由于loss_metric取值为NONE的时候，进程就会进入else判断，不会进行backward，我推测这会导致gpu之间的不同步。所以只需要在loss_metric取值为none的时候将loss_metric和grad变为0就可以等效替代if else。因此可以把生成loss_metric函数做以下更改：

这样就可以避免训练卡住的问题了

from cmgan.

SherifAbdulatif commented on September 16, 2024

So you increase the batch size from 4 to 8 in testing as your question is not clear train_ds, test_ds = dataloader.load_data(args.data_dir, args.batch_size, 8, args.cut_len)?
4 is the maximum we tested 3 can also reproduce very similar results in case you have a limited GPU.

from cmgan.

wen0320 commented on September 16, 2024

First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:

When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?

from cmgan.

coreeey commented on September 16, 2024

First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:

When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?

have you reproduced the result now?

from cmgan.

taqta commented on September 16, 2024

First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:

When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?

I meet the same problem. Training stuck in epoch 0, step 500 / 726

from cmgan.

wen0320 commented on September 16, 2024

是这段代码引起训练阻塞由于loss_metric取值为NONE的时候，进程就会进入else判断，不会进行backward，我推测这会导致gpu之间的不同步。所以只需要在loss_metric取值为none的时候将loss_metric和grad变为0就可以等效替代if else。因此可以把生成loss_metric函数做以下更改：这样就可以避免训练卡住的问题了

感谢大佬，确实可行。

from cmgan.

Training can get stuck about cmgan HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent