Comments (6)
是这段代码引起训练阻塞
由于loss_metric取值为NONE的时候,进程就会进入else判断,不会进行backward,我推测这会导致gpu之间的不同步。所以只需要在loss_metric取值为none的时候将loss_metric和grad变为0就可以等效替代if else。因此可以把生成loss_metric函数做以下更改:
这样就可以避免训练卡住的问题了
from cmgan.
So you increase the batch size from 4 to 8 in testing as your question is not clear train_ds, test_ds = dataloader.load_data(args.data_dir, args.batch_size, 8, args.cut_len)?
4 is the maximum we tested 3 can also reproduce very similar results in case you have a limited GPU.
from cmgan.
First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:
When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?
from cmgan.
First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:
When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?
have you reproduced the result now?
from cmgan.
First of all, thank you very much for your reply. And secondly, I'm sorry I didn't make my problem clear. My question is:
When reproducing your code, I often encounter a training stall in the initial epochs.(I'm utilizing two GPUs from a server for training.) The specific issue is that the training process becomes unresponsive, and two GPUs utilization rates remain at 100%. I suspect this might be due to using DistributedDataParallel for multi-process training, as I don't experience this problem when employing DataParallel for synchronized training. Unfortunately, I haven't been able to identify a solution myself. That's why I've reached out to you for assistance. Are you familiar with this issue or have you encountered it before?
I meet the same problem. Training stuck in epoch 0, step 500 / 726
from cmgan.
是这段代码引起训练阻塞 由于loss_metric取值为NONE的时候,进程就会进入else判断,不会进行backward,我推测这会导致gpu之间的不同步。所以只需要在loss_metric取值为none的时候将loss_metric和grad变为0就可以等效替代if else。因此可以把生成loss_metric函数做以下更改: 这样就可以避免训练卡住的问题了
感谢大佬,确实可行。
from cmgan.
Related Issues (20)
- the change of gen_loss during training HOT 1
- RuntimeError
- RuntimeError HOT 2
- About the decreasing of loss HOT 1
- Can not reproduce the results HOT 12
- Inferior results trained from scratch HOT 7
- RuntimeeError HOT 1
- Can not reproduce the results HOT 3
- Training GPU requirements HOT 1
- File "pesq/cypesq.pyx", line 1, in init cypesq ImportError: numpy.core.multiarray failed to import (auto-generated because you didn't call 'numpy.import_array()' after cimporting numpy; use '<void>numpy._import_array' to disable if you are certain you don't need it)
- File "/anaconda3/envs/cmg/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__ dist._verify_model_across_ranks(self.process_group, parameters) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc). HOT 2
- How do you resample to 16000? HOT 2
- 时域Loss计算疑惑
- the training speed confusion
- My server has a 3090, but reports that I don't have a gpu HOT 1
- Test set requirements when training
- epochs HOT 1
- 模型训练的采样率以及显卡训练配置咨询 HOT 1
- Can the model be open sourced?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cmgan.