Git Product home page Git Product logo

Comments (22)

KiAlexander avatar KiAlexander commented on August 23, 2024 1

Of couse, I will try it again. If it goes well, I will also report more result infomation in this issue.

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

Hey, thanks for reporting the issue.
To be honest, training DPRNN with ks=2 was not so easy and we had stability problems as well, that's why we shared the model for it. And this implementation yielded the best stability.

Actually, I used only one GPU for training, but the backend is ddp, which version of lightning do you have? From after 0.6.0, they inject the DDP sampler, it speeds up training but the results can change..
The shared model was trained with 0.6.0.

Note that DPRNN with ks=16 is much more stable and the results are always consistent.

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

pytorch-lightning==0.7.6

thank you very much, I got it!

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

Are you going to try it again?
I'd like to know the result, please. So that we can improve the docs as well, detect bugs etc..
Thanks!

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

with pytorch-lightning==0.7.6, when I set kernel_size=16 and chunk_size=100, it still happened early stopping and loss became "nan", part of train.log is as follows.

Epoch 00058: val_loss reached -10.65770 (best -10.65770), saving model to exp/train_dprnn_ln076/checkpoints/_ckpt_epoch_58.ckpt as top 5
Epoch 60: 100%|██████████| 2681/2681 [23:23<00:00,  1.91it/s, loss=-10.870, v_num=0, val_loss=-10.7]
Epoch 00059: val_loss reached -10.70161 (best -10.70161), saving model to exp/train_dprnn_ln076/checkpoints/_ckpt_epoch_59.ckpt as top 5
Epoch 61: 100%|██████████| 2681/2681 [23:59<00:00,  1.86it/s, loss=-10.854, v_num=0, val_loss=-10.7]
Epoch 00060: val_loss reached -10.73184 (best -10.73184), saving model to exp/train_dprnn_ln076/checkpoints/_ckpt_epoch_60.ckpt as top 5
Epoch 62: 100%|██████████| 2681/2681 [23:15<00:00,  1.92it/s, loss=-11.149, v_num=0, val_loss=-10.5]
Epoch 00061: val_loss  was not in top 5                      
Epoch 63: 100%|██████████| 2681/2681 [23:50<00:00,  1.87it/s, loss=-11.011, v_num=0, val_loss=-10.7]
Epoch 00062: val_loss reached -10.67140 (best -10.73184), saving model to exp/train_dprnn_ln076/checkpoints/_ckpt_epoch_62.ckpt as top 5
Epoch 64: 100%|██████████| 2681/2681 [23:19<00:00,  1.92it/s, loss=-10.840, v_num=0, val_loss=-10.7]
Epoch 00063: val_loss reached -10.73805 (best -10.73805), saving model to exp/train_dprnn_ln076/checkpoints/_ckpt_epoch_63.ckpt as top 5
Epoch 65: 100%|██████████| 2681/2681 [24:02<00:00,  1.86it/s, loss=-10.701, v_num=0, val_loss=-10.8]
Epoch 00064: val_loss reached -10.75315 (best -10.75315), saving model to exp/train_dprnn_ln076/checkpoints/_ckpt_epoch_64.ckpt as top 5
Epoch 66: 100%|██████████| 2681/2681 [23:08<00:00,  1.93it/s, loss=-10.869, v_num=0, val_loss=-10.8]
Epoch 00065: val_loss reached -10.81207 (best -10.81207), saving model to exp/train_dprnn_ln076/checkpoints/_ckpt_epoch_65.ckpt as top 5
Epoch 67: 100%|██████████| 2681/2681 [23:43<00:00,  1.88it/s, loss=-10.770, v_num=0, val_loss=-10.8]
Epoch 00066: val_loss reached -10.81964 (best -10.81964), saving model to exp/train_dprnn_ln076/checkpoints/_ckpt_epoch_66.ckpt as top 5
Epoch 68: 100%|██████████| 2681/2681 [23:22<00:00,  1.91it/s, loss=nan, v_num=0, val_loss=nan]      
Epoch 00067: val_loss  was not in top 5                      
Epoch 69: 100%|██████████| 2681/2681 [24:03<00:00,  1.86it/s, loss=nan, v_num=0, val_loss=nan]
Epoch 00068: val_loss  was not in top 5                      
Epoch 70: 100%|██████████| 2681/2681 [23:10<00:00,  1.93it/s, loss=nan, v_num=0, val_loss=nan]
Epoch 00069: val_loss  was not in top 5                      
Epoch 71: 100%|██████████| 2681/2681 [23:41<00:00,  1.89it/s, loss=nan, v_num=0, val_loss=nan]
Epoch 00070: val_loss  was not in top 5                      
Epoch 72: 100%|██████████| 2681/2681 [23:18<00:00,  1.92it/s, loss=nan, v_num=0, val_loss=nan]
Epoch 00071: val_loss  was not in top 5                      
Epoch 00072: early stopping
Epoch 00072: early stopping
Epoch 72: 100%|██████████| 2681/2681 [23:18<00:00,  1.92it/s, loss=nan, v_num=0, val_loss=nan]
Epoch 00072: early stopping
Epoch 00072: early stopping
Epoch 00072: early stopping
Epoch 00072: early stopping

In this situation ,of course, the evaluation result is bad.

Overall metrics :
{'sar': 11.844617098354227,
 'sar_imp': -137.33183151920835,
 'sdr': 11.01090081309095,
 'sdr_imp': 10.859858598785452,
 'si_sdr': 10.73790631004516,
 'si_sdr_imp': 10.73905742010222,
 'sir': 19.819491455700803,
 'sir_imp': 19.66844924139529,
 'stoi': 0.9158627038664738,
 'stoi_imp': 0.17781668059804287}

when I tried to downgrade pytorch-lightning==0.6.0, something wong happend as follows.

/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py:102: UserWarning: 
            You're using multiple gpus and multiple nodes without using a DistributedSampler
            to assign a subset of your data to each process. To silence this warning, pass a
            DistributedSampler to your DataLoader.

            ie: this:
            dataset = myDataset()
            dataloader = Dataloader(dataset)

            becomes:
            dataset = myDataset()
            dist_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
            dataloader = Dataloader(dataset, sampler=dist_sampler)

            If you want each process to load the full dataset, ignore this warning.
            
  warnings.warn(msg)
Traceback (most recent call last):
  File "train.py", line 122, in <module>
    main(arg_dic)
  File "train.py", line 96, in main
    trainer.fit(system)
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 687, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes
    while not context.join():
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 119, in join
    raise Exception(msg)
  1 #!/bin/bash
  2 
  3 set -e  # Exit on error
  4 # Main storage directory. You'll need disk space to dump the WHAM mixtures and the wsj0 wav
  5 # files if you start from sphere files.
  6 storage_dir=
  7 
  8 # If you start from the sphere files, specify the path to the directory and start from stage 0
  9 sphere_dir=  # Directory containing sphere files
 10 # If you already have wsj0 wav files, specify the path to the directory here and start from stage 1
Exception: 
-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
    fn(i, *args)
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 331, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in run_pretrain_routine
    self.get_dataloaders(ref_model)
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 202, in get_dataloaders
    self.init_val_dataloader(model)
  File "/home/yjm/anaconda3/envs/pych/lib/python3.6/site-packages/pytorch_lightning/trainer/data_loading.py", line 124, in init_val_dataloader
    if not isinstance(dataloader.sampler, DistributedSampler):
AttributeError: 'list' object has no attribute 'sampler'
Stage 4 : Evaluation
Traceback (most recent call last):
  File "/home/yjm/ss_code/asteroid/egs/wham/DPRNN/model.py", line 68, in load_best_model
    with open(os.path.join(exp_dir, 'best_k_models.json'), "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'exp/train_dprnn_ln060/best_k_models.json'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "eval.py", line 118, in <module>
    main(arg_dic)
  File "eval.py", line 36, in main
    model = load_best_model(conf['train_conf'], conf['exp_dir'])
  File "/home/yjm/ss_code/asteroid/egs/wham/DPRNN/model.py", line 75, in load_best_model
    best_model_path = os.path.join(exp_dir, 'checkpoints', all_ckpt[-1])
IndexError: list index out of range

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

I'll launch a DPRNN training with kernel_size 16 today and let you know the results.
I'm sorry this doesn't work for you, you're the first one to report this..

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

What is your torch version by the way?

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

the verisons of packages about torch are as follows.

torch==1.5.0+cu92
torch-optimizer==0.0.1a12
torch-stoi==0.1.1
torchaudio==0.5.0
torchvision==0.4.2

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

Ok thanks !
I won't be able to try with CUDA 9.2 but I doubt this is the problem.
I just started a training with kernel_size 16, pl0.7.6 and torch 1.5.
I'll let you know how it goes

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

This is what I get with my last run. I don't get NaN and I get higher performance than what you had (I don't know why). There was a problem with checkpoint saving so I'm trying to sort this out and I'll run the eval code afterwards.
It looks like the train loss is not as good as what I had before though..

Epoch 56: 100%|_______________________________________________________________________________________| 3561/3561 [29:59<00:00,  1.98it/s, loss=-16.729, v_num=228900, val_loss=-16.4]
Epoch 00055: val_loss reached -16.42357 (best -16.59247), saving model to exp/train_dprnn_dprnn_ks16_pl076_again/checkpoints/epoch=55.ckpt as top 5                                   
Epoch 57: 100%|_______________________________________________________________________________________| 3561/3561 [29:58<00:00,  1.98it/s, loss=-16.687, v_num=228900, val_loss=-16.5]
Epoch 00056: val_loss reached -16.53929 (best -16.59247), saving model to exp/train_dprnn_dprnn_ks16_pl076_again/checkpoints/epoch=56.ckpt as top 5                                   
Epoch 58: 100%|_______________________________________________________________________________________| 3561/3561 [30:00<00:00,  1.98it/s, loss=-16.800, v_num=228900, val_loss=-16.5]
Epoch 00057: val_loss reached -16.52685 (best -16.59247), saving model to exp/train_dprnn_dprnn_ks16_pl076_again/checkpoints/epoch=57.ckpt as top 5                                   
Epoch 59: 100%|_______________________________________________________________________________________| 3561/3561 [29:58<00:00,  1.98it/s, loss=-16.610, v_num=228900, val_loss=-16.5]
Epoch 00058: val_loss reached -16.54635 (best -16.59247), saving model to exp/train_dprnn_dprnn_ks16_pl076_again/checkpoints/epoch=58.ckpt as top 5      

Would you like to try an old version when we still used pytorch-lightning 0.6.0?
You can just git checkout v0.2.0 and run the same recipe (check the conf file, I'm not sure it's the same though).
I'll also do that in the next few days to see how it goes.
We stayed on pl 0.6.0 for a while because we had performance problems with the new versions.

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

Thanks for another try.I will follow your advice to try it again later.

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

I just launched training with asteroid 0.2.0 (and pl 0.6.0), I'll let you know.
By the way, did you have a problem with checkpointing when training the DPRNN?

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

Emm, I met the problem like issue84 and issue96 before.

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

You trained with WHAM data right? I never got NaN loss with it, I'm sorry.
I again trained with 0.6.0 and early-stopping actually doesn't work well so it trains much longer and the val loss still decreases (I'm around -18.2 val now).

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

As set in stage1,I use prepare_data.sh to generating data(wsj and wham)

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

I don't know what to say.
I just finished training DPRNN and reached 18.2 dB SI-SDR on test set with the current version of Asteroid. (I just released it as well).

Here is the end of my log:

Epoch 197: 100%|__________| 4022/4022 [16:10<00:00,  4.14it/s, loss=-19.217, v_num=234730, val_loss=-18.5]                                                                             
Epoch 00196: val_loss reached -18.45149 (best -18.46184), saving model to exp/train_dprnn_pl0.7.6_noearlystopping/checkpoints/epoch=196.ckpt as top 5                                  
Epoch 198: 100%|__________| 4022/4022 [16:10<00:00,  4.14it/s, loss=-19.250, v_num=234730, val_loss=-18.5]                                                                             
Epoch 00197: val_loss reached -18.46379 (best -18.46379), saving model to exp/train_dprnn_pl0.7.6_noearlystopping/checkpoints/epoch=197.ckpt as top 5                                  
Epoch 199: 100%|__________| 4022/4022 [16:11<00:00,  4.14it/s, loss=-19.443, v_num=234730, val_loss=-18.5]                                                                             
Epoch 00198: val_loss reached -18.46572 (best -18.46572), saving model to exp/train_dprnn_pl0.7.6_noearlystopping/checkpoints/epoch=198.ckpt as top 5                                  
Epoch 200: 100%|__________| 4022/4022 [16:11<00:00,  4.14it/s, loss=-19.072, v_num=234730, val_loss=-18.5]                                                                             
Epoch 00199: val_loss reached -18.46223 (best -18.46572), saving model to exp/train_dprnn_pl0.7.6_noearlystopping/checkpoints/epoch=199.ckpt as top 5                                  
Epoch 200: 100%|__________| 4022/4022 [16:11<00:00,  4.14it/s, loss=-19.072, v_num=234730, val_loss=-18.5]                                                                             
Stage 4 : Evaluation                                                                                                                                                                   
 90%|________________________________________________________________________________________________________________________________              | 2687/3000 [38:06<03:45,  1.39it/s$
/home/pul51/zaf67/miniconda3/envs/pytorch3.6/lib/python3.8/site-packages/pystoi/stoi.py:66: RuntimeWarning: Not enough STFT frames to compute intermediate intelligibility measure aft$
r removing silent frames. Returning 1e-5. Please check you wav files                                                                                                                   
  warnings.warn('Not enough STFT frames to compute intermediate '                                                                                                                      
100%|______________________________________________________________________________________________________________________________________________| 3000/3000 [42:43<00:00,  1.17it/s$
Drop 0 utts(0.00 h) from 3000 (shorter than None samples)
Overall metrics :                
{'sar': 19.116352171914485,      
 'sar_imp': -130.06009796503054, 
 'sdr': 18.617789605060587,      
 'sdr_imp': 18.466745426438173,  
 'si_sdr': 18.227683982688003,   
 'si_sdr_imp': 18.22883576588251,
 'sir': 29.22773720052717,    
 'sir_imp': 29.07669302190474,
 'stoi': 0.9722025377865715,
 'stoi_imp': 0.23415680987800583}

Could you maybe try with CUDA 10.1? I know that torch had LSTM problems before 1.5 and maybe they fixed it better with CUDA 10.1, no idea.

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

@KiAlexander Any news?

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

Sorry, my lab server is under repair. After the server is OK, I'll try with CUDA 10.1.

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

Sorry for replying so late, I tried with CUDA9.0 because another server with CUDA10.0 is still out of work.

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

However, after git checkout v0.2.0, the final result is ok.

Overall metrics :
{'sar': 19.815491848600306,
 'sar_imp': -129.36095676896227,
 'sdr': 19.344507067310975,
 'sdr_imp': 19.19346485300548,
 'si_sdr': 18.969924434155224,
 'si_sdr_imp': 18.971075544212287,
 'sir': 30.1779299318165,
 'sir_imp': 30.02688771751099,
 'stoi': 0.9755566920880101,
 'stoi_imp': 0.237510668819579}

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

Thanks for checking it out again.
So the problem would be with CUDA 9.2? That seems really weird !

from asteroid.

KiAlexander avatar KiAlexander commented on August 23, 2024

However, the torch I used info is still 9.2. The only chage I did is git checkout v0.2.0. I am also confused.

$ python
Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21) 
[GCC 7.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.version.cuda)
9.2
>>> 

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

That seems weird indeed.
I know early stopping changed behaviour between pl versions, maybe this is the source..
Otherwise I don't know.

Can we close it now?

from asteroid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.