Git Product home page Git Product logo

Comments (7)

jonashaag avatar jonashaag commented on August 23, 2024 1

Just wanted to share some training graphs here. Comparison of following models:

  • DPRNN kernel_size 16
  • DPRNN kernel_size 4
  • ConvTasNet (Asteroid default hyper params)
  • SuDoRM-RF: enc_kernel_size/K_eps=21, enc_num_basis/C_eps, out_channels/C, in_channels/C_U = 512 upsampling_depth/Q=5, num_blocks/=25
  • Each model trained with 2s and 4s non-silent segments of training sources.

Dataset is a (so far) proprietary reverberated speech dataset using ~1k handpicked room impulse responses combined with the ~44k speech samples from VCTK. Training dataset has ~80% of rooms and ~80% of speakers from VCTK, and validation dataset has ~20% of rooms and ~20% of speakers from VCTK. I'm combining each room with each speech sample from VCTK for a total dataset of size ~28M training samples + 1.7M validation samples. Each epoch a randomly generated subset of the training dataset is used, and validation uses a random 5k samples subset of the validation dataset.

In retrospect, I believe the validation subset to be too small so I actually believe the training loss to be more indicative of model performance than validation loss. Also note that the number of non-silent 4s segments in VCTK is much lower than 2s ones, which could mean that the 4s dataset has less variety in its speech samples.

DPRNN training was done on GTX 1080 Ti and all of the other models were trained on RTX 2080 Ti (which is ~1.6x as fast as the GTX 1080 Ti for this task). Train loss was SI-SDR with LR of 5e-4. Note that none of models were trained to convergence.

Train loss (x is hours, note that DPRNN was trained on slower GPU)

train

Val loss (x is hours, note that DPRNN was trained on slower GPU)

val

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

Good question.
I just finished training DPRNN with ks=16 on a single P100 (on WHAM!), the 200 epochs took 2days and 5 hours.
This is by far the best compromise between performance and training duration.
ConvTasNet and DPRNN (with ks=2) might take 5-6 days on single GPUs.

from asteroid.

jonashaag avatar jonashaag commented on August 23, 2024

Thanks a lot!

So if Iā€™m correct that would be ~73h data in ~53h training in 200 epochs = ~ 3.5ms per second of training data.

Which is very very little compared to the 0.5-1s I have seen elsewhere.

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

Yes, we are super fast šŸš€
No, I'm joking. I don't know, where have you seen these numbers and with which models?

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

It seems I answered to the question, I'm closing this.
Feel free to reopen

from asteroid.

mpariente avatar mpariente commented on August 23, 2024

Thanks a lot for the insights !
Do you double the batch size when you train on 2sec segments?
I'm half surprised about DPRNN ks=4 because dereverberation might need more context.. Also, what is your batch size with it? We struggled to obtain convergence with DPRNN ks=2 for separation..

from asteroid.

jonashaag avatar jonashaag commented on August 23, 2024

Hyper params for DPRNN:

ks  4 seg 2 batch  5 chunk 200
ks  4 seg 4 batch  2 chunk 200
ks 16 seg 2 batch 21 chunk 100
ks 16 seg 4 batch 11 chunk 100

I always use largest possible batch size (found using trial and error).

Well, as said, none of those were trained to convergence and I also don't have an intuition for how far each of them is from convergence. Maybe it also helps that I used 5e-4 LR instead of 1e-3, which I have observed to be a bit too aggressive for Adam with other (unrelated to speech separation) models.

from asteroid.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.