Hi, I'm training your model from scratch on 60 votes, each with 3-15 minutes of data.

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thank you for sharing the results, <a class="user-mention notranslate" data-hovercard-

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Required amount of data and iterations to train the model about radtts HOT 5 OPEN

nvidia commented on August 17, 2024 1

Required amount of data and iterations to train the model

from radtts.

Comments (5)

szprytny commented on August 17, 2024 1

Hi @Alexey322 ,
I did train from scratch for Polish language - about 14 hours dataset in total, about 9 hours of that is one speaker, other speakers' durations vary much.

I can tell you, that looking at your tensorboard and compairing to mine, I see higher loss_ctc - about 1.8 vs mine 1.3,
binarization_loss values - > 0.4, for me it was between 0.25-0.35

train/mel_loss was going toward -2.0 reaching it around 200k step, at 60k step it was around -1.7,
For val/mel_loss I had peak near 30k being -1.52 then at 200k step it was -0.75

from radtts.

Alexey322 commented on August 17, 2024

Thank you for sharing the results, @szprytny . Why did you try to overfit the model and what synthesis results did you get before and after overfitting?

from radtts.

szprytny commented on August 17, 2024

I cannot answer regarding synthesis on not overfitted model, because I used that 600k checkpoint for training second step of RADTTS++ model.
I can only say, that some of the speakers are quite biased comparing to training samples, but still for most of them you could recognize who is who :D

What is important - pronunciation is very good, there is no problem with understanding of spoken sentences, even very long ones "tongue twisters".
e.g. w gąszczu.zip

Tensorboard screenshot is from step 1 - training decoder with config_ljs_decoder.json
Then I used in 2nd step config_ljs_dap.json to get model for synthesis.

from radtts.

unilight commented on August 17, 2024

Hi @szprytny, thank you for the insights! Just wondering that in your experience, what would be a sufficient amount of training steps? It's not described in the original paper, and as I am still doing initial experiments with LJSpeech, the config (https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_decoder.json) sets the total number of epochs to be 10,000,000, which seems to be way too much.

from radtts.

szprytny commented on August 17, 2024

Hi @szprytny, thank you for the insights! Just wondering that in your experience, what would be a sufficient amount of training steps? It's not described in the original paper, and as I am still doing initial experiments with LJSpeech, the config (https://github.com/NVIDIA/radtts/blob/main/configs/config_ljs_decoder.json) sets the total number of epochs to be 10,000,000, which seems to be way too much.

That probably depends on dataset very much, but I can tell, that model is producing intelligible utterances pretty quickly for me, about 30k steps with 8 samples per batch.

I don't train model with pitch and energy conditioning anymore. I noticed that for my multispeaker data results are much worse than basic RADTTS model.

from radtts.

Required amount of data and iterations to train the model about radtts HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent