tugstugi / pytorch-dc-tts Goto Github PK

View Code? Open in Web Editor NEW

183.0 183.0 77.0 11.53 MB

Text to Speech with PyTorch (English and Mongolian)

License: MIT License

Python 0.75% Jupyter Notebook 99.25%

convolutional-neural-networks deep-learning mongolian python pytorch speech-synthesis text-to-speech tts

pytorch-dc-tts's People

Contributors

Stargazers

Watchers

pytorch-dc-tts's Issues

Licence

Hi! We're looking at using this for research purposes but for that we need a proper licence on the code. Any chance you could add a licence file to this repository?

I had successfully trained TTS for Indonesia. Here I also attached the result of mine compared to original utterance. It's trained on 23400 audio (about 17 hours) using modified parameters as attached.

The result is impressive and very natural thanks to your great implementation. Yet I still don't satisfied with the time needed to infer the model. I believe it can be faster. Using GPU or half precision don't help.
Can you give me hints to speed up this? Thank you.

Appendixes

Hyperparameters

"""Hyper parameters."""
__author__ = 'Erdene-Ochir Tuguldur'


class HParams:
    """Hyper parameters"""

    disable_progress_bar = False  # set True if you don't want the progress bar in the console

    logdir = "logdir"  # log dir where the checkpoints and tensorboard files are saved

    # audio.py options, these values are from https://github.com/Kyubyong/dc_tts/blob/master/hyperparams.py
    reduction_rate = 4  # melspectrogram reduction rate, don't change because SSRN is using this rate
    n_fft = 2048 # fft points (samples)
    n_mels = 80  # Number of Mel banks to generate
    power = 1.5  # Exponent for amplifying the predicted magnitude
    n_iter = 50  # Number of inversion iterations
    preemphasis = .97
    max_db = 100
    ref_db = 20
    sr = 16000  # Sampling rate
    frame_shift = 0.05  # seconds
    frame_length = 0.75  # seconds
    hop_length = int(sr * frame_shift)  # samples. =276.
    win_length = int(sr * frame_length)  # samples. =1102.
    max_N = 180  # Maximum number of characters.
    max_T = 210  # Maximum number of mel frames.

    e = 128  # embedding dimension
    d = 256  # Text2Mel hidden unit dimension
    c = 512+128  # SSRN hidden unit dimension

    dropout_rate = 0.05  # dropout

    # Text2Mel network options
    text2mel_lr = 0.005  # learning rate
    text2mel_max_iteration = 300000  # max train step
    text2mel_weight_init = 'none'  # 'kaiming', 'xavier' or 'none'
    text2mel_normalization = 'layer'  # 'layer', 'weight' or 'none'
    text2mel_basic_block = 'gated_conv'  # 'highway', 'gated_conv' or 'residual'
    text2mel_batchsize = 64

    # SSRN network options
    ssrn_lr = 0.0005  # learning rate
    ssrn_max_iteration = 150000  # max train step
    ssrn_weight_init = 'kaiming'  # 'kaiming', 'xavier' or 'none'
    ssrn_normalization = 'weight'  # 'layer', 'weight' or 'none'
    ssrn_basic_block = 'residual'  # 'highway', 'gated_conv' or 'residual'
    ssrn_batchsize = 24

Sample audio

result.zip

Inference Time

Character Count	Average Duration (seconds)	CPU Utilization (%)
15 < c ≤20	9	55.1
20 < c ≤ 25	9	38.1
25 < c ≤ 30	12	70.9
30 < c ≤ 35	12	71.9
35 < c ≤ 40	12	72.7
40 < c ≤ 45	12	72.7
45 < c ≤ 50	12	72.2
50 < c ≤ 55	15	72.4
55 < c ≤ 60	15	71.6
60 < c ≤ 65	15	71.4
65 < c ≤ 70	18	71.3
70 < c ≤ 75	21	70.7
75 < c ≤ 80	27	90.2
80 < c ≤ 85	18	70.6
85 < c ≤ 90	24	70
90 < c ≤ 95	24	70.1
95 < c ≤ 100	24	70.2
100 < c ≤ 105	24	69.9
105 < c ≤ 110	24	69.1
110 < c ≤ 115	27	69.2
115 < c ≤ 120	33	69.3
120 < c ≤ 125	33	77.4
125 < c ≤ 130	42	81.7
130 < c ≤ 135	48	81.2
135 < c ≤ 140	48	80.7
140 < c ≤ 145	63	84.1
145 < c ≤ 150	63	84
150 < c ≤ 155	81	82.7
155 < c ≤ 160	75	82.9
160 < c ≤ 165	72	81.6
165 < c ≤ 170	81	82.3
170 < c ≤ 175	87	83.1
175 < c ≤ 180	87	82.7

Custom dataset

I am not sure how the code would run with a new dataset and what the dataset should look like in order to be trained on it?

colab demo RuntimeError

tried to run this
last step segment of prepare model gives me this error:
RuntimeError: storage has wrong size: expected 4541342186092732661 got 512

Transfer Learning

Hello,
thanks for the great repository.

I want to do transferlearning on a german dataset.

Unfortunately, I'm a newbie to Pytorch and would therefore appreciate tips on what I need to change where in the code.
Thanks in advance!

How to train a Tibetan-TTS model

Thx first for this excellent project and your patience. I have already finished the training of the Mongolian-TTS model which works really well. It's cool!

And, what should I do if I want to train a Tibetan-TTS model based on this project? Can u give me some suggestions？

Sorry for this question, but I'm actually a novice. Thx in advance!

Custom dataset

Will I have to make a dataloader for the custom dataset like the one made for the english and mongolian dataset?

Solved issue

Empty speech output (some sentences synthesize and others don't)

Hello!

I have seen a similar problem among the issues with the WaveGlow model, but I am running the original code, on a Romanian dataset with a sampling rate of 16000. Tried both the gated_conv and highway options for the text2mel_basic_block.

After the model has been training for 300K steps, some of the sentences are synthesized correctly, while others are completely blank, it generates a silent audio of a few seconds. I have copied some of the attention, mel and mag files that didn't synthesize and added some examples that were correctly synthesized. (https://drive.google.com/drive/folders/1Mq6_xub9urzJxXAXI_HVaws0sZZqrg8X?usp=sharing)

Do you have any ideas why that might be happening?

(I have added a multi-speaker option to your code that I can also share, it seems to be learning, but the same thing happens, some test sentences are synthesized, while others not, and it seems to be quite random.)

Thanks in advance!

decode速度很慢，请问怎么并行化。

how to fix batch processing bug.

您好，我尝试并行化生成Y和A，但是decode出来的结果不对~

请问您现在解决了该问题了没？

能否更新下解决方案，谢谢~

transfer learning ljspeech

First of all thank you for the repository, the result is really awesome.
Is it possible to transfer learning from pretrained models (ljspeech) into my recording data?

custom dataset

Сайн уу?

Dataset -н wav файлууд заавал өгүүлбэрээр төгсөж байх ёстой юу? 2 өгүүлбэр, эсвэл урт өгүүлбэрүүдийг хоёр, гурван хэсэг болгон хуваасан байж болох уу? training-д ямар нэг нөлөө байгаа болов уу?

Баярлалаа

Text2Mel input to WaveGlow outputs noisy audio file without any speech

Hey!

I've trained text2mel part for mel generation for couple hundred epochs.
Model seems to be learning, and it gives somehow good results on different language dataset while fed to SSRN (without fine-tuning the SSRN part).
I'm trying to feed the text2mel output to trained WaveGlow model, but it outputs just low-frequency noise, without any speech.

Any advices how to post-process the generated mels to feed them to WaveGlow?

Problem about the checkpoints

@tugstugi After i train the model you have provided, i am having some checkpoints with . pth format. However, when i downloaded these .pth files, i noticed that they are like zip files. In your colab notebook, you shared a link to similar .pth files. When i download your .pth files, they are different than mine. That's why, i can not use my .pth files to synthesize speech. How can i solve this issue? Is there someone facing the same problem?

how to use

Comet message erros when running train-text2mel.py

Hello,

When running:

python train-text2mel.py --dataset=ljspeech

I am receiving this error:

Traceback (most recent call last):
File "train-text2mel.py", line 19, in
from logger import Logger
File "/media/hdd3tb/pytorch-dc-tts/logger.py", line 7, in
from comet_ml import Experiment
File "/home/jamazzon/anaconda3/envs/tts/lib/python3.7/site-packages/comet_ml/init.py", line 72, in
from .pytorch_logger import patch as pytorch_patch
File "/home/jamazzon/anaconda3/envs/tts/lib/python3.7/site-packages/comet_ml/pytorch_logger.py", line 72, in
check_module("torch", PYTORCH_ALREADY_IMPORTED_MSG)
File "/home/jamazzon/anaconda3/envs/tts/lib/python3.7/site-packages/comet_ml/_logging.py", line 249, in check_module
raise SyntaxError(error_msg)
SyntaxError: Please import comet before importing any torch modules

bible_book_file_path

https://s3.us-east-2.amazonaws.com/bible.davarpartners.com/Mongolian
I found this download_file path is no exist, this URL does not exist or is no longer available.
so i can't use this url to down bible_book_file
thanks

how to train language require tokenization

@tugstugi thanks for this repo and the sharing.

I am trying to train japanese but it wasn't in good quality, for japanese it would need to tokenize first ありがとうございます as there are not white space between the words

I copy a file from mb_speech.py, where I tokenize the words and then update the vocab string in the speech.py model with all possible characters in the corpus. However, after 12 hours of training, it couldn't produce results in japanese. If there should be any extra steps to follow? Thanks.

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got CUDAIntTensor instead

Hello!

I try to train Mongolian TTS, and when I running

python train-text2mel.py --dataset=mbspeech

I receive this error:

use_gpu True
epoch 0 with lr=1.25e-06
0%| | 0/3456 [00:00<?, ?audios/s]
Traceback (most recent call last):
File "train-text2mel.py", line 159, in
train_epoch_loss = train(epoch, phase='train')
File "train-text2mel.py", line 76, in train
Y_logit, Y, A = text2mel(L, S)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "D:\yang\Projects\TTS\code\pytorch-dc-tts\models\text2mel.py", line 165, in forward
K, V = self.text_enc(L)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "D:\yang\Projects\TTS\code\pytorch-dc-tts\models\text2mel.py", line 78, in forward
out = self.embedding(x)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "D:\yang\Projects\TTS\code\pytorch-dc-tts\models\layers.py", line 98, in forward
return self.embedding(x)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\sparse.py", line 108, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\functional.py", line 1076, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got CUDAIntTensor instead (while checking arguments for embedding)

I think I did something wrong, can u help me? Thx!

Transfer learning from already trained models

I am having some difficulties in loading already trained models. Could you be more specific how I can load already trained models? What are the changes that are to be made in the code?

h params module not found and 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)

Getting following error on windows 10 jupyter notebook using englishtts.py any help on how to get this running ?

Traceback (most recent call last):
File "englishtts.py", line 107, in
_, Y_t, A = text2mel(L, Y, monotonic_attention=True)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\prilc\Desktop\speech\models\text2mel.py", line 165, in forward
K, V = self.text_enc(L)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\prilc\Desktop\speech\models\text2mel.py", line 78, in forward
out = self.embedding(x)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\prilc\Desktop\speech\models\layers.py", line 98, in forward
return self.embedding(x)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\sparse.py", line 126, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\functional.py", line 1852, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)

Empty audios

Hi, I am doing custom training for spanish with a custom dataset I created using youtube-tts-data-generator from youtube.
I changed the vocabulary and trained both the text2mel and ssrn models for 20k and 5k iterations.
The loss function for the text2mel model doesnt seem to be improving anymore.
When I synthesize some phrases all are blank and audios have no data in them.
What could I be doing wrong?
I just need to generate audio using one voice in this case.
I used the prepro.py from dc-tts repo for generating mel and mags as the part that does the preprocessing on this repo is on todo.
Could it be a problem with that?
Thanks!