tugstugi / pytorch-dc-tts Goto Github PK
View Code? Open in Web Editor NEWText to Speech with PyTorch (English and Mongolian)
License: MIT License
Text to Speech with PyTorch (English and Mongolian)
License: MIT License
Hi! We're looking at using this for research purposes but for that we need a proper licence on the code. Any chance you could add a licence file to this repository?
I had successfully trained TTS for Indonesia. Here I also attached the result of mine compared to original utterance. It's trained on 23400 audio (about 17 hours) using modified parameters as attached.
The result is impressive and very natural thanks to your great implementation. Yet I still don't satisfied with the time needed to infer the model. I believe it can be faster. Using GPU or half precision don't help.
Can you give me hints to speed up this? Thank you.
"""Hyper parameters."""
__author__ = 'Erdene-Ochir Tuguldur'
class HParams:
"""Hyper parameters"""
disable_progress_bar = False # set True if you don't want the progress bar in the console
logdir = "logdir" # log dir where the checkpoints and tensorboard files are saved
# audio.py options, these values are from https://github.com/Kyubyong/dc_tts/blob/master/hyperparams.py
reduction_rate = 4 # melspectrogram reduction rate, don't change because SSRN is using this rate
n_fft = 2048 # fft points (samples)
n_mels = 80 # Number of Mel banks to generate
power = 1.5 # Exponent for amplifying the predicted magnitude
n_iter = 50 # Number of inversion iterations
preemphasis = .97
max_db = 100
ref_db = 20
sr = 16000 # Sampling rate
frame_shift = 0.05 # seconds
frame_length = 0.75 # seconds
hop_length = int(sr * frame_shift) # samples. =276.
win_length = int(sr * frame_length) # samples. =1102.
max_N = 180 # Maximum number of characters.
max_T = 210 # Maximum number of mel frames.
e = 128 # embedding dimension
d = 256 # Text2Mel hidden unit dimension
c = 512+128 # SSRN hidden unit dimension
dropout_rate = 0.05 # dropout
# Text2Mel network options
text2mel_lr = 0.005 # learning rate
text2mel_max_iteration = 300000 # max train step
text2mel_weight_init = 'none' # 'kaiming', 'xavier' or 'none'
text2mel_normalization = 'layer' # 'layer', 'weight' or 'none'
text2mel_basic_block = 'gated_conv' # 'highway', 'gated_conv' or 'residual'
text2mel_batchsize = 64
# SSRN network options
ssrn_lr = 0.0005 # learning rate
ssrn_max_iteration = 150000 # max train step
ssrn_weight_init = 'kaiming' # 'kaiming', 'xavier' or 'none'
ssrn_normalization = 'weight' # 'layer', 'weight' or 'none'
ssrn_basic_block = 'residual' # 'highway', 'gated_conv' or 'residual'
ssrn_batchsize = 24
Character Count | Average Duration (seconds) | CPU Utilization (%) |
---|---|---|
15 < c ≤20 | 9 | 55.1 |
20 < c ≤ 25 | 9 | 38.1 |
25 < c ≤ 30 | 12 | 70.9 |
30 < c ≤ 35 | 12 | 71.9 |
35 < c ≤ 40 | 12 | 72.7 |
40 < c ≤ 45 | 12 | 72.7 |
45 < c ≤ 50 | 12 | 72.2 |
50 < c ≤ 55 | 15 | 72.4 |
55 < c ≤ 60 | 15 | 71.6 |
60 < c ≤ 65 | 15 | 71.4 |
65 < c ≤ 70 | 18 | 71.3 |
70 < c ≤ 75 | 21 | 70.7 |
75 < c ≤ 80 | 27 | 90.2 |
80 < c ≤ 85 | 18 | 70.6 |
85 < c ≤ 90 | 24 | 70 |
90 < c ≤ 95 | 24 | 70.1 |
95 < c ≤ 100 | 24 | 70.2 |
100 < c ≤ 105 | 24 | 69.9 |
105 < c ≤ 110 | 24 | 69.1 |
110 < c ≤ 115 | 27 | 69.2 |
115 < c ≤ 120 | 33 | 69.3 |
120 < c ≤ 125 | 33 | 77.4 |
125 < c ≤ 130 | 42 | 81.7 |
130 < c ≤ 135 | 48 | 81.2 |
135 < c ≤ 140 | 48 | 80.7 |
140 < c ≤ 145 | 63 | 84.1 |
145 < c ≤ 150 | 63 | 84 |
150 < c ≤ 155 | 81 | 82.7 |
155 < c ≤ 160 | 75 | 82.9 |
160 < c ≤ 165 | 72 | 81.6 |
165 < c ≤ 170 | 81 | 82.3 |
170 < c ≤ 175 | 87 | 83.1 |
175 < c ≤ 180 | 87 | 82.7 |
I am not sure how the code would run with a new dataset and what the dataset should look like in order to be trained on it?
tried to run this
last step segment of prepare model gives me this error:
RuntimeError: storage has wrong size: expected 4541342186092732661 got 512
Hello,
thanks for the great repository.
I want to do transferlearning on a german dataset.
Unfortunately, I'm a newbie to Pytorch and would therefore appreciate tips on what I need to change where in the code.
Thanks in advance!
Thx first for this excellent project and your patience. I have already finished the training of the Mongolian-TTS model which works really well. It's cool!
And, what should I do if I want to train a Tibetan-TTS model based on this project? Can u give me some suggestions?
Sorry for this question, but I'm actually a novice. Thx in advance!
Will I have to make a dataloader for the custom dataset like the one made for the english and mongolian dataset?
Hello!
I have seen a similar problem among the issues with the WaveGlow model, but I am running the original code, on a Romanian dataset with a sampling rate of 16000. Tried both the gated_conv and highway options for the text2mel_basic_block.
After the model has been training for 300K steps, some of the sentences are synthesized correctly, while others are completely blank, it generates a silent audio of a few seconds. I have copied some of the attention, mel and mag files that didn't synthesize and added some examples that were correctly synthesized. (https://drive.google.com/drive/folders/1Mq6_xub9urzJxXAXI_HVaws0sZZqrg8X?usp=sharing)
Do you have any ideas why that might be happening?
(I have added a multi-speaker option to your code that I can also share, it seems to be learning, but the same thing happens, some test sentences are synthesized, while others not, and it seems to be quite random.)
Thanks in advance!
how to fix batch processing bug.
您好,我尝试并行化生成Y和A,但是decode出来的结果不对~
请问您现在解决了该问题了没?
能否更新下解决方案,谢谢~
First of all thank you for the repository, the result is really awesome.
Is it possible to transfer learning from pretrained models (ljspeech) into my recording data?
Сайн уу?
Dataset -н wav файлууд заавал өгүүлбэрээр төгсөж байх ёстой юу? 2 өгүүлбэр, эсвэл урт өгүүлбэрүүдийг хоёр, гурван хэсэг болгон хуваасан байж болох уу? training-д ямар нэг нөлөө байгаа болов уу?
Баярлалаа
Hey!
I've trained text2mel part for mel generation for couple hundred epochs.
Model seems to be learning, and it gives somehow good results on different language dataset while fed to SSRN (without fine-tuning the SSRN part).
I'm trying to feed the text2mel output to trained WaveGlow model, but it outputs just low-frequency noise, without any speech.
Any advices how to post-process the generated mels to feed them to WaveGlow?
@tugstugi After i train the model you have provided, i am having some checkpoints with . pth format. However, when i downloaded these .pth files, i noticed that they are like zip files. In your colab notebook, you shared a link to similar .pth files. When i download your .pth files, they are different than mine. That's why, i can not use my .pth files to synthesize speech. How can i solve this issue? Is there someone facing the same problem?
Hello,
When running:
python train-text2mel.py --dataset=ljspeech
I am receiving this error:
Traceback (most recent call last):
File "train-text2mel.py", line 19, in
from logger import Logger
File "/media/hdd3tb/pytorch-dc-tts/logger.py", line 7, in
from comet_ml import Experiment
File "/home/jamazzon/anaconda3/envs/tts/lib/python3.7/site-packages/comet_ml/init.py", line 72, in
from .pytorch_logger import patch as pytorch_patch
File "/home/jamazzon/anaconda3/envs/tts/lib/python3.7/site-packages/comet_ml/pytorch_logger.py", line 72, in
check_module("torch", PYTORCH_ALREADY_IMPORTED_MSG)
File "/home/jamazzon/anaconda3/envs/tts/lib/python3.7/site-packages/comet_ml/_logging.py", line 249, in check_module
raise SyntaxError(error_msg)
SyntaxError: Please import comet before importing any torch modules
https://s3.us-east-2.amazonaws.com/bible.davarpartners.com/Mongolian
I found this download_file path is no exist, this URL does not exist or is no longer available.
so i can't use this url to down bible_book_file
thanks
@tugstugi thanks for this repo and the sharing.
I am trying to train japanese but it wasn't in good quality, for japanese it would need to tokenize first ありがとうございます as there are not white space between the words
I copy a file from mb_speech.py, where I tokenize the words and then update the vocab string in the speech.py model with all possible characters in the corpus. However, after 12 hours of training, it couldn't produce results in japanese. If there should be any extra steps to follow? Thanks.
Hello!
I try to train Mongolian TTS, and when I running
python train-text2mel.py --dataset=mbspeech
I receive this error:
use_gpu True
epoch 0 with lr=1.25e-06
0%| | 0/3456 [00:00<?, ?audios/s]
Traceback (most recent call last):
File "train-text2mel.py", line 159, in
train_epoch_loss = train(epoch, phase='train')
File "train-text2mel.py", line 76, in train
Y_logit, Y, A = text2mel(L, S)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "D:\yang\Projects\TTS\code\pytorch-dc-tts\models\text2mel.py", line 165, in forward
K, V = self.text_enc(L)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "D:\yang\Projects\TTS\code\pytorch-dc-tts\models\text2mel.py", line 78, in forward
out = self.embedding(x)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "D:\yang\Projects\TTS\code\pytorch-dc-tts\models\layers.py", line 98, in forward
return self.embedding(x)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\module.py", line 491, in call
result = self.forward(*input, **kwargs)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\modules\sparse.py", line 108, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "C:\Users\Joee\AppData\Local\conda\conda\envs\ptPy36\lib\site-packages\torch\nn\functional.py", line 1076, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got CUDAIntTensor instead (while checking arguments for embedding)
I think I did something wrong, can u help me? Thx!
I am having some difficulties in loading already trained models. Could you be more specific how I can load already trained models? What are the changes that are to be made in the code?
Getting following error on windows 10 jupyter notebook using englishtts.py any help on how to get this running ?
Traceback (most recent call last):
File "englishtts.py", line 107, in
_, Y_t, A = text2mel(L, Y, monotonic_attention=True)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\prilc\Desktop\speech\models\text2mel.py", line 165, in forward
K, V = self.text_enc(L)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\prilc\Desktop\speech\models\text2mel.py", line 78, in forward
out = self.embedding(x)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Users\prilc\Desktop\speech\models\layers.py", line 98, in forward
return self.embedding(x)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\modules\sparse.py", line 126, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "C:\Program Files\Python37\lib\site-packages\torch\nn\functional.py", line 1852, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)
Hi, I am doing custom training for spanish with a custom dataset I created using youtube-tts-data-generator from youtube.
I changed the vocabulary and trained both the text2mel and ssrn models for 20k and 5k iterations.
The loss function for the text2mel model doesnt seem to be improving anymore.
When I synthesize some phrases all are blank and audios have no data in them.
What could I be doing wrong?
I just need to generate audio using one voice in this case.
I used the prepro.py from dc-tts repo for generating mel and mags as the part that does the preprocessing on this repo is on todo.
Could it be a problem with that?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.