Comments (45)
Okay, now my implementation can generate sounds like speech.
- Input text: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
- model: checkpoint_step000100000.pth (trained 10 hours)
def build_deepvoice3(n_vocab, embed_dim=256, mel_dim=80, linear_dim=4096, r=5,
n_speakers=1, speaker_embed_dim=16, padding_idx=None,
dropout=(1 - 0.95)):
encoder = Encoder(
n_vocab, embed_dim, padding_idx=padding_idx,
n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
dropout=dropout,
convolutions=((128, 7),) * 7)
decoder_hidden_dim = 128
decoder = Decoder(
embed_dim, in_dim=mel_dim, r=r, padding_idx=padding_idx,
n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
dropout=dropout,
convolutions=((decoder_hidden_dim, 7),) * 5,
attention=[True, False, False, False, True],
force_monotonic_attention=[True, False, False, False, False])
converter = Converter(
in_dim=decoder_hidden_dim // r, out_dim=linear_dim, dropout=dropout,
convolutions=((256, 7),) * 7)
model = DeepVoice3(
encoder, decoder, converter, padding_idx=padding_idx,
mel_dim=mel_dim, linear_dim=linear_dim,
n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim)
return model
from deepvoice3_pytorch.
@Kyubyong Here is a screenshot of my tensorboard (during working on #3, so not exactly of deepvoice3).
1.linear_l1_loss
: L1 loss for linear-domain log magunitude spectrogram
2. mel_l1_loss
: L1 loss for mel-spectrogram
3. done_loss
: Binary cross entropy loss for done flag
attn_loss
, mel_binary_div_loss
and linear_binary_div_loss
come from https://arxiv.org/abs/1710.08969. However, without those losses I can get similar loss curves. After training 20 hours, I think I get good speech samples https://www.dropbox.com/sh/jkkjqh6pawkg6sd/AAD5--NRm4rRgHo91sHYMAvGa?dl=0.
from deepvoice3_pytorch.
Learned attention seems to be almost monotonic. Not working for incremental forward path yet, though.
encoder = Encoder(
n_vocab, embed_dim, padding_idx=padding_idx,
n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
dropout=dropout,
convolutions=((64, 5),) * 10)
decoder = Decoder(
embed_dim, in_dim=mel_dim, r=r, padding_idx=padding_idx,
n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
dropout=dropout,
convolutions=((128, 5),) * 5,
attention=[False, False, False, False, True])
converter = Converter(
in_dim=mel_dim, out_dim=linear_dim, dropout=dropout,
convolutions=((256, 5),) * 5)
model = DeepVoice3(
encoder, decoder, converter, padding_idx=padding_idx,
mel_dim=mel_dim, linear_dim=linear_dim,
n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim)
from deepvoice3_pytorch.
Progress: still cannot get correct alignment with incremental forward path. Results are below
- Input text: they discarded this for a more completely Roman and far less beautiful letter. (from training data)
- Model: checkpoint_step000154000.pth
Ground truth (mel-spectrogram)
Predicted mel-spectrogram (off-line, feed ground truth every time steps)
Predicted alignment (off-line)
Predicted mel-spectrogram (on-line, feed ground truth at the first time step)
Predicted alignment (on-line)
Feeding zeros to the first time step doesn't work either :(
EDIT: There was a serious bug, fixed a0b36a4
from deepvoice3_pytorch.
Still not good, but it's seems start working now.
- checkpoint_step000100000.pth (trained 8 hours)
Predicted mel-spectrogram (off-line, feed ground truth every time steps)
Predicted mel-spectrogram (on-line, start from zero decoder state, forced monotonic attention)
from deepvoice3_pytorch.
Added dilated convolution support. Seems effective as reported in https://arxiv.org/pdf/1710.08969.pdf.
from deepvoice3_pytorch.
Th best quality speech sample I can get ever. Still not as much good as Tacotron :(
- Input text: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
- model: checkpoint_step000250000.pth (trained 15 hours)
Some notes:
- Guided attention works
- Multiple attention layers (> 2) are hard to learn. even with guided attention
- Deeper decoder tends to harder to learn alignment
from deepvoice3_pytorch.
I've written a short README for how to train a model and how to synthesize audio signals. I would be appreciated if anybody can try and give me feedback.
from deepvoice3_pytorch.
https://www.dropbox.com/sh/uq4tsfptxt0y17l/AADBL4LsPJRP2PjAAJRSH5eta?dl=0 Added audio samples. WIll update constantly if I can get better samples.
from deepvoice3_pytorch.
Tried again to visualize mel-spectrogram generated by the model.
Compared to #1 (comment), slowly but it's getting better.
From top to bottom, ground truth, predicted mel-spectrogram and predicted alignment.
from deepvoice3_pytorch.
It seems the model suffers from learning long time dependency? I'm going to stack more layers, large kernel_size and large dilation factor and see if it works.
from deepvoice3_pytorch.
@r9y9 On their paper did they provide the number of layers, kernel size and dilation factor?
If not, it would be useful to e-mail the authors!
from deepvoice3_pytorch.
Yes, they provided hyper parameters; number of layers and kernel sizes, etc. I think I did try almost same hyper parameters (unless I didn't misunderstand), but from my experience it didn't work for LJSpeech dataset. I suspect the reason is that the speech samples in LJSpeech dataset have reverberation, resulting in difficult to train high quality model. That's way I'm trying more rich models. e.g., increasing number of layers.
from deepvoice3_pytorch.
After looking at the mel-spectrogram, I feel like the model seems to have a hard time learing the shifts that happens in the spectra. IE, it seem to generate strait "lines" instead of the more curvy lines in the ground truth.
Perhaps the quality would improve if the shifts could be extracted and provided as an input feature? (I'm thinking it would basically just be an amplitude vector)
However, having to provide this would kinda destroy the TTS side of things. So it is probably not practical, I'm just interested if it would manage to improve the quality at all.
Sorry if this is off topic.
from deepvoice3_pytorch.
@DarkDefender Thank you for your comment! The straight lines are what I want to improve. Note that if I give a ground truth every time frame to the decoder, I can get curvy lines. So auto-regressive process during decoding is causing the artifacts.
from deepvoice3_pytorch.
Hi, @r9y9, nice job! Could you upload the training curve? I'm also working on implementing deepvoice3, but with no luck, yet. I think I need to compare yours and mine. Any tips?
from deepvoice3_pytorch.
@Kyubyong Sure, I will upload logs when I finish my current experiment. Some random tips I have are:
- Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough.
- With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements.
- Positional encoding (i.e., using text positions and frame positions in decoder) is essential to learn monotonic alignments (without this I cannot get it to work). However, I'm still not sure why position rate matters. 1.0 for both encoder/decoder worked from my previous experiment.
- Weight initialization is quite important particularly for deeper (e.g. > 8 layers) networks. Noticed when I tried to replicate https://arxiv.org/abs/1710.08969. They use more than 20 layers in the decoder! Very hard to train. Work in progress in #3. Speech samples (model: encoder/converter from https://arxiv.org/abs/1710.08969 and decoder from DeepVoice3): https://www.dropbox.com/sh/q9xfgscgh3k5lqa/AACPgWCprBfNgjRravscdDYCa?dl=0.
- Adam with step lr decay works. However, for deeper networks, I find Adam + noam's lr scheduler is more stable. See
deepvoice3_pytorch/lrschedule.py
Lines 4 to 11 in a93535d
from deepvoice3_pytorch.
Thanks @r9y9 . I'm trying with another dataset that is much shorter than LJ. And strangely when I applied positional encoding, it didn't work. So I replaced it with positional embedding, and the networks started to learn but not perfectly.
from deepvoice3_pytorch.
Amazing. So from the beginning, you could get monotonic alignments as can be seen above in this page, right? Is is thanks to the shared initial weights of key and query projections? If that's the case, could you point out its implementation in you code?
The paper says,
"We initialize the fully-connected layer weights used to compute hidden attention vectors to the same values for the query projection and the key projection. "
from deepvoice3_pytorch.
Amazing. So from the beginning, you could get monotonic alignments as can be seen above in this page, right?
Yes.
Is is thanks to the shared initial weights of key and query projections? If that's the case, could you point out its implementation in you code?
I haven't tried same weight initialization for attention because I thought it's not quite important. Attention works without it. Will try shared initial weights next, thanks!
from deepvoice3_pytorch.
I think that your new samples sounds quite good! To me, they sound more "clear" (not as muffled) than the tacotron samples posted on keithito tacotron page. However while yours are clearer, they do have more of the "heavy data compression"/vibration going on in them.
BTW, are you using a post processing network as the one in tacotron? I'm asking because your samples reminds me a bit of the βTajima Airport serves Toyooka.β sample on:
https://google.github.io/tacotron/publications/tacotron/index.html
It you do not, then perhaps the vibration effect could be eliminated with a post processing network?
from deepvoice3_pytorch.
@DarkDefender Thank you for your feedback! For "heavy data compression", it might be the reason I used half dimension of linear spectrogram. i.e., FFT size 1024 (hparams.py#L73) instead of 2048.
BTW, are you using a post processing network as the one in tacotron?
Yes.
from deepvoice3_pytorch.
Yes, the FFT size might be it! I've recently taken a audio compression course and, to me at least, it sounds like some of the examples where we transformed the audio signal with FFT to the frequency domain and then quantized the frequency constants a bit too much.
But I also guess that it might be that the network didn't learn how to put the audio signal together without procuding these sound artifacts.
Sadly I do not have a nvidia GPU so I can't test upping the FFT size and see if the quality goes up...
Edit: BTW, thank you for answering my silly questions @r9y9
Edit2: Actually, if the FFT size in this case only refers to sample chunk lenght, then it might not improve that much by increasing it. You will only get more constants to work with when in freq domain... If the neural network sound artifacts indeed are from disconnects that appear between sample chunks, then uping the chunk size will only space them out more (not eliminate them).
from deepvoice3_pytorch.
@DarkDefender I really appreciate your comments!. Yeah, I hope network can produce good speech samples even with small FFT size. I will do more experimetns with larger FFT sizses.
One thing I have found new today is that L1 Loss for spectrogram decreases more quickly using decoder internal states for postnet inputs, rathar than using mel-spectrogram. Hopefully this improves speech quality a bit.
from deepvoice3_pytorch.
@r9y9 I'm guessing that the new 380 000 check point samples are with the new postnet inputs? I feel like the speech samples has improved a bit in quality regardless.
The only thing that got worse was the 3_checkpoint_step000380000.wav
sample. I'm guessing that the pause between repeal and replace
perhaps might go away with more training?
Did changing FFT size do anything noticeable BTW?
from deepvoice3_pytorch.
@r9y9 you mentioned using decoder internal states as postnet inputs. Is this something described in the paper?
from deepvoice3_pytorch.
@rafaelvalle Yes, it's mentioned in DeepVoice3, not in https://arxiv.org/abs/1710.08969 though.
from deepvoice3_pytorch.
@DarkDefender New samples are from model https://arxiv.org/abs/1710.08969 with using decoder internal states as postnet inputs. 22a6748. L1 Loss for spectrogram decreases more quickly.
I'm guessing that the pause between repeal and replace perhaps might go away with more training?
I'm guessing that too.
Did changing FFT size do anything noticeable BTW?
Haven't tried it yet:(
from deepvoice3_pytorch.
https://github.com/r9y9/deepvoice3_pytorch#pretrained-models
Pre-trained models are now up and ready.
from deepvoice3_pytorch.
Finally, I think I get reasonable (not very good though) speech quality with a multi-speaker model trained on VCTK (108 speakers). Speech samples for different three speakers are attached:
- Input text: Some have accepted this as a miracle without any physical explanation.
p225
p225
p236
WIP at #10
from deepvoice3_pytorch.
Thanks for the update! I really appreciate like that you give us small updates every now and then.
As you said, it seems like they are able to read the text correctly but the quality of the sound is not that good. But it is nice results none the less.
from deepvoice3_pytorch.
Yeah, the result is not very good, but since I had very hard time to make multi-speaker model to actually work using VCTK, this is a big progress to me:)
from deepvoice3_pytorch.
Amazing work!
I've tried training with Nyanko on the current commit (8afb609) with the LJSpeech data set and default Hparams, but get significantly worse performance than the examples you have posted (for an earlier commit).
4357976 was a huge commit, so I'm going through to see if anything might have negatively affected single voice performance. I thought I'd also check to see if you have any suggestions.
Here are eval examples from 325000 and 665000 training steps.
https://www.dropbox.com/sh/rarwoxl3u0f5qkn/AAALH_XayWwEuBoN5bz1P3BIa?dl=0
from deepvoice3_pytorch.
I've tried training with Nyanko on the current commit (8afb609) with the LJSpeech data set and default Hparams, but get significantly worse performance than the examples you have posted (for an earlier commit).
Sorry about that. I might have introduced a bug in #6 (maybe 4d9bc6f doesn't work for Nyanko architecture). I will take a look soon.
from deepvoice3_pytorch.
I think I have reverted the (possibly affected) changes. It should get same results as I posted previously. Will try to train new models as soon as possible.
from deepvoice3_pytorch.
OMG, I introduced very stupid bug! Should be fixed by a934acd
from deepvoice3_pytorch.
@r9y9 Yep that will do it. Sorry I should have noticed that myself.
I'll train again (and also with another, custom dataset) and share models.
from deepvoice3_pytorch.
Ok, trained 150k steps, seems working as before. Dropout was essential to make it work.
from deepvoice3_pytorch.
@r9y9 , I downloaded the pretrained model , but when I ran getting the following error , doesn't it support CPU inference ?
python synthesis.py pretrainedmodel.pth test_list.txt output/ Command line args: {'--checkpoint-postnet': None, '--checkpoint-seq2seq': None, '--file-name-suffix': '', '--help': False, '--hparams': '', '--max-decoder-steps': '500', '--output-html': False, '--replace_pronunciation_prob': '0.0', '--speaker_id': None, '<checkpoint>': 'pretrainedmodel.pth', '<dst_dir>': 'output/', '<text_list_file>': 'test_list.txt'} Traceback (most recent call last): File "synthesis.py", line 124, in <module> checkpoint = torch.load(checkpoint_path) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 231, in load return _load(f, map_location, pickle_module) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 379, in _load result = unpickler.load() File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 350, in persistent_load data_type(size), location) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 85, in default_restore_location result = fn(storage, location) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 67, in _cuda_deserialize return obj.cuda(device_id) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 58, in _cuda with torch.cuda.device(device): File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 125, in __enter__ _lazy_init() File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 84, in _lazy_init _check_driver() File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 51, in _check_driver raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled
from deepvoice3_pytorch.
@saurabhvyas Sorry, CPU inference is not currently supported yet.
from deepvoice3_pytorch.
@r9y9 Okay no problem , but good project :)
from deepvoice3_pytorch.
#10 merged:)
Add a brief guide for speaker adaptation and training multi-speaker model. See https://github.com/r9y9/deepvoice3_pytorch#advanced-usage if interested.
from deepvoice3_pytorch.
Audio samples are now available at the github page: https://r9y9.github.io/deepvoice3_pytorch/
from deepvoice3_pytorch.
I finished everything what I wanted to do initially. I will close this issue and create separate ones for specific issues.
from deepvoice3_pytorch.
Learned attention seems to be almost monotonic. Not working for incremental forward path yet, though.
encoder = Encoder( n_vocab, embed_dim, padding_idx=padding_idx, n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim, dropout=dropout, convolutions=((64, 5),) * 10) decoder = Decoder( embed_dim, in_dim=mel_dim, r=r, padding_idx=padding_idx, n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim, dropout=dropout, convolutions=((128, 5),) * 5, attention=[False, False, False, False, True]) converter = Converter( in_dim=mel_dim, out_dim=linear_dim, dropout=dropout, convolutions=((256, 5),) * 5) model = DeepVoice3( encoder, decoder, converter, padding_idx=padding_idx, mel_dim=mel_dim, linear_dim=linear_dim, n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim)
hi @r9y9 ! i just want to know how you generate the gif time lapse of the the alignment ya?
from deepvoice3_pytorch.
Related Issues (20)
- Key for all speaker_id's
- Slow down speaking rate?
- Samples cutting out early
- Using deprecated Tensorflow 1. HOT 1
- About audio parameters settings
- pre trained model works but goes crazy on some sentences which are a bit long
- DeepVoice3 multi-speaker TTS en demo.ipynb fixes HOT 2
- Problem with lws package HOT 2
- Error while loading the model HOT 1
- Deep voice multi-speaker on Colab has pip install torch==0.3.1 error
- Deep voice 3 multi speaker on Colab - failed building wheel for lws HOT 1
- Unknown hyperparameter type for use_preset HOT 2
- Dataset not available at link
- voice tone
- n_vocab AttributeError
- Installation nightmare
- train.py problem HOT 2
- 'SinusoidalEncoding' object has no attribute '_backend' HOT 1
- Both Sample Colab Notebooks No Longer Work HOT 2
- [CONTRIBUTION] Speech Dataset Generator
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deepvoice3_pytorch.