Single speaker model Data: <a href="https://keithito.com/LJ-Speech

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Still not good, but it's seems start working now. checkpoint_s

TODOs, status and progress,about r9y9/deepvoice3_pytorch

Comments (45)

r9y9 commented on May 26, 2024 14

Okay, now my implementation can generate sounds like speech.

Input text: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
model: checkpoint_step000100000.pth (trained 10 hours)

4.wav.zip

def build_deepvoice3(n_vocab, embed_dim=256, mel_dim=80, linear_dim=4096, r=5,
                     n_speakers=1, speaker_embed_dim=16, padding_idx=None,
                     dropout=(1 - 0.95)):
    encoder = Encoder(
        n_vocab, embed_dim, padding_idx=padding_idx,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
        dropout=dropout,
        convolutions=((128, 7),) * 7)
    decoder_hidden_dim = 128
    decoder = Decoder(
        embed_dim, in_dim=mel_dim, r=r, padding_idx=padding_idx,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
        dropout=dropout,
        convolutions=((decoder_hidden_dim, 7),) * 5,
        attention=[True, False, False, False, True],
        force_monotonic_attention=[True, False, False, False, False])
    converter = Converter(
        in_dim=decoder_hidden_dim // r, out_dim=linear_dim, dropout=dropout,
        convolutions=((256, 7),) * 7)
    model = DeepVoice3(
        encoder, decoder, converter, padding_idx=padding_idx,
        mel_dim=mel_dim, linear_dim=linear_dim,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim)

    return model

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024 3

@Kyubyong Here is a screenshot of my tensorboard (during working on #3, so not exactly of deepvoice3).

1.linear_l1_loss: L1 loss for linear-domain log magunitude spectrogram
2. mel_l1_loss: L1 loss for mel-spectrogram
3. done_loss : Binary cross entropy loss for done flag

attn_loss, mel_binary_div_loss and linear_binary_div_loss come from https://arxiv.org/abs/1710.08969. However, without those losses I can get similar loss curves. After training 20 hours, I think I get good speech samples https://www.dropbox.com/sh/jkkjqh6pawkg6sd/AAD5--NRm4rRgHo91sHYMAvGa?dl=0.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Learned attention seems to be almost monotonic. Not working for incremental forward path yet, though.

    encoder = Encoder(
        n_vocab, embed_dim, padding_idx=padding_idx,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
        dropout=dropout,
        convolutions=((64, 5),) * 10)
    decoder = Decoder(
        embed_dim, in_dim=mel_dim, r=r, padding_idx=padding_idx,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
        dropout=dropout,
        convolutions=((128, 5),) * 5,
        attention=[False, False, False, False, True])
    converter = Converter(
        in_dim=mel_dim, out_dim=linear_dim, dropout=dropout,
        convolutions=((256, 5),) * 5)
    model = DeepVoice3(
        encoder, decoder, converter, padding_idx=padding_idx,
        mel_dim=mel_dim, linear_dim=linear_dim,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim)

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Progress: still cannot get correct alignment with incremental forward path. Results are below

Input text: they discarded this for a more completely Roman and far less beautiful letter. (from training data)
Model: checkpoint_step000154000.pth

Ground truth (mel-spectrogram)

Predicted mel-spectrogram (off-line, feed ground truth every time steps)

Predicted alignment (off-line)

Predicted mel-spectrogram (on-line, feed ground truth at the first time step)

Predicted alignment (on-line)

Feeding zeros to the first time step doesn't work either :(

EDIT: There was a serious bug, fixed a0b36a4

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Still not good, but it's seems start working now.

checkpoint_step000100000.pth (trained 8 hours)

Predicted mel-spectrogram (off-line, feed ground truth every time steps)

Predicted mel-spectrogram (on-line, start from zero decoder state, forced monotonic attention)

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Added dilated convolution support. Seems effective as reported in https://arxiv.org/pdf/1710.08969.pdf.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

5.wav.zip

Th best quality speech sample I can get ever. Still not as much good as Tacotron :(

Input text: A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.
model: checkpoint_step000250000.pth (trained 15 hours)

Some notes:

Guided attention works
Multiple attention layers (> 2) are hard to learn. even with guided attention
Deeper decoder tends to harder to learn alignment

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

I've written a short README for how to train a model and how to synthesize audio signals. I would be appreciated if anybody can try and give me feedback.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

https://www.dropbox.com/sh/uq4tsfptxt0y17l/AADBL4LsPJRP2PjAAJRSH5eta?dl=0 Added audio samples. WIll update constantly if I can get better samples.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Tried again to visualize mel-spectrogram generated by the model.
Compared to #1 (comment), slowly but it's getting better.

From top to bottom, ground truth, predicted mel-spectrogram and predicted alignment.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

It seems the model suffers from learning long time dependency? I'm going to stack more layers, large kernel_size and large dilation factor and see if it works.

from deepvoice3_pytorch.

rafaelvalle commented on May 26, 2024

@r9y9 On their paper did they provide the number of layers, kernel size and dilation factor?
If not, it would be useful to e-mail the authors!

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Yes, they provided hyper parameters; number of layers and kernel sizes, etc. I think I did try almost same hyper parameters (unless I didn't misunderstand), but from my experience it didn't work for LJSpeech dataset. I suspect the reason is that the speech samples in LJSpeech dataset have reverberation, resulting in difficult to train high quality model. That's way I'm trying more rich models. e.g., increasing number of layers.

from deepvoice3_pytorch.

DarkDefender commented on May 26, 2024

After looking at the mel-spectrogram, I feel like the model seems to have a hard time learing the shifts that happens in the spectra. IE, it seem to generate strait "lines" instead of the more curvy lines in the ground truth.

Perhaps the quality would improve if the shifts could be extracted and provided as an input feature? (I'm thinking it would basically just be an amplitude vector)

However, having to provide this would kinda destroy the TTS side of things. So it is probably not practical, I'm just interested if it would manage to improve the quality at all.

Sorry if this is off topic.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@DarkDefender Thank you for your comment! The straight lines are what I want to improve. Note that if I give a ground truth every time frame to the decoder, I can get curvy lines. So auto-regressive process during decoding is causing the artifacts.

from deepvoice3_pytorch.

Kyubyong commented on May 26, 2024

Hi, @r9y9, nice job! Could you upload the training curve? I'm also working on implementing deepvoice3, but with no luck, yet. I think I need to compare yours and mine. Any tips?

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@Kyubyong Sure, I will upload logs when I finish my current experiment. Some random tips I have are:

Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough.
With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements.
Positional encoding (i.e., using text positions and frame positions in decoder) is essential to learn monotonic alignments (without this I cannot get it to work). However, I'm still not sure why position rate matters. 1.0 for both encoder/decoder worked from my previous experiment.
Weight initialization is quite important particularly for deeper (e.g. > 8 layers) networks. Noticed when I tried to replicate https://arxiv.org/abs/1710.08969. They use more than 20 layers in the decoder! Very hard to train. Work in progress in #3. Speech samples (model: encoder/converter from https://arxiv.org/abs/1710.08969 and decoder from DeepVoice3): https://www.dropbox.com/sh/q9xfgscgh3k5lqa/AACPgWCprBfNgjRravscdDYCa?dl=0.

Adam with step lr decay works. However, for deeper networks, I find Adam + noam's lr scheduler is more stable. See

deepvoice3_pytorch/lrschedule.py

Lines 4 to 11 in a93535d

 # https://github.com/tensorflow/tensor2tensor/issues/280#issuecomment-339110329 

 def noam_learning_rate_decay(init_lr, global_step): 

 # Noam scheme from tensor2tensor: 

 warmup_steps = 4000.0 

 step = global_step + 1. 

 lr = init_lr * warmup_steps**0.5 * np.minimum( 

 step * warmup_steps**-1.5, step**-0.5) 

 return lr

from deepvoice3_pytorch.

Kyubyong commented on May 26, 2024

Thanks @r9y9 . I'm trying with another dataset that is much shorter than LJ. And strangely when I applied positional encoding, it didn't work. So I replaced it with positional embedding, and the networks started to learn but not perfectly.

from deepvoice3_pytorch.

Kyubyong commented on May 26, 2024

Amazing. So from the beginning, you could get monotonic alignments as can be seen above in this page, right? Is is thanks to the shared initial weights of key and query projections? If that's the case, could you point out its implementation in you code?

The paper says,

"We initialize the fully-connected layer weights used to compute hidden attention vectors to the same values for the query projection and the key projection. "

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@Kyubyong,

Amazing. So from the beginning, you could get monotonic alignments as can be seen above in this page, right?

Yes.

Is is thanks to the shared initial weights of key and query projections? If that's the case, could you point out its implementation in you code?

I haven't tried same weight initialization for attention because I thought it's not quite important. Attention works without it. Will try shared initial weights next, thanks!

from deepvoice3_pytorch.

DarkDefender commented on May 26, 2024

I think that your new samples sounds quite good! To me, they sound more "clear" (not as muffled) than the tacotron samples posted on keithito tacotron page. However while yours are clearer, they do have more of the "heavy data compression"/vibration going on in them.

BTW, are you using a post processing network as the one in tacotron? I'm asking because your samples reminds me a bit of the “Tajima Airport serves Toyooka.” sample on:
https://google.github.io/tacotron/publications/tacotron/index.html

It you do not, then perhaps the vibration effect could be eliminated with a post processing network?

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@DarkDefender Thank you for your feedback! For "heavy data compression", it might be the reason I used half dimension of linear spectrogram. i.e., FFT size 1024 (hparams.py#L73) instead of 2048.

BTW, are you using a post processing network as the one in tacotron?

Yes.

from deepvoice3_pytorch.

DarkDefender commented on May 26, 2024

Yes, the FFT size might be it! I've recently taken a audio compression course and, to me at least, it sounds like some of the examples where we transformed the audio signal with FFT to the frequency domain and then quantized the frequency constants a bit too much.

But I also guess that it might be that the network didn't learn how to put the audio signal together without procuding these sound artifacts.

Sadly I do not have a nvidia GPU so I can't test upping the FFT size and see if the quality goes up...

Edit: BTW, thank you for answering my silly questions @r9y9

Edit2: Actually, if the FFT size in this case only refers to sample chunk lenght, then it might not improve that much by increasing it. You will only get more constants to work with when in freq domain... If the neural network sound artifacts indeed are from disconnects that appear between sample chunks, then uping the chunk size will only space them out more (not eliminate them).

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@DarkDefender I really appreciate your comments!. Yeah, I hope network can produce good speech samples even with small FFT size. I will do more experimetns with larger FFT sizses.

One thing I have found new today is that L1 Loss for spectrogram decreases more quickly using decoder internal states for postnet inputs, rathar than using mel-spectrogram. Hopefully this improves speech quality a bit.

from deepvoice3_pytorch.

DarkDefender commented on May 26, 2024

@r9y9 I'm guessing that the new 380 000 check point samples are with the new postnet inputs? I feel like the speech samples has improved a bit in quality regardless.

The only thing that got worse was the 3_checkpoint_step000380000.wav sample. I'm guessing that the pause between repeal and replace perhaps might go away with more training?

Did changing FFT size do anything noticeable BTW?

from deepvoice3_pytorch.

rafaelvalle commented on May 26, 2024

@r9y9 you mentioned using decoder internal states as postnet inputs. Is this something described in the paper?

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@rafaelvalle Yes, it's mentioned in DeepVoice3, not in https://arxiv.org/abs/1710.08969 though.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@DarkDefender New samples are from model https://arxiv.org/abs/1710.08969 with using decoder internal states as postnet inputs. 22a6748. L1 Loss for spectrogram decreases more quickly.

I'm guessing that the pause between repeal and replace perhaps might go away with more training?

I'm guessing that too.

Did changing FFT size do anything noticeable BTW?

Haven't tried it yet:(

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

https://github.com/r9y9/deepvoice3_pytorch#pretrained-models

Pre-trained models are now up and ready.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Finally, I think I get reasonable (not very good though) speech quality with a multi-speaker model trained on VCTK (108 speakers). Speech samples for different three speakers are attached:

Input text: Some have accepted this as a miracle without any physical explanation.

p225

p236

eval.zip

WIP at #10

from deepvoice3_pytorch.

DarkDefender commented on May 26, 2024

Thanks for the update! I really appreciate like that you give us small updates every now and then.

As you said, it seems like they are able to read the text correctly but the quality of the sound is not that good. But it is nice results none the less.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Yeah, the result is not very good, but since I had very hard time to make multi-speaker model to actually work using VCTK, this is a big progress to me:)

from deepvoice3_pytorch.

patHutchings commented on May 26, 2024

Amazing work!

I've tried training with Nyanko on the current commit (8afb609) with the LJSpeech data set and default Hparams, but get significantly worse performance than the examples you have posted (for an earlier commit).

4357976 was a huge commit, so I'm going through to see if anything might have negatively affected single voice performance. I thought I'd also check to see if you have any suggestions.

Here are eval examples from 325000 and 665000 training steps.
https://www.dropbox.com/sh/rarwoxl3u0f5qkn/AAALH_XayWwEuBoN5bz1P3BIa?dl=0

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@patHutchings

I've tried training with Nyanko on the current commit (8afb609) with the LJSpeech data set and default Hparams, but get significantly worse performance than the examples you have posted (for an earlier commit).

Sorry about that. I might have introduced a bug in #6 (maybe 4d9bc6f doesn't work for Nyanko architecture). I will take a look soon.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

I think I have reverted the (possibly affected) changes. It should get same results as I posted previously. Will try to train new models as soon as possible.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

OMG, I introduced very stupid bug! Should be fixed by a934acd

from deepvoice3_pytorch.

patHutchings commented on May 26, 2024

@r9y9 Yep that will do it. Sorry I should have noticed that myself.
I'll train again (and also with another, custom dataset) and share models.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Ok, trained 150k steps, seems working as before. Dropout was essential to make it work.

from deepvoice3_pytorch.

saurabhvyas commented on May 26, 2024

@r9y9 , I downloaded the pretrained model , but when I ran getting the following error , doesn't it support CPU inference ?

python synthesis.py pretrainedmodel.pth test_list.txt output/ Command line args: {'--checkpoint-postnet': None, '--checkpoint-seq2seq': None, '--file-name-suffix': '', '--help': False, '--hparams': '', '--max-decoder-steps': '500', '--output-html': False, '--replace_pronunciation_prob': '0.0', '--speaker_id': None, '<checkpoint>': 'pretrainedmodel.pth', '<dst_dir>': 'output/', '<text_list_file>': 'test_list.txt'} Traceback (most recent call last): File "synthesis.py", line 124, in <module> checkpoint = torch.load(checkpoint_path) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 231, in load return _load(f, map_location, pickle_module) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 379, in _load result = unpickler.load() File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 350, in persistent_load data_type(size), location) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 85, in default_restore_location result = fn(storage, location) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 67, in _cuda_deserialize return obj.cuda(device_id) File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/_utils.py", line 58, in _cuda with torch.cuda.device(device): File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 125, in __enter__ _lazy_init() File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 84, in _lazy_init _check_driver() File "/home/saurabh/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 51, in _check_driver raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

@saurabhvyas Sorry, CPU inference is not currently supported yet.

from deepvoice3_pytorch.

saurabhvyas commented on May 26, 2024

@r9y9 Okay no problem , but good project :)

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

#10 merged:)

Add a brief guide for speaker adaptation and training multi-speaker model. See https://github.com/r9y9/deepvoice3_pytorch#advanced-usage if interested.

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

Audio samples are now available at the github page: https://r9y9.github.io/deepvoice3_pytorch/

from deepvoice3_pytorch.

r9y9 commented on May 26, 2024

I finished everything what I wanted to do initially. I will close this issue and create separate ones for specific issues.

from deepvoice3_pytorch.

haqkiemdaim commented on May 26, 2024

Learned attention seems to be almost monotonic. Not working for incremental forward path yet, though.

    encoder = Encoder(
        n_vocab, embed_dim, padding_idx=padding_idx,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
        dropout=dropout,
        convolutions=((64, 5),) * 10)
    decoder = Decoder(
        embed_dim, in_dim=mel_dim, r=r, padding_idx=padding_idx,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim,
        dropout=dropout,
        convolutions=((128, 5),) * 5,
        attention=[False, False, False, False, True])
    converter = Converter(
        in_dim=mel_dim, out_dim=linear_dim, dropout=dropout,
        convolutions=((256, 5),) * 5)
    model = DeepVoice3(
        encoder, decoder, converter, padding_idx=padding_idx,
        mel_dim=mel_dim, linear_dim=linear_dim,
        n_speakers=n_speakers, speaker_embed_dim=speaker_embed_dim)

hi @r9y9 ! i just want to know how you generate the gif time lapse of the the alignment ya?

from deepvoice3_pytorch.

TODOs, status and progress about deepvoice3_pytorch HOT 45 CLOSED

Comments (45)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

	# https://github.com/tensorflow/tensor2tensor/issues/280#issuecomment-339110329
	def noam_learning_rate_decay(init_lr, global_step):
	# Noam scheme from tensor2tensor:
	warmup_steps = 4000.0
	step = global_step + 1.
	lr = init_lr * warmup_steps*0.5 np.minimum(
	step * warmup_steps-1.5, step-0.5)
	return lr