coqui-ai / tts Goto Github PK

View Code? Open in Web Editor NEW

32.0K 275.0 3.8K 166.21 MB

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

Home Page: http://coqui.ai

License: Mozilla Public License 2.0

Python 91.95% Jupyter Notebook 7.52% HTML 0.26% Shell 0.13% Makefile 0.07% Cython 0.04% Dockerfile 0.03%

python text-to-speech deep-learning speech pytorch tts vocoder tacotron glow-tts melgan

tts's People

Contributors

Stargazers

Watchers

Forkers

weberjulian erogol rishikksh20 yyht chenchy trendingtechnology a-froghyar gitter-badger rutujasurve94 whitefu gedeon-m-gedus namtulla-n theolivenbaum xzm2004260 wendonggan knut0815 brightflysolutions c00renut kunzhou9646 guypaddock kumar-asista janvainer masterj93 syilmaz edresson eos21 emreozkose wangzh0516 sciai-ai tsaifangsheng yingfenging deepdubbed patilanup246 himanshumoliya adbmd talipturkmen kiranvarghesev stjordanis felipeescallon manikant92 harsha2010 anh kokizzu wgwangang axkuhta jonpub mathiasjakobsen liujingxiu23 ylmzfun mic92 orlgln tngamemo askmetoo suryatmodulus areebdurrani raymondseger y-kamiya normonisping templeblock gerazov shaun95 lpierron matrix4284 vvandriichuk oytunturk doytsujin saber5433 agrinh admariner amirstudy jorik041 nunatica zeta1999 orton98 weixsong mbencherif chmodsss sce-tts bennett0 jameslina kaiidams tehikumedia aheadley the0nix bbhoodski cirrushuet sasukepn1999 justinjohn0306 rhasspy chauthan cmftall mu-y afiqmuzaffar mainakmaitra claudm charlottecuc neondaniel pi-bie vipuljadhav97 smksyj-est

tts's Issues

Support offline mobile

This repository is really great! Decent samples too.

I see a huge opportunity for this to be extended to support mobile.

There are a number of obstacles to this of course, including running on TF-Lite.

If you ported it to Dart you could transpile it to iOS and Android.

French Tacotron2 DDC TTS model with HifiGan2 very noisy

I trained a French TTS model with Tacotron2 DDC from MAI-Labs. I'm using Coqui-TTS v.0.0.12.

I tried TTS with the vocoder vocoder_models--en--ljspeech--hifigan_v2, as dowloaded fro Coqui-TTS.

The resulting audio file is very noisy as you can hear: https://sndup.net/2t9d

You can find as a gist my config.json and vocoder_config.json: https://gist.github.com/lpierron/6c56302eb628ee6a86363daa08e5fa63

Any idea to solve the noisy problem ?

I tried using another Vocoder (Melgan one) and there is no noise, but the voice is hoarse as you can her: https://sndup.net/4jgn

[Bug] MelGAN based vocoders do not use feature matching loss even if it is enabled.

Describe the bug
The condition checking for enabling feature_matching loss in

TTS/TTS/vocoder/layers/losses.py

Line 263 in 4a3cc8d

if self.use_feat_match_loss and not feats_fake:

is always False.

Probably all our trained models are affected by this bug and caused suboptimal results.

Specifically, I observed that it caused the metallic noise in the model outputs.

Expected behavior
The models should use feat_matching loss

Additional context
For anyone who needs an instant fix, the line indicated above needs to be updated as follows.

        if self.use_feat_match_loss and not feats_fake is None:

[Bug] Skipping part of a sentence

Description
Using tts, it skipped part of a sentence. For example, for 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits and the distribution of the costs and benefits across different segments of society.' it skipped the last part "and the distribution of the costs and benefits across different segments of society".
However, if a period before the part that was skipped the complete text is synthesized.

To Reproduce
Single sentence:
tts --text 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits and the distribution of the costs and benefits across different segments of society.' --out_path output.wav

Split by a period:
tts --text 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits. And the distribution of the costs and benefits across different segments of society.' --out_path output.wav

Expected behavior
The long sentences should be fully synthesized.

Google Lyra as the vocoder

Google open source lyra (https://github.com/google/lyra), a version of waveRNN vocoder recently, I wonder if any of you think of using this Lyra version as the vocoder for TTS.

The benefit of this approach is that we have a well-engineered real-time vocoder on mobile devices (and hopefully, a high quality vocoder).

The unknown here is that, we don't know if Lyra is fitted for TTS. From reading google's papers, they use quantized 160-dimension mel-spectrogram as the the conditional features with only one frame look ahead.

The source code of this real-time wavegru vocoder can be really helpful anyway!

[Bug] KeyError: 'default_vocoder' for all hifigan_v2 models

Describe the bug
Running any of the HiFiGAN models fails with a KeyError for the default_vocoder.

To Reproduce
Steps to reproduce the behavior:

Install from PyPI (or dev branch using pip)
Run tts-server --list_models | grep hifigan to get the list of HiFiGAN models
Attempt to run any eg tts-server --use_cuda=true --model_name vocoder_models/en/sam/hifigan_v2
See error:

Traceback (most recent call last):
  File "/home/gez/Projects/coqui-tts/.venv/bin/tts-server", line 6, in <module>
    from TTS.server.server import main
  File "/home/gez/Projects/coqui-tts/.venv/lib/python3.8/site-packages/TTS/server/server.py", line 86, in <module>
    args.vocoder_name = model_item["default_vocoder"] if args.vocoder_name is None else args.vocoder_name
KeyError: 'default_vocoder'

Expected behavior
Presumably model_item should always have a default_vocoder, or it should be checked and handled gracefully.

Environment (please complete the following information):

Python version: 3.8.3

coqui.ai site is down

The vendor says the site has expired.

[Bug] Dependencies in 0.0.13.1

Describe the bug
With TTS==0.0.13.1

from TTS.utils.synthesizer import Synthesizer

ModuleNotFoundError: No module named 'numba.decorators'
The error seems to come from librosa. Solution: pin dependency numba to 0.48. See librosa/librosa#1160
ModuleNotFoundError: No module named 'packaging'
Solution: require Dependency on packaging

To Reproduce
Steps to reproduce the behavior:

pip install TTS==0.0.13.1
from TTS.utils.synthesizer import Synthesizer
See error

Expected behavior
No Exceptions :)

Environment (please complete the following information):

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Docker python:3.8
PyTorch or TensorFlow version (use command below):
Python version: Python 3.8.9

[Bug] AttributeError: 'AttrDict' object has no attribute 'generator_model'

Welcome to the 🐸TTS project! We are excited to see your interest, and appreciate your support!

This repository is governed by the Contributor Covenant Code of Conduct. For more details, see the CODE_OF_CONDUCT.md file.

If you've found a bug, please provide the following information:

Describe the bug
A clear and concise description of what the bug is.

To Reproduce

pip install TTS
tts --text "Это голос дикой планеты" --model_name "tts_models/ru/ruslan/tacotron2-DDC" --vocoder_name "tts_models/ru/ruslan/tacotron2-DDC" --out_path example.wav

Downloading model to /home/dims/.local/share/tts/tts_models--ru--ruslan--tacotron2-DDC
Expected behavior
Should not crash but generate audio

Environment (please complete the following information):

$ python --version
Python 3.8.5
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic

Exact command to reproduce:
$ tts --text "Это голос дикой планеты" --model_name "tts_models/ru/ruslan/tacotron2-DDC" --vocoder_name "tts_models/ru/ruslan/tacotron2-DDC" --out_path example.wav

Downloading model to /home/dims/.local/share/tts/tts_models--ru--ruslan--tacotron2-DDC
tts_models/ru/ruslan/tacotron2-DDC is already downloaded.
Using model: Tacotron2
Traceback (most recent call last):
File "/opt/anaconda3/envs/py38/bin/tts", line 8, in
sys.exit(main())
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/bin/synthesize.py", line 188, in main
synthesizer = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, args.use_cuda)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 49, in init
self.load_vocoder(vocoder_checkpoint, vocoder_config, use_cuda)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 102, in load_vocoder
self.vocoder_model = setup_generator(self.vocoder_config)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/vocoder/utils/generic_utils.py", line 70, in setup_generator
print(" > Generator Model: {}".format(c.generator_model))
AttributeError: 'AttrDict' object has no attribute 'generator_model'

RuntimeError: The expanded size of the tensor (12) must match the existing size (84) at non-singleton dimension 2. Target sizes: [64, 80, 12]. Tensor sizes: [64, 1, 84]

I have the same problem with v0.0.12:

 CUDA_VISIBLE_DEVICES="0" python ../../TTS/bin/train_tacotron.py --config_path model_config.json
 > Using CUDA:  True
 > Number of GPUs:  1
 > Git Hash: 59ab268
 > Experiment folder: /home/lpierron/Mozilla_TTS/CORPUS_LP/Models/maiLabs-fr-dca/mailabs-fr-ddc-April-23-2021_03+38PM-59ab268
 > Setting up Audio Processor...
 | > sample_rate:22050
 | > resample:False
 | > num_mels:80
 | > min_level_db:-100
 | > frame_shift_ms:None
 | > frame_length_ms:None
 | > ref_level_db:20
 | > fft_size:1024
 | > power:1.5
 | > preemphasis:0.0
 | > griffin_lim_iters:60
 | > signal_norm:True
 | > symmetric_norm:True
 | > mel_fmin:0
 | > mel_fmax:8000.0
 | > spec_gain:1.0
 | > stft_pad_mode:reflect
 | > max_norm:4.0
 | > clip_norm:True
 | > do_trim_silence:True
 | > trim_db:60
 | > do_sound_norm:False
 | > stats_path:./scale_stats.npy
 | > log_func:<ufunc 'log10'>
 | > exp_func:<function AudioProcessor.__init__.<locals>.<lambda> at 0x7f1ef7ac6c10>
 | > hop_length:256
 | > win_length:1024
 | > /tmp/tts/by_book/female/ezwa/monsieur_lecoq/metadata.csv
 | > Found 14211 files in /tmp/tts
 > Using model: Tacotron2

 > Model has 28183506 parameters
 > Starting with inf best loss.

 > DataLoader initialization
 | > Use phonemes: True
   | > phoneme language: fr-fr
 | > Number of instances : 14069
 | > Max length sequence: 281
 | > Min length sequence: 3
 | > Avg length sequence: 105.0826640130784
 | > Num. instances discarded by max-min (max=153, min=6) seq limits: 2420
 | > Batch group size: 128.

 > EPOCH: 0/1000

 > Number of output frames: 7

 > TRAINING (2021-04-23 15:38:18)
 ! Run is removed from /home/lpierron/Mozilla_TTS/CORPUS_LP/Models/maiLabs-fr-dca/mailabs-fr-ddc-April-23-2021_03+38PM-59ab268
Traceback (most recent call last):
  File "../../TTS/bin/train_tacotron.py", line 744, in <module>
    main(args)
  File "../../TTS/bin/train_tacotron.py", line 704, in main
    train_avg_loss_dict, global_step = train(
  File "../../TTS/bin/train_tacotron.py", line 198, in train
    decoder_output, postnet_output, alignments, stop_tokens = model(
  File "/home/lpierron/miniconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/lpierron/Mozilla_TTS/COQUI-TTS/TTS/TTS/tts/models/tacotron2.py", line 226, in forward
    decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs)
RuntimeError: The expanded size of the tensor (12) must match the existing size (84) at non-singleton dimension 2.  Target sizes: [64, 80, 12].  Tensor sizes: [64, 1, 84]

I have downgraded librosa==0.6.3 but it doesn't work.

See my configuration next:

model_config.json.txt

Originally posted by @lpierron in #370 (comment)

Representative Samples

The README.md links to English Voice Samples which claim to use and English DDC model however it is not identifiable in list models;

tts --list_models
 Name format: type/language/dataset/model
 >: tts_models/en/ek1/tacotron2
 >: tts_models/en/ljspeech/glow-tts
 >: tts_models/en/ljspeech/tacotron2-DCA
 >: tts_models/en/ljspeech/speedy-speech-wn
 >: tts_models/es/mai/tacotron2-DDC
 >: tts_models/fr/mai/tacotron2-DDC
 >: tts_models/zh-CN/baker/tacotron2-DDC-GST
 >: tts_models/nl/mai/tacotron2-DDC
 >: tts_models/ru/ruslan/tacotron2-DDC
 >: vocoder_models/universal/libri-tts/wavegrad
 >: vocoder_models/universal/libri-tts/fullband-melgan
 >: vocoder_models/en/ek1/wavegrad
 >: vocoder_models/en/ljspeech/multiband-melgan
 >: vocoder_models/nl/mai/parallel-wavegan

Can the Samples page be changed to one in this project and using an available model?

[Bug] fix windows support (audio lambda function)

I think we've established that windows support is broken since that commit e0b3008 .
I suspect that it's due to the exp/log function stored in the class.
I would suggest to replace the log_func in the constructor by a exp_log_base, and only store that number in the class. Then I propose using math.log(x, exp_log_base) and exp_log_base**x since np.log doesn't allow to pass a base argument. But if we need np because of speed, we can define a function:

def np_log_base(x, base):
    return np.log(x) / np.log(base)

would that fix be ok ?

"num_samples should be a positive integer value" error if `eval_split_size` is >= size of dataset

We're training a WaveGrad Vocoder on a fairly small dataset right now (~250 samples), and ran into the following error recently:

Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 412, in main
    _, global_step = train(model, criterion, optimizer, scheduler, scaler,
  File "./TTS/bin/train_vocoder_wavegrad.py", line 82, in train
    data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
  File "./TTS/bin/train_vocoder_wavegrad.py", line 46, in setup_loader
    loader = DataLoader(dataset,
  File "coqui-tts\lib\site-packages\torch\utils\data\dataloader.py", line 266, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore
  File "coqui-tts\lib\site-packages\torch\utils\data\sampler.py", line 103, in __init__
    raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0

This appears to be related to the undocumented eval_split_size setting in the config.json value. The default config for WaveGrad specifies this as 256. After debugging for a bit, it appears that the way this setting works is that it controls how many files are used for the evaluation set. So, if there are 500 WAV files, and the eval_split_size is set to 256, then the first 256 audio files encountered are used for the evaluation set and the remaining 244 are used for training.

Since it can take a fair bit of debugging for an end-user to understand what's going on, I propose two things:

There should be a sanity check/validation check that raises a more appropriate error if the number of WAV files is smaller than the eval_split_size.
The eval_split_size parameter in the config should be documented so users understand what it does and can tune it appropriately.

Feature Request: Loss coincidence report for sample data

In https://discourse.mozilla.org/t/custom-voice-tts-not-learning/40897/5, @erogol mentioned that a way to weed out bad samples in the data is to run the training network on the data to see which have the highest loss. Is there any easy way to see this? I am taking the comment to mean that we'd need to narrow the training list to just a few files at a time, run training, and check the loss value; then repeat for each handful of sample files to see a pattern. If so, that could take quite some time. Unless there is a report or something that I'm not aware of?

As we all know, training data set quality is the biggest factor influencing training. So, anything we can do to flag sub-optimal training samples that the CheckDataset notebook otherwise doesn't flag would be ideal.

To that end, is there any opportunity for the model to track and spit out a coincidence report of files to the average loss with those files? In other words, what if the training process tracked the average loss value observed each time each file is in a batch. Over time, that could be used to drive a heatmap of which files happen to be coincident with higher loss. That way, users would quickly identify the outliers in the data set that are contributing most to the loss.

--speaker_wav leads to AttributeError: 'NoneType' object has no attribute 'load_wav'

input:

tts --text 'Hello world!'  --out_path out/out_1.wav --model_name tts_models/en/vctk/sc-glow-tts --vocoder_name vocoder_models/en/vctk/hifigan_v2 --speaker_wav 28.wav

output/error

 > tts_models/en/vctk/sc-glow-tts is already downloaded.
 > vocoder_models/en/vctk/hifigan_v2 is already downloaded.
Loading speakers ...
 > Using model: glow_tts
 > Generator Model: hifigan_generator
Removing weight norm...
 > Text: Hello world!
 > Text splitted to sentences.
['Hello world!']
Traceback (most recent call last):
  File "/home/user/.local/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/user/.local/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 257, in main
    wav = synthesizer.tts(args.text, args.speaker_idx, args.speaker_wav)
  File "/home/user/.local/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 220, in tts
    speaker_embedding = self.speaker_manager.compute_x_vector_from_clip(speaker_wav)
  File "/home/user/.local/lib/python3.7/site-packages/TTS/tts/utils/speakers.py", line 241, in compute_x_vector_from_clip
    x_vector = _compute(wf)
  File "/home/user/.local/lib/python3.7/site-packages/TTS/tts/utils/speakers.py", line 228, in _compute
    waveform = self.speaker_encoder_ap.load_wav(wav_file, sr=self.speaker_encoder_ap.sample_rate)
AttributeError: 'NoneType' object has no attribute 'load_wav'```

when reaching this line: ```
waveform = self.speaker_encoder_ap.load_wav(wav_file, sr=self.speaker_encoder_ap.sample_rate)

self.speaker_encoder_ap is a NoneType for me, so it seems that self.speaker_encoder_ap wasn't initialized

the wav file im supplying is a 22050 mono file and it's path is correct

i'm running version 0.13

this works without a problem:

tts --text 'Hello world!'  --out_path out/out21.wav --model_name tts_models/en/vctk/sc-glow-tts --vocoder_name vocoder_models/en/vctk/hifigan_v2 --speaker_idx p245

[Feature request] add stopnet delay argument to synthesis function (tacotron)

Sometimes synthesis for some sentences are cut short at the last word. I know (think) that it's indicative that something is amiss in the model or the dataset, either not trained long enough, audio parameters could be tuned further (trim_db ?) or just dataset quality. But taking time to fix that issue, debugging and training many models is a luxury that some people can't afford (maybe even more if it's a low ressource language).

I would gladly do a PR to propose the feature but I'm not sure how to go about the implementation.
Would adding a stopnet delay (delaying from n steps the stop signal) solve this issue ?

Can't train a model, "[!] {name} is not a valid value"

I'm trying to train a model with my own dataset, and I got this error. The same thing applied when I used the default LJSpeech dataset: https://pastebin.com/WugD8rZt

In Random Window Discriminator, feats is an empty list

In Random Window discriminator, feats list is defined, but it is not updated. Is this by design?

    def forward(self, x, c):
        scores = []
        feats = []
        # unconditional pass
        for (window_size, layer) in zip(self.window_sizes,
                                        self.unconditional_discriminators):
            index = np.random.randint(x.shape[-1] - window_size)

            score = layer(x[:, :, index:index + window_size])
            scores.append(score)

        # conditional pass
        for (window_size, layer) in zip(self.window_sizes,
                                        self.conditional_discriminators):
            frame_size = window_size // self.hop_length
            lc_index = np.random.randint(c.shape[-1] - frame_size)
            sample_index = lc_index * self.hop_length
            x_sub = x[:, :,
                      sample_index:(lc_index + frame_size) * self.hop_length]
            c_sub = c[:, :, lc_index:lc_index + frame_size]

            score = layer(x_sub, c_sub)
            scores.append(score)
        return scores, feats

Also, thank you for this project. It's awesome :)

[Feature request] accessing version variable

Is your feature request related to a problem? Please describe.
It would be nice to see which version of TTS i am currently using with TTS.__version__ command

Describe the solution you'd like
Add __version__ information in a _version.py file

Default output_path for tts endpoint points the wrong place.

tts endpoint saves the output wav to the same path as syntehsize.py but it should save it to where the command is called.

GravesAttention with Tacotron 1 yields empty alignment plots during training and throwns no attribute error during inference

I've trained a model using T1 with GST and GravesAttention. During training, all training and eval alignment plots have been empty (trained 80k+ steps). The model produced audio in the tensorboards, however using the logic from one of the notebooks to evaluate a model and synthesize speech, it threw me the following error: AttributeError: 'GravesAttention' object has no attribute 'init_win_idx'. referring to layers/tacotorn.py --> 478 self.attention.init_win_idx(). I suspect maybe that the Tacotron 1 model is not configured to use GravesAttention because some of those methods defined in layers/tacotron.py do not exist in the GravesAttention class.

Config:

{
"model": "Tacotron",
"run_name": "blizzard-gts",
"run_description": "tacotron GST.",
"audio": {

"fft_size": 1024, 
"win_length": 1024, 
"hop_length": 256, 
"frame_length_ms": null, 
"frame_shift_ms": null, 


"sample_rate": 24000, 
"preemphasis": 0.0, 
"ref_level_db": 20, 


"do_trim_silence": true, 
"trim_db": 60, 


"power": 1.5, 
"griffin_lim_iters": 60, 


"num_mels": 80, 
"mel_fmin": 95.0, 
"mel_fmax": 12000.0, 
"spec_gain": 20,


"signal_norm": true, 
"min_level_db": -100, 
"symmetric_norm": true, 
"max_norm": 4.0, 
"clip_norm": true, 
"stats_path": null

"distributed": {
"backend": "nccl",
"url": "tcp://localhost:54321"
},

"reinit_layers": [],

"batch_size": 128,
"eval_batch_size": 16,
"r": 7,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,

"loss_masking": false,
"decoder_loss_alpha": 0.5,
"postnet_loss_alpha": 0.25,
"postnet_diff_spec_alpha": 0.25,
"decoder_diff_spec_alpha": 0.25,
"decoder_ssim_alpha": 0.5,
"postnet_ssim_alpha": 0.25,
"ga_alpha": 5.0,
"stopnet_pos_weight": 15.0,

"run_eval": true,
"test_delay_epochs": 10,
"test_sentences_file": null,

"noam_schedule": false,
"grad_clip": 1.0,
"epochs": 300000,
"lr": 0.0001,
"wd": 0.000001,
"warmup_steps": 4000,
"seq_len_norm": false,

"memory_size": -1,
"prenet_type": "original",
"prenet_dropout": true,

"attention_type": "graves",
"attention_heads": 4,
"attention_norm": "sigmoid",
"windowing": false,
"use_forward_attn": false,
"forward_attn_mask": false,
"transition_agent": false,
"location_attn": true,
"bidirectional_decoder": false,
"double_decoder_consistency": false,
"ddc_r": 7,

"stopnet": true,
"separate_stopnet": true,

"print_step": 25,
"tb_plot_step": 100,
"print_eval": false,
"save_step": 5000,
"checkpoint": true,
"tb_model_param_stats": false,

"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"num_loader_workers": 8,
"num_val_loader_workers": 8,
"batch_group_size": 4,
"min_seq_len": 6,
"max_seq_len": 153,
"compute_input_seq_cache": false,
"use_noise_augment": true,

"output_path": "/home/big-boy/Models/Blizzard/",

"phoneme_cache_path": "/home/big-boy/Models/phoneme_cache/",
"use_phonemes": true,
"phoneme_language": "en-us",

"use_speaker_embedding": false,
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": {
"gst_style_input": null,
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},

"datasets":
[{
"name": "ljspeech",
"path": "/Data/blizzard2013/segmented/",
"meta_file_train": "metadata.csv",
"meta_file_val": null
}]
}

Alignment plots:

[Bug] DDC-TTS_Universal-Fullband-MelGAN_MAI-karen_savage_ES.ipynb

Hi, while trying to execute the Colab tutorial for synthetizing spanish speech, I got an error when executing the following line:

align, spec, stop_tokens, wav = tts(vocoder_model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)

This is the error:

in tts(model, text, CONFIG, use_cuda, ap, use_gl, figures)
12 t_1 = time.time()
13 waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
---> 14 truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars)
15 print(mel_postnet_spec.shape)
16 mel_postnet_spec = ap._denormalize(mel_postnet_spec.T).T

/content/TTS_repo/TTS/tts/utils/synthesis.py in synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav, truncated, enable_eos_bos_chars, use_griffin_lim, do_trim_silence, speaker_embedding, backend)
239 if backend == 'torch':
240 decoder_output, postnet_output, alignments, stop_tokens = run_model_torch(
--> 241 model, inputs, CONFIG, truncated, speaker_id, style_mel, speaker_embeddings=speaker_embedding)
242 postnet_output, decoder_output, alignment, stop_tokens = parse_outputs_torch(
243 postnet_output, decoder_output, alignments, stop_tokens)

/content/TTS_repo/TTS/tts/utils/synthesis.py in run_model_torch(model, inputs, CONFIG, truncated, speaker_id, style_mel, speaker_embeddings)
57 else:
58 decoder_output, postnet_output, alignments, stop_tokens = model.inference(
---> 59 inputs, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
60 elif 'glow' in CONFIG.model.lower():
61 inputs_lengths = torch.tensor(inputs.shape[1:2]).to(inputs.device) # pylint: disable=not-callable

/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
25 def decorate_context(*args, **kwargs):
26 with self.class():
---> 27 return func(*args, **kwargs)
28 return cast(F, decorate_context)
29

TypeError: inference() got an unexpected keyword argument 'speaker_ids'

Thanks!

[Bug] Not able to install

pip install TTS

Failed to build TTS ERROR: Could not build wheels for TTS which use PEP 517 and cannot be installed directly

Adding more details about the state of my machine. I am on Windows 10.
I am using python 3.8.0 installed via pyenv. Here are my current package versions under python 3.8.0 environment:

pip list

Package Version

absl-py 0.12.0
appdirs 1.4.4
argon2-cffi 20.1.0
astor 0.8.1
astunparse 1.6.3
async-generator 1.10
attrs 20.3.0
audioread 2.1.9
backcall 0.2.0
bar-chart-race 0.1.0
black 20.8b1
bleach 1.5.0
cachetools 4.2.1
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
click 7.1.2
colorama 0.4.4
cycler 0.10.0
decorator 5.0.5
defusedxml 0.7.1
dill 0.3.3
entrypoints 0.3
flatbuffers 1.12
gast 0.3.3
google-auth 1.28.0
google-auth-oauthlib 0.4.4
google-pasta 0.2.0
grpcio 1.32.0
gTTS 2.2.2
h5py 2.10.0
html5lib 0.9999999
idna 2.10
inflect 5.3.0
ipykernel 5.5.3
ipython 7.22.0
ipython-genutils 0.2.0
ipywidgets 7.6.3
jedi 0.18.0
Jinja2 2.11.3
joblib 1.0.1
jsonpatch 1.32
jsonpointer 2.1
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.13
jupyter-console 6.4.0
jupyter-core 4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.0
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
kiwisolver 1.3.1
librosa 0.8.0
llvmlite 0.36.0
Markdown 3.3.4
MarkupSafe 1.1.1
matplotlib 3.3.3
mistune 0.8.4
multiprocess 0.70.11.1
mypy-extensions 0.4.3
nbclient 0.5.3
nbconvert 6.0.7
nbformat 5.1.3
nest-asyncio 1.5.1
notebook 6.3.0
numba 0.53.1
numpy 1.19.3
oauthlib 3.1.0
opt-einsum 3.3.0
packaging 20.9
pandas 1.2.3
pandocfilters 1.4.3
parso 0.8.2
pathspec 0.8.1
pep517 0.10.0
pickleshare 0.7.5
Pillow 8.2.0
pip 21.0.1
pooch 1.3.0
prometheus-client 0.10.0
prompt-toolkit 3.0.18
protobuf 3.15.7
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.20
Pygments 2.8.1
pynndescent 0.5.2
pyparsing 2.4.7
PyQt5 5.15.4
PyQt5-Qt5 5.15.2
PyQt5-sip 12.8.1
pyrsistent 0.17.3
python-dateutil 2.8.1
pytz 2021.1
pywin32 300
pywinpty 0.5.7
pyzmq 22.0.3
qtconsole 5.0.3
QtPy 1.9.0
regex 2021.4.4
requests 2.25.1
requests-oauthlib 1.3.0
resampy 0.2.2
rsa 4.7.2
scikit-learn 0.24.1
scipy 1.6.2
Send2Trash 1.5.0
setuptools 54.2.0
six 1.15.0
sounddevice 0.4.1
SoundFile 0.10.3.post1
tensorboard 1.14.0
tensorboard-plugin-wit 1.8.0
tensorflow 1.14.0
tensorflow-estimator 1.14.0
tensorflow-hub 0.11.0
tensorflow-tensorboard 1.5.1
termcolor 1.1.0
terminado 0.9.4
testpath 0.4.4
threadpoolctl 2.1.0
toml 0.10.2
torch 1.8.1+cpu
torchaudio 0.8.1
torchfile 0.1.0
torchvision 0.9.1+cpu
tornado 6.1
tqdm 4.60.0
traitlets 5.0.5
typed-ast 1.4.2
typing-extensions 3.7.4.3
umap-learn 0.5.1
Unidecode 1.2.0
urllib3 1.26.4
visdom 0.1.8.9
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.58.0
Werkzeug 1.0.1
wheel 0.36.2
widgetsnbextension 3.5.1
wrapt 1.12.1

[Bug] Trainning using --restore_path fails with param 'initial_lr' not specified message.

Describe the bug
Trainning using --restore_path fails with param 'initial_lr' not specified message.

To Reproduce
Steps to reproduce the behavior:

python TTS/bin/train_tacotron.py --config_path config.json --restore_path best_model.pth.tar
See error:
File "TTS/bin/train_tacotron.py", line 664, in
main(args)
File "TTS/bin/train_tacotron.py", line 580, in main
scheduler = NoamLR(optimizer,
File "/root/TTS/TTS/utils/training.py", line 94, in init
super(NoamLR, self).init(optimizer, last_epoch)
File "/root/anaconda3/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 39, in init
raise KeyError("param 'initial_lr' is not specified "
KeyError: "param 'initial_lr' is not specified in param_groups[0] when resuming an optimizer"

Environment (please complete the following information):

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu Docker 18.04.4 LTS
PyTorch or TensorFlow version (use command below): torch=1.8.1+cu111
Python version: 3.8.3
CUDA/cuDNN version: 11
GPU model and memory: V100

[Bug] save_spectogram seems not implemented

Hello there!

Thanks for the project. I think the saving of raw spectrograms through --save_spectogram is not implemented, right? If so, maybe we can use this issue to track its development.

Crash while saving checkpoint : "Audio buffer is not finite everywhere"

Hi,

First of all, thanks for all this great code!

Now, I'm training a new Tacotron2 using a Hindi dataset - 25 hours, 12,000 audio files, single speaker, not noisey, trimmed silences.

At 10000 global steps, when the model tries to save the checkpoint, it crashes with the message "Audio buffer is not finite everywhere". I've been trying to tweak the config parameters, but to no avail.

Traceback (most recent call last):
File "TTS/bin/train_tacotron.py", line 664, in
main(args)
File "TTS/bin/train_tacotron.py", line 634, in main
scaler_st)
File "TTS/bin/train_tacotron.py", line 312, in train
train_audio = ap.inv_melspectrogram(const_spec.T)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 286, in inv_melspectrogram
return self._griffin_lim(S**self.power)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 315, in _griffin_lim
angles = np.exp(1j * np.angle(self._stft(y)))
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 303, in _stft
pad_mode=self.stft_pad_mode,
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/core/spectrum.py", line 215, in stft
util.valid_audio(y)
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/util/utils.py", line 275, in valid_audio
raise ParameterError('Audio buffer is not finite everywhere')
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere

I'd really appreciate any hints to what might be causing this.

SpeedySpeech model causes error for input text shorter than 13 characters.

Due to the architecture of the model and the total receptive field, it causes errors for input text shorter than 13 characters.

This can be fixed by padding the input text with empty characters.

(venv) $ tts --model_name tts_models/en/ljspeech/speedy-speech-wn --text "Hey Bruce, what's good in the neighborhood?"
 > tts_models/en/ljspeech/speedy-speech-wn is already downloaded.
 > vocoder_models/en/ljspeech/multiband-melgan is already downloaded.
 > Using model: speedy_speech
Traceback (most recent call last):
  File "/home/josh/venv/bin/tts", line 8, in <module>
    sys.exit(main())
  File "/home/josh/venv/lib/python3.6/site-packages/TTS/bin/synthesize.py", line 190, in main
    synthesizer = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, args.use_cuda)
  File "/home/josh/venv/lib/python3.6/site-packages/TTS/utils/synthesizer.py", line 47, in __init__
    use_cuda)
  File "/home/josh/venv/lib/python3.6/site-packages/TTS/utils/synthesizer.py", line 96, in load_tts
    self.tts_model.load_checkpoint(tts_config, tts_checkpoint, eval=True)
  File "/home/josh/venv/lib/python3.6/site-packages/TTS/tts/models/speedy_speech.py", line 196, in load_checkpoint
    self.load_state_dict(state['model'])
  File "/home/josh/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
	size mismatch for emb.weight: copying a param with shape torch.Size([129, 128]) from checkpoint, the shape in current model is torch.Size([130, 128]).

Thanks, @JRMeyer, for pointing this out 👑

'avg_align_error' does not change on Tensorboard.

'avg_align_error' does not change at validation for non-Tacotron models on Tensorboard.

This is probably because the attention maps are binary for these models and the alignment error does not work correctly with them.

[Feature request] Test gpu memory capacity right from the start (training)

It would be cool to test if the GPU memory is enough for the model/config/dataset combo right from the start because it wastes time and money to start training only to discover that your training failed because of an OOM error.

I would suggest maybe for the first epoch and first batch to put all the longest samples duration/seq_length.
Or do a warmup batch the same way + a loaded test batch.

RuntimeError: The expanded size of the tensor (64) must match the existing size (112) at non-singleton dimension 2. Target sizes: [64, 80, 64]. Tensor sizes: [64, 1, 112]

Hey,

I'm trying to run a training with Tacotron 1 using GST. I get the error on the first batch already.

Pytorch version: 1.8 and 1.7.1 (both yielded the same error)
Python version: 3.8.0

Traceback (most recent call last): File "TTS/bin/train_tacotron.py", line 721, in <module> main(args) File "TTS/bin/train_tacotron.py", line 619, in main train_avg_loss_dict, global_step = train(train_loader, model, File "TTS/bin/train_tacotron.py", line 168, in train decoder_output, postnet_output, alignments, stop_tokens = model( File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/big-boy/projects/TTS/TTS/tts/models/tacotron.py", line 173, in forward decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs) RuntimeError: The expanded size of the tensor (64) must match the existing size (112) at non-singleton dimension 2. Target sizes: [64, 80, 64]. Tensor sizes: [64, 1, 112]

My hyperparams:
// TRAINING
"batch_size": 64,
"eval_batch_size": 16,
"r": 4,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,

// MULTI-SPEAKER and GST
"use_speaker_embedding": false, // use speaker embedding to enable multi-speaker learning.
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": { // gst parameter if gst is enabled
"gst_style_input": null, // Condition the style input either on a
// -> wave file [path to wave] or
// -> dictionary using the style tokens {'token1': 'value', 'token2': 'value'} example {"0": 0.15, "1": 0.15, "5": -0.15}
// with the dictionary being len(dict) <= len(gst_style_tokens).
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},

Noam LR Scheduling - AttributeError: 'NoneType' object has no attribute 'step'

I'm working on a step-wise learning rate scheduling method and I wanted to take inspiration from the NoamLR() class found in training.py. When I set noam_schedule: true in the config, the following error is shown.

File "TTS/bin/train_tacotron.py", line 674, in
main(args)
File "TTS/bin/train_tacotron.py", line 640, in main
train_avg_loss_dict, global_step = train(train_loader, model,
File "TTS/bin/train_tacotron.py", line 154, in train
scheduler.step()
AttributeError: 'NoneType' object has no attribute 'step'

Tacotron model uses Tacotron2 losses

/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/loss.py:94: UserWarning: Using a target size (torch.Size([64, 90, 80])) that is different to the input size (torch.Size([64, 90, 513])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.l1_loss(input, target, reduction=self.reduction)
! Run is removed from /home/big-boy/Models/Blizzard/blizzard-gts-March-11-2021_05+38PM-45068a9
Traceback (most recent call last):
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 721, in
main(args)
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 619, in main
train_avg_loss_dict, global_step = train(train_loader, model,
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 180, in train
loss_dict = criterion(postnet_output, decoder_output, mel_input,
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/projects/TTS/TTS/tts/layers/losses.py", line 377, in forward
postnet_diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/projects/TTS/TTS/tts/layers/losses.py", line 203, in forward
return self.loss_func(x_diff, target_diff)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 94, in forward
return F.l1_loss(input, target, reduction=self.reduction)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/functional.py", line 2633, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/functional.py", line 71, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore
RuntimeError: The size of tensor a (513) must match the size of tensor b (80) at non-singleton dimension 2

Originally posted by @a-froghyar in #370 (comment)

you can try ```noam_schedule: True``` to let model stabilize initially with lower learning rates.

you can try noam_schedule: True to let model stabilize initially with lower learning rates.

Also tb_model_param_stats:True to watch model layer stats on TensorBoard. It shows you if something is wrong with any of the layers.

Originally posted by @erogol in #388 (comment)

Vocoder training fails on Windows versions of Python

Trying to train a WaveGrad Vocoder on Python 3.8.8 for Windows 10 yields this error:

Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 417, in main
    best_loss = save_best_model(
  File "TTS\vocoder\utils\io.py", line 97, in save_best_model
    os.symlink(best_model_name, os.path.join(out_path, link_name))
OSError: [WinError 1314] A required privilege is not held by the client: 'best_model_1.pth.tar' -> 'c:/path/to/my/project/best_model.pth.tar'

This is because Windows only lets admins create symlinks for... reasons...

starting from best_model is not the best option when --continue_training

@gerazov based on this mozilla/TTS#679 starting from the last best_model looks confusing for people, and sometimes I found myself removing the best model to continue from the last checkpoint.

I suggest reverting this to the old behavior that we pick the last checkpoint.

🐸 TTS roadmap

These are the main dev plans for 🐸 TTS.

If you want to contribute to 🐸 TTS and don't know where to start you can pick one here and start with our Contribution Guideline. We're also always here to help.

Feel free to pick one or suggest a new one.

Contributions are always welcome 💪 .

v0.1.0 Milestones

v0.2.0 Milestones

Grapheme 2 Phoneme in-house conversion. (Thx to gruut 👍 )
Implement VITS model.

v0.3.0 Milestones

Implement generic ForwardTTS API.
Implement Fast Speech model.
Implement Fast Pitch model.

v0.4.0 Milestones

Trainer API v2 - join the discussion
Multi-speaker VCTK recipes for all the TTS.tts models.

v0.5.0 Milestones

Support for multi-lingual models
YourTTS release 🚀

v0.6.0 Milestones

Add ESpeak support
New Tokenizer and Phonemizer APIs #937
New Model API #1078
Splitting the trainer as a separate repo 👟Trainer
Update VITS model API
Gradient accumulation. #560 (in 👟)

v0.7.0 Milestones

Implement Capacitron 👑 @a-froghyar 👑 @WeberJulian
Release pretrained Capacitron

v0.8.0 Milestones

Separate numpy transforms
Better data sampling for VITS
New Thorsten DE models 👑 @thorstenMueller

🏃‍♀️ Milestones along the way

🤖 New TTS models

[Feature request] Pass config values to Tensorboars

Is your feature request related to a problem? Please describe.
It is hard to compare models with different configurations by just looking at Tensorboard.

Describe the solution you'd like
We can pass the configuration fields to the tensorboard.

[Feature request] prosody rate, style emotions, expressiveness, aggressiveness, pace, etc.

The resemble.ai system has markup like:

<prosody rate="45%"><style emotions="expressiveness:0.9
aggressiveness:0.5 pace:0.2">
<say-as interpret-as="characters">Zeuxis</style></say-as>

Is this open sourced in coqui?

[Discussion] Ideas for better model config management

(I keep it in the issues to refer back to the initial discussion)

Hi All!!

I guess one of the biggest issues in TTS is the way we handle the configs for models and training. Putting example config files under the config folder is hard to maintain and looks complicated for people to start using TTS.

So I want to discuss here some better alternatives and ask for the wisdom of the crowd 🧑‍🤝‍🧑.

Couple of constraints we need to consider from the top of my head.

configs should not be python specific, and they should be in a generic form to be serialized and loaded by other systems and programming languages. So if someone likes to export the model and use it in an embedded system config file should not be a problem.
configs should allow easy experimentation, collaboration, and reproduction.
Each model should explain its config fields. Right now I do this in config.json by violating the JSON format with comments. It is not optimal ☹️.

If you have an idea please share it below and let's discuss it.

Edit:

I should also add one more constraint.

We should solve this with no dependencies if possible.

NOTE: This is a continuation of previously started conversion mozilla/TTS#660

Originally posted by @erogol in #20

Crashing while saving checkpoint

Hi,

I am trying to train a Tacotron2 model in Hindi. I have my own 25 hour single speaker cleaned dataset. I'm using the following configuration.

{
"model": "Tacotron2",
"run_name": "hindi-ddc",
"run_description": "tacotron2 with DDC and differential spectral loss.",

// AUDIO PARAMETERS
"audio":{
    // stft parameters
    "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
    "win_length": 1024,      // stft window length in ms.
    "hop_length": 256,       // stft window hop-lengh in ms.
    "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
    "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

    // Audio processing parameters
    "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate.
    "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
    "ref_level_db": 20,     // reference level db, theoretically 20db is the sound of air.

    // Silence trimming
    "do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (true), TWEB (false), Nancy (true)
    "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

    // Griffin-Lim
    "power": 1.5,           // value to sharpen wav signals after GL algorithm.
    "griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.

    // MelSpectrogram parameters
    "num_mels": 80,         // size of the mel spec frame.
    "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
    "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
    "spec_gain": 1,

    // Normalization parameters
    "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
    "min_level_db": -100,   // lower bound for normalization
    "symmetric_norm": true, // move normalization to range [-1, 1]
    "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
    "clip_norm": true,      // clip normalized values into the range.
    "stats_path": null    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},

// VOCABULARY PARAMETERS
// if custom character set is not defined,
// default set in symbols.py is used
"characters":{
    "pad": "_",
    "eos": "~",
    "bos": "^",
    "characters": "अआइईउऊऋएऐऑओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहह़ा",
    "punctuations":"!'\",.:?। ",
    "phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
},


// DISTRIBUTED TRAINING
"distributed":{
    "backend": "nccl",
    "url": "tcp:\/\/localhost:54321"
},

"reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

// TRAINING
"batch_size": 32,       // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
"eval_batch_size":16,
"r": 7,                 // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
"mixed_precision": true,     // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.

// LOSS SETTINGS
"loss_masking": true,       // enable / disable loss masking against the sequence padding.
"decoder_loss_alpha": 0.5,  // original decoder loss weight. If > 0, it is enabled
"postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
"postnet_diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
"decoder_diff_spec_alpha": 0.25,     // differential spectral loss weight. If > 0, it is enabled
"decoder_ssim_alpha": 0.5,     // decoder ssim loss weight. If > 0, it is enabled
"postnet_ssim_alpha": 0.25,     // postnet ssim loss weight. If > 0, it is enabled
"ga_alpha": 5.0,           // weight for guided attention loss. If > 0, guided attention is enabled.
"stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.


// VALIDATION
"run_eval": true,
"test_delay_epochs": 10,  //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null,  // set a file to load sentences to be used for testing. If it is null then we use default english sentences.

// OPTIMIZER
"noam_schedule": false,        // use noam warmup and lr schedule.
"grad_clip": 1.0,              // upper limit for gradients for clipping.
"epochs": 1000,                // total number of epochs to train.
"lr": 0.0001,                  // Initial learning rate. If Noam decay is active, maximum learning rate.
"wd": 0.000001,                // Weight decay weight.
"warmup_steps": 4000,          // Noam decay steps to increase the learning rate from 0 to "lr"
"seq_len_norm": false,         // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.

// TACOTRON PRENET
"memory_size": -1,             // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame.
"prenet_type": "original",     // "original" or "bn".
"prenet_dropout": false,       // enable/disable dropout at prenet.

// TACOTRON ATTENTION
"attention_type": "original",  // 'original' , 'graves', 'dynamic_convolution'
"attention_heads": 4,          // number of attention heads (only for 'graves')
"attention_norm": "sigmoid",   // softmax or sigmoid.
"windowing": false,            // Enables attention windowing. Used only in eval mode.
"use_forward_attn": false,     // if it uses forward attention. In general, it aligns faster.
"forward_attn_mask": false,    // Additional masking forcing monotonicity only in eval mode.
"transition_agent": false,     // enable/disable transition agent of forward attention.
"location_attn": true,         // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
"bidirectional_decoder": false,  // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.
"double_decoder_consistency": true,  // use DDC explained here https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency-draft/
"ddc_r": 7,                           // reduction rate for coarse decoder.

// STOPNET
"stopnet": true,               // Train stopnet predicting the end of synthesis.
"separate_stopnet": true,      // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER.

// TENSORBOARD and LOGGING
"print_step": 25,       // Number of steps to log training on console.
"tb_plot_step": 100,    // Number of steps to plot TB training figures.
"print_eval": false,     // If True, it prints intermediate loss values in evalulation.
"save_step": 200,      // Number of training steps expected to save traninpg stats and checkpoints.
"checkpoint": true,     // If true, it saves checkpoints per "save_step"
"keep_all_best": false,  // If true, keeps all best_models after keep_after steps
"keep_after": 10000,    // Global step after which to keep best models if keep_all_best is true
"tb_model_param_stats": false,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.

// DATA LOADING
"text_cleaner": "basic_cleaners",
"enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars.
"num_loader_workers": 2,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 2,    // number of evaluation data loader processes.
"batch_group_size": 4,  //Number of batches to shuffle after bucketing.
"min_seq_len": 81,       // DATASET-RELATED: minimum text length to use in training
"max_seq_len": 186,     // DATASET-RELATED: maximum text length
"compute_input_seq_cache": false,  // if true, text sequences are computed before starting training. If phonemes are enabled, they are also computed at this stage.
"use_noise_augment": true,

// PATHS
"output_path": "/home/ubuntu/output/",

// PHONEMES
"phoneme_cache_path": "/home/ubuntu/phoneme_cache/",  // phoneme computation is slow, therefore, it caches results in the given folder.
"use_phonemes": false,           // use phonemes instead of raw characters. It is suggested for better pronounciation.
"phoneme_language": "hi",     // depending on your target language, pick one from  https://github.com/bootphon/phonemizer#languages

// MULTI-SPEAKER and GST
"use_speaker_embedding": false,      // use speaker embedding to enable multi-speaker learning.
"use_gst": false,       			    // use global style tokens
"use_external_speaker_embedding_file": false, // if true, forces the model to use external embedding per sample instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
"external_speaker_embedding_file": "../../speakers-vctk-en.json", // if not null and use_external_speaker_embedding_file is true, it is used to load a specific embedding file and thus uses these embeddings instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
"gst":	{			                // gst parameter if gst is enabled
    "gst_style_input": null,        // Condition the style input either on a
                                    // -> wave file [path to wave] or
                                    // -> dictionary using the style tokens {'token1': 'value', 'token2': 'value'} example {"0": 0.15, "1": 0.15, "5": -0.15}
                                    // with the dictionary being len(dict) <= len(gst_style_tokens).
    "gst_embedding_dim": 512,
    "gst_num_heads": 4,
    "gst_style_tokens": 10,
    "gst_use_speaker_embedding": false
},

// DATASETS
"datasets":   // List of datasets. They all merged and they get different speaker_ids.
    [
        {
            "name": "hindi",
            "path": "/dev/data/hindidataset/",
            "meta_file_train": "metadata.csv", // for vtck if list, ignore speakers id in list for train, its useful for test cloning with new speakers
            "meta_file_val": null
        }
    ]

}

The stacktrace I'm hitting is below.

CHECKPOINT : /home/ubuntu/output/modi-ddc-March-19-2021_05+17PM-8545a69/checkpoint_200.pth.tar
/home/ubuntu/TTS/TTS/utils/audio.py:234: RuntimeWarning: overflow encountered in power
return np.power(10.0, x / self.spec_gain)
! Run is kept in /home/ubuntu/output/modi-ddc-March-19-2021_05+17PM-8545a69
Traceback (most recent call last):
File "TTS/bin/train_tacotron.py", line 664, in
main(args)
File "TTS/bin/train_tacotron.py", line 634, in main
scaler_st)
File "TTS/bin/train_tacotron.py", line 312, in train
train_audio = ap.inv_melspectrogram(const_spec.T)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 286, in inv_melspectrogram
return self._griffin_lim(S**self.power)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 315, in _griffin_lim
angles = np.exp(1j * np.angle(self._stft(y)))
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 303, in _stft
pad_mode=self.stft_pad_mode,
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/core/spectrum.py", line 215, in stft
util.valid_audio(y)
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/util/utils.py", line 275, in valid_audio
raise ParameterError('Audio buffer is not finite everywhere')
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere

I've been trying to debug for 2 days but not able to make progress. I'd really appreciate any help/suggestions.

Weird Training Plots and No Audio After 70K Steps - Taco1 GST Blizzard

Hey, I'm getting weird plots and no audio produced in the tensorboard examples after 70k Steps. I'm using a custom Blizzard dataset that I've already trained other models with that produced intelligible speech after 20k steps. The training has also stopped after 70K steps because of the NaN decoder_loss error. I'm using the #373 patched dev branch with the following config file:
{
"model": "Tacotron",
"run_name": "blizzard-gts",
"run_description": "tacotron GST.",
"audio": {

    "fft_size": 1024, 
    "win_length": 1024, 
    "hop_length": 256, 
    "frame_length_ms": null, 
    "frame_shift_ms": null, 

    
    "sample_rate": 24000, 
    "preemphasis": 0.0, 
    "ref_level_db": 20, 

    
    "do_trim_silence": true, 
    "trim_db": 60, 

    
    "power": 1.5, 
    "griffin_lim_iters": 60, 

    
    "num_mels": 80, 
    "mel_fmin": 95.0, 
    "mel_fmax": 12000.0, 
    "spec_gain": 1,

    
    "signal_norm": true, 
    "min_level_db": -100, 
    "symmetric_norm": true, 
    "max_norm": 4.0, 
    "clip_norm": true, 
    "stats_path": null 
},


"distributed": {
    "backend": "nccl",
    "url": "tcp:\/\/localhost:54321"
},

"reinit_layers": [], 


"batch_size": 128, 
"eval_batch_size": 16,
"r": 7, 
"gradual_training": [
    [0, 7, 64],
    [1, 5, 64],
    [50000, 3, 32],
    [130000, 2, 32],
    [290000, 1, 32]
], 
"mixed_precision": true, 


"loss_masking": false, 
"decoder_loss_alpha": 0.5, 
"postnet_loss_alpha": 0.25, 
"postnet_diff_spec_alpha": 0.25, 
"decoder_diff_spec_alpha": 0.25, 
"decoder_ssim_alpha": 0.5, 
"postnet_ssim_alpha": 0.25, 
"ga_alpha": 5.0, 
"stopnet_pos_weight": 15.0, 



"run_eval": true,
"test_delay_epochs": 10, 
"test_sentences_file": null, 


"noam_schedule": false, 
"grad_clip": 1.0, 
"epochs": 300000, 
"lr": 0.0001, 
"wd": 0.000001, 
"warmup_steps": 4000, 
"seq_len_norm": false, 


"memory_size": -1, 
"prenet_type": "original", 
"prenet_dropout": true, 


"attention_type": "graves", 
"attention_heads": 4, 
"attention_norm": "sigmoid", 
"windowing": false, 
"use_forward_attn": false, 
"forward_attn_mask": false, 
"transition_agent": false, 
"location_attn": true, 
"bidirectional_decoder": false, 
"double_decoder_consistency": false, 
"ddc_r": 7, 


"stopnet": true, 
"separate_stopnet": true, 


"print_step": 25, 
"tb_plot_step": 100, 
"print_eval": false, 
"save_step": 5000, 
"checkpoint": true, 
"tb_model_param_stats": false, 


"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false, 
"num_loader_workers": 8, 
"num_val_loader_workers": 8, 
"batch_group_size": 4, 
"min_seq_len": 6, 
"max_seq_len": 153, 
"compute_input_seq_cache": false, 
"use_noise_augment": true,


"output_path": "/home/big-boy/Models/Blizzard/",


"phoneme_cache_path": "/home/big-boy/Models/phoneme_cache/", 
"use_phonemes": true, 
"phoneme_language": "en-us", 


"use_speaker_embedding": false, 
"use_gst": true, 
"use_external_speaker_embedding_file": false, 
"external_speaker_embedding_file": "../../speakers-vctk-en.json", 
"gst": { 
    "gst_style_input": null, 
    "gst_embedding_dim": 512,
    "gst_num_heads": 4,
    "gst_style_tokens": 10,
    "gst_use_speaker_embedding": false
},


"datasets": 
    [{
        "name": "ljspeech",
        "path": "/home/big-boy/Data/blizzard2013/segmented/",
        "meta_file_train": "metadata.csv", 
        "meta_file_val": null
    }]

}

[Bug] TTS does not detect new model versions

Describe the bug
A clear and concise description of what the bug is.

After upgrading from tts-0.0.9 to tts-0.0.11 the model was updated but TTS still tries to load the cached version.
A fix could be to hash models and compare if the cached model is the same as the hash. This also fixes
cases where models were corrupted in any way

To Reproduce

$  /nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/tts-server --model_name tts_models/en/ljspeech/glow-tts --vocoder_name vocoder_models/universal/libri-tts/fullband-melgan
 > tts_models/en/ljspeech/glow-tts is already downloaded.
 > vocoder_models/universal/libri-tts/fullband-melgan is already downloaded.
 > Using model: glow_tts
Traceback (most recent call last):
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/.tts-server-wrapped", line 6, in <module>
    from TTS.server.server import main
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/server/server.py", line 62, in <module>
    synthesizer = Synthesizer(args.tts_checkpoint, args.tts_config, args.vocoder_checkpoint, args.vocoder_config, args.use_cuda)
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 45, in __init__
    self.load_tts(tts_checkpoint, tts_config,
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 95, in load_tts
    self.tts_model.load_checkpoint(tts_config, tts_checkpoint, eval=True)
  File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/tts/models/glow_tts.py", line 229, in load_checkpoint
    self.load_state_dict(state['model'])
  File "/nix/store/1vv0fsvdv9j4gmqjgjwb3c5v8x906qgd-python3.8-pytorch-1.8.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
        size mismatch for encoder.emb.weight: copying a param with shape torch.Size([129, 192]) from checkpoint, the shape in current model is torch.Size([130, 192]).
/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/tts-server      4,64s user 0,71s system 117% cpu 4,541 total

Expected behavior

Environment (please complete the following information):

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): NixOS
PyTorch or TensorFlow version (use command below): pytorch 1.8.1
Python version: 3.7
CUDA/cuDNN version: no cuda
GPU model and memory: no gpu
Exact command to reproduce: see above

[Feature request] Unlimited Decoder Steps

We need a TTS with unlimited decoding steps for parsing long texts, Is it possible?

Please provide german voice

The quality of the english examples is already very good. What's missing to be useful for me is a model for a german voice.

You can use this issue for related discussions and documenting progress creating it.

[Help] Share your TTS models

Please consider sharing your pre-trained models in any language (If the licences allow that).

We can include them in our model catalogue for public use by attributing your name (website, company etc.).

That would enable more people to experiment together and coordinate, instead of individual efforts to achieve similar goals.

That is also a chance to make your work more visible.

You can share in two ways;

Share the model files with us and we serve them with the next 🐸 TTS release.
Upload your models on GDrive and share the link.

Models are served under .models.json file and any model is available under tts CLI or Server end points. More details...

(previously mozilla/TTS#395)

Bug: Dynamic Convolution Attention fails in `mixed_precision` training.

Describe the bug
Dynamic Convolutional Attention fails in mixed_precision training and ultimately causes NaN error.

To Reproduce
Steps to reproduce the behavior:

set mixed_precision=True in config.json.
set dynamic_convolution=True in config.json.
start training a tacotron or tacotron2 model.
On TB initially you observe broken attention alignment.
Ultimately loss becomes NaN.

Expected behavior
The model should learn the alignment after 10K iterations with no NaN loss as it does in full precision training.

Environment (please complete the following information):

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
PyTorch or TensorFlow version (use command below): Torch 1.8.0
Python version: 3.8
CUDA/cuDNN version: 11.2
GPU model and memory: 1080Ti
Exact command to reproduce:

Additional context
Add any other context about the problem here.

Training WaveGrad Immediately Fails with "ValueError: The histogram is empty, please file a bug report."

With dc2954e checked out, when I try to train WaveGrad on with the following config:

{
    "run_name": "wavegrad-my-project",
    "run_description": "wavegrad test",

    "audio":{
        "fft_size": 1024,         // number of stft frequency levels. Size of the linear spectogram frame.
        "win_length": 1024,      // stft window length in ms.
        "hop_length": 256,       // stft window hop-lengh in ms.
        "frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
        "frame_shift_ms": null,  // stft window hop-lengh in ms. If null, 'hop_length' is used.

        // Audio processing parameters
        "sample_rate": 22050,   // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
        "preemphasis": 0.0,     // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
        "ref_level_db": 0,     // reference level db, theoretically 20db is the sound of air.

        // Silence trimming
        "do_trim_silence": false,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
        "trim_db": 60,          // threshold for timming silence. Set this according to your dataset.

        // MelSpectrogram parameters
        "num_mels": 80,         // size of the mel spec frame.
        "mel_fmin": 50.0,        // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
        "mel_fmax": 7600.0,     // maximum freq level for mel-spec. Tune for dataset!!
        "spec_gain": 1.0,         // scaler value appplied after log transform of spectrogram.

        // Normalization parameters
        "signal_norm": true,    // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
        "min_level_db": -100,   // lower bound for normalization
        "symmetric_norm": true, // move normalization to range [-1, 1]
        "max_norm": 4.0,        // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
        "clip_norm": true,      // clip normalized values into the range.
        "stats_path": "/path/to/my/project/scale_stats.npy"    // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
    },

    // DISTRIBUTED TRAINING
    "mixed_precision": true,     // enable torch mixed precision training (true, false)
    "distributed":{
        "backend": "nccl",
        "url": "tcp:\/\/localhost:54322"
    },

    "target_loss": "avg_wavegrad_loss",  // loss value to pick the best model to save after each epoch

    // MODEL PARAMETERS
    "generator_model": "wavegrad",
    "model_params":{
        "use_weight_norm": true,
        "y_conv_channels":32,
        "x_conv_channels":768,
        "ublock_out_channels": [512, 512, 256, 128, 128],
        "dblock_out_channels": [128, 128, 256, 512],
        "upsample_factors": [4, 4, 4, 2, 2],
        "upsample_dilations": [
            [1, 2, 1, 2],
            [1, 2, 1, 2],
            [1, 2, 4, 8],
            [1, 2, 4, 8],
            [1, 2, 4, 8]]
    },

    // DATASET
    "data_path": "/path/to/my/project/wavs/22.05k_edited_normalized",  // root data path. It finds all wav files recursively from there.
    "feature_path": null,   // if you use precomputed features
    "seq_len": 6144,        // 24 * hop_length
    "pad_short": 0,      // additional padding for short wavs
    "conv_pad": 0,          // additional padding against convolutions applied to spectrograms
    "use_noise_augment": false,     // add noise to the audio signal for augmentation
    "use_cache": false,      // use in memory cache to keep the computed features. This might cause OOM.

    "reinit_layers": [],    // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.

    // TRAINING
    "batch_size": 96,      // Batch size for training.

    // NOISE SCHEDULE PARAMS - Only effective at training time.
    "train_noise_schedule":{
        "min_val": 1e-6,
        "max_val": 1e-2,
        "num_steps": 1000
    },
    "test_noise_schedule":{
        "min_val": 1e-6,
        "max_val": 1e-2,
        "num_steps": 50
    },

    // VALIDATION
    "run_eval": true,       // enable/disable evaluation run

    // OPTIMIZER
    "epochs": 10000,                // total number of epochs to train.
    "clip_grad": 1.0,                 // Generator gradient clipping threshold. Apply gradient clipping if > 0
    "lr_scheduler": "MultiStepLR",  // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
    "lr_scheduler_params": {
        "gamma": 0.5,
        "milestones": [100000, 200000, 300000, 400000, 500000, 600000]
    },
    "lr": 1e-4,                  // Initial learning rate. If Noam decay is active, maximum learning rate.

    // TENSORBOARD and LOGGING
    "print_step": 50,       // Number of steps to log traning on console.
    "print_eval": false,     // If True, it prints loss values for each step in eval run.
    "save_step": 5000,      // Number of training steps expected to plot training stats on TB and save model checkpoints.
    "checkpoint": true,     // If true, it saves checkpoints per "save_step"
    "keep_all_best": false,  // If true, keeps all best_models after keep_after steps
    "keep_after": 10000,    // Global step after which to keep best models if keep_all_best is true
    "tb_model_param_stats": true,     // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.

    // DATA LOADING
    "num_loader_workers": 4,        // number of training data loader processes. Don't set it too big. 4-8 are good values.
    "num_val_loader_workers": 4,    // number of evaluation data loader processes.
    "eval_split_size": 256,

    // PATHS
    "output_path": "/path/to/my/project/Models"
}

... it fails immediately with this error:

 > TRAINING (2021-03-28 19:47:57)

   --> TRAIN PERFORMACE -- EPOCH TIME: 8.09 sec -- GLOBAL_STEP: 1
     | > avg_wavegrad_loss: 1.47542
     | > avg_loader_time: 16.76300
     | > avg_step_time: 8.08550

coqui-tts\lib\site-packages\torch\optim\lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
[WARNING] NaN or Inf found in input tensor.
 ! Run is removed from D:/path/to/my/project
Traceback (most recent call last):
  File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
    main(args)
  File "./TTS/bin/train_vocoder_wavegrad.py", line 412, in main
    _, global_step = train(model, criterion, optimizer, scheduler, scaler,
  File "./TTS/bin/train_vocoder_wavegrad.py", line 223, in train
    tb_logger.tb_model_weights(model, global_step)
  File "coqui-tts\TTS\utils\tensorboard_logger.py", line 34, in tb_model_weights
    self.writer.add_histogram(
  File "coqui-tts\lib\site-packages\tensorboardX\writer.py", line 503, in add_histogram
    histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
  File "coqui-tts\lib\site-packages\tensorboardX\summary.py", line 210, in histogram
    hist = make_histogram(values.astype(float), bins, max_bins)
  File "coqui-tts\lib\site-packages\tensorboardX\summary.py", line 248, in make_histogram
    raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.

TypeError: expected str, bytes or os.PathLike object, not bool "Coqui-TTS-torchhub-example.ipynb"

When I try to run Coqui-TTS-torchhub-example.ipynb on colab I got this error:
"""
Downloading: "https://github.com/coqui-ai/TTS/archive/dev.zip" to /root/.cache/torch/hub/dev.zip

Downloading model to /root/.local/share/tts/tts_models--en--ljspeech--tacotron2-DCA
Downloading model to /root/.local/share/tts/vocoder_models--en--ljspeech--multiband-melgan
Using model: Tacotron2

TypeError Traceback (most recent call last)
in ()
3 synthesizer = torch.hub.load('coqui-ai/TTS:dev',
4 'tts',
----> 5 source='github')
6 wav = synthesizer.tts("TTS is an open-source library that generates synthethic speech!")

6 frames
/usr/local/lib/python3.7/dist-packages/torch/hub.py in load(repo_or_dir, model, *args, **kwargs)
337 repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
338
--> 339 model = _load_local(repo_or_dir, model, *args, **kwargs)
340 return model
341

/usr/local/lib/python3.7/dist-packages/torch/hub.py in _load_local(hubconf_dir, model, *args, **kwargs)
366
367 entry = _load_entry_from_hubconf(hub_module, model)
--> 368 model = entry(*args, **kwargs)
369
370 sys.path.remove(hubconf_dir)

/root/.cache/torch/hub/coqui-ai_TTS_dev/hubconf.py in tts(model_name, vocoder_name, use_cuda)
29
30 # create synthesizer
---> 31 synt = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, use_cuda)
32 return synt
33

/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/synthesizer.py in init(self, tts_checkpoint, tts_config_path, tts_speakers_file, vocoder_checkpoint, vocoder_config, encoder_checkpoint, encoder_config, use_cuda)
73 self.output_sample_rate = self.tts_config.audio["sample_rate"]
74 if vocoder_checkpoint:
---> 75 self._load_vocoder(vocoder_checkpoint, vocoder_config, use_cuda)
76 self.output_sample_rate = self.vocoder_config.audio["sample_rate"]
77

/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/synthesizer.py in _load_vocoder(self, model_file, model_config, use_cuda)
151 use_cuda (bool): enable/disable CUDA use.
152 """
--> 153 self.vocoder_config = load_config(model_config)
154 self.vocoder_ap = AudioProcessor(verbose=False, **self.vocoder_config["audio"])
155 self.vocoder_model = setup_generator(self.vocoder_config)

/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/io.py in load_config(config_path)
43 config = AttrDict()
44
---> 45 ext = os.path.splitext(config_path)[1]
46 if ext in (".yml", ".yaml"):
47 with open(config_path, "r", encoding="utf-8") as f:

/usr/lib/python3.7/posixpath.py in splitext(p)
120
121 def splitext(p):
--> 122 p = os.fspath(p)
123 if isinstance(p, bytes):
124 sep = b'/'

TypeError: expected str, bytes or os.PathLike object, not bool
"""
Anyone tell me what happens?

Version conflicts with numba when installing locally

When I ran pip install -e . or pip install -r requirements.txt I get the following errors:

ERROR: umap-learn 0.5.1 has requirement numba>=0.49, but you'll have numba 0.48.0 which is incompatible.
ERROR: pynndescent 0.5.2 has requirement numba>=0.51.2, but you'll have numba 0.48.0 which is incompatible.

Which versions of these two packages should they be downgraded to?

Configuring MelSpectrogram parameters for custom dataset

I've been trying to figure out the a good configuration for training the Tacotron2 model. I'm not sure how to set MelSpectrogram parameters accurately.

Specifically, how would I calculate the right values for mel_fmin and mel_fmax for my dataset?

Thanks!

coqui-ai / tts Goto Github PK

tts's People

Contributors

Stargazers

Watchers

Forkers

tts's Issues

v0.1.0 Milestones

v0.2.0 Milestones

v0.3.0 Milestones

v0.4.0 Milestones

v0.5.0 Milestones

v0.6.0 Milestones

v0.7.0 Milestones

v0.8.0 Milestones

🏃‍♀️ Milestones along the way

🤖 New TTS models

Recommend Projects

Recommend Topics

Recommend Org