coqui-ai / tts Goto Github PK
View Code? Open in Web Editor NEW🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Home Page: http://coqui.ai
License: Mozilla Public License 2.0
🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
Home Page: http://coqui.ai
License: Mozilla Public License 2.0
This repository is really great! Decent samples too.
I see a huge opportunity for this to be extended to support mobile.
There are a number of obstacles to this of course, including running on TF-Lite.
If you ported it to Dart you could transpile it to iOS and Android.
I trained a French TTS model with Tacotron2 DDC from MAI-Labs. I'm using Coqui-TTS v.0.0.12.
I tried TTS with the vocoder vocoder_models--en--ljspeech--hifigan_v2
, as dowloaded fro Coqui-TTS.
The resulting audio file is very noisy as you can hear: https://sndup.net/2t9d
You can find as a gist my config.json
and vocoder_config.json
: https://gist.github.com/lpierron/6c56302eb628ee6a86363daa08e5fa63
Any idea to solve the noisy problem ?
I tried using another Vocoder (Melgan one) and there is no noise, but the voice is hoarse as you can her: https://sndup.net/4jgn
Describe the bug
The condition checking for enabling feature_matching
loss in
TTS/TTS/vocoder/layers/losses.py
Line 263 in 4a3cc8d
Probably all our trained models are affected by this bug and caused suboptimal results.
Specifically, I observed that it caused the metallic noise in the model outputs.
Expected behavior
The models should use feat_matching loss
Additional context
For anyone who needs an instant fix, the line indicated above needs to be updated as follows.
if self.use_feat_match_loss and not feats_fake is None:
Description
Using tts, it skipped part of a sentence. For example, for 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits and the distribution of the costs and benefits across different segments of society.' it skipped the last part "and the distribution of the costs and benefits across different segments of society".
However, if a period before the part that was skipped the complete text is synthesized.
To Reproduce
Single sentence:
tts --text 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits and the distribution of the costs and benefits across different segments of society.' --out_path output.wav
Split by a period:
tts --text 'But despite well-established practices for rigorous estimation of the pros and cons of policies, there is room to improve, particularly in characterizing difficult-to-measure benefits. And the distribution of the costs and benefits across different segments of society.' --out_path output.wav
Expected behavior
The long sentences should be fully synthesized.
Google open source lyra (https://github.com/google/lyra), a version of waveRNN vocoder recently, I wonder if any of you think of using this Lyra version as the vocoder for TTS.
The benefit of this approach is that we have a well-engineered real-time vocoder on mobile devices (and hopefully, a high quality vocoder).
The unknown here is that, we don't know if Lyra is fitted for TTS. From reading google's papers, they use quantized 160-dimension mel-spectrogram as the the conditional features with only one frame look ahead.
The source code of this real-time wavegru vocoder can be really helpful anyway!
Describe the bug
Running any of the HiFiGAN models fails with a KeyError for the default_vocoder.
To Reproduce
Steps to reproduce the behavior:
tts-server --list_models | grep hifigan
to get the list of HiFiGAN modelstts-server --use_cuda=true --model_name vocoder_models/en/sam/hifigan_v2
Traceback (most recent call last):
File "/home/gez/Projects/coqui-tts/.venv/bin/tts-server", line 6, in <module>
from TTS.server.server import main
File "/home/gez/Projects/coqui-tts/.venv/lib/python3.8/site-packages/TTS/server/server.py", line 86, in <module>
args.vocoder_name = model_item["default_vocoder"] if args.vocoder_name is None else args.vocoder_name
KeyError: 'default_vocoder'
Expected behavior
Presumably model_item
should always have a default_vocoder
, or it should be checked and handled gracefully.
Environment (please complete the following information):
The vendor says the site has expired.
Describe the bug
With TTS==0.0.13.1
from TTS.utils.synthesizer import Synthesizer
To Reproduce
Steps to reproduce the behavior:
Expected behavior
No Exceptions :)
Environment (please complete the following information):
Welcome to the 🐸TTS project! We are excited to see your interest, and appreciate your support!
This repository is governed by the Contributor Covenant Code of Conduct. For more details, see the CODE_OF_CONDUCT.md file.
If you've found a bug, please provide the following information:
Describe the bug
A clear and concise description of what the bug is.
To Reproduce
Downloading model to /home/dims/.local/share/tts/tts_models--ru--ruslan--tacotron2-DDC
Expected behavior
Should not crash but generate audio
Environment (please complete the following information):
$ python --version
Python 3.8.5
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic
Downloading model to /home/dims/.local/share/tts/tts_models--ru--ruslan--tacotron2-DDC
tts_models/ru/ruslan/tacotron2-DDC is already downloaded.
Using model: Tacotron2
Traceback (most recent call last):
File "/opt/anaconda3/envs/py38/bin/tts", line 8, in
sys.exit(main())
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/bin/synthesize.py", line 188, in main
synthesizer = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, args.use_cuda)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 49, in init
self.load_vocoder(vocoder_checkpoint, vocoder_config, use_cuda)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 102, in load_vocoder
self.vocoder_model = setup_generator(self.vocoder_config)
File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/TTS/vocoder/utils/generic_utils.py", line 70, in setup_generator
print(" > Generator Model: {}".format(c.generator_model))
AttributeError: 'AttrDict' object has no attribute 'generator_model'
I have the same problem with v0.0.12:
CUDA_VISIBLE_DEVICES="0" python ../../TTS/bin/train_tacotron.py --config_path model_config.json
> Using CUDA: True
> Number of GPUs: 1
> Git Hash: 59ab268
> Experiment folder: /home/lpierron/Mozilla_TTS/CORPUS_LP/Models/maiLabs-fr-dca/mailabs-fr-ddc-April-23-2021_03+38PM-59ab268
> Setting up Audio Processor...
| > sample_rate:22050
| > resample:False
| > num_mels:80
| > min_level_db:-100
| > frame_shift_ms:None
| > frame_length_ms:None
| > ref_level_db:20
| > fft_size:1024
| > power:1.5
| > preemphasis:0.0
| > griffin_lim_iters:60
| > signal_norm:True
| > symmetric_norm:True
| > mel_fmin:0
| > mel_fmax:8000.0
| > spec_gain:1.0
| > stft_pad_mode:reflect
| > max_norm:4.0
| > clip_norm:True
| > do_trim_silence:True
| > trim_db:60
| > do_sound_norm:False
| > stats_path:./scale_stats.npy
| > log_func:<ufunc 'log10'>
| > exp_func:<function AudioProcessor.__init__.<locals>.<lambda> at 0x7f1ef7ac6c10>
| > hop_length:256
| > win_length:1024
| > /tmp/tts/by_book/female/ezwa/monsieur_lecoq/metadata.csv
| > Found 14211 files in /tmp/tts
> Using model: Tacotron2
> Model has 28183506 parameters
> Starting with inf best loss.
> DataLoader initialization
| > Use phonemes: True
| > phoneme language: fr-fr
| > Number of instances : 14069
| > Max length sequence: 281
| > Min length sequence: 3
| > Avg length sequence: 105.0826640130784
| > Num. instances discarded by max-min (max=153, min=6) seq limits: 2420
| > Batch group size: 128.
> EPOCH: 0/1000
> Number of output frames: 7
> TRAINING (2021-04-23 15:38:18)
! Run is removed from /home/lpierron/Mozilla_TTS/CORPUS_LP/Models/maiLabs-fr-dca/mailabs-fr-ddc-April-23-2021_03+38PM-59ab268
Traceback (most recent call last):
File "../../TTS/bin/train_tacotron.py", line 744, in <module>
main(args)
File "../../TTS/bin/train_tacotron.py", line 704, in main
train_avg_loss_dict, global_step = train(
File "../../TTS/bin/train_tacotron.py", line 198, in train
decoder_output, postnet_output, alignments, stop_tokens = model(
File "/home/lpierron/miniconda3/envs/tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/lpierron/Mozilla_TTS/COQUI-TTS/TTS/TTS/tts/models/tacotron2.py", line 226, in forward
decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs)
RuntimeError: The expanded size of the tensor (12) must match the existing size (84) at non-singleton dimension 2. Target sizes: [64, 80, 12]. Tensor sizes: [64, 1, 84]
I have downgraded librosa==0.6.3
but it doesn't work.
See my configuration next:
Originally posted by @lpierron in #370 (comment)
The README.md links to English Voice Samples which claim to use and English DDC model however it is not identifiable in list models;
tts --list_models
Name format: type/language/dataset/model
>: tts_models/en/ek1/tacotron2
>: tts_models/en/ljspeech/glow-tts
>: tts_models/en/ljspeech/tacotron2-DCA
>: tts_models/en/ljspeech/speedy-speech-wn
>: tts_models/es/mai/tacotron2-DDC
>: tts_models/fr/mai/tacotron2-DDC
>: tts_models/zh-CN/baker/tacotron2-DDC-GST
>: tts_models/nl/mai/tacotron2-DDC
>: tts_models/ru/ruslan/tacotron2-DDC
>: vocoder_models/universal/libri-tts/wavegrad
>: vocoder_models/universal/libri-tts/fullband-melgan
>: vocoder_models/en/ek1/wavegrad
>: vocoder_models/en/ljspeech/multiband-melgan
>: vocoder_models/nl/mai/parallel-wavegan
Can the Samples page be changed to one in this project and using an available model?
I think we've established that windows support is broken since that commit e0b3008 .
I suspect that it's due to the exp/log function stored in the class.
I would suggest to replace the log_func in the constructor by a exp_log_base
, and only store that number in the class. Then I propose using math.log(x, exp_log_base)
and exp_log_base**x
since np.log doesn't allow to pass a base argument. But if we need np because of speed, we can define a function:
def np_log_base(x, base):
return np.log(x) / np.log(base)
would that fix be ok ?
We're training a WaveGrad Vocoder on a fairly small dataset right now (~250 samples), and ran into the following error recently:
Traceback (most recent call last):
File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
main(args)
File "./TTS/bin/train_vocoder_wavegrad.py", line 412, in main
_, global_step = train(model, criterion, optimizer, scheduler, scaler,
File "./TTS/bin/train_vocoder_wavegrad.py", line 82, in train
data_loader = setup_loader(ap, is_val=False, verbose=(epoch == 0))
File "./TTS/bin/train_vocoder_wavegrad.py", line 46, in setup_loader
loader = DataLoader(dataset,
File "coqui-tts\lib\site-packages\torch\utils\data\dataloader.py", line 266, in __init__
sampler = RandomSampler(dataset, generator=generator) # type: ignore
File "coqui-tts\lib\site-packages\torch\utils\data\sampler.py", line 103, in __init__
raise ValueError("num_samples should be a positive integer "
ValueError: num_samples should be a positive integer value, but got num_samples=0
This appears to be related to the undocumented eval_split_size
setting in the config.json
value. The default config for WaveGrad specifies this as 256
. After debugging for a bit, it appears that the way this setting works is that it controls how many files are used for the evaluation set. So, if there are 500 WAV files, and the eval_split_size
is set to 256
, then the first 256
audio files encountered are used for the evaluation set and the remaining 244
are used for training.
Since it can take a fair bit of debugging for an end-user to understand what's going on, I propose two things:
eval_split_size
.eval_split_size
parameter in the config should be documented so users understand what it does and can tune it appropriately.In https://discourse.mozilla.org/t/custom-voice-tts-not-learning/40897/5, @erogol mentioned that a way to weed out bad samples in the data is to run the training network on the data to see which have the highest loss. Is there any easy way to see this? I am taking the comment to mean that we'd need to narrow the training list to just a few files at a time, run training, and check the loss value; then repeat for each handful of sample files to see a pattern. If so, that could take quite some time. Unless there is a report or something that I'm not aware of?
As we all know, training data set quality is the biggest factor influencing training. So, anything we can do to flag sub-optimal training samples that the CheckDataset notebook otherwise doesn't flag would be ideal.
To that end, is there any opportunity for the model to track and spit out a coincidence report of files to the average loss with those files? In other words, what if the training process tracked the average loss value observed each time each file is in a batch. Over time, that could be used to drive a heatmap of which files happen to be coincident with higher loss. That way, users would quickly identify the outliers in the data set that are contributing most to the loss.
input:
tts --text 'Hello world!' --out_path out/out_1.wav --model_name tts_models/en/vctk/sc-glow-tts --vocoder_name vocoder_models/en/vctk/hifigan_v2 --speaker_wav 28.wav
output/error
> tts_models/en/vctk/sc-glow-tts is already downloaded.
> vocoder_models/en/vctk/hifigan_v2 is already downloaded.
Loading speakers ...
> Using model: glow_tts
> Generator Model: hifigan_generator
Removing weight norm...
> Text: Hello world!
> Text splitted to sentences.
['Hello world!']
Traceback (most recent call last):
File "/home/user/.local/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/user/.local/lib/python3.7/site-packages/TTS/bin/synthesize.py", line 257, in main
wav = synthesizer.tts(args.text, args.speaker_idx, args.speaker_wav)
File "/home/user/.local/lib/python3.7/site-packages/TTS/utils/synthesizer.py", line 220, in tts
speaker_embedding = self.speaker_manager.compute_x_vector_from_clip(speaker_wav)
File "/home/user/.local/lib/python3.7/site-packages/TTS/tts/utils/speakers.py", line 241, in compute_x_vector_from_clip
x_vector = _compute(wf)
File "/home/user/.local/lib/python3.7/site-packages/TTS/tts/utils/speakers.py", line 228, in _compute
waveform = self.speaker_encoder_ap.load_wav(wav_file, sr=self.speaker_encoder_ap.sample_rate)
AttributeError: 'NoneType' object has no attribute 'load_wav'```
when reaching this line: ```
waveform = self.speaker_encoder_ap.load_wav(wav_file, sr=self.speaker_encoder_ap.sample_rate)
self.speaker_encoder_ap
is a NoneType for me, so it seems that self.speaker_encoder_ap wasn't initialized
the wav file im supplying is a 22050 mono file and it's path is correct
i'm running version 0.13
this works without a problem:
tts --text 'Hello world!' --out_path out/out21.wav --model_name tts_models/en/vctk/sc-glow-tts --vocoder_name vocoder_models/en/vctk/hifigan_v2 --speaker_idx p245
Sometimes synthesis for some sentences are cut short at the last word. I know (think) that it's indicative that something is amiss in the model or the dataset, either not trained long enough, audio parameters could be tuned further (trim_db ?) or just dataset quality. But taking time to fix that issue, debugging and training many models is a luxury that some people can't afford (maybe even more if it's a low ressource language).
I would gladly do a PR to propose the feature but I'm not sure how to go about the implementation.
Would adding a stopnet delay (delaying from n steps the stop signal) solve this issue ?
I'm trying to train a model with my own dataset, and I got this error. The same thing applied when I used the default LJSpeech dataset: https://pastebin.com/WugD8rZt
In Random Window discriminator, feats list is defined, but it is not updated. Is this by design?
def forward(self, x, c):
scores = []
feats = []
# unconditional pass
for (window_size, layer) in zip(self.window_sizes,
self.unconditional_discriminators):
index = np.random.randint(x.shape[-1] - window_size)
score = layer(x[:, :, index:index + window_size])
scores.append(score)
# conditional pass
for (window_size, layer) in zip(self.window_sizes,
self.conditional_discriminators):
frame_size = window_size // self.hop_length
lc_index = np.random.randint(c.shape[-1] - frame_size)
sample_index = lc_index * self.hop_length
x_sub = x[:, :,
sample_index:(lc_index + frame_size) * self.hop_length]
c_sub = c[:, :, lc_index:lc_index + frame_size]
score = layer(x_sub, c_sub)
scores.append(score)
return scores, feats
Also, thank you for this project. It's awesome :)
Is your feature request related to a problem? Please describe.
It would be nice to see which version of TTS i am currently using with TTS.__version__ command
Describe the solution you'd like
Add __version__ information in a _version.py file
tts
endpoint saves the output wav to the same path as syntehsize.py
but it should save it to where the command is called.
I've trained a model using T1 with GST and GravesAttention. During training, all training and eval alignment plots have been empty (trained 80k+ steps). The model produced audio in the tensorboards, however using the logic from one of the notebooks to evaluate a model and synthesize speech, it threw me the following error: AttributeError: 'GravesAttention' object has no attribute 'init_win_idx'.
referring to layers/tacotorn.py
--> 478 self.attention.init_win_idx()
. I suspect maybe that the Tacotron 1 model is not configured to use GravesAttention because some of those methods defined in layers/tacotron.py
do not exist in the GravesAttention
class.
Config:
{
"model": "Tacotron",
"run_name": "blizzard-gts",
"run_description": "tacotron GST.",
"audio": {
"fft_size": 1024,
"win_length": 1024,
"hop_length": 256,
"frame_length_ms": null,
"frame_shift_ms": null,
"sample_rate": 24000,
"preemphasis": 0.0,
"ref_level_db": 20,
"do_trim_silence": true,
"trim_db": 60,
"power": 1.5,
"griffin_lim_iters": 60,
"num_mels": 80,
"mel_fmin": 95.0,
"mel_fmax": 12000.0,
"spec_gain": 20,
"signal_norm": true,
"min_level_db": -100,
"symmetric_norm": true,
"max_norm": 4.0,
"clip_norm": true,
"stats_path": null
},
"distributed": {
"backend": "nccl",
"url": "tcp://localhost:54321"
},
"reinit_layers": [],
"batch_size": 128,
"eval_batch_size": 16,
"r": 7,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,
"loss_masking": false,
"decoder_loss_alpha": 0.5,
"postnet_loss_alpha": 0.25,
"postnet_diff_spec_alpha": 0.25,
"decoder_diff_spec_alpha": 0.25,
"decoder_ssim_alpha": 0.5,
"postnet_ssim_alpha": 0.25,
"ga_alpha": 5.0,
"stopnet_pos_weight": 15.0,
"run_eval": true,
"test_delay_epochs": 10,
"test_sentences_file": null,
"noam_schedule": false,
"grad_clip": 1.0,
"epochs": 300000,
"lr": 0.0001,
"wd": 0.000001,
"warmup_steps": 4000,
"seq_len_norm": false,
"memory_size": -1,
"prenet_type": "original",
"prenet_dropout": true,
"attention_type": "graves",
"attention_heads": 4,
"attention_norm": "sigmoid",
"windowing": false,
"use_forward_attn": false,
"forward_attn_mask": false,
"transition_agent": false,
"location_attn": true,
"bidirectional_decoder": false,
"double_decoder_consistency": false,
"ddc_r": 7,
"stopnet": true,
"separate_stopnet": true,
"print_step": 25,
"tb_plot_step": 100,
"print_eval": false,
"save_step": 5000,
"checkpoint": true,
"tb_model_param_stats": false,
"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"num_loader_workers": 8,
"num_val_loader_workers": 8,
"batch_group_size": 4,
"min_seq_len": 6,
"max_seq_len": 153,
"compute_input_seq_cache": false,
"use_noise_augment": true,
"output_path": "/home/big-boy/Models/Blizzard/",
"phoneme_cache_path": "/home/big-boy/Models/phoneme_cache/",
"use_phonemes": true,
"phoneme_language": "en-us",
"use_speaker_embedding": false,
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": {
"gst_style_input": null,
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},
"datasets":
[{
"name": "ljspeech",
"path": "/Data/blizzard2013/segmented/",
"meta_file_train": "metadata.csv",
"meta_file_val": null
}]
}
Alignment plots:
Hi, while trying to execute the Colab tutorial for synthetizing spanish speech, I got an error when executing the following line:
align, spec, stop_tokens, wav = tts(vocoder_model, sentence, TTS_CONFIG, use_cuda, ap, use_gl=False, figures=True)
This is the error:
in tts(model, text, CONFIG, use_cuda, ap, use_gl, figures)
12 t_1 = time.time()
13 waveform, alignment, mel_spec, mel_postnet_spec, stop_tokens, inputs = synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav=None,
---> 14 truncated=False, enable_eos_bos_chars=CONFIG.enable_eos_bos_chars)
15 print(mel_postnet_spec.shape)
16 mel_postnet_spec = ap._denormalize(mel_postnet_spec.T).T
/content/TTS_repo/TTS/tts/utils/synthesis.py in synthesis(model, text, CONFIG, use_cuda, ap, speaker_id, style_wav, truncated, enable_eos_bos_chars, use_griffin_lim, do_trim_silence, speaker_embedding, backend)
239 if backend == 'torch':
240 decoder_output, postnet_output, alignments, stop_tokens = run_model_torch(
--> 241 model, inputs, CONFIG, truncated, speaker_id, style_mel, speaker_embeddings=speaker_embedding)
242 postnet_output, decoder_output, alignment, stop_tokens = parse_outputs_torch(
243 postnet_output, decoder_output, alignments, stop_tokens)
/content/TTS_repo/TTS/tts/utils/synthesis.py in run_model_torch(model, inputs, CONFIG, truncated, speaker_id, style_mel, speaker_embeddings)
57 else:
58 decoder_output, postnet_output, alignments, stop_tokens = model.inference(
---> 59 inputs, speaker_ids=speaker_id, speaker_embeddings=speaker_embeddings)
60 elif 'glow' in CONFIG.model.lower():
61 inputs_lengths = torch.tensor(inputs.shape[1:2]).to(inputs.device) # pylint: disable=not-callable
/usr/local/lib/python3.7/dist-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
25 def decorate_context(*args, **kwargs):
26 with self.class():
---> 27 return func(*args, **kwargs)
28 return cast(F, decorate_context)
29
TypeError: inference() got an unexpected keyword argument 'speaker_ids'
Thanks!
pip install TTS
Failed to build TTS ERROR: Could not build wheels for TTS which use PEP 517 and cannot be installed directly
Adding more details about the state of my machine. I am on Windows 10.
I am using python 3.8.0
installed via pyenv
. Here are my current package versions under python 3.8.0 environment:
pip list
Package Version
absl-py 0.12.0
appdirs 1.4.4
argon2-cffi 20.1.0
astor 0.8.1
astunparse 1.6.3
async-generator 1.10
attrs 20.3.0
audioread 2.1.9
backcall 0.2.0
bar-chart-race 0.1.0
black 20.8b1
bleach 1.5.0
cachetools 4.2.1
certifi 2020.12.5
cffi 1.14.5
chardet 4.0.0
click 7.1.2
colorama 0.4.4
cycler 0.10.0
decorator 5.0.5
defusedxml 0.7.1
dill 0.3.3
entrypoints 0.3
flatbuffers 1.12
gast 0.3.3
google-auth 1.28.0
google-auth-oauthlib 0.4.4
google-pasta 0.2.0
grpcio 1.32.0
gTTS 2.2.2
h5py 2.10.0
html5lib 0.9999999
idna 2.10
inflect 5.3.0
ipykernel 5.5.3
ipython 7.22.0
ipython-genutils 0.2.0
ipywidgets 7.6.3
jedi 0.18.0
Jinja2 2.11.3
joblib 1.0.1
jsonpatch 1.32
jsonpointer 2.1
jsonschema 3.2.0
jupyter 1.0.0
jupyter-client 6.1.13
jupyter-console 6.4.0
jupyter-core 4.7.1
jupyterlab-pygments 0.1.2
jupyterlab-widgets 1.0.0
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.2
kiwisolver 1.3.1
librosa 0.8.0
llvmlite 0.36.0
Markdown 3.3.4
MarkupSafe 1.1.1
matplotlib 3.3.3
mistune 0.8.4
multiprocess 0.70.11.1
mypy-extensions 0.4.3
nbclient 0.5.3
nbconvert 6.0.7
nbformat 5.1.3
nest-asyncio 1.5.1
notebook 6.3.0
numba 0.53.1
numpy 1.19.3
oauthlib 3.1.0
opt-einsum 3.3.0
packaging 20.9
pandas 1.2.3
pandocfilters 1.4.3
parso 0.8.2
pathspec 0.8.1
pep517 0.10.0
pickleshare 0.7.5
Pillow 8.2.0
pip 21.0.1
pooch 1.3.0
prometheus-client 0.10.0
prompt-toolkit 3.0.18
protobuf 3.15.7
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.20
Pygments 2.8.1
pynndescent 0.5.2
pyparsing 2.4.7
PyQt5 5.15.4
PyQt5-Qt5 5.15.2
PyQt5-sip 12.8.1
pyrsistent 0.17.3
python-dateutil 2.8.1
pytz 2021.1
pywin32 300
pywinpty 0.5.7
pyzmq 22.0.3
qtconsole 5.0.3
QtPy 1.9.0
regex 2021.4.4
requests 2.25.1
requests-oauthlib 1.3.0
resampy 0.2.2
rsa 4.7.2
scikit-learn 0.24.1
scipy 1.6.2
Send2Trash 1.5.0
setuptools 54.2.0
six 1.15.0
sounddevice 0.4.1
SoundFile 0.10.3.post1
tensorboard 1.14.0
tensorboard-plugin-wit 1.8.0
tensorflow 1.14.0
tensorflow-estimator 1.14.0
tensorflow-hub 0.11.0
tensorflow-tensorboard 1.5.1
termcolor 1.1.0
terminado 0.9.4
testpath 0.4.4
threadpoolctl 2.1.0
toml 0.10.2
torch 1.8.1+cpu
torchaudio 0.8.1
torchfile 0.1.0
torchvision 0.9.1+cpu
tornado 6.1
tqdm 4.60.0
traitlets 5.0.5
typed-ast 1.4.2
typing-extensions 3.7.4.3
umap-learn 0.5.1
Unidecode 1.2.0
urllib3 1.26.4
visdom 0.1.8.9
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 0.58.0
Werkzeug 1.0.1
wheel 0.36.2
widgetsnbextension 3.5.1
wrapt 1.12.1
Describe the bug
Trainning using --restore_path fails with param 'initial_lr' not specified message.
To Reproduce
Steps to reproduce the behavior:
Environment (please complete the following information):
Hello there!
Thanks for the project. I think the saving of raw spectrograms through --save_spectogram
is not implemented, right? If so, maybe we can use this issue to track its development.
Hi,
First of all, thanks for all this great code!
Now, I'm training a new Tacotron2 using a Hindi dataset - 25 hours, 12,000 audio files, single speaker, not noisey, trimmed silences.
At 10000 global steps, when the model tries to save the checkpoint, it crashes with the message "Audio buffer is not finite everywhere". I've been trying to tweak the config parameters, but to no avail.
Traceback (most recent call last):
File "TTS/bin/train_tacotron.py", line 664, in
main(args)
File "TTS/bin/train_tacotron.py", line 634, in main
scaler_st)
File "TTS/bin/train_tacotron.py", line 312, in train
train_audio = ap.inv_melspectrogram(const_spec.T)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 286, in inv_melspectrogram
return self._griffin_lim(S**self.power)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 315, in _griffin_lim
angles = np.exp(1j * np.angle(self._stft(y)))
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 303, in _stft
pad_mode=self.stft_pad_mode,
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/core/spectrum.py", line 215, in stft
util.valid_audio(y)
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/util/utils.py", line 275, in valid_audio
raise ParameterError('Audio buffer is not finite everywhere')
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere
I'd really appreciate any hints to what might be causing this.
Due to the architecture of the model and the total receptive field, it causes errors for input text shorter than 13 characters.
This can be fixed by padding the input text with empty characters.
(venv) $ tts --model_name tts_models/en/ljspeech/speedy-speech-wn --text "Hey Bruce, what's good in the neighborhood?"
> tts_models/en/ljspeech/speedy-speech-wn is already downloaded.
> vocoder_models/en/ljspeech/multiband-melgan is already downloaded.
> Using model: speedy_speech
Traceback (most recent call last):
File "/home/josh/venv/bin/tts", line 8, in <module>
sys.exit(main())
File "/home/josh/venv/lib/python3.6/site-packages/TTS/bin/synthesize.py", line 190, in main
synthesizer = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, args.use_cuda)
File "/home/josh/venv/lib/python3.6/site-packages/TTS/utils/synthesizer.py", line 47, in __init__
use_cuda)
File "/home/josh/venv/lib/python3.6/site-packages/TTS/utils/synthesizer.py", line 96, in load_tts
self.tts_model.load_checkpoint(tts_config, tts_checkpoint, eval=True)
File "/home/josh/venv/lib/python3.6/site-packages/TTS/tts/models/speedy_speech.py", line 196, in load_checkpoint
self.load_state_dict(state['model'])
File "/home/josh/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1224, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for SpeedySpeech:
size mismatch for emb.weight: copying a param with shape torch.Size([129, 128]) from checkpoint, the shape in current model is torch.Size([130, 128]).
Thanks, @JRMeyer, for pointing this out 👑
It would be cool to test if the GPU memory is enough for the model/config/dataset combo right from the start because it wastes time and money to start training only to discover that your training failed because of an OOM error.
I would suggest maybe for the first epoch and first batch to put all the longest samples duration/seq_length.
Or do a warmup batch the same way + a loaded test batch.
Hey,
I'm trying to run a training with Tacotron 1 using GST. I get the error on the first batch already.
Pytorch version: 1.8 and 1.7.1 (both yielded the same error)
Python version: 3.8.0
Traceback (most recent call last): File "TTS/bin/train_tacotron.py", line 721, in <module> main(args) File "TTS/bin/train_tacotron.py", line 619, in main train_avg_loss_dict, global_step = train(train_loader, model, File "TTS/bin/train_tacotron.py", line 168, in train decoder_output, postnet_output, alignments, stop_tokens = model( File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/home/big-boy/projects/TTS/TTS/tts/models/tacotron.py", line 173, in forward decoder_outputs = decoder_outputs * output_mask.unsqueeze(1).expand_as(decoder_outputs) RuntimeError: The expanded size of the tensor (64) must match the existing size (112) at non-singleton dimension 2. Target sizes: [64, 80, 64]. Tensor sizes: [64, 1, 112]
My hyperparams:
// TRAINING
"batch_size": 64,
"eval_batch_size": 16,
"r": 4,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,
// MULTI-SPEAKER and GST
"use_speaker_embedding": false, // use speaker embedding to enable multi-speaker learning.
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": { // gst parameter if gst is enabled
"gst_style_input": null, // Condition the style input either on a
// -> wave file [path to wave] or
// -> dictionary using the style tokens {'token1': 'value', 'token2': 'value'} example {"0": 0.15, "1": 0.15, "5": -0.15}
// with the dictionary being len(dict) <= len(gst_style_tokens).
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},
I'm working on a step-wise learning rate scheduling method and I wanted to take inspiration from the NoamLR()
class found in training.py
. When I set noam_schedule: true
in the config, the following error is shown.
File "TTS/bin/train_tacotron.py", line 674, in
main(args)
File "TTS/bin/train_tacotron.py", line 640, in main
train_avg_loss_dict, global_step = train(train_loader, model,
File "TTS/bin/train_tacotron.py", line 154, in train
scheduler.step()
AttributeError: 'NoneType' object has no attribute 'step'
/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/loss.py:94: UserWarning: Using a target size (torch.Size([64, 90, 80])) that is different to the input size (torch.Size([64, 90, 513])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
return F.l1_loss(input, target, reduction=self.reduction)
! Run is removed from /home/big-boy/Models/Blizzard/blizzard-gts-March-11-2021_05+38PM-45068a9
Traceback (most recent call last):
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 721, in
main(args)
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 619, in main
train_avg_loss_dict, global_step = train(train_loader, model,
File "/home/big-boy/projects/TTS/TTS/bin/train_tacotron.py", line 180, in train
loss_dict = criterion(postnet_output, decoder_output, mel_input,
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/projects/TTS/TTS/tts/layers/losses.py", line 377, in forward
postnet_diff_spec_loss = self.criterion_diff_spec(postnet_output, mel_input, output_lens)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/projects/TTS/TTS/tts/layers/losses.py", line 203, in forward
return self.loss_func(x_diff, target_diff)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/modules/loss.py", line 94, in forward
return F.l1_loss(input, target, reduction=self.reduction)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/nn/functional.py", line 2633, in l1_loss
expanded_input, expanded_target = torch.broadcast_tensors(input, target)
File "/home/big-boy/anaconda3/envs/PyCapacitron/lib/python3.8/site-packages/torch/functional.py", line 71, in broadcast_tensors
return _VF.broadcast_tensors(tensors) # type: ignore
RuntimeError: The size of tensor a (513) must match the size of tensor b (80) at non-singleton dimension 2
Originally posted by @a-froghyar in #370 (comment)
you can try noam_schedule: True
to let model stabilize initially with lower learning rates.
Also tb_model_param_stats:True
to watch model layer stats on TensorBoard. It shows you if something is wrong with any of the layers.
Originally posted by @erogol in #388 (comment)
Trying to train a WaveGrad Vocoder on Python 3.8.8 for Windows 10 yields this error:
Traceback (most recent call last):
File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
main(args)
File "./TTS/bin/train_vocoder_wavegrad.py", line 417, in main
best_loss = save_best_model(
File "TTS\vocoder\utils\io.py", line 97, in save_best_model
os.symlink(best_model_name, os.path.join(out_path, link_name))
OSError: [WinError 1314] A required privilege is not held by the client: 'best_model_1.pth.tar' -> 'c:/path/to/my/project/best_model.pth.tar'
This is because Windows only lets admins create symlinks for... reasons...
@gerazov based on this mozilla/TTS#679 starting from the last best_model looks confusing for people, and sometimes I found myself removing the best model to continue from the last checkpoint.
I suggest reverting this to the old behavior that we pick the last checkpoint.
These are the main dev plans for 🐸 TTS.
If you want to contribute to 🐸 TTS and don't know where to start you can pick one here and start with our Contribution Guideline. We're also always here to help.
Feel free to pick one or suggest a new one.
Contributions are always welcome 💪 .
Synthesizer
interface on CLI
or Server
.TTS.tts
models.Is your feature request related to a problem? Please describe.
It is hard to compare models with different configurations by just looking at Tensorboard.
Describe the solution you'd like
We can pass the configuration fields to the tensorboard.
The resemble.ai system has markup like:
<prosody rate="45%"><style emotions="expressiveness:0.9
aggressiveness:0.5 pace:0.2">
<say-as interpret-as="characters">Zeuxis</style></say-as>
Is this open sourced in coqui?
(I keep it in the issues to refer back to the initial discussion)
Hi All!!
I guess one of the biggest issues in TTS is the way we handle the configs for models and training. Putting example config files under the config folder is hard to maintain and looks complicated for people to start using TTS.
So I want to discuss here some better alternatives and ask for the wisdom of the crowd 🧑🤝🧑.
Couple of constraints we need to consider from the top of my head.
config.json
by violating the JSON format with comments. It is not optimal If you have an idea please share it below and let's discuss it.
Edit:
I should also add one more constraint.
NOTE: This is a continuation of previously started conversion mozilla/TTS#660
Hi,
I am trying to train a Tacotron2 model in Hindi. I have my own 25 hour single speaker cleaned dataset. I'm using the following configuration.
{
"model": "Tacotron2",
"run_name": "hindi-ddc",
"run_description": "tacotron2 with DDC and differential spectral loss.",
// AUDIO PARAMETERS
"audio":{
// stft parameters
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 20, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": true,// enable trimming of slience of audio as you load it. LJspeech (true), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// Griffin-Lim
"power": 1.5, // value to sharpen wav signals after GL algorithm.
"griffin_lim_iters": 60,// #griffin-lim iterations. 30-60 is a good range. Larger the value, slower the generation.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1,
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": null // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// VOCABULARY PARAMETERS
// if custom character set is not defined,
// default set in symbols.py is used
"characters":{
"pad": "_",
"eos": "~",
"bos": "^",
"characters": "अआइईउऊऋएऐऑओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहह़ा",
"punctuations":"!'\",.:?। ",
"phonemes":"iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻʘɓǀɗǃʄǂɠǁʛpbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟˈˌːˑʍwɥʜʢʡɕʑɺɧɚ˞ɫ"
},
// DISTRIBUTED TRAINING
"distributed":{
"backend": "nccl",
"url": "tcp:\/\/localhost:54321"
},
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 32, // Batch size for training. Lower values than 32 might cause hard to learn attention. It is overwritten by 'gradual_training'.
"eval_batch_size":16,
"r": 7, // Number of decoder frames to predict per iteration. Set the initial values if gradual training is enabled.
"gradual_training": [[0, 7, 64], [1, 5, 64], [50000, 3, 32], [130000, 2, 32], [290000, 1, 32]], //set gradual training steps [first_step, r, batch_size]. If it is null, gradual training is disabled. For Tacotron, you might need to reduce the 'batch_size' as you proceeed.
"mixed_precision": true, // level of optimization with NVIDIA's apex feature for automatic mixed FP16/FP32 precision (AMP), NOTE: currently only O1 is supported, and use "O1" to activate.
// LOSS SETTINGS
"loss_masking": true, // enable / disable loss masking against the sequence padding.
"decoder_loss_alpha": 0.5, // original decoder loss weight. If > 0, it is enabled
"postnet_loss_alpha": 0.25, // original postnet loss weight. If > 0, it is enabled
"postnet_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
"decoder_diff_spec_alpha": 0.25, // differential spectral loss weight. If > 0, it is enabled
"decoder_ssim_alpha": 0.5, // decoder ssim loss weight. If > 0, it is enabled
"postnet_ssim_alpha": 0.25, // postnet ssim loss weight. If > 0, it is enabled
"ga_alpha": 5.0, // weight for guided attention loss. If > 0, guided attention is enabled.
"stopnet_pos_weight": 15.0, // pos class weight for stopnet loss since there are way more negative samples than positive samples.
// VALIDATION
"run_eval": true,
"test_delay_epochs": 10, //Until attention is aligned, testing only wastes computation time.
"test_sentences_file": null, // set a file to load sentences to be used for testing. If it is null then we use default english sentences.
// OPTIMIZER
"noam_schedule": false, // use noam warmup and lr schedule.
"grad_clip": 1.0, // upper limit for gradients for clipping.
"epochs": 1000, // total number of epochs to train.
"lr": 0.0001, // Initial learning rate. If Noam decay is active, maximum learning rate.
"wd": 0.000001, // Weight decay weight.
"warmup_steps": 4000, // Noam decay steps to increase the learning rate from 0 to "lr"
"seq_len_norm": false, // Normalize eash sample loss with its length to alleviate imbalanced datasets. Use it if your dataset is small or has skewed distribution of sequence lengths.
// TACOTRON PRENET
"memory_size": -1, // ONLY TACOTRON - size of the memory queue used fro storing last decoder predictions for auto-regression. If < 0, memory queue is disabled and decoder only uses the last prediction frame.
"prenet_type": "original", // "original" or "bn".
"prenet_dropout": false, // enable/disable dropout at prenet.
// TACOTRON ATTENTION
"attention_type": "original", // 'original' , 'graves', 'dynamic_convolution'
"attention_heads": 4, // number of attention heads (only for 'graves')
"attention_norm": "sigmoid", // softmax or sigmoid.
"windowing": false, // Enables attention windowing. Used only in eval mode.
"use_forward_attn": false, // if it uses forward attention. In general, it aligns faster.
"forward_attn_mask": false, // Additional masking forcing monotonicity only in eval mode.
"transition_agent": false, // enable/disable transition agent of forward attention.
"location_attn": true, // enable_disable location sensitive attention. It is enabled for TACOTRON by default.
"bidirectional_decoder": false, // use https://arxiv.org/abs/1907.09006. Use it, if attention does not work well with your dataset.
"double_decoder_consistency": true, // use DDC explained here https://erogol.com/solving-attention-problems-of-tts-models-with-double-decoder-consistency-draft/
"ddc_r": 7, // reduction rate for coarse decoder.
// STOPNET
"stopnet": true, // Train stopnet predicting the end of synthesis.
"separate_stopnet": true, // Train stopnet seperately if 'stopnet==true'. It prevents stopnet loss to influence the rest of the model. It causes a better model, but it trains SLOWER.
// TENSORBOARD and LOGGING
"print_step": 25, // Number of steps to log training on console.
"tb_plot_step": 100, // Number of steps to plot TB training figures.
"print_eval": false, // If True, it prints intermediate loss values in evalulation.
"save_step": 200, // Number of training steps expected to save traninpg stats and checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"keep_all_best": false, // If true, keeps all best_models after keep_after steps
"keep_after": 10000, // Global step after which to keep best models if keep_all_best is true
"tb_model_param_stats": false, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"text_cleaner": "basic_cleaners",
"enable_eos_bos_chars": false, // enable/disable beginning of sentence and end of sentence chars.
"num_loader_workers": 2, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 2, // number of evaluation data loader processes.
"batch_group_size": 4, //Number of batches to shuffle after bucketing.
"min_seq_len": 81, // DATASET-RELATED: minimum text length to use in training
"max_seq_len": 186, // DATASET-RELATED: maximum text length
"compute_input_seq_cache": false, // if true, text sequences are computed before starting training. If phonemes are enabled, they are also computed at this stage.
"use_noise_augment": true,
// PATHS
"output_path": "/home/ubuntu/output/",
// PHONEMES
"phoneme_cache_path": "/home/ubuntu/phoneme_cache/", // phoneme computation is slow, therefore, it caches results in the given folder.
"use_phonemes": false, // use phonemes instead of raw characters. It is suggested for better pronounciation.
"phoneme_language": "hi", // depending on your target language, pick one from https://github.com/bootphon/phonemizer#languages
// MULTI-SPEAKER and GST
"use_speaker_embedding": false, // use speaker embedding to enable multi-speaker learning.
"use_gst": false, // use global style tokens
"use_external_speaker_embedding_file": false, // if true, forces the model to use external embedding per sample instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
"external_speaker_embedding_file": "../../speakers-vctk-en.json", // if not null and use_external_speaker_embedding_file is true, it is used to load a specific embedding file and thus uses these embeddings instead of nn.embeddings, that is, it supports external embeddings such as those used at: https://arxiv.org/abs /1806.04558
"gst": { // gst parameter if gst is enabled
"gst_style_input": null, // Condition the style input either on a
// -> wave file [path to wave] or
// -> dictionary using the style tokens {'token1': 'value', 'token2': 'value'} example {"0": 0.15, "1": 0.15, "5": -0.15}
// with the dictionary being len(dict) <= len(gst_style_tokens).
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},
// DATASETS
"datasets": // List of datasets. They all merged and they get different speaker_ids.
[
{
"name": "hindi",
"path": "/dev/data/hindidataset/",
"meta_file_train": "metadata.csv", // for vtck if list, ignore speakers id in list for train, its useful for test cloning with new speakers
"meta_file_val": null
}
]
}
--
The stacktrace I'm hitting is below.
CHECKPOINT : /home/ubuntu/output/modi-ddc-March-19-2021_05+17PM-8545a69/checkpoint_200.pth.tar
/home/ubuntu/TTS/TTS/utils/audio.py:234: RuntimeWarning: overflow encountered in power
return np.power(10.0, x / self.spec_gain)
! Run is kept in /home/ubuntu/output/modi-ddc-March-19-2021_05+17PM-8545a69
Traceback (most recent call last):
File "TTS/bin/train_tacotron.py", line 664, in
main(args)
File "TTS/bin/train_tacotron.py", line 634, in main
scaler_st)
File "TTS/bin/train_tacotron.py", line 312, in train
train_audio = ap.inv_melspectrogram(const_spec.T)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 286, in inv_melspectrogram
return self._griffin_lim(S**self.power)
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 315, in _griffin_lim
angles = np.exp(1j * np.angle(self._stft(y)))
File "/home/ubuntu/TTS/TTS/utils/audio.py", line 303, in _stft
pad_mode=self.stft_pad_mode,
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/core/spectrum.py", line 215, in stft
util.valid_audio(y)
File "/home/ubuntu/.local/lib/python3.6/site-packages/librosa/util/utils.py", line 275, in valid_audio
raise ParameterError('Audio buffer is not finite everywhere')
librosa.util.exceptions.ParameterError: Audio buffer is not finite everywhere
--
I've been trying to debug for 2 days but not able to make progress. I'd really appreciate any help/suggestions.
Hey, I'm getting weird plots and no audio produced in the tensorboard examples after 70k Steps. I'm using a custom Blizzard dataset that I've already trained other models with that produced intelligible speech after 20k steps. The training has also stopped after 70K steps because of the NaN decoder_loss error. I'm using the #373 patched dev
branch with the following config file:
{
"model": "Tacotron",
"run_name": "blizzard-gts",
"run_description": "tacotron GST.",
"audio": {
"fft_size": 1024,
"win_length": 1024,
"hop_length": 256,
"frame_length_ms": null,
"frame_shift_ms": null,
"sample_rate": 24000,
"preemphasis": 0.0,
"ref_level_db": 20,
"do_trim_silence": true,
"trim_db": 60,
"power": 1.5,
"griffin_lim_iters": 60,
"num_mels": 80,
"mel_fmin": 95.0,
"mel_fmax": 12000.0,
"spec_gain": 1,
"signal_norm": true,
"min_level_db": -100,
"symmetric_norm": true,
"max_norm": 4.0,
"clip_norm": true,
"stats_path": null
},
"distributed": {
"backend": "nccl",
"url": "tcp:\/\/localhost:54321"
},
"reinit_layers": [],
"batch_size": 128,
"eval_batch_size": 16,
"r": 7,
"gradual_training": [
[0, 7, 64],
[1, 5, 64],
[50000, 3, 32],
[130000, 2, 32],
[290000, 1, 32]
],
"mixed_precision": true,
"loss_masking": false,
"decoder_loss_alpha": 0.5,
"postnet_loss_alpha": 0.25,
"postnet_diff_spec_alpha": 0.25,
"decoder_diff_spec_alpha": 0.25,
"decoder_ssim_alpha": 0.5,
"postnet_ssim_alpha": 0.25,
"ga_alpha": 5.0,
"stopnet_pos_weight": 15.0,
"run_eval": true,
"test_delay_epochs": 10,
"test_sentences_file": null,
"noam_schedule": false,
"grad_clip": 1.0,
"epochs": 300000,
"lr": 0.0001,
"wd": 0.000001,
"warmup_steps": 4000,
"seq_len_norm": false,
"memory_size": -1,
"prenet_type": "original",
"prenet_dropout": true,
"attention_type": "graves",
"attention_heads": 4,
"attention_norm": "sigmoid",
"windowing": false,
"use_forward_attn": false,
"forward_attn_mask": false,
"transition_agent": false,
"location_attn": true,
"bidirectional_decoder": false,
"double_decoder_consistency": false,
"ddc_r": 7,
"stopnet": true,
"separate_stopnet": true,
"print_step": 25,
"tb_plot_step": 100,
"print_eval": false,
"save_step": 5000,
"checkpoint": true,
"tb_model_param_stats": false,
"text_cleaner": "phoneme_cleaners",
"enable_eos_bos_chars": false,
"num_loader_workers": 8,
"num_val_loader_workers": 8,
"batch_group_size": 4,
"min_seq_len": 6,
"max_seq_len": 153,
"compute_input_seq_cache": false,
"use_noise_augment": true,
"output_path": "/home/big-boy/Models/Blizzard/",
"phoneme_cache_path": "/home/big-boy/Models/phoneme_cache/",
"use_phonemes": true,
"phoneme_language": "en-us",
"use_speaker_embedding": false,
"use_gst": true,
"use_external_speaker_embedding_file": false,
"external_speaker_embedding_file": "../../speakers-vctk-en.json",
"gst": {
"gst_style_input": null,
"gst_embedding_dim": 512,
"gst_num_heads": 4,
"gst_style_tokens": 10,
"gst_use_speaker_embedding": false
},
"datasets":
[{
"name": "ljspeech",
"path": "/home/big-boy/Data/blizzard2013/segmented/",
"meta_file_train": "metadata.csv",
"meta_file_val": null
}]
}
Describe the bug
A clear and concise description of what the bug is.
After upgrading from tts-0.0.9 to tts-0.0.11 the model was updated but TTS still tries to load the cached version.
A fix could be to hash models and compare if the cached model is the same as the hash. This also fixes
cases where models were corrupted in any way
To Reproduce
$ /nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/tts-server --model_name tts_models/en/ljspeech/glow-tts --vocoder_name vocoder_models/universal/libri-tts/fullband-melgan
> tts_models/en/ljspeech/glow-tts is already downloaded.
> vocoder_models/universal/libri-tts/fullband-melgan is already downloaded.
> Using model: glow_tts
Traceback (most recent call last):
File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/.tts-server-wrapped", line 6, in <module>
from TTS.server.server import main
File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/server/server.py", line 62, in <module>
synthesizer = Synthesizer(args.tts_checkpoint, args.tts_config, args.vocoder_checkpoint, args.vocoder_config, args.use_cuda)
File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 45, in __init__
self.load_tts(tts_checkpoint, tts_config,
File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/utils/synthesizer.py", line 95, in load_tts
self.tts_model.load_checkpoint(tts_config, tts_checkpoint, eval=True)
File "/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/lib/python3.8/site-packages/TTS/tts/models/glow_tts.py", line 229, in load_checkpoint
self.load_state_dict(state['model'])
File "/nix/store/1vv0fsvdv9j4gmqjgjwb3c5v8x906qgd-python3.8-pytorch-1.8.1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1223, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for GlowTTS:
size mismatch for encoder.emb.weight: copying a param with shape torch.Size([129, 192]) from checkpoint, the shape in current model is torch.Size([130, 192]).
/nix/store/kfk8kql1pxbxz6iazjwnac0cl1llzapn-tts-0.0.11/bin/tts-server 4,64s user 0,71s system 117% cpu 4,541 total
Expected behavior
Environment (please complete the following information):
We need a TTS with unlimited decoding steps for parsing long texts, Is it possible?
The quality of the english examples is already very good. What's missing to be useful for me is a model for a german voice.
You can use this issue for related discussions and documenting progress creating it.
Please consider sharing your pre-trained models in any language (If the licences allow that).
We can include them in our model catalogue for public use by attributing your name (website, company etc.).
That would enable more people to experiment together and coordinate, instead of individual efforts to achieve similar goals.
That is also a chance to make your work more visible.
You can share in two ways;
Models are served under .models.json
file and any model is available under tts
CLI or Server end points. More details...
(previously mozilla/TTS#395)
Describe the bug
Dynamic Convolutional Attention fails in mixed_precision training and ultimately causes NaN error.
To Reproduce
Steps to reproduce the behavior:
mixed_precision=True
in config.json
.dynamic_convolution=True
in config.json
.Expected behavior
The model should learn the alignment after 10K iterations with no NaN loss as it does in full precision training.
Environment (please complete the following information):
Additional context
Add any other context about the problem here.
With dc2954e checked out, when I try to train WaveGrad on with the following config:
{
"run_name": "wavegrad-my-project",
"run_description": "wavegrad test",
"audio":{
"fft_size": 1024, // number of stft frequency levels. Size of the linear spectogram frame.
"win_length": 1024, // stft window length in ms.
"hop_length": 256, // stft window hop-lengh in ms.
"frame_length_ms": null, // stft window length in ms.If null, 'win_length' is used.
"frame_shift_ms": null, // stft window hop-lengh in ms. If null, 'hop_length' is used.
// Audio processing parameters
"sample_rate": 22050, // DATASET-RELATED: wav sample-rate. If different than the original data, it is resampled.
"preemphasis": 0.0, // pre-emphasis to reduce spec noise and make it more structured. If 0.0, no -pre-emphasis.
"ref_level_db": 0, // reference level db, theoretically 20db is the sound of air.
// Silence trimming
"do_trim_silence": false,// enable trimming of slience of audio as you load it. LJspeech (false), TWEB (false), Nancy (true)
"trim_db": 60, // threshold for timming silence. Set this according to your dataset.
// MelSpectrogram parameters
"num_mels": 80, // size of the mel spec frame.
"mel_fmin": 50.0, // minimum freq level for mel-spec. ~50 for male and ~95 for female voices. Tune for dataset!!
"mel_fmax": 7600.0, // maximum freq level for mel-spec. Tune for dataset!!
"spec_gain": 1.0, // scaler value appplied after log transform of spectrogram.
// Normalization parameters
"signal_norm": true, // normalize spec values. Mean-Var normalization if 'stats_path' is defined otherwise range normalization defined by the other params.
"min_level_db": -100, // lower bound for normalization
"symmetric_norm": true, // move normalization to range [-1, 1]
"max_norm": 4.0, // scale normalization to range [-max_norm, max_norm] or [0, max_norm]
"clip_norm": true, // clip normalized values into the range.
"stats_path": "/path/to/my/project/scale_stats.npy" // DO NOT USE WITH MULTI_SPEAKER MODEL. scaler stats file computed by 'compute_statistics.py'. If it is defined, mean-std based notmalization is used and other normalization params are ignored
},
// DISTRIBUTED TRAINING
"mixed_precision": true, // enable torch mixed precision training (true, false)
"distributed":{
"backend": "nccl",
"url": "tcp:\/\/localhost:54322"
},
"target_loss": "avg_wavegrad_loss", // loss value to pick the best model to save after each epoch
// MODEL PARAMETERS
"generator_model": "wavegrad",
"model_params":{
"use_weight_norm": true,
"y_conv_channels":32,
"x_conv_channels":768,
"ublock_out_channels": [512, 512, 256, 128, 128],
"dblock_out_channels": [128, 128, 256, 512],
"upsample_factors": [4, 4, 4, 2, 2],
"upsample_dilations": [
[1, 2, 1, 2],
[1, 2, 1, 2],
[1, 2, 4, 8],
[1, 2, 4, 8],
[1, 2, 4, 8]]
},
// DATASET
"data_path": "/path/to/my/project/wavs/22.05k_edited_normalized", // root data path. It finds all wav files recursively from there.
"feature_path": null, // if you use precomputed features
"seq_len": 6144, // 24 * hop_length
"pad_short": 0, // additional padding for short wavs
"conv_pad": 0, // additional padding against convolutions applied to spectrograms
"use_noise_augment": false, // add noise to the audio signal for augmentation
"use_cache": false, // use in memory cache to keep the computed features. This might cause OOM.
"reinit_layers": [], // give a list of layer names to restore from the given checkpoint. If not defined, it reloads all heuristically matching layers.
// TRAINING
"batch_size": 96, // Batch size for training.
// NOISE SCHEDULE PARAMS - Only effective at training time.
"train_noise_schedule":{
"min_val": 1e-6,
"max_val": 1e-2,
"num_steps": 1000
},
"test_noise_schedule":{
"min_val": 1e-6,
"max_val": 1e-2,
"num_steps": 50
},
// VALIDATION
"run_eval": true, // enable/disable evaluation run
// OPTIMIZER
"epochs": 10000, // total number of epochs to train.
"clip_grad": 1.0, // Generator gradient clipping threshold. Apply gradient clipping if > 0
"lr_scheduler": "MultiStepLR", // one of the schedulers from https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
"lr_scheduler_params": {
"gamma": 0.5,
"milestones": [100000, 200000, 300000, 400000, 500000, 600000]
},
"lr": 1e-4, // Initial learning rate. If Noam decay is active, maximum learning rate.
// TENSORBOARD and LOGGING
"print_step": 50, // Number of steps to log traning on console.
"print_eval": false, // If True, it prints loss values for each step in eval run.
"save_step": 5000, // Number of training steps expected to plot training stats on TB and save model checkpoints.
"checkpoint": true, // If true, it saves checkpoints per "save_step"
"keep_all_best": false, // If true, keeps all best_models after keep_after steps
"keep_after": 10000, // Global step after which to keep best models if keep_all_best is true
"tb_model_param_stats": true, // true, plots param stats per layer on tensorboard. Might be memory consuming, but good for debugging.
// DATA LOADING
"num_loader_workers": 4, // number of training data loader processes. Don't set it too big. 4-8 are good values.
"num_val_loader_workers": 4, // number of evaluation data loader processes.
"eval_split_size": 256,
// PATHS
"output_path": "/path/to/my/project/Models"
}
... it fails immediately with this error:
> TRAINING (2021-03-28 19:47:57)
--> TRAIN PERFORMACE -- EPOCH TIME: 8.09 sec -- GLOBAL_STEP: 1
| > avg_wavegrad_loss: 1.47542
| > avg_loader_time: 16.76300
| > avg_step_time: 8.08550
coqui-tts\lib\site-packages\torch\optim\lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
[WARNING] NaN or Inf found in input tensor.
! Run is removed from D:/path/to/my/project
Traceback (most recent call last):
File "./TTS/bin/train_vocoder_wavegrad.py", line 442, in <module>
main(args)
File "./TTS/bin/train_vocoder_wavegrad.py", line 412, in main
_, global_step = train(model, criterion, optimizer, scheduler, scaler,
File "./TTS/bin/train_vocoder_wavegrad.py", line 223, in train
tb_logger.tb_model_weights(model, global_step)
File "coqui-tts\TTS\utils\tensorboard_logger.py", line 34, in tb_model_weights
self.writer.add_histogram(
File "coqui-tts\lib\site-packages\tensorboardX\writer.py", line 503, in add_histogram
histogram(tag, values, bins, max_bins=max_bins), global_step, walltime)
File "coqui-tts\lib\site-packages\tensorboardX\summary.py", line 210, in histogram
hist = make_histogram(values.astype(float), bins, max_bins)
File "coqui-tts\lib\site-packages\tensorboardX\summary.py", line 248, in make_histogram
raise ValueError('The histogram is empty, please file a bug report.')
ValueError: The histogram is empty, please file a bug report.
When I try to run Coqui-TTS-torchhub-example.ipynb on colab I got this error:
"""
Downloading: "https://github.com/coqui-ai/TTS/archive/dev.zip" to /root/.cache/torch/hub/dev.zip
Downloading model to /root/.local/share/tts/tts_models--en--ljspeech--tacotron2-DCA
Downloading model to /root/.local/share/tts/vocoder_models--en--ljspeech--multiband-melgan
Using model: Tacotron2
TypeError Traceback (most recent call last)
in ()
3 synthesizer = torch.hub.load('coqui-ai/TTS:dev',
4 'tts',
----> 5 source='github')
6 wav = synthesizer.tts("TTS is an open-source library that generates synthethic speech!")
6 frames
/usr/local/lib/python3.7/dist-packages/torch/hub.py in load(repo_or_dir, model, *args, **kwargs)
337 repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose)
338
--> 339 model = _load_local(repo_or_dir, model, *args, **kwargs)
340 return model
341
/usr/local/lib/python3.7/dist-packages/torch/hub.py in _load_local(hubconf_dir, model, *args, **kwargs)
366
367 entry = _load_entry_from_hubconf(hub_module, model)
--> 368 model = entry(*args, **kwargs)
369
370 sys.path.remove(hubconf_dir)
/root/.cache/torch/hub/coqui-ai_TTS_dev/hubconf.py in tts(model_name, vocoder_name, use_cuda)
29
30 # create synthesizer
---> 31 synt = Synthesizer(model_path, config_path, vocoder_path, vocoder_config_path, use_cuda)
32 return synt
33
/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/synthesizer.py in init(self, tts_checkpoint, tts_config_path, tts_speakers_file, vocoder_checkpoint, vocoder_config, encoder_checkpoint, encoder_config, use_cuda)
73 self.output_sample_rate = self.tts_config.audio["sample_rate"]
74 if vocoder_checkpoint:
---> 75 self._load_vocoder(vocoder_checkpoint, vocoder_config, use_cuda)
76 self.output_sample_rate = self.vocoder_config.audio["sample_rate"]
77
/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/synthesizer.py in _load_vocoder(self, model_file, model_config, use_cuda)
151 use_cuda (bool): enable/disable CUDA use.
152 """
--> 153 self.vocoder_config = load_config(model_config)
154 self.vocoder_ap = AudioProcessor(verbose=False, **self.vocoder_config["audio"])
155 self.vocoder_model = setup_generator(self.vocoder_config)
/root/.cache/torch/hub/coqui-ai_TTS_dev/TTS/utils/io.py in load_config(config_path)
43 config = AttrDict()
44
---> 45 ext = os.path.splitext(config_path)[1]
46 if ext in (".yml", ".yaml"):
47 with open(config_path, "r", encoding="utf-8") as f:
/usr/lib/python3.7/posixpath.py in splitext(p)
120
121 def splitext(p):
--> 122 p = os.fspath(p)
123 if isinstance(p, bytes):
124 sep = b'/'
TypeError: expected str, bytes or os.PathLike object, not bool
"""
Anyone tell me what happens?
When I ran pip install -e .
or pip install -r requirements.txt
I get the following errors:
ERROR: umap-learn 0.5.1 has requirement numba>=0.49, but you'll have numba 0.48.0 which is incompatible.
ERROR: pynndescent 0.5.2 has requirement numba>=0.51.2, but you'll have numba 0.48.0 which is incompatible.
Which versions of these two packages should they be downgraded to?
I've been trying to figure out the a good configuration for training the Tacotron2 model. I'm not sure how to set MelSpectrogram parameters accurately.
Specifically, how would I calculate the right values for mel_fmin and mel_fmax for my dataset?
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.