Git Product home page Git Product logo

deepvoice3_pytorch's Introduction

alt text

Deepvoice3_pytorch

PyPI Build Status Build status DOI

PyTorch implementation of convolutional networks-based text-to-speech synthesis models:

  1. arXiv:1710.07654: Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning.
  2. arXiv:1710.08969: Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention.

Audio samples are available at https://r9y9.github.io/deepvoice3_pytorch/.

Folks

Online TTS demo

Notebooks supposed to be executed on https://colab.research.google.com are available:

Highlights

  • Convolutional sequence-to-sequence model with attention for text-to-speech synthesis
  • Multi-speaker and single speaker versions of DeepVoice3
  • Audio samples and pre-trained models
  • Preprocessor for LJSpeech (en), JSUT (jp) and VCTK datasets, as well as carpedm20/multi-speaker-tacotron-tensorflow compatible custom dataset (in JSON format)
  • Language-dependent frontend text processor for English and Japanese

Samples

Pretrained models

NOTE: pretrained models are not compatible to master. To be updated soon.

URL Model Data Hyper paramters Git commit Steps
link DeepVoice3 LJSpeech link abf0a21 640k
link Nyanko LJSpeech builder=nyanko,preset=nyanko_ljspeech ba59dc7 585k
link Multi-speaker DeepVoice3 VCTK builder=deepvoice3_multispeaker,preset=deepvoice3_vctk 0421749 300k + 300k

To use pre-trained models, it's highly recommended that you are on the specific git commit noted above. i.e.,

git checkout ${commit_hash}

Then follow the "Synthesize from a checkpoint" section in the README of the specific git commit. Please notice that the latest development version of the repository may not work.

You could try for example:

# pretrained model (20180505_deepvoice3_checkpoint_step000640000.pth)
# hparams (20180505_deepvoice3_ljspeech.json)
git checkout 4357976
python synthesis.py --preset=20180505_deepvoice3_ljspeech.json \
  20180505_deepvoice3_checkpoint_step000640000.pth \
  sentences.txt \
  output_dir

Notes on hyper parameters

  • Default hyper parameters, used during preprocessing/training/synthesis stages, are turned for English TTS using LJSpeech dataset. You will have to change some of parameters if you want to try other datasets. See hparams.py for details.
  • builder specifies which model you want to use. deepvoice3, deepvoice3_multispeaker [1] and nyanko [2] are surpprted.
  • Hyper parameters described in DeepVoice3 paper for single speaker didn't work for LJSpeech dataset, so I changed a few things. Add dilated convolution, more channels, more layers and add guided attention loss, etc. See code for details. The changes are also applied for multi-speaker model.
  • Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough.
  • With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements.
  • Binary divergence (described in https://arxiv.org/abs/1710.08969) seems stabilizes training particularly for deep (> 10 layers) networks.
  • Adam with step lr decay works. However, for deeper networks, I find Adam + noam's lr scheduler is more stable.

Requirements

  • Python >= 3.5
  • CUDA >= 8.0
  • PyTorch >= v1.0.0
  • nnmnkwii >= v0.0.11
  • MeCab (Japanese only)

Installation

Please install packages listed above first, and then

git clone https://github.com/r9y9/deepvoice3_pytorch && cd deepvoice3_pytorch
pip install -e ".[bin]"

Getting started

Preset parameters

There are many hyper parameters to be turned depends on what model and data you are working on. For typical datasets and models, parameters that known to work good (preset) are provided in the repository. See presets directory for details. Notice that

  1. preprocess.py
  2. train.py
  3. synthesis.py

accepts --preset=<json> optional parameter, which specifies where to load preset parameters. If you are going to use preset parameters, then you must use same --preset=<json> throughout preprocessing, training and evaluation. e.g.,

python preprocess.py --preset=presets/deepvoice3_ljspeech.json ljspeech ~/data/LJSpeech-1.0
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech

instead of

python preprocess.py ljspeech ~/data/LJSpeech-1.0
# warning! this may use different hyper parameters used at preprocessing stage
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech

0. Download dataset

1. Preprocessing

Usage:

python preprocess.py ${dataset_name} ${dataset_path} ${out_dir} --preset=<json>

Supported ${dataset_name}s are:

  • ljspeech (en, single speaker)
  • vctk (en, multi-speaker)
  • jsut (jp, single speaker)
  • nikl_m (ko, multi-speaker)
  • nikl_s (ko, single speaker)

Assuming you use preset parameters known to work good for LJSpeech dataset / DeepVoice3 and have data in ~/data/LJSpeech-1.0, then you can preprocess data by:

python preprocess.py --preset=presets/deepvoice3_ljspeech.json ljspeech ~/data/LJSpeech-1.0/ ./data/ljspeech

When this is done, you will see extracted features (mel-spectrograms and linear spectrograms) in ./data/ljspeech.

1-1. Building custom dataset. (using json_meta)

Building your own dataset, with metadata in JSON format (compatible with carpedm20/multi-speaker-tacotron-tensorflow) is currently supported. Usage:

python preprocess.py json_meta ${list-of-JSON-metadata-paths} ${out_dir} --preset=<json>

You may need to modify pre-existing preset JSON file, especially n_speakers. For english multispeaker, start with presets/deepvoice3_vctk.json.

Assuming you have dataset A (Speaker A) and dataset B (Speaker B), each described in the JSON metadata file ./datasets/datasetA/alignment.json and ./datasets/datasetB/alignment.json, then you can preprocess data by:

python preprocess.py json_meta "./datasets/datasetA/alignment.json,./datasets/datasetB/alignment.json" "./datasets/processed_A+B" --preset=(path to preset json file)

1-2. Preprocessing custom english datasets with long silence. (Based on vctk_preprocess)

Some dataset, especially automatically generated dataset may include long silence and undesirable leading/trailing noises, undermining the char-level seq2seq model. (e.g. VCTK, although this is covered in vctk_preprocess)

To deal with the problem, gentle_web_align.py will

  • Prepare phoneme alignments for all utterances
  • Cut silences during preprocessing

gentle_web_align.py uses Gentle, a kaldi based speech-text alignment tool. This accesses web-served Gentle application, aligns given sound segments with transcripts and converts the result to HTK-style label files, to be processed in preprocess.py. Gentle can be run in Linux/Mac/Windows(via Docker).

Preliminary results show that while HTK/festival/merlin-based method in vctk_preprocess/prepare_vctk_labels.py works better on VCTK, Gentle is more stable with audio clips with ambient noise. (e.g. movie excerpts)

Usage: (Assuming Gentle is running at localhost:8567 (Default when not specified))

  1. When sound file and transcript files are saved in separate folders. (e.g. sound files are at datasetA/wavs and transcripts are at datasetA/txts)
python gentle_web_align.py -w "datasetA/wavs/*.wav" -t "datasetA/txts/*.txt" --server_addr=localhost --port=8567
  1. When sound file and transcript files are saved in nested structure. (e.g. datasetB/speakerN/blahblah.wav and datasetB/speakerN/blahblah.txt)
python gentle_web_align.py --nested-directories="datasetB" --server_addr=localhost --port=8567

Once you have phoneme alignment for each utterance, you can extract features by running preprocess.py

2. Training

Usage:

python train.py --data-root=${data-root} --preset=<json> --hparams="parameters you may want to override"

Suppose you build a DeepVoice3-style model using LJSpeech dataset, then you can train your model by:

python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/

Model checkpoints (.pth) and alignments (.png) are saved in ./checkpoints directory per 10000 steps by default.

NIKL

Pleae check this in advance and follow the commands below.

python preprocess.py nikl_s ${your_nikl_root_path} data/nikl_s --preset=presets/deepvoice3_nikls.json

python train.py --data-root=./data/nikl_s --checkpoint-dir checkpoint_nikl_s --preset=presets/deepvoice3_nikls.json

4. Monitor with Tensorboard

Logs are dumped in ./log directory by default. You can monitor logs by tensorboard:

tensorboard --logdir=log

5. Synthesize from a checkpoint

Given a list of text, synthesis.py synthesize audio signals from trained model. Usage is:

python synthesis.py ${checkpoint_path} ${text_list.txt} ${output_dir} --preset=<json>

Example test_list.txt:

Generative adversarial network or variational auto-encoder.
Once upon a time there was a dear little girl who was loved by every one who looked at her, but most of all by her grandmother, and there was nothing that she would not have given to the child.
A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic model and an audio synthesis module.

Advanced usage

Multi-speaker model

VCTK and NIKL are supported dataset for building a multi-speaker model.

VCTK

Since some audio samples in VCTK have long silences that affect performance, it's recommended to do phoneme alignment and remove silences according to vctk_preprocess.

Once you have phoneme alignment for each utterance, you can extract features by:

python preprocess.py vctk ${your_vctk_root_path} ./data/vctk

Now that you have data prepared, then you can train a multi-speaker version of DeepVoice3 by:

python train.py --data-root=./data/vctk --checkpoint-dir=checkpoints_vctk \
   --preset=presets/deepvoice3_vctk.json \
   --log-event-path=log/deepvoice3_multispeaker_vctk_preset

If you want to reuse learned embedding from other dataset, then you can do this instead by:

python train.py --data-root=./data/vctk --checkpoint-dir=checkpoints_vctk \
   --preset=presets/deepvoice3_vctk.json \
   --log-event-path=log/deepvoice3_multispeaker_vctk_preset \
   --load-embedding=20171213_deepvoice3_checkpoint_step000210000.pth

This may improve training speed a bit.

NIKL

You will be able to obtain cleaned-up audio samples in ../nikl_preprocoess. Details are found in here.

Once NIKL corpus is ready to use from the preprocessing, you can extract features by:

python preprocess.py nikl_m ${your_nikl_root_path} data/nikl_m

Now that you have data prepared, then you can train a multi-speaker version of DeepVoice3 by:

python train.py --data-root=./data/nikl_m  --checkpoint-dir checkpoint_nikl_m \
   --preset=presets/deepvoice3_niklm.json

Speaker adaptation

If you have very limited data, then you can consider to try fine-turn pre-trained model. For example, using pre-trained model on LJSpeech, you can adapt it to data from VCTK speaker p225 (30 mins) by the following command:

python train.py --data-root=./data/vctk --checkpoint-dir=checkpoints_vctk_adaptation \
    --preset=presets/deepvoice3_ljspeech.json \
    --log-event-path=log/deepvoice3_vctk_adaptation \
    --restore-parts="20171213_deepvoice3_checkpoint_step000210000.pth"
    --speaker-id=0

From my experience, it can get reasonable speech quality very quickly rather than training the model from scratch.

There are two important options used above:

  • --restore-parts=<N>: It specifies where to load model parameters. The differences from the option --checkpoint=<N> are 1) --restore-parts=<N> ignores all invalid parameters, while --checkpoint=<N> doesn't. 2) --restore-parts=<N> tell trainer to start from 0-step, while --checkpoint=<N> tell trainer to continue from last step. --checkpoint=<N> should be ok if you are using exactly same model and continue to train, but it would be useful if you want to customize your model architecture and take advantages of pre-trained model.
  • --speaker-id=<N>: It specifies what speaker of data is used for training. This should only be specified if you are using multi-speaker dataset. As for VCTK, speaker id is automatically assigned incrementally (0, 1, ..., 107) according to the speaker_info.txt in the dataset.

If you are training multi-speaker model, speaker adaptation will only work when n_speakers is identical.

Trouble shooting

#5 RuntimeError: main thread is not in main loop

This may happen depending on backends you have for matplotlib. Try changing backend for matplotlib and see if it works as follows:

MPLBACKEND=Qt5Agg python train.py ${args...}

In #78, engiecat reported that changing the backend of matplotlib from Tkinter(TkAgg) to PyQt5(Qt5Agg) fixed the problem.

Sponsers

Acknowledgements

Part of code was adapted from the following projects:

Banner and logo created by @jraulhernandezi (#76)

deepvoice3_pytorch's People

Contributors

abdoulfataoh avatar amilamad avatar engiecat avatar gisforgirard avatar homink avatar jraulhernandezi avatar kokimame avatar lzala avatar misterion777 avatar r9y9 avatar tripzero avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

deepvoice3_pytorch's Issues

Please correct hyperparams

Since the synthesis script has been altered to accept a builder param called deepvoice3_multispeaker instead of deepvoice3_vctk, please change the table in the pretrained models section of the README to reflect the new hyperparams for vctk. It will eliminate confusion by people using this platform.

Reference Issue #14

The table entry should read:

--hparams="builder=deepvoice3_multispeaker,preset=deepvoice3_vctk"

"ImportError: dlopen: cannot load any more object with static TLS" in python3.5 synthesis.py ........

I got fatal error when testing synthesis.py. Could you help?

python3.5 synthesis.py --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" /home/ml/deepvoice3_pytorch/models/20171213_deepvoice3_checkpoint_step000210000.pth ./text_list.txt ./output/

python3.5 synthesis.py --hparams="uilder=nyanko,preset=nyanko_ljspeech" "/home/ml/deepvoice3_pytorch/models/20171129_nyanko_checkpoint_step000585000.pth" "/home/ml/deepvoice3_pytorch/text_list.txt" "/home/ml/deepvoice3_pytorch/output"
Command line args:
{'--checkpoint-postnet': None,
'--checkpoint-seq2seq': None,
'--file-name-suffix': '',
'--help': False,
'--hparams': 'uilder=nyanko,preset=nyanko_ljspeech',
'--max-decoder-steps': '500',
'--output-html': False,
'--replace_pronunciation_prob': '0.0',
'--speaker_id': None,
'': '/home/ml/deepvoice3_pytorch/models/20171129_nyanko_checkpoint_step000585000.pth',
'<dst_dir>': '/home/ml/deepvoice3_pytorch/output',
'<text_list_file>': '/home/ml/deepvoice3_pytorch/text_list.txt'}
Traceback (most recent call last):
File "synthesis.py", line 98, in
hparams.parse(args["--hparams"])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/training/python/training/hparam.py", line 472, in parse
values_map = parse_values(values, type_map)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/training/python/training/hparam.py", line 206, in parse_values
raise ValueError('Unknown hyperparameter type for %s' % name)
ValueError: Unknown hyperparameter type for uilder
ml@tesla1a:~/deepvoice3_pytorch$ python3.5 synthesis.py --hparams="uilder=nyanko,preset=nyanko_ljspeech" "/home/ml/deepvoice3_pytorch/models/20171129_nyanko_checkpoint_step000585000.pth" "/home/ml/deepvoice3_pytorch/text_list.txt" "/home/ml/deepvoice3_pytorch/output"
Traceback (most recent call last):
File "synthesis.py", line 26, in
import torch
File "/usr/local/lib/python3.5/dist-packages/torch/init.py", line 56, in
from torch._C import *
ImportError: dlopen: cannot load any more object with static TLS

error on training

I got following error when i try to train model. Did this due to i have some speech with very long (such as 30 seconds) and bring issue ?

======
Los event path: ./log/aclclp
^M0it [00:00, ?it/s]
Traceback (most recent call last):
File "train.py", line 950, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 685, in train
priority_w=hparams.priority_freq_weight)
File "train.py", line 510, in spec_loss
l1_loss = w * masked_l1(y_hat, y, mask=mask) + (1 - w) * l1(y_hat, y)
File "/home/chester/hdd22t/virtualenv/deepvoice3-pytorch-r9y9/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "train.py", line 280, in forward
loss = self.criterion(input * mask_, target * mask_)
RuntimeError: The size of tensor a (1025) must match the size of tensor b (513) at non-singleton dimension 2

korean data

hi,ryuuiti. Could you share the korean single speaker data? I met difficulties when trying to download the data from the link you provided.

AttributeError: module 'torch.nn.utils' has no attribute 'weight_norm'

I created a new environment for this project and made it to through the preprocessing for LJ dataset, and now I'm stuck at the training portion. I get this error

Traceback (most recent call last):
  File "train.py", line 906, in <module>
    model = build_model()
  File "train.py", line 799, in build_model
    value_projection=hparams.value_projection,
  File "/mnt/deepvoice3_pytorch/deepvoice3_pytorch/builder.py", line 46, in deepvoice3
    (h, k, 1), (h, k, 3)],
  File "/mnt/deepvoice3_pytorch/deepvoice3_pytorch/deepvoice3.py", line 54, in __init__
    dilation=1, std_mul=std_mul))
  File "/mnt/deepvoice3_pytorch/deepvoice3_pytorch/modules.py", line 104, in Conv1d
    return nn.utils.weight_norm(m)
AttributeError: module 'torch.nn.utils' has no attribute 'weight_norm'

when running python train.py --data-root=./data/ljspeech/ --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"

I installed pytorch with conda install pytorch torchvision cuda90 -c pytorch. Any help would be appreciated.

Alignment problems with German text?

Hi @r9y9, I'm training on German audio. I have added the german characters (Ä, Ö, Ü, ß, ä, ö, ü) to the symbolset and am using basic_cleaners.

The problem is the alignment on test-audio. Look at some of the samples. And, of course, the audio is horrible too. I have tested with up to 500k steps. Always the same results. When I generate audio with synthesis, I have similar results. Any hints where I'd need to add more info?

step000180000_text4_single_alignment

Thanks for any recommendations... (I converted the German training data to ljspeech format...)

Changing fft_size, hop_size in hparams.py?

Hi there,

I changed hparams.py to

fft_size=2052, # default 1024
hop_size=114, # fedault 256

And I get un-audible result!

What should I do, if I want to increase the fft_size & reduce hop_size? What did I do wrong?

Thanks a lot for any help!

No activity on training

Hi,

After successful (1) installation of all prerequisites; and (2) pre-processing.
Starting the training phase with:
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
continues with a report of input parameters and eventually hangs on:
0it [00:00, ?it/s].

the command watch -n 1 nvidia-smi reports the VRAM usage with 499M range with no activity on GPU

TODOs, status and progress

Single speaker model

Data: https://keithito.com/LJ-Speech-Dataset/

  • Convolution layers
  • Multi-hop attention layers
  • Attention mask for input zero padding
  • Alignments are learned almost monotonically
  • Incremental inference (greedy decoding)
  • Force monotonic attention
  • Done flag prediction
  • Get reasonable sound quality as Tacotron (https://github.com/r9y9/tacotron_pytorch)
  • Audio samples (en)
  • Audio samples (jp)
  • Pre-trained models

Multi-speaker model

Data: VCTK

  • Preprocessor for VCTK
  • Speaker embedding
  • Get reasonable sound quality
  • Audio samples
  • Pre-trained model

Misc

From https://arxiv.org/abs/1710.08969

  • Guided attention
  • Downsample mel-spectrogram / upsample converter
  • Binary divergence
  • Separate training for encoder+decoder and converter

Notes (to be moved to README.md)

  • Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough.
  • With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements.
  • Positional encoding (i.e., using text positions and frame positions in decoder) is essential to learn monotonic alignments (without this I cannot get it to work). However, I'm still not sure why position rate matters. 1.0 for both encoder/decoder worked from my previous experiment.
  • Weight initialization is quite important particularly for deeper (e.g. > 8 layers) networks. Noticed when I tried to replicate https://arxiv.org/abs/1710.08969. They use more than 20 layers in the decoder! Very hard to train. Work in progress in #3. Speech samples (model: encoder/converter from https://arxiv.org/abs/1710.08969 and decoder from DeepVoice3): https://www.dropbox.com/sh/q9xfgscgh3k5lqa/AACPgWCprBfNgjRravscdDYCa?dl=0.
  • Adam with step lr decay works. However, for deeper networks, I find Adam + noam's lr scheduler is more stable.

jsut data

Hi,

When you train on the JSUT corpus, did you use the original Japanese script? What I'm curious is Chinese characters are not phonetic, so I doubt if the network can learn with them. I thought they need to be converted into phonetic transcription (romaji).

AttributeError: 'NoneType' object has no attribute 'text_to_sequence'

When I try to train a dataset with the command from the tutorial (python train.py --data-root=./data/ljspeech/ --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech") I get an error telling me that _frontend is a NoneType object and has no 'text_to_sequence' attribute. Do I need to modify anything to get this to work again?

AttributeError: 'NoneType' object has no attribute 'text_to_sequence'

Crowdsourcing a high-quality, open-source TTS dataset

Hi r9y9, first I just want to say that your repos are great and I have personally learned a lot from them. So a big thanks to you.

So I too have been trying to replicate the results of the big TTS papers. However the main thing that is frustrating me is the lack of a high quality TTS dataset (although 50 gpus would help too!).

I just wanted to throw this idea out there - what if random people on the internet interested in TTS/ML collaborated to create a good dataset? If enough people joined in (20+) the segmentation and labelling work should only be a couple of hours per person.

Here is a list of the options that occurred to me (and I by no means consider this list complete):

1 - Find a 20+ hour high-quality, open-source audiobook online. Given how massive the internet is - surely there is a possibility of a hi-fi audiobook that isn't poorly recorded, overly-compressed or too 'performed'. Working together, scouring the internet... who knows - a gem might be out there.

2 - Podcasts - there's an endless supply of these. But podcasts bring their own unique difficulties - e.g., were different eq/compression/mic/mastering used by the sound engineer across different episodes? Again, with enough searching, a candidate with consistent sound-quality may reveal itself.

3 - Commercial Audiobooks - this would unfortunately render the whole dataset closed-sourced and for personal research only. However I don't see how there would be any problems if all collaborators purchased the audiobook and didn't redistribute the dataset beyond the initial group of collaborators.

4 - Crowdfunding it - probably the least realistic option. Still though, if enough people were interested, 100 or so, then it might be possible. One studio, one sound engineer, one professional reader and someone to oversee the project for a week or two weeks max? Would $10,000 cover it? $20,000? I'm no expert in studio time and sound-engineering rates etc so I can't say for certain.

So to wrap this up - I just wanted to put this idea out there. I'm very curious what you, or any others reading this, think - even if you feel it's unrealistic. I know buying 50 gpus is unfeasible for most of us - but working together to solve the dataset problem? Personally, I'm optimistic.

Does this implement ignore words?

I found that , Tacotron will ignore some words in a long sentence's( a sentence with 30 words etc.) synthesis. Does Deep Voice 3 has that problem?

Another Assertion error

Hi again,

I trained single Korean speaker successfully and moving to multiple Korean speaker. Again, I encountered such Assertion error as shown below. I tracked down and looks like self.encoder
in AttentionSeq2Seq class gave such error messages. Could you let me know where the following self.encoder function is defined so that I can look into further? max_position doesn't work this time.

encoder_outputs = self.encoder(
text_sequences, lengths=input_lengths, speaker_embed=speaker_embed)

Thanks in advance,

[kwon@ssi-dnn-slave-002 deepvoice3_pytorch2]$ CUDA_VISIBLE_DEVICES=2 python train.py   --data-root=./data/nikl_m/   --hparams="frontend=ko,builder=deepvoice3,preset=deepvoice3_niklm,builder=deepvoice3_multispeaker"   --checkpoint-dir checkpoint_nikl_m
Command line args:
 {'--checkpoint': None,
 '--checkpoint-dir': 'checkpoint_nikl_m',
 '--checkpoint-postnet': None,
 '--checkpoint-seq2seq': None,
 '--data-root': './data/nikl_m/',
 '--help': False,
 '--hparams': 'frontend=ko,builder=deepvoice3,preset=deepvoice3_niklm,builder=deepvoice3_multispeaker',
 '--load-embedding': None,
 '--log-event-path': None,
 '--reset-optimizer': False,
 '--restore-parts': None,
 '--speaker-id': None,
 '--train-postnet-only': False,
 '--train-seq2seq-only': False}
Training whole model
Training seq2seq model
Hyperparameters:
  adam_beta1: 0.5
  adam_beta2: 0.9
  adam_eps: 1e-06
  allow_clipping_in_normalization: False
  batch_size: 16
  binary_divergence_weight: 0.1
  builder: deepvoice3_multispeaker
  checkpoint_interval: 10000
  clip_thresh: 0.1
  converter_channels: 256
  decoder_channels: 256
  downsample_step: 4
  dropout: 0.050000000000000044
  embedding_weight_std: 0.1
  encoder_channels: 256
  eval_interval: 10000
  fft_size: 1024
  fmax: 7600
  fmin: 125
  force_monotonic_attention: True
  freeze_embedding: False
  frontend: ko
  guided_attention_sigma: 0.2
  hop_size: 256
  initial_learning_rate: 0.0005
  kernel_size: 3
  key_position_rate: 1.385
  key_projection: False
  lr_schedule: noam_learning_rate_decay
  lr_schedule_kwargs: {}
  masked_loss_weight: 0.5
  max_positions: 512
  min_level_db: -100
  n_speakers: 1
  name: deepvoice3
  nepochs: 10000
  num_mels: 80
  num_workers: 2
  outputs_per_step: 1
  padding_idx: 0
  pin_memory: True
  power: 1.4
  preemphasis: 0.97
  preset: deepvoice3_niklm
  presets: {'deepvoice3_niklm': {'n_speakers': 119, 'speaker_embed_dim': 16, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'speaker_embedding_weight_std': 0.05, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 256, 'encoder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.4, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input': True, 'max_positions': 3000, 'query_position_rate': 2.0, 'key_position_rate': 7.6, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}, 'deepvoice3_ljspeech': {'n_speakers': 1, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 256, 'encoder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input': True, 'max_positions': 600, 'query_position_rate': 1.0, 'key_position_rate': 1.385, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}, 'deepvoice3_vctk': {'n_speakers': 108, 'speaker_embed_dim': 16, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'speaker_embedding_weight_std': 0.05, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 256, 'encoder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.4, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input': True, 'max_positions': 512, 'query_position_rate': 2.0, 'key_position_rate': 7.6, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}, 'nyanko_ljspeech': {'n_speakers': 1, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.01, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 128, 'encoder_channels': 256, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input': True, 'max_positions': 512, 'query_position_rate': 1.0, 'key_position_rate': 1.385, 'key_projection': False, 'value_projection': False, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}}
  priority_freq: 3000
  priority_freq_weight: 0.0
  query_position_rate: 1.0
  ref_level_db: 20
  replace_pronunciation_prob: 0.5
  rescaling: False
  rescaling_max: 0.999
  sample_rate: 22050
  save_optimizer_state: True
  speaker_embed_dim: 16
  speaker_embedding_weight_std: 0.01
  text_embed_dim: 256
  trainable_positional_encodings: False
  use_decoder_state_for_postnet_input: True
  use_guided_attention: True
  use_memory_mask: True
  value_projection: False
  weight_decay: 0.0
  window_ahead: 3
  window_backward: 1
Override hyper parameters with preset "deepvoice3_niklm": {
    "n_speakers": 119,
    "speaker_embed_dim": 16,
    "downsample_step": 4,
    "outputs_per_step": 1,
    "embedding_weight_std": 0.1,
    "speaker_embedding_weight_std": 0.05,
    "dropout": 0.050000000000000044,
    "kernel_size": 3,
    "text_embed_dim": 256,
    "encoder_channels": 512,
    "decoder_channels": 256,
    "converter_channels": 256,
    "use_guided_attention": true,
    "guided_attention_sigma": 0.4,
    "binary_divergence_weight": 0.1,
    "use_decoder_state_for_postnet_input": true,
    "max_positions": 3000,
    "query_position_rate": 2.0,
    "key_position_rate": 7.6,
    "key_projection": true,
    "value_projection": true,
    "clip_thresh": 0.1,
    "initial_learning_rate": 0.0005
}

0it [00:00, ?it/s]
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/generic/THCTensorCopy.c line=70 error=59 : device-side assert triggered

Traceback (most recent call last):
  File "train.py", line 967, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 661, in train
    input_lengths=input_lengths)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch2/deepvoice3_pytorch/__init__.py", line 80, in forward
    text_positions, frame_positions, input_lengths)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch2/deepvoice3_pytorch/__init__.py", line 117, in forward
    print(text_sequences)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 119, in __repr__
    return 'Variable containing:' + self.data.__repr__()
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 133, in __repr__
    return str(self)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 140, in __str__
    return _tensor_str._str(self)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 297, in _str
    strt = _matrix_str(self)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 216, in _matrix_str
    min_sz=5 if not print_full_mat else 0)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 79, in _number_format
    tensor = torch.DoubleTensor(tensor.size()).copy_(tensor).abs_().view(tensor.nelement())
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/generic/THCTensorCopy.c:70

21k 30k 58.5k wrong ?

Pretrained models of DeepVoice3 Only need 21k steps to train ?
In my experiment.I think 21k steps is too small to train.
Maybe you write 210k to 21k?
And 30k for Nyanko, 58.5k for multi-deepvoice3?

Multi GPU Support

I'd like to train this model on 8 V100 GPUs - does it support multi GPU training?

positional encoding

    position_enc = np.array([
        [position_rate * pos / np.power(10000, 2 * (i // 2) / d_pos_vec) for i in range(d_pos_vec)]
        if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])

Hey! I wonder what is a motivation behind repeating positional encoding values twice?
in paper it's done this way:

position_rate * pos / np.power(10000,  i / d_pos_vec)...

Issue training with DeepVoice3 model with LJSpeech Data

Thanks for your excellent implementation of Deep Voice 3. I am attempting to retrain a DeepVoice3 model using the LJSpeech data. My interest in training a new model is that I want to make some small model parameter changes in order to enable fine-tuning using some Spanish data that I have.

As a first step I tried to retrain the baseline model and I have run into some issues.

With my installation, I have been able to successfully synthesize using the pre-trained DeepVoice3 model with git commit 4357976 as your instructions indicate. That synthesized audio sounds very much like the samples linked from the instructions page.

However, I am trying to train now with the latest git commit (commit 48d1014, dated Feb 7). I am using the LJSpeech data set downloaded from the link you provided. I have run the pre-processing and training steps as indicated in your instructions. I am using the default preset parameters for deepvoice3_ljspeech.

I have let the training process run for a while. When I synthesize using the checkpoint saved at 210K iterations, the alignment is bad and the audio is very robotic and mostly unintelligible.

0_checkpoint_step000210000_alignment

When I synthesize using the checkpoint saved at 700K iterations, the alignment is better (but not great); the audio is improved but still robotic and choppy.

0_checkpoint_step000700000_alignment

I can post the synthesized wav files via dropbox if you are interested. I expected to have good alignment and audio at 210K iterations as that is what the pretrained model used.

Any ideas what has changed between git commits 4357976 and 48d1014 that could have caused this issue? When I diff the two commits, I see some changes in audio.py, some places where support for multi-voice has been added, and some other changes I do not yet understand. There are some additions to hparams.py, but I only noticed one difference: in the current commit, masked_loss_weight defaults to 0.5, but in the prior commit the default was 0.0.

I have just started a new training run with masked_loss_weight set to 0.0. In the meantime, do you have thoughts on anything else that might be causing the issues I am seeing?

Speed up training.

Hi r9y9,
Thanks for the amazing library here. I'm only beginning to learn ML, and love what this can do! Ultimately trying to create what lyrebird.ai has been doing. Managed to finally setup it all up and started training single speaker with the ljspeech.

However i'm experiencing same training speed of ~3s/it between my dekstop specs below and my MBP (2.5Ghz, i8, 4 Cores). is there a way I can speed things up? I know I don't have the ideal AI training hardware specs, but kinda looking forward to the results.

*Both setup has all CPU cores running at 100%

OS: Ubuntu 16.04.4
CPU: i7-7820X (8 CORE)
GPU: 2x 1080 Ti

Phonemes

Hi there,

I was wondering if you were ever considering making adjustments for `JOINT REPRESENTATION OF CHARACTERS AND PHONEMES' as the deepvoice3 paper, part 3.2 mentions.
33697180-9df6c988-db40-11e7-97c9-03689ca557a8

Thanks in advance,

B1gM

cuda out of memory?

When I used
x = F.relu(self.fc1(x), inplace=True)
cuda will out of memory?
So, I set the $inplace=False and solved the problem!
x = F.relu(self.fc1(x), inplace=False)

VCTK alignment

Hi @r9y9 you mention that aligning VCTK with gentle does not work, can you tell what is happening? is it the quality of the alignment, and how did you see it?

Error in `python3': free(): invalid next size (fast) when running synthesis.py

Greetings!
I have successfully preprocessed LJSpeech dataset and trained model for a while with preset hyperparameters:

python3 train.py --data-root=./data/ljspeech \
--hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"

But when trying to generate audio from text:

python3 synthesis.py ./checkpoints/checkpoint_step000270000.pth ./text_list.txt ./generated \ 
--hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"

I'm getting the error:

*** Error in `python3': free(): invalid next size (fast): 0x000000000db7b050 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f1138cbd7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f1138cc637a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f1138cca53c]
/usr/local/cuda-8.0/lib64/libcudnn.so.6(cudnnDestroyConvolutionDescriptor+0x9)[0x7f10e47eac69]
/usr/local/lib/python3.5/dist-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so(+0x2dedf7)[0x7f10cc728df7]
/usr/local/lib/python3.5/dist-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7f10cd5f9ee4]
/usr/local/lib/python3.5/dist-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so(_ZN5torch8autograd11ConvForward5applyERKSt6vectorINS0_8VariableESaIS3_EE+0x1192)[0x7f10cc9694a2]
/usr/local/lib/python3.5/dist-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so(+0x40d26e)[0x7f10cc85726e]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
python3[0x540199]
python3(PyEval_EvalFrameEx+0x50b2)[0x53bd92]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x252b)[0x53920b]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebd23]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x4fb9ce]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x574b36]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x252b)[0x53920b]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x4fb9ce]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x574b36]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x252b)[0x53920b]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebd23]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x4fb9ce]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x574b36]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x252b)[0x53920b]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
...
....

After debugging I managed to find out that problem appears in this loop at first iteration(deepvoice3.py, line 90):

for f in self.convolutions:
            x = f(x, speaker_embed_btc) if isinstance(f, Conv1dGLU) else f(x)

but still can't solve it.

I tried using Python 3.5.2 and 3.6.3 with tensorflow 1.3.0 and torch 0.3.1 (also tried 0.3.0.post4)
CUDA version is 8.0, GPU: Titan X
Any help would be appreciated.

Getting error when num_workers > 0

Hi,
I have tried to train the lj speech model with latest mater and it gives me error like this, with

num_workers = 2
image
It looks like _frontend for worker processes didn`t got assigned. I tried injecting _frontend object to the TextDataSource. But It failed. Is there a fix for this ?

When I set the num_workers = 0 , it is training ok.
After quick google search it tells me that when num_workers = 0 it will do all the work in main thread.
My question is, will it slow down my training process significantly ?

Persistent MemoryError while training on VCTK

Hello. I am currently trying to train VCTK model on deepvoice 3 multispeaker model.
While it seems that it works okay, sometimes the training crashes with the following error.

2734it [13:58,  3.26it/s]Traceback (most recent call last):
  File "train.py", line 957, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 585, in train
    in tqdm(enumerate(data_loader)):
  File "H:\envs\pytorch\lib\site-packages\tqdm\_tqdm.py", line 959, in __iter__
    for obj in iterable:
  File "H:\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 281, in __next__
    return self._process_next_batch(batch)
  File "H:\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
MemoryError: Traceback (most recent call last):
  File "H:\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 55, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "H:\Tensorflow_Study\git\deepvoice3_pytorch\train.py", line 329, in collate_fn
    dtype=np.float32)
MemoryError

Forcing garbage collection sporadically(using gc.collect()) doesn't help the issue.
Currently, I have 16 GB of RAM with 48 GB of virtual memory available on my SSD (just in case).
(Using Windows 10 with PyTorch 0.3.1 (with CUDA 8.0, GTX1060 6GB))

Also, I do observe that in Resource Monitor, the memory usage in Commit(KB) and Working Set(KB) is significantly different, as shown below. (Sorry for the non-english)
image

Thank you for creating such wonderful implementation!
:)

RuntimeError: invalid argument 2: sizes do not match

I downloaded pretrained models and upon running any of them I receive the following error:

My pytorch version is: 0.3.0.post4

RuntimeError: invalid argument 2: sizes do not match at /pytorch/torch/lib/THC/generic/THCTensorCopy.c:101

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "synthesis.py", line 125, in
model.load_state_dict(checkpoint["state_dict"])
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 487, in load_state_dict
.format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named seq2seq.encoder.embed_tokens.weight, whose dimensions in the model are torch.Size([149, 128]) and whose dimensions in the checkpoint are torch.Size([149, 256]).

Memory corruption when synthesising speech

Hi @r9y9 ,
Thanks for working on this project. I trained model with param -hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" with latest commit. However, when I synthesis speech, i get following errors:

 python synthesis.py --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" checkpoints_deepvoice3/checkpoint_step000630000.pth test.txt samples
Command line args:
 {'--checkpoint-postnet': None,
 '--checkpoint-seq2seq': None,
 '--file-name-suffix': '',
 '--help': False,
 '--hparams': 'builder=deepvoice3,preset=deepvoice3_ljspeech',
 '--max-decoder-steps': '500',
 '--output-html': False,
 '--replace_pronunciation_prob': '0.0',
 '--speaker_id': None,
 '<checkpoint>': 'checkpoints_deepvoice3/checkpoint_step000630000.pth',
 '<dst_dir>': 'samples',
 '<text_list_file>': 'test.txt'}
Override hyper parameters with preset "deepvoice3_ljspeech": {
    "n_speakers": 1,
    "downsample_step": 4,
    "outputs_per_step": 1,
    "embedding_weight_std": 0.1,
    "dropout": 0.050000000000000044,
    "kernel_size": 3,
    "text_embed_dim": 256,
    "encoder_channels": 512,
    "decoder_channels": 256,
    "converter_channels": 256,
    "use_guided_attention": true,
    "guided_attention_sigma": 0.2,
    "binary_divergence_weight": 0.1,
    "use_decoder_state_for_postnet_input": true,
    "max_positions": 512,
    "query_position_rate": 1.0,
    "key_position_rate": 1.385,
    "key_projection": true,
    "value_projection": true,
    "clip_thresh": 0.1,
    "initial_learning_rate": 0.0005
}
*** Error in `python': free(): invalid next size (fast): 0x0000000004da9360 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fcd2c2417e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fcd2c24a37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fcd2c24e53c]
/home/fatman/anaconda2/envs/dev3/bin/../lib/libcudnn.so.6(cudnnDestroyConvolutionDescriptor+0x9)[0x7fccdeb64c69]
/home/fatman/anaconda2/envs/dev3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0x2dedf7)[0x7fccb75acdf7]
/home/fatman/anaconda2/envs/dev3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7fccb847dee4]
/home/fatman/anaconda2/envs/dev3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN5torch8autograd11ConvForward5applyERKSt6vectorINS0_8VariableESaIS3_EE+0x1192)[0x7fccb77ed4a2]

Detail logs are here.
The text file contains only single line:
Generative adversarial network or variational auto-encoder.
Thanks.

preprocess: TypeError: unorderable types: NoneType() > int()

python3 preprocess.py ljspeech ./data/LJSpeech-1.0/ ./data/ljspeech
  0%|                                                                                   | 0/13100 [00:00<?, ?it/s]concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/concurrent/futures/process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/data1/demobin/deepvoice3_pytorch/ljspeech.py", line 57, in _process_utterance
    spectrogram = audio.spectrogram(wav).astype(np.float32)
  File "/data1/demobin/deepvoice3_pytorch/audio.py", line 32, in spectrogram
    D = _lws_processor().stft(preemphasis(y)).T
  File "/data1/demobin/deepvoice3_pytorch/audio.py", line 53, in _lws_processor
    return lws.lws(hparams.fft_size, hparams.hop_size, mode="speech")
  File "lws.pyx", line 357, in lws.lws.__init__ (lws.bycython.cpp:15047)
TypeError: unorderable types: NoneType() > int()
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess.py", line 55, in <module>
    preprocess_ljspeech(in_dir, out_dir, num_workers)
  File "preprocess.py", line 21, in preprocess_ljspeech
    metadata = ljspeech.build_from_path(in_dir, out_dir, num_workers, tqdm=tqdm)
  File "/data1/demobin/deepvoice3_pytorch/ljspeech.py", line 34, in build_from_path
    return [future.result() for future in tqdm(futures)]
  File "/data1/demobin/deepvoice3_pytorch/ljspeech.py", line 34, in <listcomp>
    return [future.result() for future in tqdm(futures)]
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
TypeError: unorderable types: NoneType() > int()

Assertion `srcIndex < srcSelectDimSize` failed

Hi again,

I am applying this repository for Korean speech corpus (http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464) and have encountered the following error. Could you have a look at it? I will be happy to ask PR once it gets working.

I formatted Korean corpus into npy as same as ljspeech has as single speaker and ran training with single GPU or multipe GPU. But it shows a series of error messages like Assertion srcIndex < srcSelectDimSize failed.

[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ ls data/nikl | head -3
nikl-mel-00001.npy
nikl-mel-00002.npy
nikl-mel-00003.npy
[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ ls data/nikl | tail -3
nikl-spec-00929.npy
nikl-spec-00930.npy
train.txt
[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ ls data/nikl/*.npy | wc -l
1860


CUDA_VISIBLE_DEVICES=3 python train.py \
  --data-root=./data/nikl/ \
  --hparams="frontend=jp,builder=deepvoice3,preset=deepvoice3_ljspeech" \
  --checkpoint-dir checkpoint_nikl


Command line args:
 {'--checkpoint': None,
 '--checkpoint-dir': 'checkpoint_nikl',
 '--checkpoint-postnet': None,
 '--checkpoint-seq2seq': None,
 '--data-root': './data/nikl/',
 '--help': False,
 '--hparams': 'builder=deepvoice3,preset=deepvoice3_ljspeech',
 '--load-embedding': None,
 '--log-event-path': None,
 '--reset-optimizer': False,
 '--restore-parts': None,
 '--speaker-id': None,
 '--train-postnet-only': False,
 '--train-seq2seq-only': False}
Training whole model
Training seq2seq model
Hyperparameters:
  adam_beta1: 0.5
  adam_beta2: 0.9
  adam_eps: 1e-06
  allow_clipping_in_normalization: True
  batch_size: 16
  binary_divergence_weight: 0.1
  builder: deepvoice3
  checkpoint_interval: 10000
  clip_thresh: 0.1
  converter_channels: 256
  decoder_channels: 256
  downsample_step: 4
  dropout: 0.050000000000000044
  embedding_weight_std: 0.1
  encoder_channels: 256
  eval_interval: 10000
  fft_size: 1024
  force_monotonic_attention: True
  freeze_embedding: False
  frontend: en
  guided_attention_sigma: 0.2
  hop_size: 256
  initial_learning_rate: 0.0005
  kernel_size: 3
  key_position_rate: 1.385
  key_projection: False
  lr_schedule: noam_learning_rate_decay
  lr_schedule_kwargs: {}
  masked_loss_weight: 0.5
  max_positions: 512
  min_level_db: -100
  n_speakers: 1
  name: deepvoice3
  nepochs: 2000
  num_mels: 80
  num_workers: 2
  outputs_per_step: 1
  padding_idx: 0
  pin_memory: True
  power: 1.4
  preemphasis: 0.97
  preset: deepvoice3_ljspeech
  presets: {'deepvoice3_ljspeech': {'n_speakers': 1, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 256, 'enc
oder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input':
 True, 'max_positions': 512, 'query_position_rate': 1.0, 'key_position_rate': 1.385, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}, 'deepvoice3_vctk
': {'n_speakers': 108, 'speaker_embed_dim': 16, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'speaker_embedding_weight_std': 0.05, 'dropout': 0.050000000000000044, 'kernel_size
': 3, 'text_embed_dim': 256, 'encoder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.4, 'binary_divergence_weight': 0.1, 'use_
decoder_state_for_postnet_input': True, 'max_positions': 1024, 'query_position_rate': 2.0, 'key_position_rate': 7.6, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_
rate': 0.0005}, 'nyanko_ljspeech': {'n_speakers': 1, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.01, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 128, 'en
coder_channels': 256, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input'
: True, 'max_positions': 512, 'query_position_rate': 1.0, 'key_position_rate': 1.385, 'key_projection': False, 'value_projection': False, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}}
  priority_freq: 3000
  priority_freq_weight: 0.0
  query_position_rate: 1.0
  ref_level_db: 20
  replace_pronunciation_prob: 0.5
  sample_rate: 22050
  save_optimizer_state: True
  speaker_embed_dim: 16
  speaker_embedding_weight_std: 0.01
  text_embed_dim: 128
  trainable_positional_encodings: False
  use_decoder_state_for_postnet_input: True
  use_guided_attention: True
  use_memory_mask: True
  value_projection: False
  weight_decay: 0.0
  window_ahead: 3
  window_backward: 1
Override hyper parameters with preset "deepvoice3_ljspeech": {
    "n_speakers": 1,
    "downsample_step": 4,
    "outputs_per_step": 1,
    "embedding_weight_std": 0.1,
    "dropout": 0.050000000000000044,
    "kernel_size": 3,
    "text_embed_dim": 256,
    "encoder_channels": 512,
    "decoder_channels": 256,
    "converter_channels": 256,
    "use_guided_attention": true,
    "guided_attention_sigma": 0.2,
    "binary_divergence_weight": 0.1,
    "use_decoder_state_for_postnet_input": true,
    "max_positions": 512,
    "query_position_rate": 1.0,
    "key_position_rate": 1.385,
    "key_projection": true,
    "value_projection": true,
    "clip_thresh": 0.1,
    "initial_learning_rate": 0.0005
}
Los event path: log/run-test2018-01-30_15:05:32.238606
34it [00:08,  4.24it/s]
7it/s]/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, i
nt, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

...

/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [46,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [46,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/generic/THCStorage.cu line=58 error=59 : device-side assert triggered

Traceback (most recent call last):
  File "train.py", line 941, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 642, in train
    input_lengths=input_lengths)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/deepvoice3_pytorch/__init__.py", line 94, in forward
    linear_outputs = self.postnet(postnet_inputs, speaker_embed)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/deepvoice3_pytorch/deepvoice3.py", line 597, in forward
    return F.sigmoid(x)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 817, in sigmoid
    return input.sigmoid()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/generic/THCStorage.cu:58

hparams is not defined while running preprocess.py

Ran the following command on downloaded LJSpeech dataset:

python3 preprocess.py ljspeech ~/data/LJSpeech-1.0/ ./data/ljspeech

No preprocessed data was generated and instead got an error:

NameError: name 'hparams' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "preprocess.py", line 47, in
preprocess(mod, in_dir, out_dir, num_workers)
File "preprocess.py", line 21, in preprocess
metadata = mod.build_from_path(in_dir, out_dir, num_workers, tqdm=tqdm)
File "/home/coglac/Documents/deepvoice3_pytorch/ljspeech.py", line 34, in build_from_path
return [future.result() for future in tqdm(futures)]
File "/home/coglac/Documents/deepvoice3_pytorch/ljspeech.py", line 34, in
return [future.result() for future in tqdm(futures)]
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
return self.__get_result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
NameError: name 'hparams' is not defined

AssertionError

Hi,

I am new to pytorch and following the example of jsut here. And I encountered the following assertion error which is hard for me to look in further. Could anyone help me out?

[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ python -V
Python 3.5.4 :: Anaconda custom (64-bit)
[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ ls /home/kwon/copora/jsut_ver1.1
basic5000  ChangeLog.txt  countersuffix26  LICENCE.txt  loanword128  onomatopee300  precedent130  README_en.txt  README_ja.txt  repeat500  travel1000  utparaphrase512  voiceactress100
[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ python preprocess.py jsut /home/kwon/copora/jsut_ver1.1 ./data/jsut
  0%|                                                                                                                                                                               | 0/7696 [00:00<?, ?it/s]concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/kwon/anaconda3/lib/python3.5/concurrent/futures/process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/jsut.py", line 52, in _process_utterance
    mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/audio.py", line 50, in melspectrogram
    assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess.py", line 47, in <module>
    preprocess(mod, in_dir, out_dir, num_workers)
  File "preprocess.py", line 21, in preprocess
    metadata = mod.build_from_path(in_dir, out_dir, num_workers, tqdm=tqdm)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/jsut.py", line 25, in build_from_path
    return [future.result() for future in tqdm(futures)]
  File "/home/kwon/3rdParty/deepvoice3_pytorch/jsut.py", line 25, in <listcomp>
    return [future.result() for future in tqdm(futures)]
  File "/home/kwon/anaconda3/lib/python3.5/concurrent/futures/_base.py", line 405, in result
    return self.__get_result()
  File "/home/kwon/anaconda3/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
AssertionError

KeyError: 'unexpected key "seq2seq.decoder.attention.in_projection.bias" in state_dict'

Hi, thanks for the fantastic DeepVoice3 implementation!

When trying to train Nyanko model starting from your pre-trained checkpoint using the following args:

--hparams="builder=nyanko,preset=nyanko_ljspeech" 
--checkpoint=checkpoints.pretrained/20171129_nyanko_checkpoint_step000585000.pth

I'm getting the error:

Load checkpoint from: checkpoints.pretrained/20171129_nyanko_checkpoint_step000585000.pth
Traceback (most recent call last):
  File "train.py", line 936, in <module>
    load_checkpoint(checkpoint_path, model, optimizer, reset_optimizer)
  File "train.py", line 820, in load_checkpoint
    model.load_state_dict(checkpoint["state_dict"])
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 490, in load_state_dict
    .format(name))
KeyError: 'unexpected key "seq2seq.decoder.attention.in_projection.bias" in state_dict'

Looks like in_projection is missing from AttentionLayer implementation in deepvoice3_pytorch/deepvoice3.py but still in the Nyanko pre-trained model https://github.com/r9y9/deepvoice3_pytorch#pretrained-models

テキストを読み込む際にエラーが出る

Deep Voice3で、下記のエラーが出ます。

collected_files = self.file_data_source.collect_files()
File "train.py", line 126, in collect_files
assert len(l) == 4 or len(l) == 5
AssertionError

テキストの書き方が間違っているのでしょうか。データはJSUTです

How about speeds between conv_TBC of fairseq-py and nn.conv1D in inferencing?

In fair team saying,
They said there's a big speed difference their own conv_temperal and original nn.conv1d in inferencing.

Have you checked the speed of this two modules while removing fairseq-py dependency?

By the way, I agree implementation without dependency. It make me readily to see overall code flow.
good job!

RuntimeError: main thread is not in main loop

When i ran the train.py
(python3 train.py --data-root=./datapath/ljspeech/ --hparams="batch_size=10")

This error is came:
Exception ignored in: <bound method Image.del of <tkinter.PhotoImage object at 0x7f1b5f86a710>>
Traceback (most recent call last):
File "/usr/lib/python3.5/tkinter/init.py", line 3359, in del
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop

Some modification on my side

  1. deepvoice3_pytorch/init.py
    from .version import version
    this line has error
    version.py is not provided.

  2. deepvoice3_pytorch/builder.py
    deepvoice3_multispeaker
    inconsistent with hparams.py

  3. deepvoice3_pytorch/deepvoice3.py
    line 474, (done>0.5).all()
    maybe done.data is better

train: RuntimeError: invalid argument 2: size '[16 x 126]' is invalid for input of with 126 elements at /home/demobin/github/pytorch/torch/lib/TH/THStorage.c:41

python3 train.py --data-root=./data/ljspeech --checkpoint-dir=checkpoints_nyanko --hparams="use_preset=True,builder=nyanko" --log-event-path=log/nyanko_preset

Command line args:
 {'--checkpoint': None,
 '--checkpoint-dir': 'checkpoints_nyanko',
 '--checkpoint-postnet': None,
 '--checkpoint-seq2seq': None,
 '--data-root': './data/ljspeech',
 '--help': False,
 '--hparams': 'use_preset=True,builder=nyanko',
 '--log-event-path': 'log/nyanko_preset',
 '--reset-optimizer': False,
 '--train-postnet-only': False,
 '--train-seq2seq-only': False}
Training whole model
Training seq2seq model
Hyperparameters:
  adam_beta1: 0.5
  adam_beta2: 0.9
  adam_eps: 1e-06
  batch_size: 16
  binary_divergence_weight: 0.1
  builder: nyanko
  checkpoint_interval: 5000
  clip_thresh: 0.1
  converter_channels: 256
  decoder_channels: 256
  downsample_step: 4
  dropout: 0.050000000000000044
  encoder_channels: 256
  fft_size: 1024
  force_monotonic_attention: True
  frontend: en
  guided_attention_sigma: 0.2
  hop_size: 256
  initial_learning_rate: 0.0005
  kernel_size: 3
  key_position_rate: 1.385
  lr_schedule: noam_learning_rate_decay
  lr_schedule_kwargs: {}
  masked_loss_weight: 0.0
  max_positions: 512
  min_level_db: -100
  name: deepvoice3
  nepochs: 2000
  num_mels: 80
  num_workers: 2
  outputs_per_step: 1
  padding_idx: 0
  pin_memory: True
  power: 1.4
  preemphasis: 0.97
  presets: {'nyanko': {'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'outputs_per_step': 1, 'text_embed_dim': 128, 'initial_learning_rate': 0.0005, 'binary_divergence_weight': 0.1, 'kernel_size': 3, 'downsample_step': 4, 'decoder_channels': 256, 'dropout': 0.050000000000000044, 'clip_thresh': 0.1, 'encoder_channels': 256, 'converter_channels': 256, 'use_decoder_state_for_postnet_input': True}, 'deepvoice3': {'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'outputs_per_step': 4, 'text_embed_dim': 256, 'initial_learning_rate': 0.001, 'binary_divergence_weight': 0.0, 'kernel_size': 7, 'downsample_step': 1, 'decoder_channels': 256, 'dropout': 0.050000000000000044, 'clip_thresh': 1.0, 'encoder_channels': 256, 'converter_channels': 256, 'use_decoder_state_for_postnet_input': True}, 'latest': {}}
  priority_freq: 3000
  priority_freq_weight: 0.0
  query_position_rate: 1.0
  ref_level_db: 20
  replace_pronunciation_prob: 0.5
  sample_rate: 22050
  text_embed_dim: 128
  trainable_positional_encodings: False
  use_decoder_state_for_postnet_input: True
  use_guided_attention: True
  use_memory_mask: True
  use_preset: True
  weight_decay: 0.0
Override hyper parameters with preset "nyanko": {
    "use_guided_attention": true,
    "guided_attention_sigma": 0.2,
    "outputs_per_step": 1,
    "text_embed_dim": 128,
    "initial_learning_rate": 0.0005,
    "binary_divergence_weight": 0.1,
    "kernel_size": 3,
    "downsample_step": 4,
    "decoder_channels": 256,
    "dropout": 0.050000000000000044,
    "clip_thresh": 0.1,
    "encoder_channels": 256,
    "converter_channels": 256,
    "use_decoder_state_for_postnet_input": true
}
Los event path: log/nyanko_preset
0it [00:00, ?it/s]Traceback (most recent call last):
  File "train.py", line 777, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 466, in train
    in tqdm(enumerate(data_loader)):
  File "/usr/local/lib/python3.5/dist-packages/tqdm/_tqdm.py", line 816, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 201, in __next__
    return self._process_next_batch(batch)
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 221, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 62, in _pin_memory_loop
    batch = pin_memory_batch(batch)
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 123, in pin_memory_batch
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 123, in <listcomp>
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 123, in pin_memory_batch
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 123, in <listcomp>
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 117, in pin_memory_batch
    return batch.pin_memory()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 82, in pin_memory
    return type(self)().set_(storage.pin_memory()).view_as(self)
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 198, in view_as
    return self.view(tensor.size())
RuntimeError: invalid argument 2: size '[16 x 126]' is invalid for input of with 126 elements at /home/demobin/github/pytorch/torch/lib/TH/THStorage.c:41

Tacotron 2

Sorry if this is off-topic (deepvoice vs tacotron) but it seems like the tacotron 2 paper is now released.
The speech samples sounds better than ever (I think):
https://google.github.io/tacotron/publications/tacotron2/index.html

I must admit that I'm not too well versed in how much this differs from the original tacotron. But perhaps the changes made also could be used in your projects?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.