r9y9 / deepvoice3_pytorch Goto Github PK

View Code? Open in Web Editor NEW

2.0K 93.0 486.0 6.94 MB

PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Home Page: https://r9y9.github.io/deepvoice3_pytorch/

License: Other

Python 99.78% Shell 0.22%

tts speech-synthesis end-to-end speech-processing machine-learning pytorch python multi-speaker

deepvoice3_pytorch's People

Contributors

Stargazers

Watchers

Forkers

saxenauts amankhandelia wotulong den-bibik cequencer nieshaoshuai allensmile huguanglong b2220333 likeucode fireae ttslr willqucd geneing caozhengquan praveenmunagapati vikingmew vsooda tanyufei stevenlol maitek freess dengbingfeng tanukkii007 taras-sereda mehrdad-shokri toannhu soyplane sunottie fengzhang2011 bjml sjfischr zhanglipku greenkeytech pengfei2017 entn-at tambetm engiecat dp-aixball ggmm2008 pabazj mdda homink amilamad edwinlimlx pooscan slbinilkumar gaosandy jsbrique chengstone windclay root20 cuijianaaa zhaoforever reloadbrain feilong0309 stevemurr cosmin-popescu igorcosta ml-lab federicosan liviust cclauss amos94 shaunstanislauslau atretyak1985 hbcbh1999 hushuitian gpsbird chiragsingla tbornt nhq jraulhernandezi bigcarrey ulandz kunlqt db-muxi samprate1st jquimera wurde smartersup walidar enormousbug xarrow aitorbajo pfriesch munggok huokedu anotherother poka93 udithakasun hephaex ozmig77 idgmatrix afcarl alanderex dsilva-tbox swapneneel mrzyzhaozeyu kyehjr

deepvoice3_pytorch's Issues

Persistent MemoryError while training on VCTK

Hello. I am currently trying to train VCTK model on deepvoice 3 multispeaker model.
While it seems that it works okay, sometimes the training crashes with the following error.

2734it [13:58,  3.26it/s]Traceback (most recent call last):
  File "train.py", line 957, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 585, in train
    in tqdm(enumerate(data_loader)):
  File "H:\envs\pytorch\lib\site-packages\tqdm\_tqdm.py", line 959, in __iter__
    for obj in iterable:
  File "H:\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 281, in __next__
    return self._process_next_batch(batch)
  File "H:\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
MemoryError: Traceback (most recent call last):
  File "H:\envs\pytorch\lib\site-packages\torch\utils\data\dataloader.py", line 55, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "H:\Tensorflow_Study\git\deepvoice3_pytorch\train.py", line 329, in collate_fn
    dtype=np.float32)
MemoryError

Forcing garbage collection sporadically(using gc.collect()) doesn't help the issue.
Currently, I have 16 GB of RAM with 48 GB of virtual memory available on my SSD (just in case).
(Using Windows 10 with PyTorch 0.3.1 (with CUDA 8.0, GTX1060 6GB))

Also, I do observe that in Resource Monitor, the memory usage in Commit(KB) and Working Set(KB) is significantly different, as shown below. (Sorry for the non-english)

Thank you for creating such wonderful implementation!
:)

Is attention normalization method right?

@r9y9 To my understanding, x = x * (s * math.sqrt(1.0 / s)) == x = x * (math.sqrt(s)). Is this right?

If it is right, why we need to multiply x with math.sqrt(s) instead of divided by math.sqrt(s) ?

deepvoice3_pytorch/deepvoice3_pytorch/deepvoice3.py

Line 172 in ec58290

x = x * (s * math.sqrt(1.0 / s))

korean data

hi，ryuuiti. Could you share the korean single speaker data? I met difficulties when trying to download the data from the link you provided.

Error While Running Synthesis.py

I'am face an issue while running Synthesis.py
Please can anyone suggest me to resolve this issue.

Another Assertion error

Hi again,

I trained single Korean speaker successfully and moving to multiple Korean speaker. Again, I encountered such Assertion error as shown below. I tracked down and looks like self.encoder
in AttentionSeq2Seq class gave such error messages. Could you let me know where the following self.encoder function is defined so that I can look into further? max_position doesn't work this time.

encoder_outputs = self.encoder(
text_sequences, lengths=input_lengths, speaker_embed=speaker_embed)

Thanks in advance,

[kwon@ssi-dnn-slave-002 deepvoice3_pytorch2]$ CUDA_VISIBLE_DEVICES=2 python train.py   --data-root=./data/nikl_m/   --hparams="frontend=ko,builder=deepvoice3,preset=deepvoice3_niklm,builder=deepvoice3_multispeaker"   --checkpoint-dir checkpoint_nikl_m
Command line args:
 {'--checkpoint': None,
 '--checkpoint-dir': 'checkpoint_nikl_m',
 '--checkpoint-postnet': None,
 '--checkpoint-seq2seq': None,
 '--data-root': './data/nikl_m/',
 '--help': False,
 '--hparams': 'frontend=ko,builder=deepvoice3,preset=deepvoice3_niklm,builder=deepvoice3_multispeaker',
 '--load-embedding': None,
 '--log-event-path': None,
 '--reset-optimizer': False,
 '--restore-parts': None,
 '--speaker-id': None,
 '--train-postnet-only': False,
 '--train-seq2seq-only': False}
Training whole model
Training seq2seq model
Hyperparameters:
  adam_beta1: 0.5
  adam_beta2: 0.9
  adam_eps: 1e-06
  allow_clipping_in_normalization: False
  batch_size: 16
  binary_divergence_weight: 0.1
  builder: deepvoice3_multispeaker
  checkpoint_interval: 10000
  clip_thresh: 0.1
  converter_channels: 256
  decoder_channels: 256
  downsample_step: 4
  dropout: 0.050000000000000044
  embedding_weight_std: 0.1
  encoder_channels: 256
  eval_interval: 10000
  fft_size: 1024
  fmax: 7600
  fmin: 125
  force_monotonic_attention: True
  freeze_embedding: False
  frontend: ko
  guided_attention_sigma: 0.2
  hop_size: 256
  initial_learning_rate: 0.0005
  kernel_size: 3
  key_position_rate: 1.385
  key_projection: False
  lr_schedule: noam_learning_rate_decay
  lr_schedule_kwargs: {}
  masked_loss_weight: 0.5
  max_positions: 512
  min_level_db: -100
  n_speakers: 1
  name: deepvoice3
  nepochs: 10000
  num_mels: 80
  num_workers: 2
  outputs_per_step: 1
  padding_idx: 0
  pin_memory: True
  power: 1.4
  preemphasis: 0.97
  preset: deepvoice3_niklm
  presets: {'deepvoice3_niklm': {'n_speakers': 119, 'speaker_embed_dim': 16, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'speaker_embedding_weight_std': 0.05, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 256, 'encoder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.4, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input': True, 'max_positions': 3000, 'query_position_rate': 2.0, 'key_position_rate': 7.6, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}, 'deepvoice3_ljspeech': {'n_speakers': 1, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 256, 'encoder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input': True, 'max_positions': 600, 'query_position_rate': 1.0, 'key_position_rate': 1.385, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}, 'deepvoice3_vctk': {'n_speakers': 108, 'speaker_embed_dim': 16, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'speaker_embedding_weight_std': 0.05, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 256, 'encoder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.4, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input': True, 'max_positions': 512, 'query_position_rate': 2.0, 'key_position_rate': 7.6, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}, 'nyanko_ljspeech': {'n_speakers': 1, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.01, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 128, 'encoder_channels': 256, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input': True, 'max_positions': 512, 'query_position_rate': 1.0, 'key_position_rate': 1.385, 'key_projection': False, 'value_projection': False, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}}
  priority_freq: 3000
  priority_freq_weight: 0.0
  query_position_rate: 1.0
  ref_level_db: 20
  replace_pronunciation_prob: 0.5
  rescaling: False
  rescaling_max: 0.999
  sample_rate: 22050
  save_optimizer_state: True
  speaker_embed_dim: 16
  speaker_embedding_weight_std: 0.01
  text_embed_dim: 256
  trainable_positional_encodings: False
  use_decoder_state_for_postnet_input: True
  use_guided_attention: True
  use_memory_mask: True
  value_projection: False
  weight_decay: 0.0
  window_ahead: 3
  window_backward: 1
Override hyper parameters with preset "deepvoice3_niklm": {
    "n_speakers": 119,
    "speaker_embed_dim": 16,
    "downsample_step": 4,
    "outputs_per_step": 1,
    "embedding_weight_std": 0.1,
    "speaker_embedding_weight_std": 0.05,
    "dropout": 0.050000000000000044,
    "kernel_size": 3,
    "text_embed_dim": 256,
    "encoder_channels": 512,
    "decoder_channels": 256,
    "converter_channels": 256,
    "use_guided_attention": true,
    "guided_attention_sigma": 0.4,
    "binary_divergence_weight": 0.1,
    "use_decoder_state_for_postnet_input": true,
    "max_positions": 3000,
    "query_position_rate": 2.0,
    "key_position_rate": 7.6,
    "key_projection": true,
    "value_projection": true,
    "clip_thresh": 0.1,
    "initial_learning_rate": 0.0005
}

0it [00:00, ?it/s]
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [0,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [1,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [2,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [3,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [4,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [5,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [6,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [7,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [8,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [9,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [10,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [11,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [12,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [13,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [14,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/THCTensorIndex.cu:279: void indexSelectSmallIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [0,0,0], thread: [15,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/generic/THCTensorCopy.c line=70 error=59 : device-side assert triggered

Traceback (most recent call last):
  File "train.py", line 967, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 661, in train
    input_lengths=input_lengths)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch2/deepvoice3_pytorch/__init__.py", line 80, in forward
    text_positions, frame_positions, input_lengths)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch2/deepvoice3_pytorch/__init__.py", line 117, in forward
    print(text_sequences)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/autograd/variable.py", line 119, in __repr__
    return 'Variable containing:' + self.data.__repr__()
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 133, in __repr__
    return str(self)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/tensor.py", line 140, in __str__
    return _tensor_str._str(self)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 297, in _str
    strt = _matrix_str(self)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 216, in _matrix_str
    min_sz=5 if not print_full_mat else 0)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/_tensor_str.py", line 79, in _number_format
    tensor = torch.DoubleTensor(tensor.size()).copy_(tensor).abs_().view(tensor.nelement())
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512957107421/work/torch/lib/THC/generic/THCTensorCopy.c:70

Tacotron 2

Sorry if this is off-topic (deepvoice vs tacotron) but it seems like the tacotron 2 paper is now released.
The speech samples sounds better than ever (I think):
https://google.github.io/tacotron/publications/tacotron2/index.html

I must admit that I'm not too well versed in how much this differs from the original tacotron. But perhaps the changes made also could be used in your projects?

hparams is not defined while running preprocess.py

Ran the following command on downloaded LJSpeech dataset:

python3 preprocess.py ljspeech ~/data/LJSpeech-1.0/ ./data/ljspeech

No preprocessed data was generated and instead got an error:

NameError: name 'hparams' is not defined
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "preprocess.py", line 47, in
preprocess(mod, in_dir, out_dir, num_workers)
File "preprocess.py", line 21, in preprocess
metadata = mod.build_from_path(in_dir, out_dir, num_workers, tqdm=tqdm)
File "/home/coglac/Documents/deepvoice3_pytorch/ljspeech.py", line 34, in build_from_path
return [future.result() for future in tqdm(futures)]
File "/home/coglac/Documents/deepvoice3_pytorch/ljspeech.py", line 34, in
return [future.result() for future in tqdm(futures)]
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
return self.__get_result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
NameError: name 'hparams' is not defined

How to evaluate the quality of synthesized speech？

Hi r9y9,
For TTS area, I am a newcomer. Apart from MOS, do you know what else can be used to evaluate the quality of synthesized speech? Looking forward your kind reply. Thanks~

Crowdsourcing a high-quality, open-source TTS dataset

Hi r9y9, first I just want to say that your repos are great and I have personally learned a lot from them. So a big thanks to you.

So I too have been trying to replicate the results of the big TTS papers. However the main thing that is frustrating me is the lack of a high quality TTS dataset (although 50 gpus would help too!).

I just wanted to throw this idea out there - what if random people on the internet interested in TTS/ML collaborated to create a good dataset? If enough people joined in (20+) the segmentation and labelling work should only be a couple of hours per person.

Here is a list of the options that occurred to me (and I by no means consider this list complete):

1 - Find a 20+ hour high-quality, open-source audiobook online. Given how massive the internet is - surely there is a possibility of a hi-fi audiobook that isn't poorly recorded, overly-compressed or too 'performed'. Working together, scouring the internet... who knows - a gem might be out there.

2 - Podcasts - there's an endless supply of these. But podcasts bring their own unique difficulties - e.g., were different eq/compression/mic/mastering used by the sound engineer across different episodes? Again, with enough searching, a candidate with consistent sound-quality may reveal itself.

3 - Commercial Audiobooks - this would unfortunately render the whole dataset closed-sourced and for personal research only. However I don't see how there would be any problems if all collaborators purchased the audiobook and didn't redistribute the dataset beyond the initial group of collaborators.

4 - Crowdfunding it - probably the least realistic option. Still though, if enough people were interested, 100 or so, then it might be possible. One studio, one sound engineer, one professional reader and someone to oversee the project for a week or two weeks max? Would $10,000 cover it? $20,000? I'm no expert in studio time and sound-engineering rates etc so I can't say for certain.

So to wrap this up - I just wanted to put this idea out there. I'm very curious what you, or any others reading this, think - even if you feel it's unrealistic. I know buying 50 gpus is unfeasible for most of us - but working together to solve the dataset problem? Personally, I'm optimistic.

Phonemes

Hi there,

I was wondering if you were ever considering making adjustments for `JOINT REPRESENTATION OF CHARACTERS AND PHONEMES' as the deepvoice3 paper, part 3.2 mentions.

Thanks in advance,

B1gM

Found no NVIDIA driver on your system

Hi,

Is GPU a must requirement on the machine? I tried tp run the synthesis.py on the pre-trained model, but got the following error.

AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

TODOs, status and progress

Single speaker model

Data: https://keithito.com/LJ-Speech-Dataset/

Multi-speaker model

Data: VCTK

Misc

Char and phoneme mixed inputs
Japanese text-processing frontend
Try Japanese TTS using https://sites.google.com/site/shinnosuketakamichi/publication/jsut
Implement dilated convolution
preprocessor for jsut
Integrate https://github.com/lanpa/tensorboard-pytorch and log images and audio samples
Add instructions how to train models (en/jp)
Rewrite audio module for better spectrogram representation. Replace griffin lim with https://github.com/Jonathan-LeRoux/lws.
Create github pages with speech samples

From https://arxiv.org/abs/1710.08969

Guided attention
Downsample mel-spectrogram / upsample converter
Binary divergence
~~Separate training for encoder+decoder and converter~~

Notes (to be moved to README.md)

Multiple attention layers are hard to learn. Empirically, one or two (first and last) attention layers seems enough.
With guided attention (see https://arxiv.org/abs/1710.08969), alignments get monotonic more quickly and reliably if we use multiple attention layers. With guided attention, I can confirm five attention layers get monotonic, though I cannot get speech quality improvements.
Positional encoding (i.e., using text positions and frame positions in decoder) is essential to learn monotonic alignments (without this I cannot get it to work). However, I'm still not sure why position rate matters. 1.0 for both encoder/decoder worked from my previous experiment.
Weight initialization is quite important particularly for deeper (e.g. > 8 layers) networks. Noticed when I tried to replicate https://arxiv.org/abs/1710.08969. They use more than 20 layers in the decoder! Very hard to train. Work in progress in #3. Speech samples (model: encoder/converter from https://arxiv.org/abs/1710.08969 and decoder from DeepVoice3): https://www.dropbox.com/sh/q9xfgscgh3k5lqa/AACPgWCprBfNgjRravscdDYCa?dl=0.
Adam with step lr decay works. However, for deeper networks, I find Adam + noam's lr scheduler is more stable.

Multi GPU Support

I'd like to train this model on 8 V100 GPUs - does it support multi GPU training?

RuntimeError: invalid argument 2: sizes do not match

I downloaded pretrained models and upon running any of them I receive the following error:

My pytorch version is: 0.3.0.post4

RuntimeError: invalid argument 2: sizes do not match at /pytorch/torch/lib/THC/generic/THCTensorCopy.c:101

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "synthesis.py", line 125, in
model.load_state_dict(checkpoint["state_dict"])
File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 487, in load_state_dict
.format(name, own_state[name].size(), param.size()))
RuntimeError: While copying the parameter named seq2seq.encoder.embed_tokens.weight, whose dimensions in the model are torch.Size([149, 128]) and whose dimensions in the checkpoint are torch.Size([149, 256]).

preprocess: TypeError: unorderable types: NoneType() > int()

python3 preprocess.py ljspeech ./data/LJSpeech-1.0/ ./data/ljspeech
  0%|                                                                                   | 0/13100 [00:00<?, ?it/s]concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.5/concurrent/futures/process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/data1/demobin/deepvoice3_pytorch/ljspeech.py", line 57, in _process_utterance
    spectrogram = audio.spectrogram(wav).astype(np.float32)
  File "/data1/demobin/deepvoice3_pytorch/audio.py", line 32, in spectrogram
    D = _lws_processor().stft(preemphasis(y)).T
  File "/data1/demobin/deepvoice3_pytorch/audio.py", line 53, in _lws_processor
    return lws.lws(hparams.fft_size, hparams.hop_size, mode="speech")
  File "lws.pyx", line 357, in lws.lws.__init__ (lws.bycython.cpp:15047)
TypeError: unorderable types: NoneType() > int()
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess.py", line 55, in <module>
    preprocess_ljspeech(in_dir, out_dir, num_workers)
  File "preprocess.py", line 21, in preprocess_ljspeech
    metadata = ljspeech.build_from_path(in_dir, out_dir, num_workers, tqdm=tqdm)
  File "/data1/demobin/deepvoice3_pytorch/ljspeech.py", line 34, in build_from_path
    return [future.result() for future in tqdm(futures)]
  File "/data1/demobin/deepvoice3_pytorch/ljspeech.py", line 34, in <listcomp>
    return [future.result() for future in tqdm(futures)]
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
TypeError: unorderable types: NoneType() > int()

error on training

I got following error when i try to train model. Did this due to i have some speech with very long (such as 30 seconds) and bring issue ?

======
Los event path: ./log/aclclp
^M0it [00:00, ?it/s]
Traceback (most recent call last):
File "train.py", line 950, in
train_seq2seq=train_seq2seq, train_postnet=train_postnet)
File "train.py", line 685, in train
priority_w=hparams.priority_freq_weight)
File "train.py", line 510, in spec_loss
l1_loss = w * masked_l1(y_hat, y, mask=mask) + (1 - w) * l1(y_hat, y)
File "/home/chester/hdd22t/virtualenv/deepvoice3-pytorch-r9y9/lib/python3.5/site-packages/torch/nn/modules/module.py", line 357, in call
result = self.forward(*input, **kwargs)
File "train.py", line 280, in forward
loss = self.criterion(input * mask_, target * mask_)
RuntimeError: The size of tensor a (1025) must match the size of tensor b (513) at non-singleton dimension 2

Does this implement ignore words?

I found that , Tacotron will ignore some words in a long sentence's( a sentence with 30 words etc.) synthesis. Does Deep Voice 3 has that problem?

Error in `python3': free(): invalid next size (fast) when running synthesis.py

Greetings!
I have successfully preprocessed LJSpeech dataset and trained model for a while with preset hyperparameters:

python3 train.py --data-root=./data/ljspeech \
--hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"

But when trying to generate audio from text:

python3 synthesis.py ./checkpoints/checkpoint_step000270000.pth ./text_list.txt ./generated \ 
--hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"

I'm getting the error:

*** Error in `python3': free(): invalid next size (fast): 0x000000000db7b050 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7f1138cbd7e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7f1138cc637a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7f1138cca53c]
/usr/local/cuda-8.0/lib64/libcudnn.so.6(cudnnDestroyConvolutionDescriptor+0x9)[0x7f10e47eac69]
/usr/local/lib/python3.5/dist-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so(+0x2dedf7)[0x7f10cc728df7]
/usr/local/lib/python3.5/dist-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7f10cd5f9ee4]
/usr/local/lib/python3.5/dist-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so(_ZN5torch8autograd11ConvForward5applyERKSt6vectorINS0_8VariableESaIS3_EE+0x1192)[0x7f10cc9694a2]
/usr/local/lib/python3.5/dist-packages/torch/_C.cpython-35m-x86_64-linux-gnu.so(+0x40d26e)[0x7f10cc85726e]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
python3[0x540199]
python3(PyEval_EvalFrameEx+0x50b2)[0x53bd92]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x252b)[0x53920b]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebd23]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x4fb9ce]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x574b36]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x252b)[0x53920b]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x4fb9ce]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x574b36]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x252b)[0x53920b]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebd23]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x4fb9ce]
python3(PyObject_Call+0x47)[0x5c1797]
python3[0x574b36]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
python3(PyObject_Call+0x47)[0x5c1797]
python3(PyEval_EvalFrameEx+0x252b)[0x53920b]
python3(PyEval_EvalCodeEx+0x13b)[0x540f9b]
python3[0x4ebe37]
...
....

After debugging I managed to find out that problem appears in this loop at first iteration(deepvoice3.py, line 90):

for f in self.convolutions:
            x = f(x, speaker_embed_btc) if isinstance(f, Conv1dGLU) else f(x)

but still can't solve it.

I tried using Python 3.5.2 and 3.6.3 with tensorflow 1.3.0 and torch 0.3.1 (also tried 0.3.0.post4)
CUDA version is 8.0, GPU: Titan X
Any help would be appreciated.

"ImportError: dlopen: cannot load any more object with static TLS" in python3.5 synthesis.py ........

I got fatal error when testing synthesis.py. Could you help?

python3.5 synthesis.py --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" /home/ml/deepvoice3_pytorch/models/20171213_deepvoice3_checkpoint_step000210000.pth ./text_list.txt ./output/

python3.5 synthesis.py --hparams="uilder=nyanko,preset=nyanko_ljspeech" "/home/ml/deepvoice3_pytorch/models/20171129_nyanko_checkpoint_step000585000.pth" "/home/ml/deepvoice3_pytorch/text_list.txt" "/home/ml/deepvoice3_pytorch/output"
Command line args:
{'--checkpoint-postnet': None,
'--checkpoint-seq2seq': None,
'--file-name-suffix': '',
'--help': False,
'--hparams': 'uilder=nyanko,preset=nyanko_ljspeech',
'--max-decoder-steps': '500',
'--output-html': False,
'--replace_pronunciation_prob': '0.0',
'--speaker_id': None,
'': '/home/ml/deepvoice3_pytorch/models/20171129_nyanko_checkpoint_step000585000.pth',
'<dst_dir>': '/home/ml/deepvoice3_pytorch/output',
'<text_list_file>': '/home/ml/deepvoice3_pytorch/text_list.txt'}
Traceback (most recent call last):
File "synthesis.py", line 98, in
hparams.parse(args["--hparams"])
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/training/python/training/hparam.py", line 472, in parse
values_map = parse_values(values, type_map)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/training/python/training/hparam.py", line 206, in parse_values
raise ValueError('Unknown hyperparameter type for %s' % name)
ValueError: Unknown hyperparameter type for uilder
ml@tesla1a:~/deepvoice3_pytorch$ python3.5 synthesis.py --hparams="uilder=nyanko,preset=nyanko_ljspeech" "/home/ml/deepvoice3_pytorch/models/20171129_nyanko_checkpoint_step000585000.pth" "/home/ml/deepvoice3_pytorch/text_list.txt" "/home/ml/deepvoice3_pytorch/output"
Traceback (most recent call last):
File "synthesis.py", line 26, in
import torch
File "/usr/local/lib/python3.5/dist-packages/torch/init.py", line 56, in
from torch._C import *
ImportError: dlopen: cannot load any more object with static TLS

No activity on training

Hi,

After successful (1) installation of all prerequisites; and (2) pre-processing.
Starting the training phase with:
python train.py --preset=presets/deepvoice3_ljspeech.json --data-root=./data/ljspeech/
continues with a report of input parameters and eventually hangs on:
0it [00:00, ?it/s].

the command watch -n 1 nvidia-smi reports the VRAM usage with 499M range with no activity on GPU

train: RuntimeError: invalid argument 2: size '[16 x 126]' is invalid for input of with 126 elements at /home/demobin/github/pytorch/torch/lib/TH/THStorage.c:41

python3 train.py --data-root=./data/ljspeech --checkpoint-dir=checkpoints_nyanko --hparams="use_preset=True,builder=nyanko" --log-event-path=log/nyanko_preset

Command line args:
 {'--checkpoint': None,
 '--checkpoint-dir': 'checkpoints_nyanko',
 '--checkpoint-postnet': None,
 '--checkpoint-seq2seq': None,
 '--data-root': './data/ljspeech',
 '--help': False,
 '--hparams': 'use_preset=True,builder=nyanko',
 '--log-event-path': 'log/nyanko_preset',
 '--reset-optimizer': False,
 '--train-postnet-only': False,
 '--train-seq2seq-only': False}
Training whole model
Training seq2seq model
Hyperparameters:
  adam_beta1: 0.5
  adam_beta2: 0.9
  adam_eps: 1e-06
  batch_size: 16
  binary_divergence_weight: 0.1
  builder: nyanko
  checkpoint_interval: 5000
  clip_thresh: 0.1
  converter_channels: 256
  decoder_channels: 256
  downsample_step: 4
  dropout: 0.050000000000000044
  encoder_channels: 256
  fft_size: 1024
  force_monotonic_attention: True
  frontend: en
  guided_attention_sigma: 0.2
  hop_size: 256
  initial_learning_rate: 0.0005
  kernel_size: 3
  key_position_rate: 1.385
  lr_schedule: noam_learning_rate_decay
  lr_schedule_kwargs: {}
  masked_loss_weight: 0.0
  max_positions: 512
  min_level_db: -100
  name: deepvoice3
  nepochs: 2000
  num_mels: 80
  num_workers: 2
  outputs_per_step: 1
  padding_idx: 0
  pin_memory: True
  power: 1.4
  preemphasis: 0.97
  presets: {'nyanko': {'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'outputs_per_step': 1, 'text_embed_dim': 128, 'initial_learning_rate': 0.0005, 'binary_divergence_weight': 0.1, 'kernel_size': 3, 'downsample_step': 4, 'decoder_channels': 256, 'dropout': 0.050000000000000044, 'clip_thresh': 0.1, 'encoder_channels': 256, 'converter_channels': 256, 'use_decoder_state_for_postnet_input': True}, 'deepvoice3': {'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'outputs_per_step': 4, 'text_embed_dim': 256, 'initial_learning_rate': 0.001, 'binary_divergence_weight': 0.0, 'kernel_size': 7, 'downsample_step': 1, 'decoder_channels': 256, 'dropout': 0.050000000000000044, 'clip_thresh': 1.0, 'encoder_channels': 256, 'converter_channels': 256, 'use_decoder_state_for_postnet_input': True}, 'latest': {}}
  priority_freq: 3000
  priority_freq_weight: 0.0
  query_position_rate: 1.0
  ref_level_db: 20
  replace_pronunciation_prob: 0.5
  sample_rate: 22050
  text_embed_dim: 128
  trainable_positional_encodings: False
  use_decoder_state_for_postnet_input: True
  use_guided_attention: True
  use_memory_mask: True
  use_preset: True
  weight_decay: 0.0
Override hyper parameters with preset "nyanko": {
    "use_guided_attention": true,
    "guided_attention_sigma": 0.2,
    "outputs_per_step": 1,
    "text_embed_dim": 128,
    "initial_learning_rate": 0.0005,
    "binary_divergence_weight": 0.1,
    "kernel_size": 3,
    "downsample_step": 4,
    "decoder_channels": 256,
    "dropout": 0.050000000000000044,
    "clip_thresh": 0.1,
    "encoder_channels": 256,
    "converter_channels": 256,
    "use_decoder_state_for_postnet_input": true
}
Los event path: log/nyanko_preset
0it [00:00, ?it/s]Traceback (most recent call last):
  File "train.py", line 777, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 466, in train
    in tqdm(enumerate(data_loader)):
  File "/usr/local/lib/python3.5/dist-packages/tqdm/_tqdm.py", line 816, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 201, in __next__
    return self._process_next_batch(batch)
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 221, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
RuntimeError: Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 62, in _pin_memory_loop
    batch = pin_memory_batch(batch)
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 123, in pin_memory_batch
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 123, in <listcomp>
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 123, in pin_memory_batch
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 123, in <listcomp>
    return [pin_memory_batch(sample) for sample in batch]
  File "/usr/local/lib/python3.5/dist-packages/torch/utils/data/dataloader.py", line 117, in pin_memory_batch
    return batch.pin_memory()
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 82, in pin_memory
    return type(self)().set_(storage.pin_memory()).view_as(self)
  File "/usr/local/lib/python3.5/dist-packages/torch/tensor.py", line 198, in view_as
    return self.view(tensor.size())
RuntimeError: invalid argument 2: size '[16 x 126]' is invalid for input of with 126 elements at /home/demobin/github/pytorch/torch/lib/TH/THStorage.c:41

AssertionError

Hi,

I am new to pytorch and following the example of jsut here. And I encountered the following assertion error which is hard for me to look in further. Could anyone help me out?

[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ python -V
Python 3.5.4 :: Anaconda custom (64-bit)
[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ ls /home/kwon/copora/jsut_ver1.1
basic5000  ChangeLog.txt  countersuffix26  LICENCE.txt  loanword128  onomatopee300  precedent130  README_en.txt  README_ja.txt  repeat500  travel1000  utparaphrase512  voiceactress100
[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ python preprocess.py jsut /home/kwon/copora/jsut_ver1.1 ./data/jsut
  0%|                                                                                                                                                                               | 0/7696 [00:00<?, ?it/s]concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/kwon/anaconda3/lib/python3.5/concurrent/futures/process.py", line 175, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/jsut.py", line 52, in _process_utterance
    mel_spectrogram = audio.melspectrogram(wav).astype(np.float32)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/audio.py", line 50, in melspectrogram
    assert S.max() <= 0 and S.min() - hparams.min_level_db >= 0
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "preprocess.py", line 47, in <module>
    preprocess(mod, in_dir, out_dir, num_workers)
  File "preprocess.py", line 21, in preprocess
    metadata = mod.build_from_path(in_dir, out_dir, num_workers, tqdm=tqdm)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/jsut.py", line 25, in build_from_path
    return [future.result() for future in tqdm(futures)]
  File "/home/kwon/3rdParty/deepvoice3_pytorch/jsut.py", line 25, in <listcomp>
    return [future.result() for future in tqdm(futures)]
  File "/home/kwon/anaconda3/lib/python3.5/concurrent/futures/_base.py", line 405, in result
    return self.__get_result()
  File "/home/kwon/anaconda3/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
AssertionError

jsut data

Hi,

When you train on the JSUT corpus, did you use the original Japanese script? What I'm curious is Chinese characters are not phonetic, so I doubt if the network can learn with them. I thought they need to be converted into phonetic transcription (romaji).

Issue training with DeepVoice3 model with LJSpeech Data

Thanks for your excellent implementation of Deep Voice 3. I am attempting to retrain a DeepVoice3 model using the LJSpeech data. My interest in training a new model is that I want to make some small model parameter changes in order to enable fine-tuning using some Spanish data that I have.

As a first step I tried to retrain the baseline model and I have run into some issues.

With my installation, I have been able to successfully synthesize using the pre-trained DeepVoice3 model with git commit 4357976 as your instructions indicate. That synthesized audio sounds very much like the samples linked from the instructions page.

However, I am trying to train now with the latest git commit (commit 48d1014, dated Feb 7). I am using the LJSpeech data set downloaded from the link you provided. I have run the pre-processing and training steps as indicated in your instructions. I am using the default preset parameters for deepvoice3_ljspeech.

I have let the training process run for a while. When I synthesize using the checkpoint saved at 210K iterations, the alignment is bad and the audio is very robotic and mostly unintelligible.

When I synthesize using the checkpoint saved at 700K iterations, the alignment is better (but not great); the audio is improved but still robotic and choppy.

I can post the synthesized wav files via dropbox if you are interested. I expected to have good alignment and audio at 210K iterations as that is what the pretrained model used.

Any ideas what has changed between git commits 4357976 and 48d1014 that could have caused this issue? When I diff the two commits, I see some changes in audio.py, some places where support for multi-voice has been added, and some other changes I do not yet understand. There are some additions to hparams.py, but I only noticed one difference: in the current commit, masked_loss_weight defaults to 0.5, but in the prior commit the default was 0.0.

I have just started a new training run with masked_loss_weight set to 0.0. In the meantime, do you have thoughts on anything else that might be causing the issues I am seeing?

cuda out of memory?

When I used
x = F.relu(self.fc1(x), inplace=True)
cuda will out of memory?
So, I set the $inplace=False and solved the problem!
x = F.relu(self.fc1(x), inplace=False)

Request a demo_server.py

For easy test, is there any way to add a demo_server similar to the one (demo_server.py) here https://github.com/keithito/tacotron ?

Great works! Thanks a lot.

テキストを読み込む際にエラーが出る

Deep Voice3で、下記のエラーが出ます。

collected_files = self.file_data_source.collect_files()
File "train.py", line 126, in collect_files
assert len(l) == 4 or len(l) == 5
AssertionError

テキストの書き方が間違っているのでしょうか。データはJSUTです

KeyError: 'unexpected key "seq2seq.decoder.attention.in_projection.bias" in state_dict'

Hi, thanks for the fantastic DeepVoice3 implementation!

When trying to train Nyanko model starting from your pre-trained checkpoint using the following args:

--hparams="builder=nyanko,preset=nyanko_ljspeech" 
--checkpoint=checkpoints.pretrained/20171129_nyanko_checkpoint_step000585000.pth

I'm getting the error:

Load checkpoint from: checkpoints.pretrained/20171129_nyanko_checkpoint_step000585000.pth
Traceback (most recent call last):
  File "train.py", line 936, in <module>
    load_checkpoint(checkpoint_path, model, optimizer, reset_optimizer)
  File "train.py", line 820, in load_checkpoint
    model.load_state_dict(checkpoint["state_dict"])
  File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 490, in load_state_dict
    .format(name))
KeyError: 'unexpected key "seq2seq.decoder.attention.in_projection.bias" in state_dict'

Looks like in_projection is missing from AttentionLayer implementation in deepvoice3_pytorch/deepvoice3.py but still in the Nyanko pre-trained model https://github.com/r9y9/deepvoice3_pytorch#pretrained-models

Alignment problems with German text?

Hi @r9y9, I'm training on German audio. I have added the german characters (Ä, Ö, Ü, ß, ä, ö, ü) to the symbolset and am using basic_cleaners.

The problem is the alignment on test-audio. Look at some of the samples. And, of course, the audio is horrible too. I have tested with up to 500k steps. Always the same results. When I generate audio with synthesis, I have similar results. Any hints where I'd need to add more info?

Thanks for any recommendations... (I converted the German training data to ljspeech format...)

Getting error when num_workers > 0

Hi,
I have tried to train the lj speech model with latest mater and it gives me error like this, with

num_workers = 2

It looks like _frontend for worker processes didn`t got assigned. I tried injecting _frontend object to the TextDataSource. But It failed. Is there a fix for this ?

When I set the num_workers = 0 , it is training ok.
After quick google search it tells me that when num_workers = 0 it will do all the work in main thread.
My question is, will it slow down my training process significantly ?

Any plan for WORLD vocoder?

Any plan for WORLD vocoder for Multi-Speaker TTS

Is it possible to train any other language?

VCTK alignment

Hi @r9y9 you mention that aligning VCTK with gentle does not work, can you tell what is happening? is it the quality of the alignment, and how did you see it?

Speed up training.

Hi r9y9,
Thanks for the amazing library here. I'm only beginning to learn ML, and love what this can do! Ultimately trying to create what lyrebird.ai has been doing. Managed to finally setup it all up and started training single speaker with the ljspeech.

However i'm experiencing same training speed of ~3s/it between my dekstop specs below and my MBP (2.5Ghz, i8, 4 Cores). is there a way I can speed things up? I know I don't have the ideal AI training hardware specs, but kinda looking forward to the results.

*Both setup has all CPU cores running at 100%

OS: Ubuntu 16.04.4
CPU: i7-7820X (8 CORE)
GPU: 2x 1080 Ti

Memory corruption when synthesising speech

Hi @r9y9 ,
Thanks for working on this project. I trained model with param -hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" with latest commit. However, when I synthesis speech, i get following errors:

 python synthesis.py --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech" checkpoints_deepvoice3/checkpoint_step000630000.pth test.txt samples
Command line args:
 {'--checkpoint-postnet': None,
 '--checkpoint-seq2seq': None,
 '--file-name-suffix': '',
 '--help': False,
 '--hparams': 'builder=deepvoice3,preset=deepvoice3_ljspeech',
 '--max-decoder-steps': '500',
 '--output-html': False,
 '--replace_pronunciation_prob': '0.0',
 '--speaker_id': None,
 '<checkpoint>': 'checkpoints_deepvoice3/checkpoint_step000630000.pth',
 '<dst_dir>': 'samples',
 '<text_list_file>': 'test.txt'}
Override hyper parameters with preset "deepvoice3_ljspeech": {
    "n_speakers": 1,
    "downsample_step": 4,
    "outputs_per_step": 1,
    "embedding_weight_std": 0.1,
    "dropout": 0.050000000000000044,
    "kernel_size": 3,
    "text_embed_dim": 256,
    "encoder_channels": 512,
    "decoder_channels": 256,
    "converter_channels": 256,
    "use_guided_attention": true,
    "guided_attention_sigma": 0.2,
    "binary_divergence_weight": 0.1,
    "use_decoder_state_for_postnet_input": true,
    "max_positions": 512,
    "query_position_rate": 1.0,
    "key_position_rate": 1.385,
    "key_projection": true,
    "value_projection": true,
    "clip_thresh": 0.1,
    "initial_learning_rate": 0.0005
}
*** Error in `python': free(): invalid next size (fast): 0x0000000004da9360 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fcd2c2417e5]
/lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fcd2c24a37a]
/lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fcd2c24e53c]
/home/fatman/anaconda2/envs/dev3/bin/../lib/libcudnn.so.6(cudnnDestroyConvolutionDescriptor+0x9)[0x7fccdeb64c69]
/home/fatman/anaconda2/envs/dev3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(+0x2dedf7)[0x7fccb75acdf7]
/home/fatman/anaconda2/envs/dev3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN5torch5cudnn30cudnn_convolution_full_forwardEP8THCStateP12cudnnContext15cudnnDataType_tPNS_12THVoidTensorES7_S7_S7_St6vectorIiSaIiEESA_SA_ibb+0x6a4)[0x7fccb847dee4]
/home/fatman/anaconda2/envs/dev3/lib/python3.6/site-packages/torch/_C.cpython-36m-x86_64-linux-gnu.so(_ZN5torch8autograd11ConvForward5applyERKSt6vectorINS0_8VariableESaIS3_EE+0x1192)[0x7fccb77ed4a2]

Detail logs are here.
The text file contains only single line:
Generative adversarial network or variational auto-encoder.
Thanks.

Changing fft_size, hop_size in hparams.py?

Hi there,

I changed hparams.py to

fft_size=2052, # default 1024
hop_size=114, # fedault 256

And I get un-audible result!

What should I do, if I want to increase the fft_size & reduce hop_size? What did I do wrong?

Thanks a lot for any help!

Assertion `srcIndex < srcSelectDimSize` failed

Hi again,

I am applying this repository for Korean speech corpus (http://www.korean.go.kr/front/board/boardStandardView.do?board_id=4&mn_id=17&b_seq=464) and have encountered the following error. Could you have a look at it? I will be happy to ask PR once it gets working.

I formatted Korean corpus into npy as same as ljspeech has as single speaker and ran training with single GPU or multipe GPU. But it shows a series of error messages like Assertion srcIndex < srcSelectDimSize failed.

[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ ls data/nikl | head -3
nikl-mel-00001.npy
nikl-mel-00002.npy
nikl-mel-00003.npy
[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ ls data/nikl | tail -3
nikl-spec-00929.npy
nikl-spec-00930.npy
train.txt
[kwon@ssi-dnn-slave-002 deepvoice3_pytorch]$ ls data/nikl/*.npy | wc -l
1860


CUDA_VISIBLE_DEVICES=3 python train.py \
  --data-root=./data/nikl/ \
  --hparams="frontend=jp,builder=deepvoice3,preset=deepvoice3_ljspeech" \
  --checkpoint-dir checkpoint_nikl


Command line args:
 {'--checkpoint': None,
 '--checkpoint-dir': 'checkpoint_nikl',
 '--checkpoint-postnet': None,
 '--checkpoint-seq2seq': None,
 '--data-root': './data/nikl/',
 '--help': False,
 '--hparams': 'builder=deepvoice3,preset=deepvoice3_ljspeech',
 '--load-embedding': None,
 '--log-event-path': None,
 '--reset-optimizer': False,
 '--restore-parts': None,
 '--speaker-id': None,
 '--train-postnet-only': False,
 '--train-seq2seq-only': False}
Training whole model
Training seq2seq model
Hyperparameters:
  adam_beta1: 0.5
  adam_beta2: 0.9
  adam_eps: 1e-06
  allow_clipping_in_normalization: True
  batch_size: 16
  binary_divergence_weight: 0.1
  builder: deepvoice3
  checkpoint_interval: 10000
  clip_thresh: 0.1
  converter_channels: 256
  decoder_channels: 256
  downsample_step: 4
  dropout: 0.050000000000000044
  embedding_weight_std: 0.1
  encoder_channels: 256
  eval_interval: 10000
  fft_size: 1024
  force_monotonic_attention: True
  freeze_embedding: False
  frontend: en
  guided_attention_sigma: 0.2
  hop_size: 256
  initial_learning_rate: 0.0005
  kernel_size: 3
  key_position_rate: 1.385
  key_projection: False
  lr_schedule: noam_learning_rate_decay
  lr_schedule_kwargs: {}
  masked_loss_weight: 0.5
  max_positions: 512
  min_level_db: -100
  n_speakers: 1
  name: deepvoice3
  nepochs: 2000
  num_mels: 80
  num_workers: 2
  outputs_per_step: 1
  padding_idx: 0
  pin_memory: True
  power: 1.4
  preemphasis: 0.97
  preset: deepvoice3_ljspeech
  presets: {'deepvoice3_ljspeech': {'n_speakers': 1, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 256, 'enc
oder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input':
 True, 'max_positions': 512, 'query_position_rate': 1.0, 'key_position_rate': 1.385, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}, 'deepvoice3_vctk
': {'n_speakers': 108, 'speaker_embed_dim': 16, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.1, 'speaker_embedding_weight_std': 0.05, 'dropout': 0.050000000000000044, 'kernel_size
': 3, 'text_embed_dim': 256, 'encoder_channels': 512, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.4, 'binary_divergence_weight': 0.1, 'use_
decoder_state_for_postnet_input': True, 'max_positions': 1024, 'query_position_rate': 2.0, 'key_position_rate': 7.6, 'key_projection': True, 'value_projection': True, 'clip_thresh': 0.1, 'initial_learning_
rate': 0.0005}, 'nyanko_ljspeech': {'n_speakers': 1, 'downsample_step': 4, 'outputs_per_step': 1, 'embedding_weight_std': 0.01, 'dropout': 0.050000000000000044, 'kernel_size': 3, 'text_embed_dim': 128, 'en
coder_channels': 256, 'decoder_channels': 256, 'converter_channels': 256, 'use_guided_attention': True, 'guided_attention_sigma': 0.2, 'binary_divergence_weight': 0.1, 'use_decoder_state_for_postnet_input'
: True, 'max_positions': 512, 'query_position_rate': 1.0, 'key_position_rate': 1.385, 'key_projection': False, 'value_projection': False, 'clip_thresh': 0.1, 'initial_learning_rate': 0.0005}}
  priority_freq: 3000
  priority_freq_weight: 0.0
  query_position_rate: 1.0
  ref_level_db: 20
  replace_pronunciation_prob: 0.5
  sample_rate: 22050
  save_optimizer_state: True
  speaker_embed_dim: 16
  speaker_embedding_weight_std: 0.01
  text_embed_dim: 128
  trainable_positional_encodings: False
  use_decoder_state_for_postnet_input: True
  use_guided_attention: True
  use_memory_mask: True
  value_projection: False
  weight_decay: 0.0
  window_ahead: 3
  window_backward: 1
Override hyper parameters with preset "deepvoice3_ljspeech": {
    "n_speakers": 1,
    "downsample_step": 4,
    "outputs_per_step": 1,
    "embedding_weight_std": 0.1,
    "dropout": 0.050000000000000044,
    "kernel_size": 3,
    "text_embed_dim": 256,
    "encoder_channels": 512,
    "decoder_channels": 256,
    "converter_channels": 256,
    "use_guided_attention": true,
    "guided_attention_sigma": 0.2,
    "binary_divergence_weight": 0.1,
    "use_decoder_state_for_postnet_input": true,
    "max_positions": 512,
    "query_position_rate": 1.0,
    "key_position_rate": 1.385,
    "key_projection": true,
    "value_projection": true,
    "clip_thresh": 0.1,
    "initial_learning_rate": 0.0005
}
Los event path: log/run-test2018-01-30_15:05:32.238606
34it [00:08,  4.24it/s]
7it/s]/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, i
nt, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [106,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

...

/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [46,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/THCTensorIndex.cu:325: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, In
dexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2]: block: [46,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/generic/THCStorage.cu line=58 error=59 : device-side assert triggered

Traceback (most recent call last):
  File "train.py", line 941, in <module>
    train_seq2seq=train_seq2seq, train_postnet=train_postnet)
  File "train.py", line 642, in train
    input_lengths=input_lengths)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/deepvoice3_pytorch/__init__.py", line 94, in forward
    linear_outputs = self.postnet(postnet_inputs, speaker_embed)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/kwon/3rdParty/deepvoice3_pytorch/deepvoice3_pytorch/deepvoice3.py", line 597, in forward
    return F.sigmoid(x)
  File "/home/kwon/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 817, in sigmoid
    return input.sigmoid()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1512386481460/work/torch/lib/THC/generic/THCStorage.cu:58

Attention doesn't work well for downsample_step=1 and outputs_per_step=1

Noticed while working on #21.

Trained 300k steps, but the model was not generalized well. Need to figure out how we can improve.

synthesis.py - optimize for deployment - custom GPU kernels

Does synthesis.py implement the algorithm mentioned in "appendix B - OPTIMIZING DEEP VOICE 3 FOR DEPLOYMENT" of the original paper?
https://arxiv.org/abs/1710.07654

Can not find a shift by one operation on the mel.

Hi, I can not find the shift by one operation on the decoder input data ( mel ). Is this a bug?

positional encoding

    position_enc = np.array([
        [position_rate * pos / np.power(10000, 2 * (i // 2) / d_pos_vec) for i in range(d_pos_vec)]
        if pos != 0 else np.zeros(d_pos_vec) for pos in range(n_position)])

Hey! I wonder what is a motivation behind repeating positional encoding values twice?
in paper it's done this way:

position_rate * pos / np.power(10000,  i / d_pos_vec)...

Please correct hyperparams

Since the synthesis script has been altered to accept a builder param called deepvoice3_multispeaker instead of deepvoice3_vctk, please change the table in the pretrained models section of the README to reflect the new hyperparams for vctk. It will eliminate confusion by people using this platform.

Reference Issue #14

The table entry should read:

--hparams="builder=deepvoice3_multispeaker,preset=deepvoice3_vctk"

Some modification on my side

deepvoice3_pytorch/init.py
from .version import version
this line has error
version.py is not provided.
deepvoice3_pytorch/builder.py
deepvoice3_multispeaker
inconsistent with hparams.py
deepvoice3_pytorch/deepvoice3.py
line 474, (done>0.5).all()
maybe done.data is better

AttributeError: module 'torch.nn.utils' has no attribute 'weight_norm'

I created a new environment for this project and made it to through the preprocessing for LJ dataset, and now I'm stuck at the training portion. I get this error

Traceback (most recent call last):
  File "train.py", line 906, in <module>
    model = build_model()
  File "train.py", line 799, in build_model
    value_projection=hparams.value_projection,
  File "/mnt/deepvoice3_pytorch/deepvoice3_pytorch/builder.py", line 46, in deepvoice3
    (h, k, 1), (h, k, 3)],
  File "/mnt/deepvoice3_pytorch/deepvoice3_pytorch/deepvoice3.py", line 54, in __init__
    dilation=1, std_mul=std_mul))
  File "/mnt/deepvoice3_pytorch/deepvoice3_pytorch/modules.py", line 104, in Conv1d
    return nn.utils.weight_norm(m)
AttributeError: module 'torch.nn.utils' has no attribute 'weight_norm'

when running python train.py --data-root=./data/ljspeech/ --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech"

I installed pytorch with conda install pytorch torchvision cuda90 -c pytorch. Any help would be appreciated.

21k 30k 58.5k wrong ？

Pretrained models of DeepVoice3 Only need 21k steps to train ?
In my experiment.I think 21k steps is too small to train.
Maybe you write 210k to 21k?
And 30k for Nyanko, 58.5k for multi-deepvoice3?

RuntimeError: main thread is not in main loop

When i ran the train.py
(python3 train.py --data-root=./datapath/ljspeech/ --hparams="batch_size=10")

This error is came:
Exception ignored in: <bound method Image.del of <tkinter.PhotoImage object at 0x7f1b5f86a710>>
Traceback (most recent call last):
File "/usr/lib/python3.5/tkinter/init.py", line 3359, in del
self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop

Error in `python3.5': corrupted size vs. prev_size when running python3.5 synthesis.py

I was able to run the "python3.5 synthesis.py ....". But it generated an error at the end each time.

..............................
Finished! Check out ./output for generated audio samples.

*** Error in `python3.5': corrupted size vs. prev_size: 0x0000000001682bd0 ***
Aborted (core dumped)

How about speeds between conv_TBC of fairseq-py and nn.conv1D in inferencing?

In fair team saying,
They said there's a big speed difference their own conv_temperal and original nn.conv1d in inferencing.

Have you checked the speed of this two modules while removing fairseq-py dependency?

By the way, I agree implementation without dependency. It make me readily to see overall code flow.
good job!

AttributeError: 'NoneType' object has no attribute 'text_to_sequence'

When I try to train a dataset with the command from the tutorial (python train.py --data-root=./data/ljspeech/ --hparams="builder=deepvoice3,preset=deepvoice3_ljspeech") I get an error telling me that _frontend is a NoneType object and has no 'text_to_sequence' attribute. Do I need to modify anything to get this to work again?