ivanvovk / durian-pytorch Goto Github PK

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" paper in PyTorch.

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

duration-models durian ljspeech text-to-speech speech speech-synthesis tts

durian-pytorch's Introduction

DurIAN

Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.

Status: released

1 Info

DurIAN is encoder-decoder architecture for text-to-speech synthesis task. Unlike prior architectures like Tacotron 2 it doesn't learn attention mechanism but takes into account phoneme durations information. So, of course, to use this model one should have phonemized and duration-aligned dataset. However, you may try to use pretrained duration model on LJSpeech dataset (CMU dict used). Links will be provided below.

2 Architecture details

DurIAN model consists of two modules: backbone synthesizer and duration predictor. Here are some of the most notable differences from DurIAN described in paper:

Prosodic boundary markers aren't used (didn't have them labeled), and thus there's no 'skip states' exclusion of prosodic boundaries' hidden states
Style codes aren't used too (same reason)
Removed Prenet before CBHG encoder (didn't improved accuracy during experiments)
Decoder's recurrent cell outputs single spectrogram frame at a time

Both backbone synthesizer and duration model are trained simultaneously. For implementation simplifications duration model predicts alignment over the fixed max number of frames. You can learn this outputs as BCE problem, MSE problem by summing over frames-axis or to use both losses (haven't tested this one), set it in the config.json. Experiments showed that just-BCE version of optimization process showed itself being unstable with longer text sequences, so prefer using MSE+BCE or just-MSE (don't mind if you get bad alignments in Tensorboard).

3 Reproducibility

You can check the synthesis demo wavfile (was obtained much before convergence) in demo folder (used Waveglow vocoder).

First of all, make sure you have installed all packages using pip install --upgrade -r requirements.txt. The code is tested using pytorch==1.5.0
Clone the repository: git clone https://github.com/ivanvovk/DurrIAN
To start training paper-based DurIAN version run python train.py -c configs/default.json. You can specify to train baseline model as python train.py -c configs/baseline.json --baseline

To make sure that everything works fine at your local environment you may run unit tests in tests folder by python <test_you_want_to_run.py>.

4 Pretrained models

This implementation was trained using phonemized duration-aligned LJSpeech dataset with BCE duration loss minimization. You may find it via this link.

5 Dataset alignment problem

The main drawback of this model is requiring of duration-aligned dataset. You can find parsed LJSpeech filelist used in the training of current implementation in filelists folder. In order to use your data, make sure you have organized your filelists in the same way as provided LJSpeech ones. However, in order to save time and neurons of your brains you may try to train the model on your dataset without duration-aligning using the pretrained on LJSpeech duration model from my model checkpoint (didn't tried). But if you are interested in aligning personal dataset, carefully follow the next section.

6 How to align your own data

In my experiments I aligned LJSpeech with Montreal Forced Alignment tool. If here something will be unclear, please, follow instructions in toolkit's docs. To begin with, aligning algorithm has several steps:

Organize your dataset properly. MFA requires it to be in a single folder of structure {utterance_id.lab, utterance_id.wav}. Make sure all your texts are of .lab format.
Download MFA release and follow installation instructions via this link.
Once done with MFA, you need your dataset words dictionary with phonemes transcriptions. Here you have several options:
1. (Try this first) Download already done dictionary from MFA pretrained models list (at the bottom of the page). In current implementation I have used English Arpabet dictionary. Here can be a problem: if your dataset contains some words missing in the dictionary, MFA may fail to parse it in the future and skip such dataset files. You may skip them or try to preprocess your dataset with accordance to the dictionary or add missing words by hand (if not too much of them).
2. You may generate the dictionary with pretrained G2P model from MFA pretrained models list using the command bin/mfa_generate_dictionary /path/to/model_g2p.zip /path/to/data dict.txt. Notice, that default MFA installation will automatically provide you with English pretrained model, which you may use.
3. In other cases, you'll need to train your own G2P model on your data. In order to train your model follow instructions via this link.
Once you have your data prepared, dictionary and G2P model, now you are ready for aligning. Run the command bin/mfa_align /path/to/data dict.txt path/to/model_g2p.zip outdir. Wait until done. outdir folder will contain a list of out of vocabulary words and a folder with special files of .TextGrid format, where wavs alignments are stored.
Now we want to process these text grid files in order to get the final filelist. Here you may find useful the python package TextGrid. Install it using pip install TextGrid. Here an example how to use it:
```
import textgrid
tg = textgrid.TextGrid.fromFile('./outdir/data/text0.TextGrid')
```
Now tg is the set two objects: first one contains aligned words, second one contains aligned phonemes. You need the second one. Extract durations (in frames! tg has intervals in seconds, thus convert) for whole dataset by iterating over obtained .TextGrid files and prepare a filelist in same format as the ones I provided in filelists folder.

I found an overview of several aligners. Maybe it will be helpful. However, I recommend you to use MFA as it is one of the most accurate aligners, to my best knowledge.

durian-pytorch's People

Contributors

Stargazers

Watchers

durian-pytorch's Issues

Alignments slightly inconsistent

I found small inconsistencies in the alignments of the audio files. It's mostly just a few frames, but still. See below, please.

MFCC LJ015-0189.wav has shape (80, 372), but filelists.txt indicates length (80, 369)
MFCC LJ048-0044.wav has shape (80, 445), but filelists.txt indicates length (80, 443)
MFCC LJ001-0052.wav has shape (80, 463), but filelists.txt indicates length (80, 461)
MFCC LJ034-0175.wav has shape (80, 667), but filelists.txt indicates length (80, 665)
MFCC LJ018-0115.wav has shape (80, 673), but filelists.txt indicates length (80, 671)
MFCC LJ011-0071.wav has shape (80, 653), but filelists.txt indicates length (80, 652)
MFCC LJ031-0034.wav has shape (80, 356), but filelists.txt indicates length (80, 354)
MFCC LJ017-0025.wav has shape (80, 261), but filelists.txt indicates length (80, 259)
MFCC LJ018-0014.wav has shape (80, 676), but filelists.txt indicates length (80, 674)
MFCC LJ023-0004.wav has shape (80, 723), but filelists.txt indicates length (80, 720)
MFCC LJ016-0154.wav has shape (80, 231), but filelists.txt indicates length (80, 229)
MFCC LJ031-0046.wav has shape (80, 482), but filelists.txt indicates length (80, 479)
MFCC LJ028-0414.wav has shape (80, 731), but filelists.txt indicates length (80, 729)
MFCC LJ044-0161.wav has shape (80, 183), but filelists.txt indicates length (80, 181)
MFCC LJ042-0086.wav has shape (80, 517), but filelists.txt indicates length (80, 515)
MFCC LJ048-0153.wav has shape (80, 765), but filelists.txt indicates length (80, 763)
MFCC LJ018-0128.wav has shape (80, 547), but filelists.txt indicates length (80, 546)

RuntimeError at the inference time

Hi. I trained the DurIAN model using a Mandarin dataset and modified text_frontend.py a little bit.
However, at the inference time, I encountered the following problem:

Traceback (most recent call last):
  File "inference.py", line 56, in <module>
    test()
  File "inference.py", line 48, in test
    outputs = model.inference(inputs)
  File "/models/DurIAN-ivanvok/model/model.py", line 91, in inference
    alignments, _ = self.duration_model.inference(inputs)
  File "/models/DurIAN-ivanvok/model/duration.py", line 88, in inference
    outputs, durations = self._compute_weighted_forced_alignment(outputs[0])
  File "/models/DurIAN-ivanvok/model/duration.py", line 47, in _compute_weighted_forced_alignment
    durations = torch.bincount(alignment.argmax(dim=0))
RuntimeError: cannot perform reduction function argmax on a tensor with no elements because the operation does not have an identity

I printed the value of outputs[0].sum(dim=0):

tensor([1.7232e-08, 1.7707e-08, 2.1013e-08,  ..., 1.3151e+01, 1.5982e+01,
        1.9016e+01])

And I also printed the value of eos_idx, which is zero.

Could you tell me what is eos_idx = list((outputs[0].sum(dim=0) > 0.1).cpu().numpy()).index(False) used for, and how to solve the above problem? Thank you very much!

Generating several frames at one decoder step

Wonderful work!
I have one question when I look into your readme.
The difference of durian between fastspeech is its autoregressive structure, which lets durian can generate better wav. It use previous output as input of current step. However, in the picture of baseline model, it is different from description of paper.

Besides, I think it is important to generate r steps of mel spectrogram at once (refer paper of tacotron), which is also difficulty when implementing durian. I hope you can implement this part.

Thank you!

An aligner issue!

Hi, i followed your direction in 6.3 to make .TextGrid files.

I dont understand about step i). Here you recommended to download an already done dictionary from MFA pretrained model which

but i dont see the dictionary to be used in your direction. So, what is it difference?

Thanks.

Training process was killed at iteration 71

Hi. Thank you very much for your implementation.
I trained the DurIAN model using my own datasets, with V100, batch size 16, torch version 1.5.0.
However, the training process will always be killed at iteration 70 to 73, as follows:

Iteration 51 | Backbone loss 5.882367134094238 | Duration model 0.22346851229667664
Iteration 52 | Backbone loss 5.768142223358154 | Duration model 0.21113494038581848
Iteration 53 | Backbone loss 5.487927436828613 | Duration model 0.3232101500034332
Iteration 54 | Backbone loss 4.821720600128174 | Duration model 0.2446255385875702
Iteration 55 | Backbone loss 4.654604911804199 | Duration model 0.2901703417301178
Iteration 56 | Backbone loss 5.083403587341309 | Duration model 0.25193724036216736
Iteration 57 | Backbone loss 4.542118072509766 | Duration model 0.29316720366477966
Iteration 58 | Backbone loss 4.519543647766113 | Duration model 0.24372977018356323
Iteration 59 | Backbone loss 6.24979829788208 | Duration model 0.310175359249115
Iteration 60 | Backbone loss 4.367422580718994 | Duration model 0.2611904442310333
Iteration 61 | Backbone loss 3.087519645690918 | Duration model 0.1617337316274643
Iteration 62 | Backbone loss 3.959341049194336 | Duration model 0.2462971955537796
Iteration 63 | Backbone loss 3.8049283027648926 | Duration model 0.27857890725135803
Iteration 64 | Backbone loss 3.482792615890503 | Duration model 0.19809436798095703
Iteration 65 | Backbone loss 3.6724131107330322 | Duration model 0.3031512498855591
Iteration 66 | Backbone loss 3.4588942527770996 | Duration model 0.220042884349823
Iteration 67 | Backbone loss 3.3647873401641846 | Duration model 0.23614169657230377
Iteration 68 | Backbone loss 3.1085774898529053 | Duration model 0.24144530296325684
Iteration 69 | Backbone loss 4.648317813873291 | Duration model 0.26875409483909607
Iteration 70 | Backbone loss 2.7025070190429688 | Duration model 0.23824451863765717
Iteration 71 | Backbone loss 3.6294140815734863 | Duration model 0.1941741555929184
Killed

Also, I notice that the GPU usage is always 0%.

Could you help me to solve this problem? Thank you very much

A simple doubt in Pre Processing

First of all, great work on the repo!
when I am reproducing the MFA output on LJSpeech, my number of samples is different than that of yours.
I wanted to know if you applied any criteria to remove some data points in the pre-processing.
Thanks in advance!