Git Product home page Git Product logo

msmc-tts's Introduction

MSMC-TTS: Multi-Stage Multi-Codebook TTS

Official Implement of MSMC-TTS System of papers "A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS" and "Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations "

The latest MSMC-TTS (MSMC-TTS-v2) is optimized with a MSMC-VQ-GAN based autoencoder combining MSMC-VQ-VAE and HifiGAN. The multi-stage predictor is still applied as the acoustic model to predict MSMCRs for TTS synthesis. avatar

News

[2022.10.28] We release the latest version of MSMC-TTS (MSMC-TTS-v2) based on MSMC-VQ-GAN. Please refer to our latest paper "Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Representation"

[2022.10.18] We will release the code of all versions of MSMC-TTS in this repo. And anyone interested in this work is welcome to join us to explore more useful speech representations for speech synthesis.

[2022.9.22] "A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS" is published at INTERSPEECH 2022.

Usage

# Install
pip -r requirements.txt

# Train (Take the example of CSMSC, please refer to the example of CSMSC to prepare your training data)
python train.py -c examples/csmsc/configs/msmc_vq_gan.yaml
python train.py -c examples/csmsc/configs/msmc_vq_gan_am.yaml

# Multi-GPU Training
python train_dist.py -c examples/csmsc/configs/msmc_vq_gan.yaml
python train_dist.py -c examples/csmsc/configs/msmc_vq_gan_am.yaml

# Test -- Analysis-Synthesis
python infer.py -c examples/csmsc/configs/msmc_vq_gan.yaml -m examples/csmsc/checkpoints/msmc_vq_gan/model_800000 -t examples/csmsc/data/test_ae.yaml -o analysis_synthesis

# Test -- TTS-Synthesis
python infer.py -c examples/csmsc/configs/msmc_vq_gan_am.yaml -m examples/csmsc/checkpoints/msmc_vq_gan_am/model_200000 -t examples/csmsc/data/test_tts.yaml -o tts

Tips

Help you better train your models!

MSMC-VQ-GAN

  1. Be careful for the compactness of your representation. For single-speaker standard TTS, you can try 2-4 heads, which each head may have 64 - 256 codewords.
  2. Please use fewer codewords if your batch size is too small, otherwise the frame size of a batch is insufficient to support the dynamic codebook update.
  3. You may change the weight of the encoder loss if you find that some stages in your MSMC-VQ-GAN learn nothing. You can also train a single-stage VQ-GAN to check the training status first, then add a stage. The VQ loss of any stage should not be too small, otherwise this stage may fail in modeling.

Multi-Stage Predictor

  1. Triplet loss can improve the expressiveness of TTS, but also may degrade the smoothness. You may try different weights of Triplet loss, such as 0, 0.01, 0.1, 1, to find the most balanced performance.
  2. For low-resource datasets, please use smaller models to avoid over-fitting. Early-stop is also an useful trick.

Citations

@inproceedings{guo2022msmc,
  title={A Multi-Stage Multi-Codebook VQ-VAE Approach to High-Performance Neural TTS},
  author={Guo, Haohan and Xie, Fenglong and Soong, Frank K and Wu, Xixin and Meng, Helen},
  booktitle={Proc. INTERSPEECH},
  year={2022}
}

Acknowledge

  1. FastSpeech Implementation: https://github.com/NVIDIA/NeMo
  2. HifiGAN Implementation: https://github.com/jik876/hifi-gan
  3. UnivNet Implementation: https://github.com/mindslab-ai/univnet
  4. VITS Implementation: https://github.com/jaywalnut310/vits

msmc-tts's People

Contributors

hhguo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.