Git Product home page Git Product logo

zefyrr / radtts Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nvidia/radtts

0.0 1.0 0.0 1.86 MB

Provides training, inference and voice conversion recipes for RADTTS and RADTTS++: Flow-based TTS models with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.

License: MIT License

Python 6.68% Dockerfile 0.01% Roff 93.31%

radtts's Introduction

Flow-based TTS with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.

This repository contains the source code and several checkpoints for our work based on RADTTS. RADTTS is a normalizing-flow-based TTS framework with state of the art acoustic fidelity and a highly robust audio-transcription alignment module. Our project page and some samples can be found here, with relevant works listed here.

This repository can be used to train the following models:

  • A normalizing-flow bipartite architecture for mapping text to mel spectrograms
  • A variant of the above, conditioned on F0 and Energy
  • Normalizing flow models for explicitly modeling text-conditional phoneme duration, fundamental frequency (F0), and energy
  • A standalone alignment module for learning unspervised text-audio alignments necessary for TTS training

HiFi-GAN vocoder pre-trained models

We provide a checkpoint and config for a HiFi-GAN vocoder trained on LibriTTS 100 and 360.
For a HiFi-GAN vocoder trained on LJS, please download the v1 model provided by the HiFi-GAN authors here, .

RADTTS pre-trained models

Model name Description Dataset
RADTTS++DAP-LJS RADTTTS model conditioned on F0 and Energy with deterministic attribute predictors LJSpeech Dataset

We will soon provide more pre-trained RADTTS models with generative attribute predictors trained on LJS and LibriTTS. Stay tuned!

Setup

  1. Clone this repo: git clone https://github.com/NVIDIA/RADTTS.git
  2. Install python requirements or build docker image
    • Install python requirements: pip install -r requirements.txt
  3. Update the filelists inside the filelists folder and json configs to point to your data
    • basedir – the folder containing the filelists and the audiodir
    • audiodir – name of the audiodir
    • filelist – | (pipe) separated text file with relative audiopath, text, speaker, and optionally categorical label and audio duration in seconds

Training RADTTS (without pitch and energy conditioning)

  1. Train the decoder
    python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir
  2. Further train with the duration predictor python train.py -c config_ljs_radtts.json -p train_config.output_directory=outdir_dir train_config.warmstart_checkpoint_path=model_path.pt model_config.include_modules="decatndur"

Training RADTTS++ (with pitch and energy conditioning)

  1. Train the decoder
    python train.py -c config_ljs_decoder.json -p train_config.output_directory=outdir
  2. Train the attribute predictor: autoregressive flow (agap), bi-partite flow (bgap) or deterministic (dap)
    python train.py -c config_ljs_{agap,bgap,dap}.json -p train_config.output_directory=outdir_wattr train_config.warmstart_checkpoint_path=model_path.pt

Training starting from a pre-trained model, ignoring the speaker embedding table

  1. Download our pre-trained model
  2. python train.py -c config.json -p train_config.ignore_layers_warmstart=["speaker_embedding.weight"] train_config.warmstart_checkpoint_path=model_path.pt

Multi-GPU (distributed)

  1. python -m torch.distributed.launch --use_env --nproc_per_node=NUM_GPUS_YOU_HAVE train.py -c config.json -p train_config.output_directory=outdir

Inference demo

  1. python inference.py -c CONFIG_PATH -r RADTTS_PATH -v HG_PATH -k HG_CONFIG_PATH -t TEXT_PATH -s ljs --speaker_attributes ljs --speaker_text ljs -o results/

Inference Voice Conversion demo

  1. python inference_voice_conversion.py --radtts_path RADTTS_PATH --radtts_config_path RADTTS_CONFIG_PATH --vocoder_path HG_PATH --vocoder_config_path HG_CONFIG_PATH --f0_mean=211.413 --f0_std=46.6595 --energy_mean=0.724884 --energy_std=0.0564605 --output_dir=results/ -p data_config.validation_files="{'Dummy': {'basedir': 'data/', 'audiodir':'22khz', 'filelist': 'vc_audiopath_txt_speaker_emotion_duration_filelist.txt'}}"

Config Files

Filename Description Nota bene
config_ljs_decoder.json Config for the decoder conditioned on F0 and Energy
config_ljs_radtts.json Config for the decoder not conditioned on F0 and Energy
config_ljs_agap.json Config for the Autoregressive Flow Attribute Predictors Requires at least pre-trained alignment module
config_ljs_bgap.json Config for the Bi-Partite Flow Attribute Predictors Requires at least pre-trained alignment module
config_ljs_dap.json Config for the Deterministic Attribute Predictors Requires at least pre-trained alignment module

LICENSE

Unless otherwise specified, the source code within this repository is provided under the MIT License

Acknowledgements

The code in this repository is heavily inspired by or makes use of source code from the following works:

Relevant Papers

Rohan Badlani, Adrian Łańcucki, Kevin J. Shih, Rafael Valle, Wei Ping, Bryan Catanzaro.
One TTS Alignment to Rule Them All. ICASSP 2022

Kevin J Shih, Rafael Valle, Rohan Badlani, Adrian Lancucki, Wei Ping, Bryan Catanzaro.
RAD-TTS: Parallel flow-based TTS with robust alignment learning and diverse synthesis.
ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models 2021

Kevin J Shih, Rafael Valle, Rohan Badlani, João Felipe Santos, Bryan Catanzaro.
Generative Modeling for Low Dimensional Speech Attributes with Neural Spline Flows. Technical Report

radtts's People

Contributors

rafaelvalle avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.