shaun95 / lightningfastspeech2 Goto Github PK

This project forked from minixc/lightningfastspeech2

License: MIT License

Shell 1.45% Python 98.55%

lightningfastspeech2's Introduction

LightningFastSpeech

WARNING: This is a work in progress and until version 0.1 (which will be out very soon), it might be hard to get running on your own machine. Thanks for your patience.

Large Pretrained TTS

In the NLP community, and more recently in speech recognition, large pre-trained models and how they can be used for down-stream tasks have become an exciting area of research.

In TTS however, little similar work exists. With this project, I hope to make a first step into bringing pretrained models to TTS. The original FastSpeech 2 model is 27M parameters large and models a single speaker, while our version would have almost 2B parameters without the improvements from LightSpeech, which bring its size down to a manageable 135M, and models more than 2,000 speakers.

A big upside of this implementation is that it is based on Pytorch Lightning, which makes it easy to do multi-gpu training, load pre-trained models and a lot more.

LightningFastSpeech couldn't exist without the amazing open source work of many others, for a full list see Attribution.

Current Status

This library is a work in progress, and until v1.0, updates might break things occasionally.

Goals

v0.1

0.1 is right around the corner! For this version, the core functionality is already there, and what's missing are mostly quality of life improvements that we should get out of the way now.

v1.0

It will take a while to get to 1.0 -- the goal for this to allow everyone to easily fine-tune our models and to easily do controllable synthesis of utterances.

Allow models to be loaded from the Huggingface hub.
Streamlit interface for synthesising utterances and generating datasets.
Tract and tractjs integration to export models for on-device and web use.
Make it easy to add new datasets and to fine-tune models with them.
Add HiFi-GAN fine-tuning to the pipeline.
A range of pre-trained models with different domains and sizes (e.g. multi-lingual, noisy/clean)

Attribution

This would not be possible without a lot of amazing open source project in the TTS space already present -- please cite their work when appropriate!

Chung-Ming Chien's FastSpeech 2 implementation, which was used during as a reference implementation.
yistLin's public d-vector implementation, which is used for multi-speaker training.
Aidan Pine's fork of FastSpeech 2, which served as the basis for the implementation of the depth-wise convolutions used in LightSpeech.
Coqui AI's excellent TTS toolkit, which was used for the Stochastic Duration Predictor and inspired the loss weighing we do.
Jungil Kong's HiFi-GAN implementation, which is used vocoding mel spectrograms produced by our TTS system.

Recommend Projects

shaun95 / lightningfastspeech2 Goto Github PK

lightningfastspeech2's Introduction

LightningFastSpeech

Large Pretrained TTS

Current Status

Goals

v0.1

v1.0

Attribution

lightningfastspeech2's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent