A fast, local neural text to speech system.
echo 'Welcome to the world of speech synthesis!' | \
./larynx --model en-us-blizzard_lessac-medium.onnx --output_file welcome.wav
- U.S. English (16Khz, single speaker)
- German (16Khz, single speaker)
- Danish (22Khz, multispeaker)
- Norwegian (22Khz, single speaker)
- Nepali (16Khz, multispeaker)
- Vietnamese (16Khz, multispeaker)
Larynx is meant to sound as good as CoquiTTS, but run reasonably fast on the Raspberry Pi 4.
Voices are trained with VITS and exported to the onnxruntime.
Download a release:
If you want to build from source, see the Makefile and C++ source. Last tested with onnxruntime 1.13.1.
- Download a voice and extract the
.onnx
and.onnx.json
files - Run the
larynx
binary with text on standard input,--model /path/to/your-voice.onnx
, and--output_file output.wav
For example:
echo 'Welcome to the world of speech synthesis!' | \
./larynx --model blizzard_lessac-medium.onnx --output_file welcome.wav
For multi-speaker models, use --speaker <number>
to change speakers (default: 0).
See larynx --help
for more options.
See src/python
Start by creating a virtual environment:
python3 -m venv .venv
source .venv/bin/activate
pip3 install --upgrade pip
pip3 install --upgrade wheel setuptools
pip3 install -r requirements.txt
Ensure you have espeak-ng installed (sudo apt-get install espeak-ng
).
Next, preprocess your dataset:
python3 -m larynx_train.preprocess \
--language en-us \
--input-dir /path/to/ljspeech/ \
--output-dir /path/to/training_dir/ \
--dataset-format ljspeech \
--sample-rate 22050
Datasets must either be in the LJSpeech format or from Mimic Recording Studio (--dataset-format mycroft
).
Finally, you can train:
python3 -m larynx_train \
--dataset-dir /path/to/training_dir/ \
--accelerator 'gpu' \
--devices 1 \
--batch-size 32 \
--validation-split 0.05 \
--num-test-examples 5 \
--max_epochs 10000 \
--precision 32
Training uses PyTorch Lightning. Run tensorboard --logdir /path/to/training_dir/lightning_logs
to monitor. See python3 -m larynx_train --help
for many additional options.
It is highly recommended to train with the following Dockerfile
:
FROM nvcr.io/nvidia/pytorch:22.03-py3
RUN pip3 install \
'pytorch-lightning'
ENV NUMBA_CACHE_DIR=.numba_cache
See the various infer_*
and export_*
scripts in src/python/larynx_train to test and export your voice from the checkpoint in lightning_logs
. The dataset.jsonl
file in your training directory can be used with python3 -m larynx_train.infer
for quick testing:
head -n5 /path/to/training_dir/dataset.jsonl | \
python3 -m larynx_train.infer \
--checkpoint lightning_logs/path/to/checkpoint.ckpt \
--sample-rate 22050 \
--output-dir wavs