neonbjb / tortoise-tts Goto Github PK

View Code? Open in Web Editor NEW

11.9K 167.0 1.7K 54.14 MB

A multi-voice TTS system trained with an emphasis on quality

License: Apache License 2.0

Python 32.80% Jupyter Notebook 59.74% HTML 7.35% Dockerfile 0.11%

tortoise-tts's Introduction

TorToiSe

Tortoise is a text-to-speech program built with the following priorities:

Strong multi-voice capabilities.
Highly realistic prosody and intonation.

This repo contains all the code needed to run Tortoise TTS in inference mode.

Manuscript: https://arxiv.org/abs/2305.07243

Hugging Face space

A live demo is hosted on Hugging Face Spaces. If you'd like to avoid a queue, please duplicate the Space and add a GPU. Please note that CPU-only spaces do not work for this demo.

https://huggingface.co/spaces/Manmay/tortoise-tts

Install via pip

pip install tortoise-tts

If you would like to install the latest development version, you can also install it directly from the git repository:

pip install git+https://github.com/neonbjb/tortoise-tts

What's in a name?

I'm naming my speech-related repos after Mojave desert flora and fauna. Tortoise is a bit tongue in cheek: this model is insanely slow. It leverages both an autoregressive decoder and a diffusion decoder; both known for their low sampling rates. On a K80, expect to generate a medium sized sentence every 2 minutes.

well..... not so slow anymore now we can get a 0.25-0.3 RTF on 4GB vram and with streaming we can get < 500 ms latency !!!

Demos

See this page for a large list of example outputs.

A cool application of Tortoise + GPT-3 (not affiliated with this repository): https://twitter.com/lexman_ai. Unfortunately, this project seems no longer to be active.

Usage guide

Local installation

If you want to use this on your own computer, you must have an NVIDIA GPU.

On Windows, I highly recommend using the Conda installation method. I have been told that if you do not do this, you will spend a lot of time chasing dependency problems.

First, install miniconda: https://docs.conda.io/en/latest/miniconda.html

Then run the following commands, using anaconda prompt as the terminal (or any other terminal configured to work with conda)

This will:

create conda environment with minimal dependencies specified
activate the environment
install pytorch with the command provided here: https://pytorch.org/get-started/locally/
clone tortoise-tts
change the current directory to tortoise-tts
run tortoise python setup install script

conda create --name tortoise python=3.9 numba inflect
conda activate tortoise
conda install pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia
conda install transformers=4.29.2
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
python setup.py install

Optionally, pytorch can be installed in the base environment, so that other conda environments can use it too. To do this, simply send the conda install pytorch... line before activating the tortoise environment.

Note: When you want to use tortoise-tts, you will always have to ensure the tortoise conda environment is activated.

If you are on windows, you may also need to install pysoundfile: conda install -c conda-forge pysoundfile

Docker

An easy way to hit the ground running and a good jumping off point depending on your use case.

git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts

docker build . -t tts

docker run --gpus all \
    -e TORTOISE_MODELS_DIR=/models \
    -v /mnt/user/data/tortoise_tts/models:/models \
    -v /mnt/user/data/tortoise_tts/results:/results \
    -v /mnt/user/data/.cache/huggingface:/root/.cache/huggingface \
    -v /root:/work \
    -it tts

This gives you an interactive terminal in an environment that's ready to do some tts. Now you can explore the different interfaces that tortoise exposes for tts.

For example:

cd app
conda activate tortoise
time python tortoise/do_tts.py \
    --output_path /results \
    --preset ultra_fast \
    --voice geralt \
    --text "Time flies like an arrow; fruit flies like a bananna."

Apple Silicon

On macOS 13+ with M1/M2 chips you need to install the nighly version of PyTorch, as stated in the official page you can do:

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu

Be sure to do that after you activate the environment. If you don't use conda the commands would look like this:

python3.10 -m venv .venv
source .venv/bin/activate
pip install numba inflect psutil
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cpu
pip install transformers
git clone https://github.com/neonbjb/tortoise-tts.git
cd tortoise-tts
pip install .

Be aware that DeepSpeed is disabled on Apple Silicon since it does not work. The flag --use_deepspeed is ignored. You may need to prepend PYTORCH_ENABLE_MPS_FALLBACK=1 to the commands below to make them work since MPS does not support all the operations in Pytorch.

do_tts.py

This script allows you to speak a single phrase with one or more voices.

python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast

faster inference read.py

This script provides tools for reading large amounts of text.

python tortoise/read_fast.py --textfile <your text to be read> --voice random

read.py

This script provides tools for reading large amounts of text.

python tortoise/read.py --textfile <your text to be read> --voice random

This will break up the textfile into sentences, and then convert them to speech one at a time. It will output a series of spoken clips as they are generated. Once all the clips are generated, it will combine them into a single file and output that as well.

Sometimes Tortoise screws up an output. You can re-generate any bad clips by re-running read.py with the --regenerate argument.

API

Tortoise can be used programmatically, like so:

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech()
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

To use deepspeed:

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

To use kv cache:

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(kv_cache=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

To run model in float16:

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

for Faster runs use all three:

reference_clips = [utils.audio.load_audio(p, 22050) for p in clips_paths]
tts = api.TextToSpeech(use_deepspeed=True, kv_cache=True, half=True)
pcm_audio = tts.tts_with_preset("your text here", voice_samples=reference_clips, preset='fast')

Acknowledgements

This project has garnered more praise than I expected. I am standing on the shoulders of giants, though, and I want to credit a few of the amazing folks in the community that have helped make this happen:

Hugging Face, who wrote the GPT model and the generate API used by Tortoise, and who hosts the model weights.
Ramesh et al who authored the DALLE paper, which is the inspiration behind Tortoise.
Nichol and Dhariwal who authored the (revision of) the code that drives the diffusion model.
Jang et al who developed and open-sourced univnet, the vocoder this repo uses.
Kim and Jung who implemented univnet pytorch model.
lucidrains who writes awesome open source pytorch models, many of which are used here.
Patrick von Platen whose guides on setting up wav2vec were invaluable to building my dataset.

Notice

Tortoise was built entirely by the author (James Betker) using their own hardware. Their employer was not involved in any facet of Tortoise's development.

License

Tortoise TTS is licensed under the Apache 2.0 license.

If you use this repo or the ideas therein for your research, please cite it! A bibtex entree can be found in the right pane on GitHub.

tortoise-tts's People

Contributors

Stargazers

Watchers

Forkers

kastnerkyle reidsanders jdvorak wendonggan entn-at osanseviero shaun95 summersigh flipkast jerpint cirenehc eamonndunne vincefav ishine shangdibufashi mdda chenchy ferqui aapopov92 kevinelgan excurl thenewguy iuriimattos2 casonclagg mrcodechef jkback c00renut solotov maximehoude gyanachand1 xiaoqin00 janfschr joebetsill houfu tobe2d wangping886 visual-synthesizer zobeirraisi yusifelawawdeh ekapolc e0xextazy lgg-to-try georgedavila prashantshukla-qa zolekode nilp0inter wiseplat rvres jnordberg parisana simplicitylinux janpauldahlke jjandnn bill-kalog space-pope lampts faad3 preambleai leochencipher bogobogo brentspell bubonicbear mhuffmansemperhl sizzles ryanmcgary litanlitudan okanchou9 ashbt jaimu97 mishav78 ivotoman pablonieto0981 r00tarded afiaka87 tglines hpflatorre nulflux jaedukseo igueths ukaserge usergit marcusllewellyn meatflavourdev berkutx ramizf xyzzy3000 post-writer anacondabitch egaebel maxmax2016 hlp2819 macroustc techthiyanes wavecoder mahmoudfelfel gcambara communityus-llc forkestra wintdkyo pylon

tortoise-tts's Issues

when i try to run it locally,it seems some error happened

i have the problem in the front.it seems work,but i can't get the voice.
could u please help me ?
it troubled me serveral days.

solved it ! try another GPU

(venv) PS C:\plan\test3\tortoise-tts-main> python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast
Removing weight norm...
Generating autoregressive samples..
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:08<00:00, 1.34s/it]
Computing best candidates using CLVP and CVVP
0%| | 0/6 [00:00<?, ?it/s]C
:\plan\test3\venv\lib\site-packages\torch\utils\checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 6.20it/s]
Transforming autoregressive outputs into audio..
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:03<00:00, 24.16it/s]

Does tortoise support other languages besides english?

A possible approach to pronunciation customization

Hi, I'm going to re-raise the topic in #12, which is currently closed. I apologize, and I appreciate that this is in some sense bad form.

I also would like the ability to, occasionally, fine-control pronunciation, and I am of the belief that fundamentally it's not a machine solvable problem, thanks to the literal nightmare that is last names. I know six people who have the same last name by codepoint, but none of them say it the same way, and there's nothing your software could ever do to cope with that, because it's unavailable contextual knowledge.

The problem is, if you want to do high quality rendering, getting names right is a sign of respect, so this genuinely matters, and I believe needs to be in some way droppable to user control.

And so I was going to go bug the ocotillo author. Hm. Guess that works out nicely.

I don't entirely understand where the English <-> Audio mapping comes from, but on a quick glance, it looks like it might be in jbetker/wav2vec2-large-robust-ft-libritts-voxpopuli.

And so I was wondering.

How hard would it be to have two of these?
If the underlying symbolic language was in some way deterministic with regards to end pronunciation - that is, it's somehow a least worst case - how hard would it be to adapt the jbetker thing to a second syllabetry?

The reason being, y'know, the International Phonetic Alphabet is in Unicode, and does a pretty reasonable job with most real world languages. And that would reduce the job to Googling someone's name once, putting it in a lookup table in IPA, and promptly forgetting about it for eternity.

Which, to me, sounds pretty good.

Or, if you prefer, ask from Siobhan and Pádraig Moloughney from Worchester, Massachusettes ("shavon and petrick molockney from wooster mass".)

Let's talk to [ipa:ʃəˈvɔːn] and [ipa:ˈpˠɑːɾˠɪɟː mʌːlɒkːniː] about it is nicely unambiguous, and fits with the symbology in the other request

clvp in other models

Can I use clvp in other models by just substituting it instead of clip?

Feature: add random voice

I saw that in silero they added random speaker voice to generate TTS, and it sounds quite good. Maybe you can add something similar?

RuntimeError: CUDA out of memory.

Hi,

I'm raising an issue for "RuntimeError: CUDA out of memory" because, since upgrading to the latest version, this is what happens when using the exact same commands which have executed successfully in the previous version 😥

This error happens with any combination such as
preset = standard / candidates = 3
preset = fast / candidates = 1

I've ensured I'm trying to run this under identical settings to when it did work (ie nothing else open, nothing hogging GPU RAM). I just can't get this to work any more since the upgrade, where it used to work on my RTX 3070 using the 'standard' setting in Windows 10.

Any assistance would be very much appreciated. The full error is below (this was with fast/1 candidate).

D:\tortoise\tortoise-tts\tortoise\utils\audio.py:14: WavFileWarning: Chunk (non-data) not understood, skipping it.
sampling_rate, data = read(full_path)
Generating autoregressive samples..
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:23<00:00, 3.88s/it]
Computing best candidates using CLVP and CVVP
0%| | 0/6 [00:00<?, ?it/s]d:\anaconda\envs\tort2\lib\site-packages\torch\utils\checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
0%| | 0/6 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\tortoise\tortoise-tts\tortoise\do_tts.py", line 30, in
gen = tts.tts_with_preset(args.text, k=args.candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
File "D:\tortoise\tortoise-tts\tortoise\api.py", line 289, in tts_with_preset
return self.tts(text, **kwargs)
File "D:\tortoise\tortoise-tts\tortoise\api.py", line 393, in tts
clvp = self.clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\tortoise\tortoise-tts\tortoise\models\clvp.py", line 121, in forward
enc_speech = self.speech_transformer(speech_emb, mask=voice_mask)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\tortoise\tortoise-tts\tortoise\models\arch_util.py", line 364, in forward
h = self.transformer(x, **kwargs)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\tortoise\tortoise-tts\tortoise\models\xtransformers.py", line 1237, in forward
x, intermediates = self.attn_layers(x, mask=mask, mems=mems, return_hiddens=True, **kwargs)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\tortoise\tortoise-tts\tortoise\models\xtransformers.py", line 972, in forward
out, inter, k, v = checkpoint(block, x, None, mask, None, attn_mask, self.pia_pos_emb, rotary_pos_emb,
File "d:\anaconda\envs\tort2\lib\site-packages\torch\utils\checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\utils\checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\tortoise\tortoise-tts\tortoise\models\arch_util.py", line 341, in forward
return torch.utils.checkpoint.checkpoint(partial, x, *args)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\utils\checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\utils\checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "d:\anaconda\envs\tort2\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\tortoise\tortoise-tts\tortoise\models\xtransformers.py", line 709, in forward
post_softmax_attn = attn.clone()
RuntimeError: CUDA out of memory. Tried to allocate 184.00 MiB (GPU 0; 8.00 GiB total capacity; 5.32 GiB already allocated; 0 bytes free; 5.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Multi-GPU support

I am looking for parallel support for multi-GPUs. I see pieces of this in autoregressive model, I am just wondering if there is a particular way to accomplish this with UnifiedVoice. Thanks.

EOFError: Ran out of input

oh，no
i meet error this morning.

Traceback (most recent call last):
File "tortoise/do_tts.py", line 27, in
tts = TextToSpeech(models_dir=args.model_dir)
File "C:\plan\test3\tortoise-tts-main\tortoise\api.py", line 196, in init
self.autoregressive.load_state_dict(torch.load(f'{models_dir}/autoregressive.pth'))
File "C:\plan\test3\venv\lib\site-packages\torch\serialization.py", line 713, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "C:\plan\test3\venv\lib\site-packages\torch\serialization.py", line 920, in _legacy_load
magic_number = pickle_module.load(f, **pickle_load_args)
EOFError: Ran out of input

A Tinkerers Controllable Intonation

Following #16

Ruling out conversations around lucky sampling, etc. Lets assume this is indeed the cause of the autoregressive aspects of this model.

It then follows one could fix the first set of generated tokens using tokens from another sample matching the intonation we are looking for.

This would act as a primer, hopefully pulling the model towards the target intonation.

I think this could be accomplished easily by another LogitsWarper modulating the token scores at each step, then releasing scoring back to normal after the initial conditioning tokens have been passed.

Doing this for both the VQVAE and the Autoregressive Decoder would be correct.

The VQVAE seems to be directly responsible for phonemes in the output, ie you can replace tokens to replace words.
The Autoregressive Decoder seems more about the actually aspects of the speaker, etc.

token replacement.zip without CLVP/CVVP for faster inference.

conditioning_tokens = torch.load('happy.pt').cuda()
...
best_results[:, :10] = conditioning_tokens[:10]
self.autoregressive = self.autoregressive.cuda()
best_latents = self.autoregressive(conds, text, torch.tensor([text.shape[-1]], device=conds.device), best_results,
    torch.tensor([best_results.shape[-1]*self.autoregressive.mel_length_compression], device=conds.device),
    return_latent=True, clip_inputs=False)
self.autoregressive = self.autoregressive.cpu()

Finally if we were to keep tokens and scores around someone could be a UI around resampling with the top-k variations. You'd have to rerun everything, but this would allow you to slowly lock down sections of the models outputs till you are happy.

Hopefully I have not made a mistake in my understanding or wording. Thoughts?
Also I understand this is not a real solution, but it is fun to tinker around.

scipy missing from requirements.txt

scipy is imported in a few places but is missing from the requirements file 😱

Tone Classifier

Github : https://github.com/bfelbo/deepmoji
Demo : https://deepmoji.mit.edu/

The embedding from this would be useful to condition the model for intonation. ~~In the meantime, this could be integrated cheaply along side the CVLP step, so we'd have similarity measure between text, speech, and tone.~~

Edit. Nvm I got too excite about the cheep integration, the model only excepts text tokens as input.

I'm getting an error when running locally on Win 11

I'm getting an error when running locally on Win 11.

OS: Win 11 21H2 x64
CPU: 8700k
GPU: 3090
RAM: 32GB

Anaconda3 Powershell Prompt:

conda activate chunkmogrify
cd C:\Anaconda3\envs\chunkmogrify\Lib\site-packages\Tortoise-TTS
python do_tts.py --text "I'm going to speak this" --voice freeman --preset fast

TypeError:

Generating autoregressive samples..
0%| | 0/6 [00:04<?, ?it/s]
Traceback (most recent call last):
File "do_tts.py", line 32, in
gen = tts.tts_with_preset(args.text, conds, preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider)
File "C:\Anaconda3\envs\chunkmogrify\Lib\site-packages\Tortoise-TTS\api.py", line 225, in tts_with_preset
return self.tts(text, voice_samples, **kwargs)
File "C:\Anaconda3\envs\chunkmogrify\Lib\site-packages\Tortoise-TTS\api.py", line 301, in tts
codes = self.autoregressive.inference_speech(conds, text,
File "C:\Anaconda3\envs\chunkmogrify\Lib\site-packages\Tortoise-TTS\models\autoregressive.py", line 564, in inference_speech
gen = self.inference_model.generate(inputs, bos_token_id=self.start_mel_token, pad_token_id=self.stop_mel_token, eos_token_id=self.stop_mel_token,
File "C:\Anaconda3\envs\chunkmogrify\lib\site-packages\torch\autograd\grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "C:\Anaconda3\envs\chunkmogrify\lib\site-packages\transformers\generation_utils.py", line 931, in generate
return self.sample(
TypeError: sample() got multiple values for keyword argument 'logits_processor'

conda list:

_tflow_select 2.3.0 gpu
absl-py 0.15.0 pyhd3eb1b0_0
addict 2.4.0 pypi_0 pypi
aiohttp 3.8.1 py38h2bbff1b_1
aiosignal 1.2.0 pyhd3eb1b0_0
albumentations 1.1.0 pypi_0 pypi
altair 4.2.0 pypi_0 pypi
analytics-python 1.4.0 pypi_0 pypi
antlr4-python3-runtime 4.8 pypi_0 pypi
anyio 3.5.0 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
argon2-cffi 21.3.0 pyhd3eb1b0_0
argon2-cffi-bindings 21.2.0 py38h2bbff1b_0
asgiref 3.5.0 pypi_0 pypi
astor 0.8.1 py38haa95532_0
asttokens 2.0.5 pyhd3eb1b0_0
astunparse 1.6.3 py_0
async-timeout 4.0.1 pyhd3eb1b0_0
atomicwrites 1.4.0 pypi_0 pypi
attrs 21.4.0 pyhd3eb1b0_0
audioread 2.1.9 pypi_0 pypi
av 9.2.0 pypi_0 pypi
backcall 0.2.0 pyhd3eb1b0_0
backoff 1.10.0 pypi_0 pypi
backports-zoneinfo 0.2.1 pypi_0 pypi
basicsr 1.3.5 pypi_0 pypi
bcrypt 3.2.0 pypi_0 pypi
beautifulsoup4 4.10.0 pyh06a4308_0
black 21.4b2 pypi_0 pypi
blas 1.0 mkl
bleach 4.1.0 pyhd3eb1b0_0
blinker 1.4 py38haa95532_0
bottleneck 1.3.4 py38h080aedc_0
braceexpand 0.1.7 pypi_0 pypi
brotli 1.0.9 ha925a31_2
brotlipy 0.7.0 py38h2bbff1b_1003
bs4 4.10.0 hd3eb1b0_0
bzip2 1.0.8 he774522_0 anaconda
ca-certificates 2022.3.29 haa95532_1
cachetools 4.2.2 pyhd3eb1b0_0
certifi 2021.10.8 py38haa95532_2
cffi 1.15.0 py38h2bbff1b_1
chardet 4.0.0 pypi_0 pypi
charset-normalizer 2.0.12 pypi_0 pypi
clean-fid 0.1.23 pypi_0 pypi
click 7.1.2 pypi_0 pypi
clip 1.0 pypi_0 pypi
cloudpickle 2.0.0 pypi_0 pypi
cmake 3.22.3 pypi_0 pypi
colorama 0.4.4 pyhd3eb1b0_0
commentjson 0.9.0 pypi_0 pypi
cpython 0.0.6 pypi_0 pypi
cryptography 3.4.8 py38h71e12ea_0
cudatoolkit 11.3.1 h59b6b97_2
cvlib 0.2.7 pypi_0 pypi
cycler 0.11.0 pyhd3eb1b0_0
dataclasses 0.8 pyh6d0b6a4_7
debugpy 1.5.1 py38hd77b12b_0
decorator 4.4.2 pypi_0 pypi
deepspeed 0.3.16 pypi_0 pypi
defusedxml 0.7.1 pyhd3eb1b0_0
detectron2 0.6 dev_0
dill 0.3.4 pyhd3eb1b0_0
dlib 19.23.0 pypi_0 pypi
dnnlib 0.0.1 pypi_0 pypi
docker-pycreds 0.4.0 pypi_0 pypi
easydict 1.9 pypi_0 pypi
einops 0.4.1 pypi_0 pypi
entmax 1.0 pypi_0 pypi
entrypoints 0.3 py38_0
executing 0.8.3 pyhd3eb1b0_0
face-alignment 1.3.5 pypi_0 pypi
facexlib 0.2.2 pypi_0 pypi
fastapi 0.75.0 pypi_0 pypi
ffmpeg 4.2.2 he774522_0
ffmpy 0.3.0 pypi_0 pypi
filelock 3.6.0 pypi_0 pypi
filetype 1.0.10 pypi_0 pypi
filterpy 1.4.5 pypi_0 pypi
fire 0.4.0 pypi_0 pypi
flask 1.1.4 pypi_0 pypi
flask_cors 3.0.10 pyhd3eb1b0_0
flaskwebgui 0.3.5 pypi_0 pypi
flatbuffers 2.0 pypi_0 pypi
fonttools 4.30.0 pypi_0 pypi
freetype 2.10.4 hd328e21_0
frozenlist 1.2.0 py38h2bbff1b_0
fsspec 2022.3.0 pypi_0 pypi
ftfy 6.1.1 pypi_0 pypi
future 0.18.2 pypi_0 pypi
fvcore 0.1.5.post20220305 pypi_0 pypi
gast 0.4.0 pyhd3eb1b0_0
gcloud 0.18.3 pypi_0 pypi
gdown 4.4.0 pypi_0 pypi
gfpgan 1.3.2 pypi_0 pypi
gitdb 4.0.9 pypi_0 pypi
gitpython 3.1.27 pypi_0 pypi
glfw 2.5.1 pypi_0 pypi
google-api-core 2.7.1 pypi_0 pypi
google-api-python-client 2.42.0 pypi_0 pypi
google-auth 1.35.0 pypi_0 pypi
google-auth-httplib2 0.1.0 pypi_0 pypi
google-auth-oauthlib 0.4.4 pyhd3eb1b0_0
google-colab 1.0.0 pyh44b312d_0 conda-forge
google-pasta 0.2.0 pyhd3eb1b0_0
googleapis-common-protos 1.56.0 pypi_0 pypi
gradio 2.8.13 pypi_0 pypi
grpcio 1.42.0 py38hc60d5dd_0
h11 0.13.0 pypi_0 pypi
h5py 2.10.0 py38h5e291fa_0
hdf5 1.10.4 h7ebc959_0
httplib2 0.20.4 pypi_0 pypi
huggingface-hub 0.4.0 pypi_0 pypi
hydra-core 1.1.1 pypi_0 pypi
icc_rt 2019.0.0 h0cc432a_1
icu 58.2 ha925a31_3
idna 2.10 pypi_0 pypi
imageio 2.16.1 pypi_0 pypi
imageio-ffmpeg 0.4.5 pypi_0 pypi
imgui 1.4.1 pypi_0 pypi
importlib-metadata 4.11.3 py38haa95532_0
importlib-resources 5.4.0 pypi_0 pypi
importlib_metadata 4.11.3 hd3eb1b0_0
imutils 0.5.4 pypi_0 pypi
inflect 5.3.0 py38haa95532_1
iniconfig 1.1.1 pypi_0 pypi
insightface 0.2.1 pypi_0 pypi
intel-openmp 2021.4.0 haa95532_3556
iopath 0.1.9 pypi_0 pypi
ipdb 0.13.9 pypi_0 pypi
ipykernel 6.9.1 py38haa95532_0
ipython 8.2.0 py38haa95532_0
ipython_genutils 0.2.0 pyhd3eb1b0_1
ipywidgets 7.7.0 pypi_0 pypi
itsdangerous 1.1.0 pypi_0 pypi
jedi 0.18.1 py38haa95532_1
jinja2 2.11.3 pypi_0 pypi
joblib 1.1.0 pypi_0 pypi
jpeg 9d h2bbff1b_0
jsonschema 3.2.0 pyhd3eb1b0_2
jupyter_client 7.1.2 pyhd3eb1b0_0
jupyter_core 4.9.2 py38haa95532_0
jupyterlab-widgets 1.1.0 pypi_0 pypi
jupyterlab_pygments 0.1.2 py_0
keras-applications 1.0.8 py_1
keras-preprocessing 1.1.2 pyhd3eb1b0_0
kiwisolver 1.3.2 py38hd77b12b_0
kornia 0.6.4 pypi_0 pypi
lama-cleaner 0.9.3 pypi_0 pypi
lark-parser 0.7.8 pypi_0 pypi
libpng 1.6.37 h2a8f88b_0
libprotobuf 3.19.1 h23ce68f_0
librosa 0.9.1 pypi_0 pypi
libtiff 4.2.0 hd0e1b90_0
libuv 1.40.0 he774522_0
libwebp 1.2.2 h2bbff1b_0
linkify-it-py 1.0.3 pypi_0 pypi
llvmlite 0.38.0 pypi_0 pypi
lmdb 1.3.0 pypi_0 pypi
loguru 0.6.0 pypi_0 pypi
lpips 0.1.4 pypi_0 pypi
lz4-c 1.9.3 h2bbff1b_1
markdown 3.3.4 py38haa95532_0
markdown-it-py 2.0.1 pypi_0 pypi
markupsafe 2.0.1 py38h2bbff1b_0
matplotlib 3.5.1 py38haa95532_1
matplotlib-base 3.5.1 py38hd77b12b_1
matplotlib-inline 0.1.2 pyhd3eb1b0_2
mdit-py-plugins 0.3.0 pypi_0 pypi
mdurl 0.1.0 pypi_0 pypi
mistune 0.8.4 py38he774522_1000
mkl 2021.4.0 haa95532_640
mkl-service 2.4.0 py38h2bbff1b_0
mkl_fft 1.3.1 py38h277e83a_0
mkl_random 1.2.2 py38hf11a4ad_0
monotonic 1.6 pypi_0 pypi
moviepy 1.0.3 pypi_0 pypi
mpi4py 3.1.3 pypi_0 pypi
multidict 5.1.0 py38h2bbff1b_2
munkres 1.1.4 py_0
mypy-extensions 0.4.3 pypi_0 pypi
nbclient 0.5.11 pyhd3eb1b0_0
nbconvert 6.1.0 py38haa95532_0
nbformat 5.1.3 pyhd3eb1b0_0
nest-asyncio 1.5.1 pyhd3eb1b0_0
networkx 2.7.1 pypi_0 pypi
ninja 1.10.2.3 pypi_0 pypi
notebook 6.4.8 py38haa95532_0
numba 0.55.1 pypi_0 pypi
numexpr 2.8.1 py38hb80d3ca_0
numpy 1.20.3 pypi_0 pypi
numpy-base 1.21.5 py38hc2deb75_0
oauth2client 4.1.3 pypi_0 pypi
oauthlib 3.2.0 pyhd3eb1b0_0
omegaconf 2.1.1 pypi_0 pypi
onnx 1.11.0 pypi_0 pypi
onnxruntime 1.11.0 pypi_0 pypi
onnxruntime-gpu 1.11.0 pypi_0 pypi
opencv-python 4.5.5.64 pypi_0 pypi
opencv-python-headless 4.5.5.64 pypi_0 pypi
openexr 1.3.7 pypi_0 pypi
openssl 1.1.1n h2bbff1b_0
opt_einsum 3.3.0 pyhd3eb1b0_1
orjson 3.6.7 pypi_0 pypi
packaging 21.3 pyhd3eb1b0_0
pandas 1.4.1 py38hd77b12b_1
pandocfilters 1.5.0 pyhd3eb1b0_0
paramiko 2.10.3 pypi_0 pypi
parso 0.8.3 pyhd3eb1b0_0
pathspec 0.9.0 pypi_0 pypi
pathtools 0.1.2 pypi_0 pypi
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 9.0.1 pypi_0 pypi
pip 21.2.2 py38haa95532_0
pluggy 1.0.0 pypi_0 pypi
pooch 1.6.0 pypi_0 pypi
portalocker 2.4.0 pypi_0 pypi
portpicker 1.3.1 py38_0
pprintpp 0.4.0 pypi_0 pypi
proglog 0.1.9 pypi_0 pypi
progressbar 2.5 pypi_0 pypi
prometheus_client 0.13.1 pyhd3eb1b0_0
promise 2.3 pypi_0 pypi
prompt-toolkit 3.0.28 pypi_0 pypi
protobuf 3.19.4 pypi_0 pypi
psutil 5.8.0 py38h2bbff1b_1
pudb 2022.1.1 pypi_0 pypi
pure_eval 0.2.2 pyhd3eb1b0_0
py 1.11.0 pypi_0 pypi
pyarrow 7.0.0 pypi_0 pypi
pyasn1 0.4.8 pyhd3eb1b0_0
pyasn1-modules 0.2.8 py_0
pybind11 2.9.1 py38hbd9d945_0 conda-forge
pybind11-global 2.9.1 py38hbd9d945_0 conda-forge
pycocotools 2.0.4 pypi_0 pypi
pycparser 2.21 pyhd3eb1b0_0
pycryptodome 3.14.1 pypi_0 pypi
pydantic 1.9.0 pypi_0 pypi
pydeck 0.7.1 pypi_0 pypi
pydeprecate 0.3.2 pypi_0 pypi
pydot 1.4.2 pypi_0 pypi
pydrive 1.3.1 pypi_0 pypi
pydub 0.25.1 pypi_0 pypi
pygments 2.11.2 pyhd3eb1b0_0
pyjwt 2.1.0 py38haa95532_0
pymcubes 0.1.2 pypi_0 pypi
pymongo 4.0.2 pypi_0 pypi
pympler 1.0.1 pypi_0 pypi
pynacl 1.5.0 pypi_0 pypi
pyopengl 3.1.1a1 py38haa95532_0
pyopenssl 21.0.0 pyhd3eb1b0_1
pyparsing 3.0.4 pyhd3eb1b0_0
pyqt 5.9.2 py38hd77b12b_6
pyqt5 5.15.6 pypi_0 pypi
pyqt5-qt5 5.15.2 pypi_0 pypi
pyqt5-sip 12.9.1 pypi_0 pypi
pyrallis 0.3.1 pypi_0 pypi
pyreadline 2.1 py38_1
pyrsistent 0.18.0 py38h196d8e1_0
pysimplegui 4.59.0 pypi_0 pypi
pysocks 1.7.1 py38haa95532_0
pyspng 0.1.0 pypi_0 pypi
pytest 7.1.1 pypi_0 pypi
python 3.8.12 h6244533_0
python-dateutil 2.8.2 pyhd3eb1b0_0
python-multipart 0.0.5 pypi_0 pypi
python_abi 3.8 2_cp38 conda-forge
pytorch 1.11.0 py3.8_cuda11.3_cudnn8_0 pytorch
pytorch-lightning 1.6.0 pypi_0 pypi
pytorch-mutex 1.0 cuda pytorch
pytz 2022.1 pypi_0 pypi
pytz-deprecation-shim 0.1.0.post0 pypi_0 pypi
pywavelets 1.3.0 pypi_0 pypi
pywin32 303 pypi_0 pypi
pywinpty 2.0.5 pypi_0 pypi
pyyaml 6.0 pypi_0 pypi
pyzmq 22.3.0 py38hd77b12b_2
qt 5.9.7 vc14h73c81de_0
qudida 0.0.4 pypi_0 pypi
realesrgan 0.2.4.0 pypi_0 pypi
regex 2022.3.2 pypi_0 pypi
requests 2.25.1 pypi_0 pypi
requests-oauthlib 1.3.0 py_0
resampy 0.2.2 pypi_0 pypi
rotary-embedding-torch 0.1.5 pypi_0 pypi
rsa 4.7.2 pyhd3eb1b0_1
sacremoses 0.0.49 pypi_0 pypi
samplerate 0.1.0 pypi_0 pypi
scikit-image 0.19.2 pypi_0 pypi
scikit-learn 1.0.2 pypi_0 pypi
scipy 1.8.0 pypi_0 pypi
seaborn 0.11.2 pyhd3eb1b0_0
semver 2.13.0 pypi_0 pypi
send2trash 1.8.0 pyhd3eb1b0_1
sentry-sdk 1.5.7 pypi_0 pypi
setproctitle 1.2.2 pypi_0 pypi
setuptools 58.0.4 py38haa95532_0
shortuuid 1.0.8 pypi_0 pypi
sip 4.19.13 py38hd77b12b_0
six 1.16.0 pyhd3eb1b0_1
smmap 5.0.0 pypi_0 pypi
sniffio 1.2.0 pypi_0 pypi
soundfile 0.10.3.post1 pypi_0 pypi
soundstretch 1.2 pypi_0 pypi
soupsieve 2.3.1 pyhd3eb1b0_0
sqlite 3.38.2 h2bbff1b_0
stack_data 0.2.0 pyhd3eb1b0_0
starlette 0.17.1 pypi_0 pypi
streamlit 1.8.1 pypi_0 pypi
submitit 1.4.1 pypi_0 pypi
tabulate 0.8.9 pypi_0 pypi
taming-transformers 0.0.1 pypi_0 pypi
tb-nightly 2.9.0a20220402 pypi_0 pypi
tensorboard 2.4.0 pyhc547734_0
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.6.0 py_0
tensorboardx 1.8 pypi_0 pypi
tensorflow 2.3.0 mkl_py38h8557ec7_0
tensorflow-base 2.3.0 eigen_py38h75a453f_0
tensorflow-estimator 2.6.0 pyh7b7c402_0
tensorflow-gpu 2.3.0 he13fc11_0
termcolor 1.1.0 pypi_0 pypi
terminado 0.13.3 pypi_0 pypi
test-tube 0.7.5 pypi_0 pypi
testpath 0.5.0 pyhd3eb1b0_0
threadpoolctl 3.1.0 pypi_0 pypi
tifffile 2022.2.9 pypi_0 pypi
timm 0.5.4 pypi_0 pypi
tk 8.6.10 he774522_0 anaconda
tokenizers 0.10.3 pypi_0 pypi
toml 0.10.2 pypi_0 pypi
tomli 2.0.1 pypi_0 pypi
toolz 0.11.2 pypi_0 pypi
torch 1.11.0 pypi_0 pypi
torch-fidelity 0.3.0 pypi_0 pypi
torchaudio 0.11.0 py38_cu113 pytorch
torchmetrics 0.7.3 pypi_0 pypi
torchtyping 0.1.4 pypi_0 pypi
torchvision 0.12.0 pypi_0 pypi
tornado 6.1 py38h2bbff1b_0
tqdm 4.64.0 pypi_0 pypi
traitlets 5.1.1 pyhd3eb1b0_0
transformers 4.3.1 pypi_0 pypi
trimesh 3.10.5 pypi_0 pypi
typeguard 2.13.3 pypi_0 pypi
typing-extensions 4.1.1 hd3eb1b0_0
typing-inspect 0.7.1 pypi_0 pypi
typing_extensions 4.1.1 pyh06a4308_0
tzdata 2022.1 pypi_0 pypi
tzlocal 4.2 pypi_0 pypi
uc-micro-py 1.0.1 pypi_0 pypi
unidecode 1.2.0 pyhd3eb1b0_0
unzip 1.0.0 pypi_0 pypi
uritemplate 4.1.1 pypi_0 pypi
urllib3 1.26.8 pyhd3eb1b0_0
urwid 2.1.2 pypi_0 pypi
urwid-readline 0.13 pypi_0 pypi
uvicorn 0.17.6 pypi_0 pypi
validators 0.18.2 pypi_0 pypi
vc 14.2 h21ff451_1
vs2015_runtime 14.27.29016 h5e58377_2
wandb 0.12.11 pypi_0 pypi
watchdog 2.1.7 pypi_0 pypi
wcwidth 0.2.5 pyhd3eb1b0_0
webdataset 0.2.5 pypi_0 pypi
webencodings 0.5.1 pypi_0 pypi
werkzeug 1.0.1 pypi_0 pypi
wget 3.2 pypi_0 pypi
wheel 0.37.1 pyhd3eb1b0_0
whichcraft 0.6.1 pypi_0 pypi
widgetsnbextension 3.6.0 pypi_0 pypi
win32-setctime 1.1.0 pypi_0 pypi
win_inet_pton 1.1.0 py38haa95532_0
wincertstore 0.2 py38haa95532_2
winpty 0.4.3 4
wldhx-yadisk-direct 0.0.6 pypi_0 pypi
wrapt 1.13.3 py38h2bbff1b_2
xz 5.2.5 h62dcd97_0
yacs 0.1.8 pypi_0 pypi
yaml 0.2.5 he774522_0
yapf 0.32.0 pypi_0 pypi
yarl 1.6.3 py38h2bbff1b_0
yaspin 2.1.0 pypi_0 pypi
zipp 3.7.0 pyhd3eb1b0_0
zlib 1.2.11 hbd8134f_5
zstd 1.4.9 h19a0ad4_0

AssertionError: This shouldn't happen. My coding sucks.

when i test for a test.txt file,it occurred
python tortoise/read.py --textfile='./test.txt' --voice random

Keeping model after usage and importing own pre trained models.

I was wondering if this was possible and if you could clarify/instruct how to use one's own pretrained models.
I noticed this as an update point in V2.1.
Basically, I want to keep the trained model of a custom voice I make and was wondering if that was possible.

how do reference CLIPs work?

can you please see the materials on how the reference clip work? Thanks in advance.

Customizing pronunciation

I was wondering if you had any suggestions for making TorToiSe pronounce words in a specific way, for example, for fictional names or domain-specific terms that it's not familiar with?

I've had pretty good luck with word replacement - programmatically replacing words with phonetic spellings - though with unusual fictional names it sometimes it puts emphasis on the wrong syllable.

Thanks for any suggestions - and fantastic work on this project!

How to Fine-Tune the audio in the Google Colab

On this demo, they show a fine-tuned version of the tortoise-tts's recording, which sounds a lot real/better compared to the normally generated one.

How can I fine-tune the audio or achieve the same result? (on google colab)?

Long text generation

Hello, thank you for your excellent work, the results are impressive, I admire you.

When generating long text using a script read.py, the resulting pieces of audio have different characteristics, i.e. the heterogeneity of pronunciation is felt.

Is it possible to somehow achieve a large, homogeneous audio of the entire text?

Setting k value above 1 results in RunteimError

First off, amazing project! Well done :)

When setting k value above one I get the following runtime error. Tried with values of 2 and 3:

Traceback (most recent call last):
  File "C:\KI\Projects\tortoise-tts\api.py", line 228, in tts_with_preset
    return self.tts(text, voice_samples, **kwargs)
  File "C:\KI\Projects\tortoise-tts\api.py", line 343, in tts
    best_latents = self.autoregressive(conds, text, torch.tensor([text.shape[-1]], device=conds.device), best_results,
  File "C:\Users\andre\anaconda3\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\KI\Projects\tortoise-tts\models\autoregressive.py", line 446, in forward
    text_logits, mel_logits = self.get_logits(conds, text_emb, self.text_head, mel_emb, self.mel_head, get_attns=return_attentions, return_latent=return_latent)
  File "C:\KI\Projects\tortoise-tts\models\autoregressive.py", line 368, in get_logits
    emb = torch.cat([speech_conditioning_inputs, first_inputs, second_inputs], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 2 for tensor number 2 in the list.

Sample files and colab notebook

Hi James,

thanks for this awesome project! I've tried it out and some of the speakers I tried are very convincing.

In the readme you teased some hand picked samples: https://github.com/neonbjb/tortoise-tts#hand-picked-tts-samples
But I've noticed that you didn't upload them yet. I'd love to see what your model can do at its best!

The requirements file seems to be incomplete. I've seen you removed the x-transformers dependency but in the colab it was needed for importing from the api.

ModuleNotFoundError

After following the installation and all went successfull, when I am trying to test it and I am getting this error:

ModuleNotFoundError: No module named 'tortoise.models'

What am I missing?

Training Script

Hey!
Do you have any plans for sharing the training scripts?
PS: I wanted to check the performance on different languages.

does it work with any language?

VQ-VAE network not released

Hello! After taking a closer look into the repo it seems that you have not released the VQ-VAE model used to generate discrete speech representations that the autoregressive model was trained on. Is this on purpose? You don't plan to do that in the future?

Feature request: Audio reference.

Hello, First of all thank you for developing this tool which is as amazing as it is entertaining.

I would like to ask if there is a possibility to implement the use of another voice audio as a reference to match the pace of the original speaker or even the pitch as Controllabe Talknet does. https://github.com/justinjohn0306/ControllableTalkNet

I apologise if this is something impossible to even ask for.
Thanks!

GPU Out of Memory

I'm a user using a 4GB GPU. Can this be resolved using 'torch.cuda.empty_cache()'? If not, can I get a solution?

Error Message

python tortoise/do_tts.py --text "I'm going to speak this" --voice lj --preset fast

Generating autoregressive samples..
100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:11<00:00, 1.96s/it]
Computing best candidates using CLVP and CVVP
0%| | 0/6 [00:00<?, ?it/s]C:\Anaconda3\envs\laver\lib\site-packages\torch\utils\checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
0%| | 0/6 [00:01<?, ?it/s]
Traceback (most recent call last):
File "tortoise/do_tts.py", line 29, in
gen = tts.tts_with_preset(args.text, voice_samples=voice_samples, conditioning_latents=conditioning_latents,
File "C:\Users\power\Desktop\Project\Dev\tortoise-tts\tortoise\api.py", line 288, in tts_with_preset
return self.tts(text, **kwargs)
File "C:\Users\power\Desktop\Project\Dev\tortoise-tts\tortoise\api.py", line 392, in tts
clvp = self.clvp(text_tokens.repeat(batch.shape[0], 1), batch, return_loss=False)
File "C:\Anaconda3\envs\laver\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Anaconda3\envs\laver\lib\site-packages\torch\utils\checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "C:\Anaconda3\envs\laver\lib\site-packages\torch\utils\checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "C:\Anaconda3\envs\laver\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Anaconda3\envs\laver\lib\site-packages\tortoise-2.2.0-py3.8.egg\tortoise\models\arch_util.py", line 341, in forward
File "C:\Anaconda3\envs\laver\lib\site-packages\torch\utils\checkpoint.py", line 235, in checkpoint
return CheckpointFunction.apply(function, preserve, *args)
File "C:\Anaconda3\envs\laver\lib\site-packages\torch\utils\checkpoint.py", line 96, in forward
outputs = run_function(*args)
File "C:\Anaconda3\envs\laver\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Anaconda3\envs\laver\lib\site-packages\tortoise-2.2.0-py3.8.egg\tortoise\models\xtransformers.py", line 709, in forward
RuntimeError: CUDA out of memory. Tried to allocate 124.00 MiB (GPU 0; 4.00 GiB total capacity; 2.81 GiB already allocated; 0 bytes free; 2.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

add web demo to Huggingface

Hi, would you be interested in adding tortoise-tts web demo to Hugging Face using Gradio? I see there is already models setup on Huggingface for this repo https://huggingface.co/jbetker

here is a guide for adding spaces to your org or username

How to add a Space: https://huggingface.co/blog/gradio-spaces

Example spaces with repos:
github: https://github.com/salesforce/BLIP
Spaces: https://huggingface.co/spaces/salesforce/BLIP

github: https://github.com/facebookresearch/omnivore
Spaces: https://huggingface.co/spaces/akhaliq/omnivore

a Gradio Demo can be setup in 2 lines of code using the inference api (if enabled) integration through huggingface

import gradio as gr
gr.Interface.load("huggingface/jbetker/tortoise-tts-v2").launch()

would launch the demo

Please let us know if you would be interested and if you have any questions.

Unnecessary word insertions

Thanks for publishing your work. Very nice results.

I noticed that the model may hallucinate and insert words, if the similar word was already used in the sentence. For example, the sentence "If this is your first night at Fight Club, you have to fight." is pronounced as "If this is your first night at Fight Club, you have to fight club." The model is reusing the previously generated expression "Fight Club" instead of pronouncing just one word "fight".

Error - multiple values for keyword argument 'logits_processor'

Hello,

Trying to run the current branch's example use case from the guide and receive the following error:

python3 tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast Generating autoregressive samples.. 0%| | 0/96 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/ghost/tortoise-tts/tortoise/do_tts.py", line 30, in <module> gen = tts.tts_with_preset(args.text, k=args.candidates, voice_samples=voice_samples, conditioning_latents=conditioning_latents, File "/home/ghost/tortoise-tts/tortoise/api.py", line 305, in tts_with_preset return self.tts(text, **kwargs) File "/home/ghost/tortoise-tts/tortoise/api.py", line 387, in tts codes = self.autoregressive.inference_speech(auto_conditioning, text_tokens, File "/usr/local/lib/python3.9/dist-packages/TorToiSe-2.3.0-py3.9.egg/tortoise/models/autoregressive.py", line 498, in inference_speech File "/home/ghost/.local/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context return func(*args, **kwargs) File "/home/ghost/.local/lib/python3.9/site-packages/transformers/generation_utils.py", line 1016, in generate return self.sample( TypeError: transformers.generation_utils.GenerationMixin.sample() got multiple values for keyword argument 'logits_processor'

Any advice? Thanks

an error when trying to run the thing

hello everyone
for some reason, when I run the do_tts python script, I am getting this error: I input my text and select the voice to use, but I still get this: I also have an NVIDIA GPU which I have used to train tacotron models, so I really don't know what is happening here
Traceback (most recent call last): File "C:\Users\thema\Downloads\tortoise-tts-main\do_tts.py", line 22, in tts = TextToSpeech() File "C:\Users\thema\Downloads\tortoise-tts-main\api.py", line 201, in init self.vocoder.load_state_dict(torch.load('.models/vocoder.pth')['model_g']) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 712, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 1046, in _load result = unpickler.load() File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 1016, in persistent_load load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 1001, in load_tensor wrap_storage=restore_location(storage, location), File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 176, in default_restore_location result = fn(storage, location) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 152, in _cuda_deserialize device = validate_cuda_device(location) File "C:\Users\thema\anaconda3\lib\site-packages\torch\serialization.py", line 136, in validate_cuda_device raise RuntimeError('Attempting to deserialize object on a CUDA ' RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
if anyone could help me with this, I would really appreciate it!

Missing module when installing locally, torchaudio

Hi,

I just installed tortoise-tts, created a virtual env, pip install -e ., then tried the example from README.md but got an error.

(venv) kinow@ranma:~/Development/python/workspace/tortoise-tts$ python tortoise/read.py test.txt
Traceback (most recent call last):
  File "/home/kinow/Development/python/workspace/tortoise-tts/tortoise/read.py", line 5, in <module>
    import torchaudio
ModuleNotFoundError: No module named 'torchaudio'

After pip install torchaudio, it worked (downloading data now).

(venv) kinow@ranma:~/Development/python/workspace/tortoise-tts$ python tortoise/read.py --textfile test.txt
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2.06k/2.06k [00:00<00:00, 1.25MB/s]

Thanks!

Was the UnivNet vocoder fine-tuned?

Hello first of all great project, this is by far the best zero-shot TTS I've seen yet. I wonder whether the Univnet Vocoder you used was fine-tuned on your dataset or did you simply take the one from https://github.com/mindslab-ai/univnet without further training?
I want to fine-tune the vocoder myself and wonder if I should use the generator weights from this project or from the mindslab repo.

Also is there any way to get your UnivNet discriminator weights?

Colab causes "device-side assert" errors

The following error is generated when I am trying to use my own text:

RuntimeError Traceback (most recent call last)
in ()
7 conds.append(c)
8
----> 9 gen = tts.tts_with_preset(text, conds, preset)
10 torchaudio.save('generated.wav', gen.squeeze(0).cpu(), 24000)
11 IPython.display.Audio('generated.wav')

1 frames
/content/tortoise-tts/api.py in tts(self, text, voice_samples, k, verbose, num_autoregressive_samples, temperature, length_penalty, repetition_penalty, top_p, max_mel_tokens, typical_sampling, typical_mass, clvp_cvvp_slider, diffusion_iterations, cond_free, cond_free_k, diffusion_temperature, **hf_generate_kwargs)
278 Sample rate is 24kHz.
279 """
--> 280 text = torch.IntTensor(self.tokenizer.encode(text)).unsqueeze(0).cuda()
281 text = F.pad(text, (0, 1)) # This may not be necessary.
282

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Will borrowing the DALLE 2 decoder add significant improvements?

Will borrowing the DALLE 2 decoder add significant improvements?
And thank you for such a great job!

Add tests

This repo is more popular than I thought it would be, and maintaining it is resulting in regressions in features I don't manually test when changes are made. It's probably about time to hunker down and write unit tests for this thing..

error when running

after following the instructions EXACTLY as they were written, I get this error:

`C:\Users\no\anaconda3\lib\site-packages\torchaudio_internal\module_utils.py:99: UserWarning: Failed to import soundfile. 'soundfile' backend is not available.
warnings.warn("Failed to import soundfile. 'soundfile' backend is not available.")
Traceback (most recent call last):
File "C:\Users\no\anaconda3\lib\site-packages\soundfile-0.10.3.post1-py3.9.egg\soundfile.py", line 142, in
raise OSError('sndfile library not found')
OSError: sndfile library not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "d:\voicecloning\tortoise-tts\tortoise\do_tts.py", line 6, in
from api import TextToSpeech
File "d:\voicecloning\tortoise-tts\tortoise\api.py", line 21, in
from tortoise.utils.audio import wav_to_univnet_mel, denormalize_tacotron_mel
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 664, in load_unlocked
File "", line 627, in load_backward_compatible
File "", line 259, in load_module
File "C:\Users\no\anaconda3\lib\site-packages\tortoise-2.1.3-py3.9.egg\tortoise\utils\audio.py", line 4, in
File "C:\Users\no\anaconda3\lib\site-packages\librosa-0.9.1-py3.9.egg\librosa_init.py", line 209, in
from . import core
File "C:\Users\no\anaconda3\lib\site-packages\librosa-0.9.1-py3.9.egg\librosa\core_init.py", line 6, in
from .audio import * # pylint: disable=wildcard-import
File "C:\Users\no\anaconda3\lib\site-packages\librosa-0.9.1-py3.9.egg\librosa\core\audio.py", line 8, in
import soundfile as sf
File "C:\Users\no\anaconda3\lib\site-packages\soundfile-0.10.3.post1-py3.9.egg\soundfile.py", line 162, in
_snd = _ffi.dlopen(_os.path.join(
OSError: cannot load library 'C:\Users\no\anaconda3\lib\site-packages\soundfile-0.10.3.post1-py3.9.egg_soundfile_data\libsndfile64bit.dll': error 0x7e`

so, i figured "why don't i just install soundfile since it doesn't exist?"

(base) d:\voicecloning\tortoise-tts>pip install soundfile
Requirement already satisfied: soundfile in c:\users\no\anaconda3\lib\site-packages\soundfile-0.10.3.post1-py3.9.egg (0.10.3.post1)
Requirement already satisfied: cffi>=1.0 in c:\users\no\anaconda3\lib\site-packages (from soundfile) (1.14.6)
Requirement already satisfied: pycparser in c:\users\no\anaconda3\lib\site-packages (from cffi>=1.0->soundfile) (2.20)

oh so now it's going to tell me that it already exists but then says it doesn't when i run it.

so, what gives. why doesn't this work?

SSML Support

It would be great if you can input SSML instead of plain text. AWS and GCP support it.
https://www.w3.org/TR/speech-synthesis11/

Custom voices are not supported in colab

Hi, thank you for making and sharing this awesome work.

I've tried to add a custom voice right in the colab, but got this error.
However, everything worked fine when I ran it locally.

Failed to initialize NumPy / Numpy is not available

Greetings.

First of all, very stellar work on this. This looks to be very helpful to use in a personal project I'm working on.

With that, I'd like to get this setup on my local computer. I'm running a GTX 1060 6gb and I haven't had much issues with running CUDA enabled software in the past. I'm using an environment created with Anaconda running Python 3.10.4 on Windows 10 to run this code.

At first, I had a few troubles getting torch to work with my GPU, but it was an easy fix that involved appending +cu113 on two of the torch packages.

However, now I've ran into another problem that I can't quite seem wrap my head around. When executing the "do_tts.py" command, I run into this:

(TortTTS2) PS D:\git\tortoise-tts> python do_tts.py --text "I'm going to speak this" --voice daniel --preset fast

D:\Conda\envs\TortTTS2\lib\site-packages\torch\_masked\__init__.py:223: UserWarning: Failed to initialize NumPy: module compiled against API version 0xf but this version of numpy is 0xe (Triggered internally at  ..\torch\csrc\utils\tensor_numpy.cpp:68.)
  example_input = torch.tensor([[-3, -2, -1], [0, 1, 2]])
Removing weight norm...
Generating autoregressive samples..
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:08<00:00,  1.35s/it]
Computing best candidates using CLVP and CVVP
  0%|                                                                                            | 0/6 [00:00<?, ?it/s]D:\Conda\envs\TortTTS2\lib\site-packages\torch\utils\checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
100%|████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:08<00:00,  1.42s/it]
Transforming autoregressive outputs into audio..
D:\git\tortoise-tts\utils\stft.py:119: FutureWarning: Pass size=1024 as keyword args. From version 0.10 passing these as positional arguments will result in an error
  fft_window = pad_center(fft_window, filter_length)
Traceback (most recent call last):
  File "D:\git\tortoise-tts\do_tts.py", line 32, in <module>
    gen = tts.tts_with_preset(args.text, conds, preset=args.preset, clvp_cvvp_slider=args.voice_diversity_intelligibility_slider)
  File "D:\git\tortoise-tts\api.py", line 225, in tts_with_preset
    return self.tts(text, voice_samples, **kwargs)
  File "D:\git\tortoise-tts\api.py", line 365, in tts
    mel = do_spectrogram_diffusion(self.diffusion, diffuser, latents, voice_samples, temperature=diffusion_temperature, verbose=verbose)
  File "D:\git\tortoise-tts\api.py", line 136, in do_spectrogram_diffusion
    cond_mel = wav_to_univnet_mel(sample.to(latents.device), do_normalization=False)
  File "D:\git\tortoise-tts\utils\audio.py", line 138, in wav_to_univnet_mel
    stft = TacotronSTFT(1024, 256, 1024, 100, 24000, 0, 12000)
  File "D:\git\tortoise-tts\utils\audio.py", line 101, in __init__
    self.stft_fn = STFT(filter_length, hop_length, win_length)
  File "D:\git\tortoise-tts\utils\stft.py", line 120, in __init__
    fft_window = torch.from_numpy(fft_window).float()
RuntimeError: Numpy is not available

For the top error, I've found that apparently it can be fixed by upgrading NumPy. Unfortunately, I can't do so since Numba requires a version of NumPy that's lower than 1.22 and greater than 1.18

I get this when upgrading NumPy:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
numba 0.55.0 requires numpy<1.22,>=1.18, but you have numpy 1.22.3 which is incompatible.

I'll be honest, when going into this I wasn't even sure what python version I should be using. I have a thought that it might be something to do with my torch installation, but I'm not so sure. Any help towards fixing this will be much appreciated!

Revive CVVP as optional

I removed CVVP from the model stack because I found it to be unnecessary and even harmful in all cases that I had encountered in a testing run.

Someone reached out to me and provided me a specific case where the removal of CVVP has broken rendering (with CVVP, a given voice works. Without it, Tortoise produces audio clips from two different speakers).

I should bring it back as an optional rendering switch.

Minimum GPU RAM for inference?

As the title suggests, I'd like to know what is the minimum GPU vram necessary for inference. I've got a 1060 6gb (mine) and a 2080ti (uni) and would love to explore the models.

Zero Shot Intonation

As we all have seen with the latest papers, how you prompt transformer models can greatly influence their outputs.
See the typical DALL-E/Disco Diffusion prompts or the PALM paper's section on Chain-of-Thought Prompting.

Prompting engineering is model specific, effected by the training set.

As an example, prompts like so do not invoke intonations aligned with the text, instead the double quotes cause shifts away from the readers voice, as people likely do in the training set, reading the text as the character.

She said with a happy voice, "I start my new job today".
They said happily with a happy voice, "I start my new job today".

Matching is not necessarily expected here, so I did some further testing generating samples prompted differently to see if your model can exhibit this behavior.

Here are my results.

output.zip

Here are my anecdotal findings.

Typical Sampling is a must if you care for expressiveness. Though there is a noticeable quality drop.

PROMPT A : She said with a sad voice, "I start my new job today".
PROMPT B : It is so sad, I start my new job today.
PROMPT C1 : Sad, I start my new job today.
PROMPT C2 : Happy, I start my new job today.

Using A, there is a perceptual change in prosody almost in every sample over the "I start my new job today" section. This is expected, assuming aspects of the training set, where readers change to reading character quotes.

Quotes should be avoided unless you are going for an "audio book reader effect".

Surprisingly though, A never actually produces a "sad" sounding sentence. This could be for many reasons, I'll leave speculation now.

Both B and C usefully give nice intonation aligned with the prompt. With B using winning out but requiring more setup.
C seems to be sufficient and simple enough you can use it automatically.

Further thoughts, just writing things.

Does Typical Sampling help? Do we get more intonation out of the model when prompting? Yes, maybe, kinda.
I think you have a nice balance here, Typical Sampling to pull the sampling into something novel and specific and then CLVP to pull in back.
Is CLVP restricting the generation of intonation?
Is top K really the best choice? Really since we've spent the time computing we should dump the top 3 or 5. Humans are the best similarity measures we have.
Where should the intonation be injected, along with the start token, or maybe along with the diffusion model's inputs?
Where does the reliance on the vocoder to carry this information come in? Should we be passing hints to the vocoder?

This isn't my area, but I'm interested in tinkering around.

Feature: Invoke Emotion With Two Audio Sources

Hello, thanks for the awesome project! This is very fun to mess around with.

One thing I've been having fun with doing is mix and mashing voices together. I've noticed that many TTS models lack emotion due to the nature of how they work. It gave me the idea of instead of mixing two voices together to create a new one, we could extract features from one, and invoke a type of style transfer if you will. I was thinking of a framework such as:

A. Source audio. Normal speaking.
B. One or two clips of somebody speaking in an angry, sad, happy tone, etc.
C. Source A references source B utterance, but not explicitly the words spoken, just tone of voice.

If it's possible to do so without training another model, I would definitely look into doing this on my free time if lead in the right direction. Cheers!

Real time inference

Not an issue really. Just wanted to say the quality is impressive, much better than anything else I've heard.

Is there a roadmap/glimpse of hope of having inference speed at near real time?

The resulting voice doesn't sound much like me

Hi, so when I tried out the colab page, I found that it works very well for pre-set voices, which is honestly very impressive.
But because my voice is unique and different, tortoise has a hard time grasping my voice. The result isn't that bad, it still has some features of my voice in it but it certainly isn't my voice.

Is there a way to "fix" this issue? I've already tried to use 10 wav samples instead of 3 or 4 but that didn't change the voice. I've also used the "standard" quality preset but that didn't make it sound a lot more like me

I understand if there is no "fix", this project is amazing and I love it!

Repository Not Found for url: https://huggingface.co/jbetker/tacotron_symbols/resolve/main/vocab.json

I tried to run this as per the documentation

python tortoise/do_tts.py --text "I'm going to speak this" --voice random --preset fast

But I got this error instead:

transformers.utils.hub.RepositoryNotFoundError: 404 Client Error: Repository Not Found for url: https://huggingface.co/jbetker/tacotron_symbols/resolve/main/vocab.json

I kind of noticed that the hugging face model seems to be going through some changes about 15 minutes ago: https://huggingface.co/jbetker/tacotron-symbols

I hope this gets fixed soon as I would really like to try this locally. Thank you!

Combine voices

It's too small of an issue to create a pull request, I guess.

In your .ipynb file you have this cell:

# You can also combine conditioning voices. Combining voices produces a new voice
# with traits from all the parents.
#
# Lets see what it would sound like if Picard and Kirk had a kid with a penchant for philosophy:
voice_samples, conditioning_latents = load_voices(['pat', 'william'])

gen = tts.tts_with_preset("They used to say that if man was meant to fly, he’d have wings. But he did fly. He discovered he had to.", 
                          voice_samples=None, conditioning_latents=None, preset=preset)
torchaudio.save('captain_kirkard.wav', gen.squeeze(0).cpu(), 24000)
IPython.display.Audio('captain_kirkard.wav')

I think voice_samples=None, conditioning_latents=None bit supposed to be voice_samples=voice_samples, conditioning_latents=conditioning_latents because otherwise it won't work.

Combining voices not working for do_tts.py

'&' character not working to combine voices for do_tts.py

Works as expected for read.py

Audio at ends of clips cut-off

When using read.py on a paragraph, on each audio file the end word is cut short by around 0.2-0.5 sec (sometimes only speaking about half of the last word). Even in the combined wav file you can still clearly hear the cuts. I couldn't find any easy workaround. Is this a known issue? I was using train_empire voice and using the latest build. If anyone has a fix or workaround please let me know.

Dataset licence

Hello,
thank you for this amazing TTS model public. It is by far the best quality tts model I have tried so far.

I would like to ask you about the licensing of the dataset you have used for the training - Am I guessing correctly that you have used your own selection of librivox recordings?

I'm asking just to be sure that I can use the outputs in commercial setting, since all librivox recordings are in the public domain.