OnlySpeakTTS

This is a TTS server that uses a private fork of tortoise to keep generation times and VRAM usage low. If you play the audio while generating, you can get very close to real-time.

You will need to provide your own audio clips. Store them in 'Tortoise/tortoise/voices/{voice_name}' you will want 2-3x 10 second clips, if it doesn't come out perfect, play around with the parameters. If you don't mind mix n' matching, you can include a few other clips that provide accent and dynamic range.

You may be able to get more emotion if you add a 'angry' or 'happy' clip to the mix, then generate different voices for each emotion. As long as you can tell what emotion to use, you can quickly swap between them.

System Requirements

Generations can use up to 5 Gigs of VRAM, and I average about 7-8 second generation times for full sentences on an RTX 3090, 3-4 seconds for shorter sentences. The 'fast' preset is slightly faster. Quality isn't bad, it's just not as smooth.

I Experience no slow-down in generations while running games like Minecraft. I did experience an increase in generation times by 100%~ while maxing my graphics card, playing games like Generation Zero.

I Never tested generating speech while inferencing on a text-generation model at the same time, but I assume it will result in slower generation times for both. It's perfectly okay to keep the models loaded into VRAM and go back and forth, though.

What Can This Do?

Assuming you use the server.py script, all inputs will automatically be seperated into segments that are within tortoise's max range.

You also have the option to save the generated voice tensors to files, and load from them later. This ensures that voices are consistent, instead of relying on randomly generated latents each time.

You can load voices from files in a second or less, using mutliple voices is perfectly viable.

How do i use this?

Tortoise-tts has it's own way it wants to be used, but I completely messed up the api.py script in this fork and didn't feel like fixing it.

For this fork, just use the server.py script and send http POST requests to port 7332 You can check the client.py script to see how these post requests should be formatted, what commands you can use, and how to use them.

You can:

Generate a new voice
Redo the previous generation if you got a bad one (becuase it's random)
Save the current voice to files that can be loaded later
Send a message to be spoken

In addition to the requirments for Tortoise, the server.py, client.py, and speech.py also have a few:

colorama, requests, soundfile, wave, pydub, threading, winsound, rich

Installation

You can create a venv, a conda env, or whatever. The original installation instructions for Tortoise still apply, but you may need to manually install setuptools before using the setup.py script

pip install setuptools

Example Video

Another_Example.mp4

Cloning Voices

Better_Cloning.mp4

This clip of laura croft uses around 5 diffusion iterations, that give it the poor quality to match the original voice. Alternatively, I could have used 7-12 diffusion iterations which would have given a much clearer voice for almost no extra generation time.

Links

I uploaded a longer example to youtube (the Lara Croft voice in this video is old):

https://www.youtube.com/watch?v=XV87AE22a6M

The Tortoise Repo:

https://github.com/neonbjb/tortoise-tts

Where to get the .pt files?

Where can I find the .pt files that should be in the same folder as the voices? It is erroring for me trying to load the .pt files:

 * Serving Flask app 'server'
 * Debug mode: off
CHANGING VOICE...
Loaded 1.mp3
Loaded 2.mp3
Loaded 3.mp3
Loaded 4.mp3
Loaded 5.mp3
[2023-10-21 12:34:38,907] ERROR in app: Exception on / [POST]
Traceback (most recent call last):
  File "C:\Users\root\anaconda3\envs\OnlySpeakTTS\lib\site-packages\flask\app.py", line 1455, in wsgi_app
    response = self.full_dispatch_request()
  File "C:\Users\root\anaconda3\envs\OnlySpeakTTS\lib\site-packages\flask\app.py", line 869, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "C:\Users\root\anaconda3\envs\OnlySpeakTTS\lib\site-packages\flask\app.py", line 867, in full_dispatch_request
    rv = self.dispatch_request()
  File "C:\Users\root\anaconda3\envs\OnlySpeakTTS\lib\site-packages\flask\app.py", line 852, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "C:\ai\OnlySpeakTTS\server.py", line 55, in result
    LOADED_VOICE = synthesizer.load_voice(os.path.join("voice_tensors", CURRENT_VOICE))
  File "C:\ai\OnlySpeakTTS\src\speech.py", line 151, in load_voice
    voice.append(torch.load(f"{voice_path}/Auto_Conditioning.pt"))
  File "C:\Users\root\anaconda3\envs\OnlySpeakTTS\lib\site-packages\torch\serialization.py", line 986, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "C:\Users\root\anaconda3\envs\OnlySpeakTTS\lib\site-packages\torch\serialization.py", line 435, in _open_file_like
    return _open_file(name_or_buffer, mode)
  File "C:\Users\root\anaconda3\envs\OnlySpeakTTS\lib\site-packages\torch\serialization.py", line 416, in __init__
    super().__init__(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'voice_tensors\\natlamir/Auto_Conditioning.pt'

pandaily591 / onlyspeaktts Goto Github PK

onlyspeaktts's Introduction

OnlySpeakTTS

System Requirements

What Can This Do?

How do i use this?

Installation

Example Video

Cloning Voices

Links

onlyspeaktts's People

Contributors

Stargazers

Watchers

Forkers

onlyspeaktts's Issues

Where to get the .pt files?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent