Git Product home page Git Product logo

ljspeechtools's Introduction

LJ Speech Tools

Tools for creating voices for Webaverse.

What do I need to make a dataset?

30-60 minutes of clean audio (no sounds, noises or other speakers). Can be one file or many. The rest will be done for you.

Mono 22khz WAV is ideal but the system will try to convert other formats.

Installation

You will need Python 3 installed on your system.

To install:

sudo apt install ffmpeg
pip install -r requirements.txt

Pipeline (Easy Mode)

  1. Put your files in the put_audio_files_here folder

  2. Run the python script:

python pipeline.py

The dataset.zip folder will be what you need to train a voice with.

Training

A colab Notebook has been provided here

Upload your dataset to your Google drive, then press "run all cells" to run the notebook.

You will need a Google account, but Colab is free. However, upgrading to Colab Pro will make training a lot easier, as you have guaranteed access to faster hardware.

Individual Tools

pipeline.py is made from a collection of other scripts, which you can run individually. Speaker Separation is not included in the pipeline-- it is assumed that you have cleaned, separated tracks, but you can run it in a preprocess if you need it.

Speaker Separation

In our tests, separation usually only worked in the case where two or three speakers with distinctly different voices were speaking.

To remove any wav files which contain audio which isn't the source speaker

First, place some example audio from your speaker in the 'target' folder.

Then, place any example audio from speakers who are not your speaker in the 'ignore' folder

Then run the separate.py with a --threshold, probably somewhere between 0.6 and 0.9

bash separate.sh
# or
python separator.py --threshold=0.65

Audio Transcription

To transcribe audio files in the wavs folder

bash transcribe.sh
# or
python transcriber.py

Transcription will create an LJSpeech compatible 'metadata.csv'

Transcription removes swearing and replaces with ****. If you want some swearing back, you can run python swearing.py -- if you don't want swearing in your dataset you should remove that data entirely, as the asteriks will negatively affect alignment.

Get length of audio dataset

python count_length.py

Will give you the total length, longest and shortest file lengths from the wavs folder

Split long audio into shorter audio samples

python audiosplitter.py

Most training scripts prefer a variety of audio sample lengths from 2-12 seconds in length. The splitter will try to find silent points and break the audio into chunks up to 12 seconds. You can modify the script to your flavor, just change the 12 to a 10 or whatever you want.

Prepare dataset

python make_dataset.py

This will create a train_filelist.txt, val_filelist.txt and dataset.zip which can be uploaded to the SortAnon TalkNet training colab notebook here: https://github.com/bycloudai/TalkNET-colab You may need to reformat the name of the files if you use other notebooks.

Complete Pipeline

If you are trying to process a voice in a noisy track, you should really bring the audio files into an audio software, mute any noise or speech from other speakers, and normalize and compress the remaining audio so it all sounds as similar as possible.

Assuming you wanted to record your own voice and didn't need to deal with speaker separation or isolation, here are the steps you would talk to do that.

  1. Put all of your files in the wavs folder If they are not wavs, convert them using ffmpeg - the default we are targeting is Mono WAV 22050 Hz

  2. If your files are longer than 12 seconds and not hand-split, run the audiosplitter

python audiosplitter.py

Verify that the data is good, then delete the contents of the wavs folder and move the data_outputs there

  1. Transcribe your dataset
bash transcribe.sh
# or
python transcriber.py

This will create a metadata.csv, which is the standard format of LJSpeech -- for most training needs, the metadata.csv and wavs folder is all you need as input.

Good Luck!

And thanks to all the hard working Ponies who took the time to document this. The compendium of knowledge created by the Pony Preservation Project was instrumental in giving these tools shape and form.

ljspeechtools's People

Contributors

lalalune avatar propialis avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.