Git Product home page Git Product logo

deeptalk's Introduction

DeepTalk

PyTorch implementation of the DeepTalk model described in DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis by A. Chowdhury, A. Ross, and P. David in IEEE International Conference on Acoustics, Speech and Signal Processing 2021 (ICASSP-2021).

Research Article

Anurag Chowdhury, Arun Ross, and Prabu David, DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis, IEEE International Conference on Acoustics, Speech and Signal Processing (2021).

Description

DeepTalk is a deep-learning based vocal style transfer model developed by A. Chowdhury, A. Ross, and P. David, at Michigan State University. The model requires a reference audio from a target speaker and a sample text to synthesize speech audio that mimics the vocal identity of the target speaker uttering the sample text.

DeepTalk Model

Downloading the DeepTalk code

  1. Clone the git repository
git clone [email protected]:ChowdhuryAnurag/DeepTalk-Deployment.git
  1. Now you should have a folder named 'DeepTalk-Deployment'

  2. Go into the folder 'DeepTalk-Deployment'

cd DeepTalk-Deployment
  1. Please contact the maintainer of this repository at [[email protected]] for access to the pretrained DeepTalk models. Unzip 'trained_models.zip' (received separately from the maintainer) into this folder
unzip trained_models.zip
  1. Now you should have a folder named 'trained_models' with several pretrained models in it

  2. The Generic model is primarily used as a starting point for fine-tuning with speech data from a target speaker. The other models (Hannah, Ted, and Gordon Smith) are some sample finetuned models based on speech data from internal sources.

  3. The Generic model is trained on the LibriSpeech and VoxCeleb 1 and 2 datasets.

Setting up the python environment for running the DeepTalk code

  1. The model was implemented in PyTorch 1.3.1 and tensorflow 1.14 using Python 3.6.8 and may be compatible with different versions of PyTorch, tensorflow, and Python, but it has not been tested. (The GPU versions of pytorch and tensorflow is recommended for faster training and inference)

    1.1) Install anaconda python distribution from https://www.anaconda.com/products/individual

    1.2) Create an anaconda environment called 'deeptalk'

    conda create -n deeptalk python=3.6.8
    

    Type [y] when prompted to Proceed([y]/n)

    1.3) Activate the deeptalk python environment

    conda activate deeptalk
    
  2. Additional requirements are listed in the ./requirements.txt file. Install them as follows:

    pip install -r requirements.txt
    
  3. Now, we need to install Montreal-Forced-Aligner. For this project it could be done in the following two ways:

    3.1) Download and install the Montreal-Forced-Aligner following the instructions here. We have included a copy of Montreal-Forced-Aligner (both for Linux and Mac OS) with this repository to serve as a template for the directory structure expected by the DeepTalk implementation. Please note that the librispeech-lexicon.txt file included in both the montreal_forced_aligned_mac and montreal_forced_aligned_linux are important for this project and should be retained in this final installation of Montreal-Forced-Aligner.

    3.2) Alternatively, you can also run the install_MFA_linux.sh script (only works for Linux machines) to automatically download and install Montreal-Forced-Aligner. This script also fixes some of the most common installation issues associated with running Montreal-Forced-Aligner on linux machines.

    ./install_MFA_linux.sh
    

    3.3) Now, run the following command to ensure Montreal-Forced-Aligner was installed correctly and is working fine.

    montreal_forced_aligner_linux/bin/mfa_align
    

    You should get the following output if everything is working fine:

    usage: mfa_align [-h] [-s SPEAKER_CHARACTERS] [-b BEAM] [-t TEMP_DIRECTORY]
                    [-j NUM_JOBS] [-v] [-n] [-c] [-d] [-e] [-i] [-q]
                    corpus_directory dictionary_path acoustic_model_path
                    output_directory
    mfa_align: error: the following arguments are required: corpus_directory, dictionary_path, acoustic_model_path, output_directory
    

Running the DeepTalk GUI to generate synthetic audio using pre-trained models received from the code maintainer

Note: You should already be inside 'DeepTalk-Deployment' directory with 'deeptalk' conda environment activated.

  1. Execute the following two commands to run the GUI prototype
export FLASK_APP=app.py
flask run

You should now be able to access the GUI prototype in your web browser at the below URL:

http://localhost:5000/

Finetuning the DeepTalk model for a target speaker

  1. The DeepTalk model can be finetuned to mimic the voice of a target speaker of your choice. For this process, you will need to place high quality audio wave files containing speech from the target speaker in Data/SampleAudio directory as follows:
Data/SampleAudio/<speaker_name>/<fileid_subjectname_audiotitle.wav>

Example:

Data/SampleAudio/Speaker1/1_Speaker1_BroadcastIndustry.wav

We have included few sample audios (through the trained_model.zip) following the directory format specified above, to serve as a reference. These sample audios can be listed using the following command:

ls Data/SampleAudio/Speaker1/
  1. Run preprocess_audio.py <input_directory> <output_directory> to preprocess the audio from previous step to make it compatible for fine-tuning the DeepTalk model.
python preprocess_audio.py Data/SampleAudio Data/ProcessedAudio

The processed audio will be saved at Data/LibriSpeech/train-other-custom/<speaker_name>

  1. Run train_DeepTalk_step1.py <preprocessed_audio_directory> to use the preprocessed audio to fine-tune the Synthesizer of the DeepTalk model.
python train_DeepTalk_step1.py Data/LibriSpeech/train-other-custom/Speaker1
  1. Run train_DeepTalk_step2.py <preprocessed_audio_directory> to use the preprocessed audio to fine-tune the Vocoder of the DeepTalk model.
python train_DeepTalk_step2.py Data/LibriSpeech/train-other-custom/Speaker1
  1. A fine-tuned model directory bearing the <speaker_name> should now appear in the trained_models directory

Acknowledgement

Portions of this implementation are based on this repository.

Citation

If you use this repository then please cite:

@InProceedings{chowdhDeepTalk21,
  author       = "Chowdhury, A. and Ross, A. and David, P.",
  title        = "DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis",
  booktitle    = "ICASSP",
  year         = "2021",
}

deeptalk's People

Contributors

chowdhuryanurag avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.