Git Product home page Git Product logo

simple-autovc's Introduction

simple-autovc

A simple, performant re-implementation of AutoVC trained on VCTK.

Motivation

The original author's repo has not released models which produce the same quality conversions as those presented in the demo. In this repo I aim to get as close as possible to the demo performance and to release the model publicly for any to use.

Description

I use the model definition provided by the original author but use the HiFi-GAN vocoder and its associated mel-spectrogram transform. Concretely, the sample rate is set at 16kHz as in the original model and the number of training steps is increased drastically from that stated in the paper -- from 100k steps to 2.3 million steps. The speaker embedding network is also pretrained on a larger external dataset.

Otherwise, all the hyperparameters are the same as those from the paper, original author repo, or github issues of the original author repo where appropriate. The 3 model components are as follows:

Usage

To use the pretrained models, no dependancies aside from pytorch, librosa==0.9.2, scipy, and numpy are required. The models use torch hub, making loading exceedingly simple:

Quickstart

Step 1: load all the models

import torch 

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Load the pretrained autovc model:
autovc = torch.hub.load('RF5/simple-autovc', 'autovc').to(device)
autovc.eval()
# Load the pretrained hifigan model:
hifigan = torch.hub.load('RF5/simple-autovc', 'hifigan').to(device)
hifigan.eval()
# Load speaker embedding model:
sse = torch.hub.load('RF5/simple-speaker-embedding', 'gru_embedder').to(device)
sse.eval()

Step 2: do inference on some utterances of your choice

# Get mel spectrogram
mel = autovc.mspec_from_file('example/source_uttr.flac') 
# or autovc.mspec_from_numpy(numpy array, sampling rate) if you have a numpy array

# Get embedding for source speaker
sse_src_mel = sse.melspec_from_file('example/source_uttr.flac')
with torch.no_grad(): 
    src_embedding = sse(sse_src_mel[None].to(device))
# Get embedding for target speaker
sse_trg_mel = sse.melspec_from_file('example/target_uttr.flac')
with torch.no_grad(): 
    trg_embedding = sse(sse_trg_mel[None].to(device))

# Do the actual voice conversion!
with torch.no_grad():
    spec_padded, len_pad = autovc.pad_mspec(mel)
    x_src = spec_padded.to(device)[None]
    s_src = src_embedding.to(device)
    s_trg = trg_embedding.to(device)
    x_identic, x_identic_psnt, _ = autovc(x_src, s_src, s_trg)
    if len_pad == 0: x_trg = x_identic_psnt[0, 0, :, :]
    else: x_trg = x_identic_psnt[0, 0, :-len_pad, :]

# x_trg is now the converted spectrogram!

Step 3: vocode output spectrogram:

# Make a vocode function
@torch.no_grad()
def vocode(spec):
    # denormalize mel-spectrogram
    spec = autovc.denormalize_mel(spec)
    _m = spec.T[None]
    waveform = hifigan(_m.to(device))[0]
    return waveform.squeeze()

converted_waveform = vocode(x_trg) # output waveform 
# Save waveform as wav file
import soundfile as sf
sf.write('converted_uttr.flac', converted_waveform.cpu().numpy(), 16000)

Doing this for the example utterance in the example/ folder yields the following:

  1. Source utterance: 1.1 raw 48kHz:
source_uttr.mp4

1.2 vocoded 16kHz:

in.mp4
  1. Reference style utterance:

2.1 raw 48kHz:

target_uttr.mp4

2.2 vocoded 16kHz:

ref.mp4
  1. Converted output utterance (vocoded 16kHz):
converted_uttr.mp4

Note as well that the input or reference utterance may be speakers unseen during training, or any audio file if you are feeling very brave.

Training

AutoVC

To train the model, simply set the root data directory in hp.py, and run train.py with the best arguments for your use case. Note that train.py is currently set up to load data in a VCTK-style folder format, so you may need to rework it to your dataset if you use a new one.

You can save time during training by pre-computing the mel-spectrograms from the waveforms using spec_utils.py, in which case just pass the precomputed mel-spectrogram folder to train.py as the appropriate argument.

Speaker embedding network

Please see the details in the original repo if you wish to further train it, but it is pretty good and works well even on several unseen languages.

HiFi-GAN

Please see the instructions in the HiFi-GAN repo on how to fine-tune the vocoder. To do this, you would need to generate reconstructed AutoVC spectrogram outputs and pair them with the ground-truth waveforms. HiFi-GAN fine-tuning will then use teacher forcing to make the vocoder better adapt to AutoVC's output. Remember to set the sampling rate to 16kHz for this step, as the default for HiFi-GAN is 22.05kHz.

simple-autovc's People

Contributors

rf5 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.