Git Product home page Git Product logo

triaan-vc's Introduction

TriAAN-VC: Triple Adaptive Attention Normalization for any-to-any Voice Conversion (ICASSP 2023 Top 3% paper recognized)

This is a Pytorch implementation of TriAAN-VC: Triple Adaptive Attention Normalization for any-to-any Voice Conversion. TriAAN-VC is a deep learning model for any-to-any voice conversion. TriAAN-VC can maintain the linguistic contents of source speech and represent target characteristics, unlike previous methods. Experimental results on the VCTK dataset suggest that TriAAN-VC achieves state-of-the-art performance.

We recommend you visit our demo site.

The overall architecture of TriAAN-VC is as below:

TriAAN-VC

Installation & Enviornment

The OS, Python, and PyTorch version are as below (You can also use other versions.):

  • Windows
  • Linux
  • python == 3.8
  • pytorch == 1.9.1
  • torchaudio == 0.9.1

You can install requirements through git and requirements.txt except for pytorch and torchaudio.

git clone https://github.com/winddori2002/TriAAN-VC.git
cd TriAAN-VC
pip install -r requirements.txt

Prepare for usage

1. Prepare dataset

We use the VCTK dataset consisting of 110 speakers with 400 utterances per speaker.

  • The dataset can be downloaded here.
  • We divide the dataset depending on seen-to-seen and unseen-to-unseen scenarios for evaluation.

2. Prepare pre-trained vocoder and feature extractor

We use the pre-trained ParallelWaveGAN as vocoder and CPC extractor as feature extractor. You can use the pre-trained weights in this repository. The vocoder is trained on the VCTK dataset and CPC extractor is trained on the LibriSpeech dataset.

  • This repository provides pre-trained ParallelWaveGAN provided by here and CPC extractor provided by here.
  • Or you can train ParallelWaveGAN and CPC.

3. Preprocess

The preprocess stages contain dataset split, feature extraction, making data paths, and eval pairs.

The steps are for VCTK dataset and if you want to use other dataset, you need to modify the details.

  • To split dataset and get mel-spectrograms, lf0, and metadata, run the following code and modify the directories in the ./config/preprocess.yaml.
python preprocess.py
  • To get CPC features and represent the paths on the metadata, run the following code. (You need pre-trained weights for CPC)
python preprocess_cpc.py
  • To get evaluation pairs for conversion, run the following code.
python generate_eval_pair.py

The dataset split and eval pairs are for evaluation and investigation, they are not actually necessary to train models.

How to use

1. Train

Training with settings

You can train TriAAN-VC by running the following code.

If you want to edit model settings, you can run python main.py train with other arguments.

In config/base.yaml, you can find other arguments, such as batch size, epoch, and so on.

python main.py train
Model arguments:
  encoder:
  	c_in:      256  (cpc:256, mel:80)
	c_h:       512
	c_out:     4
	num_layer: 6
  decoder:
  	c_in:      4
	c_h:       512
	c_out:     80
	num_layer: 6
	
Train arguments:
  epoch:      500
  batch_size: 64
  siam:       True (if not, siamese path is excluded)
  cpc:        True (if not, TriAAN-VC uses mel inputs)

Training with logging

The logs are uploaded on neptune.ai

python main.py train --logging True

Logging arguments:
  --logging    : True/False

2. Evaluation

After training, you can evaluate the model in terms of lingustic content (WER and CER) and target characteristic (SV).

You need to keep the model arguments in the training phase. The code only supports the version in which the number of target utterances is 1.

python main.py test
evaluation arguments:
  --checkpoint: Checkpoint path

Or, you can use the below code for testing with multi-target utterances.

python test.py --n_uttr 1 --eval True
evaluation arguments:
  --eval:       Option for evaluation
  --n_uttr:     Number of target utterances
  --checkpoint: Checkpoint path

3. Pretrained weights

The pretrained weights of TriAAN-VC is uploaded on the github release here.

We provide two versions of models depending on input types (mel, cpc).

4. Custom convert

For custom conversion, you can run the code with convert.py. The codes include data processing, predicting, and vocoding.

python convert.py 
Conversion arguments:
  --src_name:   Sample source names
  --trg_name:   Sample target names
  --checkpoint: Checkpoint path

You can find converted examples in ./samples or please visit our demo site.

Experimental Results

The experimental results are from the provided pre-trained weights, and the results can be slightly different from the paper. "VCTK Split" indicates the pre-trained weight with dataset split as in the paper.

Below, the results are summarized with the "TriAAN-VC-Split". Each score is the average of seen-to-seen and unseen-to-unseen scenarios.

Model Pre-trained Ver. # uttr WER AVG (%) CER AVG (%) SV AVG (%)
TriAAN-VC-Mel VCTK Split 1 27.61 14.78 89.42
TriAAN-VC-Mel VCTK Split 3 22.86 12.15 95.92
TriAAN-VC-CPC VCTK Split 1 21.50 11.24 92.33
TriAAN-VC-CPC VCTK Split 3 17.42 8.86 97.75

Certificate

Fortunately, our paper was recognized as the top 3% paper in ICASSP 2023.

Citation

@article{park2023triaan,
  title={TriAAN-VC: Triple Adaptive Attention Normalization for Any-to-Any Voice Conversion},
  author={Park, Hyun Joon and Yang, Seok Woo and Kim, Jin Sob and Shin, Wooseok and Han, Sung Won},
  journal={arXiv preprint arXiv:2303.09057},
  year={2023}
}

License

This repository is released under the MIT license. We adapted CPC codes and weights from facebookresearch/CPC_audio, released under the MIT license. We used vocoder codes from kan-bayashi/ParallelWaveGAN, released under the MIT license. For the pre-trained vocoder, we used the weights from Wendison/VQMIVC which is released under the MIT license. We also modified preprocess codes from Wendison/VQMIVC.

triaan-vc's People

Contributors

winddori2002 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

triaan-vc's Issues

about base.yaml

How should I change this part if I use another dataset instead of the VCTK dataset?
F99AL{))GKUD ZCFQD_A}OX

Evaluation results

After training for 200 epochs with the Chinese datasets, I got the following results:

image

The result here is that I just used the Chinese datasets for training, but did not retrain CPC. Maybe this is the reason why the results are very different from the results in your paper?

But if I remember correctly, I saw in CPC-audio's paper that the author mentioned that the model can be transferred to other languages and perform well?

Training Logs

Hello!
Do you have any graphs or logs from your training?
I wanted to check if my loss curve is reasonable.
Thank you.

Model output size

Hello again,
Can you help me understand the shape of the output from the model?
It looks like the output is (feature_length, mel_bands). Is that correct?
It appears the CPC feature length is not the same as original melspectrogram length. Does that affect the length of the converted audio?
Thank you for your help :)

Mixed datasets and calculate the threshold value

What may happen if I mix VCTK datasets with datasets from other languages? For example, I used a dataset mixed with VCTK and Chinese. How should I calculate the threshold value of test in "config/base.yaml" at this time?

Memory issue/bug

May I ask if you have any experience with strange memory behaviour? I tried to do inference on a V100 with 32GB memory. However it seems the model tries to allocate more than 25 GB which does not make sense if you used a single RTX 3090. By the way the GPU is completely free (RAM) according to nvidia-smi before I start convert.py
Here is my error log:

[Config]
data_path: ./base_data
wav_path: ./vctk/wav48_silence_trimmed
txt_path: ./vctk/txt
spk_info_path: ./vctk/speaker-info.txt
converted_path: ./checkpoints/converted_None_uttr
vocoder_path: ./vocoder
cpc_path: ./cpc
n_uttr: None
setting: 
    sampling_rate: 16000
    top_db: 60
    n_mels: 80
    n_fft: 400
    n_shift: 160
    win_length: 400
    window: hann
    fmin: 80
    fmax: 7600
    s2s_portion: 0.1
    eval_spks: 10
    n_frames: 128
model: 
    encoder: 
        c_in: 256
        c_h: 512
        c_out: 4
        num_layer: 6
    decoder: 
        c_in: 4
        c_h: 512
        c_out: 80
        num_layer: 6
train: 
    epoch: 500
    batch_size: 64
    lr: 1e-4
    loss: l1
    eval_every: 100
    save_epoch: 100
    siam: True
    cpc: True
test: 
    threshold: 0.6895345449450861
_name: Config
config: ./config/base.yaml
device: cuda:0
sample_path: ./samples
src_name: 
    - csmd024.wav
trg_name: 
    - One.wav
checkpoint: ./checkpoints
model_name: model-cpc-split.pth
seed: 1234
ex_name: TriAAN-VC
Traceback (most recent call last):
  File "/raid/YX/TriAAN-VC/convert.py", line 153, in <module>
    main(cfg)
  File "/raid/YX/TriAAN-VC/convert.py", line 119, in main
    output = model(src_feat, src_lf0, trg_feat)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/raid/YX/TriAAN-VC/model/model.py", line 184, in forward
    trg, trg_skips = self.spk_encoder(trg)  # target: spk
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/raid/YX/TriAAN-VC/model/model.py", line 43, in forward
    x = block(x)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 215, in forward
    input = module(input)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/raid/YX/TriAAN-VC/model/attention.py", line 55, in forward
    attn = self.softmax(attn)              
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/modules/activation.py", line 1514, in forward
    return F.softmax(input, self.dim, _stacklevel=5)
  File "/home/YX/.virtualenvs/TriAAN-VC/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 1856, in softmax
    ret = input.softmax(dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 25.84 GiB. GPU 0 has a total capacty of 31.74 GiB of which 1.81 GiB is free. Including non-PyTorch memory, this process has 29.91 GiB memory in use. Of the allocated memory 27.82 GiB is allocated by PyTorch, and 1.71 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

a question about the evaluation results

Is the evaluation data in the thesis table in the article obtained with main.py test after training, or is it obtained by evaluating the results using the conevert? And how to distinguish between s2s and u2u?

A question about speaker encoder

Hi, why not add speaker classification in speaker encoder, or use Speaker Verification feature. If I only use a speaker encoder, will there be any problems with timbral coupling?

Silence trimming code

Hi,
Thanks for the great work.
I could not find a code for trimming silence though the data path suggests that the inputs should be silence trimmed.
Is it same in inference time too?
Why would we need silence trimming? For faster training? If we keep the silence is the training not more robust?

resluts

Hi,

I retrained TriAAN-VC and ParallelWaveGAN with the Chinese dataset. Since there is no phoneme label, I did not fine-tune CPC_audio, and the final result is as follows:

mel-features:
CER: | s2s_st: 0.5101 | s2s_ut: 0.4329 | u2u_st: 0.4561 | u2u_ut: 0.4826
WER: | s2s_st: 0.8145 | s2s_ut: 0.6441 | u2u_st: 0.7425 | u2u_ut: 0.7286
ASV ACC: | s2s_st: 0.7200 | s2s_ut: 0.9067 | u2u_st: 0.7767 | u2u_ut: 0.8933
ASV COS: | s2s_st: 0.7321 | s2s_ut: 0.7770 | u2u_st: 0.7379 | u2u_ut: 0.7702

cpc-features:
CER: | s2s_st: 0.3831 | s2s_ut: 0.3040 | u2u_st: 0.3313 | u2u_ut: 0.3476
WER: | s2s_st: 0.6723 | s2s_ut: 0.4938 | u2u_st: 0.5674 | u2u_ut: 0.5691
ASV ACC: | s2s_st: 0.7900 | s2s_ut: 0.9567 | u2u_st: 0.8233 | u2u_ut: 0.9267
ASV COS: | s2s_st: 0.7637 | s2s_ut: 0.8076 | u2u_st: 0.7573 | u2u_ut: 0.7941

The result is still quite different from the result in your paper, I don't know what happened, can you give me some advice?

Thanks.

changing window size and n_fft

I tried using other vocoders like BigVGAN trained on combined data or HiFiGAN or even PWG from https://github.com/kan-bayashi/ParallelWaveGAN .
All outputs are noisy.
Why windowsize, nfft and hoplength are so important? The output should not vary too much right as this is just calculating fft in different short time windows only which should preserve the frequency range more or less with resolution difference in time/frequency scape only?

Chinese voice conversion

Hello, could you please advise on how to better adapt a model to a Chinese dataset for the purpose of Chinese voice conversion?

What is st and ut?

I am getting these scores on the test set if trimming is not done after 500 epochs.

--- Set: test ---
CER: | s2s_st: 0.1787 | s2s_ut: 0.1920 | u2u_st: 0.1717 | u2u_ut: 0.1766
WER: | s2s_st: 0.3156 | s2s_ut: 0.3644 | u2u_st: 0.3217 | u2u_ut: 0.2973
ASV ACC: | s2s_st: 0.9500 | s2s_ut: 0.9600 | u2u_st: 0.9150 | u2u_ut: 0.9550
ASV COS: | s2s_st: 0.7993 | s2s_ut: 0.7960 | u2u_st: 0.7859 | u2u_ut: 0.7878

Is this ok, or should I rerun it several times to get to your scores because of the inherent uncertainty?

How do I calculate the average scores here? Specifically, what are s2s_st and s2s_ut? s2s is already 'seen to seen' right?

Custom Traning

I testes with the provided models, and it worked well, but the voice from the reference was as good as when we use the samples that comes with the code.

I guess the samples are from the VCTK dataset where the model provided was trained.

So I think that to have a better custom voice, where it is more similar to the reference voice, we would need to train on the reference voice, is that correct?

And if it is, what do you suggest to use as a minimum to train on the reference data to have a better result?

Thanks a lot, and you work is amazing!

Can I train on one long audio file (10 mins+)?

I was wondering if it is possible to just train on one long wav file of 10 mins+ and split it into 3 files with 60%, 20%, and 20% for train, validation, and test set like the paper mentions. Does that work right away or will I have to split the long audio file into separate audio files of single sentences like the VCTK dataset?

Training time/compute for provided model

Hi there, thanks for sharing this codebase and the associated models :)

Was curious about a couple things:

  1. how long did it take to train on the VCTK dataset, and
  2. What compute was used? (Type+number of gpus)

This info might help folks who want to look into training on a larger dataset. I tried finding this info in the paper/here but didn't find anything (sorry if I missed something!) Thanks!

question about model size and result

thanks for your excellent work.

  1. I got noise from vc result, what is the reason for making the noise? and how could I remove that? In addition, I use the full-size the audio sample to train the model instead of the 128-n_frames slice.
    vc result: https://github.com/winddori2002/TriAAN-VC/assets/24654967/795df989-9a2a-4812-8d84-764c19698665
  2. i notice the parameters of the speaker & content encoder, the output channel is 4, i wonder to know why you chose so smaller numbers. or have you tried any other numbers like 128 or 256?

Other vocoder or any possible improvement

Hi,
Thanks for the great work!
I have trained the model on custom datasets with ten times more speakers now.
However, the results are only marginally better.
For, self-regeneration, it works well but fails for conversion cases to achieve good performance.
Is it about the speaker encoder? Should I increase the dimensions to capture more variety?
If the self-regeneration is good, then, possibly the vocoder part works well, right? No need to change the vocoder, right?
Even if we want to change the vocoder and want to shift to NVIDIA BIgVGAN, should I retrain the net with BigVGAN's params?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.