Git Product home page Git Product logo

grad-svc's Introduction

Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

Hugging Face Spaces GitHub Repo stars GitHub forks GitHub issues GitHub

This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS and whisper-vits-svc. So the features from whisper-vits-svc are used in this project. By the way, Diff-VC is a follow-up of Grad-TTS, Diffusion-Based Any-to-Any Voice Conversion

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

grad_tts

grad_svc

The framework of grad-svc-v1

grad_svc_v2

The framework of grad-svc-v2 & v3, encoder:768->512, diffusion:64->96

Elysia_Grad_SVC.mp4

Features

  1. Such beautiful codes from Grad-TTS

    easy to read

  2. Multi-speaker based on speaker encoder

  3. No speaker leaky based on Perturbation & Instance Normlize & GRL

    One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization

  4. No electronic sound

  5. Integrated DPM Solver-k for less steps

  6. Integrated Fast Maximum Likelihood Sampling Scheme, for less steps

  7. Conditional Flow Matching (V3), first used in SVC

  8. Rectified Flow Matching (TODO)

Setup Environment

  1. Install project dependencies

    pip install -r requirements.txt
  2. Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.

  3. Download hubert_soft model,put hubert-soft-0d54a1f4.pt into hubert_pretrain/.

  4. Download pretrained nsf_bigvgan_pretrain_32K.pth, and put it into bigvgan_pretrain/.

    Performance Bottleneck: Generator and Discriminator are 116Mb, but Generator is only 22Mb

    系统性能瓶颈:生成器和判别器一共116M,而生成器只有22M

  5. Download pretrain model gvc.pretrain.pth, and put it into grad_pretrain/.

    python gvc_inference.py --model ./grad_pretrain/gvc.pretrain.pth --spk ./assets/singers/singer0001.npy --wave test.wav
    

    For this pretrain model, temperature is set temperature=1.015 in gvc_inference.py to get good result.

Dataset preparation

Put the dataset into the data_raw directory following the structure below.

data_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Data preprocessing

After preprocessing you will get an output with following structure.

data_gvc/
└── waves-16k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── waves-32k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── mel
│    └── speaker0
│    │      ├── 000001.mel.pt
│    │      └── 000xxx.mel.pt
│    └── speaker1
│           ├── 000001.mel.pt
│           └── 000xxx.mel.pt
└── pitch
│    └── speaker0
│    │      ├── 000001.pit.npy
│    │      └── 000xxx.pit.npy
│    └── speaker1
│           ├── 000001.pit.npy
│           └── 000xxx.pit.npy
└── hubert
│    └── speaker0
│    │      ├── 000001.vec.npy
│    │      └── 000xxx.vec.npy
│    └── speaker1
│           ├── 000001.vec.npy
│           └── 000xxx.vec.npy
└── speaker
│    └── speaker0
│    │      ├── 000001.spk.npy
│    │      └── 000xxx.spk.npy
│    └── speaker1
│           ├── 000001.spk.npy
│           └── 000xxx.spk.npy
└── singer
    ├── speaker0.spk.npy
    └── speaker1.spk.npy
  1. Re-sampling
    • Generate audio with a sampling rate of 16000Hz in ./data_gvc/waves-16k
    python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000
    
    • Generate audio with a sampling rate of 32000Hz in ./data_gvc/waves-32k
    python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000
    
  2. Use 16K audio to extract pitch
    python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch
    
  3. use 32k audio to extract mel
    python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel
    
  4. Use 16K audio to extract hubert
    python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert
    
  5. Use 16k audio to extract timbre code
    python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker
    
  6. Extract the average value of the timbre code for inference
    python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer
    
  7. Use 32k audio to generate training index
    python prepare/preprocess_train.py
    
  8. Training file debugging
    python prepare/preprocess_zzz.py
    

Train

  1. Start training
    python gvc_trainer.py
    
  2. Resume training
    python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth
    
  3. Log visualization
    tensorboard --logdir logs/
    

Train Loss

loss_96_v2

grad_svc_mel

Inference

  1. Export inference model

    python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pth
    
  2. Inference

    python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --rature 1.015 --shift 0
    

    temperature=1.015, needs to be adjusted to get good results; Recommended range is (1.001, 1.035).

  3. Inference step by step

    • Extract hubert content vector
      python hubert/inference.py -w test.wav -v test.vec.npy
      
    • Extract pitch to the csv text format
      python pitch/inference.py -w test.wav -p test.csv
      
    • Convert hubert & pitch to wave
      python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0
      

Data

Name URL
PopCS https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md
opencpop https://wenet.org.cn/opencpop/download/
Multi-Singer https://github.com/Multi-Singer/Multi-Singer.github.io
M4Singer https://github.com/M4Singer/M4Singer/blob/master/apply_form.md
VCTK https://datashare.ed.ac.uk/handle/10283/2651

Code sources and references

https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS

https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC

https://github.com/facebookresearch/speech-resynthesis

https://github.com/cantabile-kwok/VoiceFlow-TTS

https://github.com/shivammehta25/Matcha-TTS

https://github.com/shivammehta25/Diff-TTSG

https://github.com/majidAdibian77/ResGrad

https://github.com/LuChengTHU/dpm-solver

https://github.com/gmltmd789/UnitSpeech

https://github.com/zhenye234/CoMoSpeech

https://github.com/seahore/PPG-GradVC

https://github.com/thuhcsi/LightGrad

https://github.com/lmnt-com/wavegrad

https://github.com/naver-ai/facetts

https://github.com/jaywalnut310/vits

https://github.com/NVIDIA/BigVGAN

https://github.com/bshall/soft-vc

https://github.com/mozilla/TTS

https://github.com/ubisoft/ubisoft-laforge-daft-exprt

https://github.com/yl4579/StyleTTS-VC

https://github.com/MingjieChen/DYGANVC

https://github.com/sony/ai-research-code/tree/master/nvcnet

grad-svc's People

Contributors

awas666 avatar maxmax2016 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

grad-svc's Issues

电音现象问题请教

想请教一下,在经过扩散模型之前的声学模型,也就是从hubert 到 mel的这个阶段,这个出来的mel直接送到声码器,为啥会有电音现象呀,按理来说,hubert已经包含足够多的信息了,为什么生成的mel谱还有那么多平行的共振峰呢?楼主有没有试过用wavLM替代hubert呀?

Error during training

For testing, I set full_epochs to 15, fast_epochs to 10, and save_step to 5.
At the end of the 11th epoch, the following error message appears and the training is terminated.

Traceback (most recent call last):
File "S:\VoiceChanger\Grad-SVC\gvc_trainer.py", line 30, in
train(hps, args.checkpoint_path)
File "S:\VoiceChanger\Grad-SVC\grad_extend\train.py", line 127, in train
prior_loss, diff_loss, mel_loss, spk_loss = model.compute_loss(
File "S:\VoiceChanger\Grad-SVC\grad\model.py", line 132, in compute_loss
mel = slice_segments(mel, ids, out_size)
File "S:\VoiceChanger\Grad-SVC\grad\utils.py", line 82, in slice_segments
ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: The expanded size of the tensor (200) must match the existing size (0) at non-singleton dimension 1. Target sizes: [100, 200]. Tensor sizes: [100, 0]

Runtime error

This is the error that I get when trying to do the "Training file debugging" step.

`Traceback (most recent call last):
File "", line 1, in
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 131, in _main
prepare(preparation_data)
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 291, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "A:\GradSVC\Grad-SVC-20230925-V3-CFM\prepare\preprocess_zzz.py", line 19, in
for batch in tqdm(loader):
File "C:\Users\phill\AppData\Roaming\Python\Python311\site-packages\tqdm\std.py", line 1178, in iter
for obj in iterable:
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 442, in iter
return self._get_iterator()
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1043, in init
w.start()
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 164, in get_preparation_data
_check_not_importing_main()
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 140, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html

0%| | 0/5 [00:05<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\queues.py", line 114, in get
raise Empty
_queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "A:\GradSVC\Grad-SVC-20230925-V3-CFM\prepare\preprocess_zzz.py", line 19, in
for batch in tqdm(loader):
File "C:\Users\phill\AppData\Roaming\Python\Python311\site-packages\tqdm\std.py", line 1178, in iter
for obj in iterable:
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 634, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1146, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 14904) exited unexpectedly`

Please help!

A better alternative to Grad-TTS

Thanks for noticing Better Diffusion Modeling Technology. Recently, Xue et al. proposed that Multi-GradSpeech using Consistent Diffusion Model as the generative network outperforms Grad-TTS in both single- and multi-speaker scenarios, and I believe that this advantage can be carried over to the SVC task, and I'd be happy to share the code if you'd like to try to replace Grad-TTS with Multi-GradSpeech.

num_worker Issue

Traceback (most recent call last): File "/Users/workstation/Music/Grad-SVC V2 96/gvc_trainer.py", line 30, in <module> train(hps, args.checkpoint_path) File "/Users/workstation/Music/Grad-SVC V2 96/grad_extend/train.py", line 120, in train for batch in progress_bar: File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/tqdm/std.py", line 1182, in __iter__ for obj in iterable: File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__ data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1318, in _next_data self._shutdown_workers() File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1443, in _shutdown_workers w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL) File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_fork.py", line 40, in wait if not wait([self.sentinel], timeout): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/connection.py", line 947, in wait ready = selector.select(timeout) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 53481) is killed by signal: Segmentation fault: 11.

Git this error while training. Would this error be resolved if I change num_workers=8 to num_workers=0 from ./grad_extend/train.py?

在推理阶段遇到了路径报错

!python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/Sakura.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0
报错:
Traceback (most recent call last):
File "E:\Grad-SVC-20230930-V3-CFM\gvc_inference.py", line 215, in
assert os.path.isfile(args.model_bigv)
AssertionError

What is the best version?

There are variety of version in Grad SVC (V3 CFM, V3 CFM RoPE, V2 96, etc..), but what is the best version?

Setting base.yaml

What is the difference between full and fast epochs? And what are test size, test step, and save step?

Regarding the error "Fail to allocate bitmap" during the training process

I made modifications to the parameters in base.yaml:

full_epoch: Changed from 150 to 15000
batch_size: Changed from 8 to 24(18)
save_step: Changed from 10 to 100

My environment is:

Windows 10 22H2
CPU: 10850k
Memory: 64GiB
GPU: RTX 4090 64GiB
Pytorch 2.0.1+cu117
Python 3.8

During the training process, I encountered the following issues:

After running approximately +1030 Epochs, the following error message is frequently encountered:

Synthesis...
Fail to allocate bitmap

After running approximately +90 Epochs, the following error message can occasionally appear:

xception ignored in: <function Image.__del__ at 0x00000132B1459AF0>
Traceback (most recent call last):
  File "H:\Python389\lib\tkinter\__init__.py", line 4017, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x00000132EA2C18B0>
Traceback (most recent call last):
  File "H:\Python389\lib\tkinter\__init__.py", line 363, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x00000132EA2C18B0>
Traceback (most recent call last):
  File "H:\Python389\lib\tkinter\__init__.py", line 363, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):

I'm not sure what caused these issues. Do you have any suggestions on how to pinpoint them? Thanks.

training error

python gvc_trainer.py

Numbers of GPU : True
Initializing logger...
Initializing data loaders...
----------131----------
----------10----------
Initializing model...
/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Number of encoder parameters = 16.99m
Number of decoder parameters = 16.87m
Start from Grad_SVC pretrain model: grad_pretrain/gvc.pretrain.pth
Initializing optimizer...
Logging test batch...
Traceback (most recent call last):
File "gvc_trainer.py", line 30, in
train(hps, args.checkpoint_path)
File "/Users/workstation/Music/Grad-SVC/grad_extend/train.py", line 72, in train
logger.add_image(f'image_{i}/ground_truth', plot_tensor(mel.squeeze()),
File "/Users/workstation/Music/Grad-SVC/grad_extend/utils.py", line 59, in plot_tensor
data = save_figure_to_numpy(fig)
File "/Users/workstation/Music/Grad-SVC/grad_extend/utils.py", line 48, in save_figure_to_numpy
data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
ValueError: cannot reshape array of size 4320000 into shape (300,1200,3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.