playvoice / grad-svc Goto Github PK

View Code? Open in Web Editor NEW

121.0 10.0 15.0 2.31 MB

Diffusion Singing Voice Conversion based on Grad-TTS from HuaWei

Home Page: https://huggingface.co/spaces/maxmax20160403/grad-svc

License: MIT License

Python 100.00%

diff-svc diffusion svc vits-svc voice-change grad-tts vits vits2 flow-matching

grad-svc's Introduction

Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS and whisper-vits-svc. So the features from whisper-vits-svc are used in this project. By the way, Diff-VC is a follow-up of Grad-TTS, Diffusion-Based Any-to-Any Voice Conversion

Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

The framework of grad-svc-v1

The framework of grad-svc-v2 & v3, encoder:768->512, diffusion:64->96

Elysia_Grad_SVC.mp4

Features

Such beautiful codes from Grad-TTS

easy to read
Multi-speaker based on speaker encoder
No speaker leaky based on Perturbation & Instance Normlize & GRL

One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization
No electronic sound
Integrated DPM Solver-k for less steps
Integrated Fast Maximum Likelihood Sampling Scheme, for less steps
Conditional Flow Matching (V3), first used in SVC
Rectified Flow Matching (TODO)

Setup Environment

Install project dependencies
```
pip install -r requirements.txt
```
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar into speaker_pretrain/.
Download hubert_soft model，put hubert-soft-0d54a1f4.pt into hubert_pretrain/.
Download pretrained nsf_bigvgan_pretrain_32K.pth, and put it into bigvgan_pretrain/.

Performance Bottleneck: Generator and Discriminator are 116Mb, but Generator is only 22Mb

系统性能瓶颈：生成器和判别器一共116M，而生成器只有22M
Download pretrain model gvc.pretrain.pth, and put it into grad_pretrain/.
```
python gvc_inference.py --model ./grad_pretrain/gvc.pretrain.pth --spk ./assets/singers/singer0001.npy --wave test.wav
```
For this pretrain model, temperature is set temperature=1.015 in gvc_inference.py to get good result.

Dataset preparation

Put the dataset into the data_raw directory following the structure below.

data_raw
├───speaker0
│   ├───000001.wav
│   ├───...
│   └───000xxx.wav
└───speaker1
    ├───000001.wav
    ├───...
    └───000xxx.wav

Data preprocessing

After preprocessing you will get an output with following structure.

data_gvc/
└── waves-16k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── waves-32k
│    └── speaker0
│    │      ├── 000001.wav
│    │      └── 000xxx.wav
│    └── speaker1
│           ├── 000001.wav
│           └── 000xxx.wav
└── mel
│    └── speaker0
│    │      ├── 000001.mel.pt
│    │      └── 000xxx.mel.pt
│    └── speaker1
│           ├── 000001.mel.pt
│           └── 000xxx.mel.pt
└── pitch
│    └── speaker0
│    │      ├── 000001.pit.npy
│    │      └── 000xxx.pit.npy
│    └── speaker1
│           ├── 000001.pit.npy
│           └── 000xxx.pit.npy
└── hubert
│    └── speaker0
│    │      ├── 000001.vec.npy
│    │      └── 000xxx.vec.npy
│    └── speaker1
│           ├── 000001.vec.npy
│           └── 000xxx.vec.npy
└── speaker
│    └── speaker0
│    │      ├── 000001.spk.npy
│    │      └── 000xxx.spk.npy
│    └── speaker1
│           ├── 000001.spk.npy
│           └── 000xxx.spk.npy
└── singer
    ├── speaker0.spk.npy
    └── speaker1.spk.npy

Re-sampling

Generate audio with a sampling rate of 16000Hz in ./data_gvc/waves-16k

python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000

Generate audio with a sampling rate of 32000Hz in ./data_gvc/waves-32k

python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000

Use 16K audio to extract pitch

python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch

use 32k audio to extract mel

python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel

Use 16K audio to extract hubert

python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert

Use 16k audio to extract timbre code

python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker

Extract the average value of the timbre code for inference

python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer

Use 32k audio to generate training index
```
python prepare/preprocess_train.py
```
Training file debugging
```
python prepare/preprocess_zzz.py
```

Train

Start training
```
python gvc_trainer.py
```

Resume training

python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth

Log visualization
```
tensorboard --logdir logs/
```

Train Loss

Inference

Export inference model

python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pth

Inference

python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --rature 1.015 --shift 0

temperature=1.015, needs to be adjusted to get good results; Recommended range is (1.001, 1.035).

Inference step by step

Extract hubert content vector

python hubert/inference.py -w test.wav -v test.vec.npy

Extract pitch to the csv text format

python pitch/inference.py -w test.wav -p test.csv

Convert hubert & pitch to wave

python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0

Data

Name	URL
PopCS	https://github.com/MoonInTheRiver/DiffSinger/blob/master/resources/apply_form.md
opencpop	https://wenet.org.cn/opencpop/download/
Multi-Singer	https://github.com/Multi-Singer/Multi-Singer.github.io
M4Singer	https://github.com/M4Singer/M4Singer/blob/master/apply_form.md
VCTK	https://datashare.ed.ac.uk/handle/10283/2651

Code sources and references

https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS

https://github.com/huawei-noah/Speech-Backbones/tree/main/DiffVC

https://github.com/facebookresearch/speech-resynthesis

https://github.com/cantabile-kwok/VoiceFlow-TTS

https://github.com/shivammehta25/Matcha-TTS

https://github.com/shivammehta25/Diff-TTSG

https://github.com/majidAdibian77/ResGrad

https://github.com/LuChengTHU/dpm-solver

https://github.com/gmltmd789/UnitSpeech

https://github.com/zhenye234/CoMoSpeech

https://github.com/seahore/PPG-GradVC

https://github.com/thuhcsi/LightGrad

https://github.com/lmnt-com/wavegrad

https://github.com/naver-ai/facetts

https://github.com/jaywalnut310/vits

https://github.com/NVIDIA/BigVGAN

https://github.com/bshall/soft-vc

https://github.com/mozilla/TTS

https://github.com/ubisoft/ubisoft-laforge-daft-exprt

https://github.com/yl4579/StyleTTS-VC

https://github.com/MingjieChen/DYGANVC

https://github.com/sony/ai-research-code/tree/master/nvcnet

grad-svc's People

Contributors

Stargazers

Watchers

Forkers

maxmax2016 splinter21 muruganr96 wendongj diiogofernands ishine youngjundev2 lokshaw-chau a-h-m-e-t-c-a-n guangkechen awas666 techthiyanes seanbackstrom zhaopufeng kdrkdrkdr

grad-svc's Issues

电音现象问题请教

想请教一下，在经过扩散模型之前的声学模型，也就是从hubert 到 mel的这个阶段，这个出来的mel直接送到声码器，为啥会有电音现象呀，按理来说，hubert已经包含足够多的信息了，为什么生成的mel谱还有那么多平行的共振峰呢？楼主有没有试过用wavLM替代hubert呀？

Error during training

For testing, I set full_epochs to 15, fast_epochs to 10, and save_step to 5.
At the end of the 11th epoch, the following error message appears and the training is terminated.

Traceback (most recent call last):
File "S:\VoiceChanger\Grad-SVC\gvc_trainer.py", line 30, in
train(hps, args.checkpoint_path)
File "S:\VoiceChanger\Grad-SVC\grad_extend\train.py", line 127, in train
prior_loss, diff_loss, mel_loss, spk_loss = model.compute_loss(
File "S:\VoiceChanger\Grad-SVC\grad\model.py", line 132, in compute_loss
mel = slice_segments(mel, ids, out_size)
File "S:\VoiceChanger\Grad-SVC\grad\utils.py", line 82, in slice_segments
ret[i] = x[i, :, idx_str:idx_end]
RuntimeError: The expanded size of the tensor (200) must match the existing size (0) at non-singleton dimension 1. Target sizes: [100, 200]. Tensor sizes: [100, 0]

Runtime error

This is the error that I get when trying to do the "Training file debugging" step.

`Traceback (most recent call last):
File "", line 1, in
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 131, in _main
prepare(preparation_data)
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 246, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 297, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
^^^^^^^^^^^^^^^^^^^^^^^^^
File "", line 291, in run_path
File "", line 98, in _run_module_code
File "", line 88, in _run_code
File "A:\GradSVC\Grad-SVC-20230925-V3-CFM\prepare\preprocess_zzz.py", line 19, in
for batch in tqdm(loader):
File "C:\Users\phill\AppData\Roaming\Python\Python311\site-packages\tqdm\std.py", line 1178, in iter
for obj in iterable:
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 442, in iter
return self._get_iterator()
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1043, in init
w.start()
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 45, in init
prep_data = spawn.get_preparation_data(process_obj._name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 164, in get_preparation_data
_check_not_importing_main()
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 140, in _check_not_importing_main
raise RuntimeError('''
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.

    This probably means that you are not using fork to start your
    child processes and you have forgotten to use the proper idiom
    in the main module:

        if __name__ == '__main__':
            freeze_support()
            ...

    The "freeze_support()" line can be omitted if the program
    is not going to be frozen to produce an executable.

    To fix this issue, refer to the "Safe importing of main module"
    section in https://docs.python.org/3/library/multiprocessing.html

0%| | 0/5 [00:05<?, ?it/s]
Traceback (most recent call last):
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1133, in _try_get_data
data = self._data_queue.get(timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\queues.py", line 114, in get
raise Empty
_queue.Empty

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "A:\GradSVC\Grad-SVC-20230925-V3-CFM\prepare\preprocess_zzz.py", line 19, in
for batch in tqdm(loader):
File "C:\Users\phill\AppData\Roaming\Python\Python311\site-packages\tqdm\std.py", line 1178, in iter
for obj in iterable:
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 634, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1329, in _next_data
idx, data = self._get_data()
^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1295, in _get_data
success, data = self._try_get_data()
^^^^^^^^^^^^^^^^^^^^
File "C:\Users\phill\AppData\Local\Programs\Python\Python311\Lib\site-packages\torch\utils\data\dataloader.py", line 1146, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 14904) exited unexpectedly`

Please help!

Does SVS work in english lyrics?

A better alternative to Grad-TTS

Thanks for noticing Better Diffusion Modeling Technology. Recently, Xue et al. proposed that Multi-GradSpeech using Consistent Diffusion Model as the generative network outperforms Grad-TTS in both single- and multi-speaker scenarios, and I believe that this advantage can be carried over to the SVC task, and I'd be happy to share the code if you'd like to try to replace Grad-TTS with Multi-GradSpeech.

What is the advantage for Grad-SVC, compare to So-VITS-SVC?

Hi @MaxMax2016 Thank you for this wonderful project

What is the advantage for Grad-SVC, compare to So-VITS-SVC?

num_worker Issue

Traceback (most recent call last): File "/Users/workstation/Music/Grad-SVC V2 96/gvc_trainer.py", line 30, in <module> train(hps, args.checkpoint_path) File "/Users/workstation/Music/Grad-SVC V2 96/grad_extend/train.py", line 120, in train for batch in progress_bar: File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/tqdm/std.py", line 1182, in __iter__ for obj in iterable: File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in __next__ data = self._next_data() ^^^^^^^^^^^^^^^^^ File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1318, in _next_data self._shutdown_workers() File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1443, in _shutdown_workers w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL) File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/process.py", line 149, in join res = self._popen.wait(timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/popen_fork.py", line 40, in wait if not wait([self.sentinel], timeout): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/connection.py", line 947, in wait ready = selector.select(timeout) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/selectors.py", line 415, in select fd_event_list = self._selector.poll(timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/workstation/Music/Grad-SVC V2 96/Grad-SVC V2 96/lib/python3.11/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 53481) is killed by signal: Segmentation fault: 11.

Git this error while training. Would this error be resolved if I change num_workers=8 to num_workers=0 from ./grad_extend/train.py?

Adjusting Hubert model

How can I use this Hubert Model on Grad SVC?
https://huggingface.co/team-lucid/hubert-base-korean

在推理阶段遇到了路径报错

!python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/Sakura.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0
报错：
Traceback (most recent call last):
File "E:\Grad-SVC-20230930-V3-CFM\gvc_inference.py", line 215, in
assert os.path.isfile(args.model_bigv)
AssertionError

What is the best version?

There are variety of version in Grad SVC (V3 CFM, V3 CFM RoPE, V2 96, etc..), but what is the best version?

Speaker Encoder是如何训练出来的？

如题，想请教一下Speaker Encoder是怎么训练出来的，有参考的代码吗

Setting base.yaml

What is the difference between full and fast epochs? And what are test size, test step, and save step?

Something wrong with the decoder

I adjusted Hubert Korean to Grad-SVC and trained the model. But the exported audio file sounds weird and the generated decoder image looks weird.

Regarding the error "Fail to allocate bitmap" during the training process

I made modifications to the parameters in base.yaml:

full_epoch: Changed from 150 to 15000
batch_size: Changed from 8 to 24(18)
save_step: Changed from 10 to 100

My environment is:

Windows 10 22H2
CPU: 10850k
Memory: 64GiB
GPU: RTX 4090 64GiB
Pytorch 2.0.1+cu117
Python 3.8

During the training process, I encountered the following issues:

After running approximately +1030 Epochs, the following error message is frequently encountered:

Synthesis...
Fail to allocate bitmap

After running approximately +90 Epochs, the following error message can occasionally appear:

xception ignored in: <function Image.__del__ at 0x00000132B1459AF0>
Traceback (most recent call last):
  File "H:\Python389\lib\tkinter\__init__.py", line 4017, in __del__
    self.tk.call('image', 'delete', self.name)
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x00000132EA2C18B0>
Traceback (most recent call last):
  File "H:\Python389\lib\tkinter\__init__.py", line 363, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):
RuntimeError: main thread is not in main loop
Exception ignored in: <function Variable.__del__ at 0x00000132EA2C18B0>
Traceback (most recent call last):
  File "H:\Python389\lib\tkinter\__init__.py", line 363, in __del__
    if self._tk.getboolean(self._tk.call("info", "exists", self._name)):

I'm not sure what caused these issues. Do you have any suggestions on how to pinpoint them? Thanks.

训练数据量？

想问一下，预训练模型用了多少数据量训出来的。

why skip_diff_train before fast_epochs

Hi,

Can you explain why skip diffusion train before the configured fast_epochs?
And how many epochs does diffusion train need?

Thanks!

training error

python gvc_trainer.py

Numbers of GPU : True
Initializing logger...
Initializing data loaders...
----------131----------
----------10----------
Initializing model...
/Users/workstation/Music/Grad-SVC/Grad-SVC/lib/python3.8/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Number of encoder parameters = 16.99m
Number of decoder parameters = 16.87m
Start from Grad_SVC pretrain model: grad_pretrain/gvc.pretrain.pth
Initializing optimizer...
Logging test batch...
Traceback (most recent call last):
File "gvc_trainer.py", line 30, in
train(hps, args.checkpoint_path)
File "/Users/workstation/Music/Grad-SVC/grad_extend/train.py", line 72, in train
logger.add_image(f'image_{i}/ground_truth', plot_tensor(mel.squeeze()),
File "/Users/workstation/Music/Grad-SVC/grad_extend/utils.py", line 59, in plot_tensor
data = save_figure_to_numpy(fig)
File "/Users/workstation/Music/Grad-SVC/grad_extend/utils.py", line 48, in save_figure_to_numpy
data = data.reshape(fig.canvas.get_width_height()[::-1] + (3,))
ValueError: cannot reshape array of size 4320000 into shape (300,1200,3)

playvoice / grad-svc Goto Github PK

grad-svc's Introduction

Grad-SVC based on Grad-TTS from HUAWEI Noah's Ark Lab

Features

Setup Environment

Dataset preparation

Data preprocessing

Train

Train Loss

Inference

Data

Code sources and references

grad-svc's People

Contributors

Stargazers

Watchers

Forkers

grad-svc's Issues

Recommend Projects

Recommend Topics

Recommend Org