cnchtu / diffusion-svc Goto Github PK

View Code? Open in Web Editor NEW

392.0 9.0 57.0 4.07 MB

License: MIT License

Python 87.81% Jupyter Notebook 12.19%

diffusion-svc's Introduction

Language: English 简体中文

Diffusion-SVC

此仓库是DDSP-SVC仓库的扩散部分的单独存放。可单独训练和推理。

Diffusion SVC 2.0 即将到来，可在v2.0_dev分支提前体验：前往分支

最近更新：使用本仓库的naive模型和浅扩散模型搭配可以用极低训练成本达到比单纯扩散模型更好的效果，强力推荐。但是小网络的naive模型泛化能力较弱，在小数据集上可能会有音域问题，这个时候naive模型微调不能训练太多步数(这会让底模退化)，前级也可以考虑更换为无限音域的ddsp模型。
效果和介绍见[介绍视频(暂未完成)] 欢迎加群交流讨论：882426004

0.简介

Diffusion-SVC 是DDSP-SVC仓库的扩散部分的单独存放。可单独训练和推理。

相比于比较著名的 Diff-SVC, 本项目的显存占用少得多，训练和推理速度更快，并针对浅扩散和实时用途有专门优化。可以在较强的GPU上实时推理。配合本项目的naive模型进行浅扩散，即使是较弱的GPU也可以实时生成质量优秀的音频。

如果训练数据和输入源的质量都非常高，Diffusion-SVC可能拥有最好的转换效果。

本项目可以很容易的级联在别的声学模型之后进行浅扩散，以改善最终的输出效果或降低性能占用。例如在DDSP-SVC和本项目的naive模型后级联Diffusion-SVC，可进一步减少需要的扩散步数并得到高质量的输出。

除此之外，本项目还可以单独训练浅扩散所需的降噪步数而不训练完整的从高斯噪声开始的降噪过程，这可以提高训练速度并改善质量，更多信息见下文。

免责声明：请确保仅使用合法获得的授权数据训练 Diffusion-SVC 模型，不要将这些模型及其合成的任何音频用于非法目的。本库作者不对因使用这些模型检查点和音频而造成的任何侵权，诈骗等违法行为负责。

1. 安装依赖

安装PyTorch：我们推荐从 PyTorch 官方网站 下载 PyTorch.
安装依赖

pip install -r requirements.txt

2. 配置预训练模型

(必要操作) 下载预训练 ContentVec 编码器并将其放到 pretrain 文件夹。经过裁剪的ContentVec镜像有完全一样的效果，但大小只有190MB。
- 注意：也可以使用别的特征提取，但仍然优先推荐ContentVec。支持的所有特征提取见tools/tools.py中的Units_Encoder类。
(必要操作) 从 DiffSinger 社区声码器项目下载预训练声码器，并解压至 pretrain/ 文件夹。
- 注意：你应当下载名称中带有nsf_hifigan的压缩文件，而非nsf_hifigan_finetune。
~~如果需要使用声纹模型，则需要将配置文件的use_speaker_encoder设置为true, 并从这里下载预训练声纹模型,该模型来自mozilla/TTS。~~

3. 预处理

1. 配置训练数据集和验证数据集

1.1 手动配置：

将所有的训练集数据 (.wav 格式音频切片) 放到 data/train/audio,也可以是配置文件中指定的文件夹如xxxx/yyyy/audio。

将所有的验证集数据 (.wav 格式音频切片) 放到 data/val/audio,也可以是配置文件中指定的文件夹如aaaa/bbbb/audio。

1.2 程序随机选择：

运行python draw.py,程序将帮助你挑选验证集数据（可以调整 draw.py 中的参数修改抽取文件的数量等参数）。

1.3文件夹结构目录展示：

注意：说话人id必须从1开始，不能从0开始；如果只有一个说话人则该说话人id必须为1

data
├─ train
│    ├─ audio
│    │    ├─ 1
│    │    │   ├─ aaa.wav
│    │    │   ├─ bbb.wav
│    │    │   └─ ....wav
│    │    ├─ 2
│    │    │   ├─ ccc.wav
│    │    │   ├─ ddd.wav
│    │    │   └─ ....wav
│    │    └─ ...
|
├─ val
|    ├─ audio
│    │    ├─ 1
│    │    │   ├─ eee.wav
│    │    │   ├─ fff.wav
│    │    │   └─ ....wav
│    │    ├─ 2
│    │    │   ├─ ggg.wav
│    │    │   ├─ hhh.wav
│    │    │   └─ ....wav
│    │    └─ ...

2. 正式预处理

python preprocess.py -c configs/config.yaml

您可以在预处理之前修改配置文件 configs/config.yaml

3. 备注：

请保持所有音频切片的采样率与 yaml 配置文件中的采样率一致！（推荐使用fap进行重采样的等前处理）
长音频切成小段可以加快训练速度，但所有音频切片的时长不应少于 2 秒。如果音频切片太多，则需要较大的内存，配置文件中将 cache_all_data 选项设置为 false 可以解决此问题。
验证集的音频切片总数建议为 10 个左右，不要放太多，不然验证过程会很慢。
如果您的数据集质量不是很高，请在配置文件中将 'f0_extractor' 设为 'crepe'。crepe 算法的抗噪性最好，但代价是会极大增加数据预处理所需的时间。
配置文件中的 ‘n_spk’ 参数将控制是否训练多说话人模型。如果您要训练多说话人模型，为了对说话人进行编号，所有音频文件夹的名称必须是不大于 ‘n_spk’ 的正整数

4. 训练

1. 不使用预训练数据进行训练：

python train.py -c configs/config.yaml

2. 预训练模型：

我们强烈建议使用预训练模型进行微调，这将比直接训练容易和节省的多，并能达到比小数据集更高的上限。
注意，在底模上微调需要使用和底模一样的编码器，如同为ContentVec，对别的编码器(如声纹)也是同理，还要注意模型的网络大小等参数相同。

！！！！！！！！！推荐训练浅扩散模型+naive模型！！！！！！！！！

只训练k_step_max深度的浅扩散模型与naive模型的组合比单纯完全扩散的质量可能还要更高，同时训练速度更快。但是naive模型可能存在音域问题。

2.1 训练完整过程的扩散预训练模型

（注意：whisper-ppg对应whisper的medium权重，whisper-ppg-large对应whisper的large-v2权重）

Units Encoder	网络大小	数据集	下载
contentvec768l12(推荐)	512*20	VCTK m4singer	HuggingFace
hubertsoft	512*20	VCTK m4singer	HuggingFace
whisper-ppg(仅支持sovits)	512*20	VCTK m4singer opencpop kiritan	HuggingFace

补充一个用contentvec768l12编码的整活底模，数据集为m4singer/opencpop/vctk，不推荐使用，不保证没问题：下载。

2.2 只训练k_step_max深度的扩散预训练模型

（注意：whisper-ppg对应whisper的medium权重，whisper-ppg-large对应whisper的large-v2权重）

所用编码器	网络大小	k_step_max	数据集	浅扩散模型下载
contentvec768l12	512*30	100	VCTK m4singer	HuggingFace
contentvec768l12	512*20	200	VCTK m4singer	HuggingFace
contentvec256l9	512*20	200	VCTK m4singer	HuggingFace
contentvec256l9	768*30	200	VCTK m4singer	HuggingFace
whisper-ppg(仅支持sovits)	768*30	200	PTDB m4singer	HuggingFace

实验发现naive模型在小数据上有音域问题，请优先考虑用较少的步数微调naive模型或直接使用无限音域的ddsp模型

2.3 和2.2配套的Naive预训练模型和DDSP预训练模型

所用编码器	网络大小	数据集	类型	Naive模型下载
contentvec768l12	3*256	VCTK m4singer	Naive	HuggingFace

注意：naive预训练模型也可用于完整扩散模型的前级naive模型。且微调shallow模型时建议将配置文件中的decay_step改小(如10000)。

3. 使用预训练数据（底模）进行训练：

欢迎PR训练的多人底模 (请使用授权同意开源的数据集进行训练)。
预训练模型见上文,需要特别注意使用的是相同编码器的模型。
将名为model_0.pt的预训练模型, 放到config.yaml里面 "expdir: exp/*****" 参数指定的模型导出文件夹内, 没有就新建一个, 程序会自动加载该文件夹下的预训练模型。
同不使用预训练数据进行训练一样，启动训练。

4.1. Naive模型与组合模型

Naive模型

Naive模型是一个轻量级的svc模型，可以作为浅扩散的前级，训练方式与扩散模型一致，示例配置文件在configs/config_naive.yaml。其所需的预处理和扩散模型是一样的。

python train.py -c configs/config_naive.yaml

推理时使用-nmodel指向模型文件以使用，此时必须要浅扩散深度-kstep。

组合模型

使用combo.py可以将一个扩散模型和一个naive模型组合为一个combo模型，只需此模型就能实现浅扩散。这两个模型需要使用同样的参数训练(如同样的说话人id)，因为推理时他们也使用同样的参数推理。

python combo.py -model <model> -nmodel <nmodel> -exp <exp> -n <name>

使用以上命令将两个模型组合。其中-model是扩散模型的路径，-nmodel是naive模型的路径；与模型同目录下的配置文件也会自动读取。-exp是输出组合模型的目录，-n是保存的组合模型名。上述命令会在<exp>下输出组合模型为<name>.ptc。

组合模型可直接在推理时作为扩散模型被正确加载用于浅扩散，而无需额外输入-nmodel来加载naive模型。

4.2. 关于k_step_max与浅扩散

(示意图见readme开头)

浅扩散过程中，扩散模型只从一定加噪深度开始扩散，而无需从高斯噪声开始。因此，在浅扩散用途下扩散模型也可以只训练一定加噪深度而不用从高斯噪声开始。

配置文件中指定k_step_max为扩散深度就是进行这样的训练，该值必须小于1000(这是完整扩散的步数)。这样训练的模型不能单独推理，必须在前级模型的输出结果上或输入源上进行浅扩散；扩散的最大深度不能超过k_step_max。

示例配置文件见configs/config_shallow.yaml。

推荐将这种只能浅扩散的扩散模型与naive模型组合为组合模型使用。

5. 可视化

# 使用tensorboard检查训练状态
tensorboard --logdir=exp

第一次验证后，在 TensorBoard 中可以看到合成后的测试音频。

6. 非实时推理

python main.py -i <input.wav> -model <model_ckpt.pt> -o <output.wav> -k <keychange> -id <speaker_id> -speedup <speedup> -method <method> -kstep <kstep> -nmodel <nmodel> -pe <f0_extractor>

-model是模型的路径，-k是变调， -speedup为加速倍速，-method为pndm,ddim,unipc或dpm-solver, -kstep为浅扩散步数，-id为扩散模型的说话人id。

如果-kstep不为空，则以输入源的 mel 进行浅扩散，若-kstep为空，则进行完整深度的高斯扩散。

-nmodel(可选，需要单独训练)是naive模型的路径，用来提供一个大致的mel给扩散模型进行k_step深度的浅扩散，其参数需要与主模型匹配。

-pe 可选项为 crepe parselmouth dio harvest rmvpe 默认 crepe

~~如果使用了声纹编码，那么可以通过-spkemb指定一个外部声纹，或者通过-spkembdict覆盖模型模型的声纹词典。~~

7. Units索引(可选,不推荐)

与RVC和so-vits-svc类似的特征索引。

注意，此为可选功能，无索引也可正常使用，索引会占用大量存储空间，索引时还会大量占用CPU，此功能不推荐使用。

# 训练特征索引，需要先完成预处理
python train_units_index.py -c config.yaml

推理时，使用-lr参数使用。此参数为检索比率。

8. 实时推理

推荐使用本仓库自带的GUI进行实时推理，如果需要使用浅扩散请先组合模型。

python gui_realtime.py

本项目也可配合rtvc实现实时推理。

注意：目前flask_api为实验性功能，rtvc也未完善，不推荐使用此方式。

pip install rtvc
python rtvc
python flask_api.py

9. 兼容性

9.1. Units编码器

	Diffusion-SVC	DDSP-SVC	so-vits-svc
ContentVec	√	√	√
HubertSoft	√	√	√
Hubert(Base,Large)	√	√	×
CNHubert(Base,Large)	√	√	√*
CNHubertSoft	√	√	×
Wav2Vec2-xlsr-53-espeak-cv-ft	√*	×	×
DPHubert	×	×	√
Whisper-PPG	×	×	√*
WavLM(Base,Large)	×	×	√*

10. Colab

可以使用TheMandateOfRock写的笔记Diffusion_SVC_CN.ipynb; 由于我没有条件测试，所以有关问题请向笔记作者反馈。~~(我摸了)~~

11.Onnx导出

在exp文件夹下创建一个新文件夹(该文件夹的名字就是等会命令中的ProjectName)，将模型和配置文件放置入其中，模型重命名为model.pt，配置为config.yaml

然后执行以下命令进行导出。

python diffusion/onnx_export.py --project <ProjectName>

导出完成之后会自动创建MoeVS的配置文件,感谢NaruseMioShirakana(同时也是MoeVS的作者)提供的onnx导出支持。

感谢

感谢所有贡献者作出的努力

diffusion-svc's People

Contributors

Stargazers

Watchers

diffusion-svc's Issues

对于MaxMax2016碰瓷各大开源语音合成模型以及开盒开发者行为的谴责 | The condemnation of MaxMax2016's attempts to exploit various major open-source speech synthesis models and their unethical behavior in open-box development

既然你来我们仓库搞事情，行，那就让你出名
Since you've come to mess with our repository, alright, we'll make you famous

Does Diffusion-SVC use f0 coarse :?

How to use whisper-ppg as encoder?

Hello! In training Diffusion model table has "whisper-ppg(only can use with sovits)" row, but in config encoder has choices: "'hubertsoft', 'hubertbase', 'hubertbase768', 'contentvec', 'contentvec768' or 'contentvec768l12' or 'cnhubertsoftfish'". How can i use pretrain whisper model? Thank you!

What is the effect of using mean speaker embedding in Naive model training?

Thank you for this wonderful project sharing as open source.

I couldn't see args.model.use_speaker_encoder in any of the config files in the configs folder. but an option is there in preprocess.py

I have a couple of questions.

What is the effect of using mean speaker embedding in Naive model training / Shallow diffusion? Will it increase the speaker similarity along with quality?
Why computed mean embedding instead of utterance level speaker embedding?
Any naive models released which is enabled as args.model.use_speaker_encoder true?

Thanks. I am looking forward to seeing your commands :)

欢迎使劲锤我

@NaruseMioShirakana 我2022年的项目，都能被你锤成碰瓷你2023年的diffusion_svc，给你使劲点赞。

samples

Do you have any kind of samples ? of opencpop or kiritan ? or anyone's just to compare quality against the other svc algos ?

Where can I find `semantic_codebook.pt`?

The sample rate of the input audio may cause problem

In my case, only use shallow model, the input audio sample rate is 22050Hz, but the training audio and config_shallow.yaml sample rate are both 44100Hz

its report "RuntimeError: The size of tensor a (336) must match the size of tensor b (760) at non-singleton dimension 2"

Then I convert sample rate to 44100Hz, it works fine

Too long time to load data.

如题。在启动cache到cpu/cuda时，速率仅有2-3it/s，对于10w+的音频需要接近七个小时来load。
关闭cache后在Load the f0, volume data from : data/train 一步速率可以到20it/s，但仍然偏慢，且训练速度很慢。
系统为K8S Pod，配置12核心，2xA800，存储系统为Lustre集群。

Solution to serious memory leaks in preprocessing under Linux | 在Linux下面进行预处理发生严重内存泄漏的解决方法

Please use the following command to force pytorch to update to the nightly version
请用下面的命令把pytorch强制更新到nightly版本

cu118：pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118 --force-reinstall
cu121：pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu121 --force-reinstall

音域问题

介绍说naive存在音域问题。但我实际测试中，在没有训练naive情况下，仅训练浅扩散模型或完全扩散模型，两者练出的模型中低音正常，高音（F5以上）都比较虚弱，主要表现为音量低、附带电音（tensorboard中播放的效果），貌似是扩散模型的问题。数据集有90多分钟，数据集应该不算小吧。我尝试修改配置文件中f0的频率上限并重新预处理训练，但貌似不起作用。

使用实时GUI时报错

Traceback (most recent call last):
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\gui_realtime.py", line 8, in
from tools.infer_tools import DiffusionSVC
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\tools\infer_tools.py", line 11, in
from tools.tools import F0_Extractor, Volume_Extractor, Units_Encoder, SpeakerEncoder, cross_fade
File "G:\Diffusion-SVC-barbara\Diffusion-SVC-barbara\tools\tools.py", line 12, in
from fairseq import checkpoint_utils
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq_init_.py", line 20, in
from fairseq.distributed import utils as distributed_utils
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\distributed_init_.py", line 7, in
from .fully_sharded_data_parallel import (
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\distributed\fully_sharded_data_parallel.py", line 10, in
from fairseq.dataclass.configs import DistributedTrainingConfig
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\dataclass_init_.py", line 6, in
from .configs import FairseqDataclass
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\site-packages\fairseq\dataclass\configs.py", line 1104, in
@DataClass
^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 1230, in dataclass
return wrap(cls)
^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 1220, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 958, in _process_class
cls_fields.append(_get_field(cls, name, type, kw_only))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python311\Lib\dataclasses.py", line 815, in _get_field
raise ValueError(f'mutable default {type(f.default)} for field '
ValueError: mutable default <class 'fairseq.dataclass.configs.CommonConfig'> for field common is not allowed: use default_factory

Hoarse sound when using realtime inference

I trained one combo model, and it works good under offline inference in main.py, but using gui_realtime.py can only make a hoarse sound and cannot hear any pronunciation clearly(pitch seems to be correct). I've tried to change input devices\upgrade torch, but nothing works.

Following are setting snapshots.

Offline Inference

Realtime Inference

Have any ideas or workarounds?

Does anyone know if the whisper-ppg-largev2 or v3 model can train diffusion models and use them In so-vits-svc model

Does anyone know if the whisper-ppg-largev2 or v3 model can train diffusion models and use them In so-vits-svc project?Actrually,I have trained a diffusionmodel with the largev2 being the vencoder,but it is just that the diff model cannot be used,it shows a bug as follows.I noticed that this project shows that whisper-ppg cannot use the diff module,but why the so-vits-svc team supplys a way to train whisper-ppg-largev2 diff model? I like the diff module,and think it is very useful, so I really want to know the reason why I cannot use it in my so-vits-svc4.1 model or hope to find the solutions.thanks!

Loss value is not decrease during training

Hello! I have been training a model using 1204 files, with each file taking between 5 and 25 seconds to process. The total duration of the training process has been 14031 seconds. However, despite training for approximately 3 hours (4722 epochs), I have noticed that the loss function is not decreasing. I would appreciate any insights into why this might be happening and any suggestions on how to resolve the issue. Thank you!

configs/duffision.yaml file:

data:
  block_size: 512
  cnhubertsoft_gate: 10
  duration: 2
  encoder: whisper-ppg
  encoder_hop_size: 320
  encoder_out_channels: 1024
  encoder_sample_rate: 16000
  extensions:
  - wav
  sampling_rate: 44100
  training_files: filelists/train.txt
  unit_interpolate_mode: nearest
  validation_files: filelists/val.txt
device: cuda
env:
  expdir: logs/44k/diffusion
  gpu_id: 0
infer:
  method: dpm-solver
  speedup: 10
model:
  n_chans: 512
  n_hidden: 256
  n_layers: 20
  n_spk: 1
  type: Diffusion
  use_pitch_aug: true
spk:
  speaker_all: 0
train:
  amp_dtype: fp32
  batch_size: 384
  cache_all_data: true
  cache_device: cpu
  cache_fp16: true
  decay_step: 100000
  epochs: 100000
  gamma: 0.5
  interval_force_save: 10000
  interval_log: 10
  interval_val: 2000
  lr: 0.00008
  num_workers: 2
  save_opt: false
  weight_decay: 0.01
vocoder:
  ckpt: pretrain/nsf_hifigan/model
  type: nsf-hifigan

UPD: Train model as part of repository: https://github.com/svc-develop-team/so-vits-svc

[Help Wanted] 代码中的initial_global_step为什么要-2?

如题
在train.py中，
81行

param_group['lr'] = args.train.lr * args.train.gamma ** max((initial_global_step - 2) // args.train.decay_step, 0)

83行

scheduler = lr_scheduler.StepLR(optimizer, step_size=args.train.decay_step, gamma=args.train.gamma, last_epoch=initial_global_step-2)

想问一下项目的方法论来源

请问此项目是基于论文DIFFSVC: A DIFFUSION PROBABILISTIC MODEL FOR SINGING VOICE CONVERSION，https://arxiv.org/abs/2105.13871 实现的吗？

Suggestion: rename gui.py

It should be something like rtvc-gui.py, to avoid confusion with potential future GUIs for training or inference.

也许可以增加其他训练方式

8G显存推理时OOM

使用的命令为python .\main.py -i 'input.wav' -model .\models\murasame.ptc -o 'output.wav' -k 0 -kstep 100 -pe rmvpe
日志如下

2023-12-25 01:45:06 | INFO | fairseq.tasks.text_to_speech | Please install tensorboardX: pip install tensorboardX
 [Loading] .\models\murasame.ptc
C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\utils\weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-12-25 01:45:09 | INFO | fairseq.tasks.hubert_pretraining | current directory is F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC
2023-12-25 01:45:09 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-12-25 01:45:09 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
Units Forced Mode:nearest
 [INFO] Extract f0 volume and mask: Use rmvpe, start...
 [INFO] Extract f0 volume and mask: Done. Use time:5.367121458053589
  0%|                                                                                            | 0/2 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\main.py", line 195, in <module>
    out_wav, out_sr = diffusion_svc.infer_from_long_audio(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\infer_tools.py", line 380, in infer_from_long_audio
    seg_units = self.units_encoder.encode(seg_input, sr, hop_size)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\tools.py", line 471, in encode
    units = self.model(audio_res, padding_mask=padding_mask)
  File "F:\C#\MurasameQQBot\MurasameQQBot\bin\Debug\net8.0\Tools\Diffusion-SVC\tools\tools.py", line 601, in __call__
    logits = self.hubert.extract_features(**inputs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\hubert\hubert.py", line 535, in extract_features
    res = self.forward(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\hubert\hubert.py", line 467, in forward
    x, _ = self.encoder(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1003, in forward
    x, layer_results = self.extract_features(x, padding_mask, layer)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1049, in extract_features
    x, (z, lr) = layer(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\models\wav2vec\wav2vec2.py", line 1260, in forward
    x, attn = self.self_attn(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\fairseq\modules\multihead_attention.py", line 538, in forward
    return F.multi_head_attention_forward(
  File "C:\Users\dogdi\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\nn\functional.py", line 5440, in multi_head_attention_forward
    attn_output = scaled_dot_product_attention(q, k, v, attn_mask, dropout_p, is_causal)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 8.00 GiB. GPU 0 has a total capacty of 8.00 GiB of which 4.45 GiB is free. Of the allocated memory 1.36 GiB is allocated by PyTorch, and 239.44 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

理论上推理应该用不到8G显存吧。。。
训练模型的时候是租的4090跑的，推理的时候用的是rtx2060s 8GB（会不会和这个有关）

关于 v2.0 的一些疑问

有一些关于最新 2.0 版本的一些疑问想请教一下大佬：

当前 v2.0 分支还会不会有大的改动？是否已经可以基于当前分支进行模型训练和推理？
v2.0 与 v1 的模型是否兼容呢？
v2.0 相比 v1 有哪些进步呢？
如果要使用新的 hifi-vaegan 作为 vocoder，在训练模型时是否必须使用 hifi-vaegan，而不能使用 nsf-hifigan？
hifi-vaegan 相比 nsf-hifigan 有哪些优点或缺点呢？
如果新旧模型不兼容，当前是否已计划训练发布新的底模呢？
如果要使用 v2.0 进行底模训练，配置文件需要针对性做一些修改吗（以及有无推荐的 batch size 和 lr 参数）？

如果有不方便回答的问题还请略过就好。

Voice got stuttered and lagged during realtime-inference

During realtime inference, the output voice got stuttered and lagged sometimes. Changing input/output devices or combo models didn't work(Decreasing historical blocks may mitigate the problem). Problem can be recurred when switching foreground programs or decreasing speedup. When the problem occurs, there is no explicit abnormal resource usage(cuda and video memory usage). By the way, the same problem can be met in DDSP-SVC project.

Following is demo video.

Desktop.2023.06.29.-.17.05.11.03.-.Trim.mp4

Have any idea or workaround?

FCPE fails to import

When trying to use FCPE as a f0 extractor the following error appears: [Errno 2] No such file or directory: 'exp/f0bce_test_R004_cu0\\config.yamI'
It's looking for a config.yaml file for the FCPE model, but this shouldn't be the case.

你们的完整扩散模型

要训练多少步啊没看的太明白默认的参数
10万步？