plachtaa / facodec Goto Github PK

Training code for FAcodec presented in NaturalSpeech3

Python 100.00%

audio-codec voice-conversion zero-shot-voice-conversion

facodec's Issues

Are the requirements incomplete?

For example, if I don't see Tensorbord in it and report an error during runtime: Audiotools is required but I haven't seen it in the requirements section, should I consider completing the requirements?
The error log mentions that some modules were compiled using an older version of NumPy (1. x series), while the current environment is using NumPy 2.0.0, which may cause program crashes. To support both NumPy 1. x and 2. x versions simultaneously, it is necessary to recompile these modules using NumPy 2.0. Does this require an update?
AttributeError: _ARRAY_API not found,may i ask what is that API?

Dataset of the pre-trained checkpoint

Hi, I have a question about the dataset.

As far as I know, the official FACodec checkpoint was trained using Librilight.
Was this version of the checkpoint also trained using Librilight?
README only says 50k hours of training data and the possibility of multi-language.
I'm confused because Librilight is known as containing 60k hours, Libriheavy is known as containing 50k hours.
I wonder about the details of the training data.

Thanks.

Does the prosody codes[0] work?

I tried to test the code some specifically for prosody but it seemed like the prosody was tied to codes[1] with the content?

What do the loss curves look like during your successful training?

Hello,

I've attempted to train FAcodec using my own dataset. However, whether I start from scratch or fine-tune your provided checkpoint, the reconstructed audio clips are just noise. I fine-tuned the model using around 128 hours of Common Voice 18 ZH-TW data. After approximately 20k steps, the loss seemed to converge. Some losses, like feature loss, decreased successfully, while others, such as mel loss and waveform loss, were oscillating.

Do all losses decrease during your training process?

请问解码器是否可以支持流式输出？

你好，请问解码器是否可以支持流式输出？

原始论文的公开权重是不是缺少部分参数？

gr_content_f0和gr_prosody_phone这两个grl层似乎没有使用，这与原论文是不符的，请问你有探究过这两部分的影响吗？

may i ask How did you eliminate the difficulty of requiring phoneme audio alignment through predicting semantic latent?

Can you indicate in which file you implemented this feature?
and , As you wrote in Read Me: \ t<speakeer_id>\ t\ t<script>\ t<phonemixed_transscript>If these parameters cannot be replaced with placeholders, will the presence or absence of these parameters have a performance impact on the final trained model?

模型问题咨询

想请教下，您是否已经跑通了代码，并且验证了效果呢？
因为看到好多权重设置跟论文中不一致

似乎有bug

meldataset.py中68行的clamp是否想打clip？
以及84行的
max_wave_length = max([b[0].size(0) for b in batch])
TypeError: 'int' object is not callable
是否应该改成与上一行一样的shape[0]？

关于训练以及推理流程有一些疑问

一、请问我的流程是否正确：
1、修改meldataset.py，改为自己的dataloader，使用VCTK数据集以及wav2vec生成伪标签，在train.py上训练出几个ckpt文件
2、使用训练出的最后一个ckpt作为预训练模型，训练train_redecoder.py（有一个疑问是此处用于训练train_redecoder.py的和train所用的数据集一样即可吗？）
3、使用train训练出的ckpt以及train_redecoder.py训练出的ckpt，作用于reconstruct_redecoder.py上进行音色转换

二、请问通过train和train_redecoder.py训练出的ckpt文件是否和您所提供的bin预训练模型有着相同的结构和参数？

感谢解答！

恢复后的音频高频部分都没了

权重模型到时候会公开么？

运行train.py报错：AttributeError: 'EncDecSpeakerLabelModel' object has no attribute 'infer_segment'

您好，我在运行train.py的时候碰到以下报错：

Traceback (most recent call last):
  File "/home/tts/ref/ns3/train.py", line 496, in <module>
    main(args)
  File "/home/tts/ref/ns3/train.py", line 342, in main
    spk_logits = torch.cat([speaker_model.infer_segment(w16.cpu()[..., :wl])[1] for w16, wl in zip(waves_16k, wave_lengths)], dim=0)
  File "/home/tts/ref/ns3/train.py", line 342, in <listcomp>
    spk_logits = torch.cat([speaker_model.infer_segment(w16.cpu()[..., :wl])[1] for w16, wl in zip(waves_16k, wave_lengths)], dim=0)
  File "/opt/conda/envs/amphion/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1688, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'EncDecSpeakerLabelModel' object has no attribute 'infer_segment'

nemo-toolkit的版本为1.21.0。
这个报错我在nemo的issues中找过，但是没有找到相关的问题。

你好，我想请问下如何用train中训练出的PTH文件进行推理以及想请教下不用任何标签也可以解耦的**

项目中的reconstruct和redecoder reconstruct似乎只能针对预训练文件，也就是bin，我想请教下train训练的pth文件能否用于推理
还有就是想请问不用任何标签也可以训练出解耦音频要素的方法是在哪个文件中体现的
感谢解答

pytorch_model.bin key error and WN activated function

Hi, thank you for sharing the training code of FACodec! I've come across a couple of points:
1.Fine-tuning the redecoder:
I'm interested in fine-tuning the redecoder using the provided encoder and redecoder bin files. However, I noticed that there's no 'net' key in the bin file, which seems to cause an issue when loading the checkpoint. Could you provide some guidance on how to properly load these files for fine-tuning?
2.Additional activation function:
I noticed that there's an additional WN gated activation function applied after the timbre layer norm, which differs from the original code and description in the paper. I'm curious about the reasoning behind this architectural change. Could you share some insights into why this modification was made and how it impacts the model's performance?

代码细节问题

您好，请问 FAcodec/modules /quantize.py中FApredictors中forward_v2函数注释掉了
spk_pred = self.timbre_predictor(timbre)[0]
这行代码，因此timbre为None，这里会导致后面

     spk_pred_logits = preds['timbre']
     spk_loss = F.cross_entropy(spk_pred_logits, spk_labels)

spk_pred_logits 的内容为None，因此报错，这里是bug吗?

请问'uv'指的是 'unvoiced' 吗?

即某一帧是否有声音,计算方式为f0是否大于某一阈值?

你好，我想问下关于检查点的问题

我发现您们所提供的预训练检查点似乎都是只有权重的bin格式，而使用仓库中train训练出来的检查点都是pth格式，先是大小就差了2.5个G
由于我既无法连上HF也无法连上HFmirror，于是我就想着先用自己训练出来的检查点试试，就把检查点的名字改成了pytorch_model.bin，连着config一起放到了checkpoints里
然后我发现训练出来的模型并不能够用于声音重构，因为在reconstruct的时候，模型的键是：
dict_keys(['encoder', 'quantizer', 'decoder', 'discriminator', 'fa_predictors'])
而检查点的键是：
Keys in ckpt_params: dict_keys(['net', 'optimizer', 'scheduler', 'iters', 'epoch'])
请问是就是这样设计的呢，还是我的使用方法是错误的呢？
最后我想问一下，请问您们是如何不加上任何标签和注释就将一个音频的音色内容音高给解耦开的呢？是用的哪个文件中的哪一段函数呢？
多谢解答

Audio format in dataset files

Thanks for you great work on implementing FACodec!
I found the data file in https://github.com/Plachtaa/FAcodec/blob/master/data/val.txt has some labels, like speaker id, phonemes. How can I get these labels? Will these labels be auto-generated in the training process?

GPU数量

你好，目前的config，batch size是4，但是论文里是8卡32的batch size，想问下released pre-train model是用了多少张卡进行训练？谢谢

How many steps would be enough if i train this model from start?

Hi! Nice work!
Could you share how many steps would be sufficient to train a new model? I'm trying to train a 16k FAcodec. The results reconstructed by ckpt 130,000 still sound different from the real speech, especially for the speaker timbre.

plachtaa / facodec Goto Github PK

facodec's Issues

Recommend Projects

Recommend Topics

Recommend Org