bshall / acoustic-model Goto Github PK
View Code? Open in Web Editor NEWAcoustic models for: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
Home Page: https://bshall.github.io/soft-vc/
License: MIT License
Acoustic models for: A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
Home Page: https://bshall.github.io/soft-vc/
License: MIT License
@bshall Thank you for this great work.
validation loss not decreasing after 25k steps, Fine-tuning with small target data (~1-hour dataset).
the best model (25k steps) not working properly. content transferred ok, but the pitch was not converted from the source to the target. and the target speaker's tone does not close.
@bshall can you please give your suggestions?
Thanks
@bshall Thank you for this great work.
I did fine-tune the pre-trained acoustic LJSpeech model with my custom dataset (~ 1 hour).
python train.py --resume checkpoints/hubert-soft-0321fd7e.pt data/ finetuned_checkpoints/
I have newly fine-tuned the best model (model-best.pt) with 20000 steps. I modified the code (https://github.com/bshall/acoustic-model/blob/main/acoustic/model.py#L119). the loading from the torch.hub.load_state_dict_from_url to my checkpoint path. but I got the below error. I shared the error log for your reference.
can you please help me, how to resolve this issue?
Thanks
Traceback (most recent call last):
File "/root/Experiments/soft-vc/inference.py", line 12, in <module>
acoustic = hubert_soft().cuda()
File "/root/Experiments/soft-vc/acoustic/acoustic/model.py", line 165, in hubert_soft
return _acoustic(
File "/root/Experiments/soft-vc/acoustic/acoustic/model.py", line 133, in _acoustic
acoustic.load_state_dict(checkpoint["acoustic-model"])
File "/root/anaconda3/envs/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1406, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for AcousticModel:
Missing key(s) in state_dict: "encoder.prenet.net.0.weight", "encoder.prenet.net.0.bias", "encoder.prenet.net.3.weight", "encoder.prenet.net.3.bias", "encoder.convs.0.weight", "encoder.convs.0.bias", "encoder.convs.3.weight", "encoder.convs.3.bias", "encoder.convs.4.weight", "encoder.convs.4.bias", "encoder.convs.7.weight", "encoder.convs.7.bias", "decoder.prenet.net.0.weight", "decoder.prenet.net.0.bias", "decoder.prenet.net.3.weight", "decoder.prenet.net.3.bias", "decoder.lstm1.weight_ih_l0", "decoder.lstm1.weight_hh_l0", "decoder.lstm1.bias_ih_l0", "decoder.lstm1.bias_hh_l0", "decoder.lstm2.weight_ih_l0", "decoder.lstm2.weight_hh_l0", "decoder.lstm2.bias_ih_l0", "decoder.lstm2.bias_hh_l0", "decoder.lstm3.weight_ih_l0", "decoder.lstm3.weight_hh_l0", "decoder.lstm3.bias_ih_l0", "decoder.lstm3.bias_hh_l0", "decoder.proj.weight".
Unexpected key(s) in state_dict: "module.encoder.prenet.net.0.weight", "module.encoder.prenet.net.0.bias", "module.encoder.prenet.net.3.weight", "module.encoder.prenet.net.3.bias", "module.encoder.convs.0.weight", "module.encoder.convs.0.bias", "module.encoder.convs.3.weight", "module.encoder.convs.3.bias", "module.encoder.convs.4.weight", "module.encoder.convs.4.bias", "module.encoder.convs.7.weight", "module.encoder.convs.7.bias", "module.decoder.prenet.net.0.weight", "module.decoder.prenet.net.0.bias", "module.decoder.prenet.net.3.weight", "module.decoder.prenet.net.3.bias", "module.decoder.lstm1.weight_ih_l0", "module.decoder.lstm1.weight_hh_l0", "module.decoder.lstm1.bias_ih_l0", "module.decoder.lstm1.bias_hh_l0", "module.decoder.lstm2.weight_ih_l0", "module.decoder.lstm2.weight_hh_l0", "module.decoder.lstm2.bias_ih_l0", "module.decoder.lstm2.bias_hh_l0", "module.decoder.lstm3.weight_ih_l0", "module.decoder.lstm3.weight_hh_l0", "module.decoder.lstm3.bias_ih_l0", "module.decoder.lstm3.bias_hh_l0", "module.decoder.proj.weight".
def _acoustic(
name: str,
discrete: bool,
upsample: bool,
pretrained: bool = True,
progress: bool = True,
) -> AcousticModel:
acoustic = AcousticModel(discrete, upsample)
if pretrained:
# checkpoint = torch.hub.load_state_dict_from_url(URLS[name], progress=progress)
# consume_prefix_in_state_dict_if_present(checkpoint["acoustic-model"], "module.")
load_path = "/root/Experiments/soft-vc/acoustic/finetuned_checkpoints/model-best.pt"
checkpoint = torch.load(load_path)
acoustic.load_state_dict(checkpoint["acoustic-model"])
acoustic.eval()
return acoustic
AcousticModel training by train.py
crash with missing attribute error.
It is caused by missing parsearg attribute discrete
.
It can be fixed with additional argument, so I made a pull request (#5).
When run train.py
with proper dataset-dir and checkpoint-dir, it crash.
Error message argue that the attribute discrete
is missing.
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/content/softVC_AM/train.py", line 96, in train
discrete=args.discrete,
AttributeError: 'Namespace' object has no attribute 'discrete'
In train.py
, there is a argument usage args.discrete
, but there is no corresponding parser.add_argument
.
Lines 87 to 91 in df6eba9
As in paper, softVC-AM seems to support both soft and discrete.
So we can add discrete flag (by default, it works as soft mode).
When I add it, the bug disappear.
I make a pull request (#5) which will fix this bug.
Thanks for your great OSS! I am happy if this help you and community.
Greetings.
I am aware of the existence of the different repositories for the generation of a voice conversion model. However, few information about a whole training pipeline is covered in the repositories. Could the README.md
file be extended with information for training a voice conversion model from scratch? Similar to the information provided in your parallel repository hubert, in order to perform a full training pipeline for a voice conversion model. Information such as:
requirements.txt
filepreprocess.py -i foo -o bar
, then train.py -i bar -o model_output
...Thanks in advance for your time.
Hi @bshall , can the pre-trained hubert-soft or discrete model be used for encoding mandarin Chinese language data? I want to train a model for Vietnamese language VC. But only train acoustic model and HiFiGAN vocoder on Vietnamese dataset.
Typically it's possible to load torch models to cpu / gpu by using the map_location
argument.
This doesn't work for the acoustic model:
TypeError: hubert_soft() got an unexpected keyword argument 'map_location
On a CPU-only machine loading this model gives the error:
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu')
to map your storages to the CPU.
unit-to-mel inference by generate.py
crash with missing file error.
It is caused by variable name mistake in generate.py
.
It can be fixed with one-line fix, so I made a pull request (#2).
When run generate.py
with proper in-dir and out-dir, it crash.
Error message argue that No such file or directory: 'path'
.
Generating from sample_softVC -> o_test
0% 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "./generate.py", line 57, in <module>
generate(args)
File "./generate.py", line 22, in generate
units = np.load("path")
File "/usr/local/lib/python3.7/dist-packages/numpy/lib/npyio.py", line 417, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'path'
In generate.py
, variable path
becomes mistakenly string "path"
.
Line 16 in c30a7c3
When I fix it, the bug disappear.
I make a pull request (#2) which will fix this bug.
I am so impressed with softVC project, so, If this PR will help this super cool project, I am grad.
Hello,
i've been trying to drop-in bigvgan for hifigan but i keep running into an issue related to the number of mel channels the acoustic model is trained on 128 vs the 100 channels bigvgan uses. Is there a simple way to fix this or does the acoustic model need to be trained with 100 mel channels?
Have you try this on multi-speaker way ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.