ylacombe / finetune-hf-vits Goto Github PK

View Code? Open in Web Editor NEW

105.0 105.0 23.0 1.24 MB

Finetune VITS and MMS using HuggingFace's tools

License: MIT License

Python 99.43% Cython 0.57%

finetune-hf-vits's People

Contributors

Stargazers

Watchers

finetune-hf-vits's Issues

error and keep crashing the system gpu used NVIDIA GeForce RTX 4060 Ti and used dataset

dataset: https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/nl

return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
preprocess train dataset: 23%|██████████████████████████████ | 7991/34605 [05:37<02:19, 190.55 examples/s]Traceback (most recent call last):
File "/home/ai001/Desktop/Texttospeach/finetune-hf-vits/env/bin/accelerate", line 8, in
sys.exit(main())
File "/home/ai001/Desktop/Texttospeach/finetune-hf-vits/env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/ai001/Desktop/Texttospeach/finetune-hf-vits/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
simple_launcher(args)
File "/home/ai001/Desktop/Texttospeach/finetune-hf-vits/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

uploading dataset to hugging face

how can I upload my own dataset in hugging face in the same way that you upload yours?
I mean the audio and metadata in one file

Using torch compile

Since we are using accelerate, Is it possible to use torch compile by setting dynamo_backend="INDUCTOR"? I am trying to get it working and it looks like the model gets unallocated and then the code hangs. Only change i made was

accelerator = Accelerator(
        gradient_accumulation_steps=training_args.gradient_accumulation_steps,
        log_with=training_args.report_to,
        project_config=accelerator_project_config,
        kwargs_handlers=[ddp_kwargs],
        dynamo_backend="INDUCTOR"
    )

finetuning error: filter audio lengths. File does not exist or is not a regular file (possibly a pipe?).

[INFO|feature_extraction_utils.py:537] 2024-08-08 14:16:12,709 >> loading configuration file checkpoints/preprocessor_config.json
[INFO|feature_extraction_utils.py:586] 2024-08-08 14:16:12,711 >> Feature extractor VitsFeatureExtractor {
"feature_extractor_type": "VitsFeatureExtractor",
"feature_size": 80,
"hop_length": 256,
"max_wav_value": 32768.0,
"n_fft": 1024,
"padding_side": "right",
"padding_value": 0.0,
"return_attention_mask": false,
"sampling_rate": 16000
}

[INFO|tokenization_utils_base.py:2267] 2024-08-08 14:16:12,941 >> loading file vocab.json
[INFO|tokenization_utils_base.py:2267] 2024-08-08 14:16:12,941 >> loading file added_tokens.json
[INFO|tokenization_utils_base.py:2267] 2024-08-08 14:16:12,941 >> loading file special_tokens_map.json
[INFO|tokenization_utils_base.py:2267] 2024-08-08 14:16:12,941 >> loading file tokenizer_config.json
[INFO|tokenization_utils_base.py:2267] 2024-08-08 14:16:12,941 >> loading file tokenizer.json
Filter (num_proc=4): 0%| | 0/150 [00:00<?, ? examples/s][src/libmpg123/parse.c:do_readahead():1099] warning: Cannot read next header, a one-frame stream? Duh...
Filter (num_proc=4): 0%| | 0/150 [00:01<?, ? examples/s]
multiprocess.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 678, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3552, in _map_single
batch = apply_function_on_filtered_inputs(
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3405, in apply_function_on_filtered_inputs
inputs = format_table(
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 641, in format_table
formatted_output = formatter(pa_table_to_format, query_type=query_type)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 401, in call
return self.format_batch(pa_table)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 450, in format_batch
batch = self.python_features_decoder.decode_batch(batch)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/formatting/formatting.py", line 222, in decode_batch
return self.features.decode_batch(batch) if self.features else batch
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/features/features.py", line 2029, in decode_batch
[
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/features/features.py", line 2030, in
decode_nested_example(self[column_name], value, token_per_repo_id=token_per_repo_id)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/features/features.py", line 1351, in decode_nested_example
return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/features/audio.py", line 187, in decode_example
array, sampling_rate = sf.read(file)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/soundfile.py", line 285, in read
with SoundFile(file, 'r', samplerate, channels,
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/soundfile.py", line 658, in init
self._file = self._open(file, mode_int, closefd)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/soundfile.py", line 1216, in _open
raise LibsndfileError(err, prefix="Error opening {0!r}: ".format(self.name))
soundfile.LibsndfileError: Error opening <_io.BytesIO object at 0x7f78a06e8040>: File does not exist or is not a regular file (possibly a pipe?).
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/data/yangcheng/github_projects/finetune-hf-vits/run_vits_finetuning.py", line 1494, in
main()
File "/data/yangcheng/github_projects/finetune-hf-vits/run_vits_finetuning.py", line 767, in main
vectorized_datasets = raw_datasets.filter(
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/dataset_dict.py", line 983, in filter
{
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/dataset_dict.py", line 984, in
k: dataset.filter(
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/fingerprint.py", line 482, in wrapper
out = func(dataset, *args, **kwargs)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3714, in filter
indices = self.map(
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 602, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 567, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3253, in map
for rank, done, content in iflatmap_unordered(
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 718, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 718, in
[async_result.get(timeout=0.05) for async_result in async_results]
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
raise self._value
soundfile.LibsndfileError: Error opening <_io.BytesIO object at 0x7f78a06e8040>: File does not exist or is not a regular file (possibly a pipe?).
Traceback (most recent call last):
File "/home/yangcheng/anaconda3/envs/fairseq/bin/accelerate", line 8, in
sys.exit(main())
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
simple_launcher(args)
File "/home/yangcheng/anaconda3/envs/fairseq/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/yangcheng/anaconda3/envs/fairseq/bin/python', 'run_vits_finetuning.py', './training_config_examples/finetune_mms_bod.json']' returned non-zero exit status 1.

Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Traceback (most recent call last):
File "/mnt/sdb/stt/mata/fairseq-mms/finetune-hf-vits/run_vits_finetuning.py", line 1513, in
main()
File "/mnt/sdb/stt/mata/fairseq-mms/finetune-hf-vits/run_vits_finetuning.py", line 1346, in main
full_generation = model(**full_generation_sample.to(model.device), speaker_id=speaker_id)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1523, in forward
else self._run_ddp_forward(*inputs, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1359, in _run_ddp_forward
return self.module(*inputs, **kwargs) # type: ignore[index]
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/accelerate/utils/operations.py", line 659, in forward
return model_forward(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/accelerate/utils/operations.py", line 647, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/finetune-hf-vits/utils/modeling_vits_training.py", line 2151, in forward
return self._inference_forward(
File "/mnt/sdb/stt/mata/fairseq-mms/finetune-hf-vits/utils/modeling_vits_training.py", line 2000, in _inference_forward
text_encoder_output = self.text_encoder(
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/finetune-hf-vits/utils/modeling_vits_training.py", line 1563, in forward
hidden_states = self.embed_tokens(input_ids) * math.sqrt(self.config.hidden_size)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
return F.embedding(
File "/mnt/sdb/stt/mata/fairseq-mms/venvTTS/lib/python3.10/site-packages/torch/nn/functional.py", line 2237, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)
VALIDATION - batch 1, process1, waveform torch.Size([8, 108288, 1]), tokens torch.Size([8, 125])

#########

I am getting this error while validation can you check it? I tryed to fine-tune for uzbek (criyllic) language. Here is my dataset
https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0/viewer/uz

KeyError: 'speaker_id'

I am getting this error while finetuning the model:

Loss doesnt goes down

I am finetuning a catalan checkpoint from MMS, with a single speaker dataset of 5 hours, the loss doesnt goes bellow 22, all the characters of my the transcription from my datset exist inside the vocab of the model.

I dont know if i just need more training, how many epochs for 2500 audios is a good amount to get better results than the checkpoint?

I already did 400

How to train from scratch?

I would like to train a new voice from scratch to do some testings and also try to train from scratch using a higher sample rate.

Could anyone point me to the directions on how to do this?

Thanks!

Fine tune exited without creating finetune model

I got this error:
File "/home/ph/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/ph/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
simple_launcher(args)
File "/home/ph/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'run_vits_finetuning.py', 'finetune_audio2.json']' returned non-zero exit status 1.

better feature extractor

Does feature extractor also gets tuned during the finetuning process? And how can we choose a better feature extractor to extract better features from the audio?

pronunciation of alphabate

Do any one have idea why the model is not capable of pronunce the single alphabate .

hub_token not implemented

The help documentation suggests hub_token can be used to push model to the hub in training, but it is not used in the code. As a result, the model fails to push during training to a private repo (without giving any error messages).

regarding checkpoints issue

my model got stopped middle but i got some checkpoints , how can i convert that to model.safetensors ?

Getting error while fine tuning for Hindi

Thanks . I am getting the below error basically RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Please help. I am using google Colab . I exactly following the instruction.

024-02-19 11:48:42.153900: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-19 11:48:42.153955: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-19 11:48:42.155392: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-02-19 11:48:43.496722: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
/usr/local/lib/python3.10/dist-packages/torch/functional.py:650: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:863.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
Steps: 0%| | 50/175200 [00:36<26:49:06, 1.81it/s, lr=2e-5, step_loss=29.5, step_loss_disc=2.78, step_loss_duration=1.5
02/19/2024 11:49:16 - INFO - main - Running validation...
VALIDATION - batch 0, process0, waveform torch.Size([4, 134400, 1]), tokens torch.Size([4, 169])...
VALIDATION - batch 0, process0, PADDING AND GATHER...
Traceback (most recent call last):
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1494, in
main()
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/run_vits_finetuning.py", line 1327, in main
full_generation = model(**full_generation_sample.to(model.device), speaker_id=speaker_id)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 817, in forward
return model_forward(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 805, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/usr/local/lib/python3.10/dist-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
return func(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2151, in forward
return self._inference_forward(
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 2000, in _inference_forward
text_encoder_output = self.text_encoder(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/content/drive/MyDrive/MMSTTS1/finetune-hf-vits/utils/modeling_vits_training.py", line 1563, in forward
hidden_states = self.embed_tokens(input_ids) * math.sqrt(self.config.hidden_size)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/sparse.py", line 162, in forward
return F.embedding(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/functional.py", line 2233, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding)

Finetuning Stops After Few Steps

Hi,
I am trying continuously trying to train the model but every time I try after few steps it ends with an error. The image attached shows it. However, with the same configurations I trained a model before on which I am finetuning further. Kindly help me how can I resolve it please.
Thankyou so much btw, this repo is fabulous!!!

Fine tune a Coqui VITS model

There are a number of VITS models on Hugging Face that were initially trained in Coqui (like https://huggingface.co/mbarnig/lb-de-fr-en-pt-coqui-vits-tts). Coqui uses a different config file than Hugging Face VITS models, but presumably the underlying model is the same architecture. How feasible is it to convert a Coqui configuration to Hugging Face format for finetuning with this repo?

BatchEncoding.to() non_blocking error

Forgive me if this is a wrong place to inquire about this issue, but what would be the reason for this error? I prepared a dataset for faroese, and am trying to start the finetuning by running the command:

accelerate launch run_vits_finetuning.py ./training_config_examples/finetune_mms_fao.json

It seems to start and I get a few confirmation messages:

[INFO|modeling_utils.py:3280] 2024-04-03 13:32:51,747 >> loading weights file /tmp/tmp2gq66xpi/model.safetensors
[INFO|modeling_utils.py:4024] 2024-04-03 13:32:51,786 >> All model checkpoint weights were used when initializing VitsDiscriminator.

[INFO|modeling_utils.py:4032] 2024-04-03 13:32:51,787 >> All the weights of VitsDiscriminator were initialized from the model checkpoint at /tmp/tmp2gq66xpi.
If your task is similar to the task the model of the checkpoint was trained on, you can already use VitsDiscriminator for predictions without further training.

But after a few seconds, I get this error message:

04/03/2024 13:32:53 - INFO - __main__ - ***** Running training *****
04/03/2024 13:32:53 - INFO - __main__ -   Num examples = 1075
04/03/2024 13:32:53 - INFO - __main__ -   Num Epochs = 200
04/03/2024 13:32:53 - INFO - __main__ -   Instantaneous batch size per device = 16
04/03/2024 13:32:53 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
04/03/2024 13:32:53 - INFO - __main__ -   Gradient Accumulation steps = 1
04/03/2024 13:32:53 - INFO - __main__ -   Total optimization steps = 13600
Steps:   0%|                                                                                                                                                                                                                      | 0/13600 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/myuser/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
TypeError: BatchEncoding.to() got an unexpected keyword argument 'non_blocking'

During handling of the above exception, another exception occurred:


Traceback (most recent call last):
  File "/home/myuser/finetune-hf-vits/run_vits_finetuning.py", line 1494, in <module>
    main()
  File "/home/myuser/finetune-hf-vits/run_vits_finetuning.py", line 1090, in main
    for step, batch in enumerate(train_dataloader):
  File "/home/myuser/.local/lib/python3.10/site-packages/accelerate/data_loader.py", line 461, in __iter__
    current_batch = send_to_device(current_batch, self.device)
  File "/home/myuser/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 157, in send_to_device
    return tensor.to(device)
  File "/home/myuser/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 800, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
  File "/home/myuser/.local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 800, in <dictcomp>
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
AttributeError: 'NoneType' object has no attribute 'to'
Traceback (most recent call last):
  File "/home/myuser/.local/lib/python3.10/site-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
TypeError: BatchEncoding.to() got an unexpected keyword argument 'non_blocking'

Any clue on what the reason is for this error message?

Who had this error please

ValueError: num_samples should be a positive integer value, but got num_samples=0
Traceback (most recent call last):
File "/home/adem/.local/bin/accelerate", line 8, in
sys.exit(main())
File "/home/adem/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/adem/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
simple_launcher(args)
File "/home/adem/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'run_vits_finetuning.py', './training_config_examples/finetune_mms.json']'

Audio quality not that good.

I finetuned using 40 mins of audio of Urdu for more than 100 epochs but the voice I get is not that clean. How much data do I need to get good results or do I need to update or change anything else.

integration of dataset

The dataset you are using contains the audios and text in one data frame. But for me I have a folder that contains the audios with mp3 format and another TSV file that contains the names of the audios, the texts, and the speaker_id. How can I handle this dataset? and integrate it with the json configuration file?

KeyError: 'speaker_id'

c'est necessaire d'avoir une colonne speaker_id sachant que j'ai pas besoin d'un locuteur specifique ? ma dataset contient juste les audios et leurs transcription

dataset

I need to understand the format of the dataset I should use ... and what is it contain exactly ?
I mean these files:
"dataset_name": "ylacombe/english_dialects",
"dataset_config_name": "welsh_female",
"override_speaker_embeddings": true,
"filter_on_speaker_id": 5223,

Colab Notebook & Tutorial

It will be helpful to have an easy to follow Colab code & tutorial or Gradio UI with easy fine tuning

Speaker_id during inference

Hi @ylacombe! I have a multi-speaker data using which I have trained the hindi checkpoint. I wanted to generate a particular speaker's voice during inference. Is there any way to do that using the inference code given in the README?

Here is how my current code looks:
`import scipy

from transformers import pipeline
import time
model_id = "./vits_finetuned_hindi"
synthesiser = pipeline("text-to-speech", model_id, device=0) # add device=0 if you want to use a GPU
speech = synthesiser("वहीं पंजाब सरकार ने सरबत खालसा के आयोजन के लिए, पंजाब के भठिंडा ज़िले में, तलवंडी साबो में, जगह देने से मना कर दिया है।")
scipy.io.wavfile.write("hindi_1.wav", rate=speech["sampling_rate"], data=speech["audio"][0])`

How to fine tune with Audio Sequence Dataset

https://huggingface.co/datasets/Sunbird/salt-studio-lug

how do I load & fine tune using this dataset

Error in `run_vits_finetuning` when starting to train

I am encountering an error when trying to run accelerate launch run_vits_finetuning.py. I get past the Weights & Biases authentication but then get stuck on training:

03/22/2024 19:57:22 - INFO - __main__ - ***** Running training *****
03/22/2024 19:57:22 - INFO - __main__ -   Num examples = 110
03/22/2024 19:57:22 - INFO - __main__ -   Num Epochs = 200
03/22/2024 19:57:22 - INFO - __main__ -   Instantaneous batch size per device = 16
03/22/2024 19:57:22 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 16
03/22/2024 19:57:22 - INFO - __main__ -   Gradient Accumulation steps = 1
03/22/2024 19:57:22 - INFO - __main__ -   Total optimization steps = 1400
Steps:   0%|                                                                                | 0/1400 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 155, in send_to_device
    return tensor.to(device, non_blocking=non_blocking)
TypeError: BatchEncoding.to() got an unexpected keyword argument 'non_blocking'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/finetune-hf-vits/run_vits_finetuning.py", line 1494, in <module>
    main()
  File "/content/finetune-hf-vits/run_vits_finetuning.py", line 1090, in main
    for step, batch in enumerate(train_dataloader):
  File "/usr/local/lib/python3.10/dist-packages/accelerate/data_loader.py", line 461, in __iter__
    current_batch = send_to_device(current_batch, self.device)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/operations.py", line 157, in send_to_device
    return tensor.to(device)
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 789, in to
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
  File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 789, in <dictcomp>
    self.data = {k: v.to(device=device) for k, v in self.data.items()}
AttributeError: 'NoneType' object has no attribute 'to'

I am relatively new to all of this, so any ideas on where the problem might be would be really helpful. It looks to me like the training data is not loading successfully but I'm trying to figure out if maybe there is an issue in the config? For context:

I am trying to run everything locally so I don't push to the hub.
I don't need the wandb functionality, but the problem persists whether or not I visualize
This problem appears on colab but also locally when running WSL, same error.

I have put a reproducible example in this colab notebook which also includes the configs I'm using. To get started I was just trying to reproduce the Gujarati training example. Any pointers to where I'm going wrong would be greatly appreciated!

model.safetensors file

i got so many checkpoints but if i try to run the model i was unable to do it , due to i don't have model.safetensors file ?

Cannot load model checkpoints

After training the mms-tts-urd_arabic-script on custom data, the final model is used for inference and works. However, the new model is not able to pronounce the words well. I thought it might be because the model is overfitted so wanted to load inference from the model checkpoints. However, it throws error with the standard huggingface pipeline code, specifically the absence of the config.json etc. On copying the last model files into the checkpoint directory the inference code works and generates an output .wav file but the file is blank with no audio. I am using the inference code provided by dunkerbunker in issue#1

Inference error

from transformers import pipeline
import scipy

model_id = "Vijish/vits_mongolian_monospeaker"
synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU

speech = synthesiser("Монголын бурхан шашинтны төв Гандантэгчэнлин хийдийн Тэргүүн их хамба Д.")

scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"])

/usr/local/lib/python3.10/dist-packages/scipy/io/wavfile.py in write(filename, rate, data)
795 block_align = channels * (bit_depth // 8)
796
--> 797 fmt_chunk_data = struct.pack('<HHIIHH', format_tag, channels, fs,
798 bytes_per_second, block_align, bit_depth)
799 if not (dkind == 'i' or dkind == 'u'):

error: ushort format requires 0 <= number <= (0x7fff * 2 + 1)

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

Hi! I am trying to train a TTS model for Persian language. I followed the instructions you posted on the main page (README). However I get this error and I can't fix it. In my opinion, these errors are related to the encoding format of the json files that are created in the "pytorch_dump_folder_path" folder.
I tried to convert all of them to utf-8 encoding format, but because some of them contain ascii characters, they remained in the same ascii encoding format. In my research, I kept coming across this sentence that said:

Since ASCII is a subset of UTF-8, these files don't require conversion because they are already compatible with UTF-8.
These ascii encoded json files are: config.json, tokenizer_config.json, preprocessor_config.json, added_tokens.json

Of course, I'm not sure if the errors I'm getting are necessarily related to this issue or not, but anyway, I'd be happy if you could help.

Traceback (most recent call last):
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 719, in _get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 817, in _dict_from_json_file
    text = reader.read()
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user1/finetune-hf-vits/run_vits_finetuning.py", line 1491, in <module>
    main()
  File "/home/user1/finetune-hf-vits/run_vits_finetuning.py", line 635, in main
    config = VitsConfig.from_pretrained(
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 605, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 634, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 722, in _get_config_dict
    raise EnvironmentError(
OSError: It looks like the config file at '/home/user1/finetune-hf-vits/torch_files/model.safetensors' is not a valid JSON file.
Traceback (most recent call last):
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 719, in _get_config_dict
    config_dict = cls._dict_from_json_file(resolved_config_file)
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 817, in _dict_from_json_file
    text = reader.read()
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/user1/finetune-hf-vits/run_vits_finetuning.py", line 1491, in <module>
    main()
  File "/home/user1/finetune-hf-vits/run_vits_finetuning.py", line 635, in main
    config = VitsConfig.from_pretrained(
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 605, in from_pretrained
    config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 634, in get_config_dict
    config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "/home/user1/.pyenv/versions/3.10.0/lib/python3.10/site-packages/transformers/configuration_utils.py", line 722, in _get_config_dict
    raise EnvironmentError(
OSError: It looks like the config file at '/home/user1/finetune-hf-vits/torch_files/model.safetensors' is not a valid JSON file.```

how to finetune from facebook mms tts

I'm trying to finetune a english model from mms tts, the repo is https://huggingface.co/facebook/mms-tts-eng/tree/main, but there are no preprocessor_config.json file, in your examples, the base models are all from ylacombe space. so how could I finetune model from facebook space, because I'm also wanna finetune models for other languages. thank you.

AttributeError: 'NoneType' object has no attribute 'to' when running example

Hello!

accelerate launch run_vits_finetuning.py ./training_config_examples/finetune_english.json
I'm getting this error:

tokenization_utils_base.py", line 803, in to
    self.data = { k: v.to(device=device) for k, v in self.data.items() }
                     ^^^^
AttributeError: 'NoneType' object has no attribute 'to'

What could be causing this?
I use Apple M1 Pro Sonoma 14.4.1

Any leads on how to train from scratch using different spectogram bins

I want to use different spectogram bins and n_fft, how to train the arabic model from scratch with. new parameters?

checkpoint inference

Hi!
I have set the parameters so that the model does not push to huggingface and the checkpoints are saved directly in my local system. Each checkpoint folder has these files:

As you see, there are two optimizer, pytorch_model and scheduler. Is this normal? if so, how i can use it?

Stuck at : Steps: 0%| | 0/20 [00:00<?, ?it/s]

My model is stuck at 0% I can't understand why.

[INFO|configuration_utils.py:472] 2024-08-07 04:19:10,786 >> Configuration saved in C:\Users\MUHAMM1\AppData\Local\Temp\tmp68z04aah\config.json
[INFO|modeling_utils.py:2765] 2024-08-07 04:19:11,656 >> Model weights saved in C:\Users\MUHAMM1\AppData\Local\Temp\tmp68z04aah\model.safetensors
[INFO|configuration_utils.py:731] 2024-08-07 04:19:11,668 >> loading configuration file C:\Users\MUHAMM~1\AppData\Local\Temp\tmp68z04aah\config.json
[INFO|configuration_utils.py:800] 2024-08-07 04:19:11,680 >> Model config VitsConfig {
"_name_or_path": "ylacombe/mms-tts-guj-train",
"activation_dropout": 0.1,
"architectures": [
"VitsDiscriminator"
],
"attention_dropout": 0.1,
"depth_separable_channels": 2,
"depth_separable_num_layers": 3,
"discriminator_kernel_size": 5,
"discriminator_period_channels": [
1,
32,
128,
512,
1024
],
"discriminator_periods": [
2,
3,
5,
7,
11
],
"discriminator_scale_channels": [
1,
16,
64,
256,
1024
],
"discriminator_stride": 3,
"duration_predictor_dropout": 0.5,
"duration_predictor_filter_channels": 256,
"duration_predictor_flow_bins": 10,
"duration_predictor_kernel_size": 3,
"duration_predictor_num_flows": 4,
"duration_predictor_tail_bound": 5.0,
"ffn_dim": 768,
"ffn_kernel_size": 3,
"flow_size": 192,
"hidden_act": "relu",
"hidden_dropout": 0.1,
"hidden_size": 192,
"hop_length": 256,
"initializer_range": 0.02,
"layer_norm_eps": 1e-05,
"layerdrop": 0.1,
"leaky_relu_slope": 0.1,
"model_type": "vits",
"noise_scale": 0.667,
"noise_scale_duration": 0.8,
"num_attention_heads": 2,
"num_hidden_layers": 6,
"num_speakers": 1,
"posterior_encoder_num_wavenet_layers": 16,
"prior_encoder_num_flows": 4,
"prior_encoder_num_wavenet_layers": 4,
"resblock_dilation_sizes": [
[
1,
3,
5
],
[
1,
3,
5
],
[
1,
3,
5
]
],
"resblock_kernel_sizes": [
3,
7,
11
],
"sampling_rate": 16000,
"segment_size": 8192,
"speaker_embedding_size": 0,
"speaking_rate": 1.0,
"spectrogram_bins": 513,
"torch_dtype": "float32",
"transformers_version": "4.43.4",
"upsample_initial_channel": 512,
"upsample_kernel_sizes": [
16,
16,
4,
4
],
"upsample_rates": [
8,
8,
2,
2
],
"use_bias": true,
"use_stochastic_duration_prediction": true,
"vocab_size": 60,
"wavenet_dilation_rate": 1,
"wavenet_dropout": 0.0,
"wavenet_kernel_size": 5,
"window_size": 4
}

[INFO|modeling_utils.py:3641] 2024-08-07 04:19:11,728 >> loading weights file C:\Users\MUHAMM~1\AppData\Local\Temp\tmp68z04aah\model.safetensors
[INFO|modeling_utils.py:4473] 2024-08-07 04:19:11,842 >> All model checkpoint weights were used when initializing VitsDiscriminator.

[INFO|modeling_utils.py:4481] 2024-08-07 04:19:11,844 >> All the weights of VitsDiscriminator were initialized from the model checkpoint at C:\Users\MUHAMM~1\AppData\Local\Temp\tmp68z04aah.
If your task is similar to the task the model of the checkpoint was trained on, you can already use VitsDiscriminator for predictions without further training.
wandb: Currently logged in as: saadgondal203 (saadgondal203-comsats-university-islamabad). Use wandb login --relogin to force relogin
wandb: Tracking run with wandb version 0.17.5
wandb: Run data is saved locally in M:\VoiceCloning\finetune-hf-vits\wandb\run-20240807_041919-bd4vl385
wandb: Run wandb offline to turn off syncing.
wandb: Syncing run still-deluge-3
wandb: View project at https://wandb.ai/saadgondal203-comsats-university-islamabad/mms_gujarati_finetuning
wandb: View run at https://wandb.ai/saadgondal203-comsats-university-islamabad/mms_gujarati_finetuning/runs/bd4vl385
08/07/2024 04:19:21 - INFO - main - ***** Running training *****
08/07/2024 04:19:21 - INFO - main - Num examples = 110
08/07/2024 04:19:21 - INFO - main - Num Epochs = 200
08/07/2024 04:19:21 - INFO - main - Instantaneous batch size per device = 16
08/07/2024 04:19:21 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 16
08/07/2024 04:19:21 - INFO - main - Gradient Accumulation steps = 1
08/07/2024 04:19:21 - INFO - main - Total optimization steps = 1400
Steps: 0%| | 0/1400 [00:00<?, ?it/s]C:\Users\Muhammad Saad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\torch\functional.py:666: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error.
Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at C:\actions-runner_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\SpectralOps.cpp:878.)
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in run_code
File "C:\Users\Muhammad Saad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts\accelerate.exe_main.py", line 7, in
File "C:\Users\Muhammad Saad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main
args.func(args)
File "C:\Users\Muhammad Saad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\accelerate\commands\launch.py", line 1106, in launch_command
simple_launcher(args)
File "C:\Users\Muhammad Saad\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['C:\Users\Muhammad Saad\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe', 'run_vits_finetuning.py', './training_config_examples/finetune_mms.json']' returned non-zero exit status 3221225477.

how to efficently train multi-speaker TTS on multiple languages

i have on average ~ 7 hrs of high quality TTS data per speaker, per language. and i have around 11 languages, i get the part of finetuning MMS model on per language, but i was looking to make the most efficient use of my TTS speech corpus to train a multispeaker multi-language TTS model as i am new to TTS i am not much aware of the nitty-gritty details on training TTS models, any help regarding the same would be highly appreciated.

How to add unsupported language? (`nob`)

Strangely enough, I can see from MMS coverage
that nob (Norwegian) is not supported for TTS.

What must be done in order to support it?

ylacombe / finetune-hf-vits Goto Github PK

finetune-hf-vits's People

Contributors

Stargazers

Watchers

Forkers

finetune-hf-vits's Issues

Recommend Projects

Recommend Topics

Recommend Org