Git Product home page Git Product logo

speecht5's Introduction

SpeechT5

Unified-modal speech-text pre-training for spoken language processing:

SpeechT5 (ACL 2022): SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing

Speech2C (INTERSPEECH 2022): Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

YiTrans (IWSLT 2022): The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task

SpeechUT (EMNLP 2022): SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

SpeechLM (IEEE/ACM TASLP): SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

Speech2S (ICASSP 2023): Joint Pre-Training with Speech and Bilingual Text for Direct Speech to Speech Translation

Prosody-SpeechT5 (ICASSP 2023): Prosody-aware SpeechT5 for Expressive Neural TTS

VATLM (IEEE Transactions on Multimedia): VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

VALL-E X (Arxiv 2023): Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

VioLA (Arxiv 2023): VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation

WavLLM (Arxiv 2024): WavLLM: Towards Robust and Adaptive Speech Large Language Model

Update

  • April, 2024: WavLLM Arxiv.
  • March, 2024: SpeechLM was accepted by IEEE/ACM Transactions on Audio, Speech, and Language Processing.
  • May, 2023: VioLA Arxiv.
  • May, 2023: VATLM was accepted by IEEE Transactions on Multimedia.
  • March, 2023: VALL-E X Arxiv and Demo.
  • February, 2023: Speech2S and Prosody-SpeechT5 were accepted by ICASSP 2023.
  • [HuggingFace Integration] February, 2023: SpeechT5 models are on HuggingFace.
  • [Model Release] November, 2022: VATLM models are released.
  • November, 2022: VATLM Arxiv.
  • November, 2022: Speech2S Arxiv.
  • [Model Release] October, 2022: SpeechUT models are released.
  • October, 2022: SpeechUT was accepted by EMNLP 2022.
  • [Model Release] October, 2022: SpeechLM models are released.
  • September, 2022: SpeechLM Arxiv.
  • [Evaluation] June, 2022: The end-to-end ST system YiTrans achieved top results on IWSLT 2022 shared tasks.
  • June, 2022: Speech2C was accepted by InterSpeech 2022.
  • [Model Release] May, 2022: Speech2C models are released.
  • [Model Release] April, 2022: SpeechT5 models are released.
  • March, 2022: Speech2C Arxiv.
  • February, 2022: SpeechT5 was accepted by ACL 2022.
  • October, 2021: SpeechT5 Arxiv.

Pre-Trained Models

Model Pre-training Dataset Fine-tuning Dataset Model
SpeechT5 Base 960 hrs LibriSpeech + LibriSpeech LM Dataset - HuggingFace
Google Drive
SpeechT5 Base 960 hrs LibriSpeech + LibriSpeech LM Dataset 100 hrs LibriSpeech HuggingFace
Google Drive
SpeechT5 Large 60k hrs Libri-Light + LibriSpeech LM Dataset - Google Drive
Speech2C 960 hrs LibriSpeech - Google Drive
Speech2C 960 hrs LibriSpeech 10 hrs LibriSpeech Google Drive
Speech2C 960 hrs LibriSpeech 100 hrs LibriSpeech Google Drive
SpeechLM-P Base 960 hrs LibriSpeech + 40M Text - Google drive
SpeechLM-P Base 960 hrs LibriSpeech + 40M Text 100 hrs LibriSpeech Google drive
SpeechLM-H Base 960 hrs LibriSpeech + 40M Text - Google drive
SpeechLM-H Base 960 hrs LibriSpeech + 40M Text 100 hrs LibriSpeech Google drive
SpeechLM-P Base 960 hrs LibriSpeech + 40M Text En-De CoVoST-2 [Azure Storage]
SpeechLM-P Base 960 hrs LibriSpeech + 40M Text En-Ca CoVoST-2 [Azure Storage]
SpeechLM-P Base 960 hrs LibriSpeech + 40M Text En-Ar CoVoST-2 [Azure Storage]
SpeechLM-P Base 960 hrs LibriSpeech + 40M Text En-Tr CoVoST-2 [Azure Storage]
SpeechLM-P Large 60k hrs LibriLight + 40M Text - Google drive
SpeechLM-P Large 60k hrs LibriLight + 40M Text 960 hrs LibriSpeech Google drive
SpeechLM-P Large 60k hrs LibriLight + 40M Text En-De CoVoST-2 Google drive
SpeechLM-P Large 60k hrs LibriLight + 40M Text En-Ca CoVoST-2 Google drive
SpeechLM-P Large 60k hrs LibriLight + 40M Text En-Ar CoVoST-2 Google drive
SpeechLM-P Large 60k hrs LibriLight + 40M Text En-Tr CoVoST-2 Google drive
SpeechUT Base (ASR) 960 hrs LibriSpeech + 40M Text - [Azure Storage]
SpeechUT Base (ASR) 960 hrs LibriSpeech + 40M Text 100 hrs LibriSpeech [Azure Storage]
SpeechUT Large (ASR) 60k hrs LibriSpeech + 40M Text - [Azure Storage]
SpeechUT Large (ASR) 60k hrs LibriSpeech + 40M Text 960 hrs LibriSpeech [Azure Storage]
SpeechUT Base (En-De) 960 hrs LibriSpeech + 408 hrs MuST-C v1 + 4.6M Text - [Azure Storage]
SpeechUT Base (En-De) 960 hrs LibriSpeech + 408 hrs MuST-C v1 + 4.6M Text En-De MuST-C v1 [Azure Storage]
SpeechUT Base (En-Es) 960 hrs LibriSpeech + 504 hrs MuST-C v1 + 15M Text - [Azure Storage]
SpeechUT Base (En-Es) 960 hrs LibriSpeech + 504 hrs MuST-C v1 + 15M Text En-Es MuST-C v1 [Azure Storage]
SpeechUT Base (En-Fr) 960 hrs LibriSpeech + 492 hrs MuST-C v1 + 40M Text - [Azure Storage]
SpeechUT Base (En-Fr) 960 hrs LibriSpeech + 492 hrs MuST-C v1 + 40M Text En-Fr MuST-C v1 [Azure Storage]

SpeechT5 Introduction

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.

se

Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

SpeechT5 Downstream Task Performance

We evaluate our models on typical spoken language processing tasks, including automatic speech recognition, text to speech, speech to text translation, voice conversion, speech enhancement, and speaker identification.

Automatic Speech Recognition

Evaluation on the LibriSpeech

Model LM dev-clean dev-other test-clean test-other
wav2vec2.0 Base - 6.1 13.5 6.1 13.3
HuBERT Base - 5.5 13.1 5.8 13.3
Baseline (w/o CTC) - 5.8 12.3 6.2 12.3
Baseline - 4.9 11.7 5.0 11.9
SpeechT5 (w/o CTC) - 5.4 10.7 5.8 10.7
SpeechT5 - 4.3 10.3 4.4 10.4
DiscreteBERT 4-gram 4.0 10.9 4.5 12.1
wav2vec 2.0 Base 4-gram 2.7 7.9 3.4 8.0
HuBERT Base 4-gram 2.7 7.8 3.4 8.1
wav2vec 2.0 Base Transf. 2.2 6.3 2.6 6.3
Baseline Transf. 2.3 6.3 2.5 6.3
SpeechT5 Transf. 2.1 5.5 2.4 5.8

Text-to-Speech

Evaluation on the LibriTTS

Model Naturalness MOS CMOS
Ground Truth - 3.87 -
Baseline 2.76 3.56 0
SpeechT5 2.91 3.65 +0.290

Speech Translation

Evaluation on the MUST-C v1

Model EN-DE EN-FR
Fairseq ST 22.70 32.90
ESPnet ST 22.91 32.69
Adapter Tuning 24.63 34.98
Baseline 23.43 33.76
SpeechT5 (w/o initializing decoder) 24.44 34.5
SpeechT5 25.18 35.30

Voice Conversion

Evaluation on the CMU Arctic

Model WER WER MCD MCD
bdl to slt clb to slt bdl to slt clb to slt
VTN w/ ASR 11.1 10.9 6.5 6.11
VTN w/ TTS 7.6 9.1 6.33 13.3
Many-to-many VTN - - 6.13 5.97
Baseline 21.5 10.8 6.26 6.16
SpeechT5 7.8 6.4 5.93 5.87

Speech Enhancement

Evaluation on the WSJ0 Hipster AmbientMixtures (WHAM!)

Model WER
Ground Truth Speech 3.2
Noisy Speech 76.1
Baseline 10.9
SpeechT5 8.9

Speaker Identification

Evaluation on the VoxCeleb1

Model Acc
SUPERB, wav2vec 2.0 Base 75.18%
SUPERB, HuBERT Base 81.42%
SUPERB, HuBERT Large 90.33%
SpeechNet, single task 86.00%
SpeechNet, multi-task with TTS 87.90%
Thin ResNet-34 89.00%
Baseline 91.92%
SpeechT5 96.49%

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the FAIRSEQ and ESPnet projects.

Microsoft Open Source Code of Conduct

Reference

If you find our work is useful in your research, please cite the following paper:

@article{Ao2021SpeechT5,
  title   = {SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language Processing},
  author  = {Junyi Ao and Rui Wang and Long Zhou and Chengyi Wang and Shuo Ren and Yu Wu and Shujie Liu and Tom Ko and Qing Li and Yu Zhang and Zhihua Wei and Yao Qian and Jinyu Li and Furu Wei},
  eprint={2110.07205},
  archivePrefix={arXiv},
  primaryClass={eess.AS},
  year={2021}
}
@article{Ao2022Speech2C,
  title   = {Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data},
  author  = {Junyi Ao and Ziqiang Zhang and Long Zhou and Shujie Liu and Haizhou Li and Tom Ko and Lirong Dai and Jinyu Li and Yao Qian and Furu Wei},
  eprint={2203.17113},
  archivePrefix={arXiv},
  primaryClass={cs.SD},
  year={2022}
}
@article{Zhang2022Yitrans,
  title   = {The YiTrans End-to-End Speech Translation System for IWSLT 2022 Offline Shared Task},
  author  = {Zhang, Ziqiang and Ao, Junyi and Zhou, Long and Liu, Shujie and Wei, Furu and Li, Jinyu},
  eprint={2206.05777},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2022}
}
@article{zhang2022speechut,
  title   = {SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training},
  author  = {Zhang, Ziqiang and Zhou, Long and Ao, Junyi and Liu, Shujie and Dai, Lirong and Li, Jinyu and Wei, Furu},
  eprint={2210.03730},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2022}
}
@article{zhang2022speechlm,
  title   = {SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data},
  author  = {Zhang, Ziqiang and Chen, Sanyuan and Zhou, Long and Wu, Yu and Ren, Shuo and Liu, Shujie and Yao, Zhuoyuan and Gong, Xun and Dai, Lirong and Li, Jinyu and Wei, Furu},
  eprint={2209.15329},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  year={2022}
}

Contact Information

For help or issues using SpeechT5 models, please submit a GitHub issue.

For other communications related to SpeechT5, please contact Long Zhou ([email protected]).

speecht5's People

Contributors

ajyy avatar hollance avatar kunazure avatar mechanicalsea avatar microsoft-github-policy-service[bot] avatar wszlong avatar xiaoshanhsj avatar zqs01 avatar zz12375 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

speecht5's Issues

SpeechT5: How to get speaker embeddings ?

Hi, I found that there must be 3 columns in the audio manifest tsv file. Is there a tutorial or example on how to get the speaker embedding using my own dataset? Is it possible to pretrain a model on a dataset without speaker label?
Thanks 😊

VATLM: ModuleNotFoundError: No module named 'fairseq.data.audio.multi_corpus_dataset_audio'

Hi, there. I'm trying to extract features with the released VATLM pre-trained models. Here's what I did.

Firstly, I tried loading the pre-trained model:

# cwd: ..../av_hubert
import vathubert
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(["./finetune_large_vox2_v_433h.pt"])

but an error occurs: AssertionError: Could not infer task type from {'_name': 'vat_hubert_pretraining'....
It's because the task vat_hubert_pretraining wasn't been successfully registered by Fairseq, and the reason for this error is that the sub-package vathubert.tasks is not successfully initialized since there's not __init__.py in it (from what I know, if there're not __init__.py, it's just a "NameSpace" package instead of a normal package). To fix it, I add __init__.py to ./vathubert/tasks.py, which content is

from .vathubert_pretraining import VATHubertPretrainingTask

This is for registering the task in Fairseq. However, another error occurs in

from fairseq.data.audio.multi_corpus_dataset_audio import MultiCorpusDataset

ModuleNotFoundError: No module named 'fairseq.data.audio.multi_corpus_dataset_audio'

I read the source code of Fairseq and didn't find the file multi_corpus_dataset_audio.py in the directory fairseq.data.audio

Did I miss anything? Or is there some bug in the code? Any help is appreciated, Thanks!

how to fine tune sid on pretrained model?

i want to run the sid pretrain model, but i got an error like this:
generate_class.py: error: argument --task: invalid choice: 'speecht5' (choose from 'masked_lm', 'cross_lingual_lm', 'translation', 'hubert_pretraining', 'online_backtranslation', 'denoising', 'multilingual_denoising', 'translation_multi_simple_epoch', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'language_modeling', 'multilingual_translation', 'sentence_prediction', 'sentence_ranking', 'translation_lev', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')
if i must fine tune sid, i run fine tune sid, and got this error
fairseq-train: error: argument --task: invalid choice: 'speecht5' (choose from 'masked_lm', 'cross_lingual_lm', 'translation', 'hubert_pretraining', 'online_backtranslation', 'denoising', 'multilingual_denoising', 'translation_multi_simple_epoch', 'legacy_masked_lm', 'translation_from_pretrained_bart', 'language_modeling', 'multilingual_translation', 'sentence_prediction', 'sentence_ranking', 'translation_lev', 'audio_pretraining', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'speech_to_text', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt') SID finetuning finished
so, how to run the model correctly? thanks!

SpeechLM: How to resample phonemes' frame rate from 30ms to 20ms?

Hi, thank you for your great work.
According to the appendix of paper, it uses a kaldi model to convert audio into phonemes. I have trained a kaldi model with frame rate of 30ms.
To generate the SpeechLM Base label (10ms), I just repeat each phoneme 3 times, it works fine.
But the SpeechLM Large label (20ms) cannot be generated simply by repeat phonemes. Could you provide some details about this convertion?

SpeechT5 Speech Enhancement

Hi,

Could you tell me where I can find the fine-tuned SpeechT5 for the speech enhancement task? Also, a link to how I can load and use it would be very useful.

Thank you,
Andrei

Whether fp16 is enabled in VATLM during pre-training

Hi, there. Recently, I'm working on extracting features using the model "VatLM Large VoxCeleb2 + LRS3 + paired audio+text+audio LRS-433h visual"
From the config file

, fp16 is enabled, and it's not overwritten in the launching script
python /path/to/fairseq/fairseq_cli/hydra_train.py \
--config-dir /path/to/vat_hubert/vathubert/conf/pretrain --config-name large_vox_iter5.yaml \
task.data=${datapath}/fbank_lrs3_vox_tsv \
task.label_dir=${datapath}/fbank_lrs3_vox_tsv \
+task.sup_data_path=${datapath}/fbank_tedv3_phone_concat_vox_tsv \
+task.sup_manifest=${datapath}/fbank_tedv3_phone_concat_vox_tsv \
+task.onlytext_manifest=${datapath}/cantab2_vox_tsv \
+task.onlyaudio_manifest=${datapath}/fbank_giga_vox_tsv_km \
hydra.run.dir=${save_path} \
common.user_dir=/path/to/vat_hubert/vathubert \
distributed_training.distributed_world_size=${ngpu} \
optimization.update_freq=[${updatefreq}] \
dataset.max_tokens=3000 \
model.label_rate=25 \
common.log_interval=200 \
checkpoint.save_interval=5 \
+task.sample_distributions=\"0.13,0.15,0.32,0.3\" \
+criterion.banlance_loss_weights=[1.0,1.0] \
dataset.data_buffer_size=40 \
+task.use_supervised_data=True \
+task.use_extra_textdata=True \
+task.use_extra_audiodata=True \

However, when I'm loading the pretrained model, telling from the log, fp16 is disabled(The bottom line of the full traceback).

2022-12-04 11:26:57 | INFO | vathubert.tasks.vathubert_pretraining | VATHubertPretrainingTask Config {'_name': 'vat_hubert_pretraining', 'data': '/LocalData/vatlm_related/fbankdata/fbank_lrs3_vox_tsv', 'labels': ['km'], 'label_dir': '/LocalData/vatlm_related/fbankdata/fbank_lrs3_vox_tsv', 'label_rate': 25, 'sample_rate': 25, 'normalize': True, 'enable_padding': False, 'max_sample_size': 500, 'min_sample_size': 5, 'max_trim_sample_size': '${task.max_sample_size}', 'single_target': False, 'random_crop': False, 'pad_audio': True, 'pdb': False, 'stack_order_audio': 4, 'skip_verify': False, 'text_sampling_alpha': 0.2, 'split_modality_batch': False, 'image_aug': True, 'image_crop_size': 88, 'image_mean': 0.421, 'image_std': 0.165, 'modalities': ['audio', 'video'], 'is_s2s': False, 'tokenizer_bpe_name': None, 'tokenizer_bpe_model': None, 'noise_wav': None, 'noise_prob': 0.0, 'noise_snr': '0', 'noise_num': 1, 'fine_tuning': False, 'use_supervised_data': True, 'sup_data_path': '/LocalData/vatlm_related/fbankdata/fbank_tedv3_phone_concat_vox_tsv', 'sup_manifest': '/LocalData/vatlm_related/fbankdata/fbank_tedv3_phone_concat_vox_tsv', 'sample_distributions': '0.13,0.15,0.32,0.3', 'use_extra_textdata': True, 'onlytext_manifest': '/LocalData/vatlm_related/fbankdata/cantab2_vox_tsv', 'use_extra_audiodata': True, 'onlyaudio_manifest': '/LocalData/vatlm_related/fbankdata/fbank_giga_vox_tsv_km'} 2022-12-04 11:26:57 | INFO | vathubert.models.vathubert | HubertModel Config: {'_name': 'vat_hubert', 'label_rate': 25, 'modalities': '${task.modalities}', 'extractor_mode': default, 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': gelu, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length_audio': 10, 'mask_prob_audio': 0.8, 'mask_length_image': 5, 'mask_prob_image': 0.3, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'resnet_relu_type': 'prelu', 'resnet_weights': None, 'sim_type': 'cosine', 'sub_encoder_layers': 0, 'audio_feat_dim': 104, 'modality_dropout': 0.5, 'audio_dropout': 0.5, 'modality_fuse': 'concat', 'selection_type': 'same_seq', 'masking_type': 'input', 'decoder_embed_dim': 768, 'decoder_ffn_embed_dim': 3072, 'decoder_layers': 6, 'decoder_layerdrop': 0.0, 'decoder_attention_heads': 4, 'decoder_learned_pos': False, 'decoder_normalize_before': False, 'no_token_positional_embeddings': False, 'decoder_dropout': 0.1, 'decoder_attention_dropout': 0.1, 'decoder_activation_dropout': 0.0, 'max_target_positions': 2048, 'share_decoder_input_output_embed': False, 'no_scale_embedding': True, 'layer_type': transformer, 'pos_conv_depth': 1, 'max_positions': 100000, 'checkpoint_activations': False, 'required_seq_len_multiple': 1, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}

From my personal experience, the inconsistent settings of fp16 during linear probing and pre-training often leads to degraded performance, so I want to know whether fp16 is enabled or not in pretraining. Thanks!

No code for Speech Synthesis

Code for finetuning speech synthesis with the predicted log Mel-filterbank features, as described in the SpeechT5 paper, is not availiable.

Is it possible to provide this?

Many thanks

SpeechT5 pretrain

Thanks for your previous reply!
But now I encounter another question, when I used fp16 to pretrain, I found an ERROR as follows:
image
It seems that fp16 from fairseq is not adapted to torch

reproduction steps for inference

are all the required preprocessing steps:

  1. acquire dataset, checkpoint, source
  2. train spm
  3. hubert feature extraction
  4. run fairseq

or are there any missing parts?

SpeechLM:KeyError: 'text_transformer' while initing the SpeechLMConfig

Hi,I'm trying to use the script in README to Extract features using pre-trained models, I used the model speechlmp_base_asr_checkpoint_best.pt.But I encountered an error while initing the SpeechLMConfig:

Traceback (most recent call last):
File "/remote-home/jzhan/SpeechT5/SpeechLM/test.py", line 7, in
cfg = SpeechLMConfig(checkpoint['cfg']['model'])
File "/SpeechT5/SpeechLM/SpeechLM.py", line 128, in init
self.update(cfg)
File "/SpeechT5/SpeechLM/SpeechLM.py", line 132, in update
self.text_transformer = TransformerConfig(model_cfg['text_transformer'])
KeyError: 'text_transformer'

Am I missing any model files?

SpeechT5: Finetuned SID model

Would it be possible to share the SpeechT5 model finetuned on VoxCeleb1 for the SID task? (I noticed the re-implemented VC and TTS models shared on HF).

SpeechLM: How to train 'Phone-unit tokenizer for speech' using kaldi?

Hello, congratulations on your success in this paper!

I want to ask that is there any training scripts for 'Phone-unit tokenizer for speech' part, which is using kaldi recipe to "train a hybrid GMM-HMM ASR model on 100 hours labeled LibriSpeech data".

I'm new at speech processing, especially at using traditional HMM model which is used by kaldi, so it would be very thankful for you to answer.

Thanks a lot!

SpeechT5 Pretrain ERROR

when pretrained 95400 num_updates,

File "/SpeechT5/SpeechT5/SpeechT5/speecht5/data/multitask_dataset.py", line 58, in getitem
sample = self.datasets[dataset_idx][sample_idx]
File "/SpeechT5/SpeechT5/SpeechT5/speecht5/data/text_dataset.py", line 218, in getitem
assert (source[1:-1] >= 1).all()
IndexError: slice() cannot be applied to a 0-dim tensor

the reason comes from text data preparation?

Missing speecht5 task

Hello,

The following Inference instructions seem outdated :
https://github.com/microsoft/SpeechT5/tree/main/SpeechT5#inference-1

The script SpeechT5/scripts/generate_speech.py trigger this error when using --task speecht5 :

generate_speech.py: error: argument --task: invalid choice: 'speecht5' (choose from 'hubert_pretraining', 'denoising', 'multilingual_denoising', 'translation', 'multilingual_translation', 'translation_from_pretrained_bart', 'translation_lev', 'language_modeling', 'speech_to_text', 'legacy_masked_lm', 'online_backtranslation', 'simul_speech_to_text', 'simul_text_to_text', 'audio_pretraining', 'semisupervised_translation', 'sentence_prediction', 'cross_lingual_lm', 'translation_from_pretrained_xlm', 'masked_lm', 'sentence_ranking', 'translation_multi_simple_epoch', 'multilingual_masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt')

Thanks

Text data preparation

Hello.
First of all, thank you for your great work.

Unfortunately, I have some issues on the preparation of the text data for the pre-training and the ASR finetuning.
I have followed the introduction provided, but I cannot figure out how should I preprocess text data using SPM and fairseq.
How can i create the text_train.tsv/text_valid.tsv?
and also i have some difficulties of creating the label data of the text data, what format should i use?
can you provide more details or examples of the manifest for the text?

How to load the pretrained models in pytorch

Hi, how can I instantiate an object of the SpeechT5 model in a Pytorch code file, and maybe load the provided pretrained weights in it?

Something similar to ( this doesn't work btw)
image

Speech2C "Inf detected in output" while training

Hello!

Thank You for the great work again! I try to train Speech2C and got this error after 49 epochs:

[2022-11-07 00:33:16,340][fairseq.nan_detector][WARNING] - Inf detected in output of , shape: torch.Size([1464, 505]), forward
Some training details:
dataset: libri 360
k-means trained on: libri 100

config: https://drive.google.com/file/d/1Ms5m-cuTrv43xsntHBdM_PEWaXtGGMOR/view?usp=sharing
hydra_log: https://drive.google.com/file/d/1HWvXqUGhNU-LnKNRj52HAbXPR-GqOVBU/view?usp=sharing

Can you please let me know if this has happened in your training setup ever? Or if you know where I am going wrong?

Thank You!

Example values for finetuning asr

Hi, congratulations on your achievement in this great work!
It is my first time to use fairseq, so could you please give the exact values or an exmple of the parameters in "ASR finetune" training and inference part, which are these values:

DATA_ROOT=
SAVE_DIR=
TRAIN_SET=
VALID_SET=
LABEL_DIR=
BPE_TOKENIZER=
USER_DIR=
PT_CHECKPOINT_PATH=

Thanks a lot!
(And the steps that how can get these values would be great!οΌ‰

SpeechLM

Hello,thanks for your great work.However, I want to ask you some question. I notice that there is a model namedFast Text2Unit Model in the item SpeechLM, but I didn't find the usage about the model. I want to know if the model is used for transforem the text which is transformed from speech to units?

Combining speech and text in the encoder

Hi,

Do you think it would be possible to combine both speech and text as input to the encoder? I'm looking to then decode text based on this multimodal input. Should I be looking at the MultitaskDataset? Would the s2t task work for this?

Thanks.

Using SpeechT5 Large for TTS

Hello, thank you so much for providing these models and code along with all the documentation. The HuggingFace integration is very helpful for people like me whose specialty is not ML :) I tried out the TTS model available on HuggingFace and the results are very good, but I'm curious what the difference would be like using the larger SpeechT5 model.

My goal is to prepare the SpeechT5 Large model (60k hrs Libri-Light + LibriSpeech LM Dataset) for TTS in the same way that the smaller model on HuggingFace is tuned for TTS. I'm a little confused though on how the training was done for the smaller model in order to prepare it for TTS. I looked at the manifest and it says: "speecht5_tts.pt are reimplemented Text-to-Speech fine-tuning on the released manifest but with a smaller batch size or max updates (Ensure the manifest is ok)." Does this mean that the SpeechT5 for TTS model was completely retrained from scratch with different batch size/max updates, or was it fine-tuned from the SpeechT5 base model (960 hrs LibriSpeech + LibriSpeech LM Dataset)?

The manifest also says: "This manifest is an attempt to recreate the Text-to-Speech recipe used for training SpeechT5. This manifest was constructed using LibriTTS clean datasets, including train-clean-100 and train-clean-360 for training, dev-clean for validation, and test-clean for evaluation." Does this mean that it was trained from scratch using 100 + 360 = 460 hours of LibriTTS data, or was it fine-tuned on those 460 hours of data?

Thank you!

VATLM: Error when loading finetuned checkpoints for infer_s2s

Hi,

I am trying to load the finetuned models provided for VATLM. However, I encounter this error where it tries to access a local storage where it was trained. This occurs with all the models you have shared.

The error is:

`File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/distributed/utils.py", line 328, in distributed_main
main(cfg, **kwargs)

File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/infer_s2s.py", line 93, in main
return _main(cfg, h)

File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/infer_s2s.py", line 115, in _main
models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([cfg.common_eval.path])

File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/checkpoint_utils.py", line 446, in load_model_ensemble_and_task
model = task.build_model(cfg.model)

File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model
model = models.build_model(cfg, self)

File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/models/init.py", line 96, in build_model
return model.build_model(cfg, task)

File "/home/projects/SpeechT5/VATLM/vat_hubert/vathubert/models/vathubert_asr.py", line 400, in build_model
state = checkpoint_utils.load_checkpoint_to_cpu(

File "/home/projects/SpeechT5/VATLM/fairseq/fairseq/checkpoint_utils.py", line 303, in load_checkpoint_to_cpu
with open(local_path, "rb") as f:

FileNotFoundError: [Errno 2] No such file or directory: '/mnt/default/v-qiushizhu/vatlm_related/results/fbank_large_vox_pretrain_iter5_ext_audio_1_32ngpu_2updatefreq/checkpoints/checkpoint_388_600000.pt'
`

I noticed that the shared VATLM models don't have cfg.model.w2v_args in the statedict and are None during the loading.

Would be great if you could help in resolving this.

Can you provide a voice conversion finetune recipe?

First, Thank you, for your amazing achievements.

I tried asr finetune. It's work well!
So, I'd like to do other things too!

Such like voice conversion task!!
Can you provide voice conversion finetune and convert recipe??

ArgumentError in SpeechT5Task.add_args() when running fairseq-generate

Hi,

I tried running fairseq-generate according to the instructions in the README. It crashed at tasks/speecht5.py line 173, with the following message exception:

argparse.ArgumentError: argument --mask-length: conflicting option string: --mask-length

I used the debugger and found that there is already an existing --mask-length argument configured in the parser, it was added from fairseq/fairseq/options.py line 149, where arguments from wav2vec2 were added. Apparently fairseq's generate.py sets wav2vec2 as the default for --arch.

I tried manually specifying the --arch argument as t5_transformer_base or t5_transformer_base_asr, but then the argument parser complains that --path is not a supported argument.

My versions are SpeechT5 commit f9b059b and fairseq commit e35c593c84bd84d5c7777ef7ace98dab508ff88e.

Any idea how to fix it, preferably without modifying fairseq code? Thanks.

hydra fine-tunning for speechT5?

Hello,

I would like to do a grid search over hyper-parameters using ./SpeecgT5/speecht5. I keep getting:

omegaconf.errors.MissingMandatoryValue: Missing mandatory value: model

Any recommendation?

Thanks!

Difficulties loading pre-trained weights!

Hello!

Thank you very much for adding a code snippet to outline how to load pre-trained SpeechT5 weights, super helpful for understanding how to process the data and load the task 😊

I've been attempting to load the 'base' pre-trained weights according to the code snippet provided here:

import torch
from speecht5.tasks.speecht5 import SpeechT5Task
from speecht5.models.speecht5 import T5TransformerModel

checkpoint = torch.load('/path/to/speecht5_checkpoint')

checkpoint['cfg']['task'].t5_task = 'pretrain'
checkpoint['cfg']['task'].hubert_label_dir = "/path/to/hubert_label"
checkpoint['cfg']['task'].data = "/path/to/tsv_file"

task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])

Steps performed:

  1. Downloaded the fine-tuned base checkpoint from https://github.com/microsoft/SpeechT5#pre-trained-models
  2. Create a dummy dict of Hubert labels with using the instructions provided here with n_clusters=500:
for x in $(seq 0 $((n_clusters - 1))); do
  echo "$x 1"
done >> $lab_dir/dict.km.txt
  1. Download the dummy text dictionary using the download link provided in @Ajyy 's previous issue response from the G Drive link. As outlined, the text dict should be placed under data and the Hubert labels under hubert_label_dir.
  2. Using the Hubert labels and text dict, running the aforementioned code snippet to load the pre-trained model. Loading the task throws an error:
Click for full traceback
task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:242, in Dictionary.add_from_file(self, f)
    241 try:
--> 242     line, field = line.rstrip().rsplit(" ", 1)
    243     if field == "#fairseq:overwrite":

ValueError: not enough values to unpack (expected 2, got 1)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])

File ~/SpeechT5/SpeechT5/speecht5/tasks/speecht5.py:301, in SpeechT5Task.setup_task(cls, args, **kwargs)
    299 if args.t5_task == "pretrain":
    300     dicts["hubert"] = [Dictionary.load(f"{args.hubert_label_dir}/dict.{label}.txt") for label in args.hubert_labels]
--> 301     dicts["text"] = Dictionary.load(op.join(args.data, "dict.txt"))
    302 else:
    303     if config is None:

File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:216, in Dictionary.load(cls, f)
    207 """Loads the dictionary from a text file with the format:
    208 
    209 ```
   (...)
    213 ```
    214 """
    215 d = cls()
--> 216 d.add_from_file(f)
    217 return d

File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:227, in Dictionary.add_from_file(self, f)
    225 try:
    226     with open(PathManager.get_local_path(f), "r", encoding="utf-8") as fd:
--> 227         self.add_from_file(fd)
    228 except FileNotFoundError as fnfe:
    229     raise fnfe

File ~/SpeechT5/SpeechT5/fairseq/fairseq/data/dictionary.py:260, in Dictionary.add_from_file(self, f)
    258     self.add_symbol(word, n=count, overwrite=overwrite)
    259 except ValueError:
--> 260     raise ValueError(
    261         "Incorrect dictionary format, expected '<token> <cnt> [flags]'"
    262     )

ValueError: Incorrect dictionary format, expected '<token> <cnt> [flags]'
  1. Do we need to add a flag column to the dummy text dict? I appended a column of zeros to the dummy text dict (giving token, count, 0). The task can then be loaded:
    task = SpeechT5Task.setup_task(checkpoint['cfg']['task'])
  2. Error with loading the weights:
model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
model.load_state_dict(checkpoint['model'])
Click for full traceback
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [17], in <cell line: 2>()
      1 model = T5TransformerModel.build_model(checkpoint['cfg']['model'], task)
----> 2 model.load_state_dict(checkpoint['model'])

File ~/SpeechT5/SpeechT5/speecht5/models/speecht5.py:1040, in T5TransformerModel.load_state_dict(self, state_dict, strict, model_cfg, args)
   1036     m_state_dict = {
   1037         key.replace(f"{m}.", ""): value for key, value in state_dict.items() if key.startswith(f"{m}.")
   1038     }
   1039     if hasattr(self, m):
-> 1040         self._modules[m].load_state_dict(m_state_dict, False)
   1041 return self

File ~/venv/lib/python3.8/site-packages/torch/nn/modules/module.py:1497, in Module.load_state_dict(self, state_dict, strict)
   1492         error_msgs.insert(
   1493             0, 'Missing key(s) in state_dict: {}. '.format(
   1494                 ', '.join('"{}"'.format(k) for k in missing_keys)))
   1496 if len(error_msgs) > 0:
-> 1497     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   1498                        self.__class__.__name__, "\n\t".join(error_msgs)))
   1499 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for TransformerEncoder:
	size mismatch for proj.weight: copying a param with shape torch.Size([81, 768]) from checkpoint, the shape in current model is torch.Size([7, 768]).
	size mismatch for proj.bias: copying a param with shape torch.Size([81]) from checkpoint, the shape in current model is torch.Size([7]).

Would be very grateful to get some insight on these two questions:

  1. Do we need to process the dummy text data in an additional way to add the 'flag' column?
  2. Is the size mismatch error being thrown related to the saved PT checkpoint?

Many thanks for your help!

SpeechT5-tts fine-tuned on Chinese

I used colab notebookto fine-tuned this model.When I run trainer.train(),It goes into error.

in <cell line: 2>:2                                                                              β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1662 in train                     β”‚
β”‚                                                                                                  β”‚
β”‚   1659 β”‚   β”‚   inner_training_loop = find_executable_batch_size(                                 β”‚
β”‚   1660 β”‚   β”‚   β”‚   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  β”‚
β”‚   1661 β”‚   β”‚   )                                                                                 β”‚
β”‚ ❱ 1662 β”‚   β”‚   return inner_training_loop(                                                       β”‚
β”‚   1663 β”‚   β”‚   β”‚   args=args,                                                                    β”‚
β”‚   1664 β”‚   β”‚   β”‚   resume_from_checkpoint=resume_from_checkpoint,                                β”‚
β”‚   1665 β”‚   β”‚   β”‚   trial=trial,                                                                  β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.9/dist-packages/transformers/trainer.py:1839 in _inner_training_loop      β”‚
β”‚                                                                                                  β”‚
β”‚   1836 β”‚   β”‚   self.state.is_world_process_zero = self.is_world_process_zero()                   β”‚
β”‚   1837 β”‚   β”‚                                                                                     β”‚
β”‚   1838 β”‚   β”‚   # tr_loss is a tensor to avoid synchronization of TPUs through .item()            β”‚
β”‚ ❱ 1839 β”‚   β”‚   tr_loss = torch.tensor(0.0).to(args.device)                                       β”‚
β”‚   1840 β”‚   β”‚   # _total_loss_scalar is updated everytime .item() has to be called on tr_loss an  β”‚
β”‚   1841 β”‚   β”‚   self._total_loss_scalar = 0.0                                                     β”‚
β”‚   1842 β”‚   β”‚   self._globalstep_last_logged = self.state.global_step                             β”‚
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be 
incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

I do use GPU,why did this error happen?

Speech2C training error

Hi there,

Great repo and paper. I follow the exact installation and data prep steps for Speech2C training and get this error when I run the pre_training command:

AssertionError: number of labels does not match (5567 != 5566)

Any help would be appreciated!

Additionally, I also had to change:

dir: ./

in config and

common.user_dir=speech2c

to make the code work.

About the SpeechT5 pre-training curve

Hi, congratulations on your achievement in this great work!
I did pre-training according to the given configuration, but the loss of the curve converges quickly (about 20k updates) and then rises, I don't know if this is normal, or can you share your pretrain curve, thanks.

SpeechUT inference and fine-tune problem

I want to use the released version of pretrained SpeechUT model to fine tune and also want to use the released fine-tuned model on MUST-C ende model to inference, however when i reload the checkpoints, there are some extra non-local path which caused "FileNotFoundError", how can i solve this problem?
image

Pretraining SpeechT5, meet problems about batch_sampler in multitask_dataset. Should I get idx and bin files of data one by one (wav) or get all of them in only two file(idx and bin each have one)

Hi, I want to pretrain a model using SpeechT5 arch. I follow the scripts you given here https://github.com/microsoft/SpeechT5/tree/main/SpeechT5#data-preparation. But I wonder if there is a restrict in fairseq-preprocess when preparing data. Because I met this error.
image
I found it raised error in the process of batching samples of the .index and .bin data provided by fairseq-preprocess. And here is what my batch_sampler shape looks like. There are 455 items in batch_sampler and each item has 6 items in it except the last one :
image
image
So in order to run successfully, I tried to give up the last row:

batch_sampler = batch_sampler[:-2]
But then I got this:

image
  1. I think it is caused by the function np.random.choice(). And I infer from it that the batch_sampler should be a list, which only contains one array in it, right?
  2. But I have no idea how it comes out, should the index and bin files containing all train_data or just one row of train data?
  3. What's the sampled object of the batch_sampler?

Here is what my directory:
image

I would really appreciative to you if you can explain this. Thank you!!!!!!!

pretrain loss

Excuse me, what value does my pre-training loss reach, can I start fintune tts?
image
i found my finued tts model can generate a mel-spectrom but diffrent to ori mel-spectrom very much。
image
Is this due to the bart loss is too high?

SpeechLM: how to prepare phoneme sequence for T2U generator

Hi, I am trying to reproduce the T2U generator but have issues about converting asr transcripts to phoneme sequence. I think the phoneme sequence in dataset/LibriSpeech/fast_phone2units/genset_examples.tsv is not produced by speechlm/data_process/prepare_phn2ltr_librilm.sh. The phonemes in the former are not up-sampled and the probability of inserting silence is less than 0.25. Is there an example on how to prepare the phoneme sequence for T2U generator?
Thanks.

SpeechUT inference error in en_fr checkpoint

When i use fine-tuned en-fr checkpoint to inference,
the results produce serious errors (bleu score is 0, and the inference time is too long). This error did not occur on en-de and en-es checkoints. I wonder whether there is an error in the checkpoint of en-fr?
image

Port to Huggingface

Hey SpeechT5 Team, I've been seeing some activity of speechT5 on Huggingface. But didn't really see Huggingface version of speechT5. Could you please tell me if you guys are working on porting speechT5 to Huggingface , or perhaps it's already done.

As i am planning to pretrain speechT5 and then further use it in Huggingface for other downstream tasks.

how to pre-train on a custom dataset ?

Hey there, I am looking forward. to pre-training SpeechT5 on a custom dataset. preferably multi-lingual datasets. could i please get some references, documentations etc as a starting point to get started on the same please. Thanks.

[SpeechLM] About phoneme tokenizer in detail?

First of all, Thanks your great works and code

I am studying SpeechLM and found some curious things about training and inference.

  1. Can you guide which stage did you use for learning? below #L155 as I expected?
    [https://github.com/kaldi-asr/kaldi/blob/master/egs/librispeech/s5/run.sh#L155]

  2. Can you guide which decoder is used for Pseudo label generation and share you command ?
    steps/decode_fmllr.sh or online2-wav-gmm-latgen-faster directly?

Best Regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.