shivammehta25 / matcha-tts Goto Github PK

[ICASSP 2024] 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching

Home Page: https://shivammehta25.github.io/Matcha-TTS/

License: MIT License

Makefile 0.15% Python 22.84% Cython 0.16% Shell 0.03% Jupyter Notebook 76.82%

deep-learning flow-matching machine-learning non-autoregressive probabilistic probabilistic-machine-learning text-to-speech tts tts-api tts-engines

matcha-tts's Introduction

🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching

Shivam Mehta, Ruibo Tu, Jonas Beskow, Éva Székely, and Gustav Eje Henter

This is the official code implementation of 🍵 Matcha-TTS [ICASSP 2024].

We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. Our method:

Is probabilistic
Has compact memory footprint
Sounds highly natural
Is very fast to synthesise from

Check out our demo page and read our ICASSP 2024 paper for more details.

Pre-trained models will be automatically downloaded with the CLI or gradio interface.

You can also try 🍵 Matcha-TTS in your browser on HuggingFace 🤗 spaces.

Teaser video

Installation

Create an environment (suggested but optional)

conda create -n matcha-tts python=3.10 -y
conda activate matcha-tts

Install Matcha TTS using pip or from source

pip install matcha-tts

from source

pip install git+https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS
pip install -e .

Run CLI / gradio app / jupyter notebook

# This will download the required models
matcha-tts --text "<INPUT TEXT>"

matcha-tts-app

or open synthesis.ipynb on jupyter notebook

CLI Arguments

To synthesise from given text, run:

matcha-tts --text "<INPUT TEXT>"

To synthesise from a file, run:

matcha-tts --file <PATH TO FILE>

To batch synthesise from a file, run:

matcha-tts --file <PATH TO FILE> --batched

Additional arguments

Speaking rate

matcha-tts --text "<INPUT TEXT>" --speaking_rate 1.0

Sampling temperature

matcha-tts --text "<INPUT TEXT>" --temperature 0.667

Euler ODE solver steps

matcha-tts --text "<INPUT TEXT>" --steps 10

Train with your own dataset

Let's assume we are training with LJ Speech

Download the dataset from here, extract it to data/LJSpeech-1.1, and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo.
Clone and enter the Matcha-TTS repository

git clone https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS

Install the package from source

pip install -e .

Go to configs/data/ljspeech.yaml and change

train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt

Generate normalisation statistics with the yaml file of dataset configuration

matcha-data-stats -i ljspeech.yaml
# Output:
#{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}

Update these values in configs/data/ljspeech.yaml under data_statistics key.

data_statistics:  # Computed for ljspeech dataset
  mel_mean: -5.536622
  mel_std: 2.116101

to the paths of your train and validation filelists.

Run the training script

make train-ljspeech

python matcha/train.py experiment=ljspeech

for a minimum memory run

python matcha/train.py experiment=ljspeech_min_memory

for multi-gpu training, run

python matcha/train.py experiment=ljspeech trainer.devices=[0,1]

Synthesise from the custom trained model

matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT>

ONNX support

Special thanks to @mush42 for implementing ONNX export and inference support.

It is possible to export Matcha checkpoints to ONNX, and run inference on the exported ONNX graph.

ONNX export

To export a checkpoint to ONNX, first install ONNX with

pip install onnx

then run the following:

python3 -m matcha.onnx.export matcha.ckpt model.onnx --n-timesteps 5

Optionally, the ONNX exporter accepts vocoder-name and vocoder-checkpoint arguments. This enables you to embed the vocoder in the exported graph and generate waveforms in a single run (similar to end-to-end TTS systems).

Note that n_timesteps is treated as a hyper-parameter rather than a model input. This means you should specify it during export (not during inference). If not specified, n_timesteps is set to 5.

Important: for now, torch>=2.1.0 is needed for export since the scaled_product_attention operator is not exportable in older versions. Until the final version is released, those who want to export their models must install torch>=2.1.0 manually as a pre-release.

ONNX Inference

To run inference on the exported model, first install onnxruntime using

pip install onnxruntime
pip install onnxruntime-gpu  # for GPU inference

then use the following:

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs

You can also control synthesis parameters:

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --temperature 0.4 --speaking_rate 0.9 --spk 0

To run inference on GPU, make sure to install onnxruntime-gpu package, and then pass --gpu to the inference command:

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --gpu

If you exported only Matcha to ONNX, this will write mel-spectrogram as graphs and numpy arrays to the output directory. If you embedded the vocoder in the exported graph, this will write .wav audio files to the output directory.

If you exported only Matcha to ONNX, and you want to run a full TTS pipeline, you can pass a path to a vocoder model in ONNX format:

python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vocoder hifigan.small.onnx

This will write .wav audio files to the output directory.

Citation information

If you use our code or otherwise find this work useful, please cite our paper:

@inproceedings{mehta2024matcha,
  title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},
  author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje},
  booktitle={Proc. ICASSP},
  year={2024}
}

Acknowledgements

Since this code uses Lightning-Hydra-Template, you have all the powers that come with it.

Other source code we would like to acknowledge:

Coqui-TTS: For helping me figure out how to make cython binaries pip installable and encouragement
Hugging Face Diffusers: For their awesome diffusers library and its components
Grad-TTS: For the monotonic alignment search source code
torchdyn: Useful for trying other ODE solvers during research and development
labml.ai: For the RoPE implementation

matcha-tts's People

Contributors

Stargazers

Watchers

matcha-tts's Issues

Information regarding Comparison with Grad-TTS

The comparison of Grad-TTS in the paper is unfair. The authors of Grad-TTS have published a follow-up paper which includes a maximum likelihood based SDE Solver (from Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme (https://arxiv.org/abs/2109.13821). This SDE Solver can improve the performance with a small number of decoding steps. It is worth noting that the code for this SDE Solver is open source under the same repository as Grad-TTS (It is implemented in https://github.com/huawei-noah/Speech-Backbones/blob/main/DiffVC/model/diffusion.py). The pre-trained model of Grad-TTS was used in the paper, and it is unlikely that the authors did not notice this fact. However, the paper ended up using Euler's method for the discretization of the Grad-TTS inverse process.

Here's a tweet from the Grad-TTS authors urging people not to be using Euler's method for the Grad-TTS inverse process:

how did you choose sigma_min?

in p-flow tts paper, they choose sigma_min as 0.01, and here use 0.0001

i'm also applying flow matching for tts based on p-flow. is there any reason to choose sigma_min as 1e-4?

multi speaker model doesnt seem to pause well between sentences, compared with the single speaker model

i noticed that the single model (LJ Speech) works very well, with natural pauses between words or sentences. however, the multiple speaker model (VCTK) seems to just talk through all the words, with limited pauses. this applies to all speaker ids.

this is easily replicable from the spaces demo when inputting multiple sentences to compare.

is this a known issue? is there a way round this so that the multi speaker model can work as well as the single model?

Generating normalization statistics error

Hello,

When generating normalization statistics as directed in the README I get the following error:

Traceback (most recent call last):
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "C:\Users\vul\miniconda3\envs\matcha-tts\Scripts\matcha-data-stats.exe\__main__.py", line 7, in <module>
    sys.exit(main())
  File "D:\Code\Matcha-TTS\matcha\utils\generate_data_statistics.py", line 102, in main
    params = compute_data_statistics(data_loader, cfg["n_feats"])
  File "D:\Code\Matcha-TTS\matcha\utils\generate_data_statistics.py", line 36, in compute_data_statistics
    for batch in tqdm(data_loader, leave=False):
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\site-packages\tqdm\std.py", line 1182, in __iter__
    for obj in iterable:
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\site-packages\torch\utils\data\dataloader.py", line 633, in __next__    data = self._next_data()
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\site-packages\torch\utils\data\dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\site-packages\torch\utils\data\dataloader.py", line 1371, in _process_data
    data.reraise()
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\site-packages\torch\_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 4.
Original Traceback (most recent call last):
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\site-packages\torch\utils\data\_utils\worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "C:\Users\vul\miniconda3\envs\matcha-tts\lib\site-packages\torch\utils\data\_utils\fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "D:\Code\Matcha-TTS\matcha\data\text_mel_datamodule.py", line 223, in __call__
    y[i, :, : y_.shape[-1]] = y_
RuntimeError: expand(torch.FloatTensor{[2, 80, 293]}, size=[80, 293]): the number of sizes provided (2) must be greater or equal to the number of dimensions in the tensor (3)

Do you know what might have caused this?

How are short utterances performing ?

Hi, thanks for your interesting Model ! I was reading in an issue here, that you recommend to train on audio samples > X seconds, where X is a number larger than 2 seconds or even more.

Besides the concrete reasons for the model to require such short audio samples, I would be interested to know if you already have performed tests on short utterances ? I would be especially interested in e.g. how spelling, letters and short numbers are performing ?

Background: for people with reading disabilities, the screen keyboard is an essential writing tool. The users use screen readers to read those keyboard letters aloud. In my experience, many models have problems with such short utterances, but it helps to train with enough appropriate short sample audios - which would be problematic with Matcha-TTS ?

training multi-gpu does not progress

When i start training with multi-gpu, the script hangs forever with GPU at 100% and GPU Memory at 2%. See screenshots and training command below.

export CUDA_VISIBLE_DEVICES=4,5
python matcha/train.py experiment=multispeaker trainer.devices=[0,1]
[2024-02-13 18:39:58,942][matcha.utils.utils][INFO] - Enforcing tags! <cfg.extras.enforce_tags=True>
[2024-02-13 18:39:58,946][matcha.utils.utils][INFO] - Printing config tree with Rich! <cfg.extras.print_config=True>
CONFIG
├── data
│   └── _target_: matcha.data.text_mel_datamodule.TextMelDataModule
│       name: multispeaker
│       train_filelist_path: data/filelists/multispeaker_audio_sid_text_train_filelist.txt
│       valid_filelist_path: data/filelists/multispeaker_audio_sid_text_val_filelist.txt
│       batch_size: 32
│       num_workers: 20
│       pin_memory: true
│       cleaners:
│       - english_cleaners2
│       add_blank: true
│       n_spks: 3
│       n_fft: 1024
│       n_feats: 80
│       sample_rate: 22050
│       hop_length: 256
│       win_length: 1024
│       f_min: 0
│       f_max: 8000
│       data_statistics:
│         mel_mean: -5.556468963623047
│         mel_std: 2.2946767807006836
│       seed: 1234
│
├── model
│   └── _target_: matcha.models.matcha_tts.MatchaTTS
│       n_vocab: 178
│       n_spks: 3
│       spk_emb_dim: 64
│       n_feats: 80
│       data_statistics:
│         mel_mean: -5.556468963623047
│         mel_std: 2.2946767807006836
│       out_size: null
│       prior_loss: true
│       encoder:
│         encoder_type: RoPE Encoder
│         encoder_params:
│           n_feats: 80
│           n_channels: 192
│           filter_channels: 768
│           filter_channels_dp: 256
│           n_heads: 2
│           n_layers: 6
│           kernel_size: 3
│           p_dropout: 0.1
│           spk_emb_dim: 64
│           n_spks: 1
│           prenet: true
│         duration_predictor_params:
│           filter_channels_dp: 256
│           kernel_size: 3
│           p_dropout: 0.1
│       decoder:
│         channels:
│         - 512
│         - 512
│         dropout: 0.05
│         attention_head_dim: 128
│         n_blocks: 1
│         num_mid_blocks: 2
│         num_heads: 2
│         act_fn: snakebeta
│       cfm:
│         name: CFM
│         solver: euler
│         sigma_min: 0.0001
│       optimizer:
│         _target_: torch.optim.Adam
│         _partial_: true
│         lr: 0.0001
│         weight_decay: 0.0
│
├── callbacks
│   └── model_checkpoint:
│         _target_: lightning.pytorch.callbacks.ModelCheckpoint
│         dirpath: /home/myuser/matcha-tts/Matcha-TTS/logs/train/multispeaker/runs/2024-02-13_18-39-58/checkpoints
│         filename: checkpoint_{epoch:03d}
│         monitor: epoch
│         verbose: false
│         save_last: true
│         save_top_k: 10
│         mode: max
│         auto_insert_metric_name: true
│         save_weights_only: false
│         every_n_train_steps: null
│         train_time_interval: null
│         every_n_epochs: 100
│         save_on_train_epoch_end: null
│       model_summary:
│         _target_: lightning.pytorch.callbacks.RichModelSummary
│         max_depth: 3
│       rich_progress_bar:
│         _target_: lightning.pytorch.callbacks.RichProgressBar
│
├── logger
│   └── tensorboard:
│         _target_: lightning.pytorch.loggers.tensorboard.TensorBoardLogger
│         save_dir: /home/myuser/matcha-tts/Matcha-TTS/logs/train/multispeaker/runs/2024-02-13_18-39-58/tensorboard/
│         name: null
│         log_graph: false
│         default_hp_metric: true
│         prefix: ''
│
├── trainer
│   └── _target_: lightning.pytorch.trainer.Trainer
│       default_root_dir: /home/myuser/matcha-tts/Matcha-TTS/logs/train/multispeaker/runs/2024-02-13_18-39-58
│       max_epochs: -1
│       accelerator: gpu
│       devices:
│       - 0
│       - 1
│       precision: 16-mixed
│       check_val_every_n_epoch: 1
│       deterministic: false
│       gradient_clip_val: 5.0
│
├── paths
│   └── root_dir: /home/myuser/matcha-tts/Matcha-TTS
│       data_dir: /home/myuser/matcha-tts/Matcha-TTS/data/
│       log_dir: /home/myuser/matcha-tts/Matcha-TTS/logs/
│       output_dir: /home/myuser/matcha-tts/Matcha-TTS/logs/train/multispeaker/runs/2024-02-13_18-39-58
│       work_dir: /home/myuser/matcha-tts/Matcha-TTS
│
├── extras
│   └── ignore_warnings: false
│       enforce_tags: true
│       print_config: true
│
├── task_name
│   └── train
├── run_name
│   └── multispeaker
├── tags
│   └── ['multispeaker']
├── train
│   └── True
├── test
│   └── True
├── ckpt_path
│   └── None
└── seed
    └── 1234
Seed set to 1234
[2024-02-13 18:39:58,991][__main__][INFO] - Instantiating datamodule <matcha.data.text_mel_datamodule.TextMelDataModule>
[2024-02-13 18:39:59,783][__main__][INFO] - Instantiating model <matcha.models.matcha_tts.MatchaTTS>

[2024-02-13 18:40:00,681][__main__][INFO] - Instantiating callbacks...
[2024-02-13 18:40:00,681][matcha.utils.instantiators][INFO] - Instantiating callback <lightning.pytorch.callbacks.ModelCheckpoint>
[2024-02-13 18:40:00,684][matcha.utils.instantiators][INFO] - Instantiating callback <lightning.pytorch.callbacks.RichModelSummary>
[2024-02-13 18:40:00,684][matcha.utils.instantiators][INFO] - Instantiating callback <lightning.pytorch.callbacks.RichProgressBar>
[2024-02-13 18:40:00,685][__main__][INFO] - Instantiating loggers...
[2024-02-13 18:40:00,685][matcha.utils.instantiators][INFO] - Instantiating logger <lightning.pytorch.loggers.tensorboard.TensorBoardLogger>
[2024-02-13 18:40:00,724][__main__][INFO] - Instantiating trainer <lightning.pytorch.trainer.Trainer>
Using 16bit Automatic Mixed Precision (AMP)
Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2024-02-13 18:40:00,949][__main__][INFO] - Logging hyperparameters!
[2024-02-13 18:40:00,988][__main__][INFO] - Starting training!
[rank: 0] Seed set to 1234
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2

[rank: 1] Seed set to 1234
[rank: 1] Seed set to 1234
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 2 processes
----------------------------------------------------------------------------------------------------

You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision

sampling rate problem

I have a question regarding this:

I have prepared a Japanese dataset with a sampling rate of 48000Hz. Is it necessary to convert the sampling rate to 22050Hz, or can I leave the sampling rate as it is? Thank you!

Matcha-TTS access denied

I'm getting this error in colab:

Access denied with the following error:

Cannot retrieve the public link of the file. You may need to change the permission to 'Anyone with the link', or have had many accesses.

You may still be able to access the file from the browser:

 https://drive.google.com/uc?id=1enuxmfslZciWGAl63WGh2ekVo00FYuQ9

cannot resume training

Trying to resume training, it runs for a bit, and then errors as follows:

Trainable params: 18.2 M
Non-trainable params: 0
Total params: 18.2 M
Total estimated model params size (MB): 72
Restored all states from the checkpoint at models/libritts-r-run/checkpoints/last.ckpt
/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/torch/nn/modules/conv.py:309: UserWarning: Applied workaround for CuDNN issue, install nvrtc.so (Triggered
internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:80.)
  return F.conv1d(input, weight, bias, self.stride,
/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/utilities/data.py:76: UserWarning: Trying to infer the `batch_size` from an
ambiguous collection. The batch size we found is 8. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
  warning_cache.warn(
Epoch 84/-2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23089/23089 1:58:45 • 0:00:00 3.57it/s v_num: 0 loss/train_step: nan loss/val_step: nan loss/val_epoch: nan
[2023-10-17 17:35:25,842][matcha.utils.utils][ERROR] -
Traceback (most recent call last):
  File "/home/myuser/Matcha-TTS/matcha/utils/utils.py", line 76, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
                               ^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/train.py", line 79, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 134, in run
    self.on_advance_end()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 249, in on_advance_end
    self.val_loop.run()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 122, in run
    return self.on_run_end()
           ^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 258, in on_run_end
    self._on_evaluation_end()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 304, in _on_evaluation_end
    call._call_lightning_module_hook(trainer, hook_name, *args, **kwargs)
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 146, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/models/baselightningmodule.py", line 186, in on_validation_end
    output = self.synthesise(x[:, :x_lengths], x_lengths, n_timesteps=10, spks=spks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/models/matcha_tts.py", line 123, in synthesise
    y_mask = sequence_mask(y_lengths, y_max_length_).unsqueeze(1).to(x_mask.dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/utils/model.py", line 10, in sequence_mask
    x = torch.arange(max_length, dtype=length.dtype, device=length.device)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: upper bound and larger bound inconsistent with step sign
[2023-10-17 17:35:25,849][matcha.utils.utils][INFO] - Output dir: /home/myuser/Matcha-TTS/outputs/2023-10-17/15-36-28
Error executing job with overrides: []
Traceback (most recent call last):
  File "/home/myuser/Matcha-TTS/matcha/train.py", line 122, in <module>
    main()  # pylint: disable=no-value-for-parameter
    ^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
           ^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
            ^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
        ^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
                       ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/train.py", line 112, in main
    metric_dict, _ = train(cfg)
                     ^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/utils/utils.py", line 86, in wrap
    raise ex
  File "/home/myuser/Matcha-TTS/matcha/utils/utils.py", line 76, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
                               ^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/train.py", line 79, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 532, in fit
    call._call_and_handle_interrupt(
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 571, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 980, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1023, in _run_stage
    self.fit_loop.run()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/fit_loop.py", line 355, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 134, in run
    self.on_advance_end()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 249, in on_advance_end
    self.val_loop.run()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 181, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 122, in run
    return self.on_run_end()
           ^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 258, in on_run_end
    self._on_evaluation_end()
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 304, in _on_evaluation_end
    call._call_lightning_module_hook(trainer, hook_name, *args, **kwargs)
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 146, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/models/baselightningmodule.py", line 186, in on_validation_end
    output = self.synthesise(x[:, :x_lengths], x_lengths, n_timesteps=10, spks=spks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/miniconda3/envs/Matcha-TTS/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/models/matcha_tts.py", line 123, in synthesise
    y_mask = sequence_mask(y_lengths, y_max_length_).unsqueeze(1).to(x_mask.dtype)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/myuser/Matcha-TTS/matcha/utils/model.py", line 10, in sequence_mask
    x = torch.arange(max_length, dtype=length.dtype, device=length.device)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: upper bound and larger bound inconsistent with step sign

Ryanspeech

Hey by any chance do you have the Ryanspeech Dataset ?? i have been looking for it for a while and when i apply for the dataset i get an error , so i thought to ask if you can reupload it in case you have it.

Thanks in advance.

Some questions about the reproduction of this paper: from newcomer

Hello, I am a novice in the field of speech, and I don't understand many things. I hope you don't take the trouble to answer my questions. Thank you again.

Firstly, for the dataset, I used the LJspeed dataset used in the paper, with a total of 13100 wavs. I divided it into training, validation, and testing sets in a 7:2:1 ratio. Is there any problem with my approach and how did you divide the experiments at that time?

Secondly, I have some doubts about the code. Is the best model saved according to epoch (monitor: epoch # name of the logged metric which determines when the model is improving)? That's because the more iterations the model has, the better. I didn't find any other evaluation indicators in the code, such as RTF or WER after running, which may be because my code ability is too poor. I don't quite understand this point.

Thirdly, there is a section of code in train.py:

If logger:

Log. info ("Logging hyperparameters!")

Utils.log_ Hyperparameters (object_dict)

If cfg. get ("train"):

Log. info ("Starting training!")

Trainer. fit (model=model, datamodule=datamodule, ckpt_path=cfg. get ("ckpt_path"))

Train_ Metrics=trainer. callback_ Metrics

If cfg. get ("test"):

Log. info ("Starting testing!")

Ckpt_ Path=trainer. checkpoint_ Callback. best_ Model_ Path

If ckpt_ Path=="":

Log. warning ("Best ckpt not found! Using current weights for testing...")

Ckpt_ Path=None

Trainer. test (model=model, datamodule=datamodule, ckpt_path=ckpt_path)

Log. info (f "Best ckpt path: {ckpt_path}")

Test_ Metrics=trainer. callback_ Metrics

#Merge train and test metrics

Metric_ Dict={* * train_metrics, * * test_metrics}

Return metric_ Dict, object_ Dict

For this code, I only saw (Starting training!) but not (Starting testing!) during runtime, and the testing section was not found. When I tried running a small epoch instead of 50k iterations in the paper, I stopped it and encountered
( raise MisconfigurationException(f"No {step_name}() method defined to run Trainer.{trainer_method}.")
lightning.fabric.utilities.exceptions.MisconfigurationException: No test_step() method defined to run Trainer.test.）
What is the reason for this?

These questions may seem childish to you, but they are really important to me. Could you please answer them and express my gratitude to you again.

phonemizer/espeak and the GPL

When running Matcha-TTS, I notice inference relies on phonemizer (which itself relies on espeak for the backend). Both of those are under GPLv3 license, have you tested or attempted inference with MIT licensed alternatives?

Could you explain transformations before applying MAS?

Could you explain intuition behind these lines of code? What formulas does it refer? In Glow-TTS I've found only description precisely algo of MAS, but not these transformations....

Keeps on running into segfault

Hey, thanks for open sourcing the model! Would love to experiment with it but whenever I try running it on Amazon Linux 2. Double checked that all the packages are at same versions as requirements.txt.

Output

[ec2-user@ip-172-31-72-8 ~]$ matcha-tts --text "hello world"
[+] GPU Available! Using GPU
[!] Configurations: 
	- Model: matcha_ljspeech
	- Vocoder: hifigan_T2_v1
	- Temperature: 0.667
	- Speaking rate: 0.95
	- Number of ODE steps: 10
	- Speaker: None
[!] Loading matcha_ljspeech!
[+] matcha_ljspeech loaded!
[!] Loading hifigan_T2_v1!
Removing weight norm...
[+] hifigan_T2_v1 loaded!
====================================================================================================
[1] - Input text: hello world
Segmentation fault

Problem for training

Hi, im trying to train that model ım download dataset LJSpeech-1.1 for testing and understand training part ım try to train that model using this date
ım split metadat.csv last 500 line will be ljs_audio_text_val_filelist.txt rest of it ljs_audio_text_train_filelist.txt
and my wavs directory containing audios in same directory like this


(matcha-tts-sourceenv) root@DESKTOP-RUI9N9R:/matcha-tts-source/Matcha-TTS/data/filelists# ll
total 1908
drwxr-xr-x 3 root   root      4096 Sep 27 19:59 ./
drwxr-xr-x 4 root   root      4096 Sep 27 15:05 ../
-rw-r--r-- 1 root   root   1407640 Sep 27 20:03 ljs_audio_text_train_filelist.txt
-rw-r--r-- 1 root   root     58269 Sep 27 20:03 ljs_audio_text_val_filelist.txt
drwxr-xr-x 2 mehmet mehmet  471040 Feb 19  2018 wavs/

and my txts are looks like this im removed after | chracter and also that chracter because of torctron2 guide say this it is correct ı hope ?

LJ001-0001|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
LJ001-0002|in being comparatively modern.

when ım try run this command im received runtime error what im doing wrong can u explain me ?



(matcha-tts-sourceenv) root@DESKTOP-RUI9N9R:/matcha-tts-source/Matcha-TTS/data/filelists# matcha-data-stats -i ljspeech.yaml
Traceback (most recent call last):                                                                                                                                                   
  File "/root/anaconda3/envs/matcha-tts-sourceenv/bin/matcha-data-stats", line 8, in <module>
    sys.exit(main())
  File "/matcha-tts-source/Matcha-TTS/matcha/utils/generate_data_statistics.py", line 102, in main
    params = compute_data_statistics(data_loader, cfg["n_feats"])
  File "/matcha-tts-source/Matcha-TTS/matcha/utils/generate_data_statistics.py", line 36, in compute_data_statistics
    for batch in tqdm(data_loader, leave=False):
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/tqdm/std.py", line 1182, in __iter__
    for obj in iterable:
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
    data = self._next_data()
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/matcha-tts-source/Matcha-TTS/matcha/data/text_mel_datamodule.py", line 197, in __getitem__
    datapoint = self.get_datapoint(self.filepaths_and_text[index])
  File "/matcha-tts-source/Matcha-TTS/matcha/data/text_mel_datamodule.py", line 168, in get_datapoint
    mel = self.get_mel(filepath)
  File "/matcha-tts-source/Matcha-TTS/matcha/data/text_mel_datamodule.py", line 173, in get_mel
    audio, sr = ta.load(filepath)
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 256, in load
    return _fallback_load(filepath, frame_offset, num_frames, normalize, channels_first, format)
  File "/root/anaconda3/envs/matcha-tts-sourceenv/lib/python3.10/site-packages/torchaudio/backend/sox_io_backend.py", line 30, in _fail_load
    raise RuntimeError("Failed to load audio from {}".format(filepath))
RuntimeError: Failed to load audio from LJ031-0079

if you need another details or anythink just message ı will provide and thanks alot for helping.

Q:A about the corrent train and the dataset ?

Hi, after several days of testing, could you please let me know if I am on the right track? I am very new to this and need to ensure that I have understood everything correctly. Thank you very much for your time and all your amazing work!

I am trying to understand how many steps are needed to train for a new language that is not English using IPA.

I have some questions about the dataset. I am using WAV files with a sample rate of 22055Hz in mono. The minimum duration is 1.0 second, and the maximum is 15 seconds. Can I use durations outside this range, like a minimum of 0.5 seconds and a maximum of 25 seconds? What should I consider regarding this? Also, should the WAV files be normalized for sound? Sometimes the audio is low, and I want to ensure the correct level. Currently, I normalize the sound to a range of -1.0 to 1.0. Is this correct?

Regarding the length of the dataset, I tested with 14 hours from a single speaker to see how it goes. Can I use more hours, say around 20 hours? How many hours are recommended for fine-tuning or training from scratch?

In terms of single speaker vs. multi-speaker datasets, if I use multi-speakers, with 4 hours for each of the 10 speakers, totaling about 40 hours, does this enhance the training process? Or is it necessary for each speaker to have a substantial number of hours, similar to the 14 hours for a single speaker?

Regarding the training time, is one day sufficient for training, or is it better to train for a week for improved results? I am using an RTX 4090. How many steps are needed to achieve a good model?

When it comes to pretraining vs. training from scratch, is it better to use a pretrained model or train from scratch for a new language?

Lastly, I have trained for about 180k steps. How many more steps are needed to complete the training process?

What parameters could you suggest to tweak to improve model perfomance?

Hi, I try to train your model on my personal datasets.

Are there any parameters you could suggest to tweak during training on my dataset? Or default params are enough for good quality results?

Are there any specific requirements to a dataset? I have the dataset of similar size like LJSpeech, 22050Hz, mono, similar duration distribution.

[error] git clone

I encountered an issue while trying to clone the repository using the git clone command. The error message I received is as follows:

Cloning into 'Matcha-TTS'...
remote: Enumerating objects: 855, done.
remote: Counting objects: 100% (255/255), done.
remote: Compressing objects: 100% (101/101), done.
error: RPC failed; curl 92 HTTP/2 stream 0 was not closed cleanly: CANCEL (err 8)
error: 2925 bytes of body are still expected
fetch-pack: unexpected disconnect while reading sideband packet
fatal: early EOF
fatal: fetch-pack: invalid index-pack output

This error occurred on a server running Ubuntu.

Conformer and Time Embeddings

Why when Conformer is used, time embeddings are not incorporated?

Could not create monotonic_align

Hi. Thanks for sharing the source code.
I tried to train the model with my custom dataset, but the error occurred. Can you help me?

error: could not create 'matcha/utils/monotonic_align/core.cpython-38-x86_64-linux-gnu.so': No such file or directory
Best,

Error when trying to train on a multispeaker dataset

Hi! I am trying to train the Matcha-TTS model on my own dataset in a low-resource language. Therefore, I have to use some ASR data to test it, even though I know it is not the best type of data for TTS training. Because it is a multispeaker dataset, I am changing the n_spks variable in dataset.yaml to 29, which is the total amount of speakers in the training set (in the validation set there are 45 different speakers).

From research, I believe it might have something to do with changing the n_spks from 1 to 29 and that messes with some indexing or boundaries that have been set, but I am not sure. Also, I managed to run it without errors when I had n_spks=1 (even though the results were not great due to noisy dataset).

When I run the script to train the model on a GPU, it gives me hundreds of lines with this error:

../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [0,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [0,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [0,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [0,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [0,0,0], thread: [68,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1290: indexSelectLargeIndex: block: [0,0,0], thread: [69,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Followed by:

Traceback (most recent call last):
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1031, in _run_stage
    self._run_sanity_check()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1060, in _run_sanity_check
    val_loop.run()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 135, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 396, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 412, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/models/baselightningmodule.py", line 128, in validation_step
    loss_dict = self.get_losses(batch)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/models/baselightningmodule.py", line 61, in get_losses
    dur_loss, prior_loss, diff_loss = self(
                                      ^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/models/matcha_tts.py", line 176, in forward
    mu_x, logw, x_mask = self.encoder(x, x_lengths, spks)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/models/components/text_encoder.py", line 397, in forward
    x = self.emb(x) * math.sqrt(self.n_channels)
        ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/train.py", line 112, in main
    metric_dict, _ = train(cfg)
                     ^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/utils/utils.py", line 86, in wrap
    raise ex
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/utils/utils.py", line 76, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
                               ^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/train.py", line 79, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 68, in _call_and_handle_interrupt
    trainer._teardown()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1010, in _teardown
    self.strategy.teardown()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 537, in teardown
    self.lightning_module.cpu()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 82, in cpu
    return super().cpu()
           ^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 960, in cpu
    return self._apply(lambda t: t.cpu())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 960, in <lambda>
    return self._apply(lambda t: t.cpu())
                                 ^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
    results = self._run_stage()
              ^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1031, in _run_stage
    self._run_sanity_check()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1060, in _run_sanity_check
    val_loop.run()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 135, in run
    self._evaluation_step(batch, batch_idx, dataloader_idx, dataloader_iter)
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 396, in _evaluation_step
    output = call._call_strategy_hook(trainer, hook_name, *step_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 309, in _call_strategy_hook
    output = fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 412, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/models/baselightningmodule.py", line 128, in validation_step
    loss_dict = self.get_losses(batch)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/models/baselightningmodule.py", line 61, in get_losses
    dur_loss, prior_loss, diff_loss = self(
                                      ^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/models/matcha_tts.py", line 176, in forward
    mu_x, logw, x_mask = self.encoder(x, x_lengths, spks)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/models/components/text_encoder.py", line 397, in forward
    x = self.emb(x) * math.sqrt(self.n_channels)
        ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/train.py", line 112, in main
    metric_dict, _ = train(cfg)
                     ^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/utils/utils.py", line 86, in wrap
    raise ex
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/utils/utils.py", line 76, in wrap
    metric_dict, object_dict = task_func(cfg=cfg)
                               ^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/MatchaTTS_Norwegian_Custom/matcha/train.py", line 79, in train
    trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path"))
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/call.py", line 68, in _call_and_handle_interrupt
    trainer._teardown()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/trainer/trainer.py", line 1010, in _teardown
    self.strategy.teardown()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/pytorch/strategies/strategy.py", line 537, in teardown
    self.lightning_module.cpu()
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/lightning/fabric/utilities/device_dtype_mixin.py", line 82, in cpu
    return super().cpu()
           ^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 960, in cpu
    return self._apply(lambda t: t.cpu())
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 802, in _apply
    module._apply(fn)
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 825, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 960, in <lambda>
    return self._apply(lambda t: t.cpu())
                                 ^^^^^^^
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x155552cf4d87 in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x155552ca575f in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x155552dc58a8 in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xfa5656 (0x1555086d3656 in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0x543010 (0x1555516bf010 in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x649bf (0x155552cd99bf in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x21b (0x155552cd2c8b in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x155552cd2e39 in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #8: <unknown function> + 0x80b718 (0x155551987718 in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x2f6 (0x155551987a96 in /global/D1/homes/gardaa/gards-py311-cu121-venv/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #10: python() [0x52d8be]
frame #11: python() [0x56e86e]
frame #12: python() [0x56d568]
frame #13: python() [0x56d5a9]
frame #14: python() [0x56d5a9]
frame #15: python() [0x56d5a9]
frame #16: python() [0x56d5a9]
frame #17: python() [0x56d5a9]
frame #18: python() [0x56d5a9]
frame #19: python() [0x57b43a]
frame #20: python() [0x57b4f9]
frame #21: python() [0x5b7c77]
frame #22: python() [0x523195]
frame #23: python() [0x60ee6e]

I am looking forward for the help with this issue. Thank you so much in advance!

Possible to manipulate text projection to elongate phonemes controllably? [Question]

I'm reading through the paper, and I'm wondering if during inference time, could you manipulate the duration predictor, or some other part to allow controllable elongation of certain phonemes?

Could you somehow interfere with this at/before inference time to allow adaption for singing? text2singing for instance. I tried writing "I'm Siiiinging in the rain" to attempt to elongate the generated sound, but looses intelligibility.

I'm wondering if this technique could be adapted to text2singing, could take some inspiration from RVC, and potentially train on an F0 contour as well. So the input would be F0 + Phones/Text projection. F0 is used as an additional conditioning signal to the text. Could this work?

torch.cuda.OutOfMemoryError: CUDA out of memory

Hi, I am training matchaTTS on my custom dataset in hindi language, here are the changes I made:

1.) upadated symbol.py to include more letters/phones
2.) updated n_vocab, to accomodate my sumbols for embedding
3.) Reduced batch_size from 32 -> 16 -> 8 but still got OOM

Here is the complete traceback:

errorlog.txt

I was training on a 80GB A100, which had 40GB vRAM available, the other was occupied by another pytorch task.

here is my data conf :
target: matcha.data.text_mel_datamodule.TextMelDataModule
name: ljspeech
train_filelist_path: /home/azureuser/exp/Matcha-TTS/data/filelists/train_set.txt
valid_filelist_path: /home/azureuser/exp/Matcha-TTS/data/filelists/validation_set.txt
batch_size: 8
num_workers: 12
pin_memory: True
cleaners: [basic_cleaners]
add_blank: True
n_spks: 1
n_fft: 1024
n_feats: 80
sample_rate: 22050
hop_length: 256
win_length: 1024
f_min: 0
f_max: 8000
data_statistics: # Computed for ljspeech dataset
mel_mean: -6.200761318206787
mel_std: 2.2481095790863037
seed: ${seed}
Please let me know if you'lll need any other information.

Compare to VoiceFlow TTS

Thank you for your work and sharing!
It seems MATCHA-TTS and VoiceFLow-TTS (https://github.com/X-LANCE/VoiceFlow-TTS) are very similar?
What is the main diffences between these two methods?
And How about the performace on voice quality, for example prosody, and the inference speed?

How important is interspersing with 0s text input ?

Congrats for your great model !
Out of curiosity, how important is interspersing with 0s the text input ?
Is it for extra performance or a crucial design choice ?

Are other languages supported？like chinese

UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor

I modified the architecture to perform emotion speech synthesis training using a pre-trained wav2vec model to extract emotion features from audio files (dim=1024). The modification involves mapping these emotion features to the same dimension as the embedded text and then adding them together, as illustrated in the diagram below.

Although the training runs, it has become significantly slower, and I'm encountering the following warning message during the training process:

/opt/miniconda/envs/emotion/lib/python3.10/site-packages/torch/nn/modules/conv.py:306:
UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor
Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return F.conv1d(input, weight, bias, self.stride,)
...
/opt/miniconda/envs/emotion/lib/python3.10/site-packages/lightning/pytorch/utilities/data.py:77: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 15. To avoid any 
miscalculations, use `self.log(..., batch_size=batch_size)`.
/opt/miniconda/envs/emotion/lib/python3.10/site-packages/torch/nn/modules/conv.py:306: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed 
cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
  return F.conv1d(input, weight, bias, self.stride,

Is the problem related to the dataset configuration?

Failed to import transformers.generation.utils

Hello,

I get this error after installing and running the CLI command

matcha-tts --text "Hello how are you"

Logs:

RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf
ImportError: numpy.core._multiarray_umath failed to import
ImportError: numpy.core.umath failed to import
Traceback (most recent call last):
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1184, in _get_module
return importlib.import_module("." + module_name, self.name)
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/generation/utils.py", line 27, in
from ..integrations.deepspeed import is_deepspeed_zero3_enabled
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/integrations/init.py", line 21, in
from .deepspeed import (
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/integrations/deepspeed.py", line 29, in
from ..optimization import get_scheduler
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/optimization.py", line 27, in
from .trainer_utils import SchedulerType
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/trainer_utils.py", line 49, in
import tensorflow as tf
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/init.py", line 37, in
from tensorflow.python.tools import module_util as _module_util
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/init.py", line 42, in
from tensorflow.python import data
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/data/init.py", line 21, in
from tensorflow.python.data import experimental
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/data/experimental/init.py", line 97, in
from tensorflow.python.data.experimental import service
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/data/experimental/service/init.py", line 419, in
from tensorflow.python.data.experimental.ops.data_service_ops import distribute
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/data/experimental/ops/data_service_ops.py", line 22, in
from tensorflow.python.data.experimental.ops import compression_ops
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/data/experimental/ops/compression_ops.py", line 16, in
from tensorflow.python.data.util import structure
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/data/util/structure.py", line 22, in
from tensorflow.python.data.util import nest
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/data/util/nest.py", line 34, in
from tensorflow.python.framework import sparse_tensor as _sparse_tensor
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/framework/sparse_tensor.py", line 25, in
from tensorflow.python.framework import constant_op
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/framework/constant_op.py", line 25, in
from tensorflow.python.eager import execute
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/eager/execute.py", line 21, in
from tensorflow.python.framework import dtypes
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/tensorflow/python/framework/dtypes.py", line 37, in
_np_bfloat16 = _pywrap_bfloat16.TF_bfloat16_type()
TypeError: Unable to convert function return value to a Python type! The signature was
() -> handle

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1184, in _get_module
return importlib.import_module("." + module_name, self.name)
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/importlib/init.py", line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1050, in _gcd_import
File "", line 1027, in _find_and_load
File "", line 1006, in _find_and_load_unlocked
File "", line 688, in _load_unlocked
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/modeling_utils.py", line 39, in
from .generation import GenerationConfig, GenerationMixin
File "", line 1075, in _handle_fromlist
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1174, in getattr
module = self._get_module(self._class_to_module[name])
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1186, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
Unable to convert function return value to a Python type! The signature was
() -> handle

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/Users/fawazahamedshaik/miniforge3/bin/matcha-tts", line 5, in
from matcha.cli import cli
File "/Users/fawazahamedshaik/github/ASR/Matcha-TTS/matcha/cli.py", line 16, in
from matcha.models.matcha_tts import MatchaTTS
File "/Users/fawazahamedshaik/github/ASR/Matcha-TTS/matcha/models/matcha_tts.py", line 7, in
import matcha.utils.monotonic_align as monotonic_align
File "/Users/fawazahamedshaik/github/ASR/Matcha-TTS/matcha/utils/init.py", line 1, in
from matcha.utils.instantiators import instantiate_callbacks, instantiate_loggers
File "/Users/fawazahamedshaik/github/ASR/Matcha-TTS/matcha/utils/instantiators.py", line 4, in
from lightning import Callback
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/lightning/init.py", line 25, in
from lightning.pytorch.callbacks import Callback # noqa: E402
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/lightning/pytorch/init.py", line 26, in
from lightning.pytorch.callbacks import Callback # noqa: E402
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/lightning/pytorch/callbacks/init.py", line 14, in
from lightning.pytorch.callbacks.batch_size_finder import BatchSizeFinder
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/lightning/pytorch/callbacks/batch_size_finder.py", line 24, in
from lightning.pytorch.callbacks.callback import Callback
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/lightning/pytorch/callbacks/callback.py", line 22, in
from lightning.pytorch.utilities.types import STEP_OUTPUT
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/lightning/pytorch/utilities/types.py", line 40, in
from torchmetrics import Metric
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/torchmetrics/init.py", line 14, in
from torchmetrics import functional # noqa: E402
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/torchmetrics/functional/init.py", line 120, in
from torchmetrics.functional.text._deprecated import _bleu_score as bleu_score
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/torchmetrics/functional/text/init.py", line 50, in
from torchmetrics.functional.text.bert import bert_score # noqa: F401
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/torchmetrics/functional/text/bert.py", line 23, in
from torchmetrics.functional.text.helper_embedding_metric import (
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/torchmetrics/functional/text/helper_embedding_metric.py", line 27, in
from transformers import AutoModelForMaskedLM, AutoTokenizer, PreTrainedModel, PreTrainedTokenizerBase
File "", line 1075, in _handle_fromlist
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1174, in getattr
module = self._get_module(self._class_to_module[name])
File "/Users/fawazahamedshaik/miniforge3/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1186, in _get_module
raise RuntimeError(
RuntimeError: Failed to import transformers.modeling_utils because of the following error (look up to see its traceback):
Failed to import transformers.generation.utils because of the following error (look up to see its traceback):
Unable to convert function return value to a Python type! The signature was
() -> handle

Need help regrding multi-speaker longform speech generation

Could you provide some resources for a long-form speech generation code that allows for switching between multiple speakers within the same text similar to what you did in the youtube video.

Model Training Logs

Attaching my loss card from tensorboard/ does everything looks okay? val_epoch seems to have converged, what were the final losses you were getting. Any other thing I need to keep an eye on?

with smoothing = 0.6

No matching distribution found for piper-phonemize

Hi!
I'm trying to install Matcha-TTS from source.
pip install -e .
gives
INFO: pip is looking at multiple versions of matcha-tts to determine which version is compatible with other requirements. This could take a while.
ERROR: Ignored the following versions that require a different python version: 1.6.2 Requires-Python >=3.7,<3.10; 1.6.3 Requires-Python >=3.7,<3.10; 1.7.0 Requires-Python >=3.7,<3.10; 1.7.1 Requires-Python >=3.7,<3.10
ERROR: Could not find a version that satisfies the requirement piper-phonemize (from matcha-tts) (from versions: none)
ERROR: No matching distribution found for piper-phonemize
Thanks in advance.

Train with one's own dataset

Hi, I wonder how I can get the corresponding .yaml file (like ljspeech.yaml) when I'm working on my own dataset. In addition, have you tested the performance of Matcha-TTS for other languages like Chinese and is a relatively small dataset (like a 20-min dataset) okay? Thank you very much!

gradio 4 removed `Box`

Removing weight norm...
[+] hifigan_univ_v1 loaded!
Traceback (most recent call last):
  File "/usr/local/bin/matcha-tts-app", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/matcha/app.py", line 171, in main
    with gr.Box():
AttributeError: module 'gradio' has no attribute 'Box'

See: gradio-app/gradio#6815

ONNX inference is around 5 times slower than Pytorch when using GPU

Hi,

I modified the logic in matcha/onnx/infer.py essentially changing the provider to CUDAExecutionProvider instead of GPUExecutionProvider because of the following error:

EP Error Unknown Provider Type: GPUExecutionProvider when using ['GPUExecutionProvider']
Falling back to ['CPUExecutionProvider'] and retrying.

I figured CUDAExecutionProvider would work because ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'CPUExecutionProvider'] were the available providers for my device I got when I ran ort.get_available_providers()

Passing CUDAExecutionProvider in ort.Inference is resulting in this warning

2024-02-08 16:40:45.111210799 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 32 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.
2024-02-08 16:40:45.138377088 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2024-02-08 16:40:45.138389649 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.

The inference time is around 5 times slower than just inferring via Pytorch with the device set as "cuda". This seems unusual as I expected ONNX+GPU to perform as fast if not faster than Pytorch+GPU. Is there a workaround for this warning? @mush42, Any help regarding this?

Running on NVIDIA A10G, CUDA version - 12.1, onnxruntime-gpu version 1.17.0. The model I'm using is the matcha_ljspeech pre-trained checkpoint exported to onnx via the export script.

Matcha synthesised audio prosody does not seems reflective of the paper

Output from TransformerTTS (Fastpitch/fastspeech2 based) :

text: "नमस्ते, मैं बजाज आलियांज़ जनरल इंश्योरेंस की ओर से स्वाति बोल रही हूँ, क्या आप से बात करने के लिए ये समय सही है?"

transformerTTS.mp4

Output from Matcha-TTS (Speech-rate 0.90) :

text: "नमस्ते, मैं बजाज आलियांज़ जनरल इंश्योरेंस की ओर से स्वाति बोल रही हूँ, क्या आप से बात करने के लिए ये समय सही है."

MatchaTTS.mp4

Output from Matcha-TTS (Speech-rate 0.90) :

MatchaTTS-2.mp4

please share your opinion.

Audio Generation Support In Tensorboard

I'm wondering if you already support audio synthesis after each number of steps, as that would be helpful to notice model convergence.

About batch inference for multi-speakers

Hello, thank you for your great work.

I would like to ask two questions:

Regarding the problem of batch inference of different sentences from the same speaker.
I am now using --file to read a txt file containing multiple lines (taking 4 lines as an example), and an error will be reported during inference:

File "/mnt/E/isjwdu/Matcha-TTS/matcha/models/components/text_encoder.py", line 403, in forward
     x = torch.cat([x, spks.unsqueeze(-1).repeat(1, 1, x.shape[-1])], dim=1)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 4 but got size 1 for tensor number 1 in the list.

Is the original code set to read only through a single line? Is there any recommended way if I want to reason about multiple sentences in batches?

For batch inference of the same txt text containing different speakers and different sentences, are there any code modification suggestions and tips?

For example my txt:

p329-016|p329|the norsemen considered the rainbow as a bridge over which the gods passed from earth to their home in the sky.
p316-091|p316|there was no bad behavior.

I want to inference different audio files based on different speakers.

Looking forward to your reply

Matcha-TTS has very low GPU utilization.

First of all thanks for very nice TTS system. This is very interresting and inspiring system.

I tried to train it, but it seems to train very slowly I see 0.5 to 1.6 iterations per second. At the same time I see neglectable GPU utilization. Is that to be expected? (Running on A100 gpu)
I expect that yes, and by looking at the number of parameters the model is quite small one, so I would assume the speed is slow because of CPU based MAS computation.
Are you aware of any reason why using alignment implemented by FastPitch team would not work? The implementation in NVidia Deeplearning examples is very snappy. https://arxiv.org/abs/2108.10447

Thanks for your great work and insights!

Inference is not deterministic?

For the same text I get different WAV files. I run the following command twice:

python cli.py --model matcha_ljspeech --vocoder hifigan_T2_v1 --cpu --output_folder "C:\Project 5 Support\matcha-tts\audio\pytorch-infer" --text "I haven't seen you for a while."

I have attached the two WAV files.

audio.zip

For comparing I have used windows fc command and a hex editor. Difference starts at position 2C.

ONNX Export

Hi,

Any plans to enable ONNX export for Matcha TTS?

I developed a script to do it, but it has some issues. Specifically, dynamic shape for input is not possible. And synthesis parameters, such as temperature, cannot be passed as scalars.

To properly enable ONNX export, we need to change how some tensor and scalar parameters are handled.

Are you interested in ONNX export? If so, how I should proceed? Perhaps with a PR?

Best
Musharraf