modelscope / kan-tts Goto Github PK

KAN-TTS is a speech-synthesis training framework, please try the demos we have posted at https://modelscope.cn/models?page=1&tasks=text-to-speech

License: MIT License

Python 100.00%

modelscope speech speech-synthesis tts

kan-tts's Introduction

Discord

English | 中文 | 日本語

Introduction

ModelScope is built upon the notion of “Model-as-a-Service” (MaaS). It seeks to bring together most advanced machine learning models from the AI community, and streamlines the process of leveraging AI models in real-world applications. The core ModelScope library open-sourced in this repository provides the interfaces and implementations that allow developers to perform model inference, training and evaluation.

In particular, with rich layers of API-abstraction, the ModelScope library offers unified experience to explore state-of-the-art models spanning across domains such as CV, NLP, Speech, Multi-Modality, and Scientific-computation. Model contributors of different areas can integrate models into the ModelScope ecosystem through the layered-APIs, allowing easy and unified access to their models. Once integrated, model inference, fine-tuning, and evaluations can be done with only a few lines of codes. In the meantime, flexibilities are also provided so that different components in the model applications can be customized wherever necessary.

Apart from harboring implementations of a wide range of different models, ModelScope library also enables the necessary interactions with ModelScope backend services, particularly with the Model-Hub and Dataset-Hub. Such interactions facilitate management of various entities (models and datasets) to be performed seamlessly under-the-hood, including entity lookup, version control, cache management, and many others.

Models and Online Accessibility

Hundreds of models are made publicly available on ModelScope (700+ and counting), covering the latest development in areas such as NLP, CV, Audio, Multi-modality, and AI for Science, etc. Many of these models represent the SOTA in their specific fields, and made their open-sourced debut on ModelScope. Users can visit ModelScope(modelscope.cn) and experience first-hand how these models perform via online experience, with just a few clicks. Immediate developer-experience is also possible through the ModelScope Notebook, which is backed by ready-to-use CPU/GPU development environment in the cloud - only one click away on ModelScope.

Some representative examples include:

LLM:

Multi-Modal:

CV:

Audio:

AI for Science:

Note: Most models on ModelScope are public and can be downloaded without account registration on modelscope website(www.modelscope.cn), please refer to instructions for model download, for dowloading models with api provided by modelscope library or git.

QuickTour

We provide unified interface for inference using pipeline, fine-tuning and evaluation using Trainer for different tasks.

For any given task with any type of input (image, text, audio, video...), inference pipeline can be implemented with only a few lines of code, which will automatically load the underlying model to get inference result, as is exemplified below:

>>> from modelscope.pipelines import pipeline
>>> word_segmentation = pipeline('word-segmentation',model='damo/nlp_structbert_word-segmentation_chinese-base')
>>> word_segmentation('今天天气不错，适合出去游玩')
{'output': '今天 天气 不错 ， 适合 出去 游玩'}

Given an image, portrait matting (aka. background-removal) can be accomplished with the following code snippet:

>>> import cv2
>>> from modelscope.pipelines import pipeline

>>> portrait_matting = pipeline('portrait-matting')
>>> result = portrait_matting('https://modelscope.oss-cn-beijing.aliyuncs.com/test/images/image_matting.png')
>>> cv2.imwrite('result.png', result['output_img'])

The output image with the background removed is:

Fine-tuning and evaluation can also be done with a few more lines of code to set up training dataset and trainer, with the heavy-lifting work of training and evaluation a model encapsulated in the implementation of traner.train() and trainer.evaluate() interfaces.

For example, the gpt3 base model (1.3B) can be fine-tuned with the chinese-poetry dataset, resulting in a model that can be used for chinese-poetry generation.

>>> from modelscope.metainfo import Trainers
>>> from modelscope.msdatasets import MsDataset
>>> from modelscope.trainers import build_trainer

>>> train_dataset = MsDataset.load('chinese-poetry-collection', split='train'). remap_columns({'text1': 'src_txt'})
>>> eval_dataset = MsDataset.load('chinese-poetry-collection', split='test').remap_columns({'text1': 'src_txt'})
>>> max_epochs = 10
>>> tmp_dir = './gpt3_poetry'

>>> kwargs = dict(
     model='damo/nlp_gpt3_text-generation_1.3B',
     train_dataset=train_dataset,
     eval_dataset=eval_dataset,
     max_epochs=max_epochs,
     work_dir=tmp_dir)

>>> trainer = build_trainer(name=Trainers.gpt3_trainer, default_args=kwargs)
>>> trainer.train()

Why should I use ModelScope library

A unified and concise user interface is abstracted for different tasks and different models. Model inferences and training can be implemented by as few as 3 and 10 lines of code, respectively. It is convenient for users to explore models in different fields in the ModelScope community. All models integrated into ModelScope are ready to use, which makes it easy to get started with AI, in both educational and industrial settings.
ModelScope offers a model-centric development and application experience. It streamlines the support for model training, inference, export and deployment, and facilitates users to build their own MLOps based on the ModelScope ecosystem.
For the model inference and training process, a modular design is put in place, and a wealth of functional module implementations are provided, which is convenient for users to customize their own model inference, training and other processes.
For distributed model training, especially for large models, it provides rich training strategy support, including data parallel, model parallel, hybrid parallel and so on.

Installation

Docker

ModelScope Library currently supports popular deep learning framework for model training and inference, including PyTorch, TensorFlow and ONNX. All releases are tested and run on Python 3.7+, Pytorch 1.8+, Tensorflow1.15 or Tensorflow2.0+.

To allow out-of-box usage for all the models on ModelScope, official docker images are provided for all releases. Based on the docker image, developers can skip all environment installation and configuration and use it directly. Currently, the latest version of the CPU image and GPU image can be obtained from:

CPU docker image

# py37
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py37-torch1.11.0-tf1.15.5-1.6.1

# py38
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-py38-torch2.0.1-tf2.13.0-1.9.5

GPU docker image

# py37
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.3.0-py37-torch1.11.0-tf1.15.5-1.6.1

# py38
registry.cn-hangzhou.aliyuncs.com/modelscope-repo/modelscope:ubuntu20.04-cuda11.8.0-py38-torch2.0.1-tf2.13.0-1.9.5

Setup Local Python Environment

One can also set up local ModelScope environment using pip and conda. ModelScope supports python3.7 and above. We suggest anaconda for creating local python environment:

conda create -n modelscope python=3.8
conda activate modelscope

PyTorch or TensorFlow can be installed separately according to each model's requirements.

Install pytorch doc
Install tensorflow doc

After installing the necessary machine-learning framework, you can install modelscope library as follows:

If you only want to play around with the modelscope framework, of trying out model/dataset download, you can install the core modelscope components:

pip install modelscope

If you want to use multi-modal models:

pip install modelscope[multi-modal]

If you want to use nlp models:

pip install modelscope[nlp] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use cv models:

pip install modelscope[cv] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use audio models:

pip install modelscope[audio] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

If you want to use science models:

pip install modelscope[science] -f https://modelscope.oss-cn-beijing.aliyuncs.com/releases/repo.html

Notes:

Currently, some audio-task models only support python3.7, tensorflow1.15.4 Linux environments. Most other models can be installed and used on Windows and Mac (x86).
Some models in the audio field use the third-party library SoundFile for wav file processing. On the Linux system, users need to manually install libsndfile of SoundFile(doc link). On Windows and MacOS, it will be installed automatically without user operation. For example, on Ubuntu, you can use following commands:
```
sudo apt-get update
sudo apt-get install libsndfile1
```
Some models in computer vision need mmcv-full, you can refer to mmcv installation guide, a minimal installation is as follows:
```
pip uninstall mmcv # if you have installed mmcv, uninstall it
pip install -U openmim
mim install mmcv-full
```

Learn More

We provide additional documentations including:

License

This project is licensed under the Apache License (Version 2.0).

Citation

@Misc{modelscope,
  title = {ModelScope: bring the notion of Model-as-a-Service to life.},
  author = {The ModelScope Team},
  howpublished = {\url{https://github.com/modelscope/modelscope}},
  year = {2023}
}

kan-tts's People

Contributors

Stargazers

Watchers

kan-tts's Issues

SyntaxError: invalid syntax (3690364265.py, line 2) in notebook

特征提取

python kantts/preprocess/data_process.py --voice_input_dir ptts_spk0_autolabel --voice_output_dir training_stage/test_male_ptts_feats --audio_config kantts/configs/audio_config_se_16k.yaml --speaker F7 --se_model speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/speaker_embedding/se.*

扩充epoch

stage0=training_stage
voice=test_male_ptts_feats

cat $stage0/$voice/am_valid.lst >> $stage0/$voice/am_train.lst
lines=0
while [ $lines -lt 400 ]
do
shuf $stage0/$voice/am_train.lst >> $stage0/$voice/am_train.lst.tmp
lines=$(wc -l < "$stage0/$voice/am_train.lst.tmp")
done
mv $stage0/$voice/am_train.lst.tmp $stage0/$voice/am_train.lst

关于ttsfrd

你好，能分发一下 ttsfrd 的aarch64版本吗，在tts模型部署的链条上，貌似就差这个的aarch64版本，就可以在各种边缘计算设备上跑通了

墨大社区显示语音功能只能跑tensorflow1.15环境，而KAN-TTS用的是pytorch

这是为何？有什么办法用KANTTS load 墨大社区的权重进行本地预测吗？
然后进行ONNX导出？

关于多说话人的疑问

请问speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k这个模型能同时训练多个说话人吗？

auto_label 出错 Authentication token does not exist, failed to access model

ps: 解决方案：将 v1.0.4 改为 v1.0.7，https://modelscope.cn/models/Jinglin/personal_voice/summary 这个链接里头的版本比较老旧所以无法运行

运行代码

from modelscope.tools import run_auto_label

input_wav = "./test_wavs/"
output_data = "./output_training_data/"

ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.4")

出错

2023-09-15 10:20:33,746 - modelscope - INFO - PyTorch version 2.0.1 Found.
2023-09-15 10:20:33,746 - modelscope - INFO - Loading ast index from /home/ubuntu/.cache/modelscope/ast_indexer
2023-09-15 10:20:35,583 - modelscope - INFO - Loading done! Current index file version is 1.9.0, with md5 66c797046ce7835fbc6d499ee4dcf5e4 and a total number of 921 components indexed
2023-09-15 10:20:40,467 - modelscope - INFO - Use user-specified model revision: v1.0.4
INFO:root:2023-09-15 10:20:54
INFO:root:TTS-AutoLabel version: 1.1.8
INFO:root:TTS-AutoLabel resource path: /home/ubuntu/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model
INFO:root:Target sampling rate: 16000
INFO:root:Input wav dir: /cfs/user/ubuntu/work/tts/ali-tts/test_female
INFO:root:Output data dir: /cfs/user/ubuntu/work/tts/ali-tts/output_training_data
INFO:root:wav_preprocess start...
INFO:root:---  There is this folder!  ---
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 21.49it/s]
INFO:root:[VAD] chunk recordings for training.
INFO:root:wav cut by vad start...
2023-09-15 10:20:55,457 - modelscope - ERROR - Authentication token does not exist, failed to access model /home/ubuntu/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/speech_vad_assert/fsmn_vad_16k which may not exist or may be                 private. Please login first.
Traceback (most recent call last):
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/hub/errors.py", line 81, in handle_http_response
    response.raise_for_status()
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http://www.modelscope.cn/api/v1/models//home/ubuntu/.cache/modelscope/hub/damo/speech_ptts_autolabel_16k/model/speech_vad_assert/fsmn_vad_16k/revisions?EndTime=1693929600

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py", line 42, in __init__
    model_dir = snapshot_download(model_dir, cache_dir=cache_dir)
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/hub/snapshot_download.py", line 96, in snapshot_download
    revision = _api.get_valid_revision(
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/hub/api.py", line 464, in get_valid_revision
    revisions = self.list_model_revisions(
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/hub/api.py", line 433, in list_model_revisions
    handle_http_response(r, logger, cookies, model_id)
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/hub/errors.py", line 88, in handle_http_response
    raise HTTPError('Response details: %s' % message) from error
requests.exceptions.HTTPError: Response details: 404 page not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "label.py", line 6, in <module>
    ret, report = run_auto_label(input_wav=input_wav, work_dir=output_data, resource_revision="v1.0.4")
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/modelscope/tools/speech_tts_autolabel.py", line 78, in run_auto_label
    ret_code, report = auto_labeling.run()
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/tts_autolabel/auto_label.py", line 853, in run
    self.wav_cut_by_vad(self.resample_wav_dir, self.cut_wav_dir)
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/tts_autolabel/auto_label.py", line 437, in wav_cut_by_vad
    vad_cut(input_wav_dir, output_wav_dir, self.resource_dir)
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/tts_autolabel/audiocut/vad.py", line 350, in vad_cut
    vad_pipeline_superhigh = Fsmn_vad(
  File "/cfs/user/ubuntu/anaconda3/envs/modelscope/lib/python3.8/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py", line 44, in __init__
    raise "model_dir must be model_name in modelscope or local path downloaded from modelscope, but is {}".format(
TypeError: exceptions must derive from BaseException

运行 KAN-TTS 的 preprocess 时，出现 [ONNXRuntimeError] : 7 : INVALID_PROTOBUF ERROR !

运行 KAN-TTS 的 preprocess 脚本时，遇到了一个 ONNXRuntimeError，错误提示是在加载 se.onnx 模型时出现了 Protobuf 解析错误。

我参考的训练流程是这里的：https://www.modelscope.cn/models/damo/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/summary

运行的命令和结果输出如下：

python kantts/preprocess/data_process.py --voice_input_dir /home/speech_personal_sambert_modelscope/KAN-TTS/resource/test_male_autolabel --voice_output_dir training_stage/test_male_ptts_feats --audio_config kantts/configs/audio_config_se_16k.yaml --speaker test_male --se_model speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/speaker_embedding/se.onnx

(maas) [root@VM-16-3-centos KAN-TTS]# python kantts/preprocess/data_process.py --voice_input_dir /home/speech_personal_sambert_modelscope/KAN-TTS/resource/test_male_autolabel --voice_output_dir training_stage/test_male_ptts_feats --audio_config kantts/configs/audio_config_se_16k.yaml --speaker test_male --se_model speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/speaker_embedding/se.onnx
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:00<00:00, 13163.76it/s]
2023-03-30:23:02:36 INFO [TextScriptConvertor.py:469] TextScriptConvertor.process:
Save script to: training_stage/test_male_ptts_feats/Script.xml
2023-03-30:23:02:36 INFO [TextScriptConvertor.py:490] TextScriptConvertor.process:
Save metafile to: training_stage/test_male_ptts_feats/raw_metafile.txt
2023-03-30:23:02:36 INFO [audio_processor.py:90] [AudioProcessor] Initialize AudioProcessor.
2023-03-30:23:02:36 INFO [audio_processor.py:91] [AudioProcessor] config params:
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] wav_normalize: True
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] trim_silence: True
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] trim_silence_threshold_db: 60
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] preemphasize: False
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] sampling_rate: 16000
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] hop_length: 200
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] win_length: 1000
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] n_fft: 2048
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] n_mels: 80
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] fmin: 0.0
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] fmax: 8000.0
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] phone_level_feature: True
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] se_feature: True
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] norm_type: mean_std
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] max_norm: 1.0
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] symmetric: False
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] min_level_db: -100.0
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] ref_level_db: 20
2023-03-30:23:02:36 INFO [audio_processor.py:93] [AudioProcessor] num_workers: 16
2023-03-30:23:02:36 INFO [audio_processor.py:201] [AudioProcessor] Amplitude normalization started
2023-03-30:23:02:36 INFO [utils.py:184] Volume statistic proceeding...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 191.17it/s]
2023-03-30:23:02:36 INFO [utils.py:170] Average amplitude RMS : 0.054727649999999996
2023-03-30:23:02:36 INFO [utils.py:186] Volume statistic done.
2023-03-30:23:02:36 INFO [utils.py:194] Volume normalization proceeding...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 2199.60it/s]
2023-03-30:23:02:36 INFO [utils.py:221] Volume normalization done.
2023-03-30:23:02:36 INFO [audio_processor.py:204] [AudioProcessor] Amplitude normalization finished
2023-03-30:23:02:36 INFO [audio_processor.py:394] [AudioProcessor] Duration generation started
  0%|                                                                                                                                                                | 0/20 [00:00<?, ?it/s]2023-03-30:23:02:36 INFO [audio_processor.py:411] [AudioProcessor] Duration align with mel is proceeding...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 182.71it/s]
2023-03-30:23:02:36 INFO [audio_processor.py:453] [AudioProcessor] Duration generate finished
2023-03-30:23:02:36 INFO [audio_processor.py:278] [AudioProcessor] Trim silence with interval started
2023-03-30:23:02:36 INFO [audio_processor.py:216] [AudioProcessor] Start to load pcm from training_stage/test_male_ptts_feats/wav
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 179.38it/s]
  0%|                                                                                                                                                                | 0/20 [00:00<?, ?it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 4150.72it/s]
2023-03-30:23:02:37 INFO [audio_processor.py:314] [AudioProcessor] Trim silence finished
2023-03-30:23:02:37 INFO [audio_processor.py:322] [AudioProcessor] Melspec extraction started
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 56.95it/s]
2023-03-30:23:02:37 INFO [audio_processor.py:361] [AudioProcessor] Melspec extraction finished
2023-03-30:23:02:37 INFO [audio_processor.py:365] Melspec statistic proceeding...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 39107.73it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 10867.48it/s]
2023-03-30:23:02:37 INFO [audio_processor.py:368] Melspec statistic done
2023-03-30:23:02:37 INFO [audio_processor.py:374] [AudioProcessor] melspec mean and std saved to:
training_stage/test_male_ptts_feats/mel/mel_mean.txt,
training_stage/test_male_ptts_feats/mel/mel_std.txt
2023-03-30:23:02:37 INFO [audio_processor.py:378] [AudioProcessor] Melspec mean std norm is proceeding...
2023-03-30:23:02:37 INFO [audio_processor.py:384] [AudioProcessor] Melspec normalization finished
2023-03-30:23:02:37 INFO [audio_processor.py:385] [AudioProcessor] Normed Melspec saved to training_stage/test_male_ptts_feats/mel
2023-03-30:23:02:37 INFO [audio_processor.py:467] [AudioProcessor] Pitch extraction started
  0%|                                                                                                                                                                | 0/20 [00:00<?, ?it/s]2023-03-30:23:02:37 INFO [audio_processor.py:483] [AudioProcessor] Pitch align with mel is proceeding...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 73.88it/s]
2023-03-30:23:02:37 INFO [audio_processor.py:510] [AudioProcessor] Pitch normalization is proceeding...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 72565.81it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 33209.06it/s]
2023-03-30:23:02:37 INFO [audio_processor.py:518] [AudioProcessor] f0 mean and std saved to:
training_stage/test_male_ptts_feats/f0/f0_mean.txt,
training_stage/test_male_ptts_feats/f0/f0_std.txt
2023-03-30:23:02:37 INFO [audio_processor.py:521] [AudioProcessor] Pitch mean std norm is proceeding...
2023-03-30:23:02:37 INFO [audio_processor.py:548] [AudioProcessor] Pitch turn to phone-level is proceeding...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 182.40it/s]
2023-03-30:23:02:37 INFO [audio_processor.py:580] [AudioProcessor] Pitch normalization finished
2023-03-30:23:02:37 INFO [audio_processor.py:581] [AudioProcessor] Normed f0 saved to training_stage/test_male_ptts_feats/f0
2023-03-30:23:02:37 INFO [audio_processor.py:582] [AudioProcessor] Pitch extraction finished
2023-03-30:23:02:37 INFO [audio_processor.py:593] [AudioProcessor] Energy extraction started
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 116.35it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 59033.13it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 31968.78it/s]
2023-03-30:23:02:38 INFO [audio_processor.py:638] [AudioProcessor] energy mean and std saved to:
training_stage/test_male_ptts_feats/energy/energy_mean.txt,
training_stage/test_male_ptts_feats/energy/energy_std.txt
2023-03-30:23:02:38 INFO [audio_processor.py:642] [AudioProcessor] Energy mean std norm is proceeding...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 186.50it/s]
2023-03-30:23:02:38 INFO [audio_processor.py:690] [AudioProcessor] Energy normalization finished
2023-03-30:23:02:38 INFO [audio_processor.py:691] [AudioProcessor] Normed Energy saved to training_stage/test_male_ptts_feats/energy
2023-03-30:23:02:38 INFO [audio_processor.py:692] [AudioProcessor] Energy extraction finished
2023-03-30:23:02:38 INFO [audio_processor.py:774] [AudioProcessor] All features extracted successfully!
2023-03-30:23:02:38 INFO [data_process.py:192] Processing audio done.
2023-03-30:23:02:38 INFO [se_processor.py:63] [SpeakerEmbeddingProcessor] Speaker embedding extractor started
2023-03-30:23:02:38 ERROR [data_process.py:237] [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/speaker_embedding/se.onnx failed:Protobuf parsing failed.
Traceback (most recent call last):
  File "kantts/preprocess/data_process.py", line 234, in <module>
    args.se_model,
  File "kantts/preprocess/data_process.py", line 199, in process_data
    se_model,
  File "/home/speech_personal_sambert_modelscope/KAN-TTS/kantts/preprocess/se_processor/se_processor.py", line 67, in process
    sess = onnxruntime.InferenceSession(se_onnx, sess_options=opts)
  File "/root/anaconda3/envs/maas/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/root/anaconda3/envs/maas/lib/python3.7/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 397, in _create_inference_session
    sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/basemodel_16k/speaker_embedding/se.onnx failed:Protobuf parsing failed.
(maas) [root@VM-16-3-centos KAN-TTS]#

请问有必要把sambert预测的声学特征对声码器进行finetune训练么

请问有必要把sambert预测的声学特征对声码器进行finetune训练么？是否能提升合成效果

运行过程中，不断打印“Load pinyin_en_mix_dict failed”

运行过程中，不断打印“Load pinyin_en_mix_dict failed”。虽然能正常输出音频，但不知道这条log是否表明运行有问题？是不是我配置还缺了啥？

直接用的是SambertHifigan语音合成-中文-多人预训练-16k模型。
主角本如下：
#!/bin/bash

SambertHifigan语音合成-中文-多人预训练-16k

git clone -b pretrain http://www.modelscope.cn/speech_tts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k.git

speaker_list: 'F7,F74,FBYN,FRXL,M7,xiaoyu'} all except M7 are female

res_zip=../funtts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/resource.zip
am_ckpt=../funtts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/sambert/ckpt/checkpoint_980000.pth
voc_ckpt=../funtts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2000000.pth
spk=xiaoyu

outdir=out_$spk
[ -d $outdir ] && rm -rf $outdir; mkdir -p $outdir

python ./kantts/bin/text_to_wav.py
--txt ./test_data/txt
--output_dir $outdir
--res_zip $res_zip
--am_ckpt $am_ckpt
--voc_ckpt $voc_ckpt
--speaker $spk

运行过程中打印的log如下：
Converting text to symbols...
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
text.cc: festival_Text_init
AM is infering...
Loading checkpoint: ../funtts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/sambert/ckpt/checkpoint_980000.pth
Inference sentence: 0_0
x_band_width:7, h_band_width: 7
Inference sentence: 1_0
x_band_width:6, h_band_width: 6
Inference sentence: 2_0
x_band_width:8, h_band_width: 8
Inference sentence: 3_0
x_band_width:7, h_band_width: 7
Vocoder is infering...
Loss = {'discriminator_adv_loss': {'enable': True, 'params': {'average_by_discriminators': False}, 'weights': 1.0}, 'feat_match_loss': {'enable': True, 'params': {'average_by_discriminators': False, 'average_by_layers': False}, 'weights': 2.0}, 'generator_adv_loss': {'enable': True, 'params': {'average_by_discriminators': False}, 'weights': 1.0}, 'mel_loss': {'enable': True, 'params': {'fft_size': 2048, 'fmax': 8000, 'fmin': 0, 'fs': 16000, 'hop_size': 200, 'log_base': None, 'num_mels': 80, 'win_length': 1000, 'window': 'hann'}, 'weights': 45.0}, 'stft_loss': {'enable': False}, 'subband_stft_loss': {'enable': False, 'params': {'fft_sizes': [384, 683, 171], 'hop_sizes': [35, 75, 15], 'win_lengths': [150, 300, 60], 'window': 'hann_window'}}}
Model = {'Generator': {'optimizer': {'params': {'betas': [0.5, 0.9], 'lr': 0.0002, 'weight_decay': 0.0}, 'type': 'Adam'}, 'params': {'bias': True, 'channels': 256, 'in_channels': 80, 'kernel_size': 7, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'resblock_dilations': [[1, 3, 5, 7], [1, 3, 5, 7], [1, 3, 5, 7]], 'resblock_kernel_sizes': [3, 7, 11], 'upsample_kernal_sizes': [20, 10, 4, 4], 'upsample_scales': [10, 5, 2, 2], 'use_weight_norm': True}, 'scheduler': {'params': {'gamma': 0.5, 'milestones': [200000, 400000, 600000, 800000]}, 'type': 'MultiStepLR'}}, 'MultiPeriodDiscriminator': {'optimizer': {'params': {'betas': [0.5, 0.9], 'lr': 0.0002, 'weight_decay': 0.0}, 'type': 'Adam'}, 'params': {'discriminator_params': {'bias': True, 'channels': 32, 'downsample_scales': [3, 3, 3, 3, 1], 'in_channels': 1, 'kernel_sizes': [5, 3], 'max_downsample_channels': 1024, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1, 'use_spectral_norm': False}, 'periods': [2, 3, 5, 7, 11]}, 'scheduler': {'params': {'gamma': 0.5, 'milestones': [200000, 400000, 600000, 800000]}, 'type': 'MultiStepLR'}}, 'MultiScaleDiscriminator': {'optimizer': {'params': {'betas': [0.5, 0.9], 'lr': 0.0002, 'weight_decay': 0.0}, 'type': 'Adam'}, 'params': {'discriminator_params': {'bias': True, 'channels': 128, 'downsample_scales': [4, 4, 4, 4, 1], 'in_channels': 1, 'kernel_sizes': [15, 41, 5, 3], 'max_downsample_channels': 1024, 'max_groups': 16, 'nonlinear_activation': 'LeakyReLU', 'nonlinear_activation_params': {'negative_slope': 0.1}, 'out_channels': 1}, 'downsample_pooling': 'DWT', 'downsample_pooling_params': {'kernel_size': 4, 'padding': 2, 'stride': 2}, 'follow_official_norm': True, 'scales': 3}, 'scheduler': {'params': {'gamma': 0.5, 'milestones': [200000, 400000, 600000, 800000]}, 'type': 'MultiStepLR'}}}
allow_cache = True
audio_config = {'fmax': 8000.0, 'fmin': 0.0, 'hop_length': 200, 'max_norm': 1.0, 'min_level_db': -100.0, 'n_fft': 2048, 'n_mels': 80, 'norm_type': 'mean_std', 'num_workers': 16, 'phone_level_feature': True, 'preemphasize': False, 'ref_level_db': 20, 'sampling_rate': 16000, 'symmetric': False, 'trim_silence': True, 'trim_silence_threshold_db': 60, 'wav_normalize': True, 'win_length': 1000}
batch_max_steps = 9600
batch_size = 16
create_time = 2022-09-18 14:11:30
discriminator_grad_norm = -1
discriminator_train_start_steps = 0
eval_interval_steps = 10000
generator_grad_norm = -1
generator_train_start_steps = 1
git_revision_hash = 22ae438
log_interval_steps = 1000
model_type = hifigan
num_save_intermediate_results = 4
num_workers = 2
pin_memory = True
remove_short_samples = False
save_interval_steps = 20000
train_max_steps = 2500000
Loaded model parameters from ../funtts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2000000.pth.
Removing weight norm...
Finished generation of 4 utterances (RTF = 0.310).
['out_xiaoyu/0_0_mel_gen.wav', 'out_xiaoyu/1_0_mel_gen.wav', 'out_xiaoyu/2_0_mel_gen.wav', 'out_xiaoyu/3_0_mel_gen.wav']
Text to wav finished!

Conda list:

git rev-parse HEAD:
8caf892

HiFi-GAN声码器的激活函数(?)选择

你好,

我注意到我们的HiFi-GAN在upsample部分, 会先进行一个

x = torch.sin(x) + x

这里似乎是替换了原始实现中的激活函数, 请问这个有参考文献吗? 还是实验结果验证?

有多说话人TTS从头训练的教程吗，看到的都是单个说话人的

无法多卡训练?

If your GPU devices are enough, you can use distributed training, which is a lot faster than single GPU training. For example, assign GPU device indexes with CUDA_VISIBLE_DEVICES system variable, --nproc_per_node denotes the count of GPU devices.

--nproc_per_node 没有这个参数呀?

使用通用格式数据微调Sambert-hifigan出现报错

微调步骤参考：https://mp.weixin.qq.com/s/Xo-pMe3-P-fJ-32Z1JLonA
已尝试过：
kantts/configs/sambert_16k_MAS.yaml（发音人已修改）
speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/sambert/config.yaml（发音人已修改）

通用数据格式如下：

.
├── prosody
│   └── prosody.txt
└── wav
    ├── 1.wav
    ├── 2.wav
    ├── ...
    └── 9000.wav

报错信息如下：

Traceback (most recent call last):
  File "kantts/bin/train_sambert.py", line 231, in <module>
    args.local_rank,
  File "kantts/bin/train_sambert.py", line 122, in train
    pin_memory=config["pin_memory"],
  File "/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 262, in __init__
    sampler = RandomSampler(dataset, generator=generator)  # type: ignore
  File "/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 104, in __init__
    "value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

将来会不会开放东南亚语音合成模型？

谢谢

提供了基于因果卷积的低时延流式生成和chunk流式生成机制

请问一下哪里能看到相关实现呢？

kantts/bin/text_to_wav.py 加载SambertHifigan语音合成-中文-多人预训练-16k 合成音频报错

报错信息
Converting text to symbols...
Traceback (most recent call last):
File "./kantts/bin/text_to_wav.py", line 168, in
text_to_wav(
File "./kantts/bin/text_to_wav.py", line 102, in text_to_wav
speaker = config["linguistic_unit"]["speaker_list"].split(",")[0]
KeyError: 'linguistic_unit'

运行角本
#!/bin/bash

SambertHifigan语音合成-中文-多人预训练-16k

speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k

outdir=out
[ -d $outdir ] && rm -rf $outdir; mkdir -p $outdir

python ./kantts/bin/text_to_wav.py
--txt ./test_data/txt
--output_dir $outdir
--res_zip ../funtts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/resource.zip
--am_ckpt ../funtts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2000000.pth
--voc_ckpt ../funtts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2000000.pth

运行环境 linux
conda list

packages in environment at /home/eeodev/anaconda3/envs/funtts:

Name Version Build Channel

how to deploy the model via c++

Is there any way export the models to onnx? or deploy it via c++?

About license

Hi~：

Thanks for the great job！But we did not find an open source license in the project. Is there any content about the license?

训练声码器的时候报错 CUFFT_INTERNAL_ERROR

CUDA_VISIBLE_DEVICES=0 python kantts/bin/train_hifigan.py --model_config speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/config.yaml --resume_path speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2000000.pth --root_dir newtest/training_stage/SSB0009_feats --stage_dir newtest/training_stage/SSB0009_hifigan_ckpt

cuFFT error: CUFFT_INTERNAL_ERROR
Traceback (most recent call last):
File "kantts/bin/train_hifigan.py", line 171, in train
trainer.train()
File "/home/dufei/git/KAN-TTS/kantts/train/trainer.py", line 199, in train
self.train_epoch()
File "/home/dufei/git/KAN-TTS/kantts/train/trainer.py", line 207, in train_epoch
self.train_step(batch)
File "/home/dufei/git/KAN-TTS/kantts/train/trainer.py", line 509, in train_step
mel_loss = self.criterion["mel_loss"](y_, y)
File "/root/anaconda3/envs/maasKan1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dufei/git/KAN-TTS/kantts/train/loss.py", line 307, in forward
mel_hat = self.mel_spectrogram(y_hat)
File "/root/anaconda3/envs/maasKan1/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/dufei/git/KAN-TTS/kantts/utils/audio_torch.py", line 175, in forward
x_stft = torch.stft(x, window=window, **self.stft_params)
File "/root/anaconda3/envs/maasKan1/lib/python3.7/site-packages/torch/functional.py", line 633, in stft
normalized, onesided, return_complex)
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR
2023-08-17:16:52:02, INFO [train_hifigan.py:179] Successfully saved checkpoint @ 1steps.

Some questions about training and inferencing

Hello, I have some questions.

If I want to train a model on some other datasets, say, the Nancy Corpus (blizzard challenge 2011 dataset), how can I prepare the data? Is there an example of the "general data" mentioned in the wiki?
Is it possible to control the speed of the speech when inferencing?
Thank you!

数据集中的#符号是什么意思？

数据集中的#符号是什么意思？#1，#2，#3，#4
002853 法律#2只是#1傀儡#3、司法#1公正#2只是#1口号#4。
fa3 lv4 zhi3 shi4 kui6 lei3 si1 fa3 gong1 zheng4 zhi3 shi4 kou3 hao4

Gained a bug when modifying the speech rate （inference stage）

The labeling standard will be released recently, and so will the automatic labeling service.
Sure, thanks for your feedback :)

Originally posted by @GinChow in #25 (comment)

pqmf

hi, is there a config for multiband(pqmf) ?

在声音特征提取的步骤中出错

在进行特征提取的步骤中，程序运行正常，没有报错和warming，但是最终生成的文件少了一个se的文件夹。所使用的tts-autolabel是1.1.7，modelscole是1.8.1。

subprocess.CalledProcessError: Command '['git', 'rev-parse', 'HEAD']' returned non-zero exit status 128.

OS: centos7.9

Python/C++ Version：python3.9 gcc4.8.5

Package Version：pytorch==1.13.1、modelscope==1.5.2、kantts==1.0.0、torchaudio==0.13.1

Model：
speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k

Command：

from modelscope.metainfo import Trainers
from modelscope.trainers import build_trainer
from modelscope.utils.audio.audio_utils import TtsTrainType

pretrained_model_id = 'damo/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k'

dataset_id = "./output_training_data/"
pretrain_work_dir = "./pretrain_work_dir/"

# 训练信息，用于指定需要训练哪个或哪些模型，这里展示AM和Vocoder模型皆进行训练
# 目前支持训练：TtsTrainType.TRAIN_TYPE_SAMBERT, TtsTrainType.TRAIN_TYPE_VOC
# 训练SAMBERT会以模型最新step作为基础进行finetune
train_info = {
    TtsTrainType.TRAIN_TYPE_SAMBERT: {  # 配置训练AM（sambert）模型
        'train_steps': 202,  # 训练多少个step
        'save_interval_steps': 200,  # 每训练多少个step保存一次checkpoint
        'log_interval': 10  # 每训练多少个step打印一次训练日志
    }
}

# 配置训练参数，指定数据集，临时工作目录和train_info
kwargs = dict(
    model=pretrained_model_id,  # 指定要finetune的模型
    model_revision="v1.0.5",
    work_dir=pretrain_work_dir,  # 指定临时工作目录
    train_dataset=dataset_id,  # 指定数据集id
    train_type=train_info  # 指定要训练类型及参数
)

trainer = build_trainer(Trainers.speech_kantts_trainer,
                        default_args=kwargs)

trainer.train()

ERROR:

(audio) [root@ecs-97b3-0001 /data/audio/kantts]# python train.py
2023-06-20 17:46:00,777 - modelscope - INFO - PyTorch version 1.13.1+cu116 Found.
2023-06-20 17:46:00,778 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-06-20 17:46:00,804 - modelscope - INFO - Loading done! Current index file version is 1.5.2, with md5 2b4346fea97faefdf1f85f3cdc38c819 and a total number of 860 components indexed
2023-06-20 17:46:02,103 - modelscope - INFO - Use user-specified model revision: v1.0.5
2023-06-20 17:46:02,871 - modelscope - INFO - Use user-specified model revision: v1.0.5
2023-06-20 17:46:04,294 - modelscope - INFO - Set workdir to ./pretrain_work_dir/
2023-06-20 17:46:04,555 - modelscope - INFO - load ./output_training_data/
2023-06-20 17:46:04,704 - modelscope - INFO - Use user-specified model revision: v1.0.5
2023-06-20 17:46:05,905 - modelscope - INFO - am_config=./pretrain_work_dir/orig_model/basemodel_16k/sambert/config.yaml voc_config=./pretrain_work_dir/orig_model/basemodel_16k/hifigan/config.yaml
2023-06-20 17:46:05,905 - modelscope - INFO - audio_config=./pretrain_work_dir/orig_model/basemodel_16k/audio_config_se_16k.yaml
2023-06-20 17:46:05,905 - modelscope - INFO - am_ckpts=OrderedDict([(2400000, './pretrain_work_dir/orig_model/basemodel_16k/sambert/ckpt/checkpoint_2400000.pth')])
2023-06-20 17:46:05,905 - modelscope - INFO - voc_ckpts=OrderedDict([(2400000, './pretrain_work_dir/orig_model/basemodel_16k/hifigan/ckpt/checkpoint_2400000.pth')])
2023-06-20 17:46:05,905 - modelscope - INFO - se_path=./pretrain_work_dir/orig_model/se.npy se_model_path=./pretrain_work_dir/orig_model/basemodel_16k/speaker_embedding/se.onnx
2023-06-20 17:46:05,905 - modelscope - INFO - mvn_path=./pretrain_work_dir/orig_model/mvn.npy
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
text.cc: festival_Text_init
fatal: Not a git repository (or any parent up to mount point /data)
Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).
Traceback (most recent call last):
  File "/data/audio/kantts/train.py", line 33, in <module>
    trainer.train()
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/modelscope/trainers/audio/tts_trainer.py", line 229, in train
    self.prepare_data()
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/modelscope/trainers/audio/tts_trainer.py", line 205, in prepare_data
    self.audio_data_preprocessor(self.raw_dataset_path, self.data_dir,
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/modelscope/preprocessors/tts.py", line 36, in __call__
    self.do_data_process(data_dir, output_dir, audio_config_path,
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/modelscope/preprocessors/tts.py", line 56, in do_data_process
    process_data(datadir, outputdir, audio_config, speaker_name,
  File "/data/audio/kantts/kantts/preprocess/data_process.py", line 137, in process_data
    config["git_revision_hash"] = get_git_revision_hash()
  File "/data/audio/kantts/kantts/utils/log.py", line 26, in get_git_revision_hash
    return subprocess.check_output(["git", "rev-parse", "HEAD"]).decode("ascii").strip()
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['git', 'rev-parse', 'HEAD']' returned non-zero exit status 128.

效果很好，但是断句延迟时间太长，看起来不连贯。是否有方法解决？

请问有没有什么参数可以调整呢？

训练步数越多反而效果不好有杂音

训练步数越多反而效果不好有杂音有朋友遇到么

sambert和hifigan的微调命名规则不统一，推理男声必须说话人为F7

1.按照https://modelscope.cn/models/damo/speech_personal_sambert-hifigan_nsf_tts_zh-cn_pretrain_16k/summary 中的教程，sambert的微调后的model的名字是延续basemodel的steps的，hifigan却没有，是从0开始，但根据log看，hifigan也是根据resume path进行finetune的。

https://modelscope.cn/docs/sambert 中微调sambert和hifigan的train_max_steps设置也不统一，前者加了basemodel的step，后者没有

2.在modelscope开源的模型中，aixiang、zhida、zhishuo等男声模型，在用text_to_wav.py推理时，必须加上--speaker F7，否则音质会很差。一般来说女声是F7，男声是M7

KeyError: 'test_male'

Original Traceback (most recent call last):
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/datasets/dataset.py", line 461, in getitem
ling_data = self.ling_unit.encode_symbol_sequence(ling_txt)
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/utils/ling_unit/ling_unit.py", line 226, in encode_symbol_sequence
lfeat_symbol_separate[index].strip(), self._lfeat_type_list[index]
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/utils/ling_unit/ling_unit.py", line 293, in encode_sub_unit
sequence = self.encode_speaker_category(this_lfeat_symbol)
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/utils/ling_unit/ling_unit.py", line 393, in encode_speaker_category
sequence.append(self._speaker_to_id[this_speaker])
KeyError: 'test_male'
Traceback (most recent call last):
File "kantts/bin/train_sambert.py", line 190, in train
trainer.train()
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/train/trainer.py", line 199, in train
self.train_epoch()
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/train/trainer.py", line 206, in train_epoch
for batch in tqdm(self.train_loader):
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/tqdm/std.py", line 1178, in iter
for obj in iterable:
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/_utils.py", line 425, in reraise
raise self.exc_type(msg)
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/akira/anaconda3/envs/maas/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/datasets/dataset.py", line 461, in getitem
ling_data = self.ling_unit.encode_symbol_sequence(ling_txt)
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/utils/ling_unit/ling_unit.py", line 226, in encode_symbol_sequence
lfeat_symbol_separate[index].strip(), self._lfeat_type_list[index]
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/utils/ling_unit/ling_unit.py", line 293, in encode_sub_unit
sequence = self.encode_speaker_category(this_lfeat_symbol)
File "/home/akira/KAN-TTS-1/KAN-TTS/kantts/utils/ling_unit/ling_unit.py", line 393, in encode_speaker_category
sequence.append(self._speaker_to_id[this_speaker])
KeyError: 'test_male'，这个问题有人遇见过吗，怎么解决

hifigan结构设计？

Hi，想问一下关于hifigan结构设计上，用到了原始hifigan结构中的transpose_upsamples外，还叠加了nn.Upsample的出发点是什么？这么做的好处是结果更稳定么？
感觉把这两个结构堆叠在一起增加了计算量？

请问ttsfrd包是怎么下的？

无论是直接pip install ttsfrd，还是使用清华、豆瓣、阿里云镜像都会出现
ERROR: Could not find a version that satisfies the requirement ttsfrd (from versions: none)
ERROR: No matching distribution found for ttsfrd

kantts支持多发音人模型吗？

看kantts的训练指引是单发音人的，即使用aishell3的多发音人的数据训练，也是一个一个训练。
在modelscope开源的多发音人模型，内部是多个声学模型组合的，而不是一个模型。

问下kantts是否能一个模型支持多个发音人？
如果不行，具体的原因是什么呢，什么原理导致多发音人会相互干扰？
如果可以，具体怎么操作才可以达到这种效果呢？

run WuuShanghai and Cantonese error in KAN-TTS or ModelScope

python3.8 x86_64 Ubuntu server

run WuuShanghai in basemodel_16k directory in KAN-TTS:
Traceback (most recent call last):
File "kantts/bin/text_to_wav.py", line 218, in
text_to_wav(
File "kantts/bin/text_to_wav.py", line 157, in text_to_wav
am_infer(symbols_file, am_ckpt, output_dir, se_file)
File "/data2/caoyangang/open_project/KAN-TTS/kantts/bin/infer_sambert.py", line 217, in am_infer
mel, mel_post, dur, f0, energy = am_synthesis(
File "/data2/caoyangang/open_project/KAN-TTS/kantts/bin/infer_sambert.py", line 83, in am_synthesis
inputs_ling = torch.stack(
RuntimeError: stack expects each tensor to be equal size, but got [16] at entry 0 and [78] at entry 1

run WuuShanghai in voices directory in KAN-TTS get the same error as above

but run WuuShanghai in ModelScope is fine.

run Cantonese model in basemodel_16k directory in KAN-TTS:
Traceback (most recent call last):
File "kantts/bin/text_to_wav.py", line 218, in
text_to_wav(
File "kantts/bin/text_to_wav.py", line 157, in text_to_wav
am_infer(symbols_file, am_ckpt, output_dir, se_file)
File "/data2/caoyangang/open_project/KAN-TTS/kantts/bin/infer_sambert.py", line 217, in am_infer
mel, mel_post, dur, f0, energy = am_synthesis(
File "/data2/caoyangang/open_project/KAN-TTS/kantts/bin/infer_sambert.py", line 83, in am_synthesis
inputs_ling = torch.stack(
RuntimeError: stack expects each tensor to be equal size, but got [16] at entry 0 and [78] at entry 1

run Cantonese model in voices directory in KAN-TTS:
Traceback (most recent call last):
File "kantts/bin/text_to_wav.py", line 218, in
text_to_wav(
File "kantts/bin/text_to_wav.py", line 157, in text_to_wav
am_infer(symbols_file, am_ckpt, output_dir, se_file)
File "/data2/caoyangang/open_project/KAN-TTS/kantts/bin/infer_sambert.py", line 201, in am_infer
fsnet.load_state_dict(state_dict["model"], strict=False)
File "/home/eeodev/anaconda3/envs/kan-tts/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for KanTtsSAMBERT:
size mismatch for text_encoder.sy_emb.weight: copying a param with shape torch.Size([107, 512]) from checkpoint, the shape in current model is torch.Size([147, 512]).
size mismatch for text_encoder.tone_emb.weight: copying a param with shape torch.Size([14, 512]) from checkpoint, the shape in current model is torch.Size([10, 512]).

run Cantonese in ModelScope get the same error as above

What do the categories in fp_category_list = ["FP", "I", "N", "Q"] mean?

bug?

hello, in kantts_sambert.py code, line 967-975.
LFR_text_inputs = LR_text_outputs.contiguous().view(batch_size, -1, self.mel_decoder.r * text_hid.shape[-1])
LFR_emo_inputs = LR_emo_outputs.contiguous().view(batch_size, -1, self.mel_decoder.r * emo_hid.shape[-1])[:, :, : emo_hid.shape[-1]]
LFR_spk_inputs = LR_spk_outputs.contiguous().view(batch_size, -1, self.mel_decoder.r * spk_hid.shape[-1])[:, :, : spk_hid.shape[-1]]

if data.size() = [64, 100, 40], r = 3,
data.contiguous().view(64, -1, 3 * 40) error

音频格式转换时出错，TypeError: check_argument_types() missing 1 required positional argument: 'func'

2023-06-19 14:08:44,582 - modelscope - INFO - Use user-specified model revision: v1.0.5
2023-06-19:14:08:44, INFO [api.py:470] Use user-specified model revision: v1.0.5
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████| 1.02G/1.02G [00:33<00:00, 32.5MB/s]
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████| 6.27k/6.27k [00:00<00:00, 1.16MB/s]
2023-06-19:14:09:30, INFO [auto_label.py:91] ---  New folder /data/audio/ptts_spk0_autolabel/paragraph/prosody...  ---
2023-06-19:14:09:30, INFO [auto_label.py:92] ---  OK  ---
2023-06-19:14:09:30, INFO [auto_label.py:91] ---  New folder /data/audio/ptts_spk0_autolabel/sp_interval...  ---
2023-06-19:14:09:30, INFO [auto_label.py:92] ---  OK  ---
2023-06-19:14:09:30, INFO [auto_label.py:91] ---  New folder /data/audio/ptts_spk0_autolabel/wav...  ---
2023-06-19:14:09:30, INFO [auto_label.py:92] ---  OK  ---
2023-06-19:14:09:30, INFO [auto_label.py:91] ---  New folder /data/audio/ptts_spk0_autolabel/log...  ---
2023-06-19:14:09:30, INFO [auto_label.py:92] ---  OK  ---
2023-06-19:14:09:30, INFO [auto_label.py:301] 2023-06-19 14:09:30
2023-06-19:14:09:30, INFO [auto_label.py:355] wav_preprocess start...
---  new folder...  ---
---  OK  ---
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 128.04it/s]
2023-06-19:14:09:30, INFO [auto_label.py:367] wav cut by vad start...
Traceback (most recent call last):
  File "/data/audio/tran.py", line 8, in <module>
    ret, report = run_auto_label(input_wav = input_wav,
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/modelscope/tools/speech_tts_autolabel.py", line 78, in run_auto_label
    ret_code, report = auto_labeling.run()
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/tts_autolabel/auto_label.py", line 765, in run
    self.wav_cut_by_vad()
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/tts_autolabel/auto_label.py", line 371, in wav_cut_by_vad
    vad_cut(self.resample_wav_dir, self.cut_wav_dir, self.resource_dir)
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/tts_autolabel/audiocut/vad.py", line 55, in vad_cut
    vad_pipeline = Fsmn_vad(vad_model_dir)
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/vad_bin.py", line 62, in __init__
    self.frontend = WavFrontend(
  File "/data/soft/anaconda3/envs/audio/lib/python3.9/site-packages/tts_autolabel/audio2phone/funasr_onnx/utils/frontend.py", line 32, in __init__
    check_argument_types()
TypeError: check_argument_types() missing 1 required positional argument: 'func'

########################################################################################
使用的代码是：

# 导入run_auto_label工具, 初次运行会下载相关库文件
from modelscope.tools import run_auto_label
# 运行 autolabel进行自动标注，20句音频的自动标注约4分钟
import os 

input_wav = '/mnt/workspace/Data/ptts_spk0_wav' # wav audio path
work_dir = '/mnt/workspace/Data/ptts_spk0_autolabel' # output path
os.makedirs(work_dir, exist_ok=True)

ret, report = run_auto_label(input_wav = input_wav,
                             work_dir = work_dir,
                            resource_revision='v1.0.5')
print(report)

官方能提供转模型转onnx的脚本吗？自己做有点困难

请问ttsfrd是你们的内部包吗？我在网上没有搜到关于ttsfrd的资料，这是你们私人的前端处理工具吗？

希望像 FunASR 一样，提供 C++ 推理

最好能提供 ONNX，这样可以在各种场景部署，这方面 FunASR 就非常优秀

Could not load library libcudnn_cnn_infer.so.8.

(/media/lab-hp/B23AB5DD3AB59F33/condaenv/maas) lab-hp@labhp-HP:~/桌面/KAN-TTS$ python ./kantts/bin/text_to_wav.py --txt test.txt --output_dir res --res_zip speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/resource.zip --am_ckpt speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/sambert/ckpt/checkpoint_980000.pth --voc_ckpt speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/hifigan/ckpt/checkpoint_2000000.pth --speaker xiaoyu
2023-07-24:22:10:22, INFO [text_to_wav.py:97] Converting text to symbols...
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
text.cc: festival_Text_init
2023-07-24:22:10:26, INFO [text_to_wav.py:109] AM is infering...
2023-07-24:22:10:29, INFO [infer_sambert.py:198] Loading checkpoint: speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k/basemodel_16k/sambert/ckpt/checkpoint_980000.pth
2023-07-24:22:10:29, INFO [infer_sambert.py:210] Inference sentence: 0_0
Could not load library libcudnn_cnn_infer.so.8. Error: /media/lab-hp/B23AB5DD3AB59F33/condaenv/maas/bin/../lib/libnvrtc.so: undefined symbol: nvrtcGetCUBIN
已放弃 (核心已转储)

ttsfrd模块有返回拼音、正则化文本的方法吗？

大神好，kantts项目有使用到ttsfrd模块来做文本转symbols的逻辑，这个模块很棒！

问下ttsfrd有单独返回拼音、正则化文本的方法吗？如果有的话，怎么使用呢？

any readme about multi-speaker or multi-emotion preprocess and train?

请教推理的时候出现Load pinyin_en_mix_dict failed如何避免？

进行语音和合成，经常会出现报错：

Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
Load pinyin_en_mix_dict failed
text.cc: festival_Text_init

发现是 ttsfrd包加载zip文件引起的

fe = ttsfrd.TtsFrontendEngine()
fe.initialize(resources_dir)

虽然最后输出音频没有问题，但是感觉减慢的推理速度，请教各位大佬如何解决呢？

train_sambert报错

inputs_text_embedding + pitch_embeddings + energy_embeddings
RuntimeError: The size of tensor a (50) must match the size of tensor b (411) at non-singleton dimension 1
2023-07-03:11:07:19 INFO [trainer.py:903] torch.Size([32, 50, 4])
2023-07-03:11:07:19 INFO [trainer.py:904] torch.Size([32, 50])
2023-07-03:11:07:19 INFO [trainer.py:905] torch.Size([32, 50])
torch.Size([32, 50, 32])
torch.Size([32, 411, 32])
torch.Size([32, 411, 32])

运行text_to_wav.py时一个小bug

需要把speech_sambert-hifigan_tts_zhitian_emo_zh-cn_16k/voices/zhitian_emo/audio_config.yaml里的内容拷贝到speech_sambert-hifigan_tts_zhitian_emo_zh-cn_16k/voices/zhitian_emo/voc/config.yaml里，否则会报错

sybert怎么应用、有预训练模型吗？

大神好，这个项目太赞了！不过有一些问题不太明白：

问一下sybert具体怎么使用的，预期达到怎样的效果（训练过程没有音频数据参与，sybert对发音特征建模不太理解）？
在kantts看到有sybert的训练模块，但是没有推理模块，也没有应用的模块。
sybert有预训练模型吗？
尝试加载resource目录的languagedata_embedded.bin，但是貌似这不是pytorch模型。

pip install git+https://github.com/alibaba-damo-academy/KAN-TTS

INFO: pip is looking at multiple versions of kantts to determine which version is compatible with other requirements. This could take a while.
ERROR: Package 'kantts' requires a different Python: 3.9.16 not in '<3.9,>=3.7.0'

Thank you!

modelscope / kan-tts Goto Github PK

kan-tts's Introduction

English | 中文 | 日本語

Introduction

Models and Online Accessibility

QuickTour

Why should I use ModelScope library

Installation

Docker

Setup Local Python Environment

Learn More

License

Citation

kan-tts's People

Contributors

Stargazers

Watchers

Forkers

kan-tts's Issues

特征提取

扩充epoch

SambertHifigan语音合成-中文-多人预训练-16k

git clone -b pretrain http://www.modelscope.cn/speech_tts/speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k.git

speaker_list: 'F7,F74,FBYN,FRXL,M7,xiaoyu'} all except M7 are female

Name Version Build Channel

SambertHifigan语音合成-中文-多人预训练-16k

speech_sambert-hifigan_tts_zh-cn_multisp_pretrain_16k

packages in environment at /home/eeodev/anaconda3/envs/funtts:

Name Version Build Channel

Recommend Projects

Recommend Topics

Recommend Org