Git Product home page Git Product logo

alibaba-damo-academy / funasr Goto Github PK

View Code? Open in Web Editor NEW
3.7K 48.0 426.0 98.19 MB

A Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models. |语音识别工具包,包含丰富的性能优越的开源预训练模型,支持语音识别、语音端点检测、文本后处理等,具备服务部署能力。

Home Page: https://www.funasr.com

License: Other

Shell 3.76% Python 72.92% Perl 0.33% CMake 0.33% C++ 12.92% C 0.01% HTML 0.18% JavaScript 3.92% Java 0.75% Makefile 0.03% C# 2.92% Ruby 0.01% Objective-C 0.18% Objective-C++ 0.20% Kotlin 0.02% Cuda 0.43% SCSS 0.39% Vue 0.71%
conformer pytorch speech-recognition paraformer punctuation speaker-diarization rnnt audio-visual-speech-recognition pretrained-model voice-activity-detection

funasr's Introduction

(简体中文|English)

FunASR: A Fundamental End-to-End Speech Recognition Toolkit

PyPI

FunASR hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology. ASR for Fun!

Highlights | News | Installation | Quick Start | Tutorial | Runtime | Model Zoo | Contact

Highlights

  • FunASR is a fundamental speech recognition toolkit that offers a variety of features, including speech recognition (ASR), Voice Activity Detection (VAD), Punctuation Restoration, Language Models, Speaker Verification, Speaker Diarization and multi-talker ASR. FunASR provides convenient scripts and tutorials, supporting inference and fine-tuning of pre-trained models.
  • We have released a vast collection of academic and industrial pretrained models on the ModelScope and huggingface, which can be accessed through our Model Zoo. The representative Paraformer-large, a non-autoregressive end-to-end speech recognition model, has the advantages of high accuracy, high efficiency, and convenient deployment, supporting the rapid construction of speech recognition services. For more details on service deployment, please refer to the service deployment document.

What's new:

  • 2024/03/05:Added the Qwen-Audio and Qwen-Audio-Chat large-scale audio-text multimodal models, which have topped multiple audio domain leaderboards. These models support speech dialogue, usage.
  • 2024/03/05:Added support for the Whisper-large-v3 model, a multitasking model that can perform multilingual speech recognition, speech translation, and language identification. It can be downloaded from themodelscope, and openai.
  • 2024/03/05: Offline File Transcription Service 4.4, Offline File Transcription Service of English 1.5,Real-time Transcription Service 1.9 released,docker image supports ARM64 platform, update modelscope;(docs)
  • 2024/01/30:funasr-1.0 has been released (docs)
  • 2024/01/30:emotion recognition models are new supported. model link, modified from repo.
  • 2024/01/25: Offline File Transcription Service 4.2, Offline File Transcription Service of English 1.3 released,optimized the VAD (Voice Activity Detection) data processing method, significantly reducing peak memory usage, memory leak optimization; Real-time Transcription Service 1.7 released,optimizatized the client-side;(docs)
  • 2024/01/09: The Funasr SDK for Windows version 2.0 has been released, featuring support for The offline file transcription service (CPU) of Mandarin 4.1, The offline file transcription service (CPU) of English 1.2, The real-time transcription service (CPU) of Mandarin 1.6. For more details, please refer to the official documentation or release notes(FunASR-Runtime-Windows)
  • 2024/01/03: File Transcription Service 4.0 released, Added support for 8k models, optimized timestamp mismatch issues and added sentence-level timestamps, improved the effectiveness of English word FST hotwords, supported automated configuration of thread parameters, and fixed known crash issues as well as memory leak problems, refer to (docs).
  • 2024/01/03: Real-time Transcription Service 1.6 released,The 2pass-offline mode supports Ngram language model decoding and WFST hotwords, while also addressing known crash issues and memory leak problems, (docs)
  • 2024/01/03: Fixed known crash issues as well as memory leak problems, (docs).
  • 2023/12/04: The Funasr SDK for Windows version 1.0 has been released, featuring support for The offline file transcription service (CPU) of Mandarin, The offline file transcription service (CPU) of English, The real-time transcription service (CPU) of Mandarin. For more details, please refer to the official documentation or release notes(FunASR-Runtime-Windows)
  • 2023/11/08: The offline file transcription service 3.0 (CPU) of Mandarin has been released, adding punctuation large model, Ngram language model, and wfst hot words. For detailed information, please refer to docs.
  • 2023/10/17: The offline file transcription service (CPU) of English has been released. For more details, please refer to (docs).
  • 2023/10/13: SlideSpeech: A large scale multi-modal audio-visual corpus with a significant amount of real-time synchronized slides.
  • 2023/10/10: The ASR-SpeakersDiarization combined pipeline Paraformer-VAD-SPK is now released. Experience the model to get recognition results with speaker information.
  • 2023/10/07: FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec.
  • 2023/09/01: The offline file transcription service 2.0 (CPU) of Mandarin has been released, with added support for ffmpeg, timestamp, and hotword models. For more details, please refer to (docs).
  • 2023/08/07: The real-time transcription service (CPU) of Mandarin has been released. For more details, please refer to (docs).
  • 2023/07/17: BAT is released, which is a low-latency and low-memory-consumption RNN-T model. For more details, please refer to (BAT).
  • 2023/06/26: ASRU2023 Multi-Channel Multi-Party Meeting Transcription Challenge 2.0 completed the competition and announced the results. For more details, please refer to (M2MeT2.0).

Installation

pip3 install -U funasr

Or install from source code

git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./

Install modelscope for the pretrained models (Optional)

pip3 install -U modelscope

Model Zoo

FunASR has open-sourced a large number of pre-trained models on industrial data. You are free to use, copy, modify, and share FunASR models under the Model License Agreement. Below are some representative models, for more models please refer to the Model Zoo.

(Note: ⭐ represents the ModelScope model zoo, 🤗 represents the Huggingface model zoo, 🍀 represents the OpenAI model zoo)

Model Name Task Details Training Data Parameters
paraformer-zh
( 🤗 )
speech recognition, with timestamps, non-streaming 60000 hours, Mandarin 220M
paraformer-zh-streaming
( 🤗 )
speech recognition, streaming 60000 hours, Mandarin 220M
paraformer-en
( 🤗 )
speech recognition, without timestamps, non-streaming 50000 hours, English 220M
conformer-en
( 🤗 )
speech recognition, non-streaming 50000 hours, English 220M
ct-punc
( 🤗 )
punctuation restoration 100M, Mandarin and English 1.1G
fsmn-vad
( 🤗 )
voice activity detection 5000 hours, Mandarin and English 0.4M
fa-zh
( 🤗 )
timestamp prediction 5000 hours, Mandarin 38M
cam++
( 🤗 )
speaker verification/diarization 5000 hours 7.2M
Whisper-large-v2
( 🍀 )
speech recognition, with timestamps, non-streaming multilingual 1.5G
Whisper-large-v3
( 🍀 )
speech recognition, with timestamps, non-streaming multilingual 1.5G
Qwen-Audio
( 🤗 )
audio-text multimodal models (pretraining) multilingual 8B
Qwen-Audio-Chat
( 🤗 )
audio-text multimodal models (chat) multilingual 8B

Quick Start

Below is a quick start tutorial. Test audio files (Mandarin, English).

Command-line usage

funasr ++model=paraformer-zh ++vad_model="fsmn-vad" ++punc_model="ct-punc" ++input=asr_example_zh.wav

Notes: Support recognition of single audio file, as well as file list in Kaldi-style wav.scp format: wav_id wav_pat

Speech Recognition (Non-streaming)

from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel(model="paraformer-zh",  vad_model="fsmn-vad",  punc_model="ct-punc", 
                  # spk_model="cam++", 
                  )
res = model.generate(input=f"{model.model_path}/example/asr_example.wav", 
                     batch_size_s=300, 
                     hotword='魔搭')
print(res)

Note: hub: represents the model repository, ms stands for selecting ModelScope download, hf stands for selecting Huggingface download.

Speech Recognition (Streaming)

from funasr import AutoModel

chunk_size = [0, 10, 5] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention

model = AutoModel(model="paraformer-zh-streaming")

import soundfile
import os

wav_file = os.path.join(model.model_path, "example/asr_example.wav")
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = chunk_size[1] * 960 # 600ms

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size, encoder_chunk_look_back=encoder_chunk_look_back, decoder_chunk_look_back=decoder_chunk_look_back)
    print(res)

Note: chunk_size is the configuration for streaming latency. [0,10,5] indicates that the real-time display granularity is 10*60=600ms, and the lookahead information is 5*60=300ms. Each inference input is 600ms (sample points are 16000*0.6=960), and the output is the corresponding text. For the last speech segment input, is_final=True needs to be set to force the output of the last word.

Voice Activity Detection (Non-Streaming)

from funasr import AutoModel

model = AutoModel(model="fsmn-vad")
wav_file = f"{model.model_path}/example/vad_example.wav"
res = model.generate(input=wav_file)
print(res)

Note: The output format of the VAD model is: [[beg1, end1], [beg2, end2], ..., [begN, endN]], where begN/endN indicates the starting/ending point of the N-th valid audio segment, measured in milliseconds.

Voice Activity Detection (Streaming)

from funasr import AutoModel

chunk_size = 200 # ms
model = AutoModel(model="fsmn-vad")

import soundfile

wav_file = f"{model.model_path}/example/vad_example.wav"
speech, sample_rate = soundfile.read(wav_file)
chunk_stride = int(chunk_size * sample_rate / 1000)

cache = {}
total_chunk_num = int(len((speech)-1)/chunk_stride+1)
for i in range(total_chunk_num):
    speech_chunk = speech[i*chunk_stride:(i+1)*chunk_stride]
    is_final = i == total_chunk_num - 1
    res = model.generate(input=speech_chunk, cache=cache, is_final=is_final, chunk_size=chunk_size)
    if len(res[0]["value"]):
        print(res)

Note: The output format for the streaming VAD model can be one of four scenarios:

  • [[beg1, end1], [beg2, end2], .., [begN, endN]]:The same as the offline VAD output result mentioned above.
  • [[beg, -1]]:Indicates that only a starting point has been detected.
  • [[-1, end]]:Indicates that only an ending point has been detected.
  • []:Indicates that neither a starting point nor an ending point has been detected.

The output is measured in milliseconds and represents the absolute time from the starting point.

Punctuation Restoration

from funasr import AutoModel

model = AutoModel(model="ct-punc")
res = model.generate(input="那今天的会就到这里吧 happy new year 明年见")
print(res)

Timestamp Prediction

from funasr import AutoModel

model = AutoModel(model="fa-zh")
wav_file = f"{model.model_path}/example/asr_example.wav"
text_file = f"{model.model_path}/example/text.txt"
res = model.generate(input=(wav_file, text_file), data_type=("sound", "text"))
print(res)

More usages ref to docs, more examples ref to demo

Export ONNX

Command-line usage

funasr-export ++model=paraformer ++quantize=false ++device=cpu

Python

from funasr import AutoModel

model = AutoModel(model="paraformer", device="cpu")

res = model.export(quantize=False)

Test ONNX

# pip3 install -U funasr-onnx
from funasr_onnx import Paraformer
model_dir = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
model = Paraformer(model_dir, batch_size=1, quantize=True)

wav_path = ['~/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav']

result = model(wav_path)
print(result)

More examples ref to demo

Deployment Service

FunASR supports deploying pre-trained or further fine-tuned models for service. Currently, it supports the following types of service deployment:

  • File transcription service, Mandarin, CPU version, done
  • The real-time transcription service, Mandarin (CPU), done
  • File transcription service, English, CPU version, done
  • File transcription service, Mandarin, GPU version, in progress
  • and more.

For more detailed information, please refer to the service deployment documentation.

Community Communication

If you encounter problems in use, you can directly raise Issues on the github page.

You can also scan the following DingTalk group or WeChat group QR code to join the community group for communication and discussion.

DingTalk group WeChat group

Contributors

The contributors can be found in contributors list

License

This project is licensed under The MIT License. FunASR also contains various third-party components and some code modified from other repos under other open source licenses. The use of pretraining model is subject to model license

Citations

@inproceedings{gao2023funasr,
  author={Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang},
  title={FunASR: A Fundamental End-to-End Speech Recognition Toolkit},
  year={2023},
  booktitle={INTERSPEECH},
}
@inproceedings{An2023bat,
  author={Keyu An and Xian Shi and Shiliang Zhang},
  title={BAT: Boundary aware transducer for memory-efficient and low-latency ASR},
  year={2023},
  booktitle={INTERSPEECH},
}
@inproceedings{gao22b_interspeech,
  author={Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan},
  title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
  year=2022,
  booktitle={Proc. Interspeech 2022},
  pages={2063--2067},
  doi={10.21437/Interspeech.2022-9996}
}
@inproceedings{shi2023seaco,
  author={Xian Shi and Yexin Yang and Zerui Li and Yanni Chen and Zhifu Gao and Shiliang Zhang},
  title={SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability},
  year={2023},
  booktitle={ICASSP2024}
}

funasr's People

Contributors

aky15 avatar bltcn avatar cdevelop avatar cgisky1980 avatar chenmengzheaaa avatar dyyzhmm avatar gbbin avatar hnluo avatar jmwang66 avatar lauragpt avatar lingyunfly avatar lizerui9926 avatar lyblsgo avatar manyeyes avatar nichongjia-2007 avatar onlybetheone avatar r1ckshi avatar season4675 avatar smohan-speech avatar tramphero avatar virtuoso461 avatar xiaowan0322 avatar yeyupiaoling avatar yhliang-aslp avatar yuekaizhang avatar yufan-aslp avatar zhaomingwork avatar zhihaodu avatar zhuzizyf avatar znsoftm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

funasr's Issues

MWER supported?

As the titile mentioned, is MWER supported in current version.
Thanks.

中文 ITN Tagger 预测不正常

Tagger 无法自动提取文本中需要被转换的部分,比如下面的例子,必须手动在 “两千两百五十公里” 前加空格,才能正常转换。

> python -m fun_text_processing.inverse_text_normalization.inverse_normalize --text "这条路长达两千两百五十公里" --overwrite_cache --language zh --verbose

Time to generate graph: 0.57 sec
这条路长达两千两百五十公里
tokens { name: "这条路长达两千两百五十公里" }
这条路长达两千两百五十公里

> python -m fun_text_processing.inverse_text_normalization.inverse_normalize --text "这条路长达 两千两百五十公里" --overwrite_cache --language zh --verbose

Time to generate graph: 0.55 sec
这条路长达 两千两百五十公里
tokens { name: "这条路长达" } tokens { measure { cardinal { integer: "2250" } units: "km" } }
这条路长达 2250km


Error when decoding with paraformer model

When I use modelscope_common_infer.sh to decode an audio, an error occurs as below. Does anyone know how to fix it?

File ".../.../FunASR/funasr/utils/postprocess_utils.py", line 170, in sentence_postprocess
    raise ValueError('invalid character: {}'.format(ch))
ValueError: invalid character: that's

joint_network param is not necessary for model ParaformerBert ?

Traceback (most recent call last):
File "/public/home/asc01/otl/funasr/egs/aishell/paraformerbert/../../../funasr/bin/asr_train_paraformer.py", line 46, in
main(args=args)
File "/public/home/asc01/otl/funasr/egs/aishell/paraformerbert/../../../funasr/bin/asr_train_paraformer.py", line 23, in main
ASRTask.main(args=args, cmd=cmd)
File "/public/home/asc01/otl/funasr/funasr/tasks/abs_task.py", line 1086, in main
cls.main_worker(args)
File "/public/home/asc01/otl/funasr/funasr/tasks/abs_task.py", line 1142, in main_worker
model = cls.build_model(args=args)
File "/public/home/asc01/otl/funasr/funasr/tasks/asr.py", line 859, in build_model
model = model_class(
TypeError: init() missing 1 required positional argument: 'joint_network'

When I run aishell1 paraformerbert baseline, I got this error. However, I found that so-called joint_network is not used in ParaformerBert. So should I just delete the param and go on training?

今天应该是有更新,CPU推理出错了

/www/miniconda3/envs/vits2/lib/python3.7/site-packages/torch/amp/autocast_mode.py:214: UserWarning: In CPU autocast, but the target dtype is not supported. Disabling autocast.
CPU Autocast only supports dtype of torch.bfloat16 currently.

我跑的那个模型 把 FunASR/funasr/models/e2e_asr_paraformer.py 文件中 两处 with autocast(False):
改为
···
#with autocast(False):
if True:
···

好了

FunASR 0.1.4 AIShell ParaformerBert Baseline Decoing Error

2022-12-10 18:24:02,135 (asr_inference_paraformer:429) INFO: decoding, utt_id: ['BAC009S0746W0278']
2022-12-10 18:24:02,261 (beam_search:1300) INFO: decoder input length: 91
2022-12-10 18:24:02,261 (beam_search:1301) INFO: max output length: 14
2022-12-10 18:24:02,984 (beam_search:1380) INFO: adding in the last position in the loop
2022-12-10 18:24:02,987 (beam_search:1316) INFO: no hypothesis. Finish decoding.
2022-12-10 18:24:02,987 (beam_search:1337) INFO: -0.00 * 0.5 = -0.00 for ctc
2022-12-10 18:24:02,987 (beam_search:1340) INFO: total log probability: -0.98
2022-12-10 18:24:02,987 (beam_search:1341) INFO: normalized log probability: -0.06
2022-12-10 18:24:02,987 (beam_search:1342) INFO: total number of ended hypotheses: 10
2022-12-10 18:24:02,988 (beam_search:1344) INFO: best hypo: 微软今年在**的重点拓展领域

2022-12-10 18:24:02,988 (asr_inference_paraformer:440) INFO: decoding, feature length: 91, forward_time: 0.8529, rtf: 0.9372
2022-12-10 18:24:02,988 (asr_inference_paraformer:446) INFO: batch_id 0 len(keys) 1 keys ['BAC009S0746W0278']
2022-12-10 18:24:03,057 (asr_inference_paraformer:467) INFO: decoding, utt: BAC009S0746W0278, predictions: 微软今年在**的重点拓展领域
2022-12-10 18:24:03,057 (asr_inference_paraformer:446) INFO: batch_id 1 len(keys) 1 keys ['BAC009S0746W0278']
Traceback (most recent call last):
File "/home/environment/hwang/anaconda3/envs/oslasr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/environment/hwang/anaconda3/envs/oslasr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr-main/funasr/bin/asr_inference_launch.py", line 241, in
main()
File "/home/work_nfs4_ssd/hwang/workspaces/funasr-main/funasr/bin/asr_inference_launch.py", line 237, in main
inference_launch(**kwargs)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr-main/funasr/bin/asr_inference_launch.py", line 204, in inference_launch
return inference(**kwargs)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr-main/funasr/bin/asr_inference_paraformer.py", line 447, in inference
key = keys[batch_id]
IndexError: list index out of range

I log some value to debug. The length of keys is actually one.

bug exists in funasr/models/e2e_vad.py

def ComputeDecibel(self) -> None:
frame_sample_length = int(self.vad_opts.frame_length_ms * self.vad_opts.sample_rate / 1000)
frame_shift_length = int(self.vad_opts.frame_in_ms * self.vad_opts.sample_rate / 1000)
self.data_buf = self.waveform[0] # 指向self.waveform[0]

    for offset in range(0, self.waveform.shape[1] - frame_sample_length, frame_shift_length):
        self.decibel.append(
            10 * math.log10((self.waveform[0][offset: offset + frame_sample_length]).square().sum() + \
                            0.000001))

if ( self.waveform.shape[1] - frame_sample_length) % frame_shift_length == 0 will cause "assert len(self.decibel) == len(self.scores[0]) # 保证帧数一致" fail.
len(self.decibel) is 1 less than len(self.scores[0]).

maybe duplicate last item for self.decibel while no remainder.

Error when loading the speechio paraformer model

Error "pickle.UnpicklingError: invalid load key, 'v'." occurs when running paraformer_large_infer.sh in egs_modelscope/speechio. The decoding model is downloaded from modelscope in advance.
image

batch_size not work while inference with Paraformer-large长音频模型

refer to egs_modelscope/asr_vad_punc/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/infer.py, batch_size seems not to work with different setting, getting almost 8x relative speed.
Computing resourse (mem, gpu-util) remains alots, so is there any method to get more relative speech ?
the version of funasr is 0.1.6, and modelscope is 1.2.0

some excellent code for dealing with various audio files format which learn from whisper of openai

i has been copied some execllent code from whisper project which can dealing with various audio files format easily.
the whisper project uses ffmpeg to deal with the input audio files that user inputed. they give the convenience to user and leave the complexity to self.
i think this can be learned by us to improve the quality of funasr.
good lunk, guys! :)

Code Block 1

1

Code Block 2

2

Code Block 3

3

Code Block 4

4

docker image can't work

i download the docker image from here
image
but an error occurred when i run the example code
below is the error information:
8717595ead9b1e539629bc226cd1c4d
a321f24cef6d5f41c0e5eee68b7aa78

开源的预训练模型跑起来很慢

请问,我在跑speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch 开源模型的时候,无论是用modelscope.pipelines 调用,还是funasr.bin.asr_inference_paraformer下的Speech2Text 调用,都很慢,大概5s的音频需要4点多秒,rtf大概0.9左右,这个情况正常吗?(cpu下运行,linux系统,内存 256G和cpu核数 64)

直接把grpc中paraformer模型换成onnx形式,客户端报错

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.RESOURCE_EXHAUSTED
details = "received initial metadata size exceeds limit"
debug_error_string = "UNKNOWN:Error received from peer ipv4:127.0.0.1:55555 {created_time:"2023-02-20T14:03:04.697806464+00:00", grpc_status:8, grpc_message:"received initial metadata size exceeds limit"}"

Error in training transformerLM

Stage 0: Generate character level token_list from /workspace/model_scope/data/train2.txt
/opt/conda/bin/python /workspace/model_scope/FunASR/funasr/bin/tokenize_text.py --token_type char --input /workspace/model_scope/data/train2.txt --output ./data/exp/baseline_train_lm_transformer_zh_char_exp1/vocab.txt --non_linguistic_symbols none --field 2- --cleaner none --g2p none --write_vocabulary true --add_symbol ':0' --add_symbol ':1' --add_symbol ':2' --add_symbol ':-1'
2023-03-20 17:17:14,563 (tokenize_text:186) INFO: OOV rate = 0.0 %
stage 1: Data preparation
/opt/conda/bin/python /workspace/model_scope/FunASR/funasr/bin/aggregate_stats_dirs.py --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.1 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.2 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.3 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.4 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.5 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.6 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.7 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.8 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.9 --input_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/log/stats.10 --output_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1
stage 2: Training
run.sh: init method is file:///workspace/model_scope/FunASR/egs/aishell2/transformerLM/data/exp/baseline_train_lm_transformer_zh_char_exp1/ddp_init
Stage 3: Calc perplexity: /workspace/model_scope/data/test2.txt
/opt/conda/bin/python ../../../funasr/bin/lm_inference.py --output_dir ./data/exp/baseline_train_lm_transformer_zh_char_exp1/perplexity_test --ngpu 1 --batch_size 1 --train_config ./data/exp/baseline_train_lm_transformer_zh_char_exp1/config.yaml --model_file ./data/exp/baseline_train_lm_transformer_zh_char_exp1/valid.loss.ave.pth --data_path_and_name_and_type /workspace/model_scope/data/test2.txt,text,text --num_workers 1 --split_with_space false
Traceback (most recent call last):
File "../../../funasr/bin/lm_inference.py", line 406, in
main()
File "../../../funasr/bin/lm_inference.py", line 403, in main
inference(**kwargs)
File "../../../funasr/bin/lm_inference.py", line 68, in inference
**kwargs,
File "../../../funasr/bin/lm_inference.py", line 107, in inference_modelscope
train_config, model_file, device)
File "/workspace/model_scope/FunASR/funasr/tasks/abs_task.py", line 1918, in build_model_from_file
with config_file.open("r", encoding="utf-8") as f:
File "/opt/conda/lib/python3.7/pathlib.py", line 1208, in open
opener=self._opener)
File "/opt/conda/lib/python3.7/pathlib.py", line 1063, in _opener
return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'data/exp/baseline_train_lm_transformer_zh_char_exp1/config.yaml'

(modelscope image:ubuntu20.04-cuda11.3.0-py37-torch1.11.0-tf1.15.5-1.4.1)

aishell performance not that good as reported

According to https://modelscope.cn/models/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/summary, paraformer large could reach the performance of wer at 1.95 at aishell test set.

However, I got 1.75 at dev, and 6.55 at test set.

Below is the log, could anybody tell me what goes wrong:

2023-01-13 03:53:21,756 - modelscope - INFO - PyTorch version 1.7.1+cu110 Found.
2023-01-13 03:53:21,756 - modelscope - INFO - Loading ast index from /root/.cache/modelscope/ast_indexer
2023-01-13 03:53:21,775 - modelscope - INFO - Loading done! Current index file version is 1.1.4, with md5 7b2befbc61ca2ec3494371f3469c93bc and a total number of 477 components indexed
2023-01-13 03:53:22,994 - modelscope - INFO - Use user-specified model revision: v1.0.4
2023-01-13 03:53:23,256 - modelscope - INFO - File am.mvn already in cache, skip downloading!
2023-01-13 03:53:23,256 - modelscope - INFO - File asr_example.wav already in cache, skip downloading!
2023-01-13 03:53:23,256 - modelscope - INFO - File config.yaml already in cache, skip downloading!
2023-01-13 03:53:23,257 - modelscope - INFO - File config_lm.yaml already in cache, skip downloading!
2023-01-13 03:53:23,257 - modelscope - INFO - File configuration.json already in cache, skip downloading!
2023-01-13 03:53:23,257 - modelscope - INFO - File decoding.yaml already in cache, skip downloading!
2023-01-13 03:53:23,257 - modelscope - INFO - File finetune.yaml already in cache, skip downloading!
2023-01-13 03:53:23,257 - modelscope - INFO - File lm.pb already in cache, skip downloading!
2023-01-13 03:53:23,257 - modelscope - INFO - File model.pb already in cache, skip downloading!
2023-01-13 03:53:23,257 - modelscope - INFO - File README.md already in cache, skip downloading!
2023-01-13 03:53:23,258 - modelscope - INFO - File seg_dict already in cache, skip downloading!
2023-01-13 03:53:23,258 - modelscope - INFO - File struct.png already in cache, skip downloading!
2023-01-13 03:53:23,258 - modelscope - INFO - File tokens.txt already in cache, skip downloading!
2023-01-13 03:53:23,280 - modelscope - INFO - initiate model from /root/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
2023-01-13 03:53:23,281 - modelscope - INFO - initiate model from location /root/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch.
2023-01-13 03:53:23,283 - modelscope - INFO - initialize model from /root/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
Decoding started... log: 'exp/aishell/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/decode_asr/dev/logdir/asr_inference.*.log'
%WER 1.75 [ 3603 / 205341, 133 ins, 85 del, 3385 sub ]
%SER 18.0 [ 2578 / 14326 ]
Scored 14326 sentences, 0 not present in hyp.
Decoding started... log: 'exp/aishell/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/decode_asr/test/logdir/asr_inference.*.log'
%WER 6.55 [ 6862 / 104765, 157 ins, 213 del, 6492 sub ]
%SER 38.84 [ 2787 / 7176 ]
Scored 7176 sentences, 0 not present in hyp.

rapid_paraformer推理时当语音中包含英文时,由于英文tokens已经合并,会导致跟时间戳无法对齐

rapid_paraformer推理时当语音中包含英文时,由于英文tokens已经合并,会导致跟时间戳无法对齐

OS: Windows
Python Version:3.7.9
Package Version:pytorch==1.13.1、modelscope==1.4.1 funasr==0.3.0
Model:damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch

Traceback (most recent call last):
File "D:/project/msasr/test8.py", line 18, in
result = model(wav)
File "D:\Miniconda\envs\modelscope\lib\site-packages\rapid_paraformer\paraformer_onnx.py", line 82, in call
timestamp, timestamp_total = time_stamp_lfr6_onnx(us_cif_peak_, copy.copy(tokens))
File "D:\Miniconda\envs\modelscope\lib\site-packages\rapid_paraformer\utils\timestamp_utils.py", line 21, in time_stamp_lfr6_onnx
assert num_peak == len(char_list) + 1 # number of peaks is supposed to be number of tokens + 1
AssertionError

speechio自测结果与官方给的不一致

你们的speechio的测试结果,为啥cer这么低啊?
比如 speechio 12
我的结果:
%WER 3.21 [ 2344 / 73114, 332 ins, 803 del, 1209 sub ]
%SER 72.91 [ 853 / 1170 ]
我的设置
71f9e569e0d016dd5b46ed9f1df57e3e

另外,不管我有没有加语言模型,测出来的wer都是3.21。。

Error when infering using the script funasr/bin/asr_inference_launch.py.

跑 egs_modelscope/aishell/paraformer 里的paraformer_large_finetune.sh
跑到 stage 4 做推理时,出现了下面的报错:

Traceback (most recent call last):
File "/mnt/jd_cloud_nfs/env/py3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/mnt/jd_cloud_nfs/env/py3/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/jd_cloud_nfs/env/py3/lib/python3.7/site-packages/funasr/bin/asr_inference_launch.py", line 241, in
main()
File "/mnt/jd_cloud_nfs/env/py3/lib/python3.7/site-packages/funasr/bin/asr_inference_launch.py", line 237, in main
inference_launch(args.mode, **kwargs)
TypeError: inference_launch() got multiple values for argument 'mode'

"modelscope[audio]"是不是还不支持arm的linux呀

安装"modelscope[audio]" 卡在

The conflict is caused by:
modelscope[audio] 1.3.2 depends on py-sound-connect>=0.1; extra == "audio"
modelscope[audio] 1.3.1 depends on py-sound-connect>=0.1; extra == "audio"
modelscope[audio] 1.3.0 depends on py-sound-connect>=0.1; extra == "audio"
modelscope[audio] 1.2.1 depends on py-sound-connect>=0.1; extra == "audio"
modelscope[audio] 1.2.0 depends on py-sound-connect>=0.1; extra == "audio"
modelscope[audio] 1.1.4 depends on py-sound-connect>=0.1; extra == "audio"
modelscope[audio] 1.1.3 depends on py-sound-connect>=0.1; extra == "audio"
modelscope[audio] 1.1.2 depends on py-sound-connect>=0.1; extra == "audio"
....

py-sound-connect 这个包找不到

使用c++推理结果是乱码

OS: win10
Python Version:3.7.16
Package Version:modelscope1.3.2
funasr0.2.3
Model:speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
Command:tester.exe F:\FunASR-main\funasr\runtime\onnxruntime\models F:\RapidASR-main\cpp_onnx\wave\long.wav
Error log:无
问题描述:使用c++推理结果是乱码

image

运行ASR推理代码时报错

您好,我按照教程运行ASR推理代码时报错,
“from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

inference_16k_pipline = pipeline(
task=Tasks.auto_speech_recognition,
model='damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
model_revision='v1.0.0'
)

rec_result = inference_16k_pipline(audio_in='https://modelscope.oss-cn-beijing.aliyuncs.com/test/audios/asr_example.wav')
print(rec_result)”

报错如下:
rec_result = inference_pipeline(audio_in=wav_path)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 79, in call
output = self.forward(output)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 157, in forward
inputs['asr_result'] = self.run_inference(cmd)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/modelscope/pipelines/audio/asr_inference_pipeline.py", line 255, in run_inference
frontend_conf=cmd['frontend_conf'])
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/easyasr/asr_inference_paraformer_espnet.py", line 570, in asr_inference
frontend_conf=frontend_conf)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/easyasr/asr_inference_paraformer_espnet.py", line 376, in inference
mvn_data = wav_utils.extract_CMVN_featrures(mvn_file)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/easyasr/common/wav_utils.py", line 42, in extract_CMVN_featrures
cmvn = kaldiio.load_mat(mvn_file)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/kaldiio/matio.py", line 241, in load_mat
return _load_mat(fd, offset, slices, endian=endian)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/kaldiio/matio.py", line 331, in _load_mat
array = read_kaldi(fd, endian)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/kaldiio/matio.py", line 441, in read_kaldi
array = read_ascii_mat(fd)
File "/home/wang/miniconda3/envs/funasr/lib/python3.7/site-packages/kaldiio/matio.py", line 617, in read_ascii_mat
raise RuntimeError(ma + "is not a digit\nFile format is wrong?")
RuntimeError: is not a digit
File format is wrong?

grpc_main_client_mic.py 文件修改mic_chunk 参数报错

报错信息:

  • recording
    Traceback (most recent call last):
    File "grpc_main_client_mic.py", line 136, in
    asyncio.run(record(args.host,args.port,args.sample_rate,args.mic_chunk,args.record_seconds,args.user_allowed,language))
    File "/Users/liuzhiyong/anaconda3/lib/python3.7/asyncio/runners.py", line 43, in run
    return loop.run_until_complete(main)
    File "/Users/liuzhiyong/anaconda3/lib/python3.7/asyncio/base_events.py", line 583, in run_until_complete
    return future.result()
    File "grpc_main_client_mic.py", line 72, in record
    await asyncio.create_task(deal_chunk(sig_mic))
    File "grpc_main_client_mic.py", line 45, in deal_chunk
    if vad.is_speech(sig_mic, sample_rate): #speaking
    File "/Users/liuzhiyong/anaconda3/lib/python3.7/site-packages/webrtcvad.py", line 27, in is_speech
    return _webrtcvad.process(self._vad, sample_rate, buf, length)
    webrtcvad.Error: Error while processing frame

模型微调的效果

我在用paraformer的common大模型在内部数据集上进行微调后的结果如下:
(1)用2kh 进行微调,cer 相对降低6%左右
(2)用1wh进行微调,cer 相对降低5%左右
微调使用的学习率是一样的,设置的很小,原始微调参数的十分之一。
现在的问题是,
(1)这样的结果是跟实际情况一致吗,数据大的,结果相对差一些
(2)对于大数据1wh,是否再进一步调低学习率?
(3)有没有其他的对微调有帮助的参数呢(finetune.yaml中的参数)?

TypeError is reported when using the punctuation model in both the online environment and the local environment

TypeError Traceback (most recent call last)
/tmp/ipykernel_355/1084747498.py in
----> 1 p('我们都是木头人不会讲话不会动',)

/opt/conda/lib/python3.7/site-packages/modelscope/pipelines/audio/punctuation_processing_pipeline.py in call(self, text_in, output_dir, cache, param_dict)
77 self.cmd['param_dict'] = param_dict
78
---> 79 output = self.forward(self.text_in)
80 result = self.postprocess(output)
81 return result

/opt/conda/lib/python3.7/site-packages/modelscope/pipelines/audio/punctuation_processing_pipeline.py in forward(self, text_in)
152 self.cmd['name_and_type'] = data_cmd
153 self.cmd['raw_inputs'] = raw_inputs
--> 154 punc_result = self.run_inference(self.cmd)
155
156 return punc_result

/opt/conda/lib/python3.7/site-packages/modelscope/pipelines/audio/punctuation_processing_pipeline.py in run_inference(self, cmd)
164 output_dir_v2=cmd['output_dir'],
165 cache=cmd['cache'],
--> 166 param_dict=cmd['param_dict'])
167 else:
168 raise ValueError('model type is mismatching')

TypeError: 'NoneType' object is not callable

AISHELL Paraformer Baseline Decoding Error

2022-12-06 16:44:40,039 (asr_inference_paraformer:337) INFO: decoding, feature length: 412, forward_time: 14.0918, rtf: 3.4203
2022-12-06 16:44:40,077 (asr_inference_paraformer:359) INFO: decoding, predictions: 这样第三产业在快速增长后
2022-12-06 16:44:40,085 (asr_inference_paraformer:324) INFO: decoding, utt_id: ['BAC009S0749W0123']
2022-12-06 16:44:58,486 (beam_search:1300) INFO: decoder input length: 87
2022-12-06 16:44:58,486 (beam_search:1301) INFO: max output length: 11
2022-12-06 16:44:59,051 (beam_search:1380) INFO: adding in the last position in the loop
2022-12-06 16:44:59,053 (beam_search:1316) INFO: no hypothesis. Finish decoding.
2022-12-06 16:44:59,054 (beam_search:1337) INFO: -0.00 * 0.5 = -0.00 for ctc
2022-12-06 16:44:59,054 (beam_search:1340) INFO: total log probability: -1.09
2022-12-06 16:44:59,054 (beam_search:1341) INFO: normalized log probability: -0.08
2022-12-06 16:44:59,054 (beam_search:1342) INFO: total number of ended hypotheses: 10
2022-12-06 16:44:59,054 (beam_search:1344) INFO: best hypo: 目前矿山企业已基本关闭

2022-12-06 16:44:59,055 (asr_inference_paraformer:337) INFO: decoding, feature length: 354, forward_time: 18.9701, rtf: 5.3588
2022-12-06 16:44:59,095 (asr_inference_paraformer:359) INFO: decoding, predictions: 目前矿山企业已基本关闭
2022-12-06 16:44:59,104 (asr_inference_paraformer:324) INFO: decoding, utt_id: ['BAC009S0761W0470']
Traceback (most recent call last):
File "/home/environment/hwang/anaconda3/envs/oslasr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/environment/hwang/anaconda3/envs/oslasr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/bin/asr_inference_launch.py", line 225, in
main()
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/bin/asr_inference_launch.py", line 219, in main
inference(**kwargs)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/bin/asr_inference_paraformer.py", line 329, in inference
results = speech2text(**batch)
File "/home/environment/hwang/anaconda3/envs/oslasr/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/bin/asr_inference_paraformer.py", line 193, in call
decoder_outs = self.asr_model.cal_decoder_with_predictor(enc, enc_len, pre_acoustic_embeds, pre_token_length)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/models/e2e_asr_paraformer.py", line 333, in cal_decoder_with_predictor
decoder_out, _ = self.decoder(
File "/home/environment/hwang/anaconda3/envs/oslasr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/models/decoder/transformer_decoder.py", line 504, in forward
x, tgt_mask, memory, memory_mask = decoder(
File "/home/environment/hwang/anaconda3/envs/oslasr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/models/decoder/transformer_decoder.py", line 123, in forward
x = residual + self.dropout(self.self_attn(tgt_q, tgt, tgt, tgt_q_mask))
File "/home/environment/hwang/anaconda3/envs/oslasr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/modules/attention.py", line 114, in forward
return self.forward_attention(v, scores, mask)
File "/home/work_nfs4_ssd/hwang/workspaces/funasr/funasr/modules/attention.py", line 83, in forward_attention
scores = scores.masked_fill(mask, min_value)
RuntimeError: The size of tensor a (13) must match the size of tensor b (14) at non-singleton dimension 3
# Accounting: time=2679 threads=1
# Ended (code 1) at Tue Dec 6 16:45:13 CST 2022, elapsed time 2679 seconds

May be, it is a coding bug?

middle layer output

Great work. I want to apply some middle layer output for other purpose(Voice Conversion), how can I get this feature from a given audio?

是否支持导onnx?

您好 最近想要使用下paraformer进行推理
想请教下是否有现成的api可以进行paraformer的onnx导出?

没叮叮咋办

没叮叮咋办, 可以提供个微信之类的交流群吗 ?

The model's quickstart document needs to update

The model speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch , if I follow it's quickstart document, it will report error.

Finally, I found that for the version of funasr(0.2.1) and modelscope(1.3.0), it should use the code as follows:

from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks
inference_pipeline = pipeline(
    task=Tasks.auto_speech_recognition,
    model='damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    model_revision="v1.2.1",
    vad_model='damo/speech_fsmn_vad_zh-cn-16k-common-pytorch',
    vad_model_revision="v1.1.8",
    punc_model='damo/punc_ct-transformer_zh-cn-common-vocab272727-pytorch',
    punc_model_revision="v1.1.6",
)
result = inference_pipeline('http://www.modelscope.cn/api/v1/models/damo/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch/repo?Revision=master\u0026FilePath=example/asr_example.wav',)
print(result)

Please update the document.

咨询一下多帧跨通道注意力(MFCCA)模型的训练

前端时间在研究贵司发表的关于远场多说话人语音识别一体化建模的文章 MFCCA:MULTI-FRAME CROSS-CHANNEL ATTENTION FOR MULTI-SPEAKER ASR IN MULTI-PARTY MEETING SCENARIO,最近刚好看到你们公开了模型算法实现,经过仔细阅读修复了自己实现的不少错误和BUG,在此非常感谢,同时有两个疑惑想请教一下:

  1. 第一个是我看到在实现跨通道注意力算法时,模型并没有使用 mask ,但是在这之前整个训练的 batch 是经过降采样和位置编码处理的,也就是说 batch 中较短的样本 padding 部分的特征虽然在提取特征时经过了 mask,但是经过前端处理后已经有值了,这样在获取 key 和 value 时,最后两帧向前看的两帧并不是新 padding 的全0特征,这里是否会有些影响?还是我理解错了?
    捕获

  2. 第二个是想请教一下 Alimeeting 训练集的组成,我看到近场训练集 Ali-near 是头戴麦克风录制的单通道音频,论文描述是生成了600h的多人说话模拟数据 Ali-simu,这一块儿生成模拟数据的具体方法能分享一下吗?生成的也都是8通道的数据吗?还是只是有部分重叠的单通道数据?原始的单通道 Ali-near 数据没有用于训练吗?此外,基于CDDMA Beamformer生成的 Ali-far-bf 是单通道数据吗?这一块儿实现是使用项目里 beamformer.py、dnn_beamformer.py、dnn_wpe.py 等算法实现的吗?

  3. 最后还有个问题比较困惑,训练的时候如果是即有单通道的数据,也有8通道的数据,那要保证每个batch中要么全都是8通道的训练样本,要么都是单通道的训练样本吧?这个是怎么处理的呢?先训练多通道的数据,再训练单通道的数据?

问题可能有点多,如果有时间能不能麻烦帮忙解答一下,非常感谢!!!

saying noting but get irrelevant recognition result

环境:

os:centos 7.9
python:3.8.3
package: torch 1.13.1,modelscope 1.3.2 funasr 0.2.2
model:speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch
commad: inference_pipeline = pipeline(task=Tasks.auto_speech_recognition,  model="./speech_paraformer-large-contextual_asr_nat-zh-cn-16k-common-vocab8404", param_dict=param_dict)
rec_result = inference_pipeline(audio_in=audio.wav)

audio.wav其实是从麦克风采集后实时传递给服务端,累积到20s后送给模型做解码。
没有说任何东西,输出一些奇怪的识别结果。

哦每庭有诉讼都一一包括括你你你认觉有问题题的话那么我们就就问问题没有其他他他这个这个这边
呃三呃十三十三三三三三三三呃三有三有三的证据
四一二三三三现在那个国务要的个个东题就是呃a呃是是在呃呃还进出去西话还有个光标幺幺二二二二是a二的是lt的
那个个那个那诉那个嗯你现在用卡来换了吗那那个个十二号
我们分是这个就是刚管里的这个被个破板抓住抓住呢然然相于就是在在推电子的时候候这个电
呃是三个的个个个个个个个个呃的的的的的的的的的的的的的的的的的的是的有的证据
证明的这是我证时间第四年是一个八月六月的范围这是应该回事就是他在因为这次共同发的的个个政府保存的收 的被告外外的的这个承包这个

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.