Comments (4)
我发现了问题。当前版本的funasr在处理训练集时,对于包含英文音频的label,程序会处理成乱码或者是tokens的结尾符"< unk >",导致finetune训练后的模型对于包含英文的音频推理结果为空。
比如对于某音频
正确的label是:这个应该是就是目前来说的话IPTV的话是还是要另外付费的
当前funasr在训练时处理成:"< unk >"或者是“嗯支好帐思一三三多物”
请问这个问题是不是我设置的finetune参数有问题,或者是config.yaml中的某个参数我没有修改正确?
from funasr.
是否与config.yaml中的tokenizer: CharTokenizer参数相关,text2tokens这一步转化出了问题:
正则的规则出现了问题,对于英文的不处理,只处理包含数字和中文的:
以前的版本,正则化可以处理带字母的pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])"),跟换后带英文的音频可以转化为token了
from funasr.
I have tested it. We would fix it soon.
from funasr.
from funasr.
Related Issues (20)
- websocket 协议文档中,在 offline 模式下 is_final 字段是没有用处的
- 采样率问题 HOT 1
- 按照首页教程执行后报错:AssertionError: iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch is not registered HOT 1
- 加载python中的websock非常慢,耗时很长
- 现在FunAsr能否支持类似讯飞的动态修正的功能? HOT 1
- 内容无法识别 HOT 1
- How is the FP16 model trained?
- 说话人识别怎么识别不出来了啊,都是spk 0 HOT 1
- vad model的rtf为0.831是正常的吗?
- 微调iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch模型,请问大概需要多少条数据啊?我看data\list\train.jsonl里就三条数据
- 使用pipeline进行ASR时,当输入是scp文件,进程不解除输出目录的文件占用
- 使用speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch模型进行微调,导入模型失败。
- funasr.AutoModel.generate推理前强制报错+循环引用报错
- vad 采样率问题
- html始终连接失败 HOT 3
- 微调模型,数据中未收录的单词并未影响预测结果 HOT 4
- 请问有没有paraformer实时和vad实时的一体的gpu调用方法,以及vad录音输入问题 HOT 2
- iic/SenseVoiceSmall got: TypeError: expected Tensor as element 1 in argument 0, but got str
- speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common 是否支持热词
- cpu和gpu的docker部署包转写错误也应该返回结果
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from funasr.