Git Product home page Git Product logo

Comments (4)

WjMessi1 avatar WjMessi1 commented on September 22, 2024

我发现了问题。当前版本的funasr在处理训练集时,对于包含英文音频的label,程序会处理成乱码或者是tokens的结尾符"< unk >",导致finetune训练后的模型对于包含英文的音频推理结果为空。

比如对于某音频
正确的label是:这个应该是就是目前来说的话IPTV的话是还是要另外付费的
当前funasr在训练时处理成:"< unk >"或者是“嗯支好帐思一三三多物”

请问这个问题是不是我设置的finetune参数有问题,或者是config.yaml中的某个参数我没有修改正确?

from funasr.

WjMessi1 avatar WjMessi1 commented on September 22, 2024

补充:
调试发现是encoder的原因:
image
d408a54611eac6905c6e7161037130e

是否与config.yaml中的tokenizer: CharTokenizer参数相关,text2tokens这一步转化出了问题:
image

正则的规则出现了问题,对于英文的不处理,只处理包含数字和中文的:
image

以前的版本,正则化可以处理带字母的pattern = re.compile(r"([\u4E00-\u9FA5A-Za-z0-9])"),跟换后带英文的音频可以转化为token了

afbb9c9ca6fbcde2164e4f40a5b0975

from funasr.

LauraGPT avatar LauraGPT commented on September 22, 2024

I have tested it. We would fix it soon.

from funasr.

LauraGPT avatar LauraGPT commented on September 22, 2024

Bugfix: https://github.com/alibaba-damo-academy/FunASR/blob/90bc3ad02eee3745188be3960036ae3e9e746049/funasr/tokenizer/char_tokenizer.py#L97

from funasr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.