Git Product home page Git Product logo

Comments (3)

hankcs avatar hankcs commented on May 22, 2024

首先,你应该在用mini词典,mini词典没有点钟 qt 185这个词条。
然后,你的自定义词典可能有问题,空格在我这边是分成[4/m, 月/n, 30号/mq, /w, 9点钟/mq],而不是你给出的null,你可以下个断点看看这个null是哪里来的。
最后,30号9点的错误合并是因为,mini词典里面号有m词性,这是不合理的,删掉即可。
现在,分词结果是:

[4月/mq, 30号/mq,  /w, 9点钟/mq]
[4月/mq, 30号/mq, 9点钟/mq]
[4月/mq, 30日/mq, 9点钟/mq]

关于这个空格是否应当作为一个标点w存在,我觉得无所谓,反正在搜索的时候肯定会被过滤掉。可以再讨论。

from hanlp.

a198720 avatar a198720 commented on May 22, 2024

找到问题原因了,是配置文件中的规范化参数这地方.我设置的为true.这块在对空格处理后,将/w 词性转换成了 /nx 外文符号的词性. 建议这块儿可以不对空格做处理. 另外在NLPTokenizer分词中效果还是那样.
#规范化处理 (繁体->简体,全角->半角,大写->小写),切换配置后必须删缓存
Normalization=true

目前,我的处理是在CharTable中,跳过对空格的规范化处理:
/**
* 正规化一些字符(原地正规化)
* @param charArray 字符
*/
public static void normalization(char[] charArray)
{
assert charArray != null;
for (int i = 0; i < charArray.length; i++)
{
if(charArray[i]==' ') continue;
charArray[i] = CONVERT[charArray[i]];
}
}

from hanlp.

hankcs avatar hankcs commented on May 22, 2024

谢谢建议,' '被规范化为'\u0000',这是不合理的。'\u0000'的Unicode码是0,会导致双数组trie树出问题。通过调整data/dictionary/other/CharTable.bin.yes即可修复这个问题。

from hanlp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.