Comments (3)
首先,你应该在用mini词典,mini词典没有点钟 qt 185
这个词条。
然后,你的自定义词典可能有问题,空格在我这边是分成[4/m, 月/n, 30号/mq, /w, 9点钟/mq],而不是你给出的null,你可以下个断点看看这个null是哪里来的。
最后,30号9点的错误合并是因为,mini词典里面号有m词性,这是不合理的,删掉即可。
现在,分词结果是:
[4月/mq, 30号/mq, /w, 9点钟/mq]
[4月/mq, 30号/mq, 9点钟/mq]
[4月/mq, 30日/mq, 9点钟/mq]
关于这个空格是否应当作为一个标点w存在,我觉得无所谓,反正在搜索的时候肯定会被过滤掉。可以再讨论。
from hanlp.
找到问题原因了,是配置文件中的规范化参数这地方.我设置的为true.这块在对空格处理后,将/w 词性转换成了 /nx 外文符号的词性. 建议这块儿可以不对空格做处理. 另外在NLPTokenizer分词中效果还是那样.
#规范化处理 (繁体->简体,全角->半角,大写->小写),切换配置后必须删缓存
Normalization=true
目前,我的处理是在CharTable中,跳过对空格的规范化处理:
/**
* 正规化一些字符(原地正规化)
* @param charArray 字符
*/
public static void normalization(char[] charArray)
{
assert charArray != null;
for (int i = 0; i < charArray.length; i++)
{
if(charArray[i]==' ') continue;
charArray[i] = CONVERT[charArray[i]];
}
}
from hanlp.
谢谢建议,' '被规范化为'\u0000',这是不合理的。'\u0000'的Unicode码是0,会导致双数组trie树出问题。通过调整data/dictionary/other/CharTable.bin.yes即可修复这个问题。
from hanlp.
Related Issues (20)
- 调用粗粒度分词API疑是存在内存泄漏? HOT 3
- ViterbiSegment加载自定义词典时未正确替换DoubleArrayTrie HOT 2
- 希望可以增加自定义词典功能,对于分错的词语可以人为纠正。 HOT 2
- a bug HOT 1
- 始终报file is not a zip file HOT 2
- hanlp.load(SIGHAN2005_MSR_CONVSEG) 卡住了 HOT 2
- Failed to load https://file.hankcs.com/hanlp/dep/pmt_dep_electra_small_20220218_134518.zip HOT 2
- TransformerNamedEntityRecognizerTF 无法识别data的max_seq_length HOT 3
- pip install hanlp failed HOT 4
- " unpack (expected 4, got 3)" from HanLP(['XXXXX']) 运行错误 HOT 1
- 索引与查找使用相同的analyzer,结果无法命中 HOT 4
- 无法下载CTB9_POS_ELECTRA_SMALL_TF HOT 2
- 解析失败,提示升级hanlp HOT 1
- 依存分析的模型要么下载不了,要么刚开始下载非常慢,然后就下不了了(dep的四个模型都是) HOT 1
- No module named 'hanlp.datasets.parsing.ctb'
- 中文名包含多音字时生成的拼音只有一个,例如 ‘李娜’ 生成拼音为 ‘Li Nuo’ HOT 1
- 执行open_small.py时报'utf-8' codec can't decode byte 0xb4 in position 0: invalid start byte HOT 1
- ================================ERROR LOG BEGINS================================ HOT 1
- When I runing the example occurred error HOT 1
- Add a custom dictionary type that supports spaces HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hanlp.