tenlee2012 / elasticsearch-analysis-hao Goto Github PK
View Code? Open in Web Editor NEW一个非常hao用的elasticsearch(es)中文分词器插件
License: Apache License 2.0
一个非常hao用的elasticsearch(es)中文分词器插件
License: Apache License 2.0
建议词典支持配置多个文件,对本地自定义词典和远程词典。
另外远程词典也是判断了 Last-Modified 和 ETag 的哈?建议在文档里面明确说明一下。
关键词: 图书发行第一股
分词结果: 图书发行、图书、发行、第一股、第一
期望结果:图书发现、图书、发现、第一股、第一、股
关键词:图书股
分词结果:图书、股
期望结果:图书股、图书、股
当前在 and 模式下,搜索“图书股”不能匹配“ 图书发行第一股”
关键词: 图书发行第一股
分词结果: 图书发行、图书、发行、第一股、第一
期望结果: 图书发行、图书、发行、第一股、第一、股
关键词:图书股
分词结果:图书股、图书
期望结果:图书股、图书、股
当前在 and 模式下,搜索“图书股”不能匹配“ 图书发行第一股”
关键词: 图书发行第一股
分词结果: 图书、发行、第一、一股、一、股
关键词:图书股
分词结果:图书、股
当前在 and 模式下,搜索“图书股”可匹配“ 图书发行第一股”
希望增加类似 ik_max_mode 模式的 hao_max_word 模式
hao 分词器支持热词、词频统计吗?
Readme 中有介绍 hao 分词器比 IK 更快,请问是怎样实现的呢?快多少?
我这边使用 IK 分词器扩展千万级词典时,索引建立效率大幅降低,请问 hao 分词器对这种情况有帮助吗?
Apache Log4j 2.x < 2.15.0-rc2 都有漏洞
@tenlee2012 你好
data/hanlp-data/data/dictionary/tc/t2s.txt
这个文件中的部分繁体字 + 其它繁体字 会出现切词异常,期待你的回复,谢谢
curl 'http://localhost:9200/_analyze?pretty' -H 'Content-Type:application/json' -d '
{
"tokenizer" : "hao_index_mode",
"text": "㑮 奮發圖強"
}
'
{
"tokens" : [
{
"token" : "㑮",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "奮",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "發",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "圖",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "強",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 4
}
]
}
{
"tokens" : [
{
"token" : "㑮",
"start_offset" : 0,
"end_offset" : 1,
"type" : "word",
"position" : 0
},
{
"token" : "奮發圖強",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 0
},
{
"token" : "奮發",
"start_offset" : 1,
"end_offset" : 3,
"type" : "word",
"position" : 1
},
{
"token" : "圖強",
"start_offset" : 3,
"end_offset" : 5,
"type" : "word",
"position" : 2
}
]
}
我嘗試添加繁體字到 custom_dictionary.text 詞庫中,起不到效果。
但添加簡體字進詞庫的話,則有效果,並且同時支持繁體字的分詞,請問繁簡字之間在插件中是怎樣處理的?
首先诚挚地感谢每一位持续关注并使用 Hao分词器 的朋友。本人将持续投入,力图把 Hao 最好的中文分词器。
在此提交一条评论,评论内容包括:
字节跳动
北京
[email protected]
www.douyin.com
再次感谢您的参与!!!
你好,这个插件很不错,感谢你的作品。
以下是测试的一个情况,发现一个错误,希望能够修复。
curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"tokenizer" : "hao_index_mode",
"char_filter": ["html_strip"],
"text" : [
"<a>b",
"A"
]
}
'
目前的结果:
{
"tokens" : [
{
"token" : "a",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "A",
"start_offset" : 2,
"end_offset" : 3,
"type" : "word",
"position" : 101
}
]
}
期待的结果:
{
"tokens" : [
{
"token" : "a",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "A",
"start_offset" : 5,
"end_offset" : 6,
"type" : "word",
"position" : 101
}
]
}
A
的start_offset小于a
的start_offset会导致lucene插入数据错误
谢谢!
您好,想問問在不開啟 enableSingleWord 的狀態下,能否通過某些方法來對特定的單字支持分詞以用于搜索?
我嘗試了新增自定義詞庫的方法,但没有效果
例如: 河南省政府今天發表重要講話
,我想特定支持 "省" 字的搜索。
目前是分詞的結果會是 河南省政府/河南省/河南/政府/今天/發表/重要講話/重要/講話
有没有7.10.2版本的
用了这个分词器,为什么会出现上述错误?
请问 hao_search_mode/hao_index_mode 什么区别?建议 readme 里面做明确介绍。
java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/exc/InputCoercionException
at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.createMapDeserializer(BasicDeserializerFactory.java:1376) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:387) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:349) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:264) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:244) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142) ~[?:?]
at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:479) ~[?:?]
at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:4405) ~[?:?]
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4214) ~[?:?]
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3099) ~[?:?]
at com.itenlee.search.analysis.help.JSONUtil.lambda$parseJSON$2(JSONUtil.java:32) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_231]
at com.itenlee.search.analysis.help.JSONUtil.parseJSON(JSONUtil.java:32) ~[?:?]
at com.itenlee.search.analysis.core.Dictionary.loadDict(Dictionary.java:96) ~[?:?]
at com.itenlee.search.analysis.core.Dictionary.initial(Dictionary.java:65) ~[?:?]
at com.itenlee.search.analysis.lucence.Configuration.(Configuration.java:125) ~[?:?]
at org.elasticsearch.index.analysis.hao.HaoTokenizerFactory.(HaoTokenizerFactory.java:34) ~[?:?]
at org.elasticsearch.index.analysis.hao.HaoTokenizerFactory.getHttpSmartTokenizerFactory(HaoTokenizerFactory.java:38) ~[?:?]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:343) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:179) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:164) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.index.IndexService.(IndexService.java:175) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:399) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:564) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:513) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:166) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:507) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:269) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:495) ~[elasticsearch-7.2.0.jar:7.2.0]
at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_231]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:493) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:464) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:165) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.2.0.jar:7.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]
Caused by: java.lang.ClassNotFoundException: com.fasterxml.jackson.core.exc.InputCoercionException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_231]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_231]
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:817) ~[?:1.8.0_231]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_231]
... 40 more
插件版本v8.3.3
{
"analyzer": "hao_index_mode",
"text": "0xFF0x030x420x270x010x430x320x010x460x640x5E0x5C0x3B0x010x360x290x010x050x330x010x030x220x010x630x3E0x010x270x340x350x5F0x3D0x010x400x2F0x010x3C0x430x010x610x3E0x010x370x140x010x3F0x360x3C0x2F0x350x380x060x2D0x010x040x460x060x490x310x010x280x230x1F0x220x2C0x2E0x190x460x410x0E0x340x1F0x1A0x650x3A0x0F0x250x350x1D0x200x1C0x070x330x010x230x4F0x3E0x240x650x650x3A0x2D0x010x3D0x370x490x090x4B0x170x480x4D0x540x470x1F0x410x410x0E0x300x0B0x370x320x450x330x010x4B0x010x010x4A0x330x330x510x360x010x4C0x5F0x240x0D0x380x010x5D0x010x010x2B0x270x010x290x440x010x2D0x340x010x5E0x3A0x010x3E0x2E0x010x080x3C0x010x0A0x340x010x010x2B0x010x440x240x010x670x3D0x010x0C0x3A0x010x250x330x330x000x310x3A0x310x300x330x330x330x330x330x2C0x320x3A0x310x330x330x330x530x330x330x2C0x330x3A0x320x330x010x010x0B0x010x330x2C0x340x3A0x330x390x370x360x350x330x330x2C0x370x3A0x340x330x330x330x330x330x330x2C0x310x300x3A0x360x300x330x010x0B0x010x330x2C0x310x340x3A0x310x300x300x320x330x330x330x330x330x2C0x310x350x3A0x360x380x330x330x330x330x330x2C0x380x3A0x320x330x010x010x0B0x010x330x2C0x330x310x3A0x320x330x280x310x1C0x010x330x2C0x380x310x3A0x320x330x280x310x1C0x010x330x2C0x320x300x300x300x3A0x320x310x390x010x650x630x010x330x2C0x370x310x3A0x310x3F0x650x5B0x010x330x2C0x370x320x3A0x310x010x010x010x010x330x2C0x370x330x3A0x310x010x010x010x010x330x2C0x390x310x3A0x310x5B0x150x330x010x330x2C0x390x320x3A0x310x010x0B0x0B0x010x330x2C0x390x330x3A0x310x010x0B0x0B0x010x330x000x310x310x340x000x340x330x000x330x000x320x320x000x010x010x630x000x000x310x2E0x30"
}
分词会直接卡住。同版本的ik没有这个问题
Monitor.java 的函数 runUnprivileged 中并没有感知到, Dictionary.java 中 loadRemoteDictionary 函数对 Monitor.lastModified/Monitor.eTags 的初始化。导致远程词典没变的情况下,词典被无意义重新加载。
好尴尬,我es是7.17.6的~
当前“镐窝”会被分词为两个单字,在对短文本分词时,建议将原文本作为分词结果之一返回,可通过配置项开关。
例如将“镐窝”分词为“镐窝”“镐”“窝”
这样可以一定程度缓解没有命名实体识别和新词发现的问题,对一些不常见的人名分词有帮助。
看好很多同学再问停用词功能。
非常抱歉,本插件不支持停用词配置以及远程停用词词库。
原因是elasticsearch
本身就有停用词功能,中文的停用词更新也不频繁,就没有重复造轮子。
如有需要,请使用es原生提供的停用词功能。
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "my_custom_stop_words_filter" ]
}
},
"filter": {
"my_custom_stop_words_filter": {
"type": "stop",
"stopwords_path": "停用词路径,每个词一行"
"ignore_case": true
}
}
}
}
}
PS: 其实词库的热更新,也只是做到了在词库更新之后的新doc才会被新词识别,旧doc还是要依赖索引重建。
我在安装7.13.3版本的插件时候,elastic 给出
plugin-security.policy] contains illegal permission ("java.lang.RuntimePermission" "*") in global grant
错误
后经检查发现,和这个commit有关:elastic/elasticsearch#64751 (v7.11加入)
elastic 限制了插件的权限,所以原有的plugin-security.policy
已经不能用了。
按照一位开发人员说法是以后只允许读取插件配置目录了(a directory under /etc/elasticsearch named for the plugin),所以 java.io.FilePermission
的 read权限未来也会被取消掉,另外RuntimePermission
也不能使用 "*"了。
所以需要细化权限,适配新版本ES
PS:如果有人遇到这个问题的话,我暂时改成了这样可以用:
grant {
permission java.net.SocketPermission "*", "accept,connect,resolve"; // okhttp
permission java.lang.RuntimePermission "getClassLoader"; // okhttp
permission java.net.NetPermission "getProxySelector"; // okhttp
permission java.lang.RuntimePermission "accessDeclaredMembers"; // jackson
permission java.lang.reflect.ReflectPermission "suppressAccessChecks"; // jackson
//permission java.lang.RuntimePermission "*"; // tensorflow
permission java.lang.RuntimePermission "createClassLoader";
permission java.lang.RuntimePermission "accessDeclaredMembers";
permission java.lang.RuntimePermission "getClassLoader";
permission java.lang.RuntimePermission "setContextClassLoader";
permission java.lang.RuntimePermission "setFactory";
permission java.lang.RuntimePermission "loadLibrary.*";
permission java.lang.RuntimePermission "accessClassInPackage.*";
};
大佬好,发现一个fvh导致高亮错位的问题。
插件版本: 8.3.1
索引配置:
{
"mappings": {
"dynamic": false,
"properties": {
"data": {
"type": "text",
"analyzer": "hao_index_mode",
"search_analyzer": "hao_search_mode",
"term_vector": "with_positions_offsets"
}
}
}
}
插入文档
{
"data": [
"测试文本1",
"......",
"测试文本2"
]
}
搜索关键字测试
GET /_search
{
"query": {
"match": {"data": "测试"}
},
"highlight": {
"fields": {"data": {}}
}
}
返回结果如下:
{
"_source": {
"data": [
"测试文本1",
"......",
"测试文本2"
]
},
"highlight": {
"data": [
"<em>测试</em>文本1",
".<em>..</em>..."
]
}
}
可以看到第二个测试的高亮片段有误。
把中间的......
去掉,则没有这个问题。
POST /_doc
{
"data": [
"测试文本1",
"测试文本2"
]
}
GET /_search
{
"query": {
"match": {"data": "测试"}
},
"highlight": {
"fields": {"data": {}}
}
}
RETURN
{
"_source": {
"data": [
"测试文本1",
"测试文本2"
]
},
"highlight": {
"data": [
"<em>测试</em>文本1",
"<em>测试</em>文本2"
]
}
}
能否实现中英文混合词的分词,例如在字典中加入词 ”GPS定位系统“ , 目前好像不能分词 ”GPS定位系统“
人名“乔大新”会被分词为 3 个单字
像这样:
elasticsearch-analysis-ik-7.10.1.zip
elasticsearch-analysis-hanlp-7.10.1.zip
本地多版本 ES 容易搞混
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.