kennfalcon / elasticsearch-analysis-hanlp Goto Github PK
View Code? Open in Web Editor NEWHanLP Analyzer for Elasticsearch
License: Apache License 2.0
HanLP Analyzer for Elasticsearch
License: Apache License 2.0
CustomDictionary.remove(String)
CustomDictionary.insert(String, String)
请问这2个方法,线程安全吗?
我们有一个场景:远程词典删除了一行词,已加载的词典也需要删除这行词。
现在远程词典的加载逻辑,只是不断增加远程词典的词条,而不会删除远程词典被删除的词条。
我修改了一下RemoteMonitor加载远程词典的方法,使其能与上次已加载的远程词典对比,从而计算出需要删除的词条。
有一个远程词典,100w行数据,业务会不断优化词典,所以这2个方法,会经常调用。如果我用集合的并行流paralleStream()来并执行CustomDictionary.insert,会不会有问题?
查看了下代码,在代码里面是配置 enable_stop_dictionary
属性,由于对 Elasticsearch 插件开发的参数传递方式不太了解,所以出现的疑问是,这个参数是在 elasticsearch 建立索引的时候设定的么?
在 #7 issue 中找到解释
分词结果是:2019 年初 级 会计职称
我们想要的结果是:2019年 初级 会计职称
用hanlp内置的接口方法进行分词,就没有问题的,可以得到预计结果
thanks
文本的空格也建了索引。
GET _analyze
{
"text": ["启动力车规级汽车"],
"analyzer": "hanlp_index"
}
实际结果
"tokens" : [
{
"token" : "启",
"start_offset" : 0,
"end_offset" : 1,
"type" : "v",
"position" : 0
},
{
"token" : "动力车",
"start_offset" : 1,
"end_offset" : 4,
"type" : "nz",
"position" : 1
},
{
"token" : "动力",
"start_offset" : 1,
"end_offset" : 3,
"type" : "n",
"position" : 2
},
{
"token" : "规",
"start_offset" : 4,
"end_offset" : 5,
"type" : "ng",
"position" : 3
},
{
"token" : "级",
"start_offset" : 5,
"end_offset" : 6,
"type" : "q",
"position" : 4
},
.... 省略
]
}
期望结果
能分出”启动力“、”启动“。
补充
hi, ES版本7.2.0,如何能动态删除词条?
目前测试远程词库可以动态添加,但是删除词条后,ES没有删除,需要重启才生效
删除hanlp.cache和.bin文件都没有生效
源码是直接insert的词条,不知道是否提供了删除词条方法?
遇到个日文人名的识别问题,带前缀时分词为声优 藤田 淑 子
,不带前缀就变成了藤田淑 子
:
POST _analyze
{
"analyzer": "hanlp_index",
"text": "日本声优藤田淑子逝世"
}
{
"tokens": [
{
"token": "日本",
"start_offset": 0,
"end_offset": 2,
"type": "ns",
"position": 0
},
{
"token": "声优",
"start_offset": 2,
"end_offset": 4,
"type": "nz",
"position": 1
},
{
"token": "藤田",
"start_offset": 4,
"end_offset": 6,
"type": "nr",
"position": 2
},
{
"token": "淑",
"start_offset": 6,
"end_offset": 7,
"type": "ng",
"position": 3
},
{
"token": "子",
"start_offset": 7,
"end_offset": 8,
"type": "ng",
"position": 4
},
{
"token": "逝世",
"start_offset": 8,
"end_offset": 10,
"type": "vi",
"position": 5
}
]
}
查询时分词:
POST _analyze
{
"analyzer": "hanlp_standard",
"text": "藤田淑子"
}
{
"tokens": [
{
"token": "藤田淑",
"start_offset": 0,
"end_offset": 3,
"type": "nr",
"position": 0
},
{
"token": "子",
"start_offset": 0,
"end_offset": 1,
"type": "ng",
"position": 1
}
]
}
这类不一致问题,请问有什么办法可以干预吗?看HanLP文档,不知道开启日文人名识别是否可以解决?
###重现步骤:
PUT test
{
"settings": {
"analysis": {
"analyzer": {
"my_hanlp_analyzer": {
"tokenizer": "my_hanlp"
}
},
"tokenizer": {
"my_hanlp": {
"type": "hanlp",
"enable_offset": false,
"enable_custom_config": true
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "my_hanlp_analyzer"
}
}
}
}
{
"title": "下降机测试"
}
分词为 下降,机,测试
搜索可以用
"match": {"title": "下降"}
"match": {"title": "下降机"}
都能搜到
下降机 nz 1
分词为 下降机,测试
期望索引更新:
用 "match": {"title": "下降"} 搜不到,"match": {"title": "下降机"} 能搜到
但是结果是: "match": {"title": "下降"} 搜到, "match": {"title": "下降机"} 搜不到
搜索下降机,还偶尔会报错。
{
"error": {
"root_cause": [
{
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [251846810/240.1mb], which is larger than the limit of [246733209/235.3mb], real usage: [251846672/240.1mb], new bytes reserved: [138/138b], usages [request=0/0b, fielddata=1458/1.4kb, in_flight_requests=138/138b, accounting=30684/29.9kb]",
"bytes_wanted": 251846810,
"bytes_limit": 246733209,
"durability": "PERMANENT"
}
],
"type": "circuit_breaking_exception",
"reason": "[parent] Data too large, data for [<http_request>] would be [251846810/240.1mb], which is larger than the limit of [246733209/235.3mb], real usage: [251846672/240.1mb], new bytes reserved: [138/138b], usages [request=0/0b, fielddata=1458/1.4kb, in_flight_requests=138/138b, accounting=30684/29.9kb]",
"bytes_wanted": 251846810,
"bytes_limit": 246733209,
"durability": "PERMANENT"
},
"status": 429
}
请问,现象是正常的么?然后报错是什么问题,好像不稳定?
直接解压的,所以 plugins/analysis-hanlp 有 config 目录,es可以正常启动,但是使用 hanlp 分词就会报错
java.lang.NoClassDefFoundError: Could not initialize class com.hankcs.hanlp.HanLP$Config
删除 plugins/analysis-hanlp 下的 config 目录之后恢复正常
给后来的人提个醒
使用elasticsearch-plugin install的方式安装的,运行之后报下面的错误:
curl 'https://$ES_URL:9200/twitter2/_analyze?pretty=true' -H 'Content-Type: application/json' -d '{ "tokenizer":"hanlp", "text":"士大夫敢死队风格"}'
{
"error" : {
"root_cause" : [
{
"type" : "index_not_found_exception",
"reason" : "no such index",
"resource.type" : "index_expression",
"resource.id" : "twitter2",
"index_uuid" : "na",
"index" : "twitter2"
}
],
"type" : "index_not_found_exception",
"reason" : "no such index",
"resource.type" : "index_expression",
"resource.id" : "twitter2",
"index_uuid" : "na",
"index" : "twitter2"
},
"status" : 404
}
hanlp.properties文件是在ES_PATH/config/analysis-hanlp/下。
/root/elasticsearch-6.4.3/bin/elasticsearch-plugin install /root/elasticsearch-6.4.3/elasticsearch-analysis-hanlp-6.4.3.zip
A tool for managing installed elasticsearch plugins
Commands
--------
list - Lists installed elasticsearch plugins
install - Install a plugin
remove - removes a plugin from Elasticsearch
Non-option arguments:
command
Option Description
------ -----------
-h, --help show help
-s, --silent show minimal output
-v, --verbose show verbose output
ERROR: Unknown plugin /root/elasticsearch-6.4.3/elasticsearch-analysis-hanlp-6.4.3.zip
以下是在6.6.2上测试的结果
PUT index
{
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "hanlp",
"search_analyzer": "hanlp"
}
}
}
}
}
PUT index/doc/1
{
"body": ["张三\n\n新买的手机"]
}
GET index/_search
{
"query": {
"match": {
"body": {
"query": "手机"
}
}
},
"highlight": {
"fields": {
"body": {}
}
}
}
返回:
{
"took" : 44,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "index",
"_type" : "doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"body" : [
"""
张三
新买的手机
"""
]
},
"highlight" : {
"body" : [
"""
张三
新买<em>的手</em>机
"""
]
}
}
]
}
}
excepted:
...
"highlight" : {
"body" : [
"""
张三
新买的<em>手机</em>
"""
]
}
...
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">http://我的地址:1888/a.txt</entry>
<!--用户可以在这里配置远程扩展停止词字典-->
<!--<entry key="remote_ext_stopwords">stop_words_location</entry>-->
http://我的地址:1888/a.txt 内容
中
华
POST _analyze
{
"text": "中华人民共和国",
"tokenizer": "hanlp_index"
}
结果:
{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "ns",
"position" : 0
},
{
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "nz",
"position" : 1
},
{
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "nz",
"position" : 2
},
{
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "n",
"position" : 3
},
{
"token" : "人民共和国",
"start_offset" : 2,
"end_offset" : 7,
"type" : "nz",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "n",
"position" : 5
},
{
"token" : "共和国",
"start_offset" : 4,
"end_offset" : 7,
"type" : "n",
"position" : 6
},
{
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "n",
"position" : 7
}
]
}
期望结果中出现单“中”,“华”的记录
我最开始的需求是希望自定义一个能够剔除html标签的hanlp_index的analyzer, 但使用时发现没了多粒度分词的效果。
{
"analysis": {
"analyzer": {
"default": {
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "hanlp_index"
}
}
}
}
`
于是我用_analyze分析,从结果看似乎是hanlp_index的tokenizer有问题。
hanlp_index的analyzer可以多粒度分词
GET _analyze { "text": ["<p>除甲醛</p>", "汽车座椅"], "analyzer": "hanlp_index" }
结果
{
"tokens" : [
{
"token" : "<p>",
"start_offset" : 0,
"end_offset" : 3,
"type" : "nx",
"position" : 0
},
{
"token" : "除甲醛",
"start_offset" : 6,
"end_offset" : 9,
"type" : "n",
"position" : 1
},
{
"token" : "甲醛",
"start_offset" : 10,
"end_offset" : 12,
"type" : "n",
"position" : 2
},
{
"token" : "<",
"start_offset" : 14,
"end_offset" : 15,
"type" : "nx",
"position" : 3
},
{
"token" : "/",
"start_offset" : 16,
"end_offset" : 17,
"type" : "w",
"position" : 4
},
{
"token" : "p>",
"start_offset" : 18,
"end_offset" : 20,
"type" : "nx",
"position" : 5
},
{
"token" : "汽车座椅",
"start_offset" : 24,
"end_offset" : 28,
"type" : "nz",
"position" : 6
},
{
"token" : "汽车",
"start_offset" : 28,
"end_offset" : 30,
"type" : "n",
"position" : 7
},
{
"token" : "车座",
"start_offset" : 31,
"end_offset" : 33,
"type" : "n",
"position" : 8
},
{
"token" : "座椅",
"start_offset" : 34,
"end_offset" : 36,
"type" : "n",
"position" : 9
}
]
}
hanlp_index的tokenizer的没有多粒度分词效果
GET /_analyze { "text": ["<p>除甲醛</p>","汽车座椅"], "tokenizer": "hanlp_index" }
结果
{
"tokens" : [
{
"token" : "<p>",
"start_offset" : 0,
"end_offset" : 3,
"type" : "nx",
"position" : 0
},
{
"token" : "除甲醛",
"start_offset" : 3,
"end_offset" : 6,
"type" : "n",
"position" : 1
},
{
"token" : "<",
"start_offset" : 6,
"end_offset" : 7,
"type" : "nx",
"position" : 2
},
{
"token" : "/",
"start_offset" : 7,
"end_offset" : 8,
"type" : "w",
"position" : 3
},
{
"token" : "p>",
"start_offset" : 8,
"end_offset" : 10,
"type" : "nx",
"position" : 4
},
{
"token" : "汽车座椅",
"start_offset" : 22,
"end_offset" : 26,
"type" : "nz",
"position" : 5
}
]
}
IK可以配置热更新字典
从远处服务下载.dic文件,根据响应头的ETag判断是否需要进行字典数据替换,
希望hanlp也能实现
https://github.com/hankcs/HanLP/releases/tag/v1.6.8 号称“全世界最大的中文语料库” 我直接把data目录替换成1.6.8 ,跑起来暂时没看到问题。
但总是担心会不会有问题
不管是用远程词典还是本地的, stopwords都没有生效, 是否因为插件没有对停用词做过滤吗
分词时,空值建了索引
[2019-02-19T08:09:05,566][INFO ][o.e.c.r.a.AllocationService] [w44JGC2] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index][2]] ...]).
Feb 19, 2019 8:09:11 AM com.hankcs.hanlp.corpus.io.IOUtil readBytes
WARNING: 读取plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt.bin时发生异常java.io.FileNotFoundException: plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt.bin (No such file or directory)
Feb 19, 2019 8:09:11 AM com.hankcs.hanlp.dictionary.CoreDictionary load
WARNING: 核心词典plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt不存在!java.io.FileNotFoundException: plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt (No such file or directory)
[2019-02-19T08:09:11,715][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[w44JGC2][index][T#1]], exiting
java.lang.ExceptionInInitializerError: null
at com.hankcs.hanlp.seg.common.Vertex.newB(Vertex.java:462) ~[?:?]
at com.hankcs.hanlp.seg.common.WordNet.(WordNet.java:73) ~[?:?]
at com.hankcs.hanlp.seg.Viterbi.ViterbiSegment.segSentence(ViterbiSegment.java:40) ~[?:?]
at com.hankcs.hanlp.seg.Segment.seg(Segment.java:507) ~[?:?]
at com.hankcs.lucene.SegmentWrapper.next(SegmentWrapper.java:76) ~[?:?]
at com.hankcs.lucene.HanLPTokenizer.incrementToken(HanLPTokenizer.java:94) ~[?:?]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:222) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:200) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:148) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:75) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:294) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:287) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:610) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
Caused by: java.lang.IllegalArgumentException: 核心词典plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt加载失败
at com.hankcs.hanlp.dictionary.CoreDictionary.(CoreDictionary.java:44) ~[?:?]
... 20 more
https://www.jianshu.com/p/52c42cdab997 按照此文章解决了java权限(参照https://github.com/KennFalcon/elasticsearch-analysis-hanlp/issues/2)问题后,遇到另外的错误
请问支持ES 5.6.4吗?
您好,已經設置繁中
PUT test { "settings": { "analysis": { "analyzer": { "tc_hanlp": { "tokenizer": "hanlp" } }, "tokenizer": { "tc_hanlp": { "type": "hanlp", "enable_traditional_chinese_mode": true, "enable_custom_config": true } } } } }
但打印結果似乎只能分詞簡中,是否須將字典檔內容改為繁中?
POST test/_analyze { "text": "美國阿拉斯加州發生8.0級地震", "analyzer": "tc_hanlp" }
{ "tokens" : [ { "token" : "美", "start_offset" : 0, "end_offset" : 1, "type" : "b", "position" : 0 }, { "token" : "國", "start_offset" : 0, "end_offset" : 1, "type" : "w", "position" : 1 }, { "token" : "阿拉斯加州", "start_offset" : 0, "end_offset" : 5, "type" : "nsf", "position" : 2 }, { "token" : "發", "start_offset" : 0, "end_offset" : 1, "type" : "n", "position" : 3 }, { "token" : "生", "start_offset" : 0, "end_offset" : 1, "type" : "v", "position" : 4 }, { "token" : "8.0", "start_offset" : 0, "end_offset" : 3, "type" : "m", "position" : 5 }, { "token" : "級", "start_offset" : 0, "end_offset" : 1, "type" : "n", "position" : 6 }, { "token" : "地震", "start_offset" : 0, "end_offset" : 2, "type" : "n", "position" : 7 } ] }
最新代码测试了,看起来并没有解决:
设置search_analyzer为hanlp_nlp
{
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp_nlp"
},
"remark": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp_nlp"
}
}
}
失败
设置为hanlp
{
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp"
},
"remark": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp"
}
}
}
没问题!
RT
[2019-04-04T00:11:45,886][INFO ][o.e.n.Node ] [rpmdWQa] started
[2019-04-04T00:12:10,126][INFO ][c.h.d.c.RemoteDictConfig ] [rpmdWQa] try load remote hanlp config from D:\webdev\es\ela
sticsearch-6.6.1\config\analysis-hanlp\hanlp-remote.xml
[2019-04-04T00:12:10,211][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [rpmdWQa] fatal error in thread [elasticse
arch[rpmdWQa][analyze][T#1]], exiting
java.lang.ExceptionInInitializerError: null
at com.hankcs.hanlp.dictionary.CoreDictionary.(CoreDictionary.java:35) ~[?:?]
at com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment.(DoubleArrayTrieSegment.java:45) ~[?:?]
at com.hankcs.lucene.HanLPSpeedAnalyzer.createComponents(HanLPSpeedAnalyzer.java:31) ~[?:?]
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:198) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f846
40faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:20]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.ja
va:267) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:252
) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.j
ava:170) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.j
ava:81) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$1.doRun(TransportSingleShardAction.j
ava:115) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.
java:759) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1
.jar:6.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.security.AccessControlException: access denied ("java.io.FilePermission" "src\main\java" "read")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_131]
at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_131]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_131]
at java.lang.SecurityManager.checkRead(SecurityManager.java:888) ~[?:1.8.0_131]
at java.io.File.isDirectory(File.java:844) ~[?:1.8.0_131]
at com.hankcs.hanlp.HanLP$Config.(HanLP.java:342) ~[?:?]
... 14 more
现在每分钟调用一次远程词典,这个时间定时任务能设置吗?
非常感谢hanlp分词插件。
我这边有个问题,如果有hanlp.cache文件,我重新启动es,会报错。请问下是什么原因?
如果删除hanlp.cache文件,就可以正常运行,但需要时间重新加载hanlp的词典。
错误:
[2018-07-24T17:46:22,115][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node1] fatal error in thread [elasticsearch[node1][generic][T#2]], exiting
java.lang.NoClassDefFoundError: org/elasticsearch/core/internal/io/IOUtils
at com.hankcs.dic.cache.DictionaryFileCache.lambda$loadCache$0(DictionaryFileCache.java:60) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_73]
at com.hankcs.dic.cache.DictionaryFileCache.loadCache(DictionaryFileCache.java:45) ~[?:?]
at com.hankcs.dic.Dictionary.(Dictionary.java:45) ~[?:?]
at com.hankcs.dic.Dictionary.initial(Dictionary.java:52) ~[?:?]
at com.hankcs.cfg.Configuration.(Configuration.java:54) ~[?:?]
at org.elasticsearch.index.analysis.HanLPTokenizerFactory.(HanLPTokenizerFactory.java:31) ~[?:?]
at org.elasticsearch.index.analysis.HanLPTokenizerFactory.getHanLPNLPTokenizerFactory(HanLPTokenizerFactory.java:47) ~[?:?]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:361) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:176) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:154) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.IndexService.(IndexService.java:145) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:363) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:448) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.indices.IndicesService.verifyIndexMetadata(IndicesService.java:481) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.gateway.Gateway.performStateRecovery(Gateway.java:135) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.gateway.GatewayService$1.doRun(GatewayService.java:229) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.6.1.jar:5.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_73]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.core.internal.io.IOUtils
at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[?:1.8.0_73]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_73]
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814) ~[?:1.8.0_73]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_73]
... 22 more
如题,楼主联系方式方便给一个嘛?
已确认字典为utf-8格式,同一文件,在python下调用可生成.bin缓存文件。
hi,KennFalcon:
我使用的插件版本: elasticsearch-analysis-hanlp for ElasticSearch6.3.2
由于HanLP在分词的时候HanLP.Config.Normalization=false
,默认是不开启Normalization。但是我想在分词的时候,开启归一化操作。
我采用的Analyzer是com.hankcs.lucene.HanLPStandardAnalyzer
,目前我的做法是在HanLPStandardAnalyzer.createComponents()
方法中添加了一行语句:
@Override
protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
//分词时开启归一化
AccessController.doPrivileged((PrivilegedAction) () -> HanLP.Config.Normalization = true);
Tokenizer tokenizer = new HanLPTokenizer(HanLP.newSegment(), configuration);
return new Analyzer.TokenStreamComponents(tokenizer);
}
然后重新打jar包,复制到elasticsearch-6.3.2/plugins/analysis-hanlp
目录下替换掉原来的jar,重启ES。
我的问题是:
是否考虑把 分词时开启归一化 做成可配置的方式?比如在 定义索引 指定 content字段的 analyzer 时,加一个Normalization 的配置选项:
"content": {
"type": "text",
"analyzer": "hanlp_standard",
"normalization":true
或者在com.hankcs.cfg.Configuration
里面添加一个类似于 enableNormalization
的配置选项,做相应的归一化操作支持?
或者说,您有什么其他好的方法吗?
请问配合es大规模的商用,hanlp分词器的性能,稳定性会出现什么问题吗?
7.2.0版本hanlp_index: 索引分词无效,试了其它版本没有问题。
ES version6.1.2, elasticsearch-analysis-hanlp version 7.3.2
PUT INDEX
{
"settings": {
"analysis": {
"analyzer": {
"my_hanlp_index_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "hanlp_index"
},
"my_hanlp_analyzer": {
"filter": [
"lowercase"
],
"tokenizer": "hanlp"
}
},
"tokenizer": {
"my_hanlp_index": {
"enable_custom_config": "true",
"type": "hanlp_index",
"enable_stop_dictionary": "true"
},
"my_hanlp": {
"enable_custom_config": "true",
"type": "hanlp",
"enable_stop_dictionary": "true"
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "my_hanlp_index_analyzer",
"search_analyzer": "my_hanlp_analyzer"
}
}
}
}
}
PUT index/doc/1
{
"body": ["\n营造建设和谐环境 \n \n\n专业名称:建设 发布日期:2010 年 1 月 \n\n \n\n摘要:为贯彻落实科学发展观,将建设有机融入全社会持续发展,减\n\n小建设对社会资源的占用及对正常社会生产生活带来的影响,公司积极抓好\n\n两手准备"]
}
GET index/_search
{
"query": {
"match": {
"body": {
"query": "科学"
}
}
},
"highlight": {
"fields": {
"body": {}
}
}
}
返回:
{
"took": 75,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.29379335,
"hits": [
{
"_index": "index",
"_type": "doc",
"_id": "1",
"_score": 0.29379335,
"_source": {
"body": [
"\n营造建设和谐环境 \n \n\n专业名称:建设 发布日期:2010 年 1 月 \n\n \n\n摘要:为贯彻落实科学发展观,将建设有机融入全社会持续发展,减\n\n小建设对社会资源的占用及对正常社会生产生活带来的影响,公司积极抓好\n\n两手准备"
]
},
"highlight": {
"body": [
"营造建设和谐环境 \n \n\n专业名称:建设 发布日期:2010 年 1 月 \n\n \n\n摘要:为贯彻落实科<em>学发</em>展观,将建设有机融入全社会持续发展,减"
]
}
}
]
}
}
如题
1,环境:ES version 6.3.2 ,Plugin version 6.3.2。
2,问题描述(为了方便描述这个性能问题,我开启了term vector)
空格参与查询,导致查询缓慢,用profile分析,查询时间在几百毫秒,而其他字段查询耗时只有10多毫秒。但是这个"空格查询"应该是不必要的。在ik_max_word分词里面不存在空格查询这种情况。
首先定义索引user,指定analyzer为hanlp_standard。user只有一个名为nick的字段
PUT user
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"profile": {
"properties": {
"nick": {
"type": "text",
"analyzer": "hanlp_standard",
"term_vector": "yes",
"fields": {
"raw": {
"type": "keyword"
}
}
}
}
}
}
}
添加一篇文档,注意:“人生”,“如梦” 中间是包含了空格的。
PUT user/profile/1
{
"nick":"人生 如梦"
}
使用term vector查看,发现:空格也被索引了吧
GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}
返回的结果如下:总的terms个数有4个。第一个term就是空格,这个空格会影响查询效率。
{
"_index": "user",
"_type": "profile",
"_id": "1",
"_version": 1,
"found": true,
"took": 0,
"term_vectors": {
"nick": {
"field_statistics": {
"sum_doc_freq": 4,
"doc_count": 1,
"sum_ttf": 4
},
"terms": {
" ": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"人生": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"如": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
},
"梦": {
"doc_freq": 1,
"ttf": 1,
"term_freq": 1
}
}
}
}
}
用profile 查询分析,会发现 空格也参与了查询。
GET user/profile/_search?human=true
{
"profile":true,
"query": {
"match": {
"nick": "人生 如梦"
}
}
}
上面是一个示例,用来说明空格也参与了查询。在实际生产环境中,当索引的文档数量上亿时,**造成了严重查询性能问题**:因为:空格 对应的查询耗时较长。贴一个实际的查询分析:(空格耗时480ms,而 “微信” 只耗时18ms)
"profile": {
"shards": [
{
"id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "nick:微信 nick: nick:黄色",
"time": "888.6ms",
"time_in_nanos": 888636963,
"breakdown": {
"score": 513864260,
"build_scorer_count": 50,
"match_count": 0,
"create_weight": 93345,
"next_doc": 364649642,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 5063173,
"score_count": 4670398,
"build_scorer": 296094,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "nick:微信",
**"time": "18.4ms",**
"time_in_nanos": 18480019,
"breakdown": {
"score": 656810,
"build_scorer_count": 62,
"match_count": 0,
"create_weight": 23633,
"next_doc": 17712339,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 7085,
"score_count": 5705,
"build_scorer": 74384,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
**"description": "nick: ",**
**"time": "480.5ms",**
"time_in_nanos": 480508016,
"breakdown": {
"score": 278358058,
"build_scorer_count": 72,
"match_count": 0,
"create_weight": 6041,
"next_doc": 192388910,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 5056541,
"score_count": 4665006,
"build_scorer": 33387,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:黄色",
**"time": "3.8ms",**
"time_in_nanos": 3872679,
"breakdown": {
"score": 136812,
"build_scorer_count": 50,
"match_count": 0,
"create_weight": 5423,
"next_doc": 3700537,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 923,
"score_count": 755,
"build_scorer": 28178,
"advance": 0,
"advance_count": 0
}
}
]
}
],
然后,我对比了 ik_max_word Analyzer,发现它的Profile查询分析里面,是不会对 空格 进行查询的,下面的TermQuery 里面并没有 “空格”。
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5753642,
"hits": [
{
"_index": "myindex",
"_type": "profile",
"_id": "1",
"_score": 0.5753642,
"_source": {
"nick": "人生 如梦"
}
}
]
},
"profile": {
"shards": [
{
"id": "[7MyDkEDrRj2RPHCPoaWveQ][myindex][0]",
"searches": [
{
"query": [
{
"type": "BooleanQuery",
"description": "nick:人生 nick:如梦",
"time": "410.8micros",
"time_in_nanos": 410831,
"breakdown": {
"score": 26377,
"build_scorer_count": 2,
"match_count": 0,
"create_weight": 227597,
"next_doc": 12341,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 144510,
"advance": 0,
"advance_count": 0
},
"children": [
{
"type": "TermQuery",
"description": "nick:人生",
"time": "197.6micros",
"time_in_nanos": 197665,
"breakdown": {
"score": 9670,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 146018,
"next_doc": 1302,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 40668,
"advance": 0,
"advance_count": 0
}
},
{
"type": "TermQuery",
"description": "nick:如梦",
"time": "62.8micros",
"time_in_nanos": 62830,
"breakdown": {
"score": 999,
"build_scorer_count": 3,
"match_count": 0,
"create_weight": 55092,
"next_doc": 864,
"match": 0,
"create_weight_count": 1,
"next_doc_count": 2,
"score_count": 1,
"build_scorer": 5868,
"advance": 0,
"advance_count": 0
}
}
]
}
],
"rewrite_time": 26763,
"collector": [
{
"name": "CancellableCollector",
"reason": "search_cancelled",
"time": "41.9micros",
"time_in_nanos": 41945,
"children": [
{
"name": "SimpleTopScoreDocCollector",
"reason": "search_top_hits",
"time": "30.6micros",
"time_in_nanos": 30633
}
]
}
]
}
],
"aggregations": []
}
]
}
}
使用方式二安装hanlp插件,安装过程顺利,es也正常启动了,但是在kibana里面测试hanlp:
GET /_analyze?pretty
{
"analyzer" : "hanlp",
"text" : ["重庆华龙网海数科技有限公司"]
}
发现es关闭了,检查log如下:
[2018-09-06T09:15:39,827][ERROR][c.h.d.Monitor ] can not find hanlp.properties
java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_181]
at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_181]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_181]
at java.lang.ClassLoader.checkClassLoaderPermission(ClassLoader.java:1528) ~[?:1.8.0_181]
at java.lang.Thread.getContextClassLoader(Thread.java:1443) ~[?:1.8.0_181]
at com.hankcs.dic.Monitor.reloadProperty(Monitor.java:61) [elasticsearch-analysis-hanlp-6.2.2.jar:?]
at com.hankcs.dic.Monitor.run(Monitor.java:34) [elasticsearch-analysis-hanlp-6.2.2.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_181]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_181]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
[2018-09-06T09:16:23,505][INFO ][o.e.m.j.JvmGcMonitorService] [cyVG2fn] [gc][57] overhead, spent [318ms] collecting in the last [1s]
[2018-09-06T09:35:31,918][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[cyVG2fn][index][T#1]], exiting
java.lang.NoClassDefFoundError: Could not initialize class com.hankcs.hanlp.HanLP$Config
at com.hankcs.hanlp.seg.Segment.seg(Segment.java:423) ~[?:?]
at com.hankcs.lucene.SegmentWrapper.next(SegmentWrapper.java:76) ~[?:?]
at com.hankcs.lucene.HanLPTokenizer.incrementToken(HanLPTokenizer.java:94) ~[?:?]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:266) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:243) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:164) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:80) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:293) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:286) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
数据包目录:ES_HOME/analysis-hanlp
搜遍代码也没看到会读这个文件夹下的内容,而且相关日志也没有提示.
我需要把enable_number_quantifier_recognize设置为true,请问这个配置,应该在什么时候,什么地方设置的?
滴滴出行直接分成了出行。。。
异常信息为:Caused by: java.security.AccessControlException: access denied ("java.util.PropertyPermission" "*" "read,write") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_111] at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_111] at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_111] at java.lang.SecurityManager.checkPropertiesAccess(SecurityManager.java:1262) ~[?:1.8.0_111] at java.lang.System.getProperties(System.java:630) ~[?:1.8.0_111] at com.hankcs.hanlp.HanLP$Config.<clinit>(HanLP.java:306) ~[?:?]
我也遇到的以上的问题,不过我的版本为6.4.
使用hanlp分词
GET _analyze
{
"text": ["**地大物博"],
"tokenizer": "hanlp"
}
返回数据:
{
"tokens" : [
{
"token" : "**",
"start_offset" : 0,
"end_offset" : 2,
"type" : "ns",
"position" : 0
},
{
"token" : "地大物博",
"start_offset" : 0,
"end_offset" : 4,
"type" : "nz",
"position" : 1
}
]
}
第二个term, start_offset还是从0开始的
而es自带的tokenizer 都是递增的
GET _analyze
{
"text": ["**地大物博"],
"tokenizer": "standard"
}
返回
{
"tokens" : [
{
"token" : "中",
"start_offset" : 0,
"end_offset" : 1,
"type" : "<IDEOGRAPHIC>",
"position" : 0
},
{
"token" : "国",
"start_offset" : 1,
"end_offset" : 2,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "地",
"start_offset" : 2,
"end_offset" : 3,
"type" : "<IDEOGRAPHIC>",
"position" : 2
},
{
"token" : "大",
"start_offset" : 3,
"end_offset" : 4,
"type" : "<IDEOGRAPHIC>",
"position" : 3
},
{
"token" : "物",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 4
},
{
"token" : "博",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 5
}
]
}
现在搜索还是能正常搜到, 但是高亮只能匹配到最前面几个字,
以下是在6.5.4上测试的结果
PUT document
{
"mappings": {
"doc": {
"properties": {
"body": {
"type": "text",
"analyzer": "hanlp",
"search_analyzer": "hanlp"
}
}
}
}
}
PUT document/doc/1
{
"body": ["**地大物博"]
}
GET document/_search
{
"query": {
"match": {
"body": {
"query": "地大物博"
}
}
},
"highlight": {
"fields": {
"body": {}
}
}
}
返回
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "document",
"_type" : "doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"body" : [
"**地大物博"
]
},
"highlight" : {
"body" : [
"<em>**地大</em>物博"
]
}
}
]
}
}
{
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp_nlp"
},
"remark": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp_nlp"
}
}
}
您好,在使用您的框架的过程中出现了一个问题,不知道是否使用错误。
从hanlp官网下载model数据后放入 /plugins/analysis-hanlp/data/model 下,重启 ES,
使用 _analyze 测试分词正常。
"hanLPAnalyzer" : {
"type" : "custom",
"char_filter" : [
"charconvert"
],
"tokenizer" : "hanlp_nlp_word"
},
"hanlp_nlp_word" : {
"enable_normalization" : "true",
"enable_remote_dict" : "true",
"type" : "hanlp_nlp"
}
然后使用 _search 查询数据,如下,数据大约为200W条,查询耗时超过2S,持续点击5次以上,出现401错误,后台日志提示内存溢出.
POST /poi/_search
{
"from": 0,
"size": 10,
"query": {
"dis_max": {
"tie_breaker": 0.3,
"queries": [{
"match": {
"full_q": {
"query": "青年路",
"operator": "OR",
"analyzer": "hanLPAnalyzer"
}
}
}],
"boost": 1.0
}
},
"post_filter": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0
}
}
}
您好:
我是用的elasticsearch-analysis-hanlp-6.5.1插件, 之前用过ik插件,对比ik的同义词配置做了如下settings设置,发现同义词没有生效,请问大神hanlp插件的同义词应该如何配置生效呢?多谢解答!!
"settings" : {
"index" : {
"analysis" : {
"filter" : {
"hanlp_synonym_ik_standard" : {
"type" : "synonym",
"synonyms_path" : "../plugins/analysis-hanlp/data/dictionary/synonym/CoreSynonym.txt"
}
},
"analyzer" : {
"hanlp_default_search" : {
"tokenizer" : "hanlp_standard"
},
"hanlp_default_index" : {
"filter" : [
"hanlp_synonym_ik_standard"
],
"tokenizer" : "hanlp_index"
}
}
}
}
},
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.