HanLP Analyzer for Elasticsearch

License: Apache License 2.0

Java 100.00%

elasticsearch-analysis-hanlp's People

Contributors

Stargazers

Watchers

Forkers

huangxianhong hariag holysoros guangjie0916 hapjin dogming123 ciweigg2 heshanhai 442114004 lyp913 sunxuening jiqiujia blurhead zhanchangbao menghot w6et humiao8sz llwwlql yehandjava helloclyde mimimiracle 123guanxing mugbya hins zhengyangyong shesheshe vincent775 micrqwe 123wangpengwei wangjiaqi- ilibx jackyli515 rencx 1120025919 haijingli xchechi123x yanwenfei0926 hollychen503 xgguo1 xljqq jackeylove xiawen0731 feitiangit iamxiaoma whh32 kuipertan warlock-deng usvazhan zwyroot gunblues 18280456288 clannad2000 pengchengxie auscenery wynshiter somnro alalals txdyc pandora-iuz dudefu hejiancsu jh-chen czc945 a-nuo lithiumed ouyangchucai gaoyc alieismy wangruiling shihuaxing jackyin68 flowerogre soulfind happyyangyuan zhbing lchqlchq dsonet guluo2016 baiyangsh banna2019 gdzhu8023 hsichen111 zdlian askmiw wtwong316 genjiluo mryuguangbao lvdapeng1130 abcdelf daishu7 apx103 gavin-ygy zj19891214 cuillgln bluelibra liyuwei neulf scorpio4cool qiuxingcheng elk-res

elasticsearch-analysis-hanlp's Issues

CustomDictionary线程安全吗

CustomDictionary.remove(String)
CustomDictionary.insert(String, String)
请问这2个方法，线程安全吗？

我们有一个场景：远程词典删除了一行词，已加载的词典也需要删除这行词。
现在远程词典的加载逻辑，只是不断增加远程词典的词条，而不会删除远程词典被删除的词条。

我修改了一下RemoteMonitor加载远程词典的方法，使其能与上次已加载的远程词典对比，从而计算出需要删除的词条。

有一个远程词典，100w行数据，业务会不断优化词典，所以这2个方法，会经常调用。如果我用集合的并行流paralleStream()来并执行CustomDictionary.insert，会不会有问题？

停用词是怎么配置的？

查看了下代码，在代码里面是配置 enable_stop_dictionary 属性，由于对 Elasticsearch 插件开发的参数传递方式不太了解，所以出现的疑问是，这个参数是在 elasticsearch 建立索引的时候设定的么？

在 #7 issue 中找到解释

eg:2019年初级会计职称

分词结果是：2019 年初级会计职称
我们想要的结果是：2019年初级会计职称

用hanlp内置的接口方法进行分词，就没有问题的，可以得到预计结果
thanks

请问enable_custom_dictionary这些配置要怎么设置？

I try to customize an analyzer like this, but it doesn't work

index模式分词错误案例

GET _analyze
{
   "text": ["启动力车规级汽车"],
   "analyzer": "hanlp_index"
}

实际结果

  "tokens" : [
    {
      "token" : "启",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "v",
      "position" : 0
    },
    {
      "token" : "动力车",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "nz",
      "position" : 1
    },
    {
      "token" : "动力",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "n",
      "position" : 2
    },
    {
      "token" : "规",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "ng",
      "position" : 3
    },
    {
      "token" : "级",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "q",
      "position" : 4
    },
   .... 省略
  ]
}

期望结果
能分出”启动力“、”启动“。

补充

我测试过”启动力车规级汽车“，有空格的情况下，是能出现”启动力“、”启动“。
hanlp原生api也是正常的。

动态删除词条

hi, ES版本7.2.0，如何能动态删除词条？
目前测试远程词库可以动态添加，但是删除词条后，ES没有删除，需要重启才生效
删除hanlp.cache和.bin文件都没有生效

源码是直接insert的词条，不知道是否提供了删除词条方法？

hanlp_index 分词不一致问题咨询

遇到个日文人名的识别问题，带前缀时分词为声优藤田淑子，不带前缀就变成了藤田淑子：

POST _analyze
{
  "analyzer": "hanlp_index",
  "text": "日本声优藤田淑子逝世"
}

{
  "tokens": [
    {
      "token": "日本",
      "start_offset": 0,
      "end_offset": 2,
      "type": "ns",
      "position": 0
    },
    {
      "token": "声优",
      "start_offset": 2,
      "end_offset": 4,
      "type": "nz",
      "position": 1
    },
    {
      "token": "藤田",
      "start_offset": 4,
      "end_offset": 6,
      "type": "nr",
      "position": 2
    },
    {
      "token": "淑",
      "start_offset": 6,
      "end_offset": 7,
      "type": "ng",
      "position": 3
    },
    {
      "token": "子",
      "start_offset": 7,
      "end_offset": 8,
      "type": "ng",
      "position": 4
    },
    {
      "token": "逝世",
      "start_offset": 8,
      "end_offset": 10,
      "type": "vi",
      "position": 5
    }
  ]
}

查询时分词：

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text": "藤田淑子"
}

{
  "tokens": [
    {
      "token": "藤田淑",
      "start_offset": 0,
      "end_offset": 3,
      "type": "nr",
      "position": 0
    },
    {
      "token": "子",
      "start_offset": 0,
      "end_offset": 1,
      "type": "ng",
      "position": 1
    }
  ]
}

这类不一致问题，请问有什么办法可以干预吗？看HanLP文档，不知道开启日文人名识别是否可以解决？

建议写一下 plugins/analysis-hanlp 下的 config 目录一定要删除

直接解压的，所以 plugins/analysis-hanlp 有 config 目录，es可以正常启动，但是使用 hanlp 分词就会报错
java.lang.NoClassDefFoundError: Could not initialize class com.hankcs.hanlp.HanLP$Config

删除 plugins/analysis-hanlp 下的 config 目录之后恢复正常

给后来的人提个醒

6.3.2版本报错index_not_found_exception

使用elasticsearch-plugin install的方式安装的，运行之后报下面的错误：
curl 'https://$ES_URL:9200/twitter2/_analyze?pretty=true' -H 'Content-Type: application/json' -d '{ "tokenizer":"hanlp", "text":"士大夫敢死队风格"}'
{
"error" : {
"root_cause" : [
{
"type" : "index_not_found_exception",
"reason" : "no such index",
"resource.type" : "index_expression",
"resource.id" : "twitter2",
"index_uuid" : "na",
"index" : "twitter2"
}
],
"type" : "index_not_found_exception",
"reason" : "no such index",
"resource.type" : "index_expression",
"resource.id" : "twitter2",
"index_uuid" : "na",
"index" : "twitter2"
},
"status" : 404
}

hanlp.properties文件是在ES_PATH/config/analysis-hanlp/下。

为什么自定义的词典失效呢？

按照第一种方法安装，无法成功？

/root/elasticsearch-6.4.3/bin/elasticsearch-plugin install /root/elasticsearch-6.4.3/elasticsearch-analysis-hanlp-6.4.3.zip

A tool for managing installed elasticsearch plugins

Commands
--------
list - Lists installed elasticsearch plugins
install - Install a plugin
remove - removes a plugin from Elasticsearch

Non-option arguments:
command              

Option         Description        
------         -----------        
-h, --help     show help          
-s, --silent   show minimal output
-v, --verbose  show verbose output
ERROR: Unknown plugin /root/elasticsearch-6.4.3/elasticsearch-analysis-hanlp-6.4.3.zip

文本有'\n'会引起高亮错位

以下是在6.6.2上测试的结果

PUT index
{
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "text",
          "analyzer": "hanlp",
          "search_analyzer": "hanlp"
        }
      }
    }
  }
}

PUT index/doc/1
{
  "body": ["张三\n\n新买的手机"]
}

GET index/_search
{
  "query": {
    "match": {
      "body": {
        "query": "手机"
      }
    }
  },
  "highlight": {
    "fields": {
      "body": {}
    }
  }
}

{
  "took" : 44,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "body" : [
            """
张三

新买的手机
"""
          ]
        },
        "highlight" : {
          "body" : [
            """
张三

新买<em>的手</em>机
"""
          ]
        }
      }
    ]
  }
}

excepted:

...
"highlight" : {
          "body" : [
            """
张三

新买的<em>手机</em>
"""
          ]
}
...

V7.0.0配置hanlp-remote.xml中remote_ext_dict远程分词地址无效，而同样的配置在ik就有效

HanLP Analyzer 扩展配置

<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">http://我的地址:1888/a.txt</entry>

<!--用户可以在这里配置远程扩展停止词字典-->
<!--<entry key="remote_ext_stopwords">stop_words_location</entry>-->

http://我的地址:1888/a.txt 内容
中
华

POST _analyze
{
"text": "中华人民共和国",
"tokenizer": "hanlp_index"
}

结果：
{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "ns",
"position" : 0
},
{
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "nz",
"position" : 1
},
{
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "nz",
"position" : 2
},
{
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "n",
"position" : 3
},
{
"token" : "人民共和国",
"start_offset" : 2,
"end_offset" : 7,
"type" : "nz",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "n",
"position" : 5
},
{
"token" : "共和国",
"start_offset" : 4,
"end_offset" : 7,
"type" : "n",
"position" : 6
},
{
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "n",
"position" : 7
}
]
}

期望结果中出现单“中”，“华”的记录

hanlp_index的tokenizer没有效果，analyzer有效果

我最开始的需求是希望自定义一个能够剔除html标签的hanlp_index的analyzer, 但使用时发现没了多粒度分词的效果。

{
"analysis": {
"analyzer": {
"default": {
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "hanlp_index"
}
}
}
}
`

于是我用_analyze分析，从结果看似乎是hanlp_index的tokenizer有问题。
hanlp_index的analyzer可以多粒度分词
GET _analyze { "text": ["除甲醛", "汽车座椅"], "analyzer": "hanlp_index" }
结果
{
"tokens" : [
{
"token" : "",
"start_offset" : 0,
"end_offset" : 3,
"type" : "nx",
"position" : 0
},
{
"token" : "除甲醛",
"start_offset" : 6,
"end_offset" : 9,
"type" : "n",
"position" : 1
},
{
"token" : "甲醛",
"start_offset" : 10,
"end_offset" : 12,
"type" : "n",
"position" : 2
},
{
"token" : "<",
"start_offset" : 14,
"end_offset" : 15,
"type" : "nx",
"position" : 3
},
{
"token" : "/",
"start_offset" : 16,
"end_offset" : 17,
"type" : "w",
"position" : 4
},
{
"token" : "p>",
"start_offset" : 18,
"end_offset" : 20,
"type" : "nx",
"position" : 5
},
{
"token" : "汽车座椅",
"start_offset" : 24,
"end_offset" : 28,
"type" : "nz",
"position" : 6
},
{
"token" : "汽车",
"start_offset" : 28,
"end_offset" : 30,
"type" : "n",
"position" : 7
},
{
"token" : "车座",
"start_offset" : 31,
"end_offset" : 33,
"type" : "n",
"position" : 8
},
{
"token" : "座椅",
"start_offset" : 34,
"end_offset" : 36,
"type" : "n",
"position" : 9
}
]
}

hanlp_index的tokenizer的没有多粒度分词效果
GET /_analyze { "text": ["除甲醛","汽车座椅"], "tokenizer": "hanlp_index" }

结果
{
"tokens" : [
{
"token" : "",
"start_offset" : 0,
"end_offset" : 3,
"type" : "nx",
"position" : 0
},
{
"token" : "除甲醛",
"start_offset" : 3,
"end_offset" : 6,
"type" : "n",
"position" : 1
},
{
"token" : "<",
"start_offset" : 6,
"end_offset" : 7,
"type" : "nx",
"position" : 2
},
{
"token" : "/",
"start_offset" : 7,
"end_offset" : 8,
"type" : "w",
"position" : 3
},
{
"token" : "p>",
"start_offset" : 8,
"end_offset" : 10,
"type" : "nx",
"position" : 4
},
{
"token" : "汽车座椅",
"start_offset" : 22,
"end_offset" : 26,
"type" : "nz",
"position" : 5
}
]
}

远处字典更新问题

IK可以配置热更新字典

从远处服务下载.dic文件，根据响应头的ETag判断是否需要进行字典数据替换，

希望hanlp也能实现

直接替换使用1.6.8的字典会不会有问题？

https://github.com/hankcs/HanLP/releases/tag/v1.6.8 号称“全世界最大的中文语料库” 我直接把data目录替换成1.6.8 ，跑起来暂时没看到问题。
但总是担心会不会有问题

stopwords未生效

不管是用远程词典还是本地的, stopwords都没有生效, 是否因为插件没有对停用词做过滤吗

查询时空格问题

分词时，空值建了索引

麻烦适配一下7.2.0，谢谢

ES 5.2.2 版本运行报错

[2019-02-19T08:09:05,566][INFO ][o.e.c.r.a.AllocationService] [w44JGC2] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index][2]] ...]).
Feb 19, 2019 8:09:11 AM com.hankcs.hanlp.corpus.io.IOUtil readBytes
WARNING: 读取plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt.bin时发生异常java.io.FileNotFoundException: plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt.bin (No such file or directory)
Feb 19, 2019 8:09:11 AM com.hankcs.hanlp.dictionary.CoreDictionary load
WARNING: 核心词典plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt不存在！java.io.FileNotFoundException: plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt (No such file or directory)
[2019-02-19T08:09:11,715][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[w44JGC2][index][T#1]], exiting
java.lang.ExceptionInInitializerError: null
at com.hankcs.hanlp.seg.common.Vertex.newB(Vertex.java:462) ~[?:?]
at com.hankcs.hanlp.seg.common.WordNet.(WordNet.java:73) ~[?:?]
at com.hankcs.hanlp.seg.Viterbi.ViterbiSegment.segSentence(ViterbiSegment.java:40) ~[?:?]
at com.hankcs.hanlp.seg.Segment.seg(Segment.java:507) ~[?:?]
at com.hankcs.lucene.SegmentWrapper.next(SegmentWrapper.java:76) ~[?:?]
at com.hankcs.lucene.HanLPTokenizer.incrementToken(HanLPTokenizer.java:94) ~[?:?]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:222) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:200) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:148) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:75) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:294) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:287) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:610) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
Caused by: java.lang.IllegalArgumentException: 核心词典plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt加载失败
at com.hankcs.hanlp.dictionary.CoreDictionary.(CoreDictionary.java:44) ~[?:?]
... 20 more

https://www.jianshu.com/p/52c42cdab997 按照此文章解决了java权限（参照https://github.com/KennFalcon/elasticsearch-analysis-hanlp/issues/2）问题后，遇到另外的错误

请问支持ES 5.6.4吗？

enable_traditional_chinese_mode字典檔案內容也需要改為繁中嗎?

您好，已經設置繁中
PUT test { "settings": { "analysis": { "analyzer": { "tc_hanlp": { "tokenizer": "hanlp" } }, "tokenizer": { "tc_hanlp": { "type": "hanlp", "enable_traditional_chinese_mode": true, "enable_custom_config": true } } } } }
但打印結果似乎只能分詞簡中，是否須將字典檔內容改為繁中?
POST test/_analyze { "text": "美國阿拉斯加州發生8.0級地震", "analyzer": "tc_hanlp" }

{ "tokens" : [ { "token" : "美", "start_offset" : 0, "end_offset" : 1, "type" : "b", "position" : 0 }, { "token" : "國", "start_offset" : 0, "end_offset" : 1, "type" : "w", "position" : 1 }, { "token" : "阿拉斯加州", "start_offset" : 0, "end_offset" : 5, "type" : "nsf", "position" : 2 }, { "token" : "發", "start_offset" : 0, "end_offset" : 1, "type" : "n", "position" : 3 }, { "token" : "生", "start_offset" : 0, "end_offset" : 1, "type" : "v", "position" : 4 }, { "token" : "8.0", "start_offset" : 0, "end_offset" : 3, "type" : "m", "position" : 5 }, { "token" : "級", "start_offset" : 0, "end_offset" : 1, "type" : "n", "position" : 6 }, { "token" : "地震", "start_offset" : 0, "end_offset" : 2, "type" : "n", "position" : 7 } ] }

search_analyzer无法设置为hanlp_nlp，设置为hanlp可以成功

ref : #38
#35

最新代码测试了，看起来并没有解决：
设置search_analyzer为hanlp_nlp

{
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp_nlp"
},
"remark": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp_nlp"
}
}
}

失败

设置为hanlp

{
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp"
},
"remark": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp"
}
}
}

没问题！

elasticsearch6.4可以用不？

v6.6.1 运行报错看不懂

[2019-04-04T00:11:45,886][INFO ][o.e.n.Node ] [rpmdWQa] started
[2019-04-04T00:12:10,126][INFO ][c.h.d.c.RemoteDictConfig ] [rpmdWQa] try load remote hanlp config from D:\webdev\es\ela
sticsearch-6.6.1\config\analysis-hanlp\hanlp-remote.xml
[2019-04-04T00:12:10,211][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [rpmdWQa] fatal error in thread [elasticse
arch[rpmdWQa][analyze][T#1]], exiting
java.lang.ExceptionInInitializerError: null
at com.hankcs.hanlp.dictionary.CoreDictionary.(CoreDictionary.java:35) ~[?:?]
at com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment.(DoubleArrayTrieSegment.java:45) ~[?:?]
at com.hankcs.lucene.HanLPSpeedAnalyzer.createComponents(HanLPSpeedAnalyzer.java:31) ~[?:?]
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:198) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f846
40faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:20]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.ja
va:267) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:252
) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.j
ava:170) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.j
ava:81) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$1.doRun(TransportSingleShardAction.j
ava:115) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.
java:759) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1
.jar:6.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.security.AccessControlException: access denied ("java.io.FilePermission" "src\main\java" "read")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_131]
at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_131]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_131]
at java.lang.SecurityManager.checkRead(SecurityManager.java:888) ~[?:1.8.0_131]
at java.io.File.isDirectory(File.java:844) ~[?:1.8.0_131]
at com.hankcs.hanlp.HanLP$Config.(HanLP.java:342) ~[?:?]
... 14 more

请问远程词典能设置定时任务时间吗？

现在每分钟调用一次远程词典，这个时间定时任务能设置吗？

hanlp.cache问题

非常感谢hanlp分词插件。
我这边有个问题，如果有hanlp.cache文件，我重新启动es，会报错。请问下是什么原因？
如果删除hanlp.cache文件，就可以正常运行，但需要时间重新加载hanlp的词典。
错误：
[2018-07-24T17:46:22,115][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node1] fatal error in thread [elasticsearch[node1][generic][T#2]], exiting
java.lang.NoClassDefFoundError: org/elasticsearch/core/internal/io/IOUtils
at com.hankcs.dic.cache.DictionaryFileCache.lambda$loadCache$0(DictionaryFileCache.java:60) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_73]
at com.hankcs.dic.cache.DictionaryFileCache.loadCache(DictionaryFileCache.java:45) ~[?:?]
at com.hankcs.dic.Dictionary.(Dictionary.java:45) ~[?:?]
at com.hankcs.dic.Dictionary.initial(Dictionary.java:52) ~[?:?]
at com.hankcs.cfg.Configuration.(Configuration.java:54) ~[?:?]
at org.elasticsearch.index.analysis.HanLPTokenizerFactory.(HanLPTokenizerFactory.java:31) ~[?:?]
at org.elasticsearch.index.analysis.HanLPTokenizerFactory.getHanLPNLPTokenizerFactory(HanLPTokenizerFactory.java:47) ~[?:?]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:361) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:176) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:154) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.IndexService.(IndexService.java:145) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:363) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:448) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.indices.IndicesService.verifyIndexMetadata(IndicesService.java:481) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.gateway.Gateway.performStateRecovery(Gateway.java:135) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.gateway.GatewayService$1.doRun(GatewayService.java:229) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.6.1.jar:5.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_73]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.core.internal.io.IOUtils
at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[?:1.8.0_73]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_73]
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814) ~[?:1.8.0_73]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_73]
... 22 more

es6.2.2 hanlp删除缓存.bin文件后无法再次生成

如题，楼主联系方式方便给一个嘛？

已确认字典为utf-8格式，同一文件，在python下调用可生成.bin缓存文件。

如何在分词时方便地开启归一化？

hi，KennFalcon：
我使用的插件版本： elasticsearch-analysis-hanlp for ElasticSearch6.3.2
由于HanLP在分词的时候HanLP.Config.Normalization=false，默认是不开启Normalization。但是我想在分词的时候，开启归一化操作。
我采用的Analyzer是com.hankcs.lucene.HanLPStandardAnalyzer，目前我的做法是在HanLPStandardAnalyzer.createComponents()方法中添加了一行语句：

    @Override
    protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
        //分词时开启归一化
        AccessController.doPrivileged((PrivilegedAction) () -> HanLP.Config.Normalization = true);
        Tokenizer tokenizer = new HanLPTokenizer(HanLP.newSegment(), configuration);
        return new Analyzer.TokenStreamComponents(tokenizer);
    }

然后重新打jar包，复制到elasticsearch-6.3.2/plugins/analysis-hanlp目录下替换掉原来的jar，重启ES。
我的问题是：
是否考虑把分词时开启归一化做成可配置的方式？比如在定义索引指定 content字段的 analyzer 时，加一个Normalization 的配置选项：

        "content": {
          "type": "text",
          "analyzer": "hanlp_standard",
          "normalization":true

或者在com.hankcs.cfg.Configuration里面添加一个类似于 enableNormalization的配置选项，做相应的归一化操作支持？
或者说，您有什么其他好的方法吗？

6.2.2版本启动后报错java.security.AccessControlException: access denied ("java.io.FilePermission" "plugins\analysis-hanlp\data\dictionary\custom\CustomDictionary.txt.bin" "delete")

请问配合es大规模的商用性能，稳定性会出现什么问题吗？

请问配合es大规模的商用，hanlp分词器的性能，稳定性会出现什么问题吗？

使用hanlp_nlp, 会导致elasticsearch 释放gc，然后崩溃啊；hanlp模式没有问题。

7.2.0版本hanlp_index: 索引分词无效

7.2.0版本hanlp_index: 索引分词无效，试了其它版本没有问题。

多个换行符导致高亮错位

ES version6.1.2, elasticsearch-analysis-hanlp version 7.3.2

PUT INDEX
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_hanlp_index_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "hanlp_index"
        },
        "my_hanlp_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "hanlp"
        }
      },
      "tokenizer": {
        "my_hanlp_index": {
          "enable_custom_config": "true",
          "type": "hanlp_index",
          "enable_stop_dictionary": "true"
        },
        "my_hanlp": {
          "enable_custom_config": "true",
          "type": "hanlp",
          "enable_stop_dictionary": "true"
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "text",
          "analyzer": "my_hanlp_index_analyzer",
          "search_analyzer": "my_hanlp_analyzer"
        }
      }
    }
  }
}

PUT index/doc/1
{
  "body": ["\n营造建设和谐环境 \n \n\n专业名称：建设                        发布日期：2010 年 1 月 \n\n      \n\n摘要：为贯彻落实科学发展观，将建设有机融入全社会持续发展，减\n\n小建设对社会资源的占用及对正常社会生产生活带来的影响，公司积极抓好\n\n两手准备"]
}

GET index/_search
{
  "query": {
    "match": {
      "body": {
        "query": "科学"
      }
    }
  },
  "highlight": {
    "fields": {
      "body": {}
    }
  }
}

{
  "took": 75,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.29379335,
    "hits": [
      {
        "_index": "index",
        "_type": "doc",
        "_id": "1",
        "_score": 0.29379335,
        "_source": {
          "body": [
            "\n营造建设和谐环境 \n \n\n专业名称：建设                        发布日期：2010 年 1 月 \n\n      \n\n摘要：为贯彻落实科学发展观，将建设有机融入全社会持续发展，减\n\n小建设对社会资源的占用及对正常社会生产生活带来的影响，公司积极抓好\n\n两手准备"
          ]
        },
        "highlight": {
          "body": [
            "营造建设和谐环境 \n \n\n专业名称：建设                        发布日期：2010 年 1 月 \n\n      \n\n摘要：为贯彻落实科<em>学发</em>展观，将建设有机融入全社会持续发展，减"
          ]
        }
      }
    ]
  }
}

是否能支持 ES 6.6.0？

如题

HanLP分词插件因为空格导致在搜索时有性能问题

1，环境：ES version 6.3.2 ，Plugin version 6.3.2。
2，问题描述（为了方便描述这个性能问题，我开启了term vector）
空格参与查询，导致查询缓慢，用profile分析，查询时间在几百毫秒，而其他字段查询耗时只有10多毫秒。但是这个"空格查询"应该是不必要的。在ik_max_word分词里面不存在空格查询这种情况。

首先定义索引user，指定analyzer为hanlp_standard。user只有一个名为nick的字段

PUT user
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "profile": {
      "properties": {
        "nick": {
          "type": "text",
          "analyzer": "hanlp_standard",
          "term_vector": "yes", 
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

添加一篇文档，注意：“人生”，“如梦” 中间是包含了空格的。

PUT user/profile/1
{
  "nick":"人生 如梦"
}

使用term vector查看，发现：空格也被索引了吧

GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

返回的结果如下：总的terms个数有4个。第一个term就是空格，这个空格会影响查询效率。
{
  "_index": "user",
  "_type": "profile",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "nick": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        " ": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "人生": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "如": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "梦": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        }
      }
    }
  }
}

用profile 查询分析，会发现 空格也参与了查询。

GET user/profile/_search?human=true
{
  "profile":true,
  "query": {
    "match": {
      "nick": "人生 如梦"
    }
  }
}

上面是一个示例，用来说明空格也参与了查询。在实际生产环境中，当索引的文档数量上亿时，**造成了严重查询性能问题**：因为：空格 对应的查询耗时较长。贴一个实际的查询分析：(空格耗时480ms，而 “微信” 只耗时18ms)

 "profile": {
    "shards": [
      {
        "id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黄色",
                "time": "888.6ms",
                "time_in_nanos": 888636963,
                "breakdown": {
                  "score": 513864260,
                  "build_scorer_count": 50,
                  "match_count": 0,
                  "create_weight": 93345,
                  "next_doc": 364649642,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 5063173,
                  "score_count": 4670398,
                  "build_scorer": 296094,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    **"time": "18.4ms",**
                    "time_in_nanos": 18480019,
                    "breakdown": {
                      "score": 656810,
                      "build_scorer_count": 62,
                      "match_count": 0,
                      "create_weight": 23633,
                      "next_doc": 17712339,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 7085,
                      "score_count": 5705,
                      "build_scorer": 74384,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    **"description": "nick: ",**
                    **"time": "480.5ms",**
                    "time_in_nanos": 480508016,
                    "breakdown": {
                      "score": 278358058,
                      "build_scorer_count": 72,
                      "match_count": 0,
                      "create_weight": 6041,
                      "next_doc": 192388910,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 5056541,
                      "score_count": 4665006,
                      "build_scorer": 33387,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黄色",
                    **"time": "3.8ms",**
                    "time_in_nanos": 3872679,
                    "breakdown": {
                      "score": 136812,
                      "build_scorer_count": 50,
                      "match_count": 0,
                      "create_weight": 5423,
                      "next_doc": 3700537,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 923,
                      "score_count": 755,
                      "build_scorer": 28178,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],

然后，我对比了 ik_max_word Analyzer，发现它的Profile查询分析里面，是不会对 空格 进行查询的，下面的TermQuery 里面并没有 “空格”。

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "myindex",
        "_type": "profile",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "nick": "人生 如梦"
        }
      }
    ]
  },
  "profile": {
    "shards": [
      {
        "id": "[7MyDkEDrRj2RPHCPoaWveQ][myindex][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:人生 nick:如梦",
                "time": "410.8micros",
                "time_in_nanos": 410831,
                "breakdown": {
                  "score": 26377,
                  "build_scorer_count": 2,
                  "match_count": 0,
                  "create_weight": 227597,
                  "next_doc": 12341,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 2,
                  "score_count": 1,
                  "build_scorer": 144510,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:人生",
                    "time": "197.6micros",
                    "time_in_nanos": 197665,
                    "breakdown": {
                      "score": 9670,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 146018,
                      "next_doc": 1302,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 40668,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:如梦",
                    "time": "62.8micros",
                    "time_in_nanos": 62830,
                    "breakdown": {
                      "score": 999,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 55092,
                      "next_doc": 864,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 5868,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 26763,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "41.9micros",
                "time_in_nanos": 41945,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "30.6micros",
                    "time_in_nanos": 30633
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      }
    ]
  }
}

6.2.2版本启用hanlp后es就关闭

使用方式二安装hanlp插件，安装过程顺利，es也正常启动了，但是在kibana里面测试hanlp：
GET /_analyze?pretty
{
"analyzer" : "hanlp",
"text" : ["重庆华龙网海数科技有限公司"]
}
发现es关闭了，检查log如下：
[2018-09-06T09:15:39,827][ERROR][c.h.d.Monitor ] can not find hanlp.properties
java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_181]
at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_181]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_181]
at java.lang.ClassLoader.checkClassLoaderPermission(ClassLoader.java:1528) ~[?:1.8.0_181]
at java.lang.Thread.getContextClassLoader(Thread.java:1443) ~[?:1.8.0_181]
at com.hankcs.dic.Monitor.reloadProperty(Monitor.java:61) [elasticsearch-analysis-hanlp-6.2.2.jar:?]
at com.hankcs.dic.Monitor.run(Monitor.java:34) [elasticsearch-analysis-hanlp-6.2.2.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_181]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_181]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
[2018-09-06T09:16:23,505][INFO ][o.e.m.j.JvmGcMonitorService] [cyVG2fn] [gc][57] overhead, spent [318ms] collecting in the last [1s]
[2018-09-06T09:35:31,918][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[cyVG2fn][index][T#1]], exiting
java.lang.NoClassDefFoundError: Could not initialize class com.hankcs.hanlp.HanLP$Config
at com.hankcs.hanlp.seg.Segment.seg(Segment.java:423) ~[?:?]
at com.hankcs.lucene.SegmentWrapper.next(SegmentWrapper.java:76) ~[?:?]
at com.hankcs.lucene.HanLPTokenizer.incrementToken(HanLPTokenizer.java:94) ~[?:?]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:266) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:243) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:164) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:80) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:293) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:286) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

ES_HOME/analysis-hanlp 没看到日志有读,还有代码里哪里去加载了这个目录

安装数据包
release包中存放的为HanLP源码中默认的分词数据，若要下载完整版数据包，请查看HanLP Release。

数据包目录：ES_HOME/analysis-hanlp

搜遍代码也没看到会读这个文件夹下的内容,而且相关日志也没有提示.

Configuration里的配置如何设置

我需要把enable_number_quantifier_recognize设置为true，请问这个配置，应该在什么时候，什么地方设置的？

开启停用词会按照词性过滤词

滴滴出行直接分成了出行。。。

es5x下跑不起来

异常信息为：Caused by: java.security.AccessControlException: access denied ("java.util.PropertyPermission" "*" "read,write") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_111] at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_111] at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_111] at java.lang.SecurityManager.checkPropertiesAccess(SecurityManager.java:1262) ~[?:1.8.0_111] at java.lang.System.getProperties(System.java:630) ~[?:1.8.0_111] at com.hankcs.hanlp.HanLP$Config.<clinit>(HanLP.java:306) ~[?:?]

怎么从新打包，修复我的老版本的问题？具体改怎么操作？

#24

我也遇到的以上的问题，不过我的版本为6.4.

分词器分完词之后offset显示不正确

使用hanlp分词
GET _analyze

{
  "text": ["**地大物博"],
  "tokenizer": "hanlp"
}

返回数据:

{
  "tokens" : [
    {
      "token" : "**",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "ns",
      "position" : 0
    },
    {
      "token" : "地大物博",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "nz",
      "position" : 1
    }
  ]
}

第二个term, start_offset还是从0开始的

而es自带的tokenizer 都是递增的

GET _analyze

{
  "text": ["**地大物博"],
  "tokenizer": "standard"
}

{
  "tokens" : [
    {
      "token" : "中",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "国",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "地",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "大",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "物",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "博",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    }
  ]
}

现在搜索还是能正常搜到, 但是高亮只能匹配到最前面几个字,
以下是在6.5.4上测试的结果

PUT document

{
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "text",
          "analyzer": "hanlp",
          "search_analyzer": "hanlp"
        }
      }
    }
  }
}

PUT document/doc/1

{
  "body": ["**地大物博"]
}

GET document/_search

{
  "query": {
    "match": {
      "body": {
        "query": "地大物博"
      }
    }
  },
  "highlight": {
    "fields": {
      "body": {}
    }
  }
}

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "document",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "body" : [
            "**地大物博"
          ]
        },
        "highlight" : {
          "body" : [
            "<em>**地大</em>物博"
          ]
        }
      }
    ]
  }
}

search_analyzer无法设置为hanlp_nlp，设置为hanlp可以成功

{
    "properties": {
        "content": {
            "type": "text",
            "analyzer": "hanlp_nlp",
            "search_analyzer": "hanlp_nlp"
        },
        "remark": {
            "type": "text",
            "analyzer": "hanlp_nlp",
            "search_analyzer": "hanlp_nlp"
        }
    }
}

从官网下载Model 放入网站后奔溃

您好，在使用您的框架的过程中出现了一个问题，不知道是否使用错误。
从hanlp官网下载model数据后放入 /plugins/analysis-hanlp/data/model 下，重启 ES，
使用 _analyze 测试分词正常。
"hanLPAnalyzer" : {
"type" : "custom",
"char_filter" : [
"charconvert"
],
"tokenizer" : "hanlp_nlp_word"
},
"hanlp_nlp_word" : {
"enable_normalization" : "true",
"enable_remote_dict" : "true",
"type" : "hanlp_nlp"
}
然后使用 _search 查询数据，如下，数据大约为200W条，查询耗时超过2S，持续点击5次以上，出现401错误，后台日志提示内存溢出.
POST /poi/_search
{
"from": 0,
"size": 10,
"query": {
"dis_max": {
"tie_breaker": 0.3,
"queries": [{
"match": {
"full_q": {
"query": "青年路",
"operator": "OR",
"analyzer": "hanLPAnalyzer"
}
}
}],
"boost": 1.0
}
},
"post_filter": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0
}
}
}

无法加入自定义词典

一加入就报错，不加入的话是好的
我用的es版本是5.6.10，hanlp版本是1.5.3

同义词如何配置

您好：
我是用的elasticsearch-analysis-hanlp-6.5.1插件，之前用过ik插件，对比ik的同义词配置做了如下settings设置，发现同义词没有生效，请问大神hanlp插件的同义词应该如何配置生效呢？多谢解答！！
"settings" : {
"index" : {
"analysis" : {
"filter" : {
"hanlp_synonym_ik_standard" : {
"type" : "synonym",
"synonyms_path" : "../plugins/analysis-hanlp/data/dictionary/synonym/CoreSynonym.txt"
}
},
"analyzer" : {
"hanlp_default_search" : {
"tokenizer" : "hanlp_standard"
},
"hanlp_default_index" : {
"filter" : [
"hanlp_synonym_ik_standard"
],
"tokenizer" : "hanlp_index"
}
}
}
}
},

kennfalcon / elasticsearch-analysis-hanlp Goto Github PK

elasticsearch-analysis-hanlp's People

Contributors

Stargazers

Watchers

Forkers

elasticsearch-analysis-hanlp's Issues

eg:2019年初级会计职称

自定义分词库,空文件

然后建立index

插入doc

用 _analyze测试结果

修改自定义分词库, 重启es

_analyze测试

报错

Recommend Projects

Recommend Topics

Recommend Org