Git Product home page Git Product logo

elasticsearch-analysis-hanlp's People

Contributors

cjdxhjj avatar codacy-badger avatar dependabot[bot] avatar hariag avatar kennfalcon avatar kevin-sb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-analysis-hanlp's Issues

CustomDictionary线程安全吗

CustomDictionary.remove(String)
CustomDictionary.insert(String, String)
请问这2个方法,线程安全吗?

我们有一个场景:远程词典删除了一行词,已加载的词典也需要删除这行词。
现在远程词典的加载逻辑,只是不断增加远程词典的词条,而不会删除远程词典被删除的词条。

我修改了一下RemoteMonitor加载远程词典的方法,使其能与上次已加载的远程词典对比,从而计算出需要删除的词条。

有一个远程词典,100w行数据,业务会不断优化词典,所以这2个方法,会经常调用。如果我用集合的并行流paralleStream()来并执行CustomDictionary.insert,会不会有问题?

停用词是怎么配置的?

查看了下代码,在代码里面是配置 enable_stop_dictionary 属性,由于对 Elasticsearch 插件开发的参数传递方式不太了解,所以出现的疑问是,这个参数是在 elasticsearch 建立索引的时候设定的么?

#7 issue 中找到解释

部分 分词有歧义

eg:2019年初级会计职称

分词结果是:2019 年初 级 会计职称
我们想要的结果是:2019年 初级 会计职称

用hanlp内置的接口方法进行分词,就没有问题的,可以得到预计结果
thanks

index模式分词错误案例

GET _analyze
{
   "text": ["启动力车规级汽车"],
   "analyzer": "hanlp_index"
}

实际结果

  "tokens" : [
    {
      "token" : "启",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "v",
      "position" : 0
    },
    {
      "token" : "动力车",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "nz",
      "position" : 1
    },
    {
      "token" : "动力",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "n",
      "position" : 2
    },
    {
      "token" : "规",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "ng",
      "position" : 3
    },
    {
      "token" : "级",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "q",
      "position" : 4
    },
   .... 省略
  ]
}

期望结果
能分出”启动力“、”启动“。

补充

  • 我测试过”启动力 车规级汽车“,有空格的情况下,是能出现”启动力“、”启动“。
  • hanlp原生api也是正常的。

动态删除词条

hi, ES版本7.2.0,如何能动态删除词条?
目前测试远程词库可以动态添加,但是删除词条后,ES没有删除,需要重启才生效
删除hanlp.cache和.bin文件都没有生效

源码是直接insert的词条,不知道是否提供了删除词条方法?

hanlp_index 分词不一致问题咨询

遇到个日文人名的识别问题,带前缀时分词为声优 藤田 淑 子,不带前缀就变成了藤田淑 子

POST _analyze
{
  "analyzer": "hanlp_index",
  "text": "日本声优藤田淑子逝世"
}

{
  "tokens": [
    {
      "token": "日本",
      "start_offset": 0,
      "end_offset": 2,
      "type": "ns",
      "position": 0
    },
    {
      "token": "声优",
      "start_offset": 2,
      "end_offset": 4,
      "type": "nz",
      "position": 1
    },
    {
      "token": "藤田",
      "start_offset": 4,
      "end_offset": 6,
      "type": "nr",
      "position": 2
    },
    {
      "token": "",
      "start_offset": 6,
      "end_offset": 7,
      "type": "ng",
      "position": 3
    },
    {
      "token": "",
      "start_offset": 7,
      "end_offset": 8,
      "type": "ng",
      "position": 4
    },
    {
      "token": "逝世",
      "start_offset": 8,
      "end_offset": 10,
      "type": "vi",
      "position": 5
    }
  ]
}

查询时分词:

POST _analyze
{
  "analyzer": "hanlp_standard",
  "text": "藤田淑子"
}

{
  "tokens": [
    {
      "token": "藤田淑",
      "start_offset": 0,
      "end_offset": 3,
      "type": "nr",
      "position": 0
    },
    {
      "token": "",
      "start_offset": 0,
      "end_offset": 1,
      "type": "ng",
      "position": 1
    }
  ]
}

这类不一致问题,请问有什么办法可以干预吗?看HanLP文档,不知道开启日文人名识别是否可以解决?

自定义分词库更新,原来的index有办法刷新么?

###重现步骤:

自定义分词库,空文件

然后建立index

PUT test
{
"settings": {
    "analysis": {
    "analyzer": {
        "my_hanlp_analyzer": {
        "tokenizer": "my_hanlp"
        }
    },
    "tokenizer": {
        "my_hanlp": {
        "type": "hanlp",
        "enable_offset": false,
        "enable_custom_config": true
        }
    }
    }
},
"mappings": {
    "properties": {
    "title": {
        "type": "text",
        "analyzer": "my_hanlp_analyzer"
    }
    }
}
}

插入doc

{
  "title": "下降机测试"
}

用 _analyze测试结果

分词为 下降,机,测试

搜索可以用

"match": {"title": "下降"}

"match": {"title": "下降机"}

都能搜到

修改自定义分词库, 重启es

下降机 nz 1

_analyze测试

分词为 下降机,测试

期望索引更新:

用 "match": {"title": "下降"} 搜不到,"match": {"title": "下降机"} 能搜到

但是结果是: "match": {"title": "下降"} 搜到, "match": {"title": "下降机"} 搜不到

报错

搜索下降机,还偶尔会报错。

{
    "error": {
        "root_cause": [
            {
                "type": "circuit_breaking_exception",
                "reason": "[parent] Data too large, data for [<http_request>] would be [251846810/240.1mb], which is larger than the limit of [246733209/235.3mb], real usage: [251846672/240.1mb], new bytes reserved: [138/138b], usages [request=0/0b, fielddata=1458/1.4kb, in_flight_requests=138/138b, accounting=30684/29.9kb]",
                "bytes_wanted": 251846810,
                "bytes_limit": 246733209,
                "durability": "PERMANENT"
            }
        ],
        "type": "circuit_breaking_exception",
        "reason": "[parent] Data too large, data for [<http_request>] would be [251846810/240.1mb], which is larger than the limit of [246733209/235.3mb], real usage: [251846672/240.1mb], new bytes reserved: [138/138b], usages [request=0/0b, fielddata=1458/1.4kb, in_flight_requests=138/138b, accounting=30684/29.9kb]",
        "bytes_wanted": 251846810,
        "bytes_limit": 246733209,
        "durability": "PERMANENT"
    },
    "status": 429
}

请问,现象是正常的么?然后报错是什么问题,好像不稳定?

建议写一下 plugins/analysis-hanlp 下的 config 目录一定要删除

直接解压的,所以 plugins/analysis-hanlp 有 config 目录,es可以正常启动,但是使用 hanlp 分词就会报错
java.lang.NoClassDefFoundError: Could not initialize class com.hankcs.hanlp.HanLP$Config

删除 plugins/analysis-hanlp 下的 config 目录之后恢复正常

给后来的人提个醒

6.3.2版本报错index_not_found_exception

使用elasticsearch-plugin install的方式安装的,运行之后报下面的错误:
curl 'https://$ES_URL:9200/twitter2/_analyze?pretty=true' -H 'Content-Type: application/json' -d '{ "tokenizer":"hanlp", "text":"士大夫敢死队风格"}'
{
"error" : {
"root_cause" : [
{
"type" : "index_not_found_exception",
"reason" : "no such index",
"resource.type" : "index_expression",
"resource.id" : "twitter2",
"index_uuid" : "na",
"index" : "twitter2"
}
],
"type" : "index_not_found_exception",
"reason" : "no such index",
"resource.type" : "index_expression",
"resource.id" : "twitter2",
"index_uuid" : "na",
"index" : "twitter2"
},
"status" : 404
}

hanlp.properties文件是在ES_PATH/config/analysis-hanlp/下。

按照第一种方法安装,无法成功?

/root/elasticsearch-6.4.3/bin/elasticsearch-plugin install /root/elasticsearch-6.4.3/elasticsearch-analysis-hanlp-6.4.3.zip

A tool for managing installed elasticsearch plugins

Commands
--------
list - Lists installed elasticsearch plugins
install - Install a plugin
remove - removes a plugin from Elasticsearch

Non-option arguments:
command              

Option         Description        
------         -----------        
-h, --help     show help          
-s, --silent   show minimal output
-v, --verbose  show verbose output
ERROR: Unknown plugin /root/elasticsearch-6.4.3/elasticsearch-analysis-hanlp-6.4.3.zip

文本有'\n'会引起高亮错位

以下是在6.6.2上测试的结果

PUT index
{
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "text",
          "analyzer": "hanlp",
          "search_analyzer": "hanlp"
        }
      }
    }
  }
}
PUT index/doc/1
{
  "body": ["张三\n\n新买的手机"]
}
GET index/_search
{
  "query": {
    "match": {
      "body": {
        "query": "手机"
      }
    }
  },
  "highlight": {
    "fields": {
      "body": {}
    }
  }
}

返回:

{
  "took" : 44,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "index",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "body" : [
            """
张三

新买的手机
"""
          ]
        },
        "highlight" : {
          "body" : [
            """
张三

新买<em>的手</em>机
"""
          ]
        }
      }
    ]
  }
}

excepted:

...
"highlight" : {
          "body" : [
            """
张三

新买的<em>手机</em>
"""
          ]
}
...

V7.0.0配置hanlp-remote.xml中remote_ext_dict远程分词地址无效,而同样的配置在ik就有效

HanLP Analyzer 扩展配置
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">http://我的地址:1888/a.txt</entry>

<!--用户可以在这里配置远程扩展停止词字典-->
<!--<entry key="remote_ext_stopwords">stop_words_location</entry>-->

http://我的地址:1888/a.txt 内容

POST _analyze
{
"text": "中华人民共和国",
"tokenizer": "hanlp_index"
}

结果:
{
"tokens" : [
{
"token" : "中华人民共和国",
"start_offset" : 0,
"end_offset" : 7,
"type" : "ns",
"position" : 0
},
{
"token" : "中华人民",
"start_offset" : 0,
"end_offset" : 4,
"type" : "nz",
"position" : 1
},
{
"token" : "中华",
"start_offset" : 0,
"end_offset" : 2,
"type" : "nz",
"position" : 2
},
{
"token" : "华人",
"start_offset" : 1,
"end_offset" : 3,
"type" : "n",
"position" : 3
},
{
"token" : "人民共和国",
"start_offset" : 2,
"end_offset" : 7,
"type" : "nz",
"position" : 4
},
{
"token" : "人民",
"start_offset" : 2,
"end_offset" : 4,
"type" : "n",
"position" : 5
},
{
"token" : "共和国",
"start_offset" : 4,
"end_offset" : 7,
"type" : "n",
"position" : 6
},
{
"token" : "共和",
"start_offset" : 4,
"end_offset" : 6,
"type" : "n",
"position" : 7
}
]
}

期望结果中出现单“中”,“华”的记录

hanlp_index的tokenizer没有效果,analyzer有效果

我最开始的需求是希望自定义一个能够剔除html标签的hanlp_index的analyzer, 但使用时发现没了多粒度分词的效果。

{
"analysis": {
"analyzer": {
"default": {
"filter": [
"lowercase",
"asciifolding"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "hanlp_index"
}
}
}
}
`

于是我用_analyze分析,从结果看似乎是hanlp_index的tokenizer有问题。
hanlp_index的analyzer可以多粒度分词
GET _analyze { "text": ["<p>除甲醛</p>", "汽车座椅"], "analyzer": "hanlp_index" }
结果
{
"tokens" : [
{
"token" : "<p>",
"start_offset" : 0,
"end_offset" : 3,
"type" : "nx",
"position" : 0
},
{
"token" : "除甲醛",
"start_offset" : 6,
"end_offset" : 9,
"type" : "n",
"position" : 1
},
{
"token" : "甲醛",
"start_offset" : 10,
"end_offset" : 12,
"type" : "n",
"position" : 2
},
{
"token" : "<",
"start_offset" : 14,
"end_offset" : 15,
"type" : "nx",
"position" : 3
},
{
"token" : "/",
"start_offset" : 16,
"end_offset" : 17,
"type" : "w",
"position" : 4
},
{
"token" : "p>",
"start_offset" : 18,
"end_offset" : 20,
"type" : "nx",
"position" : 5
},
{
"token" : "汽车座椅",
"start_offset" : 24,
"end_offset" : 28,
"type" : "nz",
"position" : 6
},
{
"token" : "汽车",
"start_offset" : 28,
"end_offset" : 30,
"type" : "n",
"position" : 7
},
{
"token" : "车座",
"start_offset" : 31,
"end_offset" : 33,
"type" : "n",
"position" : 8
},
{
"token" : "座椅",
"start_offset" : 34,
"end_offset" : 36,
"type" : "n",
"position" : 9
}
]
}

hanlp_index的tokenizer的没有多粒度分词效果
GET /_analyze { "text": ["<p>除甲醛</p>","汽车座椅"], "tokenizer": "hanlp_index" }

结果
{
"tokens" : [
{
"token" : "<p>",
"start_offset" : 0,
"end_offset" : 3,
"type" : "nx",
"position" : 0
},
{
"token" : "除甲醛",
"start_offset" : 3,
"end_offset" : 6,
"type" : "n",
"position" : 1
},
{
"token" : "<",
"start_offset" : 6,
"end_offset" : 7,
"type" : "nx",
"position" : 2
},
{
"token" : "/",
"start_offset" : 7,
"end_offset" : 8,
"type" : "w",
"position" : 3
},
{
"token" : "p>",
"start_offset" : 8,
"end_offset" : 10,
"type" : "nx",
"position" : 4
},
{
"token" : "汽车座椅",
"start_offset" : 22,
"end_offset" : 26,
"type" : "nz",
"position" : 5
}
]
}

远处字典更新问题

IK可以配置热更新字典

从远处服务下载.dic文件,根据响应头的ETag判断是否需要进行字典数据替换,

希望hanlp也能实现

stopwords未生效

不管是用远程词典还是本地的, stopwords都没有生效, 是否因为插件没有对停用词做过滤吗

ES 5.2.2 版本运行报错

[2019-02-19T08:09:05,566][INFO ][o.e.c.r.a.AllocationService] [w44JGC2] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[index][2]] ...]).
Feb 19, 2019 8:09:11 AM com.hankcs.hanlp.corpus.io.IOUtil readBytes
WARNING: 读取plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt.bin时发生异常java.io.FileNotFoundException: plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt.bin (No such file or directory)
Feb 19, 2019 8:09:11 AM com.hankcs.hanlp.dictionary.CoreDictionary load
WARNING: 核心词典plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt不存在!java.io.FileNotFoundException: plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt (No such file or directory)
[2019-02-19T08:09:11,715][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[w44JGC2][index][T#1]], exiting
java.lang.ExceptionInInitializerError: null
at com.hankcs.hanlp.seg.common.Vertex.newB(Vertex.java:462) ~[?:?]
at com.hankcs.hanlp.seg.common.WordNet.(WordNet.java:73) ~[?:?]
at com.hankcs.hanlp.seg.Viterbi.ViterbiSegment.segSentence(ViterbiSegment.java:40) ~[?:?]
at com.hankcs.hanlp.seg.Segment.seg(Segment.java:507) ~[?:?]
at com.hankcs.lucene.SegmentWrapper.next(SegmentWrapper.java:76) ~[?:?]
at com.hankcs.lucene.HanLPTokenizer.incrementToken(HanLPTokenizer.java:94) ~[?:?]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:222) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:200) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:148) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:75) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:294) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:287) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:610) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:596) ~[elasticsearch-5.2.2.jar:5.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.2.2.jar:5.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_201]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_201]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_201]
Caused by: java.lang.IllegalArgumentException: 核心词典plugins/analysis-hanlp/data/dictionary/CoreNatureDictionary.txt加载失败
at com.hankcs.hanlp.dictionary.CoreDictionary.(CoreDictionary.java:44) ~[?:?]
... 20 more

https://www.jianshu.com/p/52c42cdab997 按照此文章解决了java权限(参照https://github.com/KennFalcon/elasticsearch-analysis-hanlp/issues/2)问题后,遇到另外的错误

enable_traditional_chinese_mode字典檔案內容也需要改為繁中嗎?

您好,已經設置繁中
PUT test { "settings": { "analysis": { "analyzer": { "tc_hanlp": { "tokenizer": "hanlp" } }, "tokenizer": { "tc_hanlp": { "type": "hanlp", "enable_traditional_chinese_mode": true, "enable_custom_config": true } } } } }
但打印結果似乎只能分詞簡中,是否須將字典檔內容改為繁中?
POST test/_analyze { "text": "美國阿拉斯加州發生8.0級地震", "analyzer": "tc_hanlp" }

{ "tokens" : [ { "token" : "美", "start_offset" : 0, "end_offset" : 1, "type" : "b", "position" : 0 }, { "token" : "國", "start_offset" : 0, "end_offset" : 1, "type" : "w", "position" : 1 }, { "token" : "阿拉斯加州", "start_offset" : 0, "end_offset" : 5, "type" : "nsf", "position" : 2 }, { "token" : "發", "start_offset" : 0, "end_offset" : 1, "type" : "n", "position" : 3 }, { "token" : "生", "start_offset" : 0, "end_offset" : 1, "type" : "v", "position" : 4 }, { "token" : "8.0", "start_offset" : 0, "end_offset" : 3, "type" : "m", "position" : 5 }, { "token" : "級", "start_offset" : 0, "end_offset" : 1, "type" : "n", "position" : 6 }, { "token" : "地震", "start_offset" : 0, "end_offset" : 2, "type" : "n", "position" : 7 } ] }

search_analyzer无法设置为hanlp_nlp,设置为hanlp可以成功

ref : #38
#35

最新代码测试了,看起来并没有解决:
设置search_analyzer为hanlp_nlp

{
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp_nlp"
},
"remark": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp_nlp"
}
}
}

image

失败

设置为hanlp

{
"properties": {
"content": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp"
},
"remark": {
"type": "text",
"analyzer": "hanlp_nlp",
"search_analyzer": "hanlp"
}
}
}

image

没问题!

v6.6.1 运行报错 看不懂

[2019-04-04T00:11:45,886][INFO ][o.e.n.Node ] [rpmdWQa] started
[2019-04-04T00:12:10,126][INFO ][c.h.d.c.RemoteDictConfig ] [rpmdWQa] try load remote hanlp config from D:\webdev\es\ela
sticsearch-6.6.1\config\analysis-hanlp\hanlp-remote.xml
[2019-04-04T00:12:10,211][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [rpmdWQa] fatal error in thread [elasticse
arch[rpmdWQa][analyze][T#1]], exiting
java.lang.ExceptionInInitializerError: null
at com.hankcs.hanlp.dictionary.CoreDictionary.(CoreDictionary.java:35) ~[?:?]
at com.hankcs.hanlp.seg.Other.DoubleArrayTrieSegment.(DoubleArrayTrieSegment.java:45) ~[?:?]
at com.hankcs.lucene.HanLPSpeedAnalyzer.createComponents(HanLPSpeedAnalyzer.java:31) ~[?:?]
at org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:198) ~[lucene-core-7.6.0.jar:7.6.0 719cde97f846
40faa1e3525690d262946571245f - nknize - 2018-12-07 14:44:20]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.ja
va:267) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:252
) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.j
ava:170) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.j
ava:81) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$1.doRun(TransportSingleShardAction.j
ava:115) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.
java:759) ~[elasticsearch-6.6.1.jar:6.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.6.1
.jar:6.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.security.AccessControlException: access denied ("java.io.FilePermission" "src\main\java" "read")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_131]
at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_131]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_131]
at java.lang.SecurityManager.checkRead(SecurityManager.java:888) ~[?:1.8.0_131]
at java.io.File.isDirectory(File.java:844) ~[?:1.8.0_131]
at com.hankcs.hanlp.HanLP$Config.(HanLP.java:342) ~[?:?]
... 14 more

hanlp.cache问题

非常感谢hanlp分词插件。
我这边有个问题,如果有hanlp.cache文件,我重新启动es,会报错。请问下是什么原因?
如果删除hanlp.cache文件,就可以正常运行,但需要时间重新加载hanlp的词典。
错误:
[2018-07-24T17:46:22,115][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [node1] fatal error in thread [elasticsearch[node1][generic][T#2]], exiting
java.lang.NoClassDefFoundError: org/elasticsearch/core/internal/io/IOUtils

at com.hankcs.dic.cache.DictionaryFileCache.lambda$loadCache$0(DictionaryFileCache.java:60) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_73]
at com.hankcs.dic.cache.DictionaryFileCache.loadCache(DictionaryFileCache.java:45) ~[?:?]
at com.hankcs.dic.Dictionary.(Dictionary.java:45) ~[?:?]
at com.hankcs.dic.Dictionary.initial(Dictionary.java:52) ~[?:?]
at com.hankcs.cfg.Configuration.(Configuration.java:54) ~[?:?]
at org.elasticsearch.index.analysis.HanLPTokenizerFactory.(HanLPTokenizerFactory.java:31) ~[?:?]
at org.elasticsearch.index.analysis.HanLPTokenizerFactory.getHanLPNLPTokenizerFactory(HanLPTokenizerFactory.java:47) ~[?:?]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:361) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:176) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:154) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.IndexService.(IndexService.java:145) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:363) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:448) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.indices.IndicesService.verifyIndexMetadata(IndicesService.java:481) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.gateway.Gateway.performStateRecovery(Gateway.java:135) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.gateway.GatewayService$1.doRun(GatewayService.java:229) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) ~[elasticsearch-5.6.1.jar:5.6.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-5.6.1.jar:5.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[?:1.8.0_73]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) ~[?:1.8.0_73]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]
Caused by: java.lang.ClassNotFoundException: org.elasticsearch.core.internal.io.IOUtils
at java.net.URLClassLoader.findClass(URLClassLoader.java:381) ~[?:1.8.0_73]
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) ~[?:1.8.0_73]
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:814) ~[?:1.8.0_73]
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ~[?:1.8.0_73]
... 22 more

如何在分词时方便地开启归一化?

hi,KennFalcon:
我使用的插件版本: elasticsearch-analysis-hanlp for ElasticSearch6.3.2
由于HanLP在分词的时候HanLP.Config.Normalization=false,默认是不开启Normalization。但是我想在分词的时候,开启归一化操作。
我采用的Analyzer是com.hankcs.lucene.HanLPStandardAnalyzer,目前我的做法是在HanLPStandardAnalyzer.createComponents()方法中添加了一行语句:

    @Override
    protected Analyzer.TokenStreamComponents createComponents(String fieldName) {
        //分词时开启归一化
        AccessController.doPrivileged((PrivilegedAction) () -> HanLP.Config.Normalization = true);
        Tokenizer tokenizer = new HanLPTokenizer(HanLP.newSegment(), configuration);
        return new Analyzer.TokenStreamComponents(tokenizer);
    }

然后重新打jar包,复制到elasticsearch-6.3.2/plugins/analysis-hanlp目录下替换掉原来的jar,重启ES。
我的问题是:
是否考虑把 分词时开启归一化 做成可配置的方式?比如在 定义索引 指定 content字段的 analyzer 时,加一个Normalization 的配置选项:

        "content": {
          "type": "text",
          "analyzer": "hanlp_standard",
          "normalization":true

或者在com.hankcs.cfg.Configuration里面添加一个类似于 enableNormalization的配置选项,做相应的归一化操作支持?
或者说,您有什么其他好的方法吗?

多个换行符导致高亮错位

ES version6.1.2, elasticsearch-analysis-hanlp version 7.3.2

PUT INDEX
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_hanlp_index_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "hanlp_index"
        },
        "my_hanlp_analyzer": {
          "filter": [
            "lowercase"
          ],
          "tokenizer": "hanlp"
        }
      },
      "tokenizer": {
        "my_hanlp_index": {
          "enable_custom_config": "true",
          "type": "hanlp_index",
          "enable_stop_dictionary": "true"
        },
        "my_hanlp": {
          "enable_custom_config": "true",
          "type": "hanlp",
          "enable_stop_dictionary": "true"
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "text",
          "analyzer": "my_hanlp_index_analyzer",
          "search_analyzer": "my_hanlp_analyzer"
        }
      }
    }
  }
}
PUT index/doc/1
{
  "body": ["\n营造建设和谐环境 \n \n\n专业名称:建设                        发布日期:2010 年 1 月 \n\n      \n\n摘要:为贯彻落实科学发展观,将建设有机融入全社会持续发展,减\n\n小建设对社会资源的占用及对正常社会生产生活带来的影响,公司积极抓好\n\n两手准备"]
}
GET index/_search
{
  "query": {
    "match": {
      "body": {
        "query": "科学"
      }
    }
  },
  "highlight": {
    "fields": {
      "body": {}
    }
  }
}

返回:

{
  "took": 75,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.29379335,
    "hits": [
      {
        "_index": "index",
        "_type": "doc",
        "_id": "1",
        "_score": 0.29379335,
        "_source": {
          "body": [
            "\n营造建设和谐环境 \n \n\n专业名称:建设                        发布日期:2010 年 1 月 \n\n      \n\n摘要:为贯彻落实科学发展观,将建设有机融入全社会持续发展,减\n\n小建设对社会资源的占用及对正常社会生产生活带来的影响,公司积极抓好\n\n两手准备"
          ]
        },
        "highlight": {
          "body": [
            "营造建设和谐环境 \n \n\n专业名称:建设                        发布日期:2010 年 1 月 \n\n      \n\n摘要:为贯彻落实科<em>学发</em>展观,将建设有机融入全社会持续发展,减"
          ]
        }
      }
    ]
  }
}

HanLP分词插件因为空格导致在搜索时有性能问题

1,环境:ES version 6.3.2 ,Plugin version 6.3.2。
2,问题描述(为了方便描述这个性能问题,我开启了term vector)
空格参与查询,导致查询缓慢,用profile分析,查询时间在几百毫秒,而其他字段查询耗时只有10多毫秒。但是这个"空格查询"应该是不必要的。在ik_max_word分词里面不存在空格查询这种情况。

首先定义索引user,指定analyzer为hanlp_standard。user只有一个名为nick的字段

PUT user
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "profile": {
      "properties": {
        "nick": {
          "type": "text",
          "analyzer": "hanlp_standard",
          "term_vector": "yes", 
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

添加一篇文档,注意:“人生”,“如梦” 中间是包含了空格的。

PUT user/profile/1
{
  "nick":"人生 如梦"
}

使用term vector查看,发现:空格也被索引了吧

GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

返回的结果如下:总的terms个数有4个。第一个term就是空格,这个空格会影响查询效率。
{
  "_index": "user",
  "_type": "profile",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 0,
  "term_vectors": {
    "nick": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        " ": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "人生": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "如": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "梦": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        }
      }
    }
  }
}

用profile 查询分析,会发现 空格也参与了查询。

GET user/profile/_search?human=true
{
  "profile":true,
  "query": {
    "match": {
      "nick": "人生 如梦"
    }
  }
}

上面是一个示例,用来说明空格也参与了查询。在实际生产环境中,当索引的文档数量上亿时,**造成了严重查询性能问题**:因为:空格 对应的查询耗时较长。贴一个实际的查询分析:(空格耗时480ms,而 “微信” 只耗时18ms)

 "profile": {
    "shards": [
      {
        "id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黄色",
                "time": "888.6ms",
                "time_in_nanos": 888636963,
                "breakdown": {
                  "score": 513864260,
                  "build_scorer_count": 50,
                  "match_count": 0,
                  "create_weight": 93345,
                  "next_doc": 364649642,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 5063173,
                  "score_count": 4670398,
                  "build_scorer": 296094,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    **"time": "18.4ms",**
                    "time_in_nanos": 18480019,
                    "breakdown": {
                      "score": 656810,
                      "build_scorer_count": 62,
                      "match_count": 0,
                      "create_weight": 23633,
                      "next_doc": 17712339,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 7085,
                      "score_count": 5705,
                      "build_scorer": 74384,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    **"description": "nick: ",**
                    **"time": "480.5ms",**
                    "time_in_nanos": 480508016,
                    "breakdown": {
                      "score": 278358058,
                      "build_scorer_count": 72,
                      "match_count": 0,
                      "create_weight": 6041,
                      "next_doc": 192388910,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 5056541,
                      "score_count": 4665006,
                      "build_scorer": 33387,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黄色",
                    **"time": "3.8ms",**
                    "time_in_nanos": 3872679,
                    "breakdown": {
                      "score": 136812,
                      "build_scorer_count": 50,
                      "match_count": 0,
                      "create_weight": 5423,
                      "next_doc": 3700537,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 923,
                      "score_count": 755,
                      "build_scorer": 28178,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],

然后,我对比了 ik_max_word Analyzer,发现它的Profile查询分析里面,是不会对 空格 进行查询的,下面的TermQuery 里面并没有 “空格”。

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5753642,
    "hits": [
      {
        "_index": "myindex",
        "_type": "profile",
        "_id": "1",
        "_score": 0.5753642,
        "_source": {
          "nick": "人生 如梦"
        }
      }
    ]
  },
  "profile": {
    "shards": [
      {
        "id": "[7MyDkEDrRj2RPHCPoaWveQ][myindex][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:人生 nick:如梦",
                "time": "410.8micros",
                "time_in_nanos": 410831,
                "breakdown": {
                  "score": 26377,
                  "build_scorer_count": 2,
                  "match_count": 0,
                  "create_weight": 227597,
                  "next_doc": 12341,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 2,
                  "score_count": 1,
                  "build_scorer": 144510,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:人生",
                    "time": "197.6micros",
                    "time_in_nanos": 197665,
                    "breakdown": {
                      "score": 9670,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 146018,
                      "next_doc": 1302,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 40668,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:如梦",
                    "time": "62.8micros",
                    "time_in_nanos": 62830,
                    "breakdown": {
                      "score": 999,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 55092,
                      "next_doc": 864,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 5868,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 26763,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "41.9micros",
                "time_in_nanos": 41945,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "30.6micros",
                    "time_in_nanos": 30633
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      }
    ]
  }
}



6.2.2版本启用hanlp后es就关闭

使用方式二安装hanlp插件,安装过程顺利,es也正常启动了,但是在kibana里面测试hanlp:
GET /_analyze?pretty
{
"analyzer" : "hanlp",
"text" : ["重庆华龙网海数科技有限公司"]
}
发现es关闭了,检查log如下:
[2018-09-06T09:15:39,827][ERROR][c.h.d.Monitor ] can not find hanlp.properties
java.security.AccessControlException: access denied ("java.lang.RuntimePermission" "getClassLoader")
at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_181]
at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_181]
at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_181]
at java.lang.ClassLoader.checkClassLoaderPermission(ClassLoader.java:1528) ~[?:1.8.0_181]
at java.lang.Thread.getContextClassLoader(Thread.java:1443) ~[?:1.8.0_181]
at com.hankcs.dic.Monitor.reloadProperty(Monitor.java:61) [elasticsearch-analysis-hanlp-6.2.2.jar:?]
at com.hankcs.dic.Monitor.run(Monitor.java:34) [elasticsearch-analysis-hanlp-6.2.2.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_181]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_181]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_181]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
[2018-09-06T09:16:23,505][INFO ][o.e.m.j.JvmGcMonitorService] [cyVG2fn] [gc][57] overhead, spent [318ms] collecting in the last [1s]
[2018-09-06T09:35:31,918][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [] fatal error in thread [elasticsearch[cyVG2fn][index][T#1]], exiting
java.lang.NoClassDefFoundError: Could not initialize class com.hankcs.hanlp.HanLP$Config
at com.hankcs.hanlp.seg.Segment.seg(Segment.java:423) ~[?:?]
at com.hankcs.lucene.SegmentWrapper.next(SegmentWrapper.java:76) ~[?:?]
at com.hankcs.lucene.HanLPTokenizer.incrementToken(HanLPTokenizer.java:94) ~[?:?]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.simpleAnalyze(TransportAnalyzeAction.java:266) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.analyze(TransportAnalyzeAction.java:243) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:164) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.admin.indices.analyze.TransportAnalyzeAction.shardOperation(TransportAnalyzeAction.java:80) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:293) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$ShardTransportHandler.messageReceived(TransportSingleShardAction.java:286) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_181]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_181]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

es5x下跑不起来

异常信息为:Caused by: java.security.AccessControlException: access denied ("java.util.PropertyPermission" "*" "read,write") at java.security.AccessControlContext.checkPermission(AccessControlContext.java:472) ~[?:1.8.0_111] at java.security.AccessController.checkPermission(AccessController.java:884) ~[?:1.8.0_111] at java.lang.SecurityManager.checkPermission(SecurityManager.java:549) ~[?:1.8.0_111] at java.lang.SecurityManager.checkPropertiesAccess(SecurityManager.java:1262) ~[?:1.8.0_111] at java.lang.System.getProperties(System.java:630) ~[?:1.8.0_111] at com.hankcs.hanlp.HanLP$Config.<clinit>(HanLP.java:306) ~[?:?]

分词器分完词之后offset显示不正确

使用hanlp分词
GET _analyze

{
  "text": ["**地大物博"],
  "tokenizer": "hanlp"
}

返回数据:

{
  "tokens" : [
    {
      "token" : "**",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "ns",
      "position" : 0
    },
    {
      "token" : "地大物博",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "nz",
      "position" : 1
    }
  ]
}

第二个term, start_offset还是从0开始的

而es自带的tokenizer 都是递增的

GET _analyze

{
  "text": ["**地大物博"],
  "tokenizer": "standard"
}

返回

{
  "tokens" : [
    {
      "token" : "",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<IDEOGRAPHIC>",
      "position" : 0
    },
    {
      "token" : "",
      "start_offset" : 1,
      "end_offset" : 2,
      "type" : "<IDEOGRAPHIC>",
      "position" : 1
    },
    {
      "token" : "",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "<IDEOGRAPHIC>",
      "position" : 2
    },
    {
      "token" : "",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "<IDEOGRAPHIC>",
      "position" : 3
    },
    {
      "token" : "",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    }
  ]
}

现在搜索还是能正常搜到, 但是高亮只能匹配到最前面几个字,
以下是在6.5.4上测试的结果

PUT document

{
  "mappings": {
    "doc": {
      "properties": {
        "body": {
          "type": "text",
          "analyzer": "hanlp",
          "search_analyzer": "hanlp"
        }
      }
    }
  }
}

PUT document/doc/1

{
  "body": ["**地大物博"]
}

GET document/_search

{
  "query": {
    "match": {
      "body": {
        "query": "地大物博"
      }
    }
  },
  "highlight": {
    "fields": {
      "body": {}
    }
  }
}

返回

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "document",
        "_type" : "doc",
        "_id" : "1",
        "_score" : 0.2876821,
        "_source" : {
          "body" : [
            "**地大物博"
          ]
        },
        "highlight" : {
          "body" : [
            "<em>**地大</em>物博"
          ]
        }
      }
    ]
  }
}

从官网下载Model 放入网站后 奔溃

您好,在使用您的框架的过程中出现了一个问题,不知道是否使用错误。
从hanlp官网下载model数据后放入 /plugins/analysis-hanlp/data/model 下,重启 ES,
使用 _analyze 测试分词正常。
"hanLPAnalyzer" : {
"type" : "custom",
"char_filter" : [
"charconvert"
],
"tokenizer" : "hanlp_nlp_word"
},
"hanlp_nlp_word" : {
"enable_normalization" : "true",
"enable_remote_dict" : "true",
"type" : "hanlp_nlp"
}
然后使用 _search 查询数据,如下,数据大约为200W条,查询耗时超过2S,持续点击5次以上,出现401错误,后台日志提示内存溢出.
POST /poi/_search
{
"from": 0,
"size": 10,
"query": {
"dis_max": {
"tie_breaker": 0.3,
"queries": [{
"match": {
"full_q": {
"query": "青年路",
"operator": "OR",
"analyzer": "hanLPAnalyzer"
}
}
}],
"boost": 1.0
}
},
"post_filter": {
"bool": {
"adjust_pure_negative": true,
"boost": 1.0
}
}
}

同义词如何配置

您好:
我是用的elasticsearch-analysis-hanlp-6.5.1插件, 之前用过ik插件,对比ik的同义词配置做了如下settings设置,发现同义词没有生效,请问大神hanlp插件的同义词应该如何配置生效呢?多谢解答!!
"settings" : {
"index" : {
"analysis" : {
"filter" : {
"hanlp_synonym_ik_standard" : {
"type" : "synonym",
"synonyms_path" : "../plugins/analysis-hanlp/data/dictionary/synonym/CoreSynonym.txt"
}
},
"analyzer" : {
"hanlp_default_search" : {
"tokenizer" : "hanlp_standard"
},
"hanlp_default_index" : {
"filter" : [
"hanlp_synonym_ik_standard"
],
"tokenizer" : "hanlp_index"
}
}
}
}
},

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.