Git Product home page Git Product logo

elasticsearch-analysis-hao's People

Contributors

qbit-git avatar tenlee2012 avatar tenlee2014 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

elasticsearch-analysis-hao's Issues

希望词典支持配置多个文件

建议词典支持配置多个文件,对本地自定义词典和远程词典。
另外远程词典也是判断了 Last-Modified 和 ETag 的哈?建议在文档里面明确说明一下。

希望增加 hao_max_word 模式

hao 分词器

  • hao_index_mode
关键词: 图书发行第一股
分词结果: 图书发行、图书、发行、第一股、第一
期望结果:图书发现、图书、发现、第一股、第一、股
关键词:图书股
分词结果:图书、股
期望结果:图书股、图书、股

当前在 and 模式下,搜索“图书股”不能匹配“ 图书发行第一股”

  • hao_index_mode,autoWordLength=3
关键词: 图书发行第一股
分词结果: 图书发行、图书、发行、第一股、第一
期望结果: 图书发行、图书、发行、第一股、第一、股
关键词:图书股
分词结果:图书股、图书
期望结果:图书股、图书、股

当前在 and 模式下,搜索“图书股”不能匹配“ 图书发行第一股”

ik 分词器

  • ik_max_word
关键词: 图书发行第一股
分词结果: 图书、发行、第一、一股、一、股
关键词:图书股
分词结果:图书、股

当前在 and 模式下,搜索“图书股”可匹配“ 图书发行第一股”

建议

希望增加类似 ik_max_mode 模式的 hao_max_word 模式

比 IK 更快怎么实现的?

Readme 中有介绍 hao 分词器比 IK 更快,请问是怎样实现的呢?快多少?
我这边使用 IK 分词器扩展千万级词典时,索引建立效率大幅降低,请问 hao 分词器对这种情况有帮助吗?

`㑮` 导致切词异常

@tenlee2012 你好

data/hanlp-data/data/dictionary/tc/t2s.txt
这个文件中的部分繁体字 + 其它繁体字 会出现切词异常,期待你的回复,谢谢

测试文本

curl 'http://localhost:9200/_analyze?pretty' -H 'Content-Type:application/json' -d '
{
  "tokenizer" : "hao_index_mode",
  "text": "㑮 奮發圖強"
}
'

目前的效果

{
  "tokens" : [
    {
      "token" : "",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "",
      "start_offset" : 4,
      "end_offset" : 5,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 4
    }
  ]
}

期待的效果

{
  "tokens" : [
    {
      "token" : "",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "奮發圖強",
      "start_offset" : 1,
      "end_offset" : 5,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "奮發",
      "start_offset" : 1,
      "end_offset" : 3,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "圖強",
      "start_offset" : 3,
      "end_offset" : 5,
      "type" : "word",
      "position" : 2
    }
  ]
}

請問繁簡字之間是如何處理的?

我嘗試添加繁體字到 custom_dictionary.text 詞庫中,起不到效果。
但添加簡體字進詞庫的話,則有效果,並且同時支持繁體字的分詞,請問繁簡字之間在插件中是怎樣處理的?

谁在使用 Hao分词器

首先诚挚地感谢每一位持续关注并使用 Hao分词器 的朋友。本人将持续投入,力图把 Hao 最好的中文分词器。

此 Issue 的目的

  • 让我更新动力继续做的更好
  • 聆听各位朋友的声音,让 Hao分词器 变得更好
  • 吸引更多的人参与贡献

我们期待您能提供

在此提交一条评论,评论内容包括:

  • 您所在公司、学校或组织
  • 您所在的城市
  • 您的联系方式:微博、邮箱、微信、QQ、Facebook、Twitter
  • 可供体验的网址地址
    您可以参考下面的样例来提供您的信息:

字节跳动
北京
[email protected]
www.douyin.com

再次感谢您的参与!!!

html_strip + hao_index_mode+multi_value 的情况下,数据插入错误

你好,这个插件很不错,感谢你的作品。

以下是测试的一个情况,发现一个错误,希望能够修复。

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer" : "hao_index_mode",
  "char_filter": ["html_strip"],
  "text" : [
      "<a>b",
      "A"
    ]
}
'
目前的结果:
{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "A",
      "start_offset" : 2,
      "end_offset" : 3,
      "type" : "word",
      "position" : 101
    }
  ]
}

期待的结果:
{
  "tokens" : [
    {
      "token" : "a",
      "start_offset" : 3,
      "end_offset" : 4,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "A",
      "start_offset" : 5,
      "end_offset" : 6,
      "type" : "word",
      "position" : 101
    }
  ]
}

A的start_offset小于a的start_offset会导致lucene插入数据错误

谢谢!

特定單字的搜索支持

您好,想問問在不開啟 enableSingleWord 的狀態下,能否通過某些方法來對特定的單字支持分詞以用于搜索?
我嘗試了新增自定義詞庫的方法,但没有效果

例如: 河南省政府今天發表重要講話 ,我想特定支持 "省" 字的搜索。
目前是分詞的結果會是 河南省政府/河南省/河南/政府/今天/發表/重要講話/重要/講話

7.2 启动报错

java.lang.NoClassDefFoundError: com/fasterxml/jackson/core/exc/InputCoercionException
at com.fasterxml.jackson.databind.deser.BasicDeserializerFactory.createMapDeserializer(BasicDeserializerFactory.java:1376) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer2(DeserializerCache.java:387) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache._createDeserializer(DeserializerCache.java:349) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCache2(DeserializerCache.java:264) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache._createAndCacheValueDeserializer(DeserializerCache.java:244) ~[?:?]
at com.fasterxml.jackson.databind.deser.DeserializerCache.findValueDeserializer(DeserializerCache.java:142) ~[?:?]
at com.fasterxml.jackson.databind.DeserializationContext.findRootValueDeserializer(DeserializationContext.java:479) ~[?:?]
at com.fasterxml.jackson.databind.ObjectMapper._findRootDeserializer(ObjectMapper.java:4405) ~[?:?]
at com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:4214) ~[?:?]
at com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3099) ~[?:?]
at com.itenlee.search.analysis.help.JSONUtil.lambda$parseJSON$2(JSONUtil.java:32) ~[?:?]
at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_231]
at com.itenlee.search.analysis.help.JSONUtil.parseJSON(JSONUtil.java:32) ~[?:?]
at com.itenlee.search.analysis.core.Dictionary.loadDict(Dictionary.java:96) ~[?:?]
at com.itenlee.search.analysis.core.Dictionary.initial(Dictionary.java:65) ~[?:?]
at com.itenlee.search.analysis.lucence.Configuration.(Configuration.java:125) ~[?:?]
at org.elasticsearch.index.analysis.hao.HaoTokenizerFactory.(HaoTokenizerFactory.java:34) ~[?:?]
at org.elasticsearch.index.analysis.hao.HaoTokenizerFactory.getHttpSmartTokenizerFactory(HaoTokenizerFactory.java:38) ~[?:?]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildMapping(AnalysisRegistry.java:343) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.index.analysis.AnalysisRegistry.buildTokenizerFactories(AnalysisRegistry.java:179) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.index.analysis.AnalysisRegistry.build(AnalysisRegistry.java:164) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.index.IndexService.(IndexService.java:175) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.index.IndexModule.newIndexService(IndexModule.java:399) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.IndicesService.createIndexService(IndicesService.java:564) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:513) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.IndicesService.createIndex(IndicesService.java:166) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.createIndices(IndicesClusterStateService.java:507) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:269) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$5(ClusterApplierService.java:495) ~[elasticsearch-7.2.0.jar:7.2.0]
at java.lang.Iterable.forEach(Iterable.java:75) ~[?:1.8.0_231]
at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:493) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:464) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:165) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:688) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-7.2.0.jar:7.2.0]
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-7.2.0.jar:7.2.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_231]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_231]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]
Caused by: java.lang.ClassNotFoundException: com.fasterxml.jackson.core.exc.InputCoercionException
at java.net.URLClassLoader.findClass(URLClassLoader.java:382) ~[?:1.8.0_231]
at java.lang.ClassLoader.loadClass(ClassLoader.java:418) ~[?:1.8.0_231]
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:817) ~[?:1.8.0_231]
at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ~[?:1.8.0_231]
... 40 more

长文本分词卡住的问题

插件版本v8.3.3


{
    "analyzer": "hao_index_mode",
    "text": "0xFF0x030x420x270x010x430x320x010x460x640x5E0x5C0x3B0x010x360x290x010x050x330x010x030x220x010x630x3E0x010x270x340x350x5F0x3D0x010x400x2F0x010x3C0x430x010x610x3E0x010x370x140x010x3F0x360x3C0x2F0x350x380x060x2D0x010x040x460x060x490x310x010x280x230x1F0x220x2C0x2E0x190x460x410x0E0x340x1F0x1A0x650x3A0x0F0x250x350x1D0x200x1C0x070x330x010x230x4F0x3E0x240x650x650x3A0x2D0x010x3D0x370x490x090x4B0x170x480x4D0x540x470x1F0x410x410x0E0x300x0B0x370x320x450x330x010x4B0x010x010x4A0x330x330x510x360x010x4C0x5F0x240x0D0x380x010x5D0x010x010x2B0x270x010x290x440x010x2D0x340x010x5E0x3A0x010x3E0x2E0x010x080x3C0x010x0A0x340x010x010x2B0x010x440x240x010x670x3D0x010x0C0x3A0x010x250x330x330x000x310x3A0x310x300x330x330x330x330x330x2C0x320x3A0x310x330x330x330x530x330x330x2C0x330x3A0x320x330x010x010x0B0x010x330x2C0x340x3A0x330x390x370x360x350x330x330x2C0x370x3A0x340x330x330x330x330x330x330x2C0x310x300x3A0x360x300x330x010x0B0x010x330x2C0x310x340x3A0x310x300x300x320x330x330x330x330x330x2C0x310x350x3A0x360x380x330x330x330x330x330x2C0x380x3A0x320x330x010x010x0B0x010x330x2C0x330x310x3A0x320x330x280x310x1C0x010x330x2C0x380x310x3A0x320x330x280x310x1C0x010x330x2C0x320x300x300x300x3A0x320x310x390x010x650x630x010x330x2C0x370x310x3A0x310x3F0x650x5B0x010x330x2C0x370x320x3A0x310x010x010x010x010x330x2C0x370x330x3A0x310x010x010x010x010x330x2C0x390x310x3A0x310x5B0x150x330x010x330x2C0x390x320x3A0x310x010x0B0x0B0x010x330x2C0x390x330x3A0x310x010x0B0x0B0x010x330x000x310x310x340x000x340x330x000x330x000x320x320x000x010x010x630x000x000x310x2E0x30"
}

分词会直接卡住。同版本的ik没有这个问题

在对短文本分词时,建议将原文本作为分词结果之一返回

当前“镐窝”会被分词为两个单字,在对短文本分词时,建议将原文本作为分词结果之一返回,可通过配置项开关。
例如将“镐窝”分词为“镐窝”“镐”“窝”
这样可以一定程度缓解没有命名实体识别和新词发现的问题,对一些不常见的人名分词有帮助。

停用词相关

看好很多同学再问停用词功能。
非常抱歉,本插件不支持停用词配置以及远程停用词词库。
原因是elasticsearch本身就有停用词功能,中文的停用词更新也不频繁,就没有重复造轮子。
如有需要,请使用es原生提供的停用词功能。
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-tokenfilter.html

PUT /my-index-000001
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "my_custom_stop_words_filter" ]
        }
      },
      "filter": {
        "my_custom_stop_words_filter": {
          "type": "stop",
          "stopwords_path": "停用词路径,每个词一行"
          "ignore_case": true
        }
      }
    }
  }
}

PS: 其实词库的热更新,也只是做到了在词库更新之后的新doc才会被新词识别,旧doc还是要依赖索引重建。

新版本需要限制插件申请的权限

我在安装7.13.3版本的插件时候,elastic 给出
plugin-security.policy] contains illegal permission ("java.lang.RuntimePermission" "*") in global grant错误

后经检查发现,和这个commit有关:elastic/elasticsearch#64751 (v7.11加入)

elastic 限制了插件的权限,所以原有的plugin-security.policy已经不能用了。
按照一位开发人员说法是以后只允许读取插件配置目录了(a directory under /etc/elasticsearch named for the plugin),所以 java.io.FilePermission 的 read权限未来也会被取消掉,另外RuntimePermission也不能使用 "*"了。

所以需要细化权限,适配新版本ES


PS:如果有人遇到这个问题的话,我暂时改成了这样可以用:

grant {
  permission java.net.SocketPermission "*", "accept,connect,resolve"; // okhttp
  permission java.lang.RuntimePermission "getClassLoader"; // okhttp
  permission java.net.NetPermission "getProxySelector"; // okhttp
  permission java.lang.RuntimePermission "accessDeclaredMembers"; // jackson
  permission java.lang.reflect.ReflectPermission "suppressAccessChecks"; // jackson
  //permission java.lang.RuntimePermission "*"; // tensorflow
  permission java.lang.RuntimePermission "createClassLoader";
  permission java.lang.RuntimePermission "accessDeclaredMembers";
  permission java.lang.RuntimePermission "getClassLoader";
  permission java.lang.RuntimePermission "setContextClassLoader";
  permission java.lang.RuntimePermission "setFactory";
  permission java.lang.RuntimePermission "loadLibrary.*";
  permission java.lang.RuntimePermission "accessClassInPackage.*";
};

Array类型在某些情况下启用fvh,会导致高亮片段错位的问题

大佬好,发现一个fvh导致高亮错位的问题。


插件版本: 8.3.1


索引配置:

{
    "mappings": {
        "dynamic": false,
        "properties": {
            "data": {
                "type": "text",
                "analyzer": "hao_index_mode",
                "search_analyzer": "hao_search_mode",
                "term_vector": "with_positions_offsets"
            }
        }
    }
}

插入文档

{
    "data": [
        "测试文本1",
        "......",
        "测试文本2"
    ]
}

搜索关键字测试

GET /_search

{
    "query": {
        "match": {"data": "测试"}
    },
    "highlight": {
        "fields": {"data": {}}
    }
}

返回结果如下:

{
    "_source": {
        "data": [
            "测试文本1",
            "......",
            "测试文本2"
        ]
    },
    "highlight": {
        "data": [
            "<em>测试</em>文本1",
            ".<em>..</em>..."
        ]
    }
}

可以看到第二个测试的高亮片段有误。

把中间的......去掉,则没有这个问题。

POST /_doc

{
    "data": [
        "测试文本1",
        "测试文本2"
    ]
}

GET /_search

{
    "query": {
        "match": {"data": "测试"}
    },
    "highlight": {
        "fields": {"data": {}}
    }
}

RETURN
{
    "_source": {
        "data": [
            "测试文本1",
            "测试文本2"
        ]
    },
    "highlight": {
        "data": [
            "<em>测试</em>文本1",
            "<em>测试</em>文本2"
        ]
    }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.