Elasticsearch Version 7.15.1 Installed Plugins<

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ngram about elasticsearch HOT 7 CLOSED

S-Dragon0302 commented on August 21, 2024

ngram

from elasticsearch.

Comments (7)

elasticsearchmachine commented on August 21, 2024

Pinging @elastic/es-search (Team:Search)

from elasticsearch.

cbuescher commented on August 21, 2024

@S-Dragon0302 Would you please let us know what the problem is you are encountering? I'm going to remove the "bug" label for now as I don't see whats missing. Also keep in mind that if this is a language-specific problem, the language-specific discuss forums (https://discuss.elastic.co/c/in-your-native-tongue/11) might be a good place to ask.

from elasticsearch.

S-Dragon0302 commented on August 21, 2024

The segmentation result is incorrect. The token I generated from segmentation has no value. Actually, there should be a result

from elasticsearch.

S-Dragon0302 commented on August 21, 2024

The segmentation result should be this.
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}

from elasticsearch.

S-Dragon0302 commented on August 21, 2024

The actual result is this.
{
"tokens" : [ ]
}

from elasticsearch.

benwtrent commented on August 21, 2024

@S-Dragon0302

For the given:

是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ the pattern without token filtering:

GET /my_index/_analyze
{
  "filter": [
    "lowercase"
  ],
  "tokenizer": {
    "type": "pattern",
    "pattern": "[^\\p{L}\\p{N}]+"
  },
  "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

Results in:

{
  "tokens": [
    {
      "token": "是",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "不",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "是",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "发",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 3
    },
    {
      "token": "现",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
    {
      "token": "我",
      "start_offset": 10,
      "end_offset": 11,
      "type": "word",
      "position": 5
    },
    {
      "token": "的",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 6
    },
    {
      "token": "字",
      "start_offset": 14,
      "end_offset": 15,
      "type": "word",
      "position": 7
    },
    {
      "token": "冒",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 8
    },
    {
      "token": "烟",
      "start_offset": 18,
      "end_offset": 19,
      "type": "word",
      "position": 9
    },
    {
      "token": "了",
      "start_offset": 20,
      "end_offset": 21,
      "type": "word",
      "position": 10
    }
  ]
}

None of those are longer than 1 ngram. So filtering, requiring 2 ngram results in no output.

from elasticsearch.

benwtrent commented on August 21, 2024

closing as expected behavior. Filtering requiring 2 ngram when there is only 1 ngram is expected.

from elasticsearch.

ngram about elasticsearch HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent