Git Product home page Git Product logo

Comments (7)

elasticsearchmachine avatar elasticsearchmachine commented on August 21, 2024

Pinging @elastic/es-search (Team:Search)

from elasticsearch.

cbuescher avatar cbuescher commented on August 21, 2024

@S-Dragon0302 Would you please let us know what the problem is you are encountering? I'm going to remove the "bug" label for now as I don't see whats missing. Also keep in mind that if this is a language-specific problem, the language-specific discuss forums (https://discuss.elastic.co/c/in-your-native-tongue/11) might be a good place to ask.

from elasticsearch.

S-Dragon0302 avatar S-Dragon0302 commented on August 21, 2024

The segmentation result is incorrect. The token I generated from segmentation has no value. Actually, there should be a result

from elasticsearch.

S-Dragon0302 avatar S-Dragon0302 commented on August 21, 2024

The segmentation result should be this.
{
"tokens": [
{
"token": "是不",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "不是",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 1
},
{
"token": "是发",
"start_offset": 4,
"end_offset": 7,
"type": "word",
"position": 2
},
{
"token": "发现",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 3
},
{
"token": "现我",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 4
},
{
"token": "我的",
"start_offset": 10,
"end_offset": 13,
"type": "word",
"position": 5
},
{
"token": "的字",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 6
},
{
"token": "字冒",
"start_offset": 14,
"end_offset": 17,
"type": "word",
"position": 7
},
{
"token": "冒烟",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 8
},
{
"token": "烟了",
"start_offset": 18,
"end_offset": 21,
"type": "word",
"position": 9
}
]
}

from elasticsearch.

S-Dragon0302 avatar S-Dragon0302 commented on August 21, 2024

The actual result is this.
{
"tokens" : [ ]
}

from elasticsearch.

benwtrent avatar benwtrent commented on August 21, 2024

@S-Dragon0302

For the given:

是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ the pattern without token filtering:

GET /my_index/_analyze
{
  "filter": [
    "lowercase"
  ],
  "tokenizer": {
    "type": "pattern",
    "pattern": "[^\\p{L}\\p{N}]+"
  },
  "text": "是ྂ不ྂ是ྂ发ྂ现ྂ我ྂ的ྂ字ྂ冒ྂ烟ྂ了ྂ"
}

Results in:

{
  "tokens": [
    {
      "token": "是",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "不",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 1
    },
    {
      "token": "是",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "发",
      "start_offset": 6,
      "end_offset": 7,
      "type": "word",
      "position": 3
    },
    {
      "token": "现",
      "start_offset": 8,
      "end_offset": 9,
      "type": "word",
      "position": 4
    },
    {
      "token": "我",
      "start_offset": 10,
      "end_offset": 11,
      "type": "word",
      "position": 5
    },
    {
      "token": "的",
      "start_offset": 12,
      "end_offset": 13,
      "type": "word",
      "position": 6
    },
    {
      "token": "字",
      "start_offset": 14,
      "end_offset": 15,
      "type": "word",
      "position": 7
    },
    {
      "token": "冒",
      "start_offset": 16,
      "end_offset": 17,
      "type": "word",
      "position": 8
    },
    {
      "token": "烟",
      "start_offset": 18,
      "end_offset": 19,
      "type": "word",
      "position": 9
    },
    {
      "token": "了",
      "start_offset": 20,
      "end_offset": 21,
      "type": "word",
      "position": 10
    }
  ]
}

None of those are longer than 1 ngram. So filtering, requiring 2 ngram results in no output.

from elasticsearch.

benwtrent avatar benwtrent commented on August 21, 2024

closing as expected behavior. Filtering requiring 2 ngram when there is only 1 ngram is expected.

from elasticsearch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.