您好，您的 jcseg 惠我良多，致上敝人万分感谢。我在使用jcseg，有两个地方一直觉得有问题，就是当有同义词或是中文数字转阿拉伯数字的时候，所衍生出来的同义词

我做了个测试，先用elasticsearch的synonym filter做一个自订的analyzer： <div class="snippet-clipboard

延续前面说明，在查询的时候，去分析他的query，以synonym filter去做查询分析： <div class="snippet-clipboard-cont

还有positionIncrement，这个是恐怕才是同义词的关键，建议您参考一下Lucene的文件：<a href="https://lucene.apache.org/

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

同义词或是中文数字转阿拉伯数字的问题 about jcseg HOT 10 CLOSED

lionsoul2014 commented on May 25, 2024

同义词或是中文数字转阿拉伯数字的问题

from jcseg.

Comments (10)

lionsoul2014 commented on May 25, 2024

1，对于第一个问题，我测试过，如果endOffset设置为和原词一样，token本身会被截取，lucene系列的产品都是按照startOffset和endOffset来创建ByteRef
2，对于lucene内部的token来说，position必须后一个大于前一个，这个也测试过。

目前的同义词方案确实会造成highlight错误，这个需要研究借鉴下ES的同义词实现机制。

from jcseg.

hupet commented on May 25, 2024

我做了个测试，先用elasticsearch的synonym filter做一个自订的analyzer：

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "british,english",
            "queen,monarch",
            "computer,pc"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      }
    }
  }, 
  "mappings": {
    "doc": {
      "properties": {
        "txt": {
            "type": "text",
            "analyzer": "my_synonyms"
        }
      }
    }
  }
}

用这个analyzer去analyze句子：

GET my_index/_analyze
{
  "analyzer": "my_synonyms",
  "text": ["this is a pc game"]
}

结果可以看到 endOffset 和 position 都和原词一致：

...    
    {
      "token": "pc",
      "start_offset": 10,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "computer",
      "start_offset": 10,
      "end_offset": 12,
      "type": "SYNONYM",
      "position": 3
    },...

接著用jcseg测试：

PUT /jcseg_index
{
  "mappings": {
    "doc": {
      "properties": {
        "txt": {
          "type": "text",
          "analyzer": "jcseg_complex"
        }
      }
    }
  }
}

analyze以下句子：

GET jcseg_index/_analyze
{
  "analyzer": "jcseg_complex",
  "text": ["他是人事部的经理"]
}

结果 endOffset 和 position 的位置都不同

    {
      "token": "人事部",
      "start_offset": 2,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "人事管理部",
      "start_offset": 2,
      "end_offset": 7,
      "type": "word",
      "position": 3
    },
    {
      "token": "人事管理部门",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 4
    },

from jcseg.

hupet commented on May 25, 2024

延续前面说明，在查询的时候，去分析他的query，以synonym filter去做查询分析：

GET my_index/_validate/query?explain
{
  "query": {
    "match_phrase": {
      "txt": "computer game"
    }
  }
}

结果如下：

{
  "valid": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "explanations": [
    {
      "index": "my_index",
      "valid": true,
      "explanation": """txt:"(computer pc) game""""
    }
  ]
}

同义词的 computer 和 pc 会以括号形成 or 的查询，因此不管查pc game或computer game都可以查得到
若以 jcseg 测试，查询分析：

GET jcseg_index/_validate/query?explain
{
  "query": {
    "match_phrase": {
      "txt": "人事管理部门的经理"
    }
  }
}

结果如下：

{
  "valid": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "explanations": [
    {
      "index": "jcseg_index",
      "valid": true,
      "explanation": """txt:"人事管理部门 人事管理部 人事部 的 经 理""""
    }
  ]
}

三个同义词因为position不同的关系，不会形成 or 的语法，其结果就是查「人事部的经理」查得到，
查「人事管理部门的经理」就查不到了，这样似乎失去了同义词的作用了。

from jcseg.

lionsoul2014 commented on May 25, 2024

感谢你提供的分析结果，我貌似明白了，给lucene的token.type我一直都是设置的word，同义词需要设置为SYNONYM，设置衍生词的offset和原词一致才不会截取，我这边改个版本试下。

from jcseg.

hupet commented on May 25, 2024

还有positionIncrement，这个是恐怕才是同义词的关键，建议您参考一下Lucene的文件：https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html
文件中在用法的第一點就说：
Set it to zero to put multiple terms in the same position. This is useful if, e.g., a word has multiple stems. Searches for phrases including either stem will match. In this case, all but the first stem's increment should be set to zero: the increment of the first instance should be one. Repeating a token with an increment of zero can also be used to boost the scores of matches on that token.

我记得我试过把type改成SYNONYM，结果是没用的，关键还是在position上。

from jcseg.

lionsoul2014 commented on May 25, 2024

好，感谢提供的资料。这个问题其实也困扰我挺长时间了，一直没有精力去研究，我这边试下！

from jcseg.

outshow commented on May 25, 2024

不知道作者解决这个问题没有，貌似我也遇到这个问题。

from jcseg.

lionsoul2014 commented on May 25, 2024

backup: https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html

from jcseg.

lionsoul2014 commented on May 25, 2024

@outshow @hupet 试试最新master分支的代码，给了一个修复提交，初测OK了。

from jcseg.

lionsoul2014 commented on May 25, 2024

经测试，该问题已经修复：https://www.oschina.net/news/109718

from jcseg.

同义词或是中文数字转阿拉伯数字的问题 about jcseg HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent