Git Product home page Git Product logo

Comments (10)

lionsoul2014 avatar lionsoul2014 commented on May 25, 2024

1,对于第一个问题,我测试过,如果endOffset设置为和原词一样,token本身会被截取,lucene系列的产品都是按照startOffset和endOffset来创建ByteRef
2,对于lucene内部的token来说,position必须后一个大于前一个,这个也测试过。

目前的同义词方案确实会造成highlight错误,这个需要研究借鉴下ES的同义词实现机制。

from jcseg.

hupet avatar hupet commented on May 25, 2024

我做了个测试,先用elasticsearch的synonym filter做一个自订的analyzer:

PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "british,english",
            "queen,monarch",
            "computer,pc"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      }
    }
  }, 
  "mappings": {
    "doc": {
      "properties": {
        "txt": {
            "type": "text",
            "analyzer": "my_synonyms"
        }
      }
    }
  }
}

用这个analyzer去analyze句子:

GET my_index/_analyze
{
  "analyzer": "my_synonyms",
  "text": ["this is a pc game"]
}

结果可以看到 endOffset 和 position 都和原词一致:

...    
    {
      "token": "pc",
      "start_offset": 10,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 3
    },
    {
      "token": "computer",
      "start_offset": 10,
      "end_offset": 12,
      "type": "SYNONYM",
      "position": 3
    },...

接著用jcseg测试:

PUT /jcseg_index
{
  "mappings": {
    "doc": {
      "properties": {
        "txt": {
          "type": "text",
          "analyzer": "jcseg_complex"
        }
      }
    }
  }
}

analyze以下句子:

GET jcseg_index/_analyze
{
  "analyzer": "jcseg_complex",
  "text": ["他是人事部的经理"]
}

结果 endOffset 和 position 的位置都不同

    {
      "token": "人事部",
      "start_offset": 2,
      "end_offset": 5,
      "type": "word",
      "position": 2
    },
    {
      "token": "人事管理部",
      "start_offset": 2,
      "end_offset": 7,
      "type": "word",
      "position": 3
    },
    {
      "token": "人事管理部门",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 4
    },

from jcseg.

hupet avatar hupet commented on May 25, 2024

延续前面说明,在查询的时候,去分析他的query,以synonym filter去做查询分析:

GET my_index/_validate/query?explain
{
  "query": {
    "match_phrase": {
      "txt": "computer game"
    }
  }
}

结果如下:

{
  "valid": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "explanations": [
    {
      "index": "my_index",
      "valid": true,
      "explanation": """txt:"(computer pc) game""""
    }
  ]
}

同义词的 computer 和 pc 会以括号形成 or 的查询,因此不管查pc game或computer game都可以查得到
若以 jcseg 测试,查询分析:

GET jcseg_index/_validate/query?explain
{
  "query": {
    "match_phrase": {
      "txt": "人事管理部门的经理"
    }
  }
}

结果如下:

{
  "valid": true,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "explanations": [
    {
      "index": "jcseg_index",
      "valid": true,
      "explanation": """txt:"人事管理部门 人事管理部 人事部 的 经 理""""
    }
  ]
}

三个同义词因为position不同的关系,不会形成 or 的语法,其结果就是查「人事部的经理」查得到,
查「人事管理部门的经理」就查不到了,这样似乎失去了同义词的作用了。

from jcseg.

lionsoul2014 avatar lionsoul2014 commented on May 25, 2024

感谢你提供的分析结果,我貌似明白了,给lucene的token.type我一直都是设置的word,同义词需要设置为SYNONYM,设置衍生词的offset和原词一致才不会截取,我这边改个版本试下。

from jcseg.

hupet avatar hupet commented on May 25, 2024

还有positionIncrement,这个是恐怕才是同义词的关键,建议您参考一下Lucene的文件:https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html
文件中在用法的第一點就说:
Set it to zero to put multiple terms in the same position. This is useful if, e.g., a word has multiple stems. Searches for phrases including either stem will match. In this case, all but the first stem's increment should be set to zero: the increment of the first instance should be one. Repeating a token with an increment of zero can also be used to boost the scores of matches on that token.

我记得我试过把type改成SYNONYM,结果是没用的,关键还是在position上。

from jcseg.

lionsoul2014 avatar lionsoul2014 commented on May 25, 2024

好,感谢提供的资料。这个问题其实也困扰我挺长时间了,一直没有精力去研究,我这边试下!

from jcseg.

outshow avatar outshow commented on May 25, 2024

不知道 作者 解决这个问题没有,貌似我也遇到这个问题。

from jcseg.

lionsoul2014 avatar lionsoul2014 commented on May 25, 2024

backup: https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html

from jcseg.

lionsoul2014 avatar lionsoul2014 commented on May 25, 2024

@outshow @hupet 试试最新master分支的代码,给了一个修复提交,初测OK了。

from jcseg.

lionsoul2014 avatar lionsoul2014 commented on May 25, 2024

经测试,该问题已经修复:https://www.oschina.net/news/109718

from jcseg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.