Comments (10)
1,对于第一个问题,我测试过,如果endOffset设置为和原词一样,token本身会被截取,lucene系列的产品都是按照startOffset和endOffset来创建ByteRef
2,对于lucene内部的token来说,position必须后一个大于前一个,这个也测试过。
目前的同义词方案确实会造成highlight错误,这个需要研究借鉴下ES的同义词实现机制。
from jcseg.
我做了个测试,先用elasticsearch的synonym filter做一个自订的analyzer:
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british,english",
"queen,monarch",
"computer,pc"
]
}
},
"analyzer": {
"my_synonyms": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"txt": {
"type": "text",
"analyzer": "my_synonyms"
}
}
}
}
}
用这个analyzer去analyze句子:
GET my_index/_analyze
{
"analyzer": "my_synonyms",
"text": ["this is a pc game"]
}
结果可以看到 endOffset 和 position 都和原词一致:
...
{
"token": "pc",
"start_offset": 10,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "computer",
"start_offset": 10,
"end_offset": 12,
"type": "SYNONYM",
"position": 3
},...
接著用jcseg测试:
PUT /jcseg_index
{
"mappings": {
"doc": {
"properties": {
"txt": {
"type": "text",
"analyzer": "jcseg_complex"
}
}
}
}
}
analyze以下句子:
GET jcseg_index/_analyze
{
"analyzer": "jcseg_complex",
"text": ["他是人事部的经理"]
}
结果 endOffset 和 position 的位置都不同
{
"token": "人事部",
"start_offset": 2,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "人事管理部",
"start_offset": 2,
"end_offset": 7,
"type": "word",
"position": 3
},
{
"token": "人事管理部门",
"start_offset": 2,
"end_offset": 8,
"type": "word",
"position": 4
},
from jcseg.
延续前面说明,在查询的时候,去分析他的query,以synonym filter去做查询分析:
GET my_index/_validate/query?explain
{
"query": {
"match_phrase": {
"txt": "computer game"
}
}
}
结果如下:
{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "my_index",
"valid": true,
"explanation": """txt:"(computer pc) game""""
}
]
}
同义词的 computer 和 pc 会以括号形成 or 的查询,因此不管查pc game或computer game都可以查得到
若以 jcseg 测试,查询分析:
GET jcseg_index/_validate/query?explain
{
"query": {
"match_phrase": {
"txt": "人事管理部门的经理"
}
}
}
结果如下:
{
"valid": true,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"explanations": [
{
"index": "jcseg_index",
"valid": true,
"explanation": """txt:"人事管理部门 人事管理部 人事部 的 经 理""""
}
]
}
三个同义词因为position不同的关系,不会形成 or 的语法,其结果就是查「人事部的经理」查得到,
查「人事管理部门的经理」就查不到了,这样似乎失去了同义词的作用了。
from jcseg.
感谢你提供的分析结果,我貌似明白了,给lucene的token.type我一直都是设置的word,同义词需要设置为SYNONYM,设置衍生词的offset和原词一致才不会截取,我这边改个版本试下。
from jcseg.
还有positionIncrement,这个是恐怕才是同义词的关键,建议您参考一下Lucene的文件:https://lucene.apache.org/core/7_3_0/core/org/apache/lucene/analysis/tokenattributes/PositionIncrementAttribute.html
文件中在用法的第一點就说:
Set it to zero to put multiple terms in the same position. This is useful if, e.g., a word has multiple stems. Searches for phrases including either stem will match. In this case, all but the first stem's increment should be set to zero: the increment of the first instance should be one. Repeating a token with an increment of zero can also be used to boost the scores of matches on that token.
我记得我试过把type改成SYNONYM,结果是没用的,关键还是在position上。
from jcseg.
好,感谢提供的资料。这个问题其实也困扰我挺长时间了,一直没有精力去研究,我这边试下!
from jcseg.
不知道 作者 解决这个问题没有,貌似我也遇到这个问题。
from jcseg.
from jcseg.
@outshow @hupet 试试最新master分支的代码,给了一个修复提交,初测OK了。
from jcseg.
经测试,该问题已经修复:https://www.oschina.net/news/109718
from jcseg.
Related Issues (20)
- 结合ES使用时词典更新问题 HOT 3
- jcseg2.2.0在elasticsearch5.6.5中access denied的问题!且安装了这个的节点能够正常运行但是数据基本是废了! HOT 3
- hi 我现在使用es 6.2.2,使用jcseg的插件数组字段分词的时候会有问题 HOT 16
- 如何不对切分结果进行小写字母的转换? HOT 1
- **仓库无法下载2.3.0版本 HOT 2
- 关于ILexicon.CJK_WORD类型的疑问 HOT 1
- 怎么识别文中的人名 HOT 3
- Missing jcseg pom for 2.3.0 in central repository HOT 2
- 本机的外网网址为什么不支持 HOT 3
- BUILD FAILED 2.4.0 HOT 3
- jcseg-server内存占用 HOT 1
- 一类英文分词错误的问题 HOT 2
- 2.4.1版本 match_phrase 查询查不到结果 HOT 4
- 请问词库的文件格式有什么要求吗? HOT 1
- 请问最多分词这个功能支持了吗 HOT 1
- 请问如何识别身份证呢 HOT 1
- 多语言支持
- DETECT模式下的setEnWordSeg(false)似乎不生效 HOT 3
- solr 9.0+ 有计划后续支持吗?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jcseg.