以下是lucene9.7的官方示例，仅修改了保存值。 <div class="snippet-clipboard-content notranslate posit

hanlp-lucene-plugin目前支持的lucence版本为7.2.0，不支持lucene9.7。lucene9.7中不存在<code class="n

不好意思，你是对的。由于是maven 构建的项目，没注意实际使用的 org.apache.lucene.ana

补充： protable-1.7.6 查询正常 protable-1.8.4 查询有问题（图1）

应该是 <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="http

索引与查找使用相同的analyzer，结果无法命中 about hanlp HOT 4 CLOSED

SxunS commented on June 18, 2024

索引与查找使用相同的analyzer，结果无法命中

from hanlp.

Comments (4)

hankcs commented on June 18, 2024

hanlp-lucene-plugin目前支持的lucence版本为7.2.0，不支持lucene9.7。lucene9.7中不存在org.apache.lucene.analysis.util.TokenizerFactory这个类，所以你根本不可能编译通过，所以要么你跑的根本不是你所列出的代码而是别的分词器，要么你跑的不是官方版本。
lucence版本7.2.0不存在搜不到的问题：https://github.com/hankcs/hanlp-lucene-plugin/blob/c6be0de363022a38436490cd19761881ebad41e8/src/test/java/com/hankcs/lucene/HanLPAnalyzerTest.java#L87

    public void testIndexAndSearch() throws Exception
    {
        Analyzer analyzer = new HanLPAnalyzer();
        IndexWriterConfig config = new IndexWriterConfig(analyzer);
        config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
        Directory directory = new RAMDirectory();
        IndexWriter indexWriter = new IndexWriter(directory, config);

        Document document = new Document();
        document.add(new TextField("content", "**人", Field.Store.YES));
        indexWriter.addDocument(document);

        indexWriter.commit();
        indexWriter.close();

        IndexReader ireader = DirectoryReader.open(directory);
        IndexSearcher isearcher = new IndexSearcher(ireader);
        QueryParser parser = new QueryParser("content", analyzer);
        Query query = parser.parse("**人");
        ScoreDoc[] hits = isearcher.search(query, 300000).scoreDocs;
        assertEquals(1, hits.length);
        for (ScoreDoc scoreDoc : hits)
        {
            Document targetDoc = isearcher.doc(scoreDoc.doc);
            System.out.println(targetDoc.getField("content").stringValue());
        }
    }

from hanlp.

SxunS commented on June 18, 2024

不好意思，你是对的。由于是maven 构建的项目，没注意实际使用的org.apache.lucene.analysis.util.TokenizerFactory这个类,确实在lucene7.2.0中。所以编译没有报错（跑的是官方版本）.
对于上述测试用例，我又重新创建了一个干净的环境。maven依赖坐标如下

<dependencies>
    <dependency>
      <groupId>com.hankcs.nlp</groupId>
      <artifactId>hanlp-lucene-plugin</artifactId>
      <version>1.1.7</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.13.2</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>com.hankcs</groupId>
      <artifactId>hanlp</artifactId>
      <version>portable-1.8.4</version>
    </dependency>
  </dependencies>

结果依然同问题描述的一样。
3. 尝试移出 portable-1.8.4依赖，结果正常检索出来，猜测可能和 com.hankcs:hanlp:portable-1.8.4有关。
4. 包含 portable-1.8.4依赖，测试结果：

5. 移除 portable-1.8.4依赖，测试结果：

from hanlp.

SxunS commented on June 18, 2024

补充：

protable-1.7.6 查询正常
protable-1.8.4 查询有问题（图1）
使用方案2（release jar + data + properties）的方式，查询正常

protable 和 release jar 的区别是什么呢？就是data 词典和模型不一样吗？
使用protable 也是用的自定义的词典（下载自官方）。
properties 配置

#本配置文件中的路径的根目录，根目录+其他路径=完整路径（支持相对路径，请参考：https://github.com/hankcs/HanLP/pull/254）
#Windows用户请注意，路径分隔符统一使用/
root=E:/xx/demo/document-search/document-search/document-search-server/src/main/resources

#好了，以上为唯一需要修改的部分，以下配置项按需反注释编辑。
Normalization=true

from hanlp.

hankcs commented on June 18, 2024

应该是 3a99bc6 引入了一个初始化的bug
portable版本默认加载小模型
该bug仅影响mini模型在JRE启动后第一次分词的结果
如果你使用mini模型，请使用 https://github.com/hankcs/HanLP/releases/tag/v1.8.1 以前的版本。否则无论portable与否，只要你的hanlp.properties里没有加载mini模型，都不影响。

感谢反馈，已经修复，请检查上面的commit是否解决了这个问题。
如果还有问题，欢迎重开issue。

from hanlp.

索引与查找使用相同的analyzer，结果无法命中 about hanlp HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent