Comments (4)
- hanlp-lucene-plugin目前支持的lucence版本为7.2.0,不支持lucene9.7。lucene9.7中不存在
org.apache.lucene.analysis.util.TokenizerFactory
这个类,所以你根本不可能编译通过,所以要么你跑的根本不是你所列出的代码而是别的分词器,要么你跑的不是官方版本。 - lucence版本7.2.0不存在搜不到的问题:https://github.com/hankcs/hanlp-lucene-plugin/blob/c6be0de363022a38436490cd19761881ebad41e8/src/test/java/com/hankcs/lucene/HanLPAnalyzerTest.java#L87
public void testIndexAndSearch() throws Exception
{
Analyzer analyzer = new HanLPAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
Directory directory = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("content", "**人", Field.Store.YES));
indexWriter.addDocument(document);
indexWriter.commit();
indexWriter.close();
IndexReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
QueryParser parser = new QueryParser("content", analyzer);
Query query = parser.parse("**人");
ScoreDoc[] hits = isearcher.search(query, 300000).scoreDocs;
assertEquals(1, hits.length);
for (ScoreDoc scoreDoc : hits)
{
Document targetDoc = isearcher.doc(scoreDoc.doc);
System.out.println(targetDoc.getField("content").stringValue());
}
}
from hanlp.
- 不好意思,你是对的。由于是maven 构建的项目,没注意实际使用的
org.apache.lucene.analysis.util.TokenizerFactory
这个类,确实在lucene7.2.0
中。所以编译没有报错(跑的是官方版本). - 对于上述测试用例,我又重新创建了一个干净的环境。maven依赖坐标如下
<dependencies>
<dependency>
<groupId>com.hankcs.nlp</groupId>
<artifactId>hanlp-lucene-plugin</artifactId>
<version>1.1.7</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.13.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.hankcs</groupId>
<artifactId>hanlp</artifactId>
<version>portable-1.8.4</version>
</dependency>
</dependencies>
结果依然同问题描述的一样。
3. 尝试移出 portable-1.8.4
依赖,结果正常检索出来,猜测可能和 com.hankcs:hanlp:portable-1.8.4
有关。
4. 包含 portable-1.8.4
依赖,测试结果:
5. 移除 portable-1.8.4
依赖,测试结果:
from hanlp.
补充:
- protable-1.7.6 查询正常
- protable-1.8.4 查询有问题(图1)
- 使用方案2(release jar + data + properties)的方式,查询正常
protable 和 release jar 的区别是什么呢? 就是data 词典和模型不一样吗?
使用protable 也是用的自定义的 词典(下载自官方)。
properties 配置
#本配置文件中的路径的根目录,根目录+其他路径=完整路径(支持相对路径,请参考:https://github.com/hankcs/HanLP/pull/254)
#Windows用户请注意,路径分隔符统一使用/
root=E:/xx/demo/document-search/document-search/document-search-server/src/main/resources
#好了,以上为唯一需要修改的部分,以下配置项按需反注释编辑。
Normalization=true
from hanlp.
- 应该是 3a99bc6 引入了一个初始化的bug
- portable版本默认加载小模型
- 该bug仅影响mini模型在JRE启动后第一次分词的结果
- 如果你使用mini模型,请使用 https://github.com/hankcs/HanLP/releases/tag/v1.8.1 以前的版本。否则无论portable与否,只要你的hanlp.properties里没有加载mini模型,都不影响。
感谢反馈,已经修复,请检查上面的commit是否解决了这个问题。
如果还有问题,欢迎重开issue。
from hanlp.
Related Issues (20)
- 无法下载CTB9_POS_ELECTRA_SMALL_TF HOT 2
- 解析失败,提示升级hanlp HOT 1
- 依存分析的模型要么下载不了,要么刚开始下载非常慢,然后就下不了了(dep的四个模型都是) HOT 1
- No module named 'hanlp.datasets.parsing.ctb'
- 中文名包含多音字时生成的拼音只有一个,例如 ‘李娜’ 生成拼音为 ‘Li Nuo’ HOT 1
- 执行open_small.py时报'utf-8' codec can't decode byte 0xb4 in position 0: invalid start byte HOT 1
- ================================ERROR LOG BEGINS================================ HOT 1
- When I runing the example occurred error HOT 1
- Add a custom dictionary type that supports spaces HOT 3
- Smatch provide wrong and random scores HOT 2
- portable 1.8.4的更新 请尽快推到portable分支 现在分支上还是1.8.3
- 中文分词(粗分)错误:New in version 3.3. HOT 1
- 中文分词错误:左右捕盜廳以『邪學罪人安敦伊、吳伯多祿、閔유아욱가、黃錫斗、張周基,押付公忠水營,梟警』啓。 HOT 5
- NER模型加载问题 HOT 1
- cpu docker部署安装依赖cuda环境 HOT 1
- 悄悄地问:分词模型能否“理解”语意? HOT 3
- phraseTree引发的import error HOT 2
- AttributeError: module 'keras._tf_keras.keras.layers' has no attribute 'AbstractRNNCell' HOT 3
- 本地单任务模型,加载amr时失败 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hanlp.