Comments (8)
请贴一下插件jar版本。
from hanlp.
看起来像是用4.x的Lucene插件放到5.x的Lucene里用。
from hanlp.
请问 :lucene5.2.1 (hanlp-portable-1.2.4, hanlp-solr-plugin-1.0)索引创建成功,搜索却无命中记录(搜索没报错,并且肯定有这个词)。
建立索引:
Analyzer analyzer = new HanLPAnalyzer();////////////////////////////////////////////////////
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
Directory directory = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriter indexWriter = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("time", res, Store.YES));
document.add(new TextField("content", content, Store.YES));
indexWriter.addDocument(document);
indexWriter.commit();
closeWriter(indexWriter);
搜索
Directory directory = FSDirectory.open(Paths.get(INDEX_DIR));
Analyzer analyzer = new HanLPAnalyzer();//////////////////////////////////////////////
IndexReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
QueryParser parser = new QueryParser("content", analyzer);
Query query = parser.parse(text);
ScoreDoc[] hits = isearcher.search(query, 300000).scoreDocs;
当搜索 “time:[20140101 TO 20150101]” 有命中结果显示,而搜索 “被告人” 这个词命中结果为0个,这个词是一定有的。请问您知道是什么原因么?
from hanlp.
我测试正常
Analyzer analyzer = new HanLPAnalyzer();////////////////////////////////////////////////////
IndexWriterConfig config = new IndexWriterConfig(analyzer);
config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
String INDEX_DIR = System.getProperty("java.io.tmpdir") + File.separator + "index";
Directory directory = FSDirectory.open(Paths.get(INDEX_DIR));
IndexWriter indexWriter = new IndexWriter(directory, config);
Document document = new Document();
document.add(new TextField("content", "被公诉机关指控涉嫌犯罪的当事人称作被告人。", Field.Store.YES));
indexWriter.addDocument(document);
document = new Document();
document.add(new TextField("content", "商品和服务", Field.Store.YES));
indexWriter.addDocument(document);
document = new Document();
document.add(new TextField("content", "和服的价格是每镑15便士", Field.Store.YES));
indexWriter.addDocument(document);
indexWriter.commit();
indexWriter.close();
IndexReader ireader = DirectoryReader.open(directory);
IndexSearcher isearcher = new IndexSearcher(ireader);
QueryParser parser = new QueryParser("content", analyzer);
Query query = parser.parse("被告人");
ScoreDoc[] hits = isearcher.search(query, 300000).scoreDocs;
for (ScoreDoc scoreDoc : hits)
{
Document targetDoc = isearcher.doc(scoreDoc.doc);
System.out.println(targetDoc.getField("content").stringValue());
}
如果你那边也能通过这个测试,那么问题可能并不在这里。
from hanlp.
折腾一天找到原因了,就是在添加的字符串(content)中存在两个或以上的换行符时后面的文本就不被识别了。例如:"\n\n" + "被公诉机关指控涉嫌犯罪的当事人称作被告人。" 再搜索被告人 结果就为0
from hanlp.
感谢排查,问题已经确认,马上修复这个bug。
from hanlp.
这个问题应该解决了,如果还有问题,欢迎再开issue。
from hanlp.
非常感谢!
from hanlp.
Related Issues (20)
- 调用粗粒度分词API疑是存在内存泄漏? HOT 3
- ViterbiSegment加载自定义词典时未正确替换DoubleArrayTrie HOT 2
- 希望可以增加自定义词典功能,对于分错的词语可以人为纠正。 HOT 2
- a bug HOT 1
- 始终报file is not a zip file HOT 2
- hanlp.load(SIGHAN2005_MSR_CONVSEG) 卡住了 HOT 2
- Failed to load https://file.hankcs.com/hanlp/dep/pmt_dep_electra_small_20220218_134518.zip HOT 2
- TransformerNamedEntityRecognizerTF 无法识别data的max_seq_length HOT 3
- pip install hanlp failed HOT 4
- " unpack (expected 4, got 3)" from HanLP(['XXXXX']) 运行错误 HOT 1
- 索引与查找使用相同的analyzer,结果无法命中 HOT 4
- 无法下载CTB9_POS_ELECTRA_SMALL_TF HOT 2
- 解析失败,提示升级hanlp HOT 1
- 依存分析的模型要么下载不了,要么刚开始下载非常慢,然后就下不了了(dep的四个模型都是) HOT 1
- No module named 'hanlp.datasets.parsing.ctb'
- 中文名包含多音字时生成的拼音只有一个,例如 ‘李娜’ 生成拼音为 ‘Li Nuo’ HOT 1
- 执行open_small.py时报'utf-8' codec can't decode byte 0xb4 in position 0: invalid start byte HOT 1
- ================================ERROR LOG BEGINS================================ HOT 1
- When I runing the example occurred error HOT 1
- Add a custom dictionary type that supports spaces HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hanlp.