castorini / pyserini Goto Github PK
View Code? Open in Web Editor NEWPyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
Home Page: http://pyserini.io/
License: Apache License 2.0
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
Home Page: http://pyserini.io/
License: Apache License 2.0
We have scripts here and there for reading in TREC runs, interpolating runs, etc. Should we add them into pyserini, organized under a nice API?
Would be better than an endless proliferation of scripts...
Thoughts @rodrigonogueira4 @x65han ?
Follow up to #91
Let's document these new features?
Based on our new conventions: https://github.com/lintool/guide/blob/master/coding-style.md
Hi @zeynepakkalyoncu - do we have an actual use case for dump_document_vectors
in IndexReaderUtils
? I'm trying to write a use case for it, and I can't seem to even get it to work? Throws a mysterious jnius.JavaException: JVM exception occurred
error.
If you're using this feature somewhere (in a notebook), we should make sure it works and write a test case... otherwise I would suggest removing it until a real use case comes up.
Feature request from @cmacdonald -
Compute scores wrt a set of document specified by the user, i.e., "rerank this set for me". Could decompose into "score this query wrt this document" with an outer loop over documents.
Can be accomplished today by setting k to be a really large value and then filtering results... but having a better implementation of this feature would be generally useful.
Is it possible to dump out tf-idf document vectors for retrieved documents in pyserini?
Obtained a JavaException error after importing:
from pyserini.search import pysearch
--> Error:
... jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'
Installed Pyserini via:
pip install pyserini --user
I was not able to resolve this, I tried:
Complete error:
Traceback (most recent call last): File "build_db.py", line 6, in <module> from pyserini.search import pysearch File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/search/pysearch.py", line 25, in <module> from ..pyclass import JSearcher, JResult, JDocument, JString, JArrayList, JTopics, JTopicReader File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/pyclass.py", line 51, in <module> JDefaultEnglishAnalyzer = autoclass('io.anserini.analysis.DefaultEnglishAnalyzer') File "/home/pboers/.local/lib/python3.7/site-packages/jnius/reflect.py", line 208, in autoclass c = find_javaclass(clsname) File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'
Let's try to replicate 20 newsgroup classification w/ scikit-learn using Pyserini and Anserini:
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html
That is, use Pyserini to extract the tf-idf vectors that feed the classifiers in scikit-learn.
Issue transferred from castorini/anserini#1120
Initial work by @yuki617 here:
https://github.com/yuki617/anserini/blob/tfidf/20newgroup_replication.ipynb
Working on 20 newsgroup dataset and encounter the following error:
Also attached the problematic document:
pyserini==0.8.1.0
Related to #43 - currently IndexReaderUtils
exposes analyze
method:
https://github.com/castorini/pyserini/blob/master/pyserini/index/pyutils.py
This is hard coded to a default. We should think about how to expose arbitrary Lucene analyzers in general... what would the API look like?
Hi @zeynepakkalyoncu we should check out Guido's materials for AFIRM 2019:
https://github.com/ielab/afirm2019
This is a summary of the issue presented in #32
Consider this fragment:
>>> from pyserini.index import pyutils
>>>
>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> postings_list = index_utils.get_postings_list('black')
>>>
>>> for i in range(0, 10):
... print('{}'.format(postings_list[i]))
...
(6, 2) [555,606]
(29, 1) [410]
(32, 2) [65,462]
(35, 2) [288,475]
(56, 1) [662]
(60, 1) [69]
(61, 1) [110]
(63, 1) [195]
(74, 2) [230,518]
(96, 1) [107]
The docids (e.g., 6 in the first posting), refers to internal Lucene docids, which are different from external docids (i.e., those in the collection).
Use this hidden method convertLuceneDocidToDocid
to convert, as in:
>>> for i in range(0, 10):
... print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
...
LA111289-0011 (6, 2) [555,606]
LA092890-0052 (29, 1) [410]
LA022489-0041 (32, 2) [65,462]
LA051990-0051 (35, 2) [288,475]
LA092890-0077 (56, 1) [662]
LA022489-0061 (60, 1) [69]
LA021889-0073 (61, 1) [110]
LA110689-0057 (63, 1) [195]
LA080789-0088 (74, 2) [230,518]
LA021889-0117 (96, 1) [107]
The TODO is to explicitly expose convertLuceneDocidToDocid
.
Similarly, we can use convertDocidToLuceneDocid
to convert an external collection docid into an internal docid:
>>> from jnius import autoclass
>>> JString = autoclass('java.lang.String')
>>> index_utils.object.convertDocidToLuceneDocid(index_utils.reader, JString("LA052189-0089"))
200443
We can verify as follows:
>>> for i in range(len(postings_list)):
... if postings_list[i].docid == 200443:
... print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
...
LA052189-0089 (200443, 64) [18,133,175,212,225,244,262,273,307,320,344,372,388,431,438,454,464,541,576,583,616,640,772,778,801,831,838,885,891,912,937,952,970,1123,1151,1165,1180,1210,1215,1231,1270,1307,1346,1431,1436,1507,1514,1542,1546,1550,1663,1676,1726,1750,1764,1769,1781,1784,1838,1847,1873,1880,1922,1971]
Which matches exactly what we get from get_document_vector
:
>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> doc_vector = index_utils.get_document_vector("LA052189-0089")
>>> doc_vector['black']
64
from pyserini.search import pysearch
searcher = pysearch.SimpleSearcher('/home/ds/anserini/covid-2020-04-17/lucene-index-covid-paragraph-new/')
hits = searcher.search('nsp1 synthesis degradation', 10)
article = json.loads(searcher.doc('42saxb98').raw)
#print(json.dumps(article, indent=4))
article['metadata']['title']
error-
AttributeError Traceback (most recent call last)
in ()
----> 1 article = json.loads(searcher.doc('42saxb98').raw)
2
3 # Uncomment to print the entire article... warning, it's long! :)
4 #print(json.dumps(article, indent=4))
5
AttributeError: 'Document' object has no attribute 'raw'
I was trying to the run the demo code.
importing gives me this error -
File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass
jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'
How do I index my own data in Pyserini ? All the notebooks and examples are using prebuilt indexes.
As a result of castorini/anserini#1027 - we need to refactor analyzers in Pyserini.
Multiple people have asked for the ability to perform indexing from Python. This shouldn't be too hard, we just need to properly expose IndexCollection
.
We would like to be able to search an index without the query being processed by the stemmer. The specific use-case would be for the background linking task of TREC. We want to use the document vectors (that contain stemmed terms) to construct a new query.
Current invoking is something like:
from pyserini.search import pysearch
searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')
@zeynepakkalyoncu @emmileaf don't you think we have one unnecessary nested layer?
Would something like this make more sense?
from pyserini import search
searcher = search.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')
or
from pyserini.search import SimpleSearcher
searcher = SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')
I, first, indexed my docs using lucene-8.5.1
in java, then after running
searcher = pysearch.SimpleSearcher('/home/sipah00/java_lucene/lucene_idx1/')
I got this error, JavaException: JVM exception occurred: Could not load codec 'Lucene84'. Did you forget to add lucene-backward-codecs.jar?
How to resolve this issue? is there any specific version of lucene
that pyserini
supports?
Also, how can I index my docs in pyserini
only?
It'd be nice to be able to replicate standard regression runs directly from Python, something like:
python -m pyserini.search_collection ...
We should be able to get exactly the same output as from Java.
This no longer works:
from pyserini.search import pysearch
topics = pysearch.get_topics('msmarco_passage_dev_subset')
The reason is that the Java end uses generics, and so pyjnius can't properly dispatch to the method. See: kivy/pyjnius#134
I've found a term that occurs once in a document vector, but doesn't occur in the collection. Am I using the wrong analyzer or is this a bug? I've used the following Pyserini functions:
index_utils = pyutils.IndexReaderUtils('/Index/lucene-index.core18.pos+docvectors+rawdocs_all')
tf = index_utils.get_document_vector(docid)
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
df = {term: (index_utils.get_term_counts(term, analyzer=analyzer))[1] for term in tf.keys()}
output:
tf = {.. 'hobbies:photographi': 1, ..}
df = {.. 'hobbies:photographi': 0, ..}
I assume the term is derived from this part in the raw text: "..<b>HOBBIES:</b>Photography..."
With RM3, we get an NPE when we try to search with a query build using the querybuilder.
I need to access statistics of my index like tf, tf-idf etc. I want to answer questions like: In how many documents does a spesific term occur in? What documents does a spesific term occur in? What terms occur in the 1st document. Is there any way to do this using pyserini? Thanks!
Main README.md
is getting pretty long. Should we break into separate pages?
@chriskamphuis @PepijnBoers @x65han thoughts?
@x389liu has been working with Spacy and Pyserini.
She's offered to write up a guide to take Pyserini output and do basic NLP on it... e.g., sentence chunking, NER, etc.
Now that Pyserini is reasonably stable... we kinda need a testing framework...
https://github.com/castorini/pyserini/blob/master/pyserini/trectools/_base.py#L17
from __future__ import annotations
This, from what I understand, forces users to Python 3.7. Is this okay?
I don't mind either way, but we should have a discussion about it... and make the decision across all castorini.
@rodrigonogueira4 @x65han @ronakice thoughts?
We have:
# Pass in a no-op analyzer:
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
index_utils.get_term_counts(term, analyzer=analyzer)
df, cf = index_utils.get_term_counts(term)
Here, we take an analyzer
.
And:
# Fetch and traverse postings for an analyzed term:
postings_list = index_utils.get_postings_list(analyzed[0], analyze=False)
for posting in postings_list:
print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')
Here, we take a bool
. Let's make both consistent?
How about both take analyzer
and accepts None
? Passing in a "no-op" analyzer seems a bit janky.
Thoughts? @PepijnBoers @chriskamphuis
I am trying to import pysearch from pyserini.search. I set the JAVA_HOME variable to jdk11. I am running this using Jupyter Notebook and I am getting the error as shown.
import os
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/jdk-11.0.7.jdk/Contents/Home'
from pyserini.search import pysearch
The error is:
ValueError: VM is already running, can't set classpath/options; VM started at File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main
I initially was running in a virtual environment (conda environment). I tried coming out of it and execute, but I am still getting the same issue. Any workarounds? Or am I missing something?
I am using MacOS Catalina.
Currently, we have
numpy==1.16.4
scipy==1.4.1
numpy
latest appears to be 1.18.4; scipy
appears to be up to date.
Arg layout should be consist with anserini structure: change -prf
to -prcl
since we've decided to name the class PseudoRelevanceClassifierReranker
Args should be grouped, e.g., -prcl.r
, -prcl.n
, -prcl.alpha
use tqdm for progress indicator
Write test case: you can assume the CACM test index: https://github.com/castorini/pyserini/blob/master/tests/test_indexutils.py#L31
This would be similar to the commit hook in anserini: https://github.com/castorini/anserini/blob/master/.travis.yml#L20
I think output file should be obligatory.
hello! Thank u for your work !
I have some issues and I am not sure whether It's a bug in the API or It's due to my misunderstanding of some semantics related to the API..
When I run these two programs, and analyze the frequency of the term "standard" in some documents like this:
In fact, for the same document id, i get for the first program 3 terms, and the second 1 term frequency. The term "standard" was used here just for illustration, but I encounter the same problem for other terms.
I will be very grateful to get a feedback about this issue.
Best regards,
Call it FusionSearcher
or something like that. Combines results via RRF by default.
Right now we have:
$ python -m pyserini.search.pysearch
usage: pysearch.py [-h] -index path -topics path -output path
Maybe let's move to:
$ python -m pyserini.search
usage: pysearch.py [-h] -index path -topics path -output path
What do you think @yuki617 ?
I do this:
from pyserini.index import pyutils
index_utils = pyutils.IndexReaderUtils('lucene-index.robust04.pos+docvectors+rawdocs')
iter1 = index_utils.terms()
iter2 = index_utils.terms()
for term in iter1:
print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))
As expected, I iterate over all terms. Then:
for term in iter2:
print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))
Gives nothing... is terms()
returning the same iterator every time?
Doesn't seem like the right behavior?
Hi. I getting the error "ImportError: DLL load failed: The specified module could not be found." when trying to import both of:
from pyserini.search import pysearch
from pyserini.search.pysearch import SimpleSearcher
The modules have installed and they are recommended from the editor when typing them. I am using pycharm, python 3.7.6 and tried both pyserini==0.9.0.0 pyserini==0.9.3.0.
what should I use to make indexes. I was looking at few options -solr,elasticsearch,lucene etc
We've been having an issue where Collection
iterators sometimes freeze when used in forked process. This is not specific to pyserini and can be reproduced with only pyjnius and a BufferedReader
. See below for a minimal script to reproduce it.
This issue reliably occurs if:
BufferedReader
is run on a "large" directory. Large depends on a combination of number of files and file size. It does not happen on a directory with 2000 empty files (created with touch $fn
), but it does happen if this is increased to 3000 empty files. It does happen if the 2000 empty files are 1MB each (dd if=/dev/zero of=$fn bs=1M count=1
) rather than empty. (I've also reproduced it on a random directory containing 265 files of varying sizes.)Script:
import os
import sys
from multiprocessing import Pool
from jnius import autoclass
jstr = autoclass("java.lang.String")
jbr = autoclass("java.io.BufferedReader")
jfr = autoclass("java.io.FileReader")
def jprint(x):
fr = jfr(x)
f = jbr(fr)
while True:
line = f.readLine()
if line is None:
print("break")
f.close()
fr.close()
break
else:
print("not none")
if __name__ == "__main__":
dir = sys.argv[1]
p = Pool(5)
fns = [os.path.join(dir, fn) for fn in os.listdir(dir) if os.path.isfile(os.path.join(dir, fn))]
p.map(jprint, fns)
It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.9.2.0. You're in good company, about 5% of other
projects updated in the last year are also missing files.
+ /tmp/venv/bin/pip3 wheel --no-binary pyserini -w /tmp/ext pyserini==0.9.2.0
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting pyserini==0.9.2.0
Downloading http://10.10.0.139:9191/root/pypi/%2Bf/6bb/4e22d7cb0a83a/pyserini-0.9.2.0.tar.gz (57.8 MB)
ERROR: Command errored out with exit status 1:
command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-we7atqws/pyserini/pip-egg-info
cwd: /tmp/pip-wheel-we7atqws/pyserini/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-wheel-we7atqws/pyserini/setup.py", line 3, in <module>
with open("project-description.md", "r") as fh:
FileNotFoundError: [Errno 2] No such file or directory: 'project-description.md'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
pyserini notebooks are not updated with the most recent commit.
For example, in pyserini_robust04_demo.ipynb, hits[0].contents
should be used instead of hits[0].content
after this change.
Hi. I have an anserini index trying to access its stastistics so I am following this:
https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md
when I type:
from pyserini import analysis, index
index_reader = index.IndexReader()
I am getting:
AttributeError Traceback (most recent call last)
in ()
1 from pyserini import analysis, index
2
----> 3 index_reader = index.IndexReader()
AttributeError: module 'pyserini.index' has no attribute 'IndexReader'
Hi @lintool !
I have a new issue :
I created a new index with the dataset "DUC-2001" by mean of this function :
sh anserini/target/appassembler/bin/IndexCollection \
-collection TrecCollection \
-generator JsoupGenerator \
-threads 2 \
-input ${EXP}/ \
-index indexes/lucene-index.XXX \
-storePositions -storeDocvectors -storeRawDocs
I also installed Luke Toolbox project to understand how the index working.
When i run this code :
for id_ in docid:
doc_vector = index_utils.get_document_vector(id_)
bm25_score_one_doc = {}
for term_ in doc_vector:
postings_list = index_utils.get_postings_list(term_)
it works for some terms but not for all...
Traceback (most recent call last):
File "doc2index_2.py", line 50, in <module>
postings_list = index_utils.get_postings_list(term_)
File "/home/poulain/.local/lib/python3.6/site-packages/pyserini/index/pyutils.py", line 118, in get_postings_list
postings_list = self.object.getPostingsList(self.reader, JString(term))
File "jnius/jnius_export_class.pxi", line 768, in jnius.JavaMethod.__call__
File "jnius/jnius_export_class.pxi", line 934, in jnius.JavaMethod.call_staticmethod
File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.lang.NullPointerException
I think there are two different indexes, the first one applies a stemming ( the word "Cherokee" become "cheroke") and the second keeps the word without stemming.
So, how can i stemming the posting index ?
Best regards
All CACM HTML files are under collection
. Instantiating pycollection.Collection
with Python 3.5.0 and Java 11 leads to java.nio.file.NoSuchFileException
:
from pyserini.collection import pycollection
collection = pycollection.Collection('HtmlCollection', 'collection/')
2019-11-22 08:05:07,954 ERROR [main] collection.DocumentCollection$2 (DocumentCollection.java:226) - Visiting failed for ä
java.nio.file.NoSuchFileException: ä
at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) ~[?:?]
at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:149) ~[?:?]
at java.base/java.nio.file.Files.readAttributes(Files.java:1763) ~[?:?]
at java.base/java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219) ~[?:?]
at java.base/java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276) ~[?:?]
at java.base/java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322) ~[?:?]
at java.base/java.nio.file.Files.walkFileTree(Files.java:2716) [?:?]
at java.base/java.nio.file.Files.walkFileTree(Files.java:2796) [?:?]
at io.anserini.collection.DocumentCollection.discover(DocumentCollection.java:232) [anserini-0.6.0-fatjar.jar:?]
at io.anserini.collection.DocumentCollection.iterator(DocumentCollection.java:110) [anserini-0.6.0-fatjar.jar:?]
Make the internal Lucene docid number n expose to the searcher to make python interface easily iterate in the index file.
@stephaniewhoo Can you share your notebook here and we can discuss?
Hi, congratulations for this work!
I just wanted to ask, if possible, to add an example on how to use pyserini to build a new index. Also, to build my own collection, do I just have to override the Collection class?
Thanks!
We should have a version of https://github.com/castorini/anserini/blob/master/src/main/python/verify_simplesearcher.py
That verifies the output of SimpleSearcher
against SearchCollection
Java. Most of the script can be reused (just copy over).
Should take a path to Anserini root, something like -anserini ..
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.