castorini / pyserini Goto Github PK

View Code? Open in Web Editor NEW

1.5K 18.0 342.0 7.84 MB

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Home Page: http://pyserini.io/

License: Apache License 2.0

Python 98.79% Shell 0.29% HTML 0.92%

information-retrieval

pyserini's Introduction

Pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, prebuilt indexes, and evaluation scripts for many commonly used IR test collections. With Pyserini, it's easy to reproduce runs on a number of standard IR test collections!

For additional details, our paper in SIGIR 2021 provides a nice overview.

❗ Anserini was upgraded from JDK 11 to JDK 21 at commit 272565 (2024/04/03), which corresponds to the release of v0.35.0. Correspondingly, Pyserini was upgraded to JDK 21 at commit b2f677 (2024/04/04).

🎬 Installation

Install via PyPI (requires Python 3.10+):

pip install pyserini

Sparse retrieval depends on Anserini, which is itself built on Lucene (written in Java), and thus requiring JDK 21.

Dense retrieval depends on neural networks and requires a more complex set of dependencies. A pip installation will automatically pull in the 🤗 Transformers library to satisfy the package requirements. Pyserini also depends on PyTorch and Faiss, but since these packages may require platform-specific custom configuration, they are not explicitly listed in the package requirements. We leave the installation of these packages to you.

The software ecosystem is rapidly evolving and a potential source of frustration is incompatibility among different versions of underlying dependencies. We provide additional detailed installation instructions here.

If you're planning on just using Pyserini, then the pip instructions above are fine. However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation. Instructions are provided here.

🙋 How do I search?

Pyserini supports the following classes of retrieval models:

Traditional lexical models (e.g., BM25) using LuceneSearcher.
Learned sparse retrieval models (e.g., uniCOIL, SPLADE, etc.) using LuceneImpactSearcher.
Learned dense retrieval models (e.g., DPR, Contriever, etc.) using FaissSearcher.
Hybrid retrieval models (e.g., dense-sparse fusion) using HybridSearcher.

See this guide (same as the links above) for details on how to search common corpora in IR and NLP research (e.g., MS MARCO, NaturalQuestions, BEIR, etc.) using indexes that we have already built for you.

Once you get the top-k results, you'll actually want to fetch the document text... See this guide for how.

🙋 How do I index my own corpus?

Well, it depends on what type of retrieval model you want to search with:

The steps are different for different classes of models: this guide (same as the links above) describes the details.

🙋 Additional FAQs

How do I configure search? (Guide to Interactive Search)
How do I manually download indexes? (Guide to Interactive Search)
How do I perform dense and hybrid retrieval? (Guide to Interactive Search)
How do I iterate over index terms and access term statistics? (Index Reader API)
How do I traverse postings? (Index Reader API)
How do I access and manipulate term vectors? (Index Reader API)
How do I compute the tf-idf or BM25 score of a document? (Index Reader API)
How do I access basic index statistics? (Index Reader API)
How do I access underlying Lucene analyzers? (Analyzer API)
How do I build custom Lucene queries? (Query Builder API)
How do I iterate over raw collections? (Collection API)

⚗️ Reproducibility

With Pyserini, it's easy to reproduce runs on a number of standard IR test collections! We provide a number of prebuilt indexes that directly support reproducibility "out of the box".

In our SIGIR 2022 paper, we introduced "two-click reproductions" that allow anyone to reproduce experimental runs with only two clicks (i.e., copy and paste). Documentation is organized into reproduction matrices for different corpora that provide a summary of different experimental conditions and query sets:

For more details, see our paper on Building a Culture of Reproducibility in Academic Research.

Additional reproduction guides below provide detailed step-by-step instructions.

Sparse Retrieval

Reproducing Robust04 baselines for ad hoc retrieval
Reproducing the BM25 baseline for MS MARCO V1 Passage Ranking
Reproducing the BM25 baseline for MS MARCO V1 Document Ranking
Reproducing the multi-field BM25 baseline for MS MARCO V1 Document Ranking from Elasticsearch
Reproducing BM25 baselines on the MS MARCO V2 Collections
Reproducing LTR filtering experiments: MS MARCO V1 Passage, MS MARCO V1 Document
Reproducing IRST experiments on the MS MARCO V1 Collections
Reproducing DeepImpact: MS MARCO V1 Passage
Reproducing uniCOIL with doc2query-T5: MS MARCO V1, MS MARCO V2
Reproducing uniCOIL with TILDE: MS MARCO V1 Passage, MS MARCO V2 Passage
Reproducing SPLADEv2: MS MARCO V1 Passage
Reproducing Mr. TyDi experiments
Reproducing BM25 baselines for HC4
Reproducing BM25 baselines for HC4 on NeuCLIR22
Reproducing SLIM experiments
Baselines for KILT: a benchmark for Knowledge Intensive Language Tasks
Baselines for TripClick: a large-scale dataset of click logs in the health domain
Baselines (in Anserini) for the FEVER (Fact Extraction and VERification) dataset

Dense Retrieval

Reproducing TCT-ColBERTv1 experiments: MS MARCO V1
Reproducing TCT-ColBERTv2 experiments: MS MARCO V1, MS MARCO V2
Reproducing DPR experiments
Reproducing BPR experiments
Reproducing ANCE experiments
Reproducing DistilBERT KD experiments
Reproducing DistilBERT Balanced Topic Aware Sampling experiments
Reproducing SBERT dense retrieval experiments
Reproducing ADORE dense retrieval experiments
Reproducing Vector PRF experiments
Reproducing ANCE-PRF experiments
Reproducing Mr. TyDi experiments
Reproducing DKRR experiments

Hybrid Sparse-Dense Retrieval

Reproducing uniCOIL + TCT-ColBERTv2 experiments on the MS MARCO V2 Collections

Available Corpora

Corpora	Size	Checksum
MS MARCO V1 passage: uniCOIL (noexp)	2.7 GB	`f17ddd8c7c00ff121c3c3b147d2e17d8`
MS MARCO V1 passage: uniCOIL (d2q-T5)	3.4 GB	`78eef752c78c8691f7d61600ceed306f`
MS MARCO V1 doc: uniCOIL (noexp)	11 GB	`11b226e1cacd9c8ae0a660fd14cdd710`
MS MARCO V1 doc: uniCOIL (d2q-T5)	19 GB	`6a00e2c0c375cb1e52c83ae5ac377ebb`
MS MARCO V2 passage: uniCOIL (noexp)	24 GB	`d9cc1ed3049746e68a2c91bf90e5212d`
MS MARCO V2 passage: uniCOIL (d2q-T5)	41 GB	`1949a00bfd5e1f1a230a04bbc1f01539`
MS MARCO V2 doc: uniCOIL (noexp)	55 GB	`97ba262c497164de1054f357caea0c63`
MS MARCO V2 doc: uniCOIL (d2q-T5)	72 GB	`c5639748c2cbad0152e10b0ebde3b804`

📃 Additional Documentation

📜️ Release History

v0.35.0 (w/ Anserini v0.35.0): April 4, 2024 [Release Notes]
v0.25.0 (w/ Anserini v0.25.0): March 31, 2024 [Release Notes]
v0.24.0 (w/ Anserini v0.24.0): December 28, 2023 [Release Notes]
v0.23.0 (w/ Anserini v0.23.0): November 17, 2023 [Release Notes]
v0.22.1 (w/ Anserini v0.22.1): October 19, 2023 [Release Notes]
v0.22.0 (w/ Anserini v0.22.0): August 31, 2023 [Release Notes]
v0.21.0 (w/ Anserini v0.21.0): April 6, 2023 [Release Notes]
v0.20.0 (w/ Anserini v0.20.0): February 1, 2023 [Release Notes]

older... (and historic notes)

v0.19.2 (w/ Anserini v0.16.2): December 16, 2022 [Release Notes]
v0.19.1 (w/ Anserini v0.16.1): November 12, 2022 [Release Notes]
v0.19.0 (w/ Anserini v0.16.1): November 2, 2022 [Release Notes] [Known Issues]
v0.18.0 (w/ Anserini v0.15.0): September 26, 2022 [Release Notes] (First release based on Lucene 9)
v0.17.1 (w/ Anserini v0.14.4): August 13, 2022 [Release Notes] (Final release based on Lucene 8)
v0.17.0 (w/ Anserini v0.14.3): May 28, 2022 [Release Notes]
v0.16.1 (w/ Anserini v0.14.3): May 12, 2022 [Release Notes]
v0.16.0 (w/ Anserini v0.14.1): March 1, 2022 [Release Notes]
v0.15.0 (w/ Anserini v0.14.0): January 21, 2022 [Release Notes]
v0.14.0 (w/ Anserini v0.13.5): November 8, 2021 [Release Notes]
v0.13.0 (w/ Anserini v0.13.1): July 3, 2021 [Release Notes]
v0.12.0 (w/ Anserini v0.12.0): May 5, 2021 [Release Notes]
v0.11.0.0: February 18, 2021 [Release Notes]
v0.10.1.0: January 8, 2021 [Release Notes]
v0.10.0.1: December 2, 2020 [Release Notes]
v0.10.0.0: November 26, 2020 [Release Notes]
v0.9.4.0: June 26, 2020 [Release Notes]
v0.9.3.1: June 11, 2020 [Release Notes]
v0.9.3.0: May 27, 2020 [Release Notes]
v0.9.2.0: May 15, 2020 [Release Notes]
v0.9.1.0: May 6, 2020 [Release Notes]
v0.9.0.0: April 18, 2020 [Release Notes]
v0.8.1.0: March 22, 2020 [Release Notes]
v0.8.0.0: March 12, 2020 [Release Notes]
v0.7.2.0: January 25, 2020 [Release Notes]
v0.7.1.0: January 9, 2020 [Release Notes]
v0.7.0.0: December 13, 2019 [Release Notes]
v0.6.0.0: November 2, 2019

📜️ Historical Notes

⁉️ Lucene 8 to Lucene 9 Transition. In 2022, Pyserini underwent a transition from Lucene 8 to Lucene 9. Most of the prebuilt indexes have been rebuilt using Lucene 9, but there are a few still based on Lucene 8.

More details:

PyPI v0.17.1 (commit 33c87c, released 2022/08/13) is the last Pyserini release built on Lucene 8, based on Anserini v0.14.4. Thereafter, Anserini trunk was upgraded to Lucene 9.
PyPI v0.18.0 (commit 5fab14, released 2022/09/26) is built on Anserini v0.15.0, using Lucene 9. Thereafter, Pyserini trunk advanced to Lucene 9.

Explanations:

What's the impact? Indexes built with Lucene 8 are not fully compatible with Lucene 9 code (see Anserini #1952). The workaround is to disable consistent tie-breaking, which happens automatically if a Lucene 8 index is detected by Pyserini. However, Lucene 9 code running on Lucene 8 indexes will give slightly different results than Lucene 8 code running on Lucene 8 indexes. Note that Lucene 8 code is not able to read indexes built with Lucene 9.
Why is this necessary? Although disruptive, an upgrade to Lucene 9 is necessary to take advantage of Lucene's HNSW indexes, which will increase the capabilities of Pyserini and open up the design space of dense/sparse hybrids.

With v0.11.0.0 and before, Pyserini versions adopted the convention of X.Y.Z.W, where X.Y.Z tracks the version of Anserini, and W is used to distinguish different releases on the Python end. Starting with Anserini v0.12.0, Anserini and Pyserini versions have become decoupled.

Anserini is designed to work with JDK 11. There was a JRE path change above JDK 9 that breaks pyjnius 1.2.0, as documented in this issue, also reported in Anserini here and here. This issue was fixed with pyjnius 1.2.1 (released December 2019). The previous error was documented in this notebook and this notebook documents the fix.

✨ References

If you use Pyserini, please cite the following paper:

@INPROCEEDINGS{Lin_etal_SIGIR2021_Pyserini,
   author = "Jimmy Lin and Xueguang Ma and Sheng-Chieh Lin and Jheng-Hong Yang and Ronak Pradeep and Rodrigo Nogueira",
   title = "{Pyserini}: A {Python} Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations",
   booktitle = "Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)",
   year = 2021,
   pages = "2356--2362",
}

🙏 Acknowledgments

This research is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

pyserini's People

Contributors

Stargazers

Watchers

Forkers

mirzaeiyan yyht pombredanne pepijnboers poulain-tim tteofili cmacdonald nsndimt rodrigonogueira4 mbougha crystina-z yuki617 pypi-buildability-project mrkarezina adamyy x389liu hangcui0510 lukuang bennu-li lizzyzhang-tutu eddiebarry jbarrow cdsciavolino stephaniewhoo rizvisaira gdh756462786 dahlia-chehata qguo96 rakeeb-hossain mxueguang yuxuan-ji hec2018 kaisun314 jiarui-z yemiliey jadepark13 felipeborgesc lyricalparking ali-abz w32zhong 4105029005 jtibshirani chriskamphuis naydoummar databill86 keleog emaiuileboo spacemanidol bajeelapv vrdn-23 printfcalvin saileshnankani andrewyguo mayankanand007 dayalstrub olesota raineydavid bigdatasciencegroup lujunru isoboroff jingtaozhan sontran1001 navid-rekabsaz velocitycavalry binshengliu hongruishi amyxie361 alexlimh cash nak6 nickyongzhang nimasadri11 mzzchy zanezzephyrs d1shs0ap gonenc larryli1999 zhiyu-chen anshiquanshu66 techthiyanes apokali xy2323819551 leungjch rootofallevii maths-yyds ielab mieseung eneasmesquita cadurosar qiaoyf96 e-budur jmmackenzie viswavi edanerg fan-luo morfeo262 vasco989k narabzad emanuelaboros littlefive5

pyserini's Issues

pysearch.get_topics doesn't work anymore

This no longer works:

from pyserini.search import pysearch
topics = pysearch.get_topics('msmarco_passage_dev_subset')

The reason is that the Java end uses generics, and so pyjnius can't properly dispatch to the method. See: kivy/pyjnius#134

Update requirements for scipy and numpy

Currently, we have

numpy==1.16.4
scipy==1.4.1

numpy latest appears to be 1.18.4; scipy appears to be up to date.

Document the new query construction API

Follow up to #91

Let's document these new features?

Access Index from colab

Hi. I have an anserini index trying to access its stastistics so I am following this:
https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md

when I type:
from pyserini import analysis, index
index_reader = index.IndexReader()
I am getting:
AttributeError Traceback (most recent call last)
in ()
1 from pyserini import analysis, index
2
----> 3 index_reader = index.IndexReader()

AttributeError: module 'pyserini.index' has no attribute 'IndexReader'

searcher can't take custom constructed query with RM3 set

With RM3, we get an NPE when we try to search with a query build using the querybuilder.

Clean API to expose Lucene Analyzers

Related to #43 - currently IndexReaderUtils exposes analyze method:
https://github.com/castorini/pyserini/blob/master/pyserini/index/pyutils.py

This is hard coded to a default. We should think about how to expose arbitrary Lucene analyzers in general... what would the API look like?

Implement searcher that searches multiple indexes and fuses the results

Call it FusionSearcher or something like that. Combines results via RRF by default.

Term occurs in document vector, but has collection frequency 0

I've found a term that occurs once in a document vector, but doesn't occur in the collection. Am I using the wrong analyzer or is this a bug? I've used the following Pyserini functions:

index_utils = pyutils.IndexReaderUtils('/Index/lucene-index.core18.pos+docvectors+rawdocs_all')
tf = index_utils.get_document_vector(docid)
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
df = {term: (index_utils.get_term_counts(term, analyzer=analyzer))[1] for term in tf.keys()}

output:

tf = {.. 'hobbies:photographi': 1, ..}
df = {.. 'hobbies:photographi': 0, ..}

I assume the term is derived from this part in the raw text: "..<b>HOBBIES:</b>Photography..."

Verify SimpleSearcher output script

We should have a version of https://github.com/castorini/anserini/blob/master/src/main/python/verify_simplesearcher.py

That verifies the output of SimpleSearcher against SearchCollection Java. Most of the script can be reused (just copy over).

Should take a path to Anserini root, something like -anserini ...

Add functionality to search without preprocessing

We would like to be able to search an index without the query being processed by the stemmer. The specific use-case would be for the background linking task of TREC. We want to use the document vectors (that contain stemmed terms) to construct a new query.

Refactor analyzers based on underlying Anserini changes

As a result of castorini/anserini#1027 - we need to refactor analyzers in Pyserini.

Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Obtained a JavaException error after importing:
from pyserini.search import pysearch

--> Error:
... jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Installed Pyserini via:
pip install pyserini --user

I was not able to resolve this, I tried:

replacing the fatjar with a newly built Anserini 0.8.1 fatjar
manual configuration of classpath via configure_classpath() method in pyserini.setup.

Complete error:

Traceback (most recent call last): File "build_db.py", line 6, in <module> from pyserini.search import pysearch File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/search/pysearch.py", line 25, in <module> from ..pyclass import JSearcher, JResult, JDocument, JString, JArrayList, JTopics, JTopicReader File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/pyclass.py", line 51, in <module> JDefaultEnglishAnalyzer = autoclass('io.anserini.analysis.DefaultEnglishAnalyzer') File "/home/pboers/.local/lib/python3.7/site-packages/jnius/reflect.py", line 208, in autoclass c = find_javaclass(clsname) File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

forked pyserini Collection iterators sometimes freeze with large directories

We've been having an issue where Collection iterators sometimes freeze when used in forked process. This is not specific to pyserini and can be reproduced with only pyjnius and a BufferedReader. See below for a minimal script to reproduce it.

This issue reliably occurs if:

pyjnius is initialized in the main process before forked processes are created. If we initialize pyjnius inside each forked process, everything's fine. (This is not easy to do though, because there's no way to close pyjnius after it's initialized, so every call to pyserini would have to happen in a forked process.)
The BufferedReader is run on a "large" directory. Large depends on a combination of number of files and file size. It does not happen on a directory with 2000 empty files (created with touch $fn), but it does happen if this is increased to 3000 empty files. It does happen if the 2000 empty files are 1MB each (dd if=/dev/zero of=$fn bs=1M count=1) rather than empty. (I've also reproduced it on a random directory containing 265 files of varying sizes.)

Script:

import os
import sys
from multiprocessing import Pool
from jnius import autoclass
jstr = autoclass("java.lang.String")
jbr = autoclass("java.io.BufferedReader")
jfr = autoclass("java.io.FileReader")
def jprint(x):
    fr = jfr(x)
    f = jbr(fr)
    while True:
        line = f.readLine()
        if line is None:
            print("break")
            f.close()
            fr.close()
            break
        else:
            print("not none")

if __name__ == "__main__":
    dir = sys.argv[1]
    p = Pool(5)
    fns = [os.path.join(dir, fn) for fn in os.listdir(dir) if os.path.isfile(os.path.join(dir, fn))]
    p.map(jprint, fns)

java.nio.file.NoSuchFileException using the Collection API

All CACM HTML files are under collection. Instantiating pycollection.Collection with Python 3.5.0 and Java 11 leads to java.nio.file.NoSuchFileException:

from pyserini.collection import pycollection

collection = pycollection.Collection('HtmlCollection', 'collection/')

2019-11-22 08:05:07,954 ERROR [main] collection.DocumentCollection$2 (DocumentCollection.java:226) - Visiting failed for ä
java.nio.file.NoSuchFileException: ä
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
	at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) ~[?:?]
	at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:149) ~[?:?]
	at java.base/java.nio.file.Files.readAttributes(Files.java:1763) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322) ~[?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2716) [?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2796) [?:?]
	at io.anserini.collection.DocumentCollection.discover(DocumentCollection.java:232) [anserini-0.6.0-fatjar.jar:?]
	at io.anserini.collection.DocumentCollection.iterator(DocumentCollection.java:110) [anserini-0.6.0-fatjar.jar:?]

Not able to import - from pyserini.search import pysearch

I was trying to the run the demo code.

importing gives me this error -
File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass
jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Analysis of length distributions of title, abstract, and full-text in CORD-19

@stephaniewhoo Can you share your notebook here and we can discuss?

get_document_vector() and get_postings_list() Stemming ?

Hi @lintool !
I have a new issue :
I created a new index with the dataset "DUC-2001" by mean of this function :

 sh anserini/target/appassembler/bin/IndexCollection \
            -collection TrecCollection \
            -generator JsoupGenerator \
            -threads 2 \
            -input ${EXP}/ \
            -index indexes/lucene-index.XXX \
            -storePositions -storeDocvectors -storeRawDocs

I also installed Luke Toolbox project to understand how the index working.

When i run this code :

for id_ in docid:
    doc_vector = index_utils.get_document_vector(id_)
    bm25_score_one_doc = {}
    for term_ in doc_vector:
        postings_list = index_utils.get_postings_list(term_)

it works for some terms but not for all...

Traceback (most recent call last):
  File "doc2index_2.py", line 50, in <module>
    postings_list = index_utils.get_postings_list(term_)
  File "/home/poulain/.local/lib/python3.6/site-packages/pyserini/index/pyutils.py", line 118, in get_postings_list
    postings_list = self.object.getPostingsList(self.reader, JString(term))
  File "jnius/jnius_export_class.pxi", line 768, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 934, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.lang.NullPointerException

I think there are two different indexes, the first one applies a stemming ( the word "Cherokee" become "cheroke") and the second keeps the word without stemming.

So, how can i stemming the posting index ?

Best regards

get_document_vector vs get_postings_list

hello! Thank u for your work !
I have some issues and I am not sure whether It's a bug in the API or It's due to my misunderstanding of some semantics related to the API..
When I run these two programs, and analyze the frequency of the term "standard" in some documents like this:

Program 1:
doc_vector = index_utils.get_document_vector(XXX)
Output : {"...",'standard': 3, "..."}
Program 2:
postings_list = index_utils.get_postings_list('standard')
Output: docid=XXX, tf=1, pos=[26]

In fact, for the same document id, i get for the first program 3 terms, and the second 1 term frequency. The term "standard" was used here just for illustration, but I encounter the same problem for other terms.

I will be very grateful to get a feedback about this issue.

Best regards,

Expose IndexCollection in Python

Multiple people have asked for the ability to perform indexing from Python. This shouldn't be too hard, we just need to properly expose IndexCollection.

Retire technical debt for pysearch.search main

Arg layout should be consist with anserini structure: change -prf to -prcl since we've decided to name the class PseudoRelevanceClassifierReranker

Args should be grouped, e.g., -prcl.r, -prcl.n, -prcl.alpha

use tqdm for progress indicator

Write test case: you can assume the CACM test index: https://github.com/castorini/pyserini/blob/master/tests/test_indexutils.py#L31

This would be similar to the commit hook in anserini: https://github.com/castorini/anserini/blob/master/.travis.yml#L20

I think output file should be obligatory.

Guido's materials for AFIRM 2019

Hi @zeynepakkalyoncu we should check out Guido's materials for AFIRM 2019:
https://github.com/ielab/afirm2019

Replicate SearchCollection in Pyserini

It'd be nice to be able to replicate standard regression runs directly from Python, something like:

python -m pyserini.search_collection ...

We should be able to get exactly the same output as from Java.

Use case for dump_document_vectors in IndexReaderUtils?

Hi @zeynepakkalyoncu - do we have an actual use case for dump_document_vectors in IndexReaderUtils? I'm trying to write a use case for it, and I can't seem to even get it to work? Throws a mysterious jnius.JavaException: JVM exception occurred error.

If you're using this feature somewhere (in a notebook), we should make sure it works and write a test case... otherwise I would suggest removing it until a real use case comes up.

Force users to Python 3.7

https://github.com/castorini/pyserini/blob/master/pyserini/trectools/_base.py#L17

from __future__ import annotations

This, from what I understand, forces users to Python 3.7. Is this okay?

I don't mind either way, but we should have a discussion about it... and make the decision across all castorini.

@rodrigonogueira4 @x65han @ronakice thoughts?

Add utility methods/class (?) for manipulating TREC runs

We have scripts here and there for reading in TREC runs, interpolating runs, etc. Should we add them into pyserini, organized under a nice API?

Would be better than an endless proliferation of scripts...

Thoughts @rodrigonogueira4 @x65han ?

Dump out tf-idf document vectors in pyserini

Is it possible to dump out tf-idf document vectors for retrieved documents in pyserini?

Expose methods to convert between internal and external docid

This is a summary of the issue presented in #32

Consider this fragment:

>>> from pyserini.index import pyutils
>>> 
>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> postings_list = index_utils.get_postings_list('black')
>>> 
>>> for i in range(0, 10):
...     print('{}'.format(postings_list[i]))
... 
(6, 2) [555,606]
(29, 1) [410]
(32, 2) [65,462]
(35, 2) [288,475]
(56, 1) [662]
(60, 1) [69]
(61, 1) [110]
(63, 1) [195]
(74, 2) [230,518]
(96, 1) [107]

The docids (e.g., 6 in the first posting), refers to internal Lucene docids, which are different from external docids (i.e., those in the collection).

Use this hidden method convertLuceneDocidToDocid to convert, as in:

>>> for i in range(0, 10):
...     print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
... 
LA111289-0011 (6, 2) [555,606]
LA092890-0052 (29, 1) [410]
LA022489-0041 (32, 2) [65,462]
LA051990-0051 (35, 2) [288,475]
LA092890-0077 (56, 1) [662]
LA022489-0061 (60, 1) [69]
LA021889-0073 (61, 1) [110]
LA110689-0057 (63, 1) [195]
LA080789-0088 (74, 2) [230,518]
LA021889-0117 (96, 1) [107]

The TODO is to explicitly expose convertLuceneDocidToDocid.

Similarly, we can use convertDocidToLuceneDocid to convert an external collection docid into an internal docid:

>>> from jnius import autoclass
>>> JString = autoclass('java.lang.String')
>>> index_utils.object.convertDocidToLuceneDocid(index_utils.reader, JString("LA052189-0089"))
200443

We can verify as follows:

>>> for i in range(len(postings_list)):
...     if postings_list[i].docid == 200443:
...         print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
... 
LA052189-0089 (200443, 64) [18,133,175,212,225,244,262,273,307,320,344,372,388,431,438,454,464,541,576,583,616,640,772,778,801,831,838,885,891,912,937,952,970,1123,1151,1165,1180,1210,1215,1231,1270,1307,1346,1431,1436,1507,1514,1542,1546,1550,1663,1676,1726,1750,1764,1769,1781,1784,1838,1847,1873,1880,1922,1971]

Which matches exactly what we get from get_document_vector:

>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> doc_vector = index_utils.get_document_vector("LA052189-0089")
>>> doc_vector['black']
64

AttributeError: 'Document' object has no attribute 'raw'

from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('/home/ds/anserini/covid-2020-04-17/lucene-index-covid-paragraph-new/')
hits = searcher.search('nsp1 synthesis degradation', 10)

article = json.loads(searcher.doc('42saxb98').raw)

Uncomment to print the entire article... warning, it's long! :)

#print(json.dumps(article, indent=4))

article['metadata']['title']

error-

AttributeError Traceback (most recent call last)
in ()
----> 1 article = json.loads(searcher.doc('42saxb98').raw)
2
3 # Uncomment to print the entire article... warning, it's long! :)
4 #print(json.dumps(article, indent=4))
5

AttributeError: 'Document' object has no attribute 'raw'

Move location of search main

Right now we have:

$ python -m pyserini.search.pysearch
usage: pysearch.py [-h] -index path -topics path -output path

Maybe let's move to:

$ python -m pyserini.search
usage: pysearch.py [-h] -index path -topics path -output path

What do you think @yuki617 ?

Make methods in IndexReaderUtils more consistent re: Analyzer

We have:

# Pass in a no-op analyzer:
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
index_utils.get_term_counts(term, analyzer=analyzer)
df, cf = index_utils.get_term_counts(term)

Here, we take an analyzer.

And:

# Fetch and traverse postings for an analyzed term:
postings_list = index_utils.get_postings_list(analyzed[0], analyze=False)
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')

Here, we take a bool. Let's make both consistent?

How about both take analyzer and accepts None? Passing in a "no-op" analyzer seems a bit janky.

Thoughts? @PepijnBoers @chriskamphuis

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.9.2.0. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary pyserini -w /tmp/ext pyserini==0.9.2.0
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting pyserini==0.9.2.0
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/6bb/4e22d7cb0a83a/pyserini-0.9.2.0.tar.gz (57.8 MB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-we7atqws/pyserini/pip-egg-info
         cwd: /tmp/pip-wheel-we7atqws/pyserini/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-we7atqws/pyserini/setup.py", line 3, in <module>
        with open("project-description.md", "r") as fh:
    FileNotFoundError: [Errno 2] No such file or directory: 'project-description.md'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Write a guide about working with Pyserini and Spacy

@x389liu has been working with Spacy and Pyserini.

She's offered to write up a guide to take Pyserini output and do basic NLP on it... e.g., sentence chunking, NER, etc.

Expose internal Lucene docid number n to the Pysearch

Make the internal Lucene docid number n expose to the searcher to make python interface easily iterate in the index file.

Testing framework

Now that Pyserini is reasonably stable... we kinda need a testing framework...

Index own data

How do I index my own data in Pyserini ? All the notebooks and examples are using prebuilt indexes.

Strange terms() behavior in IndexReaderUtils

I do this:

from pyserini.index import pyutils

index_utils = pyutils.IndexReaderUtils('lucene-index.robust04.pos+docvectors+rawdocs')

iter1 = index_utils.terms()
iter2 = index_utils.terms()

for term in iter1:
    print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))

As expected, I iterate over all terms. Then:

for term in iter2:
    print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))

Gives nothing... is terms() returning the same iterator every time?

Doesn't seem like the right behavior?

Access term frequencies in index

I need to access statistics of my index like tf, tf-idf etc. I want to answer questions like: In how many documents does a spesific term occur in? What documents does a spesific term occur in? What terms occur in the 1st document. Is there any way to do this using pyserini? Thanks!

Issue with VM during pysearch import

I am trying to import pysearch from pyserini.search. I set the JAVA_HOME variable to jdk11. I am running this using Jupyter Notebook and I am getting the error as shown.

import os
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/jdk-11.0.7.jdk/Contents/Home'

from pyserini.search import pysearch

The error is:

ValueError: VM is already running, can't set classpath/options; VM started at  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main

I initially was running in a virtual environment (conda environment). I tried coming out of it and execute, but I am still getting the same issue. Any workarounds? Or am I missing something?

I am using MacOS Catalina.

JavaException: JVM exception occurred: Could not load codec 'Lucene84'

I, first, indexed my docs using lucene-8.5.1 in java, then after running

searcher = pysearch.SimpleSearcher('/home/sipah00/java_lucene/lucene_idx1/')

I got this error, JavaException: JVM exception occurred: Could not load codec 'Lucene84'. Did you forget to add lucene-backward-codecs.jar?
How to resolve this issue? is there any specific version of lucene that pyserini supports?
Also, how can I index my docs in pyserini only?

Pyserini package/module structure: unnecessary nesting?

Current invoking is something like:

from pyserini.search import pysearch
searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

@zeynepakkalyoncu @emmileaf don't you think we have one unnecessary nested layer?

Would something like this make more sense?

from pyserini import search
searcher = search.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

from pyserini.search import SimpleSearcher
searcher = SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

Import Error

Hi. I getting the error "ImportError: DLL load failed: The specified module could not be found." when trying to import both of:
from pyserini.search import pysearch
from pyserini.search.pysearch import SimpleSearcher

The modules have installed and they are recommended from the editor when typing them. I am using pycharm, python 3.7.6 and tried both pyserini==0.9.0.0 pyserini==0.9.3.0.

Break README.md into document subpages?

Main README.md is getting pretty long. Should we break into separate pages?

Usage of the Analyzer API
Usage of the Query Builder API
...

@chriskamphuis @PepijnBoers @x65han thoughts?

Failure to handle garbage characters

Working on 20 newsgroup dataset and encounter the following error:

Also attached the problematic document:

9985.txt

pyserini==0.8.1.0

Refactor arg names in pyserini/search/main.py

Based on our new conventions: https://github.com/lintool/guide/blob/master/coding-style.md

Replicate 20 newsgroup classification w/ scikit-learn

Let's try to replicate 20 newsgroup classification w/ scikit-learn using Pyserini and Anserini:
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

That is, use Pyserini to extract the tf-idf vectors that feed the classifiers in scikit-learn.

Issue transferred from castorini/anserini#1120

Initial work by @yuki617 here:
https://github.com/yuki617/anserini/blob/tfidf/20newgroup_replication.ipynb

Score query wrt document or rerank input set of documents wrt query

Feature request from @cmacdonald -

Compute scores wrt a set of document specified by the user, i.e., "rerank this set for me". Could decompose into "score this query wrt this document" with an outer loop over documents.

Can be accomplished today by setting k to be a really large value and then filtering results... but having a better implementation of this feature would be generally useful.

How can I build an index with pyserini?

Hi, congratulations for this work!
I just wanted to ask, if possible, to add an example on how to use pyserini to build a new index. Also, to build my own collection, do I just have to override the Collection class?
Thanks!

Retire technical debt for TfidfVectorizer

Write test cases: you can assume the CACM test index: https://github.com/castorini/pyserini/blob/master/tests/test_indexutils.py#L31
Write docstrings
Add boilerplate at top of source

how to built indexes

what should I use to make indexes. I was looking at few options -solr,elasticsearch,lucene etc

Outdated documents

pyserini notebooks are not updated with the most recent commit.

For example, in pyserini_robust04_demo.ipynb, hits[0].contents should be used instead of hits[0].content after this change.