Git Product home page Git Product logo

pyserini's Introduction

Pyserini

PyPI Downloads PyPI Download Stats Maven Central Generic badge LICENSE

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, prebuilt indexes, and evaluation scripts for many commonly used IR test collections. With Pyserini, it's easy to reproduce runs on a number of standard IR test collections!

For additional details, our paper in SIGIR 2021 provides a nice overview.

โ— Anserini was upgraded from JDK 11 to JDK 21 at commit 272565 (2024/04/03), which corresponds to the release of v0.35.0. Correspondingly, Pyserini was upgraded to JDK 21 at commit b2f677 (2024/04/04).

๐ŸŽฌ Installation

Install via PyPI (requires Python 3.10+):

pip install pyserini

Sparse retrieval depends on Anserini, which is itself built on Lucene (written in Java), and thus requiring JDK 21.

Dense retrieval depends on neural networks and requires a more complex set of dependencies. A pip installation will automatically pull in the ๐Ÿค— Transformers library to satisfy the package requirements. Pyserini also depends on PyTorch and Faiss, but since these packages may require platform-specific custom configuration, they are not explicitly listed in the package requirements. We leave the installation of these packages to you.

The software ecosystem is rapidly evolving and a potential source of frustration is incompatibility among different versions of underlying dependencies. We provide additional detailed installation instructions here.

If you're planning on just using Pyserini, then the pip instructions above are fine. However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation. Instructions are provided here.

๐Ÿ™‹ How do I search?

Pyserini supports the following classes of retrieval models:

See this guide (same as the links above) for details on how to search common corpora in IR and NLP research (e.g., MS MARCO, NaturalQuestions, BEIR, etc.) using indexes that we have already built for you.

Once you get the top-k results, you'll actually want to fetch the document text... See this guide for how.

๐Ÿ™‹ How do I index my own corpus?

Well, it depends on what type of retrieval model you want to search with:

The steps are different for different classes of models: this guide (same as the links above) describes the details.

๐Ÿ™‹ Additional FAQs

โš—๏ธ Reproducibility

With Pyserini, it's easy to reproduce runs on a number of standard IR test collections! We provide a number of prebuilt indexes that directly support reproducibility "out of the box".

In our SIGIR 2022 paper, we introduced "two-click reproductions" that allow anyone to reproduce experimental runs with only two clicks (i.e., copy and paste). Documentation is organized into reproduction matrices for different corpora that provide a summary of different experimental conditions and query sets:

For more details, see our paper on Building a Culture of Reproducibility in Academic Research.

Additional reproduction guides below provide detailed step-by-step instructions.

Sparse Retrieval

Sparse Retrieval

Dense Retrieval

Dense Retrieval

Hybrid Sparse-Dense Retrieval

Hybrid Sparse-Dense Retrieval

Available Corpora

Available Corpora

Corpora Size Checksum
MS MARCO V1 passage: uniCOIL (noexp) 2.7 GB f17ddd8c7c00ff121c3c3b147d2e17d8
MS MARCO V1 passage: uniCOIL (d2q-T5) 3.4 GB 78eef752c78c8691f7d61600ceed306f
MS MARCO V1 doc: uniCOIL (noexp) 11 GB 11b226e1cacd9c8ae0a660fd14cdd710
MS MARCO V1 doc: uniCOIL (d2q-T5) 19 GB 6a00e2c0c375cb1e52c83ae5ac377ebb
MS MARCO V2 passage: uniCOIL (noexp) 24 GB d9cc1ed3049746e68a2c91bf90e5212d
MS MARCO V2 passage: uniCOIL (d2q-T5) 41 GB 1949a00bfd5e1f1a230a04bbc1f01539
MS MARCO V2 doc: uniCOIL (noexp) 55 GB 97ba262c497164de1054f357caea0c63
MS MARCO V2 doc: uniCOIL (d2q-T5) 72 GB c5639748c2cbad0152e10b0ebde3b804

๐Ÿ“ƒ Additional Documentation

๐Ÿ“œ๏ธ Release History

older... (and historic notes)

๐Ÿ“œ๏ธ Historical Notes

โ‰๏ธ Lucene 8 to Lucene 9 Transition. In 2022, Pyserini underwent a transition from Lucene 8 to Lucene 9. Most of the prebuilt indexes have been rebuilt using Lucene 9, but there are a few still based on Lucene 8.

More details:

Explanations:

  • What's the impact? Indexes built with Lucene 8 are not fully compatible with Lucene 9 code (see Anserini #1952). The workaround is to disable consistent tie-breaking, which happens automatically if a Lucene 8 index is detected by Pyserini. However, Lucene 9 code running on Lucene 8 indexes will give slightly different results than Lucene 8 code running on Lucene 8 indexes. Note that Lucene 8 code is not able to read indexes built with Lucene 9.

  • Why is this necessary? Although disruptive, an upgrade to Lucene 9 is necessary to take advantage of Lucene's HNSW indexes, which will increase the capabilities of Pyserini and open up the design space of dense/sparse hybrids.

With v0.11.0.0 and before, Pyserini versions adopted the convention of X.Y.Z.W, where X.Y.Z tracks the version of Anserini, and W is used to distinguish different releases on the Python end. Starting with Anserini v0.12.0, Anserini and Pyserini versions have become decoupled.

Anserini is designed to work with JDK 11. There was a JRE path change above JDK 9 that breaks pyjnius 1.2.0, as documented in this issue, also reported in Anserini here and here. This issue was fixed with pyjnius 1.2.1 (released December 2019). The previous error was documented in this notebook and this notebook documents the fix.

โœจ References

If you use Pyserini, please cite the following paper:

@INPROCEEDINGS{Lin_etal_SIGIR2021_Pyserini,
   author = "Jimmy Lin and Xueguang Ma and Sheng-Chieh Lin and Jheng-Hong Yang and Ronak Pradeep and Rodrigo Nogueira",
   title = "{Pyserini}: A {Python} Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations",
   booktitle = "Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)",
   year = 2021,
   pages = "2356--2362",
}

๐Ÿ™ Acknowledgments

This research is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

pyserini's People

Contributors

alexlimh avatar arthurchen189 avatar cathrineee avatar chriskamphuis avatar crystina-z avatar dahlia-chehata avatar ehsk avatar haksoat avatar hanglics avatar jacklin64 avatar jasper-xian avatar justram avatar kaisun314 avatar lintool avatar manveertamber avatar mofetoluwa avatar mrkarezina avatar mxueguang avatar pepijnboers avatar qguo96 avatar ronakice avatar sahel-sh avatar saileshnankani avatar stephaniewhoo avatar toluclassics avatar tteofili avatar x389liu avatar x65han avatar yuki617 avatar zeynepakkalyoncu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyserini's Issues

pysearch.get_topics doesn't work anymore

This no longer works:

from pyserini.search import pysearch
topics = pysearch.get_topics('msmarco_passage_dev_subset')

The reason is that the Java end uses generics, and so pyjnius can't properly dispatch to the method. See: kivy/pyjnius#134

Access Index from colab

Hi. I have an anserini index trying to access its stastistics so I am following this:
https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md

when I type:
from pyserini import analysis, index
index_reader = index.IndexReader()
I am getting:
AttributeError Traceback (most recent call last)
in ()
1 from pyserini import analysis, index
2
----> 3 index_reader = index.IndexReader()

AttributeError: module 'pyserini.index' has no attribute 'IndexReader'

Term occurs in document vector, but has collection frequency 0

I've found a term that occurs once in a document vector, but doesn't occur in the collection. Am I using the wrong analyzer or is this a bug? I've used the following Pyserini functions:

index_utils = pyutils.IndexReaderUtils('/Index/lucene-index.core18.pos+docvectors+rawdocs_all')
tf = index_utils.get_document_vector(docid)
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
df = {term: (index_utils.get_term_counts(term, analyzer=analyzer))[1] for term in tf.keys()}

output:

tf = {.. 'hobbies:photographi': 1, ..}
df = {.. 'hobbies:photographi': 0, ..}

I assume the term is derived from this part in the raw text: "..<b>HOBBIES:</b>Photography..."

Add functionality to search without preprocessing

We would like to be able to search an index without the query being processed by the stemmer. The specific use-case would be for the background linking task of TREC. We want to use the document vectors (that contain stemmed terms) to construct a new query.

Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Obtained a JavaException error after importing:
from pyserini.search import pysearch

--> Error:
... jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Installed Pyserini via:
pip install pyserini --user

I was not able to resolve this, I tried:

  • replacing the fatjar with a newly built Anserini 0.8.1 fatjar
  • manual configuration of classpath via configure_classpath() method in pyserini.setup.

Complete error:

Traceback (most recent call last): File "build_db.py", line 6, in <module> from pyserini.search import pysearch File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/search/pysearch.py", line 25, in <module> from ..pyclass import JSearcher, JResult, JDocument, JString, JArrayList, JTopics, JTopicReader File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/pyclass.py", line 51, in <module> JDefaultEnglishAnalyzer = autoclass('io.anserini.analysis.DefaultEnglishAnalyzer') File "/home/pboers/.local/lib/python3.7/site-packages/jnius/reflect.py", line 208, in autoclass c = find_javaclass(clsname) File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

forked pyserini Collection iterators sometimes freeze with large directories

We've been having an issue where Collection iterators sometimes freeze when used in forked process. This is not specific to pyserini and can be reproduced with only pyjnius and a BufferedReader. See below for a minimal script to reproduce it.

This issue reliably occurs if:

  1. pyjnius is initialized in the main process before forked processes are created. If we initialize pyjnius inside each forked process, everything's fine. (This is not easy to do though, because there's no way to close pyjnius after it's initialized, so every call to pyserini would have to happen in a forked process.)
  2. The BufferedReader is run on a "large" directory. Large depends on a combination of number of files and file size. It does not happen on a directory with 2000 empty files (created with touch $fn), but it does happen if this is increased to 3000 empty files. It does happen if the 2000 empty files are 1MB each (dd if=/dev/zero of=$fn bs=1M count=1) rather than empty. (I've also reproduced it on a random directory containing 265 files of varying sizes.)

Script:

import os
import sys
from multiprocessing import Pool
from jnius import autoclass
jstr = autoclass("java.lang.String")
jbr = autoclass("java.io.BufferedReader")
jfr = autoclass("java.io.FileReader")
def jprint(x):
    fr = jfr(x)
    f = jbr(fr)
    while True:
        line = f.readLine()
        if line is None:
            print("break")
            f.close()
            fr.close()
            break
        else:
            print("not none")

if __name__ == "__main__":
    dir = sys.argv[1]
    p = Pool(5)
    fns = [os.path.join(dir, fn) for fn in os.listdir(dir) if os.path.isfile(os.path.join(dir, fn))]
    p.map(jprint, fns)

java.nio.file.NoSuchFileException using the Collection API

All CACM HTML files are under collection. Instantiating pycollection.Collection with Python 3.5.0 and Java 11 leads to java.nio.file.NoSuchFileException:

from pyserini.collection import pycollection

collection = pycollection.Collection('HtmlCollection', 'collection/')
2019-11-22 08:05:07,954 ERROR [main] collection.DocumentCollection$2 (DocumentCollection.java:226) - Visiting failed for รค
java.nio.file.NoSuchFileException: รค
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
	at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) ~[?:?]
	at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:149) ~[?:?]
	at java.base/java.nio.file.Files.readAttributes(Files.java:1763) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322) ~[?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2716) [?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2796) [?:?]
	at io.anserini.collection.DocumentCollection.discover(DocumentCollection.java:232) [anserini-0.6.0-fatjar.jar:?]
	at io.anserini.collection.DocumentCollection.iterator(DocumentCollection.java:110) [anserini-0.6.0-fatjar.jar:?]

get_document_vector() and get_postings_list() Stemming ?

Hi @lintool !
I have a new issue :
I created a new index with the dataset "DUC-2001" by mean of this function :

 sh anserini/target/appassembler/bin/IndexCollection \
            -collection TrecCollection \
            -generator JsoupGenerator \
            -threads 2 \
            -input ${EXP}/ \
            -index indexes/lucene-index.XXX \
            -storePositions -storeDocvectors -storeRawDocs

I also installed Luke Toolbox project to understand how the index working.

When i run this code :

for id_ in docid:
    doc_vector = index_utils.get_document_vector(id_)
    bm25_score_one_doc = {}
    for term_ in doc_vector:
        postings_list = index_utils.get_postings_list(term_)

it works for some terms but not for all...

Traceback (most recent call last):
  File "doc2index_2.py", line 50, in <module>
    postings_list = index_utils.get_postings_list(term_)
  File "/home/poulain/.local/lib/python3.6/site-packages/pyserini/index/pyutils.py", line 118, in get_postings_list
    postings_list = self.object.getPostingsList(self.reader, JString(term))
  File "jnius/jnius_export_class.pxi", line 768, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 934, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.lang.NullPointerException

I think there are two different indexes, the first one applies a stemming ( the word "Cherokee" become "cheroke") and the second keeps the word without stemming.

So, how can i stemming the posting index ?

Best regards

get_document_vector vs get_postings_list

hello! Thank u for your work !
I have some issues and I am not sure whether It's a bug in the API or It's due to my misunderstanding of some semantics related to the API..
When I run these two programs, and analyze the frequency of the term "standard" in some documents like this:

  • Program 1:
    doc_vector = index_utils.get_document_vector(XXX)
    Output : {"...",'standard': 3, "..."}
  • Program 2:
    postings_list = index_utils.get_postings_list('standard')
    Output: docid=XXX, tf=1, pos=[26]

In fact, for the same document id, i get for the first program 3 terms, and the second 1 term frequency. The term "standard" was used here just for illustration, but I encounter the same problem for other terms.

I will be very grateful to get a feedback about this issue.

Best regards,

Retire technical debt for pysearch.search __main__

Arg layout should be consist with anserini structure: change -prf to -prcl since we've decided to name the class PseudoRelevanceClassifierReranker

Args should be grouped, e.g., -prcl.r, -prcl.n, -prcl.alpha


use tqdm for progress indicator


Write test case: you can assume the CACM test index: https://github.com/castorini/pyserini/blob/master/tests/test_indexutils.py#L31

This would be similar to the commit hook in anserini: https://github.com/castorini/anserini/blob/master/.travis.yml#L20


I think output file should be obligatory.

Replicate SearchCollection in Pyserini

It'd be nice to be able to replicate standard regression runs directly from Python, something like:

python -m pyserini.search_collection ...

We should be able to get exactly the same output as from Java.

Use case for dump_document_vectors in IndexReaderUtils?

Hi @zeynepakkalyoncu - do we have an actual use case for dump_document_vectors in IndexReaderUtils? I'm trying to write a use case for it, and I can't seem to even get it to work? Throws a mysterious jnius.JavaException: JVM exception occurred error.

If you're using this feature somewhere (in a notebook), we should make sure it works and write a test case... otherwise I would suggest removing it until a real use case comes up.

Expose methods to convert between internal and external docid

This is a summary of the issue presented in #32

Consider this fragment:

>>> from pyserini.index import pyutils
>>> 
>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> postings_list = index_utils.get_postings_list('black')
>>> 
>>> for i in range(0, 10):
...     print('{}'.format(postings_list[i]))
... 
(6, 2) [555,606]
(29, 1) [410]
(32, 2) [65,462]
(35, 2) [288,475]
(56, 1) [662]
(60, 1) [69]
(61, 1) [110]
(63, 1) [195]
(74, 2) [230,518]
(96, 1) [107]

The docids (e.g., 6 in the first posting), refers to internal Lucene docids, which are different from external docids (i.e., those in the collection).

Use this hidden method convertLuceneDocidToDocid to convert, as in:

>>> for i in range(0, 10):
...     print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
... 
LA111289-0011 (6, 2) [555,606]
LA092890-0052 (29, 1) [410]
LA022489-0041 (32, 2) [65,462]
LA051990-0051 (35, 2) [288,475]
LA092890-0077 (56, 1) [662]
LA022489-0061 (60, 1) [69]
LA021889-0073 (61, 1) [110]
LA110689-0057 (63, 1) [195]
LA080789-0088 (74, 2) [230,518]
LA021889-0117 (96, 1) [107]

The TODO is to explicitly expose convertLuceneDocidToDocid.

Similarly, we can use convertDocidToLuceneDocid to convert an external collection docid into an internal docid:

>>> from jnius import autoclass
>>> JString = autoclass('java.lang.String')
>>> index_utils.object.convertDocidToLuceneDocid(index_utils.reader, JString("LA052189-0089"))
200443

We can verify as follows:

>>> for i in range(len(postings_list)):
...     if postings_list[i].docid == 200443:
...         print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
... 
LA052189-0089 (200443, 64) [18,133,175,212,225,244,262,273,307,320,344,372,388,431,438,454,464,541,576,583,616,640,772,778,801,831,838,885,891,912,937,952,970,1123,1151,1165,1180,1210,1215,1231,1270,1307,1346,1431,1436,1507,1514,1542,1546,1550,1663,1676,1726,1750,1764,1769,1781,1784,1838,1847,1873,1880,1922,1971]

Which matches exactly what we get from get_document_vector:

>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> doc_vector = index_utils.get_document_vector("LA052189-0089")
>>> doc_vector['black']
64

AttributeError: 'Document' object has no attribute 'raw'

from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('/home/ds/anserini/covid-2020-04-17/lucene-index-covid-paragraph-new/')
hits = searcher.search('nsp1 synthesis degradation', 10)

article = json.loads(searcher.doc('42saxb98').raw)

Uncomment to print the entire article... warning, it's long! :)

#print(json.dumps(article, indent=4))

article['metadata']['title']

error-


AttributeError Traceback (most recent call last)
in ()
----> 1 article = json.loads(searcher.doc('42saxb98').raw)
2
3 # Uncomment to print the entire article... warning, it's long! :)
4 #print(json.dumps(article, indent=4))
5

AttributeError: 'Document' object has no attribute 'raw'

Move location of search main

Right now we have:

$ python -m pyserini.search.pysearch
usage: pysearch.py [-h] -index path -topics path -output path

Maybe let's move to:

$ python -m pyserini.search
usage: pysearch.py [-h] -index path -topics path -output path

What do you think @yuki617 ?

Make methods in IndexReaderUtils more consistent re: Analyzer

We have:

# Pass in a no-op analyzer:
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
index_utils.get_term_counts(term, analyzer=analyzer)
df, cf = index_utils.get_term_counts(term)

Here, we take an analyzer.

And:

# Fetch and traverse postings for an analyzed term:
postings_list = index_utils.get_postings_list(analyzed[0], analyze=False)
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')

Here, we take a bool. Let's make both consistent?

How about both take analyzer and accepts None? Passing in a "no-op" analyzer seems a bit janky.

Thoughts? @PepijnBoers @chriskamphuis

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.9.2.0. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary pyserini -w /tmp/ext pyserini==0.9.2.0
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting pyserini==0.9.2.0
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/6bb/4e22d7cb0a83a/pyserini-0.9.2.0.tar.gz (57.8 MB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-we7atqws/pyserini/pip-egg-info
         cwd: /tmp/pip-wheel-we7atqws/pyserini/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-we7atqws/pyserini/setup.py", line 3, in <module>
        with open("project-description.md", "r") as fh:
    FileNotFoundError: [Errno 2] No such file or directory: 'project-description.md'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Testing framework

Now that Pyserini is reasonably stable... we kinda need a testing framework...

Index own data

How do I index my own data in Pyserini ? All the notebooks and examples are using prebuilt indexes.

Strange terms() behavior in IndexReaderUtils

I do this:

from pyserini.index import pyutils

index_utils = pyutils.IndexReaderUtils('lucene-index.robust04.pos+docvectors+rawdocs')

iter1 = index_utils.terms()
iter2 = index_utils.terms()

for term in iter1:
    print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))

As expected, I iterate over all terms. Then:

for term in iter2:
    print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))

Gives nothing... is terms() returning the same iterator every time?

Doesn't seem like the right behavior?

Access term frequencies in index

I need to access statistics of my index like tf, tf-idf etc. I want to answer questions like: In how many documents does a spesific term occur in? What documents does a spesific term occur in? What terms occur in the 1st document. Is there any way to do this using pyserini? Thanks!

Issue with VM during pysearch import

I am trying to import pysearch from pyserini.search. I set the JAVA_HOME variable to jdk11. I am running this using Jupyter Notebook and I am getting the error as shown.

import os
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/jdk-11.0.7.jdk/Contents/Home'

from pyserini.search import pysearch

The error is:

ValueError: VM is already running, can't set classpath/options; VM started at  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main

I initially was running in a virtual environment (conda environment). I tried coming out of it and execute, but I am still getting the same issue. Any workarounds? Or am I missing something?

I am using MacOS Catalina.

JavaException: JVM exception occurred: Could not load codec 'Lucene84'

I, first, indexed my docs using lucene-8.5.1 in java, then after running

searcher = pysearch.SimpleSearcher('/home/sipah00/java_lucene/lucene_idx1/')

I got this error, JavaException: JVM exception occurred: Could not load codec 'Lucene84'. Did you forget to add lucene-backward-codecs.jar?
How to resolve this issue? is there any specific version of lucene that pyserini supports?
Also, how can I index my docs in pyserini only?

Pyserini package/module structure: unnecessary nesting?

Current invoking is something like:

from pyserini.search import pysearch
searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

@zeynepakkalyoncu @emmileaf don't you think we have one unnecessary nested layer?

Would something like this make more sense?

from pyserini import search
searcher = search.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

or

from pyserini.search import SimpleSearcher
searcher = SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

Import Error

Hi. I getting the error "ImportError: DLL load failed: The specified module could not be found." when trying to import both of:
from pyserini.search import pysearch
from pyserini.search.pysearch import SimpleSearcher

The modules have installed and they are recommended from the editor when typing them. I am using pycharm, python 3.7.6 and tried both pyserini==0.9.0.0 pyserini==0.9.3.0.

Score query wrt document or rerank input set of documents wrt query

Feature request from @cmacdonald -

Compute scores wrt a set of document specified by the user, i.e., "rerank this set for me". Could decompose into "score this query wrt this document" with an outer loop over documents.

Can be accomplished today by setting k to be a really large value and then filtering results... but having a better implementation of this feature would be generally useful.

How can I build an index with pyserini?

Hi, congratulations for this work!
I just wanted to ask, if possible, to add an example on how to use pyserini to build a new index. Also, to build my own collection, do I just have to override the Collection class?
Thanks!

how to built indexes

what should I use to make indexes. I was looking at few options -solr,elasticsearch,lucene etc

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.