castorini / pyserini Goto Github PK

View Code? Open in Web Editor NEW

1.6K 19.0 349.0 7.76 MB

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Home Page: http://pyserini.io/

License: Apache License 2.0

Python 98.79% Shell 0.29% HTML 0.92%

information-retrieval

pyserini's Issues

Add utility methods/class (?) for manipulating TREC runs

We have scripts here and there for reading in TREC runs, interpolating runs, etc. Should we add them into pyserini, organized under a nice API?

Would be better than an endless proliferation of scripts...

Thoughts @rodrigonogueira4 @x65han ?

Document the new query construction API

Follow up to #91

Let's document these new features?

Refactor arg names in pyserini/search/main.py

Based on our new conventions: https://github.com/lintool/guide/blob/master/coding-style.md

Use case for dump_document_vectors in IndexReaderUtils?

Hi @zeynepakkalyoncu - do we have an actual use case for dump_document_vectors in IndexReaderUtils? I'm trying to write a use case for it, and I can't seem to even get it to work? Throws a mysterious jnius.JavaException: JVM exception occurred error.

If you're using this feature somewhere (in a notebook), we should make sure it works and write a test case... otherwise I would suggest removing it until a real use case comes up.

Score query wrt document or rerank input set of documents wrt query

Feature request from @cmacdonald -

Compute scores wrt a set of document specified by the user, i.e., "rerank this set for me". Could decompose into "score this query wrt this document" with an outer loop over documents.

Can be accomplished today by setting k to be a really large value and then filtering results... but having a better implementation of this feature would be generally useful.

Dump out tf-idf document vectors in pyserini

Is it possible to dump out tf-idf document vectors for retrieved documents in pyserini?

Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Obtained a JavaException error after importing:
from pyserini.search import pysearch

--> Error:
... jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Installed Pyserini via:
pip install pyserini --user

I was not able to resolve this, I tried:

replacing the fatjar with a newly built Anserini 0.8.1 fatjar
manual configuration of classpath via configure_classpath() method in pyserini.setup.

Complete error:

Traceback (most recent call last): File "build_db.py", line 6, in <module> from pyserini.search import pysearch File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/search/pysearch.py", line 25, in <module> from ..pyclass import JSearcher, JResult, JDocument, JString, JArrayList, JTopics, JTopicReader File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/pyclass.py", line 51, in <module> JDefaultEnglishAnalyzer = autoclass('io.anserini.analysis.DefaultEnglishAnalyzer') File "/home/pboers/.local/lib/python3.7/site-packages/jnius/reflect.py", line 208, in autoclass c = find_javaclass(clsname) File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Retire technical debt for TfidfVectorizer

Write test cases: you can assume the CACM test index: https://github.com/castorini/pyserini/blob/master/tests/test_indexutils.py#L31
Write docstrings
Add boilerplate at top of source

Replicate 20 newsgroup classification w/ scikit-learn

Let's try to replicate 20 newsgroup classification w/ scikit-learn using Pyserini and Anserini:
https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

That is, use Pyserini to extract the tf-idf vectors that feed the classifiers in scikit-learn.

Issue transferred from castorini/anserini#1120

Initial work by @yuki617 here:
https://github.com/yuki617/anserini/blob/tfidf/20newgroup_replication.ipynb

Failure to handle garbage characters

Working on 20 newsgroup dataset and encounter the following error:

Also attached the problematic document:

9985.txt

pyserini==0.8.1.0

Clean API to expose Lucene Analyzers

Related to #43 - currently IndexReaderUtils exposes analyze method:
https://github.com/castorini/pyserini/blob/master/pyserini/index/pyutils.py

This is hard coded to a default. We should think about how to expose arbitrary Lucene analyzers in general... what would the API look like?

Guido's materials for AFIRM 2019

Hi @zeynepakkalyoncu we should check out Guido's materials for AFIRM 2019:
https://github.com/ielab/afirm2019

Expose methods to convert between internal and external docid

This is a summary of the issue presented in #32

Consider this fragment:

>>> from pyserini.index import pyutils
>>> 
>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> postings_list = index_utils.get_postings_list('black')
>>> 
>>> for i in range(0, 10):
...     print('{}'.format(postings_list[i]))
... 
(6, 2) [555,606]
(29, 1) [410]
(32, 2) [65,462]
(35, 2) [288,475]
(56, 1) [662]
(60, 1) [69]
(61, 1) [110]
(63, 1) [195]
(74, 2) [230,518]
(96, 1) [107]

The docids (e.g., 6 in the first posting), refers to internal Lucene docids, which are different from external docids (i.e., those in the collection).

Use this hidden method convertLuceneDocidToDocid to convert, as in:

>>> for i in range(0, 10):
...     print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
... 
LA111289-0011 (6, 2) [555,606]
LA092890-0052 (29, 1) [410]
LA022489-0041 (32, 2) [65,462]
LA051990-0051 (35, 2) [288,475]
LA092890-0077 (56, 1) [662]
LA022489-0061 (60, 1) [69]
LA021889-0073 (61, 1) [110]
LA110689-0057 (63, 1) [195]
LA080789-0088 (74, 2) [230,518]
LA021889-0117 (96, 1) [107]

The TODO is to explicitly expose convertLuceneDocidToDocid.

Similarly, we can use convertDocidToLuceneDocid to convert an external collection docid into an internal docid:

>>> from jnius import autoclass
>>> JString = autoclass('java.lang.String')
>>> index_utils.object.convertDocidToLuceneDocid(index_utils.reader, JString("LA052189-0089"))
200443

We can verify as follows:

>>> for i in range(len(postings_list)):
...     if postings_list[i].docid == 200443:
...         print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
... 
LA052189-0089 (200443, 64) [18,133,175,212,225,244,262,273,307,320,344,372,388,431,438,454,464,541,576,583,616,640,772,778,801,831,838,885,891,912,937,952,970,1123,1151,1165,1180,1210,1215,1231,1270,1307,1346,1431,1436,1507,1514,1542,1546,1550,1663,1676,1726,1750,1764,1769,1781,1784,1838,1847,1873,1880,1922,1971]

Which matches exactly what we get from get_document_vector:

>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> doc_vector = index_utils.get_document_vector("LA052189-0089")
>>> doc_vector['black']
64

AttributeError: 'Document' object has no attribute 'raw'

from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('/home/ds/anserini/covid-2020-04-17/lucene-index-covid-paragraph-new/')
hits = searcher.search('nsp1 synthesis degradation', 10)

article = json.loads(searcher.doc('42saxb98').raw)

Uncomment to print the entire article... warning, it's long! :)

#print(json.dumps(article, indent=4))

article['metadata']['title']

error-

AttributeError Traceback (most recent call last)
in ()
----> 1 article = json.loads(searcher.doc('42saxb98').raw)
2
3 # Uncomment to print the entire article... warning, it's long! :)
4 #print(json.dumps(article, indent=4))
5

AttributeError: 'Document' object has no attribute 'raw'

Not able to import - from pyserini.search import pysearch

I was trying to the run the demo code.

importing gives me this error -
File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass
jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Index own data

How do I index my own data in Pyserini ? All the notebooks and examples are using prebuilt indexes.

Refactor analyzers based on underlying Anserini changes

As a result of castorini/anserini#1027 - we need to refactor analyzers in Pyserini.

Expose IndexCollection in Python

Multiple people have asked for the ability to perform indexing from Python. This shouldn't be too hard, we just need to properly expose IndexCollection.

Add functionality to search without preprocessing

We would like to be able to search an index without the query being processed by the stemmer. The specific use-case would be for the background linking task of TREC. We want to use the document vectors (that contain stemmed terms) to construct a new query.

Pyserini package/module structure: unnecessary nesting?

Current invoking is something like:

from pyserini.search import pysearch
searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

@zeynepakkalyoncu @emmileaf don't you think we have one unnecessary nested layer?

Would something like this make more sense?

from pyserini import search
searcher = search.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

from pyserini.search import SimpleSearcher
searcher = SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

JavaException: JVM exception occurred: Could not load codec 'Lucene84'

I, first, indexed my docs using lucene-8.5.1 in java, then after running

searcher = pysearch.SimpleSearcher('/home/sipah00/java_lucene/lucene_idx1/')

I got this error, JavaException: JVM exception occurred: Could not load codec 'Lucene84'. Did you forget to add lucene-backward-codecs.jar?
How to resolve this issue? is there any specific version of lucene that pyserini supports?
Also, how can I index my docs in pyserini only?

Replicate SearchCollection in Pyserini

It'd be nice to be able to replicate standard regression runs directly from Python, something like:

python -m pyserini.search_collection ...

We should be able to get exactly the same output as from Java.

pysearch.get_topics doesn't work anymore

This no longer works:

from pyserini.search import pysearch
topics = pysearch.get_topics('msmarco_passage_dev_subset')

The reason is that the Java end uses generics, and so pyjnius can't properly dispatch to the method. See: kivy/pyjnius#134

Term occurs in document vector, but has collection frequency 0

I've found a term that occurs once in a document vector, but doesn't occur in the collection. Am I using the wrong analyzer or is this a bug? I've used the following Pyserini functions:

index_utils = pyutils.IndexReaderUtils('/Index/lucene-index.core18.pos+docvectors+rawdocs_all')
tf = index_utils.get_document_vector(docid)
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
df = {term: (index_utils.get_term_counts(term, analyzer=analyzer))[1] for term in tf.keys()}

output:

tf = {.. 'hobbies:photographi': 1, ..}
df = {.. 'hobbies:photographi': 0, ..}

I assume the term is derived from this part in the raw text: "..<b>HOBBIES:</b>Photography..."

searcher can't take custom constructed query with RM3 set

With RM3, we get an NPE when we try to search with a query build using the querybuilder.

Access term frequencies in index

I need to access statistics of my index like tf, tf-idf etc. I want to answer questions like: In how many documents does a spesific term occur in? What documents does a spesific term occur in? What terms occur in the 1st document. Is there any way to do this using pyserini? Thanks!

Break README.md into document subpages?

Main README.md is getting pretty long. Should we break into separate pages?

Usage of the Analyzer API
Usage of the Query Builder API
...

@chriskamphuis @PepijnBoers @x65han thoughts?

Write a guide about working with Pyserini and Spacy

@x389liu has been working with Spacy and Pyserini.

She's offered to write up a guide to take Pyserini output and do basic NLP on it... e.g., sentence chunking, NER, etc.

Testing framework

Now that Pyserini is reasonably stable... we kinda need a testing framework...

Force users to Python 3.7

https://github.com/castorini/pyserini/blob/master/pyserini/trectools/_base.py#L17

from __future__ import annotations

This, from what I understand, forces users to Python 3.7. Is this okay?

I don't mind either way, but we should have a discussion about it... and make the decision across all castorini.

@rodrigonogueira4 @x65han @ronakice thoughts?

Make methods in IndexReaderUtils more consistent re: Analyzer

We have:

# Pass in a no-op analyzer:
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
index_utils.get_term_counts(term, analyzer=analyzer)
df, cf = index_utils.get_term_counts(term)

Here, we take an analyzer.

And:

# Fetch and traverse postings for an analyzed term:
postings_list = index_utils.get_postings_list(analyzed[0], analyze=False)
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')

Here, we take a bool. Let's make both consistent?

How about both take analyzer and accepts None? Passing in a "no-op" analyzer seems a bit janky.

Thoughts? @PepijnBoers @chriskamphuis

Issue with VM during pysearch import

I am trying to import pysearch from pyserini.search. I set the JAVA_HOME variable to jdk11. I am running this using Jupyter Notebook and I am getting the error as shown.

import os
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/jdk-11.0.7.jdk/Contents/Home'

from pyserini.search import pysearch

The error is:

ValueError: VM is already running, can't set classpath/options; VM started at  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main

I initially was running in a virtual environment (conda environment). I tried coming out of it and execute, but I am still getting the same issue. Any workarounds? Or am I missing something?

I am using MacOS Catalina.

Update requirements for scipy and numpy

Currently, we have

numpy==1.16.4
scipy==1.4.1

numpy latest appears to be 1.18.4; scipy appears to be up to date.

Retire technical debt for pysearch.search main

Arg layout should be consist with anserini structure: change -prf to -prcl since we've decided to name the class PseudoRelevanceClassifierReranker

Args should be grouped, e.g., -prcl.r, -prcl.n, -prcl.alpha

use tqdm for progress indicator

Write test case: you can assume the CACM test index: https://github.com/castorini/pyserini/blob/master/tests/test_indexutils.py#L31

This would be similar to the commit hook in anserini: https://github.com/castorini/anserini/blob/master/.travis.yml#L20

I think output file should be obligatory.

get_document_vector vs get_postings_list

hello! Thank u for your work !
I have some issues and I am not sure whether It's a bug in the API or It's due to my misunderstanding of some semantics related to the API..
When I run these two programs, and analyze the frequency of the term "standard" in some documents like this:

Program 1:
doc_vector = index_utils.get_document_vector(XXX)
Output : {"...",'standard': 3, "..."}
Program 2:
postings_list = index_utils.get_postings_list('standard')
Output: docid=XXX, tf=1, pos=[26]

In fact, for the same document id, i get for the first program 3 terms, and the second 1 term frequency. The term "standard" was used here just for illustration, but I encounter the same problem for other terms.

I will be very grateful to get a feedback about this issue.

Best regards,

Implement searcher that searches multiple indexes and fuses the results

Call it FusionSearcher or something like that. Combines results via RRF by default.

Move location of search main

Right now we have:

$ python -m pyserini.search.pysearch
usage: pysearch.py [-h] -index path -topics path -output path

Maybe let's move to:

$ python -m pyserini.search
usage: pysearch.py [-h] -index path -topics path -output path

What do you think @yuki617 ?

Strange terms() behavior in IndexReaderUtils

I do this:

from pyserini.index import pyutils

index_utils = pyutils.IndexReaderUtils('lucene-index.robust04.pos+docvectors+rawdocs')

iter1 = index_utils.terms()
iter2 = index_utils.terms()

for term in iter1:
    print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))

As expected, I iterate over all terms. Then:

for term in iter2:
    print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))

Gives nothing... is terms() returning the same iterator every time?

Doesn't seem like the right behavior?

Import Error

Hi. I getting the error "ImportError: DLL load failed: The specified module could not be found." when trying to import both of:
from pyserini.search import pysearch
from pyserini.search.pysearch import SimpleSearcher

The modules have installed and they are recommended from the editor when typing them. I am using pycharm, python 3.7.6 and tried both pyserini==0.9.0.0 pyserini==0.9.3.0.

how to built indexes

what should I use to make indexes. I was looking at few options -solr,elasticsearch,lucene etc

forked pyserini Collection iterators sometimes freeze with large directories

We've been having an issue where Collection iterators sometimes freeze when used in forked process. This is not specific to pyserini and can be reproduced with only pyjnius and a BufferedReader. See below for a minimal script to reproduce it.

This issue reliably occurs if:

pyjnius is initialized in the main process before forked processes are created. If we initialize pyjnius inside each forked process, everything's fine. (This is not easy to do though, because there's no way to close pyjnius after it's initialized, so every call to pyserini would have to happen in a forked process.)
The BufferedReader is run on a "large" directory. Large depends on a combination of number of files and file size. It does not happen on a directory with 2000 empty files (created with touch $fn), but it does happen if this is increased to 3000 empty files. It does happen if the 2000 empty files are 1MB each (dd if=/dev/zero of=$fn bs=1M count=1) rather than empty. (I've also reproduced it on a random directory containing 265 files of varying sizes.)

Script:

import os
import sys
from multiprocessing import Pool
from jnius import autoclass
jstr = autoclass("java.lang.String")
jbr = autoclass("java.io.BufferedReader")
jfr = autoclass("java.io.FileReader")
def jprint(x):
    fr = jfr(x)
    f = jbr(fr)
    while True:
        line = f.readLine()
        if line is None:
            print("break")
            f.close()
            fr.close()
            break
        else:
            print("not none")

if __name__ == "__main__":
    dir = sys.argv[1]
    p = Pool(5)
    fns = [os.path.join(dir, fn) for fn in os.listdir(dir) if os.path.isfile(os.path.join(dir, fn))]
    p.map(jprint, fns)

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.9.2.0. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary pyserini -w /tmp/ext pyserini==0.9.2.0
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting pyserini==0.9.2.0
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/6bb/4e22d7cb0a83a/pyserini-0.9.2.0.tar.gz (57.8 MB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-we7atqws/pyserini/pip-egg-info
         cwd: /tmp/pip-wheel-we7atqws/pyserini/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-we7atqws/pyserini/setup.py", line 3, in <module>
        with open("project-description.md", "r") as fh:
    FileNotFoundError: [Errno 2] No such file or directory: 'project-description.md'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Outdated documents

pyserini notebooks are not updated with the most recent commit.

For example, in pyserini_robust04_demo.ipynb, hits[0].contents should be used instead of hits[0].content after this change.

Access Index from colab

Hi. I have an anserini index trying to access its stastistics so I am following this:
https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md

when I type:
from pyserini import analysis, index
index_reader = index.IndexReader()
I am getting:
AttributeError Traceback (most recent call last)
in ()
1 from pyserini import analysis, index
2
----> 3 index_reader = index.IndexReader()

AttributeError: module 'pyserini.index' has no attribute 'IndexReader'

get_document_vector() and get_postings_list() Stemming ?

Hi @lintool !
I have a new issue :
I created a new index with the dataset "DUC-2001" by mean of this function :

 sh anserini/target/appassembler/bin/IndexCollection \
            -collection TrecCollection \
            -generator JsoupGenerator \
            -threads 2 \
            -input ${EXP}/ \
            -index indexes/lucene-index.XXX \
            -storePositions -storeDocvectors -storeRawDocs

I also installed Luke Toolbox project to understand how the index working.

When i run this code :

for id_ in docid:
    doc_vector = index_utils.get_document_vector(id_)
    bm25_score_one_doc = {}
    for term_ in doc_vector:
        postings_list = index_utils.get_postings_list(term_)

it works for some terms but not for all...

Traceback (most recent call last):
  File "doc2index_2.py", line 50, in <module>
    postings_list = index_utils.get_postings_list(term_)
  File "/home/poulain/.local/lib/python3.6/site-packages/pyserini/index/pyutils.py", line 118, in get_postings_list
    postings_list = self.object.getPostingsList(self.reader, JString(term))
  File "jnius/jnius_export_class.pxi", line 768, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 934, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.lang.NullPointerException

I think there are two different indexes, the first one applies a stemming ( the word "Cherokee" become "cheroke") and the second keeps the word without stemming.

So, how can i stemming the posting index ?

Best regards

java.nio.file.NoSuchFileException using the Collection API

All CACM HTML files are under collection. Instantiating pycollection.Collection with Python 3.5.0 and Java 11 leads to java.nio.file.NoSuchFileException:

from pyserini.collection import pycollection

collection = pycollection.Collection('HtmlCollection', 'collection/')

2019-11-22 08:05:07,954 ERROR [main] collection.DocumentCollection$2 (DocumentCollection.java:226) - Visiting failed for ä
java.nio.file.NoSuchFileException: ä
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
	at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) ~[?:?]
	at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:149) ~[?:?]
	at java.base/java.nio.file.Files.readAttributes(Files.java:1763) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322) ~[?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2716) [?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2796) [?:?]
	at io.anserini.collection.DocumentCollection.discover(DocumentCollection.java:232) [anserini-0.6.0-fatjar.jar:?]
	at io.anserini.collection.DocumentCollection.iterator(DocumentCollection.java:110) [anserini-0.6.0-fatjar.jar:?]

Should take a path to Anserini root, something like -anserini ...

castorini / pyserini Goto Github PK

pyserini's Issues

Uncomment to print the entire article... warning, it's long! :)

Recommend Projects

Recommend Topics

Recommend Org