Git Product home page Git Product logo

pyserini's Issues

Use case for dump_document_vectors in IndexReaderUtils?

Hi @zeynepakkalyoncu - do we have an actual use case for dump_document_vectors in IndexReaderUtils? I'm trying to write a use case for it, and I can't seem to even get it to work? Throws a mysterious jnius.JavaException: JVM exception occurred error.

If you're using this feature somewhere (in a notebook), we should make sure it works and write a test case... otherwise I would suggest removing it until a real use case comes up.

Score query wrt document or rerank input set of documents wrt query

Feature request from @cmacdonald -

Compute scores wrt a set of document specified by the user, i.e., "rerank this set for me". Could decompose into "score this query wrt this document" with an outer loop over documents.

Can be accomplished today by setting k to be a really large value and then filtering results... but having a better implementation of this feature would be generally useful.

Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Obtained a JavaException error after importing:
from pyserini.search import pysearch

--> Error:
... jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Installed Pyserini via:
pip install pyserini --user

I was not able to resolve this, I tried:

  • replacing the fatjar with a newly built Anserini 0.8.1 fatjar
  • manual configuration of classpath via configure_classpath() method in pyserini.setup.

Complete error:

Traceback (most recent call last): File "build_db.py", line 6, in <module> from pyserini.search import pysearch File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/search/pysearch.py", line 25, in <module> from ..pyclass import JSearcher, JResult, JDocument, JString, JArrayList, JTopics, JTopicReader File "/home/pboers/.local/lib/python3.7/site-packages/pyserini/pyclass.py", line 51, in <module> JDefaultEnglishAnalyzer = autoclass('io.anserini.analysis.DefaultEnglishAnalyzer') File "/home/pboers/.local/lib/python3.7/site-packages/jnius/reflect.py", line 208, in autoclass c = find_javaclass(clsname) File "jnius/jnius_export_func.pxi", line 28, in jnius.find_javaclass jnius.JavaException: Class not found b'io/anserini/analysis/DefaultEnglishAnalyzer'

Expose methods to convert between internal and external docid

This is a summary of the issue presented in #32

Consider this fragment:

>>> from pyserini.index import pyutils
>>> 
>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> postings_list = index_utils.get_postings_list('black')
>>> 
>>> for i in range(0, 10):
...     print('{}'.format(postings_list[i]))
... 
(6, 2) [555,606]
(29, 1) [410]
(32, 2) [65,462]
(35, 2) [288,475]
(56, 1) [662]
(60, 1) [69]
(61, 1) [110]
(63, 1) [195]
(74, 2) [230,518]
(96, 1) [107]

The docids (e.g., 6 in the first posting), refers to internal Lucene docids, which are different from external docids (i.e., those in the collection).

Use this hidden method convertLuceneDocidToDocid to convert, as in:

>>> for i in range(0, 10):
...     print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
... 
LA111289-0011 (6, 2) [555,606]
LA092890-0052 (29, 1) [410]
LA022489-0041 (32, 2) [65,462]
LA051990-0051 (35, 2) [288,475]
LA092890-0077 (56, 1) [662]
LA022489-0061 (60, 1) [69]
LA021889-0073 (61, 1) [110]
LA110689-0057 (63, 1) [195]
LA080789-0088 (74, 2) [230,518]
LA021889-0117 (96, 1) [107]

The TODO is to explicitly expose convertLuceneDocidToDocid.

Similarly, we can use convertDocidToLuceneDocid to convert an external collection docid into an internal docid:

>>> from jnius import autoclass
>>> JString = autoclass('java.lang.String')
>>> index_utils.object.convertDocidToLuceneDocid(index_utils.reader, JString("LA052189-0089"))
200443

We can verify as follows:

>>> for i in range(len(postings_list)):
...     if postings_list[i].docid == 200443:
...         print('{} {}'.format(index_utils.object.convertLuceneDocidToDocid(index_utils.reader, postings_list[i].docid), postings_list[i]))
... 
LA052189-0089 (200443, 64) [18,133,175,212,225,244,262,273,307,320,344,372,388,431,438,454,464,541,576,583,616,640,772,778,801,831,838,885,891,912,937,952,970,1123,1151,1165,1180,1210,1215,1231,1270,1307,1346,1431,1436,1507,1514,1542,1546,1550,1663,1676,1726,1750,1764,1769,1781,1784,1838,1847,1873,1880,1922,1971]

Which matches exactly what we get from get_document_vector:

>>> index_utils = pyutils.IndexReaderUtils('index-robust04-20191213/')
>>> doc_vector = index_utils.get_document_vector("LA052189-0089")
>>> doc_vector['black']
64

AttributeError: 'Document' object has no attribute 'raw'

from pyserini.search import pysearch

searcher = pysearch.SimpleSearcher('/home/ds/anserini/covid-2020-04-17/lucene-index-covid-paragraph-new/')
hits = searcher.search('nsp1 synthesis degradation', 10)

article = json.loads(searcher.doc('42saxb98').raw)

Uncomment to print the entire article... warning, it's long! :)

#print(json.dumps(article, indent=4))

article['metadata']['title']

error-


AttributeError Traceback (most recent call last)
in ()
----> 1 article = json.loads(searcher.doc('42saxb98').raw)
2
3 # Uncomment to print the entire article... warning, it's long! :)
4 #print(json.dumps(article, indent=4))
5

AttributeError: 'Document' object has no attribute 'raw'

Index own data

How do I index my own data in Pyserini ? All the notebooks and examples are using prebuilt indexes.

Add functionality to search without preprocessing

We would like to be able to search an index without the query being processed by the stemmer. The specific use-case would be for the background linking task of TREC. We want to use the document vectors (that contain stemmed terms) to construct a new query.

Pyserini package/module structure: unnecessary nesting?

Current invoking is something like:

from pyserini.search import pysearch
searcher = pysearch.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

@zeynepakkalyoncu @emmileaf don't you think we have one unnecessary nested layer?

Would something like this make more sense?

from pyserini import search
searcher = search.SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

or

from pyserini.search import SimpleSearcher
searcher = SimpleSearcher('lucene-index.robust04.pos+docvectors+rawdocs')

JavaException: JVM exception occurred: Could not load codec 'Lucene84'

I, first, indexed my docs using lucene-8.5.1 in java, then after running

searcher = pysearch.SimpleSearcher('/home/sipah00/java_lucene/lucene_idx1/')

I got this error, JavaException: JVM exception occurred: Could not load codec 'Lucene84'. Did you forget to add lucene-backward-codecs.jar?
How to resolve this issue? is there any specific version of lucene that pyserini supports?
Also, how can I index my docs in pyserini only?

Replicate SearchCollection in Pyserini

It'd be nice to be able to replicate standard regression runs directly from Python, something like:

python -m pyserini.search_collection ...

We should be able to get exactly the same output as from Java.

pysearch.get_topics doesn't work anymore

This no longer works:

from pyserini.search import pysearch
topics = pysearch.get_topics('msmarco_passage_dev_subset')

The reason is that the Java end uses generics, and so pyjnius can't properly dispatch to the method. See: kivy/pyjnius#134

Term occurs in document vector, but has collection frequency 0

I've found a term that occurs once in a document vector, but doesn't occur in the collection. Am I using the wrong analyzer or is this a bug? I've used the following Pyserini functions:

index_utils = pyutils.IndexReaderUtils('/Index/lucene-index.core18.pos+docvectors+rawdocs_all')
tf = index_utils.get_document_vector(docid)
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
df = {term: (index_utils.get_term_counts(term, analyzer=analyzer))[1] for term in tf.keys()}

output:

tf = {.. 'hobbies:photographi': 1, ..}
df = {.. 'hobbies:photographi': 0, ..}

I assume the term is derived from this part in the raw text: "..<b>HOBBIES:</b>Photography..."

Access term frequencies in index

I need to access statistics of my index like tf, tf-idf etc. I want to answer questions like: In how many documents does a spesific term occur in? What documents does a spesific term occur in? What terms occur in the 1st document. Is there any way to do this using pyserini? Thanks!

Testing framework

Now that Pyserini is reasonably stable... we kinda need a testing framework...

Make methods in IndexReaderUtils more consistent re: Analyzer

We have:

# Pass in a no-op analyzer:
analyzer = pyanalysis.get_lucene_analyzer(stemming=False, stopwords=False)
index_utils.get_term_counts(term, analyzer=analyzer)
df, cf = index_utils.get_term_counts(term)

Here, we take an analyzer.

And:

# Fetch and traverse postings for an analyzed term:
postings_list = index_utils.get_postings_list(analyzed[0], analyze=False)
for posting in postings_list:
    print(f'docid={posting.docid}, tf={posting.tf}, pos={posting.positions}')

Here, we take a bool. Let's make both consistent?

How about both take analyzer and accepts None? Passing in a "no-op" analyzer seems a bit janky.

Thoughts? @PepijnBoers @chriskamphuis

Issue with VM during pysearch import

I am trying to import pysearch from pyserini.search. I set the JAVA_HOME variable to jdk11. I am running this using Jupyter Notebook and I am getting the error as shown.

import os
os.environ['JAVA_HOME'] = '/Library/Java/JavaVirtualMachines/jdk-11.0.7.jdk/Contents/Home'

from pyserini.search import pysearch

The error is:

ValueError: VM is already running, can't set classpath/options; VM started at  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/runpy.py", line 193, in _run_module_as_main

I initially was running in a virtual environment (conda environment). I tried coming out of it and execute, but I am still getting the same issue. Any workarounds? Or am I missing something?

I am using MacOS Catalina.

Retire technical debt for pysearch.search __main__

Arg layout should be consist with anserini structure: change -prf to -prcl since we've decided to name the class PseudoRelevanceClassifierReranker

Args should be grouped, e.g., -prcl.r, -prcl.n, -prcl.alpha


use tqdm for progress indicator


Write test case: you can assume the CACM test index: https://github.com/castorini/pyserini/blob/master/tests/test_indexutils.py#L31

This would be similar to the commit hook in anserini: https://github.com/castorini/anserini/blob/master/.travis.yml#L20


I think output file should be obligatory.

get_document_vector vs get_postings_list

hello! Thank u for your work !
I have some issues and I am not sure whether It's a bug in the API or It's due to my misunderstanding of some semantics related to the API..
When I run these two programs, and analyze the frequency of the term "standard" in some documents like this:

  • Program 1:
    doc_vector = index_utils.get_document_vector(XXX)
    Output : {"...",'standard': 3, "..."}
  • Program 2:
    postings_list = index_utils.get_postings_list('standard')
    Output: docid=XXX, tf=1, pos=[26]

In fact, for the same document id, i get for the first program 3 terms, and the second 1 term frequency. The term "standard" was used here just for illustration, but I encounter the same problem for other terms.

I will be very grateful to get a feedback about this issue.

Best regards,

Move location of search main

Right now we have:

$ python -m pyserini.search.pysearch
usage: pysearch.py [-h] -index path -topics path -output path

Maybe let's move to:

$ python -m pyserini.search
usage: pysearch.py [-h] -index path -topics path -output path

What do you think @yuki617 ?

Strange terms() behavior in IndexReaderUtils

I do this:

from pyserini.index import pyutils

index_utils = pyutils.IndexReaderUtils('lucene-index.robust04.pos+docvectors+rawdocs')

iter1 = index_utils.terms()
iter2 = index_utils.terms()

for term in iter1:
    print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))

As expected, I iterate over all terms. Then:

for term in iter2:
    print('{} (df={}, cf={})'.format(term.term, term.doc_freq, term.total_term_freq))

Gives nothing... is terms() returning the same iterator every time?

Doesn't seem like the right behavior?

Import Error

Hi. I getting the error "ImportError: DLL load failed: The specified module could not be found." when trying to import both of:
from pyserini.search import pysearch
from pyserini.search.pysearch import SimpleSearcher

The modules have installed and they are recommended from the editor when typing them. I am using pycharm, python 3.7.6 and tried both pyserini==0.9.0.0 pyserini==0.9.3.0.

how to built indexes

what should I use to make indexes. I was looking at few options -solr,elasticsearch,lucene etc

forked pyserini Collection iterators sometimes freeze with large directories

We've been having an issue where Collection iterators sometimes freeze when used in forked process. This is not specific to pyserini and can be reproduced with only pyjnius and a BufferedReader. See below for a minimal script to reproduce it.

This issue reliably occurs if:

  1. pyjnius is initialized in the main process before forked processes are created. If we initialize pyjnius inside each forked process, everything's fine. (This is not easy to do though, because there's no way to close pyjnius after it's initialized, so every call to pyserini would have to happen in a forked process.)
  2. The BufferedReader is run on a "large" directory. Large depends on a combination of number of files and file size. It does not happen on a directory with 2000 empty files (created with touch $fn), but it does happen if this is increased to 3000 empty files. It does happen if the 2000 empty files are 1MB each (dd if=/dev/zero of=$fn bs=1M count=1) rather than empty. (I've also reproduced it on a random directory containing 265 files of varying sizes.)

Script:

import os
import sys
from multiprocessing import Pool
from jnius import autoclass
jstr = autoclass("java.lang.String")
jbr = autoclass("java.io.BufferedReader")
jfr = autoclass("java.io.FileReader")
def jprint(x):
    fr = jfr(x)
    f = jbr(fr)
    while True:
        line = f.readLine()
        if line is None:
            print("break")
            f.close()
            fr.close()
            break
        else:
            print("not none")

if __name__ == "__main__":
    dir = sys.argv[1]
    p = Pool(5)
    fns = [os.path.join(dir, fn) for fn in os.listdir(dir) if os.path.isfile(os.path.join(dir, fn))]
    p.map(jprint, fns)

Missing files in sdist

It appears that the manifest is missing at least one file necessary to build
from the sdist for version 0.9.2.0. You're in good company, about 5% of other
projects updated in the last year are also missing files.

+ /tmp/venv/bin/pip3 wheel --no-binary pyserini -w /tmp/ext pyserini==0.9.2.0
Looking in indexes: http://10.10.0.139:9191/root/pypi/+simple/
Collecting pyserini==0.9.2.0
  Downloading http://10.10.0.139:9191/root/pypi/%2Bf/6bb/4e22d7cb0a83a/pyserini-0.9.2.0.tar.gz (57.8 MB)
    ERROR: Command errored out with exit status 1:
     command: /tmp/venv/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"'; __file__='"'"'/tmp/pip-wheel-we7atqws/pyserini/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-wheel-we7atqws/pyserini/pip-egg-info
         cwd: /tmp/pip-wheel-we7atqws/pyserini/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-wheel-we7atqws/pyserini/setup.py", line 3, in <module>
        with open("project-description.md", "r") as fh:
    FileNotFoundError: [Errno 2] No such file or directory: 'project-description.md'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Access Index from colab

Hi. I have an anserini index trying to access its stastistics so I am following this:
https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md

when I type:
from pyserini import analysis, index
index_reader = index.IndexReader()
I am getting:
AttributeError Traceback (most recent call last)
in ()
1 from pyserini import analysis, index
2
----> 3 index_reader = index.IndexReader()

AttributeError: module 'pyserini.index' has no attribute 'IndexReader'

get_document_vector() and get_postings_list() Stemming ?

Hi @lintool !
I have a new issue :
I created a new index with the dataset "DUC-2001" by mean of this function :

 sh anserini/target/appassembler/bin/IndexCollection \
            -collection TrecCollection \
            -generator JsoupGenerator \
            -threads 2 \
            -input ${EXP}/ \
            -index indexes/lucene-index.XXX \
            -storePositions -storeDocvectors -storeRawDocs

I also installed Luke Toolbox project to understand how the index working.

When i run this code :

for id_ in docid:
    doc_vector = index_utils.get_document_vector(id_)
    bm25_score_one_doc = {}
    for term_ in doc_vector:
        postings_list = index_utils.get_postings_list(term_)

it works for some terms but not for all...

Traceback (most recent call last):
  File "doc2index_2.py", line 50, in <module>
    postings_list = index_utils.get_postings_list(term_)
  File "/home/poulain/.local/lib/python3.6/site-packages/pyserini/index/pyutils.py", line 118, in get_postings_list
    postings_list = self.object.getPostingsList(self.reader, JString(term))
  File "jnius/jnius_export_class.pxi", line 768, in jnius.JavaMethod.__call__
  File "jnius/jnius_export_class.pxi", line 934, in jnius.JavaMethod.call_staticmethod
  File "jnius/jnius_utils.pxi", line 91, in jnius.check_exception
jnius.JavaException: JVM exception occurred: java.lang.NullPointerException

I think there are two different indexes, the first one applies a stemming ( the word "Cherokee" become "cheroke") and the second keeps the word without stemming.

So, how can i stemming the posting index ?

Best regards

java.nio.file.NoSuchFileException using the Collection API

All CACM HTML files are under collection. Instantiating pycollection.Collection with Python 3.5.0 and Java 11 leads to java.nio.file.NoSuchFileException:

from pyserini.collection import pycollection

collection = pycollection.Collection('HtmlCollection', 'collection/')
2019-11-22 08:05:07,954 ERROR [main] collection.DocumentCollection$2 (DocumentCollection.java:226) - Visiting failed for ä
java.nio.file.NoSuchFileException: ä
	at java.base/sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
	at java.base/sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:116) ~[?:?]
	at java.base/sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:55) ~[?:?]
	at java.base/sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:149) ~[?:?]
	at java.base/java.nio.file.Files.readAttributes(Files.java:1763) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.getAttributes(FileTreeWalker.java:219) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.visit(FileTreeWalker.java:276) ~[?:?]
	at java.base/java.nio.file.FileTreeWalker.walk(FileTreeWalker.java:322) ~[?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2716) [?:?]
	at java.base/java.nio.file.Files.walkFileTree(Files.java:2796) [?:?]
	at io.anserini.collection.DocumentCollection.discover(DocumentCollection.java:232) [anserini-0.6.0-fatjar.jar:?]
	at io.anserini.collection.DocumentCollection.iterator(DocumentCollection.java:110) [anserini-0.6.0-fatjar.jar:?]

How can I build an index with pyserini?

Hi, congratulations for this work!
I just wanted to ask, if possible, to add an example on how to use pyserini to build a new index. Also, to build my own collection, do I just have to override the Collection class?
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.