Comments (10)
you can just do:
get_term_counts(self, term: str, analyzer=get_lucene_analyzer()) -> Tuple[int, int]:
if analyzer is None:
# skip analysis (pass dummy to Anserini)
else:
# perform analysis with analyzer (either default or custom)
...
where you import that function from pyanalysis
from pyserini.
Both taking an analyzer
makes sense to me, especially since the getPostingsListWithAnalyzer
method is already available in Anserini.
from pyserini.
@PepijnBoers do you have cycles to take this on?
from pyserini.
@lintool sure!
from pyserini.
Currently get_term_counts
applies Anserini's default Lucene analyzer if no analyzer is specified (analyzer=None
), then in order to skip term analysis you have to pass a dummy analyzer. Is this the requested behavior or do we want analyzer=None
to mean that no analysis should take place?
from pyserini.
I'm thinking:
- No analyzer specified - use default.
analyzer=foo
- usefoo
as analyzeranalyzer=None
- skip analysis
And we make this behavior across all methods (existing, and in the future).
Thoughts?
from pyserini.
I assumed that None
would be the value for an unspecified analyzer, do you suggest we overload the methods in Pyserini and remove a default value for the analyzer
parameter?
from pyserini.
We could do it like this:
get_term_counts(self, term: str, analyzer='unspecified') -> Tuple[int, int]:
if analyzer is 'unspecified'":
# use default
elif analyzer is None:
# skip analysis (pass dummy to Anserini)
else:
# perform analysis with given analyzer
...
from pyserini.
What about something simpler, like this?
get_term_counts(self, term: str, analyzer=default) -> Tuple[int, int]:
if analyzer is None:
# skip analysis (pass dummy to Anserini)
else:
# perform analysis with analyzer (either default or custom)
...
from pyserini.
Looks good, but then we have to specify default
somewhere, otherwise we face a NameError
. The question would then also be how/where to define default, right?
from pyserini.
Related Issues (20)
- Pyserini library not working in google colab HOT 3
- number of hits for a given query is not as specified in the retrieval command
- version conflict: the doc of experiments-nfcorpus.md HOT 4
- pyserini segfaults during search with ray HOT 2
- Connection Reset Error in Reproducing DPR
- Which tokenization technique is employed by BM25?
- Does pyserini.search include deep learning 2023 (dl23) track of the TREC dataset? HOT 2
- The anserini library does not load on Windows when the user name is in Chinese. HOT 1
- merge a large index with small index \ adding small collection of docs to a large index
- Pyserini download index doesn't actually appear to check tarball size
- Install Failed building wheel for nmslib with pybind11-2.6.1
- How are you handling duplicate entries for the corpus and qrels? HOT 1
- mContriever pre-built index for Mr.TyDi datasets
- Support for jsonl.gz input in pyserini.encode
- Optimizations when building a dense index
- Improper Contriever encoding with the current pyserini.encode class
- Error When Setting Up Pyserini: python -m spacy download en_core_web_sm
- Issue with fetching raw documents
- Create a `Rerank` module in Pyserini
- Contriever training script & hyper-parameter values
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pyserini.