Comments (4)
Have a look here :https://github.com/naver/splade/tree/main#evaluating-a-pre-trained-model
from splade.
Thanks for the pointer, but I'm not sure that addresses my question completely. This response to a previously closed issue gets closer to the heart of what I'm asking, which is how to create a PISA index for an arbitrary document/query corpus that is encoded into sparse vectors with a pre-trained efficient-SPLADE model.
The previous approach outlines a method to generate this PISA index by first creating an Anserini index with the SPLADE model's sparse vectors, exporting it to a CIFF format, converting the CIFF format to the PISA format, building the PISA index, and then mapping the queries to the expected format. Is this still the most direct approach to create the PISA index with a pre-trained SPLADE model?
If so, I've run into an intermediate issue with the previous approach. How are the docs_anserini.jsonl
and queries_anserini.tsv
file used to create the Anserini index? The regression process for Anserini that is linked does not list how to ingest those files via its command target/appassembler/bin/IndexCollection
, and instead specifies a downloadable version of the MS MARCO Passage Corpus that has already been specifically processed with DistilSPLADE-max. As a result, it seems to ignore the dataset and SPLADE model that was used to create docs_anserini.jsonl
and queries_anserini.tsv
.
from splade.
Hi Saartak
Thanks for the pointer, but I'm not sure that addresses my question completely. This response to a previously closed issue gets closer to the heart of what I'm asking, which is how to create a PISA index for an arbitrary document/query corpus that is encoded into sparse vectors with a pre-trained efficient-SPLADE model.
The previous approach outlines a method to generate this PISA index by first creating an Anserini index with the SPLADE model's sparse vectors, exporting it to a CIFF format, converting the CIFF format to the PISA format, building the PISA index, and then mapping the queries to the expected format. Is this still the most direct approach to create the PISA index with a pre-trained SPLADE model?
On our side it kinda is the most direct. You can also look into https://github.com/terrierteam/pyterrier_pisa/tree/main which directly creates an index on pisa using Terrier and also allows for querying the index directly as well. They have an example of using SPLADE at the very end of the README. We are still looking into integrating this/using this as the default, but also need to make sure that it does not create a circular dependency (It would lead to: SPLADE depends on Pyterrier_Pisa, which depends on Pyterrier_SPLADE which depends on SPLADE)
If so, I've run into an intermediate issue with the previous approach. How are the
docs_anserini.jsonl
andqueries_anserini.tsv
file used to create the Anserini index? The regression process for Anserini that is linked does not list how to ingest those files via its commandtarget/appassembler/bin/IndexCollection
, and instead specifies a downloadable version of the MS MARCO Passage Corpus that has already been specifically processed with DistilSPLADE-max. As a result, it seems to ignore the dataset and SPLADE model that was used to createdocs_anserini.jsonl
andqueries_anserini.tsv
.
Ok, so for the indexcollection you should just pass the path to the folder containing your docs_anserini.jsonl to the input parameter (substitute -input /path/to/msmarco-passage-distill-splade-max to path to docs_anserini). Which is your corpus. Then after when you go for the retrieval portion you put your queries_anserini.tsv (substitute -topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz to the path to your queries_anserini.tsv).
Hope this helps, but feel free to ask for more clarification.
from splade.
Hi Carlos,
That's very helpful, thank you! That worked perfectly!
from splade.
Related Issues (20)
- [Bug] Get PyTorch version HOT 2
- Can SPLADE adapt to Chinese language ? HOT 10
- Proposed Dockerfile
- Tutorial to export a SPLADE model to ONNX HOT 6
- Whether the SPLADE model supports the distinction of 'is_q'? HOT 1
- SPLADE representations on BEIR dataset HOT 1
- Quick Start Problem: an unexpected keyword argument 'version_base' HOT 1
- Is it possible to get a commercial license? HOT 5
- Installation error - splade with tokenisers v0.12.1 – Compatibility issue with Python 3.11.1 and Rust (v. 1.72, 1.76, 1.69, 1.62)
- PyTorch version checking
- Inquiry about Configuration Details for "ecir23-scratch-tydi-japanese-splade" Model HOT 4
- TypeError: main() got an unexpected keyword argument 'version_base' HOT 1
- How to install the ENV correctly?
- Inference Experiments HOT 2
- Change default to splade-v3
- Seeking Assistance with SPLADE Model for Chinese Text
- bug: TREC 2020 qrel_binary.json, score 1 should be treated as negative instead of positive
- Hybrid search & normalization
- Is splade suitable to use with another languages ?
- SPLADE model supports Multi-lingual text data?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from splade.