What is the process for indexing MS MARCO using Efficient SPLADE? I

Have a look here :<a href="https://github.com/naver/splade/tree/main#evaluating-a-pre-

Indexing a document corpus with Efficient SPLADE about splade HOT 4 CLOSED

saarthaks commented on June 6, 2024

Indexing a document corpus with Efficient SPLADE

from splade.

Comments (4)

DRRV commented on June 6, 2024

Have a look here :https://github.com/naver/splade/tree/main#evaluating-a-pre-trained-model

from splade.

saarthaks commented on June 6, 2024

Thanks for the pointer, but I'm not sure that addresses my question completely. This response to a previously closed issue gets closer to the heart of what I'm asking, which is how to create a PISA index for an arbitrary document/query corpus that is encoded into sparse vectors with a pre-trained efficient-SPLADE model.

The previous approach outlines a method to generate this PISA index by first creating an Anserini index with the SPLADE model's sparse vectors, exporting it to a CIFF format, converting the CIFF format to the PISA format, building the PISA index, and then mapping the queries to the expected format. Is this still the most direct approach to create the PISA index with a pre-trained SPLADE model?

If so, I've run into an intermediate issue with the previous approach. How are the docs_anserini.jsonl and queries_anserini.tsv file used to create the Anserini index? The regression process for Anserini that is linked does not list how to ingest those files via its command target/appassembler/bin/IndexCollection, and instead specifies a downloadable version of the MS MARCO Passage Corpus that has already been specifically processed with DistilSPLADE-max. As a result, it seems to ignore the dataset and SPLADE model that was used to create docs_anserini.jsonl and queries_anserini.tsv.

from splade.

cadurosar commented on June 6, 2024

Hi Saartak

Thanks for the pointer, but I'm not sure that addresses my question completely. This response to a previously closed issue gets closer to the heart of what I'm asking, which is how to create a PISA index for an arbitrary document/query corpus that is encoded into sparse vectors with a pre-trained efficient-SPLADE model.

The previous approach outlines a method to generate this PISA index by first creating an Anserini index with the SPLADE model's sparse vectors, exporting it to a CIFF format, converting the CIFF format to the PISA format, building the PISA index, and then mapping the queries to the expected format. Is this still the most direct approach to create the PISA index with a pre-trained SPLADE model?

On our side it kinda is the most direct. You can also look into https://github.com/terrierteam/pyterrier_pisa/tree/main which directly creates an index on pisa using Terrier and also allows for querying the index directly as well. They have an example of using SPLADE at the very end of the README. We are still looking into integrating this/using this as the default, but also need to make sure that it does not create a circular dependency (It would lead to: SPLADE depends on Pyterrier_Pisa, which depends on Pyterrier_SPLADE which depends on SPLADE)

If so, I've run into an intermediate issue with the previous approach. How are the docs_anserini.jsonl and queries_anserini.tsv file used to create the Anserini index? The regression process for Anserini that is linked does not list how to ingest those files via its command target/appassembler/bin/IndexCollection, and instead specifies a downloadable version of the MS MARCO Passage Corpus that has already been specifically processed with DistilSPLADE-max. As a result, it seems to ignore the dataset and SPLADE model that was used to create docs_anserini.jsonl and queries_anserini.tsv.

Ok, so for the indexcollection you should just pass the path to the folder containing your docs_anserini.jsonl to the input parameter (substitute -input /path/to/msmarco-passage-distill-splade-max to path to docs_anserini). Which is your corpus. Then after when you go for the retrieval portion you put your queries_anserini.tsv (substitute -topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz to the path to your queries_anserini.tsv).

Hope this helps, but feel free to ask for more clarification.

from splade.

saarthaks commented on June 6, 2024

Hi Carlos,

That's very helpful, thank you! That worked perfectly!

from splade.

Indexing a document corpus with Efficient SPLADE about splade HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent