Git Product home page Git Product logo

Comments (4)

DRRV avatar DRRV commented on June 6, 2024

Have a look here :https://github.com/naver/splade/tree/main#evaluating-a-pre-trained-model

from splade.

saarthaks avatar saarthaks commented on June 6, 2024

Thanks for the pointer, but I'm not sure that addresses my question completely. This response to a previously closed issue gets closer to the heart of what I'm asking, which is how to create a PISA index for an arbitrary document/query corpus that is encoded into sparse vectors with a pre-trained efficient-SPLADE model.

The previous approach outlines a method to generate this PISA index by first creating an Anserini index with the SPLADE model's sparse vectors, exporting it to a CIFF format, converting the CIFF format to the PISA format, building the PISA index, and then mapping the queries to the expected format. Is this still the most direct approach to create the PISA index with a pre-trained SPLADE model?

If so, I've run into an intermediate issue with the previous approach. How are the docs_anserini.jsonl and queries_anserini.tsv file used to create the Anserini index? The regression process for Anserini that is linked does not list how to ingest those files via its command target/appassembler/bin/IndexCollection, and instead specifies a downloadable version of the MS MARCO Passage Corpus that has already been specifically processed with DistilSPLADE-max. As a result, it seems to ignore the dataset and SPLADE model that was used to create docs_anserini.jsonl and queries_anserini.tsv.

from splade.

cadurosar avatar cadurosar commented on June 6, 2024

Hi Saartak

Thanks for the pointer, but I'm not sure that addresses my question completely. This response to a previously closed issue gets closer to the heart of what I'm asking, which is how to create a PISA index for an arbitrary document/query corpus that is encoded into sparse vectors with a pre-trained efficient-SPLADE model.

The previous approach outlines a method to generate this PISA index by first creating an Anserini index with the SPLADE model's sparse vectors, exporting it to a CIFF format, converting the CIFF format to the PISA format, building the PISA index, and then mapping the queries to the expected format. Is this still the most direct approach to create the PISA index with a pre-trained SPLADE model?

On our side it kinda is the most direct. You can also look into https://github.com/terrierteam/pyterrier_pisa/tree/main which directly creates an index on pisa using Terrier and also allows for querying the index directly as well. They have an example of using SPLADE at the very end of the README. We are still looking into integrating this/using this as the default, but also need to make sure that it does not create a circular dependency (It would lead to: SPLADE depends on Pyterrier_Pisa, which depends on Pyterrier_SPLADE which depends on SPLADE)

If so, I've run into an intermediate issue with the previous approach. How are the docs_anserini.jsonl and queries_anserini.tsv file used to create the Anserini index? The regression process for Anserini that is linked does not list how to ingest those files via its command target/appassembler/bin/IndexCollection, and instead specifies a downloadable version of the MS MARCO Passage Corpus that has already been specifically processed with DistilSPLADE-max. As a result, it seems to ignore the dataset and SPLADE model that was used to create docs_anserini.jsonl and queries_anserini.tsv.

Ok, so for the indexcollection you should just pass the path to the folder containing your docs_anserini.jsonl to the input parameter (substitute -input /path/to/msmarco-passage-distill-splade-max to path to docs_anserini). Which is your corpus. Then after when you go for the retrieval portion you put your queries_anserini.tsv (substitute -topics tools/topics-and-qrels/topics.msmarco-passage.dev-subset.distill-splade-max.tsv.gz to the path to your queries_anserini.tsv).

Hope this helps, but feel free to ask for more clarification.

from splade.

saarthaks avatar saarthaks commented on June 6, 2024

Hi Carlos,

That's very helpful, thank you! That worked perfectly!

from splade.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.