Git Product home page Git Product logo

pyterrier_ance's Introduction

PyTerrier_ANCE

This is the PyTerrier plugin for the ANCE dense passage retriever.

Installation

This repostory can be installed using Pip.

pip install --upgrade git+https://github.com/terrierteam/pyterrier_ance.git

You will need FAISS (cpu or gpu) installed:

On Colab:

!pip install faiss-cpu 

On Anaconda:

# CPU-only version
$ conda install -c pytorch faiss-cpu

# GPU(+CPU) version
$ conda install -c pytorch faiss-gpu

For ANCE, the CPU version is sufficient.

Indexing

You will need a pre-trained ANCE checkpoint. There are several available from the ANCE repository.

Then, indexing is as easy as instantiating the indexer, pointing at the (unzipped) checkpoint and the directory in which you wish to create an index

dataset = pt.get_dataset("irds:vaswani")
import pyterrier_ance
indexer = pyterrier_ance.ANCEIndexer("/path/to/checkpoint", "/path/to/anceindex")
indexer.index(dataset.get_corpus_iter())

Retrieval

You can instantiate the retrieval transformer, again by specifying the checkpoint location and the index location:

anceretr = pyterrier_ance.ANCERetrieval("/path/to/checkpoint", "/path/to/anceindex")

Thereafter, you can use it in the normal PyTerrier way, for instance in an experiment:

pt.Experiment(
    [anceretr], 
    dataset.get_topics(), 
    dataset.get_qrels(), 
    eval_metrics=["map"]
)

You can also use ANCE as a re-ranker to score text (e.g., as a re-ranker) using ANCETextScorer.

ance_text_scorer = pyterrier_ance.ANCETextScorer("/path/to/checkpoint")
# You'll need to use this in a retrieval pipeline that includes the document text, e.g.:
# bm25 >> pt.text.get_text(dataset, 'text') >> ance_text_scorer

Documents longer than Passages

If your documents are longer than passages, you should apply passaging to them before indexing, and max passage (say) during retrieval:

# indexing
dataset = pt.get_dataset("irds:vaswani")
import pyterrier_ance
indexer = pt.text.sliding("text", prepend_attr=None) >> pyterrier_ance.ANCEIndexer("/path/to/checkpoint", "/path/to/anceindex")
indexer.index(dataset.get_corpus_iter())

# retrieval 

ance_maxp = pyterrier_ance.ANCERetrieval("/path/to/checkpoint", "/path/to/anceindex") >> pt.text.max_passage()

Examples

Checkout out the notebooks, even on Colab:

The Terrier data repository contains ANCE indices for several corpora, including Vaswani and MSMARCO Passage v1.

Implementation Details

We use a fork-ed copy of ANCE that makes it pip installable, and addresses other quibbles.

References

  • [Xiong20] Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, Arnold Overwijk. https://arxiv.org/pdf/2007.00808.pdf
  • [Macdonald20]: Craig Macdonald, Nicola Tonellotto. Declarative Experimentation inInformation Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271

Credits

  • Craig Macdonald, University of Glasgow
  • Nicola Tonellotto, University of Pisa

pyterrier_ance's People

Contributors

cmacdonald avatar xiao0728 avatar seanmacavaney avatar tonellotto avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.