Git Product home page Git Product logo

pygaggle's Introduction

PyGaggle

PyPI LICENSE

PyGaggle provides a gaggle of deep neural architectures for text ranking and question answering. It was designed for tight integration with Pyserini, but can be easily adapted for other sources as well.

Currently, this repo contains implementations of the rerankers for CovidQA on CORD-19, as described in "Rapidly Bootstrapping a Question Answering Dataset for COVID-19".

Installation

  1. Install via PyPI pip install pygaggle. Requires Python 3.6+

  2. Install PyTorch 1.4+.

  3. Download the index: sh scripts/update-index.sh.

  4. Make sure you have an installation of Java 11+: javac --version.

  5. Install Anserini.

Evaluations

Additional Instructions

  1. Clone the repo with git clone --recursive https://github.com/castorini/pygaggle.git

  2. Make you sure you have an installation of Python 3.6+. All python commands below refer to this.

  3. For pip, do pip install -r requirements.txt

    • If you prefer Anaconda, use conda env create -f environment.yml && conda activate pygaggle.

Running rerankers on CovidQA

For a full list of mostly self-explanatory environment variables, see this file.

BM25 uses the CPU. If you don't have a GPU for the transformer models, pass --device cpu (PyTorch device string format) to the script.

Note: Run the following evaluations at root of this repo.

Unsupervised Methods

BM25:

python -um pygaggle.run.evaluate_kaggle_highlighter --method bm25

BERT:

python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name bert-base-cased

SciBERT:

python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name allenai/scibert_scivocab_cased

BioBERT:

python -um pygaggle.run.evaluate_kaggle_highlighter --method transformer --model-name biobert

Supervised Methods

T5 (fine-tuned on MS MARCO):

python -um pygaggle.run.evaluate_kaggle_highlighter --method t5

BioBERT (fine-tuned on SQuAD v1.1):

  1. mkdir biobert-squad && cd biobert-squad

  2. Download the weights, vocab, and config from the BioBERT repository to biobert-squad.

  3. Untar the model and rename some files in biobert-squad:

tar -xvzf BERT-pubmed-1000000-SQuAD.tar.gz
mv bert_config.json config.json
for filename in model.ckpt*; do
    mv $filename $(python -c "import re; print(re.sub(r'ckpt-\\d+', 'ckpt', '$filename'))");
done
  1. Evaluate the model:
cd .. # go to root of this of repo
python -um pygaggle.run.evaluate_kaggle_highlighter --method qa_transformer --model-name <folder path>

BioBERT (fine-tuned on MS MARCO):

  1. Download the weights, vocab, and config from our Google Storage bucket. This requires an installation of gsutil.
mkdir biobert-marco && cd biobert-marco
gsutil cp "gs://neuralresearcher_data/doc2query/experiments/exp374/model.ckpt-100000*" .
gsutil cp gs://neuralresearcher_data/biobert_models/biobert_v1.1_pubmed/bert_config.json config.json
gsutil cp gs://neuralresearcher_data/biobert_models/biobert_v1.1_pubmed/vocab.txt .
  1. Rename the files:
for filename in model.ckpt*; do
    mv $filename $(python -c "import re; print(re.sub(r'ckpt-\\d+', 'ckpt', '$filename'))");
done
  1. Evaluate the model:
cd .. # go to root of this repo
python -um pygaggle.run.evaluate_kaggle_highlighter --method seq_class_transformer --model-name <folder path>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.