Git Product home page Git Product logo

genre's Introduction

The GENRE (Generarive ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.

@article{de2020autoregressive,
  title={Autoregressive Entity Retrieval},
  author={De Cao, Nicola and Izacard, Gautier and Riedel, Sebastian and Petroni, Fabio},
  journal={arXiv preprint arXiv:2010.00904},
  year={2020}
}

Please consider citing our work if you use code from this repository.

In a nutshell, GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture. GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:

For end-to-end entity linking GENRE re-generates the input text annoted with a markup:

GENRE achieves state-of-the-art results on multiple datasets.

Main dependencies

  • python>=3.7
  • pytorch>=1.6
  • fairseq>=0.10 (for training -- optional for inference)
  • transformers>=4.0 (optional for inference)

Usage

See examples on how to use GENRE for both pytorch fairseq and huggingface transformers:

Generally, after importing and loading the model, you would generate predictions (in this example for Entity Disambiguation) with a simple call like:

model.sample(
    sentences=[
        "[START_ENT] Armstrong [END_ENT] was the first man on the Moon."
    ]
)
[[{'text': 'Neil Armstrong', 'logprob': tensor(-0.1443)},
  {'text': 'William Armstrong', 'logprob': tensor(-1.4650)},
  {'text': 'Scott Armstrong', 'logprob': tensor(-1.7311)},
  {'text': 'Arthur Armstrong', 'logprob': tensor(-1.7356)},
  {'text': 'Rob Armstrong', 'logprob': tensor(-1.7426)}]]

NOTE: we used fairseq for all experiments in the paper. The huggingface/transformers models are obtained with a conversion script similar to this. Therefore results might differ.

Models

Use the link above to download models in .tar.gz format and then tar -zxvf <FILENAME> do uncompress. As an alternative use this script to dowload all of them.

Entity Disambiguation

Training Dataset pytorch / fairseq huggingface / transformers
BLINK fairseq_entity_disambiguation_blink hf_entity_disambiguation_blink
BLINK + AidaYago2 fairseq_entity_disambiguation_aidayago hf_entity_disambiguation_aidayago

End-to-End Entity Linking

Training Dataset pytorch / fairseq huggingface / transformers
WIKIPEDIA fairseq_e2e_entity_linking_wiki_abs hf_e2e_entity_linking_wiki_abs
WIKIPEDIA + AidaYago2 fairseq_e2e_entity_linking_aidayago hf_e2e_entity_linking_aidayago

Document Retieval

Training Dataset pytorch / fairseq huggingface / transformers
KILT fairseq_wikipage_retrieval hf_wikipage_retrieval

See here examples to load the models and make inference.

Dataset

Use the link above to download datasets. As an alternative use this script to dowload all of them. These dataset (except BLINK data) are a pre-processed version of Phong Le and Ivan Titov (2018) data availabe here. BLINK data taken from here.

Entity Disambiguation (train / dev)

Entity Disambiguation (test)

Document Retieval

  • KILT for the these datasets please follow the download instruction on the KILT repository.

Pre-processing

To pre-process a KILT formatted dataset into source and target files as expected from fairseq use

python scripts/convert_kilt_to_fairseq.py $INPUT_FILENAME $OUTPUT_FOLDER

Then, to tokenize and binarize them as expected from fairseq use

./preprocess_fairseq.sh $DATASET_PATH $MODEL_PATH

note that this requires to have fairseq source code downloaded in the same folder as the genre repository (see here).

Trie from KILT Wikipedia titles

We also release the BPE prefix tree (trie) from KILT Wikipedia titles (kilt_titles_trie.pkl) that is based on the 2019/08/01 Wikipedia dump, downloadable in its raw format here. The trie contains ~5M titles and it is used to generate entites for all the KILT experiments.

Troubleshooting

If the module cannot be found, preface the python command with PYTHONPATH=.

Licence

GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.

genre's People

Contributors

nicola-decao avatar fabiopetroni avatar denden047 avatar ii-research-yu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.