Git Product home page Git Product logo

openqa-eval's Introduction

Evaluating Open-Domain Question Answering in the Era of Large Language Models

This repository hosts the code for our ACL 2023 paper: https://arxiv.org/abs/2305.06984.

Quick Links

Overview

Lexical matching is the standard evaluation method for open-domain question answering (QA), but it fails when plausible answers are not in the provided list. In this study, we manually examined the answers of several open-domain QA models and found that

  1. The true performance of all models is severely underestimated by lexical matching;
  2. The performance of LLMs increases by nearly +60%, and the few-shot LLM (InstructGPT text-davinci-003) actually achieves state-of-the-art;
  3. Automated evaluation methods (BERT-based or LLM-based) are a reasonable surrogate for lexical matching in some circumstances, but not for long-form answers usually generated by LLMs.

Please see our paper for more details.

Requirements

The code needs Python 3.8+ (we tested it with Python 3.8).

To install from the repo:

pip install git+https://github.com/ehsk/OpenQA-eval.git

To install from the source:

git clone [email protected]:ehsk/OpenQA-eval.git
pip install -e .

Data

We worked on the Natural Questions-open (Lee et al., ACL 2019) test dataset that consists of 3,610 questions. We randomly sampled 301 questions for annotation.

In the data directory, we provide the answers generated by all open-domain QA models along with the output of the four evaluation mechanisms, described in the paper:

   data
    ├── model_outputs                                   # Answers generated by 12 open-domain QA models
    │   ├── NQ301_text-davinci-003_fewshot-n64.jsonl    # InstructGPT (few-shot)
    │   ├── NQ301_text-davinci-003_zeroshot.jsonl       # InstructGPT (zero-shot)
    │   ├── NQ_ANCE-plus_FiD.jsonl                      # ANCE+ & Fusion-In-Decoder
    │   └── ...
    ├── NQ301_BEM.tsv                                   # BEM predictions for all generated answers
    ├── NQ301_gpt-4.tsv                                 # GPT4-eval output for all generated answers
    ├── NQ301_human.tsv                                 # Human judgments for all generated answers
    └── NQ301_text-davinci-003.tsv                      # InstructGPT-eval output for all generated answers

The annotations can also be viewed online here.

Evaluation

The evaluation script takes a prediction file in a jsonl format as below and measures its performance with different metrics.

{"question": "who is under the mask of darth vader", "answer": ["Anakin Skywalker"], "prediction": "Anakin Skywalker"}
{"question": "which is the default file extension for an audio file in windows media player", "answer": ["Windows Playlist ( WPL )"], "prediction": "WMA"}

The following command computes only two lexical matching metrics: EM (Exact-Match accuracy) and macro-averaged F1.

python -m oqaeval /path/to/prediction_file.jsonl

To evaluate using an LLM like InstructGPT-eval in the paper, the model name (text-davinci-003 or gpt-4) argument should be passed:

python -m oqaeval /path/to/prediction_file.jsonl --model text-davinci-003

which calls OpenAI APIs. The environment variable OPENAI_API_KEY needs to be set first. Bear in mind that running this command will result in charges to your OpenAI account. We did not see a significant difference between GPT-4 and InstructGPT, so we recommend using the cheaper model (InstructGPT).

To evaluate using our provided annotated files including human judgment, you can simply run:

python -m oqaeval /path/to/prediction_file.jsonl --annotation data/NQ301_human.tsv

The above command evaluates only 301 annotated questions and skips the rest in the prediction file.

Bugs:bug: or questions:question:

If you have any questions or encounter any problems, feel free to open an issue.

Citation

If you want to cite our papers, please use:

@article{kamalloo2023evaluating,
  title = "{Evaluating Open-Domain Question Answering in the Era of Large Language Models}",
  author = {Kamalloo, Ehsan and Dziri, Nouha and Clarke, Charles L. A. and Rafiei, Davood},
  journal={arXiv preprint arXiv:2305.06984},
  year={2018}
}

License

This work is licensed under the MIT license. See LICENSE for details.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.