Git Product home page Git Product logo

karlstratos / factscore Goto Github PK

View Code? Open in Web Editor NEW

This project forked from shmsw25/factscore

0.0 0.0 0.0 95 KB

A package to evaluate factuality of long-form generation. Original implementation of our EMNLP 2023 paper "FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation"

Home Page: https://arxiv.org/abs/2305.14251

License: MIT License

Python 100.00%

factscore's Introduction

Setup

conda create --name factscore python=3.10 --yes
conda activate factscore
pip install --editable .   # We need to use the code in this repo
python -m spacy download en_core_web_sm
python factscore/download_data.py   # No need for reconstructing instruct-llama-7b if we use openai
pip install gdown
gdown https://drive.google.com/uc?id=1mekls6OGOKLmt7gYtHs0WGf5oTamTNat -O .cache/factscore/enwiki-20230401.db  # See https://github.com/shmsw25/FActScore/issues/40#issue-2125580571

Put your OpenAI key in a file, assumed to be openai_key.txt here.

Issues

  • In the code, the atomic fact generator is InstructGPT and the factscore evaluator is (Retrieval+)ChatGPT.

  • Originally, InstructGPT=GPT3=text-davinci-003, and ChatGPT=gpt-3.5-turbo.

  • Due to deprecation, InstructGPT is set to gpt-3.5-turbo-instruct now. ChatGPT is still gpt-3.5-turbo. So if we do our own atomic fact generation, we will be using gpt-3.5-turb-instruct instead of davinci-003. But even if we don't and just do factscore evalution, using gpt-3.5-turbo, the results are different. My best guess is that it's difficult to replicate exactly because who knows what updates are being made with models at OpenAI. See this sheet for the results.

  • Some topics don't match exactly to titles in the DB due to ambiguity, which causes the code to fail (see this issue). An example is "Francisco Urroz". The code is modified to match titles with the same prefix if this happens (much slower), which gives "Francisco Urroz (rugby union)" and "Francisco Urroz (footballer)". The code then just uses all the paragraphs in these entries, assuming that the model shouldn't be penalized for the ambiguity of the query. This is an interesting example because GPT-4 is aware of the ambiguity at least (though it gets most facts wrong) while Alpaca-65B is oblivious, see *-problem.jsonl.

Commands

The code will cache results on already completed prompts. By default these are .cache/factscore/InstructGPT.pkl for atomic fact generation and .cache/factscore/ChatGPT.pkl for factscore evaluation.

Toy

Run atomic fact geneneration and factscore evaluation on 2 unlabeled biographies by ChatGPT.

python factscore/factscorer.py --input_path ChatGPT_unlabeled_head2.jsonl --model_name retrieval+ChatGPT --openai_key openai_key.txt --verbose

Labeled

Evaluate the ChatGPT (i.e., the old gpt-3.5-turbo) responses, human-labeled with atomic facts. Only run the factscore evaluator (i.e., retrieval + gpt-3.5-turbo-instruct). Need to consume ~5.9 milion tokens corresponding to atomic facts with contexts in 157 ChatGPT biographies (I guess what's remaining after abstaining from the original 183 topics). Even though we don't need atomic fact generation, it can take a while for the OpenAI API to complete (2-3 hours). The cost ended up around $6.

python factscore/factscorer.py --input_path data/labeled/ChatGPT.jsonl --model_name retrieval+ChatGPT --openai_key openai_key.txt --use_atomic_facts --verbose

Unlabeled

Evaluate the Alpaca-65B responses. Need to generate atomic facts as well as running the factscore evaluator (both by gpt-3.5-turbo-instruct). Alpaca-65B responds to 500 out of 500 topics.

For atomic fact generation, need to consume 2 million input tokens corresponding to in-context examples and the biography.

For factscore evaluation, need to consume 8.1 million input tokens corresponding to atomic facts with contexts. The total cost ended up around $4.

The whole thing takes a few hours (OpenAI willing).

python factscore/factscorer.py --input_path data/unlabeled/Alpaca-65B.jsonl --model_name retrieval+ChatGPT --openai_key openai_key.txt --verbose

factscore's People

Contributors

shmsw25 avatar martiansideofthemoon avatar karlstratos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.