Git Product home page Git Product logo

minicheck's Introduction

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Authors: Liyan Tang, Philippe Laban, Greg Durrett

Please check out our work here 📃

LLM-AggreFact Benchmark

Description

LLM-AggreFact is a fact verification benchmark. It aggregates 10 of the most up-to-date publicly available datasets on factual consistency evaluation across both closed-book and grounded generation settings. In LLM-AggreFact:

  1. Documents come from diverse sources, including Wikipedia paragraphs, interviews, web text, covering domains such as news, dialogue, science, and healthcare.
  2. Claims to be verified are mostly generated from recent generative models (except for one dataset of human-written claims), without any human intervention in any format, such as injecting certain error types into model-generated claims.

Benchmark Access

Our Benchmark is available on HuggingFace 🤗 More benchmark details can be found here.

from datasets import load_dataset
dataset = load_dataset("lytang/LLM-AggreFact")

The benchmark contains the following fields:

Field Description
dataset One of the 10 datasets in the benchmark
doc Document used to check the corresponding claim
claim Claim to be checked by the corresponding document
label 1 if the claim is supported, 0 otherwise

MiniCheck Model Evaluation Demo

Please first clone our GitHub Repo and install necessary packages from requirements.txt.

Our MiniCheck models are available on HuggingFace 🤗 More model details can be found from this collection. Below is a simple use case of MiniCheck. MiniCheck models will be automatically downloaded from Huggingface for the first time and cached in the specified directory.

from minicheck.minicheck import MiniCheck

doc = "A group of students gather in the school library to study for their upcoming final exams."
claim_1 = "The students are preparing for an examination."
claim_2 = "The students are on vacation."

# model_name can be one of ['roberta-large', 'deberta-v3-large', 'flan-t5-large']
# lytang/MiniCheck-Flan-T5-Large will be auto-downloaded from Huggingface for the first time
scorer = MiniCheck(model_name='flan-t5-large', device=f'cuda:0', cache_dir='./ckpts')
pred_label, raw_prob, _, _ = scorer.score(docs=[doc, doc], claims=[claim_1, claim_2])

print(pred_label) # [1, 0]
print(raw_prob)   # [0.9805923700332642, 0.007121307775378227]

A detailed walkthrough of the evaluation process on LLM-Aggrefact and replication of the results is available in this notebook: inference-example-demo.ipynb.

Synthetic Data Generation

Code for generating synthetic data (both C2D and D2C methods) is available in the synthetic_data_gen directory. Our 14K synthetic data (7K C2D and 7K D2C) used for model training is available on HuggingFace 🤗 and can be found here.

Citation

If you found our work useful, please consider citing our work.

@misc{tang2024minicheck,
      title={MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents}, 
      author={Liyan Tang and Philippe Laban and Greg Durrett},
      year={2024},
      eprint={2404.10774},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

minicheck's People

Contributors

liyan06 avatar

Stargazers

Seçkin Kükrer avatar Jonathan Suru avatar Weixiao Zhou avatar qshu avatar Alex Dimakis avatar Eunjin Kim avatar Olaf Lenzmann avatar Chris Coffey avatar Al-Ekram Elahee Hridoy avatar  avatar  avatar  avatar Aaditya Rajesh avatar Neil Van avatar Jeff Carpenter avatar Steffen Röcker avatar Razvan Dinu avatar sunzheng avatar Puyuan Peng avatar  avatar  avatar yangchao avatar Chaitanya Malaviya avatar Hangjian Li avatar Philippe Laban avatar init avatar Fangcong Yin avatar  avatar  avatar

Watchers

 avatar

minicheck's Issues

A request for finetune code about MiniCheck-FT5

I tried to follow the approach of MiniCheck-FT5 in the paper a few days ago, but unfortunately, it did not achieve the same good effect🤣. I also understand that this is because my finetune code was not written very well. It is not a simple text classification task. If it's convenient for you, could you please send me the finetune code of the MiniCheck-FT5.

Generation of "claims"

Hi!

How did you generate the claims in the benchmark? Did you (1) directly prompt the model to generate a claim, (2) first generate model responses and then decompose them into claims, and (3) use other ways?

Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.