Git Product home page Git Product logo

aggrefact's Introduction

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

Announcement

Our paper is accepted to ACL 2023! Please consider citing our paper if you find it useful.

TL;DR

Reported results on standard summarization factuality benchmarks can be misleading. Factuality metrics, even ChatGPT-based, don’t actually improve on more recent summarizers!

Abstract

The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems’ outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models. Critically, our analysis shows that much of the recent improvement in the factuality detection space has been on summaries from older (pre-Transformer) models instead of more relevant recent summarization models. We further perform a finer-grained analysis per error-type and find similar performance variance across error types for different factuality metrics. Our results show that no one metric is superior in all settings or for all error types, and we provide recommendations for best practices given these insights.

Please check our latest paper version here.

AggreFact Benchmark

The dataset can be found in the data folder. aggre_fact_sota.csv is a subset of aggre_fact_final.csv, which only contains summaries generated by SOTA models. The followings are descriptions of column names.

Col. name Description
dataset Name of the original annotated dataset.
origin Summarization dataset. Either cnndm or xsum.
id Document id.
doc Input article.
summary model generated summary.
model_name Name of the model used to generate the summary
label Factual consistency of the generated summary. 1 is factually consistent, 0 otherwise.
cut Either val or test.
system_score The output score from a factuality system.
system_label The binary factual consistency label based on the score of the factuality system. Only examples in the test set have labels. Labels are determined under the threshold-per-dataset setting.

The new csv file containing the results from chatgpt based metrics will be released shortly.

Main Result

Unified Error Types

We unified unique error type taxonomies from XSumFaith, FRANK, Goyal21 and CLIFF under data/error_type_mapping. More details can be found in Section 4 of the paper.

aggrefact's People

Contributors

liyan06 avatar

Stargazers

Forrest Bao avatar  avatar JIMMY ZHAO avatar Kung-Hsiang Steeve Huang  avatar Daxiong avatar Zhuoer(Eddie) Wang | 王卓尔 avatar  avatar Philippe Laban avatar Kun-Lin Lee avatar 爱可可-爱生活 avatar Junxian He avatar Hou Pong (Ken) Chan avatar Melissa Wang avatar Mathieu Ravaut avatar Bin Wang avatar  avatar Arie Cattan avatar

Watchers

 avatar  avatar Yotam avatar

Forkers

achandlr

aggrefact's Issues

Integrating human faithfulness ratings from HELM paper.

The team of the HELM paper just shared a data set of doc-summary faithfulness ratings in this issue.
The rating is binary and was crowd sourced. The rated docs are from cnn and xsum. The summaries are references or created by some recent models (gpt3 etc). I think this could be integrated into aggrefact to get an even bigger and better benchmark.

I would be interested in discussing opinions whether this is a fit to be integrated into aggrefact and what to consider while doing so.

Release of AggreFact Unify dataset

Will you be willing to release the AggreFact-unified dataset? Major findings in the paper stemmed from the error analysis and it is valuable as a benchmark dataset on hallucination detection.

thanks!

New paper version

Hi,
I think it would be beneficial for people discovering your work if you added to the readme that there is a new version of the paper available (as can be seen on this overview) and what the key differences are.

Probably it would also be nice to have access to the new csv file containing the chatgpt based metrics, so people could compare and verify your results etc.

kind regards

Converting of chatgpt responses.

Hi,
I am currently trying to reevaluate the chatgpt based metrics (of Luo). I am wondering how you converted the answers to boolean decisions. I would imagine that especially the chain-of-thought template would result in responses that don't follow a pattern strictly.
As far as I see the paper of Luo is also not clear about how to convert this.
Have you been looking for keywords or is there any specific pattern you evaluated? Would also be nice if you could share the code.

kind regards

What are BertExt, BertExtAbs, and BertSum?

Hi, I am looking at the model_name column of the data you shared. What are the three models BertExt, BertExtAbs, and BertSum? How are they related to BertSumExt and BertSumAbs of (Liu and Lapata, 2019)?

Dataset Release

Hi @Liyan06

Awesome work! wanted to know when do you plan to release the datasets as I would like to use them for running some experiments

Thanks

请问这些数据是怎么来的呢?

我看代码好像是直接从csv表格现成的数据里提取的数据,只是做一些计算就得到了结果。我想问下大佬,某数据集在某方法(论文中的metric)下的分数,是怎么来的呢?比如对于一条BART模型摘要的CNN/DM数据样本,它在DAE得分是0.912,我了解的就是这个分数来源,是直接从以往的论文中得取的呢还是大佬用模型跑出来的呢?还望大佬赐教!

Sentence level dataset

Thanks for the great work, your benchmark is more than useful!

Could you please provide the dataset at the sentence level? This is particularly helpful in CNN/DM summaries where some sentences might be faithful and others not.

Summarizer `missing`

Hi, in your data, under the model_name column, some values are missing. Are samples of such a value in the model_name column counted into OLD, ExFormer, SOTA, or not used in your analysis at all"?

Reproduction Scores

Thanks for making your work available!
I've been unable to reproduce scores on this dataset for SummaC (used the default code available in the repository). Could you point me to the settings you used?

Chatgpt based metrics and long inputs.

I am trying to re evaluate the chatgpt based metrics on your benchmark.
Using gpt-3.5-turbo some texts produce errors as they contain to many tokens.
How did you handle this problem? Truncation? Did u use a model with larger input like gpt-3.5-turbo-16k?

Go Figure not included?

Thanks for the great work to compile a SotA data set for metric evaluation.
I am just curious why the annotations of GO FIGURE are not included.

kind regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.