liyan06 / aggrefact Goto Github PK

View Code? Open in Web Editor NEW

17.0 3.0 1.0 16.75 MB

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors (ACL 2023)

Jupyter Notebook 93.65% Python 6.35%

aggrefact's Introduction

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

Announcement

Our paper is accepted to ACL 2023! Please consider citing our paper if you find it useful.

TL;DR

Reported results on standard summarization factuality benchmarks can be misleading. Factuality metrics, even ChatGPT-based, don’t actually improve on more recent summarizers!

Abstract

The propensity of abstractive summarization models to make factual errors has been studied extensively, including design of metrics to detect factual errors and annotation of errors in current systems’ outputs. However, the ever-evolving nature of summarization systems, metrics, and annotated benchmarks makes factuality evaluation a moving target, and drawing clear comparisons among metrics has become increasingly difficult. In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model. We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models. Critically, our analysis shows that much of the recent improvement in the factuality detection space has been on summaries from older (pre-Transformer) models instead of more relevant recent summarization models. We further perform a finer-grained analysis per error-type and find similar performance variance across error types for different factuality metrics. Our results show that no one metric is superior in all settings or for all error types, and we provide recommendations for best practices given these insights.

Please check our latest paper version here.

AggreFact Benchmark

The dataset can be found in the data folder. aggre_fact_sota.csv is a subset of aggre_fact_final.csv, which only contains summaries generated by SOTA models. The followings are descriptions of column names.

Col. name	Description
dataset	Name of the original annotated dataset.
origin	Summarization dataset. Either cnndm or xsum.
id	Document id.
doc	Input article.
summary	model generated summary.
model_name	Name of the model used to generate the summary
label	Factual consistency of the generated summary. 1 is factually consistent, 0 otherwise.
cut	Either val or test.
system_score	The output score from a factuality system.
system_label	The binary factual consistency label based on the score of the factuality system. Only examples in the test set have labels. Labels are determined under the threshold-per-dataset setting.

The new csv file containing the results from chatgpt based metrics will be released shortly.

Main Result

Unified Error Types

We unified unique error type taxonomies from XSumFaith, FRANK, Goyal21 and CLIFF under data/error_type_mapping. More details can be found in Section 4 of the paper.

aggrefact's People

Contributors

Stargazers

Watchers

Forkers

achandlr

aggrefact's Issues

Integrating human faithfulness ratings from HELM paper.

The team of the HELM paper just shared a data set of doc-summary faithfulness ratings in this issue.
The rating is binary and was crowd sourced. The rated docs are from cnn and xsum. The summaries are references or created by some recent models (gpt3 etc). I think this could be integrated into aggrefact to get an even bigger and better benchmark.

I would be interested in discussing opinions whether this is a fit to be integrated into aggrefact and what to consider while doing so.

Release of AggreFact Unify dataset

Will you be willing to release the AggreFact-unified dataset? Major findings in the paper stemmed from the error analysis and it is valuable as a benchmark dataset on hallucination detection.

thanks!

New paper version

Hi,
I think it would be beneficial for people discovering your work if you added to the readme that there is a new version of the paper available (as can be seen on this overview) and what the key differences are.

Probably it would also be nice to have access to the new csv file containing the chatgpt based metrics, so people could compare and verify your results etc.

kind regards

Question regarding the difference between Table 3 and 4 in the paper

Hi, I am confused about the difference between the results in Tables 3 and 4 in the paper. Why is there a huge performance gap for ChatGPT-ZS on the AGGREFACT-CNN-FTSOTA (66.2 v.s. 56.3)?

Converting of chatgpt responses.

Hi,
I am currently trying to reevaluate the chatgpt based metrics (of Luo). I am wondering how you converted the answers to boolean decisions. I would imagine that especially the chain-of-thought template would result in responses that don't follow a pattern strictly.
As far as I see the paper of Luo is also not clear about how to convert this.
Have you been looking for keywords or is there any specific pattern you evaluated? Would also be nice if you could share the code.

kind regards

What are BertExt, BertExtAbs, and BertSum?

Hi, I am looking at the model_name column of the data you shared. What are the three models BertExt, BertExtAbs, and BertSum? How are they related to BertSumExt and BertSumAbs of (Liu and Lapata, 2019)?

Dataset Release

Hi @Liyan06

Awesome work! wanted to know when do you plan to release the datasets as I would like to use them for running some experiments

Thanks

请问这些数据是怎么来的呢？

我看代码好像是直接从csv表格现成的数据里提取的数据，只是做一些计算就得到了结果。我想问下大佬，某数据集在某方法（论文中的metric）下的分数，是怎么来的呢？比如对于一条BART模型摘要的CNN/DM数据样本，它在DAE得分是0.912，我了解的就是这个分数来源，是直接从以往的论文中得取的呢还是大佬用模型跑出来的呢？还望大佬赐教！

Sentence level dataset

Thanks for the great work, your benchmark is more than useful!

Could you please provide the dataset at the sentence level? This is particularly helpful in CNN/DM summaries where some sentences might be faithful and others not.

Summarizer `missing`

Hi, in your data, under the model_name column, some values are missing. Are samples of such a value in the model_name column counted into OLD, ExFormer, SOTA, or not used in your analysis at all"?

Reproduction Scores

Thanks for making your work available!
I've been unable to reproduce scores on this dataset for SummaC (used the default code available in the repository). Could you point me to the settings you used?

Chatgpt based metrics and long inputs.

I am trying to re evaluate the chatgpt based metrics on your benchmark.
Using gpt-3.5-turbo some texts produce errors as they contain to many tokens.
How did you handle this problem? Truncation? Did u use a model with larger input like gpt-3.5-turbo-16k?

What is PegasusDynamic in your data?

Hi, I am looking at the data you provided. There is one summarizer model called PegasusDynamic. What is it? It is a variation of Pegasus?

Go Figure not included?

Thanks for the great work to compile a SotA data set for metric evaluation.
I am just curious why the annotations of GO FIGURE are not included.

kind regards

liyan06 / aggrefact Goto Github PK

aggrefact's Introduction

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

Announcement

TL;DR

Abstract

AggreFact Benchmark

Main Result

Unified Error Types

aggrefact's People

Contributors

Stargazers

Watchers

Forkers

aggrefact's Issues

Recommend Projects

Recommend Topics

Recommend Org