google / wmt-mqm-human-evaluation Goto Github PK

License: Apache License 2.0

wmt-mqm-human-evaluation's Introduction

Expert-based Human Evaluations for the Submissions of WMT 2020, WMT 2021, WMT 2022 and WMT 2023.

The contents of this repository are not an official Google product.

We re-annotated the WMT English to German and Chinese to English test sets newstest2020, newstest2021, and the TED talks WMT21 test suite with raters that are professional translators and native speakers of the target language. The resulting human ratings are more reliable than crowd-worker human evaluations. We refer to our paper for more details of the experimental setup.

You can use the Marot web app to open these TSV data files for computing scores as well as for interactively slicing and dicing (details and screenshots presented further down in this documentation).

Files part of this repository

mqm_newstest2020_ende.tsv MQM labels aqcuired for 10 submission of newstest2020 for English-to-German.
psqm_newstest2020_ende.tsv pSQM labels aqcuired for 10 submission of newstest2020 for English-to-German.
mqm_newstest2020_zhen.tsv MQM labels aqcuired for 10 submission of newstest2020 for Chinese-to-English.
psqm_newstest2020_zhen.tsv pSQM labels aqcuired for 10 submission of newstest2020 for Chinese-to-English.
mqm_newstest2021_ende.tsv MQM labels aqcuired for 15 submission of newstest2021 for English-to-German.
mqm_newstest2021_zhen.tsv MQM labels aqcuired for 15 submission of newstest2021 for Chinese-to-English.
mqm_ted_ende.tsv MQM labels aqcuired for 15 submission of TED talks for English-to-German.
mqm_ted_zhen.tsv MQM labels aqcuired for 15 submission of TED talks for Chinese-to-English.
mqm_general_MT2022_ende.tsv MQM labels aqcuired for 16 submission of generalMT2022 for English-to-German.
mqm_generalMT2022_zhen.tsv MQM labels aqcuired for 16 submission of generalMT2022 for Chinese-to-English.
mqm_general_MT2023_ende.tsv MQM labels aqcuired for 13 submission of generalMT2023 for English-to-German.
mqm_generalMT2023_zhen.tsv MQM labels aqcuired for 16 submission of generalMT2023 for Chinese-to-English.

newstest2021

English to German

System	expert-based MQM	WMT DA
ref-C	0.51(1)	0.320(3)
VolcTrans-GLAT	1.04(2)	0.301(6)
Facebook-AI	1.05(3)	0.378(2)
ref-A	1.22(4)	0.280(8)
Nemo	1.34(5)	0.250(10)
HuaweiTSC	1.38(6)	0.312(4)
Online-W	1.46(7)	0.391(1)
UEdin	1.51(8)	0.305(5)
eTranslation	1.70(9)	0.281(7)
VolcTrans-AT	1.74(10)	0.280(9)

Chinese to English

System	expert-based MQM	WMT DA
ref-A	4.35(1)	0.019(3)
NiuTrans	4.63(2)	0.042(1)
SMU	4.84(3)	0.034(7)
MiSS	4.93(4)	0.029(5)
Borderline	4.95(5)	0.023(4)
DIDI-NLP	5.1(6)	0.031(2)
IIE-MT	5.15(7)	0.031(6)
Facebook-AI	5.22(8)	0.037(8)
Online-W	5.57(9)	0.087(9)

TED talks

English to German

System	expert-based MQM
ref.A	0.91(1)
Facebook-AI	1.06(2)
Online-W	1.12(3)
VolcTrans-AT	1.24(4)
metricsystem3	1.44(5)
VolcTrans-GLAT	1.49(6)
HuaweiTSC	1.50(7)
metricsystem1	1.63(8)
metricsystem2	1.69(9)
metricsystem5	1.72(10)
UEdin	1.77(11)
metricsystem4	1.78(12)
eTranslation	1.96(13)
Nemo	2.14(14)

Chinese to English

System	expert-based MQM
ref.B	0.42(1)
DIDI-NLP	1.65(2)
metricsystem2	1.76(3)
metricsystem1	1.90(4)
MiSS	1.97(5)
IIE-MT	1.98(6)
metricsystem4	2.05(7)
metricsystem5	2.15(8)
SMU	2.202(9)
Borderline	2.40(10)
NiuTrans	2.49(11)
Facebook-AI	2.64(12)
Online-W	2.93(13)
metricsystem3	2.99(14)
ref.A	5.52(15)

newstest2020

English to German

System	expert-based MQM	WMT DA
Human-B	0.75(1)	0.57(1)
Human-A	0.91(2)	0.45(4)
Human-P	1.41(3)	0.30(10)
Tohoku-AIP-NTT	2.02(4)	0.47(3)
OPPO	2.25(5)	0.50(2)
eTranslation	2.33(6)	0.31(9)
Tencent_Translation	2.35(7)	0.39(6)
Huoshan_Translate	2.45(8)	0.33(7)
Online-B	2.48(9)	0.42(5)
Online-A	2.99(10)	0.32(8)

Chinese to English

System	expert-based MQM	WMT DA
Human-A	3.43(1)	-
Human-B	3.62(2)	-0.03(9)
VolcTrans	5.03(3)	0.10(1)
WeChat_AI	5.13(4)	0.08(3)
Tencent_Translation	5.19(5)	0.06(4)
OPPO	5.20(6)	0.05(7)
THUNLP	5.34(7)	0.03(8)
DeepMind	5.41(8)	0.05(6)
DiDi_NLP	5.48(9)	0.09(2)
Online-B	5.85(10)	0.06(5)

Types of extra human evaluations

Multidimensional Quality Metric (MQM)

To adapt the generic MQM framework for our context, we followed the official guidelines for scientific research.

Our annotators were instructed to identify all errors within each segment in a document, paying particular attention to document context. Each error was highlighted in the text, and labeled with an error category and a severity. To temper the effect of long segments, we imposed a maximum of five errors per segment, instructing raters to choose the five most severe errors for segments containing more errors. Our error` hierarchy includes the standard top-level categories Accuracy, Fluency, Terminology, Style, and Locale, each with a specific set of sub-categories. After an initial pilot run, we introduced a special Non-translation error that can be used to tag an entire segment which is too badly garbled to permit reliable identification of individual errors.Error severities are assigned independent of category, and consist of Major, Minor, and Neutral levels, corresponding respectively to actual translation or grammatical errors, smaller imperfections, and purely subjective opinions about the translation.

Since we are ultimately interested in scoring segments, we require a weighting on error types. We fixed the weight on Minor errors at 1, and explored a range of Major weights from 1 to 10 (the Major weight recommended in the MQM standard). For each weight combination we examined the stability of system ranking using a resampling technique. We found that a Major weight of 5 gave the best balance between stability and ability to discriminate among systems.

These weights apply to all error categories with two exceptions. We assigned a weight of 0.1 to Minor Fluency/Punctuation errors to reflect their mostly non-linguistic nature. Decisions like the style of quotation mark to use or the spacing around punctuation affect the appearance of a text but do not change its meaning. Unlike other kinds of Minor errors, these are easy to correct algorithmically, so we assign a low weight to ensure that their main role is to distinguish between systems that are equivalent in other respects. Major Fluency/Punctuation errors, which render a text ungrammatical or change its meaning (eg, eliding the comma in “Let’s eat, grandma”), have standard weighting. The second exception is the singleton Non-translation category, with a weight of 25, equivalent to five Major errors.

Marot: interactive analysis tool for MQM evaluations

The Marot web app can be used to see detailed analysis of MQM evaluations. To use it, download the file marot-lite.html to your computer:

wget https://raw.githubusercontent.com/google-research/google-research/master/marot/marot-lite.html

Then, simply open the marot-lite.html file in a web browser, and use the "Choose file" button to pick an MQM TSV data file (downloaded to your computer). MQM data spans several columns, so it's best to use a desktop or laptop computer with a wide screen. You can intereactively slice and dice the evaluation results in Marot by filtering down to specific systems, documents, etc. Here are a couple of screenshot of the tool:

Screenshot of evaluation metrics in Marot:

Screenshot of examples of rated sentences in Marot:

Scalar Quality Metrics (SQM)

Similar to the WMT setting, the Scalar Quality Metric (SQM) evaluation collects segment-level scalar ratings with document context. Different from the 0-100 assessment of translation quality used in WMT, SQM uses a 0-6 scale for translation quality assessment. Another difference is that the sentences were rated by professional translators instead of crowd workers or researchers. We refer to pSQM for SQM labels that were acquired with professional translators.

Credits

If you use this data, please cite the following paper:

@misc{freitag2021experts,
      title={Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation}, 
      author={Markus Freitag and George Foster and David Grangier and Viresh Ratnakar and Qijun Tan and Wolfgang Macherey},
      year={2021},
      eprint={2104.14478},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

wmt-mqm-human-evaluation's People

Contributors

Stargazers

Watchers

Forkers

isabella232 vratnakar yiyiwang515 wanyu2018umac xahiru wangshuo6699 jvamvas janeliewang thecodeofmontecristo

wmt-mqm-human-evaluation's Issues

Source-Side Spans

Thank you for creating this amazing resource!
The annotator guidelines (Table 12 of your paper) contain the following instruction:

To identify an error, highlight the relevant span of text [...] The span of text may be in the source segment if the error is a source error or an omission.

In the dataset, all the spans are on the target side, including omission errors:

Source: Setting the example? Income inequality in the US is at an all-time high
Target: <v>Die</v> Einkommensungleichheit in den USA ist auf einem Allzeithoch
Category: Accuracy/Omission

I'm wondering: Did the annotators deviate from the guidelines in this respect, or is it maybe a data processing mistake that could still be fixed?

avg_seg_scores for 2021?

Hi, I saw in 2020 you published sentence scores and they were really helpful for me. Thanks for that :)

I couldn't find theses scores for 2021 in the repo. Could you provide them? Or do you have the script to calculate them from the MQM evaluations. It would be of great help, Thanks

Pairwise inter-rater agreement

Hi,

I'm now trying to figure out how to compute pairwise inter-rater agreement, but my results are lower (avg: 0.279) than the results from the paper (avg. 0.584) for English→German MQM.

To compute the agreement, I followed the steps:

load MQM annotations of news2020, news2021, and ted; increment doc_id in news2021 and ted by 1000 and 2000 to avoid the doc_id overlap.
convert an MQM score s to the 7-point liker-type score by mapping s==0 to 0, 0<s<=5 to 1, ..., 24.99<s<=25 to 6
given two raters, get the intersection of the segments labeled by both raters
compute Cohen's kappa on the intersection using sklearn.metrics.cohen_kappa_score

May I have any advice on computing the agreement? Or a sample script?

I appreciate any help you can provide.

Data missing in mqm_newstest2020_zhen.tsv

I found out mqm_newstest2020_zhen.tsv has minor data missing issue for the following systems at Seg ID 181.
181 Online-B.1605
181 WeChat_AI.1525
181 Tencent_Translation.1249
181 THUNLP.1498
181 OPPO.1422
181 Huoshan_Translate.919
This caused a mismatch in between mqm_newstest2020_zhen.tsv and mqm_newstest2020_zhen.avg_seg_scores.tsv
mqm_newstest2020_zhen.tsv only has 19994 unique system+id.

absent information in the wmt 2020 zhen source errors

Hi,

I found out that the wmt 2020 zhen source file had missing columns: seg_id, rater. Please check.

Sincerely,

Discrepancy in avg-z scores vs official WMT papers

Hi,

When I look at the English to German table for Newstest2020, DA scores are exactly those from the official findings report here: https://statmt.org/wmt20/pdf/2020.wmt-1.1.pdf

For WMT21 scores and rankings are completely off (page 19 Table 10 of this: https://statmt.org/wmt21/pdf/2021.wmt-1.1.pdf )

Is there a specific reason ?

[Question]: Is there some way to easily map the source to its reference translation for the Google MQM Dataset?

Details on https://stackoverflow.com/questions/73922895/is-there-some-way-to-easily-map-the-source-to-its-reference-translation-for-the

I've tried this but looks like I'm still missing ~20% of the data to their references: https://www.kaggle.com/code/alvations/lightyear2

Minor inconsistencies in the data

Thanks again for creating this great resource. I noticed some minor inconsistencies in the newstest2020 dataset that could be relevant for other people working with it:

Traces of Post-Editing
The raters did not just highlight error spans (as indicated by the guidelines) but also seem to have made post-edits. The "target" column sometimes contains text that deviates from the original translation. This affects 8255 out of 79020 ratings for EN–DE and 32184 out of 124292 ratings for ZH–EN.
Example:

ID: Human-A.0 | independent.281139 | 4 | 4
Rater: rater2
Original translation: Um auf Titelseiten zu gelangen, trug er einen Mundschutz und klebte sich Klebeband auf seine Nase, um Leute zum Reden zu bringen.
Target in the dataset: Um auf Titelseiten zu gelangen, trug <v>er angeblich </v> einen Mundschutz und klebte sich Klebeband auf seine Nase, um Leute zum Reden zu bringen.

Superfluous Quotes
A few translations are wrapped in quotes that are not present in the original data. The raters have usually marked the quotes as addition or punctuation errors, which has slightly increased the error count for these types. This affects at least 135 out of 14180 samples for EN–DE and 25 out of 19994 samples for ZH–EN.
Example:

ID: Online-A.1574 | stv.tv.21636 | 17 | 77
Original source: "This review is focused on steps that can be taken to help aid enforcement agencies such as local authorities, as they use their powers to help keep communities safe."
Source in the dataset: ""This review is focused on steps that can be taken to help aid enforcement agencies such as local authorities, as they use their powers to help keep communities safe.""
Original translation: "Diese Überprüfung konzentriert sich auf Maßnahmen, die ergriffen werden können, um Durchsetzungsbehörden wie lokalen Behörden zu helfen, da sie ihre Befugnisse nutzen, um die Sicherheit der Gemeinschaften zu gewährleisten."
Target in the dataset: """Diese Überprüfung konzentriert sich auf Maßnahmen, die ergriffen werden können, um Durchsetzungsbehörden wie lokalen Behörden zu helfen, da sie ihre Befugnisse nutzen, um die Sicherheit der Gemeinschaften zu gewährleisten."""

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.