google / wmt-mqm-human-evaluation Goto Github PK

View Code? Open in Web Editor NEW

58.0 3.0 9.0 62.43 MB

License: Apache License 2.0

wmt-mqm-human-evaluation's Issues

absent information in the wmt 2020 zhen source errors

Hi,

I found out that the wmt 2020 zhen source file had missing columns: seg_id, rater. Please check.

Sincerely,

Pairwise inter-rater agreement

Hi,

I'm now trying to figure out how to compute pairwise inter-rater agreement, but my results are lower (avg: 0.279) than the results from the paper (avg. 0.584) for English→German MQM.

To compute the agreement, I followed the steps:

load MQM annotations of news2020, news2021, and ted; increment doc_id in news2021 and ted by 1000 and 2000 to avoid the doc_id overlap.
convert an MQM score s to the 7-point liker-type score by mapping s==0 to 0, 0<s<=5 to 1, ..., 24.99<s<=25 to 6
given two raters, get the intersection of the segments labeled by both raters
compute Cohen's kappa on the intersection using sklearn.metrics.cohen_kappa_score

May I have any advice on computing the agreement? Or a sample script?

I appreciate any help you can provide.

Source-Side Spans

Thank you for creating this amazing resource!
The annotator guidelines (Table 12 of your paper) contain the following instruction:

To identify an error, highlight the relevant span of text [...] The span of text may be in the source segment if the error is a source error or an omission.

In the dataset, all the spans are on the target side, including omission errors:

Source: Setting the example? Income inequality in the US is at an all-time high
Target: <v>Die</v> Einkommensungleichheit in den USA ist auf einem Allzeithoch
Category: Accuracy/Omission

I'm wondering: Did the annotators deviate from the guidelines in this respect, or is it maybe a data processing mistake that could still be fixed?

Discrepancy in avg-z scores vs official WMT papers

Hi,

When I look at the English to German table for Newstest2020, DA scores are exactly those from the official findings report here: https://statmt.org/wmt20/pdf/2020.wmt-1.1.pdf

For WMT21 scores and rankings are completely off (page 19 Table 10 of this: https://statmt.org/wmt21/pdf/2021.wmt-1.1.pdf )

Is there a specific reason ?

Data missing in mqm_newstest2020_zhen.tsv

I found out mqm_newstest2020_zhen.tsv has minor data missing issue for the following systems at Seg ID 181.
181 Online-B.1605
181 WeChat_AI.1525
181 Tencent_Translation.1249
181 THUNLP.1498
181 OPPO.1422
181 Huoshan_Translate.919
This caused a mismatch in between mqm_newstest2020_zhen.tsv and mqm_newstest2020_zhen.avg_seg_scores.tsv
mqm_newstest2020_zhen.tsv only has 19994 unique system+id.

Minor inconsistencies in the data

Thanks again for creating this great resource. I noticed some minor inconsistencies in the newstest2020 dataset that could be relevant for other people working with it:

Traces of Post-Editing
The raters did not just highlight error spans (as indicated by the guidelines) but also seem to have made post-edits. The "target" column sometimes contains text that deviates from the original translation. This affects 8255 out of 79020 ratings for EN–DE and 32184 out of 124292 ratings for ZH–EN.
Example:

ID: Human-A.0 | independent.281139 | 4 | 4
Rater: rater2
Original translation: Um auf Titelseiten zu gelangen, trug er einen Mundschutz und klebte sich Klebeband auf seine Nase, um Leute zum Reden zu bringen.
Target in the dataset: Um auf Titelseiten zu gelangen, trug <v>er angeblich </v> einen Mundschutz und klebte sich Klebeband auf seine Nase, um Leute zum Reden zu bringen.

Superfluous Quotes
A few translations are wrapped in quotes that are not present in the original data. The raters have usually marked the quotes as addition or punctuation errors, which has slightly increased the error count for these types. This affects at least 135 out of 14180 samples for EN–DE and 25 out of 19994 samples for ZH–EN.
Example:

ID: Online-A.1574 | stv.tv.21636 | 17 | 77
Original source: "This review is focused on steps that can be taken to help aid enforcement agencies such as local authorities, as they use their powers to help keep communities safe."
Source in the dataset: ""This review is focused on steps that can be taken to help aid enforcement agencies such as local authorities, as they use their powers to help keep communities safe.""
Original translation: "Diese Überprüfung konzentriert sich auf Maßnahmen, die ergriffen werden können, um Durchsetzungsbehörden wie lokalen Behörden zu helfen, da sie ihre Befugnisse nutzen, um die Sicherheit der Gemeinschaften zu gewährleisten."
Target in the dataset: """Diese Überprüfung konzentriert sich auf Maßnahmen, die ergriffen werden können, um Durchsetzungsbehörden wie lokalen Behörden zu helfen, da sie ihre Befugnisse nutzen, um die Sicherheit der Gemeinschaften zu gewährleisten."""

[Question]: Is there some way to easily map the source to its reference translation for the Google MQM Dataset?

Details on https://stackoverflow.com/questions/73922895/is-there-some-way-to-easily-map-the-source-to-its-reference-translation-for-the

I've tried this but looks like I'm still missing ~20% of the data to their references: https://www.kaggle.com/code/alvations/lightyear2

avg_seg_scores for 2021?

Hi, I saw in 2020 you published sentence scores and they were really helpful for me. Thanks for that :)

I couldn't find theses scores for 2021 in the repo. Could you provide them? Or do you have the script to calculate them from the MQM evaluations. It would be of great help, Thanks

google / wmt-mqm-human-evaluation Goto Github PK

wmt-mqm-human-evaluation's Issues

absent information in the wmt 2020 zhen source errors

Pairwise inter-rater agreement

Source-Side Spans

Discrepancy in avg-z scores vs official WMT papers

Data missing in mqm_newstest2020_zhen.tsv

Minor inconsistencies in the data

[Question]: Is there some way to easily map the source to its reference translation for the Google MQM Dataset?

avg_seg_scores for 2021?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent