wmt_baseline's Introduction

Important Notification About Data Bug

In the initial release, there was a bug in data_2_terminology/train.tsv. If you used this file, you must redownload the fixed file, as the old file did not properly include all terminologies.
I've added diffcheck.py so you can check that the only difference between the data_2/train.tsv and data_2_terminology/train.tsv is the 608 terminology pairs.
The baseline results for using terminology has also been updated, but the differences from before are minor.

Sorry about the confusion!!

Baseline for WMT21 Machine Translation using Terminologies Task

This is a baseline for the WMT21 Machine Translation using Terminologies task. The task invites participants to explore methods to incorporate terminologies into either the training or the inference process, in order to improve both the accuracy and consistency of MT systems on a new domain.

For the baseline, we consider the English-to-French translation task, and evaluation is performed on the TICO-19 dataset, which is part of the overall evaluation for the task in WMT21.

Model

The baseline finetunes OPUS-MT systems which are pre-trained on the OPUS parallel data using the Marian toolkit. We used the Huggingface ported version and used Huggingface + Pytorch to train.

Datasets

In the challenge you are allowed to use any parallel or monolingual data from previous WMT shared tasks. However, to reduce training time & resources we finetuned the pre-trained English-to-French model from MarianMT (OPUS-MT) on the following datasets:

Training Datasets

Medline

Split	Num. examples
Training1	614093
Training2	6540

Taus

Split	Num. examples
Train	885606

Terminologies

Split	Num. examples
Train	608

Subsampled dataset for finetuning

For the intial training dataset, we have two subsampled versions:

data_2

Dataset	Taus	Medline1	Medline2	Terminologies	Total
Num. examples	30000	30000	6540	0	66540

Note that Medline2 has max. 6540 pairs after filtering empty examples

data_2_terminology

Dataset	Taus	Medline1	Medline2	Terminologies	Total
Num. examples	30000	30000	6540	608	67148

Evaluation Dataset

During training, we evaluate on the dev set of TICO-19, and the final evaluation is performed on the test set of TICO-19. Note that the dev and test sets are on the smaller side.

Split	Num. examples
Dev	971
Test	2100

Baseline results

Training Epochs	Use terminology	Eval set	BLEU
3 epochs	No	TICO-19 Dev	40.0991
3 epochs	Yes	TICO-19 Dev	40.3334
3 epochs	No	TICO-19 Test	37.5342
3 epochs	Yes	TICO-19 Test	37.6491

10 epochs	No	TICO-19 Dev	39.9382
10 epochs	Yes	TICO-19 Dev	40.0829
10 epochs	No	TICO-19 Test	37.4869
10 epochs	Yes	TICO-19 Test	37.579

Note that due to the small size of the data, these results can vary depending on various settings (hyperparameters, training epochs, etc.). However, generally the results should be better when using terminology than not.

Run the code on Colab

You can run the baseline experiments from this colab notebook. To make further changes to the code, make sure to choose "Save a copy in Drive" to save an editable copy to your own Google Drive.

TODO

code for generating the datasets will be added
add editable transformer code

wmt_baseline's People

Contributors

Stargazers

Watchers

wmt_baseline's Issues

bleu 성능 질문

조교님 안녕하세요! predict_result.json에 나오는 predict_bleu에 값에 대한 질문이 있는데요!
저는 조교님이 알려주신대로 data augmentation, back translation을 사용해봤는데 오히려 성능이 떨어지더라구요...
codalab-competition에 올리신 bleu 38.87 나온 모델은 조교님이 코랩 링크 올려주신 대로 돌렸을 때 나오는 모델인가요

과제 질문

조교님 제가 과제를 정확히 이해하지 못하고 있는 것 같습니다.
코드에서 모델이랑 토크나이저, 그리고 train class 다 huggingface transformer를 사용하고 있는데, 저희는 어떤 걸 수정하여 performance를 높이는 것인가요? 모델 코드를 저희가 직접 짜는것이 아니라 transformer 모델을 사용하는 것인가요?

Dev, Test set 수정 관련 문의.

조교님, 안녕하세요.

제가 주어진 terminology(608개)에 해당하는 phrase에 대해서 tagging을 하는 방식으로 data augumentation을 수행하여 학습을 하였는데,
이 방식을 dev data와 test data에도 적용 하여 성능을 측정하였습니다.
저는 dev랑 test에도 tokenizing을 사용하듯이 tagging도 적용해도 된다고 생각하여 그렇게 하였는데, 인정되는 방식이 맞을까요?
dev와 test data는 동일한 파일로 하였고, 실행 코드 내에서 tagging을 적용하였습니다!

Recommend Projects