Git Product home page Git Product logo

wmt_baseline's Introduction

Important Notification About Data Bug

  • In the initial release, there was a bug in data_2_terminology/train.tsv. If you used this file, you must redownload the fixed file, as the old file did not properly include all terminologies.

  • I've added diffcheck.py so you can check that the only difference between the data_2/train.tsv and data_2_terminology/train.tsv is the 608 terminology pairs.

  • The baseline results for using terminology has also been updated, but the differences from before are minor.

Sorry about the confusion!!

Baseline for WMT21 Machine Translation using Terminologies Task

This is a baseline for the WMT21 Machine Translation using Terminologies task. The task invites participants to explore methods to incorporate terminologies into either the training or the inference process, in order to improve both the accuracy and consistency of MT systems on a new domain.

For the baseline, we consider the English-to-French translation task, and evaluation is performed on the TICO-19 dataset, which is part of the overall evaluation for the task in WMT21.

Model

The baseline finetunes OPUS-MT systems which are pre-trained on the OPUS parallel data using the Marian toolkit. We used the Huggingface ported version and used Huggingface + Pytorch to train.

Datasets

In the challenge you are allowed to use any parallel or monolingual data from previous WMT shared tasks. However, to reduce training time & resources we finetuned the pre-trained English-to-French model from MarianMT (OPUS-MT) on the following datasets:

Training Datasets

Split Num. examples
Training1 614093
Training2 6540
Split Num. examples
Train 885606
Split Num. examples
Train 608

Subsampled dataset for finetuning

For the intial training dataset, we have two subsampled versions:

  • data_2
Dataset Taus Medline1 Medline2 Terminologies Total
Num. examples 30000 30000 6540 0 66540

Note that Medline2 has max. 6540 pairs after filtering empty examples

  • data_2_terminology
Dataset Taus Medline1 Medline2 Terminologies Total
Num. examples 30000 30000 6540 608 67148

Evaluation Dataset

During training, we evaluate on the dev set of TICO-19, and the final evaluation is performed on the test set of TICO-19. Note that the dev and test sets are on the smaller side.

Split Num. examples
Dev 971
Test 2100

Baseline results

Training Epochs Use terminology Eval set BLEU
3 epochs No TICO-19 Dev 40.0991
3 epochs Yes TICO-19 Dev 40.3334
3 epochs No TICO-19 Test 37.5342
3 epochs Yes TICO-19 Test 37.6491
10 epochs No TICO-19 Dev 39.9382
10 epochs Yes TICO-19 Dev 40.0829
10 epochs No TICO-19 Test 37.4869
10 epochs Yes TICO-19 Test 37.579

Note that due to the small size of the data, these results can vary depending on various settings (hyperparameters, training epochs, etc.). However, generally the results should be better when using terminology than not.

Run the code on Colab

You can run the baseline experiments from this colab notebook. To make further changes to the code, make sure to choose "Save a copy in Drive" to save an editable copy to your own Google Drive.

TODO

  • code for generating the datasets will be added
  • add editable transformer code

wmt_baseline's People

Contributors

mnskim avatar

Stargazers

 avatar

Watchers

 avatar  avatar

wmt_baseline's Issues

bleu 성능 질문

조교님 안녕하세요! predict_result.json에 나오는 predict_bleu에 값에 대한 질문이 있는데요!
저는 조교님이 알려주신대로 data augmentation, back translation을 사용해봤는데 오히려 성능이 떨어지더라구요...
codalab-competition에 올리신 bleu 38.87 나온 모델은 조교님이 코랩 링크 올려주신 대로 돌렸을 때 나오는 모델인가요

과제 질문

조교님 제가 과제를 정확히 이해하지 못하고 있는 것 같습니다.
코드에서 모델이랑 토크나이저, 그리고 train class 다 huggingface transformer를 사용하고 있는데, 저희는 어떤 걸 수정하여 performance를 높이는 것인가요? 모델 코드를 저희가 직접 짜는것이 아니라 transformer 모델을 사용하는 것인가요?

Dev, Test set 수정 관련 문의.

조교님, 안녕하세요.

제가 주어진 terminology(608개)에 해당하는 phrase에 대해서 tagging을 하는 방식으로 data augumentation을 수행하여 학습을 하였는데,
이 방식을 dev data와 test data에도 적용 하여 성능을 측정하였습니다.
저는 dev랑 test에도 tokenizing을 사용하듯이 tagging도 적용해도 된다고 생각하여 그렇게 하였는데, 인정되는 방식이 맞을까요?
dev와 test data는 동일한 파일로 하였고, 실행 코드 내에서 tagging을 적용하였습니다!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.