Git Product home page Git Product logo

low-resource-machine-translation's Introduction

Low-resource-Machine-Translation

This repository contains the code for the project relative to the course Deep Natural Language Processing. The goal of the project is to replicate the experiments performed by Dabre et al. on low-resource machine translation. In particular, starting from a machine translation model pretrained on a large dataset, we finetune it on a low-resource language. Then, two extensions are implemented:

  • The same approach is tested on translation from Vietnamese to English and, then, from English to the other low-resource languages
  • The same approach is tested on a different dataset and a different language pair

Implementation details

Python version used is 3.7.12

Libraries detail

transformers 4.16.2
datasets 1.18.3
metrics 0.3.3
sentencepiece 0.1.96
sacrebleu 2.0.0
torch 1.10.0 + cu111

Multilingual finetuning

Open In Colab

The initial model chosen for the task is MarianMT, a transformer-based model pretrained on a large English-Chinese corpus. The model is finetuned on four low-resource languages from the ALT dataset (Vietnamese, Indonesian, Khmer, and Filipino). The finetuning is performed using the Huggingface ๐Ÿค— Transformers library and relies on trainer API. The code for model finetuning is available in the finetuning_en_target notebook.

Changing direction of translation

Open In Colab

For this task, the initial model is MarianMT pretrained on a Chinese-English corpus. The model is finetuned on the Vietnamese-Chinese task, then the English sentences are translated to another low-resource language using the models finetuned in the previous part. The results are assessed by computing the BLEU score. The code for Vietnamese-English finetuning is available in the finetuning_vi_en notebook, whereas the code to translate between two low-resource languages using pretrained models is available in the translate_vi_target notebook.

Testing on a different dataset

Open In Colab

In this task, the approach is experimented on the WikiMatrix dataset, which consists on many parallel sentences mined from Wikipedia using a distance metric to predict alignments. The selected language pair is English-Kazakh because it contains the same number of samples as those in the previous sections. The starting model is MarianMT pretrained on English-Turkish, and results are evaluated using the BLEU score. The code for model finetuning is available in the finetuning_en_kazakh notebook.

Model usage

Open In Colab

Some of the models finetuned within this project are available on the Huggingface hub, so they can be downloaded and used. An example of usage is provided in the following.

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Download the pretrained model for English-Vietnamese available on the hub
model = AutoModelForSeq2SeqLM.from_pretrained("CLAck/en-vi")

tokenizer = AutoTokenizer.from_pretrained("CLAck/en-vi")
# Download a tokenizer that can tokenize English since the model Tokenizer doesn't know anymore how to do it
# We used the one coming from the initial model
# This tokenizer is used to tokenize the input sentence
tokenizer_en = AutoTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-zh')
# These special tokens are needed to reproduce the original tokenizer
tokenizer_en.add_tokens(["<2zh>", "<2vi>"], special_tokens=True)

sentence = "The cat is on the table"
# This token is needed to identify the target language
input_sentence = "<2vi> " + sentence 
translated = model.generate(**tokenizer_en(input_sentence, return_tensors="pt", padding=True))
output_sentence = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]

low-resource-machine-translation's People

Contributors

notlosca avatar andrea-cavallo-98 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.