Git Product home page Git Product logo

constrained-levt's Introduction

Constrained-LevT

This repository contains the code for the ACL-20 paper: Lexically Constrained Neural Machine Translation with Levenshtein Transformer. If you use this repository in your work, please cite:

@article{susanto2020lexically,
  title={Lexically Constrained Neural Machine Translation with Levenshtein Transformer},
  author={Susanto, Raymond Hendy and Chollampatt, Shamil and Tan, Liling},
  journal={arXiv preprint arXiv:2004.12681},
  year={2020}
}

Requirements and Installation

  • PyTorch version >= 1.2.0
  • Python version >= 3.6
git clone https://github.com/raymondhs/constrained-levt
cd constrained-levt
pip install --editable .

Usage

To replicate the experiments in our paper, you can download our pretrained models and evaluation sets into the root directory of this repository. These models were trained following the original instructions to train Levenshtein Transformer model. To preserve each constraint in the output, use --preserve-constraint. For example:

mkdir -p data-bin
tar -xvzf const_levt_en_de.tgz -C data-bin
cat data-bin/const_levt_en_de/newstest2014-wikt.en \
| python interactive_with_constraints.py \
    data-bin/const_levt_en_de \
    -s en -t de \
    --task translation_lev \
    --path data-bin/const_levt_en_de/checkpoint_best.pt \
    --iter-decode-max-iter 9 \
    --iter-decode-eos-penalty 0 \
    --beam 1 \
    --print-step \
    --batch-size 400 \
    --buffer-size 4000 \
    --preserve-constraint | tee /tmp/gen.out
# ...
# | Translated 3003 sentences (87040 tokens) in 11.5s (261.37 sentences/s, 7575.50 tokens/s)

# Compute term usage rate
cat /tmp/gen.out \
| grep ^H \
| sed 's/^H\-//' \
| sort -n -k 1 \
| cut -f 3 > /tmp/gen.out.sys
python scripts/term_usage_rate.py \
    -i data-bin/const_levt_en_de/newstest2014-wikt.en \
    -s /tmp/gen.out.sys
# Term use rate: 100.000

Each input line is tab-separated, where the first column corresponds to the source text and the remaining columns for the constraints. Each constraint is provided in this format: source|||target. A preprocessing script (tokenize.sh) is provided in case you want to try with your own input. It will run tokenization, BPE segmentation, and additional preprocessing for Romanian. For example:

echo 'Hello world!' | ./tokenize.sh en data-bin/const_levt_en_de/ende.code

License

The code and models in this repository are licensed under the MIT License. The evaluation datasets are licensed under CC-BY-SA 3.0.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.