Git Product home page Git Product logo

topdown-btg-preordering's Introduction

****************************************
Top-Down BTG-based Preordering
****************************************

This is an implementation of Top-Down Bracketing Transduction Grammar
(BTG)-based preordering which improves machine translation quality by
reordering an input sentence to have a word order in a target language.
The detailed algorithm can be found in this paper:

Tetsuji Nakagawa: Efficient Top-Down BTG Parsing for Machine Translation
Preordering, ACL-2015 (http://www.aclweb.org/anthology/P15-1021).


****************************************
1. Installing
****************************************

This software uses the CityHash library (https://github.com/google/cityhash),
and it needs to be installed beforehand.
After installing the library, run the following command:

$ make

Then, two binary files (tdbtg_preorderer_train and tdbtg_preorderer_parse) will
be generated.


****************************************
2. Example Usage
****************************************

The directory example/ contains tiny training and test data for
English-to-Japanese preordering. Training and testing can be carried out as
below:

* Training model parameters

$ ../tdbtg_preorderer_train \
-input_annot train.annot \
-input_align train.align \
-output_model train.model

* Preordering source sentences

$ ../tdbtg_preorderer_parse \
-input_model train.model \
-input_annot test.annot \
-output_result test.order

* Evaluating the result

$ ../evaluate_preordering.py test.align test.order

The result will look like this:

Number of evaluated sentences: 3
Number of skipped sentences: 0
Fuzzy Reordering Score: 0.857143
Kendall's Tau: 0.933333
Complete Match: 0.666667


****************************************
3. File Format
****************************************

This software uses two file formats for training and test data, Annot and
Align, which are used in Lader (http://www.phontron.com/lader/).

* Annot file

This file contains tokenized and annotated source sentences. Each line has a
sentence in the following format:

w_1 w_2 ... w_K\tp_1 p_2 ... p_K\tc_1 c_2 ... c_K

where w_i, p_i, and c_i are the i-th word, POS tag and word class respectively.
Each token is separated by a space, and each sequence of tokens is separated by
a tab.
Word classes can be obtained with Brown clustering or mkcls. If POS taggers are
not available, coarse-grained word classes will be able to be used instead
(Koo et al.: Simple Semi-supervised Dependency Parsing).

* Align file

This file contains word alignment information.
The i-th line of the file contains word alignment information of the i-th
sentence in the corresponding Annot file.
Below is the format of each line:

m-n ||| s_1-t_1 s_2-t_2 ... s_L-t_L

where m and n are the numbers of the tokens in the source and the target
sentences respectively, and each pair s_i-t_i means that the (s_i + 1)-th
source token and the (t_i + 1)-th target token are aligned.


****************************************
4. Training
****************************************

tdbtd_preorderer_train is the program for training model parameters.
It inputs training Annot and Align data, and outputs learned model data.

Available options:
  -input_annot <string>
    Input annotation file.
  -input_align <string>
    Input alignment file.
  -output_model <string>
    Output model file.
  -beam_width <integer>
    Number of candidates for beam search. The default value is 20.
  -training_iterations <integer>
    Number of training iterations of Perceptron. The default value is 20.
  -min_updates <integer>
    Minimum updates for features not to be dropped. The default value is 0.
    If an integer larger than 0 is specified, the model size is reduced by
    applying a technique for obtaining sparser Perceptron (Goldberg and
    Elhadad: Learning Sparser Perceptron Models).


****************************************
5. Preordering
****************************************

tdbtg_preorderer_parse is the program for reordering input sentences.
It inputs model data and Annot data, and outputs reordered result.

Available options:
  -input_model <string>
    Input model file.
  -input_annot <string>
    Input annotation file.
  -output_result <string>
    Output result file.
  -output_format <string>
    Output file format: {order, text, tree}.
    "order" means a sequence of numbers in which the i-th element is the index
    in the source-side of the i-th target-side token.
    "text" means a reordered sentence.
    "tree" means a BTG parse tree.
    The default value is "order".
  -beam_width <integer>
    Number of candidates for beam search. The default value is 20.
  -rank_in_nbest <integer>
    Rank in the n-best results to be output. The default value is 0.


****************************************
6. Evaluating
****************************************

evaluate_preordering.py is the program to evaluate preordering results.
It inputs a reordering result in the Order format and a gold standard word
alignment data in the Align format.

evaluate_preordering.py <align file> <order file>

Fuzzy Reordering Score and Kendall's Tau are output.

topdown-btg-preordering's People

Contributors

tetsujinakagawa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.