This R package implements the translation models IBM1, IBM2, IBM3, and IBM4 from Brown et al (1993)
as well as phrase-based translation. While SMT methods have largely been overtaken by neural network-based methods, the IBM models remain historically important and are still used in conjunction with neural methods thanks to their word-alignment capabilities.
Our objective is to find the best translation
where kgrams
R package).
Actually, there is one piece missing. The translation and language models we will consider implicitly treat
In the decoding algorithm of Wang & Waibel (1998)
, they state that they model
To summarise:
-
$\Pr(\mathbf{f} | \mathbf{e}, n, m)$ : translation probability estimated using IBM models or phrase tables; implemented in this package -
$\Pr(\mathbf{e} | n)$ : language model estimated using ngrams; implemented in thekgrams
R package -
$\Pr(n | m)$ : sentence length model; by default estimated using the built-inglm
function to run a simple poisson regression
Algorithms for IBM1 and IBM2 largely follow those provided in Koehn (2009)
.
For IBM3 and IBM4, I use the heuristic algorithm without pegging as described in the appendix of Och & Ney (2003)
.
Given a vector of output sentences target
and input sentences source
, each function
follows the basic syntax of IBMX(target,source,maxiter,eps)
where X
is 1, 2, or 3, maxiter
is the maximum number of iterations allowed, and eps
is the stopping criteria
for convergence.
The sentences in target
and source
are assumed to be strings containing space-delimited
words. Notice then that "Wow!" would be considered one word while "Wow !" would
be considered two words. You will likely want to do some pre-processing on the sentences
before throwing them into the models. Some examples of handy pre-processing functions are
tm::removePunctuation
, stringr::str_squish
, and base::tolower
.
The only pre-processing that is done by the IBMX
functions is to add "<NULL>" to the beginning of all sentences in source
.
E.g. "do you have time ?" becomes "<NULL> do you have time ?" This is done to allow words in the target
sentences to be aligned
with "nothing".
As explained in the cited papers, lower-numbered models are meant to be fed as input
into higher-numbered models. For this reason, each function calls the previous one
by default in order to initialize the algorithm.
You can specify the number of iterations for the initialization, e.g.,
IBM3(e,f,maxiter,eps,init.IBM1=30,init.IBM2=15)
. You can also set the init
arguments to 0 and use results from previous model estimations instead.
See help pages for more details.
Phrase-based translation traditionally takes the following steps for estimating phrase probability tables (see Koehn (2009)
):
- Estimate IBM models in both directions (e.g. English-to-French and French-to-English)
- Combine alignments from the two models to obtain a "symmetrized" set of sentence alignments
- Extract phrases using the symmetrized alignments
- Estimate phrase probabilities based on the frequency with which each phrase occurs
Steps 2 to 4 are handled by the build_phrase_table
function. For example:
# estimate models in both directions
model3_ftoe = IBM3(target=e,source=f,maxiter=5, init.IBM1=15, init.IBM2=15, verbose=100)
model3_etof = IBM3(target=f,source=e,maxiter=5, init.IBM1=15, init.IBM2=15, verbose=100)
# extract phrases from the IBM models and compute phrase probabilities
phtable = build_phrase_table(model3_ftoe, model3_etof)
These methods rely on being able to take a word or words and look up the corresponding probability in a table.
To do this fast I use R environment objects, which are the closest equivalent to python
dictionaries available in base R. These use hashing, making the table lookups
O(1) time. For example, each IBM model produces a tmatrix
, which gives the translation
probability of an output word given an input word. E.g., tmatrix[["fish"]][["poisson"]]
(or equivalently tmatrix$fish$poisson
) gives the probability that "fish" in English is the proper translation of "poisson" in French.
Environments are a bit less intuitive than matrices. Firstly, the built-in R function object.size
does not actually compute the total size of all objects in the environment.
Because of this I provide functions object.size.IBM1
, object.size.IBM2
, object.size.IBM3
which iterate through each level of the environments and adds up the sizes.
Secondly, unlike matrices, tmatrix2 = tmatrix
does not make a copy of the enviornment tmatrix
. Instead,
it copies only the pointer to tmatrix
. This means that making changes to tmatrix2
will also affect tmatrix1
.
To generate most likely translations for a given sentence in language f to language e,
use the generic method decode(object,target.sentence,senlength.model,language.model)
where object
is an object returned from one IBM1()
, IBM2()
, or build_phrase_table()
;
target.sentence
is the sentence in the language we want to translate from,
and senlength.model
, language.model
are the sentence-length and language models described
previously in the noisy channel section. See ?decode.IBM2
for how they should be specified.
If senlength.model
or language.model
aren't specified, they have defaults:
-
language.model
is a 3-gram model based on maximum likelihood without any smoothing -
senlength.model
is a matrix of predicted probabilities based on the regressionglm(object$corpus$target_lengths ~ object$corpus$source_lengths, family="poisson")
It is computationally infeasible to consider all
possible translations while decoding, so in practice we rely on heuristic algorithms.
For IBM1 and IBM2 I implement Algorithm 1 of Wang & Waibel (1998)
, which is a stack decoding approach.
For phrase translation I implement the methods described in Koehn (2009)
chapter 6.
Given a decoded translation, we then need to evaluate how good the translation is!
Ideally one would use human evaluators, but our next-best option is
some automated evaluation measures. I provide the evaluate
function for this purpose,
which implements precision, recall, f-measure, PER, WER, and BLEU as decribed in Koehn (2009)
.
See ?evaluate
for more details.
See ?IBM1
, ?IBM2
, ?IBM3
, ?IBM4
, ?build_phrase_table
, ?decode.IBM1
, ?decode.IBM2
, ?decode.phrase_table
, ?evaluate
for more details.
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation.
Koehn, P. (2009). Statistical machine translation. Cambridge University Press.
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational linguistics, 29(1), 19-51.
Wang, Y. Y., & Waibel, A. (1998). Fast decoding for statistical machine translation. In Fifth International Conference on Spoken Language Processing.