joshua-decoder / joshua Goto Github PK

View Code? Open in Web Editor NEW

121.0 121.0 57.0 354.75 MB

Joshua Statistical Machine Translation Toolkit

Home Page: http://joshua-decoder.org/

joshua's People

Contributors

Stargazers

Watchers

joshua's Issues

compile kenlm during default ant target

Packed grammar does not get recreated when a new grammar.gz file is provided

When the pipeline is resumed from the TUNE stage, and a new grammar file is provided (following a complete run with an invalid/incomplete grammar), the packed grammar file does not get recreated in the tune and the test directories within $RUNDIR/data. They are detected as cached.

Steps to recreate :

Complete a pipeline run with an invalid grammar file (make sure you get past the TEST stage). You will most probably get BLEU scores of 0
Now resume the pipeline from the TUNE stage with a new (and proper) grammar file.
The packed grammars in $RUNDIR/data/tune/ and $RUNDIR/data/test/ will not be recreated even though a new grammar file was provided.

Extracting the 1-best hypothesis should be possible without recomputing features

Currently, extracting even the 1-best hypothesis also computes the features associated with it, which triggers a recomputation of all the features along all edges of the Viterbi derivation. But if you just want the best hypothesis or model scores, and not the feature breakdown, you shouldn't have to recompute all of these (which takes a little bit of time). It should be possible to do this (and it would help with timing tests).

Documentation for every new feature

Add lmplz to pipeline

Pipeline should use left-state minimization by default

escape brackets (& parens?)

Joshua doesn't properly escape brackets (and possibly parens). I ran into this while trying to true-case some English text. This was on test4. Running with v5.0-rc3, the JOSHUA environment variable set:

$ ./run-joshua.sh -server-port 5995 -threads 1
#assume $FILE is set to a local file
#grab the first "real" line of it
$ head -n3 $FILE | tail -n1 | nc test4 5995
Translating sentence #0 [thread 9]: 'SAN SALVADOR, 3 JAN 90 -- [REPORT] [ARMED FORCES PRESS COMMITTEE, COPREFA] [TEXT] THE ARCE BATTALION COMMAND HAS REPORTED THAT ABOUT 50 PEASANTS OF VARIOUS AGES HAVE BEEN KIDNAPPED BY TERRORISTS OF THE FARABUNDO MARTI NATIONAL LIBERATION FRONT [FMLN] IN SAN MIGUEL DEPARTMENT. ACCORDING TO THAT GARRISON, THE MASS KIDNAPPING TOOK PLACE ON 30 DECEMBER IN SAN LUIS DE LA REINA. THE SOURCE ADDED THAT THE TERRORISTS FORCED THE INDIVIDUALS, WHO WERE TAKEN TO AN UNKNOWN LOCATION, OUT OF THEIR RESIDENCES, PRESUMABLY TO INCORPORATE THEM AGAINST THEIR WILL INTO CLANDESTINE GROUPS.'
Exception in thread "Thread-4" java.lang.NullPointerException
        at joshua.decoder.ff.state_maintenance.NgramStateComputer.computeState(NgramStateComputer.java:56)
        at joshua.decoder.ff.state_maintenance.NgramStateComputer.computeState(NgramStateComputer.java:13)
        at joshua.decoder.chart_parser.ComputeNodeResult.(ComputeNodeResult.java:71)
        at joshua.decoder.chart_parser.Chart.completeSpan(Chart.java:330)
        at joshua.decoder.chart_parser.Chart.expand(Chart.java:486)
        at joshua.decoder.DecoderThread.translate(DecoderThread.java:107)
        at joshua.decoder.Decoder$DecoderThreadRunner.run(Decoder.java:248)

Implement collapsing of LM-equivalent states

Comparisons with Moses and cdec show that this makes a difference in search.

Thrax dies on blank lines

Thrax should just skip them. Also, the pipeline should remove blank lines from the training corpora.

Import MIRA

Import kbMIRA to remove runtime dependence on Moses.

default output with new -output-format unspecified

With the new devel -output-format, it looks like the nbest list is being output when I use an old configuration (pasted below). This is on a Mac laptop, freshly compiled commit 74ee78c. ~~The funny thing is that I get the expected backwords-compatible output on a linux server with pretty much the same configuration.~~

Here's the output of the command:
$ echo "hola\nmi amigo" | ./joshua-decoder -c ~/workspace/mt/grammar_es-en/joshua.config 2> /dev/null

0 ||| hello ||| WordPenalty=-1.303 lm_0=-3.033 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-0.344 tm_pt_2=-0.605 tm_pt_4=-1.000 tm_pt_5=-0.323 tm_pt_6=-0.577 ||| 0.840
0 ||| hi ||| WordPenalty=-1.303 lm_0=-3.467 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-1.277 tm_pt_2=-0.532 tm_pt_4=-1.000 tm_pt_5=-1.327 tm_pt_6=-0.401 ||| -0.638
0 ||| hell ||| WordPenalty=-1.303 lm_0=-5.532 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-6.034 tm_pt_2=-3.401 tm_pt_3=-0.135 tm_pt_4=-1.000 tm_pt_5=-5.991 tm_pt_6=-2.428 ||| -8.913
0 ||| ho ||| WordPenalty=-1.303 lm_0=-6.468 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-6.439 tm_pt_2=-1.872 tm_pt_3=-0.368 tm_pt_4=-1.000 tm_pt_5=-6.396 tm_pt_6=-1.099 ||| -10.240
0 ||| wave ||| WordPenalty=-1.303 lm_0=-7.044 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-6.439 tm_pt_2=-2.303 tm_pt_3=-0.368 tm_pt_4=-1.000 tm_pt_5=-6.396 tm_pt_6=-1.705 ||| -11.003
0 ||| hey ||| WordPenalty=-1.303 lm_0=-4.235 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-7.133 tm_pt_2=-6.356 tm_pt_3=-1.000 tm_pt_4=-1.000 tm_pt_5=-7.089 tm_pt_6=-6.252 ||| -11.654
0 ||| helo ||| WordPenalty=-1.303 lm_0=-7.121 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-7.133 tm_pt_2=-0.693 tm_pt_3=-1.000 tm_pt_4=-1.000 tm_pt_5=-7.089 tm_pt_6=-0.693 ||| -12.592
0 ||| hellos ||| WordPenalty=-1.303 lm_0=-7.455 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-7.133 tm_pt_3=-1.000 tm_pt_4=-1.000 tm_pt_5=-7.089 ||| -12.684
0 ||| say ||| WordPenalty=-1.303 lm_0=-5.676 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-7.133 tm_pt_2=-8.549 tm_pt_3=-1.000 tm_pt_4=-1.000 tm_pt_5=-7.089 tm_pt_6=-7.656 ||| -13.688
0 ||| "hello ||| WordPenalty=-1.303 lm_0=-8.774 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-7.133 tm_pt_3=-1.000 tm_pt_4=-1.000 tm_pt_5=-7.089 ||| -14.004
0 ||| "hi ||| WordPenalty=-1.303 lm_0=-8.774 tm_glue_0=1.000 tm_pt_0=-1.000 tm_pt_1=-7.133 tm_pt_3=-1.000 tm_pt_4=-1.000 tm_pt_5=-7.089 ||| -14.004
0 ||| hola ||| OOVPenalty=1.000 WordPenalty=-1.303 lm_0=-7.455 tm_glue_0=1.000 ||| -120.434
1 ||| my friend ||| WordPenalty=-1.737 lm_0=-5.719 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-0.687 tm_pt_2=-1.281 tm_pt_4=-2.000 tm_pt_5=-0.370 tm_pt_6=-1.130 ||| -1.410
1 ||| my friends ||| WordPenalty=-1.737 lm_0=-6.290 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-3.279 tm_pt_2=-4.016 tm_pt_4=-2.000 tm_pt_5=-3.104 tm_pt_6=-3.942 ||| -5.930
1 ||| my friend of ||| WordPenalty=-2.171 lm_0=-7.739 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-1.294 tm_pt_2=-8.108 tm_pt_4=-3.000 tm_pt_5=-2.868 tm_pt_6=-1.201 ||| -6.403
1 ||| me friend ||| WordPenalty=-1.737 lm_0=-9.472 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-1.930 tm_pt_2=-2.754 tm_pt_4=-2.000 tm_pt_5=-2.071 tm_pt_6=-2.449 ||| -7.437
1 ||| i friend ||| WordPenalty=-1.737 lm_0=-8.579 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-2.459 tm_pt_2=-5.077 tm_pt_4=-2.000 tm_pt_5=-4.631 tm_pt_6=-6.886 ||| -10.292
1 ||| the friend ||| WordPenalty=-1.737 lm_0=-6.860 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-5.391 tm_pt_2=-7.966 tm_pt_4=-2.000 tm_pt_5=-5.365 tm_pt_6=-8.052 ||| -10.369
1 ||| a friend ||| WordPenalty=-1.737 lm_0=-5.762 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.509 tm_pt_2=-8.404 tm_pt_3=-0.000 tm_pt_4=-2.000 tm_pt_5=-6.551 tm_pt_6=-8.326 ||| -10.680
1 ||| me friends ||| WordPenalty=-1.737 lm_0=-9.064 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-4.522 tm_pt_2=-5.489 tm_pt_4=-2.000 tm_pt_5=-4.805 tm_pt_6=-5.261 ||| -10.979
1 ||| my male friend ||| WordPenalty=-2.171 lm_0=-11.304 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-1.373 tm_pt_2=-3.806 tm_pt_3=-0.368 tm_pt_4=-3.000 tm_pt_5=-5.119 tm_pt_6=-0.317 ||| -12.016
1 ||| me friend of ||| WordPenalty=-2.171 lm_0=-11.098 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-2.537 tm_pt_2=-9.582 tm_pt_4=-3.000 tm_pt_5=-4.569 tm_pt_6=-2.520 ||| -12.036
1 ||| myself friend ||| WordPenalty=-1.737 lm_0=-10.498 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-5.815 tm_pt_2=-3.238 tm_pt_3=-0.000 tm_pt_4=-2.000 tm_pt_5=-6.383 tm_pt_6=-2.423 ||| -13.191
1 ||| mi friend ||| WordPenalty=-1.737 lm_0=-10.743 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.857 tm_pt_2=-1.194 tm_pt_3=-0.000 tm_pt_4=-2.000 tm_pt_5=-6.646 tm_pt_6=-0.996 ||| -13.274
1 ||| i friends ||| WordPenalty=-1.737 lm_0=-8.172 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-5.051 tm_pt_2=-7.812 tm_pt_4=-2.000 tm_pt_5=-7.365 tm_pt_6=-9.698 ||| -13.834
1 ||| me of my friends ||| WordPenalty=-2.606 lm_0=-9.935 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-4.724 tm_pt_2=-19.797 tm_pt_3=-0.050 tm_pt_4=-4.000 tm_pt_5=-6.127 tm_pt_6=-3.341 ||| -13.926
1 ||| i'm friend ||| WordPenalty=-1.737 lm_0=-8.501 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.010 tm_pt_2=-6.608 tm_pt_3=-0.050 tm_pt_4=-2.000 tm_pt_5=-7.562 tm_pt_6=-7.900 ||| -14.049
1 ||| mine friend ||| WordPenalty=-1.737 lm_0=-9.724 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.039 tm_pt_2=-4.467 tm_pt_3=-0.018 tm_pt_4=-2.000 tm_pt_5=-7.339 tm_pt_6=-4.298 ||| -14.112
1 ||| it friend ||| WordPenalty=-1.737 lm_0=-8.811 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.509 tm_pt_2=-8.286 tm_pt_3=-0.007 tm_pt_4=-2.000 tm_pt_5=-7.157 tm_pt_6=-7.683 ||| -14.155
1 ||| my my friend ||| WordPenalty=-2.171 lm_0=-8.416 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-0.687 tm_pt_2=-1.715 tm_pt_3=-1.000 tm_pt_4=-3.000 tm_pt_5=-8.948 tm_pt_6=-3.452 ||| -14.185
1 ||| the friends ||| WordPenalty=-1.737 lm_0=-6.746 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.982 tm_pt_2=-10.701 tm_pt_4=-2.000 tm_pt_5=-8.099 tm_pt_6=-10.864 ||| -14.206
1 ||| my opinion friend ||| WordPenalty=-2.171 lm_0=-11.211 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-1.379 tm_pt_2=-4.891 tm_pt_3=-0.050 tm_pt_4=-3.000 tm_pt_5=-7.562 tm_pt_6=-3.617 ||| -14.474
1 ||| me . ||| WordPenalty=-1.737 lm_0=-5.454 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.517 tm_pt_2=-12.971 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-7.514 tm_pt_6=-12.640 ||| -14.841
1 ||| mu friend ||| WordPenalty=-1.737 lm_0=-11.210 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.550 tm_pt_2=-1.827 tm_pt_3=-0.050 tm_pt_4=-2.000 tm_pt_5=-7.562 tm_pt_6=-0.813 ||| -14.848
1 ||| i friend of ||| WordPenalty=-2.171 lm_0=-10.205 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-3.066 tm_pt_2=-11.905 tm_pt_4=-3.000 tm_pt_5=-7.129 tm_pt_6=-6.957 ||| -14.891
1 ||| me male friend ||| WordPenalty=-2.171 lm_0=-12.043 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-2.616 tm_pt_2=-5.280 tm_pt_3=-0.368 tm_pt_4=-3.000 tm_pt_5=-6.820 tm_pt_6=-1.636 ||| -15.030
1 ||| a friend of ||| WordPenalty=-2.171 lm_0=-7.295 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.116 tm_pt_2=-15.231 tm_pt_3=-0.000 tm_pt_4=-3.000 tm_pt_5=-9.049 tm_pt_6=-8.397 ||| -15.187
1 ||| her friend ||| WordPenalty=-1.737 lm_0=-8.107 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.955 tm_pt_2=-7.559 tm_pt_3=-0.368 tm_pt_4=-2.000 tm_pt_5=-8.255 tm_pt_6=-7.307 ||| -15.214
1 ||| your friend ||| WordPenalty=-1.737 lm_0=-6.406 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.732 tm_pt_2=-7.461 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-8.785 ||| -15.612
1 ||| i've friend ||| WordPenalty=-1.737 lm_0=-9.304 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.451 tm_pt_2=-5.864 tm_pt_3=-0.368 tm_pt_4=-2.000 tm_pt_5=-8.255 tm_pt_6=-7.240 ||| -15.927
1 ||| , friend ||| WordPenalty=-1.737 lm_0=-9.009 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.509 tm_pt_2=-10.431 tm_pt_3=-0.135 tm_pt_4=-2.000 tm_pt_5=-7.850 tm_pt_6=-11.907 ||| -16.437
1 ||| his friend ||| WordPenalty=-1.737 lm_0=-7.442 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.649 tm_pt_2=-7.481 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-7.603 ||| -16.545
1 ||| myself friends ||| WordPenalty=-1.737 lm_0=-10.091 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.407 tm_pt_2=-5.973 tm_pt_3=-0.000 tm_pt_4=-2.000 tm_pt_5=-9.118 tm_pt_6=-5.235 ||| -16.734
1 ||| a friends ||| WordPenalty=-1.737 lm_0=-7.912 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.100 tm_pt_2=-11.139 tm_pt_3=-0.000 tm_pt_4=-2.000 tm_pt_5=-9.285 tm_pt_6=-11.138 ||| -16.781
1 ||| mi friends ||| WordPenalty=-1.737 lm_0=-10.336 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.448 tm_pt_2=-3.929 tm_pt_3=-0.000 tm_pt_4=-2.000 tm_pt_5=-9.380 tm_pt_6=-3.808 ||| -16.816
1 ||| me boyfriend ||| WordPenalty=-1.737 lm_0=-9.451 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.517 tm_pt_2=-7.477 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-7.514 tm_pt_6=-6.904 ||| -16.874
1 ||| i'm friends ||| WordPenalty=-1.737 lm_0=-7.792 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.601 tm_pt_2=-9.343 tm_pt_3=-0.050 tm_pt_4=-2.000 tm_pt_5=-10.297 tm_pt_6=-10.712 ||| -17.291
1 ||| their friend ||| WordPenalty=-1.737 lm_0=-7.942 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-8.505 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-8.054 ||| -17.390
1 ||| our friend ||| WordPenalty=-1.737 lm_0=-8.449 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-7.387 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-6.865 ||| -17.493
1 ||| mine friends ||| WordPenalty=-1.737 lm_0=-9.317 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.631 tm_pt_2=-7.202 tm_pt_3=-0.018 tm_pt_4=-2.000 tm_pt_5=-10.073 tm_pt_6=-7.110 ||| -17.655
1 ||| me boy friend ||| WordPenalty=-2.171 lm_0=-12.396 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-2.620 tm_pt_2=-8.670 tm_pt_3=-1.000 tm_pt_4=-3.000 tm_pt_5=-7.514 tm_pt_6=-1.636 ||| -17.661
1 ||| like friend ||| WordPenalty=-1.737 lm_0=-8.686 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-5.430 tm_pt_2=-6.718 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-9.991 ||| -17.679
1 ||| it friends ||| WordPenalty=-1.737 lm_0=-8.403 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.100 tm_pt_2=-11.021 tm_pt_3=-0.007 tm_pt_4=-2.000 tm_pt_5=-9.891 tm_pt_6=-10.495 ||| -17.698
1 ||| have a friend ||| WordPenalty=-2.171 lm_0=-7.988 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.145 tm_pt_2=-18.115 tm_pt_3=-1.000 tm_pt_4=-3.000 tm_pt_5=-8.948 tm_pt_6=-7.581 ||| -17.864
1 ||| mi friend of ||| WordPenalty=-2.171 lm_0=-12.369 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.464 tm_pt_2=-8.021 tm_pt_3=-0.000 tm_pt_4=-3.000 tm_pt_5=-9.144 tm_pt_6=-1.067 ||| -17.873
1 ||| m friend ||| WordPenalty=-1.737 lm_0=-9.726 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.649 tm_pt_2=-4.685 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-5.117 ||| -17.922
1 ||| the male friend ||| WordPenalty=-2.171 lm_0=-9.425 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.076 tm_pt_2=-10.491 tm_pt_3=-0.368 tm_pt_4=-3.000 tm_pt_5=-10.114 tm_pt_6=-7.239 ||| -17.957
1 ||| mine's friend ||| WordPenalty=-1.737 lm_0=-10.777 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-2.232 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-1.506 ||| -17.983
1 ||| my opinion friends ||| WordPenalty=-2.171 lm_0=-10.804 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-3.971 tm_pt_2=-7.626 tm_pt_3=-0.050 tm_pt_4=-3.000 tm_pt_5=-10.297 tm_pt_6=-6.429 ||| -18.016
1 ||| love friend ||| WordPenalty=-1.737 lm_0=-9.639 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.039 tm_pt_2=-5.502 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-7.092 ||| -18.108
1 ||| without friend ||| WordPenalty=-1.737 lm_0=-9.010 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-7.459 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-7.232 ||| -18.142
1 ||| a male friend ||| WordPenalty=-2.171 lm_0=-8.235 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.194 tm_pt_2=-10.929 tm_pt_3=-0.368 tm_pt_4=-3.000 tm_pt_5=-11.300 tm_pt_6=-7.513 ||| -18.176
1 ||| yea friend ||| WordPenalty=-1.737 lm_0=-9.228 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-6.928 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-6.795 ||| -18.196
1 ||| dear friend ||| WordPenalty=-1.737 lm_0=-10.435 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.732 tm_pt_2=-3.643 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-4.784 ||| -18.272
1 ||| i of my friends ||| WordPenalty=-2.606 lm_0=-10.620 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-5.253 tm_pt_2=-22.120 tm_pt_3=-0.050 tm_pt_4=-4.000 tm_pt_5=-8.687 tm_pt_6=-7.777 ||| -18.358
1 ||| mu friends ||| WordPenalty=-1.737 lm_0=-10.802 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-10.142 tm_pt_2=-4.562 tm_pt_3=-0.050 tm_pt_4=-2.000 tm_pt_5=-10.297 tm_pt_6=-3.625 ||| -18.390
1 ||| my my friends ||| WordPenalty=-2.171 lm_0=-8.764 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-3.279 tm_pt_2=-4.450 tm_pt_3=-1.000 tm_pt_4=-3.000 tm_pt_5=-11.683 tm_pt_6=-6.264 ||| -18.483
1 ||| her friends ||| WordPenalty=-1.737 lm_0=-7.513 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-10.547 tm_pt_2=-10.294 tm_pt_3=-0.368 tm_pt_4=-2.000 tm_pt_5=-10.990 tm_pt_6=-10.119 ||| -18.569
1 ||| i . ||| WordPenalty=-1.737 lm_0=-5.479 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.047 tm_pt_2=-15.294 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-10.073 tm_pt_6=-17.077 ||| -18.613
1 ||| i'm friend of ||| WordPenalty=-2.171 lm_0=-10.127 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.617 tm_pt_2=-13.435 tm_pt_3=-0.050 tm_pt_4=-3.000 tm_pt_5=-10.060 tm_pt_6=-7.971 ||| -18.648
1 ||| from friend ||| WordPenalty=-1.737 lm_0=-8.725 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-9.950 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-9.604 ||| -18.698
1 ||| well friend ||| WordPenalty=-1.737 lm_0=-8.712 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.649 tm_pt_2=-9.760 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-10.319 ||| -18.701
1 ||| mine friend of ||| WordPenalty=-2.171 lm_0=-11.350 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.646 tm_pt_2=-11.295 tm_pt_3=-0.018 tm_pt_4=-3.000 tm_pt_5=-9.837 tm_pt_6=-4.369 ||| -18.711
1 ||| it friend of ||| WordPenalty=-2.171 lm_0=-10.437 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.116 tm_pt_2=-15.113 tm_pt_3=-0.007 tm_pt_4=-3.000 tm_pt_5=-9.655 tm_pt_6=-7.754 ||| -18.755
1 ||| and friend ||| WordPenalty=-1.737 lm_0=-8.126 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-11.719 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-11.564 ||| -18.756
1 ||| opinion friend ||| WordPenalty=-1.737 lm_0=-10.767 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.550 tm_pt_2=-4.457 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-5.811 ||| -18.904
1 ||| hon friend ||| WordPenalty=-1.737 lm_0=-12.113 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-0.846 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-0.813 ||| -18.986
1 ||| yours friend ||| WordPenalty=-1.737 lm_0=-10.555 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-5.976 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-5.170 ||| -19.046
1 ||| my opinion friend of ||| WordPenalty=-2.606 lm_0=-12.837 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-1.986 tm_pt_2=-11.719 tm_pt_3=-0.050 tm_pt_4=-4.000 tm_pt_5=-10.060 tm_pt_6=-3.688 ||| -19.073
1 ||| ma friend ||| WordPenalty=-1.737 lm_0=-11.444 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-3.411 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-2.893 ||| -19.104
1 ||| my my friend of ||| WordPenalty=-2.606 lm_0=-10.436 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-1.294 tm_pt_2=-8.543 tm_pt_3=-1.000 tm_pt_4=-4.000 tm_pt_5=-11.446 tm_pt_6=-3.523 ||| -19.178
1 ||| don't friend ||| WordPenalty=-1.737 lm_0=-9.153 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-10.385 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-9.672 ||| -19.198
1 ||| think friend ||| WordPenalty=-1.737 lm_0=-9.861 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.955 tm_pt_2=-8.330 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-9.166 ||| -19.300
1 ||| mu friend of ||| WordPenalty=-2.171 lm_0=-12.835 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.157 tm_pt_2=-8.654 tm_pt_3=-0.050 tm_pt_4=-3.000 tm_pt_5=-10.060 tm_pt_6=-0.884 ||| -19.447
1 ||| i've friends ||| WordPenalty=-1.737 lm_0=-8.896 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.043 tm_pt_2=-8.599 tm_pt_3=-0.368 tm_pt_4=-2.000 tm_pt_5=-10.990 tm_pt_6=-10.052 ||| -19.469
1 ||| , friends ||| WordPenalty=-1.737 lm_0=-8.100 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.100 tm_pt_2=-13.166 tm_pt_3=-0.135 tm_pt_4=-2.000 tm_pt_5=-10.584 tm_pt_6=-14.719 ||| -19.479
1 ||| own friend ||| WordPenalty=-1.737 lm_0=-10.671 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-7.117 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-6.389 ||| -19.575
1 ||| the boyfriend ||| WordPenalty=-1.737 lm_0=-6.729 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-10.978 tm_pt_2=-12.688 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-10.807 tm_pt_6=-12.507 ||| -19.697
1 ||| i boyfriend ||| WordPenalty=-1.737 lm_0=-8.559 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.047 tm_pt_2=-9.800 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-10.073 tm_pt_6=-11.340 ||| -19.729
1 ||| her friend of ||| WordPenalty=-2.171 lm_0=-9.733 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-8.563 tm_pt_2=-14.387 tm_pt_3=-0.368 tm_pt_4=-3.000 tm_pt_5=-10.753 tm_pt_6=-7.378 ||| -19.813
1 ||| the boy friend ||| WordPenalty=-2.171 lm_0=-9.011 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-6.080 tm_pt_2=-13.882 tm_pt_3=-1.000 tm_pt_4=-3.000 tm_pt_5=-10.807 tm_pt_6=-7.239 ||| -19.821
1 ||| phone friend ||| WordPenalty=-1.737 lm_0=-10.669 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-7.816 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-7.217 ||| -19.845
1 ||| his friends ||| WordPenalty=-1.737 lm_0=-6.812 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-11.240 tm_pt_2=-10.216 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-11.683 tm_pt_6=-10.415 ||| -19.865
1 ||| your friends ||| WordPenalty=-1.737 lm_0=-6.731 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-10.324 tm_pt_2=-10.196 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-11.683 tm_pt_6=-11.597 ||| -19.886
1 ||| the . ||| WordPenalty=-1.737 lm_0=-5.013 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-10.978 tm_pt_2=-18.182 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-10.807 tm_pt_6=-18.243 ||| -19.945
1 ||| part friend ||| WordPenalty=-1.737 lm_0=-10.871 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.342 tm_pt_2=-7.577 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-8.948 tm_pt_6=-7.354 ||| -20.045
1 ||| their friends ||| WordPenalty=-1.737 lm_0=-6.744 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-11.933 tm_pt_2=-11.240 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-11.683 tm_pt_6=-10.866 ||| -20.142
1 ||| our friends ||| WordPenalty=-1.737 lm_0=-7.234 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-11.933 tm_pt_2=-10.122 tm_pt_3=-1.000 tm_pt_4=-2.000 tm_pt_5=-11.683 tm_pt_6=-9.677 ||| -20.227
1 ||| i boy friend ||| WordPenalty=-2.171 lm_0=-11.503 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-3.149 tm_pt_2=-10.993 tm_pt_3=-1.000 tm_pt_4=-3.000 tm_pt_5=-10.073 tm_pt_6=-6.073 ||| -20.516
1 ||| i've friend of ||| WordPenalty=-2.171 lm_0=-10.930 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.058 tm_pt_2=-12.691 tm_pt_3=-0.368 tm_pt_4=-3.000 tm_pt_5=-10.753 tm_pt_6=-7.311 ||| -20.526
1 ||| mi male friend ||| WordPenalty=-2.171 lm_0=-13.314 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.543 tm_pt_2=-3.720 tm_pt_3=-0.368 tm_pt_4=-3.000 tm_pt_5=-11.395 tm_pt_6=-0.182 ||| -20.867
1 ||| a of my friends ||| WordPenalty=-2.606 lm_0=-10.086 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-9.303 tm_pt_2=-25.446 tm_pt_3=-0.050 tm_pt_4=-4.000 tm_pt_5=-10.607 tm_pt_6=-9.217 ||| -21.030
1 ||| , friend of ||| WordPenalty=-2.171 lm_0=-10.635 tm_glue_0=2.000 tm_pt_0=-2.000 tm_pt_1=-7.116 tm_pt_2=-17.259 tm_pt_3=-0.135 tm_pt_4=-3.000 tm_pt_5=-10.348 tm_pt_6=-11.978 ||| -21.037

joshua.config:

# This file is a template for the Joshua pipeline; variables enclosed
# in <angle-brackets> are substituted by the pipeline script as
# appropriate.  This file also serves to document Joshua's many
# parameters.

# This is the grammar file and the grammar file format.  The grammar
# file can be compressed with gzip.  Supported formats are "thrax" and
# "samt".  The latter denotes the format used in Zollmann and
# Venugopal's SAMT decoder (http://www.cs.cmu.edu/~zollmann/samt/).

#tm-file = /home/hltcoe/mpost/expts/scale12/runs/ldc_transcript/4/data/test/grammar.filtered.gz
tm-file = /Users/orluke/workspace/mt/grammar_es-en/grammar.filtered.gz
tm-format = thrax

# The span limit is the maximum input span permitted for the
# application of grammar rules found in the grammar file.

span-limit = 12

# This symbol is used over unknown words in the source language

default-non-terminal = X

# This is the goal nonterminal, used to determine when a complete
# parse is found.  It should correspond to the root-level rules in the
# glue grammar.

goal-symbol = GOAL

# DEPRECATED
## If set to true (true is denoted with a case-insensitive form of the
## word "true"), the decoder will look for sentence-specific grammars
## by appending the sentence ID to the value of {tm_file}.  e.g., if
## tm_file=grammar, it will look for grammar.19 when decoding sentence
## 19.  Sentence numbers are placed before compression suffixes, so if
## tm_file=grammar.gz, Joshua will look for grammar.19.gz.
#
#use-sent-specific-tm = false

# The glue grammar contains glue rules.  Its main distinction from the
# regular grammar is that the span limit does not apply to it.  In the
# future, the explicit distinction between these grammars will be
# dropped in favor of specifying an arbitrary number of grammars with
# various per-grammar settings.

#glue-file = /home/hltcoe/mpost/expts/scale12/runs/ldc_transcript/4/data/test/grammar.glue
glue-file = /Users/orluke/workspace/joshua/data/glue-grammar
glue-format = thrax
glue-owner = glue

# Language model config.

# Multiple language models are supported.  For each language model,
# create a line in the following format, 
#
# lm = TYPE 5 false false 100 FILE
#
# where the six fields correspond to the following values:
# - LM type: one of "kenlm", "berkeleylm", "javalm" (not recommended), or "none"
# - LM order: the N of the N-gram language model
# - whether to use left equivalent state (currently not supported)
# - whether to use right equivalent state (currently not supported)
# - the ceiling cost of any n-gram (currently ignored)
# - LM file: the location of the language model file
# You also need to add a weight for each language model below.

lm = kenlm 5 false false 100 /Users/orluke/workspace/mt/grammar_es-en/lm.gz

# The suffix _OOV is appended to unknown source-language words if this
# is set to true.

mark-oovs = false

# The pop-limit for decoding.  This determines how many hypotheses are
# considered over each span of the input.

pop-limit = 100

# How many hypotheses to output

top-n = 300

# Whether those hypotheses should be distinct strings

use-unique-nbest = true

# The following two options control whether to output (a) the
# derivation tree and (b) word alignment information (for each
# hypothesis on the n-best list).  Note that setting these options to
# 'true' will currently break MERT, so don't use these in the
# pipeline.

use-tree-nbest = false
include-align-index = false

## Model weights #####################################################

# For each langage model line listed above, create a weight in the
# following format: the keyword "lm", a 0-based index, and the weight.
# lm INDEX WEIGHT

lm 0 1.0

# The phrasal weights correspond to weights stored with each of the
# grammar rules.  The format is
#
#   phrasemodel owner COLUMN WEIGHT
#
# where COLUMN denotes the 0-based order of the parameter in the
# grammar file and WEIGHT is the corresponding weight.  In the future,
# we plan to add a sparse feature representation which will simplify
# this.

phrasemodel pt 0 0.4158868281310451
phrasemodel pt 1 0.1627066162702699
phrasemodel pt 2 0.13203392808433392
phrasemodel pt 3 1.8679839704500125
phrasemodel pt 4 1.4648798819368178
phrasemodel pt 5 0.9362930322587838
phrasemodel pt 6 0.21589422360841612

phrasemodel glue 0 0.7789235291076393

# The wordpenalty feature counts the number of words in each hypothesis.

wordpenalty -4.25070502440208

# This feature counts the number of unknown words in the hypothesis.

oovpenalty -119.29654861114149

# This feature weights paths through an input lattice.  It is only activated
# when decoding lattices.

"ant thrax" doesn't build Thrax

It complains about a bunch of errors. This workaround works suggesting it should be easy to fix:

cd thrax
ant 
cd ..

Don't recompute features for functions with only a single feature value

Currently, k-best extraction recomputes all feature values for every (reached) hyperedge since we store only the dot product during decoding. This is really time consuming. However, for feature functions that contribute only a single feature, we already have the value and need not recompute it. This has the advantage of working for the most expensive feature function, the LM, but would also probably be useful for some of the others that have to walk the rule's RHS.

giza++ build error

After building most of GIZA++ succeeds, the following fails:

giza:
     [exec] /Users/orluke/workspace/joshua-4.0
     [exec] make -C GIZA++-v2
     [exec] /Users/orluke/workspace/joshua-4.0
     [exec] make -C mkcls-v2
     [exec] c++ -Wall -W -DNDEBUG -O3 -Wno-deprecated -c KategProblemTest.cpp -o KategProblemTest.o
     [exec] make[1]: Nothing to be done for `opt'.
     [exec] In file included from KategProblemTest.cpp:27:
     [exec] In file included from ./KategProblemTest.h:30:
     [exec] In file included from ./KategProblem.h:34:
     [exec] In file included from ./Problem.h:34:
     [exec] In file included from ./StatVar.h:33:
     [exec] ./myleda.h:180:4: error: use of undeclared identifier 'insert'
     [exec]           insert(typename MY_HASH_BASE::value_type(a,init));
     [exec]           ^
     [exec]           this->
     [exec] KategProblemTest.cpp:97:14: note: in instantiation of member function 'leda_h_array<std::basic_string<char>, int>::operator[]' requested here
     [exec]   translation["1"]=1;
     [exec]              ^
     [exec] /usr/include/c++/4.2.1/ext/hash_map:201:7: note: must qualify identifier to find this declaration in dependent base class
     [exec]       insert(const value_type& __obj)
     [exec]       ^
     [exec] /usr/include/c++/4.2.1/ext/hash_map:206:9: note: must qualify identifier to find this declaration in dependent base class
     [exec]         insert(_InputIterator __f, _InputIterator __l)
     [exec]         ^
     [exec] 1 error generated.
     [exec] make[1]: *** [KategProblemTest.o] Error 1
     [exec] make: *** [mkcls-v2] Error 2
     [exec] Result: 2
     [exec] make: Nothing to be done for `all'.

Generalize ff computation

ComputeFeatures an computeScore could just take in an accumulator and then be one function.

(Due to Kenneth.)

Add Shen et al. (2008) dependency LM

Derivations in the output break MERT

Including derivations or alignments in the output breaks MERT.

Add KenLM estimation code to the --lm-gen options

multiple language models

Does Joshua support multiple language models? It doesn't seem to, now, but it would be nice if you could include multiple LMs to be weighted by MERT.

Grammar representation

A more efficient grammar representation is needed.

OS X installation depends on coreutils to run thrax test

the gstat command from coreutils is not installed in Darwin by default. One must resolve that dependency via Homebrew, Macports, etc.
The test/thrax/test.sh test will fail on an OS X system that does not have coreutils installed. We should either change the test so that it does not require coreutils in Darwin or make it clear in the (developer) installation/setup instructions that coreutils are required for this test, check for coreutils when running the thrax test, and output a helpful message instructing the developer to go install coreutils if gstat is not found.

ZMERT shouldn't have to stop and restart the decoder

Loading the models is an expensive step, and there's no reason that MERT runs have to load them multiple times. It should just load the decoder once and reuse the running decoder across iterations. This could be accomplished with a special input command to Joshua that changes the model weights and resets the sentence counter.

lattice test fails

currently, this is the output of an ant test run:

test:
     [exec] Running test in ./bn-en/hiero...PASSED
     [exec] Running test in ./bn-en/packed...PASSED
     [exec] Running test in ./bn-en/samt...PASSED
     [exec] Running test in ./lattice...
     [exec] FAILED

Verbosity levels

Joshua should support verbosity levels with a command-line switch, so it's easy to shut it up with something like -v 0 or -q.

Problems with the grammar packer slice size heuristic

When packing an unfiltered Europarl Hiero grammar (~ 8 GB), the packer failed with the following error:

INFO: Allocated slice: /home/hltcoe/mpost/expts/wmt13/runs/hiero/es-en/1/es-en-hiero/grammar.packed/slice_00048.source
Jul 18, 2013 10:20:25 PM joshua.tools.GrammarPacker$PackingFileTuple <init>
INFO: Allocated slice: /home/hltcoe/mpost/expts/wmt13/runs/hiero/es-en/1/es-en-hiero/grammar.packed/slice_00049.source
Jul 18, 2013 10:22:57 PM joshua.tools.GrammarPacker$PackingFileTuple <init>
INFO: Allocated slice: /home/hltcoe/mpost/expts/wmt13/runs/hiero/es-en/1/es-en-hiero/grammar.packed/slice_00050.source
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
        at joshua.tools.GrammarPacker$PackingBuffer.reallocate(GrammarPacker.java:714)
        at joshua.tools.GrammarPacker$FeatureBuffer.add(GrammarPacker.java:820)
        at joshua.tools.GrammarPacker.binarize(GrammarPacker.java:314)
        at joshua.tools.GrammarPacker.pack(GrammarPacker.java:154)
        at joshua.tools.GrammarPacker.main(GrammarPacker.java:554)

MERT hangs with very large development sets

From an email from Phu Le:

If you want to have a look, here it is:

http://ltvp.net/zmert.tar.gz (zmert folder when it hangs)

To reproduce, you can just rerun zmert:
java -Xms1G -Xmx3G -cp meteor.jar:zmert.jar joshua.zmert.ZMERT -maxMem 1000 zmert_config.txt

3 input files (dev.matched, dev.reference, decoder_config_base) I fed to MEMT:
http://ltvp.net/data.tar.gz

Thanks in advance. Give me a starting point if you have any idea. I'll try to fix it myself if possible.

Regards,
LTVP

skipping output for blank input lines

This report is against commit 829e2e7.

After starting the decoder server, this is the output for the following command

$ echo -e "1\n2\n\n3\n" | nc localhost 8182
0 ||| 1_OOV |||  ||| 0.000
1 ||| 2_OOV |||  ||| 0.000
3 ||| 3_OOV |||  ||| 0.000

The expected output is

0 ||| 1_OOV |||  ||| 0.000
1 ||| 2_OOV |||  ||| 0.000
2 |||  |||  ||| 0.000
3 ||| 3_OOV |||  ||| 0.000

Move to string-less feature vectors

Drop silly feature string construction & re-parsing in packed grammars. Could consider moving to faster primitive int-to-float maps for feature vector.

Set weights from command line

It would be convenient to be able to set the weights from the command line.

GIZA should support parallelization

https://groups.google.com/forum/#!topic/joshua_support/bFXaCmHOPAg

add jhclark/bigfatlm

It would be nice to leverage more Hadoop tools in the pipeline.

DPState interface should compute integer hashes instead of string ones

pipeline.pl: test-decode-$run step should have dependency on input test set.

When a bad input test set causes Joshua Decoder to crash (in my case, a 189,000 line caused an out of memory exception https://groups.google.com/forum/#!topic/joshua_support/-2PykKqbnU4) the pipeline script does not die as it should, which means one must manually terminate that step.

This results in the incomplete test.output.nbest to be cached in cachepipe. And a few subsequent steps(including extract-onebest) of the pipeline will inadvertently run as well before being manually terminated, since they will finish almost immeadiately with a small test.output.nbest.

Thus on subsequent retries, when the bad input test set is fixed, the pipeline does not detect this change and attempts to use the BLEU scorer with the incomplete candidate set (test.output.1best). And since the line count that JoshuaEval uses only counts the number of lines in the ref file (inputs/test.target), we get a NullPointerException when it reads to end of file in the candidate file and calls normalize on a null line at joshua.util.JoshuaEval:120.

The title suggests one possible fix, although I'm not sure what the best solution is.

Add option to print chart cell counts post-decoding

Add a debug output with number of items per chart cell, like Moses, for neat debugging information.

BLEU scorer parity with Moses

Our BLEU scorer returns different results from the standard Moses one. Find out if this is a bug or different assumptions.

Close file handles after loading packed grammar

Packed grammars don't release file handles after loading is completed.

Quasi-synchronous grammar

In the more long term, I think it would be worth looking into quasi-synchronous grammar support in the decoder.

Something is broken with multithreading

Running with lots of threads (e.g., 16) does not peg the CPU at 1600%. In fact, it rarely seems to go over 1000%. Also, the running time as a function of the number of threads quickly stops decreasing, with about 4 threads being the saturation point. This is for a newstest de-en model built on Europarl + Common Crawl, using the hash-based grammar, with a bunch of language models:

Looking a little deeper, it seems that the time taken to translate each thread grows as threads are added. Here is a graph of the sum of the times fo translating all the sentence (no model loading or anything included):

After 4 or so, it's basically linear in the number of threads!

Glue grammar creation in pipeline is broken

It currently operates under the assumption that a glue grammar can't be created from a packed grammar, which is no longer true.

Sparse feature representation

Grammars should be able to include sparse (instead of fixed) feature vectors.

Clean up examples

Parallelize MBR computation

MBR should be multithreaded. This would be easy to add following the model used in the InputManager.

merge devel into master when it's nice and stable (5.0 release?)

pipeline test failing -- illegal option -- c

test:
[exec] Running test 'test-berkeleylm.sh' in ./bn-en/hiero...PASSED
[exec] Running test 'test.sh' in ./bn-en/hiero...PASSED
[exec] Running test 'test.sh' in ./bn-en/packed...PASSED
[exec] Running test 'test.sh' in ./bn-en/samt...PASSED
[exec] Running test 'test.sh' in ./decoder/constrained...PASSED
[exec] Running test 'test.sh' in ./decoder/denormalization...PASSED
[exec] Running test 'test.sh' in ./decoder/empty-test...PASSED
[exec] Running test 'test.sh' in ./decoder/n-ary...PASSED
[exec] Running test 'test.sh' in ./decoder/regexp-grammar...PASSED
[exec] Running test 'test.sh' in ./decoder/regexp-grammar-both-rule-types...PASSED
[exec] Running test 'test.sh' in ./lattice...PASSED
[exec] Running test 'test.sh' in ./lattice-short...PASSED
[exec] Running test 'test.sh' in ./packed-grammar...PASSED
[exec] Running test 'test.sh' in ./parser...PASSED
[exec] Running test 'test-ghkm.sh' in ./pipeline...stat: illegal option -- c
[exec] usage: stat [-FlLnqrsx] [-f format] [-t timefmt] [file ...]
[exec] FAILED
[exec] Running test 'test.sh' in ./scripts/normalization...PASSED
[exec] Running test 'test.sh' in ./server...PASSED
[exec] Running test 'test.sh' in ./thrax/extraction...PASSED
[exec] Running test 'test-exact.sh' in ./thrax/filtering...PASSED
[exec] Running test 'test-fast.sh' in ./thrax/filtering...
[exec] PASSED

test:

BUILD FAILED
/Users/orluke/workspace/joshua-4.0/build.xml:304: The directory /Users/orluke/workspace/joshua-4.0/test does not exist

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.