grammatical / neural-naacl2018 Goto Github PK

Neural models and instructions on how to reproduce our results for our neural grammatical error correction systems from M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, K. Heafield: Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task, NAACL 2018.

License: MIT License

Shell 16.53% Makefile 7.34% Perl 13.16% Python 62.97%

neural-naacl2018's Introduction

Approaching Neural GEC as a Low-Resource MT Task

This repository contains neural models and instructions on how to reproduce our results for our neural grammatical error correction systems from M. Junczys-Dowmunt, R. Grundkiewicz, S. Guha, K. Heafield: Approaching Neural Grammatical Error Correction as a Low-Resource Machine Translation Task, NAACL 2018.

Citation

@InProceedings{neural-naacl2018,
    author    = {Junczys-Dowmunt, Marcin  and  Grundkiewicz, Roman  and  Guha,
                 Shubha  and  Heafield, Kenneth},
    title     = {Approaching Neural Grammatical Error Correction as a
                 Low-Resource Machine Translation Task},
    booktitle = {Proceedings of the 2018 Conference of the North American
                 Chapter of the Association for Computational Linguistics:
                 Human Language Technologies, Volume 1 (Long Papers)},
    month     = {June},
    year      = {2018},
    address   = {New Orleans, Louisiana},
    publisher = {Association for Computational Linguistics},
    pages     = {595--606},
    url       = {http://www.aclweb.org/anthology/N18-1055}
}

Models

We have prepared the top neural GEC system described in the paper that is an ensemble of four transformer models and a neural language model. Each translation model is pretrained with a language model and trained using edit-weighted MLE objective on NUCLE and Lang-8 data.

The systems are created using training settings that are very similar to those described in the paper. Small performance differences occur mainly due to the use of a more recent version of the Marian toolkit that comes with new features. The most noticable change is replacing averaged model checkpoints by exponential smoothing. Differences in less significant training hyperparameters might also exist. Other settings, including data and data preprocessing, remain exactly the same as in the original paper.

Content

models - pretrained neural models, instructions on how to use them, and scripts to evaluate the system on the CoNLL and JFLEG data sets
outputs - corrected outputs and evaluation scores for the CoNLL and JFLEG data sets generated by the prepared GEC system
training - complete training pipeline reproducing the prepared neural models

In case of any questions, please open an issue or send me (Roman) an email.

neural-naacl2018's People

Contributors

Stargazers

Watchers

Forkers

ghkk yingywang shamilcm ml-lab adityanshastry goo2go pacowong kellymarchisio haicheviet wulouzhu xiaoshengjun pidugusundeep neverneverendup zhanzq anddyyyyy

neural-naacl2018's Issues

BPE model

Hi,
I want to re-train the bpe model. So I wonder which dataset you used for training the bpe model and if you only use the source part, target part or both. Besides, what is the parameter you used, i.e., bpe_operations. Thanks!

Will work for Hindi Language Grammar Correction?

Will it work for Hindi Language Grammar Correction? If yes, what steps should I take to perform the GEC task on Hindi lang?
Thank you in advance

Model for Polish

Could you share a model for Polish?

Replication of the training process

Hello, I tried to re-trained the model twice following the readme file step by step but the M2 score was only 54.41 for CoNLL-14 test set. I found the recall is similar to the downloaded model but my re-trained model has a lower precision (either the ensemble or the single transformer model). Do you know which causes this issue?

Besides, I notice you mentioned the pre-training embeddings and edit-weighted loss in your paper. but I have not seen any parameters related to these techniques in the transformer.sh script. Where are these techniques used?

No License File

We intend to use this pipeline to re-create results from the source research paper. However, there is no LICENSE file to clarify usage of the code in the repository.

Could you please add a LICENSE file to this repository, to help make fair-use of the code clear?

Error evaluating test datasets

I'm trying to evaluate the test datasets on the pre-trained models as instructed in models/README.md.

However, when running bash evaluate.sh 0, I get an AssertionError saying that candidates, sources and gold_edits have different lengths. I found out that JFLEG works well but not CoNLL 2013 and 2014.

Anyone have any advice? I'm currently using CUDA 9.0, Python 3.6 and PyTorch 0.4.0. Here is the error I currently have:

CoNLL Test 2013
Traceback (most recent call last):
  File "./tools/remove_repetitions.py", line 62, in <module>
    main()
  File "./tools/remove_repetitions.py", line 16, in main
    ngrams = get_ngrams(words, args.max_size, args.min_freq)
  File "./tools/remove_repetitions.py", line 47, in get_ngrams
    for ngram in counts.keys():
RuntimeError: dictionary changed size during iteration
Traceback (most recent call last):
  File "/mnt/gwena/Gwena/IncompleteIntentionClassifier/baseline/gec_models/neural-naacl2018/models/tools/subword-nmt/subword_nmt/apply_bpe.py", line 373, in <module>
    args.output.write(bpe.process_line(line))
BrokenPipeError: [Errno 32] Broken pipe
Exception ignored in: <_io.TextIOWrapper name='<stdout>' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
Traceback (most recent call last):
  File "./tools/m2scorer/scripts/m2scorer.py", line 137, in <module>
    p, r, f1 = levenshtein.batch_multi_pre_rec_f1(system_sentences, source_sentences, gold_edits, max_unchanged_words, beta, ignore_whitespace_casing, verbose, very_verbose)
  File "/mnt/gwena/Gwena/IncompleteIntentionClassifier/baseline/gec_models/neural-naacl2018/models/tools/m2scorer/scripts/levenshtein.py", line 107, in batch_multi_pre_rec_f1
    assert len(candidates) == len(sources) == len(gold_edits)
AssertionError

Question regarding oversampling

In the Low-resource paper, you have mentioned that NUCLE was over-sampled 10 times for domain-adaptation for CoNLL-14 dataset.
I tried benchmarking the pre-trained models provided in this repo on the WI+LOCNESS test-set.
Single model gave an F-score of 34.15 whereas the ensemble of 4 models+reranking gave an F-score of 53.27. The ensemble gives fewer false positives than the single model leading to higher precision.
Metrics of single model on WI+LOCNESS test-set
Metrics of ensemble on WI+LOCNESS test-set
Does oversampling on NUCLE data lead to a decrease in precision for the single model from 69-70 on CONLL-14 test set to 31.3 on WI+LOCNESS test-set?

Thanks!

cd data
make all
cd ..

It gives the following error.

python ../tools/spellcheck.py < ../tools/jfleg/dev/dev.src > jflegdev.err
Traceback (most recent call last):
  File "../tools/spellcheck.py", line 30, in <module>
    main()
  File "../tools/spellcheck.py", line 19, in main
    print(d.suggest(w)[0], end=" ")
IndexError: list index out of range
make: *** [Makefile:54: jflegdev.err] Error 1
make: *** Deleting file 'jflegdev.err'

I have the file dev.src in tools/jfleg/dev/

I'm using Python 2 for this experiment as it seems it does not support Python 3.