Git Product home page Git Product logo

xlm's Introduction

XLM

NEW: Added XLM-R model.

PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes:



Model



XLM supports multi-GPU and multi-node training, and contains code for:

  • Language model pretraining:
    • Causal Language Model (CLM)
    • Masked Language Model (MLM)
    • Translation Language Model (TLM)
  • GLUE fine-tuning
  • XNLI fine-tuning
  • Supervised / Unsupervised MT training:
    • Denoising auto-encoder
    • Parallel data training
    • Online back-translation

Installation

Install the python package in editable mode with

pip install -e .

Dependencies

  • Python 3
  • NumPy
  • PyTorch (currently tested on version 0.4 and 1.0)
  • fastBPE (generate and apply BPE codes)
  • Moses (scripts to clean and tokenize text only - no installation required)
  • Apex (for fp16 training)

I. Monolingual language model pretraining (BERT)

In what follows we explain how you can download and use our pretrained XLM (English-only) BERT model. Then we explain how you can train your own monolingual model, and how you can fine-tune it on the GLUE tasks.

Pretrained English model

We provide our pretrained XLM_en English model, trained with the MLM objective.

Languages Pretraining Model BPE codes Vocabulary
English MLM Model BPE codes Vocabulary

which obtains better performance than BERT (see the GLUE benchmark) while trained on the same data:

Model Score CoLA SST2 MRPC STS-B QQP MNLI_m MNLI_mm QNLI RTE WNLI AX
BERT 80.5 60.5 94.9 89.3/85.4 87.6/86.5 72.1/89.3 86.7 85.9 92.7 70.1 65.1 39.6
XLM_en 82.8 62.9 95.6 90.7/87.1 88.8/88.2 73.2/89.8 89.1 88.5 94.0 76.0 71.9 44.7

If you want to play around with the model and its representations, just download the model and take a look at our ipython notebook demo.

Our XLM PyTorch English model is trained on the same data than the pretrained BERT TensorFlow model (Wikipedia + Toronto Book Corpus). Our implementation does not use the next-sentence prediction task and has only 12 layers but higher capacity (665M parameters). Overall, our model achieves a better performance than the original BERT on all GLUE tasks (cf. table above for comparison).

Train your own monolingual BERT model

Now it what follows, we will explain how you can train a similar model on your own data.

1. Preparing the data

First, get the monolingual data (English Wikipedia, the TBC corpus is not hosted anymore).

# Download and tokenize Wikipedia data in 'data/wiki/en.{train,valid,test}'
# Note: the tokenization includes lower-casing and accent-removal
./get-data-wiki.sh en

Install fastBPE and learn BPE vocabulary (with 30,000 codes here):

OUTPATH=data/processed/XLM_en/30k  # path where processed files will be stored
FASTBPE=tools/fastBPE/fast  # path to the fastBPE tool

# create output path
mkdir -p $OUTPATH

# learn bpe codes on the training set (or only use a subset of it)
$FASTBPE learnbpe 30000 data/wiki/txt/en.train > $OUTPATH/codes

Now apply BPE tokenization to train/valid/test files:

$FASTBPE applybpe $OUTPATH/train.en data/wiki/txt/en.train $OUTPATH/codes &
$FASTBPE applybpe $OUTPATH/valid.en data/wiki/txt/en.valid $OUTPATH/codes &
$FASTBPE applybpe $OUTPATH/test.en data/wiki/txt/en.test $OUTPATH/codes &

and get the post-BPE vocabulary:

cat $OUTPATH/train.en | $FASTBPE getvocab - > $OUTPATH/vocab &

Binarize the data to limit the size of the data we load in memory:

# This will create three files: $OUTPATH/{train,valid,test}.en.pth
# After that we're all set
python preprocess.py $OUTPATH/vocab $OUTPATH/train.en &
python preprocess.py $OUTPATH/vocab $OUTPATH/valid.en &
python preprocess.py $OUTPATH/vocab $OUTPATH/test.en &

2. Train the BERT model

Train your BERT model (without the next-sentence prediction task) on the preprocessed data:


python train.py

## main parameters
--exp_name xlm_en                          # experiment name
--dump_path ./dumped                       # where to store the experiment

## data location / training objective
--data_path $OUTPATH                       # data location
--lgs 'en'                                 # considered languages
--clm_steps ''                             # CLM objective (for training GPT-2 models)
--mlm_steps 'en'                           # MLM objective

## transformer parameters
--emb_dim 2048                             # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)
--n_layers 12                              # number of layers
--n_heads 16                               # number of heads
--dropout 0.1                              # dropout
--attention_dropout 0.1                    # attention dropout
--gelu_activation true                     # GELU instead of ReLU

## optimization
--batch_size 32                            # sequences per batch
--bptt 256                                 # sequences length  (streams of 256 tokens)
--optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001  # optimizer (training is quite sensitive to this parameter)
--epoch_size 300000                        # number of sentences per epoch
--max_epoch 100000                         # max number of epochs (~infinite here)
--validation_metrics _valid_en_mlm_ppl     # validation metric (when to save the best model)
--stopping_criterion _valid_en_mlm_ppl,25  # stopping criterion (if criterion does not improve 25 times)
--fp16 true                                # use fp16 training

## bert parameters
--word_mask_keep_rand '0.8,0.1,0.1'        # bert masking probabilities
--word_pred '0.15'                         # predict 15 percent of the words

## There are other parameters that are not specified here (see train.py).

To train with multiple GPUs use:

export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

Tips: Even when the validation perplexity plateaus, keep training your model. The larger the batch size the better (so using multiple GPUs will improve performance). Tuning the learning rate (e.g. [0.0001, 0.0002]) should help.

3. Fine-tune a pretrained model on GLUE tasks

Now that the model is pretrained, let's finetune it. First, download and preprocess the GLUE tasks:

# Download and tokenize GLUE tasks in 'data/glue/{MNLI,QNLI,SST-2,STS-B}'

./get-data-glue.sh

# Preprocessing should be the same than for training.
# If you removed lower-casing/accent-removal, it sould be reflected here as well.

and prepare the GLUE data using the codes and vocab:

# by default this script uses the BPE codes and vocab of pretrained XLM_en. Modify in script if needed.
./prepare-glue.sh

In addition to the train.py script, we provide a complementary script glue-xnli.py to fine-tune a model on either GLUE or XNLI.

You can now fine-tune the pretrained model on one of the English GLUE tasks using this config:

# Config used for fine-tuning our pretrained English BERT model (mlm_en_2048.pth)
python glue-xnli.py
--exp_name test_xlm_en_glue              # experiment name
--dump_path ./dumped                     # where to store the experiment
--model_path mlm_en_2048.pth             # model location
--data_path $OUTPATH                     # data location
--transfer_tasks MNLI-m,QNLI,SST-2       # transfer tasks (GLUE tasks)
--optimizer_e adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--optimizer_p adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--finetune_layers "0:_1"                 # fine-tune all layers
--batch_size 8                           # batch size (\in [4, 8])
--n_epochs 250                           # number of epochs
--epoch_size 20000                       # number of sentences per epoch (relatively small on purpose)
--max_len 256                            # max number of words in sentences
--max_vocab -1                           # max number of words in vocab

Tips: You should sweep over the batch size (4 and 8) and the learning rate (5e-6, 2.5e-5, 1.25e-4) parameters.

II. Cross-lingual language model pretraining (XLM)

XLM-R (new model)

XLM-R is the new state-of-the-art XLM model. XLM-R shows the possibility of training one model for many languages while not sacrificing per-language performance. It is trained on 2.5 TB of CommonCrawl data, in 100 languages. You can load XLM-R from torch.hub (Pytorch >= 1.1):

# XLM-R model
import torch
xlmr = torch.hub.load('pytorch/fairseq', 'xlmr.large')
xlmr.eval()

Apply sentence-piece-model (SPM) encoding to input text:

en_tokens = xlmr.encode('Hello world!')
assert en_tokens.tolist() == [0, 35378,  8999, 38, 2]
xlmr.decode(en_tokens)  # 'Hello world!'

ar_tokens = xlmr.encode('مرحبا بالعالم')
assert ar_tokens.tolist() == [0, 665, 193478, 258, 1705, 77796, 2]
xlmr.decode(ar_tokens) # 'مرحبا بالعالم'

zh_tokens = xlmr.encode('你好,世界')
assert zh_tokens.tolist() == [0, 6, 124084, 4, 3221, 2]
xlmr.decode(zh_tokens)  # '你好,世界'

Extract features from XLM-R:

# Extract the last layer's features
last_layer_features = xlmr.extract_features(zh_tokens)
assert last_layer_features.size() == torch.Size([1, 6, 1024])

# Extract all layer's features (layer 0 is the embedding layer)
all_layers = xlmr.extract_features(zh_tokens, return_all_hiddens=True)
assert len(all_layers) == 25
assert torch.all(all_layers[-1] == last_layer_features)

XLM-R handles the following 100 languages: Afrikaans, Albanian, Amharic, Arabic, Armenian, Assamese, Azerbaijani, Basque, Belarusian, Bengali, Bengali Romanized, Bosnian, Breton, Bulgarian, Burmese, Burmese, Catalan, Chinese (Simplified), Chinese (Traditional), Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Filipino, Finnish, French, Galician, Georgian, German, Greek, Gujarati, Hausa, Hebrew, Hindi, Hindi Romanized, Hungarian, Icelandic, Indonesian, Irish, Italian, Japanese, Javanese, Kannada, Kazakh, Khmer, Korean, Kurdish (Kurmanji), Kyrgyz, Lao, Latin, Latvian, Lithuanian, Macedonian, Malagasy, Malay, Malayalam, Marathi, Mongolian, Nepali, Norwegian, Oriya, Oromo, Pashto, Persian, Polish, Portuguese, Punjabi, Romanian, Russian, Sanskri, Scottish, Gaelic, Serbian, Sindhi, Sinhala, Slovak, Slovenian, Somali, Spanish, Sundanese, Swahili, Swedish, Tamil, Tamil Romanized, Telugu, Telugu Romanized, Thai, Turkish, Ukrainian, Urdu, Urdu Romanized, Uyghur, Uzbek, Vietnamese, Welsh, Western, Frisian, Xhosa, Yiddish.

Pretrained cross-lingual language models

We provide large pretrained models for the 15 languages of XNLI, and two other models in 17 and 100 languages.

Languages Pretraining Tokenization Model BPE codes Vocabulary
15 MLM tokenize + lowercase + no accent + BPE Model BPE codes (80k) Vocabulary (95k)
15 MLM + TLM tokenize + lowercase + no accent + BPE Model BPE codes (80k) Vocabulary (95k)
17 MLM tokenize + BPE Model BPE codes (175k) Vocabulary (200k)
100 MLM tokenize + BPE Model BPE codes (175k) Vocabulary (200k)

which obtains better performance than mBERT on the XNLI cross-lingual classification task:

Model lg en es de ar zh ur
mBERT 102 81.4 74.3 70.5 62.1 63.8 58.3
XLM (MLM) 15 83.2 76.3 74.2 68.5 71.9 63.4
XLM (MLM+TLM) 15 85.0 78.9 77.8 73.1 76.5 67.3
XLM (MLM) 17 84.8 79.4 76.2 71.5 75 -
XLM (MLM) 100 83.7 76.6 73.6 67.4 71.7 62.9

If you want to play around with the model and its representations, just download the model and take a look at our ipython notebook demo.

The 17 and 100 Languages

The XLM-17 model includes these languages: en-fr-es-de-it-pt-nl-sv-pl-ru-ar-tr-zh-ja-ko-hi-vi

The XLM-100 model includes these languages: en-es-fr-de-zh-ru-pt-it-ar-ja-id-tr-nl-pl-simple-fa-vi-sv-ko-he-ro-no-hi-uk-cs-fi-hu-th-da-ca-el-bg-sr-ms-bn-hr-sl-zh_yue-az-sk-eo-ta-sh-lt-et-ml-la-bs-sq-arz-af-ka-mr-eu-tl-ang-gl-nn-ur-kk-be-hy-te-lv-mk-zh_classical-als-is-wuu-my-sco-mn-ceb-ast-cy-kn-br-an-gu-bar-uz-lb-ne-si-war-jv-ga-zh_min_nan-oc-ku-sw-nds-ckb-ia-yi-fy-scn-gan-tt-am

Train your own XLM model with MLM or MLM+TLM

Now in what follows, we will explain how you can train an XLM model on your own data.

1. Preparing the data

Monolingual data (MLM): Follow the same procedure as in I.1, and download multiple monolingual corpora, such as the Wikipedias.

Note that we provide a tokenizer script:

lg=en
cat my_file.$lg | ./tools/tokenize.sh $lg > my_tokenized_file.$lg &

Parallel data (TLM): We provide download scripts for some language pairs in the get-data-para.sh script.

# Download and tokenize parallel data in 'data/wiki/para/en-zh.{en,zh}.{train,valid,test}'
./get-data-para.sh en-zh &

For other language pairs, look at the OPUS collection, and modify the get-data-para.sh script [here)(https://github.com/facebookresearch/XLM/blob/master/get-data-para.sh#L179-L180) to add your own language pair.

Now create you training set for the BPE vocabulary, for instance by taking 100M sentences from each monolingua corpora.

# build the training set for BPE tokenization (50k codes)
OUTPATH=data/processed/XLM_en_zh/50k
mkdir -p $OUTPATH
shuf -r -n 10000000 data/wiki/train.en >> $OUTPATH/bpe.train
shuf -r -n 10000000 data/wiki/train.zh >> $OUTPATH/bpe.train

And learn the 50k BPE code as in the previous section on the bpe.train file. Apply BPE tokenization on the monolingual and parallel corpora, and binarize everything using preprocess.py:

pair=en-zh

for lg in $(echo $pair | sed -e 's/\-/ /g'); do
  for split in train valid test; do
    $FASTBPE applybpe $OUTPATH/$pair.$lg.$split data/wiki/para/$pair.$lg.$split $OUTPATH/codes
    python preprocess.py $OUTPATH/vocab $OUTPATH/$pair.$lg.$split
  done
done

2. Train the XLM model

Train your XLM (MLM only) on the preprocessed data:

python train.py

## main parameters
--exp_name xlm_en_zh                       # experiment name
--dump_path ./dumped                       # where to store the experiment

## data location / training objective
--data_path $OUTPATH                       # data location
--lgs 'en-zh'                              # considered languages
--clm_steps ''                             # CLM objective (for training GPT-2 models)
--mlm_steps 'en,zh'                        # MLM objective

## transformer parameters
--emb_dim 1024                             # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)
--n_layers 12                              # number of layers
--n_heads 16                               # number of heads
--dropout 0.1                              # dropout
--attention_dropout 0.1                    # attention dropout
--gelu_activation true                     # GELU instead of ReLU

## optimization
--batch_size 32                            # sequences per batch
--bptt 256                                 # sequences length  (streams of 256 tokens)
--optimizer adam,lr=0.0001                 # optimizer (training is quite sensitive to this parameter)
--epoch_size 300000                        # number of sentences per epoch
--max_epoch 100000                         # max number of epochs (~infinite here)
--validation_metrics _valid_mlm_ppl        # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,25     # stopping criterion (if criterion does not improve 25 times)
--fp16 true                                # use fp16 training

## There are other parameters that are not specified here (see [here](https://github.com/facebookresearch/XLM/blob/master/train.py#L24-L198)).

Here the validation metrics _valid_mlm_ppl is the average of MLM perplexities.

MLM+TLM model: If you want to add TLM on top of MLM, just add "en-zh" language pair in mlm_steps:

--mlm_steps 'en,zh,en-zh'                  # MLM objective

Tips: You can also pretrain your model with MLM-only, and then continue training with MLM+TLM with the --reload_model parameter.

3. Fine-tune XLM models (Applications, see below)

Cross-lingual language model (XLM) provides a strong pretraining method for cross-lingual understanding (XLU) tasks. In what follows, we present applications to machine translation (unsupervised and supervised) and cross-lingual classification (XNLI).

III. Applications: Supervised / Unsupervised MT

XLMs can be used as a pretraining method for unsupervised or supervised neural machine translation.

Pretrained XLM(MLM) models

The English-French, English-German and English-Romanian models are the ones we used in the paper for MT pretraining. They are trained with monolingual data only, with the MLM objective. If you use these models, you should use the same data preprocessing / BPE codes to preprocess your data. See the preprocessing commands in get-data-nmt.sh.

Languages Pretraining Model BPE codes Vocabulary
English-French MLM Model BPE codes Vocabulary
English-German MLM Model BPE codes Vocabulary
English-Romanian MLM Model BPE codes Vocabulary

Download / preprocess data

To download the data required for the unsupervised MT experiments, simply run:

git clone https://github.com/facebookresearch/XLM.git
cd XLM

And one of the three commands below:

./get-data-nmt.sh --src en --tgt fr
./get-data-nmt.sh --src de --tgt en
./get-data-nmt.sh --src en --tgt ro

for English-French, German-English, or English-Romanian experiments. The script will successively:

  • download Moses scripts, download and compile fastBPE
  • download, extract, tokenize, apply BPE to monolingual and parallel test data
  • binarize all datasets

If you want to use our pretrained models, you need to have an exactly identical vocabulary. Since small differences can happen during preprocessing, we recommend that you use our BPE codes and vocabulary (although you should get something almost identical if you learn the codes and compute the vocabulary yourself). This will ensure that the vocabulary of your preprocessed data perfectly matches the one of our pretrained models, and that there is not a word / index mismatch. To do so, simply run:

wget https://dl.fbaipublicfiles.com/XLM/codes_enfr
wget https://dl.fbaipublicfiles.com/XLM/vocab_enfr

./get-data-nmt.sh --src en --tgt fr --reload_codes codes_enfr --reload_vocab vocab_enfr

get-data-nmt.sh contains a few parameters defined at the beginning of the file:

  • N_MONO number of monolingual sentences for each language (default 5000000)
  • CODES number of BPE codes (default 60000)
  • N_THREADS number of threads in data preprocessing (default 16)

The default number of monolingual data is 5M sentences, but using more monolingual data will significantly improve the quality of pretrained models. In practice, the models we release for MT are trained on all NewsCrawl data available, i.e. about 260M, 200M and 65M sentences for German, English and French respectively.

The script should output a data summary that contains the location of all files required to start experiments:

===== Data summary
Monolingual training data:
    en: ./data/processed/en-fr/train.en.pth
    fr: ./data/processed/en-fr/train.fr.pth
Monolingual validation data:
    en: ./data/processed/en-fr/valid.en.pth
    fr: ./data/processed/en-fr/valid.fr.pth
Monolingual test data:
    en: ./data/processed/en-fr/test.en.pth
    fr: ./data/processed/en-fr/test.fr.pth
Parallel validation data:
    en: ./data/processed/en-fr/valid.en-fr.en.pth
    fr: ./data/processed/en-fr/valid.en-fr.fr.pth
Parallel test data:
    en: ./data/processed/en-fr/test.en-fr.en.pth
    fr: ./data/processed/en-fr/test.en-fr.fr.pth

Pretrain a language model (with MLM)

The following script will pretrain a model with the MLM objective for English and French:

python train.py

## main parameters
--exp_name test_enfr_mlm                # experiment name
--dump_path ./dumped/                   # where to store the experiment

## data location / training objective
--data_path ./data/processed/en-fr/     # data location
--lgs 'en-fr'                           # considered languages
--clm_steps ''                          # CLM objective
--mlm_steps 'en,fr'                     # MLM objective

## transformer parameters
--emb_dim 1024                          # embeddings / model dimension
--n_layers 6                            # number of layers
--n_heads 8                             # number of heads
--dropout 0.1                           # dropout
--attention_dropout 0.1                 # attention dropout
--gelu_activation true                  # GELU instead of ReLU

## optimization
--batch_size 32                         # sequences per batch
--bptt 256                              # sequences length
--optimizer adam,lr=0.0001              # optimizer
--epoch_size 200000                     # number of sentences per epoch
--validation_metrics _valid_mlm_ppl     # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,10  # end experiment if stopping criterion does not improve

If parallel data is available, the TLM objective can be used with --mlm_steps 'en-fr'. To train with both the MLM and TLM objective, you can use --mlm_steps 'en,fr,en-fr'. We provide models trained with the MLM objective for English-French, English-German and English-Romanian, along with the BPE codes and vocabulary used to preprocess the data.

Train on unsupervised MT from a pretrained model

You can now use the pretrained model for Machine Translation. To download a model trained with the command above on the MLM objective, and the corresponding BPE codes, run:

wget -c https://dl.fbaipublicfiles.com/XLM/mlm_enfr_1024.pth

If you preprocessed your dataset in ./data/processed/en-fr/ with the provided BPE codes codes_enfr and vocabulary vocab_enfr, you can pretrain your NMT model with mlm_enfr_1024.pth and run:

python train.py

## main parameters
--exp_name unsupMT_enfr                                       # experiment name
--dump_path ./dumped/                                         # where to store the experiment
--reload_model 'mlm_enfr_1024.pth,mlm_enfr_1024.pth'          # model to reload for encoder,decoder

## data location / training objective
--data_path ./data/processed/en-fr/                           # data location
--lgs 'en-fr'                                                 # considered languages
--ae_steps 'en,fr'                                            # denoising auto-encoder training steps
--bt_steps 'en-fr-en,fr-en-fr'                                # back-translation steps
--word_shuffle 3                                              # noise for auto-encoding loss
--word_dropout 0.1                                            # noise for auto-encoding loss
--word_blank 0.1                                              # noise for auto-encoding loss
--lambda_ae '0:1,100000:0.1,300000:0'                         # scheduling on the auto-encoding coefficient

## transformer parameters
--encoder_only false                                          # use a decoder for MT
--emb_dim 1024                                                # embeddings / model dimension
--n_layers 6                                                  # number of layers
--n_heads 8                                                   # number of heads
--dropout 0.1                                                 # dropout
--attention_dropout 0.1                                       # attention dropout
--gelu_activation true                                        # GELU instead of ReLU

## optimization
--tokens_per_batch 2000                                       # use batches with a fixed number of words
--batch_size 32                                               # batch size (for back-translation)
--bptt 256                                                    # sequence length
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001  # optimizer
--epoch_size 200000                                           # number of sentences per epoch
--eval_bleu true                                              # also evaluate the BLEU score
--stopping_criterion 'valid_en-fr_mt_bleu,10'                 # validation metric (when to save the best model)
--validation_metrics 'valid_en-fr_mt_bleu'                    # end experiment if stopping criterion does not improve

The parameters of your Transformer model have to be identical to the ones used for pretraining (or you will have to slightly modify the code to only reload existing parameters). After 8 epochs on 8 GPUs, the above command should give you something like this:

epoch               ->     7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu  -> 34.02
test_en-fr_mt_bleu  -> 36.62

IV. Applications: Cross-lingual text classification (XNLI)

XLMs can be used to build cross-lingual classifiers. After fine-tuning an XLM model on an English training corpus for instance (e.g. of sentiment analysis, natural language inference), the model is still able to make accurate predictions at test time in other languages, for which there is very little or no training data. This approach is usually referred to as "zero-shot cross-lingual classification".

Get the right tokenizers

Before running the scripts below, make sure you download the tokenizers from the tools/ directory.

Download / preprocess monolingual data

Follow a similar approach than in section 1 for the 15 languages:

for lg in ar bg de el en es fr hi ru sw th tr ur vi zh; do
  ./get-data-wiki.sh $lg
done

Downloading the Wikipedia dumps make take several hours. The get-data-wiki.sh script will automatically download Wikipedia dumps, extract raw sentences, clean and tokenize them. Note that in our experiments we also concatenated the Toronto Book Corpus to the English Wikipedia, but this dataset is no longer hosted.

For Chinese and Thai you will need a special tokenizer that you can install using the commands below. For all other languages, the data will be tokenized with Moses scripts.

# Thai - https://github.com/PyThaiNLP/pythainlp
pip install pythainlp

# Chinese
cd tools/
wget https://nlp.stanford.edu/software/stanford-segmenter-2018-10-16.zip
unzip stanford-segmenter-2018-10-16.zip

Download parallel data

This script will download and tokenize the parallel data used for the TLM objective:

lg_pairs="ar-en bg-en de-en el-en en-es en-fr en-hi en-ru en-sw en-th en-tr en-ur en-vi en-zh"
for lg_pair in $lg_pairs; do
  ./get-data-para.sh $lg_pair
done

Apply BPE and binarize

Apply BPE and binarize data similar to section 2.

Pretrain a language model (with MLM and TLM)

The following script will pretrain a model with the MLM and TLM objectives for the 15 XNLI languages:

python train.py

## main parameters
--exp_name train_xnli_mlm_tlm            # experiment name
--dump_path ./dumped/                    # where to store the experiment

## data location / training objective
--data_path ./data/processed/XLM15/                   # data location
--lgs 'ar-bg-de-el-en-es-fr-hi-ru-sw-th-tr-ur-vi-zh'  # considered languages
--clm_steps ''                                        # CLM objective
--mlm_steps 'ar,bg,de,el,en,es,fr,hi,ru,sw,th,tr,ur,vi,zh,en-ar,en-bg,en-de,en-el,en-es,en-fr,en-hi,en-ru,en-sw,en-th,en-tr,en-ur,en-vi,en-zh,ar-en,bg-en,de-en,el-en,es-en,fr-en,hi-en,ru-en,sw-en,th-en,tr-en,ur-en,vi-en,zh-en'  # MLM objective

## transformer parameters
--emb_dim 1024                           # embeddings / model dimension
--n_layers 12                            # number of layers
--n_heads 8                              # number of heads
--dropout 0.1                            # dropout
--attention_dropout 0.1                  # attention dropout
--gelu_activation true                   # GELU instead of ReLU

## optimization
--batch_size 32                          # sequences per batch
--bptt 256                               # sequences length
--optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001,weight_decay=0  # optimizer
--epoch_size 200000                      # number of sentences per epoch
--validation_metrics _valid_mlm_ppl      # validation metric (when to save the best model)
--stopping_criterion _valid_mlm_ppl,10   # end experiment if stopping criterion does not improve

Download XNLI data

This script will download and tokenize the XNLI corpus:

./get-data-xnli.sh

Preprocess data

This script will apply BPE using the XNLI15 bpe codes, and binarize data.

./prepare-xnli.sh

Fine-tune your XLM model on cross-lingual classification (XNLI)

You can now use the pretrained model for cross-lingual classification. To download a model trained with the command above on the MLM-TLM objective, run:

wget -c https://dl.fbaipublicfiles.com/XLM/mlm_tlm_xnli15_1024.pth

You can now fine-tune the pretrained model on XNLI, or on one of the English GLUE tasks:

python glue-xnli.py
--exp_name test_xnli_mlm_tlm             # experiment name
--dump_path ./dumped/                    # where to store the experiment
--model_path mlm_tlm_xnli15_1024.pth     # model location
--data_path ./data/processed/XLM15       # data location
--transfer_tasks XNLI,SST-2              # transfer tasks (XNLI or GLUE tasks)
--optimizer_e adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--optimizer_p adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--finetune_layers "0:_1"                 # fine-tune all layers
--batch_size 8                           # batch size (\in [4, 8])
--n_epochs 250                           # number of epochs
--epoch_size 20000                       # number of sentences per epoch
--max_len 256                            # max number of words in sentences
--max_vocab 95000                        # max number of words in vocab

V. Product-Key Memory Layers (PKM)

XLM also implements the Product-Key Memory layer (PKM) described in [4]. To add a memory in (for instance) the layers 4 and 7 of an encoder, you can simply provide --use_memory true --mem_enc_positions 4,7 as argument of train.py (and similarly for --mem_dec_positions and the decoder). All memory layer parameters can be found here. A minimalist and simple implementation of the PKM layer, that uses the same configuration as in the paper, can be found in this ipython notebook.

Frequently Asked Questions

How can I run experiments on multiple GPUs?

XLM supports both multi-GPU and multi-node training, and was tested with up to 128 GPUs. To run an experiment with multiple GPUs on a single machine, simply replace python train.py in the commands above with:

export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

The multi-node is automatically handled by SLURM.

References

Please cite [1] if you found the resources in this repository useful.

Cross-lingual Language Model Pretraining

[1] G. Lample *, A. Conneau * Cross-lingual Language Model Pretraining

* Equal contribution. Order has been determined with a coin flip.

@article{lample2019cross,
  title={Cross-lingual Language Model Pretraining},
  author={Lample, Guillaume and Conneau, Alexis},
  journal={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2019}
}

XNLI: Evaluating Cross-lingual Sentence Representations

[2] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, V. Stoyanov XNLI: Evaluating Cross-lingual Sentence Representations

@inproceedings{conneau2018xnli,
  title={XNLI: Evaluating Cross-lingual Sentence Representations},
  author={Conneau, Alexis and Lample, Guillaume and Rinott, Ruty and Williams, Adina and Bowman, Samuel R and Schwenk, Holger and Stoyanov, Veselin},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}

Phrase-Based & Neural Unsupervised Machine Translation

[3] G. Lample, M. Ott, A. Conneau, L. Denoyer, MA. Ranzato Phrase-Based & Neural Unsupervised Machine Translation

@inproceedings{lample2018phrase,
  title={Phrase-Based \& Neural Unsupervised Machine Translation},
  author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}

Large Memory Layers with Product Keys

[4] G. Lample, A. Sablayrolles, MA. Ranzato, L. Denoyer, H. Jégou Large Memory Layers with Product Keys

@article{lample2019large,
  title={Large Memory Layers with Product Keys},
  author={Lample, Guillaume and Sablayrolles, Alexandre and Ranzato, Marc'Aurelio and Denoyer, Ludovic and J{\'e}gou, Herv{\'e}},
  journal={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2019}
}

Unsupervised Cross-lingual Representation Learning at Scale

[5] A. Conneau *, K. Khandelwal *, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, V. Stoyanov Unsupervised Cross-lingual Representation Learning at Scale

* Equal contribution

@article{conneau2019unsupervised,
  title={Unsupervised Cross-lingual Representation Learning at Scale},
  author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin},
  journal={arXiv preprint arXiv:1911.02116},
  year={2019}
}

License

See the LICENSE file for more details.

xlm's People

Contributors

cclauss avatar ethanjperez avatar glample avatar jowagner avatar jrapin avatar kubapok avatar louismartin avatar sedflix avatar tagucci avatar talschuster avatar victorsanh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xlm's Issues

Not able to learn with sinusoidal embeddings.

Hi,
I ran the MLM pretraining for en-fr using the default arguments.
I noticed that while I was able to learn using the learnt embeddings, using sinusoidal embeddings completely fails to learn and the validation accuracy stays around 5%.

Did you face similar issues while using sin embeddings too?

Thanks!

Pretrained word embeddings

First, thanks for sharing your code!

I really appreciate it.

I have a question about pre-trained word embeddings for unsupervised NMT task.

While reviewing code, I could find out that you guys never used pre-trained word embeddings.
(since --reload_emb is empty)

If this is true that pre-trained word embeddings has not beed used, is there a specific reason for not using pre-trained word embeddings?

Thank You!

Embeddings for each subword in a sentence

Hi,

thanks for releasing the code for Cross-lingual Language Model Pretraining ❤️

I would like to know, if it's possible to: encode a whole sentence and get the embeddings for each token (or better subword). The notebook contains only example of how to encode a sentence, but could you also provide a way to get the embeddings for each subword?

Thanks :)

Why SRC < TGT ?

if [ "$SRC" \> "$TGT" ]; then echo "please ensure SRC < TGT"; exit; fi

Hi @glample,
Can you explain why do you make this assumption "SRC < TGT"?
I noticed it also in:

if src < tgt and ((src, tgt) in required_para or (tgt, src) in required_para)
_lang1, _lang2 = (lang1, lang2) if lang1 < lang2 else (lang2, lang1)
assert lang1 < lang2

Address already in use

I tried to run several multi-gpu programs on a single server.But I countered this problem
RuntimeError: Address already in use at /pytorch/torch/lib/THD/process_group/General.cpp:17
So, if I have 4 GPUS on a single server and want to run two programs on GPU 0,1 and 2,3, how can I set the parameter local_rank and master_port? @glample

How can I use multi-GPU to train UNMT

I add --local_rank, but raise error.

SLURM job: False
Traceback (most recent call last):
File "train.py", line 322, in
main(params)
File "train.py", line 198, in main
init_distributed_mode(params)
File "XLM/src/slurm.py", line 110, in init_distributed_mode
params.global_rank = int(os.environ['RANK'])
File "/usr/lib/python3.5/os.py", line 725, in getitem
raise KeyError(key) from None
KeyError: 'RANK'

trainging model for languages with great differences

Thanks for you work. I have a question. When I training for languages with great differences, such as Chinese-English, English-Kazakh. Is it a good choice to share all parameters? I notice that XLM usually share all parameters.

Adjust learning rate

Hi, I noticed that whether it is unsupervised NMT training or MLM training, the learning rate is 0.0001. Is this the learning rate when training with 8 GPUs? If I use 4 GPUs, how to adjust the learning rate and warm-up? Thank you very much.

Experience OOM error during evaluate_mt()

Dear authors,
Thank you so much for your codes. I'm trying to reproduce supervised MT results on wmt14 en-de. The training works fine with single(multi)-gpu. However, I frequently experience OOM error after one epoch and during evaluate_mt() step. Here's the script I used and the error message:

python train.py --exp_name wmt14_ende --dump_path ./dumped/ --data_path ./data/processed/wmt14_de-en --lgs 'en-de' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-de_mt_bleu,10' --validation_metrics 'valid_en-de_mt_bleu' --mt_steps "en-de" --gpus '0'
(--gpus just indicates the gpuid to use)

Traceback (most recent call last):
File "train.py", line 325, in
main(params)
File "train.py", line 300, in main
scores = evaluator.run_all_evals(trainer)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/evaluation/evaluator.py", line 181, in run_all_evals
self.evaluate_mt(scores, data_set, lang1, lang2, eval_bleu)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/evaluation/evaluator.py", line 377, in evaluate_mt
word_scores, loss = decoder('predict', tensor=dec2, pred_mask=pred_mask, y=y, get_scores=True)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 313, in forward
return self.predict(**kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 416, in predict
scores, loss = self.pred_layer(masked_tensor, y, get_scores)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call
result = self.forward(*input, **kwargs)
File "/nfs/eecs-fserv/share/yangyil/XLM/src/model/transformer.py", line 132, in forward
loss = F.cross_entropy(scores, y, reduction='elementwise_mean')
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 1550, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/nfs/stak/users/yangyil/shared/anaconda3/lib/python3.6/site-packages/torch/nn/functional.py", line 975, in log_softmax
return input.log_softmax(dim)
RuntimeError: CUDA error: out of memory

The OOM always happens within F.cross_entropy(), although cross_entropy doesn't always trigger OOM. Do you have some idea to make it more stable?

Another thing: I uses pytorch 0.4.1 but didn't experience this #15, and if I update it to 1.0.1, I'll experience another error pytorch/pytorch#13273 (
_queue_reduction() doesn't take
torch.distributed.ProcessGroupNCCL object).

Best.
Yilin

How can I get the words embeddings?

Hello!
Thank you for sharing this code!

Is there an easy way to get the embedding of a particular word?
Those found in table 5. of the paper.
Thank you!

FP 16 Training for mt and bt steps.

Hi, I noticed in the code that fp16 training is disabled manually for machine translation and back translation updates by putting assert false statements.

Specifically I am trying to use the MT step. I commented the assert statement and added retain_graph=True in the first backward call. But I noticed that after doing this my throughput was actually lower than without fp16 enabled.

Can you help me with correctly setting up the fp16 training for mt step?

RuntimeError: CUDA out of memory. Tried to allocate 498.50 MiB (GPU 0; 7.92 GiB total capacity; 6.74 GiB already allocated; 307.56 MiB free; 3.53 MiB cached)

Hi,@glample

I pretrained a model with the MLM objective for Mongolian and Chinese, but when I used the pretrained model for mn-zh Machine Translation, the error came. I tried reducing --batch_size from default 32 to 16, 8, 4, 2, and 1, but that didn't help. Could you have any good solutions for this to share?

The pretrained result is:
INFO - 02/28/19 09:47:19 - 1 day, 1:01:20 - ============ End of epoch 7 ============
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - epoch -> 7.000000
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mn_mlm_ppl -> 20.055305
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mn_mlm_acc -> 56.151420
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_zh_mlm_ppl -> 1813.456839
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_zh_mlm_acc -> 28.312303
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mlm_ppl -> 916.756072
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - valid_mlm_acc -> 42.231861
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mn_mlm_ppl -> 8.259349
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mn_mlm_acc -> 65.375485
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_zh_mlm_ppl -> 11569.002599
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_zh_mlm_acc -> 15.452244
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mlm_ppl -> 5788.630974
INFO - 02/28/19 09:47:30 - 1 day, 1:01:31 - test_mlm_acc -> 40.413864

Train on unsupervised MT from the pretrained model
python train.py --exp_name unsupMT_mnzh --dump_path ./dumped/ --reload_model 'best-valid_mlm_ppl.pth,best-valid_mlm_ppl.pth' --data_path ./data/processed/mn-zh/ --lgs 'mn-zh' --ae_steps 'mn,zh' --bt_steps 'mn-zh-mn,zh-mn-zh' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 1024 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 2000 --batch_size 16 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.999,lr=0.0001 --epoch_size 300000 --eval_bleu true --stopping_criterion 'valid_mn-zh_mt_bleu,10' --validation_metrics 'valid_mn-zh_mt_bleu'

RuntimeError: CUDA error: device-side assert triggered

Hello! I have been running your translate.py script and have been running into this error on a particular line of my input file, containing a BPE-ised URL but other than that nothing particular (only 13 subwords long) The error occurs with the following line of code:

decoded, dec_lengths = decoder.generate(encoded, lengths.cuda(), params.tgt_id, max_len=int(1.5 * lengths.max().item() + 10))`

Do you have any suggestions about what might be causing this error and how it could be fixed? Thank you very much in advance!

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese

Hi,@glample

The result becomes 0 at the end of second epoch when I pretrain a model with the MLM objective for Mongolian and Chinese. Is the preprocessing method inappropriate?

details:
python train.py --exp_name 'my_mnzh_mlm' --dump_path './dumped/' --exp_id '190225' --data_path './data/processed/mn-zh/' --lgs 'mn-zh' --clm_steps '' --mlm_steps 'mn,zh' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.2' --attention_dropout '0.2' --gelu_activation 'true' --batch_size '16' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '300000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

python train.py --exp_name 'my_mnzh_mlm' --dump_path './dumped/' --exp_id '190225' --data_path './data/processed/mn-zh/' --lgs 'mn-zh' --clm_steps '' --mlm_steps 'mn,zh' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.2' --attention_dropout '0.2' --gelu_activation 'true' --batch_size '16' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '300000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

INFO - 02/25/19 13:21:37 - 3:07:50 - ============ End of epoch 0 ============
INFO - 02/25/19 13:21:48 - 3:08:01 - epoch -> 0.000000
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mn_mlm_ppl -> 574.678424
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mn_mlm_acc -> 17.192429
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_zh_mlm_ppl -> 5591.294827
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_zh_mlm_acc -> 14.550473
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mlm_ppl -> 3082.986625
INFO - 02/25/19 13:21:48 - 3:08:01 - valid_mlm_acc -> 15.871451
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mn_mlm_ppl -> 436.168551
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mn_mlm_acc -> 13.728215
INFO - 02/25/19 13:21:48 - 3:08:01 - test_zh_mlm_ppl -> 32195.137737
INFO - 02/25/19 13:21:48 - 3:08:01 - test_zh_mlm_acc -> 7.138838
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mlm_ppl -> 16315.653144
INFO - 02/25/19 13:21:48 - 3:08:01 - test_mlm_acc -> 10.433527

INFO - 02/25/19 16:29:17 - 6:15:30 - ============ End of epoch 1 ============
INFO - 02/25/19 16:29:28 - 6:15:41 - epoch -> 1.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mn_mlm_ppl -> 966.486405
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mn_mlm_acc -> 7.886435
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_zh_mlm_ppl -> 8967.092445
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_zh_mlm_acc -> 0.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mlm_ppl -> 4966.789425
INFO - 02/25/19 16:29:28 - 6:15:41 - valid_mlm_acc -> 3.943218
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mn_mlm_ppl -> 808.229061
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mn_mlm_acc -> 12.853917
INFO - 02/25/19 16:29:28 - 6:15:41 - test_zh_mlm_ppl -> 43495.881859
INFO - 02/25/19 16:29:28 - 6:15:41 - test_zh_mlm_acc -> 0.000000
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mlm_ppl -> 22152.055460
INFO - 02/25/19 16:29:28 - 6:15:41 - test_mlm_acc -> 6.426958

Hyperparameters for replicating supervised MT ro -> en result.

Hi,

I am trying to replicate the supervised MT ro -> en baseline of 28.4 mentioned in the paper. I was hoping that you could give me some idea about the hyperparameters for that.
Specifically can you tell me the values of #of BPE operations, learning rate and learning rate schedule used, dropout and attention dropout values, embedding size of the network, batch size and # of gpus used during training.

Thanks!

weird codes in Evaluator.get_iterator

Hi,

I just found out a weird piece at:

if len(self.params.langs) > 30:
eval_lgs = set(["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh", "ab", "ay", "bug", "ha", "ko", "ln", "min", "nds", "pap", "pt", "tg", "to", "udm", "uk", "zh_classical"])
eval_lgs = set(["ar", "bg", "de", "el", "en", "es", "fr", "hi", "ru", "sw", "th", "tr", "ur", "vi", "zh"])
subsample = 10 if (data_set == 'test' or lang1 not in eval_lgs) else 5
n_sentences = 600 if (data_set == 'test' or lang1 not in eval_lgs) else 1500

If possible, may I ask the intuition behind this "hack".

Thanks.

Reloading model and params from Checkpoint

Hi,
How can I reload the checkpoint and model file in order to continue from the last epoch I have reached in previous (aborted) running ? I want to do this in the pretrain stage and also in the train stage

Thanks,
Odel

Cannot get good results if I train from original data and script

Hi, When I ran the translation task, I met a problem. I can get the similar result if l load your mlm_enfr_1024.pth. But I cannot get good result if I start from your get-data-nmt.sh for both de-en, en-fr cases.

details:
Running command: python train.py --exp_name 'my_enfr_mlm' --dump_path './dumped/' --exp_id 'bs.20' --data_path './data/processed/en-fr/' --lgs 'en-fr' --clm_steps '' --mlm_steps 'en,fr' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.1' --attention_dropout '0.1' --gelu_activation 'true' --batch_size '32' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '200000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

INFO - 02/20/19 17:14:59 - 0:50:54 - valid_en_mlm_ppl -> 1413.372916
INFO - 02/20/19 17:14:59 - 0:50:54 - log:{"epoch": 0, "valid_en_mlm_ppl": 1413.3729161899485, "valid_en_mlm_acc": 4.681079149544399, "valid_fr_mlm_ppl": 1137.9702763241598, "valid_fr_mlm_acc": 4.591462520170163, "valid_mlm_ppl": 1275.6715962570543, "valid_mlm_acc": 4.636270834857281, "test_en_mlm_ppl": 1377.6397512089368, "test_en_mlm_acc": 4.500805152979066, "test_fr_mlm_ppl": 1547.092026693417, "test_fr_mlm_acc": 4.81150066011442, "test_mlm_ppl": 1462.3658889511769, "test_mlm_acc": 4.656152906546742}
INFO - 02/20/19 18:05:31 - 1:41:26 - valid_en_mlm_ppl -> 2161.567965
INFO - 02/20/19 18:05:31 - 1:41:26 - log:{"epoch": 1, "valid_en_mlm_ppl": 2161.56796481175, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1688.979616470098, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1925.2737906409238, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2062.9860141920476, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2497.6693821048448, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2280.327698148446, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 18:56:00 - 2:31:55 - valid_en_mlm_ppl -> 2245.817440
INFO - 02/20/19 18:56:00 - 2:31:55 - log:{"epoch": 2, "valid_en_mlm_ppl": 2245.8174404810325, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1625.404408585545, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1935.6109245332887, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2138.2897057505943, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2388.5677765876662, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2263.4287411691303, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 19:46:31 - 3:22:26 - valid_en_mlm_ppl -> 2165.622311
INFO - 02/20/19 19:46:31 - 3:22:26 - log:{"epoch": 3, "valid_en_mlm_ppl": 2165.6223114703407, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1680.1268854516293, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1922.874598460985, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2075.5851921823105, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2465.9347158442074, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2270.7599540132587, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 20:37:00 - 4:12:55 - valid_en_mlm_ppl -> 2062.631943
INFO - 02/20/19 20:37:00 - 4:12:55 - log:{"epoch": 4, "valid_en_mlm_ppl": 2062.6319433943568, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1765.4204690043236, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1914.0262061993403, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 1966.636764557332, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2606.315150449565, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2286.4759575034486, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 21:27:28 - 5:03:23 - valid_en_mlm_ppl -> 2151.624741
INFO - 02/20/19 21:27:28 - 5:03:23 - log:{"epoch": 5, "valid_en_mlm_ppl": 2151.624740528933, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1690.7461604349478, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1921.1854504819405, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2054.5326346790675, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2479.448594677353, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2266.9906146782105, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 22:17:56 - 5:53:51 - valid_en_mlm_ppl -> 2155.638091
INFO - 02/20/19 22:17:56 - 5:53:51 - log:{"epoch": 6, "valid_en_mlm_ppl": 2155.6380909977584, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1699.0517872173994, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1927.3449391075787, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2053.9586330892766, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2483.16693279636, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2268.5627829428186, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 23:08:23 - 6:44:18 - valid_en_mlm_ppl -> 2133.608678
INFO - 02/20/19 23:08:23 - 6:44:18 - log:{"epoch": 7, "valid_en_mlm_ppl": 2133.608678409897, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1695.3582695161938, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1914.4834739630455, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2038.1278812563512, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2492.9029435971656, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2265.5154124267583, "test_mlm_acc": 4.0700737027375675}
INFO - 02/20/19 23:58:51 - 7:34:46 - valid_en_mlm_ppl -> 2065.049633
INFO - 02/20/19 23:58:51 - 7:34:46 - log:{"epoch": 8, "valid_en_mlm_ppl": 2065.049632547123, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1770.2985750724292, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1917.6741038097762, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 1973.5921541087191, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2588.5655595835324, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2281.0788568461257, "test_mlm_acc": 4.0700737027375675}
INFO - 02/21/19 00:49:20 - 8:25:15 - valid_en_mlm_ppl -> 2177.331599
INFO - 02/21/19 00:49:20 - 8:25:15 - log:{"epoch": 9, "valid_en_mlm_ppl": 2177.331599451264, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1664.960476646684, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1921.1460380489739, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2081.1290653201354, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2436.2827245826775, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2258.7058949514067, "test_mlm_acc": 4.0700737027375675}
INFO - 02/21/19 01:39:46 - 9:15:41 - valid_en_mlm_ppl -> 2110.860061
INFO - 02/21/19 01:39:46 - 9:15:41 - log:{"epoch": 10, "valid_en_mlm_ppl": 2110.8600607294125, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1716.5880506037283, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1913.7240556665704, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2007.549178045412, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2522.7412353839986, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2265.145206714705, "test_mlm_acc": 4.0700737027375675}
INFO - 02/21/19 02:30:13 - 10:06:08 - valid_en_mlm_ppl -> 2208.660441
INFO - 02/21/19 02:30:13 - 10:06:08 - log:{"epoch": 11, "valid_en_mlm_ppl": 2208.6604406115257, "valid_en_mlm_acc": 5.074146864391638, "valid_fr_mlm_ppl": 1656.203270846642, "valid_fr_mlm_acc": 4.254070705588969, "valid_mlm_ppl": 1932.431855729084, "valid_mlm_acc": 4.664108784990304, "test_en_mlm_ppl": 2111.8613551170783, "test_en_mlm_acc": 4.186795491143317, "test_fr_mlm_ppl": 2405.011263807759, "test_fr_mlm_acc": 3.9533519143318174, "test_mlm_ppl": 2258.4363094624186, "test_mlm_acc": 4.0700737027375675}


INFO - 02/20/19 16:24:05 - 0:00:00 - ============ Monolingual data (en)
INFO - 02/20/19 16:24:05 - 0:00:00 - Loading data from ./data/processed/en-fr/train.en.pth ...
INFO - 02/20/19 16:24:06 - 0:00:01 - 129033877 words (64139 unique) in 5000000 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:08 - 0:00:03 - Loading data from ./data/processed/en-fr/valid.en.pth ...
INFO - 02/20/19 16:24:08 - 0:00:03 - 69727 words (64139 unique) in 3000 sentences. 1 unknown words (1 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:08 - 0:00:03 - Loading data from ./data/processed/en-fr/test.en.pth ...
INFO - 02/20/19 16:24:09 - 0:00:03 - 76017 words (64139 unique) in 3003 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:09 - 0:00:04 - ============ Monolingual data (fr)
INFO - 02/20/19 16:24:09 - 0:00:04 - Loading data from ./data/processed/en-fr/train.fr.pth ...
INFO - 02/20/19 16:24:09 - 0:00:04 - 130884578 words (64139 unique) in 5000000 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:12 - 0:00:06 - Loading data from ./data/processed/en-fr/valid.fr.pth ...
INFO - 02/20/19 16:24:12 - 0:00:07 - 79585 words (64139 unique) in 3000 sentences. 1 unknown words (1 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:12 - 0:00:07 - Loading data from ./data/processed/en-fr/test.fr.pth ...
INFO - 02/20/19 16:24:12 - 0:00:07 - 86351 words (64139 unique) in 3003 sentences. 0 unknown words (0 unique) covering 0.00% of the data.

INFO - 02/20/19 16:24:13 - 0:00:08 - ============ Data summary
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - train - en: 5000000
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - valid - en: 3000
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - test - en: 3003
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - train - fr: 5000000
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - valid - fr: 3000
INFO - 02/20/19 16:24:13 - 0:00:08 - Monolingual data - test - fr: 3003

Performance of Unsupervised NMT with 5M monolingual data

Hi, @glample . Thank you for your nice contribution.

I have noticed the demo you released only uses 5M monolingual data. I have tried and it seems it can not achieve the accuracy paper reported, but i want to know what accuracy it will achieve under 5M monolingual data (just for reference). Can you provide some helps?

TypeError: cross_entropy() got an unexpected keyword argument 'reduction'

Hi, @glample

I trained with a single GPU and got an err just like the them shown.
Running command:
First: ./get-data-nmt.sh --src en --tgt fr
got:
===== Data summary
Monolingual training data:
en: ./data/processed/en-fr/train.en.pth
fr: ./data/processed/en-fr/train.fr.pth
Monolingual validation data:
en: ./data/processed/en-fr/valid.en.pth
fr: ./data/processed/en-fr/valid.fr.pth
Monolingual test data:
en: ./data/processed/en-fr/test.en.pth
fr: ./data/processed/en-fr/test.fr.pth
Parallel validation data:
en: ./data/processed/en-fr/valid.en-fr.en.pth
fr: ./data/processed/en-fr/valid.en-fr.fr.pth
Parallel test data:
en: ./data/processed/en-fr/test.en-fr.en.pth
fr: ./data/processed/en-fr/test.en-fr.fr.pth
And then run: python train.py --exp_name 'my_enfr_mlm' --dump_path './dumped/' --exp_id 'bs.20' --data_path './data/processed/en-fr/' --lgs 'en-fr' --clm_steps '' --mlm_steps 'en,fr' --emb_dim '1024' --n_layers '6' --n_heads '8' --dropout '0.1' --attention_dropout '0.1' --gelu_activation 'true' --batch_size '8' --bptt '256' --optimizer 'adam,lr=0.0001' --epoch_size '300000' --validation_metrics '_valid_mlm_ppl' --stopping_criterion '_valid_mlm_ppl,10'

got the err.

couldn't match SOTA performance on wmt14 EnDe

Dear authors,

I understand this repo isn't very much for supervised MT. But your codebase contains Transformer Enc-Dec model and more importantly it is much simpler than standard supervised MT codebase (e.g. T2T, Fairseq, OpenNMT).

With the intention to reproduce wmt14 EnDe SOTA performance, I use the data & BPE from Fairseq, train the Transformer base (emb_dim=512) w/ only mt_step="en-de" on 4x 2080 Ti (one gpu even lower). And finally got a tokenized BLEU score of 25.63 w/ beam_size 4, length_penalty 0.6. It's more than 1 BLEU lower than reported in Transformer paper.

Training script:
export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py --exp_name wmt14_ende --dump_path ./dumped/ --data_path ./data/processed/wmt14_de-en/fairseq --lgs 'en-de' --encoder_only false --emb_dim 512 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 6000 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 --epoch_size 200000 --eval_bleu true --stopping_criterion 'valid_en-de_mt_bleu,10' --validation_metrics 'valid_en-de_mt_bleu' --mt_steps "en-de" --gpus '0,1,2,3'

Translate results:

valid_en-de_mt_ppl-> 5.401580
valid_en-de_mt_acc -> 65.806969
valid_en-de_mt_bleu -> 28.990000
test_en-de_mt_ppl -> 5.942769
test_en-de_mt_acc -> 66.605212
test_en-de_mt_bleu -> 25.630000

My intuition is the model structure is slightly different (gelu, layer_norm etc.). May I ask you have you tried it with supervised MT wmt14 benchmark, and what's your thoughts on this?

Best.

Translation script

Hello! Do you happen to have a translate.py script so that the model can be used to translate new data? I saw the --eval_only parameter, but it seems that the file to be translated has to be named according to the naming conventions specified in the trainer (and the data folder has to contain all the training/validation files too). The evaluator also appears to be using the target language file to get the maximum sentence length, which we shouldn't have access to when translating a new document.

Thanks for your help!

loss of paddding

Hi, do we need to ignore padding's loss when we do back-translation? It seems that the code doesn't ignore the padding when we calculate loss. Thank you very much.

pred_mask = alen[:, None] < len1[None] - 1 # do not predict anything given the last target word

How to save fine-tune models for XNLI task?

Hi,
I ran XNLI fine-tuning task (with MLM+TLM) and got an average accuracy of 73.5 (compared to 75.1 in your paper). The code generated params.pkl, however, I could not find the fine-tuned model. How do I save the model after fine-tuning (or after every epoch of fine-tuning)?

reloading decoder from mlm_1024.pth

Hi, thanks for your work. When I reloading decoder from mlm_1024.pth that has been pretrained, the warnings are rised as follow:

WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter layer_norm15.0.weight not found.
WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter layer_norm15.0.bias not found.
WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter encoder_attn.0.q_lin.weight not found.
WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter encoder_attn.0.q_lin.bias not found.
WARNING - 03/03/19 10:07:57 - 0:00:37 - Parameter encoder_attn.0.k_lin.weight not found.
...

if dec_path != '':

My training is for unsupervised NMT. Is is the normal? How to fix it? Thank you very much.

Question About Performance

The paper shows the best en-fr bleu is 33.4. The readme.md shows
'epoch -> 7
valid_fr-en_mt_bleu -> 28.36
valid_en-fr_mt_bleu -> 30.50
test_fr-en_mt_bleu -> 34.02
test_en-fr_mt_bleu -> 36.62'.
Does this result from the max_len parameter which removes the long sentences from parallel test corpus?

truecasing

Hi,

did you do truecasing/lowercasing in your MT experiments? From the code I can't find any signs of this.

Is there any specific reason to do / not do it?

Thanks

Subsampling frequent outputs

Hi,

thanks for sharing your code!
I'm just wondering if you have implemented the subsampling of frequent outputs (can't find it in your code) and if it was crucial for the performance.

Cheers,
Stephan

Memory is not released

Hi, when the program ends, the memory of the GPU0 is released, but the other GPUs are not released. Why that ?

Question About Decoder

How does the decoder know which direction go towards(lang1 or lang2) when input language is lang1?In other words, how does the decoder know which state it is at , DAE or MT ?
In the previous version(UNMT), it uses different project layers. In XLM, self.pred_layer is always same. @glample

Error when using multi-GPU for training MT only

I tried to train a machine translation model using parallel data only. The script I used for training is as follows:

export NGPU=4; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py \
        --exp_name supMT_deen \
        --dump_path ./checkpoints/ \
        --data_path /unsullied/sharefs/zhaoyuekai/data/WMT/corpus/de-en/processed/ \
        --lgs 'de-en' \
        --mt_steps 'de-en' \
        --lambda_mt '0:1,100000:0.1,300000:0' \
         --encoder_only false \
        --emb_dim 1024 \
        --n_layers 6 \
         --n_heads 8 \
         --dropout 0.1 \
         --attention_dropout 0.1 \
         --gelu_activation true  \
         --tokens_per_batch 2000 \
         --batch_size 32 \
         --bptt 256 \
         --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001 \
         --epoch_size 200000 \
         --eval_bleu true \
         --stopping_criterion 'valid_en-fr_mt_bleu,10' \
         --validation_metrics 'valid_en-fr_mt_bleu'

When training on only one GPU, no error was reported, however when I tried to train it on 4 GPUs, following error was encountered.

Traceback (most recent call last):
  File "train.py", line 341, in <module>
Traceback (most recent call last):
  File "train.py", line 341, in <module>
Traceback (most recent call last):
  File "train.py", line 341, in <module>
Traceback (most recent call last):
  File "train.py", line 341, in <module>
    main(params)
  File "train.py", line 300, in main
    main(params)
  File "train.py", line 300, in main
    main(params)
  File "train.py", line 300, in main
    trainer.mt_step(lang1, lang2, params.lambda_mt)
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 770, in mt_step
    trainer.mt_step(lang1, lang2, params.lambda_mt)
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 770, in mt_step
    trainer.mt_step(lang1, lang2, params.lambda_mt)
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 770, in mt_step
        self.optimize(loss, ['encoder', 'decoder'])self.optimize(loss, ['encoder', 'decoder'])

  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 131, in optimize
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 131, in optimize
    self.optimize(loss, ['encoder', 'decoder'])
  File "/unsullied/sharefs/zhaoyuekai/data/XLM/config/XLM.active/src/trainer.py", line 131, in optimize
    main(params)
  File "train.py", line 300, in main
        loss.backward()loss.backward()

  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
    loss.backward()
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/tensor.py", line 102, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph)torch.autograd.backward(self, gradient, retain_graph, create_graph)

  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 90, in backward
        allow_unreachable=True)  # allow_unreachable flagallow_unreachable=True)  # allow_unreachable flag

  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 445, in distributed_data_parallel_hook
        self._queue_reduction(bucket_idx)self._queue_reduction(bucket_idx)

  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self._queue_reduction(bucket_idx)
  File "/home/zhaoyuekai/miniconda3/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 475, in _queue_reduction
    self.device_ids)
TypeError    : self.device_ids)_queue_reduction(): incompatible function arguments. The following argument types are supported:
    1. (process_group: torch.distributed.ProcessGroup, grads_batch: List[List[at::Tensor]], devices: List[int]) -> Tuple[torch.distributed.Work, at::Tensor]

The BLEU decreased when train on Unsupervised NMT

Hi,@glample

I pre-trained a language model and use it to train on Unsupervised NMT,but the BLEU becomes lower and lower. Is there something wrong?

Details:
The language model:
INFO - 03/14/19 16:57:00 - 23:43:04 - ============ End of epoch 11 ============
INFO - 03/14/19 16:57:06 - 23:43:10 - epoch -> 11.000000
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_mn_mlm_ppl -> 12.698742
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_mn_mlm_acc -> 61.901453
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_zh_mlm_ppl -> 482.045657
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_zh_mlm_acc -> 24.392448
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_mlm_ppl -> 247.372200
INFO - 03/14/19 16:57:06 - 23:43:10 - valid_mlm_acc -> 43.146951
INFO - 03/14/19 16:57:06 - 23:43:10 - test_mn_mlm_ppl -> 34.794975
INFO - 03/14/19 16:57:06 - 23:43:10 - test_mn_mlm_acc -> 52.602524
INFO - 03/14/19 16:57:06 - 23:43:10 - test_zh_mlm_ppl -> 124.785448
INFO - 03/14/19 16:57:06 - 23:43:10 - test_zh_mlm_acc -> 34.501062
INFO - 03/14/19 16:57:06 - 23:43:10 - test_mlm_ppl -> 79.790211
INFO - 03/14/19 16:57:06 - 23:43:10 - test_mlm_acc -> 43.551793

Unsupervised NMT:

python3.6.2 train.py --exp_name unsupMT_mnzh --dump_path ./dumped/ --exp_id '190315' --reload_model './dumped/my_mnzh_mlm/190313/best-valid_mlm_ppl.pth,./dumped/my_mnzh_mlm/190313/best-valid_mlm_ppl.pth' --data_path ./data/processed/mn-zh/ --lgs 'mn-zh' --ae_steps 'mn,zh' --bt_steps 'mn-zh-mn,zh-mn-zh' --word_shuffle 3 --word_dropout 0.1 --word_blank 0.1 --lambda_ae '0:1,100000:0.1,300000:0' --encoder_only false --emb_dim 768 --n_layers 6 --n_heads 8 --dropout 0.1 --attention_dropout 0.1 --gelu_activation true --tokens_per_batch 1000 --batch_size 16 --max_batch_size 64 --bptt 256 --optimizer adam_inverse_sqrt,beta1=0.9,beta2=0.98,lr=0.0001,weight_decay=0 --epoch_size 300000 --eval_bleu true --stopping_criterion 'valid_mn-zh_mt_bleu,10' --validation_metrics 'valid_mn-zh_mt_bleu'

INFO - 03/15/19 12:54:23 - 3:17:34 - ============ End of epoch 0 ============
INFO - 03/15/19 12:56:06 - 3:19:16 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp0.mn-zh.valid.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.mn-zh.valid.txt : 0.180000
INFO - 03/15/19 12:58:15 - 3:21:25 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp0.zh-mn.valid.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.zh-mn.valid.txt : 2.740000
INFO - 03/15/19 12:58:36 - 3:21:47 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp0.mn-zh.test.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.mn-zh.test.txt : 0.000000
INFO - 03/15/19 12:59:01 - 3:22:12 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp0.zh-mn.test.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.zh-mn.test.txt : 2.160000
INFO - 03/15/19 12:59:01 - 3:22:12 - epoch -> 0.000000
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_mn-zh_mt_ppl -> 6020.106288
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_mn-zh_mt_acc -> 9.684522
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_mn-zh_mt_bleu -> 0.180000
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_zh-mn_mt_ppl -> 146.305114
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_zh-mn_mt_acc -> 40.263721
INFO - 03/15/19 12:59:01 - 3:22:12 - valid_zh-mn_mt_bleu -> 2.740000
INFO - 03/15/19 12:59:01 - 3:22:12 - test_mn-zh_mt_ppl -> 6059.479785
INFO - 03/15/19 12:59:01 - 3:22:12 - test_mn-zh_mt_acc -> 12.168889
INFO - 03/15/19 12:59:01 - 3:22:12 - test_mn-zh_mt_bleu -> 0.000000
INFO - 03/15/19 12:59:01 - 3:22:12 - test_zh-mn_mt_ppl -> 488.040713
INFO - 03/15/19 12:59:01 - 3:22:12 - test_zh-mn_mt_acc -> 34.044409
INFO - 03/15/19 12:59:01 - 3:22:12 - test_zh-mn_mt_bleu -> 2.160000

INFO - 03/16/19 06:23:05 - 20:46:16 - ============ End of epoch 5 ============
INFO - 03/16/19 06:25:41 - 20:48:51 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp5.mn-zh.valid.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.mn-zh.valid.txt : 0.000000
INFO - 03/16/19 06:27:31 - 20:50:41 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp5.zh-mn.valid.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.zh-mn.valid.txt : 0.280000
INFO - 03/16/19 06:27:58 - 20:51:09 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp5.mn-zh.test.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.mn-zh.test.txt : 0.000000
INFO - 03/16/19 06:28:22 - 20:51:33 - BLEU ./dumped/unsupMT_mnzh/190315/hypotheses/hyp5.zh-mn.test.txt ./dumped/unsupMT_mnzh/190315/hypotheses/ref.zh-mn.test.txt : 0.920000
INFO - 03/16/19 06:28:22 - 20:51:33 - epoch -> 5.000000
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_mn-zh_mt_ppl -> 9263.390210
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_mn-zh_mt_acc -> 7.963293
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_mn-zh_mt_bleu -> 0.000000
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_zh-mn_mt_ppl -> 195.211674
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_zh-mn_mt_acc -> 36.910448
INFO - 03/16/19 06:28:22 - 20:51:33 - valid_zh-mn_mt_bleu -> 0.280000
INFO - 03/16/19 06:28:22 - 20:51:33 - test_mn-zh_mt_ppl -> 9938.071239
INFO - 03/16/19 06:28:22 - 20:51:33 - test_mn-zh_mt_acc -> 6.666667
INFO - 03/16/19 06:28:22 - 20:51:33 - test_mn-zh_mt_bleu -> 0.000000
INFO - 03/16/19 06:28:22 - 20:51:33 - test_zh-mn_mt_ppl -> 619.158340
INFO - 03/16/19 06:28:22 - 20:51:33 - test_zh-mn_mt_acc -> 32.541759
INFO - 03/16/19 06:28:22 - 20:51:33 - test_zh-mn_mt_bleu -> 0.920000

Bug with file path in get-data-xnli.sh

Hi, There were couple of bugs in the "get-data-xnli.sh" script related to file path. The following is the fix:

  1. comment "mkdir -p $XNLI_PATH" (line 29) -- creating this directory prevents downloading XNLI-1.0.zip

  2. replace
    mkdir -p $PROCESSED_PATH/eval/XNLI
    rm $PROCESSED_PATH/eval/XNLI/*. -- getting error "cannot remove...no such file..."
    with
    if [ -d $PROCESSED_PATH/eval/XNLI ]; then
    rm -rf $PROCESSED_PATH/eval/XNLI
    fi
    mkdir -p $PROCESSED_PATH/eval/XNLI

loss.backward is blocked

Hi,
Thanks a lot for the awesome project~
I append a MLP after XLM sentence embedding to build a QA model. But after run several step(single GPU, 21000 step, 8 batch size), it is blocked on loss.backward step, without any error message. If run on 4 GPU, it will blocked sooner (like 4200 step, 4*8 batch size). Could you please give some hint how can I fix this bug?
Thanks a lot~

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.