flc777 / glat Goto Github PK

View Code? Open in Web Editor NEW

132.0 5.0 29.0 645 KB

Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

Python 96.86% C++ 0.73% Cuda 1.64% Shell 0.04% Lua 0.18% Cython 0.56%

glat's Introduction

GLAT

Implementation for the ACL2021 paper "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

Requirements

Python >= 3.7
Pytorch >= 1.5.0
Fairseq 1.0.0a0

Preparation

Train an autoregressive Transformer according to the instructions in Fairseq.

Use the trained autoregressive Transformer to generate target sentences for the training set.

Binarize the distilled training data.

input_dir=path_to_raw_text_data
data_dir=path_to_binarized_output
src=source_language
tgt=target_language
python3 fairseq_cli/preprocess.py --source-lang ${src} --target-lang ${tgt} --trainpref ${input_dir}/train \
    --validpref ${input_dir}/valid --testpref ${input_dir}/test --destdir ${data_dir}/ \
    --workers 32 --src-dict ${input_dir}/dict.${src}.txt --tgt-dict {input_dir}/dict.${tgt}.txt

Train

For training GLAT

save_path=path_for_saving_models
python3 train.py ${data_dir} --arch glat --noise full_mask --share-all-embeddings \
    --criterion glat_loss --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --stop-min-lr 1e-9 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9, 0.999)' \
    --adam-eps 1e-6 --task translation_lev_modified --max-tokens 8192 --weight-decay 0.01 --dropout 0.1 \
    --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 --fp16 \
    --max-source-positions 1000 --max-target-positions 1000 --max-update 300000 --seed 0 --clip-norm 5\
    --save-dir ${save_path} --src-embedding-copy --length-loss-factor 0.05 --log-interval 1000 \
    --eval-bleu --eval-bleu-args '{"iter_decode_max_iter": 0, "iter_decode_with_beam": 1}' \
    --eval-tokenized-bleu --eval-bleu-remove-bpe --best-checkpoint-metric bleu \
    --maximize-best-checkpoint-metric --decoder-learned-pos --encoder-learned-pos \
    --apply-bert-init --activation-fn gelu --user-dir glat_plugins \

For training GLAT+CTC

save_path=path_for_saving_models
python3 train.py ${data_dir} --arch glat_ctc --noise full_mask --share-all-embeddings \
    --criterion ctc_loss --label-smoothing 0.1 --lr 5e-4 --warmup-init-lr 1e-7 --stop-min-lr 1e-9 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --optimizer adam --adam-betas '(0.9, 0.999)' \
    --adam-eps 1e-6 --task translation_lev_modified --max-tokens 8192 --weight-decay 0.01 --dropout 0.1 \
    --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 --decoder-embed-dim 512 --fp16 \
    --max-source-positions 1000 --max-target-positions 1000 --max-update 300000 --seed 0 --clip-norm 2\
    --save-dir ${save_path} --length-loss-factor 0 --log-interval 1000 \
    --eval-bleu --eval-bleu-args '{"iter_decode_max_iter": 0, "iter_decode_with_beam": 1}' \
    --eval-tokenized-bleu --eval-bleu-remove-bpe --best-checkpoint-metric bleu \
    --maximize-best-checkpoint-metric --decoder-learned-pos --encoder-learned-pos \
    --apply-bert-init --activation-fn gelu --user-dir glat_plugins \

Inference

The default setting without self re-ranking

checkpoint_path=path_to_your_checkpoint
python3 fairseq_cli/generate.py ${data_dir} --path ${checkpoint_path} --user-dir glat_plugins \
    --task translation_lev_modified --remove-bpe --max-sentences 20 --source-lang ${src} --target-lang ${tgt} \
    --quiet --iter-decode-max-iter 0 --iter-decode-eos-penalty 0 --iter-decode-with-beam 1 --gen-subset test

Generating with self re-ranking of beam 5

checkpoint_path=path_to_your_checkpoint
python3 fairseq_cli/generate.py ${data_dir} --path ${checkpoint_path} --user-dir glat_plugins \
    --task translation_lev_modified --remove-bpe --max-sentences 20 --source-lang ${src} --target-lang ${tgt} \
    --quiet --iter-decode-max-iter 0 --iter-decode-eos-penalty 0 --iter-decode-with-beam 5 --gen-subset test

The script for averaging checkpoints is scripts/average_checkpoints.py

Thanks dugu9sword for contributing part of the code.

glat's People

Contributors

Stargazers

Watchers

glat's Issues

KeyError: 'bleu'

@FLC777 Hello, thank you very much for the open source. When I run the glat_CTC script, the model can run and save, but the following errors are always reported from time to time, and then the program stops and kills:

What is “linear annealing (from 3e − 4 to 1e − 5) ” schedule？

Does it is fixed lr-scheduler

How to reproduce "rerank" result in paper

About the CTC/NPD implementation

Hello thanks for your work!

There are 3 length prediction method mentioned in your paper.
　① the method used in Mask-predict
　② NPD
　③ CTC

I noticed that the Length prediction method in this source is ↓, the implementation of ①.

def forward_length_prediction(self, length_out, encoder_out, tgt_tokens=None):

Do you have some plan to release the CTC or NPD implementation?

Thanks a lot!

About Fairseq 1.0.0a0

the version of fairseq is 1.0.0a0?

Why was the src_embedding_copy flag code changed?

What is happening in this part of the code after if embedding_copy: and why was it changed from the original fairseq implementation?

when import 'torch_imputer' . fail with the error : ninja: build stopped: subcommand failed.

train command gives error related to 'torch_imputer'

I got this error while running train command.
2023-09-12 15:47:46.250592: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-12 15:47:47.042422: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Traceback (most recent call last):
File "/media/administrator/0adda20a-df3e-4c0c-a4fc-ef1c1c039914/EXPERIMENTS/GLAT-main/train.py", line 14, in
cli_main()
File "/media/administrator/0adda20a-df3e-4c0c-a4fc-ef1c1c039914/EXPERIMENTS/GLAT-main/fairseq_cli/train.py", line 440, in cli_main
parser = options.get_training_parser()
File "/media/administrator/0adda20a-df3e-4c0c-a4fc-ef1c1c039914/EXPERIMENTS/GLAT-main/fairseq/options.py", line 36, in get_training_parser
parser = get_parser("Trainer", default_task)
File "/media/administrator/0adda20a-df3e-4c0c-a4fc-ef1c1c039914/EXPERIMENTS/GLAT-main/fairseq/options.py", line 216, in get_parser
utils.import_user_module(usr_args)
File "/media/administrator/0adda20a-df3e-4c0c-a4fc-ef1c1c039914/EXPERIMENTS/GLAT-main/fairseq/utils.py", line 459, in import_user_module
importlib.import_module(module_name)
File "/home/administrator/anaconda3/envs/itv2/lib/python3.9/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1030, in _gcd_import
File "", line 1007, in _find_and_load
File "", line 986, in _find_and_load_unlocked
File "", line 680, in _load_unlocked
File "", line 850, in exec_module
File "", line 228, in _call_with_frames_removed
File "/media/administrator/0adda20a-df3e-4c0c-a4fc-ef1c1c039914/EXPERIMENTS/GLAT-main/glat_plugins/init.py", line 2, in
from .models import *
File "/media/administrator/0adda20a-df3e-4c0c-a4fc-ef1c1c039914/EXPERIMENTS/GLAT-main/glat_plugins/models/init.py", line 2, in
from .glat_ctc import *
File "/media/administrator/0adda20a-df3e-4c0c-a4fc-ef1c1c039914/EXPERIMENTS/GLAT-main/glat_plugins/models/glat_ctc.py", line 19, in
from torch_imputer.imputer import best_alignment
ModuleNotFoundError: No module named 'torch_imputer'

Please let me know how to resolve this

size mismatch for encoder.embed_tokens.weight

Thanks for your code. An error occurred while attempting to evaluate on test dataset.

fairseq plugins loaded...
2021-11-12 08:25:29 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': 'glat_plugins', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False}, 'common_eval': {'_name': None, 'path': '/data/wbxu/GLAT_Checkpoints/checkpoint_best.pt', 'post_process': 'subword_nmt', 'quiet': True, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'pytorch_ddp', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 2, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'tpu': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 20, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 20, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.25], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'checkpoints', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 0, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': None, 'task': {'_name': 'translation_lev_modified', 'data': '/data/wbxu/DSLP-main/data-bin/wmt14.en-de_kd', 'source_lang': 'en', 'target_lang': 'de', 'load_alignments': False, 'left_pad_source': False, 'left_pad_target': False, 'max_source_positions': 1024, 'max_target_positions': 1024, 'upsample_primary': -1, 'truncate_source': False, 'num_batch_buckets': 0, 'train_subset': 'train', 'dataset_impl': None, 'required_seq_len_multiple': 1, 'eval_bleu': False, 'eval_bleu_args': '{}', 'eval_bleu_detok': 'space', 'eval_bleu_detok_args': '{}', 'eval_tokenized_bleu': False, 'eval_bleu_remove_bpe': None, 'eval_bleu_print_samples': False, 'noise': 'random_delete', 'start_p': 0.5, 'minus_p': 0.2, 'total_up': 300000}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': True}, 'optimizer': None, 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [0.25]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
2021-11-12 08:25:29 | INFO | fairseq.tasks.translation | [en] dictionary: 39840 types
2021-11-12 08:25:29 | INFO | fairseq.tasks.translation | [de] dictionary: 39840 types
2021-11-12 08:25:29 | INFO | fairseq_cli.generate | loading model(s) from /data/wbxu/GLAT_Checkpoints/checkpoint_best.pt
/home/wbxu/anaconda3/envs/GLATEnv/lib/python3.7/site-packages/omegaconf/omegaconf.py:579: UserWarning: update() merge flag is is not specified, defaulting to False.
For more details, see https://github.com/omry/omegaconf/issues/367
  stacklevel=1,
Traceback (most recent call last):
  File "fairseq_cli/generate.py", line 408, in <module>
    cli_main()
  File "fairseq_cli/generate.py", line 404, in cli_main
    main(args)
  File "fairseq_cli/generate.py", line 49, in main
    return _main(cfg, sys.stdout)
  File "fairseq_cli/generate.py", line 102, in _main
    num_shards=cfg.checkpoint.checkpoint_shard_count,
  File "/data/wbxu/GLAT-main/fairseq/checkpoint_utils.py", line 304, in load_model_ensemble
    state,
  File "/data/wbxu/GLAT-main/fairseq/checkpoint_utils.py", line 358, in load_model_ensemble_and_task
    model.load_state_dict(state["model"], strict=strict, model_cfg=cfg.model)
  File "/data/wbxu/GLAT-main/fairseq/models/fairseq_model.py", line 115, in load_state_dict
    return super().load_state_dict(new_state_dict, strict)
  File "/home/wbxu/anaconda3/envs/GLATEnv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Glat:
	size mismatch for encoder.embed_tokens.weight: copying a param with shape torch.Size([39847, 512]) from checkpoint, the shape in current model is torch.Size([39840, 512]).
	size mismatch for decoder.embed_tokens.weight: copying a param with shape torch.Size([39847, 512]) from checkpoint, the shape in current model is torch.Size([39840, 512]).
	size mismatch for decoder.output_projection.weight: copying a param with shape torch.Size([39847, 512]) from checkpoint, the shape in current model is torch.Size([39840, 512]).

Inference with NPD. Issue: AttributeError: 'dict' object has no attribute '_get_node_flag'

hi, folks

I triggered an issue,

when I tried to set '--iter-decode-with-external-reranker ' and '--path checkpoints/nat.pt:at_checkpoints/at.pt '

, do inference with NPD.

Did you meet similar issues,

and do you have any ideas?

It could help me a lot.

Thanks.

Problem of inference speed

I test the inference speed compared autoregressive model with GLAT model,
It is no more than 2x speed up actually.
There is about 15.3x speed up in your paper (Table 1), pls provide inference scripts.
thanks

backpropagation

Hi, I would like to ask if replacing the Encoder output will have an impact on the backpropagation

about mask_id in "glat_ctc.py" 41 line

I noticed that the blank_id used in the generator file and ctc_loss is unk_idx=3, why use "pad_id=1" as the masked_id in the "get_align_target" function in the "glat_ctc.py" file, this should be blank_id.

cannot reproduce results on wmt14 ende distill

I follow the instructions but I cannot reproduce the results on wmt14 ende. With the provided generation script I only got
BLEU4 = 16.65, 48.6/22.0/11.4/6.3 (BP=1.000, ratio=1.014, syslen=65432, reflen=64506)
After re-ranking with an AT model, I got
BLEU4 = 19.93, 52.2/25.6/14.2/8.3 (BP=1.000, ratio=1.013, syslen=65313, reflen=64506)
Still far from 26.55 as reported in the paper.
I use the provided training script and train the model on 8 V100 GPUs. The training log shows that the glat accuracy is pretty high at the end of training.
epoch 151: 1050 / 1993 loss=3.323, nll_loss=1.371, glat_accu=0.758, glat_keep=0.073, glat_context_p=0.3, word_ins=3.162, length=3.223, ppl=10.01, wps=132588, ups=2.16, wpb=61302.6, bsz=2018.2, num_updates=300000, lr=5.7735e-05, gnorm=1.455, clip=0, train_wall=46, gb_free=6.8, wall=139238
But the valid loss is also very high.
valid | epoch 151 | valid on 'valid' subset | loss 7.412 | nll_loss 5.888 | glat_accu 0 | glat_keep 0 | glat_context_p 0 | word_ins 7.252 | length 3.209 | ppl 170.31 | wps 242820 | wpb 41551 | bsz 1500 | num_updates 300000 | best_loss 6.511

About Data Set Selection

Hello, author, while reviewing your code, I found that you did not specify which dataset to use. I read your paper, such as WMT4en de, and downloaded it. However, as for the src dict and tgt dict parameters in preprocess.py, I did not find dict.en.txt and dict.de.txt in the dataset. I dare to bother you, but I hope you can provide a dataset.

The argument "--cpu" is not supported

I want to use a device without GPU to translate testing datasets.
But it seems that a GPU is necessary, even though I used the argument "--cpu".
My command is:

$ python3 fairseq_cli/generate.py ${data_dir} --path ${checkpoint_path}/checkpoint_best.pt --user-dir glat_plugins \
>     --task translation_lev_modified --remove-bpe --max-sentences 20 --source-lang ${src} --target-lang ${tgt} \
>     --quiet --iter-decode-max-iter 0 --iter-decode-eos-penalty 0 --iter-decode-with-beam 1 --gen-subset test --save-dir transcheck \
>     --cpu

and it responses:

fairseq plugins loaded...
2021-11-05 20:48:08 | INFO | fairseq_cli.generate | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': True, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'user_dir': 'glat_plugins', 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False}, 'common_eval': {'_name': None, 'path': 'smallmodel/checkpoint_best.pt', 'post_process': 'subword_nmt', 'quiet': True, 'model_overrides': '{}', 'results_path': None}, 'distributed_training': {'_name': None, 'distributed_world_size': 1, 'distributed_rank': 0, 'distributed_backend': 'nccl', 'distributed_init_method': None, 'distributed_port': -1, 'device_id': 0, 'distributed_no_spawn': False, 'ddp_backend': 'pytorch_ddp', 'bucket_cap_mb': 25, 'fix_batches_to_gpus': False, 'find_unused_parameters': False, 'fast_stat_sync': False, 'heartbeat_timeout': -1, 'broadcast_buffers': False, 'slowmo_momentum': None, 'slowmo_algorithm': 'LocalSGD', 'localsgd_frequency': 3, 'nprocs_per_node': 1, 'pipeline_model_parallel': False, 'pipeline_balance': None, 'pipeline_devices': None, 'pipeline_chunks': 0, 'pipeline_encoder_balance': None, 'pipeline_encoder_devices': None, 'pipeline_decoder_balance': None, 'pipeline_decoder_devices': None, 'pipeline_checkpoint': 'never', 'zero_sharding': 'none', 'tpu': False}, 'dataset': {'_name': None, 'num_workers': 1, 'skip_invalid_size_inputs_valid_test': False, 'max_tokens': None, 'batch_size': 20, 'required_batch_size_multiple': 8, 'required_seq_len_multiple': 1, 'dataset_impl': None, 'data_buffer_size': 10, 'train_subset': 'train', 'valid_subset': 'valid', 'validate_interval': 1, 'validate_interval_updates': 0, 'validate_after_updates': 0, 'fixed_validation_seed': None, 'disable_validation': False, 'max_tokens_valid': None, 'batch_size_valid': 20, 'curriculum': 0, 'gen_subset': 'test', 'num_shards': 1, 'shard_id': 0}, 'optimization': {'_name': None, 'max_epoch': 0, 'max_update': 0, 'stop_time_hours': 0.0, 'clip_norm': 0.0, 'sentence_avg': False, 'update_freq': [1], 'lr': [0.25], 'stop_min_lr': -1.0, 'use_bmuf': False}, 'checkpoint': {'_name': None, 'save_dir': 'transcheck', 'restore_file': 'checkpoint_last.pt', 'finetune_from_model': None, 'reset_dataloader': False, 'reset_lr_scheduler': False, 'reset_meters': False, 'reset_optimizer': False, 'optimizer_overrides': '{}', 'save_interval': 1, 'save_interval_updates': 0, 'keep_interval_updates': -1, 'keep_last_epochs': -1, 'keep_best_checkpoints': -1, 'no_save': False, 'no_epoch_checkpoints': False, 'no_last_checkpoints': False, 'no_save_optimizer_state': False, 'best_checkpoint_metric': 'loss', 'maximize_best_checkpoint_metric': False, 'patience': -1, 'checkpoint_suffix': '', 'checkpoint_shard_count': 1, 'load_checkpoint_on_all_dp_ranks': False, 'model_parallel_size': 1, 'distributed_rank': 0}, 'bmuf': {'_name': None, 'block_lr': 1.0, 'block_momentum': 0.875, 'global_sync_iter': 50, 'warmup_iterations': 500, 'use_nbm': False, 'average_sync': False, 'distributed_world_size': 1}, 'generation': {'_name': None, 'beam': 5, 'nbest': 1, 'max_len_a': 0.0, 'max_len_b': 200, 'min_len': 1, 'match_source_len': False, 'unnormalized': False, 'no_early_stop': False, 'no_beamable_mm': False, 'lenpen': 1.0, 'unkpen': 0.0, 'replace_unk': None, 'sacrebleu': False, 'score_reference': False, 'prefix_size': 0, 'no_repeat_ngram_size': 0, 'sampling': False, 'sampling_topk': -1, 'sampling_topp': -1.0, 'constraints': None, 'temperature': 1.0, 'diverse_beam_groups': -1, 'diverse_beam_strength': 0.5, 'diversity_rate': -1.0, 'print_alignment': None, 'print_step': False, 'lm_path': None, 'lm_weight': 0.0, 'iter_decode_eos_penalty': 0.0, 'iter_decode_max_iter': 0, 'iter_decode_force_max_iter': False, 'iter_decode_with_beam': 1, 'iter_decode_with_external_reranker': False, 'retain_iter_history': False, 'retain_dropout': False, 'retain_dropout_modules': None, 'decoding_format': None, 'no_seed_provided': False}, 'eval_lm': {'_name': None, 'output_word_probs': False, 'output_word_stats': False, 'context_window': 0, 'softmax_batch': 9223372036854775807}, 'interactive': {'_name': None, 'buffer_size': 0, 'input': '-'}, 'model': None, 'task': {'_name': 'translation_lev_modified', 'data': 'cantonese-mandarin/pre-processed', 'source_lang': 'can', 'target_lang': 'man', 'load_alignments': False, 'left_pad_source': False, 'left_pad_target': False, 'max_source_positions': 1024, 'max_target_positions': 1024, 'upsample_primary': -1, 'truncate_source': False, 'num_batch_buckets': 0, 'train_subset': 'train', 'dataset_impl': None, 'required_seq_len_multiple': 1, 'eval_bleu': False, 'eval_bleu_args': '{}', 'eval_bleu_detok': 'space', 'eval_bleu_detok_args': '{}', 'eval_tokenized_bleu': False, 'eval_bleu_remove_bpe': None, 'eval_bleu_print_samples': False, 'noise': 'random_delete', 'start_p': 0.5, 'minus_p': 0.2, 'total_up': 300000}, 'criterion': {'_name': 'cross_entropy', 'sentence_avg': True}, 'optimizer': None, 'lr_scheduler': {'_name': 'fixed', 'force_anneal': None, 'lr_shrink': 0.1, 'warmup_updates': 0, 'lr': [0.25]}, 'scoring': {'_name': 'bleu', 'pad': 1, 'eos': 2, 'unk': 3}, 'bpe': None, 'tokenizer': None}
2021-11-05 20:48:08 | INFO | fairseq.tasks.translation | [can] dictionary: 10168 types
2021-11-05 20:48:08 | INFO | fairseq.tasks.translation | [man] dictionary: 10168 types
2021-11-05 20:48:08 | INFO | fairseq_cli.generate | loading model(s) from smallmodel/checkpoint_best.pt
2021-11-05 20:48:10 | INFO | fairseq.data.data_utils | loaded 1,000 examples from: cantonese-mandarin/pre-processed/test.can-man.can
2021-11-05 20:48:10 | INFO | fairseq.data.data_utils | loaded 1,000 examples from: cantonese-mandarin/pre-processed/test.can-man.man
2021-11-05 20:48:10 | INFO | fairseq.tasks.translation | cantonese-mandarin/pre-processed test can-man 1000 examples
Traceback (most recent call last):                                                                                                                                         
  File "fairseq_cli/generate.py", line 408, in <module>
    cli_main()
  File "fairseq_cli/generate.py", line 404, in cli_main
    main(args)
  File "fairseq_cli/generate.py", line 49, in main
    return _main(cfg, sys.stdout)
  File "fairseq_cli/generate.py", line 206, in _main
    constraints=constraints,
  File "/data/home/db72687/Documents/glat/fairseq/tasks/fairseq_task.py", line 501, in inference_step
    models, sample, prefix_tokens=prefix_tokens, constraints=constraints
  File "/data/home/db72687/anaconda3/envs/glat/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
    return func(*args, **kwargs)
  File "/data/home/db72687/Documents/glat/fairseq/iterative_refinement_generator.py", line 212, in generate
    prev_decoder_out, encoder_out, **decoder_options
  File "/data/home/db72687/Documents/glat/fairseq/models/nat/nonautoregressive_transformer.py", line 138, in forward_decoder
    step=step,
  File "/data/home/db72687/anaconda3/envs/glat/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/data/home/db72687/Documents/glat/fairseq/models/nat/fairseq_nat_model.py", line 46, in wrapper
    self, normalize=normalize, encoder_out=encoder_out, *args, **kwargs
  File "/data/home/db72687/Documents/glat/fairseq/models/nat/nonautoregressive_transformer.py", line 239, in forward
    embedding_copy=(step == 0) & self.src_embedding_copy,
  File "/data/home/db72687/Documents/glat/fairseq/models/nat/nonautoregressive_transformer.py", line 301, in extract_features
    x = cat_x.index_select(dim=0, index=torch.arange(bsz * seq_len).cuda() * 2 +
  File "/data/home/db72687/anaconda3/envs/glat/lib/python3.7/site-packages/torch/cuda/__init__.py", line 149, in _lazy_init
    _check_driver()
  File "/data/home/db72687/anaconda3/envs/glat/lib/python3.7/site-packages/torch/cuda/__init__.py", line 54, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError: 
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

About the CTC implementation

Hi, thanks for your brilliant work!

I have read your paper and want to reproduce the results of GLAT model with CTC loss.

However, I noticed that there is no implementation of the proposed glancing training with CTC loss, where you use LCS distance between the Y and Y^.

I had some trouble of implementing this (mainly the selection of Y and the LCS calculation for tensors) and I am wondering whether you could kindly release this part of the code.

Thanks a lot!

Training Speed

Hi, thanks for your awesome work!

When I use the command line here (https://github.com/FLC777/GLAT#train) to train the model on 8*V100 GPUs, the time cost of each epoch increases rapidly (epoch1 10min -> epoch 50 120min). Is there something wrong with my command? I wonder what I am supposed to do.

Thanks very much!
hemingkx

About CTC

I'd like to ask you something. Can I calculate loss with cross entropy loss on the generated ctc_align sequence instead of torch.nn.CTCloss() in the code？

training fail with the following error

 File "/GLAT/glat_plugins/criterions/glat_loss.py", line 150, in forward
    utils.item(l["loss"].data / l["factor"])
  File "/GLAT/fairseq/utils.py", line 293, in item
    return tensor.item()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Which part of your code copies the encoder hidden when glancing?

Hi,
I read your paper and code with great interest. You said in the paper that during glancing, the sampled tokens are replaced with the embeddings from decoder, while the unsampled tokens use encoder output.
However, in the code, the unglanced tokens are still masked.
Am I missing something?

       glat_info = None
        if glat and tgt_tokens is not None:
            with torch.no_grad():
                with torch_seed(rand_seed):
                    word_ins_out = self.decoder(
                        normalize=False,
                        prev_output_tokens=prev_output_tokens,
                        encoder_out=encoder_out,
                    )
                pred_tokens = word_ins_out.argmax(-1)
                same_num = ((pred_tokens == tgt_tokens) & nonpad_positions).sum(1)
                input_mask = torch.ones_like(nonpad_positions)
                bsz, seq_len = tgt_tokens.size()
                for li in range(bsz):
                    target_num = (((seq_lens[li] - same_num[li].sum()).float()) * glat['context_p']).long()
                    if target_num > 0:
                        input_mask[li].scatter_(dim=0, index=torch.randperm(seq_lens[li])[:target_num].cuda(), value=0)
                input_mask = input_mask.eq(1)
                input_mask = input_mask.masked_fill(~nonpad_positions,False)
                glat_prev_output_tokens = prev_output_tokens.masked_fill(~input_mask, 0) + tgt_tokens.masked_fill(input_mask, 0)
                # this line here
                
                glat_tgt_tokens = tgt_tokens.masked_fill(~input_mask, self.pad)

                prev_output_tokens, tgt_tokens = glat_prev_output_tokens, glat_tgt_tokens

                glat_info = {
                    "glat_accu": (same_num.sum() / seq_lens.sum()).item(),
                    "glat_context_p": glat['context_p'],
                }