google-research / lasertagger Goto Github PK

License: Apache License 2.0

Python 97.92% Shell 2.08%

lasertagger's Introduction

LaserTagger

LaserTagger is a text-editing model which predicts a sequence of token-level edit operations to transform a source text into a target text. The model currently supports four different edit operations:

Keep the token.
Delete the token.
Add a phrase before the token.
Swap the order of input sentences (if there are two of them).

Operation 3 can be combined with 1 and 2. Compared to sequence-to-sequence models, LaserTagger is (1) less prone to hallucination, (2) more data efficient, and (3) faster at inference time.

A detailed method description and evaluation can be found in our EMNLP'19 paper: https://arxiv.org/abs/1909.01187

LaserTagger is built on Python 3, Tensorflow and BERT. It works with CPU, GPU, and Cloud TPU.

Usage Instructions

Running an experiment with LaserTagger consists of the following steps:

Optimize the vocabulary of phrases that can be added by LaserTagger.
Convert target texts into target tag sequences.
Finetune a pretrained BERT model to predict the tags.
Compute predictions.
Evaluate the predictions.

Next we go through these steps, using the Split-and-Rephrase (WikiSplit) task as a running example.

You can run all of the steps with

sh run_wikisplit_experiment.sh

after setting the paths in the beginning of the script.

Note: Text should be tokenized with spaces separating the tokens before applying LaserTagger.

1. Phrase Vocabulary Optimization

Download the WikiSplit dataset and run the following command to find a set of phrases that the model is allowed to add.

export WIKISPLIT_DIR=/path/to/wikisplit
export OUTPUT_DIR=/path/to/output

python phrase_vocabulary_optimization.py \
  --input_file=${WIKISPLIT_DIR}/train.tsv \
  --input_format=wikisplit \
  --vocabulary_size=500 \
  --max_input_examples=1000000 \
  --output_file=${OUTPUT_DIR}/label_map.txt

Note that you can also set max_input_examples to a smaller value to get a reasonable vocabulary, but then you should sort the dataset rows in the case of WikiSplit. The rows are in an alphabetical order so taking first k of them might not give you a representative sample of the data.

2. Converting Target Texts to Tags

Download a pretrained BERT model from the official repository. We've used the 12-layer ''BERT-Base, Cased'' model for all of our experiments. Then convert the original TSV datasets into TFRecord format.

export BERT_BASE_DIR=/path/to/cased_L-12_H-768_A-12

python preprocess_main.py \
  --input_file=${WIKISPLIT_DIR}/tune.tsv \
  --input_format=wikisplit \
  --output_tfrecord=${OUTPUT_DIR}/tune.tf_record \
  --label_map_file=${OUTPUT_DIR}/label_map.txt \
  --vocab_file=${BERT_BASE_DIR}/vocab.txt \
  --output_arbitrary_targets_for_infeasible_examples=true

python preprocess_main.py \
    --input_file=${WIKISPLIT_DIR}/train.tsv \
    --input_format=wikisplit \
    --output_tfrecord=${OUTPUT_DIR}/train.tf_record \
    --label_map_file=${OUTPUT_DIR}/label_map.txt \
    --vocab_file=${BERT_BASE_DIR}/vocab.txt \
    --output_arbitrary_targets_for_infeasible_examples=false

3. Model Training

Model hyperparameters are specified in lasertagger_config.json. This configuration file extends bert_config.json which comes with the zipped pretrained BERT model.

Note that if you want to switch from using LaserTagger_FF to LaserTagger_AR, you should set "use_t2t_decoder": true in the LaserTagger config. The latter is usually more accurate, whereas the former runs inference faster.

Train the model on CPU/GPU.

# Check these numbers from the "*.num_examples" files created in step 2.
export NUM_TRAIN_EXAMPLES=310922
export NUM_EVAL_EXAMPLES=5000
export CONFIG_FILE=configs/lasertagger_config.json
export EXPERIMENT=wikisplit_experiment_name

python run_lasertagger.py \
  --training_file=${OUTPUT_DIR}/train.tf_record \
  --eval_file=${OUTPUT_DIR}/tune.tf_record \
  --label_map_file=${OUTPUT_DIR}/label_map.txt \
  --model_config_file=${CONFIG_FILE} \
  --output_dir=${OUTPUT_DIR}/models/${EXPERIMENT} \
  --init_checkpoint=${BERT_BASE_DIR}/bert_model.ckpt \
  --do_train=true \
  --do_eval=true \
  --train_batch_size=256 \
  --save_checkpoints_steps=500 \
  --num_train_examples=${NUM_TRAIN_EXAMPLES} \
  --num_eval_examples=${NUM_EVAL_EXAMPLES}

To train on Cloud TPU, you should additionally set:

  --use_tpu=true \
  --tpu_name=${TPU_NAME}

Please see BERT TPU instructions and the Google Cloud TPU tutorial for how to use Cloud TPUs.

4. Prediction

First you need to export your model.

python run_lasertagger.py \
  --label_map_file=${OUTPUT_DIR}/label_map.txt \
  --model_config_file=${CONFIG_FILE} \
  --output_dir=${OUTPUT_DIR}/models/${EXPERIMENT} \
  --do_export=true \
  --export_path=${OUTPUT_DIR}/models/${EXPERIMENT}/export

You can additionally use init_checkpoint to specify which checkpoint to export (the default is to export the latest).

Compute the predicted tags and realize the output text with:

export SAVED_MODEL_DIR=/path/to/exported/model
export PREDICTION_FILE=${OUTPUT_DIR}/models/${EXPERIMENT}/pred.tsv

python predict_main.py \
  --input_file=${WIKISPLIT_DIR}/validation.tsv \
  --input_format=wikisplit \
  --output_file=${PREDICTION_FILE} \
  --label_map_file=${OUTPUT_DIR}/label_map.txt \
  --vocab_file=${BERT_BASE_DIR}/vocab.txt \
  --saved_model=${SAVED_MODEL_DIR}

Note that the above will run inference with batch size of 1 so it's not optimal in terms of inference time.

5. Evaluation

Compute the evaluation scores.

python score_main.py --prediction_file=${PREDICTION_FILE}

Example output:

Exact score:     15.220
SARI score:      61.668
 KEEP score:     93.059
 ADDITION score: 32.168
 DELETION score: 59.778

How to Cite LaserTagger

@inproceedings{malmi2019lasertagger,
  title={Encode, Tag, Realize: High-Precision Text Editing},
  author={Eric Malmi and Sebastian Krause and Sascha Rothe and Daniil Mirylenka and Aliaksei Severyn},
  booktitle={EMNLP-IJCNLP},
  year={2019}
}

License

Apache 2.0; see LICENSE for details.

Disclaimer

This repository contains a Python reimplementation of our original C++ code used for the paper and thus some discrepancies compared to the paper results are possible. However, we've verified that we get the similar results on the WikiSplit dataset.

This is not an official Google product.

lasertagger's People

Contributors

Stargazers

Watchers

Forkers

kimdebie shyamalschandra jsedoc aibharata codeaudit rogervaas zhuomingliang bigdatasciencegroup trytook freshy969 myougg tchigher steelball abodacs databill86 shashankmadan-designesthetics merajat tradaz by2101 shadowkun mahjiong roymondliao manikant92 leshanbog jinxiu0406 53x dzwpusa dujiaxin qsong4 tangjiandd jeanru zebrajack hitluobin heyuvin alxfed xrosliang binhetech ringos luweishuang pidugusundeep dragomirradev aloha0424 ivokun sjyttkl qianjunlang chenny0808 sc1054 aredwing wmlabs khyattigupta tejvi-m cehinson pengzhihan tejvi daviddhc20120601 yechenh schen149 googleinterns yaxinfan1 zhangleiqss bogedy muskanmahajan37 xi-c-chen zhongerqiandan duxiaochao andy-wagner xinshu jeremiah0425 yhx0105 skywindy m-fatema xumeng123 kasnerz wwdxfa wangludewdrop luxinyu1 pengwei-iie gitsamshi lishengfever techthiyanes isabella232 nashid python-repository-hub hhy-hook gshan4056 whz-nj 520jefferson mnangua littlepai guykhazma mattian7 anmoisio

lasertagger's Issues

Problem when training using GPU

I have tried to train the model in both CPU and GPU. It works well on CPU but I got the problem when run on GPU:

2021-05-03 09:56:44.869207: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ************************__
2021-05-03 09:56:44.869262: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[32768,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
ERROR:tensorflow:Error recorded from training_loop: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[32768,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node bert/encoder/layer_0/intermediate/dense/MatMul (defined at /home/kieuttt/anaconda3/envs/lasertagger/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[loss/Mean/_4031]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[32768,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node bert/encoder/layer_0/intermediate/dense/MatMul (defined at /home/kieuttt/anaconda3/envs/lasertagger/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Does anyone get it or know how to fix it ?
Thank you.

Why my model only predicts KEEP tags?

Hi,
I used your code to train the model in GEC task, I upgraded the code to tensorflow2.x, but when I'm in 'prediction' step, that is, when I inference the model, the model only predicts [KEEP, KEEP, ...., KEEP]. I used your original dataset, alse all KEEP in lang8-english with 20w examples, I trained 3 epochs(17w steps), only 0step and 100step model predicted not all KEEP(obviously a random output).
The same problem arises in lasertaggerff, my label map size is 2002, and follow the steps exactly, not only trained on the original few-shot dataset but also lang8, the model behaves the same, that is, predicting all KEEP with no change in src.

CUBLAS_STATUS_EXECUTION_FAILED when running on GPU.

Lasertagger works well on CPU. However, when running it on GPU, I get the bug report as follows. How could I solve this problem?

‘’‘
2021-10-21 22:10:44.179433: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(128, 2), b.shape=(2, 768), m=128, n=768, k=2
[[{{node bert/embeddings/MatMul}}]]
[[loss/Cast/_429]]
(1) Internal: Blas GEMM launch failed : a.shape=(128, 2), b.shape=(2, 768), m=128, n=768, k=2
[[{{node bert/embeddings/MatMul}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/hunan/fever/depend/lasertagger/predict_main.py", line 97, in
app.run(main)
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/hunan/fever/depend/lasertagger/predict_main.py", line 90, in main
prediction = predictor.predict(sources)
File "/home/hunan/fever/depend/lasertagger/predict_utils.py", line 57, in predict
out = self._predictor({key: [example.features[key]] for key in keys})
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/contrib/predictor/predictor.py", line 77, in call
return self._session.run(fetches=self.fetch_tensors, feed_dict=feed_dict)
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/hunan/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(128, 2), b.shape=(2, 768), m=128, n=768, k=2
[[node bert/embeddings/MatMul (defined at /miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[loss/Cast/_429]]
(1) Internal: Blas GEMM launch failed : a.shape=(128, 2), b.shape=(2, 768), m=128, n=768, k=2
[[node bert/embeddings/MatMul (defined at /miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'bert/embeddings/MatMul':
File "/fever/depend/lasertagger/predict_main.py", line 97, in
app.run(main)
File "/miniconda3/envs/google/lib/python3.7/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/miniconda3/envs/google/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/fever/depend/lasertagger/predict_main.py", line 79, in main
tf.contrib.predictor.from_saved_model(FLAGS.saved_model), builder,
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/contrib/predictor/predictor_factories.py", line 153, in from_saved_model
config=config)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/contrib/predictor/saved_model_predictor.py", line 153, in init
loader.load(self._session, tags.split(','), export_dir)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/saved_model/loader_impl.py", line 269, in load
return loader.load(sess, tags, import_scope, **saver_kwargs)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/saved_model/loader_impl.py", line 422, in load
**saver_kwargs)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/saved_model/loader_impl.py", line 352, in load_graph
meta_graph_def, import_scope=import_scope, **saver_kwargs)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/training/saver.py", line 1477, in _import_meta_graph_with_return_elements
**kwargs))
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/meta_graph.py", line 809, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
producer_op_list=producer_op_list)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/importer.py", line 517, in _import_graph_def_internal
_ProcessNewOps(graph)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/importer.py", line 243, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3561, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3561, in
for c_op in c_api_util.new_tf_operations(self)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 3451, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "/miniconda3/envs/google/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()
‘’’

Which version of tensorflow does the code run on

Thinks for your contribution!
Which version of tensorflow does the code run on?

Permission denied error while replicating

unable to replicate the paper in colab, error at 7 . How do I fix it??

 5 assert BUCKET, 'Must specify an existing GCS bucket name'
      6 OUTPUT_DIR = 'gs://{}/lasertagger/{}'.format(BUCKET, TASK)
----> 7 tf.gfile.MakeDirs(OUTPUT_DIR)
      8 print('***** Model output directory: {} *****'.format(OUTPUT_DIR))

PermissionDeniedError: Error executing an HTTP request: HTTP response code 403 with body '{
  "error": {
    "code": 403,
    "message": "[email protected] does not have storage.objects.get access to contextualexp/lasertagger/wikisplit.",
    "errors": [
      {
        "message": "[email protected] does not have storage.objects.get access to contextualexp/lasertagger/wikisplit.",
        "domain": "global",
        "reason": "forbidden"
      }
    ]
  }
}
'
	 when reading metadata of gs://contextualexp/lasertagger/wikisplit

Fine-tuned models request

Are there any models available for Abstractive Summarization and Grammatical Error Correction?

It would also be good to have utils file for parsing MSR Abstractive Text Compression Dataset and for Low Resource Track in GEC task

How can I use this model on Chinese dataset?

And can this model be helpful on Chinese dataset?

why it can not train?

the terminal show no error but this:

zsh: killed python run_lasertagger.py --training_file=${OUTPUT_DIR}/train.tf_record

then finished

Lasertagger Colab Notebook request

The goal is to run lasertagger in a Google Colab notebook, similar to the BERT finetuning notebooks.

A few issues involved:

Requirements -- Install BERT
Colab works only with TF 1.x for TPU support
Default 256 batch size results in OOM error on colab k80 GPU
Add support for training / export using TFHub / Google Cloud.

Is the weights available?

Hi,
thanks for the interesting work. could you share the weights after your training?

Config for BERT large

What does full attention mean in config?

How to use GPU?

How does this use GPU? I see the GPU utility is 0%.

Problem with pkg-resources==0.0.0 when installing requirements.txt

I installed the packages using the command line: pip install -r requirements.txt.
Then I got the problem:

ERROR: Could not find a version that satisfies the requirement pkg-resources==0.0.0
ERROR: No matching distribution found for pkg-resources==0.0.0

Does anyone get it or know to fix it ?
Thanks a lot.

why the prediction of Exact score is 0.00

I set NUM_EPOCHS=100000.0, and run run_wikisplit_experiment.sh file, then the final result as follows:

Exact score: 0.000
SARI score: 16.101
KEEP score: 20.214
ADDITION score: 1.225
DELETION score: 26.865

why Exact score is 0.00 ? and other values is quite different from the published results.
How should I modify the network parameters？

Output files (e.g. label map) not found

Error:
tensorflow.python.framework.errors_impl.NotFoundError: output/label_map.txt.log; No such file or directory

I followed the instructions to the letter: downloaded wiki-split and BERT (using the links from run_wikisplit_experiment.sh) and then set OUTPUT_DIR=output.

Tail of the log:

I0609 12:19:06.618008 4580216256 phrase_vocabulary_optimization.py:190] 988000 examples processed.
I0609 12:19:07.309025 4580216256 phrase_vocabulary_optimization.py:190] 989000 examples processed.
I0609 12:19:07.954758 4580216256 phrase_vocabulary_optimization.py:202] 989944 examples processed.

Traceback (most recent call last):
  File "phrase_vocabulary_optimization.py", line 286, in <module>
    app.run(main)
  File "/Users/zheka/Library/Python/3.7/lib/python/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/zheka/Library/Python/3.7/lib/python/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "phrase_vocabulary_optimization.py", line 266, in main
    stats_writer.write('Idx\tFrequency\tCoverage (%)\tPhrase\n')
  File "/Users/zheka/Library/Python/3.7/lib/python/site-packages/tensorflow_core/python/lib/io/file_io.py", line 106, in write
    self._prewrite_check()
  File "/Users/zheka/Library/Python/3.7/lib/python/site-packages/tensorflow_core/python/lib/io/file_io.py", line 92, in _prewrite_check
    compat.as_bytes(self.__name), compat.as_bytes(self.__mode))
tensorflow.python.framework.errors_impl.NotFoundError: output/label_map.txt.log; No such file or directory

How do tag "keep" and "delete" accurately correspond to the input token ?

I created a similar model for text editing, but when the length of the input text is too long, the number of "delete" and "keep" tags output by the decoder is not equal to the input length

Can tagging_converter support consecutive DELETE|add_phrases?

Hi, in the follow case, I want to change the target into tags, and I suppose the result will be "DELETE|hi--DELETE|hi", but it can't get converted.
`
input_texts, target = ['hello hello .'], 'hi hi .' # ['hello Tim hello .]', 'hi Tim hi' works fine.

phrase_vocabulary=['hi']

task = tagging.EditingTask(input_texts)

converter = tagging_converter.TaggingConverter(phrase_vocabulary, True)

tags = converter.compute_tags(task, target)
`
Looking forward to your reply. Thanks.

MSR Summarization Example

Can you provide detailed example for MSR dataset. I run into same problem as here #5 (comment)

I don't understand how to tokenise this dataset. Can it be tokenised with FullTokenizer in BERT?

Please help if you can.