Git Product home page Git Product logo

qags's Introduction

qags: Question Answering and Generation for Summarization

This repo contains the code for the paper Asking and Answering Questions to Evaluate the Factual Consistency of Summaries, which appeared at ACL 2020.

Usage

To compute QAGS scores, we need to

  1. generate questions
  2. answer questions
  3. compare answers

1. Generating Questions

Extracting answer candidates

We use an answer-conditional question generation model, so we first need to extract answer candidates. Use the following command, where data_file is a text file containining an example per line and out_dir is the directory to write the processed files to. The script will produce test.txt, test_{n_ans_per_txt}.txt, test_w_{n_ans_per_txt}ans.txt in out_dir, which respectively contain the examples, the extracted answers, and the answers and examples formatted to feed into the QG model.

python qg_utils.py --command extract_ans \
                   --data_file ${data_file} \
                   --out_dir ${out_dir}

Generating questions

To generate the questions, we rely on BART finetuned on NewsQA, implemented in fairseq. A frozen version of fairseq for doing so is available in qags/fairseq. Our pretrained QG model is available here.

To generate from these models, we must first preprocess the data (tokenize and binarize) using the command: ./fairseq/scripts/aw/preprocess.sh preprocess. In the script, make sure to change dat_dir to point to the directory containing your files. The script expects dat_dir to contain test.src and test.trg, where test.src are the files that will actually be fed into the QG model to generate from; test.trg can be a dummy file with the same number of lines (e.g., a copy of test.src).

Then to generate, use command ./scripts/aw/gen_sum.sh. Change model_path to point to the pretrained QG checkpoint, data_path to the directory containing the processed data (typically the processed directory created during preprocessing), and out_file for the file to log to. Due to a code quirk, in fairseq/fairseq/models/summerization_encoder_only.py, set HACK_PATH (line 107) to the best_pretrained_bert.pt checkpoint.

Finally, extract the generated questions using

python qg_utils.py --command extract-gen \
                   --data_file ${fseq_log_file} \
                   --out_dir ${out_dir}

which will extract the generations and the corresponding probabilities respectively to gen.txt and prob.txt in out_dir.

2. Answering Questions

To prepare the QA data, use the following command:

python qa_utils.py --command format-qa-data --out_dir tmp \
                   --src_txt_file ${src_txt_file} --gen_txt_file ${gen_txt_file} \
                   --gen_qst_file ${gen_qst_file} --gen_prob_file ${gen_prob_file} 

where gen_{qst/prob}_file are generated from the previous step (gen.txt and prob.txt). {src/gen}_txt_file are respectively the source and model-generated texts (e.g. for summarization, the source articles and model-generated summaries to be evaluated). As part of this step, we filter questions by quality using a number of heuristics. Most importantly, we filter questions by enforcing answer consistency: We use a QA model to answer the generated questions, and if the predicted answer doesn't match the original answer, we throw out the question. To do this, we need to run the QA model on the generated questions, which will produce an answer file. For this step, use the flag --use_all_qsts and then run the QA model on the resulting data file.

Once you have answers for each question, we need to compare the expected and predicted answers, which we do so via the flags --use_exp_anss --gen_ans_file ${gen_ans_file} --gen_prd_file ${gen_prd_file}, where the latter two respectively contain the expected and the predicted answers.

To evaluate our QA models, use the following command to evaluate the model on pred_file and write the predictions to out_dir/out_file Our models are based on pytorch-pretrained-BERT (now transformers) and pretrained checkpoints are located here. Make sure model_dir points to the QA model directory. To compute QAGS scores, evaluate the QA model using the both the article as context and the summary as context, so you will need to run this command twice.

python finetune_pt_squad.py \
              --bert_model bert-large-uncased \
              --load_model_from_dir ${model_dir} \
              --version_2_with_negative \
              --do_lower_case \
              --do_predict \
              --predict_file ${pred_file} \
              --output_dir ${out_dir} \
              --prediction_file ${out_file} \
              --overwrite_output_dir

3. Comparing Answers

Finally, to get the actual QAGS scores, we compare answers. The following command will write the scores to out_dir/qags_scores.txt.

python qa_utils.py --command compute-qags \
                   --src-ans-file ${src_ans_file} \
                   --trg-ans-file ${trg_ans_file} \
                   --out-dir ${out_dir}

Data

The crowdsourced annotations of summary sentences we collected are available in data/mturk_{cnndm,xsum}.jsonl. Each line is an article, model-generated summary divided into sentences, and three annotations per sentence. Each annotation is a binary choice of whether or not the summary sentence is factually supported by the article, as well as an anonymized annotator ID.

For CNNDM, the summarization model is Bottom-Up Summarization (Gehrmann et al., 2017). For XSUM, the summarization model is BART finetuned on the XSUM training data.

Citation

If you use this code or data, please cite us.

@article{wang2020asking,
   title={Asking and Answering Questions to Evaluate the Factual Consistency of Summaries},
   url={http://dx.doi.org/10.18653/v1/2020.acl-main.450},
   DOI={10.18653/v1/2020.acl-main.450},
   journal={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
   publisher={Association for Computational Linguistics},
   author={Wang, Alex and Cho, Kyunghyun and Lewis, Mike},
   year={2020}
}

qags's People

Contributors

kenchan0226 avatar w4ngatang avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.