Evaluating consistency of Question-Answering Models

This repository contains code for creating implications and evaluating the consistency of question-answering models, as described in the following paper:

Are Red Roses Red? Evaluating Consistency of Question-Answering Models
Marco Tulio Ribeiro, Carlos Guestrin, Sameer Singh
Association for Computational Linguistics (ACL), 2019

Installation

Clone this repository and cd to the folder:

git clone [email protected]:marcotcr/qa_consistency.git
cd qa_consistency

Create and activate a virtual environment, e.g.:

virtualenv -p python3.6 env
source env/bin/activate

Run the following, replacing [gpu] with [cpu] if you don't have a gpu.

pip install cython numpy
pip install benepar[gpu]
pip install -e .
cd qa_consistency
git clone https://github.com/kelvinguu/qanli.git
cd ..
python -c "import benepar;benepar.download('benepar_en_small')"
python -m spacy download en_core_web_sm

Generating implications:

VQA

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
import qa_consistency.implication
gen = qa_consistency.implication.ImplicationsVQA()
gen.implications('How many birds?', '3')

[('Are there 3 birds ?', 'yes', 'yeseqcount'),
('Are there 4 birds ?', 'no', 'n+1'),
('Are there any birds ?', 'yes', 'ans>0 implies some')]

SQuAD

import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)
import qa_consistency.implication
gen = qa_consistency.implication.ImplicationsSquad()
passage = 'Kublai originally named his eldest son, Zhenjin, as the Crown Prince, \
but he died before Kublai in 1285.'
gen.implications('When did Zhenjin die?', '1285', passage)

[('Who died in 1285?', 'Zhenjin', 'subj')]

Evaluating the consistency of models

VQA

Download and extract precomputed implications here. Create a folder for the consistency dataset (CONSISTENCY_FOLDER). Output your model predictions into a json file (PRED_FILE) in the VQA format. Then run:

import qa_consistency.dataset_utils
all_imps = pickle.load(open('vqa_imps.pkl', 'rb'))
vqa = qa_consistency.dataset_utils.load_vqa(vqa_path, 'validation')
# Uncomment the line below if you want vqa v2
# vqa = qa_consistency.dataset_utils.load_vqav2(vqa_path, 'validation')
qa_consistency.dataset_utils.generate_implication_vqa(vqa, PRED_FILE, all_imps, CONSISTENCY_FOLDER)

This will write CONSISTENCY_FOLDER/{questions,annotations}.json. At this point you should run your model on these files, and generate a new prediction file (CONSISTENCY_PRED_FILE), and then run:

stats = qa_consistency.dataset_utils.evaluate_consistency_vqa(CONSISTENCY_FOLDER, CONSISTENCY_PREDS_FILE)
print('Consistency by implication type:')
print()
for x, v in stats.items():
    if x == 'all':
        continue
    print('%s : %.1f' % (x, 100* v))
print()
print('Avg  : %.1f' % (100 * stats['all']))

SQuAD

Download and extract precomputed implications here. Let SQUAD_PATH be a pointer to the original squad dev set json (dev-v1.1.json), PRED_FILE be the predictions json on the dev set from your model in the SQuAD official format (dictionary of id : answer). Run:

import qa_consistency.dataset_utils
all_imps = pickle.load(open('squad_imps.pkl', 'rb'))
qa_consistency.dataset_utils.generate_implication_squad(
SQUAD_PATH, PRED_FILE, all_imps, NEW_SQUAD_JSON)

This will generate a new dataset in the SQuAD format in the NEW_SQUAD_JSON path. At this point you should run your model on this file, and generate a new prediction file (CONSISTENCY_PRED_FILE), and then run:

stats = qa_consistency.dataset_utils.evaluate_consistency_squad(NEW_SQUAD_JSON, CONSISTENCY_PRED_FILE)
print('Consistency by implication type:')
print()
for x, v in stats.items():
    if x == 'all':
        continue
    print('%s : %.1f' % (x, 100* v))
print()
print('Avg  : %.1f' % (100 * stats['all']))

Notebooks where we bring it all together

Code of Conduct

Microsoft Open Source Code of Conduct

fagan2888 / qa_consistency Goto Github PK

qa_consistency's Introduction

Evaluating consistency of Question-Answering Models

Installation

Generating implications:

VQA

SQuAD

Evaluating the consistency of models

VQA

SQuAD

Notebooks where we bring it all together

Code of Conduct

qa_consistency's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent