google / active-qa Goto Github PK

License: Apache License 2.0

Shell 0.20% Python 99.80%

active-qa's Introduction

ActiveQA: Active Question Answering

This repo contains code for our paper Ask the Right Questions: Active Question Reformulation with Reinforcement Learning.

Small forewarning, this is still much more of a research codebase than a library. No support is provided.

If you use this code for your research, please cite the paper.

Introduction

ActiveQA is an agent that transforms questions online in order to find the best answers. The agent consists of a Tensorflow model that reformulates questions and an Answer Selection model. It interacts with an environment that contains a question-answering system. The agent queries the environment with variants of a question and calculates a score for the answer against the original question. The model is trained end-to-end using reinforcement learning.

This version addresses the SearchQA question-answering task, and the environment consists of the Bi-directional Attention Flow (BiDAF) model of Seo et al. (2017).

Setup

Dependencies

We require tensorflow and many other supporting libraries. Tensorflow should be installed separately following the docs. To install the other dependencies use

pip install -r requirements.txt

Note: We only ran this code with Python 2, so Python 3 is not officially supported.

Data

Download the source dataset from SearchQA, GloVe, and NLTK corpus and save them in $HOME/data.

export DATA_DIR=$HOME/data
mkdir $DATA_DIR

Download

Download the SearchQA dataset (~600 MB) for training, testing, and validation here: https://drive.google.com/open?id=1OxRhw81g7amW3aBd_iu2By5THysgr2uv

<Download the dataset to $DATA_DIR/SearchQA.zip>
unzip $DATA_DIR/SearchQA.zip -d $DATA_DIR

Download GloVe (~850 MB):

export GLOVE_DIR=$DATA_DIR/glove
mkdir $GLOVE_DIR

wget -c http://nlp.stanford.edu/data/glove.6B.zip -O $GLOVE_DIR/glove.6B.zip
unzip $GLOVE_DIR/glove.6B.zip -d $GLOVE_DIR

Download NLTK (for tokenizer). Make sure that nltk is installed!

python -m nltk.downloader -d $HOME/nltk_data punkt

Download the reformulator model pretrained on UN+Paralex datasets (~140 MB):

export PRETRAINED_DIR=$DATA_DIR/pretrained
mkdir $PRETRAINED_DIR

wget -c https://storage.googleapis.com/pretrained_models/translate.ckpt-1460356.zip -O $PRETRAINED_DIR/translate.ckpt-1460356.zip
unzip $PRETRAINED_DIR/translate.ckpt-1460356.zip -d $PRETRAINED_DIR

Preprocess

The SearchQA dataset requires a 2-step preprocessing:

Convert into SQuAD data format as the model was written to only work with that format.

export SQUAD_DIR=$DATA_DIR/squad
mkdir $SQUAD_DIR

python -m searchqa.prepro \
--searchqa_dir=$DATA_DIR/SearchQA \
--squad_dir=$SQUAD_DIR

Preprocess the SearchQA dataset in SQuAD format (along with GloVe vectors) and save them in $PWD/data/squad (~60 minutes):
```
python -m third_party.bi_att_flow.squad.prepro \
--glove_dir=$GLOVE_DIR \
--source_dir=$SQUAD_DIR
```

Note that Python2 and Python3 handle Unicode differently and hence the preprocessing output differs. For converting the SearchQA format to SQuAD format either version can be used; use Python3 for other datasets.

gRPC

We need to compile the gRPC interface for the Environment Server.

chmod +x compile_protos.sh; ./compile_protos.sh

Run Environment Server

The training requires running the environment gRPC server, which receives queries from the ActiveQA agent and sends back one response per query.

python -m px.environments.bidaf_server \
--port=10000 \
--squad_data_dir=data/squad \
--bidaf_shared_file=data/bidaf/shared.json \
--bidaf_model_dir=data/bidaf/

The checkpoint of a BiDAF model trained on SearchQA is already provided in data/bidaf, so you don't have to train one yourself. However, if you want to reproduce our training, clone the BiDAF repository and run

python basic/cli.py \
--mode=trains \
--data_dir=data/squad \
--shared_path=data/bidaf/shared.json \
--init_lr=0.001 \
--num_steps=14000

Reformulator Training

We first train reformulator from a model pretrained on UN and Paralex datasets. It should take a week on a single P100 GPU to reach ~42 F1 score on SearchQA's dev set.

export OUT_DIR=/tmp/active-qa
mkdir $OUT_DIR

export REFORMULATOR_DIR=$OUT_DIR/reformulator
mkdir $REFORMULATOR_DIR

echo "model_checkpoint_path: \"$PRETRAINED_DIR/translate.ckpt-1460356\"" > checkpoint
cp -f checkpoint $REFORMULATOR_DIR
cp -f checkpoint $REFORMULATOR_DIR/initial_checkpoint.txt

python -m px.nmt.reformulator_and_selector_training \
--environment_server_address=localhost:10000 \
--hparams_path=px/nmt/example_configs/reformulator.json \
--enable_reformulator_training=true \
--enable_selector_training=false \
--train_questions=$SQUAD_DIR/train-questions.txt \
--train_annotations=$SQUAD_DIR/train-annotation.txt \
--train_data=data/squad/data_train.json \
--dev_questions=$SQUAD_DIR/dev-questions.txt \
--dev_annotations=$SQUAD_DIR/dev-annotation.txt \
--dev_data=data/squad/data_dev.json \
--glove_path=$GLOVE_DIR/glove.6B.100d.txt \
--out_dir=$REFORMULATOR_DIR \
--tensorboard_dir=$OUT_DIR/tensorboard

Note: if you don't want to wait a week of training, you can download this checkpoint of the reformulator trained on SearchQA, with dev set F1 score of 42.5. Note that this is not the exact model analyzed in the paper, but one with equivalent performance.

Selector Training

After training the reformulator, we can now train the selector. It should take 2-3 days on a single P100 GPU to reach ~47.5 F1 score on SearchQA's dev set.

python -m px.nmt.reformulator_and_selector_training \
--environment_server_address=localhost:10000 \
--hparams_path=px/nmt/example_configs/reformulator.json \
--enable_reformulator_training=false \
--enable_selector_training=true \
--train_questions=$SQUAD_DIR/train-questions.txt \
--train_annotations=$SQUAD_DIR/train-annotation.txt \
--train_data=data/squad/data_train.json \
--dev_questions=$SQUAD_DIR/dev-questions.txt \
--dev_annotations=$SQUAD_DIR/dev-annotation.txt \
--dev_data=data/squad/data_dev.json \
--glove_path=$GLOVE_DIR/glove.6B.100d.txt \
--batch_size_train=16 \
--batch_size_eval=64 \
--save_path=$OUT_DIR/selector \
--out_dir=$REFORMULATOR_DIR \
--tensorboard_dir=$OUT_DIR/tensorboard

Note: If you don't want to wait 2-3 days for the training to finish, you can download a checkpoint of the selector. The checkpoint is trained on SearchQA, achieving an F1 score of ~47.5 on the dev set.

References

This repository relies on the work of the following repositories:

and uses data from the following sources:

BibTex

@inproceedings{buck18,
  author    = {Christian Buck and
               Jannis Bulian and
               Massimiliano Ciaramita and
               Andrea Gesmundo and
               Neil Houlsby and
               Wojciech Gajewski and
               Wei Wang},
  title     = {Ask the Right Questions: Active Question Reformulation with Reinforcement
               Learning},
  booktitle = {Sixth International Conference on Learning Representations (ICLR)},
  year      = {2018},
  month     = {May},
  address   = {Vancouver, Canada},
  url       = {https://openreview.net/forum?id=S1CChZ-CZ},
}

active-qa's People

Stargazers

Watchers

active-qa's Issues

The Process is killed in the middle

Hi @willnorris ,
Actually I am running your project in the process of 2-way processing while run this comand python -m third_party.bi_att_flow.squad.prepro
--glove_dir=$GLOVE_DIR
--source_dir=$SQUAD_DIR

the process is killed in the middle like this " 66%|██████▌ | 59663/90834 [12:42<33:24:15, 3.86s/it]Killed
"
can you tell me whether the system configuration issue or any this else actually before killed my system was struct for 3 minutes and later when i open the command prompt its show's me that your process is killed can you tell me the what's the reason behind this...

Thanks and Regards,
Manikantha Sekhar..

Happy Codding......

'_coverage_penalty_weight' attribute not found:

When running in a ipython notebook

reformulator = reformulator.Reformulator(
      hparams_path='px/nmt/example_configs/reformulator.json',
      source_prefix='<en> <2en> ',
      out_dir='/tmp',
      environment_server_address='localhost:10000')

AttributeError: 'DiverseBeamSearchDecoder' object has no attribute '_coverage_penalty_weight'

When running via the cli:

python -m px.nmt.reformulator_and_selector_training \
--environment_server_address=localhost:10000 \
--hparams_path=px/nmt/example_configs/reformulator.json \
--enable_reformulator_training=true \
--enable_selector_training=false \
--train_questions=$SQUAD_DIR/train-questions.txt \
--train_annotations=$SQUAD_DIR/train-annotation.txt \
--train_data=data/squad/data_train.json \
--dev_questions=$SQUAD_DIR/dev-questions.txt \
--dev_annotations=$SQUAD_DIR/dev-annotation.txt \
--dev_data=data/squad/data_dev.json \
--glove_path=$GLOVE_DIR/glove.6B.100d.txt \
--out_dir=$REFORMULATOR_DIR \
--tensorboard_dir=$OUT_DIR/tensorboard

AttributeError: 'DiverseBeamSearchDecoder' object has no attribute '_coverage_penalty_weight'

This should be set in the parent object as per https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/seq2seq/python/ops/beam_search_decoder.py#L338

Not clear on what is missing

Problem in download the selector pretrained model.

I am using ActiveQA Github repository to generate questions and answers. right now i am looking for checkpoints for selector training (Pre-trained Models),actually i was unable to download from activeqa readme file so could you provide public link.

https://storage.cloud.google.com/pretrained_models/selector.zip

px.utils module is missing

In reformulator_and_selector_training.py file, eval_utils module needs to be imported by "from px.utils import eval_utils". However, there is no utils module in the px folder. Could you please upload this file?

How do I know the training is finished for reformulator_and_selector_training

Hello!

Could you please provide some info when/how do I know training is finished for reformulator_and_selector_training?
If the training is finished, how can I directly use to trained model for query reformulator?
Could you please provide a trained model for reformulator_and_selector_training as you did for reformulator?

Thanks!

Installation Error

Output

    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-build-OSBzO1/numpy/setup.py", line 31, in <module>
        raise RuntimeError("Python version >= 3.5 required.")
    RuntimeError: Python version >= 3.5 required.

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-OSBzO1/numpy/

System Info

Ubuntu 16.04.6 LTS (GNU/Linux 4.4.0-165-generic x86_64)

Selector Pre-Trained Models

I was wondering where I can find (or if you plan to release) the Selector pre-trained models that achieved ~47.5 F1 score

answers_file is not extracted/provided

When running Selector Training section, px.nmt.reformulator_and_selector_training module requires answers files (shown below). However, train_data is not provided in configurations. Neither the answers file is not generated after preprocessing squad data using python -m searchqa.prepro
--searchqa_dir=$DATA_DIR/SearchQA
--squad_dir=$SQUAD_DIR.
Could you please give some help on how to fix this issue.
questions, annotations, docid_2_answer = read_data(
questions_file=FLAGS.train_questions,
annotations_file=FLAGS.train_annotations,
answers_file=FLAGS.train_data,
preprocessing_mode=FLAGS.mode)
dev_questions, dev_annotations, dev_docid_2_answer = read_data(
questions_file=FLAGS.dev_questions,
annotations_file=FLAGS.dev_annotations,
answers_file=FLAGS.dev_data,
preprocessing_mode=FLAGS.mode,
max_lines=FLAGS.max_dev_examples)

Problem in download of reformulator pretrained model.

I'm trying to download the pretrained model file of reformulator(translate.ckpt-6156696.zip) and returns forbidden access (403).
https://storage.cloud.google.com/pretrained_models/translate.ckpt-6156696.zip

Can you provide a public link for this file?

ImportError: cannot import name 'aqa_pb2'

When I run this code in Jupyter, that error show up for me:
!python -m px.environments.bidaf_server
--port=10000
--squad_data_dir=./data/squad
--bidaf_shared_file=./data/bidaf/shared.json
--bidaf_model_dir=./data/bidaf

The import can't be completed, I guess, because doesn't have the file to import.
In px/proto I can't found it.

I'm using Python 2.

Are you running the environment server before running the reformulator code?

Originally posted by @graviraja in #15 (comment)

Kindly help me on how to run the environment server before running the reformulator code?
I've also tried changing environment_server_address=None. Still the same issue.

odd gprc status reporting during selector training

Hello, I am just beginning the training of the selector, and would like to share some odd-looking reporting with you to see if it is expected and/or ignorable, or something possibly problematic. The most confusing report is that of the termination for 'deadline_exceeded', though the server still appears to be answering as tf_logging reports truncated questions. Here is a sample run-through, which happens each iteration:

W1129 16:50:36.295381 140505258112768 tf_logging.py:120] <_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1543510236.294404271","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
W1129 16:50:36.298100 140503681070848 tf_logging.py:120] <_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1543510236.297316511","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
W1129 16:50:36.296053 140505249720064 tf_logging.py:120] <_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1543510236.295287420","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
W1129 16:50:36.301875 140503672678144 tf_logging.py:120] <_Rendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1543510236.301333914","description":"Error received from peer","file":"src/core/lib/surface/call.cc","file_line":1017,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
I1129 16:51:09.612217 140514004862784 tf_logging.py:115] Answered: 0 : 19th century , literature argentine cowboys popular , jose hernandez' martin fierro classic : gaucho : 5767 : 0.0
I1129 16:51:09.612453 140514004862784 tf_logging.py:115] Answered: 1 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuckuckuckuckuck cláusula cláusula cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.612546 140514004862784 tf_logging.py:115] Answered: 2 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuckuckuckuckuckuck cláusula cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.612641 140514004862784 tf_logging.py:115] Answered: 3 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuckuckuckuck cláusulauck cláusula cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.612725 140514004862784 tf_logging.py:115] Answered: 4 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuckuckuckuck cláusula cláusula cláusula cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.612806 140514004862784 tf_logging.py:115] Answered: 5 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuckuckuck cláusulauck cláusula cláusula cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.612886 140514004862784 tf_logging.py:115] Answered: 6 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuckuckuckuckuck cláusulauck cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.612967 140514004862784 tf_logging.py:115] Answered: 7 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuckuckuck cláusulauckuck cláusula cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.613049 140514004862784 tf_logging.py:115] Answered: 8 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuckuck cláusulauckuck cláusula cláusula cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.613131 140514004862784 tf_logging.py:115] Answered: 9 : alertealertealerte 84 84 84 84 Normas Normas Normas Normas Normas Normas Normas Normas Normas Normas profunda profunda profunda profunda culmin hair Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica Jurídica hair hair hairuckuckuckuckuckuckuckuckuckuck cláusulauckuckuck cláusula cláusula cláusula : gaucho : 5767 : 0.0
I1129 16:51:09.613209 140514004862784 tf_logging.py:115] Time to make 1344 environment calls: 153.337013006```

Nonsensical reformed queries from Reformulator

I am trying to get the reformulations from the reformulator but I get all nonsensical reformulations like this-

My questions were- ['how can i apply for nsa?', 'what is the minimum working hours required for a day?']

I used this code to get the reformulations-

from px.nmt import reformulator
from px.proto import reformulator_pb2

questions = ['how can i apply for nsa?', 'what is the minimum working hours required for a day?']

reformulator_instance = reformulator.Reformulator(
    hparams_path='px/nmt/example_configs/reformulator.json',
    source_prefix='<en> <2en> ',
    out_dir='path/to/reformulator_dir',
    environment_server_address='localhost:10000')

# Change from GREEDY to BEAM if you want 20 rewrites instead of one.
responses = reformulator_instance.reformulate(
    questions=questions,
    inference_mode=reformulator_pb2.ReformulatorRequest.GREEDY)

# Since we are using greedy decoder, keep only the first rewrite.
reformulations = [r[0].reformulation for r in responses]

print reformulations

Syntax Error: ‘async’ is a reserved word in Python >= 3.7

flake8 testing of https://github.com/google/active-qa on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./px/environments/docqa.py:72:20: E999 SyntaxError: invalid syntax
               async=0,
                   ^
1     E999 SyntaxError: invalid syntax

Getting a a grpc.FutureTimeoutError while using Reformulator from the checkpoint

Thank you for your interesting paper & open-sourcing it!

Running the code given in: #9 (comment) but getting a grpc.FutureTimeoutError:

python2 reformulate.py 
Num encoder layer 2 is different from num decoder layer 4, so set pass_hidden_state to False
# hparams:
  src=source
  tgt=target
  train_prefix=None
  dev_prefix=None
  test_prefix=None
  train_annotations=None
  dev_annotations=None
  test_annotations=None
  out_dir=/tmp/active-qa/reformulator
# Vocab file data/spm2/spm.unigram.16k.vocab.nocount.notab.source exists
  using source vocab for target
# Use the same embedding for source and target
Traceback (most recent call last):
  File "reformulate.py", line 10, in <module>
    environment_server_address='localhost:10000')
  File "/root/active-qa/px/nmt/reformulator.py", line 130, in __init__
    use_placeholders=True)
  File "/root/active-qa/px/nmt/model_helper.py", line 171, in create_train_model
    trie=trie)
  File "/root/active-qa/px/nmt/gnmt_model.py", line 56, in __init__
    trie=trie)
  File "/root/active-qa/px/nmt/attention_model.py", line 65, in __init__
    trie=trie)
  File "/root/active-qa/px/nmt/model.py", line 137, in __init__
    hparams.environment_server, mode=hparams.environment_mode))
  File "/root/active-qa/px/nmt/environment_client.py", line 152, in make_environment_reward_fn
    grpc.channel_ready_future(channel).result(timeout=30)
  File "/root/active-qa/venv/local/lib/python2.7/site-packages/grpc/_utilities.py", line 134, in result
    self._block(timeout)
  File "/root/active-qa/venv/local/lib/python2.7/site-packages/grpc/_utilities.py", line 84, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError

Does the gRPC server have to run in order to use the Reformulator or am I missing something else here?

How do i use Selector for a custom document?

For a custom document and a question related to that document, i can run reformulator for that question and can get the multiple reformulations. But how can i get the answers for those reforumlations using that custom document and get the best answer using pretrained selector model ?

Getting an error while running the reformulator_training file

Hi @willnorris @cdibona @christianbuck @dberlin @j5b

Actually i am running the command
'python -m px.nmt.reformulator_and_selector_training --environment_server_address=localhost:10000 --hparams_path=px/nmt/example_configs/reformulator.json --enable_reformulator_training=true --enable_selector_training=false --train_questions=$SQUAD_DIR/train-questions.txt --train_annotations=$SQUAD_DIR/train-annotation.txt --train_data=data/squad/data_train.json --dev_questions=$SQUAD_DIR/dev-questions.txt --dev_annotations=$SQUAD_DIR/dev-annotation.txt --dev_data=data/squad/data_dev.json --glove_path=$GLOVE_DIR/glove.6B.100d.txt --out_dir=$REFORMULATOR_DIR --tensorboard_dir=$OUT_DIR/tensorboard'

then i am getting an error like "tensorflow.python.framework.errors_impl.NotFoundError: /train-questions.txt; No such file or directory"
but in squad directory folder i am having the train-question.txt file but again it showing me the error file not found could you help me

Thanks & Regards,
Manikantha Sekhar

Getting a grpc.FutureTimeoutError while using Reformulator from the checkpoint

Are you running the environment server before running the reformulator code?

https://github.com/google/active-qa/issues/15#issue-407744369
@graviraja @JohannesTK I'm getting the above error even after running the environment server.
Please anyone help fixing it..
Screenshot for gRPC Environment server

Setup on Windows

Console Output

Collecting sentencepiece (from -r requirements.txt (line 11))
  Using cached https://files.pythonhosted.org/packages/1b/87/c3c2fa8cbec61fffe031ca9f0da512747520bec9be7f886f748457daac31/sentencepiece-0.1.83.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "c:\users\zzj04\appdata\local\temp\pip-install-vxmbno\sentencepiece\setup.py", line 29, in <module>
        with codecs.open(os.path.join('..', 'VERSION'), 'r', 'utf-8') as f:
      File "c:\python27\lib\codecs.py", line 898, in open
        file = __builtin__.open(filename, mode, buffering)
    IOError: [Errno 2] No such file or directory: '..\\VERSION'

Windows Version

OS Name Microsoft Windows 10 Pro
Version 10.0.17763 Build 17763

Python Version

Python 2.7.16 (v2.7.16:413a49145e, Mar  4 2019, 01:30:55) [MSC v.1500 32 bit (Intel)] on win32

reformulator and selector links are invalid

hi, loved the repo. failed to download pre-trained models using the links in readme.
checkpoint of the reformulator
and checkpoint of the selector
are there new updated links?
thanks in advance, a.

Can we use this to generate question and answers from a directory of text files

I have not seen this kind of training and inference , can I use just raw text files to get the model to come up with questions. Thus building a qa bot.

Parameters for bi_att_flow model training not provided in the Readme

@rodrigonogueira4

Have not found Monte Carlo Sampling in the code

Hi,
Thanks for releasing the code for active-qa.
After browsing the code, I did not find Monte-Carlo Sampling in the training stage. It seems that each training instance consists of only one 「query, reformulated_query, reward」 tuple. Therefore, the reward is the same for each token in one reformulated query.
I don't know whether the suspicion is right. If it is right, what will model perform with or without Monte-Carlo sampling? Maybe using only one instance for Monte Carlo sampling is like the relation between stochastic gradient descent and gradient descent?
Thank you

How do I use only Reformulator with checkpoint of the reformulator?

How do I use only Reformulator with checkpoint of the reformulator https://storage.cloud.google.com/pretrained_models/translate.ckpt-6156696.zip

Docker Image?

Would be great to have docker image.

Model pretrained on UN and Paralex datasets

Reformulator Training
We first train reformulator from a model pretrained on UN and Paralex datasets. It should take a week on a single P100 GPU to reach ~42 F1 score on SearchQA's dev set.

@rodrigonogueira4 How to make the model (pretrained on UN and Paralex datasets) from scratch on a different dataset ?

How to test with own txt file or a document file

Hi @dberlin ,

I had run Full code in System but at last i downloaded the pretrained transalate checkpoints and sector modules and placed in the select folders but my concern is how to test my own text file or any other documents file (which contains the paragraph) to generate the questions and answers format could help me to come with an output

Thanks and Regards,
Manikantha Sekhar.

Happy Codding....

ValueError from the run environment step

All prior steps went fine. Running the gRPC environment server errors out. Thoughts?

Running Python 2.7.14 :: Anaconda custom (x86_64) on CPU

Full stack trace:

python -m px.environments.bidaf_server
--port=10000
--squad_data_dir=data/squad
--bidaf_shared_file=data/bidaf/shared.json
--bidaf_model_dir=data/bidaf/
I0514 14:08:35.730832 140735704388480 bidaf_server.py:195] Loading server...
Traceback (most recent call last):
File "/Users/david/anaconda/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/Users/david/anaconda/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/Users/david/active-qa/px/environments/bidaf_server.py", line 227, in
app.run(main)
File "/Users/david/anaconda/lib/python2.7/site-packages/absl/app.py", line 300, in run
_run_main(main, args)
File "/Users/david/anaconda/lib/python2.7/site-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/Users/david/active-qa/px/environments/bidaf_server.py", line 207, in main
debug_mode=FLAGS.debug_mode), server)
File "/Users/david/active-qa/px/environments/bidaf_server.py", line 84, in init
debug_mode=debug_mode)
File "/Users/david/active-qa/px/environments/bidaf_server.py", line 107, in _InitializeEnvironment
debug_mode=debug_mode)
File "px/environments/bidaf.py", line 95, in init
self.config, dataset, True, data_filter=data_filter)
File "third_party/bi_att_flow/basic/read_data.py", line 199, in read_data
shared = json.load(fh)
File "/Users/david/anaconda/lib/python2.7/json/init.py", line 291, in load
**kw)
File "/Users/david/anaconda/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/Users/david/anaconda/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/david/anaconda/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 1781596155 (char 1781596154)

Assertion Error

Hi Every one ,
I followed README.md file and follow the instruction given their to run the program/code mean while i got an error while running tis command python -m searchqa.prepro --searchqa_dir=$DATA_DIR/SearchQA --squad_dir=$SQUAD_DIR i got an error
Traceback (most recent call last):
File "searchqa/prepro.py", line 165, in
app.run(main)
File "/home/launchship/my_name/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/launchship/my_name/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "searchqa/prepro.py", line 145, in main
assert os.path.exists(FLAGS.searchqa_dir)
AssertionError

could any one solve this problem that helps me a lot