gauthierdmn / question_generation Goto Github PK

Neural Question Generation using the SQuAD and NewsQA datasets

Python 100.00%

question_generation's Introduction

Neural Question Generation: Learning to Ask

This projects aims at exploring automatic question generation from sentences in reading comprehension passages using deep neural networks.

We can interpret this task as the reverse objective of Question Answering, where given a sentence and a question, we build an algorithm to find the answer. Here the goal is to generate a question given a sentence in input, and potentially an answer.

Various paradigms can be considered:

given a sentence, generate a question on the sentence. This paradigm is very close to what is being done in Machine Translation where given an input sentence in language A, we intend to translate it into the corresponding sentence in language B. The main difference results in the size of the output space, which is much larger for QG since a large number of questions can be created from a sentence.
given a sentence and an answer, a span in the sentence, generate a question on the sentence that can be answered by the answer we gave. The difference with the previous paradigm is that here the output space of potential generated question is much narrower since it is restricted by the answer.
given a paragraph, a sentence in the paragraph, an answer in the sentence, generate a question on the sentence that can be answered by the answer we gave. Here the paragraph could potentially help to generate a valid question, providing more context than the standalone sentence.

For now, I implemented a baseline as described in Xinya Du, Junru Shao and Claire Cardie 's paper Learning to Ask: Neural Question Generation for Reading Comprehension, following the first paradigm.

For their work, they used OpenNMT, a library built on top of Torch (resp. OpenNMT-py on top of PyTorch) specifically designed for Neural Machine Translation modeling.

For learning purpose and fun, I decided to implement their work in PyTorch directly. Now if you are looking for performance, I highly advise you to have a look at OpenNMT instead, since their implementation is more efficient than mine.

Model Architecture

Code Organization

├── config.py          <- Configuration file with data directories and hyperparamters to train the model
├── preprocessing.py   <- Preprocess the input text files, building datasets and vocabularies for model training
├── layers.py          <- Define the various layers to be used by the main model
├── make_dataset.py    <- Download the SquAD and NewsQA datasets we use for this experiment
├── model.py           <- Define the Seq2Seq model architecture, with an encoder and a decoder
├── requirements.txt   <- Required Python libraries to build the project
├── train.py           <- Train the model
├── eval.py            <- Use the model to generate questions on unseen data
├── utils.py           <- Group a bunch of useful functions to process the data

Results

Using only a sentence as input

Accuracy and perplexity after 15 epochs:

ACC	PLP
43%	32.2

Using a sentence and the answer to the question to be created

Accuracy and perplexity after 15 epochs:

ACC	PLP
47.3%	25.2

Using the full paragraph and the answer to the question to be created

Accuracy and perplexity after 15 epochs:

ACC	PLP
46.5%	26.6

Set-Up

Before running the following commands to train your model, you need to download the NewsQA dataset manually here. Follow the steps they describe, but you basically need to download the data as a ZIP file and use the helper functions provided to wrap it into a JSON file.

Once it is done:

Clone this repository
Create a directory for your experiments, logs and model weights: mkdir output
Download GloVE word vectors: https://nlp.stanford.edu/projects/glove/
Modify the config.py file to set up the paths where your GloVE, SquAD and NewsQA datasets, and where your models will be saved
Create a Python virtual environment, source to it: mkvirualenv qa-env ; workon qa-env if you use virtualenvwrapper
Install the dependencies: pip install -r requirements.txt ; python -m spacy download en
Run python make_dataset.py to download SquAD dataset, and join SQuAD and NewsQA datasets into a single file
Run python preprocessing.py to preprocess the data
Run python train.py to train the model with hyper-parameters found in config.py
Run python eval.py on a test file to generate your own questions!

Next Steps

Use a pointer-generator to copy words from the source sentence
Improve the training process including Reinforcement Learning rewards such as in this paper
Investigate Transfer Learning as well as Multi Task Learning

Resources

SQuAD dataset: https://arxiv.org/abs/1606.05250
NewsQA dataset: https://datasets.maluuba.com/NewsQA
GloVE: https://nlp.stanford.edu/projects/glove/
Learning to Ask: Neural Question Generation for Reading Comprehension by Xinya Du, Junru Shao, Claire Cardie: http://arxiv.org/abs/1705.00106

question_generation's People

Contributors

Stargazers

Watchers

question_generation's Issues

ValueError: Expected input batch_size (1216) to match target batch_size (1280)

Traceback (most recent call last):
File "train.py", line 171, in
loss = criterion(pred.view(-1, pred.size(2)), question.view(-1))
File "/var/www/html/question_generate/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/var/www/html/question_generate/env/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 209, in forward
return F.nll_loss(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction)
File "/var/www/html/question_generate/env/lib/python3.6/site-packages/torch/nn/functional.py", line 1869, in nll_loss
.format(input.size(0), target.size(0)))
ValueError: Expected input batch_size (1216) to match target batch_size (1280)

Invalid name of newsqa json

Hi..,
I tried running code make_dataset.py. But in the links provided of newsqa dataser i didn't find newsqa json file. All i found was .questions file. Can you please provide me the link to get the json file which will be stored as "combined-newsqa-data-v1.json".

Trained Model

Hi,
Due to the large time taken to train the model, is there any way that you could provide your model?

I solved this issue by changing

<unk> tag

Hi @gauthierdmn
how do i change (unknown) tag to exact word itself?

runtime error (eval.py)

In eval.py i got this error

File "/home/frenk/Scaricati/question/eval.py", line 82, in
pred = model(sentence, len_sentence, answer=answer)

File "/home/frenk/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)

File "/home/frenk/Scaricati/question/model.py", line 29, in forward
enc_output, enc_hidden = self.enc(sentence, sentence_len, answer)

File "/home/frenk/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)

File "/home/frenk/Scaricati/question/layers.py", line 54, in forward
x, (hidden, cell) = self.rnn(x) # (batch_size, seq_len, 2 * hidden_size)

File "/home/frenk/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)

File "/home/frenk/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 557, in forward
return self.forward_packed(input, hx)

File "/home/frenk/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 550, in forward_packed
output, hidden = self.forward_impl(input, hx, batch_sizes, max_batch_size, sorted_indices)

File "/home/frenk/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 519, in forward_impl
self.check_forward_args(input, hx, batch_sizes)

File "/home/frenk/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 490, in check_forward_args
self.check_input(input, batch_sizes)

File "/home/frenk/anaconda3/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 153, in check_input
self.input_size, input.size(-1)))

RuntimeError: input.size(-1) must be equal to input_size. Expected 302, got 300

input directory

Hello
I could not find the location where user can enter input. Can someone tell me in which file I should specify my input sentence of my choice.

What is BLEU-4 score on SQuAD？

pytorch issue

I've been trying to run your project with all the same data files that you have instructed. I've been getting the following error:

`Traceback (most recent call last):

File "train.py", line 135, in

pred = model(sentence, len_sentence, question, answer)

File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in call

result = self.forward(*input, **kwargs)

File "/home/quilllionzml/question_generation/model.py", line 29, in forward

enc_output, enc_hidden = self.enc(sentence, sentence_len, answer)

File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in call

result = self.forward(*input, **kwargs)

File "/home/quilllionzml/question_generation/layers.py", line 48, in forward

x = self.embedding(x, y)

File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in call

result = self.forward(*input, **kwargs)

File "/home/quilllionzml/question_generation/layers.py", line 20, in forward

f_emb = self.f_embed(y)

File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/module.py", line 493, in call

result = self.forward(*input, **kwargs)

File "/usr/local/lib/python3.5/dist-packages/torch/nn/modules/sparse.py", line 117, in forward

self.norm_type, self.scale_grad_by_freq, self.sparse)

File "/usr/local/lib/python3.5/dist-packages/torch/nn/functional.py", line 1506, in embedding

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

RuntimeError: index out of range at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:193`

Could you help with this?
Side note: what are in_vocab_size and out_vocab_size?

eval.py is not working

When I run eval.py, the system is not producing any output after if prints 2-3 lines. It neither stops nor runs. It is not in an infinite loop because I checked with print() statements. Can someone please help me in this?

Refer the picture for output

Cannot evaluate the model

I tried running python eval.py as mentioned in README.md and ran into RuntimeError: Expected object of backend CUDA but got backend CPU for argument #3 'index. I managed to solve this by looking at the train.py and replacing

sentence, len_sentence, question = batch.src[0], batch.src[1], batch.trg[0]

with

sentence, len_sentence, question = batch.src[0].to(device), batch.src[1].to(device), batch.trg[0].to(device)

But now I have another problem. I debugged the code and found out that the evaluation loop runs for 4 iterations and it prints a lot of <unk>s but on the 5th iteration, it doesn't go beyond the line:

pred = model(sentence, len_sentence, answer=answer)

It is stuck forever in the loop. I'm new to this and I'm trying to learn how this works by looking at your code. Please help.

newsqa

can I get newsqa dataset processed one..

gauthierdmn / question_generation Goto Github PK

question_generation's Introduction

Neural Question Generation: Learning to Ask

Model Architecture

Code Organization

Results

Set-Up

Next Steps

Resources

question_generation's People

Contributors

Stargazers

Watchers

Forkers

question_generation's Issues

Recommend Projects

Recommend Topics

Recommend Org