malllabiisc / dips Goto Github PK

NAACL 2019: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation

License: Apache License 2.0

Python 100.00%

paper paraphrase-generation submodular-optimization diversity naacl2019 data-augmentation diverse-decoding natural-language-generation

dips's Issues

How to generate multiple paraphrases for a question

I wanted to know what are the required changes to be done in order to get more than one paraphrased questions for a given question. :D

Link to Quora-Div doesnt work

Link to Quora-Div doesnt work, could you please update it. Thanks.

Can I get the pre-trained word-embedding file?

Hello,

Can I get the pre-trained word embedding file of the model?
I am facing issue with GPU compatibility. The code is completely working fine. Just the hardware issues. As the reason, I am asking for a pretrained file.

Generate paraphrases for a new dataset

Hello, thanks for the code!

I'd like to achieve the following:

train your model for the best effectiveness (or at least quite good effectiveness)
generate let's say 5 paraphrases for each sentence in a dataset (eg TREC dataset from https://github.com/harvardnlp/sent-conv-torch/tree/master/data)

What are the exact steps for this?

About the maximum word length set to 20

In the paper and the implementation, the maximum token (=word) length of the dataset is truncated up to 20 tokens, but I'm not sure the reason why you set such a condition.

Would you tell me more about this setting, like why you set this limitation and whether you have a paper that you referenced for it or empirical findings about that ?

Thanks.

bleu_score referenced before assignment

In helper.py, inside the load_checkpoint function, in the except block, bleu_score is undefined. I believe initializing it to None should solve it

Can this method be used for other languages?

Hi, Thank you for sharing the code!
I want to know whether this code can be used for Chinese?
If I want to use the code for Chinese data augmentaion, which part should I change?
Thank you!

Remove "pkg-resources==0.0.0" from requirements.txt

pkg-resources==0.0.0 causes pip install -r requirements.txt to fail (on windows).

According to this link, it should be safe to just remove it from requirements.txt.

how to measure BLEU and other metrices

Hi,

I am trying to do the experiment of section 5 in the paper, using evaluation scripts in src/evaluation/, but now I am struggling to reproduce results of table 4 and 5.

The model trained with Twitter gave almost the same results (BLEU:47-56, lambda:0.5-1), however the model trained with Quora-Div had a BLEU of 21-26 (lambda:0.5-1), which was lower than that of the paper (35.1).

Here are the outputs of src/evaluation/get_bleu_score.py to the decoding results of quora-div test data.

# lambda = 0.5
results_submod_src_0.5_1.0_1.0_1.0_1.0.npy : (0.2133832613042971, [0.5514513662938855, 0.2794985741291888, 0.16517278760476983, 0.10313060915605508], 0.9426644403664957, 0.9442470265324794, 516031, 546500)
# lambda = 1.0
results_submod_src_1.0_1.0_1.0_1.0_1.0.npy : (0.2677655774938234, [0.6150636401998018, 0.3489407079357134, 0.22080072282389715, 0.14503831657417837], 0.92996373353237, 0.9323055809698079, 509505, 546500)

In my setup, the seq2seq model was trained using the quora-div data set with same hyperparameter settings as the ones in the supplementary material, the w2v dictionary was created using a trained embedding, and each decoding was done with beam=10.

Would you provide detailed information about conducting the experiment, including procedures such as how to calculate bleu for each ref and candidates pairs (like getting average of the all candidates or max value of them)?

I also would like to know how you measure METEOR and TERp scores in the paper. (libraries or other OSSs you used for calculating them)

Thanks

How to test on custom dataset

@ashutoshml I am traying to make this run for my custom dataset. I have some questions, which I have kept in the src.txt file. I wanted to know what should I keep in the tgt.txt file. Thanks in advance for the help :D

What is the maximum number of paraphrases can be generated for single input sentence?

Hi.
I'm trying to generate 500 paraphrases for each input sentence. But I got "Error in Submod: attempt to get argmax of an empty sequence". And I found that there are only 50 paraphrases generated for each sentence in the output file. I'm wondering if you can tell me the maximum number of paraphrases can be generated for single input sentence, or how can I get more paraphrases using DiPS.
Thanks.
Merry Christmas and Happy New Year!

Word embedding file

I downloaded the Google word vectors bin file and placed it inside data, but I encountered a No such file or directory: 'data/embeddings/word2vec.pickle' while trying to train the model

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Hi, firstly thank you very much for the code and detailed explanation.

I am running this line of code:
! python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset twitter -run_name DiPS

and I am getting following error:
Loading Word2Vec
Word2vec Loaded
Time Taken : 0.0016974012056986492
2020-06-07 22:22:12,866 | INFO | main | Training and Validation data loading..
2020-06-07 22:22:13,243 | INFO | main | Training and Validation data loaded!
2020-06-07 22:22:13,243 | INFO | main | Creating vocab ...
2020-06-07 22:22:30,206 | INFO | main | Vocab created with number of words = 18865
2020-06-07 22:22:30,207 | INFO | main | Saving Vocabulary file
2020-06-07 22:22:30,217 | INFO | main | Vocabulary file saved in Model/DiPS/vocab.p
2020-06-07 22:22:30,219 | INFO | main | Checkpoint found with epoch num 30
2020-06-07 22:22:30,297 | INFO | main | Building Encoder RNN..
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/main.py", line 473, in
main()
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/main.py", line 415, in main
model = s2s(args, voc, device, logger)
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/model.py", line 44, in init
self.config.bidirectional).to(device)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 386, in to
return self._apply(convert)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Actually, I am using colab for training and I trained for 12 hours and it went upto 74 epochs with defaults. Although in my drive I could see the last file of 30th epoch. So i thought of restarting it last night and it resumed from 30th epoch. But due to some reason I had to stop after 4 epochs and I stopped the code and switched off. But now when I am running the whole thing same way again today, I am unable to proceed with training and getting this error. Please help. I am just a newbie in this field.

Question: Using a pre-trained encoder?

Is there any reason or limitation why one could not use BERT or other Transformer-based encoders for:

Word vector generation instead of word2vec
Use directly the encoder and only train the decoder part?

Best

Creating new paraphrases

Thank you for uploading this code!

I just finished training the model and would like to know how could I use it to generate new sentences based on unseen sentence seeds?

Say I have a text file src.txt containing 3 sentences, and I would like to generate 2 or 3 paraphrases for each of those sentences. I saw that the decode mode of the main.py file requires a src.txt and a tgt.txt files, but is it possible to do it without the tgt.txt? Many thanks in advance.

malllabiisc / dips Goto Github PK

dips's Issues

How to generate multiple paraphrases for a question

Link to Quora-Div doesnt work

Can I get the pre-trained word-embedding file?

Generate paraphrases for a new dataset

About the maximum word length set to 20

bleu_score referenced before assignment

Can this method be used for other languages?

Remove "pkg-resources==0.0.0" from requirements.txt

how to measure BLEU and other metrices

How to test on custom dataset

What is the maximum number of paraphrases can be generated for single input sentence?

Word embedding file

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Question: Using a pre-trained encoder?

Creating new paraphrases

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent