malllabiisc / dips Goto Github PK
View Code? Open in Web Editor NEWNAACL 2019: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation
License: Apache License 2.0
NAACL 2019: Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation
License: Apache License 2.0
I wanted to know what are the required changes to be done in order to get more than one paraphrased questions for a given question. :D
Link to Quora-Div doesnt work, could you please update it. Thanks.
Hello,
Can I get the pre-trained word embedding file of the model?
I am facing issue with GPU compatibility. The code is completely working fine. Just the hardware issues. As the reason, I am asking for a pretrained file.
Hello, thanks for the code!
I'd like to achieve the following:
What are the exact steps for this?
In the paper and the implementation, the maximum token (=word) length of the dataset is truncated up to 20 tokens, but I'm not sure the reason why you set such a condition.
Would you tell me more about this setting, like why you set this limitation and whether you have a paper that you referenced for it or empirical findings about that ?
Thanks.
In helper.py
, inside the load_checkpoint
function, in the except
block, bleu_score
is undefined. I believe initializing it to None
should solve it
Hi, Thank you for sharing the code!
I want to know whether this code can be used for Chinese?
If I want to use the code for Chinese data augmentaion, which part should I change?
Thank you!
pkg-resources==0.0.0
causes pip install -r requirements.txt
to fail (on windows).
According to this link, it should be safe to just remove it from requirements.txt
.
Hi,
I am trying to do the experiment of section 5 in the paper, using evaluation scripts in src/evaluation/, but now I am struggling to reproduce results of table 4 and 5.
The model trained with Twitter gave almost the same results (BLEU:47-56, lambda:0.5-1), however the model trained with Quora-Div had a BLEU of 21-26 (lambda:0.5-1), which was lower than that of the paper (35.1).
Here are the outputs of src/evaluation/get_bleu_score.py to the decoding results of quora-div test data.
# lambda = 0.5
results_submod_src_0.5_1.0_1.0_1.0_1.0.npy : (0.2133832613042971, [0.5514513662938855, 0.2794985741291888, 0.16517278760476983, 0.10313060915605508], 0.9426644403664957, 0.9442470265324794, 516031, 546500)
# lambda = 1.0
results_submod_src_1.0_1.0_1.0_1.0_1.0.npy : (0.2677655774938234, [0.6150636401998018, 0.3489407079357134, 0.22080072282389715, 0.14503831657417837], 0.92996373353237, 0.9323055809698079, 509505, 546500)
In my setup, the seq2seq model was trained using the quora-div data set with same hyperparameter settings as the ones in the supplementary material, the w2v dictionary was created using a trained embedding, and each decoding was done with beam=10.
Would you provide detailed information about conducting the experiment, including procedures such as how to calculate bleu for each ref and candidates pairs (like getting average of the all candidates or max value of them)?
I also would like to know how you measure METEOR and TERp scores in the paper. (libraries or other OSSs you used for calculating them)
Thanks
@ashutoshml I am traying to make this run for my custom dataset. I have some questions, which I have kept in the src.txt file. I wanted to know what should I keep in the tgt.txt file. Thanks in advance for the help :D
Hi.
I'm trying to generate 500 paraphrases for each input sentence. But I got "Error in Submod: attempt to get argmax of an empty sequence". And I found that there are only 50 paraphrases generated for each sentence in the output file. I'm wondering if you can tell me the maximum number of paraphrases can be generated for single input sentence, or how can I get more paraphrases using DiPS.
Thanks.
Merry Christmas and Happy New Year!
I downloaded the Google word vectors bin file and placed it inside data
, but I encountered a No such file or directory: 'data/embeddings/word2vec.pickle'
while trying to train the model
Hi, firstly thank you very much for the code and detailed explanation.
I am running this line of code:
! python -m src.main -mode train -gpu 0 -use_attn -bidirectional -dataset twitter -run_name DiPS
and I am getting following error:
Loading Word2Vec
Word2vec Loaded
Time Taken : 0.0016974012056986492
2020-06-07 22:22:12,866 | INFO | main | Training and Validation data loading..
2020-06-07 22:22:13,243 | INFO | main | Training and Validation data loaded!
2020-06-07 22:22:13,243 | INFO | main | Creating vocab ...
2020-06-07 22:22:30,206 | INFO | main | Vocab created with number of words = 18865
2020-06-07 22:22:30,207 | INFO | main | Saving Vocabulary file
2020-06-07 22:22:30,217 | INFO | main | Vocabulary file saved in Model/DiPS/vocab.p
2020-06-07 22:22:30,219 | INFO | main | Checkpoint found with epoch num 30
2020-06-07 22:22:30,297 | INFO | main | Building Encoder RNN..
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/main.py", line 473, in
main()
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/main.py", line 415, in main
model = s2s(args, voc, device, logger)
File "/content/drive/My Drive/Colab Notebooks/DiPS/src/model.py", line 44, in init
self.config.bidirectional).to(device)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 386, in to
return self._apply(convert)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 193, in _apply
module._apply(fn)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 127, in _apply
self.flatten_parameters()
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/rnn.py", line 123, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
Actually, I am using colab for training and I trained for 12 hours and it went upto 74 epochs with defaults. Although in my drive I could see the last file of 30th epoch. So i thought of restarting it last night and it resumed from 30th epoch. But due to some reason I had to stop after 4 epochs and I stopped the code and switched off. But now when I am running the whole thing same way again today, I am unable to proceed with training and getting this error. Please help. I am just a newbie in this field.
Is there any reason or limitation why one could not use BERT or other Transformer-based encoders for:
Best
Thank you for uploading this code!
I just finished training the model and would like to know how could I use it to generate new sentences based on unseen sentence seeds?
Say I have a text file src.txt containing 3 sentences, and I would like to generate 2 or 3 paraphrases for each of those sentences. I saw that the decode mode of the main.py file requires a src.txt and a tgt.txt files, but is it possible to do it without the tgt.txt? Many thanks in advance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.