edinburghnlp / refresh Goto Github PK
View Code? Open in Web Editor NEWRanking Sentences for Extractive Summarization with Reinforcement Learning
License: BSD 3-Clause "New" or "Revised" License
Ranking Sentences for Extractive Summarization with Reinforcement Learning
License: BSD 3-Clause "New" or "Revised" License
Hi
i am using TF 1.10 and i upgraded your code so because of that i can not restore your pretrained model to use it to get the test accuracy . could you please tell me how much test accuracy you got with your best model model.ckpt.epoch-11 ??
thanks
Hi Shashi,
While running the code, after 1 epoch is done, when it is just about to write the final validation summaries, I get the following error -
File "document_summarizer_training_testing.py", line 212, in train
rouge_score = rouge_generator.get_full_rouge(FLAGS.train_dir+"/model.ckpt.epoch-"+str(epoch)+".validation-summary-topranked", "validation")
File "/datadrive/prateek/Summarization/Refresh/reward_utils.py", line 199, in get_full_rouge
rouge_score = _rouge(system_dir, gold_summary_directory)
File "/datadrive/prateek/Summarization/Refresh/reward_utils.py", line 38, in _rouge
output = r.convert_and_evaluate(rouge_args="-e /address/to/rouge/data/directory/rouge/data -a -c 95 -m -n 4 -w 1.2")
File "/datadrive/miniconda3/envs/summly/lib/python2.7/site-packages/pyrouge/Rouge155.py", line 361, in convert_and_evaluate
rouge_output = self.evaluate(system_id, rouge_args)
File "/datadrive/miniconda3/envs/summly/lib/python2.7/site-packages/pyrouge/Rouge155.py", line 336, in evaluate
rouge_output = check_output(command).decode("UTF-8")
File "/datadrive/miniconda3/envs/summly/lib/python2.7/subprocess.py", line 223, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '[u'/datadrive/prateek/Summarization/github-pyrouge/pyrouge/tools/ROUGE-1.5.5/ROUGE-1.5.5.pl', '-e', '/address/to/rouge/data/directory/rouge/data', '-a', '-c', '95', '-m', '-n', '4', '-w', '1.2', u'-m', u'/tmp/tmpFdvDy4/rouge_conf.xml']' returned non-zero exit status 2
To me it seems like a pyrouge related error. I have installed Pyrouge using the instructions mentioned here -
https://stackoverflow.com/questions/47045436/how-to-install-the-python-package-pyrouge-on-microsoft-windows/47045437#47045437
Is the error due to Pyrouge, if so how would you recommend I install Pyrouge?
We are trying to run the example as given in the README file but have not been able to for the following reason.
I'm trying to wrap my head around how this part of the paper is implemented in practice, as it appears to imply that you're building a static distribution ahead of time. How then does it link to the probabilities produced from your policy network so that you can backprop? My understanding of REINFORCE is that you sample from a distribution built from your policy predicted probabilities. It might be helpful if you can point me to the place(s) in the code where this section is implemented.
Hi Shashi,
I use your Preprocessed CNN and DailyMail data(11487 test samples), And use the function in your code to evaluate the lead3's rouge_1. But i get rouge_1 40.487 higher than your paper's result(39.6). Can i have your ROUGE-1.5.5 folder to get the same result.
I use pyrouge0.1.3 and ROUGE1.5.5
Hi,
Because i need the REFRESH's rouge score test on DUC-2004 . It is my strong baseline. But I can't find the code that process data in scripts/oracle-estimator So I can't sure if my pre-process is correct?
Wanted to ask you if you have code fo pre-process or DUC-2004 's rouge score.
It will helpful for me.
Thank you.
Very interesting paper. Looking it over, there's something I cannot quite grasp about the architecture. What's the advantage of running the sentences through the Document Extractor LSTM, pulling out the state to initialize the Sentence Extractor LSTM and running the same sentences though that? Would you not get the same thing from running the sentences through a single multi-layer LSTM? Perhaps that's what you are doing in practice and the diagram is more there to aid interpretation, but I was nevertheless curious if that's true.
Thanks for the great work. How can I process my own data like yours?
Hi Shashi,
Because i try to use bert sentence embedding in your code. So i need Original training sentence without preprocessed. But I don't have a trianing mainbody for your dataset. Can I have the trianing mainbody ?
Thanks for your help.
Hi Shashi,
I am planning to run the code on your own data. How can I preprocess the dataset such that it can be used by the code (Refresh)?
Thanks,
Prateek
Hi Shashi,
I am trying to generate a summary of my own text article using the pretrained embeddings provided in the link. I created a doc file with the article text and saved as cnn.test.doc and also updated the corresponding title file. But when I am running the code it shows error as shown below.
File "/Users/ravi/Desktop/Sidenet-1/data_utils.py", line 263, in populate_data
thissent = [int(item) for item in line.strip().split()]
I have given a text document but it is accepting the integers. I guess do we need to provide the Word id's of the corresponding words in a sentence.
Can you please guide me on how to generate the summary for new text articles using this code.
I tried to run the model for evaluation and got some error. The log is posted here:
Command: python document_summarizer_training_testing.py --use_gpu /gpu:2 --data_mode cnn --exp_mode test --model_to_load 2 --train_dir training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5 --num_sample_rollout 5 > training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/test.model2.log
Error:
Traceback (most recent call last):
File "document_summarizer_training_testing.py", line 291, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "document_summarizer_training_testing.py", line 287, in main
test()
File "document_summarizer_training_testing.py", line 259, in test
model.saver.restore(sess, selected_modelpath)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1562, in restore
err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Tensor name "PolicyNetwork/ConvLayer/Conv1D_1/conv_biases_1" not found in checkpoint files training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/model.ckpt.epoch-2
[[node save/RestoreV2 (defined at /media/gtx/data/Asif/Refresh-master/my_model.py:73) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
Caused by op u'save/RestoreV2', defined at:
File "document_summarizer_training_testing.py", line 291, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "document_summarizer_training_testing.py", line 287, in main
test()
File "document_summarizer_training_testing.py", line 244, in test
model = MY_Model(sess, len(vocab_dict)-2)
File "/media/gtx/data/Asif/Refresh-master/my_model.py", line 73, in init
self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=None)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1102, in init
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 795, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()
NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
Tensor name "PolicyNetwork/ConvLayer/Conv1D_1/conv_biases_1" not found in checkpoint files training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/model.ckpt.epoch-2
[[node save/RestoreV2 (defined at /media/gtx/data/Asif/Refresh-master/my_model.py:73) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]
what could be the issue here?
how i can get vocab.txt?
hello shashi,
When I use CNN data sets to run code for training, I encounter some problems.
File "D:\Python-Pytorch\myrefresh\Refresh-master\data_utils.py", line 329, in prepare_vocab_embeddingdict
for line in fembedd:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbd in position 4009: illegal multibyte sequence
embed_line = ""
linecount = 0
with open(wordembed_filename, "r", encoding='utf-8') as fembedd:
for line in fembedd:
if linecount == 0:
vocabsize = int(line.split()[0])
I added code " encoding='utf-8' " after the code “ with open(wordembed_filename, "r" ”, it worked out.
But then ,
File "D:\Python-Pytorch\myrefresh\Refresh-master\data_utils.py", line 353, in prepare_vocab_embeddingdict
foutput.write("\n".join(vocab_list)+"\n")
UnicodeEncodeError: 'gbk' codec can't encode character '\xa3' in position 714: illegal multibyte sequence
the original code is:
foutput = open(vocabfilename,"w")
vocab_list = [(vocab_dict[key], key) for key in vocab_dict.keys()]
vocab_list.sort()
vocab_list = [item[1] for item in vocab_list]
foutput.write("\n".join(vocab_list)+"\n")
foutput.close()
return vocab_dict, word_embedding_array
I tried the same method as above, but it didn't work. What can I do to solve this problem? Could you help me? I use Windows to run the code.
thank you very much,
Zuky Li
How to preprocess the dataset?
Hi
After my training is completed with CNN dataset, the Log display model has been saved, log as follows:
MRT: Epoch 20 : Saving model after epoch completion
MRT: Epoch 20 : Saving rouge dictionary
MRT: Epoch 20 : Performance on the validation data
MRT: Epoch 20 : Validation (1220) accuracy= 0.912055
MRT: Epoch 20 : Writing final validation summaries
Writing predictions and final summaries ...
MRT: Epoch 20 : Validation (1220) rouge= 0.220117
Optimization Finished!
But, when I run the evaluation command, I get the error "Model not found in checkpoint folder", Program stop running, log as follows:
Prepare vocab dict and read pretrained word embeddings ...
Reading pretrained word embeddings file: /address/data/1-billion-word-language-modeling-benchmark-r13output.word2vec.vec
0 ...
100000 ...
200000 ...
300000 ...
400000 ...
500000 ...
Read pretrained embeddings: (559183, 200)
Size of vocab: 559185 (_PAD:0, _UNK:1)
Writing vocab file: /address/to/training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/vocab.txt
Prepare test data ...
Data file prefix (.doc, .title, .image, .label.multipleoracle): /address/data/preprocessed-input-directory/cnn.test
Data sizes: 1090 1090 1090 1090
Reading data (no padding to save memory) ...
0 ...
Writing data files with prefix (.filename, .doc, .title, .image, .label, .weight, .rewards): /address/to/training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/cnn.test
Model not found in checkpoint folder.
Can you help me solve this problem? Thank you!
hi Shashi, thanks for sharing your code :)
after running my_flag.py, i try to run document_summarizer_training_testing.py but i keep getting this error, can u help me please?
raise _exceptions.DuplicateFlagError.from_flag(name, self)
DuplicateFlagError: The flag 'tmp_directory' is defined twice. First from /home/azad/Documents/Refresh-master/my_flags.py, Second from my_flags. Description from first occurrence: Temporary directory used by rouge code.
Hi, I wonder if the output of the model on CNN/DM is provided? Thank you!
The links in the README.md appear to return a 403
when trying to train the model in an updated tensorflow version ,facing a lot of issue , can anyone share the demo code
Hi , i don’t know what you mean by multiple oracle estimation?
And also please can you explain to me how you calculate the actual-reward-multisample- in the code .
I am sorry for disturbing you but i am brginner and a little bit confused
Hi Shashi,
I have an issue when training.
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 559183. Shapes are [1,200] and [559183,200].
From merging shape 1 with other shapes. for 'PolicyNetwork/concat/concat_dim' (op: 'Pack') with input shapes: [1,200], [1,200], [559183,200].
Thank you very much.
I am doing research about Myanmar(Burmese) text summarization. I want to test with Myanmar Summarization with your code (Refresh: Ranking Sentences for Extractive Summarization with Reinforcement Learning) . Could I test this code with Myanmar language .Is it language-independent? Please help me. Thank you so much.
I am looking forward to your response,
Soe Soe Lwin
Hi Shashi,
I also use pyrouge (https://github.com/bheinzerling/pyrouge) to calculate the ROUGE score and I use exactly the same code as you do (in reward_utils.py).
However, the result I've gotten with the test set is a lot higher than yours.
Did you do any preprocessing before summarizing on the test set, or what do you think could cause me to get a higher score?
(I am using python2.7 and tensorflow 0.12)
Thank you very much!
Simeng
Hi Shashi,
In paper , I can't find the description about what is the function of the weights. But I saw it used in the code . Can you give me some information about what the weight doing. Thank you.
Hi Shashi,
Thanks for providing oracle generation script. while going through estimate_multiple_orcales.py I noticed that files are not considered if their name is not starting with numbers and only a few portions of the dataset is considered for multiple Oracle. can you please explain the reason behind the decision
hi
i am getting 403 error while downloading word embeddings file
Can you guys add a licensing file?
btw Cool Paper, I think I will try to implement this in pytorch, but will add transformer nets encoder to see how transformer performs.
Hi Shashi:
I would like to train your model on another dataset.
In preprocessing CNN datasets, every article has 2 or 3 sentences with ROUGE score.
How do you select these 2 or 3 sentences?
In your paper, you just mention the greedy approach with ROUGE score, would you
kindly give me more hints or details for this greedy approach?
Thanks
Can you share the demo code ?
For my purposes a quick python api to text -> summary API of it would be helpful.
I am comparing a bunch of summary methods qualitatively, want to evaluate your retrained model.
In all of the papers that talks about summarization,they always mention comparisons to LEAD scores?
What is that?
Sorry for non relevant questoin ,but I searched for it a lot but came to dead end.
Hi Shashi,
I find Number of testing documents for CNN is 1090 when exp_mode. But in paper is 1093. Have any problem in my data or code?
Thanks.
hi Shashi
i have a problem here on both cnn and dailymail datasets
in running them both i keep getting the following error
i would appreciate your help
cuda version >>> release 10.0, V10.0.130
tensorflow >>> tensorflow-gpu = 1.14.0
Traceback (most recent call last):
File "document_summarizer_training_testing.py", line 282, in
tf.app.run()
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "document_summarizer_training_testing.py", line 277, in main
train()
File "document_summarizer_training_testing.py", line 102, in train
model = MY_Model(sess, len(vocab_dict)-2)
File "/home/azad/Documents/Refresh-master/my_model.py", line 58, in init
self.extractor_output, self.logits = model_docsum.policy_network(self.vocab_embed_variable, self.document_placeholder, self.label_placeholder)
File "/home/azad/Documents/Refresh-master/model_docsum.py", line 142, in policy_network
fullvocab_embed_variable = tf.concat(0, [pad_embed_variable, unk_embed_variable, vocab_embed_variable])
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1296, in concat
dtype=dtypes.int32).get_shape().assert_is_compatible_with(
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1087, in convert_to_tensor
return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1145, in convert_to_tensor_v2
as_ref=False)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1224, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1145, in _autopacking_conversion_function
return _autopacking_helper(v, dtype, name or "packed")
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1095, in _autopacking_helper
return gen_array_ops.pack(elems_as_tensors, name=scope)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5897, in pack
"Pack", values=values, axis=axis, name=name)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2027, in init
control_input_ops)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1867, in _create_c_op
raise ValueError(str(e))
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 559183. Shapes are [1,200] and [559183,200].
From merging shape 1 with other shapes. for 'PolicyNetwork/concat/concat_dim' (op: 'Pack') with input shapes: [1,200], [1,200], [559183,200].
I am facing issues in the word embedding vector file. Please check if it has some errors after line number 500000, and I am unable to download the new file form the link http://kinloch.inf.ed.ac.uk/public/Refresh-NAACL18-1-billion-benchmark-wordembeddings.tar.gz provided in the README.md (403 error, FORBIDDEN)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.