edinburghnlp / refresh Goto Github PK

View Code? Open in Web Editor NEW

272.0 272.0 48.0 53 KB

Ranking Sentences for Extractive Summarization with Reinforcement Learning

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

refresh's People

Contributors

Stargazers

Watchers

refresh's Issues

How much is the test accuracy?

Hi
i am using TF 1.10 and i upgraded your code so because of that i can not restore your pretrained model to use it to get the test accuracy . could you please tell me how much test accuracy you got with your best model model.ckpt.epoch-11 ??

thanks

Pyrouge

Hi Shashi,

While running the code, after 1 epoch is done, when it is just about to write the final validation summaries, I get the following error -

File "document_summarizer_training_testing.py", line 212, in train
rouge_score = rouge_generator.get_full_rouge(FLAGS.train_dir+"/model.ckpt.epoch-"+str(epoch)+".validation-summary-topranked", "validation")
File "/datadrive/prateek/Summarization/Refresh/reward_utils.py", line 199, in get_full_rouge
rouge_score = _rouge(system_dir, gold_summary_directory)
File "/datadrive/prateek/Summarization/Refresh/reward_utils.py", line 38, in _rouge
output = r.convert_and_evaluate(rouge_args="-e /address/to/rouge/data/directory/rouge/data -a -c 95 -m -n 4 -w 1.2")
File "/datadrive/miniconda3/envs/summly/lib/python2.7/site-packages/pyrouge/Rouge155.py", line 361, in convert_and_evaluate
rouge_output = self.evaluate(system_id, rouge_args)
File "/datadrive/miniconda3/envs/summly/lib/python2.7/site-packages/pyrouge/Rouge155.py", line 336, in evaluate
rouge_output = check_output(command).decode("UTF-8")
File "/datadrive/miniconda3/envs/summly/lib/python2.7/subprocess.py", line 223, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command '[u'/datadrive/prateek/Summarization/github-pyrouge/pyrouge/tools/ROUGE-1.5.5/ROUGE-1.5.5.pl', '-e', '/address/to/rouge/data/directory/rouge/data', '-a', '-c', '95', '-m', '-n', '4', '-w', '1.2', u'-m', u'/tmp/tmpFdvDy4/rouge_conf.xml']' returned non-zero exit status 2

To me it seems like a pyrouge related error. I have installed Pyrouge using the instructions mentioned here -
https://stackoverflow.com/questions/47045436/how-to-install-the-python-package-pyrouge-on-microsoft-windows/47045437#47045437

Is the error due to Pyrouge, if so how would you recommend I install Pyrouge?

TypeError: 'range' object does not support item assignment

Error in having the code worked in sample examples

We are trying to run the example as given in the README file but have not been able to for the following reason.

We have not been able to figure out software configuration for python, tensorflow etc.Is there some instruction by which we will be able to reproduce the environment. We have tried with python 2 and 3 . for tensor flow no clue
The link 'Original Test and Validation mainbody data' and 'Gold Test and Validation highlights' are not working. Showing forbidden 404 for last few days
Can you please share some materials or doc to at least reproduce the example for CNN data for daily doc data.

4.2 Training with High Probability Samples

I'm trying to wrap my head around how this part of the paper is implemented in practice, as it appears to imply that you're building a static distribution ahead of time. How then does it link to the probabilities produced from your policy network so that you can backprop? My understanding of REINFORCE is that you sample from a distribution built from your policy predicted probabilities. It might be helpful if you can point me to the place(s) in the code where this section is implemented.

Lead3 rouge score

Hi Shashi,
I use your Preprocessed CNN and DailyMail data(11487 test samples), And use the function in your code to evaluate the lead3's rouge_1. But i get rouge_1 40.487 higher than your paper's result(39.6). Can i have your ROUGE-1.5.5 folder to get the same result.

I use pyrouge0.1.3 and ROUGE1.5.5

Process for test data

Hi,
Because i need the REFRESH's rouge score test on DUC-2004 . It is my strong baseline. But I can't find the code that process data in scripts/oracle-estimator So I can't sure if my pre-process is correct?

Wanted to ask you if you have code fo pre-process or DUC-2004 's rouge score.
It will helpful for me.

Thank you.

Doc Encode + Sent Extract vs multi-layer LSTM

Very interesting paper. Looking it over, there's something I cannot quite grasp about the architecture. What's the advantage of running the sentences through the Document Extractor LSTM, pulling out the state to initialize the Sentence Extractor LSTM and running the same sentences though that? Would you not get the same thing from running the sentences through a single multi-layer LSTM? Perhaps that's what you are doing in practice and the diagram is more there to aid interpretation, but I was nevertheless curious if that's true.

How to process my own data?

Thanks for the great work. How can I process my own data like yours?

training set mainbody problem

Hi Shashi,
Because i try to use bert sentence embedding in your code. So i need Original training sentence without preprocessed. But I don't have a trianing mainbody for your dataset. Can I have the trianing mainbody ?

Thanks for your help.

got shape [60,20,110,2], but wanted [61]

can any one help me please

How to preprocess the dataset

Hi Shashi,

I am planning to run the code on your own data. How can I preprocess the dataset such that it can be used by the code (Refresh)?

Thanks,
Prateek

How to generate summary for my own data?

Hi Shashi,

I am trying to generate a summary of my own text article using the pretrained embeddings provided in the link. I created a doc file with the article text and saved as cnn.test.doc and also updated the corresponding title file. But when I am running the code it shows error as shown below.

File "/Users/ravi/Desktop/Sidenet-1/data_utils.py", line 263, in populate_data
thissent = [int(item) for item in line.strip().split()]
I have given a text document but it is accepting the integers. I guess do we need to provide the Word id's of the corresponding words in a sentence.

Can you please guide me on how to generate the summary for new text articles using this code.

Restoring from checkpoint failed in evaluation

I tried to run the model for evaluation and got some error. The log is posted here:

Command: python document_summarizer_training_testing.py --use_gpu /gpu:2 --data_mode cnn --exp_mode test --model_to_load 2 --train_dir training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5 --num_sample_rollout 5 > training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/test.model2.log

Error:
Traceback (most recent call last):
File "document_summarizer_training_testing.py", line 291, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "document_summarizer_training_testing.py", line 287, in main
test()
File "document_summarizer_training_testing.py", line 259, in test
model.saver.restore(sess, selected_modelpath)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1562, in restore
err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Tensor name "PolicyNetwork/ConvLayer/Conv1D_1/conv_biases_1" not found in checkpoint files training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/model.ckpt.epoch-2
[[node save/RestoreV2 (defined at /media/gtx/data/Asif/Refresh-master/my_model.py:73) = RestoreV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, ..., DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/device:CPU:0"](_arg_save/Const_0_0, save/RestoreV2/tensor_names, save/RestoreV2/shape_and_slices)]]

Caused by op u'save/RestoreV2', defined at:
File "document_summarizer_training_testing.py", line 291, in
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "document_summarizer_training_testing.py", line 287, in main
test()
File "document_summarizer_training_testing.py", line 244, in test
model = MY_Model(sess, len(vocab_dict)-2)
File "/media/gtx/data/Asif/Refresh-master/my_model.py", line 73, in init
self.saver = tf.train.Saver(tf.global_variables(), max_to_keep=None)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1102, in init
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1114, in build
self._build(self._filename, build_save=True, build_restore=True)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1151, in _build
build_save=build_save, build_restore=build_restore)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 795, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 406, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 862, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1466, in restore_v2
shape_and_slices=shape_and_slices, dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

what could be the issue here?

how can i get vocab.txt?

how i can get vocab.txt?

'gbk' codec can't encode character

hello shashi,
When I use CNN data sets to run code for training, I encounter some problems.

File "D:\Python-Pytorch\myrefresh\Refresh-master\data_utils.py", line 329, in prepare_vocab_embeddingdict
for line in fembedd:
UnicodeDecodeError: 'gbk' codec can't decode byte 0xbd in position 4009: illegal multibyte sequence

embed_line = ""
linecount = 0
with open(wordembed_filename, "r", encoding='utf-8') as fembedd:
for line in fembedd:
if linecount == 0:
vocabsize = int(line.split()[0])
I added code " encoding='utf-8' " after the code “ with open(wordembed_filename, "r" ”, it worked out.
But then ,
File "D:\Python-Pytorch\myrefresh\Refresh-master\data_utils.py", line 353, in prepare_vocab_embeddingdict
foutput.write("\n".join(vocab_list)+"\n")
UnicodeEncodeError: 'gbk' codec can't encode character '\xa3' in position 714: illegal multibyte sequence

the original code is:
foutput = open(vocabfilename,"w")
vocab_list = [(vocab_dict[key], key) for key in vocab_dict.keys()]
vocab_list.sort()
vocab_list = [item[1] for item in vocab_list]
foutput.write("\n".join(vocab_list)+"\n")
foutput.close()
return vocab_dict, word_embedding_array
I tried the same method as above, but it didn't work. What can I do to solve this problem? Could you help me? I use Windows to run the code.
thank you very much，
Zuky Li

How to preprocess the dataset?

Model not found in checkpoint folder

Hi
After my training is completed with CNN dataset, the Log display model has been saved, log as follows:

MRT: Epoch 20 : Saving model after epoch completion
MRT: Epoch 20 : Saving rouge dictionary
MRT: Epoch 20 : Performance on the validation data
MRT: Epoch 20 : Validation (1220) accuracy= 0.912055
MRT: Epoch 20 : Writing final validation summaries
Writing predictions and final summaries ...
MRT: Epoch 20 : Validation (1220) rouge= 0.220117
Optimization Finished!

But, when I run the evaluation command, I get the error "Model not found in checkpoint folder", Program stop running, log as follows:

Prepare vocab dict and read pretrained word embeddings ...
Reading pretrained word embeddings file: /address/data/1-billion-word-language-modeling-benchmark-r13output.word2vec.vec
0 ...
100000 ...
200000 ...
300000 ...
400000 ...
500000 ...
Read pretrained embeddings: (559183, 200)
Size of vocab: 559185 (_PAD:0, _UNK:1)
Writing vocab file: /address/to/training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/vocab.txt
Prepare test data ...
Data file prefix (.doc, .title, .image, .label.multipleoracle): /address/data/preprocessed-input-directory/cnn.test
Data sizes: 1090 1090 1090 1090
Reading data (no padding to save memory) ...
0 ...
Writing data files with prefix (.filename, .doc, .title, .image, .label, .weight, .rewards): /address/to/training/directory/cnn-reinforcementlearn-singlesample-from-moracle-noatt-sample5/cnn.test
Model not found in checkpoint folder.

Can you help me solve this problem? Thank you!

DuplicateFlagError: The flag 'tmp_directory' is defined twice.

hi Shashi, thanks for sharing your code :)
after running my_flag.py, i try to run document_summarizer_training_testing.py but i keep getting this error, can u help me please?

raise _exceptions.DuplicateFlagError.from_flag(name, self)
DuplicateFlagError: The flag 'tmp_directory' is defined twice. First from /home/azad/Documents/Refresh-master/my_flags.py, Second from my_flags. Description from first occurrence: Temporary directory used by rouge code.

Model output availability

Hi, I wonder if the output of the model on CNN/DM is provided? Thank you!

Links in README.md are broken

The links in the README.md appear to return a 403

facing problem in trying to summarize my data

when trying to train the model in an updated tensorflow version ,facing a lot of issue , can anyone share the demo code

What is multiple oracle?

Hi , i don’t know what you mean by multiple oracle estimation?

And also please can you explain to me how you calculate the actual-reward-multisample- in the code .

I am sorry for disturbing you but i am brginner and a little bit confused

Dimension mismatch problem

Hi Shashi,
I have an issue when training.
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 559183. Shapes are [1,200] and [559183,200].
From merging shape 1 with other shapes. for 'PolicyNetwork/concat/concat_dim' (op: 'Pack') with input shapes: [1,200], [1,200], [559183,200].
Thank you very much.

Does this code support Myanmar(Burmese) language summarization?

I am doing research about Myanmar(Burmese) text summarization. I want to test with Myanmar Summarization with your code (Refresh: Ranking Sentences for Extractive Summarization with Reinforcement Learning) . Could I test this code with Myanmar language .Is it language-independent? Please help me. Thank you so much.

I am looking forward to your response,
Soe Soe Lwin

Different Result

Hi Shashi,

I also use pyrouge (https://github.com/bheinzerling/pyrouge) to calculate the ROUGE score and I use exactly the same code as you do (in reward_utils.py).

However, the result I've gotten with the test set is a lot higher than yours.

Did you do any preprocessing before summarizing on the test set, or what do you think could cause me to get a higher score?

(I am using python2.7 and tensorflow 0.12)

Thank you very much!
Simeng

When I tried to download the file in the link, it was denied access，show the forbidden

What is the function of the weights in the code.

Hi Shashi,
In paper , I can't find the description about what is the function of the weights. But I saw it used in the code . Can you give me some information about what the weight doing. Thank you.

did anybody tried to load the pretrained model provided?

some clarification in estimate_multiple_orcales.py

Hi Shashi,
Thanks for providing oracle generation script. while going through estimate_multiple_orcales.py I noticed that files are not considered if their name is not starting with numbers and only a few portions of the dataset is considered for multiple Oracle. can you please explain the reason behind the decision

You don't have permission to access /public/Refresh-NAACL18-1-billion-benchmark-wordembeddings.tar.gz on this server.

hi
i am getting 403 error while downloading word embeddings file

License

Can you guys add a licensing file?

btw Cool Paper, I think I will try to implement this in pytorch, but will add transformer nets encoder to see how transformer performs.

How to "selecting the best subset of sentences using a greedy approach"

Hi Shashi:

I would like to train your model on another dataset.
In preprocessing CNN datasets, every article has 2 or 3 sentences with ROUGE score.
How do you select these 2 or 3 sentences?
In your paper, you just mention the greedy approach with ROUGE score, would you
kindly give me more hints or details for this greedy approach?

Thanks

Demo Code

Can you share the demo code ?

For my purposes a quick python api to text -> summary API of it would be helpful.

I am comparing a bunch of summary methods qualitatively, want to evaluate your retrained model.

What is LEAD?

In all of the papers that talks about summarization,they always mention comparisons to LEAD scores?
What is that?
Sorry for non relevant questoin ,but I searched for it a lot but came to dead end.

How to generate label for single oracle?

Number of testing documents for CNN.

Hi Shashi,
I find Number of testing documents for CNN is 1090 when exp_mode. But in paper is 1093. Have any problem in my data or code?
Thanks.

size mismatch error in model_docsum.py in tf.concat

hi Shashi
i have a problem here on both cnn and dailymail datasets
in running them both i keep getting the following error
i would appreciate your help

cuda version >>> release 10.0, V10.0.130
tensorflow >>> tensorflow-gpu = 1.14.0

Traceback (most recent call last):
File "document_summarizer_training_testing.py", line 282, in
tf.app.run()
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "document_summarizer_training_testing.py", line 277, in main
train()
File "document_summarizer_training_testing.py", line 102, in train
model = MY_Model(sess, len(vocab_dict)-2)
File "/home/azad/Documents/Refresh-master/my_model.py", line 58, in init
self.extractor_output, self.logits = model_docsum.policy_network(self.vocab_embed_variable, self.document_placeholder, self.label_placeholder)
File "/home/azad/Documents/Refresh-master/model_docsum.py", line 142, in policy_network
fullvocab_embed_variable = tf.concat(0, [pad_embed_variable, unk_embed_variable, vocab_embed_variable])
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1296, in concat
dtype=dtypes.int32).get_shape().assert_is_compatible_with(
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1087, in convert_to_tensor
return convert_to_tensor_v2(value, dtype, preferred_dtype, name)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1145, in convert_to_tensor_v2
as_ref=False)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1224, in internal_convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1145, in _autopacking_conversion_function
return _autopacking_helper(v, dtype, name or "packed")
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 1095, in _autopacking_helper
return gen_array_ops.pack(elems_as_tensors, name=scope)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5897, in pack
"Pack", values=values, axis=axis, name=name)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3616, in create_op
op_def=op_def)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 2027, in init
control_input_ops)
File "/home/azad/anaconda3/envs/thesis/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1867, in _create_c_op
raise ValueError(str(e))
ValueError: Dimension 0 in both shapes must be equal, but are 1 and 559183. Shapes are [1,200] and [559183,200].
From merging shape 1 with other shapes. for 'PolicyNetwork/concat/concat_dim' (op: 'Pack') with input shapes: [1,200], [1,200], [559183,200].

Errors in Word Embedding Vector file.

I am facing issues in the word embedding vector file. Please check if it has some errors after line number 500000, and I am unable to download the new file form the link http://kinloch.inf.ed.ac.uk/public/Refresh-NAACL18-1-billion-benchmark-wordembeddings.tar.gz provided in the README.md (403 error, FORBIDDEN)

edinburghnlp / refresh Goto Github PK

refresh's People

Contributors

Stargazers

Watchers

Forkers

refresh's Issues

Recommend Projects

Recommend Topics

Recommend Org