hpzhao / summarunner Goto Github PK

View Code? Open in Web Editor NEW

253.0 253.0 82.0 179.31 MB

The PyTorch Implementation of SummaRuNNer

Home Page: https://arxiv.org/pdf/1611.04230.pdf

License: MIT License

Python 100.00%

extractive-summarization pytorch pytorch-implmention summarunner summary

summarunner's People

Contributors

Stargazers

Watchers

Forkers

jalamao yuedongcs xsongx pocheyeniu noobfang sumehta adasniff bihui9968 cash2one xuwenkang taomiao ymfa wenchenli kayamin kmm2204 rawstewage shaoyn0817 revodata gonewithgt chybot fjlind8 today96 liu-nlper shubhampachori12110095 ylf4910 alexbeloglazov logicxin soupstandstop xianhuaxizi mqrshiyan siduoge summba-nlp moolighty yssongbit flandrinorzxy topdreamer moxue1314 mazzzystar yuanmingchen zhujunnan gsw945 500swapnil xiaowen-ttkx fengwuxuan xiaojie2018 tlifcen qianrenjian amitvhatkar parzival27 bigapartmentsin colaaaaaa weihanghuang anveenaik99 lzjpaul kiminh wobudapai erick093 sjyttkl xuemingqiu risubaba hahasdnu1029 anupam-majumder wlynne saminamulla fakeend vishnu-itachi farida-ali leesin5079 yasark fortuneseeker hyunbool wiiiiamtang niudawei19960506 berylv587 evanhong99 elijahahianyo yumoxu baragouine aishwarya-kotkar

summarunner's Issues

dimension out of range (expected to be in range of [-1, 0], but got 1)

使用的是Pytorch 0.2.0版本，Mac系统，注释掉了GPU的部分代码。然后出现了下面的异常。

2017-11-23 17:01:50,849 [INFO] loadding train dataset
2017-11-23 17:03:18,838 [INFO] loadding validation dataset
Traceback (most recent call last):
  File "/SummaRuNNer-master/src/train.py", line 74, in <module>
    outputs = net(sents)
  File "/usr/local/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in __call__
    result = self.forward(*input, **kwargs)
  File "/SummaRuNNer-master/src/model.py", line 72, in forward
    doc = torch.transpose(self.tanh(self.fc1(doc_features)), 0, 1)
  File "/usr/local/lib/python2.7/site-packages/torch/autograd/variable.py", line 733, in transpose
    return Transpose.apply(self, dim1, dim2)
  File "/usr/local/lib/python2.7/site-packages/torch/autograd/_functions/tensor.py", line 80, in forward
    result = i.transpose(dim1, dim2)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

Final performances ?

Hi,

I was wondering what was the final performances of the model. Can it reproduce more or less the reported results ?

Thank you for your help !

RuntimeError: fractional_max_pool2d_backward_out_cuda failed with error code 0

When I tried run the code on the Google Colab don't works 'cause this error:

RuntimeError                              Traceback (most recent call last)
<ipython-input-9-c137c0eee290> in <module>()
    276         predict(bod)
    277     else:
--> 278         train()

2 frames
/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables)
     97     Variable._execution_engine.run_backward(
     98         tensors, grad_tensors, retain_graph, create_graph,
---> 99         allow_unreachable=True)  # allow_unreachable flag
    100 
    101 

RuntimeError: fractional_max_pool2d_backward_out_cuda failed with error code 0

Pretrained model

Do you have a copy of a pre-trained model that could be used to extract key sentences from other text?

embedding.npz文件是如何训练出的？google的word2vec吗

Data sets not available

Can you share the datasets in pkl format. Or let us know how you are preparing the data.

Traceback (most recent call last):
File "train.py", line 76, in
outputs = net(sents)
File "/home/op/zhaopeng/pytorch/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 224, in call
result = self.forward(*input, **kwargs)
File "/home/op/zhaopeng/pytorch/SummaRuNNer-master/src/model.py", line 68, in forward
sent_features = self._avg_pooling(word_outputs, sequence_length)
File "/home/op/zhaopeng/pytorch/SummaRuNNer-master/src/model.py", line 56, in _avg_pooling
avg_pooling = torch.mean(data[:sequence_length[index][0], :], dim = 0)
TypeError: 'int' object has no attribute 'getitem'

运行出错了，

Trained model?

Hi,

Is there any way you can provide the trained model for this? Would love to try how it works on my data.

Thanks.

TypeError: slice indices must be integers or None or have an index method

mldl@mldlUB1604:/ub16_prj/SummaRuNNer$ python main.py
2018-03-20 04:11:24,567 [INFO] Loading vocab,train and val dataset.Wait a second,please
RNN (
(abs_pos_embed): Embedding(100, 50)
(rel_pos_embed): Embedding(10, 50)
(embed): Embedding(151332, 100, padding_idx=0)
(word_RNN): GRU(100, 200, batch_first=True, bidirectional=True)
(sent_RNN): GRU(400, 200, batch_first=True, bidirectional=True)
(fc): Linear (400 -> 400)
(content): Linear (400 -> 1)
(salience): Bilinear (in1_features=400, in2_features=400, out_features=1)
(novelty): Bilinear (in1_features=400, in2_features=400, out_features=1)
(abs_pos): Linear (50 -> 1)
(rel_pos): Linear (50 -> 1)
)
#Params: 16.7M
Traceback (most recent call last):
File "main.py", line 171, in
train()
File "main.py", line 106, in train
probs = net(features,doc_lens)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 206, in call
result = self.forward(*input, **kwargs)
File "/home/mldl/ub16_prj/SummaRuNNer/models/RNN.py", line 76, in forward
word_out = self.max_pool1d(x,sent_lens)
File "/home/mldl/ub16_prj/SummaRuNNer/models/RNN.py", line 53, in max_pool1d
t = t[:seq_lens[index],:]
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/variable.py", line 69, in getitem
return Index(key)(self)
File "/usr/local/lib/python2.7/dist-packages/torch/autograd/_functions/tensor.py", line 16, in forward
result = i.index(self.index)
TypeError: slice indices must be integers or None or have an index method
mldl@mldlUB1604:/ub16_prj/SummaRuNNer$

word2id file?

Hi, hpzhao:
Very sorry for bothering you. I want to know how can i get the word2id file. What is the start id number? And may i need to remove some words when generating word2id? Can you provide me a vocabulary table including words need to remove?
Thanks a lot.

sentences labeling

Hello, @hpzhao After reading the paper, I am still confused about the sentence labeling method. Can you explain the detail or show the relevant code?

scripts to produce pkl files?

hi @hpzhao, if you get a chance, could you include some of the scripts you ran to generate the pkl files? the google drive link you provided is very helpful but it would be great to have the scripts too for the sake of completeness. it would also make it easy to rerun on a new dataset

abstractive training?

hey @hpzhao, have you tried implementing the abstractive training mode? I'm a little suspicious of some of the sentence labels based on spot checking the data

One GRU network for a mini batch, i.e. multiple documents

Hi,
I'm wondering why your forward network has only one word and sentence level GRUs. I understand that these GRUs need to be executed for each document, but I guess your GRUs are executed at once for whole documents in a mini batch.

Thank you for your help!

数据标签

您好，该论文的一大创新点就是可以使用原文本和参考摘要直接进行训练。做法是采用贪婪匹配算法，从文章中抽取句子和参考摘要做最大化 Rouge。抽取的句子就是label。但是这部分代码您的项目里是没有的，您的原始数据里就有label，而很多数据包括我自己的数据是没有label的，请问您有生成label的代码吗

用ROUGE测试的时候有问题

2018-09-07 16:53:13,999 [MainThread ] [INFO ] Running ROUGE with command /home/xy/ROUGE/RELEASE-1 .5.5/ROUGE-1.5.5.pl -e /home/xy/ROUGE/RRELEASE-1.5.5/data -a -c 95 -m -n 2 -b 75 -m /tmp/tmpzgkhf8t u/rouge_conf.xml
Cannot open /home/xy/ROUGE/RRELEASE-1.5.5/data/smart_common_words.txt
Traceback (most recent call last):
File "eval.py", line 38, in
rouge()
File "eval.py", line 33, in rouge
output = r.convert_and_evaluate(rouge_args=command)
File "/home/xy/anaconda3/lib/python3.6/site-packages/pyrouge-0.1.3-py3.6.egg/pyrouge/Rouge155.py" , line 367, in convert_and_evaluate
rouge_output = self.evaluate(system_id, rouge_args)
File "/home/xy/anaconda3/lib/python3.6/site-packages/pyrouge-0.1.3-py3.6.egg/pyrouge/Rouge155.py" , line 342, in evaluate
rouge_output = check_output(command, env=env).decode("UTF-8")
File "/home/xy/anaconda3/lib/python3.6/subprocess.py", line 336, in check_output
**kwargs).stdout
File "/home/xy/anaconda3/lib/python3.6/subprocess.py", line 418, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['/home/xy/ROUGE/RELEASE-1.5.5/ROUGE-1.5.5.pl', '-e', '/hom e/xy/ROUGE/RRELEASE-1.5.5/data', '-a', '-c', '95', '-m', '-n', '2', '-b', '75', '-m', '/tmp/tmpzgkh f8tu/rouge_conf.xml']' returned non-zero exit status 2.
一直没找到原因，。。

embedding.pkl

Hi,
Sorry for bothering you. I want to know how to get embedding.pkl. Is it trained by word2vec? And if so, what is the train dataset of word2vec. is it all data from train.pkl + validation.pkl + test.pkl？
Thanks a lot.

Word2Vec

Hello,
First of all, thank you for this sharing.
This is my first experience with a deep learning applications; sorry for inconvenience. I tried to understand how did you use word embeddings, but I only see that you feed the network by features which corresponds the ids of words in the related sentence (probs = net(features,doc_lens)).
I could not understand, in which point of code the tensor called 'features' has been transformed to word2vec embeddings. It seems to me that the tensor called 'embed' has been initialized in a vocab object (vocab = utils.Vocab(embed, word2id)) but it has not been connected to the rnn's input layer. Could you help me to clarify how and in which order the word embeddings have been transfered to the model?

RNN_RNN_seed_1.pt and CNN_RNN_seed_1.pt gives same result

I am using pre-trained model RNN_RNN_seed_1.pt and CNN_RNN_seed_1.pt for prediction. But both the models give exactly the same summary. Kindly let me know if I am missing out something.

how many epochs should be trained? and how to get the ROUGE score?

question about vocab.make_features

hey @hpzhao, I'm a little confused about how the vocab.make_features method works. I'm trying to write an individual document predict method.

It looks like sents_list combines sentences across the batch of documents and feeds it to the net at once? Is this different from how it previously worked? how does the net know when one document ends and the next begins? I remember seeing this old prepare_data method which made me thing it was going one document at a time:

def prepare_data(doc, word2id):
    data = deepcopy(doc.content)
    max_len = -1
    for sent in data:
        words = sent.strip().split()
        max_len = max(max_len, len(words))
    sent_list = []
     
    for sent in data:
        words = sent.strip().split()
        sent = [word2id[word] if word in word2id else 1 for word in words]
        sent += [0 for _ in range(max_len - len(sent))]
        sent_list.append(sent)
    
    sent_array = numpy.array(sent_list)
    label_array = numpy.array(doc.label)

    return sent_array, label_array

Thanks!

NLP-OSS conf

hey @hpzhao, if you want, you could convert this into an installable package and perhaps submit to this conf: https://nlposs.github.io/

maybe something about productionizing / distributing pytorch models? issues with going from paper -> code, what kinds of things should authors include to make it easier to code etc.

当我使用test模式的时候，出现了有个错误

10%|███████████████▋ | 997/10350 [01:25<12:29, 12.48it/s]Traceback (most recent call last):
File "main.py", line 270, in
test()
File "main.py", line 200, in test
prob = probs[start:stop]
IndexError: dimension specified as 0 but tensor has no dimensions
Exception ignored in: <function tqdm.del at 0x7fa9a42a28c8>
Traceback (most recent call last):
File "/home/qiuxueming/anaconda3/lib/python3.7/site-packages/tqdm/_tqdm.py", line 889, in del
self.close()
File "/home/qiuxueming/anaconda3/lib/python3.7/site-packages/tqdm/_tqdm.py", line 1095, in close
self._decr_instances(self)
File "/home/qiuxueming/anaconda3/lib/python3.7/site-packages/tqdm/_tqdm.py", line 454, in _decr_instances
cls.monitor.exit()
File "/home/qiuxueming/anaconda3/lib/python3.7/site-packages/tqdm/_monitor.py", line 52, in exit
self.join()
File "/home/qiuxueming/anaconda3/lib/python3.7/threading.py", line 1029, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread
如何才能修改它

preprocess.py 相关

您好，非常感谢您的分享！

我初次接触，最近在看这个preprocess.py文档，请问一下这个py文件是可以生成word2id.json文件的吗？
能否提供一下运行该文件的使用方法呢，以及提到的data/100.w2v的文件可以分享一下吗？

非常感谢！

Using pretrained models error : dimension specified as 0 but tensor has no dimensions

Runtime error.

请问尝试使用您的模型，结果发生如下错误知道是什么问题吗？

Traceback (most recent call last):
File "train.py", line 77, in
outputs = net(sents)
File "/home/cc/.local/lib/python3.5/site-packages/torch/nn/modules/module.py", line 325, in call
result = self.forward(*input, **kwargs)
File "/home/cc/Model/SummaRuNNer/src/model.py", line 72, in forward
doc = torch.transpose(self.tanh(self.fc1(doc_features)), 0, 1)
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

sentence labeling codes

嗨，可以請你把sentence label 代碼放上來嗎，我在對每一句作Label時，速度相當慢，想參考您的作法

Data Label

Dear，
do you know the meaning of label 2 in source data Neural Summarization by Extracting Sentences and Words, even though we just make all label to label 0 except label 1

System does not summarize (the output document is the same that original document)

Hi @hpzhao , I tried run the system with your pretrained models, and apparently everything works well. But, when I analyzed the output documents (generated by test mode), I noticed that they are the same the original documents. I tried run the predict mode too, with a single document as input (the first example from test dataset) and I had the same result. Do you know what I'm doing wrong? Maybe you already seen a similar bug and know something I can do. Thank you for your time.

Ps: no errors are shown in the terminal

interpret output

extracting content, salience, novelty etc.

hi @hpzhao, do you know how one might extract the computed features from this model such as content, salience, novelty, abs. pos. imp, rel. pos. imp?

basic pytorch question

hi @hpzhao, I have a rather basic pytorch question that I'm having trouble googling. I'm working on converting the code to run on a CPU (at test time) and it was mostly easy enough. so far I had to add a map_location function to torch.load and put if statements around each use of cuda in test.py.

changes made so far:

net = SummaRuNNer(config)
net = net.cuda() if args.gpu is not None else net.cpu()
if args.gpu is not None:
    net.load_state_dict(torch.load(args.model_file))
else:
    net.load_state_dict(torch.load(args.model_file, map_location=lambda storage, loc: storage))

sents = Variable(torch.from_numpy(x))
if args.gpu is not None:
    sents = sents.cuda()

the part that I can't quite understand is why the Variable has a .cuda() at the end of it within the model.py?

s = Variable(torch.zeros(100, 1)).cuda()
position_index = Variable(torch.LongTensor([[position]])).cuda()

Can you explain what adding .cuda() to the end of the Variable does here? I thought that this is enough on the outside:

net = SummaRuNNer(config)
net.cuda()

thanks!

-- super novice pytorch user

Data Format

Please provide the format of data or the pickle files. Though it would be better if the data format is explained a bit.

Particularly I am interested in knowing how a complete doc (or a batch of them) is processed first the individual words are processed by the word_gru and then the sentence is processed by sent_gru. How are the weights updated, upon each word or each sentence or complete document. Hence it is necessary to know how the data is fed.

How many sentences did you summarize to get a result of 26.0 with the RNN-RNN model?

AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

您好，我在训练的时候报错：

Traceback (most recent call last):
File "main.py", line 62, in
torch.cuda.set_device(args.device)
File "/home/minelab/anaconda3/envs/py3/lib/python3.5/site-packages/torch/cuda/init.py", line 161, in set_device
torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

pytorch版本是0.1.12，请问有可能是什么原因呢？谢谢

provide data as non-pickle files?

hi @hpzhao, is there any chance you can make the data as gzipped line json or text files? the binary pickle files are a bit less portable. for example, I'm trying to run the code using Python 3 instead of Python 2 and run into issues with the pickle files.

in addition, the pickle files are deeply tied to your utility classes instead of raw numpy objects, torch tensors, or dicts etc. if one tries to unpickle the file from the wrong location, it will complain and say utils.Dataset or utils.Vocab not found.

you can load the raw data from the line json and then initialize the Vocab or Dataset class after loading the raw data.

as a side benefit I found the file sizes to be much smaller as well. in my personal fork based on the old code, I could fit these files into the repo itself:

I used the npz (numpy compressed) format for the embedding which brought the file size down to 66mb.

i'd be happy to do this and submit as a PR if you're interest, please let me know!