thunlp / nre Goto Github PK

View Code? Open in Web Editor NEW

811.0 79.0 310.0 239.35 MB

Neural Relation Extraction, including CNN, PCNN, CNN+ATT, PCNN+ATT

License: MIT License

C++ 99.64% Makefile 0.36%

relation-extraction

nre's Introduction

The project will no longer be maintained and users are recommended to access and use the new package https://github.com/thunlp/OpenNRE.

nre's People

Contributors

Stargazers

Watchers

Forkers

bygreencn clear-datacenter huxiaoman7 hua-zhang zhiyu-chen fulquan zhoujialinmumu rickyall nlpscott xiliangsong jianchengss gxieaa ericxsun hitluobin eriche2016 jz3707 kukumayas dzang heyecheng benjamesbabala liujinliang99 scudc ritali libcorner jimsow jankim gatsbyustc vikingmew fanlu lu839684437 jhnlp tifoit itgirls dongdyang iiapache ljdawn stevenlol riskyhe309 wellwang nijinosuke lvye1937 vishwajeetkumar93 litoeknee lai-bluejay xiashaxiaoxue xilang pathriclee shaoyizhang qiuyuew ammskang yangqiokay bigcanyajun sunbin-nlp ericshijian kaharjan yuwenlidao alongwithyou javelir yukinjie vangogh0318 joneswong xuehui0725 shuaidong-jiao zhujiahui andrewlesson fangzheng354 x-zho14 leebird shaktisd justintomas tt0yy rubeeny cosecant-csc xiongziqi ieee820 bensnw robingong jianbotang oncebasun littlepan0413 cziszero xyhxiayuhang hyzcn djher o-github-o siddheshk henrywoodotc jasonhoou dengwc skybirdhe wxybdth hiredd jockeyyan colinsongf adazhou shaohuikuang pokbe louiekang ppuliu fence

nre's Issues

Experiment Parameter

When I use the parameters in the out folder to run the test.cpp, I couldn't get the same result in your chart. Could you provide the parameters of the generating charts? I would be very grateful.

The pr.txt files does not match the curve in the paper

Hi,

I'm trying to reproduce the PR curves in the paper. However I find that the pr.txt files in the repository does not match the curves reported in the paper (PCNN+ATT for example).

Are these files generated by models that are not fully trained?
Can you provide pr.txt files that can reproduce curves in the paper?

Much appreciated.

数据集的问题

P@N

P@N是什么意思呀，N代表什么

A SEGV signal occurred in CNN+ATT/init.h: 71

When I ran the test program in CNN+ATT, a SEGV signal occurred in init.h:71

=================================================================
==25628==ERROR: AddressSanitizer: SEGV on unknown address 0x0000000000c0 (pc 0x7fa794dc2908 bp 0x7ffca4461380 sp 0x7ffca4460ca0 T0)
    #0 0x7fa794dc2907 in _IO_vfscanf (/lib/x86_64-linux-gnu/libc.so.6+0x5b907)
    #1 0x7fa795c415d0 in vfscanf (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x525d0)
    #2 0x7fa795c41749 in __interceptor_fscanf (/usr/lib/x86_64-linux-gnu/libasan.so.2+0x52749)
    #3 0x402c1e in init() /home/mfc_fuzz/NRE/CNN+ATT/init.h:71
    #4 0x40e400 in main /home/mfc_fuzz/NRE/CNN+ATT/test.cpp:99
    #5 0x7fa794d8782f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
    #6 0x4028e8 in _start (/home/mfc_fuzz/NRE/CNN+ATT/test+0x4028e8)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ??:0 _IO_vfscanf
==25628==ABORTING

First word match could be not right entity mention.

From this line of code, It seems that the first match between a head(tail) entity and a word is treated as right entity mention in the sentence. But in the case when a sentence has several mentions of the entity it is not necessarily true.

head: brooklyn
tail: eastern parkway
sentence: brooklyn museum , 200 eastern parkway , brooklyn , (718) 638-5000 .

Original dataset contains necessary index information, but it seems that preprocessed data in this repo doesn't have it.

Please correct me if I am wrong.

Would you like to share the trained model using the NYT datasets?

about weight diagonal matrix

我不理解公式中的矩阵A，什么是带权对角矩阵，矩阵元素的值是作为参数训练还是一开始就固定？

train.cpp: error: reference to 'end' is ambiguous

error message appears when compile. Please take a look at the screenshot.

Environment: Mac
Apple LLVM version 8.1.0 (clang-802.0.42)

准确率计算是否包含NA类型

你好，
在准确率计算的时候是否包含，NA类型？因为训练语料中 NA类型数据占比近80%，所以很多数据会极有可能预测为NA类型，而测试数据中NA占近90%，所以如果包含NA类型，准确率确实很高，但是如果去除NA类型计算的准确率非常低。

关于论文中的实验结果

请问，上表中在第一大列One（从bag中sample出一个句子）的设置下，行的不同设置ONE、AVE、ATT结果为什么会不一样呢？我的理解是：在这种情况下，只有一个句子，AVE和ATT就没有什么作用了，跟ONE的结果应该保持一致。

求教：ubuntu16.04 编译后，执行命令./test后，出现段错误

ubuntu16.04 编译后，执行命令./test后，出现段错误

Word embeddings file

Would you like to share how the word embedding file was created, like what procedure was used. And also if I want this algorithm to work on my dataset, how am I supposed to create a word embedding file for my dataset

PCNN+ATT has no attentions

Hi.
when I read the codes one by one, I found PCNN+ATT don't have any attention part while CNN+ATT has that.

Concretely, "init.h" in the "CNN+ATT" directory mentions the attention-related matrix called "att_W" and "att_W_Dao" which are used and trained in the trainining step of "train.cpp" but in "PCNN+ATT", I haven't found the parts.

Could you check them??

Calculate gradients and update parameters for CNN+ATT

For the code at: https://github.com/thunlp/NRE/blob/master/CNN%2BATT/train.cpp

I can not follow how to calculate gradients and update parameters, maybe from line 193 to line 238.

Could anyone explain, please?
@Mrlyk423

Segmentation Fault

Hi,

Thanks for sharing your code. I am running this on a linux server with 60GB free memory. Compile works fine, but ./train leads to a segmentation fault.

[s CNN+ATT]$ ls
init.h log.txt makefile out test.cpp test.h train.cpp
[s CNN+ATT]$ make
g++ train.cpp -o train -O2 -lpthread
g++ test.cpp -o test -O2 -lpthread
[sharmistha@momo CNN+ATT]$ ./train
Init Begin.

Segmentation fault (core dumped)

All the training files are in the correct directory. Please let me know how to resolve this issue.

Thanks

am I missing something?

mldl@mldlUB1604:/ub16_prj/NRE/CNN+ATT$ ./train
Init Begin.
wordTotal= 114042
Word dimension= 50
Segmentation fault (core dumped)
mldl@mldlUB1604:/ub16_prj/NRE/CNN+ATT$ ll ../data
total 215124
drwxr-xr-x 2 mldl mldl 4096 5月 2 03:29 ./
drwxrwxr-x 10 mldl mldl 4096 5月 2 03:29 ../
-rw-r--r-- 1 mldl mldl 584570 7月 16 2016 entity2id.txt
-rw-r--r-- 1 mldl mldl 1851 7月 16 2016 relation2id.txt
-rw-r--r-- 1 mldl mldl 48268627 7月 16 2016 test.txt
-rw-r--r-- 1 mldl mldl 147456013 7月 16 2016 train.txt
-rw-r--r-- 1 mldl mldl 23955231 7月 16 2016 vec.bin
mldl@mldlUB1604:~/ub16_prj/NRE/CNN+ATT$

about the query vector

Hi, there are two questions bother me a lot,
first

does the r already exist or we need to train it?
second
in my view, this method can not extract novel relation, in fact, extracted relation is among the all already r, did I misunderstand?
Thx!!!

关于train时最后一遍迭代和test时准确率召回率一样的问题

您好：
我在使用您的代码时，完全按照您的readme进行操作，之后发现输出在屏幕上的train最后一遍的p,r值和test一致。刚刚开始学习自然语言处理，不知道怎么回事，所以麻烦您了。
./train
…
tot:1950
persicon:1 recall :0.00512821
persicon:0.782178 recall :0.0405128
…

./test
tot:1950
persicon:1 recall :0.00512821
persicon:0.782178 recall :0.0405128
…

实体ID的生成

我通过 http://iesl.cs.umass.edu/riedel/ecml/ 直接解析出的实体ID都是以/guid/开头，而NRE中处理出的实体id是以m.xxx开头，请问这两个id的对应关系是怎么得到的？

how to test in single CNN/PCNN

I try to reproduce the cnn/pcnn with tensorflow. In test phase, I treat a sentence as a special bag, and draw Precision-Recall curve. Surprisingly, single CNN/PCNN get better performance than CNN/PCNN+ONE/ATT。
Another question：the performance of various models look very similar，while in papers, the performance of various models look very different?

Entity order while building distant supervision dataset

How did you (or Riedel) configure the dataset regarded to the order of entities in sentence.

In training set, there are instances related with both entity pair (e1, e2) and another pair (e2, e1).

In addition, not only those entity pairs don't share sentence instances (relation mentions), but also each pair doesn't have any order of appearance of entities. (I mean, for (e1, e2) entity pair, there are both sentence in which e1 appears before e2, and e2 appears before e1)

I think the order of entities is important as PCNN uses position embedding.

If there is no triple related with (e1, e2) entity pair in Freebase, what sentences are assigned for training instance for (e1, e2)-None and what sentences are assigned for (e2, e1)-None?

Thank you :)

Bug Report: bug in the convolution code of PCNN+ATT/test.h

In the line 12 of PCNN+ATT/test.h, the code is
for (int i = 0; i < 3 * dimensionC; i++) {

However, I think the code should be
``for (int i = 0; i < dimensionC; i++) {`
which is similar to the 44th line of train.cpp.

I think the original code in test.h actually assumes that there are 3 * dimensionC convolution kernels while actually there are only dimensionC kernels.

I tested the new version of the test code with the released trained parameters and the PR curve matches the curve in the paper. And the speed is much faster than the original code.

关于P@N测试的一个问题

你好，
我想问一下这个P@N测试是怎么做的？
这一步的ground_truth label是用的distant supervision的结果还是通过人工的方式来判断呢？
我认为是应该用人工的方式判断，但https://github.com/thunlp/TensorFlow-NRE 这份代码中，做P@N测试用的是测试集上的distant supervision得到的结果作为ground_truth。
望告知，谢谢！

关于contextwise split

你好，现在大多模型在输入的时候都采取了将句子按照实体划分为3段的方法，这个时候每段的长度可以pad或者trancate到定长。假如我使用trancate的方法，那么两个实体间的部分怎么trancate呢？

比如XXX Obama XXXXXXXXXXXXX USA XXX. Obama左边和USA右边部分如果太长的话可以删掉远离实体的单词，但是Obama和USA中间的部分怎么处理呢？谢谢！

Wrong Parameters of PCNN+ONE in models+figs

when I use the parameters in provided “models+figs” file to run the "test.cpp" in PCNN+ONE model, the result is apparent error.

关于data中test.txt文件的问题

data压缩包中的test.txt文件中的关系类型基本部分都是NA？只有小部分数据正常给出关系类型。

about test.h in CNN+ATT

Hi, in /CNN+ATT/test.h, line 114, it seems that the function vector<double> test(int *sentence, int *testPositionE1, int *testPositionE2, int len, float *r) is never used, since vector<double> score is never used in the following codes.

Therefore, * r, which should represent sentence encoding in function vector<double> test(int *sentence, int *testPositionE1, int *testPositionE2, int len, float *r), now is a random vector in void* testMode(void *id ) and is then pushed into r_tmp.

Did I miss something? Thanks!

关于relation2id.txt

在加载train.txt数据时发现存在不在relation2id.txt的关系类型，如：/people/ethnicity/includes_groups ,阅读NRE中代码发现似乎将其看做NA关系。
请问，这些未记录到relation2id.txt的关系，是因为其出现的次数过少吗？如果是这样的话，将其处理为NA或直接忽略都可以吗？

data 中entity2id.txt 中命名是什么意思？

数据说明中 train.txt: training file, format (fb_mid_e1, fb_mid_e2, e1_name, e2_name, relation, sentence).
其中fb_mid_e1，fb_mid_e2是位置信息，但是具体怎么定义的呢？

How to select relation in Attention while Testing?

Hi, some thing about the attention confused me a lot.

the r is the query vector with relation r (the relation representation).
In train phase, is it r is the target relation label? if so, when in test phase, which r should be chosen to calculate the attention weight for the instances in a bag?

Do I misunderstand something about the paper?

Thanks.

关于数据的问题

您论文里提到训练集有522611句子、测试集有172448句子。但在您发布的data.zip文件中测试集行数为172448，但句子去重后为61707；训练集行数为570088，句子去重后为368099，即使句子+实体对+关系联合再去重后也是510415，而非522611。

请问是哪里出了问题？您论文中的“句子数量”指的是什么？

Can't get the same PR curve as in the paper

Hi,

Thanks for the great work. I'm trying to reproduce the results but having some trouble.

First, I just used the "pr.txt" file in the "NRE/PCNN+ATT/out/" directory and plotted the PR curve.
Then, I re-ran the test part of the program (without retraining) and plotted the PR curve with the newly generated "pr.txt" file.

What I plotted:

Somehow, they are different and also not the same as figure 3 in the paper. Could you please elaborate on the possible reasons? Thanks.

How to train about my data?

My file like this:
V70驱动器报警F30001过电流”
电机动力电缆干扰