junefeng / relationclassification-rl Goto Github PK

View Code? Open in Web Editor NEW

134.0 134.0 37.0 109.73 MB

Reinforcement Learning for Relation Classification from Noisy Data(AAAI2018)

C++ 99.74% Makefile 0.26%

relationclassification-rl's People

Contributors

Stargazers

Watchers

relationclassification-rl's Issues

duplicate data in train.txt

Hi, I have found some duplicate data in train.txt. For example,
line 190245: m.053x3n m.0fnb4 shamsur_rahman dhaka /people/deceased_person/place_of_death these include '' the best poems of shamsur_rahman , '' published last year in new delhi ; and '' the devotee , the combatant : selected poems of shamsur_rahman , '' published in 2000 in dhaka . ###END###
line 190246: m.053x3n m.0fnb4 shamsur_rahman dhaka /people/deceased_person/place_of_death these include '' the best poems of shamsur_rahman , '' published last year in new delhi ; and '' the devotee , the combatant : selected poems of shamsur_rahman , '' published in 2000 in dhaka . ###END###

line 190667: m.05fjf m.0xsbj new_jersey bound_brook /location/location/contains bound_brook is one of the oldest settlements in new_jersey , dating to 1681 . ###END###
line 190668: m.05fjf m.0xsbj new_jersey bound_brook /location/location/contains bound_brook is one of the oldest settlements in new_jersey , dating to 1681 . ###END###

They are totally the same in both entities and sentences. Are they set for some reason?

数据文件问题

您好，我在调试代码时，发现 data/pretrain/word2vec.txt 和 data/pretrain/pre_bestRL.txt 两个文件不存在，能否提供？

Testing using the pretrained CNN weights yields pr-curve similar to the pr-curve obtained after joint training.

I tried to run testing using the pre-trained CNN weights provided in the "data/pretrain" directory. I did this by setting outString in main.cpp to the directory where the pre-trained weights are stored. But after running this test, I obtain a pr curve very similar to the pr curve obtained after joint-training.
Can you clarify why this happens?
PR curve of pretrained CNN-

PR curve after joint training-

到底是sentence-level还是bag-level的？？

你好，看到你的paper里全文提到的都是sentence-level的关系抽取，但是打印了test时部分输出的结果，如下：

这很明显是bag级别的评测啊，总共bag数*52个类=5027256，tot表示的是positive bag数1950个。
如果是sentence级别的训练，为什么要用bag级别的评测呢？？（另外也大致看了train部分的代码，虽然C++不太懂，但是里面也出现了bags_train这样的变量，是不是训练的时候也是bag-level的呢？如果这样的话不是跟paper冲突了吗？）
希望能解答下，非常感谢！！

环境、配置

README 里面没发现这个项目的环境配置相关的信息，麻烦了解的人不吝解答，谢谢

what's the meaing of Dao? Is it gradients?

For example

matrixRelationDao = (float *)calloc(dimensionC*relationTotal, sizeof(float));
matrixW1Dao =  (float*)calloc(dimensionC * dimension * window, sizeof(float));
matrixB1Dao =  (float*)calloc(dimensionC, sizeof(float));

updateMatrixRelation = (float *)calloc(dimensionC*relationTotal, sizeof(float));
updateMatrixW1 =  (float*)calloc(dimensionC * dimension * window, sizeof(float));
updateMatrixB1 =  (float*)calloc(dimensionC, sizeof(float));

对这个work有一个疑惑：

我在研究您的论文时，产生了一个疑惑：
你的模型/方法破坏了training set & testing set的原始分布。

其他的RL工作都是基于改变模型参数来适配拟合数据的，也就是不会改变training data & testing data。这样就保证了training set & testing set的原始分布。

但是这篇文章的工作核心是：用RL来对原始training数据的noise bag进行剔除，通过标签Y改变input data。这在training阶段是OK的，这样做确实可以减少noise data对我的分类模型的干扰。但是在testing阶段还能这样吗？testing set都没label了，如何反馈reward给policy module进行testing set中的bag的剔除？那么我在testing phrase还如何work呢？

我看了代码，发现in testing phrase，确实是直接对test set用CNN做关系分类。

谢谢。

Segmentation fault (core dumped)

mldl@ub1604:/ub16_prj/RelationClassification-RL$ ./main rlpre 0.01
wordTotal= 114042
Word dimension= 50
Segmentation fault (core dumped)
mldl@ub1604:/ub16_prj/RelationClassification-RL$ ./main r 0.01
mldl@ub1604:/ub16_prj/RelationClassification-RL$ ./main rl 0.01
wordTotal= 114042
Word dimension= 50
Segmentation fault (core dumped)
mldl@ub1604:/ub16_prj/RelationClassification-RL$ ./main test
wordTotal= 114042
Word dimension= 50
Segmentation fault (core dumped)
mldl@ub1604:~/ub16_prj/RelationClassification-RL$

Questiton about the number of entity2vec

There are 49,828 entities in the training set, but there are only 39,528 pre-trained entity embeddings.

Tensorflow version

Hi,
Do you have codes for tensorflow implementation?

你好，对你这篇论文的实现代码有一些不明白的地方，能指教一下吗?

a doubt for the idea

for the special reward setting in this work, better policy will select the sentences in the bag that has higher logP(r|xi), the best result is find the max one, which means finding one max sentence for each bag and feed it to train the classifier. Is that correct?

three lost relations

Hi,
There are 56 relations in sentences, but 53 in train.txt. However, I find that three lost relations are not rare:
/business/company/industry: 6 sentences
/people/ethnicity/includes_groups: 7 sentences
/people/ethnicity/people: 169 sentences

As a contrast, there are 4 relations whose sentences are only 1. They are:
/location/fr_region/capital
/business/shopping_center/owner
/business/shopping_center_owner/shopping_centers_owned
/location/mx_state/capital

So why do you delete frequent relations and leave relations which only have 1 sentences?

./main rl 0.001 Segmentation fault

关于数据集的问题

你好，
我下载了你提供的数据集RE.zip，发现训练集的句子个数570088和论文中汇报的（522611）不一致，还有就是我发现这个训练集（570088）中存在和测试集中的entity pair重叠的部分，用这个作为训练集是不是不太合适？

junefeng / relationclassification-rl Goto Github PK

relationclassification-rl's People

Contributors

Stargazers

Watchers

Forkers

relationclassification-rl's Issues

Recommend Projects

Recommend Topics

Recommend Org