may- / cnn-re-tf Goto Github PK

View Code? Open in Web Editor NEW

190.0 12.0 44.0 75.3 MB

Convolutional Neural Network for Multi-label Multi-instance Relation Extraction in Tensorflow

Python 14.45% Jupyter Notebook 85.55%

cnn tensorflow distant-supervision relation-extraction wikidata

cnn-re-tf's Introduction

Convolutional Neural Network for Relation Extraction

Note: This project is mostly based on https://github.com/yuhaozhang/sentence-convnet

Requirements

Python 2.7
Tensorflow (tested with version ~~0.10.0rc0~~ -> 1.0.1)
Numpy

To download wikipedia articles (distant_supervision.py)

Beautifulsoup
Pandas
Stanford NER *Path to Stanford-NER is specified in ner_path variable in distant_supervision.py

To visualize the results (visualize.ipynb)

Data

data directory includes preprocessed data:

cnn-re-tf
├── ...
├── word2vec
└── data
    ├── er              # binay-classification dataset
    │   ├── source.txt      #   source sentences
    │   └── target.txt      #   target labels
    └── mlmi            # multi-label multi-instance dataset
        ├── source.att      #   attention
        ├── source.left     #   left context
        ├── source.middle   #   middle context
        ├── source.right    #   right context
        ├── source.txt      #   source sentences
        └── target.txt      #   target labels

To reproduce:

python ./distant_supervision.py

word2vec directory is empty. Please download the Google News pretrained vector data from this Google Drive link, and unzip it to the directory. It will be a .bin file.

Usage

Preprocess

python ./util.py

It creates vocab.txt, ids.txt and emb.npy files.

Training

Binary classification (ER-CNN):

python ./train.py --sent_len=3 --vocab_size=11208 --num_classes=2 --train_size=15000 \
--data_dir=./data/er --attention=False --multi_label=False --use_pretrain=False

Multi-label multi-instance learning (MLMI-CNN):

python ./train.py --sent_len=255 --vocab_size=36112 --num_classes=23 --train_size=10000 \
--data_dir=./data/mlmi --attention=True --multi_label=True --use_pretrain=True

Multi-label multi-instance Context-wise learning (MLMI-CONT):

python ./train_context.py --sent_len=102 --vocab_size=36112 --num_classes=23 --train_size=10000 \
--data_dir=./data/mlmi --attention=True --multi_label=True --use_pretrain=True

Caution: A wrong value for input-data-dependent options (sent_len, vocab_size and num_class) may cause an error. If you want to train the model on another dataset, please check these values.

Evaluation

python ./eval.py --train_dir=./train/1473898241

Replace the --train_dir with the output from the training.

Run TensorBoard

tensorboard --logdir=./train/1473898241

Architecture

Results

	P	R	F	AUC	init_lr	l2_reg
ER-CNN	0.9410	0.8630	0.9003	0.9303	0.005	0.05
MLMI-CNN	0.8205	0.6406	0.7195	0.7424	1e-3	1e-4
MLMI-CONT	0.8819	0.7158	0.7902	0.8156	1e-3	1e-4

*As you see above, these models somewhat suffer from overfitting ...

References

http://github.com/yuhaozhang/sentence-convnet
http://github.com/dennybritz/cnn-text-classification-tf
http://www.wildml.com/2015/12/implementing-a-cnn-for-text-classification-in-tensorflow/
http://tkengo.github.io/blog/2016/03/14/text-classification-by-cnn/
Adel et al. Comparing Convolutional Neural Networks to Traditional Models for Slot Filling NAACL 2016
Nguyen and Grishman. Relation Extraction: Perspective from Convolutional Neural Networks NAACL 2015
Lin et al. Neural Relation Extraction with Selective Attention over Instances ACL 2016

cnn-re-tf's People

Contributors

Stargazers

Watchers

cnn-re-tf's Issues

How do you create the entities.pickle file?

Where do you download the entities.pickle file to use for distant supervision? Need help with this

Dataset format and input format for new predictions

Hi, can you please explain how I can form my own dataset for training MLMICNN. I'm confused with the source.att and target files and some tokens in the other files, such as , , etc.
Also is it possible to check the relation prediction for a single sentence after training is completed ?

How to prepare the source.att file

Hi there, I am back again. This time I am trying to train a context cnn model using sentence and some other attributes. For example, including sentence itself and demographic attributes of writer (age, income, education level, etc) into a source data entry: "The fox chase a bunny , 23 , 24000 , high school"

I am not quite sure whether I could just added these other attributes as left, right file, correct me if I am wrong:

source.left:
the fox chase a bunny
source.middle:
23,24000
source.right:
high school
target.txt
1 0

BUT how abount source.att? How to decide the values between 0.0 and 1.0?

TypeError: object of type 'NoneType' has no len() with #3 settings

(.venv) ub16hp@UB16HP:/ub16_prj/cnn-re-tf$ python distant_supervision.py
===== step 1 =====
[1/4] Downloading wiki articles ...
===== step 2 =====
[1/4] Downloading wiki articles ...
===== step 3 =====
[1/4] Downloading wiki articles ...
===== step 4 =====
[1/4] Downloading wiki articles ...
===== step 5 =====
[1/4] Downloading wiki articles ...
===== step 6 =====
[1/4] Downloading wiki articles ...
===== step 7 =====
[1/4] Downloading wiki articles ...
Traceback (most recent call last):
File "distant_supervision.py", line 693, in
main()
File "distant_supervision.py", line 681, in main
positive_examples()
File "distant_supervision.py", line 452, in positive_examples
ret = loop(step, doc_id, limit, entities, relations, counter)
File "distant_supervision.py", line 343, in loop
docs = download_wiki_articles(doc_id, limit)
File "distant_supervision.py", line 73, in download_wiki_articles
pages = bs(r, "html.parser").findAll('page')
File "/home/ub16hp/ub16_prj/cnn-re-tf/.venv/local/lib/python2.7/site-packages/bs4/init.py", line 246, in init
elif len(markup) <= 256 and (
TypeError: object of type 'NoneType' has no len()
(.venv) ub16hp@UB16HP:/ub16_prj/cnn-re-tf$

distant supervision script exists with error

Hello, Thank you for the code.

I have been trying to recreate dataset using same instructions here (https://github.com/may-/cnn-re-tf/issues/3#issuecomment-309293662) it works great till the very end and gives the following error :

Traceback (most recent call last):
File "./distant_supervision.py", line 693, in
main()
File "./distant_supervision.py", line 687, in main
extract_negative()
File "./distant_supervision.py", line 666, in extract_negative
subj = '<' + entities[row['subj'].encode('utf-8')][0] + '>'
KeyError: 'Bell\xc3\xaame'

Any thoughts? thanks in advance

Did you optimize F1 specifically

Hi,

I am doing a simliar project as you did.
I took a look at your text_cnn looks like the loss function you use is cross_entrophy.
I wonder how does precision and recall look like when your loss start to converge?
In my case loss start to be very small but Precision and Recall is still high, not sure what need to be done.
Did you optimize F1 specifically?
Thanks!

STANFORD NER

hello!
(base_path = "http://en.wikipedia.org/w/api.php?format=xml&action=query"
query = base_path + "&list=random&rnnamespace=0&rnlimit=%d" % limit)
what is mean????
I can not get useful infomation from this url.

[Help] How do I specify the positive class? How to output the prediction results?

Dear all,

I need help to understand these codes.

I would like to use these codes to make predictions. My data contains two class labels, namely 'Cat' and 'Bunny'. If I would like to pick "Bunny" as positive class, shall I edit the \er\target.txt and setting all instances whose class is "Bunny" as "1 0" and the others (cat) as "0 1"?

Moreover, how can I get the actual predictions?

Thank ahead for your time.