Git Product home page Git Product logo

pure's Introduction

PURE: Entity and Relation Extraction from Text

This repository contains (PyTorch) code and pre-trained models for PURE (the Princeton University Relation Extraction system), described by the paper: A Frustratingly Easy Approach for Entity and Relation Extraction.

Quick links

Overview

In this work, we present a simple approach for entity and relation extraction. Our approach contains three conponents:

  1. The entity model takes a piece of text as input and predicts all the entities at once.
  2. The relation model considers every pair of entities independently by inserting typed entity markers, and predicts the relation type for each pair.
  3. The approximation relation model supports batch computations, which enables efficient inference for the relation model.

Please find more details of this work in our paper.

Setup

Install dependencies

Please install all the dependency packages using the following command:

pip install -r requirements.txt

Download and preprocess the datasets

Our experiments are based on three datasets: ACE04, ACE05, and SciERC. Please find the links and pre-processing below:

  • ACE04/ACE05: We use the preprocessing code from DyGIE repo. Please follow the instructions to preprocess the ACE05 and ACE04 datasets.
  • SciERC: The preprocessed SciERC dataset can be downloaded in their project website.

Quick Start

The following commands can be used to download the preprocessed SciERC dataset and run our pre-trained models on SciERC.

# Download the SciERC dataset
wget http://nlp.cs.washington.edu/sciIE/data/sciERC_processed.tar.gz
mkdir scierc_data; tar -xf sciERC_processed.tar.gz -C scierc_data; rm -f sciERC_processed.tar.gz
scierc_dataset=scierc_data/processed_data/json/

# Download the pre-trained models (single-sentence)
mkdir scierc_models; cd scierc_models

# Download the pre-trained entity model
wget https://nlp.cs.princeton.edu/projects/pure/scierc_models/ent-scib-ctx0.zip
unzip ent-scib-ctx0.zip; rm -f ent-scib-ctx0.zip
scierc_ent_model=scierc_models/ent-scib-ctx0/

# Download the pre-trained full relation model
wget https://nlp.cs.princeton.edu/projects/pure/scierc_models/rel-scib-ctx0.zip
unzip rel-scib-ctx0.zip; rm -f rel-scib-ctx0.zip
scierc_rel_model=scierc_models/rel-scib-ctx0/

# Download the pre-trained approximation relation model
wget https://nlp.cs.princeton.edu/projects/pure/scierc_models/rel_approx-scib-ctx0.zip
unzip rel_approx-scib-ctx0.zip; rm -f rel_approx-scib-ctx0.zip
scierc_rel_model_approx=scierc_models/rel_approx-scib-ctx0/

cd ..

# Run the pre-trained entity model, the result will be stored in ${scierc_ent_model}/ent_pred_test.json
python run_entity.py \
    --do_eval --eval_test \
    --context_window 0 \
    --task scierc \
    --data_dir ${scierc_dataset} \
    --model allenai/scibert_scivocab_uncased \
    --output_dir ${scierc_ent_model}

# Run the pre-trained full relation model
python run_relation.py \
  --task scierc \
  --do_eval --eval_test \
  --model allenai/scibert_scivocab_uncased \
  --do_lower_case \
  --context_window 0\
  --max_seq_length 128 \
  --entity_output_dir ${scierc_ent_model} \
  --output_dir ${scierc_rel_model}
  
# Output end-to-end evaluation results
python run_eval.py --prediction_file ${scierc_rel_model}/predictions.json

# Run the pre-trained approximation relation model (with batch computation)
python run_relation_approx.py \
  --task scierc \
  --do_eval --eval_test \
  --model allenai/scibert_scivocab_uncased \
  --do_lower_case \
  --context_window 0\
  --max_seq_length 250 \
  --entity_output_dir ${scierc_ent_model} \
  --output_dir ${scierc_rel_model_approx} \
  --batch_computation

# Output end-to-end evaluation results
python run_eval.py --prediction_file ${scierc_rel_model_approx}/predictions.json

Entity Model

Input data format for the entity model

The input data format of the entity model is JSONL. Each line of the input file contains one document in the following format.

{
  # document ID (please make sure doc_key can be used to identify a certain document)
  "doc_key": "CNN_ENG_20030306_083604.6",

  # sentences in the document, each sentence is a list of tokens
  "sentences": [
    [...],
    [...],
    ["tens", "of", "thousands", "of", "college", ...],
    ...
  ],

  # entities (boundaries and entity type) in each sentence
  "ner": [
    [...],
    [...],
    [[26, 26, "LOC"], [14, 14, "PER"], ...], #the boundary positions are indexed in the document level
    ...,
  ],

  # relations (two spans and relation type) in each sentence
  "relations": [
    [...],
    [...],
    [[14, 14, 10, 10, "ORG-AFF"], [14, 14, 12, 13, "ORG-AFF"], ...],
    ...
  ]
}

Train/evaluate the entity model

You can use run_entity.py with --do_train to train an entity model and with --do_eval to evaluate an entity model. A trianing command template is as follow:

python run_entity.py \
    --do_train --do_eval [--eval_test] \
    --learning_rate=1e-5 --task_learning_rate=5e-4 \
    --train_batch_size=16 \
    --context_window {0 | 100 | 300} \
    --task {ace05 | ace04 | scierc} \
    --data_dir {directory of preprocessed dataset} \
    --model {bert-base-uncased | albert-xxlarge-v1 | allenai/scibert_scivocab_uncased} \
    --output_dir {directory of output files}

Arguments:

  • --learning_rate: the learning rate for BERT encoder parameters.
  • --task_learning_rate: the learning rate for task-specific parameters, i.e., the classifier head after the encoder.
  • --context_window: the context window size used in the model. 0 means using no contexts. In our cross-sentence entity experiments, we use --context_window 300 for BERT models and SciBERT models and use --context_window 100 for ALBERT models.
  • --model: the base transformer model. We use bert-base-uncased and albert-xxlarge-v1 for ACE04/ACE05 and use allenai/scibert_scivocab_uncased for SciERC.
  • --eval_test: whether evaluate on the test set or not.

The predictions of the entity model will be saved as a file (ent_pred_dev.json) in the output_dir directory. If you set --eval_test, the predictions (ent_pred_test.json) are on the test set. The prediction file of the entity model will be the input file of the relation model.

Relation Model

Input data format for the relation model

The input data format of the relation model is almost the same as that of the entity model, except that there is one more filed ."predicted_ner" to store the predictions of the entity model.

{
  "doc_key": "CNN_ENG_20030306_083604.6",
  "sentences": [...],
  "ner": [...],
  "relations": [...],
  "predicted_ner": [
    [...],
    [...],
    [[26, 26, "LOC"], [14, 15, "PER"], ...],
    ...
  ]
}

Train/evaluate the relation model:

You can use run_relation.py with --do_train to train a relation model and with --do_eval to evaluate a relation model. A trianing command template is as follow:

python run_relation.py \
  --task {ace05 | ace04 | scierc} \
  --do_train --train_file {path to the training json file of the dataset} \
  --do_eval [--eval_test] [--eval_with_gold] \
  --model {bert-base-uncased | albert-xxlarge-v1 | allenai/scibert_scivocab_uncased} \
  --do_lower_case \
  --train_batch_size 32 \
  --eval_batch_size 32 \
  --learning_rate 2e-5 \
  --num_train_epochs 10 \
  --context_window {0 | 100} \
  --max_seq_length {128 | 228} \
  --entity_output_dir {path to output files of the entity model} \
  --output_dir {directory of output files}

Aruguments:

  • --eval_with_gold: whether evaluate the model with the gold entities provided.
  • --entity_output_dir: the output directory of the entity model. The prediction files (ent_pred_dev.json or ent_pred_test.json) of the entity model should be in this directory.

The prediction results will be stored in the file predictions.json in the folder output_dir, and the format will be almost the same with the output file from the entity model, except that there is one more field "predicted_relations" for each document.

You can run the evaluation script to output the end-to-end performance (Ent, Rel, and Rel+) of the predictions.

python run_eval.py --prediction_file {path to output_dir}/predictions.json

Approximation relation model

You can use the following command to train an approximation model.

python run_relation_approx.py \
 --task {ace05 | ace04 | scierc} \
 --do_train --train_file {path to the training json file of the dataset} \
 --do_eval [--eval_with_gold] \
 --model {bert-base-uncased | allenai/scibert_scivocab_uncased} \
 --do_lower_case \
 --train_batch_size 32 \
 --eval_batch_size 32 \
 --learning_rate 2e-5 \
 --num_train_epochs 10 \
 --context_window {0 | 100} \
 --max_seq_length {128 | 228} \
 --entity_output_dir {path to output files of the entity model} \
 --output_dir {directory of output files}

Once you have a trained approximation model, you can enable efficient batch computation during inference with --batch_computation:

python run_relation_approx.py \
 --task {ace05 | ace04 | scierc} \
 --do_eval [--eval_test] [--eval_with_gold] \
 --model {bert-base-uncased | allenai/scibert_scivocab_uncased} \
 --do_lower_case \
 --eval_batch_size 32 \
 --context_window {0 | 100} \
 --max_seq_length 250 \
 --entity_output_dir {path to output files of the entity model} \
 --output_dir {directory of output files} \
 --batch_computation

Note: the current code does not support approximation models based on ALBERT.

Pre-trained Models

We release our pre-trained entity models and relation models for ACE05 and SciERC datasets.

Note: the performance of the pre-trained models might be slightly different from the reported numbers in the paper, since we reported the average numbers based on multiple runs.

Pre-trained models for ACE05

Entity models:

Relation models:

Performance of pretrained models on ACE05 test set:

  • BERT (single)
NER - P: 0.890260, R: 0.882944, F1: 0.886587
REL - P: 0.689624, R: 0.652476, F1: 0.670536
REL (strict) - P: 0.664830, R: 0.629018, F1: 0.646429
  • BERT-approx (single)
NER - P: 0.890260, R: 0.882944, F1: 0.886587
REL - P: 0.678899, R: 0.642919, F1: 0.660419
REL (strict) - P: 0.651376, R: 0.616855, F1: 0.633646
  • ALBERT (single)
NER - P: 0.900237, R: 0.901388, F1: 0.900812
REL - P: 0.739901, R: 0.652476, F1: 0.693444
REL (strict) - P: 0.698522, R: 0.615986, F1: 0.654663
  • BERT (cross)
NER - P: 0.902111, R: 0.905405, F1: 0.903755
REL - P: 0.701950, R: 0.656820, F1: 0.678636
REL (strict) - P: 0.668524, R: 0.625543, F1: 0.646320
  • BERT-approx (cross)
NER - P: 0.902111, R: 0.905405, F1: 0.903755
REL - P: 0.684448, R: 0.657689, F1: 0.670802
REL (strict) - P: 0.659132, R: 0.633362, F1: 0.645990
  • ALBERT (cross)
NER - P: 0.911111, R: 0.905953, F1: 0.908525
REL - P: 0.748521, R: 0.659427, F1: 0.701155
REL (strict) - P: 0.723866, R: 0.637706, F1: 0.678060

Pre-trained models for SciERC

Entity models:

Relation models:

Performance of pretrained models on SciERC test set:

  • SciBERT (single)
NER - P: 0.667857, R: 0.665875, F1: 0.666865
REL - P: 0.491614, R: 0.481520, F1: 0.486515
REL (strict) - P: 0.360587, R: 0.353183, F1: 0.356846
  • SciBERT-approx (single)
NER - P: 0.667857, R: 0.665875, F1: 0.666865
REL - P: 0.500000, R: 0.453799, F1: 0.475780
REL (strict) - P: 0.376697, R: 0.341889, F1: 0.358450
  • SciBERT (cross)
NER - P: 0.676223, R: 0.713947, F1: 0.694573
REL - P: 0.494797, R: 0.536961, F1: 0.515017
REL (strict) - P: 0.362346, R: 0.393224, F1: 0.377154
  • SciBERT-approx (cross)
NER - P: 0.676223, R: 0.713947, F1: 0.694573
REL - P: 0.483366, R: 0.507187, F1: 0.494990
REL (strict) - P: 0.356164, R: 0.373717, F1: 0.364729

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Zexuan Zhong ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

If you use our code in your research, please cite our work:

@inproceedings{zhong2021frustratingly,
   title={A Frustratingly Easy Approach for Entity and Relation Extraction},
   author={Zhong, Zexuan and Chen, Danqi},
   booktitle={North American Association for Computational Linguistics (NAACL)},
   year={2021}
}

pure's People

Contributors

a3616001 avatar danqi avatar hrabago avatar okcd00 avatar shirley-wu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pure's Issues

Questions about package

Hello,
I have a few questions about the package:

1.) Is your code is integrated in spacy pipeline or is different from it?
2.) What size of a dataset do we need to make good working model?
3.) What should the configuration of system be to train the model with the data from step 2?

Thanks again for all the help.

Ace Relation Dataset

Where can we find the Ace2005 Relation dataset with the "predicted_ner" attribute, as mentioned in the README?

Thanks,

Provide full environment

Hi, can you please provide an environment.yml (or requirements.txt if using only pip) with all the libraries and the python version this repo uses? Using the current requirements.txt throws errors for libraries like spacy. Thank you!

About the Approximation relation model

Hi Zexuan,

Thanks for releasing the nice code.

I am confused that why the "batch computations" is only appiled on inference. Did you try to use the "batch computations" on the training process?

Best,
Deming

About Multiple Entity Pairs in the Same Sentence

Hello,

For multiple entity pairs in one sentence, do you separately generate a new sentence for each entity pair during training? How do you sample negative samples? I mean entity pairs without relations

Thank you

How to run pre-trained model on a custom datasets

Hi,
I have used the brat tool to annotate my data, so I have two files: .ann files with the annotations of entities and relations, and raw text files. My goal is to convert these two files into the Scierc format, and finally used your pre-trained model on my datasets. However, I am having problem with the data conversion into the Scierc format.

Please can you point me to good direction on how to achieve this so I can run your model on my data?

Thanks

Using own pretrained model from checkpoint

Hi,

so I have the problem that training using my own, further pretrained model results in very long training time.
I tried with bert-base-uncased first which trains fairly fast. However my model has additional tokens in the added_tokens.json file of the checkpoint which results in at least 5x slower training. When I delete it, its as fast as usual. Also using the model to train a bert model with AutoModelForTokenClassification gives the same fast speed with the additional tokens. Only with this, it slows down when training the entity model. Any advice why this happens or how to speed this up?

Thank you.

Negative sampling

Hey! Once again, great work!

Since I don't have enough computer resources for experimenting myself, I need to address this question to you guys.

Have you tried sampling the negative samples (non-ner spans) instead of taking them all?
If so, did it yield good results?
If it didn't, do you have a clue why not?

Thanks in advance!

different F1 with the same seed

Hi, I try to use your code with our own dataset.Every time I train, the results on the dev or test dataset are slightly different, maybe 0.1 F1, but I can't find the specific location in your code where this situation occurs. Have you ever tried to use the same seed, run more times and check whether the results are consistent? Thank you.

Error about running run_entity.py on SciERC

Error1: OSError: Model name 'allenai/scibert_scivocab_uncased' was not found in tokenizers model name list.....

So I upgrade some dependency packages (torch=1.8.1, transformers=4.6.1). But I find a new error.

Error2: TypeError: Caught TypeError in replica 0 on device 0.
TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

I hope you can help me. what are the dependency packages when you run run_entity.py on SciERC?

the added special tokens seem to convert to ids uncorrectly with my chinese datasets

Have you tried the argument of add_new_tokens? I used bert-base-chinese to train my chinese dataset, and the number of entity types exceeds the total [unused], after applying add_marker_tokens function, the vocab size and tokenizer length both added from 21128 to 21300. Then, I applied the origin BERT model, and applied resize_token_embeddings(), however, I found that the special token, such as <subj_start=地点>, <obj_start=人物>, etc, are all represented as 100, namely [UNK] in the sequence of input_ids after applying tokenizer.convert_tokens_to_ids(). Since the different marker tokens should have different id according to the paper, could you please help me to solve this problem?

Fine-tunning time (training time) for entity model

Hey! First of all, congratulations on the paper and thanks for releasing the code.

I couldn't find the time spent on the training phase (as well as the processors used) on the paper. Could you provide this info, please?

Also, have you tried your model on a distillated BERT?

Thanks!

[Paper] What are "gold" entity and relationship types?

Hi, I was reading through your work and in section 3.2 under Training & Inference section, you define e* and r* to be "gold" entity and relationship types. Could you please explain or point to any literature that explains what exactly that means? Thank you!

How to run a pretrained model on unlabeled data?

Hi,

I'm looking to apply your pretrained models to an unlabeled, new dataset. I have my dataset in DyGIE format. Looking at the script, it's unclear to me how to do this, becuase there are only two blocks of code in the script. The first is if args.do_train:, where the model is trained, and the second is if args.do_eval:, where the model is evaluated.

I don't want to train, since I'm using a pre-trained model, but I also don't want to evaluate, since my data don't have labels, which makes my use case different than the example of applying the pretrained scibert models to the scierc dataset.

Wondering if you have pointers on how to do this?

Thanks!

Error trying to train custom model.

Hello,
I have successfully converted my json output of Inception to jsonl format acceptable to your package. The output is as follows:

{"doc_key": "Test_001", "sentences": [["Google", "was", "founded", "on", "September", "4", ",", "1998", ",", "by", "Larry", "Page", "and", "Sergey", "Brin", "while", "they", "were", "Ph.D", ".", "students", "at", "Stanford", "University", "in", "California", "."], ["On", "December", "3", ",", "2019", ",", "Pichai", "also", "became", "the", "CEO", "of", "Google", "."], ["In", "March", "1999", ",", "the", "Google", "moved", "its", "offices", "to", "Palo", "Alto", ",", "California", "."]], "ner": [[[0, 0, "ORG"], [10, 11, "PER"], [13, 14, "PER"]], [[33, 33, "PER"], [39, 39, "ORG"]], [[46, 46, "ORG"], [51, 54, "LOC"]]], "relations": [[[0, 0, 10, 11, "Founder"], [0, 0, 13, 14, "Founder"]], [[33, 33, 39, 39, "CEO"]], [[46, 46, 51, 54, "Located in"]]]}

This was saved as train.json, test.json and dev.json. I wanted to test your package using this input file to see if it is working. I used the following command:

python run_entity.py --do_train --learning_rate=1e-5 --task_learning_rate=5e-4 --train_batch_size=1 --context_window 300 --data_dir data_dir --model bert-base-uncased --output_dir output_dir --task ace05

I got the following error:

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 2.00 GiB total capacity; 1.28 GiB already allocated; 5.61 MiB free; 1.39 GiB reserved in total by PyTorch)

How can I fix this? Also, do you have any recommendations for the options of my command(s) given my goal? Thank you very much.

About License

Thank you for publishing the great results.
Is it possible to include the license for this repository's codes?

How to compute the REL+ F1(strict)

Hi Zexuan,

Is there any script for computing the REL+(strict), which mesures the correction of entity type and relation prediction.

In the run_relation.py, the output f1 seems to be REL F1.

Best,
Deming

Question regarding Table 4 in the paper

Hi,

Thanks for your excellent work!

May I ask why is the F1 score 64.2 reported in Table 4 for the "TypedMarkers" different from the 66.7 reported in Table 1 under single-sentence and Bert base settings?

Peiyuan

tool to create training data

Hello,
I was wondering, how do you create the custom training data in JSONL format? Do you do it manually or do you have access to a tool? Thank you very much.

Expired certificate for nlp.cs.princeton.edu

When trying to download pretrained models with wget, I get the following error:

Resolving nlp.cs.princeton.edu (nlp.cs.princeton.edu)... 128.112.136.61
Connecting to nlp.cs.princeton.edu (nlp.cs.princeton.edu)|128.112.136.61|:443... connected.
ERROR: cannot verify nlp.cs.princeton.edu's certificate, issued by ‘/C=US/O=Let's Encrypt/CN=R3’:
  Issued certificate has expired.
To connect to nlp.cs.princeton.edu insecurely, use `--no-check-certificate'.

I was wondering if anyone else has had this problem, & if it's resolvable or if I should connect insecurely.

Thanks!

What does e2e means in Table 4

Hi,

I have a question regarding Table 4 in the original paper.

Since you only explain e2e as "e2e: the entities are predicted by our entity model", there are several possibilities to explain this:

  1. The relation model uses entity labels and entity span predicted by the entity model in both training and evaluation (in this case it is on dev set).
  2. The relation model uses the gold entity class labels and gold entity span during training but uses the labels and span predicted by the entity model during inference.
  3. The relation model uses the span predicted by the entity model but uses the gold entity class label during training. During inference, it uses the span and class label predicted by the entity model.

May I ask which should be the correct explanation? Thank you!

Best regards,
Peiyuan

语言

您好,这个模型能用在别的语言数据集上 吗

Missing predicted value

There are multiple identical entity types in a sentence, but the model always predicts only one. For example, the input is: 10,000 catties of oranges and 20,000 catties of apples, and the output is only: oranges --> fruits; why is this?

About the second encoder for the relation model

Hello, I'm really interested in your pipelined approach in this paper. I notice that you mentioned a second pre-trained encoder in the relation model. I'm a little confused about that. I'm not sure how you get the second pre-trained encoder. Is the training procedure completely identical with BERT? Or is there anything special in getting the encoder?

Thanks,

Non-deterministic NER results

Hi Zexuan,

I tried to reproduct the NER results for SciERC and ran the same command 5 times(the default seed is 0).

python run_entity.py \
    --do_train \
    --context_window 0 \
    --task scierc \
    --data_dir scierc_data/processed_data/json/ \
    --model allenai/scibert_scivocab_uncased \
    --output_dir scierc_models/ent-scib-ctx0/

However, I got non-deterministic results.
I've only observed this happening on the GPU, rather than the CPU.
Is it possible to avoid the non-deterministic results?

08/19/2021 21:45:10 - INFO - root - Evaluating...
08/19/2021 21:45:13 - INFO - root - Accuracy: 0.989835
08/19/2021 21:45:13 - INFO - root - Cor: 1113, Pred TOT: 1703, Gold TOT: 1685
08/19/2021 21:45:13 - INFO - root - P: 0.65355, R: 0.66053, F1: 0.65702
08/19/2021 21:45:13 - INFO - root - Used time: 3.658346
08/19/2021 21:45:17 - INFO - root - Total pred entities: 1703

08/19/2021 21:45:32 - INFO - root - Evaluating...
08/19/2021 21:45:36 - INFO - root - Accuracy: 0.989780
08/19/2021 21:45:36 - INFO - root - Cor: 1135, Pred TOT: 1759, Gold TOT: 1685
08/19/2021 21:45:36 - INFO - root - P: 0.64525, R: 0.67359, F1: 0.65912
08/19/2021 21:45:36 - INFO - root - Used time: 3.603827
08/19/2021 21:45:39 - INFO - root - Total pred entities: 1759

08/19/2021 21:45:54 - INFO - root - Evaluating...
08/19/2021 21:45:58 - INFO - root - Accuracy: 0.989813
08/19/2021 21:45:58 - INFO - root - Cor: 1147, Pred TOT: 1769, Gold TOT: 1685
08/19/2021 21:45:58 - INFO - root - P: 0.64839, R: 0.68071, F1: 0.66416
08/19/2021 21:45:58 - INFO - root - Used time: 3.655415
08/19/2021 21:46:01 - INFO - root - Total pred entities: 1769

08/19/2021 21:46:17 - INFO - root - Evaluating...
08/19/2021 21:46:21 - INFO - root - Accuracy: 0.990009
08/19/2021 21:46:21 - INFO - root - Cor: 1107, Pred TOT: 1672, Gold TOT: 1685
08/19/2021 21:46:21 - INFO - root - P: 0.66208, R: 0.65697, F1: 0.65952
08/19/2021 21:46:21 - INFO - root - Used time: 3.643370
08/19/2021 21:46:24 - INFO - root - Total pred entities: 1672

08/19/2021 21:46:39 - INFO - root - Evaluating...
08/19/2021 21:46:43 - INFO - root - Accuracy: 0.989835
08/19/2021 21:46:43 - INFO - root - Cor: 1115, Pred TOT: 1714, Gold TOT: 1685
08/19/2021 21:46:43 - INFO - root - P: 0.65053, R: 0.66172, F1: 0.65608
08/19/2021 21:46:43 - INFO - root - Used time: 3.624065
08/19/2021 21:46:46 - INFO - root - Total pred entities: 1714

The GPU is GTX3090(CUDA 11.1). The packages installed are

allennlp==2.4.0
torch==1.7.0
transformers==4.5.1
overrides==3.1.0
requests==2.24.0

Best,
Yaobin

entity/models.py _get_input_tensors function

Hi, I have a detailed question to ask, which has puzzled me for a long time.
A function called "_get_input_tensors" in 'models.py' under 'entity' folder.
def _get_input_tensors(self, tokens, spans, spans_ner_label):
start2idx = []
end2idx = []

    bert_tokens = []
    bert_tokens.append(self.tokenizer.cls_token)
    for token in tokens:
        start2idx.append(len(bert_tokens))
        sub_tokens = self.tokenizer.tokenize(token)
        bert_tokens += sub_tokens
        end2idx.append(len(bert_tokens)-1)
    bert_tokens.append(self.tokenizer.sep_token)

    indexed_tokens = self.tokenizer.convert_tokens_to_ids(bert_tokens)
    tokens_tensor = torch.tensor([indexed_tokens])

    bert_spans = [[start2idx[span[0]], end2idx[span[1]], span[2]] for span in spans]
    bert_spans_tensor = torch.tensor([bert_spans])

    spans_ner_label_tensor = torch.tensor([spans_ner_label])

    return tokens_tensor, bert_spans_tensor, spans_ner_label_tensor

Here, in "bert_spans = [[start2idx[span[0]], end2idx[span[1]], span[2]] for span in spans]", "spans" were created with a sentence(Does not contain left and right context). But start2idx and end2idx are got from a sentence plus left and right contexts. So, I think there are some problems with "bert_spans".(Maybe my understanding is wrong.)
Could you explain this for me? Thank you!

RuntimeError: CUDA error: device-side assert triggered

Hi, I'd run into "RuntimeError: CUDA error: device-side assert triggered" error when I attempted to run your code on a Chinese dataset.
The log is as follows:

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [165,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [165,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "run_entity.py", line 225, in <module>
    output_dict = model.run_batch(train_batches[i], training=True)
  File "/tf_group/lihongyu/PURE-main/entity/models.py", line 302, in run_batch
    attention_mask = attention_mask_tensor.to(self._model_device),
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/tf_group/lihongyu/PURE-main/entity/models.py", line 65, in forward
    spans_embedding = self._get_span_embeddings(input_ids, spans, token_type_ids=token_type_ids, 
attention_mask=attention_mask)
  File "/tf_group/lihongyu/PURE-main/entity/models.py", line 41, in _get_span_embeddings
    sequence_output, pooled_output = self.bert(input_ids=input_ids, token_type_ids=token_type_ids,             
attention_mask=attention_mask)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 752, in forward
    input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py", line 181, in forward
    embeddings = inputs_embeds + position_embeddings + token_type_embeddings
RuntimeError: CUDA error: device-side assert triggered

I have searched a few cases of this error on stack overflow, but still fail to make out what has happened.
I drew the dimension of inputs_embeds, position_embeddings, token_type_embeddings, and it seemed to be nothing wrong(all of these is of [1, seq_len(>350), 768])
Thanks for your time.

Question about ACE2004

Hello,

This work is very interesting!
However, I can not find the code about how you do the k-fold cross-validation in ACE2004.

Siheng

About NER F1

Hi Zexuan,

In the evaluation in run_entity.py, you select the entites with max_span_length<=8 to compute tot_gold. Does it make the recall overestimate?

PURE/run_entity.py

Lines 76 to 108 in 8517005

def evaluate(model, batches, tot_gold):
"""
Evaluate the entity model
"""
logger.info('Evaluating...')
c_time = time.time()
cor = 0
tot_pred = 0
l_cor = 0
l_tot = 0
for i in range(len(batches)):
output_dict = model.run_batch(batches[i], training=False)
pred_ner = output_dict['pred_ner']
for sample, preds in zip(batches[i], pred_ner):
for gold, pred in zip(sample['spans_label'], preds):
l_tot += 1
if pred == gold:
l_cor += 1
if pred != 0 and gold != 0 and pred == gold:
cor += 1
if pred != 0:
tot_pred += 1
acc = l_cor / l_tot
logger.info('Accuracy: %5f'%acc)
logger.info('Cor: %d, Pred TOT: %d, Gold TOT: %d'%(cor, tot_pred, tot_gold))
p = cor / tot_pred if cor > 0 else 0.0
r = cor / tot_gold if cor > 0 else 0.0
f1 = 2 * (p * r) / (p + r) if cor > 0 else 0.0
logger.info('P: %.5f, R: %.5f, F1: %.5f'%(p, r, f1))
logger.info('Used time: %f'%(time.time()-c_time))
return f1

Best,
Deming

Preparing input file

For your input file that you used here:

{
document ID (please make sure doc_key can be used to identify a certain document)
"doc_key": "CNN_ENG_20030306_083604.6",

sentences in the document, each sentence is a list of tokens
"sentences": [
[...],
[...],
["tens", "of", "thousands", "of", "college", ...],
...
],

entities (boundaries and entity type) in each sentence
"ner": [
[...],
[...],
[[26, 26, "LOC"], [14, 14, "PER"], ...], #the boundary positions are indexed in the document level
...,
],

relations (two spans and relation type) in each sentence
"relations": [
[...],
[...],
[[14, 14, 10, 10, "ORG-AFF"], [14, 14, 12, 13, "ORG-AFF"], ...],
...
]
}

... how did you do the annotation for relations and NER? Was it done using Inception or Tagtog or another tool? Thank you very much again.

The NER results for 5 times runs

Hi Zexuan,

The paper reports the average Rel F1 over 5 runs. Do the 5 RE runs use the same ent_pred_test.json (e.g. from the BERT (cross, W=300) entity model) or they use different NER predictions from NER models with different seeds?

Could you please share the ent_pred_test.json for the 5 RE runs?

Thanks!

Best,
Deming

tensorflow版本

感谢您杰出的工作,麻烦问下有tensorflow版本吗?

CUDA error when trying to run command

Hi, I tried to run the commands for the pre-trained model:

python run_entity.py
--do_eval --eval_test
--context_window 0
--task scierc
--data_dir ${scierc_dataset}
--model allenai/scibert_scivocab_uncased
--output_dir ${scierc_ent_model}

and the custom model:

python run_entity.py --do_train --learning_rate=1e-5 --task_learning_rate=5e-4 --train_batch_size=16 --context_window 0 --data_dir data_dir --model bert-base-uncased --output_dir output_dir --task inception

I get this error for both:

image

Here are my specs:

RTX 3060 , CUDA 11.2
torch==1.4.0

Do you think it is the Nvidia version that is causing the issue? Thank you very much.

attention mask

I want to ask why the dimension of the attention mask in the relation_approx model is [batch_size, from_seq_length, to_seq_length],
and why is the token of the location information (subject and object) spliced after the sentence is masked?

Line 82 of the code in the models.py file in the relations folder

Thank you.

Multiple issues

Hi,

I recently used this for a project and faced several issues/shortcomings of the repository that can be improved. Here are some of the things:

  • Ability to auto-populate task fields based on labels found in the dataset and removal of the restriction of the task argument to be one of the standard ones
    parser.add_argument('--task', type=str, default=None, required=True, choices=['ace04', 'ace05', 'scierc'])
  • Having a --no-cuda argument in run_entity as well to allow users to test code on a machine without CUDA without actually interfering with the code. This is already present in run_relations, but has been skipped in this file for some reason.
  • Lack of --do_predict, which has already been raised in #30
  • Some optimisations like
    model = EntityModel(args, num_ner_labels=num_ner_labels)

    dev_data = Dataset(args.dev_data)
    dev_samples, dev_ner = convert_dataset_to_samples(dev_data, args.max_span_length, ner_label2id=ner_label2id, context_window=args.context_window)
    dev_batches = batchify(dev_samples, args.eval_batch_size)

    if args.do_train:
        train_data = Dataset(args.train_data)

All of these lines can be put under the if args.do_train as these are relevant only to that section and the other args.do_eval section already loads the model again like so

    if args.do_eval:
        args.bert_model_dir = args.output_dir
        model = EntityModel(args, num_ner_labels=num_ner_labels)
  • Other small optimisations and some typos

I am thinking of creating an independent issue for each of these and start working on including some of these fixes/features in the near future as I've already made quite a few changes locally for my project. Just wanted to get your thoughts on these issues and if there is something that I am missing.

Reproduct NER results for ACE05

Hi Zexuan,

I tried to reproduct the NER results for ACE05. (88.7 for single sentence. 90.1 for cross sentence).

My commands are following:

single sentence:

python run_entity.py     --do_train --do_eval --eval_test     --learning_rate=1e-5 --task_learning_rate=5e-4     --train_batch_size=16     --context_window 0     --task ace05       --data_dir data/ace05     --model ../bert_models/bert-base-uncased       --output_dir models/bsz16_seed0_ctx0

And I get test F1=83.475

cross sentence:

python run_entity.py     --do_train --do_eval --eval_test     --learning_rate=1e-5 --task_learning_rate=5e-4     --train_batch_size=16     --context_window 300     --task ace05       --data_dir data/ace05     --model ../bert_models/bert-base-uncased       --output_dir models/bsz16_seed42_ctx300 --seed 42

And I get test F1=84.96

The numbers are highly lower than those your reported. Is anything wrong?

Best,
Deming

FileNotFoundError: [Errno 2] No such file or directory: 'scierc_models/ent-scib-ctx0/ent_pred_dev.json'

Hello,
I keep getting the error mentioned on the title while running this script :

python run_relation.py   --task scierc   --do_train --train_file scierc_data/processed_data/json/train.json   --do_eval --eval_with_gold   --model  allenai/scibert_scivocab_uncased   --do_lower_case   --train_batch_size 8   --eval_batch_size 32   --learning_rate 2e-5   --num_train_epochs 3   --context_window 100   --max_seq_length 128    --entity_output_dir scierc_models/ent-scib-ctx0   --output_dir scierc_outputs

I have also tried with --eval_test argument and without using either --eval_test and --eval_with_gold, both throw the same error.
Earlier I had managed to run the first two evaluation scripts provided with the repo, seamlessly. They produced ent_pred_test.json and predictions.json, respectively and as expected. :

# Run the pre-trained entity model, the result will be stored in ${scierc_ent_model}/ent_pred_test.json
python run_entity.py \
   --do_eval --eval_test \
   --context_window 0 \
   --task scierc \
   --data_dir ${scierc_dataset} \
   --model allenai/scibert_scivocab_uncased \
   --output_dir ${scierc_ent_model}

# Run the pre-trained full relation model
python run_relation.py \
 --task scierc \
 --do_eval --eval_test \
 --model allenai/scibert_scivocab_uncased \
 --do_lower_case \
 --context_window 0\
 --max_seq_length 128 \
 --entity_output_dir ${scierc_ent_model} \
 --output_dir ${scierc_rel_model}

Could you please tell me if I am doing anything wrong?

About the relation in datasets

Hi, I noticed “relations” in the format of your data input.
Does every entity in the relation have to appear in the same sentence, not document?

Input Data Format

Hello!
I'm trying to use your repo but first I need to convert my annotations in json format.
My annotations are in .annformat and this is a sample of them:
T0 Claim 2992 3010 here are the facts
T1 Premise 3012 3078 210,000 dead people in our country in just the last several months
R0 Support Arg1:T0 Arg2:

Now, I am trying to create the .jsonfile to "feed" the model. My concerns are about the indexes I have to put in the .json input file for entities and relations. The indexes next to the entities "T" refer to the char indexes based on the main .txtfile.

My issue is about if I have to do some math to calculate the "new" indexes considering some offset, or the "real" indexes can be used without any problems.

Thank you in advance for your time and help. I'm trying to test your performing models with my annotations but I want to be sure the input is good.

Pierpaolo

About the "Batch Computations"

Hi, Thanks for releasing the code.
I am very interested in "Batch Computations" mentioned in your paper, but I didn‘t find out where it was implemented.
Looking forward to your reply.

Creating custom dataset from scratch

I was wondering if you download the training data, in original sciERC format (as shown here) and then reformat again automatically internally before training the model? Asking this, because a little confused whether to format my custom dataset like sciERC or your input data format, as shown on the Readme.md . Also, sciERC annotates sentence wise , as per the aforementioned link. How does pure handle multisentence passages? The pretrained models that are downloaded by the repo-provided Readme.md link, are also labeled as single sentence model.

KeyError causing custom training to halt

Hello,
I believe I have resolved my previous issue with the custom training. I have to wait 10 minutes to move on from "Waiting for CUDA..." but it does successfully move on to the next stage. However, I get the following error:

File "run_entity.py", line 197, in
dev_samples, dev_ner = convert_data_to_samples(dev_data, args.max_span_length, ner_label2id=ner_label2id, context_window=args.context_window)
File "/home/programmer/Documents/data/custom_triples_phases/PURE/entity/utils.py", line 122, in convert_dataset_to_samples
sample['spans_label'].append(ner_label2id[sent_ner[(i,j)]])
KeyError: 'LOC'

Do you have any idea what is causing this error? Just as a reminder, my train, dev and test JSON files are as follows (with different doc_keys):

{"clusters": [], "doc_key": "Test_001", "sentences": [["Google", "was", "founded", "on", "September", "4", ",", "1998", ",", "by", "Larry", "Page", "and", "Sergey", "Brin", "while", "they", "were", "Ph.D", ".", "students", "at", "Stanford", "University", "in", "California", "."], ["On", "December", "3", ",", "2019", ",", "Pichai", "also", "became", "the", "CEO", "of", "Google", "."], ["In", "March", "1999", ",", "the", "Google", "moved", "its", "offices", "to", "Palo", "Alto", ",", "California", "."]], "ner": [[[0, 0, "ORG"], [10, 11, "PER"], [13, 14, "PER"]], [[33, 33, "PER"], [39, 39, "ORG"]], [[46, 46, "ORG"], [51, 54, "LOC"]]], "relations": [[[0, 0, 10, 11, "Founder"], [0, 0, 13, 14, "Founder"]], [[33, 33, 39, 39, "CEO"]], [[46, 46, 51, 54, "Located in"]]]}

The command I ran is as follows:

python run_entity.py --do_train --learning_rate=1e-5 --task_learning_rate=5e-4 --train_batch_size=16 --context_window 0 --data_dir data_dir --model bert-base-uncased --output_dir output_dir --task inception

Thank you very much.

About relation F1

Hello, I want to know what's the difference between F1 and task_F1 in run_relation.py, Isn't n_gold always equal to e2e_ngold?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.