Git Product home page Git Product logo

yuanxiaosc / multiple-relations-extraction-only-look-once Goto Github PK

View Code? Open in Web Editor NEW
346.0 10.0 70.0 244 KB

Multiple-Relations-Extraction-Only-Look-Once. Just look at the sentence once and extract the multiple pairs of entities and their corresponding relations. 端到端联合多关系抽取模型,可用于 http://lic2019.ccf.org.cn/kg 信息抽取。

Home Page: https://yuanxiaosc.github.io/2019/05/28/信息抽取任务相关论文发展脉络/

Python 89.54% Jupyter Notebook 10.46%
relation-extraction entity-extraction information-extraction joint-models tensorflow-models bert-model

multiple-relations-extraction-only-look-once's Introduction

Multiple-Relations-Extraction-Only-Look-Once

Multiple-Relations-Extraction-Only-Look-Once. Just look at the sentence once and extract the multiple pairs of entities and their corresponding relations. 只用看一次,抽取所有的实体及其对应的所有关系。

input "text": "《逐风行》是百度文学旗下纵横中文网签约作家清水秋风创作的一部东方玄幻小说,小说已于2014-04-28正式发布"

output

"spo_list": [{"predicate": "连载网站", "object_type": "网站", "subject_type": "网络小说", "object": "纵横中文网", "subject": "逐风行"}, {"predicate": "作者", "object_type": "人物", "subject_type": "图书作品", "object": "清水秋风", "subject": "逐风行"}]

Main principle

The entity extraction task is converted into a sequence annotation task, and the multi-relationship extraction task is converted into a multi-head selection task. The token will be sent to the model, and the model will predict the output label, predicate value, and predicate location. Using the bin/read_standard_format_data.py file, the original game data format can be converted into the data format required for the multi-head selection model, as follows:

把实体抽取任务转换成序列标注任务,把多关系抽取任务转换成多头选择任务。token会被送入模型,模型会预测输出label、predicate value 和 predicate location。使用bin/read_standard_format_data.py文件可以把原始比赛数据格式转换成多头选择模型所需数据格式,如下所示:

+-------+-------+------------+----------------------+--------------------+
| index | token |   label    |   predicate value    | predicate location |
+-------+-------+------------+----------------------+--------------------+
|   0   |   《  |     O      |        ['N']         |        [0]         |
|   1   |   逐  | B-图书作品  | ['连载网站', '作者']  |      [12, 21]      |
|   2   |   风  | I-图书作品  |        ['N']         |        [2]         |
|   3   |   行  | I-图书作品  |        ['N']         |        [3]         |
|   4   |   》  |     O      |        ['N']         |        [4]         |
|   5   |   是  |     O      |        ['N']         |        [5]         |
|   6   |   百  |     O      |        ['N']         |        [6]         |
|   7   |   度  |     O      |        ['N']         |        [7]         |
|   8   |   文  |     O      |        ['N']         |        [8]         |
|   9   |   学  |     O      |        ['N']         |        [9]         |
|   10  |   旗  |     O      |        ['N']         |        [10]        |
|   11  |   下  |     O      |        ['N']         |        [11]        |
|   12  |   纵  |   B-网站   |        ['N']         |        [12]        |
|   13  |   横  |   I-网站   |        ['N']         |        [13]        |
|   14  |   中  |   I-网站   |        ['N']         |        [14]        |
|   15  |   文  |   I-网站   |        ['N']         |        [15]        |
|   16  |   网  |   I-网站   |        ['N']         |        [16]        |
|   17  |   签  |     O      |        ['N']         |        [17]        |
|   18  |   约  |     O      |        ['N']         |        [18]        |
|   19  |   作  |     O      |        ['N']         |        [19]        |
|   20  |   家  |     O      |        ['N']         |        [20]        |
|   21  |   清  |   B-人物   |        ['N']         |        [21]        |
|   22  |   水  |   I-人物   |        ['N']         |        [22]        |
|   23  |   秋  |   I-人物   |        ['N']         |        [23]        |
|   24  |   风  |   I-人物   |        ['N']         |        [24]        |
|   25  |   创  |     O      |        ['N']         |        [25]        |
...
+-------+-------+------------+----------------------+--------------------+

See my blog for more details

Document description

name description
bert The feature extractor of the model, where BERT is used as the feature extractor for the model. It can be replaced with other feature extractors, such as BiLSTM and CNN.
bin/data_manager.py Prepare formatted data for the mode.
bin/integrated_model_output.py Organize the output of model prediction into standard data format.
bin/read_standard_format_data.py View formatted data.
bin/test_head_select_scores.py One method for solving the relationship extraction problem: the multi-head selection method.
experimental_loss_function run_multiple_relations_extraction_XXX.py, File under experiment.
raw_data
produce_submit_json_file.py Generate entity relationship triples and write them to JSON files.
run_multiple_relations_extraction.py The most basic model for model training and prediction.

Need Your Help!

Problem code location

run_multiple_relations_extraction.py 531~549 lines!

You can try different experiments (experimental_loss_function/run_multiple_relations_extraction_XXX.py) and share your results.

Use example

竞赛任务

给定schema约束集合及句子sent,其中schema定义了关系P以及其对应的主体S和客体O的类别,例如(S_TYPE:人物,P:妻子,O_TYPE:人物)、(S_TYPE:公司,P:创始人,O_TYPE:人物)等。 任务要求参评系统自动地对句子进行分析,输出句子中所有满足schema约束的SPO三元组知识Triples=[(S1, P1, O1), (S2, P2, O2)…]。 输入/输出: (1) 输入:schema约束集合及句子sent (2) 输出:句子sent中包含的符合给定schema约束的三元组知识Triples

例子 输入句子: "text": "《古世》是连载于云中书城的网络小说,作者是未弱"

输出三元组: "spo_list": [{"predicate": "作者", "object_type": "人物", "subject_type": "图书作品", "object": "未弱", "subject": "古世"}, {"predicate": "连载网站", "object_type": "网站", "subject_type": "网络小说", "object": "云中书城", "subject": "古世"}]}

数据简介

本次竞赛使用的SKE数据集是业界规模最大的基于schema的中文信息抽取数据集,其包含超过43万三元组数据、21万中文句子及50个已定义好的schema,表1中展示了SKE数据集中包含的50个schema及对应的例子。数据集中的句子来自百度百科和百度信息流文本。数据集划分为17万训练集,2万验证集和2万测试集。其中训练集和验证集用于训练,可供自由下载,测试集分为两个,测试集1供参赛者在平台上自主验证,测试集2在比赛结束前一周发布,不能在平台上自主验证,并将作为最终的评测排名。

Getting Started

Environment Requirements

  • python 3.6+
  • Tensorflow 1.12.0+

Step 1: Environmental preparation

  • Install Tensorflow
  • Dowload bert-base, chinese, unzip file and put it in pretrained_model floader.

Step 2: Download the training data, dev data and schema files

Please download the training data, development data and schema files from the competition website, then unzip files and put them in ./raw_data/ folder.

cd data
unzip train_data.json.zip 
unzip dev_data.json.zip
cd -

此处不再提供2019语言与智能技术竞赛_信息抽取原始数据下载,如有疑问可以联系我的邮箱 [email protected]

There is no longer a raw data download, if you have any questions, you can contact my mailbox [email protected]

Step3: Data preprocessing

python bin/data_manager.py

It is currently recommended to use the run_multiple_relations_extraction_MSE_loss.py file instead of the run_multiple_relations_extraction.py file for model and forecasting!

Step4: Model training

Run_multiple_relations_extraction_mask_loss.py is recommended.

python run_multiple_relations_extraction.py \
--task_name=SKE_2019 \
--do_train=true \
--do_eval=false \
--data_dir=bin/standard_format_data \
--vocab_file=pretrained_model/chinese_L-12_H-768_A-12/vocab.txt \
--bert_config_file=pretrained_model/chinese_L-12_H-768_A-12/bert_config.json \
--init_checkpoint=pretrained_model/chinese_L-12_H-768_A-12/bert_model.ckpt \
--max_seq_length=128 \
--train_batch_size=32 \
--learning_rate=2e-5 \
--num_train_epochs=3.0 \
--output_dir=./output_model/multiple_relations_model/epochs3/

Step5: Model prediction

python run_multiple_relations_extraction.py \
  --task_name=SKE_2019 \
  --do_predict=true \
  --data_dir=bin/standard_format_data \
  --vocab_file=pretrained_model/chinese_L-12_H-768_A-12/vocab.txt \
  --bert_config_file=pretrained_model/chinese_L-12_H-768_A-12/bert_config.json \
  --init_checkpoint=output_model/multiple_relations_model/epochs3/model.ckpt-2000 \
  --max_seq_length=128 \
  --output_dir=./infer_out/multiple_relations_model/epochs3/ckpt2000

Step6: Generating Entities and Relational Files

python produce_submit_json_file.py

You can use other strategies to generate the final entity relationship file for better results. I have written the template code for you (see produce_submit_json_file.py).

Paper realization

This code is an unofficial implementation of Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers and Joint entity recognition and relation extraction as a multi-head selection problem.

multiple-relations-extraction-only-look-once's People

Contributors

yuanxiaosc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multiple-relations-extraction-only-look-once's Issues

Out of range: Read less bytes than requested

W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Out of range: Read less bytes than requested
内存120G
cpu
请问下作者有遇到这个问题嘛?怎么解决呢?

英文

请问该模型对于英文语料适用吗

subject_predicate_object_predict_output.json生成后SPO_lIST全部为空 如下所示

{"text": "《不是所有时光都微笑》是2012年7月1日光明日报出版社出版的书籍,作者是蓝瞳", "spo_list": null}
{"text": "《鬼影实录2》是托德·威廉姆斯执导,布赖恩·波兰德主演的恐怖片", "spo_list": null}
{"text": "”这是明朝天启年间的首辅大学士叶向高为纪念尤溪籍靖边将领詹荣逝世七十年所写的《读史吊詹角山司马》诗", "spo_list": null}

do eval

怎样设置model evaluation

预测结果不对

训练跑的是 run_multiple_relations_extraction.py
我只是预测了三句而已:

  • 肥西家明房产经纪有限公司创建于2000年12月18日,总部办公地址位于合肥市肥西县县城上派镇三河中路众鑫楼2单元303室
  • 360于2005年11月创立,系互联网安全服务和产品提供商,并于2011年登陆美国纽交所
  • 兹娜·萨扎娜维特斯,女,出生于1990年10月25日,白俄罗斯举重运动员

这是loss:
image

ner的结果是有了,但是关系的结果:
image

数据集失效

你好,数据集的百度网盘链接已失效,请问您方便补一下吗?谢谢!

最后一步生成文件失败

前面都没问题,执行最后一步时,一直出现这样的情况
python produce_submit_json_file.py
0it [00:00, ?it/s]

单条文本预测?

非常感谢您的开源精神!
我想利用您的项目尝试单条文本预测?该如何做?
我尝试改了下run_multiple_relations_extraction.py代码,最后再这个函数的时候,对单条文本这里要怎么输入?
d = tf.data.TFRecordDataset(input_file)

def input_fn(params):
    """The actual input function."""
    batch_size = params["batch_size"]

    # For training, we want a lot of parallel reading and shuffling.
    # For eval, we want no shuffling and parallel reading doesn't matter.
    d = tf.data.TFRecordDataset(input_file)
    if is_training:
        d = d.repeat()
        d = d.shuffle(buffer_size=100)

    d = d.apply(
        tf.contrib.data.map_and_batch(
            lambda record: _decode_record(record, name_to_features),
            batch_size=batch_size,
            drop_remainder=drop_remainder))

    return d

数据集

你好,提供的数据集不能下载了,请问能重新提供吗

标签信息

请问该方法中,做head prediction时,预测实体label embedding没有用到吗?例如将bert encoder输出和实体label embedding做拼接,然后作为head prediction的输入

loss

请问用不同的loss,哪个效果最好

关于Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers这篇论文

感谢作者分享的代码,对我的学习研究工作有很大的帮助。在readme最后的Paper realization中看到了Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers这篇论文,请问在代码中哪里进行了改进,另外实体识别模型中,不知道我理解的对不对,为什么用了softmax分类而不是crf呢?

Label Embedding

您好,是不是将ner的label,作为关系预测的输入的一部分,抛弃掉了

按照readmme得到的spo结果全为空?

结果如下:
{"text": "《不是所有时光都微笑》是2012年7月1日光明日报出版社出版的书籍,作者是蓝瞳", "spo_list": []}
{"text": "《鬼影实录2》是托德·威廉姆斯执导,布赖恩·波兰德主演的恐怖片", "spo_list": []}
{"text": "”这是明朝天启年间的首辅大学士叶向高为纪念尤溪籍靖边将领詹荣逝世七十年所写的《读史吊詹角山司马》诗", "spo_list": []}
请问原因是什么?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.