yuanxiaosc / entity-relation-extraction Goto Github PK

Entity and Relation Extraction Based on TensorFlow and BERT. 基于TensorFlow和BERT的管道式实体及关系抽取，2019语言与智能技术竞赛信息抽取任务解决方案。Schema based Knowledge Extraction, SKE 2019

Home Page: https://yuanxiaosc.github.io/2019/05/17/多关系抽取研究/

Python 100.00%

tensorflow entity-extraction relation-extraction pipeline-framework bert-model competition-code

entity-relation-extraction's People

Contributors

Stargazers

Watchers

Forkers

charlottesean sunyancn xs55555 tomlee20180103 jude-nlp tianmh 2585575866 lrxzhy janciswang xiaojie2018 cjm1044642385 scievan searchlink ghui0318 zxyscz tjunlp fighting41love fendaq allensmile leichen9 concenterate typanda supersuntang ylf4910 shawbown shannonyu frfy ruizewang yhx123hero xudonghou nudtchengqing batermj hackerapple stevenyesz me-meda zhaizhijiang see-u-see chenpe32cp sxrczh guoyin90 zgq7799 berryhn weibobo2015 wengbenjue meitianjinbu zxlzr yangxudong senkey705 eminemrain fankli spico197 wytalw yxwisdom franklwl hey1213grey devinkung spurs1988 90217 strawberrylunar qianrenjian carlos9310 zofuthan kxlshitou stuartchan mengyuliu anigi98932 sunnymarkliu hatleon nicole1130 0xqq yuandongdongdong jamesxinyu youarerare liuwq168 jiaxings lwj-code michael-wzhu wushicanasl shenfuli qshuang123 zhuxu403 freddiexu hsimwong jnupython xzhp33p hawksilent cchengz gaohaihui luoy2 juihsuanlee troublemaker-r 111304037 jerryliu306 songyhs tiffen ronieliu zhengchao7819 lvjianwei123 mymusise louisheck

entity-relation-extraction's Issues

@broccolik 这里没有用到postag的信息，如果要有也最好用基于字的postag ，因为BERT中文预训练模型的参数都是基于字训练的。

Originally posted by @yuanxiaosc in #27 (comment)

test文件夹里面的文件哪里来

您好，test文件夹里面的文件是需要自己准备么，按照步骤执行没有生成，如果自己准备的话，格式是啥样的？

您这个模型，想要达到第一名的准确率，还需要做些什么工作？

非常感谢您的分享，想讨教一下，您这个模型，想要达到第一名的准确率，还需要做些什么工作，或者已有更高准确率的方案能否分享一下？谢谢！

运行 run_predicate_classification.py 中出现keyError

你好，我按照ReadMe.md中的方式运行run_predicate_classification.py，出现如下的错误：

Traceback (most recent call last):
File "run_predicate_classification.py", line 821, in
tf.app.run()
File "/data/anaconda3/envs/tf/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_predicate_classification.py", line 698, in main
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
File "run_predicate_classification.py", line 385, in file_based_convert_examples_to_features
max_seq_length, tokenizer)
File "run_predicate_classification.py", line 347, in convert_single_example
label_ids = _predicate_label_to_id(label_list, label_map)
File "run_predicate_classification.py", line 371, in _predicate_label_to_id
predicate_label_ids[predicate_label_map[label]] = 1
KeyError: ''

不知道这个问题应该如何解决

code bugs for row_label_ids ?

In this line 579:

                row_label_ids = tf.reduce_sum(tf.ones_like(elements_equal), -1)

should be:
row_label_ids = tf.reduce_sum(tf.ones_like(label_ids), -1)

作者你好，请问可否发一份原始数据集？

由于官网数据集无法下载，不知是否方便发一份原始数据集给我研究用呢？这边比赛结束了，无法报名下载啦~

运行run_predicate_classification.py训练脚本报错，能帮忙看看嘛？

Traceback (most recent call last):
File "run_predicate_classification.py", line 812, in
tf.app.run()
File "/home/anaconda/anaconda3/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_predicate_classification.py", line 690, in main
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
File "run_predicate_classification.py", line 381, in file_based_convert_examples_to_features
max_seq_length, tokenizer)
File "run_predicate_classification.py", line 343, in convert_single_example
label_ids = _predicate_label_to_id(label_list, label_map)
File "run_predicate_classification.py", line 367, in _predicate_label_to_id
predicate_label_ids[predicate_label_map[label]] = 1
KeyError: ''

关于效果和联合训练

不知道您这种管道式的训练效果如何呢？
另外，基于bert的高准确率，在构建损失函数的时候，直接组合两个任务的损失，也就是分类损失和标注损失的和，然后做fine-tunning？您这种方法可行吗？
有看到有人做联合训练的，好像目前效果都还不够。
谢谢！

请问GPU的接口在哪里

您好，我在运行时程序自动选择了CPU，但我的其它程序都会自动选择GPU。请问这个使用GPU的接口在哪里？

运行出错报keyError的问题

def _predicate_label_to_id(predicate_label, predicate_label_map): 函数是把关系标签转换成onehot向量，关系全部定义在 def get_labels(self):中，只要是这里面的关系都可以转换，所以你可以输出一下这个key，即predicate_label_map[label] 中的label看看

Originally posted by @yuanxiaosc in #4 (comment)
作者你好，我在训练关系分类模型时也出现keyError。将label print出来后发现label为空，原语料中这一句话确实是没有关系类别。请问该怎么解决的？

一句话中同一种关系出现多次

比如在一句话中，“张三的国籍是**，李四的国籍是印度”，出现两次“国籍”这个关系需要预测，那么在关系分类模型中，对应label是国籍，还是国籍，国籍呢？

作者您好，请问用cpu训练需要多长时间？

您好，想知道run_sequnce_labeling代码中训练过程一开始的loss大概是多少？

正在用pytorch复现，在这里遇了疑问，评价指标达不到您的效果，比较差。
loss = 0.5 * predicate_loss + token_label_loss
因为tensorflow看不到中间过程是具体输出，想请教您一下。

生成实体-关系结果过程中出现问题

在运行python produce_submit_json_file.py时候，

python produce_submit_json_file.py

Traceback (most recent call last):
File "produce_submit_json_file.py", line 324, in
spo_list_manager = Sorted_relation_and_entity_list_Management(TEST_DATA_DIR, MODEL_OUTPUT_DIR, Competition_Mode=Competition_Mode)
File "produce_submit_json_file.py", line 133, in init
File_Management.init(self, TEST_DATA_DIR=TEST_DATA_DIR, MODEL_OUTPUT_DIR=MODEL_OUTPUT_DIR, Competition_Mode=Competition_Mode)
File "produce_submit_json_file.py", line 82, in init
self.MODEL_OUTPUT_DIR = get_latest_model_predict_data_dir(MODEL_OUTPUT_DIR)
File "produce_submit_json_file.py", line 22, in get_latest_model_predict_data_dir
if not os.path.exists(new_ckpt_dir):
UnboundLocalError: local variable 'new_ckpt_dir' referenced before assignment

出现 local variable 'new_ckpt_dir' referenced before assignment

您好，请问方便分享一下数据吗

run_sequence_labeling评测错误

您好，请问您在run_sequence_labeling做评测时有报下面这个错误吗？
TypeError: Values of eval_metric_ops must be (metric_value, update_op) tuples, given: Tensor("ArgMax:0", shape=(?,), dtype=int32) for key: predicate_prediction

没有找到生成token_in.txt等txt文件的代码？

token_in.txt，predicate_out.txt等txt文件是在哪里生成的呢，我看第一个issue也是问的这个问题，下载训练数据是没有问题，但是很明显txt文件没有啊，请问博主可以提供该文件和文件的生成脚本吗？感谢

dataset

请问可以提供一下完整的训练数据集吗，我没有参加比赛，所以下载不了

The data on the websit that you provided can not be download now, can you upload the data to github?thank you

模型预测的输入格式

您好，这个模型的输入是什么，两个实体和他们的词性，以及整个句子？

作者您好，运行关系分类模型报错keyError

File "run_predicate_classification.py", line 812, in
tf.app.run()
File "D:\Anaconda3\envs\comp\lib\site-packages\tensorflow\python\platform\app.py", line 125, in run
_sys.exit(main(argv))
File "run_predicate_classification.py", line 690, in main
train_examples, label_list, FLAGS.max_seq_length, tokenizer, train_file)
File "run_predicate_classification.py", line 381, in file_based_convert_examples_to_features
max_seq_length, tokenizer)
File "run_predicate_classification.py", line 343, in convert_single_example
label_ids = _predicate_label_to_id(label_list, label_map)
File "run_predicate_classification.py", line 367, in _predicate_label_to_id
predicate_label_ids[predicate_label_map[label]] = 1
KeyError: ''

raw_data 里面的数据是否不全，官方已经停止下载数据了，能否共享一下全量数据？感谢！

看到 readme.md 里面有提到：“本次竞赛使用的SKE数据集是业界规模最大的基于schema的中文信息抽取数据集，其包含超过43万三元组数据、21万中文句子”，实际看 raw_data 里面只有几千句话。

官方已经停止下载数据了，能否共享一下全量数据？

感谢！

No such file or directory: 'bin/subject_object_labeling/sequence_labeling_data/test/token_in_and_one_predicate.txt'

进行序列标注模型预测时候出现Error

python run_sequnce_labeling.py \
  --task_name=SKE_2019 \
  --do_predict=true \
  --data_dir=bin/subject_object_labeling/sequence_labeling_data \
  --vocab_file=pretrained_model/chinese_L-12_H-768_A-12/vocab.txt \
  --bert_config_file=pretrained_model/chinese_L-12_H-768_A-12/bert_config.json \
  --init_checkpoint=output/sequnce_labeling_model/epochs9/model.ckpt-22000 \
  --max_seq_length=128 \
  --output_dir=./output/sequnce_infer_out/epochs9/ckpt22000

Exception:

W0202 13:40:21.592350 139693254723392 tpu_context.py:222] eval_on_tpu ignored because use_tpu is False.
Traceback (most recent call last):
  File "run_sequnce_labeling.py", line 885, in <module>
    tf.app.run()
  File "/srv/jupyterhub/envs/lib/python3.7/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/srv/jupyterhub/envs/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/srv/jupyterhub/envs/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "run_sequnce_labeling.py", line 826, in main
    predict_examples = processor.get_test_examples(FLAGS.data_dir)
  File "run_sequnce_labeling.py", line 235, in get_test_examples
    with open(os.path.join(data_dir, os.path.join("test", "token_in_and_one_predicate.txt")), encoding='utf-8') as token_in_f:
FileNotFoundError: [Errno 2] No such file or directory: 'bin/subject_object_labeling/sequence_labeling_data/test/token_in_and_one_predicate.txt'

看了下 bin/subject_object_labeling/sequence_labeling_data/test/ 目录是空的

另外：bin/prepare_data_for_labeling_infer.py 好像没有这个脚本。

这样release data是违法的吧.....

完成了关系和序列标注模型训练后模型的使用问题

您好，我按照您的代码和思路完成了关系分类模型和实体识别模型的训练，如果进行模型的使用呢，比如我input：句子1 调用模型后输出了实体和相关关系的信息表

你好，请教一下，这个模型的结果能排到第几呢

rt，谢谢 @yuanxiaosc

Cant find toke_in.txt at data_dir

that heppend at 218 in run_predicate_classification.py . SomeWhere can i download this txt?

如何进行评测

生成keep_empty_spo_list_subject_predicate_object_predict_output.json后如何进行评测呢，是直接用 bin/evaluation/中的calc_pr.py吗使用这个函数把keep_empty_spo_list_subject_predicate_object_predict_output.json 作为 predict_file参数但是提示：
predict file is error
{"errorCode": 1, "errorMsg": "file_reading_error"}

保存模型时系统找不到指定的路径

您好:

我在运行代码时遇到如下错误

INFO:tensorflow:Saving checkpoints for 0 into ./output/predicate_classification_model/epochs6/model.ckpt.
2019-12-15 15:31:02.192730: W T:\src\github\tensorflow\tensorflow\core\framework\op_kernel.cc:1318] OP_REQUIRES failed at save_restore_v2_ops.cc:109 : Not found: Failed to create a NewWriteableFile: ./output/predicate_classification_model/epochs6/model.ckpt-0_temp_58f0ef5ae6ef4fadb59db0652cc8e3ec/part-00000-of-00001.data-00000-of-00001.tempstate4867356647866357195 : 系统找不到指定的路径。

系统环境：Windows 10

请问如何解决呢？
谢谢！

请问作者，在预测对应实体的部分的问题

在预测实体部分，我发现您好像并没有使用关系的信息，只是采用了联合训练的方式，是这样的吗？

运行出错

AttributeError: module 'tokenization' has no attribute 'FullTokenizer'

怎样做评估

你这样在检测实体的时候不是等于压根没用第一步预测的关系吗

为什么是把这个关系当成训练的监督信息，
而不是作为额外特征，加到序列特征中去做序列标注

在特定gpu上运行程序

你好，我在运行过程中按照ReadMe中的命令运行时，程序会默认检测当前空闲gpu并且全部占满gpu，我想请问一下如果我想只在某一块gpu或者某几块gpu上运行的话，有什么办法吗

epochs6ckpt1000这个模型是指的，第6个epochs的第1000次迭代的模型吗？

作者你好，代码运行出错

sequence_labeling_data_manager.py是否没有添加对test数据的处理。
我没有找到能够运行run_sequence_labeling.py的test数据集，也就是不能找到test/token_in_and_one_predicate.txt文件，是我忽略了某些操作吗，谢谢

作者您好，最后结果recall极低请问是什么问题

请问postag embedding是怎么做的？

请问 postag embedding 部分是对每个字或者每个词对应的postag进行embedding吗？比如{"word": "的", "pos": "u"} 是对“u”进行embedding吗？

询问数据集的问题

您好，这个是精标数据，还是远监督数据？

老哥，为啥run_sequence_labeling还需要计算predicate loss

如题，为啥run_sequence_labeling还需要计算predicate loss

关于预测效果询问

大神你好，按照这个代码运行，关系分类 6轮，序列标注9轮，训练出了来预测结果f1只有0.67.请问是哪里的问题。在自己的数据上训练的。数据量差不多，关系分类有60多种。也用了官方数据训练，f1也不到0.7.请问是哪里有问题吗。

**原论文

作者，您好！先进行关系分类再识别实体这种**有相关论文提供吗？

您好，请问什么时候更新代码？比赛已经结束了，想向您学习

请问您的博客的样式自己写的吗，是否有模板，如有可以共享下吗，好看

github.io的样式

准确率问题：用了跟您一样的数据，准确率只有百分之五十多

具体截图如下：准确率和召回率

precision:0.5743
recall:0.5948
F1-score:0.5844

实体关系分类的完全准确率和部分准确率：

图片上传失败，内容如下：

correct_line: 509, line: 1000, percentage: 50.9000%

superset_line: 141, line: 1000, percentage: 14.1000%

subset_line: 254, line: 1000, percentage: 25.4000%

没有想到原因，望指教！

new token_in.txt during training

FileNotFoundError: [Errno 2] No such file or directory: './raw_data/train\token_in.txt'

some mistakes happen when i run your code.

some mistakes happen when run eval, can you help me?

here are the details:
WARNING:tensorflow:Reraising captured error
Traceback (most recent call last):
File "run_predicate_classification.py", line 812, in
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "run_predicate_classification.py", line 741, in main
result = estimator.evaluate(input_fn=eval_input_fn, steps=eval_steps)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2424, in evaluate
rendezvous.raise_errors()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.5/dist-packages/six.py", line 693, in reraise
raise value
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2418, in evaluate
name=name
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 478, in evaluate
return _evaluate()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 460, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1484, in _evaluate_build_graph
self._call_model_fn_eval(input_fn, self.config))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1520, in _call_model_fn_eval
features, labels, model_fn_lib.ModeKeys.EVAL, config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2195, in _call_model_fn
features, labels, mode, config)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/estimator.py", line 1195, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2479, in _model_fn
features, labels, is_export_mode=is_export_mode)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1259, in call_without_tpu
return self._call_model_fn(features, labels, is_export_mode=is_export_mode)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 1538, in _call_model_fn
return estimator_spec.as_estimator_spec()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 330, in as_estimator_spec
prediction_hooks=self.prediction_hooks + hooks)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/estimator/model_fn.py", line 236, in new
'tuples, given: {} for key: {}'.format(value, key))
TypeError: Values of eval_metric_ops must be (metric_value, update_op) tuples, given: Tensor("Abs:0", shape=(), dtype=float32) for key: eval_accuracy

关于处理和实体类型确认的问题

您好，感谢您提供的代码，本人学习您的代码过程中，有两个疑惑：

您在前期处理bert tokenizer 产生##字符串时候，做了[##WordPiece]替换，这一步的目的我不太理解，我本人觉得应该是可以省略此步骤，因为这样可以减少第二个模型的类型，且在预测结果也方便了处理。
是否对于一种关系有多种实体类型情况，是否就无法确定两实体到底属于这关系下的那个实体类型了

关于关系抽取的问题

您好，阅读了您的代码，主要有两个问题不太理解，想咨询一下您

分类模型在modeling的训练模型最终的结果是batch_size128768,128代表的是句长，为什么最后只取第一个字作为分类的标准？后面的字都不需要了吗？
在关系抽取模型里，我看到代码是分别预测标签和文本的BIO分别训练，也就是说预测标签结果不依赖于文本的BIO

但是一般不是先给文本打上BIO标签以后，再和BIO标签一起去训练文本得到预测标签吗？

请问一下，这个模型和Multiple-Relations-Extraction-Only-Look-Once这个模型的算法有什么区别吗？

请问一下，这个模型和Multiple-Relations-Extraction-Only-Look-Once这个模型的算法有什么区别吗？还有就是结果哪个好一些