king-yyf / cmekg_tools Goto Github PK

View Code? Open in Web Editor NEW

994.0 8.0 368.0 6.37 MB

License: MIT License

Python 100.00%

cmekg_tools's Introduction

CMeKG 工具代码及模型

Index

CMeKG工具
- 模型下载
依赖库
模型使用

cmekg工具

CMeKG网站

中文医学知识图谱CMeKG CMeKG（Chinese Medical Knowledge Graph）是利用自然语言处理与文本挖掘技术，基于大规模医学文本数据，以人机结合的方式研发的中文医学知识图谱。

CMeKG 中主要模型工具包括医学文本分词，医学实体识别和医学关系抽取。这里是三种工具的代码、模型和使用方法。

模型下载

由于依赖和训练好的的模型较大，将模型放到了百度网盘中，链接如下，按需下载。

RE：链接:https://pan.baidu.com/s/1cIse6JO2H78heXu7DNewmg 密码:4s6k

NER: 链接:https://pan.baidu.com/s/16TPSMtHean3u9dJSXF9mTw 密码:shwh

分词：链接:https://pan.baidu.com/s/1bU3QoaGs2IxI34WBx7ibMQ 密码:yhek

依赖库

json
random
numpy
torch
transformers
gc
re
time
tqdm

模型使用

医学关系抽取

依赖文件

pytorch_model.bin : 医学文本预训练的 BERT-base model
vocab.txt
config.json
model_re.pkl: 训练好的关系抽取模型文件，包含了模型参数、优化器参数等
predicate.json

使用方法

配置参数在medical_re.py的class config里，首先在medical_re.py的class config里修改各个文件路径

训练

import medical_re
medical_re.load_schema()
medical_re.run_train()

model_re/train_example.json 是训练文件示例

使用

import medical_re
medical_re.load_schema()
model4s, model4po = medical_re.load_model()

text = '据报道称，新冠肺炎患者经常会发热、咳嗽，少部分患者会胸闷、乏力，其病因包括: 1.自身免疫系统缺陷\n2.人传人。'  # content是输入的一段文字
res = medical_re.get_triples(text, model4s, model4po)
print(json.dumps(res, ensure_ascii=False, indent=True))

执行结果

[
 {
  "text": "据报道称，新冠肺炎患者经常会发热、咳嗽，少部分患者会胸闷、=乏力，其病因包括: 1.自身免疫系统缺陷\n2.人传人",
  "triples": [
   [
    "新冠肺炎",
    "临床表现",
    "肺炎"
   ],
   [
    "新冠肺炎",
    "临床表现",
    "发热"
   ],
   [
    "新冠肺炎",
    "临床表现",
    "咳嗽"
   ],
   [
    "新冠肺炎",
    "临床表现",
    "胸闷"
   ],
   [
    "新冠肺炎",
    "临床表现",
    "乏力"
   ],
   [
    "新冠肺炎",
    "病因",
    "自身免疫系统缺陷"
   ],
   [
    "新冠肺炎",
    "病因",
    "人传人"
   ]
  ]
 }
]

医学实体识别

调整的参数和模型在ner_constant.py中

训练

python3 train_ner.py

使用示例

medical_ner 类提供两个接口测试函数

predict_sentence(sentence): 测试单个句子，返回:{"实体类别"：“实体”},不同实体以逗号隔开
predict_file(input_file, output_file): 测试整个文件文件格式每行待提取实体的句子和提取出的实体{"实体类别"：“实体”},不同实体以逗号隔开

from run import medical_ner

#使用工具运行
my_pred=medical_ner()
#根据提示输入单句：“高血压病人不可食用阿莫西林等药物”
sentence=input("输入需要测试的句子:")
my_pred.predict_sentence("".join(sentence.split()))

#输入文件(测试文件，输出文件)
my_pred.predict_file("my_test.txt","outt.txt")

医学文本分词

调整的参数和模型在cws_constant.py中

训练

python3 train_cws.py

使用示例

medical_cws 类提供两个接口测试函数

predict_sentence(sentence): 测试单个句子，返回:{"实体类别"：“实体”},不同实体以逗号隔开
predict_file(input_file, output_file): 测试整个文件文件格式每行待提取实体的句子和提取出的实体{"实体类别"：“实体”},不同实体以逗号隔开

from run import medical_cws

#使用工具运行
my_pred=medical_cws()
#根据提示输入单句：“高血压病人不可食用阿莫西林等药物”
sentence=input("输入需要测试的句子:")
my_pred.predict_sentence("".join(sentence.split()))

#输入文件(测试文件，输出文件)
my_pred.predict_file("my_test.txt","outt.txt")

cmekg_tools's People

Contributors

Stargazers

Watchers

Forkers

kanandian yepgang lincolnfan christiaaaan 1egend cwlseu yuzhang112 jsusu dongdongdong04 bingzhen sech-io cancangit angus9077 tengben0905 chenmosha johnnywang92 vincentwei2021 ltyunique fengrk tianyudizhua zgdkik codeofrina benzite mhkmars liangsuoliver wengbenjue mayi140611 jia0511 uncarman2017 54huige jerrylxx gaoyb923 sixlife dimwalker jichengyuan leavingangle muguizi waterbroz seabeauty qijunl anrerbo up2hcs lzyccc tututou lightyear416 lalalashenle sunny635 tangwest ronnie88597 gshan4056 noelcarlton xiexie1993 lwpnnx lemon5269 xbutterflyx laremn lbeing bluep0int chenjl121 cherish-zyq zhulongpeng0129 yccckid cztgit yuconggen d68321 harzva rfvqwas liaozhihui qiuchenpro sixawn pidada yuanxw0828 xutianhan zhaohengmaster existencein yangyang8599 jeromecn unusaulwu nn-123 luhggit m-gao shuifuture shadow-linux ljt1469 liul21cn saga518 maowase little2000 ume-technology godflyfly lpffernando williamgjn zhaochangyou forestsha newcorder xuanchenguang 976339067 magicpwn chenzhih03 yangjunjie52

cmekg_tools's Issues

>大佬medical_cws.py medical_ner.py 这俩里面的使用能跑通吗

感谢大佬，改了就跑通了！
大佬medical_cws.py medical_ner.py 这俩里面的使用能跑通吗
Originally posted by @gouyulang in #8 (comment)

如何训练自己的模型呢

对于这个pkl文件能不能滞空来训练自己的模型呢,但是设为null会报错,这里有办法解决吗

你好，我仔细看了一下您的代码，关于re有两点想讨论一下：
1、extract_spoes()函数中，L280-L291，我清晰你希望完成的是当同一输入文本中有多个主语定位词时遍历每一组，并在model4po模型中作为mask，与hidden_state进行叠加，希望在提取宾语与实体关系词时仅关注该主语起始位置，这样就免除了依存分析的内容。但是这一部分遍历只会取到第一组。只是因为在get_triples中用“。”切割，通常情况下一句只有一个主语，因此看起来表现是对的。
2、同上所述，在model4po模型定义时，看起来将s直接填充进了所有有效token对应的位置，all_s[b, :cue_len, :] = s，无法起到长文本的mask作用，这一步骤添加对第二段po提取的训练是无意义的。

网站打不开

无论是直接打开还是挂梯子还是用流量打开都不行。。

打开.pkl文件 Open .pkl file

你好，问一下，torch是哪个版本？pkl文件该如何打开？我使用python解析是一串int类型的数字。

Hello, may I ask, which version of torch is it? How to open pkl file? I am using python to parse a string of numbers of type int.

CMeKG

您好，这个网站挂掉了http://cmekg.pcl.ac.cn/

前端

您好，请问前端的代码可以上传一下吗

NER任务测试集

hi，我们在测试ner任务的时候没有测试集，可以发一份ner任务的测试集出来做测试吗，非常感谢

您好，可以有偿付费咨询一下吗，按小时计费

我这边加载模型这块不太明白，同时我想用您的代码，训练自己的非医学的数据，是否可以呢？可否有偿指导一下

TypeError: dropout(): argument 'input' (position 1) must be Tensor, not str

请问这是transformer版本库导致的问题吗

medical_cws.py 运行出错

从百度云下载了模型文件，更新 medical_cws.py 对应的模型路径后，运行 medical_cws.py 报错了，怎么解决？以下是日志
(base) ubuntu@ubuntu-test3:~/knowledgegraph/CMeKG_tools/CMeKG_tools-main$ python medical_cws.py
Some weights of the model checkpoint at /home/ubuntu/knowledgegraph/CMeKG_tools/CMeKG_tools-main/models/medical_cws were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias']

This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Traceback (most recent call last):
File "medical_cws.py", line 157, in
res = meg.predict_sentence("肾上腺由皮质和髓质两个功能不同的内分泌器官组成，皮质分泌肾上腺皮质激素，髓质分泌儿茶酚胺激素。")
File "medical_cws.py", line 105, in predict_sentence
self.model.load_state_dict(torch.load(self.NEWPATH,map_location=self.device))
File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERT_LSTM_CRF:
Missing key(s) in state_dict: "word_embeds.embeddings.position_ids".

网站http://cmekg.pcl.ac.cn/打不开

网站http://cmekg.pcl.ac.cn/打不开，无法获取知识图谱
想做知识图谱构建方面研究，请问可以使用您的CMeKG的数据集吗，后期会注明引用来源

ner的长度怎么改，看到限制的是512

about the the version of "transformers" package

Hello,

Could you please provide the specific version of transformers?

这种报错是因为模型文件的问题吗?

您好:
请帮忙看看如下的报错是因为pkl文件的问题?还是因为啥? 多谢。
model4s.load_state_dict(checkpoint['model4s_state_dict'])

下载的NER模型在读取时报错疑似缺少某些参数想请教如何解决

您好，我遇到了这样的报错：

Traceback (most recent call last):
  File "medical_ner.py", line 184, in <module>
    res = my_pred.predict_sentence(sentence)
  File "medical_ner.py", line 103, in predict_sentence
    self.model.load_state_dict(torch.load(self.NEWPATH, map_location=device))
  File "/home/amax/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERT_LSTM_CRF:
        Missing key(s) in state_dict: "word_embeds.embeddings.position_ids".

我做的操作是这样的：

git clone这个仓库
配置环境和依赖库
下载NER模型（链接:https://pan.baidu.com/s/16TPSMtHean3u9dJSXF9mTw ）后解压压缩包
修改medical_ner.py中的这几行中的路径，使之指向我服务器上正确的路径：

       self.NEWPATH = '/Users/yangyf/workplace/model/medical_ner/model.pkl'
        self.vocab = load_vocab('/Users/yangyf/workplace/model/medical_ner/vocab.txt')
        self.vocab_reverse = {v: k for k, v in self.vocab.items()}

        self.model = BERT_LSTM_CRF('/Users/yangyf/workplace/model/medical_ner', tagset_size, 768, 200, 2,
                              dropout_ratio=0.5, dropout1=0.5, use_cuda=use_cuda)

跑medical_ner.py
最后就出现上述那个错误

我检查了一下，当前这个模型【需要】以下这些参数：

word_embeds.embeddings.position_ids      torch.Size([1, 512])
word_embeds.embeddings.word_embeddings.weight    torch.Size([21128, 768])
word_embeds.embeddings.position_embeddings.weight        torch.Size([512, 768])
word_embeds.embeddings.token_type_embeddings.weight      torch.Size([2, 768])
word_embeds.embeddings.LayerNorm.weight          torch.Size([768])
word_embeds.embeddings.LayerNorm.bias    torch.Size([768])
word_embeds.encoder.layer.0.attention.self.query.weight          torch.Size([768, 768])
word_embeds.encoder.layer.0.attention.self.query.bias    torch.Size([768])
word_embeds.encoder.layer.0.attention.self.key.weight    torch.Size([768, 768])
word_embeds.encoder.layer.0.attention.self.key.bias      torch.Size([768])
word_embeds.encoder.layer.0.attention.self.value.weight          torch.Size([768, 768])
word_embeds.encoder.layer.0.attention.self.value.bias    torch.Size([768])
word_embeds.encoder.layer.0.attention.output.dense.weight        torch.Size([768, 768])
...省略...

我猜测load进来的checkpoint中（也就是model.pkl中），可能没有word_embeds.embeddings.position_ids这项。劳烦您能否拨冗查看一下，是我的执行步骤有误？还是训练好的模型checkpoint有问题？谢谢！