Git Product home page Git Product logo

cmekg_tools's Introduction

CMeKG 工具 代码及模型

Index

cmekg工具

CMeKG网站

中文医学知识图谱CMeKG CMeKG(Chinese Medical Knowledge Graph)是利用自然语言处理与文本挖掘技术,基于大规模医学文本数据,以人机结合的方式研发的中文医学知识图谱。

CMeKG 中主要模型工具包括 医学文本分词,医学实体识别和医学关系抽取。这里是三种工具的代码、模型和使用方法。

模型下载

由于依赖和训练好的的模型较大,将模型放到了百度网盘中,链接如下,按需下载。

RE:链接:https://pan.baidu.com/s/1cIse6JO2H78heXu7DNewmg 密码:4s6k

NER: 链接:https://pan.baidu.com/s/16TPSMtHean3u9dJSXF9mTw 密码:shwh

分词:链接:https://pan.baidu.com/s/1bU3QoaGs2IxI34WBx7ibMQ 密码:yhek

依赖库

  • json
  • random
  • numpy
  • torch
  • transformers
  • gc
  • re
  • time
  • tqdm

模型使用

医学关系抽取

依赖文件

  • pytorch_model.bin : 医学文本预训练的 BERT-base model
  • vocab.txt
  • config.json
  • model_re.pkl: 训练好的关系抽取模型文件,包含了模型参数、优化器参数等
  • predicate.json

使用方法

配置参数在medical_re.py的class config里,首先在medical_re.py的class config里修改各个文件路径

  • 训练
import medical_re
medical_re.load_schema()
medical_re.run_train()

model_re/train_example.json 是训练文件示例

  • 使用
import medical_re
medical_re.load_schema()
model4s, model4po = medical_re.load_model()

text = '据报道称,新冠肺炎患者经常会发热、咳嗽,少部分患者会胸闷、乏力,其病因包括: 1.自身免疫系统缺陷\n2.人传人。'  # content是输入的一段文字
res = medical_re.get_triples(text, model4s, model4po)
print(json.dumps(res, ensure_ascii=False, indent=True))
  • 执行结果
[
 {
  "text": "据报道称,新冠肺炎患者经常会发热、咳嗽,少部分患者会胸闷、=乏力,其病因包括: 1.自身免疫系统缺陷\n2.人传人",
  "triples": [
   [
    "新冠肺炎",
    "临床表现",
    "肺炎"
   ],
   [
    "新冠肺炎",
    "临床表现",
    "发热"
   ],
   [
    "新冠肺炎",
    "临床表现",
    "咳嗽"
   ],
   [
    "新冠肺炎",
    "临床表现",
    "胸闷"
   ],
   [
    "新冠肺炎",
    "临床表现",
    "乏力"
   ],
   [
    "新冠肺炎",
    "病因",
    "自身免疫系统缺陷"
   ],
   [
    "新冠肺炎",
    "病因",
    "人传人"
   ]
  ]
 }
]

医学实体识别

调整的参数和模型在ner_constant.py中

训练

python3 train_ner.py

使用示例

medical_ner 类提供两个接口测试函数

  • predict_sentence(sentence): 测试单个句子,返回:{"实体类别":“实体”},不同实体以逗号隔开
  • predict_file(input_file, output_file): 测试整个文件 文件格式每行待提取实体的句子和提取出的实体{"实体类别":“实体”},不同实体以逗号隔开
from run import medical_ner

#使用工具运行
my_pred=medical_ner()
#根据提示输入单句:“高血压病人不可食用阿莫西林等药物”
sentence=input("输入需要测试的句子:")
my_pred.predict_sentence("".join(sentence.split()))

#输入文件(测试文件,输出文件)
my_pred.predict_file("my_test.txt","outt.txt")

医学文本分词

调整的参数和模型在cws_constant.py中

训练

python3 train_cws.py

使用示例

medical_cws 类提供两个接口测试函数

  • predict_sentence(sentence): 测试单个句子,返回:{"实体类别":“实体”},不同实体以逗号隔开
  • predict_file(input_file, output_file): 测试整个文件 文件格式每行待提取实体的句子和提取出的实体{"实体类别":“实体”},不同实体以逗号隔开
from run import medical_cws

#使用工具运行
my_pred=medical_cws()
#根据提示输入单句:“高血压病人不可食用阿莫西林等药物”
sentence=input("输入需要测试的句子:")
my_pred.predict_sentence("".join(sentence.split()))

#输入文件(测试文件,输出文件)
my_pred.predict_file("my_test.txt","outt.txt")

cmekg_tools's People

Contributors

king-yyf avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cmekg_tools's Issues

BUG:是否存在代码问题

你好,我仔细看了一下您的代码,关于re有两点想讨论一下:
1、extract_spoes()函数中,L280-L291,我清晰你希望完成的是当同一输入文本中有多个主语定位词时遍历每一组,并在model4po模型中作为mask,与hidden_state进行叠加,希望在提取宾语与实体关系词时仅关注该主语起始位置,这样就免除了依存分析的内容。但是这一部分遍历只会取到第一组。只是因为在get_triples中用“。”切割,通常情况下一句只有一个主语,因此看起来表现是对的。
2、同上所述,在model4po模型定义时,看起来将s直接填充进了所有有效token对应的位置,all_s[b, :cue_len, :] = s,无法起到长文本的mask作用,这一步骤添加对第二段po提取的训练是无意义的。

网站打不开

无论是直接打开还是挂梯子还是用流量打开都不行。。

打开.pkl文件 Open .pkl file

你好,问一下,torch是哪个版本?pkl文件该如何打开?我使用python解析是一串int类型的数字。

Hello, may I ask, which version of torch is it? How to open pkl file? I am using python to parse a string of numbers of type int.

前端

您好,请问前端的代码可以上传一下吗

NER任务测试集

hi,我们在测试ner任务的时候没有测试集,可以发一份ner任务的测试集出来做测试吗,非常感谢

medical_cws.py 运行出错

从百度云下载了模型文件, 更新 medical_cws.py 对应的模型路径后,运行 medical_cws.py 报错了,怎么解决?以下是日志
(base) ubuntu@ubuntu-test3:~/knowledgegraph/CMeKG_tools/CMeKG_tools-main$ python medical_cws.py
Some weights of the model checkpoint at /home/ubuntu/knowledgegraph/CMeKG_tools/CMeKG_tools-main/models/medical_cws were not used when initializing BertModel: ['cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.bias']

  • This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Traceback (most recent call last):
    File "medical_cws.py", line 157, in
    res = meg.predict_sentence("肾上腺由皮质和髓质两个功能不同的内分泌器官组成,皮质分泌肾上腺皮质激素,髓质分泌儿茶酚胺激素。")
    File "medical_cws.py", line 105, in predict_sentence
    self.model.load_state_dict(torch.load(self.NEWPATH,map_location=self.device))
    File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 847, in load_state_dict
    self.class.name, "\n\t".join(error_msgs)))
    RuntimeError: Error(s) in loading state_dict for BERT_LSTM_CRF:
    Missing key(s) in state_dict: "word_embeds.embeddings.position_ids".

下载的NER模型在读取时报错 疑似缺少某些参数 想请教如何解决

您好,我遇到了这样的报错:

Traceback (most recent call last):
  File "medical_ner.py", line 184, in <module>
    res = my_pred.predict_sentence(sentence)
  File "medical_ner.py", line 103, in predict_sentence
    self.model.load_state_dict(torch.load(self.NEWPATH, map_location=device))
  File "/home/amax/.conda/envs/torch/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for BERT_LSTM_CRF:
        Missing key(s) in state_dict: "word_embeds.embeddings.position_ids". 

我做的操作是这样的:

  1. git clone这个仓库
  2. 配置环境和依赖库
  3. 下载NER模型(链接:https://pan.baidu.com/s/16TPSMtHean3u9dJSXF9mTw )后解压压缩包
  4. 修改medical_ner.py中的这几行中的路径,使之指向我服务器上正确的路径:
       self.NEWPATH = '/Users/yangyf/workplace/model/medical_ner/model.pkl'
        self.vocab = load_vocab('/Users/yangyf/workplace/model/medical_ner/vocab.txt')
        self.vocab_reverse = {v: k for k, v in self.vocab.items()}

        self.model = BERT_LSTM_CRF('/Users/yangyf/workplace/model/medical_ner', tagset_size, 768, 200, 2,
                              dropout_ratio=0.5, dropout1=0.5, use_cuda=use_cuda)

  1. medical_ner.py
  2. 最后就出现上述那个错误

我检查了一下,当前这个模型【需要】以下这些参数:

word_embeds.embeddings.position_ids      torch.Size([1, 512])
word_embeds.embeddings.word_embeddings.weight    torch.Size([21128, 768])
word_embeds.embeddings.position_embeddings.weight        torch.Size([512, 768])
word_embeds.embeddings.token_type_embeddings.weight      torch.Size([2, 768])
word_embeds.embeddings.LayerNorm.weight          torch.Size([768])
word_embeds.embeddings.LayerNorm.bias    torch.Size([768])
word_embeds.encoder.layer.0.attention.self.query.weight          torch.Size([768, 768])
word_embeds.encoder.layer.0.attention.self.query.bias    torch.Size([768])
word_embeds.encoder.layer.0.attention.self.key.weight    torch.Size([768, 768])
word_embeds.encoder.layer.0.attention.self.key.bias      torch.Size([768])
word_embeds.encoder.layer.0.attention.self.value.weight          torch.Size([768, 768])
word_embeds.encoder.layer.0.attention.self.value.bias    torch.Size([768])
word_embeds.encoder.layer.0.attention.output.dense.weight        torch.Size([768, 768])
...省略...

我猜测load进来的checkpoint中(也就是model.pkl中),可能没有word_embeds.embeddings.position_ids这项。劳烦您能否拨冗查看一下,是我的执行步骤有误?还是训练好的模型checkpoint有问题?谢谢!

如何训练自己的数据集?

作者您好,对于medical_re.py目前我们是加载您训练好的模型来进行train_example.json数据测试,如果我们使用自己的数据集,那么又该如何训练自己的模型呢? 可以讲一下如何训练自己的模型流程成嘛?非常感谢,期待您的回复。

train_data.json

Excuse me, train_data.json file mentioned in medical_re.py file from where to get?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.