Git Product home page Git Product logo

named_entity_recognition's People

Contributors

luopeixiang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

named_entity_recognition's Issues

关于HMM解码时初始状态计算的问题

你好,在models中的HMM解码那部分,计算序列首元素的标签概率的时候,直接将初始状态概率Pi和元素对应的标签概率bt相加,这样概率和不就不为1了吗?希望能帮忙解答一下,谢谢!

关于list和lists的问题

您好,请问为什么您在把tag或word读取出来存成List之后,还要在换行的地方将list存进lists。
请问为什么不直接将包括换行符在内的所有tag或word存成list,不用Lists?

打开train、dev、text时报错

运行时出现打开文件问题
UnicodeDecodeError: 'gbk' codec can't decode byte 0x98 in position 2: illegal multibyte sequence
但因为对build_corpus函数不熟悉,不知道参数是什么,无法改成utf-8格式。
想咨询博主应该怎么办,谢谢回答!
如果方便的话,想请博主给我一个联系方式,或者通过邮箱联系。我的个人邮箱是[email protected]
再次感谢!

对数据集的疑问

您好,我对数据集内容还有部分疑问,烦请解答
1、项目中dev\test\train三个文件分别是做什么用的数据集?
2、这些数据集中的标注是人工标注的吗?
3、训练好的模型是如何测试准确率的?
谢谢!

关于代码的几点思考

首先非常佩服大佬能写出这些代码,对我来说,光是理解就需要花费很长的时间,整个看下来也还是有很多不理解的地方,需要时间慢慢消化。不过,在读代码的过程中也有几点思考,想跟repo主交流一下。

  1. 关于HMM和BiLSTM-CRF测试函数的问题,repo主写的接口函数test需要提供word2id和tag2id的参数;个人觉得如果是用训练好的模型对未标注序列进行测试的时候,也就是说迁移到一个新的环境的时候,这两个参数是很难提供的;个人想法是将word2id和tag2id这两个参数直接在__init__中提供,这样train和test函数就都不需要再提供这个参数了,便于迁移。

  2. BiLSTM-Model中的test函数还需要提供tag_lists参数,这在作者的测试环境来说是可行的,因为测试集也是有标注的,只是为了检验得到的效果;但是在对真正无标注序列进行预测的时候是无法提供的,而该函数也没有考虑tag_lists无法提供的情况;相应的sort_by_lengths函数和preprocess_data_for_lstmcrf函数也要做一下简单的修改。

以上仅是个人的一些想法,希望能和作者交流一下,再次感谢你的代码!

potential fix in build_corpus

I changed an if-else block to try-except block and it worked. Machine: windows10, python3.7
also i need another sklearn package after i installed requirements.txt
i think this is due to a syntactical difference between bmes format and a windows file reader. idk.

def build_corpus(split, make_vocab=True, data_dir="./ResumeNER"):
    """读取数据"""
    assert split in ['train', 'dev', 'test']

    word_lists = []
    tag_lists = []
    with open(join(data_dir, split+".char.bmes"), 'r', encoding='utf-8') as f:
        word_list = []
        tag_list = []
        for line in f.readlines():
            try:
                word, tag = line.strip('\n').split()
                word_list.append(word)
                tag_list.append(tag)
            except:
                word_lists.append(word_list)
                tag_lists.append(tag_list)
                word_list = []
                tag_list = []

    # 如果make_vocab为True,还需要返回word2id和tag2id
    if make_vocab:
        word2id = build_map(word_lists)
        tag2id = build_map(tag_lists)
        return word_lists, tag_lists, word2id, tag2id
    else:
        return word_lists, tag_lists

请问数据集有哪些需要注意的点吗

用我自己的数据集进行训练,训练过程是没问题的,但是进行评估的时候出现这个报错:
Traceback (most recent call last):
File "C:/Users/gavin/PycharmProjects/named_entity_recognition-master/test.py", line 68, in
main()
File "C:/Users/gavin/PycharmProjects/named_entity_recognition-master/test.py", line 55, in main
crf_word2id, crf_tag2id)
File "C:\Users\gavin\PycharmProjects\named_entity_recognition-master\models\bilstm_crf.py", line 170, in test
pred_tag_lists = [pred_tag_lists[i] for i in indices]
File "C:\Users\gavin\PycharmProjects\named_entity_recognition-master\models\bilstm_crf.py", line 170, in
pred_tag_lists = [pred_tag_lists[i] for i in indices]
IndexError: list index out of range

crf log exp sum

is the summation dimension problematic in models/utili.py LINE 146?
should the dimension set to 2? (as we want to sum over the previous step t-1's tag space)

运行main.py时报错:not enough values to unpack

运行'python main.py'时,出现如下错误:

>python main.py
读取数据...
Traceback (most recent call last):
  File "main.py", line 73, in <module>
    main()
  File "main.py", line 14, in main
    build_corpus("train")
  File "data.py", line 16, in build_corpus
    word, tag = line.strip('\n').split()
ValueError: not enough values to unpack (expected 2, got 0)

BiLstm-CRF

请问在两三个epoch后为什么loss会变成复数呢,不应该是无限接近于0吗

数据问题

不是说用的BIOES标注的吗,ResumeNER文件夹下面怎么又是BEMS啦

想问一下bilstm+crf做推理的时候,为什么还要加入tag呢?

print("加载并评估bilstm+crf模型...")
crf_word2id, crf_tag2id = extend_maps(word2id, tag2id, for_crf=True)
bilstm_model = load_model(BiLSTMCRF_MODEL_PATH)
bilstm_model.model.bilstm.bilstm.flatten_parameters()  # remove warning
test_word_lists, test_tag_lists = prepocess_data_for_lstmcrf(
    test_word_lists, test_tag_lists, test=True
)
lstmcrf_pred, target_tag_list = bilstm_model.test(test_word_lists, test_tag_lists,
                                                  crf_word2id, crf_tag2id)

我只需要得到lstmcrf_pred就行了,然后为什么test参数一定要test_tag_lists

bilistm_crf模型中的为啥要sort_by_lengths(word_lists, tag_lists),作用是啥。

bilistm_crf模型中的为啥要sort_by_lengths(word_lists, tag_lists),作用是啥。在训练中有啥好处。 如果不排序是否也没事呢。


def sort_by_lengths(word_lists, tag_lists):
    pairs = list(zip(word_lists, tag_lists))
    indices = sorted(range(len(pairs)),
                     key=lambda k: len(pairs[k][0]),
                     reverse=True)
    pairs = [pairs[i] for i in indices]
    # pairs.sort(key=lambda pair: len(pair[0]), reverse=True)

    word_lists, tag_lists = list(zip(*pairs))

    return word_lists, tag_lists, indices

预训练词向量

您好,请问词向量可以替换为预训练好的词向量么?

训练bilstm_crf,不需要在标注后加<end>

感谢作者的分享!
在prepocess_data_for_lstmcrf中,发觉作者对每句句子和tag之后都加入了end的标志。
在我自己的数据集上跑代码下来,val_loss是不会变负的,不work。
我的理解是,这样做相当于有了两个end。这样训练crf这个转移矩阵的时候,相当于end->end在最后一步要有最大值,感觉是不对的。个人觉得并不需要给word和tag在数据标注上增加这个end尾巴。start和end的tag添加是给crf的矩阵使用的。

HMM的Learning问题

大佬你好,请问HMM的参数学习过程是如何体现极大似然的EM算法的呢,train函数中三个参数矩阵仅仅是按照标签频率归一以后初始化,并没有看到学习的过程,初入门,望指点!

打开ckpts中训练好的模型出错

你好,我用pickle.load()打开训练好的模型,报了No module named 'models.hmm'这样的错误,所有的模型都会出错,请问一下这是为什么呢?

potential fix in build_corpus

I changed an if-else block to try-except block and it worked. Machine: windows10, python3.7
also i need another sklearn package after i installed requirements.txt
i think this is due to a syntactical difference between bmes format and a windows file reader. idk.

 def build_corpus(split, make_vocab=True, data_dir="./ResumeNER"):                                                           """读取数据"""                                                                                                          assert split in ['train', 'dev', 'test']                                                                                                                                                                                                        word_lists = []                                                                                                         tag_lists = []                                                                                                          with open(join(data_dir, split+".char.bmes"), 'r', encoding='utf-8') as f:                                                  word_list = []                                                                                                          tag_list = []                                                                                                           for line in f.readlines():                                                                                                  try:                                                                                                                        word, tag = line.strip('\n').split()                                                                                    word_list.append(word)                                                                                                  tag_list.append(tag)                                                                                                except:                                                                                                                     word_lists.append(word_list)                                                                                            tag_lists.append(tag_list)                                                                                              word_list = []                                                                                                          tag_list = []

下载问题

您好,博主,为什么我下载下来的文件夹没有后面的内容?非常感谢您的回复

标签替换

您好,您的代码我觉得非常好,想学习一下,换成自己的数据集,但是找不到在哪里替换自己的标签,请问可以帮我答疑下吗?万分感谢!

epoch显示貌似有问题

如下,epoch 11已经训练到位了,怎么到12epoch, 又从8.33% 开始了呢???
Epoch 12, step/total_step: 5/60 8.33% Loss:5.1469
Epoch 12, step/total_step: 10/60 16.67% Loss:3.1716

保存模型... Epoch 10, Val Loss:4.1249 Epoch 11, step/total_step: 5/60 8.33% Loss:5.9976 Epoch 11, step/total_step: 10/60 16.67% Loss:3.6216 Epoch 11, step/total_step: 15/60 25.00% Loss:2.9838 Epoch 11, step/total_step: 20/60 33.33% Loss:1.7743 Epoch 11, step/total_step: 25/60 41.67% Loss:2.1327 Epoch 11, step/total_step: 30/60 50.00% Loss:1.2837 Epoch 11, step/total_step: 35/60 58.33% Loss:1.3841 Epoch 11, step/total_step: 40/60 66.67% Loss:1.2403 Epoch 11, step/total_step: 45/60 75.00% Loss:1.0821 Epoch 11, step/total_step: 50/60 83.33% Loss:0.9138 Epoch 11, step/total_step: 55/60 91.67% Loss:0.6943 Epoch 11, step/total_step: 60/60 100.00% Loss:0.5065 保存模型... Epoch 11, Val Loss:4.0510 Epoch 12, step/total_step: 5/60 8.33% Loss:5.1469 Epoch 12, step/total_step: 10/60 16.67% Loss:3.1716 Epoch 12, step/total_step: 15/60 25.00% Loss:2.4593 Epoch 12, step/total_step: 20/60 33.33% Loss:1.4299 Epoch 12, step/total_step: 25/60 41.67% Loss:1.9214 Epoch 12, step/total_step: 30/60 50.00% Loss:1.2415 Epoch 12, step/total_step: 35/60 58.33% Loss:1.3120 Epoch 12, step/total_step: 40/60 66.67% Loss:1.2194 Epoch 12, step/total_step: 45/60 75.00% Loss:0.9205 Epoch 12, step/total_step: 50/60 83.33% Loss:0.8615 Epoch 12, step/total_step: 55/60 91.67% Loss:0.6182 Epoch 12, step/total_step: 60/60 100.00% Loss:0.4272

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

正在训练评估双向LSTM模型...
Traceback (most recent call last):
File "main.py", line 73, in
main()
File "main.py", line 43, in main
crf=False
File "/home/l1/NER/named_entity_recognition/evaluate.py", line 64, in bilstm_train_and_eval
bilstm_model = BILSTM_Model(vocab_size, out_size, crf=crf)
File "/home/l1/NER/named_entity_recognition/models/bilstm_crf.py", line 31, in init
self.hidden_size, out_size).to(self.device)
File "/home/l1/anaconda3/envs/NER/lib/python3.7/site-packages/torch/nn/modules/module.py", line 381, in to
return self._apply(convert)
File "/home/l1/anaconda3/envs/NER/lib/python3.7/site-packages/torch/nn/modules/module.py", line 187, in _apply
module._apply(fn)
File "/home/l1/anaconda3/envs/NER/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 117, in _apply
self.flatten_parameters()
File "/home/l1/anaconda3/envs/NER/lib/python3.7/site-packages/torch/nn/modules/rnn.py", line 113, in flatten_parameters
self.batch_first, bool(self.bidirectional))
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

我按照requirements.txt安装了环境,使用python3.7 cuda10.0 cudnn v7.6.5 版本 一直抱上述错误,有遇到同样错误的人吗?

单词转换成向量用的什么

单词转向量是用的什么,word2vec吗,我看别的人写的是给单词一个word2id(唯一id)就行,不用训练词向量吗,这一块有些疑惑

recall

您好 您的项目非常的棒,有个问题:我使用这个项目进行中文分词,准确度达到不错的效果,但是recall召回率却非常的低,请问是怎么回事呢

ResumeNER数据好像有点小问题哦

传上来repo的ResumeNER数据有点问题,运行的时候data.py 第16行会报错,大概是有空行的原因,我用了原作者的就不会出现这个问题。(环境啥的都没问题,应该就是上传的那份数据存在一个空行)
再次感谢大佬开源鸭。。。

数据问题

请问我按BEMS打标替换数据之后,在计算召回率时报错division by zero是怎么回事呀,这个数据格式有什么要求吗

NER 的评估方式应该是以entity为基本单位而不是以单个tag 为单位

感谢使用我们的论文和数据。

我发现你的评估函数中 precision/recall/F1 是所有tag 的平均值,然而实际的NER 的评估是以entity 为单位的而不是以tag 为单位。举个例子:

美 B-LOC
国 E-LOC
的 O
华 B-PER
莱 I-PER
士 E-PER

我 O
跟 O
他 O
谈 O
笑 O
风 O
生 O

这里实际上我们只关心美国(LOC) 和 华莱士(PER) 这两个entity 有没有预测对。 因此评估时需要先把识别出的entity的位置及其种类抽取出来,如果其中有任何一个不一致就得当作这个预测是错的。

可以参考https://github.com/jiesutd/NCRFpp/blob/master/utils/metric.py 对NER评估函数的实现。

代码是否多余?

    def train_step(self, batch_sents, batch_tags, word2id, tag2id):
        self.model.train()
        self.step += 1
        # 准备数据
        tensorized_sents, lengths = tensorized(batch_sents, word2id)
        tensorized_sents = tensorized_sents.to(self.device)
        targets, lengths = tensorized(batch_tags, tag2id)
        targets = targets.to(self.device)

        # forward
        scores = self.model(tensorized_sents, lengths)

        # 计算损失 更新参数
        self.optimizer.zero_grad()
        loss = self.cal_loss_func(scores, targets, tag2id).to(self.device)
        loss.backward()
        self.optimizer.step()

        return loss.item()

以上代码来自与model/bilstm_crf.py文件
我想问下train_step函数下的self.model.train()起到一个什么作用

模型运行速度/调参

请问为什么模型在我的数据中使用时,HMM和CRF可以很快运行出结果,但是Bilstm和Bilstm-crf模型训练非常慢,基本要两到三天,最后结果也不是很理想。之前尝试调大学习率和减小epoch,训练速度也没有明显提高。想问一下是我的数据问题吗?还是模型设置问题?

环境配置不成功

发现错误
AttributeError: 'LSTM' object has no attribute '_flat_weights'
搜索后发现是torch版本问题,我用的是最新版本1.11.0
想要更换版本发现1.0.1.post2不存在
ERROR: Could not find a version that satisfies the requirement torch==1.0.1.post2 (from versions: 1.7.1, 1.8.0, 1.8.1, 1.9.0, 1.9.1, 1.10.0, 1.10.1, 1.10.2, 1.11.0)
ERROR: No matching distribution found for torch==1.0.1.post2
搜索后发现window不支持
随后在服务器上安装
发现依旧不支持,所以这个项目无法进行下去了

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.