Git Product home page Git Product logo

ner-lstm-crf's Introduction

NER-LSTM-CRF

An easy-to-use named entity recognition (NER) toolkit, implemented the LSTM+[CNN]+CRF model in tensorflow.

该项目短期内不再维护,PyTorch版本:https://github.com/liu-nlper/SLTK

1. Model

Bi-LSTM/Bi-GRU + [CNN] + CRF,其中CNN层针对英文,捕获字符层面特征,通过参数use_char_feature控制self.nil_vars.add(self.feature_weight_dict[feature_name].name)。

2. Usage

2.1 数据准备

训练数据处理成下列形式,特征之间用制表符(或空格)隔开,每行共n列,1至n-1列为特征,最后一列为label。

苏   NR   B-ORG
州   NR   I-ORG
大   NN   I-ORG
学   NN   E-ORG
位   VV   O
于   VV   O
江   NR   B-GPE
苏   NR   I-GPE
省   NR   E-GPE
苏   NR   B-GPE
州   NR   I-GPE
市   NR   E-GPE

2.2 修改配置文件

Step 1: 将上述训练文件的路径写入配置文件config.yml中的data_params/path_train参数里;

Step 2: 以上样例数据中每行包含三列,分别称为f1f2label,首先需要将需要将model_params/feature_names设置为['f1', 'f2'],并将embed_params下的名称改为相应的feature name,其中的shape参数需要通过预处理之后才能得到(Step 3),path_pre_train为预训练的词向量路径,格式同gensim生成的txt文件格式;

Step 3: 修改data_params下的参数:该参数存放特征和label的voc(即名称到编号id的映射字典),改为相应的路径。

:处理中文时,将char_feature参数设为false;处理英文时,设为true

2.3 预处理

$ python/python3 preprocessing.py

预处理后,会得到各个特征的item数以及label数,并自动修改config.yml文件中各个feature的shape参数,以及nb_classes参数;

句子的最大长度参数sequence_length由参数sequence_len_pt控制,默认为98,即计算所得的句子长度sequence_length覆盖了98%的实例,可根据实际情况作调整;

需要注意的是,若提供了预训练的embedding向量,则特征embedding的维度以预训练的向量维度为准,若没有提供预训练的向量,则第一列的特征向量维度默认为64,其余特征为32,这里可以根据实际情况进行调整。

2.4 训练模型

训练模型:根据需要调整其余参数,其中dev_size表示开发集占训练集的比例(默认值为0.1),并运行:

$ python/python3 train.py

2.5 标记数据

标记数据:config.yml中修改相应的path_testpath_result,并运行:

$ python/python3 test.py

2.6 参数说明

参数 说明
1 rnn_unit str,['lstm', 'gru'],模型中使用哪种单元,用户设置,默认值lstm
2 num_units int,bilstm/bigru单元数,用户设置,默认值256
3 num_layers int,bilstm/bigru层数,用户设置,默认值1
4 rnn_dropout float,lstm/gru层的dropout值,用户设置,默认值0.2
5 use_crf bool,是否使用crf层,用户设置,默认值true
6 use_char_feature bool,是否使用字符层面特征(针对英文),用户设置,默认值false
7 learning_rate float,学习率,用户设置,默认值0.001
8 dropout_rate float,bilstm/bigru输出与全连接层之间,用户设置,默认值0.5
9 l2_rate float,加在全连接层权重上,用户设置,默认值0.001
10 clip: None or int, 梯度裁剪,用户设置,默认值10
11 dev_size float between (0, 1),训练集中划分出的开发集的比例,shuffle之后再划分,用户设置,默认值0.2
12 sequence_len_pt int,句子长度百分数,用于设定句子最大长度,用户设置,默认值98
13 sequence_length int,句子最大长度,由参数sequence_len_pt计算出,无需设置
14 word_len_pt int,单词长度百分数,用于设定单词最大长度,只有在use_char_feature设为true时才会使用,用户设置,默认值95
15 word_length int,单词最大长度,由参数word_len_pt计算出
16 nb_classes int,标签数,自动计算,无需设置
17 batch_size int,batch size,用户设置,默认值为64
18 nb_epoch int,迭代次数,用户设置
19 max_patience int,最大耐心值,即在开发集上的表现累计max_patience次没有提升时,训练即终止,用户设置,默认值5
20 path_model str,模型存放路径,用户设置,默认值值./Model/best_model
21 sep str,['table', 'space'],表示特征之间的分隔符,用户设置,默认值table
22 conv_filter_size_list list,当使用char feature时,卷积核的数量,用户设定,默认值[8, 8, 8, 8, 8]
23 conv_filter_len_list list,当使用char feature时,卷积核的尺寸,用户设定,默认值[1, 2, 3, 4, 5]
24 conv_dropout 卷积层的dropout rate
25 Other parameters ......

3. Utils

一些小工具,包括:

  • train_word2vec_model.py: 利用gensim训练词向量;
  • trietree.py: 构建Trie树,并实现查找(待优化),可用于构建字典特征;
  • KeywordExtractor,trietree.py用KeywordExtractor库代替,提供更多接口;
  • updating...

4. Requirements

  • numpy
  • tensorflow 1.4
  • pickle
  • tqdm
  • yaml

5. References

ner-lstm-crf's People

Contributors

liu-nlper avatar shuaihuaiyi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ner-lstm-crf's Issues

关于特征和label的编号问题

您好,看您的代码,特征的编号是从2开始,label的编号是从1开始,nb_classes是voc[-1]+1,所以总共的标签数是7的时候,nb_classes是9,这里不太懂,nb_classes实际上不应该仍然是7吗,我不太清楚这里的逻辑,求讲解,谢谢!

is get_sequence_actual_length right?

Hi liu,

when tensor in function get_sequence_actual_length(utils.py) is:
[[-0.1,0.1,0]]
it will return:
[0]
but actually, it is[2]
so, it is wrong, right?

utils部分代码

您好:
非常感谢您分享您的代码,能够自动生成配置文件感觉很好。
不过貌似缺少utils中, create_dictionary、 load_embed_from_txt 的代码,请问可以上传这部分代码吗?
谢谢啦~

关于nil_var

self.nil_vars.add(self.feature_weight_dict[feature_name].name)

这个变量似乎是用来在训练时防止embedding矩阵的某些无用的行得到更新,但是这里的缩进导致没有提供预训练词向量的特征对应的embedding不能应用这一机制,是不是有bug呢?

另外,预处理时将前两行都空了出来,似乎是将第0行作为填充,第1行作为未登录词,那么在训练时是否应该将这两行的梯度都替换为0呢?我对于TensorFlow API还很不熟悉,看样子函数zero_nil_slot好像只是替换了一行

the data of NER

it will be better if the data of ner can be accessed(training data and embedding data)

您好,我有两个问题

1.Step 3: 修改data_params下的参数:该参数存放特征和label的voc(即名称到编号id的映射字典),改为相应的路径。
这个具体要求修改什么,能不能举个例子谢谢。
2.我的训练的时候,迭代到20个epoch就停止了,这正常吗?
3.在测试的时候,标记的结果全部都是O

麻烦您了

关于CNN的几个问题

1、3D卷积核的shape中,那个恒定为1的维度代表了什么呢?之前您使用tf.expand_dims将输入的维度增加了1,这两者有关吗?kernel_size中的那些参数是根据什么来进行排列的呢?
2、tf.contrib.layers.conv3d提供了默认参数activation_fn=tf.nn.relu, 为什么之后还要手动进行一次激活?
3、with语句中的最后一行scope.reuse_variables()有什么作用呢?执行完这条语句就会进入下一次循环,构造新的scope,我有点想不通

请问可以改成多GPU的训练方式吗?

感谢博主的无私分享!
目前想在多GPU平台上跑这个模型,但是看model.py里只有device:cpu这个变量,不知道如果改成多gpu的版本,还有哪些参数或设置需要更改啊?
先谢过啦!

data

右 JJB B-body 我的数据格式

数据集不同

请问我的数据集没有中间那一列可以顺利运行吗

CRF loss為負數

我用JNLPBA(http://www.nactem.ac.uk/tsujii/GENIA/ERtask/report.html) Training Data(iob格式)資料訓練模型,可是train, dev loss卻有負數的情況出現。
下列為config值
model: NER model_params: bilstm_params: num_units: 128 # num layers???u???щиm??1 num_layers: 1 feature_names: ['f1'] embed_params: f1: dropout_rate: 0.5 shape: [23000, 100] path_pre_train: #'./data/embedding.txt' path: #'./Res/embed/char_embed.pkl' f2: dropout_rate: 0.3 shape: [5, 32] path_pre_train: null path: null use_crf: True rnn_unit: 'gru' # 'lstm' or 'gru' learning_rate: 0.01 clip: 10 dev_size: 0.1 dropout_rate: 0.5 l2_rate: 0.00 nb_classes: 11 sequence_length: 208 batch_size: 512 nb_epoch: 1000 max_patience: 20 path_model: './Model/best_model' data_params: voc_params: f1: min_count: 0 path: './Res/voc/f1.voc.pkl' f2: min_count: 0 path: './Res/voc/f2.voc.pkl' label: min_count: 0 path: './Res/voc/label.voc.pkl' sep: 'table' # table or space path_train: './data/JNLPBA/Genia4ERtask1.iob2' path_test: './data/JNLPBA/sampletest1.iob2' path_result: './data/JNLPBA/sampletest1.iob2_result.txt'
想問一下為何crf loss會出現負數,該如何解決呢? 感謝

你好 有个问题

我跑preprocess和train的时候,都出现了list indices must be integers or slices, not str
请问这个怎么解决

使用你的数据训练是正常的,但是当我使用自己的数据时,确爆出如下的错误

数据(中间行是无意义的):
【 O O
拉手 O O
】 O O
您 O O
好 O O
, O O
黄记 O B-commodityname
煌 O I-commodityname
中华 O I-commodityname
店 O I-commodityname
0 O I-commodityname
人餐 O E-commodityname
券号 O O
000000000 O S-order_arr
等 O O
0 O B-consumequantity
张 O E-consumequantity
券 O O
已于 O O
00 O B-date
日 O E-date
00 O B-time
时 O E-time
消费 O O
, O O
拉手 O O
客服 O O
: O O
0000000000 O O

【 O O
拉手 O O
】 O O
您 O O
好 O O
, O O
黄记 O B-commodityname
煌 O I-commodityname
中华 O I-commodityname
店 O I-commodityname
0 O I-commodityname
人餐 O E-commodityname
券号 O O
000000000 O S-order_arr
等 O O
0 O B-consumequantity
张 O E-consumequantity
券 O O
已于 O O
00 O B-date
日 O E-date
00 O B-time
时 O E-time
消费 O O
, O O
拉手 O O
客服 O O
: O O
0000000000 O O

您 O O
于 O O
00 O B-date

  • O I-date
    00 O E-date
    00 O B-time
    : O I-time
    00 O E-time
    消费 O O
    00 O B-commodityname
    元 O I-commodityname
    风采 O I-commodityname
    飞 O I-commodityname
    虹 O I-commodityname
    店 O I-commodityname
    足浴 O E-commodityname
    共 O O
    0 O B-consumequantity
    份 O E-consumequantity
    , O O
    剩余 O O
    0 O B-remainquantity
    份 O E-remainquantity
    , O O
    订单号 O O
    00000000 O S-order_arr
    , O O
    客服 O O
    0000000000 O O
    , O O
    【 O O
    窝窝 O O
    团 O O
    】 O O

只有前面两个数据是正常的,但是加最后一个数据时却出错了 ?? 所以感到特别疑惑?

Epoch 1 / 20:
0%| | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1139, in _do_call
return fn(*args)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1121, in _run_fn
status, run_metadata)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\contextlib.py", line 66, in exit
next(self.gen)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\framework\errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1,8] = 155 is not in [0, 144)
[[Node: Gather_1 = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Reshape_3, add_3)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:/PythonWorkSpace/nlp/lstm_crf/train.py", line 74, in
main()
File "D:/PythonWorkSpace/nlp/lstm_crf/train.py", line 70, in main
data_dict=data_dict, dev_size=config['model_params']['dev_size'])
File "D:\PythonWorkSpace\nlp\lstm_crf\model.py", line 240, in fit
_, loss = self.sess.run([self.train_op, self.loss], feed_dict=feed_dict)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[1,8] = 155 is not in [0, 144)
[[Node: Gather_1 = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Reshape_3, add_3)]]

Caused by op 'Gather_1', defined at:
File "D:/PythonWorkSpace/nlp/lstm_crf/train.py", line 74, in
main()
File "D:/PythonWorkSpace/nlp/lstm_crf/train.py", line 67, in main
path_model=config['model_params']['path_model'])
File "D:\PythonWorkSpace\nlp\lstm_crf\model.py", line 73, in init
self.build_model()
File "D:\PythonWorkSpace\nlp\lstm_crf\model.py", line 160, in build_model
self.loss = self.compute_loss()
File "D:\PythonWorkSpace\nlp\lstm_crf\model.py", line 371, in compute_loss
self.logits, self.input_label_ph, self.sequence_actual_length)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\crf\python\ops\crf.py", line 155, in crf_log_likelihood
transition_params)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\crf\python\ops\crf.py", line 93, in crf_sequence_score
transition_params)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\contrib\crf\python\ops\crf.py", line 220, in crf_binary_score
flattened_transition_indices)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\ops\gen_array_ops.py", line 1179, in gather
validate_indices=validate_indices, name=name)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "D:\Anaconda\Anaconda2\envs\tensorflow\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in init
self._traceback = _extract_stack()

InvalidArgumentError (see above for traceback): indices[1,8] = 155 is not in [0, 144)
[[Node: Gather_1 = Gather[Tindices=DT_INT32, Tparams=DT_FLOAT, validate_indices=true, _device="/job:localhost/replica:0/task:0/cpu:0"](Reshape_3, add_3)]]

预训练词向量的问题

@liu-nlper
你好,想问一下预训练词向量的问题,请问你在训练的时候提供的向量是字符向量还是词向量呢?因为输入数据格式是一个字一行的,所以我的理解是不是应该给字符向量?但是不知道为什么我提供了预训练的字符向量以后结果竟然下降,能不能帮忙解答一下,谢谢!

should dropout be set to zero when inference?

Hi liu-nlper,

Nice work and an easy to follow README!

But I noticed in your testing phase, you still construct the model such that its dropout equals what's in the config. Was is a design out of some special consideration? Cuz usually when inference we use the complete model without dropout.

Thanks!
Xuan

问题

您好,运行过程出现错误:one_instance_items[j].append(feature_tokens[j])
IndexError: list index out of range,这个怎么解决

谢谢

发到你邮箱了,谢谢!

训练时出现Segmentation fault

$ python train.py 
/opt/python/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py:91: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Epoch 1 / 200:
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 109/109 [05:18<00:00,  2.48s/it]
train loss: 9.972754, dev loss: 7.669499
Segmentation fault

不清楚是哪里出错了,需要我提供其他什么配置吗?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.