GAN-BERT-CRF AND GAN-BiLSTM-CRF

This is an algorithm which uses BERT-CRF as the generator and a CNN-based network as the discrminator. This algorithm use unannotated samples in addition to annotated samples to maximize NER (i.e Name entity recognition) performance. Furthermore, it effectively identify sequence samples with error labels without complex rules and domain knowledge. Also, it is an aciteve leanring algorithm. The paper (Adversarial active learning for the identification of medical concepts and annotation inconsistency) is published in the Journal of Biomedical Informatics (https://www.sciencedirect.com/science/article/abs/pii/S1532046420301106)
.

NER: In medical field or other field, values of unannotated samples were ignored. This algorithm use adversarial idea to use unannotated to maximize NER performances， which means both unannotated and annotated samples are untilized for NER.
Active leanring: Use both ouputs of CRF layer and the discriminator to select unannotated samples to be labeled.
Identification of error labels: The discriminator outputs smaller score for the sequences with error labels.

Requirements

Python: 3.6
Tensorflow: 1.15.2

DataSet

The DataSet should be in the format of two columns. The words are in the first column and labels are in the second column. Sample data is in the Data Directory.
The china_medical_char_data_cleaned directory processes Chinese data and medical_char_data_cleaned processes English data. First, Download dataset and put dataset into these two directories based on languages. Then put your embedding files into these dicrectories and run .sh file to produce training data, validation data, test data and embedding npz.

bash makefile.sh

Usage

Download pretrained Bert and divide you training data into two parts: labeled data and unlabeled data. Put your divided training data、validation data and test data into model/data_path. (pretrained Bert: https://github.com/google-research/bert)
Train data

python main.py --mode=train

How to optimize

One of effective optimize methods is to update different layers of BERT, you can set up which layers are updated in the bert_gan file.For example,

self.params_1 = [param for param in tf.trainable_variables() if
                         ('crf_param' not in param.name and 'bert/encoder/layer_11/output' in param.name) ]
grads_and_vars2 = optim.compute_gradients(self.loss2, self.params_1)

Test data and carry out active learning steps

python main.py --mode=test

Reference

[BERT] (https://github.com/google-research/bert)
[BiLSTM-CRF] (https://github.com/guillaumegenthial/sequence_tagging) (https://github.com/guillaumegenthial/tf_ner/tree/master/models/chars_lstm_lstm_crf)

kushal-h / gan-bert-crf Goto Github PK

gan-bert-crf's Introduction

GAN-BERT-CRF AND GAN-BiLSTM-CRF

Requirements

DataSet

Usage

How to optimize

Reference

gan-bert-crf's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent