Git Product home page Git Product logo

autobioner's Introduction

Distantly Supervised Biomedical Named Entity Recognition with Dictionary Expansion

This project provides the AutoBioNER framework for distantly supervised biomedical named entity recognition (BioNER).

The whole framework consists of two parts: Dictionary Expansion and Neural Model Training.

Required Inputs

  • Tokenized Raw Texts: e.g., DictExpan/data/bc5/input_text.txt
    • One token per line.
    • An empty line means the end of a sentence.
  • Two Dictionaries
    • Core Dictionary w/ Type Info: e.g., DictExpan/data/bc5/dict_core.txt
      • Two columns (i.e., Type, Tokenized Surface) per line.
      • Tab separated.
      • How to obtain: from domain-specific dictionaries.
    • Full Dictionary w/o Type Info: e.g., DictExpan/data/bc5/dict_full.txt
      • One tokenized high-quality phrases per line.
      • How to obtain: from domain-specific dictionaries and high-quality phrase mining tool on domain-specific corpus (e.g., AutoPhrase)

Dictionary Expansion

cd DictExpan/

Start Stanford CoreNLP

# Download the Stanford CoreNLP Toolkit to src/tools/CoreNLP/
cd src/tools/CoreNLP/stanford-corenlp
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

Run Dictionary Expansion

# Need to change the corpus (data, RAW_TEXT, DICT_CORE, DICT_FULL) name in run.sh
# Need to change the corpus (data) name in src/corpusProcessing/corpusProcess.sh
# Need to change the corpus (data) name in src/dataProcessing/dataProcess.sh
# Need to change the corpus (data) name in src/SetExpan/set_expan_main.py
./run.sh

Output

Two expanded dictionaries:

  • Expanded core dictionary: e.g., DictExpan/data/bc5/dict_core_expand.txt
  • Expanded full dictionary: e.g., DictExpan/data/bc5/dict_full_expand.txt

Neural Model Training

After the Dictionary Expansion step, take the tokenized raw corpus (DictExpan/data/bc5/input_text.txt), expanded core dictionary (DictExpan/data/bc5/dict_core_expand.txt) and expanded full dictionary (DictExpan/data/bc5/dict_full_expand.txt) as the input to AutoNER.

The details of the Neural Model Training can be found in the AutoNER repository.

Citation

If you find the implementation useful, please cite the following paper:

@inproceedings{wang2019distantly,
  title={Distantly supervised biomedical named entity recognition with dictionary expansion},
  author={Wang, Xuan and Zhang, Yu and Li, Qi and Ren, Xiang and Shang, Jingbo and Han, Jiawei},
  booktitle={2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)},
  pages={496--503},
  year={2019},
  organization={IEEE}
}

autobioner's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.