Git Product home page Git Product logo

frankkramer-lab / drnote Goto Github PK

View Code? Open in Web Editor NEW
5.0 2.0 0.0 102 KB

DrNote is an open tagging tool for text annotation and entity linking based on OpenTapioca and WIkiData/Wikipedia. It provides an entity linking service with pre-trained data for medical annotations in multilingual settings. The processing of raw text as well as PDF by a tesseract backend is supported.

License: Apache License 2.0

Shell 100.00%
entity-linking knowledge-graph medical-text-mining multilingual nlp tagging text-mining wikidata wikipedia

drnote's Introduction

DrNote

Accepted at PLOS DH:
https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000086 (or see citation)

The DrNote annotation tool features a simple yet effective annotation tool for various purposes.

The annotation method is based on the Opentapioca (GitHub) codebase to provide a named entity linking functionality on unstructured text data.

The project leverages the data from Wikidata and Wikipedia without the requirement of any commercial components.

The annotation service provides a web-based UI as well as an API-based access.

The processing of PDF files is supported. Linked entities can be injected as hyperlinks into the uploaded PDF file.

Different languages (de, en, es etc.) are supported.

Update on Results:
A bug in the evaluation pipeline was found, leading to degraded results in the obtained scores. See the updated scores in the Errata section.

Demo:
Our demo instance is available at:
https://drnote.misit-augsburg.de
Note: Upload of large PDF files is not supported. Uploaded data is discarded after processing.

Graphical Demo:
Annotation Demo

CLI Demo:

# Enter text
text="Die Diagnosen sind Hypothyreose bei Autoimmunthyreoiditis, Diabetes mellitus mit diabetische Nephropathie und akutes Nierenversagen."
# Annotate
curl -k https://drnote.misit-augsburg.de/annotate \
  -F "inputType=plaintext" \
  -F "outputType=html" \
  -F \
"filterOptions={
  \"pipeline\": \"de_core_news_sm\",
  \"rules\": [
    \"any pos[NOUN,PROPN] require\",
    \"all non_stopwords require\"
  ]
}" \
  -F \
"plaintext=$text"

Errata

Detected issues:

  • For the GSC EMEA/Medline datasets, the labels were not correctly filtered for the CHEM label class in all instances.
  • Due to a too strict regular expression, detected Chemical entries for PubTator were only considered if a MeSH code was given.
  • For GSC EMEA/Medline datasets, in the cTAKES outputs, the UMLS tags were wrongfully used over the MedicationMentions tags.
  • The character spans of cTAKES may yield broken values due to unsupported umlaut characters. The broken character spans are now fixed using a workaround.

The evaluation was re-run with a revised evaluation pipeline. However, due to constant changes in the WikiData, the results may vary. For instance, due to substantial changes in the WikiData graph structure, the SPARQL query for finding medication entities was changed from the previous query

(old SPARQL query)
SELECT DISTINCT ?entity WHERE
{
    {?entity wdt:P279+ wd:Q12140 .}
    UNION
    {?entity wdt:P31+ wd:Q12140 .}
}
to an ATC code-based query
(new SPARQL query)
SELECT DISTINCT ?entity WHERE
{
    {?entity wdt:P279+ wd:Q12140 .}
    UNION
    {?entity wdt:P31+ wd:Q12140 .}
    UNION
    {?entity wdt:P267 ?atccode .}
}

For comparisons, the (cached) original outputs from PubTator, cTAKES, and the original pre-trained DrNote model & index store was used. Also, the cached set of UMLS entities was used. The updated results are (as of 31.07.2024) as follows.

Dataset Method Precision Recall F1 score
GERNERMED cTAKES 0.858 0.512 0.641
GERNERMED PubTator 0.760 0.481 0.590
GERNERMED DrNote 0.935 0.624 0.749
Medline GSC cTAKES 0.806 0.307 0.444
Medline GSC PubTator 0.449 0.420 0.434
Medline GSC DrNote 0.693 0.139 0.232
EMEA GSC cTAKES 0.834 0.357 0.500
EMEA GSC PubTator 0.522 0.211 0.301
EMEA GSC DrNote 0.833 0.172 0.285
Medline GSC DrNote (filtered) 0.634 0.444 0.522
EMEA GSC DrNote (filtered) 0.604 0.636 0.620

How to Use

Spawn DrNote using Pre-trained Data

Steps to spawn the service using pre-trained data:

# Assumed: Docker, Docker-compose installed and user added to Docker group
# follow guide from https://docs.docker.com/engine/install/ubuntu/
# sudo apt-get install -y docker docker-compose
# sudo usermod -aG docker $USER

# Clone repository
git clone https://github.com/frankkramer-lab/DrNote
cd DrNote/

# Retrieve pre-trained data
wget -O build/pretrained_data.tar.gz https://myweb.rz.uni-augsburg.de/~freijoha/DrNote/pretrained_data.tar.gz

# Spawn annotation service
./04_start_annotation_service.sh

The annotation service should be available at:
https://<DOCKER_HOST>/

Build From Scratch and Spawn DrNote

Steps to automatically build the OpenTapioca data setup pipeline and spawn the annotation service.

Prestep: Setup the configuration:

  • Modify the file ./cfg/opentapioca_profile.json.
  • Modify the file ./cfg/load_config.json.
    Note: The language code should match the entry in ./cfg/opentapioca_profile.json.

Steps:

  1. Check dependencies:

    • Run ./01_checkDependencies.sh
  2. Generate the NIF file:

    • Run ./02_loadNIFFile.sh
  3. Generate the OpenTapioca data:

    • Run ./03_processForOpenTapioca.sh
  4. Spawn the MISIT annotation service:

    • Run ./04_start_annotation_service.sh

The annotation service should be available at:
https://<DOCKER_HOST>/

Citation

The paper is available at: https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0000086 If you use our work or want to reference it, use the following bibtex lines:

@article{10.1371/journal.pdig.0000086,
    doi = {10.1371/journal.pdig.0000086},
    author = {Frei, Johann and Soto-Rey, Iñaki and Kramer, Frank},
    journal = {PLOS Digital Health},
    publisher = {Public Library of Science},
    title = {DrNote: An open medical annotation service},
    year = {2022},
    month = {08},
    volume = {1},
    url = {https://doi.org/10.1371/journal.pdig.0000086},
    pages = {1-18},
    abstract = {In the context of clinical trials and medical research medical text mining can provide broader insights for various research scenarios by tapping additional text data sources and extracting relevant information that is often exclusively present in unstructured fashion. Although various works for data like electronic health reports are available for English texts, only limited work on tools for non-English text resources has been published that offers immediate practicality in terms of flexibility and initial setup. We introduce DrNote, an open source text annotation service for medical text processing. Our work provides an entire annotation pipeline with its focus on a fast yet effective and easy to use software implementation. Further, the software allows its users to define a custom annotation scope by filtering only for relevant entities that should be included in its knowledge base. The approach is based on OpenTapioca and combines the publicly available datasets from WikiData and Wikipedia, and thus, performs entity linking tasks. In contrast to other related work our service can easily be built upon any language-specific Wikipedia dataset in order to be trained on a specific target language. We provide a public demo instance of our DrNote annotation service at https://drnote.misit-augsburg.de/.},
    number = {8},
}

Referenced Repositories

Not required for smaller queries:

drnote's People

Contributors

j-frei avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

drnote's Issues

Add LICENSE to the repository?

Hello, congrats on the great tool! Can't wait to use it :)

By the way, could you add a license file to the repository to clarify reuse rules? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.