Git Product home page Git Product logo

pymedext_eds's Introduction

Pymedext annotators for the EDS pipeline

Installation

Requires the installation of PyMedExt_core PyMedExt_core It can be done using requirements.txt

pip install -r requirements.txt

Installation via pip:

pip install git+git://github.com/equipe22/pymedext_eds.git@master#egg=pymedext_eds

Cloning the repository:

git clone https://github.com/equipe22/pymedext_eds.git
cd pymedext_eds
pip install .

Basic usage

All the annotators are defined in the pymedext_eds.annotators module. You will find a description of the existing annotators in the next section.

  • First, import the annotators and text :
from pymedext_eds.utils import rawtext_loader

from pymedext_eds.annotators import Endlines, SentenceTokenizer, \
                                    RegexMatcher, Pipeline

from pymedext_eds.viz import display_annotations
  • Load documents:
data_path = pkg_resources.resource_filename('pymedext_eds', 'data/demo')
file_list = glob(data_path + '/*.txt')
docs = [rawtext_loader(x) for x in file_list]
  • Declare the pipeline:
endlines = Endlines(['raw_text'], 'endlines', 'endlines:v1')
sentences = SentenceTokenizer(['endlines'], 'sentence', 'sentenceTokenizer:v1')
regex = RegexMatcher(['endlines','syntagme'], 'regex', 'RegexMatcher:v1', 'list_regexp.json')

pipeline = Pipeline(pipeline = [endlines, sentences, regex])
  • Use the pipeline to annotate:
annotated_docs = pipeline.annotate(docs)
  • Explore annotations by type :
from pprint import pprint
pprint(annotated_docs[0].get_annotations('regex')[10].to_dict())
  • Display annotations in text (using displacy)
display_annotations(chunk[0], ['regex'])

Existing annotators

  • Endlines:
    • Used to clean the text when using text extracted from PDFs. Removes erroneous endlines introduced by pdf to text conversion.
    • input : raw_text
    • output: Annotations
  • SectionSplitter:
    • Segments the text into sections
    • output: Annotations
  • SentenceTokenizer:
    • Tokenize the text in sentences
    • input: cleaned text from Endlines or sections
    • output: Annotations
  • Hypothesis:
    • Classification of sentences regarding the degree of certainty
    • input: sentences
    • output: Attributes
  • ATCDFamille:
    • Classification of sentences regarding the subject (patient or family)
    • input: sentences
    • output: Attributes
  • SyntagmeTokenizer:
    • Segmentation of sentences into syntagms
    • input: sentences
    • output: Annotations
  • Negation:
    • Classification of syntagms according to the polarity
    • input: syntagm
    • output: Attributes
  • RegexMatcher:
    • Extracts informations using predefined regexs
    • input: sentence or syntagm
    • output: Annotations
  • QuickUMLSAnnotator:
    • Extracts medical concepts from UMLS using QuickUMLS
    • output: Annotations
  • MedicationAnnotator:
    • Extracts medications informations using a deep learning pipeline
    • output: Annotations

QuickUMLS installation (copied from Georgetown-IR-Lab/QuickUMLS)

Installation

  1. Obtain a UMLS installation This tool requires you to have a valid UMLS installation on disk. To install UMLS, you must first obtain a license from the National Library of Medicine; then you should download all UMLS files from this page; finally, you can install UMLS using the MetamorphoSys tool as explained in this guide. The installation can be removed once the system has been initialized.
  2. Install QuickUMLS: You can do so by either running pip install quickumls or python setup.py install. On macOS, using anaconda is strongly recommended†.
  3. Create a QuickUMLS installation Initialize the system by running python -m quickumls.install <umls_installation_path> <destination_path>, where <umls_installation_path> is where the installation files are (in particular, we need MRCONSO.RRF and MRSTY.RRF) and <destination_path> is the directory where the QuickUmls data files should be installed. This process will take between 5 and 30 minutes depending how fast the CPU and the drive where UMLS and QuickUMLS files are stored are (on a system with a Intel i7 6700K CPU and a 7200 RPM hard drive, initialization takes 8.5 minutes).

python -m quickumls.install supports the following optional arguments:

  • -L / --lowercase: if used, all concept terms are folded to lowercase before being processed. This option typically increases recall, but it might reduce precision;
  • -U / --normalize-unicode: if used, expressions with non-ASCII characters are converted to the closest combination of ASCII characters.
  • -E / --language: Specify the language to consider for UMLS concepts; by default, English is used. For a complete list of languages, please see this table provided by NLM.
  • -d / --database-backend: Specify which database backend to use for QuickUMLS. The two options are leveldb and unqlite. The latter supports multi-process reading and has better unicode compatibility, and it used as default for all new 1.4 installations; the former is still used as default when instantiating a QuickUMLS client. More info about differences between the two databases and migration info are available here.

†: If the installation fails on macOS when using Anaconda, install leveldb first by running conda install -c conda-forge python-leveldb.

Run a simple server

Define the server and the pipeline:

import flask

from flask import Flask, render_template, request

from pymedext_eds.annotators import Endlines, SentenceTokenizer, Hypothesis, \
                                    ATCDFamille, SyntagmeTokenizer, Negation, RegexMatcher, \
                                    Pipeline

endlines = Endlines(['raw_text'], 'endlines', 'endlines:v1')
sentences = SentenceTokenizer(['endlines'], 'sentence', 'sentenceTokenizer:v1')
hypothesis = Hypothesis(['sentence'], 'hypothesis', 'hypothesis:v1')
family = ATCDFamille(['sentence'], 'context', 'ATCDfamily:v1')
syntagmes = SyntagmeTokenizer(['sentence'], 'syntagme', 'SyntagmeTokenizer:v1')
negation = Negation(['syntagme'], 'negation', 'Negation:v1')
regex = RegexMatcher(['endlines','syntagme'], 'regex', 'RegexMatcher:v1', 'list_regexp.json')

pipeline = Pipeline(pipeline = [endlines, sentences, hypothesis, family, syntagmes, negation, regex])

app=Flask(__name__)

@app.route('/annotate',methods = ['POST'])
def result():
    if request.method == 'POST':

        return pipeline.__call__(request)

if __name__ == '__main__':
    app.run(port = 6666, debug=True)

Save this code in demo_flask_server.py and run it using:

python demo_flask_server.py

Query the server:

import requests
from pymedextcore.document import Document

data_path = pkg_resources.resource_filename('pymedext_eds', 'data/demo')
file_list = glob(data_path + '/*.txt')
docs = [rawtext_loader(x) for x in file_list]

json_doc = [doc.to_dict() for doc in docs]
res =  requests.post(f"http://127.0.0.1:6666/annotate", json = json_doc)
if res.status_code == 200:
    res = res.json()['result']
    docs = [Document.from_dict(doc) for doc in res ]

Run a docker server

define the git credentials

first create a file .git-credentials and replace user and pass by your github credentials such has

https://user:[email protected]

WARNING :never add it on the git !!!

build the images

docker build -f eds_apps/Dockerfile_backend -t pymedext-eds:v1 .


#if proxy add
docker build -f eds_apps/Dockerfile_backend -t pymedext-eds:v1 \
--buildargs http_proxy="proxy" \
--buildargs https_proxy="proxy" .

start the backend server

docker run --rm  -d -p 6666:6666 pymedext-eds:v1 python3 demo_flask.py

pymedext_eds's People

Contributors

aneuraz avatar willdgn avatar marc-r-vincent avatar

Watchers

 avatar  avatar  avatar  avatar Nicolas Garcelon avatar  avatar

Forkers

drfabach

pymedext_eds's Issues

Span in NER model error

The spans in the NER model are incorrect.

Code to reproduce:

from glob import glob
import pandas as pd
import re
from pprint import pprint
import pkg_resources

from pymedextcore.document import Document
from pymedext_eds.annotators import Endlines, SentenceTokenizer, SectionSplitter
from pymedext_eds.utils import rawtext_loader
from pymedext_eds.med import MedicationAnnotator, MedicationNormalizer

endlines = Endlines(["raw_text"], "clean_text", ID="endlines")
sections = SectionSplitter(['clean_text'], "section", ID= 'sections')
sentenceSplitter = SentenceTokenizer(["section"],"sentence", ID="sentences")

models_param = [{'tagger_path':'data/models/apmed5/entities/final-model.pt' ,
                'tag_name': 'entity_pred' },
                {'tagger_path':'data/models/apmed5/events/final-model.pt' ,
                'tag_name': 'event_pred' },
               {'tagger_path': "data/models/apmed5/drugblob/final-model.pt",
                'tag_name': 'drugblob_pred'}]

med = MedicationAnnotator(['sentence'], 'med', ID='med:v2', models_param=models_param,  device='cuda:1')

data_path = pkg_resources.resource_filename('pymedext_eds', 'data/romedi')
romedi_path = glob(data_path + '/*.p')[0]

norm = MedicationNormalizer(['ENT/DRUG','ENT/CLASS'], 'normalized_mention', ID='norm',romedi_path= romedi_path)

pipeline = [endlines,sections, sentenceSplitter, med, norm]

data_path = pkg_resources.resource_filename('pymedext_eds', 'data/demo')
file_list = glob(data_path + '/*.txt')

docs = [rawtext_loader(x) for x in file_list]

for doc in docs:
    doc.annotate(pipeline)

[t.value for t in docs[0].get_annotations('ENT/DRUG')]

docs[0].get_annotations('clean_text')[0].value[5687:5691]

Lock error when repeating pipeline init.

Initializing a pipeline twice results in a lock error (e.g. in demo_pymedext_eds.ipynb).
Quickumls installation was done with default options (i.e. just specifying data path)

  • quickumls installed version: quickumls-1.3.0.post4, database backend leveldb
  • does not occur with quickumls-14.0 and database backend unqlite
  • install command: python -m quickumls.install <umls_installation_path> <destination_path>
    image

Review CLASS_NORM in constants

'ATB': ['J01C (BETALACTAMINES : PENICILLINES)',
         'J01X (AUTRES ANTIBACTERIENS)',
         'J01D (AUTRES BETALACTAMINES)',
         'J01M (QUINOLONES ANTIBACTERIENNES)',
         'J01F (MACROLIDES, LINCOSAMIDES  ET STREPTOGRAMINES)',
         'J01A (TETRACYCLINES)',
         'J01E (SULFAMIDES ET TRIMETHOPRIME)',
         'J01G (AMINOSIDES ANTIBACTERIENS)',
         "J01R (ASSOCIATIONS D'ANTIBACTERIENS)",
         'J01B (PHENICOLES)'],
 'ATBTHERAPIE': ['J01C (BETALACTAMINES : PENICILLINES)',
                 'J01X (AUTRES ANTIBACTERIENS)',
                 'J01D (AUTRES BETALACTAMINES)',
                 'J01M (QUINOLONES ANTIBACTERIENNES)',
                 'J01F (MACROLIDES, LINCOSAMIDES  ET STREPTOGRAMINES)',
                 'J01A (TETRACYCLINES)',
                 'J01E (SULFAMIDES ET TRIMETHOPRIME)',
                 'J01G (AMINOSIDES ANTIBACTERIENS)',
                 "J01R (ASSOCIATIONS D'ANTIBACTERIENS)",
                 'J01B (PHENICOLES)'],

ATB should point to J01 Anti-infectieux systémiques

Seemingly wrong use of re.sub in Annotators

In Annotators repeated use of re.sub in the following fashion:

txt=re.sub(pattern, repl, string, re.IGNORECASE)

but the correct use of of re.sub (python 3.7) is the following:

txt=re.sub(pattern, repl, string, count, flags)

therefore re.sub interpretes re.IGNORECASE as the argument of count and stops at second match (re.IGNORECASE == 2)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.