nlpatvcu / medacy Goto Github PK

:hospital: Medical Text Mining and Information Extraction with spaCy

License: GNU General Public License v3.0

Python 100.00%

natural-language-processing medical-natural-language-processing machine-learning metamap clinical-text-processing information-extraction medical-text-mining spacy

medacy's Introduction

medaCy

🏥 Medical Text Mining and Information Extraction with spaCy 🏥

MedaCy is a text processing and learning framework built over spaCy to support the lightning fast prototyping, training, and application of highly predictive medical NLP models. It is designed to streamline researcher workflow by providing utilities for model training, prediction and organization while insuring the replicability of systems.

🌟 Features

Highly predictive, shared-task dominating out-of-the-box trained models for medical named entity recognition.
Customizable pipelines with detailed development instructions and documentation.
Allows the designing of replicable NLP systems for reproducing results and encouraging the distribution of models whilst still allowing for privacy.
Active community development spearheaded and maintained by NLP@VCU.
Detailed API.

💭 Where to ask questions

MedaCy is actively maintained by a team of researchers at Virginia Commonwealth University. The best way to receive immediate responses to any questions is to raise an issue. Make sure to first consult the API. See how to formulate a good issue or feature request in the Contribution Guide.

💻 Installation Instructions

MedaCy can be installed for general use or for pipeline development / research purposes.

Application	Run
Prediction and Model Training (stable)	`pip install git+https://github.com/NLPatVCU/medaCy.git`
Prediction and Model Training (latest)	`pip install git+https://github.com/NLPatVCU/medaCy.git@development`
Pipeline Development and Contribution	See Contribution Instructions

📚 Power of medaCy

After installing medaCy and medaCy's clinical model, simply run:

from medacy.model.model import Model

model = Model.load_external('medacy_model_clinical_notes')
annotation = model.predict("The patient was prescribed 1 capsule of Advil for 5 days.")
print(annotation)

and receive instant predictions:

[
    ('Drug', 40, 45, 'Advil'),
    ('Dosage', 27, 28, '1'), 
    ('Form', 29, 36, 'capsule'),
    ('Duration', 46, 56, 'for 5 days')
]

MedaCy can also be used through its command line interface, documented here

To explore medaCy's other models or train your own, visit the examples section.

Reference

@ARTICLE {
    author  = "Andriy Mulyar, Natassja Lewinski and Bridget McInnes",
    title   = "TAC SRIE 2018: Extracting Systematic Review Information with MedaCy",
    journal = "National Institute of Standards and Technology (NIST) 2018 Systematic Review Information Extraction (SRIE) > Text Analysis Conference",
    year    = "2018",
    month   = "nov"
}

License

This package is licensed under the GNU General Public License.

Authors

Current contributors: Steele Farnsworth, Anna Conte, Gabby Gurdin, Aidan Kierans, Aidan Myers, and Bridget T. McInnes

Former contributors: Andriy Mulyar, Jorge Vargas, Corey Sutphin, and Bobby Best

Acknowledgments

medacy's People

Contributors

Stargazers

Watchers

Forkers

thesw4rm daniela-llivina r-best andriymulyar chiragsehra ggurdin conteam magroves oshahzada98 njokudk ahmedelnour5 wolfsohned sammahen linglich saridsa1 omrison princepurohit153 justachetan qinqd tobias76 mejihero dendendelen pablofbaez manutel bit-whacker anaymalpani sadam1195 anoop2019 dnagaraj10 coreysutphin sema4-ericschles isabaala yushu-liu amoliu gdsttian rpj911 julianflowers fighting41love p-r-t prateeknagpal amit-d-bahir jiejyun-liu mayukhdifferent gerrygekao opencvnoob vamsilnm awoziji bballamudi zazabar bendavis-chicago ryannetwork kathirvelkg manikant92 singh95rahul nlpprj strivedi02 ankitchaudhary23 jasonparker veeravalliss jiniaoxu bingfenxiyan zhangbeibei1991 vmdhhh aspirincode iamprashant medical-projects plandes armon-chen moly-malibu aasha01 pwforks lodpaine deep-mind-hive trangtran72 arpine-aikoda gustavwilliam anirudh179 mahes2000 touchmed pj0616 chintagunta85 rawmatterx cutlerci ahmad-abdellatif 351978274 criskgr lokeshchinthala kamlesh0606 ericbioinf divyanshusharma3031

medacy's Issues

MetaMap mapped_terms_to_spacy_ann Bug

Description

Conducted in development branch, attempted to metamap a file and generate a spacy annotation file. Got the error listed below.

Traceback (most recent call last): File "medtest.py", line 5, in <module> me_ann = medy.mapped_terms_to_spacy_ann(me_dict) File "/home/jeff/.local/lib/python3.6/site-packages/medacy/pipeline_components/metamap/metamap.py", line 196, in mapped_terms_to_spacy_ann for span in self.get_span_by_term(term): #if a single entity corresonds to a disjunct span File "/home/jeff/.local/lib/python3.6/site-packages/medacy/pipeline_components/metamap/metamap.py", line 255, in get_span_by_term if int(term['ConceptPIs']['@Count']) == 1: TypeError: string indices must be integers

Steps/Code to Reproduce

python3 medtest.py

Expected Results

Annotation output.

Actual Results

See error above

Versions

Linux-4.15.0-33-generic-x86_64-with-LinuxMint-19-tara
Python 3.6.5 (default, Apr 1 2018, 05:46:30)
[GCC 7.3.0]
NumPy 1.15.4
SciPy 1.2.0
medacy 0.0.6

Create flow diagrams and figures describing how the medaCy works.

[FEATURE REQUEST] Expand the scope of medaCy past NER

What problem does your feature solve?
medaCy is capable of support much more than NER - the current codebase would not take much refactoring to set-up medaCy to run pipelines for others tasks such as relationship extraction.

Describe the solution you'd like
Refactor medaCy to make room for other types of medical text processing systems to be included. Pipeline components, pipelines, and tools could be left where they are - NER, relationship extraction, etc become root directories each containing a sub-directory of model.

Additional context
Looking big picture here.

Include a better interface for prediction.

In the Model class:
The .predict() method should ideally default to accepting a string of text and output structured predictions from that text (a spacy compatible annotation style would be useful). Currently, the .predict() method does not allow for this - it does bulk predictions.

Refactoring this functionality away into a bulk_predict() method would still allow for this.

Add pathos as a dependency in setup.py

_restore_from_ascii() method in metamap.py throws "TypeError: list indices must be integers not strings" when dealing with a converted document.

Description

If a non-ascii character is actually converted, when the program goes to restore the non-ascii character in the _restore_from_ascii_ method in metamap.py, it throws a TypeError. The offending line is for mapping in metamap_dict['metamap']['MMOs']['MMO']['Utterances']['Utterance']['Phrases']['Phrase']['Mappings']['Mapping']:.

Steps/Code to Reproduce

Include a character such as л in your document and attempt to metamap. The error will be thrown when it attempts to restore the document back.

Expected Results

Text has non-ascii characters again and the character spans in the metamap dictionary are updated to reflect the restoration.

Actual Results

Error thrown.

Versions

Linux-4.15.0-43-generic-x86_64-with-Ubuntu-16.04-xenial
Python 3.5.2 (default, Nov 12 2018, 13:43:14)
medacy 0.0.8

metamap.py Line 51, unclosed file opened for reading

On line 51 of metamap.py, the file_to_map is opened and not closed during the method. Possible I/O leak on systems.

Allow Metamap to auto-cache output for later retrieval

Implement a persistent storage option for metamap output (see here: https://stackoverflow.com/questions/94153/how-do-i-persist-to-disk-a-temporary-file-using-python

[FEATURE REQUEST] Option for label ranking in outputs

Description
A CRF produces label probability outputs. Currently, we are simply using the highest probability label as the predicted entity label. It would be useful to allow for an option to output a label ranking for a given token.

Uses
Multi-label prediction, hierarchical label prediction. Capable of handling nested entities.

Where to get started
A contributor would simply have to interface the correct attributes set by the sklearn-crfsuite wrapper into the Annotation class. Some discussion would have to be had to insure compatibility with other parts of the package.

Line 58-62 metamap.py. Uses list comprehension to check if file exists in directory.

Lines 58-62 of metamap.py have a roundabout, list comprehension way of checking for existing files. Can be replaced with if file in os.listdir(self.cache_directory) ....

Typo in Utilizing Pre-trained NER models tutorial

The command line text to copy/paste for installing the package to run the example needs to be updated to:

pip install git+https://github.com/NLPatVCU/medaCy_model_clinical_notes.git

Make token merging optional during token annotation in each PipelineComponent.

Currently, tokens are merged by default in components such as the MetaMap annotator or the various UnitAnnotators. This is so that annotated groups of tokens are seen as individual block by the end classifier. This functionality is often wanted and should be default but still the option of turning off this merging should be provided to the end developer of a pipeline. This should be made de-facto for any new components but the re-factoring of the MetaMap and individual unit annotation components will be required.

[FEATURE REQUEST] Functionality for analyzing the differences between two Annotation objects.

What problem does your feature solve?
A method to do analysis of annotations (namely for the application of looking at differences between gold and predicted annotations).

Describe the solution you'd like
The Annotation class should be given some static methods like Annotation.diff(ann_object_1, ann_object_2) will output the difference between to annotation objects. Maybe some parameter for leniency to deal with fuzzy annotation matching.

Interface sklearn to compute various evaluation metrics between two annotation files (assuming one is gold and one is predicted).

Additional context
This would be very useful for result analysis and guiding the building of pipelines.

Refactor away UnitAnnotator and transition/test individual annotators for each unit type

Each unit type will have its own lexicons to pull from. This is best accomplished by placing each unit type in its own pipeline component. This has been started to a larger extent but needs to be completed and tests.

Separate models into new python packages.

What problem does your feature solve?
Models should not be provided with medaCy, rather they should be available for installation and compatible with medaCy.
Describe the solution you'd like
This will work very similarly to #59 .

Implement code for various feature representations

Currently only feature dictionaries exist - a necessity is the implementation of feature vectors. The feature type returned should be an argument to the FeatureExtractor class.

AttributeError: 'NoneType' object has no attribute 'netloc'

Description

Hi, I try to install the medacy and medacy_model_clinical_notes model with google colab jupyter notebook.

Steps/Code to Reproduce

Install medacy - successful
!pip install git+https://github.com/NLPatVCU/medaCy.git
Install medacy_model_clinical_notes, not successfully
!pip install git+https://github.com/NLPatVCU/medaCy_model_clinical_notes.git

Collecting git+https://github.com/NLPatVCU/medaCy_model_clinical_notes.git
Cloning https://github.com/NLPatVCU/medaCy_model_clinical_notes.git to /tmp/pip-req-build-iyx8hwuq
Requirement already satisfied: medacy>=0.0.3 in /usr/local/lib/python3.6/dist-packages (from medacy-model-clinical-notes==1.0.1) (0.0.9)
Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/base_command.py", line 179, in main
status = self.run(options, args)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/commands/install.py", line 315, in run
resolver.resolve(requirement_set)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolve.py", line 131, in resolve
self._resolve_one(requirement_set, req)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolve.py", line 357, in _resolve_one
add_req(subreq, extras_requested=available_requested)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolve.py", line 314, in add_req
use_pep517=self.use_pep517
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/req/constructors.py", line 328, in install_req_from_req_string
if req.url and comes_from.link.netloc in domains_not_allowed:
AttributeError: 'NoneType' object has no attribute 'netloc'

Versions

import platform; print(platform.platform())
import sys; print("Python", sys.version)
import numpy; print("NumPy", numpy.version)
import scipy; print("SciPy", scipy.version)
import medacy; print("medacy", medacy.version)
-->
Linux-4.14.79+-x86_64-with-Ubuntu-18.04-bionic
Python 3.6.7 (default, Oct 22 2018, 11:32:17)
[GCC 8.2.0]
NumPy 1.14.6
SciPy 1.1.0
medacy 0.0.9

Could anyone help out with this issue please, thank you!

Logging number of files loaded for training incorrectly stays the same despite different number of files being passed in

Description

When I fit a model initially, it prints out:
DEBUG:root:Loaded 4 files for training
DEBUG:root:Loaded 121 files for training
DEBUG:root:Loaded 138 files for training

where the 138 number is the correct number. When I rerun the script to fit the model I point it to a different directory containing five files and get the exact same log output, although it only trains on the specified five files.

Steps/Code to Reproduce

model.fit(train_loader) #Contains ~138 samples to train on

Then after it finishes...

model.fit(sample_loader) #Contains 5 samples

Expected Results

model.fit(sample_loader) will log that there are five files to train on

Actual Results

DEBUG:root:Loaded 4 files for training
DEBUG:root:Loaded 121 files for training
DEBUG:root:Loaded 138 files for training

Versions

Write a number of useful examples in an /examples sub-directory of root.

Problem with installing medaCy

Description

Hi. I am trying to install medaCy on my system using the instructions given in the README, however, I am getting some error caused due to unavailability of some SpaCy models.

Steps/Code to Reproduce

Run:
pip install git+https://github.com/NLPatVCU/medaCy.git

Output



Collecting git+https://github.com/NLPatVCU/medaCy.git
       Cloning https://github.com/NLPatVCU/medaCy.git to /tmp/pip-k5kMwQ-build
Collecting en_core_web_sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 (from medacy==0.1.0)
        Could not find a version that satisfies the requirement en_core_web_sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 (from medacy==0.1.0) (from versions: )
No matching distribution found for en_core_web_sm@ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz#egg=en_core_web_sm-2.0.0 (from medacy==0.1.0)

Versions

Linux-4.4.0-17134-Microsoft-x86_64-with-Ubuntu-18.04-bionic
NumPy 1.16.3
SciPy 1.3.0
medaCy not installed, the issue is about that only

Refactor to implement sci-kit learn like functionality

It may be a good idea to implement a sci-kit learn like feel to the model building process. That is, a class with a 'fit' and 'predict' method. This would be alongside the existing code for learning and predicting.

Refactor to directly use JSON output from Metamap2016

The newest version of Metamap now supports JSON format - update the Metamap wrapper to directly parse this information. It currently gathers the XML and manually turns it into JSON - this will make the code much cleaner and potentially faster

ImportError: Package not installed: medacy_model_clinical_notes

3 model = Model.load_external('medacy_model_clinical_notes')

   4 annotation = model.predict("The patient was prescribed 1 capsule of Advil for 5 days.")
   5 print(annotation)
   /usr/local/lib/python3.6/dist-packages/medacy/ner/model/model.py in 
   load_external(package_name)
  458         """
  459         if importlib.util.find_spec(package_name) is None:

--> 460 raise ImportError("Package not installed: %s" % package_name)
461 return importlib.import_module(package_name).load()
462

       ImportError: Package not installed: medacy_model_clinical_notes

-->

Plan and execute the inclusion of models into medaCy

Plan out how ready-made models will be included into medaCy

Half of the document is considered for a specific span

Description

When generating the MedaCy ground truth files from 5 - fold cross validation, in some files a very large span is considered as the span for some Dose instances. This behavior is not observed when it is the only file that is trained. Following are the examples of 2 such files in TAC (2008) data set.
File : PMC1257590.ann
Dose 929 932 400

PMC4847079.ann
Dose 1394 1398 25.6
Dose 1409 1413 30.7

Steps/Code to Reproduce

Run 5 fold cross validation using the systematic_review_pipeline on the following files individually and with more than two files and compare the MedaCy ground truth files for the mentioned Dose instances above

PMC1257590.ann
PMC4847079.ann

Expected Results

File : PMC1257590.ann
Ground truth:
Dose 929 932 400
MedaCy truth:
Dose 929 932 400
Predictions:
Dose 929 932 400

File : PMC4847079.ann
Ground truth:
Dose 1394 1398 25.6
Dose 1409 1413 30.7
MedaCy truth:
Dose 1394 1398 25.6
Dose 1409 1413 30.7
Predictions:
Dose 1394 1398 25.6
Dose 1409 1413 30.7

Actual Results

File : PMC1257590.ann
Ground truth:
Dose 929 932 400
MedaCy truth:(when run individually)
Dose 929 932 400
MedaCy truth:(when run with more than 2 files)
Dose 929 10478 400 μg/kg) of BPA (> 99% purity; Sigma-Aldrich.......................
Predictions:
Dose 929 932 400

File : PMC4847079.ann
Ground truth:
Dose 1394 1398 25.6
Dose 1409 1413 30.7
MedaCy truth:(when run individually)_
Dose 1394 1397 25.
Dose 1397 1398 6
Dose 1409 1412 30.
Dose 1412 1413 7
MedaCy truth:(when run with more than 2 files)
Dose 1394 1397 25.
Dose 1397 3460 6 mg/m3 and 30.7 mg/m3, respectively. Concentration measurements were taken using a portable DataRAM......
Dose 1409 3461 30.7 mg/m3, respectively. Concentration measurements were taken using a portable DataRAM.....................
Predictions:
None

Versions

Linux-3.10.0-693.11.6.el7.x86_64-x86_64-with-centos-7.4.1708-Core
Python 3.4.5
NumPy 1.15.4
SciPy 1.2.0

Unit Tests have metamap path hardcoded

All metamap related unit tests have metamap path hardcoded, figure out how to pass as a parameter to unit testing suite.

Create a docker container with medaCy set up for easy first use

Deal with replacing non-ascii characters as a pre-processing step.

Metamap cannot process non-unicode characters (not between 0, FF). An optional pre-processing step must be integrated as an option to this process.

Common examples of this issue arising are when degree symbols (°) appear or a mu appears.

[FEATURE REQUEST] Make Model class pickle-able

What problem does your feature solve?
Define a way to serialize a Model object such that the pipeline and inner machine learning model are connected in a single binary file.
Model objects can then be serialized and compressed for remote installation.

Describe the solution you'd like
Serialized models could be hosted anywhere and installed from anywhere.
model = medacy.load(serialized_model) # this would load an instance of the Model class

ImportError: Package not installed: medacy_model_clinical_notes

---------------------------------------------------------------------------

ImportError Traceback (most recent call last)
in
1 from medacy.ner.model import Model
2
----> 3 model = Model.load_external('medacy_model_clinical_notes')
4 annotation = model.predict("The patient was prescribed 1 capsule of Advil for 5 days.")
5 print(annotation)

~/anaconda3/lib/python3.7/site-packages/medacy/ner/model/model.py in load_external(package_name)
458 """
459 if importlib.util.find_spec(package_name) is None:
--> 460 raise ImportError("Package not installed: %s" % package_name)
461 return importlib.import_module(package_name).load()
462

ImportError: Package not installed: medacy_model_clinical_notes

Multi-thread feature extraction on documents during train

Look into multi-threading feature extraction on documents during training. The bottleneck is aggregating all extracted features into a single array - this will have to be addressed.

[FEATURE REQUEST] Use medaCy with spaCy pipeline

This is more of a question/clarification about existing functionality. I would like to use use a medaCy in the way that one would typically use spaCy, in terms of pipeline components. That is, create a doc and use the doc attributes (ents, annotations, etc). Is there a way to load something like a clinical note and use it like you would in a spaCy pipeline? Is there a way to extend the spaCy pipeline with medaCy models so that annotations can be visualized with displaCy, or some approximation of these things. I read the docs and looked at the code base, but it wasn't clear to me whether this was currently possible or not. Any help and/or clarification would be appreciated. I'm currently trying out medaCy to extract drug doseage information from clinical notes. The ner is doing a great job of extraction, being able to use this model in the way one would use models in spaCy would be very helpful for our proof of concept stage.

Predictions directory not created automatically when running cross validation

Description

When running the cross validation model, the prediction directory is used to write predictions during bulk prediction. If the prediction directory is not created manually inside the data set folder it throws the following error:
'FileNotFoundError: [Errno 2] No such file or directory: '/home/mahendrand/VE/SMM4H/data_smmh4h/task2/training/dataset_1/predictions/326575463835250689.ann''

Steps/Code to Reproduce

if you do not create a prediction directory manually inside the data set folder and run the following model:
model.cross_validate(num_folds = 5, dataset = training_dataset, write_predictions=True)
The error will be thrown when it attempts to do the bulk prediction

Expected Results

When the path to the dataset is passed as the parameter, the prediction directory is expected to be created automatically inside the data set folder

Actual Results

error is thrown as above

Versions

Linux-3.10.0-693.11.6.el7.x86_64-x86_64-with-centos-7.4.1708-Core
Python 3.4.5
NumPy 1.15.4
SciPy 1.2.0
medacy 0.0.9

[FEATURE REQUEST] Add an option to use word embeddings during training

What problem does your feature solve?
Includes more features to use during training.

Describe the solution you'd like
The VCU NLP lab has a collection of word embeddings (specifically glove vectors) trained over various medical corpora. Incorporate this into medaCy's FeatureExtractor.

Additional context
Please get in contact with me if you would like to pursue this feature - I have the embeddings packaged up for use with spaCy.

Update README with example NER and results for n2c2.

Separate data into new python packages.

What problem does your feature solve?
Dataset's should be separated into unique python packages. This will allow users to install, build, and work with data in an efficient manner while allowing complete control over who has access to that data.

medaCy can be broken into three parts: loading data, loading models, manipulating models and data (that is, training, prediction, meta-information extraction, etc). medaCy currently does all three in one package - but storing data and models takes a lot of memory. The architecture for the package should transition towards the direction of leaving only code that orchestrates the interaction between data and models in this repository while allowing actual data and models to be interfaced from separate compatible python packages. A nomenclature like medacy_model_***** and medacy_dataset_**** for external package naming makes sense although outside developers could clearly name their packages freely.
Dataset's that are included for benchmarking/testing purposes can be made into open repositories while anyone could also create a private version of a dataset by following directions on how to make their private package interface medaCy.

Describe the solution you'd like

DataLoader should be refactored to something like Dataset (this nomenclature makes more sense if one considers what the current DataLoader class does).
Each python package corresponding to a dataset can be installed/removed freely and has medaCy as an installation requirement.
Each medacy compatible dataset package should return an instantiated, ready-to-work-with version of the Dataset class. This class should include all functionality currently present in DataLoader alongside also sidestepping the init method from a machine directory to load data present in a medacy compatible data package.
Detailed instructions should be provided to write a custom interfacing data package (maybe provide a boilerplate template repository).
Ideally, the internals of Dataset will store a generator containing raw text files and annotation files alongside meta-data about the dataset (entities and types of entity relationships). These can then be looped through at will by the current code for processing documents - steps should be taken to insure this process is as memory efficient as possible as big corpora could be used.

Integrate MetamapLite as a component

MetaMapLite looks promising - currently MetaMap is very bulk. This would be a fascinating direction to explore and would make the package much more robust.

When performing cross validation, log predictions on the test data done at each fold.

Add time stamps/tracking to logs to see how much time each component is taking.

Add result -> latex table conversion

Add automatic formatting of LaTeX tables with printed results.

Line 49 of metamap.py, recent_file should be added to the init docstring

On line 49 of metamap.py, recent_file should be added to the __init__ docstring so that some random class variable is being defined for the first time in a member function without us realising.

Write converters for various annotation formats.

Currently, only ANN formatted files are support. Converters should be written!

Allow metamap.py to also cache results when metamapping just a string of text

Line 80 in metamap.py

    def map_text(self, text, max_prune_depth=10):
        #TODO add caching here as in map_file
        metamap_dict = self._run_metamap('--XMLf --blanklines 0 --silent --prune %i' % max_prune_depth, text)
        return metamap_dict

installation issues on windows

Description

issues when trying to install medacy on windows.

Steps/Code to Reproduce

on cmd i am running:
pip install git+https://github.com/NLPatVCU/medaCy.git

Expected Results

medacy installed

Actual Results

Building wheel for ujson (setup.py) ... error
ERROR: Complete output from command 'c:\python\python37\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\om\AppData\Local\Temp\pip-install-5k7yz4al\ujson\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\om\AppData\Local\Temp\pip-wheel-572h8xdt' --python-tag cp37:
ERROR: Warning: 'classifiers' should be a list, got type 'filter'
running bdist_wheel
running build
running build_ext
building 'ujson' extension
error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/

ERROR: Failed building wheel for ujson

Versions

Windows-10-10.0.17763-SP0
Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)]
NumPy 1.16.2
ModuleNotFoundError: No module named 'scipy'
ModuleNotFoundError: No module named 'medacy'

Bad input array shape

Description

I'm getting a error called could not broadcast input array from shape (96) into shape (128)

Steps/Code to Reproduce

import medacy_model_clinical_notes
from medacy.model import Model
model = Model.load_external('medacy_model_clinical_notes')
annotations = model.predict("The patient took 5 mg of aspirin.")

Expected Results

{
'entities': {
'T3': ('Drug', 40, 45, 'Advil'),
'T1': ('Dosage', 27, 28, '1'),
'T2': ('Form', 29, 36, 'capsule'),
'T4': ('Duration', 46, 56, 'for 5 days')
},
'relations': []
}

Actual Results

ValueError Traceback (most recent call last)
in
6 #model = Model(pipeline)
7 f = open('/usr/local/lib/python3.6/dist-packages/medacy_model_clinical_notes/model/n2c2_2018_no_metamap_2018_12_22_16.49.17.pkl','rb')
----> 8 model1 = medacy_model_clinical_notes.load()
9 #model = Model.load_external('medacy_model_clinical_notes')
10 f.close()

/usr/local/lib/python3.6/dist-packages/medacy_model_clinical_notes/medacy_model_clinical_notes.py in load()
6 def load():
7 entities = ['Drug', 'Form', 'Route', 'ADE', 'Reason', 'Frequency', 'Duration', 'Dosage', 'Strength']
----> 8 pipeline = ClinicalPipeline(entities=entities)
9 model = Model(pipeline, n_jobs=1)
10 model_directory = resource_filename('medacy_model_clinical_notes', 'model')

/usr/local/lib/python3.6/dist-packages/medacy/pipelines/clinical_pipeline.py in init(self, metamap, entities)
22 description="""Pipeline tuned for the extraction of ADE related entities from the 2018 N2C2 Shared Task"""
23 super().init("clinical_pipeline",
---> 24 spacy_pipeline=spacy.load("en_core_web_sm"),
25 description=description,
26 creators="Andriy Mulyar (andriymulyar.com)", #append if multiple creators

/usr/local/lib/python3.6/dist-packages/spacy/init.py in load(name, **overrides)
16 if depr_path not in (True, False, None):
17 deprecation_warning(Warnings.W001.format(path=depr_path))
---> 18 return util.load_model(name, **overrides)
19
20

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model(name, **overrides)
112 return load_model_from_link(name, **overrides)
113 if is_package(name): # installed as package
--> 114 return load_model_from_package(name, **overrides)
115 if Path(name).exists(): # path to model data directory
116 return load_model_from_path(Path(name), **overrides)

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model_from_package(name, **overrides)
133 """Load a model from an installed package."""
134 cls = importlib.import_module(name)
--> 135 return cls.load(**overrides)
136
137

/usr/local/lib/python3.6/dist-packages/en_core_web_sm/init.py in load(**overrides)
10
11 def load(**overrides):
---> 12 return load_model_from_init_py(file, **overrides)

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model_from_init_py(init_file, **overrides)
171 if not model_path.exists():
172 raise IOError(Errors.E052.format(path=path2str(data_path)))
--> 173 return load_model_from_path(data_path, meta, **overrides)
174
175

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model_from_path(model_path, meta, **overrides)
154 component = nlp.create_pipe(name, config=config)
155 nlp.add_pipe(component, name=name)
--> 156 return nlp.from_disk(model_path)
157
158

/usr/local/lib/python3.6/dist-packages/spacy/language.py in from_disk(self, path, disable)
645 if not (path / 'vocab').exists():
646 exclude['vocab'] = True
--> 647 util.from_disk(path, deserializers, exclude)
648 self._path = path
649 return self

/usr/local/lib/python3.6/dist-packages/spacy/util.py in from_disk(path, readers, exclude)
509 for key, reader in readers.items():
510 if key not in exclude:
--> 511 reader(path / key)
512 return path
513

/usr/local/lib/python3.6/dist-packages/spacy/language.py in (p, proc)
641 if not hasattr(proc, 'to_disk'):
642 continue
--> 643 deserializers[name] = lambda p, proc=proc: proc.from_disk(p, vocab=False)
644 exclude = {p: False for p in disable}
645 if not (path / 'vocab').exists():

pipeline.pyx in spacy.pipeline.Tagger.from_disk()

pipeline.pyx in spacy.pipeline.Tagger.from_disk.load_model()

/usr/local/lib/python3.6/dist-packages/thinc/neural/_classes/model.py in from_bytes(self, bytes_data)
350 name = name.decode('utf8')
351 dest = getattr(layer, name)
--> 352 copy_array(dest, param[b'value'])
353 i += 1
354 if hasattr(layer, '_layers'):

/usr/local/lib/python3.6/dist-packages/thinc/neural/util.py in copy_array(dst, src, casting, where)
46 def copy_array(dst, src, casting='same_kind', where=None):
47 if isinstance(dst, numpy.ndarray) and isinstance(src, numpy.ndarray):
---> 48 dst[:] = src
49 elif isinstance(dst, cupy.ndarray):
50 src = cupy.array(src, copy=False)

ValueError: could not broadcast input array from shape (96) into shape (128)

Versions

[FEATURE REQUEST] Add Language model

What problem does your feature solve?
can you add Language model for medical notes to predict next word

Describe the solution you'd like
adding transformer model , or implementation of the paper
Learning to Write Notes in Electronic Health Records

NotImplementedError: object proxy must define __reduce_ex__()

Description

Installed medaCy and the model (medacy_model_clinical_notes) on a Mac using the GitHub instructions. When running the GitHub example ("The Power of medaCy") using Anaconda, I get the following error:

NotImplementedError: object proxy must define reduce_ex()

This is thrown in pickle.py. See the attached console trace for details:

medaCy.console.trace.txt

Steps/Code to Reproduce

from medacy.ner.model import Model

model = Model.load_external('medacy_model_clinical_notes')
annotation = model.predict("The patient was prescribed 1 capsule of Advil for 5 days.")
print(annotation)

Expected Results

{
'entities': {
'T3': ('Drug', 40, 45, 'Advil'),
'T1': ('Dosage', 27, 28, '1'),
'T2': ('Form', 29, 36, 'capsule'),
'T4': ('Duration', 46, 56, 'for 5 days')
},
'relations': []
}

Actual Results

The above mentioned error.

Versions

Darwin-18.7.0-x86_64-i386-64bit
Python 3.7.2 (default, Dec 29 2018, 00:00:04)
[Clang 4.0.1 (tags/RELEASE_401/final)]
NumPy 1.15.4
SciPy 1.1.0
medacy 0.1.1

nlpatvcu / medacy Goto Github PK

medacy's Introduction

medaCy

🌟 Features

💭 Where to ask questions

💻 Installation Instructions

📚 Power of medaCy

Reference

License

Authors

Acknowledgments

medacy's People

Contributors

Stargazers

Watchers

Forkers

medacy's Issues

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Description

Steps/Code to Reproduce

Versions

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Description

Steps/Code to Reproduce

Output

Versions

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

---------------------------------------------------------------------------

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Description

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

Recommend Projects

Recommend Topics

Recommend Org