aphp / eds-scikit Goto Github PK
View Code? Open in Web Editor NEWeds-scikit is a Python library providing tools to process and analyse OMOP data
Home Page: https://aphp.github.io/eds-scikit
License: BSD 3-Clause "New" or "Revised" License
eds-scikit is a Python library providing tools to process and analyse OMOP data
Home Page: https://aphp.github.io/eds-scikit
License: BSD 3-Clause "New" or "Revised" License
Currently, adding concept sets is quite hacky:
from eds_scikit.biology import ConceptsSet
protein_blood = ConceptsSet("Protein_Blood_Quantitative")
protein_urine = ConceptsSet("Protein_Urine_Quantitative")
protein = ConceptsSet(
name="Protein_Quantitative",
concept_codes=protein_blood.concept_codes + protein_urine.concept_codes,
)
It would be handier to have the following:
protein = protein_blood + protein_urine
The concept_name
could be generic like "addition_1"
since it doesn't seem to be used except in the bioclean
table.
We would need to add to ConceptSet
:
__add__
__sub__
__eq__
Functionality to plot patient trajectory based on various events.
More generaly, it can handle any sequence of events.
For the moment, the entry data will look like:
For the function paramters:
Can be launched with no additional parameters
event_mapping:
figure parameters:
When running codes from A gentle demo section in documentation, some commands return errors (probably originating from small syntax changes) using version 0.1.6.
diabetes.concept.value_counts()
Discrepancy solved in my case by replacing concept
by value
column
event_from_code
functioncodes = dict(
COVID=dict(
code_list=r"U071[0145]",
code_type="regex",
)
)
covid = conditions_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
codes=codes,
date_min=DATE_MIN,
date_max=DATE_MAX,
)
Changing the dictionary in the following way solved the issue in my case :
codes = dict(
COVID=dict(
regex=r"U071[0145]",
)
)
TypeError: One of the provided Serie isn't a datetime Serie
A solution in my case was to convert, birth_datetime
to datetime format using the following command :
visit_detail_covid["birth_datetime"].apply(lambda x:pd.to_datetime(x))
I guess the issue might be coming from the i2b2 connector
Code to load an i2b2 database (common for the 3 bugs) :
import eds_scikit
import datetime
from eds_scikit.io import HiveData
database_name = "cse_**"
data = HiveData(
database_name=database_name,
database_type="I2B2"
)
DATE_MIN = datetime.datetime(2018, 1, 1)
DATE_MAX = datetime.datetime(2019, 6, 1)
Minimal code for bug 1 :
from eds_scikit.event.diabetes import diabetes_from_icd10
diabetes = diabetes_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
date_min=DATE_MIN,
date_max=DATE_MAX,
)
diabetes.concept.value_counts()
Minimal code for bug 2 :
from eds_scikit.event import conditions_from_icd10
codes = dict(
COVID=dict(
code_list=r"U071[0145]",
code_type="regex",
)
)
covid = conditions_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
codes=codes,
date_min=DATE_MIN,
date_max=DATE_MAX,
)
Minimal code for bug 3 :
from eds_scikit.event import conditions_from_icd10
from eds_scikit.utils import datetime_helpers
codes = dict(
COVID=dict(
regex=r"U071[0145]",
)
)
covid = conditions_from_icd10(
condition_occurrence=data.condition_occurrence,
visit_occurrence=data.visit_occurrence,
codes=codes,
date_min=DATE_MIN,
date_max=DATE_MAX,
)
visit_detail_covid = data.visit_detail.merge(
covid[["visit_occurrence_id"]],
on="visit_occurrence_id",
how="inner",
)
visit_detail_covid = visit_detail_covid.merge(data.person[['person_id','birth_datetime']],
on='person_id',
how='inner')
visit_detail_covid["age"] = (
datetime_helpers.substract_datetime(
visit_detail_covid["visit_detail_start_datetime"],
visit_detail_covid["birth_datetime"],
out="hours",
)
/ (24 * 365.25)
)
New pipeline
I would be super interested to port the biology report introduced in [plot_concepts_set
] (https://github.com/aphp/eds-scikit/blob/main/eds_scikit/biology/viz/plot.py) to other type of concepts such as billing codes (cim10, ccam).
It would be great when the goal is to explore and describe a cohort based on diagnoses for example.
I'll try to find time for a first PR.
utility
We would like to avoid loading all data in memory persisting HiveData. Especially usefull when processing big cohort,
where the actual code will blow memory.
Several proposition/directions for amelioration:
spark.write.mode("overwrite").parquet("file://"+path)
It would be the better solution in my opinion. However , actually the local spark does not have the write to write to local file (tested on cse210038 and the cse diabetes of Theo Jolivet).Functionality to plot patient trajectory based on various events.
More generaly, it can handle any sequence of events.
For the moment, the entry data will look like:
For the function paramters:
Can be launched with no additional parameters
event_mapping:
figure parameters:
Modification of HiveData
to accept user-defined dicts of columns / tables to load because some OMOP extractions can be quite large, and users might not want to load everything
I think the entrypoint to data is very important, as people will always start with loading data before anything else so it should be a premium feature
I don't manage to load the package because of the importation of the eds_scikit.biology module in the init.py file :
I am trying to use the package inside a jupyterlab of the scikit-eds project.
input:
import eds_scikit
output:
import eds_scikit.biology # noqa: F401 --> To register functions
Problems when using eds_scikit
on Windows. We might want to add testing on Windows in the CI.
For instance time.tzset()
(used in __init__.py
) does not exist on Windows
A new method for Phenotype Class:
def add_inclusion_criteria(self, inclusion_criteria: pd.DataFrame, level="patient", threshold=1):
In order to improve the Phenotype class and make it more universal, I would very much be able to add inclusion other than purely code based.
examples of such non-code features I would need to use this class for my study:
The feature would make a filtering of the data based on the provided inclusion criteria.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.