aphp / eds-scikit Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 5.0 18.75 MB

eds-scikit is a Python library providing tools to process and analyse OMOP data

Home Page: https://aphp.github.io/eds-scikit

License: BSD 3-Clause "New" or "Revised" License

Python 98.77% HTML 0.03% Shell 0.05% TeX 1.15%

clinical-data-warehouse data-science medical omop python

eds-scikit's People

Contributors

Stargazers

Watchers

Forkers

straymat vincent-maladiere jcharline theooj paul-bssr

eds-scikit's Issues

Feature request: Add more flexibility to `ConceptSet`

Description

Currently, adding concept sets is quite hacky:

from eds_scikit.biology import ConceptsSet

protein_blood = ConceptsSet("Protein_Blood_Quantitative")
protein_urine = ConceptsSet("Protein_Urine_Quantitative")
protein = ConceptsSet(
    name="Protein_Quantitative",
    concept_codes=protein_blood.concept_codes + protein_urine.concept_codes,
)

It would be handier to have the following:

protein = protein_blood + protein_urine

The concept_name could be generic like "addition_1" since it doesn't seem to be used except in the bioclean table.

We would need to add to ConceptSet:

__add__
__sub__
__eq__

Feature request: [feature]

Feature type

Functionality to plot patient trajectory based on various events.
More generaly, it can handle any sequence of events.

Description

For the moment, the entry data will look like:

Entry DF

Minimal

person
event
t_start
t_end

Additional

index_date
family

For the function paramters:

Can be launched with no additional parameters

Optional

event mapping (see below)
family to index mapping
columns name mapping
list of person ids to plot
figure parameters
same_x_axis_scale

event_mapping:

color =
type = point / continuous
label

figure parameters:

width / height
bar height, point width
title / labels parameters

Errors when running `introduction.ipynb`

When running codes from A gentle demo section in documentation, some commands return errors (probably originating from small syntax changes) using version 0.1.6.

Description

In section section "Extracting diabetes status", the following command does not output the same result than in documentation

diabetes.concept.value_counts()

Discrepancy solved in my case by replacing concept by value column

In section "Extracting covid status", the code cell below returns a KeyError: 'code_list' arising from line 81 in event_from_code function

codes = dict(
    COVID=dict(
        code_list=r"U071[0145]", 
        code_type="regex",
    )
)

covid = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=codes,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

Changing the dictionary in the following way solved the issue in my case :

codes = dict(
    COVID=dict(
        regex=r"U071[0145]", 
    )
)

In section "Adding patient age", the following error is raised when trying to compute patient age

TypeError: One of the provided Serie isn't a datetime Serie

A solution in my case was to convert, birth_datetime to datetime format using the following command :

visit_detail_covid["birth_datetime"].apply(lambda x:pd.to_datetime(x))

I guess the issue might be coming from the i2b2 connector

How to reproduce the bug

Code to load an i2b2 database (common for the 3 bugs) :

import eds_scikit
import datetime
from eds_scikit.io import HiveData

database_name = "cse_**" 

data = HiveData(
    database_name=database_name,
    database_type="I2B2"
)


DATE_MIN = datetime.datetime(2018, 1, 1)
DATE_MAX = datetime.datetime(2019, 6, 1)

Minimal code for bug 1 :

from eds_scikit.event.diabetes import diabetes_from_icd10

diabetes = diabetes_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

diabetes.concept.value_counts()

Minimal code for bug 2 :

from eds_scikit.event import conditions_from_icd10

codes = dict(
    COVID=dict(
        code_list=r"U071[0145]", 
        code_type="regex",
    )
)

covid = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=codes,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

Minimal code for bug 3 :

from eds_scikit.event import conditions_from_icd10
from eds_scikit.utils import datetime_helpers

codes = dict(
    COVID=dict(
        regex=r"U071[0145]", 
    )
)

covid = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=codes,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

visit_detail_covid = data.visit_detail.merge(
    covid[["visit_occurrence_id"]],
    on="visit_occurrence_id",
    how="inner",
)

visit_detail_covid = visit_detail_covid.merge(data.person[['person_id','birth_datetime']], 
                                              on='person_id', 
                                              how='inner')

visit_detail_covid["age"] = (
    datetime_helpers.substract_datetime(
        visit_detail_covid["visit_detail_start_datetime"],
        visit_detail_covid["birth_datetime"],
        out="hours",
    )
    / (24 * 365.25)
)

Port biology report to other tables eg. cim10 and ccam

Feature type

New pipeline

Description

I would be super interested to port the biology report introduced in [plot_concepts_set] (https://github.com/aphp/eds-scikit/blob/main/eds_scikit/biology/viz/plot.py) to other type of concepts such as billing codes (cim10, ccam).

It would be great when the goal is to explore and describe a cohort based on diagnoses for example.

I'll try to find time for a first PR.

Give a feature to avoid loading all subsetted cohort in memory

Feature type

utility

Description

We would like to avoid loading all data in memory persisting HiveData. Especially usefull when processing big cohort,
where the actual code will blow memory.

Several proposition/directions for amelioration:

Keep local storage but use garbage collector to avoid surcharging the memory. It does not sovle the issue but it scale to bigger cohort.
Use HDFS instead of local storage to gain distributed and out-of-memory processing (as described in the aphp wiki). This is a bit APHP specific though.
Write directly from spark to local file system by prefixing the path by "file://" in a spark.write.mode("overwrite").parquet("file://"+path) It would be the better solution in my opinion. However , actually the local spark does not have the write to write to local file (tested on cse210038 and the cse diabetes of Theo Jolivet).

Feature request: Patient trajectories vizualization

Feature type

Functionality to plot patient trajectory based on various events.
More generaly, it can handle any sequence of events.

Description

For the moment, the entry data will look like:

Entry DF

Minimal

person
event
t_start
t_end

Additional

index_date
family

For the function paramters:

Can be launched with no additional parameters

Optional

event mapping (see below)
family to index mapping
columns name mapping
list of person ids to plot
figure parameters
same_x_axis_scale

event_mapping:

color =
type = point / continuous
label

figure parameters:

width / height
bar height, point width
title / labels parameters

Feature request: Allowing tables_to_load and columns_to_load in HiveData

Description

Modification of HiveData to accept user-defined dicts of columns / tables to load because some OMOP extractions can be quite large, and users might not want to load everything
I think the entrypoint to data is very important, as people will always start with loading data before anything else so it should be a premium feature

No module named 'eds_scikit.biology'

I don't manage to load the package because of the importation of the eds_scikit.biology module in the init.py file :

I am trying to use the package inside a jupyterlab of the scikit-eds project.

input:

import eds_scikit

output:

import eds_scikit.biology  # noqa: F401 --> To register functions

`eds_scikit` on Windows

Description

Problems when using eds_scikit on Windows. We might want to add testing on Windows in the CI.

For instance time.tzset() (used in __init__.py) does not exist on Windows

Feature request: Adding other types of features than code-based to the Phenotype class

Feature type

A new method for Phenotype Class:

def add_inclusion_criteria(self, inclusion_criteria: pd.DataFrame, level="patient", threshold=1):

Description

In order to improve the Phenotype class and make it more universal, I would very much be able to add inclusion other than purely code based.

examples of such non-code features I would need to use this class for my study:

include patient having at least a minimum number of hospitalization.
include patient only being adult at a given date

The feature would make a filtering of the data based on the provided inclusion criteria.

aphp / eds-scikit Goto Github PK

eds-scikit's People

Contributors

Stargazers

Watchers

Forkers

eds-scikit's Issues

Description

Feature type

Description

Entry DF

Minimal

Additional

Optional

Description

How to reproduce the bug

Feature type

Description

Feature type

Description

Feature type

Description

Entry DF

Minimal

Additional

Optional

Description

Description

Feature type

Description

Recommend Projects

Recommend Topics

Recommend Org