Git Product home page Git Product logo

eds-scikit's People

Contributors

aremaki avatar jcharline avatar percevalw avatar svittoz avatar theooj avatar thomzoy avatar vincent-maladiere avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

eds-scikit's Issues

Feature request: Add more flexibility to `ConceptSet`

Description

Currently, adding concept sets is quite hacky:

from eds_scikit.biology import ConceptsSet

protein_blood = ConceptsSet("Protein_Blood_Quantitative")
protein_urine = ConceptsSet("Protein_Urine_Quantitative")
protein = ConceptsSet(
    name="Protein_Quantitative",
    concept_codes=protein_blood.concept_codes + protein_urine.concept_codes,
)

It would be handier to have the following:

protein = protein_blood + protein_urine

The concept_name could be generic like "addition_1" since it doesn't seem to be used except in the bioclean table.

We would need to add to ConceptSet:

  • __add__
  • __sub__
  • __eq__

Feature request: [feature]

Feature type

Functionality to plot patient trajectory based on various events.
More generaly, it can handle any sequence of events.

Description

For the moment, the entry data will look like:

Entry DF

Minimal

  • person
  • event
  • t_start
  • t_end

Additional

  • index_date
  • family

For the function paramters:

Can be launched with no additional parameters

Optional

  • event mapping (see below)
  • family to index mapping
  • columns name mapping
  • list of person ids to plot
  • figure parameters
  • same_x_axis_scale

event_mapping:

  • color =
  • type = point / continuous
  • label

figure parameters:

  • width / height
  • bar height, point width
  • title / labels parameters

Errors when running `introduction.ipynb`

When running codes from A gentle demo section in documentation, some commands return errors (probably originating from small syntax changes) using version 0.1.6.

Description

  1. In section section "Extracting diabetes status", the following command does not output the same result than in documentation
diabetes.concept.value_counts()

Discrepancy solved in my case by replacing concept by value column

  1. In section "Extracting covid status", the code cell below returns a KeyError: 'code_list' arising from line 81 in event_from_code function
codes = dict(
    COVID=dict(
        code_list=r"U071[0145]", 
        code_type="regex",
    )
)

covid = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=codes,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

Changing the dictionary in the following way solved the issue in my case :

codes = dict(
    COVID=dict(
        regex=r"U071[0145]", 
    )
)
  1. In section "Adding patient age", the following error is raised when trying to compute patient age
TypeError: One of the provided Serie isn't a datetime Serie

A solution in my case was to convert, birth_datetime to datetime format using the following command :

visit_detail_covid["birth_datetime"].apply(lambda x:pd.to_datetime(x))

I guess the issue might be coming from the i2b2 connector

How to reproduce the bug

Code to load an i2b2 database (common for the 3 bugs) :

import eds_scikit
import datetime
from eds_scikit.io import HiveData

database_name = "cse_**" 

data = HiveData(
    database_name=database_name,
    database_type="I2B2"
)


DATE_MIN = datetime.datetime(2018, 1, 1)
DATE_MAX = datetime.datetime(2019, 6, 1)

Minimal code for bug 1 :

from eds_scikit.event.diabetes import diabetes_from_icd10

diabetes = diabetes_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

diabetes.concept.value_counts()

Minimal code for bug 2 :

from eds_scikit.event import conditions_from_icd10

codes = dict(
    COVID=dict(
        code_list=r"U071[0145]", 
        code_type="regex",
    )
)

covid = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=codes,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

Minimal code for bug 3 :

from eds_scikit.event import conditions_from_icd10
from eds_scikit.utils import datetime_helpers

codes = dict(
    COVID=dict(
        regex=r"U071[0145]", 
    )
)

covid = conditions_from_icd10(
    condition_occurrence=data.condition_occurrence,
    visit_occurrence=data.visit_occurrence,
    codes=codes,
    date_min=DATE_MIN,
    date_max=DATE_MAX,
)

visit_detail_covid = data.visit_detail.merge(
    covid[["visit_occurrence_id"]],
    on="visit_occurrence_id",
    how="inner",
)

visit_detail_covid = visit_detail_covid.merge(data.person[['person_id','birth_datetime']], 
                                              on='person_id', 
                                              how='inner')

visit_detail_covid["age"] = (
    datetime_helpers.substract_datetime(
        visit_detail_covid["visit_detail_start_datetime"],
        visit_detail_covid["birth_datetime"],
        out="hours",
    )
    / (24 * 365.25)
)

Give a feature to avoid loading all subsetted cohort in memory

Feature type

utility

Description

We would like to avoid loading all data in memory persisting HiveData. Especially usefull when processing big cohort,
where the actual code will blow memory.

Several proposition/directions for amelioration:

  • Keep local storage but use garbage collector to avoid surcharging the memory. It does not sovle the issue but it scale to bigger cohort.
  • Use HDFS instead of local storage to gain distributed and out-of-memory processing (as described in the aphp wiki). This is a bit APHP specific though.
  • Write directly from spark to local file system by prefixing the path by "file://" in a spark.write.mode("overwrite").parquet("file://"+path) It would be the better solution in my opinion. However , actually the local spark does not have the write to write to local file (tested on cse210038 and the cse diabetes of Theo Jolivet).

Feature request: Patient trajectories vizualization

Feature type

Functionality to plot patient trajectory based on various events.
More generaly, it can handle any sequence of events.

Description

For the moment, the entry data will look like:

Entry DF

Minimal

  • person
  • event
  • t_start
  • t_end

Additional

  • index_date
  • family

For the function paramters:

Can be launched with no additional parameters

Optional

  • event mapping (see below)
  • family to index mapping
  • columns name mapping
  • list of person ids to plot
  • figure parameters
  • same_x_axis_scale

event_mapping:

  • color =
  • type = point / continuous
  • label

figure parameters:

  • width / height
  • bar height, point width
  • title / labels parameters

Feature request: Allowing tables_to_load and columns_to_load in HiveData

Description

Modification of HiveData to accept user-defined dicts of columns / tables to load because some OMOP extractions can be quite large, and users might not want to load everything
I think the entrypoint to data is very important, as people will always start with loading data before anything else so it should be a premium feature

No module named 'eds_scikit.biology'

I don't manage to load the package because of the importation of the eds_scikit.biology module in the init.py file :

I am trying to use the package inside a jupyterlab of the scikit-eds project.

input:

import eds_scikit

output:

import eds_scikit.biology  # noqa: F401 --> To register functions

`eds_scikit` on Windows

Description

Problems when using eds_scikit on Windows. We might want to add testing on Windows in the CI.

For instance time.tzset() (used in __init__.py) does not exist on Windows

Feature request: Adding other types of features than code-based to the Phenotype class

Feature type

A new method for Phenotype Class:

def add_inclusion_criteria(self, inclusion_criteria: pd.DataFrame, level="patient", threshold=1):

Description

In order to improve the Phenotype class and make it more universal, I would very much be able to add inclusion other than purely code based.

examples of such non-code features I would need to use this class for my study:

  • include patient having at least a minimum number of hospitalization.
  • include patient only being adult at a given date

The feature would make a filtering of the data based on the provided inclusion criteria.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.