Git Product home page Git Product logo

cv19index's Issues

Script crashes if there are 0 inpatient records w/out discharge date

Hi,

Thanks for the script!

When I run a claims file that is only outpatient data (all inpatient = FALSE), or an inpatient without a discharge date the script produces this error:

Traceback (most recent call last):
  File "run_cv19index.py", line 6, in <module>
    cv19index.predict.main()
  File "c:\Users\chris-pickering\Documents\Projects\cv19index\cv19index\cv19index\predict.py", line 445, in main
    do_run_claims(args.demographics_file, args.claims_file, args.output_file, args.model, args.as_of_date, args.feature_file)
  File "c:\Users\chris-pickering\Documents\Projects\cv19index\cv19index\cv19index\predict.py", line 416, in do_run_claims
    input_df = preprocess_xgboost(claim_df, demo_df, asOfDate)
  File "c:\Users\chris-pickering\Documents\Projects\cv19index\cv19index\cv19index\preprocess.py", line 79, in preprocess_xgboost
    inpatient_days = pd.Series((inpatient_rows['dischargeDate'].dt.date - inpatient_rows['admitDate'].dt.date).dt.days, index=claim_df['personId'])
  File "C:\Users\chris-pickering\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 4372, in __getattr__
    return object.__getattribute__(self, name)
  File "C:\Users\chris-pickering\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\accessor.py", line 133, in __get__
    accessor_obj = self._accessor(obj)
  File "C:\Users\chris-pickering\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\indexes\accessors.py", line 325, in __new__
    raise AttributeError("Can only use .dt accessor with datetimelike "
AttributeError: Can only use .dt accessor with datetimelike values

To resolve, I changed on claim to be inpatient=TRUE and added a discharge date of today.

Can we update the script for this scenario?

Thanks!
Christopher

ICD 10 Cleaning Code Removing Valid Codes

Hey guys, I love this tool and am working with it right now, what a great contribution.

One thing I noticed, the function to clean ICD 10 codes from the Tutorial seems to be only returning codes without dots and >3 characters. It was not 100% clear that if I supplied codes with dots that they would not make it through so I got stuck on that for a while.

The bigger problem though is that there are valid ICD10 codes with only 3 characters, F99, R17, ... they appear in the CSSR docs but will never make it into a dataset from what I have tested. Thanks again just wanted to document these speed bumps.

I was able to resolve simply by adding another return after the else: since I did some cleaning on my ICD10s (too much :D) already. But I could see how for a more general case you'd want to add some other validation.

def cleanICD10Syntax(code): if len(code) > 3 and '.' not in code: return code[:3] + '.' + code[3:] else: code

urllib.parse.quote: improper usage?

I got a problem running the model:
ValueError: feature_names mismatch

The problem is that df_inputs in predict.py:253 has columns with wrong names, like

Diagnosis%20of%20Nephritis_%20nephrosis_%20renal%20sclerosis%20in%20the%20previous%2012%20months

This is because of use urllib.parse.quote here:
df_inputs.columns = [urllib.parse.quote(col) for col in df_inputs.columns]

Python 3.7.7 and 3.8.1, Windows 10.

Add Conda package for windows

Several people have had problems installing xgboost on windows. We should provide directions for installing using Anaconda on Windows. This will be easier than pip.

Errors installing python dependencies

Several people have had issues installing the python dependencies on windows. Often these involve errors asking them to install Visual C++ libraries.

A script to get ROC AUC results without cv19index package

It would be great if you can just provide an end-to-end working python script that simply outputs the ROC AUC on the data you have in this repo. This way people can quickly replace their models with your xgboost model and try new things out.

Requiring cv19index to be installed as a package should not be needed as we are really fighting against time as humanity.

Error with _NA_VALUES

This issue has been reported:

I'm running this through the Anaconda environment and getting dependency issues:

Traceback (most recent call last):
File "run_cv19index.py", line 3, in
import cv19index.predict
File "C:\Users\eugene.nguyen\Desktop\cv19index-master\cv19index\predict.py", line 14, in
from .io import read_frame, read_model, write_predictions, read_claim, read_demographics
File "C:\Users\eugene.nguyen\Desktop\cv19index-master\cv19index\io.py", line 9, in
from pandas.io.common import _NA_VALUES
ImportError: cannot import name '_NA_VALUES' from 'pandas.io.common' (C:\Users\eugene.nguyen\Anaconda3\lib\site-packages\pandas\io\common.py)

No such file when running cv19index executable

Got this error when running the executable:

FileNotFoundError: [Errno 2] No such file or directory: '/Users/daved/Desktop/python-virtual-environments/env/lib/python3.8/site-packages/cv19index/resources/demographics.schema.json'

Error running inference, Python version: 3.6.9

Edit: Stack Trace below
Error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 341: invalid start byte

INFO:/content/cv19index/cv19index/predict.py:Reading model from /content/cv19index/cv19index/resources/xgboost_all_ages/model.pickle.  Writing results to examples/predictions.csv
DEBUG:cv19index.preprocess:Beginning claims data frame preprocessing, raw data frame as follows.
DEBUG:cv19index.preprocess:           personId  ... dx15
0  001ef63fe5cb0cc5  ...     
1  001ef63fe5cb0cc5  ...     
2  001ef63fe5cb0cc5  ...     
3  001ef63fe5cb0cc5  ...     
4  001ef63fe5cb0cc5  ...     

[5 rows x 21 columns]
DEBUG:cv19index.preprocess:Filtered claims to just those within the dates 2017-12-31 to 2018-12-31.  Claim count went from 68481 to 35735
DEBUG:cv19index.preprocess:Cleaning diagnosis codes.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
DEBUG:cv19index.preprocess:Computing diagnosis flags.
DEBUG:cv19index.preprocess:Preprocessing complete.
INFO:/content/cv19index/cv19index/predict.py:Reordering test inputs to match training.
INFO:/content/cv19index/cv19index/predict.py:Scale pos weight is 24.882477847848648. Rescaling predictions to probabilities
WARNING:/content/cv19index/cv19index/shap_top_factors.py:Computing SHAP scores.  Approximate = False
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
Traceback (most recent call last):
  File "run_cv19index.py", line 6, in <module>
    cv19index.predict.main()
  File "/content/cv19index/cv19index/predict.py", line 454, in main
    do_run_claims(args.demographics_file, args.claims_file, args.output_file, args.model, args.as_of_date, args.feature_file)
  File "/content/cv19index/cv19index/predict.py", line 431, in do_run_claims
    predictions = run_xgb_model(input_df, model, quote = quote)
  File "/content/cv19index/cv19index/predict.py", line 376, in run_xgb_model
    **kwargs,
  File "/content/cv19index/cv19index/predict.py", line 161, in perform_predictions
    df_cutoff, model, outcome_column, mapping, **kwargs
  File "/content/cv19index/cv19index/shap_top_factors.py", line 127, in generate_shap_top_factors
    shap_values = shap.TreeExplainer(model).shap_values(
  File "/usr/local/lib/python3.6/dist-packages/shap/explainers/tree.py", line 121, in __init__
    self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
  File "/usr/local/lib/python3.6/dist-packages/shap/explainers/tree.py", line 726, in __init__
    xgb_loader = XGBTreeModelLoader(self.original_model)
  File "/usr/local/lib/python3.6/dist-packages/shap/explainers/tree.py", line 1326, in __init__
    self.name_obj = self.read_str(self.name_obj_len)
  File "/usr/local/lib/python3.6/dist-packages/shap/explainers/tree.py", line 1456, in read_str
    val = self.buf[self.pos:self.pos+size].decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 341: invalid start byte
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-9-4aabaf703f58> in <module>()
----> 1 get_ipython().run_cell_magic('shell', '', 'cd cv19index/\npython run_cv19index.py -a 2018-12-31 examples/demographics.csv examples/claims.csv examples/predictions.csv -f examples/features.csv\n# cv19index -a 2018-12-31 examples/demographics.csv examples/claims.csv examples/predictions.csv')

2 frames
/usr/local/lib/python3.6/dist-packages/google/colab/_system_commands.py in check_returncode(self)
    136     if self.returncode:
    137       raise subprocess.CalledProcessError(
--> 138           returncode=self.returncode, cmd=self.args, output=self.output)
    139 
    140   def _repr_pretty_(self, p, cycle):  # pylint:disable=unused-argument

Support Python 3.5

Currently we only support Python 3.6 because of the type hinting that we have. We should check whether there is a backwards compatible way to support this in Python 3.5. Otherwise we should look at whether we should remove those hints to increase the availability of the code.

inpatient days mismatch

Hello, when running the results, I found that the value of inpatient days is not aligned with what I observed in the original claim input file, e.g. patients having no inpatient visits but have inpatient days of 24, or vice versa. Upon debugging, it seems it lines in the part where the inpatient_days is created with index using claim_df, this actually chose only value of date_diff where index == personId.

    preprocessed_df['# of Admissions (12M)'] = inpatient_rows.groupby('personId').admitDate.nunique()
    date_diff = pd.to_timedelta(inpatient_rows['dischargeDate'].dt.date - inpatient_rows['admitDate'].dt.date)
    inpatient_days = pd.Series(date_diff.dt.days, index=claim_df['personId'])
    preprocessed_df['Inpatient Days'] = inpatient_days.groupby('personId').sum()

Example of date_diff:
date_diff.dt.days
10 8
29 2
53 2
56 9
60 2
..
1333281 3
1333325 2 --> if there was a personid == 1333325, then there inpatient days is 2, while this is the index of the claim_df, not related to personId.
1333336 10
1333337 5
1333340 5
Length: 74609, dtype: int64


The claim_df and demo_df were set up as suggested:

  • demo_df has unique row for each patient with age and gender
  • claim_df has one or multiple rows for each patient (only patient with claims are included).
    Please let me know if you have any suggestion? Thank you.

Add BlueButton support

We currently have a CSV file format for getting in claims data. Another useful format would be to take in data in FHIR JSON format. BlueButton is an example of a source like this. Example files are available for the BlueButton developer sandbox.

Age cap?

Does the model have an intrinsic age cap? We have a population of patients ranging from age 22 to 107. After running the xgboost model using the tutorial, we are seeing all patients aging 98 - 107 with a risk score of only 30, while patients age 92 all have the highest risk score at 73. Our sample sizes are large enough that we would expect to see at least some of the oldest patients in the highest risk category. This made us wonder if there is some kind of intrinsic age cap in the model.

TypeError: ('Timestamp subtraction must have the same timezones or no timezones', 'occurred at index 92')

File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 6928, in apply
return op.get_result()

File "/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py", line 186, in get_result
return self.apply_standard()

File "/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py", line 292, in apply_standard
self.apply_series_generator()

File "/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py", line 321, in apply_series_generator
results[i] = self.f(v)

File "", line 16, in
inpatient_df['Inpatient Days'] = inpatient_df[['dischargeDate','admitDate']].apply(lambda x: (pd.to_datetime(x.dischargeDate) - pd.to_datetime(x.admitDate)).days, axis=1)

File "pandas/_libs/tslibs/c_timestamp.pyx", line 295, in pandas._libs.tslibs.c_timestamp._Timestamp.sub

TypeError: ('Timestamp subtraction must have the same timezones or no timezones', 'occurred at index 92')

Error with reorder_inputs

Hello,
I tried running the prediction as followed, and got this error. This seems to be because model["model"].feature_names is empty, resulting in the 'NoneType' error in the reorder_inputs function. Can you please help fix it? Thank you.

    input_schema = resource_filename("cv19index", "resources/xgboost/input.csv.schema.json")
    model = resource_filename("cv19index", "resources/xgboost/model.pickle")
    model = resource_filename("cv19index", "resources/model_simple/model.pickle")

    asOfDate = '2020-01-31'
    fclaim = "data/claim_test.csv"
    fdemo = "data/demo_test.csv"
    output = "data/prediction_test.csv"
    model_name = "xgboost"

    do_run_claims(fdemo, fclaim, output, model_name, asOfDate, feature_file = None)

Traceback (most recent call last):

File "", line 12, in
do_run_claims(fdemo, fclaim, output, model_name, asOfDate, feature_file = None)

File "C:\Users..\Python\Python37\site-packages\cv19index\predict.py", line 431, in do_run_claims
predictions = run_xgb_model(input_df, model, quote = quote)

File "C:\Users..\Python\Python37\site-packages\cv19index\predict.py", line 359, in run_xgb_model
df_inputs = reorder_inputs(df_inputs, predictor)

File "C:\Users..\Python\Python37\site-packages\cv19index\predict.py", line 342, in reorder_inputs
if set(predictor["model"].feature_names) == set(df_inputs.columns) and predictor[

TypeError: 'NoneType' object is not iterable

from cv19index.io import read_model

model_name = "xgboost"
schema_fpath = resource_filename("cv19index", f"resources/{model_name}/input.csv.schema.json")
model_fpath = resource_filename("cv19index",f"resources/{model_name}/model.pickle")
model = read_model(model_fpath)

print(f'model["model"].feature_names: {model["model"].feature_names}')

Output:
model["model"].feature_names: None

Version:
cv19index: 1.1.4
xgboost: 1.4.0

Questions to feature mapping?

How are the questions asked in here, mapped to the 15 features dx0 to dx15 and ER visits?

How to run inference on the simpler regression model? What inputs/features to give etc?

Running the CV19 Index Predictor

do_run(input_fpath, input_schema, model, output)

Traceback (most recent call last):

File "", line 1, in
do_run(input_fpath, input_schema, model, output)

File "..\cv19index\predict.py", line 360, in do_run
model = read_model(model_fpath)

File "..\cv19index\io.py", line 19, in read_model
return pickle.load(fobj)

File "C:\Users\cdhingr1\AppData\Local\Continuum\anaconda3\envs\fastai\lib\site-packages\xgboost\core.py", line 981, in setstate
_check_call(_LIB.XGBoosterLoadModelFromBuffer(handle, ptr, length))

File "C:\Users\cdhingr1\AppData\Local\Continuum\anaconda3\envs\fastai\lib\site-packages\xgboost\core.py", line 176, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))

XGBoostError: [15:34:02] C:\Jenkins\workspace\xgboost-win64_release_0.90\src\gbm\gbm.cc:20: Unknown gbm type

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.