closedloop-ai / cv19index Goto Github PK
View Code? Open in Web Editor NEWCOVID-19 Vulnerability Index
Home Page: http://cv19index.com
License: Other
COVID-19 Vulnerability Index
Home Page: http://cv19index.com
License: Other
Hi,
Thanks for the script!
When I run a claims file that is only outpatient data (all inpatient = FALSE), or an inpatient without a discharge date the script produces this error:
Traceback (most recent call last):
File "run_cv19index.py", line 6, in <module>
cv19index.predict.main()
File "c:\Users\chris-pickering\Documents\Projects\cv19index\cv19index\cv19index\predict.py", line 445, in main
do_run_claims(args.demographics_file, args.claims_file, args.output_file, args.model, args.as_of_date, args.feature_file)
File "c:\Users\chris-pickering\Documents\Projects\cv19index\cv19index\cv19index\predict.py", line 416, in do_run_claims
input_df = preprocess_xgboost(claim_df, demo_df, asOfDate)
File "c:\Users\chris-pickering\Documents\Projects\cv19index\cv19index\cv19index\preprocess.py", line 79, in preprocess_xgboost
inpatient_days = pd.Series((inpatient_rows['dischargeDate'].dt.date - inpatient_rows['admitDate'].dt.date).dt.days, index=claim_df['personId'])
File "C:\Users\chris-pickering\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\generic.py", line 4372, in __getattr__
return object.__getattribute__(self, name)
File "C:\Users\chris-pickering\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\accessor.py", line 133, in __get__
accessor_obj = self._accessor(obj)
File "C:\Users\chris-pickering\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\indexes\accessors.py", line 325, in __new__
raise AttributeError("Can only use .dt accessor with datetimelike "
AttributeError: Can only use .dt accessor with datetimelike values
To resolve, I changed on claim to be inpatient=TRUE and added a discharge date of today.
Can we update the script for this scenario?
Thanks!
Christopher
Hey guys, I love this tool and am working with it right now, what a great contribution.
One thing I noticed, the function to clean ICD 10 codes from the Tutorial seems to be only returning codes without dots and >3 characters. It was not 100% clear that if I supplied codes with dots that they would not make it through so I got stuck on that for a while.
The bigger problem though is that there are valid ICD10 codes with only 3 characters, F99, R17, ... they appear in the CSSR docs but will never make it into a dataset from what I have tested. Thanks again just wanted to document these speed bumps.
I was able to resolve simply by adding another return
after the else:
since I did some cleaning on my ICD10s (too much :D) already. But I could see how for a more general case you'd want to add some other validation.
def cleanICD10Syntax(code): if len(code) > 3 and '.' not in code: return code[:3] + '.' + code[3:] else: code
๐
I got a problem running the model:
ValueError: feature_names mismatch
The problem is that df_inputs in predict.py:253 has columns with wrong names, like
Diagnosis%20of%20Nephritis_%20nephrosis_%20renal%20sclerosis%20in%20the%20previous%2012%20months
This is because of use urllib.parse.quote here:
df_inputs.columns = [urllib.parse.quote(col) for col in df_inputs.columns]
Python 3.7.7 and 3.8.1, Windows 10.
Several people have had problems installing xgboost on windows. We should provide directions for installing using Anaconda on Windows. This will be easier than pip.
Several people have had issues installing the python dependencies on windows. Often these involve errors asking them to install Visual C++ libraries.
The logistic regression model is currently just a pickle file with parameters. It is not callable from the regular script.
It would be great if you can just provide an end-to-end working python script that simply outputs the ROC AUC on the data you have in this repo. This way people can quickly replace their models with your xgboost model and try new things out.
Requiring cv19index to be installed as a package should not be needed as we are really fighting against time as humanity.
This issue has been reported:
I'm running this through the Anaconda environment and getting dependency issues:
Traceback (most recent call last):
File "run_cv19index.py", line 3, in
import cv19index.predict
File "C:\Users\eugene.nguyen\Desktop\cv19index-master\cv19index\predict.py", line 14, in
from .io import read_frame, read_model, write_predictions, read_claim, read_demographics
File "C:\Users\eugene.nguyen\Desktop\cv19index-master\cv19index\io.py", line 9, in
from pandas.io.common import _NA_VALUES
ImportError: cannot import name '_NA_VALUES' from 'pandas.io.common' (C:\Users\eugene.nguyen\Anaconda3\lib\site-packages\pandas\io\common.py)
Got this error when running the executable:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/daved/Desktop/python-virtual-environments/env/lib/python3.8/site-packages/cv19index/resources/demographics.schema.json'
Edit: Stack Trace below
Error : UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 341: invalid start byte
INFO:/content/cv19index/cv19index/predict.py:Reading model from /content/cv19index/cv19index/resources/xgboost_all_ages/model.pickle. Writing results to examples/predictions.csv
DEBUG:cv19index.preprocess:Beginning claims data frame preprocessing, raw data frame as follows.
DEBUG:cv19index.preprocess: personId ... dx15
0 001ef63fe5cb0cc5 ...
1 001ef63fe5cb0cc5 ...
2 001ef63fe5cb0cc5 ...
3 001ef63fe5cb0cc5 ...
4 001ef63fe5cb0cc5 ...
[5 rows x 21 columns]
DEBUG:cv19index.preprocess:Filtered claims to just those within the dates 2017-12-31 to 2018-12-31. Claim count went from 68481 to 35735
DEBUG:cv19index.preprocess:Cleaning diagnosis codes.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
DEBUG:cv19index.preprocess:Computing diagnosis flags.
DEBUG:cv19index.preprocess:Preprocessing complete.
INFO:/content/cv19index/cv19index/predict.py:Reordering test inputs to match training.
INFO:/content/cv19index/cv19index/predict.py:Scale pos weight is 24.882477847848648. Rescaling predictions to probabilities
WARNING:/content/cv19index/cv19index/shap_top_factors.py:Computing SHAP scores. Approximate = False
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
Traceback (most recent call last):
File "run_cv19index.py", line 6, in <module>
cv19index.predict.main()
File "/content/cv19index/cv19index/predict.py", line 454, in main
do_run_claims(args.demographics_file, args.claims_file, args.output_file, args.model, args.as_of_date, args.feature_file)
File "/content/cv19index/cv19index/predict.py", line 431, in do_run_claims
predictions = run_xgb_model(input_df, model, quote = quote)
File "/content/cv19index/cv19index/predict.py", line 376, in run_xgb_model
**kwargs,
File "/content/cv19index/cv19index/predict.py", line 161, in perform_predictions
df_cutoff, model, outcome_column, mapping, **kwargs
File "/content/cv19index/cv19index/shap_top_factors.py", line 127, in generate_shap_top_factors
shap_values = shap.TreeExplainer(model).shap_values(
File "/usr/local/lib/python3.6/dist-packages/shap/explainers/tree.py", line 121, in __init__
self.model = TreeEnsemble(model, self.data, self.data_missing, model_output)
File "/usr/local/lib/python3.6/dist-packages/shap/explainers/tree.py", line 726, in __init__
xgb_loader = XGBTreeModelLoader(self.original_model)
File "/usr/local/lib/python3.6/dist-packages/shap/explainers/tree.py", line 1326, in __init__
self.name_obj = self.read_str(self.name_obj_len)
File "/usr/local/lib/python3.6/dist-packages/shap/explainers/tree.py", line 1456, in read_str
val = self.buf[self.pos:self.pos+size].decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 341: invalid start byte
---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-9-4aabaf703f58> in <module>()
----> 1 get_ipython().run_cell_magic('shell', '', 'cd cv19index/\npython run_cv19index.py -a 2018-12-31 examples/demographics.csv examples/claims.csv examples/predictions.csv -f examples/features.csv\n# cv19index -a 2018-12-31 examples/demographics.csv examples/claims.csv examples/predictions.csv')
2 frames
/usr/local/lib/python3.6/dist-packages/google/colab/_system_commands.py in check_returncode(self)
136 if self.returncode:
137 raise subprocess.CalledProcessError(
--> 138 returncode=self.returncode, cmd=self.args, output=self.output)
139
140 def _repr_pretty_(self, p, cycle): # pylint:disable=unused-argument
Currently we only support Python 3.6 because of the type hinting that we have. We should check whether there is a backwards compatible way to support this in Python 3.5. Otherwise we should look at whether we should remove those hints to increase the availability of the code.
Hello, when running the results, I found that the value of inpatient days is not aligned with what I observed in the original claim input file, e.g. patients having no inpatient visits but have inpatient days of 24, or vice versa. Upon debugging, it seems it lines in the part where the inpatient_days is created with index using claim_df, this actually chose only value of date_diff where index == personId.
preprocessed_df['# of Admissions (12M)'] = inpatient_rows.groupby('personId').admitDate.nunique()
date_diff = pd.to_timedelta(inpatient_rows['dischargeDate'].dt.date - inpatient_rows['admitDate'].dt.date)
inpatient_days = pd.Series(date_diff.dt.days, index=claim_df['personId'])
preprocessed_df['Inpatient Days'] = inpatient_days.groupby('personId').sum()
Example of date_diff:
date_diff.dt.days
10 8
29 2
53 2
56 9
60 2
..
1333281 3
1333325 2 --> if there was a personid == 1333325, then there inpatient days is 2, while this is the index of the claim_df, not related to personId.
1333336 10
1333337 5
1333340 5
Length: 74609, dtype: int64
The claim_df and demo_df were set up as suggested:
We currently have a CSV file format for getting in claims data. Another useful format would be to take in data in FHIR JSON format. BlueButton is an example of a source like this. Example files are available for the BlueButton developer sandbox.
Does the model have an intrinsic age cap? We have a population of patients ranging from age 22 to 107. After running the xgboost model using the tutorial, we are seeing all patients aging 98 - 107 with a risk score of only 30, while patients age 92 all have the highest risk score at 73. Our sample sizes are large enough that we would expect to see at least some of the oldest patients in the highest risk category. This made us wonder if there is some kind of intrinsic age cap in the model.
File "/usr/local/lib/python3.6/dist-packages/pandas/core/frame.py", line 6928, in apply
return op.get_result()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py", line 186, in get_result
return self.apply_standard()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py", line 292, in apply_standard
self.apply_series_generator()
File "/usr/local/lib/python3.6/dist-packages/pandas/core/apply.py", line 321, in apply_series_generator
results[i] = self.f(v)
File "", line 16, in
inpatient_df['Inpatient Days'] = inpatient_df[['dischargeDate','admitDate']].apply(lambda x: (pd.to_datetime(x.dischargeDate) - pd.to_datetime(x.admitDate)).days, axis=1)
File "pandas/_libs/tslibs/c_timestamp.pyx", line 295, in pandas._libs.tslibs.c_timestamp._Timestamp.sub
TypeError: ('Timestamp subtraction must have the same timezones or no timezones', 'occurred at index 92')
Hello,
I tried running the prediction as followed, and got this error. This seems to be because model["model"].feature_names is empty, resulting in the 'NoneType' error in the reorder_inputs function. Can you please help fix it? Thank you.
input_schema = resource_filename("cv19index", "resources/xgboost/input.csv.schema.json")
model = resource_filename("cv19index", "resources/xgboost/model.pickle")
model = resource_filename("cv19index", "resources/model_simple/model.pickle")
asOfDate = '2020-01-31'
fclaim = "data/claim_test.csv"
fdemo = "data/demo_test.csv"
output = "data/prediction_test.csv"
model_name = "xgboost"
do_run_claims(fdemo, fclaim, output, model_name, asOfDate, feature_file = None)
Traceback (most recent call last):
File "", line 12, in
do_run_claims(fdemo, fclaim, output, model_name, asOfDate, feature_file = None)
File "C:\Users..\Python\Python37\site-packages\cv19index\predict.py", line 431, in do_run_claims
predictions = run_xgb_model(input_df, model, quote = quote)
File "C:\Users..\Python\Python37\site-packages\cv19index\predict.py", line 359, in run_xgb_model
df_inputs = reorder_inputs(df_inputs, predictor)
File "C:\Users..\Python\Python37\site-packages\cv19index\predict.py", line 342, in reorder_inputs
if set(predictor["model"].feature_names) == set(df_inputs.columns) and predictor[
TypeError: 'NoneType' object is not iterable
from cv19index.io import read_model
model_name = "xgboost"
schema_fpath = resource_filename("cv19index", f"resources/{model_name}/input.csv.schema.json")
model_fpath = resource_filename("cv19index",f"resources/{model_name}/model.pickle")
model = read_model(model_fpath)
print(f'model["model"].feature_names: {model["model"].feature_names}')
Output:
model["model"].feature_names: None
Version:
cv19index: 1.1.4
xgboost: 1.4.0
How are the questions asked in here, mapped to the 15 features dx0 to dx15 and ER visits?
How to run inference on the simpler regression model? What inputs/features to give etc?
What does the "flag" variable represent?
do_run(input_fpath, input_schema, model, output)
Traceback (most recent call last):
File "", line 1, in
do_run(input_fpath, input_schema, model, output)
File "..\cv19index\predict.py", line 360, in do_run
model = read_model(model_fpath)
File "..\cv19index\io.py", line 19, in read_model
return pickle.load(fobj)
File "C:\Users\cdhingr1\AppData\Local\Continuum\anaconda3\envs\fastai\lib\site-packages\xgboost\core.py", line 981, in setstate
_check_call(_LIB.XGBoosterLoadModelFromBuffer(handle, ptr, length))
File "C:\Users\cdhingr1\AppData\Local\Continuum\anaconda3\envs\fastai\lib\site-packages\xgboost\core.py", line 176, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
XGBoostError: [15:34:02] C:\Jenkins\workspace\xgboost-win64_release_0.90\src\gbm\gbm.cc:20: Unknown gbm type
What is the mapping of gender label (1/2) to Male/Female?
I only see numeric values for gender in the example and couldn't find relevant mappings in this repo: https://github.com/closedloop-ai/cv19index/blob/master/examples/data/demographics.csv
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.