emilyxia / hospital-information-analysis Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 2.0 3.89 MB

ORIE4741 Fall Project

TeX 100.00%

hospital-information-analysis's People

Contributors

Watchers

Forkers

mianyangsharon avyukt

hospital-information-analysis's Issues

Final peer review

The project aims to help patients select hospital based on their medical conditions and personal information. The dataset used in the project is from SPARCS for the year of 2012.

Things that I like about the project:
- The visuals (charts and tables) are very clear, which help understand the dataset.
- It has a clear and reasonable explanation of data cleaning process.
- The description of Weapons of Math Destruction is very interesting and detailed, and the logic is great.

Suggestions / Potential Improvements:
- What is the size of the dataset? I see that the augmented dataset has 28 columns, but how many rows does it have?
- It seems like the random forest model outperforms linear models (with or without regularization) for all diseases; if that's the case, why do you say "The difference in models found to work for each of the diseases implies that we need to take disease characteristics into account, when modeling hospital charges"?

Midterm Peer Review

This project aims to predict average charges for a patient to visit a hospital, so patients can make better decisions about where to go for different reasons.

Things I liked about this midterm report:

The visualizations are both aesthetically pleasing, and contain useful information. The payment type vs charges chart especially was very interesting in terms of trends and information. It’s also a good and relevant project topic that could help people.
They were very thorough in terms of their data cleaning, and I liked seeing how they transformed string values into categorical ones with one hot encoding.
I liked their usage of Random Forests for specific feature selection, and that they realized they might be underfitting and need to include more data. It’s always cool to see something like that with a real example.

Things that could be better:

In certain sections of the report (like data cleaning), there is a lot of information and word clutter. They could put their data in tables, or use headers and other things like that to make their report more readable.
It’s unclear in their model how they intend to use their results to make decisions. I’m not sure (as of now) how the model is going to lead to a prediction of which hospital to go to.
There isn’t a plan to measure the effectiveness of a model, which I think would be beneficial for this project!

Final peer review sc2356

The project is about helping a patient select a hospital based on their medical condition, personal information, and information about hospitals. The dataset is SPARCS hospital de-identified data for New York State. The goal of the project is to predict the cost of treating a patient with one of the three most common medical conditions given information about the patient and information about hospitals.

What they did well:
The team had interesting graphs and charts in the exploratory section, which showed large differences in both income and treatment cost across different counties, as well as differences in treatment cost by payment type.

The team also chose three top diseases, which helped in that they could focus on creating accurate predictive models for those diseases, without worrying about having disease as a confounding factor. Within those three diseases, they found that certain features had different importances across the diseases, which is interesting.

The team chose several different models to find both accurate classifiers and the features with the greatest importances for those classifiers; this was good, as they considered both linear models (multiple kinds of regressions) as well as non-linear models (random forests).

Improvements:
It would have been good if the team had addressed the differences in costs by payment type, as it seems like it might be a latent variable (corresponding to income or family hardship).

Midterm Peer review

Summary:
The project is about helping patients to decide which hospital to choose by finding hospital selections with different situations.
The first dataset they used is SPARCS, which includes patients’ information, including characteristics, diagnoses, treatments, etc. The second is MEPS, which provides data on the cost of health care and insurance coverage.
Their objective is to investigate the choices of hospitals given a patient with required treatment and improve the hospital selections.

Three Likes:

The visualizations of data exploration are clear and meaningful since it helps the group to determine which features shall include in the model.
This project uses random forest to implement feature selection, which is highly related with machine learning.
Data cleaning is comprehensive since the team provides a good illustration of how they deal with missing data, useless data and how they encode other related features.

Three Improvements:
1.The team can think about how to test the effectiveness of the model, which are required in this report.
2.I am confused about how to make hospital selections based on total charge and so the team can illustrate it more clearly in the beginning.
3.It would be better to include the choices of models or algorithms and to explain more in-depth about how to choose the specific hospital.

Final Peer Review - vvs24

This project analyzes the total average charges of a hospital based on the disease a patient has,demographic information about the patient, and information about the hospital. The project primarily focuses on using regression tasks such as Linear Regression, Lasso Regression, and Random Forests to make a good prediction of this target variable.

Things I liked about the project:

The group tried several different models, and compared and contrasted the results for reach model by recording both the Train Mean Squared Error, and the Test Mean Squared Error for each model. This made it clear to the reader, which model performed the best, along with which models were overfitting.
The group took overfitting into account in their project, and argued that due to their limited number of features, it was likely that they could be overfitting. This was the motivation for why they used lasso regression and ridge regression.
The group analyzed the importance of various features with respect to the target variable, by analyzing the coefficients of their ridge regression classifier, and the feature importance scores of their random forest classifier. This was a good thing to do because it helped them realize that the "Length of Stay" feature seemed to have the largest correlation with their target variable.
The group's Fairness analysis was very in depth and they considered how different protected attributes varied with the target label.

Things that could be improved:

The group focused a lot on model selection, but did not focus a lot on their feature representation. I wish there was more analysis on how they represented their features. I think they may have gotten accuracy improvements if they tried varying aspects of their feature vectors. Another thing they could have tried is using different ways to encode categorical variables, such as using hashing instead of one hot encoding, to shrink the size of their categorical variable vectors.
During model selection, I wish there was more discussion about hyperparameter tuning. For example, maybe they could have discussed their results from doing K-Fold cross validation to choose the optimal "max-depth" parameter for their Random Forest Regression.
During the feature importance analysis, it seemed that "length of stay" had the largest correlation with the target variable by far. I would have liked the group to have trained only on the "length of stay" feature and recorded the test accuracy of that simple model, and compare to their overall models. This way they would be able to show that the other features do give a definite improvement to their prediction accuracy.

Peer Review -yw2288

Three Likes:

I think this project’s objective has great importance since the issue of hospital selection is related with everyone.
This project takes a lot of factors into account, such as patients’conditions, cost, distance, which can better optimize the hospital selection by different perspectives.
The first dataset, with 2.54 million rows, is very huge. With training models on larger set of data, the model can be more accurate in predicting the outcome.

Three Improvements:

The project considers a lot of factors but is not very feasible to achieve, especially the datasets do not cover all those features. It is better to narrow down the factors.
The team can go over literature reviews to choose some specific factors that are crucial in selecting the hospitals.
It is unclear what the specific way is to achieve the outcome. It would be better to include the models or algorithms and to explain more in-depth about how to choose the specific hospital.

Final Peer Review - rr737

This project analyzes the average charges of a hospital based on some patient features, like: patient personal information, hospital indicators, demographic indicators and location. The team started with features selection in order to draw the most important. Then, they run some regression models such as Linear Regression, Lasso Regression and Ridge Regression.

Things I like:

The team tried different regression models. Beside this is requisite, it is very useful to have different results on the same dataset. This lead to a better and better performance on real-life projects. Comparing results via Mean Squared Error made it clearer the model performance, along with which models were over(under)fitting.
The team explored features to see which one would have the greatest impact on the models. They discovered that "Length of Stay" feature have the largest correlation with their target variable. Which is something you can expect given how hospitals charge fees.

Things to improve:

In general terms, I really like the project and the data analysis. I just would like to see a more deeply analysis on the results and conclusion and maybe some recommendation on how to optimize hospital fees or how to reduce expenses. But, besides that, it is really good.

Peer Review - ss2627

Summary
The goal of the project is to enhance the decision making process for patients when they try to select a hospital for their treatment considering the insurance policy, cost, distance, and other factors using data analysis. The project is using the Statewide Planning and Research Cooperative Systems (SPARCS) inpatient de-identified dataset, which provides details on the patients and their treatments and the MEPS data.

What I like about the proposal:

I like the topic of this project and how you explained the problem and its significance very clearly. It is very convincing to me that this will be a project worth investing for.
I like the dataset descriptions, they are very clear and gives details on what they include and not include, the dataset source, the size of the data with respect to number of data points and features, and why you are using these two datasets.
I like how the problem statement explains very clearly the difficulty of solving problem and why there is no good universal approach to solve it yet.

Potential Areas for improvement:

The problem statement said the objective is to “improve the process of hospital selecting using analysis”, I was not sure if this means you are focusing on data exploratory to seek patterns, or trying to build a hospital recommendation model given a patient’s information, or both.
It was not clear to me what kind of performance metric you will be using to evaluate your recommendation on selecting hospital. Say you suggest this patient select hospital A over hospital B, how do you rate the hospitals for one patient? By cost, distance, hospital ratings, past patient treatment successful rates etc. or some combination of them, and if so how would you define the objective function here? I think finding a good objective function to optimize over might be the most challenging part for this project.
I think it could be explained more on why the project focus is on hospital selection for specific situations, and not for all cases.

Midterm Peer Review - zh343

This project is about using the patient's relevant information to estimate the total medical charges. The major dataset comes from NY government and Kaggle, and it mainly provides details on patient characteristics, diagnoses, treatments, charges, etc. The long term goal is to develop a strategy to help patients find the most appropriate hospital based on their medical conditions and requirements.

Three things I like about the proposal:

The problem is well-stated and practically meaningful. In the short term, the total charge prediction is a useful tool for patients to get a good estimate of what their situation will cost. In the long term, this project can provide customization on the patient's case and greatly facilitate their decision making.
Very reasonable data cleaning process. Each feature dropping/transformation is well-founded. It is also nice to point out the uncertainty in the "unknown" entries for categorical features (which requires paying extra attention to in for future modeling).
Good visualizations. The plots gave me a better sense of the feature correlations and importance.

Three areas for improvement:

Maybe it is due to my lack of expertise, but I don't quite understand why random forests will definitely give the most relevant features. If you are sure about it, you could include a bit more explanation about the feature selection process and let readers know why it makes sense. Otherwise, you could also consider try 1 or 2 additional methods and determine the most important features based on the combined results.
Provide validation results (MSE) for your linear model as a baseline so that you can compare your future model's performance.
For you still intend to achieve your long term goal (to find the best fit of a hospital for patients), you may consider forming a specific plan, especially on how to use your modeling results to help with decision making.

Despite my suggestions, I don't think there is anything wrong with your report. The midterm report is very well-written and I believe you are on your way to a high-quality project.

Peer Review - rt389

The project is about using Machine Learning Model to perform hospital selection for specific situations, like emergency cases, severe chronic diseases, small diseases.
They are using data from The Statewide Planning and Research Cooperative System (SPARCS) inpatient de-Identified dataset, which can be found on https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/ u4ud-w55t and https://www.kaggle.com/c/pf2012 .
Their objective is to find the optimal hospital for a given condition after fully analyzing the SPARCS dataset.
Three things I liked:

The dataset is professional and may be extended to a real-world project that have a practical impact.
The scope of the project is adjustable. The project can narrow its scope by specifying about the disease type or increase it by specifying the disease types.
This project will use extensive Machine Learning technique which does not depend on not only on the numbers like area or hospital code, but also the text for diseases, etc. which requires professional knowledge about the medical terms.
Three things to improve on:
The dataset is huge, 2.54 million rows and 34 columns entries. I hope there is a way to extract certain parts in the dataset so that the project does not have to analysis all the dataset, which will take a long time.
The dataset contains plenty of dataset columns that may be unnecessary to analyze, such as Operating Certificate Number, Facility ID, etc. It would be better to determine which features to user for prediction, or do some A/B testing on multiple models
The project may be too large and professional because there may be too many medical conditions to look at. It would be better to narrow the scope of the project by generalizing about the medical conditions to predict.

Midterm Peer Review - jc3362

Geat Job! This is a very interesting topic and you’ve been making major processes! This project is about hospital selection for emergency cases, based on patients’ medical conditions and other relevant features. The dataset comes from Statewide Planning and Research Cooperative System (SPARCS) inpatient de-Identified dataset.

Three things I like:

The report is detailed and clean
Visualization is easy to read
Using good machine learning strategies

Three areas of concern:

For data cleaning, I have no idea why you transform 120+ to 120. Are they identical?
By using one hot encoding, the dimentions of feature should become very large. Did you have the problem like curse of dimensionality? I noticed that you have dropped rows but not columns. If you have too much columns (many features), you may spend too much time training model and may also have overfitting problem.
For the random forests part, the ouput of the model is top 7 important factors that influence the total charges in emergency. However, the topic of the project is hospital selection for emergency cases, based on patients’ medical conditions and other relevant features. Report doesn't talked about the relation between hospital selection with total charges in emergency.

Midterm Peer Review

This project addresses hospital selection depends on the patients conditions. The team chose to use data from Statewide Planning and Research Cooperative System and insurance data for their learning. The team overall did a good job assessing the objective and exploring the available learning features in the data.

Three things I like:

The team did a good job carefully cleaning the data regarding data types and duplicate information.
The subject is interesting as it can benefit many people.
They used random forest to select the relevant features.

Three areas of concern:

I am not sure how the total number of charges relate to the hospital that people should select.
Maybe the correlation of these features with y is not strong enough for future predictions.
Your team initially explored insurance data, but I had a hard time understand how that set of data fit in the learning models.

Peer Review - aml326

This project addresses hospital selection for a given situation (emergency, chronic disease, small disease, etc.). Due to the complexities of the healthcare system in the US, selecting a hospital can be a stressful and difficult decision, especially during critical circumstances. Thus, the project seeks to study how to make hospital selection quicker and easier by analyzing publicly available patient data from the Statewide Planning and Research Cooperative System as well as health insurance information from MEPS.

Three things I like:

This project addresses a critical subject that is relevant to everyone's lives. The objective is not only tangible and intuitive but is also of great importance.
The objective is to not only look at patient data but also health insurance information. This is especially relevant in this day and age of disagreement over the US healthcare system.
The feature space seems well defined for answering this question. Having metrics such as hospital re-admissions rate as well as the cost of treatment seem critical to me and thus the project is certainly headed in the right track!

Three areas of concern:

The size of the SPARCS dataset is truly massive. This could be of huge benefit since you have plenty of data to cross-validate, use as a test set, etc. However, processing and cleaning the dataset could prove to be quite challenging. It would probably be useful to look into how you are going to do this and how you plan to reduce the scale of the dataset if need be.
You may have already thought of how you are going to do this, but one concern I had was how to combine information from both SPARCS and MEPS. While it is crucial to have both the patient and healthcare information, do the two sets use similar identifiers (i.e. hospital names, regions of the U.S., patient identifiers) that would allow you to use them in unison with relative ease?
I may be wrong, or just don't understand the full terms, but I was under the impression that a hospital visit would be free in an emergency scenario. It may be worth looking into the differences in various insurances' policies in your analysis and using this information in your model.

Peer Review - zz84

This project is to help patients find hospitals that fit their situation - kind of disease, severity, cost, etc. - the most. The dataset is SPARCS inpatient de-identified dataset that contains patients' characteristics, diagnosis, treatments, charges, etc. Another dataset MEPS contains medical providers and employers for many families across the United States. The objective is that given a patient's diagnosis, severity, among other features, the model with output the recommended hospital.

I like this project in the following ways:

the objective of this project will help many patients choose the hospital where s/he will receive the potentially "best" treatment
the SPARCS dataset contains 2.54 million rows which would give a good result if trained properly
the MEPS dataset will compensate for the loss of information on medical insurance for the SPARCS dataset

However, I am concerned in the following aspects:

how would you combine and link the two datasets? Is there a variable that would link them together? Since the first dataset is de-identified, I am worried about how would you extract health insurance information and use them as a feature.
While the intention of this project is to help patients choose the best hospital, the definition of "best" here is unclear. Does "best" mean the cheapest or the one that most people go to?
After having a clear definition of the "best" hospital, I am concerned about how will you build the model to achieve your objective. A related area you may want to look into is ad recommendation.

Peer Review ys448

This project is about hospital information analysis. Namely, the group will examine all the given information about hospitals such as whom the patients are, what kind of illness are they encountering etc. They are using NY state Hospital Inpatient Discharges (SPARCS De-Identified): 2012 as the source for retrieving these information, and use MEPS as the source to retrieve health insurance coverage information. This project aims to give suggestions on hospital selection in various situations considering lots of factors.

Overall, I really like the idea of this project.

Three things I like about this project:

The dataset is great. The source of the data is reliable and convincing. Most importantly, it is large enough to analyze, while there are still missing data in between to make the dataset messy enough. This serves as a good foundation for this project.
The idea is good. Hospital information is always something that people cares about. Especially in emergency cases when people sometimes are not rational enough to make a decision, it is nice to have such thing as an adviser for patients.
The feature space is broad. The group has taken many factors into consideration when monitoring this project, and this is essential. A wide coverage of feature space guarantees that the result output is more applicable in realistic settings.

Three things that concerns me:

The dataset is too large. I go through the dataset of NY state hospital information, and I see that there are 2.54 million rows. This seems to be a concern to me, because large datasets will strongly affect the runtime of your training program. Of course, this is strongly dependent on the complexity of your training model. Yet, one of my previous projects that has 40k input text data sometimes takes hours to train. This might requires further techniques in managing caches or batches.
How to select the information. This point follows the previous point. Since you have so large a dataset, what is your criterion on determining your feature space or data? Namely, how can you decide which portion of data you will go through and which portion you will not touch? In other words, how would you like to convince others that the portion of data you did not touch down will not seriously affect the result output. It might be better if you include this in further statements.
Possible approaches? Just wondering in general which approach would you use to train your model? Are you going to do this via training a model, or just using linear regression? If you're going to train the model, which classifier are you going to use? You mentioned you want to give suggestions to patients in various situations, then will you be grouping the hospitals according to some metric? If so, how would you like your clustering problem to be solved?

emilyxia / hospital-information-analysis Goto Github PK

hospital-information-analysis's People

Contributors

Watchers

Forkers

hospital-information-analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org