emilyxia / hospital-information-analysis Goto Github PK
View Code? Open in Web Editor NEWORIE4741 Fall Project
ORIE4741 Fall Project
The project aims to help patients select hospital based on their medical conditions and personal information. The dataset used in the project is from SPARCS for the year of 2012.
Things that I like about the project:
- The visuals (charts and tables) are very clear, which help understand the dataset.
- It has a clear and reasonable explanation of data cleaning process.
- The description of Weapons of Math Destruction is very interesting and detailed, and the logic is great.
Suggestions / Potential Improvements:
- What is the size of the dataset? I see that the augmented dataset has 28 columns, but how many rows does it have?
- It seems like the random forest model outperforms linear models (with or without regularization) for all diseases; if that's the case, why do you say "The difference in models found to work for each of the diseases implies that we need to take disease characteristics into account, when modeling hospital charges"?
This project aims to predict average charges for a patient to visit a hospital, so patients can make better decisions about where to go for different reasons.
Things I liked about this midterm report:
Things that could be better:
The project is about helping a patient select a hospital based on their medical condition, personal information, and information about hospitals. The dataset is SPARCS hospital de-identified data for New York State. The goal of the project is to predict the cost of treating a patient with one of the three most common medical conditions given information about the patient and information about hospitals.
What they did well:
The team had interesting graphs and charts in the exploratory section, which showed large differences in both income and treatment cost across different counties, as well as differences in treatment cost by payment type.
The team also chose three top diseases, which helped in that they could focus on creating accurate predictive models for those diseases, without worrying about having disease as a confounding factor. Within those three diseases, they found that certain features had different importances across the diseases, which is interesting.
The team chose several different models to find both accurate classifiers and the features with the greatest importances for those classifiers; this was good, as they considered both linear models (multiple kinds of regressions) as well as non-linear models (random forests).
Improvements:
It would have been good if the team had addressed the differences in costs by payment type, as it seems like it might be a latent variable (corresponding to income or family hardship).
Summary:
The project is about helping patients to decide which hospital to choose by finding hospital selections with different situations.
The first dataset they used is SPARCS, which includes patients’ information, including characteristics, diagnoses, treatments, etc. The second is MEPS, which provides data on the cost of health care and insurance coverage.
Their objective is to investigate the choices of hospitals given a patient with required treatment and improve the hospital selections.
Three Likes:
Three Improvements:
1.The team can think about how to test the effectiveness of the model, which are required in this report.
2.I am confused about how to make hospital selections based on total charge and so the team can illustrate it more clearly in the beginning.
3.It would be better to include the choices of models or algorithms and to explain more in-depth about how to choose the specific hospital.
This project analyzes the total average charges of a hospital based on the disease a patient has,demographic information about the patient, and information about the hospital. The project primarily focuses on using regression tasks such as Linear Regression, Lasso Regression, and Random Forests to make a good prediction of this target variable.
Things I liked about the project:
The group tried several different models, and compared and contrasted the results for reach model by recording both the Train Mean Squared Error, and the Test Mean Squared Error for each model. This made it clear to the reader, which model performed the best, along with which models were overfitting.
The group took overfitting into account in their project, and argued that due to their limited number of features, it was likely that they could be overfitting. This was the motivation for why they used lasso regression and ridge regression.
The group analyzed the importance of various features with respect to the target variable, by analyzing the coefficients of their ridge regression classifier, and the feature importance scores of their random forest classifier. This was a good thing to do because it helped them realize that the "Length of Stay" feature seemed to have the largest correlation with their target variable.
The group's Fairness analysis was very in depth and they considered how different protected attributes varied with the target label.
Things that could be improved:
The group focused a lot on model selection, but did not focus a lot on their feature representation. I wish there was more analysis on how they represented their features. I think they may have gotten accuracy improvements if they tried varying aspects of their feature vectors. Another thing they could have tried is using different ways to encode categorical variables, such as using hashing instead of one hot encoding, to shrink the size of their categorical variable vectors.
During model selection, I wish there was more discussion about hyperparameter tuning. For example, maybe they could have discussed their results from doing K-Fold cross validation to choose the optimal "max-depth" parameter for their Random Forest Regression.
During the feature importance analysis, it seemed that "length of stay" had the largest correlation with the target variable by far. I would have liked the group to have trained only on the "length of stay" feature and recorded the test accuracy of that simple model, and compare to their overall models. This way they would be able to show that the other features do give a definite improvement to their prediction accuracy.
Summary:
The project is about helping patients to decide which hospital to choose by finding hospital selections with different situations.
The first dataset they used is SPARCS, which includes patients’ information, including characteristics, diagnoses, treatments, etc. The second is MEPS, which provides data on the cost of health care and insurance coverage.
Their objective is to investigate the choices of hospitals given a patient with required treatment and improve the hospital selections.
Three Likes:
Three Improvements:
This project analyzes the average charges of a hospital based on some patient features, like: patient personal information, hospital indicators, demographic indicators and location. The team started with features selection in order to draw the most important. Then, they run some regression models such as Linear Regression, Lasso Regression and Ridge Regression.
Things I like:
The team tried different regression models. Beside this is requisite, it is very useful to have different results on the same dataset. This lead to a better and better performance on real-life projects. Comparing results via Mean Squared Error made it clearer the model performance, along with which models were over(under)fitting.
The team explored features to see which one would have the greatest impact on the models. They discovered that "Length of Stay" feature have the largest correlation with their target variable. Which is something you can expect given how hospitals charge fees.
Things to improve:
Summary
The goal of the project is to enhance the decision making process for patients when they try to select a hospital for their treatment considering the insurance policy, cost, distance, and other factors using data analysis. The project is using the Statewide Planning and Research Cooperative Systems (SPARCS) inpatient de-identified dataset, which provides details on the patients and their treatments and the MEPS data.
What I like about the proposal:
Potential Areas for improvement:
This project is about using the patient's relevant information to estimate the total medical charges. The major dataset comes from NY government and Kaggle, and it mainly provides details on patient characteristics, diagnoses, treatments, charges, etc. The long term goal is to develop a strategy to help patients find the most appropriate hospital based on their medical conditions and requirements.
Three things I like about the proposal:
Three areas for improvement:
Despite my suggestions, I don't think there is anything wrong with your report. The midterm report is very well-written and I believe you are on your way to a high-quality project.
The project is about using Machine Learning Model to perform hospital selection for specific situations, like emergency cases, severe chronic diseases, small diseases.
They are using data from The Statewide Planning and Research Cooperative System (SPARCS) inpatient de-Identified dataset, which can be found on https://health.data.ny.gov/Health/Hospital-Inpatient-Discharges-SPARCS-De-Identified/ u4ud-w55t and https://www.kaggle.com/c/pf2012 .
Their objective is to find the optimal hospital for a given condition after fully analyzing the SPARCS dataset.
Three things I liked:
Geat Job! This is a very interesting topic and you’ve been making major processes! This project is about hospital selection for emergency cases, based on patients’ medical conditions and other relevant features. The dataset comes from Statewide Planning and Research Cooperative System (SPARCS) inpatient de-Identified dataset.
Three things I like:
The report is detailed and clean
Visualization is easy to read
Using good machine learning strategies
Three areas of concern:
For data cleaning, I have no idea why you transform 120+ to 120. Are they identical?
By using one hot encoding, the dimentions of feature should become very large. Did you have the problem like curse of dimensionality? I noticed that you have dropped rows but not columns. If you have too much columns (many features), you may spend too much time training model and may also have overfitting problem.
For the random forests part, the ouput of the model is top 7 important factors that influence the total charges in emergency. However, the topic of the project is hospital selection for emergency cases, based on patients’ medical conditions and other relevant features. Report doesn't talked about the relation between hospital selection with total charges in emergency.
This project addresses hospital selection depends on the patients conditions. The team chose to use data from Statewide Planning and Research Cooperative System and insurance data for their learning. The team overall did a good job assessing the objective and exploring the available learning features in the data.
Three things I like:
Three areas of concern:
This project addresses hospital selection for a given situation (emergency, chronic disease, small disease, etc.). Due to the complexities of the healthcare system in the US, selecting a hospital can be a stressful and difficult decision, especially during critical circumstances. Thus, the project seeks to study how to make hospital selection quicker and easier by analyzing publicly available patient data from the Statewide Planning and Research Cooperative System as well as health insurance information from MEPS.
Three things I like:
Three areas of concern:
This project is to help patients find hospitals that fit their situation - kind of disease, severity, cost, etc. - the most. The dataset is SPARCS inpatient de-identified dataset that contains patients' characteristics, diagnosis, treatments, charges, etc. Another dataset MEPS contains medical providers and employers for many families across the United States. The objective is that given a patient's diagnosis, severity, among other features, the model with output the recommended hospital.
I like this project in the following ways:
However, I am concerned in the following aspects:
This project is about hospital information analysis. Namely, the group will examine all the given information about hospitals such as whom the patients are, what kind of illness are they encountering etc. They are using NY state Hospital Inpatient Discharges (SPARCS De-Identified): 2012 as the source for retrieving these information, and use MEPS as the source to retrieve health insurance coverage information. This project aims to give suggestions on hospital selection in various situations considering lots of factors.
Overall, I really like the idea of this project.
Three things I like about this project:
The dataset is great. The source of the data is reliable and convincing. Most importantly, it is large enough to analyze, while there are still missing data in between to make the dataset messy enough. This serves as a good foundation for this project.
The idea is good. Hospital information is always something that people cares about. Especially in emergency cases when people sometimes are not rational enough to make a decision, it is nice to have such thing as an adviser for patients.
The feature space is broad. The group has taken many factors into consideration when monitoring this project, and this is essential. A wide coverage of feature space guarantees that the result output is more applicable in realistic settings.
Three things that concerns me:
The dataset is too large. I go through the dataset of NY state hospital information, and I see that there are 2.54 million rows. This seems to be a concern to me, because large datasets will strongly affect the runtime of your training program. Of course, this is strongly dependent on the complexity of your training model. Yet, one of my previous projects that has 40k input text data sometimes takes hours to train. This might requires further techniques in managing caches or batches.
How to select the information. This point follows the previous point. Since you have so large a dataset, what is your criterion on determining your feature space or data? Namely, how can you decide which portion of data you will go through and which portion you will not touch? In other words, how would you like to convince others that the portion of data you did not touch down will not seriously affect the result output. It might be better if you include this in further statements.
Possible approaches? Just wondering in general which approach would you use to train your model? Are you going to do this via training a model, or just using linear regression? If you're going to train the model, which classifier are you going to use? You mentioned you want to give suggestions to patients in various situations, then will you be grouping the hospitals according to some metric? If so, how would you like your clustering problem to be solved?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.