wgova Goto Github PK

followers: 21.0 following: 45.0 repos: 41.0 gists: 0.0

Name: Webster Gova

Type: User

Company: Fin-Tech

Bio: Decision Scientist. Passionate about advancement of Sub_Saharan startups in FinTech/EdTech/MedTech/EconTech solutions

Location: South Africa

Data scientist by day, PhD reader by night, bit of a nocturnal nerd. Regular guy with strokes of grey, ..... 👩🏾‍💻

Blog posts

💬 Clustering African countries: Covid-19 real time R0

🔭 Nowcasting Covid-19 cases

Machine learning tutorials

Iris dataset K-means clustering

- ......Loading financial time series predictions

You can also connect with me on:

Languages and Tools:

Webster Gova's Projects

automations

Intended for re-useable scripts in data preparation and plotting functions

awesome-public-datasets

A topic-centric list of high-quality open datasets in public domains. New PR ☛☛☛

candle-stick-screener-for-nifty-companies

candle stick screener using yfinance and ta-lib library in flask

courses

Course materials for the Data Science Specialization: https://www.coursera.org/specialization/jhudatascience/1

coworking-analysis

Creative economy and the future of work through the lens of meetup.

docker-for-data-science-tutorial

Materials for "Docker for Data Science" tutorial presented at PyCon 2018 in Cleveland, OH

gameofthrones---network-analysis

An analysis of game of thrones characters to predict the most important ones from book 1 to book5

LOGISTIC REGRESSION - HEART DISEASE PREDICTION Introduction World Health Organization has estimated 12 million deaths occur worldwide, every year due to Heart diseases. Half the deaths in the United States and other developed countries are due to cardio vascular diseases. The early prognosis of cardiovascular diseases can aid in making decisions on lifestyle changes in high risk patients and in turn reduce the complications. This research intends to pinpoint the most relevant/risk factors of heart disease as well as predict the overall risk using logistic regression Data Preparation Source The dataset is publically available on the Kaggle website, and it is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The classification goal is to predict whether the patient has 10-year risk of future coronary heart disease (CHD).The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes. Variables Each attribute is a potential risk factor. There are both demographic, behavioral and medical risk factors. Demographic: • Sex: male or female(Nominal) • Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous) Behavioral • Current Smoker: whether or not the patient is a current smoker (Nominal) • Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.) Medical( history) • BP Meds: whether or not the patient was on blood pressure medication (Nominal) • Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal) • Prevalent Hyp: whether or not the patient was hypertensive (Nominal) • Diabetes: whether or not the patient had diabetes (Nominal) Medical(current) • Tot Chol: total cholesterol level (Continuous) • Sys BP: systolic blood pressure (Continuous) • Dia BP: diastolic blood pressure (Continuous) • BMI: Body Mass Index (Continuous) • Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.) • Glucose: glucose level (Continuous) Predict variable (desired target) • 10 year risk of coronary heart disease CHD (binary: “1”, means “Yes”, “0” means “No”) Logistic Regression Logistic regression is a type of regression analysis in statistics used for prediction of outcome of a categorical dependent variable from a set of predictor or independent variables. In logistic regression the dependent variable is always binary. Logistic regression is mainly used to for prediction and also calculating the probability of success. The results above show some of the attributes with P value higher than the preferred alpha(5%) and thereby showing low statistically significant relationship with the probability of heart disease. Backward elimination approach is used here to remove those attributes with highest P-value one at a time followed by running the regression repeatedly until all attributes have P Values less than 0.05. Feature Selection: Backward elimination (P-value approach) Logistic regression equation P=eβ0+β1X1/1+eβ0+β1X1P=eβ0+β1X1/1+eβ0+β1X1 When all features plugged in: logit(p)=log(p/(1−p))=β0+β1∗Sexmale+β2∗age+β3∗cigsPerDay+β4∗totChol+β5∗sysBP+β6∗glucoselogit(p)=log(p/(1−p))=β0+β1∗Sexmale+β2∗age+β3∗cigsPerDay+β4∗totChol+β5∗sysBP+β6∗glucose Interpreting the results: Odds Ratio, Confidence Intervals and P-values • This fitted model shows that, holding all other features constant, the odds of getting diagnosed with heart disease for males (sex_male = 1)over that of females (sex_male = 0) is exp(0.5815) = 1.788687. In terms of percent change, we can say that the odds for males are 78.8% higher than the odds for females. • The coefficient for age says that, holding all others constant, we will see 7% increase in the odds of getting diagnosed with CDH for a one year increase in age since exp(0.0655) = 1.067644. • Similarly , with every extra cigarette one smokes thers is a 2% increase in the odds of CDH. • For Total cholesterol level and glucose level there is no significant change. • There is a 1.7% increase in odds for every unit increase in systolic Blood Pressure. Model Evaluation - Statistics From the above statistics it is clear that the model is highly specific than sensitive. The negative values are predicted more accurately than the positives. Predicted probabilities of 0 (No Coronary Heart Disease) and 1 ( Coronary Heart Disease: Yes) for the test data with a default classification threshold of 0.5 lower the threshold Since the model is predicting Heart disease too many type II errors is not advisable. A False Negative ( ignoring the probability of disease when there actually is one) is more dangerous than a False Positive in this case. Hence in order to increase the sensitivity, threshold can be lowered. Conclusions • All attributes selected after the elimination process show P-values lower than 5% and thereby suggesting significant role in the Heart disease prediction. • Men seem to be more susceptible to heart disease than women. Increase in age, number of cigarettes smoked per day and systolic Blood Pressure also show increasing odds of having heart disease • Total cholesterol shows no significant change in the odds of CHD. This could be due to the presence of 'good cholesterol(HDL) in the total cholesterol reading. Glucose too causes a very negligible change in odds (0.2%) • The model predicted with 0.88 accuracy. The model is more specific than sensitive. Overall model could be improved with more data

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.