Project 4

Problem Statement
Data Dictionary
Preprocessing & Modeling
Conclusion and Recommendations
Datasets

Problem Statement

Using data from UC Irvine Machine Learning build a model that can predict if a person's income is in excess of $50,000 given certain profile information

Data Dictionary

Feature	Type	Dataset	Description
age	int64	large_train_sample	person's age
education-num	int64	large_train_sample	years of education
capital-gain	int64	large_train_sample	capital gained
capital-loss	int64	large_train_sample	capital lost
hours-per-week	int64	large_train_sample	average hours worked per week
native-country	int64	large_train_sample	1: from United States, 0: other, engineered feature
wage	int64	large_train_sample	1: made over 50k, 0: made under 50k, engineered feature
relationship_Other-relative	int64	large_train_sample	engineered feature
relationship_Own-child	int64	large_train_sample	engineered feature
relationship_Unmarried	int64	large_train_sample	engineered feature
relationship_Wife	int64	large_train_sample	engineered feature
workclass_Local-gov	int64	large_train_sample	engineered feature
workclass_Private	int64	large_train_sample	engineered feature
workclass_Self-emp-inc	int64	large_train_sample	engineered feature
workclass_Self-emp-not-inc	int64	large_train_sample	engineered feature
workclass_State-gov	int64	large_train_sample	engineered feature
workclass_ Without-pay	int64	large_train_sample	engineered feature

Preprocessing and Modeling

Data cleaning included dropping the following features: final weight, education, marital status, and occupation. Null values or non numerical values were dropped. The wage and native country columns were binarized and the sex, relationship and workclass columns were dummied.

The models used were a Random Forest Classifier, XGBoost Classifier, and XGBoost-Dart Classifer. First, a baseline score was established then I used GridSearchCV to identify the best scores and parameters generated from the models. The models consisted of 18 features.

Model	R2 Traing Score	R2 Testing Score	Cross Val Score
Random Forest	0.846	0.837	0.842
XGBoost Classifer	0.860	0.853	0.856
XGBoost-Dart Classifer	0.872	0.853	0.854

Conclusion and Recommendations

In summary, the 3 models used performed better in predicting wage under 50k that it was predicting over 50k. Precision scores across the 3 models were fairly similar, but the best score came from the Random Forest Classifier with 0.80. The model with the highest accuracy score was XGBoost Classifier with .8529

Datasets

large_train_sample.csv: Train Sample Data
test_data.csv: Test Data
sub.csv: Submission Data

anthonyzurke / project-4 Goto Github PK

project-4's Introduction

Project 4

Table of Contents

Problem Statement

Data Dictionary

Preprocessing and Modeling

Conclusion and Recommendations

Datasets

project-4's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent