Git Product home page Git Product logo

machine-learning_house-prices's Introduction

Predicting House Prices in Ames, Iowa using Machine Learning

Data Set from the House Prices: Advance Regression Techniques competition on Kaggle: https://www.kaggle.com/c/house-prices-advanced-regression-techniques/

Machine Learning Project for NYC Data Science Academy Cohort 15

Overview

The data was provided as train.csv (with house prices), and test.csv (new observations with all the same features as train.csv, but missing the Sale Price). We processed the train.csv dataset, performed feature engineering by a number of methods and optimized model hyperparameters using K-fold splitting, performed a stacking algorithm of different models based on train/test optimization, and created the final csv file for Kaggle submission.

We essentially performed 2 different runs with our data:

Run 1 -- Cleaned/processed train.csv and test.csv separately. Did not perform manually-assessed feature removal or feature creation. Obtained feature lists by AIC, Lasso, and GBoost, and manually curated these lists bases on optimal score in train/test modeling.

Test Score (RMSLE): 0.113, Kaggle Score: 0.130

Run 2 -- Cleaned/processed train.csv and test.csv together. Performed manual feature removal/alteration based on detailed reading of the data description file. Obtained feature list using AIC minimization method and applied this list in model development with no changes.

Test Score (RMSLE): 0.122, Kaggle Score: 0.118

Repository Contents

Data cleaning and processing:

Run 1:

AmesDF_Processing -- Processing/cleaning of train.csv only, without additional feature removal/creation. Associated with Run 1

Ames_Final_Feature_DF -- Script to create final Train DF with manually-tested feature list. Associated with Run 1

Ames_TestSet1_Processing_Run1_XGB_Submission Processing/cleaning of test.csv and applying the stacked (linear/gboost/random forest) model to the test data, using the feature list computed in Run 1

Run 2:

Ames_TestTrain_Processing_Run2_Submission -- Joint processing of the test.csv and train.csv, and creating and submission of the second prediction csv for Kaggle

Feature Engineering:

AmesDF_Multicollinearity_Feature_Reduction -- Checking the processed DF for signs of multicollinearity and correlation between features. Also performing AIC feature selection algorithm associated with Run 1.

AmesDF_FeatureSelection_AfterManualRemoval -- Multicollinearity and correlation check of processed data from Run 2, and Feature Engineering for Run 2 using a random feature selection algorithm to minimize AIC.

Feature_Engineering_Lasso -- Use of Lasso algorithm to choose optimal lambda values and engineer exclusion list of features whose coefficients decrease to 0 after regularization penalty. Features were normalized using min/max standardization to avoid bias

Feature_Engineering_GradientBoost -- Use of GradientBoostingRegressor to detemine a relevant candidate feature list using the feature_importances_ method

Modeling and hyperparameter optimization:

Modeling_Linear -- K-Fold testing of multiple linear regression using features from AIC optimization

Modeling_Ridge_AllStandardized -- Choosing optimal lambda hyperparameter using train/test modeling, with the features having non-zero coefficients from Lasso feature engineering

Modeling_RandomForest -- K-Fold testing and hyperparameter optimization using the RandomForestRegressor algorithm from sklearn, using feature list generated from AIC optimization

Modeling_GBoost_AIC -- K-Fold hyperparameter optimization and modeling using GradientBoostingRegressor from sklearn, with the feature list generated from AIC optimization

Modeling_Elasticnet -- Choosing optimal alpha and rho hyperparameters for balancing Ridge/Lasso regularization algorithms using train/test modeling

Model stacking and train/test comparison:

Stacking_FinalFeatures_Run1_WithXGB -- Stacking model for Run 1, where a 24% XGBoost, 36% Gradient Boost, 20% Random Forest, and 20% linear stacked model was found to be best against the test set

Stacking_FinalFeatures_Run2 -- Stacking model for Run 2, which suggested an optimal balance of 40% gradient, 35% random forest, and 25% linear model (0% Ridge)

Individal subfolders:

These subfolders contain pre-work or work that was not part of the final process pipeline

machine-learning_house-prices's People

Contributors

tinyboxes avatar cchen17 avatar dcorrig1 avatar adrian-gillerman avatar

Watchers

James Cloos avatar Jesse avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.