Git Product home page Git Product logo

dsc-regression-boston-lab's Introduction

Project - Regression Modeling with the Ames Housing Dataset

Introduction

In this lab, you'll apply the regression analysis and diagnostics techniques covered in this section to the "Ames Housing" dataset. You performed a detailed EDA for this dataset earlier on, and hopefully, you more or less recall how this data is structured! In this lab, you'll use some of the features in this dataset to create a linear model to predict the house price!

Objectives

You will be able to:

  • Perform a linear regression using statsmodels
  • Determine if a particular set of data exhibits the assumptions of linear regression
  • Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters
  • Use the coefficient of determination to determine model performance
  • Interpret the parameters of a simple linear regression model in relation to what they signify for specific data

Let's get started

Import necessary libraries and load 'ames.csv' as a pandas dataframe

import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('seaborn')
ames = pd.read_csv('ames.csv')

subset = ['YrSold', 'MoSold', 'Fireplaces', 'TotRmsAbvGrd', 'GrLivArea',
          'FullBath', 'YearRemodAdd', 'YearBuilt', 'OverallCond', 'OverallQual', 'LotArea', 'SalePrice']

data = ames.loc[:, subset]

The columns in the Ames housing data represent the dependent and independent variables. We have taken a subset of all columns available to focus on feature interpretation rather than preprocessing steps. The dependent variable here is the sale price of a house SalePrice. The description of the other variables is available on KAGGLE.

Inspect the columns of the dataset and comment on type of variables present

# Your code here
# Record your observations here 

Create histograms for all variables in the dataset and comment on their shape (uniform or not?)

# Your code here 
# You observations here 

Check the linearity assumption for all chosen features with target variable using scatter plots

# Your code here 

Clearly, your data needs a lot of preprocessing to improve the results. This key behind a Kaggle competition is to process the data in such a way that you can identify the relationships and make predictions in the best possible way. For now, we'll use the dataset untouched and just move on with the regression. The assumptions are not exactly all fulfilled, but they still hold to a level that we can move on.

Let's do Regression

Now, let's perform a number of simple regression experiments between the chosen independent variables and the dependent variable (price). You'll do this in a loop and in every iteration, you should pick one of the independent variables. Perform the following steps:

  • Run a simple OLS regression between independent and dependent variables
  • Plot the residuals using sm.graphics.plot_regress_exog()
  • Plot a Q-Q plot for regression residuals normality test
  • Store following values in array for each iteration:
    • Independent Variable
    • r_squared'
    • intercept'
    • 'slope'
    • 'p-value'
    • 'normality (JB)'
  • Comment on each output
# Your code here

Clearly, the results are not very reliable. The best R-Squared is witnessed with OverallQual, so in this analysis, this is our best predictor.

How can you improve these results?

  1. Preprocessing

This is where the preprocessing of data comes in. Dealing with outliers, normalizing data, scaling values etc. can help regression analysis get more meaningful results from the given data.

  1. Advanced Analytical Methods

Simple regression is a very basic analysis technique and trying to fit a straight line solution to complex analytical questions may prove to be very inefficient. Later on, you'll explore multiple regression where you can use multiple features at once to define a relationship with the outcome. You'll also look at some preprocessing and data simplification techniques and revisit the Ames dataset with an improved toolkit.

Level up - Optional

Apply some data wrangling skills that you have learned in the previous section to pre-process the set of independent variables we chose above. You can start off with outliers and think of a way to deal with them. See how it affects the goodness of fit.

Summary

In this lab, you applied your skills learned so far on a new data set. You looked at the outcome of your analysis and realized that the data might need some preprocessing to see a clear improvement in the results. You'll pick this back up later on, after learning about more preprocessing techniques and advanced modeling techniques.

dsc-regression-boston-lab's People

Contributors

loredirick avatar mas16 avatar mathymitchell avatar peterbell avatar shakeelraja avatar sik-flow avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.