Multiple Linear Regression in Statsmodels - Lab
Introduction
In this lab, you'll practice fitting a multiple linear regression model on the Ames Housing dataset!
Objectives
You will be able to:
- Determine if it is necessary to perform normalization/standardization for a specific model or set of data
- Use standardization/normalization on features of a dataset
- Identify if it is necessary to perform log transformations on a set of features
- Perform log transformations on different features of a dataset
- Use statsmodels to fit a multiple linear regression model
- Evaluate a linear regression model by using statistical performance metrics pertaining to overall model and specific parameters
The Ames Housing Data
Using the specified continuous and categorical features, preprocess your data to prepare for modeling:
- Split off and one hot encode the categorical features of interest
- Log and scale the selected continuous features
import pandas as pd
import numpy as np
ames = pd.read_csv('ames.csv')
continuous = ['LotArea', '1stFlrSF', 'GrLivArea', 'SalePrice']
categoricals = ['BldgType', 'KitchenQual', 'SaleType', 'MSZoning', 'Street', 'Neighborhood']
Continuous Features
# Log transform and normalize
Categorical Features
# One hot encode categoricals
Combine Categorical and Continuous Features
# combine features into a single dataframe called preprocessed
Run a linear model with SalePrice as the target variable in statsmodels
# Your code here
Run the same model in scikit-learn
# Your code here - Check that the coefficients and intercept are the same as those from Statsmodels
Predict the house price given the following characteristics (before manipulation!!)
Make sure to transform your variables as needed!
- LotArea: 14977
- 1stFlrSF: 1976
- GrLivArea: 1976
- BldgType: 1Fam
- KitchenQual: Gd
- SaleType: New
- MSZoning: RL
- Street: Pave
- Neighborhood: NridgHt
Summary
Congratulations! You pre-processed the Ames Housing data using scaling and standardization. You also fitted your first multiple linear regression model on the Ames Housing data using statsmodels and scikit-learn!