Git Product home page Git Product logo

ds-skills-cv's Introduction

Cross Validation

Cross validation is very useful for determining optimal model parameters such as our regularization parameter alpha. It first divides the training set into subsets (by default the sklearn package uses 3) and then selects an optimal hyperparameter (in this case alpha, our regularization parameter) based on test performance. For example, if we have 3 splits: A, B and C, we can do 3 training and testing combinations and then average test performance as an overall estimate of model performance for those given parameters. (The three combinations are: Train on A+B test on c, train on A+C test on B, train on B+C test on A.) We can do this across various alpha values in order to determine an optimal regularization parameter. By default, sklearn will even estimate potential alpha for you, or you can explicit check the performance of specific alpha.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from sklearn.model_selection import train_test_split
df = pd.read_csv('Housing_Prices/train.csv')
print(len(df))
df.head()
1460
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows ร— 81 columns

from sklearn.linear_model import LassoCV, RidgeCV
#Define X and Y
feats = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]

X = df[feats].drop('SalePrice', axis=1)

#Impute null values
for col in X:
    avg = X[col].mean()
    X[col] = X[col].fillna(value=avg)

y = df.SalePrice

print('Number of X features: {}'.format(len(X.columns)))

#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
L1 = LassoCV()
print('Model Details:\n', L1)

L1.fit(X_train, y_train)

print('Optimal alpha: {}'.format(L1.alpha_))
print('First 5 coefficients:\n', L1.coef_[:5])
count = 0
for num in L1.coef_:
    if num == 0:
        count += 1
print(count)
print('Number of coefficients set to zero: {}'.format(count))
Number of X features: 37
Model Details:
 LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
Optimal alpha: 198489.80980228688
First 5 coefficients:
 [-2.80735194 -0.         -0.          0.25507382  0.        ]
25
Number of coefficients set to zero: 25

Notes on Coefficients and Using Lasso for Feature Selection

The Lasso technique also has a very important and profound effect: feature selection. That is, many of your feature coefficients will be optimized to zero, effectively removing their impact on the model. This can be a useful application in practice when trying to reduce the number of features in the model. Note that which variables are set to zero can change if multicollinearity is present in the data. That is, if two features within the X space are highly correlated, then which takes precendence in the model is somewhat arbitrary, and as such, coefficient weights between multiple runs of .fit() could lead to substantially different coefficient values.

With Normalization

#Define X and Y
feats = [col for col in df.columns if df[col].dtype in [np.int64, np.float64]]

X = df[feats].drop('SalePrice', axis=1)

#Impute null values
for col in X:
    avg = X[col].mean()
    X[col] = X[col].fillna(value=avg)

y = df.SalePrice

print('Number of X features: {}'.format(len(X.columns)))

#Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X,y)
L1 = LassoCV(normalize = True)
print('Model Details:\n', L1)
L1.fit(X_train, y_train)

print('Optimal alpha: {}'.format(L1.alpha_))
print('First 5 coefficients:\n', L1.coef_[:5])
count = 0
for num in L1.coef_:
    if num == 0:
        count += 1
print(count)
print('Number of coefficients set to zero: {}'.format(count))
Number of X features: 37
Model Details:
 LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=True, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
Optimal alpha: 141.0984264427501
First 5 coefficients:
 [ -0.00000000e+00  -5.95275404e+01   0.00000000e+00   1.60217484e-01
   2.00649624e+04]
21
Number of coefficients set to zero: 21

Calculate the Mean Squarred Error

Calculate the mean squarred error between both of the models above and the test set.

# Your code here

Repeat this Process for the Ridge Regression Object

# Your code here

Practice Preprocessing and Feature Engineering

Use some of our previous techniques including normalization, feature engineering, and dummy variables on the dataset. Then, repeat fitting and tuning a model, observing the performance impact compared to above.

# Your code here

ds-skills-cv's People

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.