Git Product home page Git Product logo

tuning-regression-trees-lab-onl01-dtsc-ft-070620's Introduction

Regression Trees and Model Optimization - Lab

Introduction

In this lab, we'll see how to apply regression analysis using CART trees while making use of some hyperparameter tuning to improve our model.

Objectives

In this lab you will:

  • Perform the full process of cleaning data, tuning hyperparameters, creating visualizations, and evaluating decision tree models
  • Determine the optimal hyperparameters for a decision tree model and evaluate the performance of decision tree models

Ames Housing dataset

The dataset is available in the file 'ames.csv'.

  • Import the dataset and examine its dimensions:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline

# Load the Ames housing dataset 
data = None

# Print the dimensions of data


# Check out the info for the dataframe


# Show the first 5 rows

Identify features and target data

In this lab, we will use using 3 predictive continuous features:

Features

  • LotArea: Lot size in square feet
  • 1stFlrSF: Size of first floor in square feet
  • GrLivArea: Above grade (ground) living area square feet

Target

  • SalePrice', the sale price of the home, in dollars

  • Create DataFrames for the features and the target variable as shown above

  • Inspect the contents of both the features and the target variable

# Features and target data
target = None
features = None

Inspect correlations

  • Use scatter plots to show the correlation between the chosen features and the target variable
  • Comment on each scatter plot
# Your code here 

Create evaluation metrics

  • Import r2_score and mean_squared_error from sklearn.metrics
  • Create a function performance(true, predicted) to calculate and return both the R-squared score and Root Mean Squared Error (RMSE) for two equal-sized arrays for the given true and predicted values
    • Depending on your version of sklearn, in order to get the RMSE score you will need to either set squared=False or you will need to take the square root of the output of the mean_squared_error function - check out the documentation or this helpful and related StackOverflow post
    • The benefit of calculating RMSE instead of the Mean Squared Error (MSE) is that RMSE is in the same units at the target - here, this means that RMSE will be in dollars, calculating how far off in dollars our predictions are away from the actual prices for homes, on average
# Import metrics


# Define the function
def performance(y_true, y_predict):
    """ 
    Calculates and returns the two performance scores between 
    true and predicted values - first R-Squared, then RMSE
    """

    # Calculate the r2 score between 'y_true' and 'y_predict'

    # Calculate the root mean squared error between 'y_true' and 'y_predict'

    # Return the score

    pass


# Test the function
score = performance([3, -0.5, 2, 7, 4.2], [2.5, 0.0, 2.1, 7.8, 5.3])
score

# [0.9228556485355649, 0.6870225614927066]

Split the data into training and test sets

  • Split features and target datasets into training/test data (80/20)
  • For reproducibility, use random_state=42
from sklearn.model_selection import train_test_split 

# Split the data into training and test subsets
x_train, x_test, y_train, y_test = None

Grow a vanilla regression tree

  • Import the DecisionTreeRegressor class
  • Run a baseline model for later comparison using the datasets created above
  • Generate predictions for test dataset and calculate the performance measures using the function created above
  • Use random_state=45 for tree instance
  • Record your observations
# Import DecisionTreeRegressor


# Instantiate DecisionTreeRegressor 
# Set random_state=45
regressor = None

# Fit the model to training data


# Make predictions on the test data
y_pred = None

# Calculate performance using the performance() function 
score = None
score

# [0.5961521990414137, 55656.48543887347] - R2, RMSE

Hyperparameter tuning (I)

  • Find the best tree depth using depth range: 1-30
  • Run the regressor repeatedly in a for loop for each depth value
  • Use random_state=45 for reproducibility
  • Calculate RMSE and r-squared for each run
  • Plot both performance measures for all runs
  • Comment on the output
# Your code here 

Hyperparameter tuning (II)

  • Repeat the above process for min_samples_split
  • Use a range of values from 2-10 for this hyperparameter
  • Use random_state=45 for reproducibility
  • Visualize the output and comment on results as above
# Your code here 

Run the optimized model

  • Use the best values for max_depth and min_samples_split found in previous runs and run an optimized model with these values
  • Calculate the performance and comment on the output
# Your code here 

Level up (Optional)

  • How about bringing in some more features from the original dataset which may be good predictors?
  • Also, try tuning more hyperparameters like max_features to find a more optimal version of the model
# Your code here 

Summary

In this lab, we looked at applying a decision-tree-based regression analysis on the Ames Housing dataset. We saw how to train various models to find the optimal values for hyperparameters.

tuning-regression-trees-lab-onl01-dtsc-ft-070620's People

Contributors

shakeelraja avatar loredirick avatar sumedh10 avatar laurenpasciak avatar mike-kane avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.