Git Product home page Git Product logo

dsc-tuning-decision-trees-lab-dc-ds-071519's Introduction

Hyperparameter Tuning and Pruning in Decision Trees - Lab

Introduction

In this lab, you will use the titanic dataset to see the impact of tree pruning and hyperparameter tuning on the predictive performance of a decision tree classifier. Pruning reduces the size of decision trees by removing nodes of the tree that do not provide much predictive power to classify instances. Decision trees are the most susceptible out of all the machine learning algorithms to overfitting and effective pruning can reduce this likelihood.

Objectives

In this lab you will:

  • Determine the optimal hyperparameters for a decision tree model and evaluate the model performance

Import necessary libraries

Let's first import the libraries you'll need for this lab.

import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn')

Import the data

The titanic dataset, available in 'titanic.csv', is all cleaned up and preprocessed for you so that you can focus on pruning and optimization. Import the dataset and print the first five rows of the data:

# Import the data
df = None

Create training and test sets

  • Assign the 'Survived' column to y
  • Drop the 'Survived' and 'PassengerId' columns from df, and assign the resulting DataFrame to X
  • Split X and y into training and test sets. Assign 30% to the test set and set the random_state to SEED
# Create X and y 
y = None
X = None

# Split into training and test sets
SEED = 1
X_train, X_test, y_train, y_test = None

Train a vanilla classifier

Note: The term "vanilla" is used for a machine learning algorithm with its default settings (no tweaking/tuning).

  • Instantiate a decision tree
    • Use the 'entropy' criterion and set the random_state to SEED
  • Fit this classifier to the training data
# Train the classifier using training data
dt = None

Make predictions

  • Create a set of predictions using the test set
  • Using y_test and y_pred, calculate the AUC (Area under the curve) to check the predictive performance
# Make predictions using test set 
y_pred = None

# Check the AUC of predictions
false_positive_rate, true_positive_rate, thresholds = None
roc_auc = None
roc_auc

Maximum Tree Depth

Let's first check for the best depth parameter for our decision tree:

  • Create an array for max_depth values ranging from 1 - 32
  • In a loop, train the classifier for each depth value (32 runs)
  • Calculate the training and test AUC for each run
  • Plot a graph to show under/overfitting and the optimal value
  • Interpret the results
# Identify the optimal tree depth for given data
# Your observations here 

Minimum Sample Split

Now check for the best min_samples_splits parameter for our decision tree

  • Create an array for min_sample_splits values ranging from 0.1 - 1 with an increment of 0.1
  • In a loop, train the classifier for each min_samples_splits value (10 runs)
  • Calculate the training and test AUC for each run
  • Plot a graph to show under/overfitting and the optimal value
  • Interpret the results
# Identify the optimal min-samples-split for given data
# Your observations here

Minimum Sample Leafs

Now check for the best min_samples_leafs parameter value for our decision tree

  • Create an array for min_samples_leafs values ranging from 0.1 - 0.5 with an increment of 0.1
  • In a loop, train the classifier for each min_samples_leafs value (5 runs)
  • Calculate the training and test AUC for each run
  • Plot a graph to show under/overfitting and the optimal value
  • Interpret the results
# Calculate the optimal value for minimum sample leafs
# Your observations here 

Maximum Features

Now check for the best max_features parameter value for our decision tree

  • Create an array for max_features values ranging from 1 - 12 (1 feature vs all)
  • In a loop, train the classifier for each max_features value (12 runs)
  • Calculate the training and test AUC for each run
  • Plot a graph to show under/overfitting and the optimal value
  • Interpret the results
# Find the best value for optimal maximum feature size
# Your observations here

Re-train the classifier with chosen values

Now we will use the best values from each training phase above and feed it back to our classifier. Then we can see if there is any improvement in predictive performance.

  • Train the classifier with the optimal values identified
  • Compare the AUC of the new model with the earlier vanilla decision tree AUC
  • Interpret the results of the comparison
# Train a classifier with optimal values identified above
dt = None


false_positive_rate, true_positive_rate, thresholds = None
roc_auc = None
roc_auc
# Your observations here

In the next section, we shall talk about hyperparameter tuning using a technique called "grid-search" to make this process even more granular and decisive.

Summary

In this lesson, we looked at tuning a decision tree classifier in order to avoid overfitting and increasing the generalization capabilities of the classifier. For the titanic dataset, we see that identifying optimal parameter values can result in some improvements towards predictions. This idea will be exploited further in upcoming lessons and labs.

dsc-tuning-decision-trees-lab-dc-ds-071519's People

Contributors

shakeelraja avatar loredirick avatar sumedh10 avatar cheffrey2000 avatar

Watchers

James Cloos avatar Kevin McAlear avatar  avatar Mohawk Greene avatar Victoria Thevenot avatar Belinda Black avatar Bernard Mordan avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar Sophie DeBenedetto avatar  avatar  avatar Antoin avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Nicole Kroese  avatar Kaeland Chatman avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar  avatar  avatar

Forkers

khayes847 yli1517

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.