In the following code block, we generate a data set with just one variable, "age", to mimick our "Game of Thrones" tree example.
import random
import pandas as pd
import numpy as np
np.random.seed(0)
#This code is provided
random.seed(123)
age_0 = pd.DataFrame([19])
age_1 = pd.DataFrame(sorted(random.sample(range(18, 50), 20)))
age_2 = pd.DataFrame(sorted(random.sample(range(35, 70), 20)))
age_3 = pd.DataFrame([68])
age = age_0.append(age_1, ignore_index= True)
age = age.append(age_2, ignore_index= True)
age = age.append(age_3, ignore_index= True)
label_1 = pd.DataFrame([1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1])
label_2 = pd.DataFrame([0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0])
label = label_1.append(label_2, ignore_index=True)
data = pd.concat([age, label], axis=1)
data.columns = ['age', 'label']
This data is created in such a way that the data is not perfectly separable, and also in a way that younger people seem to be more likely to watch the show.
data.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
age | label | |
---|---|---|
0 | 19 | 1 |
1 | 19 | 0 |
2 | 20 | 1 |
3 | 21 | 0 |
4 | 22 | 1 |
data.tail()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
age | label | |
---|---|---|
37 | 64 | 0 |
38 | 65 | 1 |
39 | 67 | 0 |
40 | 68 | 1 |
41 | 68 | 0 |
We'll try to manually calculate what the "perfect split" is for this data set, so we'll basically try to recreate the first "split" in the decision tree from our lab. Let's show it again:
Is 41 the best split? Let's find out! You'll create 3 functions in total:
- A function
split()
which splits up the data set in a way that you can easily compute the gini for the two "children" of the parent node, given a split value. - A function
gini_score()
, which, given a certain split, computes the gini for the left node, the gini for the right node, and the purity gain - A function
best_split()
which loops over the former two functions to find the best split.
Let's start by creating the split()
function. This function takes three arguments, the first one is the column name, which should be a string, the second one is the value that represents the split (in this example, representing a certain age), and as a third argument the name of the data set.
In the cell below, complete the split
function. This function will take in a column name, value, and a DataFrame. For any data in the named column less than or equal to value
parameter, the function should store these rows in a variable called data_left
. Any rows with a value in col_name
greater than value
should be stored in data_right
. The function should return data_left, data_right
.
def split(col_name, value, data):
#split the data in 2 given a column name and the value
pass
Let's use our newly created function on the column "age" (for this data the only option), and look at the age 44.
data_left, data_right = None
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-6c5b1813cbeb> in <module>()
----> 1 data_left, data_right = None
TypeError: 'NoneType' object is not iterable
Now, inspect data_left
and data_right
.
Next, we'll create a function to determine the gini score for a given split.
Complete the gini_score
function in the cell below. Since this function is a bit complicated, comments have been provided to help simplify writing it.
gini_score
should:
- Determine the size of each split, as well as the total number of samples.
- Determine the probability of seeing a given outcome (watches or doesn't watch GoT) in both splits.
- Square the probability and sum then sum the squares for a given node.
- Calculate the gini coefficient for both splits (1 minus the value calculated in the last step)
- Calculate the weighted gini scores for each side by multiplying the proportion of the total sample that side makes up (size of side / number of total samples).
- Calculate the gain by adding the two weighted gini scores together.
def gini_score(data_left, data_right):
# amount of instances flowing in the left vs right node
size_left= None
size_right = None
n_samples = None
# respective chances of seeing each outcome in the left vs right node
p_0L = None
p_1L = None
p_0R = None
p_1R = None
#take the squares and sum over each node
score_L = None
score_R = None
# left node and right node ginis
gini_L = None
gini_R = None
#weighted ginis
weight_gini_L = None
weight_gini_R = None
# The gains (here, we don't compute the root gini again. This value should be mimimized.)
gain = None
return gini_L, gini_R, gain
Now, let's call our function on the split we've already made to test that everything works.
In the cell below, call gini_score
and pass in data_left
and data_right
.
# Expected Output: (0.4444444444444444, 0.4012345679012346, 0.42592592592592593)
Now that we have a function to split our data on a given value, and another function to determine how good this split using gini score, we'll write one more function to find the split that produces the best possible gini scores,
In the cell below, complete the best split function. As with the previous function, we have included comments to help make coding it less complicated.
The function should:
- Determine the range of the search space (between the minimum and maximum values that column contains)
- Iterate through that search space. For every value:
- Split the data using our split function
- Calculate the for each side of the split, as well as the gain
- If the gain is better (lower) than the current best score, update the
best
values,
- If the gain is better (lower) than the current best score, update the
- When every possible value in search space has been tested, return an
output
dictionary containing the best value to split on, the best gain score, the best splits, and the best gini scores for those groups.
def best_split(data, col_name):
# make sure you have the correct range to loop over
min_val = None
max_val = None
best_score = 999
# loop over all the ages
for i in range(min_val, max_val):
data_left, data_right = None
gini_l, gini_r, gain = None
# update if gain is lower than any previously observed gain
if None:
best_val = None
best_score = None
best_groups = None
best_ginis= None
output = None
# create a dictionary with the best value, the best gain, the best groups and the best ginis
output['val'] = None
output['gain'] = None
output['groups'] = None
output['ginis'] = None
return output
Now, let's test that our new function works.
# best_output = best_split(data, "age")
# best_output['ginis']
# split_age = best_output['val']
# split_age
Uncomment and run the cell below to get a better idea for what the best_split
function is actually doing during each iteration of the loop.
# If you want to see what's going on in each loop...
#min_age=19
#max_age=68
#for i in range(min_age, max_age):
# data_left, data_right = split("age", i, data)
# data_l,data_r, gini = gini_score(data_left, data_right)
# print(i)
# print(gini)
We'll use scikit-learn to create a decision tree.
Run the cell below to import the tree
module from sklearn.
from sklearn import tree
Now, create a DecisionTreeClassifier
object. When creating the tree, set the criterion
equal to gini
, and the max_depth
equal to 1
.
clf_GoT = None
Now we can use clf.fit
with "age" as a first argument and "label" as a second argument. If you only have 1 predictor, you need to reshape your predictor using .reshape(-1, 1)
.
Run the cell below to fit our Decision Tree Classifier object to the data.
# clf_GoT.fit(data['age'].values.reshape(-1, 1), data['label'])
GoT_tree
Some 3rd party libraries make it easy to create a visual representation of our fitted Decision Tree Classifier. Run the cell below to create the visualization.
NOTE: The code below relies on a library called graphviz
which is notoriously troublesome to get working on some machines. If the code doesn't work immediately, feel free to try and debug it by googling the error message--if it doesn't work, it's probably only because you need to install a missing dependency. If you can't get the visualization to work, feel free to skip this section!
# Uncomment this code and run this cell to visualize the Tree
# import graphviz
# from sklearn.tree import export_graphviz
# GoT_graph = tree.export_graphviz(GoT_tree, out_file=None)
# graph = graphviz.Source(GoT_graph)
# Uncomment this code and run this cell to visualize the Tree
# export_graphviz(GoT_tree, out_file="mytree.dot")
# with open("mytree.dot") as f: dot_graph = f.read()
# graphviz.Source(dot_graph)
We see that scikit learn generated the same split! Now, let's verify if we computed the correct gini measures.
The salary data set was extracted from the census bureau database and contains salary information. The goal is to use this data set and to try to draw conclusions regarding what drives salaries. More specifically, the target variable is categorical (> 50k; <= 50 k)
import pandas as pd
import numpy as np
import statsmodels as sm
import sklearn as skl
import sklearn.preprocessing as preprocessing
import sklearn.linear_model as linear_model
import sklearn.cross_validation as cross_validation
from sklearn.cross_validation import train_test_split
import sklearn.metrics as metrics
import sklearn.tree as tree
import seaborn as sns
salaries = pd.read_csv("salaries_final.csv", index_col = 0)
salaries.tail()
The dataset "salaries" contains 6 predictors, and one outcome variable, the target salary <= 50k/ >50k.
The 6 predictors are:
-
Age
: continuous. -
Education
: Categorical. Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. -
Occupation
: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. -
Relationship
: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. -
Race
: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. -
Sex
: Female, Male.
It's important to know that scikit learn needs to get dummies as an input for categorical variables. Luckily, we can use the dmatrices
from the patsy library to get our data in the correct shape. From our 7 predictors, we only have 2 continuous variables ("Age" and "Education-Num"). The other 5 are all categorical.
In order to deal use categorical data in the model, we'll need to One-Hot Encode the categorical data by creating boolean dummy columns for each different category in each categorical column. Pandas provides a way to do this, but we'll explore that in a further lab. In this lab, we'll make use of the patsy
library's dmatrices
module.
Run the cell below to split our target from the dataset, and transform our dataset into a one-hot encoded version.
from patsy import dmatrices
target, data = dmatrices('Target ~ Age + C(Education) + \
C(Occupation) + C(Relationship) + C(Race) + C(Sex)',
salaries, return_type = "dataframe")
Now, use the appropriate method to split our our data and labels into training and testing sets.
data_train, data_test,target_train, target_test = train_test_split(data, target,
test_size = 0.25)
Now that we have prepared our data, we'll create a large Decision Tree to see how it does.
In the cell below, create a DecisionTreeClassifier
object, and set the criterion
parameter to 'gini'
, as well as the max_depth
parameter to 12
.
Then, run the cell below it to fit()
our model to the data.
from sklearn import tree
sal_tree = None
# sal_tree.fit(data_train, target_train.iloc[:,1])
# Optional: Uncomment this cell and run it to visualize our trained model.
# import graphviz
# from sklearn.tree import export_graphviz
# export_graphviz(sal_tree, out_file="mytree.dot", feature_names=data_train.columns , class_names=list(target_train), rounded=True)
# with open("mytree.dot") as f:
# dot_graph = f.read()
# graphviz.Source(dot_graph)
Let's examine if there's a difference with smaller trees.
In the cell below, create a Decision Tree Classifier as we did before, but this time set the max_depth
to 3
. Still set the criterion
to gini
. Then, .fit()
the smaller model to our training data (see the cell above if you are unsure of the syntax).
sal_tree_smaller = None
# sal_tree_smaller.fit(data_train, target_train.iloc[:,1])
# Optional: Uncomment this cell and run it to visualize our trained model.
# export_graphviz(sal_tree_smaller, out_file="mytree.dot", feature_names=data_train.columns ,
# class_names=list(target_train), rounded=True)
# with open("mytree.dot") as f:
# dot_graph = f.read()
# graphviz.Source(dot_graph)
Most leaf nodes will point to <= 50 k. How is this possible? A class imbalance in our dataset! 5865 make more than 50k, while 18555 make less (~25 vs 75%)
Note how the left nodes always point to "true" and the right nodes to "false".
Now that we have trained models, let's evaluate the performance of each.
Run the cell below to import the accuracy_score
helper method. Then, run the cell below to create some sample predictions on our testing data and generate a confusion_matrix
and a classification_report
based on the predictions.
from sklearn.metrics import accuracy_score
# pred= sal_tree.predict(data_test)
# print(metrics.confusion_matrix(target_test.iloc[:,1], pred))
# print(metrics.classification_report(target_test.iloc[:,1], pred))
# accuracy_score(target_test.iloc[:,1], pred)
# pred_smaller = sal_tree_smaller.predict(data_test)
# print(metrics.confusion_matrix(target_test.iloc[:,1], pred_smaller))
# print(metrics.classification_report(target_test.iloc[:,1], pred_smaller))
Now, run the cell below to generate an accuracy score for our predictions.
# accuracy_score(target_test.iloc[:,1], pred_smaller)
One of the best ways to tune Decision Trees to prevent overfitting is to pre-tune the model by providing constraints on certain aspects of the Decision Tree. Decision Trees are famously prone to overfitting, and tuning the model to be more general can help prevent this.
The following parameters are the most commonly used for tuning. In a later lab, you'll learn how to automate the search for the best parameters for each.
- criterion: either gini for gini impurity, or entropy for information gain.
- max_depth: the maximum depth of a tree.
- min_samples_split: minimum amount of samples required to split an internal node.
- min_samples_leaf: The minimum number of samples required to be at a leaf node.
- class_weight: Weights associated with classes.
from sklearn import tree
sal_tree_tuned= tree.DecisionTreeClassifier(criterion = "gini",
max_depth = 12, min_samples_split = 300, min_samples_leaf = 150)
sal_tree_tuned = sal_tree_tuned.fit(data_train, target_train.iloc[:,1])
# Optional: Uncomment this cell and run it to visualize our trained model.
# export_graphviz(sal_tree_tuned, out_file="mytree.dot", feature_names=data_train.columns ,
# class_names=list(target_train), rounded=True)
# with open("mytree.dot") as f:
# dot_graph = f.read()
# graphviz.Source(dot_graph)
Now, run the cell below so we can see how well the pre-tuned model did.
pred_tuned = sal_tree_tuned.predict(data_test)
print(metrics.confusion_matrix(target_test.iloc[:,1], pred_tuned))
print(metrics.classification_report(target_test.iloc[:,1], pred_tuned))
accuracy_score(target_test.iloc[:,1], pred_tuned)
For this deep tree: better result with pruning!
https://www.svds.com/machine-learning-vs-statistics/
https://github.com/xbno/Projects/blob/master/Models_Scratch/Decision%20Trees%20from%20scratch.ipynb
https://archive.ics.uci.edu/ml/machine-learning-databases/adult/
https://www.valentinmihov.com/2015/04/17/adult-income-data-set/