Introduction to Decision Trees

Introduction

In this lesson, we shall look decision tree classifiers. These are rule based classifiers and belong to the first generation of modern AI. Despite the fact that this algorithm has been used in practice for decades, its simplicity and effectiveness for routine classification task is still in par with more sophisticated approaches. In addition, now can combine multiple instances of this algorithm to create much more complex architectures like random forests and other ensemble approcahes. Let's move ahead with this.

Objectives

You will be able to:

Understand and describe a decision tree algorithm in terms of graph architecture
Describe how decision trees are used to create partitions in s sample space
Have an overview of the training and prediction stages involved decision tree classification
Understand the importance of a cost function for decision trees

From Graphs to Decision Trees

We have seen basic classification algorithms (a.k.a classifers), including Naive Bayes and signoid based logistic regression earlier. A decision tree is somewhat different type of classifier that performs through a recursive partition of the sample space. In this lesson, we shall get a conceptual understanding of how this is achieved.

A decision tree comprises of decisions that originate from a chosen point in sample space. In terms of a graph theoretic understanding (recall the graph section), it is a directed acyclic graph with a root called "root node" that has no incoming edges. All other nodes have one (and only one) incoming edge. Nodes having outgoing edges are known as internalnodes. All other nodes are called leaves . Nodes with an incoming edge, but no outgoing edge are called terminal nodes.

Directed Acyclic Graphs

In computer science and mathematics, a directed acyclic graph (DAG) is a graph that is directed and without cycles connecting the other edges. This means that it is impossible to traverse the entire graph starting at one edge. The graph is a topological sorting, where each node is in a certain order.

Partitioning the Sample Space

So a decision tree is effectively a DAG as the one seen above where each internal node partitions the sample space into two (or more) sub-spaces according to some discrete function of the input attributes values.

In the simplest and most frequent case, each internal node considers a single attribute so that space is partitioned according to the attribute’s value. In the case of numeric attributes, the condition refers to a range. Let's see a bit more on this with a simple example below.

Above, you can see that root node (testing color) acts as the first decision for feature "Color", creating three new paths. Based on the decision on the color being red green and blue. On the right side, you can see three primary partitions of our sample space.

If the color is identified as "Red", we don't do any further tests and thus all red objects belong to middle partition without any further sub partition.

For "Green" color, we do a further test on the attribute "Size". So for green objects, we further classify them into small green and large green objects. On the right we see the green sample space, further divided accordingly

For "Blue" color, we perform two further tests, if the blue objects are of round shape, we stop there and do not further partition the space. For square blue objects, we perform yet another test and see if they are "small blue square objects" or "large blue square objects". So in the blue partition, we can see that large square and small square are put into their own spaces. Here is another example for a decision tree made for taking decisions on bank loan applications.

So this is the basic idea behind decision trees , every internal node checks for a condition and performs a decision. Every terminal/lead node represents a discrete class. Decision tree induction is closely related to rule induction. In essence a decision tree is a just series of IF-ELSE statements (rules). Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by combining the decisions along the path to form the antecedent part, and taking the leaf’s class prediction as the class value.

Definition

A decision tree is a DAG type of classifier where each branch node represents a choice between a number of alternatives and each leaf node represents a classification. An unknown (or test) instance is routed down the tree according to the values of the attributes in the successive nodes. When the instance reaches a leaf, it is classified according to the label assigned to the corresponded leaf.

A real dataset would usually have a lot more features than the example above and will create much bigger trees, but the idea will remain exactly the same. The idea of feature importance is of high importance as selecting the correct feature to make a split that define complexity and effectiveness of the classification process. Regression trees are represented in the same manner, just they predict continuous values like price of a house.

Training Process

The process of training a decision tree and predicting the target features of query instances is as follows:

Present a dataset of training examples containing features/predictors and a target. (similar to classifiers we have seen earlier)
Train the tree model by making splits for the target using the values of predictors. The predictor to use gets selected following the idea of feature selection and uses measures like "information gain" and "gini index" etc. We shall cover these shortly.
Tree is grown untill some stopping criteria is achieved. This could be a set depth of the tree or any other similar measure.
Show a new set of features to the tree, with an unknown class and let the example propagate through a trained tree. Resulting leaf node represents the class predictions this data.

Splitting Criteria

The training process of a decision tree can be generalized as "Recursive binary Splitting".

In this procedure all the features are considered and different split points are tried and tested using some Cost Function. The split with the lowest cost is selected.

There are couple of algorithms there to build a decision tree:

CART (Classification and Regression Trees) uses Gini Indexas metric.
ID3 (Iterative Dichotomiser 3) uses Entropy function and Information gain as metrics.

Let's quickly see why using these cost criteria is imperative for building a tree. We shall try to develop an intuition using a simple example. Let’s just take a famous dataset in the machine learning world which is weather dataset(playing game Y or N based on weather condition).

So We have four features - X (outlook,temp,humidity and windy) being categorical and one target - y (play Y or N) also categorical, and we need to learn the mapping between X and y. This is a binary classification problem and in order to create a tree, we need to have a root node first and need to decide which feature (outlook,temp,humidity and windy) to use first. Selecting the wrong feature can increase complexity of the tree and it is desired to keep the tree as short as possible.

Greedy Search

We need to determine the attribute that best classifies the training data, we use this attribute at the root of the tree. At each node, we repeat this process creating further splits, until a leaf node is achieved , i.e. all data gets classified.

This means we are performing top-down, greedy search through the space of possible decision trees.

In ordert identify the best attribute for ID3 classification trees, we use the "Information Gain" criteria. Information gain (IG) measures how much “information” a feature gives us about the class. Decision Trees always try to maximize the Information gain. So an attribute with highest Information gain will tested/split first.

Let's move on to the next lesson where we shall look into this criteria with simple examples.

Additional Resources

R2D3:. This is highly recommended for getting a visual introduction to decision trees. Excellent animations explaining the training and prediction stages shown above
Dataversity: Decision Trees Intro A quick and visual introduction to DTs.
Directed Acyclic Graphs. This would help relate early understanding of graph computation to decision tree architectures.

Summary

In this lesson, we saw an introduction to decision trees as simple yet effective classifiers. We looked at how decision trees partition the sample space based by learning rules from a given dataset. We looked at how feature selection for splitting the tree is of such high importance. Next we shall look at Information gain criteria used for feature selection.

kpokrass / dsc-3-31-03-introduction-to-decision-trees-online-ds-ft-021119 Goto Github PK

dsc-3-31-03-introduction-to-decision-trees-online-ds-ft-021119's Introduction

Introduction to Decision Trees

Introduction

Objectives

From Graphs to Decision Trees

Partitioning the Sample Space

Definition

Training Process

Splitting Criteria

Greedy Search

Additional Resources

Summary

dsc-3-31-03-introduction-to-decision-trees-online-ds-ft-021119's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent