Build a tree by making one useful split at a time.

This page starts with a tiny classification problem, then turns it into a hands-on lesson about entropy, information gain, and why a tree that keeps chasing perfect purity can stop generalizing.

Play with split building Inspect entropy and gain See why single trees wobble

Replica Hub

Decision Trees

The unreasonable power of nested decision rules.

Let's Build a Decision Tree

Let's pretend we're farmers with a new plot of land. Given only the Diameter and Height of a tree trunk, we must determine if it's an Apple, Cherry, or Oak tree. To do this, we'll use a Decision Tree.

Start Splitting

Almost every tree with a Diameter ≥ 0.45 is an Oak tree! Thus, we can probably assume that any other trees we find in that region will also be one.

This first decision node will act as our root node. We'll draw a vertical line at this Diameter and classify everything above it as Oak (our first leaf node), and continue to partition our remaining data on the left.

Split Some More

We continue along, hoping to split our plot of land in the most favorable manner. We see that creating a new decision node at Height ≤ 4.88 leads to a nice section of Cherry trees, so we partition our data there.

Our Decision Tree updates accordingly, adding a new leaf node for Cherry.

And Some More

After this second split we're left with an area containing many Apple and some Cherry trees. No problem: a vertical division can be drawn to separate the Apple trees a bit better.

Once again, our Decision Tree updates accordingly.

And Yet Some More

The remaining region just needs a further horizontal division and boom - our job is done! We've obtained an optimal set of nested decisions.

That said, some regions still enclose a few misclassified points. Should we continue splitting, partitioning into smaller sections?

Hmm...

Don't Go Too Deep!

If we do, the resulting regions would start becoming increasingly complex, and our tree would become unreasonably deep. Such a Decision Tree would learn too much from the noise of the training examples and not enough generalizable rules.

That is the classic bias-variance tradeoff in action. In this case, going too deep results in a tree that overfits our data, so we'll stop here.

We're done! We can simply pass any new data point's Height and Diameter values through the newly created Decision Tree to classify them as either an Apple, Cherry, or Oak tree!

Where To Partition?

We just saw how a Decision Tree operates at a high-level: from the top down, it creates a series of sequential rules that split the data into well-separated regions for classification. But given the large number of potential options, how exactly does the algorithm determine where to partition the data? Before we learn how that works, we need to understand Entropy.

Entropy measures the amount of information of some variable or event. We'll make use of it to identify regions consisting of a large number of similar (pure) or dissimilar (impure) elements.

Given a certain set of events that occur with probabilities , the total entropy can be written as the negative sum of weighted probabilities:

The quantity has a number of interesting properties:

Entropy Properties

only if all but one of the are zero, this one having the value of 1. Thus the entropy vanishes only when there is no uncertainty in the outcome, meaning that the sample is completely unsurprising.
is maximum when all the are equal. This is the most uncertain, or "impure", situation.
Any change towards the equalization of the probabilities increases .

The entropy can be used to quantify the impurity of a collection of labeled data points: a node containing multiple classes is impure whereas a node including only one class is pure.

Above, you can compute the entropy of a collection of labeled data points belonging to two classes, which is typical for binary classification problems. Click on the Add and Remove buttons to modify the composition of the bubble.

Did you notice that pure samples have zero entropy whereas impure ones have larger entropy values? This is what entropy is doing for us: measuring how pure (or impure) a set of samples is. We'll use it in the algorithm to train Decision Trees by defining the Information Gain.

Build a tree by making one useful split at a time.

Replica Hub

Decision Trees

Let's Build a Decision Tree

Start Splitting

Split Some More

And Some More

And Yet Some More

Don't Go Too Deep!

Where To Partition?

Entropy Properties

Information Gain

ID3 Algorithm Steps

A Note On Information Measures

Another Look At Our Decision Tree

The Problem of Pertubations

Why Is This A Problem?

The Need to Go Beyond Decision Trees

The End

References