Ok so as we saw in previous parts, the CART algorithm allows us to build decision trees. Up till now we have built these trees until all leaves are pure, meaning they have only one class of examples (for classification trees), however this can lead to overfitting the training data which decreases the generalizability of our model, and therefore it's usefulness. This is where cost-complexity pruning comes into play.
What is pruning?
So pruning comes from biology, pruning a plant is selectively removing some part of it. In the case of decision trees, it just means removing some branches. However even though we remove branches we want to keep all of our samples, so we cant just eliminate part of the samples from branches, so effectively removing a branch corresponds to choosing a pruning node, where we want our branch to end, and collapsing all it's child nodes into it.
Now how do we choose which branches to remove ? if we remove too many our model looses any classifying, or regressing power it has and if we remove too few we can still have overfitting. This is addressed by cost-complexity pruning, which balances the complexity of the tree (the number of leaves, so potential overfitting) with the performance of the tree.
Notation
In order to explain cost-complexity pruning, we are going to need to give some names to things we need, luckily that's already been done.
Tree nomenclature
Let us consider a decision tree and two of its nodes, and .
- is a descendant of if there is a path down (from the root to the leaves) the tree from to .
- is an ancestor of if there is a path up (from the leaves to the root) from to .
- and are, respectively, the right and left child nodes of
- A branch is the branch of with root , is composed of the node and all of its descendants.
- pruning a branch from is removing all nodes of from , the pruned tree is called
- If you can get a tree from by pruning branches, the is a pruned subtree of and we denote that with: $ T' \leq T$
- For a given tree , we can define, the set of leaf nodes
- The complexity of is given by the cardinality of , (ie. the number of leaf nodes), it is noted:
Measures
Let us consider a leaf node of , with the class of (ie the majority class in the node).
- the is the resubstitution error estimate of . is the proportion of the majority class in .
- We denote , with simply being the proportion of samples in node compared to the rest of the tree.
- It is provable that , which just means that if we split a node the misclassification rate is sure to improve.
- The overall missclassification rate for , is:
Which is to say the sum of the resubstitution error of a leaf node multiplied by the probability of being in said node over all of the leaf nodes.
The pruning
The first step in pruning a tree is, ..., you guessed it: having a tree. So we start by growing the maximal tree, with pure leaves. Now the naive approach would be to go through all the possible pruned subtrees and see which one has the best trade-off between performance and complexity, however that is, in practice, impossible because of the huge number of possible pruned subtrees.
This is a test of how the git thing works