Say you’re interviewing for a data analyst position at Google(congrats!!), and the interviewer starts asking you about decision trees. What’s the objective function? How is the tree made? What are the hyperparameters???
…but you don’t know. You’re just a model.fit()/GridSearchCV programmer. You fail the interview, become unemployed, and go homeless in three months due to rising rent prices.
How could we have passed that interview and turned your life around?
Let’s have a chat about how decision trees work!
Skip this if you already know the basics
A decision tree is typically a classification model, that is literally a decision tree. The standard Python one in sci-kit learn uses the CART algorithm to build a binary tree.
This means that we take an observation, and go down a virtual flowchart to decide what prediction to make for it. At each step, we take one predictor, p, and ask: is p less than some value v? If so, we go left. If not, we go right. At the end, we end up at a leaf node containing our final decision.
Graphically, this is equivalent to repeatedly partitioning an n-dimensional space into halves along some axis, and assigning class labels to each slice. This is why decision tree boundaries are always square-shaped
Here, we have a decision tree on the left as a flowchart, and a graphical representation of its decision boundary on the right. Boundaries can get more complex with more splits.
After splitting the tree into nodes, each node will contain some points of the training data. We can simply predict the class with the highest proportion of points inside each node.
In fact, we can even approximate class probabilities, by taking the proportions of each class in the node, from the training data.
I will discuss the CART algorithm. The CART algorithm makes binary splits recursively, to form a decision tree.
We only need to learn how a single binary split is made to understand the algorithm. Single binary splits are made such that they minimize the loss function in each subsequent leaf.
What are the loss functions?
The two common loss functions are Gini impurity and entropy:
k is the class labels, p_k is the proportion of points with label k in the resulting leaf. Note for each of these, the loss will be high if there are many points with different labels in the resulting leaf, and low if most points have the same label.
How does the splitting work?
At every step, the CART algorithm will search for a feature k, and a threshold t, to produce the purest subsets upon the subsequent split, weighted by size.
This is the CART algorithm’s loss function:
The m values are number of instances in the left, right or combined node.
G is the impurity value of the left/right node.
The tree will search for the best split, and recursively split its nodes until a stopping criteria is met.
Examples of stopping criteria:
- Cannot reduce loss any further(or by at least some threshold)
- Reaches max_depth, set by user
- The node to split has less than min_samples_split observations in it.
There are other stopping criteria provided in scikit-learn, that effectively regularize the tree.
Note that decision trees are prone to overfitting if not regularized(theoretically, they could simply put every training observation in its own node). I hope your understanding of these splitting criteria help you make more informed hyperparameter changes!
For each node, instead of predicting the dominant class label, instead predict the mean value of the response variable among all observations in the node.
Swap out the impurity function for MSE, and we can now use the decision tree for regression problems?
(This is typically not useful, until we get into random forests or boosting, but that’s how it works!)
I hope you better understand decision trees now, and can get that job at Google!
However, decision trees are not very powerful, at least on their own. Typically, classification problems require higher-performance models, such as a random forest or boosting, which uses decision trees as a foundation.
Thus, it’s important to know the math behind them before we move forward. I will post about bagging and boosting in the coming weeks, so follow me to be notified when those articles are available!