In this comprehensive guide, we’ll transcend the basics. No longer will you feel overwhelmed by terms like “node impurity,” “weighted fractions,” or “cost-complexity pruning” when staring at the documentation for the random forest model. Instead, we’ll dissect each parameter, elucidating its role and impact. Through a blend of theory and hands-on Python examples, you’ll emerge with a nuanced understanding of how to mould the Random Forest to your will.
Seasoned data scientists often possess an intuitive feel for their datasets — a sixth sense that guides them to the right algorithm and the right parameters. Whilst this might seem like arcane magic, it is simply the effect of years of experience in applying and understanding these models. In this blog we will cover the core elements of the random forest algorithm so that it is not only easy to understand but easy to apply in your data science project. So, whether you’re a data enthusiast looking to master Random Forests or a practitioner seeking a refresher with deeper insights, this blog is your compass.
“A Random Forest is an ensemble machine learning method that combines multiple decision trees to produce more accurate and robust predictions.”
An apt description for most circumstances, but we are searching for something more. Below you will find code and explanations to understand how manipulating key parameters of the algorithm will impact the performance of your model.
Typically, before entering the modelling phase, you will have already cleaned and done exploratory data analysis on your datasets. From this you will have a grounded understanding of how the data varies and relates to your dependent variable, which in turn will provide some guidance as to which models are worth exploring based on your understanding of how these models work.
In this blog we will use the classic Iris toy dataset, with some noise injected into the variables so that we can see the value in optimising the parameters of the model:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier