You could download this notebook from: here, and you should install all the required packages before running the code locally. And you can download the dataset from:

## Problem Statement

Consider you are working with a weather forecasting agency that aims to improve its predictions of rain across Australia. As a Data Scientist, your task is to create an automated system that predicts whether it will rain the next day. Your prediction model should take into account features such as location, temperature, humidity, wind speed, and other relevant weather parameters.

The rain predictions from your system will not only assist in rain prediction but they must also provide clear explanations for each prediction it makes.

To assist you in this task, you are provided with a CSV file containing a decade’s worth of daily weather observations from various Australian locations. The historical dataset includes weather parameters along with a binary indicator “RainTomorrow”, denoting whether it rained the next day (1mm or more, ‘Yes’) or not (‘No’).

Your job is to use this data to build a prediction system capable of accurately forecasting rain for any given day, even if that day’s data is not part of the initial dataset.

## Importing libraries

import pandas as pd

import matplotlib.pyplot as plt

%matplotlib inline

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LinearRegression

from sklearn import metrics

import seaborn as sns

import plotly.express as px

import numpy as npdata = pd.read_csv('weatherAUS.csv') # load the dataset

## Exploratory Data Analysis and Visualization

To understand the data that’s available, we must perform data analysis by visualizing the distribution of values in each feature, and the relationships between selling price and other features. By features, I mean the columns of the data.

`data`

The dataset contains 145460 rows and 23 columns. Each row in the dataset contains information (columns) also called features about the weather on that specific day. The task is to find a way to estimate the value in the “RainTomorrow” column using the values in the other columns. If we can do this estimation for historical data, then we should be able to estimate RainTomorrow for future dates that are not part of this set, simply by providing information like temperature, humidity, wind speed, and other relevant weather parameters.

Check the datatypes of each column

`data.info()`

Check the missing values of each column

`data.isnull().sum()`

Since there are not too many missing values in RainTomorow category, I will remove all days where this value is missing, since it’s important to have only accurate values if possible, before training our model.

data.shape # outputs (145460, 23)# Drop values in a dataset where rain today and rain tomorrow are NaN.data.dropna(subset=['RainToday', 'RainTomorrow'], inplace=True)

## Exploratory Analysis and Visualization

Now we begin to explore some features

The plot below shows that the distribution of the Location feature is uniform, in each city, the rainy days are almost the same. Except for some of them like Uluru, Katherine, and Nhil. But, this also can be due to the count that’s also lower in these areas, there are probably more missing values since they didn’t measure the weather fully in these last 10 years.

`## Explore rainy days over all locations in Australia`

px.histogram(data, x='Location', color='RainToday', title='Rainy days over all locations in Australia')

`data.Location.nunique() # outputs 49`

The plots below show the correlation between low temperatures at 3 pm and the chances of rain tomorrow, or low temperatures in the morning and the chances of rain on that day. We can see from the plot that that’s true which is also an intuitive assumption. The lower the temperatures in 3 pm the higher the chances of raining tomorrow.

px.histogram(data, x='Temp3pm', color='RainTomorrow', title='Rain Tomorrow vs Temperature at 3pm')px.histogram(data, x='Temp9am', color='RainToday', title='Rain Today vs Temperature at 9am')

It’s important to check the distribution of our Target Variable RainTomorrow

`px.histogram(data, x='RainTomorrow', color='RainToday', title='Rain Tomorrow vs Rain Today)`

In the plot above, we could tell that our RainTomorrow, the feature (class) that we’re trying to predict, is not equally represented in the dataset. This means that our data is imbalanced since there are more days when there’s no rain than when there is rain.

From this distribution, we can see that if it did not rain today, there’s a high chance that it won’t rain tomorrow (we can see this from the RainTomorrow No bar), on the other hand, if we look at the Raintomorrow YES, it’s pretty hard to conclude whether it’s gonna rain tomorrow from RainToday yes or no.

This makes it harder for our model to predict whether it’s gonna rain tomorrow and easier to predict if it’s NOT going to rain tomorrow.

`px.scatter(data.sample(1000), x='MinTemp', y='MaxTemp', color='RainToday', title='MinTemp and MaxTemp vs RainToday')`

From the plot above, we could observe that there’s a linear correlation between MinTemp and MaxTemp in a day, and if we check the RainToday = Yes data points, we could see that on the days that it has rained, there is not much difference between Min and Max Temperature

`# Grouping the data by category and target and calculating the count`

grouped_data = data.groupby(['Cloud3pm', 'RainTomorrow']).size().unstack()

grouped_data.plot(kind='bar', stacked=True)

plt.xlabel('Category')

plt.ylabel('Count')

plt.title('Correlation between Categorical Variable and Target Variable')

plt.legend(title='Target')

plt.show()

Based on the plots above, we could see that there’s a correlation between clouds at 3pm and rain tomorrow. If the cloud at 3pm’s value is 7 or 8 there’s a high chance of raining tomorrow.

`px.histogram(data, x='WindSpeed3pm', color='RainTomorrow', title='Rain Tomorrow vs Temperature at 3pm')`

`px.scatter(x='Temp3pm', y='Humidity3pm', color='RainTomorrow', data_frame=data.sample(8000), title='Rain Tomorrow vs Temperature at 3pm, Humidity at 3pm')`

From the plots above, we can observe that if the humidity is greater than 50 and the temperature is relatively low, there’s more chance of rain tomorrow.

`px.scatter(x='Temp9am', y='Humidity9am', color='RainToday', data_frame=data.sample(8000), title='Rain Today vs Temperature at 3 pm, Humidity at 3pm)`

We get similar results by checking humidity and temperature at 9 am.

## Train / Test / Validation Splits

**Training Set**: This is the set of the dataset and is used to train the machine learning model. The model learns from this data and tunes its parameters to minimize the difference between the predicted and actual values of the target variable. The training set usually accounts for about 60–80% of the dataset.**Validation Set**: The validation set is used to evaluate the performance of the model during the training phase and fine-tune the model’s hyperparameters (parameters that are not learned from the data, but are set by the practitioner, like learning rate in gradient descent). The model does not learn from this data in the traditional sense, but rather, it’s used to prevent overfitting to the training data. The validation set helps ensure that the model generalizes well to unseen data. It typically comprises about 10–20% of the dataset.**Test Set**: This is the data that the model has never seen during its training or validation phase. It’s used to evaluate the final performance of the model after all training and validation have been completed. This helps us understand how the model will perform when it’s used to make predictions on new, unseen data in the real world. Like the validation set, the test set usually makes up about 10–20% of the dataset.

Separating data into these three sets is important because it allows us to make sure our model not only performs well on the data it was trained on but also generalizes well to new, unseen data. This process helps us avoid overfitting, where a model learns the training data so well that it performs poorly when faced with new data.

train_val_df, test_df = train_test_split(data, test_size=0.2, random_state=42) # get 20% of data for testing (test_df)

train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=42) # from 80% remaining data, get 25% for validation (val_df), and the remaining train_df for training.data['Date'] = pd.to_datetime(data['Date']) # convert Date column to DateTime format

data['Year'] = data['Date'].dt.year # extract year from Date columnsns.countplot(x=data['Year'])

Since we’re trying to predict if it’s going to rain in the future, it’s logical to train the data accordingly too. For example, it’s better to use a training set for historical data (using 2007 until 2014), then use data in the future for validation for example 2015, and 2016 for hyperparameter tuning, and then use 2017’s data for prediction (testing our model)

`train_df = data[data['Year'] < 2015] # get data before 2015 for training`

val_df = data[data['Year'] == 2015] # get data in 2015 for validation

test_df = data[data['Year'] > 2015] # get data after 2015 for testing

## Identify the Inputs and Target

There’s no point in training the model using the Date column since, the other features based on analysis they’re important for our prediction because there is a correlation of rain today/tomorrow using those columns.

The Target Column will be rain tomorrow, and we should remove it from our input.

In this dataset there’s a column location, we’re using it in our model, but this means that this model can be used then only to make predictions on certain locations given in the training set if we would like to make it more general, the location column should be removed.

## Handling Missing Data

Handling missing values is a very important step in data preprocessing. The strategy to handle these missing values depends based on the type of data and the nature of the problem.

For numerical data, it’s common to use the following techniques:

- Drop: Remove rows with missing values. Mean/Median: Replace missing values with the mean or median of the non-missing values.
- The mean is sensitive to outliers while the median is not, it might be more suitable to use the median.
- Random: Substitute missing values with a random value from the available data.
- K-Nearest Neighbors (KNN): Impute missing values using the KNN algorithm, which fills missing values based on similar “neighbor” observations.

For categorical data, it’s common to use the following techniques:

- Drop: Remove rows with missing values.
- Mode: Replace missing values with the mode (most frequent value).
- Random: Same as for numerical data.
- KNN: Can also be used for categorical data, where the most common class among the K-nearest neighbors replaces the missing value. Choosing the right method depends on the specific data, the importance of the feature, and the proportion of missing values.

There are more techniques for handling missing values, and they important depending on the problem you might experiment with different approaches.

In the code below we identify our numerical columns and categorical columns, also handle missing values, and divide our dataset into train/test/validation sets.

numerical_columns = train_df.select_dtypes(include=np.number).columns # get numerical columns

categorical_columns = train_df.select_dtypes('object').columns # get train_df[categorical_columns].nunique()def handle_missing(table, columns = None, method = 'drop'):

table = table.copy()

if columns == None:

columns = table.columns

for col in columns:

if method == 'drop':

table[col].dropna(inplace=True)

elif method == 'mode':

table[col].fillna(table[col].mode()[0], inplace = True)

elif method == 'median':

table[col].fillna(table[col].median(), inplace = True)

elif method == 'mean':

table[col].fillna(table[col].mean(), inplace = True)

elif method == 'random':

table[col] = table[col].apply(lambda x: np.random.choice(table[col].dropna().values) if np.isnan(x) else x)

return tabletrain_df = handle_missing(train_df, columns=numerical_columns, method='mean')

test_df = handle_missing(test_df, columns=numerical_columns, method='mean')

val_df = handle_missing(val_df, columns=numerical_columns, method='mean')y_train = train_df['RainTomorrow']

y_test = test_df['RainTomorrow']

y_val = val_df['RainTomorrow']train_df = train_df.drop(['RainTomorrow'],axis=1)

train_df.isnull().sum()

test_df = test_df.drop(['RainTomorrow'],axis=1)

val_df = val_df.drop(['RainTomorrow'],axis=1)

## Scaling Numeric Features

Scaling numerical features to a certain range (like 0 to 1 or -1 to 1) is a good practice in machine learning. It helps ensure that all features contribute equally to the model’s prediction by preventing any single feature from dominating due to its larger scale. Additionally, it makes optimization algorithms more effective, as they usually function better with smaller numbers.

For example:

Consider two data points:

Person A: Age = 25, Income = $50,000

Person B: Age = 50, Income = $100,000

Without scaling, the income feature would overpower the age feature due to its larger values, affecting our model’s learning.

By applying min-max scaling, we adjust the values:

Person A: Scaled Age = 0, Scaled Income = 0

Person B: Scaled Age = 1, Scaled Income = 1

Now, both features have the same range, allowing the model to learn from both without bias.

Scaling numerical features using the sk-learn library.

from sklearn.preprocessing import MinMaxScalertrain_df.describe()scaler = MinMaxScaler()

scaler.fit(data[numerical_columns])train_df[numerical_columns] = scaler.transform(train_df[numerical_columns])

val_df[numerical_columns] = scaler.transform(val_df[numerical_columns])

test_df[numerical_columns] = scaler.transform(test_df[numerical_columns])val_df[numerical_columns].describe()

## Encoding Categorical Data

Before we can apply machine learning algorithms to categorical variables, we need to transform them into numerical form. This process is known as encoding.

For instance, take the ‘Transmission’ column, which contains ‘Manual’ and ‘Automatic’. Given that there are only two categories, we can use binary encoding: assign ‘0’ to ‘Manual’ and ‘1’ to ‘Automatic’, or vice versa.

In the case of ‘Fuel_Type’ with three categories, we can apply One-Hot-Encoding. Here, each category gets its column in the data, and these new columns are binary, indicating the presence (1) or absence (0) of that category for a given record.

One-Hot-Encoding is particularly useful when the categories do not have a natural order or hierarchy, as is the case with ‘Fuel_Type’. It prevents the machine learning algorithm from assigning inappropriate weight or importance to the categories based on a numerical value.

However, one should be cautious when using One-Hot-Encoding with a variable that has many categories. This is because it can lead to a high increase in the number of columns (dimensionality) in your dataset, making it sparse and potentially harder to work with — a situation often referred to as the “Curse of Dimensionality”. In such situations, other encoding techniques such as ordinal encoding or target encoding might be more appropriate.

An example of One Hot Encoding:

One Hot Encoding Implementation using sk-learn library.

from sklearn.preprocessing import OneHotEncoder## Remove RainTommorrow from categorical since we're gonna use it to create our data frames etc.

categorical_columns = categorical_columns.drop('RainTomorrow')encoder = OneHotEncoder(sparse=False, handle_unknown = 'ignore')

encoder.fit(data[categorical_columns])encoded_columns = list(encoder.get_feature_names_out(categorical_columns))print(encoded_columns)train_df[encoded_columns] = encoder.transform(train_df[categorical_columns].fillna('Unknown'))

val_df[encoded_columns] = encoder.transform(val_df[categorical_columns].fillna('Unknown'))

test_df[encoded_columns] = encoder.transform(test_df[categorical_columns].fillna('Unknown'))train_df = train_df.drop(categorical_columns, axis=1)test_df = test_df.drop(categorical_columns, axis=1)

val_df = val_df.drop(categorical_columns, axis=1)train_df = train_df.drop(['Date', 'Year'], axis=1)

val_df = val_df.drop(['Date', 'Year'], axis=1)

test_df = test_df.drop(['Date', 'Year'], axis=1)

## Logistic Regression

Logistic regression is a classification learning algorithm. The explanation will be in the case of binary classification. However, it can be extended to multiclass classification. In logistic Regression, similar to linear regression we still want to model our output yi (i-th example in our dataset) as a linear function of xi ( yi = w * xi + b), but this would result in a function that can have values from minus infinity to plus infinity (continues range of values same as in linear regression). But, in our problem, the yi values can only be Yes or No (Remember, RainTomorrow can be Yes or No).

As a solution to this problem, we could encode the positive labels as 1, and negative labels as 0 (In our case would be ‘Yes’ = 1, ‘No’ = 0), and then we just need to find a simple continuous function whose codomain is (0, 1). In such a case, if the value returned by the model for input x is closer to 0, then we assign a negative label to x; otherwise, the example is labeled as positive. One function that has such a property is the standard logistic function (also known as the sigmoid function):

So now instead of z, we just replace it with our well-known linear combination of features (x * w + b), whichever value is returned by this, would be in a range of 0,1, and then we can set a threshold for example for values greater than 0.5 we classify them as Yes, and for values smaller than 0.5, we classify them as No.

So, we take a linear combination of our input features with their respective weights, then we apply the sigmoid function such that we obtain a number between 0 and 1. This number represents the probability of an input; being classified as “Yes”

The error function that evaluates our results is called cross entropy loss:

Minimizing cross-entropy loss leads to the best model, i.e., the one that predicts the highest probability for the correct class.

In terms of weights and biases, our goal is to find the values for these parameters that minimize the cross-entropy loss function. This is typically achieved through an iterative process like gradient descent, where we start with random values for weights and biases and then iteratively adjust these values in the direction that most decreases the cross-entropy loss.

In a nutshell, the procedure would look like this:

- Initialize the weights and biases: Start with random values for the weights and biases. These are the initial “guesses” for the best values of these parameters.
- Compute the loss: Use the current weights and biases to make predictions on the training data, and then compute the cross entropy loss. This tells us how well (or poorly) the current weights and biases are performing.
- Compute the gradients: Calculate the gradients of the loss function concerning the weights and biases. The gradient is a multi-dimensional derivative that tells us the slope of the loss function in every direction. It points in the direction of the steepest ascent.
- Update the weights and biases: Subtract a small multiple of the gradient from the current weights and biases. This “steps” in the direction of steepest descent, i.e., the direction that most decrease the loss.
- Repeat: Repeat steps 2–4 many times. Until loss has decreased to our satisfied level or there are no significant decreases anymore.

It’s important to understand how gradient descent works, to know how are these gradients being computed and how are the weights being updated for more info watch this video: here.

## Training the model using LogisticRegression and Interpreting using the sklearn library

from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression(solver='liblinear', random_state=42)

The parameter max_iteration of the fit method by default is 100, this means that it will run 100 times and do predictions, calculate errors, modify weights and bias.

`model.fit(train_df, y_train)`

We can check the weights assigned for each feature

`print(train_df.columns)`

print(model.coef_.tolist())

We could see that max temp has a negative weight, this could mean that it’s not much affecting our target variable. Rainfall has a positive, temperature has a positive value, etc. The model has learned the weights, if sunshine, the value is highly negative.

The higher the weight, the highly correlated feature with our target. Let’s plot the feature’s importance by sorting and displaying each feature’s corresponding weight.

`weight_df = pd.DataFrame({`

'feature': train_df.columns,

'weight': model.coef_to listt()[0]

})

weight_df

`sns.barplot(data=weight_df.sort_values('weight', ascending=False).head(10), x='weight', y='feature')`

As we can see from the plot above, the most important features, with higher weights, that affect our target variable are WindGustSpeed, Humidity3pm, …

Make predictions on our train data using the code below:

train_predictions = model.predict(train_df)

train_predictionsy_train

## Evaluation of our model

Count the number of matches between predictions and actual values, divide by the number of predictions (days) | total rows

`from sklearn. metrics import accuracy_score`accuracy_score(y_train, train_predictions)

This is the accuracy of our model on the training data is 85%. We could also return probabilities, for each prediction we will have probabilities.

model.classes_ # outputs array(['No', 'Yes'], dtype=object)train_probabilities = model.predict_proba(train_df)from sklearn.metrics import confusion_matrix

train_probabilities returns a list of probabilities, for each train data point, its prediction is Yes with a specific probability, and NO with a 1-probability of YES.

A better metric to evaluate the performance of our model is **Confusion Matrix**

The four terms in the matrix are:

**True Positives (TP)**: These are the cases where the model predicted ‘Yes’ and the actual class was also ‘Yes’.

**True Negatives (TN)**: These are the cases where the model predicted ‘No’ and the actual class was also ‘No’.

**False Positives (FP)**: These are the cases where the model predicted ‘Yes’ but the actual class was ‘No’. This is also known as “Type I error.”

**False Negatives (FN)**: These are the cases where the model predicted ‘No’ but the actual class was ‘Yes’. This is also known as a “Type II error.”

confusion_matrix(y_train, train_predictions, normalize='true')# OUTPUTS

array([[0.94614779, 0.05385221],

[0.47729149, 0.52270851]])

The confusion matrix is giving us an overview of how well our model is performing in predicting whether it will rain tomorrow or not.

Let’s break down what the values in the matrix mean. The top left value (0.9461) is the True Negative rate. This means that the model correctly predicted 94.61% of the time when it was NOT going to rain the next day. So, it’s very good at telling us when it’s going to be a dry day.

However, the bottom right value (0.5227), which represents the True Positive rate, shows that the model only correctly predicted that it would rain the next day about 52.27% of the time. So, it’s not that accurate when it comes to predicting rainy days (which is unfortunate because that’s what we’re trying to achieve with this model).

The top right value (0.0538), the False Positive rate, tells us that the model incorrectly predicted that it would rain the next day about 5.38% of the time when it was a dry day.

The bottom left value (0.4772), the False Negative rate, indicates that the model incorrectly predicted that it would not rain the next day about 47.72% of the time when it ended up raining.

Depending on the situation, we might be more worried about False Positives (predicting rain when it’s dry) or False Negatives (predicting dry when it’s raining). These are areas we’ll need to work on to improve our model’s performance.

If we’re trying to check if we want to host a game of tennis tomorrow, but several false negatives are high, then our model is not performing well, since it will most likely not predict well that’s going to rain tomorrow. Because the model will predict that it’s NOT going to rain. so you try to reduce the false negatives, even if it reduces the accuracy.

If were trying to recommend chemotherapy for breast cancer, then we look at False Positives, because we predicted that person has cancer, but it didn’t. It was False Positive. So depending on the problem we optimize these values.

This model based on the confusion matrix, is not that good since it has only 47% accuracy in predicting Rain tomorrow, it’s close to random guess.

**The reason that the TN is only .5 and TP is .94 is mostly because the dataset is imbalanced, and that should be fixed to get better results. This model is biased toward predicting NO rain Tomorrow, and it still hasn’t learned much about predicting if it’s going to rain.**

Similarly to predictions on the train set, you can use the test dataset to make predictions on unseen datasets for our model, which can be useful in predicting if it’s going to rain tomorrow.

test_predictions = model.predict(test_df)

accuracy_score(y_test, test_predictions)# Accuracy score on test data is 0.841

## Next Steps

- Handling Class-Imbalance, since we can see that the algorithm is heavily biased towards the majority class No Rain Tomorrow.
- apply techniques like under or over-sampling, SMOTE, Tuning Class weights, or other methods (experiment with this by always checking the results).
- We could experiment more with logistic regression, change the threshold of 0.5 (decision boundary), and get the one that best fits the nature of the problem.
- Use a validation set for hyperparameter tuning
- After experimenting with parameters and handling class imbalances, try to fit in a single input to the model, and check the probabilities of Raining Tomorrow or Not.
- Use other classification algorithms, such as Random Forests, or XGBOOST.

## Random Forest Classifier

## Bagging

Bagging, also known as bootstrap aggregating, is a strategy used to make our predictions more reliable. It works by reducing variance in our predictions.

To understand bagging, it’s helpful to know two terms: ‘variance’ and ‘independent and identically distributed (i.i.d.)’.

Variance is a measure of how much values in a dataset differ from the average value. In simple words, it’s a way of understanding how spread out the data is. For instance, if we predict the price of a house based on certain features, the variance would be how much the predicted prices differ from the average predicted price. High variance might mean that our model performs very well on some data but very poorly on other data, which is not ideal.

Now, ‘independent and identically distributed’ or ‘i.i.d.’ is a term used to describe a scenario where all the elements in a sequence have the same probability distribution and each item in the sequence does not depend on the other items. In other words, it’s like drawing cards from a deck, replacing the card each time — each draw doesn’t affect the others and has the same odds.

So, what bagging does is that it averages the predictions from multiple models, each of which is trained on a different set of i.i.d. samples (datasets). This averaging process reduces the variance, meaning our predictions become more reliable and less scattered. In simpler terms, by using bagging, we’re gathering opinions from several models instead of relying on one, and this leads to a more stable and accurate final prediction.

## Decision Trees

Before using the Random Forest algorithm, it’s important to have an understanding of the Decision Tree model.

Given the table above, we would want to decide if we should play golf or not. We can train a decision tree, which constructs the tree given the dataset, and it creates this hierarchical structure where in each node we have the decision to make (this tree is created using a computer instead of us manually having to figure out the best decisions that lead us to an answer of our problem).

Another more specific example can be this tree that was trained using the sk-learn DecisionTreeClassifier class.

We can see a Gini value in each box, this is the loss function used by the decision tree to decide which column should be used for splitting the data, and at what value it should be split. A lower gini value is a better split in the sense that the accuracy of predicting that class in that node is higher. A perfect split (which indicates that all data that belongs to the same class was gathered in that node) has a Gini index of 0.

In the Decision Tree, in each step, there are subsets of data created and split into each other nodes that are created. The tree can be split until all data points in each node belong to the same class which indicates a perfect classification which also leads to overfitting.

## Random Forests

Decision trees tend to overfit the training data. That means they can get so good at predicting the examples they’ve seen, that they perform poorly on new, unseen data. They may also be sensitive to small changes in the data, leading to drastically different trees.

Random Forest helps overcome these issues. It creates a whole forest of different decision trees, each trained on a different set of samples from the data (this is the ‘bagging’ part), and each tree only gets a subset of the features to make decisions on (this is the ‘random’ part). By doing so, Random Forest makes sure that the trees are diverse and not correlated to each other.

When a new prediction needs to be made, Random Forest takes the input, has each of the decision trees in the forest make a prediction, and then takes the majority vote of the predictions as the final decision. This way, the model leverages the wisdom of the crowd, making the overall predictions more reliable and less susceptible to the overfitting of any individual tree.

`from sklearn.ensemble import RandomForestClassifier`

Using the same training/test set, I’ve created a base model, with the default parameters of RandomForsetClassifier. We will use this model for initial predictions and then further improve it using techniques such as GridSearchCv to find the best possible parameters which give us the best results.

model = RandomForestClassifier(n_jobs=-1, random_state=42)model.fit(train_df, y_train)

print("Accuracy on train set")

print(model.score(train_df, y_train))print("Accuracy on validation set")

## OUTPUTS

print(model.score(val_df, y_val))

Accuracy on train set

0.9999795893374699

Accuracy of the validation set

0.8562233015390017

Looking at the accuracy scores, we can observe a big difference between the training and validation sets. The model is practically perfect on the training set with an accuracy of 99.99%, but on the validation set, the accuracy drops to 85.62%. This indicates that our model is overfitting — it has learned the training data extremely well, maybe too well, to the point where it struggles to generalize to new, unseen data in the validation set.

Overfitting occurs when the model learns the specific details and noise in our training data to the extent that it negatively impacts our model’s ability to perform on new data. In this case, it has essentially ‘memorized’ the training set, and therefore performs significantly worse on data it hasn’t seen before.

Ensemble methods, which combine the predictions from multiple models, can often help with this. They work on the principle that the individual errors of each model tend to cancel each other out when averaged, thus leading to a more robust and accurate final prediction. It’s kind of like asking many people for their opinion — they might not all agree, but collectively they can often arrive at a better decision. In the case of decision trees, it would take a lot of trees predicting inaccurately to end up with a final incorrect prediction.

## Get Feature Importance

`importance_df = pd.DataFrame({`

'feature': train_df.columns,

'importances': model.feature_importances_

}).sort_values('importances', ascending=False)plt.title("Feature importances")

sns.barplot(data=importance_df.head(10), x='importances', y='feature')

## Hyperparameter Tuning with Random Forest

GridSearchCV and RandomizedSearchCV are two methods that can be used from sklearn for hyperparameter tuning. They are used to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions.

GridSearchCV: This method performs an exhaustive search over specified parameter values for an estimator. It trains the model for each combination of the hyperparameters and retains the best combination. For example, if you specify max_depth values as [1, 2, 3] and n_estimators as [50, 100, 200], then GridSearchCV will try all combinations of [(1, 50), (1, 100), (1, 200), (2, 50), (2, 100), (2, 200), (3, 50), (3, 100), (3, 200)] and return the set of parameters with the best performance metric. The downside is that it can be very time-consuming for larger datasets or/and for too many parameters specified.

RandomizedSearchCV: This method is a random search on hyperparameters. The RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values. Given enough time, RandomizedSearchCV will find as good or better parameters as GridSearchCV. This is usually faster and leads to similar results as GridSearchCV.