Table of contents:
Regression is a powerful statistical tool that allows us to predict a continuous outcome variable based on the value of one or several predictor variables. But how do we evaluate the accuracy and reliability of our predictions?
This is where regression metrics come into play. In this blog, we’ll delve into some key regression metrics such as MSE, MAE, RMSE, R2 Score, and Adjusted R2 Score, simplifying them for a clear understanding.
MAE stands for mean absolute error Consider that you are working on simple linear regression problem
You have to calculate MAE.To calculate MAE, you need to find the error with a resect to each data point (yi−ŷ). We will take the absolute value, i.e., |yi−ŷ| similar for all data points.
yi=original target value
And we will divide it by total number of rows then formula for MAE becomes
- After calculating MAE, you get a number, i.e., loss. Our aim is to reduce the value of MAE.The number you get is in terms of ‘Y’.suppose your ‘Y’ variable is ‘Package(LPA).If the MAE you got is 1.8, then it is 1.8 LPA.In short.
Unit of ‘Y’ Variable = Unit of MAE
2. robust to outliers
In MAE,we are using modulus to solve the problem with modules.The problem with modulus is that the graph of the modulus function is not differentiable at zero.
MSE solves this problem.
MSE stands for mean square error.Consider the previous example.MAE is
There is a slight difference between MSE and MAE. In MSE, instead of modulus, you use the square function.
The geometric intuition of MSE is
On calculating MSE, it will also give the number, and again, we have to reduce the value of MSE as much as possible.
You can use it as a loss function, as it is differentiable at 0.
After calculating MSE,the unit of number you get is difficult to interpret.
Supposing your ‘y’ variable has unit LPA, the MSE is (LPA)2.
Penalize outliers; too much impact or not robust to outliers.
If your data has too many outliers, you should go with MAE.if less outliers then you should go with MSE.
RMSE stands for root mean square error
RMSE = √MSE
Its properties are similar to that of MSE
The main benefit is that the value you get after calculating RMSE has the same unit as your target variable, ‘Y’.It becomes easy to interpret.
disadvantage is that it is not robust to outliers most of the time RMSE is used.
4. R2 Score:
R2 score tells how well your model is performing; consider the placement dataset
suppose you don’t have the CGPA values and you only have package values, and someone asks you. “I want to take admission to this college. What will be my package?”
then you have only one option You would have calculated the mean package of all students who have placed in the past and would have given the same answer. if data is like this
In the worst-case scenario, you have to tell how much package you will get. The best option is mean (y-mean line).
But you have CGPG values. You will take CGPA values by drawing the regression line; this is the prediction.
When you calculate the R2 score, you actually compare how much better the linear regression line is than the y-mean line. It is also called the coefficient of determination.
Interpretation of R2 Score
If the R2 score is zero, then the amount of error in both lines is the same, i.e., both lines are overlapped.
If the R2 score is one, then there is no error in the regression line, which is perfect.
The more you move towards perfection, the more the R2 score moves towards 1. The more you move towards bad, the more your R2 score moves towards zero.
if R2 score is negative, your for regression line is making more errors than mean line This is worst-case scenario
If the R2 score comes out at 0.80, the interpretation is the CGPa column, which is the input column, which is able to explain 80% of the variance in LPA.
The biggest flaw in the R2 score is that when you add more input columns to your dataset, the R2 score increases. It can be assumed that there are more input columns. you are adding, the more variance will be explained
If you also add irrelevent columns, then the R2 score also increases. Naturally, it needs to decrease, but it is increasing or constant. This is not appropriate behavior to handle this adjusted R2 score.
5. Adjusted R2 score
the formula designed in such way that they remove the flaws of R2 Score with the help of two cases
WHen you add irrelevent column, you add temporary value of ‘k’ in adjusted R2 score formula will increase, which results in decrease in value of denominator, i.e. (n-1-k)
The (n-1) term denominator will be constant, and the (1-R2) term will increase little or may be constant.Let’s consider it a constant for this discussion.
The numerator term will be constant, but the denominator is decreasing since the whole thing starts increasing when you subtract it from 1.
then the whole term will become smaller, which means the whole adjusted R2 score will decrease.
When you add a relevant column,the value of k will increase,the value of the denomiant (n-1-k) will decrease (n-1) constant.
Since R2 score will increase as you added relevent column (1-R2) term will decrease vigorously the whole will decrease.
When you subtract it from 1, the whole term will increase, which means R2 adjusted score will increase.
It becomes a reliable matrix, especially when dealing with multiple linear regressions.