In the previous post we discussed what is Linear Regression, and why is it used for, today we will discuss about the cost function, Gradient descent and Model Evaluation.

The cost function is used to calculate the best line in the linear regression where we can check the error which is the distance between the predicted value (on the line) and the actual data points. The smaller this difference (the error), the better the line fits the data.

In Linear Regression, generally **Mean Squared Error (MSE)** cost function is used, which is the average of squared error that occurred between the **ypredicted** and **yi**.

We calculate MSE using simple linear equation y=mx+b:

Using the MSE function, we’ll update the values of B0 and B1 such that the MSE value settles at the minima. These parameters can be determined using the gradient descent method such that the value for the cost function is minimum.

However, while the MSE helps us evaluate how well the model is performing, it doesn’t directly provide a method for finding the best values of the model’s parameters (in this case, the slope “m” and the intercept “b”) to minimize this error. This is where Gradient Descent comes into play.

Here’s why we need Gradient Descent in linear regression:

Linear regression aims to find the best-fitting line by adjusting the slope and intercept to minimize the MSE. Gradient Descent is an optimization algorithm that iteratively adjusts these parameters to minimize the cost function, which, in this case, is the MSE.

Imagine you’re inside a complex maze, and your goal is to reach the exit, but you can only move in discrete steps (forward, backward, left, or right). You can think of this maze as a three-dimensional surface, where each point in the maze represents a different elevation.

In this scenario:

- Gradient Descent can be compared to your strategy for finding the exit efficiently. The gradient (a vector that points in the direction of steepest ascent) indicates the slope of the surface at your current location.
- Learning Rate represents how large or small your steps are. If you take very small steps, you’ll carefully navigate the maze, but it might take a long time to reach the exit. If you take large steps, you might move too quickly and miss the correct path.
- Optimal Path corresponds to the quickest way to reach the exit. Gradient Descent helps you by constantly assessing the slope (gradient) at your current location and guiding you toward the steepest descent, ultimately leading you to the exit.

Just like in optimization problems, choosing the right learning rate in the maze is essential. If your steps are too small, you might wander aimlessly. If they’re too large, you might miss the optimal path. The goal is to strike a balance between step size (learning rate) and the guidance provided by the gradient to efficiently navigate the maze and reach your destination. One of the ways to achieve this is to apply the batch gradient descent algorithm. In batch gradient descent, the values are updated in each iteration.

The strength of any linear regression model can be assessed using various evaluation metrics. These evaluation metrics usually provide a measure of how well the observed outputs are being generated by the model.

The most used metrics are,

- Coefficient of Determination or R-Squared (R2)
- Root Mean Squared Error (RSME) and Residual Standard Error (RSE)

R-Squared is a number that explains the amount of variation that is explained/captured by the developed model. It always ranges between 0 & 1 . Overall, the higher the value of R-squared, the better the model fits the data.

Mathematically it can be represented as,

**R2 = 1 — ( RSS/TSS )**

**Residual sum of Squares (RSS)**is defined as the sum of squares of the residual for each data point in the plot/data. It is the measure of the difference between the expected and the actual observed output.

**Total Sum of Squares (TSS)**is defined as the sum of errors of the data points from the mean of the response variable. Mathematically TSS is,

where y hat is the mean of the sample data points.

The Root Mean Squared Error is the square root of the variance of the residuals. It specifies the absolute fit of the model to the data i.e. how close the observed data points are to the predicted values. Mathematically it can be represented as,

R-squared is a better measure than RSME. Because the value of Root Mean Squared Error depends on the units of the variables (i.e. it is not a normalized measure), it can change with the change in the unit of the variables.

That’s it for the Linear Regression. Thank you for reading, Will see you in the next post.