In this article, I am going to explain to you about the Ridge and Lasso Regression Algorithm. Ridge and Lasso Regression are **regularization techniques** used to prevent **overfitting **in linear regression models by adding a **penalty term** to the loss function. Regularization is a technique that helps overcome over-fitting problems in machine learning models. It is called Regularization as it helps keep the parameters regular or normal. The common techniques are **L1 and L2 Regularization** commonly known as **Lasso and Ridge Regression**.

As we know in linear regression we can use some points to draw the best-fit line and in this example, I have a single independent feature and it is **x **and the output is **y**. I have some points which is something like this and I am trying to create the best line with the help of linear regression.

`(x, y) = (1,1), (2,2)`

Please assume the training data set has only two points like this. I usually create the best-fit line which is passing these two points. Now we are trying to calculate the **sum of residence **with the help of this line passing these two points. I will also see this as a cost function because I need to reduce this cost function.

`Cost Function:`

J(θ0,θ1) = 1/2m [i =1 -> m Σ (hθ(x)^i - y^i)^2]

Now I will try to plot the line that is passed through all the points like this.

Now if I try to calculate the cost function the value of the **J(θ0,θ1)** is **0 **because this line passes through the graph’s origin and **θ0** will be **zero**. So this kind of dataset is called** training data**. So we can add a new data point to this(Redpoint) we can call it test data then you can see that there is a gap between the newly added data point and the line that we created**(E)**. So this creates a condition which is called **Overfitting**.

That means the **model performs well but it fails to perform well with test data**.

A model performs

wellwith the training data(low bias) when itfailsto perform well with the test data(high variance) then it is calledOverfitting.

So there is another scenario called **underfitting**.

A model

failsto perform well with training data(high bias)and performsbadlywith the test data(high variance) then it is calledUnderfitting.

Let’s say I have three models called** Model 1, Model 2, and Model 3**. Model 1 training accuracy of the data set is **90%** and the test accuracy of the data set is **80%**. In Model 2, the training accuracy of the data set is **92%** and the test accuracy of the data set is **91%**. In Model 3, the training accuracy of the data set is **70%** and the test accuracy of the Data set is **65%**. In the Model 1 data set, we can name as an **overfitting model**. In the Model 2 dataset, we can get the **generalized model**. In the Model 3 data set, we can get the **underfitting model**.

So for the first model, it will be **low bias and high variance** for the second model, it will be **low bias and low variance** for the third model, it will be **high bias and high variance**. So we are always expecting a **generalized model** because a generalized model will be able to give us an excellent output.

The error function(above graph blue color line) is calculated based on the Training data set. When the model fits too closely to the training data, it will be termed **Overfitting**. In this case, the model performs very well on training data but very poorly in Testing data. To neutralize or optimize this error, Regularization is done and thus helps in keeping the parameter regular or normal. **So in this scenario, Regularization techniques come into the picture**.

A regression model that uses the L1 regularization technique is called ** Lasso Regression** and a model which uses L2 is called

**.**

*Ridge Regression*

The key difference between these two is the penalty term.

## Ridge regression(L2 Regularization)

**Ridge regression** adds the “*squared magnitude*” of the coefficient as a penalty term to the loss function. Here the last part of the equation represents the L2 regularization element(**λ [i =1 -> m Σ (slope)²]**).

`λ [i =1 -> m Σ (slope)²] : λ is a Hyperparameter`

J(θ0,θ1) = 1/2m [i =1 -> m Σ (hθ(x)^i - y^i)^2] + λ (slope)²

J(θ0,θ1) = 1/2m [i =1 -> m Σ (ˆy^i - y^i)^2] + λ (slope)²

Now let’s observe how this will reduce the overfitting.

`hθ(x)=θ0+θ1x`

// θ0 is zero now θ0 = 0

hθ(x)=θ1x

// Here θ1 is called as slope

//I don't want to make J(θ0,θ1) to zero because of the overfitting condition.

J(θ0,θ1) = 1/2m [i =1 -> m Σ (ˆy^i - y^i)^2] + λ (slope)²//Assume λ = 1 and slope = θ1

J(θ0,θ1) = 1/2m [i =1 -> m Σ (ˆy^i - y^i)^2] + λ (slope)²

= 0 + 1 (θ1)²

= (θ1)² : This is > 0

// Now cost function is not equals to zero. And the best-fit line will find another best-fit line

// And new best-fit line never passes the exact trianing data points

This is how Ridge Regression making sure that Overfitting will not happen.

Now you know your new best-fit line’s cost function is minimal and not zero. You can find a small value to your slope and when it is a small value your cost function also gets a small value. This is how Ridge regression makes sure that the overfitting does not happen.

## What is the relationship between λ and (slope)²?

I will explain this from the first article’s example. Let’s try to change the Lambda values and observe how spread the cost function graph.

You can remember I have used a dataset like this to calculate the cost function.

Then I got J(θ1) values with respect to the θ1 like this when λ is zero.

`λ = 0 λ = 1 `

slope(θ1) J(θ1)

-1 9.33

0 2.3

0.5 0.58

1 0

Now we change the **λ** and observe the **J(θ1)** cost function and the **Gradient Descent**. I assume λ values as 1, 2, 3. This is the result I got.

` λ = 0 λ = 1 λ = 2 λ = 3`

slope(θ1) J(θ1) J(θ1) J(θ1) J(θ1)

-1 9.33 10.33 11.33 12.33

0 2.3 2.3 2.3 2.3

0.5 0.58 0.83 1.08 1.33

1 0 1 2 3

Now I am trying to plot these points.

`λ = 0 (Blue colour Graph)`

Now our cost function is also 0. That means we are not appliying Ridge Regression here. We are using the cost function of Linear Regression

Assume that our Global Minima is θa at this moment and the J(θ1) is zero.λ = 1 (Red colour Graph)

Now I will get a new Gradient Descent and Global Minima is θb at this moment

Now you can see Gradient Descent has shifted and θ1 value has reduced and J(θ1) is not zero.

λ = 2 (Orange colour Graph)

Now I will get a new Gradient Descent and Global Minima is θc at this moment

Now you can see Gradient Descent has shifted and θ1 value has reduced and J(θ1) is not zero.

λ = 3 (Green colour Graph)

Now I will get a new Gradient Descent and Global Minima is θd at this moment

Now you can see Gradient Descent has shifted and θ1 value has reduced and J(θ1) is not zero.

So you can see now I can not get a zero value for J(θ1) when I incresing the λ value.

So **λ** is a **hyperparameter **here. When we observe the relationship between λ and the slope, you can see **when λ is increasing our slope is decreasing**.

Let’s assume I have a multi-linear regression like **hθ(x)=θ0 + θ1×1 + θ2×2 + θ3×3**. And let’s try to understand how we can reduce the overfitting by using Ridge Regression.

`hθ(x)=θ0 + θ1x1 + θ2x2 + θ3x3`

// Lets assume we get values for the slopes like this

hθ(x)=0.55 + 0.54x1 + 0.43x2 + 0.23x3 -- 1// And when we apply Rdge Regression to this your slopes will reduce. I am using some Arbitory values here

hθ(x)=0.55 + 0.43x1 + 0.34x2 + 0.12x3 -- 2

// Here 0.43x1 means x1 moves by 1 then according to that y will move by 0.43

// If we get large values for the slope that means x1(input feature) is highly corelated with the y(output feature)

// Now you can see equation 1 is highly corelated with output features rather than 2nd equation

* Since the value of the slops are very small in 2nd equation comapred to the 1st equation we can say that that will not impact the best-fit line by a major moment

and the slope is not equal to zero.

* Ridge Regression reducing the impact of the input feature to the output feature by reducing the coefficient(slope). This is what Ridge Regression means.

## Lasso regression(L1 Regularization)

We use this algorithm for **feature selection**. Feature selection means the features that are not that important will automatically get deleted and the features that are very important will be considered. It adds the **“ absolute value of magnitude”** of the coefficient as a penalty term to the loss function. To do the feature selection we are using an equation like this.

`Cost Function + λ [i =1 -> m Σ | slope] : λ is a Hyperparameter`

J(θ0,θ1) = 1/2m [i =1 -> m Σ (hθ(x)^i - y^i)^2] + λ [i =1 -> m Σ | slope | ]

If I try to identify the relationship between **λ and the slope** with respect to this cost function and plot the graph for the cost function, it looks like this.

The similarity is when **λ is increasing, Gradient Descent(θ) is decreasing** but you can see after one point of time θ value becomes zero. When θ value(slope or coefficient) is zero we actually remove that specific feature. Assume you have multi-linear regression like **hθ(x)=θ0 + θ1×1 + θ2×2 + θ3×3 + θ4×4** and assume that your **θ4** is very small and when we apply Lasso regression to the cost function this **θ4 value becomes zero** at one point. So that **we can remove that θ4 feature.**

When a feature is not very correlated(having a small coefficient value) we can remove that feature from the entire equation and the remaining features we can use to find the best-fit line.

**When we use the Lasso Regression?**

Suppose you have hundreds of features and you can apply this lasso regression and remove the features that are not much correlated.

The **key difference** between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some features altogether. So, this works well for **feature selection** in case we have a huge number of features.

## ElasticNet Regression

This algorithm is a** combination of Ridge and Lasso Regression**. Here we **reduce the overfitting **and we do **feature selection**.

The cost function looks like this,

`J(θ0,θ1) = 1/2m [i =1 -> m Σ (hθ(x)^i - y^i)^2] + λ1 [i =1 -> m Σ(slope )²] + λ2 [i =1 -> m Σ | slope | ]`

Suppose we have a model with an overfitting condition and a lot of features we can use this **Elastucnet Regression**.

## LASSO (L1 regularization)

- regularization term penalizes the absolute value of the coefficients.
- sets irrelevant values to 0.
- might remove too many features in your model.

## Ridge regression (L2 regularization)

- penalizes the size (square of the magnitude) of the regression coefficients.
- enforces the
*B*(slope/partial slope) coefficients to be lower, but not 0. - does not remove irrelevant features, but minimizes their impact.

I hope you get a good understanding of LASSO (L1 regularization) and Ridge regression (L2 regularization). See you in another Machine Learning article.

Thank you!