Given a set of input features and labels, denoted as *D:*

`import numpy as np`

import matplotlib.pyplot as plt# In this example, we are going to include a dataset with only 2 dimensions

# These are the parameters for the noise

mean = 3

stddev = 25

size = 50

# These are the parameters of the true model

beta0 = 3

beta1 = 2

# Generating noise for the samples

noise = np.random.normal(mean, stddev, size)

# Generating the data using a line plus noise

x = np.arange(size)

y = beta0 + beta1*x + noise

# Plotting the data

plt.scatter(x,y)

Our goal is to learn a mapping from the inputs to the labels. One of the simplest methods is to use a linear model of the form [1]:

Here ** yi** is assumed to be the weighted sum of the features of

**, plus a residual term**

*xi***. Moreover, the**

*Îµ***Î˛s**are the weights (or coefficients) of the model, and

**is the feature**

*x_{i,j}***of example**

*j***.**

*i*`# Plotting the "true" linear model`

x0 = 0

x1 = 50

y0 = beta0 + beta1*x0

y1 = beta0 + beta1*x1# Plotting the data

plt.scatter(x,y)

plt.plot([x0,x1], [y0,y1], color="red", label="True Linear model")

plt.legend()

We can write this in compact form using matrices:

Note that here we are talking about a theoretical â€śtrueâ€ť model. Out of all the possible linear models, this true model is the best of all. However, in the real world, it is not possible to obtain it as that would require data from the entire population subjected to the analysis. Often, we can only afford to get a sample.

Instead, what we are doing is finding the estimators:

That minimize the Mean Squared Error, defined as [1]:

`# Plotting the mean squared error`

for i in range(len(x)):

plt.plot(

[x[i], x[i]],

[true_model(x[i]), y[i]],

color='green',

linestyle='dotted',

label="Mean Squared Error" if i == len(x)-1 else None)

plt.legend()

The MSE is convex and differentiable, which means that the global minimum is found at the zeros of the gradients, and no local minimums exist. Furthermore, following these properties, it is possible to get a closed-form solution to obtain the model coefficients and it is given by:

`# Function to fit linear model to data`

def fit_estimator(x, include_bias=True):

if include_bias:

x = np.column_stack((np.ones((len(x), 1)), x))

return np.linalg.inv(x.T@x)@x.T@y

Now our estimated model can be used for inference as follows:

`# Fitting the linear model`

betas = fit_estimator(x)

beta0hat = betas[0]

beta1hat = betas[1]# Function used for inference

est_model = lambda x: beta0hat + beta1hat*x

# Plotting the data

plt.scatter(x,y)

# Plotting the true linear model

plt.plot([x0,x1], [est_model(x0),est_model(x1)], color="orange", label="Estimated Linear model")

plt.plot([x0,x1], [y0,y1], color="red", label="True Linear model", linestyle='dotted')

plt.legend()

Letâ€™s define two concepts before introducing regularization:

- Model
**variance:**It is the spread, or variability, of the model prediction for a given data point. - Model
**bias**: The difference between the expectation of the model prediction and the real target.

As the number of trainable parameters ** d** increases, it is said that the model becomes increasingly complex. As

**approaches the number of training samples**

*d***the**

*n,***variance**of the model increases, but the

**bias**decreases [2].

A model too complex, with many parameters and small data, will have a small bias but high variance. This means that you would have to be lucky to get a model that produces reliable predictions. Conversely, if the model is not complex enough to describe the data under analysis, due to few features or simple architecture, the variance will be small but the bias will be large. In this case, the model will never reach the target.

An illustration of this is in Fig 5. As the complexity decreases, so does the number of possible predictions for the same target, but they are also biased away.

With regularized models, the objective functions ** L **have the form:

Where ** E** measures the error for a dataset

**The Mean Squared Error for regression is one example of such measurements. Additionally, there is a new regularization term**

*D.***which**

*R,***penalizes**the

**complexity**of the model to

**avoid over-fitting**and introduce desirable properties. The two most common regularized models are Ridge and Lasso [2].

## Closed-form estimation of parameters

This type uses the Euclidean norm of the weights as the regularization term:

It is easy to see that both terms are convex and differentiable as they are continuous, soft functions. Thus, the optimum has a closed-form solution:

`# Function to fit a ridge model to data (notice the np.eye)`

def fit_ridge_regressor(x, y, reg_lambda, include_bias=True):

if include_bias:

x = np.column_stack((np.ones((len(x), 1)), x))

return np.linalg.inv(x.T@x + reg_lambda*np.eye(x.shape[1]))@x.T@y

## Ridge regularization effect

Ridge penalizes large weights, pushing them asymptotically towards zero, but not disabling them as you can see in Figure 6. In this scenario, the variance is reduced and highly correlated variables will have the same weight, which favors interpretability while constraining all the possible models that can be generated from the same dataset.

`from sklearn.datasets import load_diabetes`# Load the housing dataset

dataset = load_diabetes()

# Split the dataset into x and y

X = dataset.data

y = dataset.target

# Compute model for varying regularization lambdas

num_lambdas = 30

reg_results = np.empty((X.shape[1]+1,num_lambdas))

reg_lambdas = np.linspace(start=0, stop=20, num=num_lambdas)

for i, reg_lambda in enumerate(reg_lambdas):

betas = fit_ridge_regressor(X, y, reg_lambda) # fit model here

reg_results[:, i] = betas

# Plot the weights and notice them being pushed towards zero

plt.figure(figsize=(15,5))

for i in range(reg_results.shape[0]):

plt.plot(reg_lambdas, reg_results[i], linestyle='dotted')

## Limitations of Ridge

For scenarios of very high dimensionality, possibly with many variables that do not provide relevant information to the problem, the model will still consider all of the variables. Ridge will not enforce sparsity in the model, making it one notable drawback.

**Coordinate descent optimization**

The loss function in Lasso might look very similar to Ridge but has very different properties. The regularization term is given by the sum of the absolute values of the model weights:

This means that the overall loss function is not differentiable around zero and a closed-form solution is not possible. However, there are more ways to solve this problem!

One of the most popular is the **coordinate descent** algorithm which states that if a function can be written as:

where ** g** is a convex and differentiable function, and the

**are the convex, non-smooth parts, then, the optimum can be reached by minimizing along each coordinate axis [3]:**

*h_i*In the case of Lasso, coordinate descent is a perfect fit since:

Note that the **penalization** term is **still not differentiable**. But wait! Now we arrive at the concept of subdifferentials [4]. It is a function that returns a set of values that lie in between the left and right derivatives of the function to be differentiated:

Thus, the optimum of the loss function with respect to each coordinate is given by:

where:

Our goal is to solve for the coefficient ** i. **If we further define a variable

**as:**

*z_i*the solution for lasso is given by soft-thresholding:

`def soft_threshold(z_i,lambda_reg):`

if z_i < - lambda_reg:

return (z_i + lambda_reg)

elif z_i > lambda_reg:

return (z_i - lambda_reg)

else:

return 0def fit_lasso_regressor(x, y, reg_lambda, include_bias=True, n_iter=50):

# Random initialization

betas = np.random.uniform(size=x.shape[1])

# Coordinate descent for n_iter iterations

for i in range(n_iter):

# Minimizing one coordinate at a time

for j in range(X.shape[1]):

x_j = X[:,j].reshape(-1,1)

x_minus_j = np.delete(X,j,1)

betas_minus_j = np.delete(betas, j)

z_j = int((x_j.T @ (y - x_minus_j@betas_minus_j)) / ( x_j.T@x_j))

if include_bias and j == 0:

betas[j] = z_j

else:

betas[j] = soft_threshold(z_j, 0.5)

return betas

## Lasso regularization effect

Lasso, will **enforce sparsity** in the model, pushing some of the weights to be exactly zero, acting as an **automatic feature selector. **This effect can be seen in Figure 7. For instance, when a pair of features are highly collinear, Lasso will disable the weights in one of them. This regularization effect results in a reduction of model complexity and prevents the effects of overfitting.

`# Calculate betas for varying regularization lambdas`

num_lambdas = 100

reg_results = np.empty((X.shape[1],num_lambdas))

reg_lambdas = np.linspace(start=0, stop=1000, num=num_lambdas)

for i, reg_lambda in enumerate(reg_lambdas):

betas = fit_lasso_regressor(X, y, reg_lambda)

reg_results[:, i] = betas# Plot the weights and notice them being pushed towards zero

plt.figure(figsize=(15,5))

for i in range(reg_results.shape[0]):

plt.plot(reg_lambdas, reg_results[i], linestyle='dotted')

## Limitiations of Lasso

According to the authors of Elastic-Net [5], the Lasso model, by design, has three major shortcomings:

- When the number of features
is greater than the number of observations*p*, the lasso selects at most*n*variables before it saturates. This is a limiting feature for a variable selection method.*n* - If there is a group of variables among which the pairwise correlation is very high, then the lasso tends to select only one variable from the group and does not care which one is selected. This affects model interpretability.
- For the usual
situation, if there are high correlations between features, it has been observed that the performance of Lasso is dominated by ridge regression.*n > p*

The naive Elastic-Net is a regularized method that follows:

As you can see, it includes both Ridge and Lasso regularization terms. However, we can reorganize this equation to include a single regularization hyperparameter and a ratio as follows:

If ** Î±=0**, then Elastic-Net reduces to Ridge. Conversely, if

**, it will be equivalent to Lasso. Furthermore, this method was designed to overcome the limitations of the later. Specifically [5]:**

*Î±=1*- The original loss function can be reorganized into an equivalent problem by playing with the regularization parameters in a way such that we get artificially augmented data with size
and rank*n+p*. Thus, for the*p*case, the model can select all*p > n*features. See the paper for more details here.*p* - The Elastic-Net regularization penalty is designed to be strictly convex. Thus, it will assign the same values to the regression coefficients if the corresponding features are highly correlated. This is called the grouping effect and solves the interpretability issue of Lasso.
- According to the empirical studies, after optimizing the regularization hyperparameters, Elastic-Net is not outperformed by either Ridge or Lasso, but it keeps both variable selection and shrinkage properties.

## Estimating the coefficients with coordinate descent

The coefficients of Elastic-Net can be obtained using coordinate decent. The optimum of each coordinate is given by:

where:

Solving for the coefficient at coordinate ** i**, we get the soft thresholding operator:

where ** z_i** is given by:

`def fit_elastic_net_regressor(x, y, reg_lambda, reg_ratio, include_bias=True, n_iter=50):`

# Random initialization

betas = np.random.uniform(size=x.shape[1])# Coordinate descent for n_iter iterations

for i in range(n_iter):

# Minimizing one coordinate at a time

for j in range(X.shape[1]):

x_j = X[:,j].reshape(-1,1)

x_minus_j = np.delete(X,j,1)

betas_minus_j = np.delete(betas, j)

z_j = float((x_j.T @ (y - x_minus_j@betas_minus_j))/ ( x_j.T@x_j + (1+reg_lambda*(1-reg_ratio))))

if include_bias and j == 0:

betas[j] = z_j

else:

betas[j] = soft_threshold(z_j, reg_lambda*reg_ratio)

return betas

And now letâ€™s test this method with fixed ratios and varying lambdas to see how the weights of the models behave. This is just an example. In real life, you should properly tune the hyperparameters to maximize performance.

`# Calculate betas for varying regularization lambdas`

def try_elastic_net(fixed_ratio, X, y):

num_lambdas = 100

reg_results = np.empty((X.shape[1],num_lambdas))

reg_lambdas = np.linspace(start=0, stop=100, num=num_lambdas)

for i, reg_lambda in enumerate(reg_lambdas):

betas = fit_elastic_net_regressor(X, y, reg_lambda, fixed_ratio)

reg_results[:, i] = betas# Plot the weights and notice them being pushed towards zero

plt.figure(figsize=(15,5))

for i in range(reg_results.shape[0]):

plt.plot(reg_lambdas, reg_results[i], linestyle='dotted')

You can clearly see that a ratio closer to zero will make the model look closer to Ridge, while a ratio closer to one will resemble Lasso:

`# Models that behaves similar to Ridge (ratio 0.1)`

try_elastic_net(0.1, X, y)

`# Models that behaves similar to Lasso (ratio 0.9)`

try_elastic_net(0.9, X, y)

Letâ€™s do a quick exercise on how to use these methods can be used to model datasets. For this exercise, **scikit-learn** will be used since it is the most popular library for classical machine learning algorithms. It is **stable** and under continuous, active development.

## Splitting and standardizing the dataset

In this particular example, we are going to use the diabetes dataset, found in the scikit-learn API. This dataset has

- 442 instances
- 10 predictor columns (age, sex, body mass, index, glucose, etc)
- One continuous numerical target that represents the progression of the disease.

Moreover, the first step is to split the data into a train set and a test set. The **train set** is used by the model to learn a function that **maps** the features to the targets. This means that the model has access to these targets.

On the other hand, the **test set** is used to **assess** the **fitness** of the model and to get a notion of performance on unseen data. In this stage, the model will receive the features, but it will not have access to the targets. However, the targets in the test set are still used to measure the error.

`import numpy as np`

import pandas as pd

from sklearn.datasets import load_diabetes

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.linear_model import Ridge, Lasso, ElasticNet

from sklearn.metrics import mean_squared_error# Load the diabetes dataset

diabetes = load_diabetes()

X = diabetes.data

y = diabetes.target

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the data separately for training and testing sets

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)

X_test_scaled = scaler.transform(X_test)

## Hyperparameter optimization and model selection

Remember the regularization parameters defined before? well, we cannot just guess what the best values would be. Proper hyperparameter optimization must be performed to get as close as possible to the true model.

In this scenario, the classic grid search method will be used, which simply fits different models using all the possible combinations of hyperparameters defined in the search space. There are a variety of more sophisticated methods available, but is not the scope of this entry to explore them. I will leave that up to the reader!

`# Parameters for each method`

ridge_params = {'alpha': np.logspace(-4, 4, 100)}

lasso_params = {'alpha': np.logspace(-4, 4, 100)}

elastic_params = {'alpha': np.logspace(-4, 4, 100),

'l1_ratio': np.linspace(0.1, 0.9, 10)}

Behind the scenes, scikit-learn will take the train set and split it further to make an additional validation set. The model will not see the targets, in the same way as in the test set, and it is used for tuning the model. After defining what values we want to explore for the hyperparameter optimization, we want to pick the model that had the least error.

Why donâ€™t we do this on the test set? Because we donâ€™t want to leak information to the model when assessing its ability for generalization. This would be like telling the model what are the best hyperparameters for that particular sample and it would be cheating!

`def perform_regression(method, params, X_train, y_train, X_test, y_test):`

"""

Performs hyperparameter optimization, searching for the

best regularization parameters, then makes predictions and

computes regression error on the test set.

"""

# Optimization

grid = GridSearchCV(method, params, cv=5)

grid.fit(X_train, y_train)

best_model = grid.best_estimator_

best_params = grid.best_params_

# Prediction

predictions = best_model.predict(X_test)

# Test error

mse = mean_squared_error(y_test, predictions)

return best_params, mse

## Testing the trained models

Finally, after the model is fitted with the best hyperparameters, what is left is to use the test set to measure the error.

`# Perform regression for each method`

results = []

methods = [

(Ridge(), ridge_params),

(Lasso(), lasso_params),

(ElasticNet(), elastic_params)

]for method, params in methods:

best_params, mse = perform_regression(method, params, X_train_scaled, y_train, X_test_scaled, y_test)

results.append({

'Method': method.__class__.__name__,

'Best Alpha': best_params.get('alpha', 'N/A'),

'Best L1 Ratio': best_params.get('l1_ratio', 'N/A'),

'MSE': mse

})

# Bar chart for MSE comparison

plt.figure(figsize=(10, 6))

plt.bar(results_df['Method'], results_df['MSE'], color=['blue', 'green', 'purple'])

plt.xlabel('Regression Method')

plt.ylabel('Mean Squared Error (MSE)')

plt.title('MSE Comparison: Ridge vs Lasso vs Elastic Net')

plt.xticks(rotation=45)

plt.tight_layout()

plt.show()

Note that Lasso here got the least MSE of the three. However, be very careful since this exercise was not meant to be extensive experimentation to draw conclusions about any of the models. In addition, there are many more techniques that can be used for modeling such as support vector machines, or neural networks.