*This is part 1 of a two-part series about feature selection. Read **part 2 here**.*

When you’re fitting a model to a dataset, you may need to perform feature selection: keeping only some subset of the features to fit the model, while discarding the rest. This can be necessary for a variety of reasons:

- to keep the model explainable (having too many features makes interpretation harder)
- to avoid the curse of dimensionality
- to maximize/minimize some objective function related to the model (R-squared, AIC, etc)
- to avoid a poor fit, etc.

If the number of features N is small, an exhaustive search may be doable — you can literally try all possible combinations of features, and just keep the one that minimizes the cost / objective function. But if N is large, then an exhaustive search might be impossible. The total number of combinations to try is `2^N`

which, if N is larger than a few dozen, becomes prohibitive (it’s an exponential function). In such cases you must use a heuristic method: explore the search space in an efficient way, looking for a combination of features that minimizes the objective function you use to conduct the search.

What you’re looking for is a vector `[1, 0, 0, 1, 0, 1, 1, 1, 0, ...]`

of length N, with elements taking values from `{0, 1}`

. Each element in the vector is assigned to a feature; if the element is 1, the feature is selected; if the element is 0, the feature is discarded. You need to find the vector that minimizes the objective function. The search space has as many dimensions N as there are features, and the only possible values along any dimension are 0 and 1.

Finding a good heuristic algorithm is not trivial. The `regsubsets()`

function in R has some options you can use. Also, scikit-learn offers several methods to perform a heuristic feature selection, provided your problem is a good fit for their techniques. But finding a good, general purpose heuristic — in the most general form — is a hard problem. In this series of articles we will explore a few options that may work reasonably well even when N is quite large, and the objective function can be literally anything, as long as it’s easy to compute.

For all optimization techniques in this series of articles, I am using the very popular House Prices dataset on Kaggle (MIT license) — after a few simple feature transformations, it ends up having 213 features (N=213) and 1453 observations. The model I’m using is linear regression, `statsmodels.api.OLS()`

, and the objective function I am trying to minimize is BIC — the Bayesian Information Criterion, a measure of information loss, so a lower BIC is better. It is similar to AIC — the Akaike Information Criterion — except BIC tends to produce more parsimonious models (it prefers models with fewer features). Minimizing either AIC or BIC tends to reduce overfitting. But other objective functions could also be used instead, e.g. R-squared (the explained variance in the target) or adjusted R-squared — just keep in mind that a larger R-squared is better, so that’s a maximization problem.

Ultimately, the choice of objective function is irrelevant here. What matters is that we have one objective function that we’re consistently trying to optimize using various techniques.

The full code used in this series of articles is contained in a single notebook in my feature selection repository — also linked at the end. I will provide code snippets within the text here, but please check the notebook for the full context.

But first, let’s check a well-known, tried-and-true feature selection technique, which we will use as a baseline comparison with the more complex techniques described later.

SFS, the forward version, is a simple algorithm. It starts by trying to choose a single feature, and it selects that feature that minimizes the objective function. Once a feature is selected, it stays selected forever. Then it tries to add another feature to it (for a total of 2 features), in such a way as to minimize the objective again. It increases the number of selected features by 1, every time trying to find the best new feature to add to the existing collection. The search ends when all features are tried together. Whichever combination minimizes the objective, wins.

SFS is a greedy algorithm — each choice is locally optimal — and it never goes back to correct its own mistakes. But it’s reasonably fast even when N is quite large. The total number of combinations it tries is `N(N+1)/2`

which is a quadratic polynomial (whereas an exhaustive search needs to perform an exponential number of trials).

Let’s see what the SFS code may look like in Python, using the mlxtend library:

`import statsmodels.api as sm`

from mlxtend.feature_selection import SequentialFeatureSelector as SFS

from sklearn.base import BaseEstimatorclass DummyEstimator(BaseEstimator):

# mlxtend wants to use an sklearn estimator, which is not needed here

# (statsmodels OLS is used instead)

# create a dummy estimator to pacify mlxtend

def fit(self, X, y=None, **kwargs):

return self

def neg_bic(m, X, y):

# objective function

lin_mod_res = sm.OLS(y, X, hasconst=True).fit()

return -lin_mod_res.bic

seq_selector = SFS(

DummyEstimator(),

k_features=(1, X.shape[1]),

forward=True,

floating=False,

scoring=neg_bic,

cv=None,

n_jobs=-1,

verbose=0,

# make sure the intercept is not dropped

fixed_features=['const'],

)

n_features = X.shape[1] - 1

objective_runs_sfs = round(n_features * (n_features + 1) / 2)

t_start_seq = time.time()

# mlxtend will mess with your dataframes if you don't .copy()

seq_res = seq_selector.fit(X.copy(), y.copy())

t_end_seq = time.time()

run_time_seq = t_end_seq - t_start_seq

seq_metrics = seq_res.get_metric_dict()

It runs through the combinations quickly and these are the summary results:

`best k: 36`

best objective: 33708.98602877906

R2 @ best k: 0.9075677543649224

objective runs: 22791

total run time: 44.850 sec

The best number of features is 36 out of 213. The best BIC is 33708.986, and it takes less than 1 minute to complete on my system. It invokes the objective function 22.8k times.

These are the best BIC and R-squared values, as functions of the number of features selected:

For more information, such as the names of the features actually selected, check the notebook in the repository.

Now let’s try something more complex.

This is a numeric optimization algorithm. It’s in the same class as genetic algorithms (they’re both evolutionary), but CMA-ES is quite distinct from GA. It’s a stochastic algorithm, and it’s derivative-free — it does not require you to compute a derivative of the objective function (unlike gradient descent, which relies on partial derivatives). It is computationally efficient, and it’s used in a variety of numeric optimization libraries such as Optuna. I will only attempt a brief sketch of CMA-ES here; for more detailed explanations, please refer to the literature in the links section at the end.

Consider the 2-dimensional Rastrigin function:

The heatmap below shows the values of this function — brighter colors mean a higher value. The function has the global minimum in the origin (0, 0), but it’s peppered with many local extremes. We want to find the global minimum via CMA-ES.

CMA-ES is based on the multivariate normal distribution. It generates test points in the search space from this distribution. You will have to guess the original mean value of the distribution, and its standard deviation, but after that the algorithm will iteratively modify all these parameters, moving the distribution in the search space, looking for the best objective function values. Here’s the original distribution that the test points are drawn from:

`xi`

is the set of points that the algorithm generates at each step, in the search space. `lambda`

is the number of points generated. The mean value of the distribution will be updated at every step, and hopefully will converge eventually on the true solution. `sigma`

is the standard deviation of the distribution — the spread of the test points. `C`

is the covariance matrix: it defines the shape of the distribution. Depending on the values of `C`

the distribution may have a “round” shape or a more elongated, oval shape. Changes to `C`

allow CMA-ES to “sneak” into certain areas in the search space, or avoid other areas.

A population of 6 points was generated in the image above, which is the default population size picked by the optimizer for this problem. This is the first step. After this, the algorithm needs to:

- compute the objective function (Rastrigin) for each point
- update the mean, the standard deviation, and the covariance matrix, effectively creating a new multivariate normal distribution
- generate a new set of test points from the new distribution
- repeat until some criterion is fulfilled (either converge on some mean value, or exceed the maximum number of steps, etc.)

I will not show the updates for all distribution parameters here, or else this article will become very large — check the links at the end for a complete explanation. But updating just the mean of the distribution is quite simple, and it works like this: after computing the objective function for each test point, weights are assigned to the points, with the larger weights given to the points with a better objective value, and a weighted sum is computed from their positions, which becomes the new mean. Effectively, CMA-ES moves the distribution mean towards the points with a better objective value:

If the algorithm converges to the true solution, then the mean value of the distribution will converge to that solution. The standard deviation will converge to 0. The covariance matrix will change the shape of the distribution (round or oval), depending on the geography of the objective function, extending into promising areas, and shying away from bad areas.

Here’s an animated GIF showing the evolution in time of the test points while CMA-ES is solving the Rastrigin problem:

The 2D Rastrigin function was relatively simple, since it only has 2 dimensions. For our feature selection problem, we have N=213 dimensions. Moreover, the space is not continuous. Each test point is an N-dimensional vector with component values from `{0, 1}`

. In other words, each test point looks like this: `[1, 0, 0, 1, 1, 1, 0, ...]`

— a binary vector. But other than that, the problem is the same: we need to find the points (the vectors) that minimize the objective function: the BIC parameter of an OLS model.

Here’s a simplified version of the CMA-ES code for feature selection, using the cmaes library:

`def cma_objective(fs):`

features_use = ['const'] + [

f for i, f in enumerate(features_select) if fs[i,] == 1

]

lin_mod = sm.OLS(y_cmaes, X_cmaes[features_use], hasconst=True).fit()

return lin_mod.bicX_cmaes = X.copy()

y_cmaes = y.copy()

features_select = [f for f in X_cmaes.columns if f != 'const']

dim = len(features_select)

bounds = np.tile([0, 1], (dim, 1))

steps = np.ones(dim)

optimizer = CMAwM(

mean=np.full(dim, 0.5),

sigma=1 / 6,

bounds=bounds,

steps=steps,

n_max_resampling=10 * dim,

seed=0,

)

max_gen = 100

best_objective_cmaes_small = np.inf

best_sol_raw_cmaes_small = None

for gen in tqdm(range(max_gen)):

solutions = []

for _ in range(optimizer.population_size):

x_for_eval, x_for_tell = optimizer.ask()

value = cma_objective(x_for_eval)

solutions.append((x_for_tell, value))

if value < best_objective_cmaes_small:

best_objective_cmaes_small = value

best_sol_raw_cmaes_small = x_for_eval

optimizer.tell(solutions)

best_features_cmaes_small = [

features_select[i]

for i, val in enumerate(best_sol_raw_cmaes_small.tolist())

if val == 1.0

]

print(f'best objective: {best_objective_cmaes_small}')

print(f'best features: {best_features_cmaes_small}')

The CMA-ES optimizer is defined with some initial guesses for the mean value and for sigma (standard deviation). It then loops through many generations, creating test points `x_for_eval`

, evaluating them with the objective, modifying the distribution (mean, sigma, covariance matrix), etc. Each `x_for_eval`

point is a binary vector `[1, 1, 1, 0, 0, 1, ...]`

used to select the features from the dataset.

Please note that the `CMAwM()`

optimizer is used (CMA with margin) instead of the default `CMA()`

. The default optimizer works well for regular, continuous problems, but the search space here is high-dimensional, and only two discrete values (0 and 1) are allowed. The default optimizer gets stuck in this space. `CMAwM()`

enlarges a bit the search space (while the solutions it returns are still binary vectors), and that seems enough to unblock it.

This simple code definitely works, but it’s far from optimal. In the companion notebook, I have a more complex, optimized version of it, which is able to find better solutions, faster. But the code is big, so I will not show it here — check the notebook.

The image below shows the history of the complex, optimized CMA-ES code searching for the best solution. The heatmap shows the prevalence / popularity of each feature at each generation (brighter = more popular). You can see how some features are always very popular, while others fall out of fashion quickly, and yet others are “discovered” later. The population size picked by the optimizer, given the parameters of this problem, is 20 points (individuals), so feature popularity is averaged across these 20 points.