Are you feeling a bit down about those underwhelming performance scores with your good ol’ Naive Bayes model. Is your performance score keeping you up at night, and all you can think about is running Grid searches to fine-tune those cagey hyperparameters?”

Don’t worry we can work it out! You might be missing something silly. Lets just revisit all your algorithm decisions one by one, and exhaust all the means to improve it.

Before heading over make sure you’ve refreshed your understanding of Bayesian Modeling.

Let us first start with the model choice —

Choosing the right model is the first thing you need to get right. Lets break down the different types of Naive Bayes classifiers in simple terms and explain when to use each one.

## Bernoulli Naive Bayes: Spam or Not?

BernoulliNB is designed exclusively for binary/boolean features. Imagine you’re dealing with email spam and you want to detect them. You notice that some spam emails include your email handle in the subject line, while others don’t. So, you create a feature that captures this information. If your email handle is present in the subject, it’s marked as 1; otherwise, it’s marked as 0. This binary approach helps classify emails as spam or not.

Like BernoulliNB, this classifier is suitable for discrete data. The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.

Let’s consider a situation where you’re working with text data and trying to identify spam. Instead of merely looking for the presence of certain words, you count how many times each word appears in an email. For instance, you might tally words like “CASH” or “Lottery.” The idea is to provide the model with more information. Not only does it know if a word is present or not, but it also understands how frequently that word appears.

The Multinomial Naive Bayes classifier is the go-to option for this type of data because it **assumes that the features are drawn from a multinomial distribution**. It’s great for working with discrete data, like word counts, and helps improve the model’s accuracy. To get your hands dirty check this out in kaggle.

Finally, consider a scenario where you’re trying to predict whether a college student can dunk a basketball based solely on their height. Heights are continuous data, and the distribution of human heights follows the normal distribution, also known as a Gaussian distribution. The Gaussian Naive Bayes classifier looks at the height of all the students and figures out where the cutoff should be to maximize the model’s performance. It aims to classify students as dunkers or non-dunkers.

This classifier is used when all the **features are continuous**, making it perfect for situations like the Iris dataset, which includes features like sepal width, petal width, sepal length, and petal length.

After selecting your model, the next step is to ensure you’ve adhered to all the best practices.

## Enhancing Performance of Bayesian Model:

There are a few techniques which you can keep in mind when you start building your Bayesian Model Algorithm: –

Make sure you have encoded the categorical values and ,most importantly chosen the right kind of encoding method

**One Hot Encoding** : One-hot encoding (OHE) creates a new binary column for each category in a categorical column. If we have k categories in our column. We create k new binary columns (unlike statistical modelling, you don’t have to worry about kth column here as ML algorithm takes care of it) to represent those categories. For example if a column named fruits has apple, mango, grapes. OHE will transform it to 3 columns — Apple ,Mango , Grapes with each having value 0/1. You will need it especially in BernoulliNB.

What happens when the test set introduces new categories? With *OneHotEncoder*, we can handle it by setting `handle_unknown="ignore"`

, which creates a row with all zeros for any unrecognized category. This ensures that unknown categories are treated consistently for that feature

`OHE = OneHotEncoder(sparse=False, dtype="int", handle_unknown="ignore")`

OHE.fit(X_train[["col_name"]])

X_train_encoded = OHE.transform(X_train[["col_name"]])

In the case where the features has binary values (say, yes/no, boy/girl), we use an argument called `drop`

within `OneHotEncoder`

and set it to `"if_binary"`

. This argument `drop,`

helps avoid redundant information by encoding one of the two binary values. In this case, you’ll only get one additional column instead of two having value 0/1.

`OHE_binary = OneHotEncoder(sparse=False, dtype='int', drop="if_binary")`

OHE_binary.fit(X[["col_name"]]);

X_encoded = OHE_binary.transform(X[["col_name"]])

**Ordinal Encoding** is another method of data processing for a categorical variable used for ordinal data. Ordinal categories have a clear and meaningful order or ranking but the intervals between them are not necessarily equal or defined. Take the case of education, if the values are Undergraduate, Postgraduate and PhD, we can easily order them, but can we say how much valuable is an Undergraduate degree from a Postgraduate or a Postgraduate from a PhD?

For these kind of dataset we usually use MultinomialNB

`X = pd.DataFrame({'education':[ 'Undergraduate', 'PhD', 'Postgraduate', 'Postgraduate', `

'PhD', 'Neutral', 'Undergraduate',

'PhD', 'Undergraduate']})from sklearn.preprocessing import OrdinalEncoder

order = ['Undergraduate', 'Postgraduate', 'PhD']

OE = OrdinalEncoder(categories = [order], dtype=int)

OE.fit(X)

X_ord = OE.transform(X)

X.assign(rating_enc=X_ord)

**Feature Selection**: Naive Bayes is called ‘naive’ for a reason. I blame it on its assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. The independence assumption in Bayes’ theorem is like saying that the color of one marble you pick doesn’t depend on the colors of the marbles you picked before. It helps simplify complex problems and make predictions, but in reality, many real-world datasets contain correlated or dependent features.

But how can we accommodate this assumption? We do this by looking at how things are connected and picking just one if they’re too connected. We use a correlation heatmap to see these connections and choose the most important one, especially if they’re very, very connected. In practice correlation score of more than 0.5 is high correlation. This makes our calculations easier and more accurate.

**Eliminating Zero Observations**: Multiplying 0 with other features’ probabilities will result in 0. A common method to deal with this is to use Laplace Smoothing. Here, the var_smoothing hyperparameter acts as a savior by adding a small number (default=1e-9 in sklearn library) to each probability, preventing zero probabilities.

**Hyperparameter Tuning : **Practically there are only one hyperparameters which can be optimised i.e., var_smoothing.

`from sklearn.model_selection import GridSearchCV`# Define the hyperparameter grid

param_grid = {

'alpha': [1e-6, 0.001, 0.01, 0.1, 1.0, 10.0] # We use discreet value here

}

# Create a Naive Bayes classifier, Replace it with your model

clf = MultinomialNB()

# Perform grid search

grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy') #Change the scoring metric as per your requirements.

grid_search.fit(X, y)

# Get the best hyperparameters and model

best_alpha = grid_search.best_params_['alpha']

best_fit_prior = grid_search.best_params_['fit_prior']

best_model = grid_search.best_estimator_

print(f"Best Alpha: {best_alpha}")

print(f"Best Fit Prior: {best_fit_prior}")

Or if your *GridSearchCV *taking a long long time, ‘bayes’ is always there to help. Here is an hyperparameter optimisation technique again based on Bayesian idea.

`from bayes_opt import BayesianOptimization`# Define the function to optimize

def evaluate_naive_bayes(alpha):

clf = MultinomialNB(alpha=alpha) #Replace it with your model

scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy') #Change the scoring metric as per your requirements.

return scores.mean()

# Define the parameter range for Laplace smoothing (alpha)

pbounds = {'alpha': (1e-6, 10.0)}

# Initialize BayesianOptimization

optimizer = BayesianOptimization(f=evaluate_naive_bayes, pbounds=pbounds, random_state=1)

# Perform optimization

optimizer.maximize(init_points=5, n_iter=10)

# Get the best Laplace smoothing parameter

best_alpha = optimizer.max['params']['alpha']

**Handling Continuous Variables **: Categorical data naturally fits the simplicity of Naive Bayes. The probabilities associated with each category are easier to calculate, and the assumptions of the model, such as conditional independence, are more likely to hold.

Categorizing the data involves dividing the range of continuous values into intervals or bins and then assigning category labels to data points based on which interval they fall into, this is called discretization.

Are we missing anything important here? May be handling missing values?

A critical step of data preprocessing is handling missing values. It is a major decision for most models. However, with the Naive Bayes classifier, you’re in luck.** The Naive Bayes classifier is immune to missing values**. We can just ignore them because the algorithm handles the input features separately at both model construction and prediction phases.

Naive Bayes classifiers are handy tools for solving a wide range of classification problems. Whether your data is binary, discrete, or continuous, there’s a Naive Bayes classifier suited to the task. Understanding which one to use depends on the nature of your features and the problem you’re trying to solve. So, the next time you encounter a classification challenge, remember that Naive Bayes has you covered.