Classification models are powerful tools in machine learning that enable us to predict the class or category of an observation based on its features. In this article, we aim to explore and compare the suitability of two popular classification algorithms: Logistic Regression and Support Vector Machines (SVM). To achieve this, we will create two classification datasets that contain more than 50% informative features and have a minimum of 2500 rows. Additionally, we will apply scaling techniques on the first five features and assess the impact of these transformations on the final quality of our classification models.

## Problem Statement

- Make two classification datasets, having more than 50% informative features and a minimum of 2500 rows.
- Performing some kind of Scaling on the first 5 features.
- Compare the suitability of above datasets when performing Logistic Regression and Support Vector Machines.
- Check whether the applied transformation have an impact on the final quality of Classification Model, and present your findings neatly.

**Objective**

- Creating Two Classification Datasets than contains more than 50% informative features
- Should Have atleast 2500 rows
- To Apply Standard Scaling
- performing Logistic Regression and Support Vector Machines
- Applying Other Classification Model With Respect to Created Datasets

**Approaches**

- importing Relevant Libraries for Creation of Model
- importing Relevant Libraries for Model Evaluation
- Using Relevant metrics for measuring the same

**Observation**

- How higher mean accuracy along with lower standard deviation means that the dataset is easier to classify and the above used algorithm are most suitable for the particular datasets.

## Importing Libraries

`from sklearn.datasets import make_classification`

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score

from sklearn.metrics import accuracy_score

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import RandomForestClassifier

from sklearn.naive_bayes import GaussianNB

## Creating Datasets

To ensure that our datasets are informative, we will carefully select features that contribute significantly to the classification task. By incorporating more than 50% informative features, we aim to enhance the predictive power of our models. Moreover, having a substantial number of rows (at least 2500) will provide us with sufficient data to train and evaluate our models effectively.

`# First dataset`

X1, y1 = make_classification(n_samples=3000, n_features=10, n_informative=6, n_redundant=2, n_classes=2, random_state=42)# Second dataset

X2, y2 = make_classification(n_samples=4000, n_features=15, n_informative=9, n_redundant=3, n_classes=3, random_state=21)

`In the above cell We created two dataset where the first Dataet contains 3000 rows with 10 features, where 6 of them are informative and 2 of them are redundant`

`Whereas the second dataset contains 4000 rows with 15 features where 9 of them are informative and 3 of them are redundant.`

`Both dataset satisfy the condition to be said as containing more than 50% informative features and along with it it also meets the number of rows requirement.`

`Here the n class parameter defines the number of classes inthe dataset`

**Performing Scaling**

Before training our models, we will apply scaling techniques to the first five features of our datasets. Scaling is essential to normalize the feature values and bring them to a similar scale, avoiding bias towards features with larger magnitudes. By applying scaling, we can improve the convergence and performance of our classification algorithms, ensuring that all features are treated equally.

`# Create an instance of the StandardScaler class`

scaler = StandardScaler()# Fit and transform the first 5 features of the first dataset

X1[:, :5] = scaler.fit_transform(X1[:, :5])

`In the above cell we are using StandardScaler and fit_transfrom method to fit the scaler for the first 5 features of the First Created Dataset.`

`Where after transfomation they will be having zero mean and unit variance.`

`Then the scaler values are assigned back to the original dataset`

`# Create an instance of the StandardScaler class`

scaler = StandardScaler()# Fit and transform the first 5 features of the second dataset

X2[:, :5] = scaler.fit_transform(X2[:, :5])

`It is similar to what we perform to the first dataset the cell above current cell we are using StandardScaler and fit_transfrom method to fit the scaler for the first 5 features of the Second Created Dataset.`

`Where after transfomation they will be having zero mean and unit variance.`

`Then the scaler values are assigned back to the original dataset`

## Performing Logistic Regression and Support Vector Machines

Once we have prepared our datasets and scaled the features, we will proceed to compare the suitability of Logistic Regression and Support Vector Machines for our classification task. Both algorithms are widely used for binary classification and offer unique advantages and disadvantages. We will evaluate the performance of each model using appropriate evaluation metrics such as accuracy, precision, recall, and F1 score. This comparison will provide insights into the strengths and weaknesses of each algorithm, helping us determine which one is better suited for our specific classification problem.

# Logistic Regression on Dataset 1

lr1 = LogisticRegression()

scores_lr1 = cross_val_score(lr1, X1, y1, cv=5)

print("Logistic Regression on Dataset 1: Accuracy = %0.2f (+/- %0.2f)" % (scores_lr1.mean(), scores_lr1.std() * 2))# SVM on Dataset 1

Logistic Regression on Dataset 1: Accuracy = 0.89 (+/- 0.02)

svm1 = SVC()

scores_svm1 = cross_val_score(svm1, X1, y1, cv=5)

print("SVM on Dataset 1: Accuracy = %0.2f (+/- %0.2f)" % (scores_svm1.mean(), scores_svm1.std() * 2))

SVM on Dataset 1: Accuracy = 0.94 (+/- 0.02)

`In the above cell we are training and performing & also evaluating Logistic Regression & SVM(Support Vector Machine) in the First Dataset using the 5 fold Cross Validation.`

`Then we can the the Standard Deviation and Mean of the Accuracy Scores across the given Folds in the above cell`

# Logistic Regression on Dataset 2

lr2 = LogisticRegression()

scores_lr2 = cross_val_score(lr2, X2, y2, cv=5)

print("Logistic Regression on Dataset 2: Accuracy = %0.2f (+/- %0.2f)" % (scores_lr2.mean(), scores_lr2.std() * 2))# SVM on Dataset 2

Logistic Regression on Dataset 2: Accuracy = 0.63 (+/- 0.01)

svm2 = SVC()

scores_svm2 = cross_val_score(svm2, X2, y2, cv=5)

print("SVM on Dataset 2: Accuracy = %0.2f (+/- %0.2f)" % (scores_svm2.mean(), scores_svm2.std() * 2))

SVM on Dataset 2: Accuracy = 0.89 (+/- 0.02)

`The above cell is similar to what we perfomed in the cell above current cell, where basically we are training and performing & also evaluating Logistic Regression & SVM(Support Vector Machine) in the Second Dataset using the 5 fold Cross Validation.`

`In Conclusion, This kind of comparison can help us determine one of the dataset among the datasets which can be most suitable with respect to Logistic Regression and Support Vector Machine`

`Here, higher mean accuracy along with lower standard deviation means that the dataset is easier to classify and the above used algorithm are most suitable for the particular datasets.`

## Model Selection & Quality of The Model

**Checking whether the applied transformation have an impact on the final quality of Classification Model**

Finally, we will investigate whether the applied scaling transformations have any impact on the final quality of our classification models. We will train and evaluate the models on both the original unscaled datasets and the scaled datasets. By comparing the performance metrics of the models on these different datasets, we can determine whether the scaling process improves or hampers the classification accuracy. This analysis will allow us to draw conclusions about the effectiveness of scaling in enhancing the predictive capabilities of our classification models.

Without Scaling

# Without Scaling

lr1_no_scaling = LogisticRegression()

scores_no_scaling = cross_val_score(lr1_no_scaling, X1, y1, cv=5)

acc_no_scaling = accuracy_score(y1, lr1_no_scaling.fit(X1, y1).predict(X1))

print("Logistic Regression on Dataset 1 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_no_scaling.mean(), scores_no_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_no_scaling)

print("-"*35)

lr2_no_scaling = LogisticRegression()

scores_no_scaling2 = cross_val_score(lr2_no_scaling, X2, y2, cv=5)

acc_no_scaling2 = accuracy_score(y2, lr2_no_scaling.fit(X2, y2).predict(X2))

print("Logistic Regression on Dataset 2 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_no_scaling2.mean(), scores_no_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_no_scaling2)Logistic Regression on Dataset 1 without scaling:

Accuracy (CV): 0.89 (+/- 0.02)

Accuracy (train): 0.90

-----------------------------------

Logistic Regression on Dataset 2 without scaling:

Accuracy (CV): 0.63 (+/- 0.01)

Accuracy (train): 0.64

`In the Above cell we are basically training and evaluating the Logistic Regression Model on the unscaled Dataset that mean without applying Feature Scaling along with with using 5 fold Cross Validation.`

`Then we are fetching the accuracy score after the completion of Training.`

With Feature Scaling

# Feature scaling

scaler1 = StandardScaler()

X1_scaled = scaler1.fit_transform(X1)scaler2 = StandardScaler()

X2_scaled = scaler2.fit_transform(X2)# With Scaling

lr1_scaling = LogisticRegression()

scores_scaling = cross_val_score(lr1_scaling, X1_scaled, y1, cv=5)

acc_scaling = accuracy_score(y1, lr1_scaling.fit(X1_scaled, y1).predict(X1_scaled))

print("Logistic Regression on Dataset 1 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_scaling.mean(), scores_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_scaling)lr2_scaling = LogisticRegression()

Logistic Regression on Dataset 1 with scaling:

scores_scaling2 = cross_val_score(lr2_scaling, X2_scaled, y2, cv=5)

acc_scaling2 = accuracy_score(y2, lr2_scaling.fit(X2_scaled, y2).predict(X2_scaled))

print("Logistic Regression on Dataset 2 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_scaling2.mean(), scores_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_scaling2)

Accuracy (CV): 0.89 (+/- 0.02)

Accuracy (train): 0.90

Logistic Regression on Dataset 2 with scaling:

Accuracy (CV): 0.63 (+/- 0.01)

Accuracy (train): 0.64

`In the Above cell we are basically performing the same thing which involves training and evaluating the Logistic Regression Model but on an scaled dataset with the same number of cross validation folds`

`Then we are fetching the accuracy score after the completion of Training and comparing the performance metrics between the given created Models`

## Applying The Same To Other Classification Models

**Support Vector Machine**

# Support Vector Machinesprint("*"*30)

print("Support Vector Machines Without-Scaling")

print("*"*30)

svc1_no_scaling = SVC()

scores_svc_no_scaling = cross_val_score(svc1_no_scaling, X1, y1, cv=5)

acc_svc_no_scaling = accuracy_score(y1, svc1_no_scaling.fit(X1, y1).predict(X1))

print("Support Vector Machines on Dataset 1 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_svc_no_scaling.mean(), scores_svc_no_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_svc_no_scaling)svc2_no_scaling = SVC()

scores_svc_no_scaling2 = cross_val_score(svc2_no_scaling, X2, y2, cv=5)

acc_svc_no_scaling2 = accuracy_score(y2, svc2_no_scaling.fit(X2, y2).predict(X2))

print("Support Vector Machines on Dataset 2 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_svc_no_scaling2.mean(), scores_svc_no_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_svc_no_scaling2)

print("*"*30)

print("Support Vector Machines With-Scaling")

print("*"*30)svc1_scaling = SVC()

scores_svc_scaling = cross_val_score(svc1_scaling, X1_scaled, y1, cv=5)

acc_svc_scaling = accuracy_score(y1, svc1_scaling.fit(X1_scaled, y1).predict(X1_scaled))

print("Support Vector Machines on Dataset 1 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_svc_scaling.mean(), scores_svc_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_svc_scaling)svc2_scaling = SVC()

******************************

scores_svc_scaling2 = cross_val_score(svc2_scaling, X2_scaled, y2, cv=5)

acc_svc_scaling2 = accuracy_score(y2, svc2_scaling.fit(X2_scaled, y2).predict(X2_scaled))

print("Support Vector Machines on Dataset 2 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_svc_scaling2.mean(), scores_svc_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_svc_scaling2)

Support Vector Machines Without-Scaling

******************************

Support Vector Machines on Dataset 1 without scaling:

Accuracy (CV): 0.94 (+/- 0.02)

Accuracy (train): 0.95

Support Vector Machines on Dataset 2 without scaling:

Accuracy (CV): 0.89 (+/- 0.02)

Accuracy (train): 0.92

******************************

Support Vector Machines With-Scaling

******************************

Support Vector Machines on Dataset 1 with scaling:

Accuracy (CV): 0.94 (+/- 0.02)

Accuracy (train): 0.95

Support Vector Machines on Dataset 2 with scaling:

Accuracy (CV): 0.91 (+/- 0.02)

Accuracy (train): 0.93

**K-Nearest Neighbors**

# K-Nearest Neighbors

print("*"*30)

print("K-Nearest Neighbors Without-Scaling")

print("*"*30)

knn1_no_scaling = KNeighborsClassifier()

scores_knn_no_scaling = cross_val_score(knn1_no_scaling, X1, y1, cv=5)

acc_knn_no_scaling = accuracy_score(y1, knn1_no_scaling.fit(X1, y1).predict(X1))

print("K-Nearest Neighbors on Dataset 1 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_knn_no_scaling.mean(), scores_knn_no_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_knn_no_scaling)knn2_no_scaling = KNeighborsClassifier()

scores_knn_no_scaling2 = cross_val_score(knn2_no_scaling, X2, y2, cv=5)

acc_knn_no_scaling2 = accuracy_score(y2, knn2_no_scaling.fit(X2, y2).predict(X2))

print("K-Nearest Neighbors on Dataset 2 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_knn_no_scaling2.mean(), scores_knn_no_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_knn_no_scaling2)

print("-"*35)

print("*"*30)

print("K-Nearest Neighbors With-Scaling")

print("*"*30)

knn1_scaling = KNeighborsClassifier()

scores_knn_scaling = cross_val_score(knn1_scaling, X1_scaled, y1, cv=5)

acc_knn_scaling = accuracy_score(y1, knn1_scaling.fit(X1_scaled, y1).predict(X1_scaled))

print("K-Nearest Neighbors on Dataset 1 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_knn_scaling.mean(), scores_knn_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_knn_scaling)knn2_scaling = KNeighborsClassifier()

******************************

scores_knn_scaling2 = cross_val_score(knn2_scaling, X2_scaled, y2, cv=5)

acc_knn_scaling2 = accuracy_score(y2, knn2_scaling.fit(X2_scaled, y2).predict(X2_scaled))

print("K-Nearest Neighbors on Dataset 2 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_knn_scaling2.mean(), scores_knn_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_knn_scaling2)

K-Nearest Neighbors Without-Scaling

******************************

K-Nearest Neighbors on Dataset 1 without scaling:

Accuracy (CV): 0.93 (+/- 0.02)

Accuracy (train): 0.95

K-Nearest Neighbors on Dataset 2 without scaling:

Accuracy (CV): 0.85 (+/- 0.02)

Accuracy (train): 0.91

-----------------------------------

******************************

K-Nearest Neighbors With-Scaling

******************************

K-Nearest Neighbors on Dataset 1 with scaling:

Accuracy (CV): 0.93 (+/- 0.02)

Accuracy (train): 0.95

K-Nearest Neighbors on Dataset 2 with scaling:

Accuracy (CV): 0.87 (+/- 0.02)

Accuracy (train): 0.92

**Decision Trees**

# Decision Trees

print("*"*30)

print("Decision Trees Without-Scaling")

print("*"*30)

dt1_no_scaling = DecisionTreeClassifier()

scores_dt_no_scaling = cross_val_score(dt1_no_scaling, X1, y1, cv=5)

acc_dt_no_scaling = accuracy_score(y1, dt1_no_scaling.fit(X1, y1).predict(X1))

print("Decision Trees on Dataset 1 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_dt_no_scaling.mean(), scores_dt_no_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_dt_no_scaling)dt2_no_scaling = DecisionTreeClassifier()

scores_dt_no_scaling2 = cross_val_score(dt2_no_scaling, X2, y2, cv=5)

acc_dt_no_scaling2 = accuracy_score(y2, dt2_no_scaling.fit(X2, y2).predict(X2))

print("Decision Trees on Dataset 2 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_dt_no_scaling2.mean(), scores_dt_no_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_dt_no_scaling2)

print("-"*35)print("*"*30)

print("Decision Trees with With-Scaling")

print("*"*30)

scaler1 = StandardScaler()

X1_scaled = scaler1.fit_transform(X1)

dt1_scaling = DecisionTreeClassifier()

scores_dt_scaling = cross_val_score(dt1_scaling, X1_scaled, y1, cv=5)

acc_dt_scaling = accuracy_score(y1, dt1_scaling.fit(X1_scaled, y1).predict(X1_scaled))

print("Decision Trees on Dataset 1 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_dt_scaling.mean(), scores_dt_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_dt_scaling)scaler2 = StandardScaler()

******************************

X2_scaled = scaler2.fit_transform(X2)

dt2_scaling = DecisionTreeClassifier()

scores_dt_scaling2 = cross_val_score(dt2_scaling, X2_scaled, y2, cv=5)

acc_dt_scaling2 = accuracy_score(y2, dt2_scaling.fit(X2_scaled, y2).predict(X2_scaled))

print("Decision Trees on Dataset 2 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_dt_scaling2.mean(), scores_dt_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_dt_scaling2)

Decision Trees Without-Scaling

******************************

Decision Trees on Dataset 1 without scaling:

Accuracy (CV): 0.88 (+/- 0.01)

Accuracy (train): 1.00

Decision Trees on Dataset 2 without scaling:

Accuracy (CV): 0.77 (+/- 0.04)

Accuracy (train): 1.00

-----------------------------------

******************************

Decision Trees with With-Scaling

******************************

Decision Trees on Dataset 1 with scaling:

Accuracy (CV): 0.88 (+/- 0.02)

Accuracy (train): 1.00

Decision Trees on Dataset 2 with scaling:

Accuracy (CV): 0.77 (+/- 0.04)

Accuracy (train): 1.00

**Random Forest**

# Random Forest

print("*"*30)

print("Random Forest Without-Scaling")

print("*"*30)

rf1_no_scaling = RandomForestClassifier()

scores_rf_no_scaling = cross_val_score(rf1_no_scaling, X1, y1, cv=5)

acc_rf_no_scaling = accuracy_score(y1, rf1_no_scaling.fit(X1, y1).predict(X1))

print("Random Forest on Dataset 1 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_rf_no_scaling.mean(), scores_rf_no_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_rf_no_scaling)rf2_no_scaling = RandomForestClassifier()

scores_rf_no_scaling2 = cross_val_score(rf2_no_scaling, X2, y2, cv=5)

acc_rf_no_scaling2 = accuracy_score(y2, rf2_no_scaling.fit(X2, y2).predict(X2))

print("Random Forest on Dataset 2 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_rf_no_scaling2.mean(), scores_rf_no_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_rf_no_scaling2)

print("-"*35)print("*"*30)

print("Random Forest with With-Scaling")

print("*"*30)

rf1_scaling = RandomForestClassifier()

scores_rf_scaling = cross_val_score(rf1_scaling, X1_scaled, y1, cv=5)

acc_rf_scaling = accuracy_score(y1, rf1_scaling.fit(X1_scaled, y1).predict(X1_scaled))

print("Random Forest on Dataset 1 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_rf_scaling.mean(), scores_rf_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_rf_scaling)rf2_scaling = RandomForestClassifier()

******************************

scores_rf_scaling2 = cross_val_score(rf2_scaling, X2_scaled, y2, cv=5)

acc_rf_scaling2 = accuracy_score(y2, rf2_scaling.fit(X2_scaled, y2).predict(X2_scaled))

print("Random Forest on Dataset 2 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_rf_scaling2.mean(), scores_rf_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_rf_scaling2)

Random Forest Without-Scaling

******************************

Random Forest on Dataset 1 without scaling:

Accuracy (CV): 0.93 (+/- 0.03)

Accuracy (train): 1.00

Random Forest on Dataset 2 without scaling:

Accuracy (CV): 0.88 (+/- 0.02)

Accuracy (train): 1.00

-----------------------------------

******************************

Random Forest with With-Scaling

******************************

Random Forest on Dataset 1 with scaling:

Accuracy (CV): 0.93 (+/- 0.02)

Accuracy (train): 1.00

Random Forest on Dataset 2 with scaling:

Accuracy (CV): 0.88 (+/- 0.02)

Accuracy (train): 1.00# Naive Bayes

print("*"*30)

print("Naive Bayes with Without-Scaling")

print("*"*30)

nb1_no_scaling = GaussianNB()

scores_nb_no_scaling = cross_val_score(nb1_no_scaling, X1, y1, cv=5)

acc_nb_no_scaling = accuracy_score(y1, nb1_no_scaling.fit(X1, y1).predict(X1))

print("Naive Bayes on Dataset 1 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_nb_no_scaling.mean(), scores_nb_no_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_nb_no_scaling)nb2_no_scaling = GaussianNB()

scores_nb_no_scaling2 = cross_val_score(nb2_no_scaling, X2, y2, cv=5)

acc_nb_no_scaling2 = accuracy_score(y2, nb2_no_scaling.fit(X2, y2).predict(X2))

print("Naive Bayes on Dataset 2 without scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_nb_no_scaling2.mean(), scores_nb_no_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_nb_no_scaling2)

print("")

print("-"*30)

print("Naive Bayes with Scaling")

print("-"*30)

nb1_scaling = GaussianNB()

scores_nb_scaling = cross_val_score(nb1_scaling, X1_scaled, y1, cv=5)

acc_nb_scaling = accuracy_score(y1, nb1_scaling.fit(X1_scaled, y1).predict(X1_scaled))

print("Naive Bayes on Dataset 1 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_nb_scaling.mean(), scores_nb_scaling.std() * 2))

print(" Accuracy (train): %0.2f" % acc_nb_scaling)nb2_scaling = GaussianNB()

Naive Bayes with Without-Scaling

scores_nb_scaling2 = cross_val_score(nb2_scaling, X2_scaled, y2, cv=5)

acc_nb_scaling2 = accuracy_score(y2, nb2_scaling.fit(X2_scaled, y2).predict(X2_scaled))

print("Naive Bayes on Dataset 2 with scaling:")

print(" Accuracy (CV): %0.2f (+/- %0.2f)" % (scores_nb_scaling2.mean(), scores_nb_scaling2.std() * 2))

print(" Accuracy (train): %0.2f" % acc_nb_scaling2)

Naive Bayes on Dataset 1 without scaling:

Accuracy (CV): 0.86 (+/- 0.03)

Accuracy (train): 0.86

Naive Bayes on Dataset 2 without scaling:

Accuracy (CV): 0.65 (+/- 0.03)

Accuracy (train): 0.65------------------------------

Naive Bayes with Scaling

------------------------------

Naive Bayes on Dataset 1 with scaling:

Accuracy (CV): 0.86 (+/- 0.03)

Accuracy (train): 0.86

Naive Bayes on Dataset 2 with scaling:

Accuracy (CV): 0.65 (+/- 0.03)

Accuracy (train): 0.65

## Conclusion

In this article, we embarked on a comparative study of two popular classification algorithms, Logistic Regression and Support Vector Machines and few others. We created informative classification datasets with a significant number of rows and applied scaling techniques to normalize the feature values. By evaluating the performance of the models on both scaled and unscaled datasets, we gained insights into the impact of scaling on classification accuracy. This exploration provides valuable guidance for practitioners seeking to choose the most suitable classification model and preprocessing techniques for their specific tasks.

Thank you for joining me on this journey of exploring classification models and their suitability for different datasets. I hope you found this article insightful and informative. Remember, in the realm of machine learning, understanding the strengths and weaknesses of various algorithms is crucial for achieving accurate predictions. So, whether you’re working on logistic regression, support vector machines, or any other classification technique, keep experimenting, learning, and improving. Stay curious, stay passionate, and keep pushing the boundaries of what’s possible. Happy Learning!