## Accuracy of statistical learning models

Assessing the accuracy of statistical learning models is a crucial aspect of data analysis. In this section, we will explore some fundamental concepts related to selecting a suitable statistical learning method for a particular dataset. Throughout the journey, we will delve into practical applications of these concepts.

The primary objective of this journey is to introduce readers to various statistical learning methods that go beyond the conventional linear regression approach. But why is it necessary to present multiple methods instead of just advocating for a single, superior technique? The answer lies in the nature of statistics itself: there is no universal method that outperforms all others across every conceivable dataset. While a specific method may work exceptionally well on a particular dataset, another method could yield better results on a similar yet distinct dataset. Therefore, determining the most effective approach for a given dataset is a critical task. Selecting the optimal method can be one of the most challenging aspects of applying statistical learning in practice.

In this section, we will discuss key concepts that arise when choosing a statistical learning procedure for a specific dataset. As you progress through the journey, we will elucidate how these concepts can be practically applied.

**Measuring the Quality of Fit**

In order to assess how well a statistical learning method performs on a given dataset, we need a measure to evaluate how closely its predictions align with the actual observed data. This involves quantifying the extent to which the predicted response values for each observation match the true response values. In the context of regression, the most commonly used measure is the ** mean squared error (MSE)**.

The *MSE* is calculated by taking the average of the squared differences between the predicted response values (denoted as ˆf(xi)) and the true response values (denoted as yi) for all the observations in the dataset. Mathematically, it can be expressed as:

Here, n represents the number of observations in the dataset.

The MSE provides a measure of the overall fit of the statistical learning model. A* smaller MSE indicates that the predicted responses are closer to the true responses, indicating a better fit*. Conversely, a l*arger MSE suggests that there are substantial differences between the predicted and true responses* for certain observations.

It’s important to note that the MSE is computed using the training data that was used to train the model. While the ** training MSE** gives us an indication of how well the method fits the training data, it is not necessarily an accurate reflection of its performance on new, unseen test data. Ultimately, what

*we are interested in is the accuracy of the predictions when the method is applied to previously unseen test data.*

Consider an example where we want to develop an algorithm to predict stock prices based on previous stock returns. We train the method using historical stock returns from the past six months. However, our main concern is not how well the method predicts the stock price from last week (which is part of the training data), but rather how accurately it can predict tomorrow’s price or next month’s price. Similarly, in a medical context, if we train a statistical learning method to predict the risk of diabetes based on clinical measurements, we are more interested in accurately predicting the risk for future patients based on their clinical data, rather than predicting the risk for the patients used to train the model, as we already know their diabetes status.

our goal is to develop a model or method that can accurately predict the relationship between input variables (features) and output variables (responses). To assess the performance of these methods, we typically split our data into two sets: a training set and a test set.

During the training phase, we use the training set to fit or train the statistical learning method. By adjusting the model’s parameters or selecting the best-fitting model, we obtain an estimate denoted as ˆ f. This estimate represents the learned relationship between the input variables and the corresponding outputs.

To evaluate how well our trained model performs on new, unseen data, we employ the test set. This set consists of observations that were not used during the training phase. For each test observation (x0, y0), where x0 represents the input and y0 is the true output, we utilize our trained method to predict the response value, resulting in the estimate ˆ f(x0).

To measure the accuracy of these predictions, we calculate the squared difference between the true response value and the predicted value,

for each test observation. By averaging these squared differences across all test observations, we obtain the test mean squared error (test MSE).

In the absence of test data, we might be tempted to select the method that minimizes the training MSE. However, there is a fundamental problem with this approach. Many statistical methods are designed to minimize the training MSE by estimating coefficients specifically based on the training data. While this may result in a small training MSE, it does not guarantee a small test MSE. In fact, the test MSE may be much larger, indicating overfitting of the model to the training data.

Overfitting occurs when a model is excessively flexible and captures random patterns in the training data that do not generalize well to unseen test data. As the flexibility of a model increases, the training MSE tends to decrease, but the test MSE may not follow the same pattern. Overfitting can lead to a small training MSE but a large test MSE, indicating poor generalization.

To illustrate this phenomenon, let’s consider the example shown in Figure We have a true underlying function, denoted as f, which is represented by the black curve. We fit three different models: a linear regression line (orange curve) and two *smoothing spline* fits with varying levels of flexibility (blue and green curves).

As the flexibility of the model increases, the training MSE(grey curve) tends to decrease, but the test MSE(red curve) may not follow the same pattern. Overfitting can lead to a small training MSE but a large test MSE, indicating poor generalization. and minimum possible test MSE over all methods (dashed line).As the flexibility increases, the models start to capture more of the noise present in the training data. This leads to a decrease in the training MSE, as the models can closely fit the training points, but the test MSE increases because the models fail to capture the true underlying pattern and instead overfit the noise.*To understand this example in brief consider ISLR book — page no 31 and 32.*

To overcome this issue and select a model that generalizes well to unseen data, we can use techniques like cross-validation. Cross-validation involves splitting the available data into multiple subsets or folds. The model is then trained on a portion of the data (training set) and evaluated on the remaining portion (validation set). This process is repeated for different combinations of training and validation sets.

By calculating the MSE for each fold, we can obtain an estimate of the model’s performance on unseen data. The average validation MSE across all folds can be used as a more reliable estimate of the model’s true prediction error. This helps in selecting the model with the smallest average validation MSE as the one that is likely to perform best on new, unseen data.

**The Bias-Variance Trade-Off**

The U-shape observed in the test mean squared error (MSE) curves (Figures 1.1–1.3) is a result of two competing properties in statistical learning methods. Although the mathematical proof is beyond the scope of this journey, it is possible to decompose the expected test MSE into three fundamental quantities:

the *variance* of the predicted values (ˆf(x0)), the squared *bias* of the predicted values (ˆf(x0)), and the variance of the error term (ϵ). This relationship is given by the equation:

(Expected test MSE = **Variance** of ˆ f(x0) + Squared **Bias** of ˆ f(x0) + Variance of ϵ)

In this equation, E[(y0 — ˆf(x0))²] represents the *expected test MSE* at x0, which is the average test MSE obtained when repeatedly estimating f using different training sets and testing each at x0. The overall expected test MSE is computed by averaging E[(y0 — ˆf(x0))²] over all possible values of x0 in the test set.

Equation indicates that to minimize the expected test error, we need to select a statistical learning method that achieves *low variance* and *low bias* simultaneously. It’s important to note that variance and squared bias are nonnegative quantities, meaning the expected test MSE can never be lower than the irreducible error Var(ϵ)

** Variance** refers to the amount by which the predicted values (f-hat) would change if we estimated them using different training datasets. A method with high variance will show significant changes in f-hat when trained on different datasets. In general, more flexible statistical methods tend to have higher variance. For example, in Figure 1.1, the green curve represents a highly flexible method that closely follows the observations, resulting in high variance.It has high variance because changing anyone of these data points may cause the estimate ˆ f to change considerably.In contrast, the orange least squares line is relatively inflexible and has low variance, because moving any single observation will likely cause only a small shift in the position of the line.

** Bias**, on the other hand, refers to the error introduced by approximating a complex real-life problem with a simpler model. Linear regression, for instance, assumes a linear relationship between the response variable (Y) and the predictors (X1, X2, …, Xp). This assumption may introduce bias when the true relationship is more complex. In Figure 1.3, where the true relationship is substantially non-linear, linear regression produces high bias.Generally, more flexible methods result in less bias.

As a general rule, increasing the flexibility of a method leads to a decrease in bias and an increase in variance. The relative rates of change of bias and variance determine whether the test MSE increases or decreases. Initially, increasing flexibility leads to a rapid decrease in bias, resulting in a sharp decline in the expected test MSE. However, at a certain point, further flexibility has little impact on bias but significantly increases the variance. This is when the test MSE starts to increase. This pattern can be observed in the right-hand panels of Figures 1.1–1.3.

Squared bias (blue curve), variance (orange curve), Var(ϵ) (dashed line), and test MSE (red curve) for the three data sets in Figures 2.9–2.11. The vertical dotted line indicates the flexibility level corresponding to the smallest test MSE.

Figure 1.4 illustrates Equation of expected test MSE for the examples in Figures 1.1–1.3. The blue curve represents the squared bias for different levels of flexibility, while the orange curve corresponds to the variance. The dashed line represents the irreducible error (Var(ϵ)), and the red curve represents the test set MSE, which is the sum of these three quantities. The optimal test MSE, indicated by the smallest value, occurs at different levels of flexibility for the three data sets because the squared bias and variance change at different rates.

This relationship between bias, variance, and test set MSE is known as the ** bias-variance trade-off**. Achieving good test set performance requires balancing low variance and low squared bias. It’s referred to as a trade-off because it’s challenging to obtain a method with extremely low bias and low variance simultaneously. Methods with very low bias may have high variance, while methods with low variance may have high bias. The goal is to find a method where both variance and squared bias are low.

While it may not be possible to explicitly compute the test MSE, bias, or variance in real-life situations where the true f is unobserved, understanding the bias-variance trade-off is crucial. In this journey, highly flexible methods are explored to reduce bias, but it doesn’t guarantee that they will outperform simpler methods like linear regression. The optimal choice depends on the complexity of the true relationship.In contrast, if the true f is highly non-linear and we have an ample number of training observations, then we may do better using a highly flexible approach, Cross-validation, provides a way to estimate the test MSE using the training data.

## Accuracy Assessment in the Classification Setting

In addition to the regression setting, model accuracy is also crucial in the classification setting. While many concepts, such as the bias-variance trade-off, transfer over with some modifications, the presence of qualitative outcomes introduces new considerations. Let’s explore how we can quantify the accuracy of a classification model.

**Training Error Rate**: In the classification setting, we seek to estimate the true function f based on training observations (x1, y1), …, (xn, yn), where the yi’s are qualitative (class labels). The training *error rate* is commonly used to measure the accuracy of our estimated function f-hat on the training data. It is calculated as the proportion of mistakes made by applying our estimated function to the training observations. Mathematically, it can be expressed as:

Here, ŷi represents the predicted class label for the ith observation using our estimated function f-hat, and I(yi ≠ ŷi) is an* indicator variable* that equals 1 if yi ≠ ŷi (misclassified) and 0 if yi = ŷi (correctly classified). The training error rate computes the fraction of incorrect classifications in the training data.

**Test Error Rate**: While the training error rate provides insight into the accuracy of our model on the training data, our main interest lies in the performance on new, unseen test data. The test error rate measures the accuracy of our classifier when applied to test observations that were not used in training. For a given test observation (x0, y0), the *test error rate* is calculated as the average misclassification rate over all test observations:

Test Error Rate = Ave (I(y0 ≠ ŷ0))

Here, ŷ0 represents the predicted class label for the test observation (x0) using the classifier. A good classifier is one that minimizes the test error rate, indicating accurate predictions on unseen data.

It’s important to note that in both the training and test error rates, the indicator variable I(yi ≠ ŷi) or I(y0 ≠ ŷ0) quantifies the number of misclassifications, contributing to the ** overall error rate**.

Examples: Let’s consider an example to illustrate these concepts. Suppose we have a dataset of patients and their medical records, and the task is to classify whether a patient has a certain disease or not based on their symptoms. We use the training data, which consists of patients with known disease status, to train our classifier. The training error rate measures the proportion of misclassifications made when applying the classifier to this training data.

To evaluate the performance of our classifier on new patients, we utilize the test data, which includes patients whose disease status is unknown. The test error rate is calculated by averaging the misclassification rates across all test observations, providing an estimate of the classifier’s performance on unseen patients.

By minimizing the test error rate, we can select a classifier that accurately predicts the disease status of future patients based on their symptoms.

**The Bayes Classifier**

In classification tasks, it has been shown that the test error rate, as defined in Equation, is minimized on average by a straightforward classifier known as the Bayes classifier. The Bayes classifier assigns each observation to the most likely class based on its predictor values.

The key component of the Bayes classifier is the* conditional probability*,

Pr(Y = j|X = x0),

which represents the probability of the response variable Y being equal to class j given the observed predictor vector x0. The *Bayes classifier* simply assigns a test observation with predictor vector x0 to the class j for which

Pr(Y = j|X = x0) is the largest.

The Bayes classifier is particularly useful in two-class problems, where there are only two possible response values (e.g., class 1 or class 2). In such cases, the Bayes classifier predicts class 1 if Pr(Y = 1|X = x0) is greater than 0.5, and class 2 otherwise.

To visualize this concept, consider Figure, which depicts a simulated dataset in a two-dimensional space with predictors X1 and X2. The orange and blue circles represent training observations belonging to two different classes. The probability of the response being orange or blue varies across different values of X1 and X2. By knowing the data generation process, we can calculate the conditional probabilities for each combination of X1 and X2.

In the figure, the orange shaded region represents the points for which Pr(Y = orange|X) is greater than 50%, while the blue shaded region represents points for which the probability is below 50%. The purple dashed line indicates the points where the probability is exactly 50% — this is known as the *Bayes decision boundary*. The Bayes classifier’s prediction is based on this boundary: observations falling on the orange side are assigned to the orange class, and those on the blue side are assigned to the blue class.

The Bayes classifier achieves the lowest possible test error rate, referred to as the *Bayes error rate*. Since the classifier always chooses the class with the largest value of Pr(Y = j|X = x0), the error rate is given by 1 minus this maximum probability. In general, the overall Bayes error rate is computed by averaging this error rate over all possible values of X.

For example, in our simulated dataset, the Bayes error rate is 0.133, which is greater than zero. This is because the classes overlap in the true population, causing some values of x0 to have Pr(Y = j|X = x0) less than 1. The Bayes error rate is analogous to the irreducible error discussed earlier, representing the inherent limits of classification accuracy.

**K-nearest neighbors (KNN)**

The K-nearest neighbors (KNN) classifier is a method used to estimate the conditional distribution of a qualitative response variable given a set of predictors. It is a simple yet effective algorithm for classification tasks. While the ideal approach is to compute the* Bayes classifier, *which relies on the true conditional distribution of the response variable, it is o*ften not feasible because this distribution is typically unknown in real-world data.*

The KNN classifier works as follows: given a positive integer K and a test observation, denoted as x0, the algorithm identifies the K nearest neighbors to x0 from the training data. These neighbors are the K points in the training data that are closest to x0. The KNN classifier first identifies the K points in the training data that are closest to x0, represented by N0. The KNN classifier then estimates the conditional probability of each class by calculating the fraction of points among the neighbors that belong to that class.

The estimated conditional probability for class j, denoted as Pr(Y = j|X = x0), is computed as follows:

where yi represents the response value of the ith point among the K neighbors, and I(yi = j) is an indicator function that equals 1 if yi = j and 0 otherwise.

Finally, the KNN classifier assigns the test observation x0 to the class with the highest estimated probability from the above calculation.

Let’s consider an example to illustrate the KNN approach. Suppose we have a small training dataset consisting of six blue and six orange observations. We want to predict the class of a new test observation represented by a black cross. If we choose K = 3, the KNN classifier will identify the three closest observations to the test point. These neighbors form a neighborhood represented by a circle. If two of the three nearest neighbors are blue and one is orange, the estimated probabilities for the blue and orange classes would be 2/3 and 1/3, respectively. Therefore, the KNN classifier would predict that the test observation belongs to the blue class.

In practice, the KNN decision boundary is often very close to the optimal Bayes classifier, despite not knowing the true distribution. The decision boundary separates the regions where the KNN classifier assigns different classes. It can be obtained by applying the KNN approach with different values of K to all possible values of the predictor variables.

The choice of K is crucial in the KNN classifier. If K is too small (e.g., K = 1), the decision boundary can be overly flexible, capturing noise and irrelevant patterns from the training data. This leads to a classifier with low bias but high variance, meaning it is sensitive to small fluctuations in the training data and may not generalize well to new observations. On the other hand, if K is too large (e.g., K = 100), the decision boundary becomes too rigid and simple, resulting in a classifier with low variance but high bias. This classifier may fail to capture important patterns in the data. It’s important to find the right balance between bias and variance.

To assess the performance of the KNN classifier, we can calculate the test error rate, which measures the proportion of misclassified observations in a separate test dataset. The goal is to select the value of K that minimizes the test error rate. However, it’s worth noting that there is not always a strong relationship between the training error rate and the test error rate. As the flexibility of the KNN classifier increases (smaller values of K), the training error rate tends to decrease, but the test error rate may not necessarily follow the same trend. This phenomenon is known as the bias-variance tradeoff.

In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method. The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.