Machine Learning and Neural Networks - page 71

 

8.3 Bias-Variance Decomposition of the Squared Error (L08: Model Evaluation Part 1)



8.3 Bias-Variance Decomposition of the Squared Error (L08: Model Evaluation Part 1)

In the previous lecture, we gained some intuition about bias and variance and briefly touched on the bias-variance decomposition of a loss function. Now, in this lecture, we will delve deeper into the bias-variance decomposition by focusing on the squared error loss. Starting with the squared error loss makes it simpler and more intuitive before we explore its relation to overfitting and underfitting. Additionally, we will briefly discuss the bias-variance decomposition of 0-1 loss, which is a more controversial topic with recent works dedicated to it. However, we will first examine the squared error case since it provides a more straightforward understanding.

To recap briefly, bias and variance were covered in more detail in the previous video, but it's beneficial to spend a moment recapping the setting. We are looking at the point estimator minus the prediction, which represents the predicted target for a given model and a specific training set. The expectation is taken over different training sets drawn from the same distribution or population. This expectation represents the average prediction for a given data point in the test set.

The bias measures how far the average prediction is from the true target value, while the variance quantifies the amount by which each individual prediction deviates from the average prediction. The variance term is squared to disregard the sign and focus on the overall spread of the predictions around the average.

The squared error loss can be represented as (theta - theta hat)^2, where theta is the true value and theta hat is the predicted value for a specific data point. In this lecture, we will focus on the bias-variance decomposition of the squared error loss, considering only the bias and variance terms and ignoring the noise term.

To proceed with the bias-variance decomposition, we introduce some notation and setup for the scenario. We consider a true function that generates the labels (y), and we have a hypothesis (h) as our model, which approximates the true data-generating function. We use y hat to represent a prediction. With these terms, we can express the squared error loss as (y - y hat)^2. To avoid confusion with the expectation symbol (E), we denote the squared error as (s).

Now, let's decompose the squared error into bias and variance components. To achieve this, we employ a mathematical trick of inserting and subtracting the expectation of the prediction. By doing this, we expand the quadratic expression and separate it into three terms: (y^2, -2yy hat, y hat^2).

Next, we apply the expectation to both sides of the equation. Applying the expectation to the first term yields y^2, which remains unchanged since y is a constant. The expectation of the second term, -2yy hat, is zero since we are subtracting the same value (expectation of y hat) from itself repeatedly. As for the third term, the expectation of y hat^2 represents the average value of the predictions.

After applying the expectation, we are left with two terms: the squared bias and the variance. The squared bias is (y - E[y hat])^2, which measures the difference between the true label and the average prediction. The variance is E[(y hat - E[y hat])^2], quantifying the average squared deviation of individual predictions from the average prediction.

To demonstrate the elimination of the term -2yy hat when applying the expectation, we break down the steps. Applying the expectation to -2yy hat results in 2E[yy hat]. Expanding the expression further, we find that the expectation of yy hat equals E[y]E[y hat] since y is a constant. Consequently, the expectation of -2yy hat simplifies to -2E[y]E[y hat].

Subtracting this term from the squared error (s) gives us:

s = y^2 - 2yy hat + y hat^2 = y^2 - 2yy hat + y hat^2 - 2E[y]E[y hat] + 2E[y]E[y hat]

Now, let's rearrange the terms:

s = (y - E[y hat])^2 + 2(E[y]E[y hat] - yy hat)

We can further simplify the expression by recognizing that E[y]E[y hat] is a constant value, denoted as c. Therefore, we have:

s = (y - E[y hat])^2 + 2(c - yy hat)

Finally, let's focus on the second term, 2(c - yy hat). This term can be decomposed as follows:

2(c - yy hat) = 2c - 2yy hat

The first term, 2c, is a constant and does not depend on the prediction y hat. The second term, -2yy hat, represents the interaction between the true label y and the prediction y hat.

Now, we can summarize the bias-variance decomposition of the squared error loss as follows:

s = (y - E[y hat])^2 + 2c - 2yy hat

The first term, (y - E[y hat])^2, corresponds to the squared bias. It measures the discrepancy between the true label y and the average prediction E[y hat].

The second term, 2c, is a constant and represents the squared bias. It is not affected by the choice of the prediction y hat.

The third term, -2yy hat, represents the variance. It captures the variability of the individual predictions y hat around their average E[y hat]. It is directly influenced by the choice of the prediction y hat.

Therefore, we can conclude that the squared error loss can be decomposed into a squared bias term, a constant squared bias term, and a variance term.

Understanding the bias-variance decomposition helps us gain insights into the behavior of a model. A high bias indicates underfitting, where the model is not able to capture the underlying patterns in the data. A high variance indicates overfitting, where the model is too sensitive to the training data and fails to generalize well to unseen data.

By analyzing the bias and variance components, we can make informed decisions about model complexity, regularization techniques, and data collection strategies to optimize the performance of our models.

In the next lecture, we will extend the bias-variance decomposition to the 0-1 loss and discuss its implications.

8.3 Bias-Variance Decomposition of the Squared Error (L08: Model Evaluation Part 1)
8.3 Bias-Variance Decomposition of the Squared Error (L08: Model Evaluation Part 1)
  • 2020.11.04
  • www.youtube.com
In this video, we decompose the squared error loss into its bias and variance components.-------This video is part of my Introduction of Machine Learning cou...
 

8.4 Bias and Variance vs Overfitting and Underfitting (L08: Model Evaluation Part 1)



8.4 Bias and Variance vs Overfitting and Underfitting (L08: Model Evaluation Part 1)

In this video, my goal is to set a record for the shortest video in this course. I want to keep it concise and not drag on with the topic for too long. I only have two slides, so it won't take much time. In this video, we will explore the relationship between bias-variance decomposition and the concepts of underfitting and overfitting.

Let's start by looking at the graph shown earlier in this lecture. Please note that this is a simple sketch and not based on real numbers. In practice, the relationship between these terms can be noisy when dealing with real-world datasets. The graph illustrates the squared error loss plotted against the model's capacity, which relates to its complexity or ability to fit the training data.

Capacity refers to how well the model can fit the training set. A higher capacity means the model is more capable of fitting the data. For example, in parametric models like regression, capacity is often determined by the number of parameters or terms. As the capacity increases, the training error decreases because a more complex model can better fit the training data.

However, having a low training error doesn't guarantee good performance on new data. It's possible to overfit the training data by fitting it too closely, which can lead to an increase in the error on new data, known as the generalization error. The generalization error can be estimated using an independent test set. Initially, as the capacity increases, the generalization error improves to some extent. But after reaching a certain point, the error starts to increase again, indicating overfitting.

The gap between the training error and the generalization error represents the degree of overfitting. As the model's capacity increases, the gap increases because the model fits the data too closely, including noise in the data. The degree of overfitting indicates how much the model overfits the training data and fails to generalize well to new data.

Now, let's relate these concepts to bias and variance. In the graph, I've added the terms bias and variance in red. As the model's capacity increases, its variance also increases. This can be observed in the case of deep decision trees compared to short decision trees. Models with higher variance are more prone to overfitting. The higher the variance, the larger the degree of overfitting, which is represented by the gap between the training error and the generalization error.

Conversely, as the variance increases, the bias decreases. A more complex model usually has a lower bias. The graph may appear to show the bias going down and then up again, but that's just a result of a bad drawing. In reality, the bias decreases asymptotically as the variance increases when the capacity of the model increases.

On the other hand, when the model has low capacity (such as a simple model), it underfits the data, resulting in poor performance on both the training and test sets. This is associated with a high bias. Underfitting occurs when the model is too simplistic to capture the underlying patterns in the data.

To summarize, high bias is correlated with underfitting, while high variance is correlated with overfitting. In the next video, we will briefly explore the bias-variance decomposition of the 0-1 loss, which is more relevant to classification tasks. Although it's less intuitive than decomposing the squared error loss, it provides insights into the bias and variance components in a classification context.

8.4 Bias and Variance vs Overfitting and Underfitting (L08: Model Evaluation Part 1)
8.4 Bias and Variance vs Overfitting and Underfitting (L08: Model Evaluation Part 1)
  • 2020.11.04
  • www.youtube.com
This brief video discusses the connection between bias & variance and overfitting & underfitting.-------This video is part of my Introduction of Machine Lear...
 

8.5 Bias-Variance Decomposition of the 0/1 Loss (L08: Model Evaluation Part 1)


8.5 Bias-Variance Decomposition of the 0/1 Loss (L08: Model Evaluation Part 1)

In this discussion, we delved into the bias-variance decomposition of the squared error loss and its relationship with overfitting and underfitting. Now, we will shift our focus to the bias-variance decomposition of the 0/1 loss, which is a bit more complex due to its piecewise nature. The 0/1 loss assigns a value of 0 if the true label matches the predicted label, and 1 otherwise. Analyzing this loss function is trickier since it is not a continuous function.

To explore the bias-variance decomposition in the context of the 0/1 loss, we will refer to the work of Pedro Domingo and Common Dieterich. Pedro Domingo's paper, "The Unified Bias Variance Decomposition," aimed to unify various bias-variance decompositions related to the 0/1 loss. Several authors have proposed different decompositions, but each of them has significant shortcomings.

In this class, we will primarily focus on the intuition behind the bridge between the bias-variance decomposition and the 0/1 loss. We will briefly discuss combinatorics work from 1995 and Pedro Domingo's explanation of this work. For a more detailed understanding, you can refer to the referenced papers.

Let's begin by revisiting the squared error loss, which we defined as the squared difference between the true value and the predicted value. Previously, we looked at the expectation of this loss over different training sets and decomposed it into bias and variance terms. Now, we will introduce a generalized notation using function L to represent the loss and take the expectation of this function.

When discussing the bias-variance decomposition of the squared error loss, we decomposed it into bias and variance terms. The bias term, denoted as Bias(Y), represents the difference between the true label (Y) and the average prediction (E[Y_hat]). The variance term, denoted as Var(Y_hat), measures the variability of predictions around the average prediction. These terms capture how much the predictions deviate from the true label and how much they scatter, respectively.

Now, we will define a new term called the main prediction. In the case of the squared error loss, the main prediction is the average prediction across different training sets. However, when dealing with the 0/1 loss, the main prediction is obtained by taking the mode of the predictions, i.e., the most frequent prediction. This distinction is crucial in understanding the bias-variance decomposition in the context of classification.

Let's explore how the bias and variance can be defined in terms of the 0/1 loss. We will refer to the cleaned-up version of the previous slide. On the right-hand side, we introduce the bias term. In the papers by Kong and Dieterich, the bias is defined as 1 if the main prediction (E[Y_hat]) is not equal to the true label (Y), and 0 otherwise. This definition captures whether the main prediction matches the true label or not.

Next, let's focus on the case where the bias is zero, indicating that the main prediction matches the true label. In this scenario, the loss is equal to the variance. By definition, the loss represents the probability that the prediction does not match the true label. Thus, we can interpret the variance as the probability that the prediction (Y_hat) is not equal to the main prediction (E[Y_hat]). This probability reflects the variability in predictions when the bias is zero.

Now, let's delve into the case where the bias is one, which is slightly more complicated. We start by rewriting the loss as one minus the probability that the prediction matches the true label. This is equivalent to one minus the accuracy. We will consider two aspects: when Y is not equal to the main prediction and when Y is equal to the main prediction.

When Y is not equal to the main prediction, the loss is equal to one, indicating a misclassification. In this case, the variance term doesn't contribute to the loss since the main prediction is different from the true label, and the variability in predictions is irrelevant. The entire loss can be attributed to the bias term, which captures the fact that the main prediction does not match the true label.

On the other hand, when Y is equal to the main prediction, the loss is equal to one minus the probability that all other predictions are different from the main prediction. This probability represents the variability in predictions when the bias is one. Therefore, the variance term accounts for the loss in this case, reflecting the uncertainty in predictions around the main prediction.

To summarize, in the bias-variance decomposition of the 0/1 loss, the bias term captures the misclassification error when the main prediction does not match the true label. The variance term accounts for the variability in predictions when the main prediction matches the true label.

It's important to note that the bias-variance decomposition for the 0/1 loss is more nuanced and complex compared to the squared error loss due to the discrete nature of the loss function. The bias and variance terms are defined based on the concept of the main prediction and capture different aspects of the classification performance.

Understanding the bias-variance trade-off in the context of the 0/1 loss is crucial for evaluating and improving classification models. By analyzing the bias and variance components, we can gain insights into the sources of error and make informed decisions to mitigate underfitting or overfitting issues.

If you're interested in a more detailed exploration of the bias-variance decomposition for the 0/1 loss, I recommend reading Pedro Domingo's paper "The Unified Bias Variance Decomposition" and the related works by Kong and Dieterich. These papers provide in-depth explanations and mathematical formalisms for the decomposition.

The bias-variance trade-off is a fundamental concept in machine learning that relates to the model's ability to balance between underfitting and overfitting. The bias term represents the error due to the model's assumptions or simplifications, leading to an underfitting scenario where the model is too simple to capture the underlying patterns in the data. On the other hand, the variance term represents the error due to the model's sensitivity to small fluctuations in the training data, resulting in an overfitting scenario where the model is too complex and captures noise rather than generalizable patterns.

In the case of the 0/1 loss, the bias term captures the misclassification error when the main prediction is different from the true label. A high bias indicates that the model is consistently making incorrect predictions and is unable to capture the true underlying patterns in the data. This often occurs when the model is too simple or lacks the necessary complexity to capture the complexity of the problem.

The variance term, on the other hand, captures the variability in predictions when the main prediction matches the true label. It reflects the model's sensitivity to different training data samples and the instability of its predictions. A high variance indicates that the model is overly sensitive to small changes in the training data and is likely overfitting. This means that the model may perform well on the training data but fails to generalize to unseen data.

Ideally, we want to find a model that achieves a balance between bias and variance, minimizing both types of errors. However, there is often a trade-off between the two. Decreasing the bias may increase the variance and vice versa. This is known as the bias-variance trade-off.

To strike the right balance, various techniques can be employed. Regularization methods, such as L1 or L2 regularization, can help reduce the model's complexity and control the variance. Cross-validation can be used to evaluate the model's performance on different subsets of the data and identify potential overfitting. Ensemble methods, like bagging or boosting, can also be employed to reduce variance by combining multiple models.

Understanding the bias-variance trade-off is crucial for model selection and hyperparameter tuning. It allows us to assess the model's generalization performance and make informed decisions to improve its accuracy and reliability.

8.5 Bias-Variance Decomposition of the 0/1 Loss (L08: Model Evaluation Part 1)
8.5 Bias-Variance Decomposition of the 0/1 Loss (L08: Model Evaluation Part 1)
  • 2020.11.05
  • www.youtube.com
This video discusses the tricky topic of decomposing the 0/1 loss into bias and variance terms.-------This video is part of my Introduction of Machine Learni...
 

8.6 Different Uses of the Term "Bias" (L08: Model Evaluation Part 1)



8.6 Different Uses of the Term "Bias" (L08: Model Evaluation Part 1)

The lecture was not particularly exciting as it delved into the topic of bias and variance decomposition in machine learning. The speaker acknowledged the tediousness of the subject matter. However, there was one last important point the speaker wanted to address regarding the different forms of bias in machine learning.

The term "machine learning bias" was explained as an overloaded term, meaning it is used to refer to different things in different contexts. In a previous machine learning course taught by the speaker, bias unit and neural networks were discussed, but that was different from the statistical bias discussed in this lecture. In the context of machine learning, bias refers to the preferences or restrictions of the machine learning algorithm, also known as inductive bias.

The speaker gave an example of a decision tree algorithm to illustrate inductive bias. Decision trees favor smaller trees over larger trees. If two decision trees have the same performance on a training set, the algorithm would prefer the smaller tree and stop growing the tree if no improvement can be made. This preference for smaller trees is an example of inductive bias affecting a decision tree algorithm.

The speaker referred to a paper by Dieterich and Khan that contrasts machine learning bias with statistical bias. Appropriate and inappropriate biases were discussed in relation to absolute bias. Inappropriate biases do not contain any good approximation to the target function, meaning the algorithm is not well-suited to the problem. On the other hand, appropriate biases allow for good approximations to the target function.

Relative bias was described as being too strong or too weak. A bias that is too strong may not rule out good approximations but prefers poorer hypotheses instead. Conversely, a bias that is too weak considers too many hypotheses, potentially leading to overfitting.

The speaker shared an example of a simulation study involving decision tree models to demonstrate the interplay between bias and variance. The study evaluated the mean error rate and found that some errors were due to bias while others were due to variance.

Another important type of bias discussed was fairness bias, which refers to demographic disparities in algorithmic systems that are objectionable for societal reasons. Machine learning models can treat certain demographics unfairly, and this bias can stem from imbalanced datasets or other factors. The speaker recommended referring to the Fair ML Book for more information on fairness in machine learning.

The speaker briefly mentioned a project they worked on involving hiding soft biometric information from face images while still maintaining matching accuracy. The goal was to protect privacy by preventing algorithms from extracting gender information from face images. The speaker evaluated the performance of their system and commercial face matching algorithms, noting biases in the commercial software's binary gender classifier based on skin color.

The speaker emphasized the importance of minimizing biases and being mindful of how classifiers perform on different demographics. They highlighted the need for techniques such as oversampling to address biases and ensure fairer outcomes.

The lecture covered various forms of bias in machine learning, including inductive bias, statistical bias, and fairness bias. The examples and discussions shed light on the challenges and considerations involved in mitigating bias and promoting fairness in machine learning algorithms.

8.6 Different Uses of the Term "Bias" (L08: Model Evaluation Part 1)
8.6 Different Uses of the Term "Bias" (L08: Model Evaluation Part 1)
  • 2020.11.05
  • www.youtube.com
This video discusses the different uses of the term "bias" in machine learning by introducing the concepts of machine learning bias and fairness bias.-------...
 

9.1 Introduction (L09 Model Eval 2: Confidence Intervals)



9.1 Introduction (L09 Model Eval 2: Confidence Intervals)

Hello everyone! Today, we have a highly engaging and informative lecture ahead. In contrast to the previous lecture, which delved into the rather dry topic of setup and bias-variance decomposition, this session promises to be more exciting. We will be discussing various resampling techniques and conducting simulations on different datasets to observe how resampling affects the training of algorithms. By dividing a dataset into training and test sets, we reduce the available training size, potentially impacting model performance.

Moreover, we will explore confidence intervals and different methods to construct them. This includes using normal approximation intervals and various bootstrapping techniques. Confidence intervals have gained significance in machine learning, with recent paper submissions requiring their inclusion. Reviewers also take confidence intervals more seriously now. They provide an expectation within the field and prove useful not only for reviewers but also for other readers examining your models.

Now, let's dive into the lecture topics. We'll begin with an introduction, followed by the holdout method for model evaluation. Then, we'll explore how the holdout method can be employed for model selection. Moving forward, we'll delve into constructing confidence intervals using different techniques, starting with the normal approximation interval.

Resampling methods will also be a key focus. We'll analyze the repeated holdout method, where the holdout method is applied to resampled versions of the training set. Furthermore, we'll examine empirical confidence intervals, which rely on resampling techniques. Here, we'll encounter the familiar bootstrap technique discussed in the bagging and ensemble model lecture.

Once we understand how to create empirical confidence intervals using the bootstrap method, we'll explore two enhanced versions: the point 632 bootstrap and the point 632 plus bootstrap. It's important to note the context of this lecture within the broader framework of model evaluation. We won't introduce new machine learning algorithms but instead focus on essential techniques for comparing and selecting models.

These techniques are crucial because it's challenging to determine which machine learning algorithm performs well on a given dataset. We often need to try and compare numerous algorithms to find the best performing one. Additionally, evaluating model performance is vital for developing applications like image recognition on iPhones, where predicting image labels accurately is crucial.

Besides estimating generalization performance for unseen data, we also compare different models. By using the same algorithm and training set, we can obtain multiple models with different hyperparameter settings. We compare these models to select the best one. Furthermore, we may use different algorithms and want to assess their performance on specific data types, such as images or text.

To select the best model, we can either estimate the absolute generalization performance accurately or rank the models without absolute performance values. The latter approach helps avoid biases introduced when using the same test set multiple times. A ranking system allows us to select the best model without relying on accurate estimates of generalization performance.

In the upcoming lectures, we will cover cross-validation techniques, statistical tests for model evaluation, and evaluation metrics beyond accuracy, such as precision, recall, and receiver operating characteristic (ROC) curves.

These lectures are critical because they provide the means to compare different machine learning algorithms and select the most suitable model. While they don't introduce new algorithms, they offer practical insights and techniques for assessing model performance.

In summary, our lecture today will cover resampling techniques, confidence intervals, and their relevance in machine learning. By the end of this lecture series, you'll have a comprehensive understanding of model evaluation and the tools necessary to make informed decisions in machine learning. Let's begin our exploration of these topics!

9.1 Introduction (L09 Model Eval 2: Confidence Intervals)
9.1 Introduction (L09 Model Eval 2: Confidence Intervals)
  • 2020.11.11
  • www.youtube.com
This first video goes over the contents being covered in L09 (issues with the holdout method, resampling methods, and confidence intervals). Then, it introdu...
 

9.2 Holdout Evaluation (L09 Model Eval 2: Confidence Intervals)



9.2 Holdout Evaluation (L09 Model Eval 2: Confidence Intervals)

In this video, we will discuss the holdout method for model evaluation. While this method is not new, there are some interesting aspects that we haven't explored before. The holdout method involves dividing the dataset into a training set and a test set. The training set is used to train or fit the model, while the test set is used to evaluate the model's performance.

However, there are a few considerations to keep in mind. First, the training set error is an optimistically biased estimate of the generalization error. This means that the training error may not reliably estimate the model's performance because it could be overfitting the training data. On the other hand, the test set provides an unbiased estimate of the generalization error if it is independent of the training set. However, from a conceptual perspective, the test set can be pessimistically biased. This bias arises because when we divide the dataset into training and test sets, we lose valuable data. Even with a small dataset, removing 30% of the data for evaluation can significantly impact the model's performance.

To illustrate this point, let's consider a simple example. Imagine we have a dataset consisting of only 10 data points. If we remove 30% of the data for evaluation, the model will be trained on only 70% of the data. This limited training data can lead to decreased model performance because machine learning models generally benefit from more data. If we plot a learning curve, we typically observe that as the dataset size increases, the generalization performance improves. Therefore, withholding a significant portion of the data for evaluation can make the model worse.

Despite this drawback, evaluating the model is necessary. In academia, we usually report the test set performance and consider our task complete. However, in industry, we often train the model on the entire dataset after evaluating it on the test set. This allows us to report the model's performance accurately to stakeholders, such as project managers. But training on the entire dataset can lead to a pessimistic bias in the test set performance estimate. For example, if the model achieved 95% accuracy on the test set, training on the full dataset might improve the model's performance to 96%. In this case, the initial estimate of 95% accuracy is pessimistically biased.

Using the holdout method alone is not always ideal. It has limitations, such as not accounting for the variance in the training data. When we split the data randomly, different splits can result in varying model performance. This variability makes the test set estimate less reliable as it provides only a point estimate. Additionally, the holdout method does not consider the possibility of optimistic bias when the test set is used multiple times for tuning and comparing models.

To further understand the impact of biases, let's consider the concept of pessimistic bias. In terms of model selection, a 10% pessimistic bias does not affect the ranking of models based on prediction accuracy. Suppose we have three models: h2, h1, and h3. Even if all the accuracy estimates are pessimistically biased by 10%, the ranking remains the same. The goal of model selection is to choose the best model available, and a consistent pessimistic bias across all models does not alter the relative ranking.

Similarly, there can be cases where the test set error is optimistically biased. This occurs when the same test set is used multiple times to tune and compare different models. Using the test set repeatedly can lead to survivor bias, where only the models that perform well on the test set are considered. An example of this is the "Do CIFAR-10 classifiers generalize to CIFAR-10?" paper, which examines overfitting and optimistic biases in classifiers trained and evaluated on the CIFAR-10 image dataset.

In conclusion, while the holdout method is a commonly used approach for model evaluation, it has its limitations and potential biases. To overcome these limitations, alternative techniques have been developed, such as cross-validation and bootstrapping.

Cross-validation is a method that involves dividing the dataset into multiple subsets or folds. The model is trained on a combination of these folds and evaluated on the remaining fold. This process is repeated several times, with each fold serving as the test set once. Cross-validation provides a more comprehensive evaluation of the model's performance as it utilizes different subsets of the data for training and testing. It helps mitigate the impact of random data splits and provides a more reliable estimate of the model's generalization performance.

Bootstrapping is another resampling technique that addresses the limitations of the holdout method. It involves randomly sampling the dataset with replacement to create multiple bootstrap samples. Each bootstrap sample is used as a training set, and the remaining data is used as a test set. By repeatedly sampling with replacement, bootstrapping generates multiple training-test splits, allowing for a more robust evaluation of the model's performance.

Both cross-validation and bootstrapping help to alleviate the biases associated with the holdout method. They provide more reliable estimates of the model's performance by utilizing the available data more efficiently and accounting for the variability in the training-test splits.

While the holdout method is a straightforward approach for model evaluation, it has limitations and potential biases. To mitigate these issues, techniques like cross-validation and bootstrapping offer more robust and reliable estimates of the model's performance. It is important to consider these alternative methods depending on the specific requirements and constraints of the problem at hand.

9.2 Holdout Evaluation (L09 Model Eval 2: Confidence Intervals)
9.2 Holdout Evaluation (L09 Model Eval 2: Confidence Intervals)
  • 2020.11.11
  • www.youtube.com
The second video talks about using a test set for estimating the generalization performance of a model. Technically, an independent test set can provide an u...
 

9.3 Holdout Model Selection (L09 Model Eval 2: Confidence Intervals)



9.3 Holdout Model Selection (L09 Model Eval 2: Confidence Intervals)

In the previous video, we discussed the holdout method for model evaluation. Now, we will explore how we can modify this method for model selection. To recap, in the previous video, we split the dataset into a training set and a test set. We trained a model on the training set using a machine learning algorithm and fixed hyperparameter settings. Then, we evaluated the model on the test set. Additionally, we optionally fit the model to the entire dataset to leverage more data, expecting improved performance.

Now, we aim to use the holdout method for model selection, which is closely related to hyperparameter tuning. Model selection involves choosing the best model among different hyperparameter settings. In the process of hyperparameter tuning, we generate multiple models, each corresponding to a specific hyperparameter setting. Model selection helps us identify the model with the optimal hyperparameter setting.

To explain the modified holdout method for model selection, let's break down the steps. First, instead of splitting the dataset into just a training and test set, we divide it into three sets: a training set, a validation set, and a test set. This separation allows us to have an independent dataset, the validation set, for model selection.

Next, we consider different hyperparameter settings and fit multiple models using the training data. For instance, we may use a K-nearest neighbor algorithm with hyperparameter values of k=3, k=5, and k=7, resulting in three models.

The model selection step involves evaluating these models using the validation set. Since models may overfit to the training data, it is not suitable for selecting the best model. Therefore, we rely on the independent validation set to evaluate the models. We compute performance metrics, such as prediction accuracy, for each model and select the one with the best performance as the optimal model, corresponding to the best hyperparameter settings.

However, using the validation set multiple times for model selection can introduce bias, similar to the issue we encountered with the test set in the previous video. To obtain an unbiased estimate of the model's performance, we reserve an independent test set. After selecting the best model, we evaluate its performance on the test set and report the results.

Optionally, before the final evaluation, we can refit the model using the combined training and validation data. This step leverages more data to potentially improve the model's performance. Finally, we evaluate the final model on the independent test set and report its performance. Although we don't have a test set to further evaluate the model fitted with the combined data, it is generally expected to be better due to the increased amount of data.

In practice, the holdout method for model selection may vary, and not all steps are strictly followed. Some practitioners directly evaluate the selected model on the test set without retraining on the combined data. Nonetheless, the key idea is to have separate datasets for training, validation, and testing to ensure unbiased performance estimation and facilitate the selection of the best model.

In the next video, we will delve into the concept of confidence intervals.

9.3 Holdout Model Selection (L09 Model Eval 2: Confidence Intervals)
9.3 Holdout Model Selection (L09 Model Eval 2: Confidence Intervals)
  • 2020.11.12
  • www.youtube.com
After discussing the holdout method for model evaluation in the previous video, this video covers the holdout method for model selection (aka hyperparameter ...
 

9.4 ML Confidence Intervals via Normal Approximation (L09 Model Eval 2: Confidence Intervals)



9.4 ML Confidence Intervals via Normal Approximation (L09 Model Eval 2: Confidence Intervals)

In this video, our focus is on confidence intervals, specifically for estimating classification error or classification accuracy from a test set. We'll be using the normal approximation method, which is the simplest approach. However, we'll also discuss better methods based on resampling in future videos.

Currently, we are in the basic section, exploring confidence intervals using the normal approximation method. In subsequent videos, we'll delve into different resampling techniques, starting with the repeated holdout method and then moving on to methods like bootstrapping for constructing empirical confidence intervals, which are more effective when dealing with smaller datasets commonly encountered in traditional machine learning.

Let's begin by discussing the binomial distribution, which you may already be familiar with from other statistics classes. The binomial distribution provides the number of successes, with parameters n and p, representing the number of trials and the success probability, respectively. The mean of the binomial distribution is given by n times p. For example, if we have 100 trials with a 33% success probability, the mean would be 30.

In the figure on the left-hand side, you can see the probability density function of the binomial distribution for different values of p and n. This density function illustrates the probability of different numbers of successes. Additionally, the variance of the binomial distribution is calculated as n times p times (1 - p), which we'll use later. Take a moment to familiarize yourself with this model.

Now, let's connect the binomial distribution to machine learning. We can view the 0-1 loss as a Bernoulli trial, where we have two possibilities: correct classification (success) and incorrect classification (failure). We can consider incorrect classification as a success and correct classification as a failure. This perspective aligns with the concept of heads and tails in a coin flip. To estimate the probability of success (i.e., incorrect classification), we can empirically compute it by performing a large number of trials and counting the number of successes divided by the total number of trials. The average number of successes is n times p, which corresponds to the mean of the binomial distribution.

The relationship between the 0-1 loss and the binomial distribution helps us understand the notion of error in machine learning. We can consider the 0-1 loss as a Bernoulli trial, and the true error as the probability of correct predictions. To estimate the true error, we use a test set and calculate the proportion of incorrect predictions. This proportion represents the classification error, which can be further divided by the size of the test set to obtain a value between zero and one.

When constructing confidence intervals, we use the same methods employed in one-sample confidence intervals from other statistics classes. A confidence interval is an interval that is expected to contain the parameter of interest with a certain probability. The most common confidence level is 95%, but other levels such as 90% or 99% can also be used. The choice of confidence level determines the width of the interval, with higher levels resulting in wider intervals.

To formally define a confidence interval, we consider multiple samples drawn repeatedly from the assumed distribution. In our case, we assume a normal distribution. When constructing a 95% confidence interval using this method, if we were to construct an infinite number of intervals based on an infinite number of samples, we would expect 95% of these intervals to contain the true parameter.

You may be wondering why we assume that the data can be drawn from a normal distribution. The reason is that the binomial distribution resembles a normal distribution when the number of trials is large. Even for relatively small numbers of trials, the data already exhibits a shape similar to a standard normal distribution. This is why we employ the normal approximation

method for constructing confidence intervals in this case.

Now, let's dive into the details of constructing a confidence interval for classification error using the normal approximation method. First, we need to calculate the standard deviation of the binomial distribution. As mentioned earlier, the variance of the binomial distribution is given by n times p times (1 - p). Therefore, the standard deviation is the square root of the variance.

Next, we determine the z-score corresponding to the desired confidence level. The z-score represents the number of standard deviations away from the mean of the standard normal distribution. For a 95% confidence level, the z-score is approximately 1.96. The general formula to calculate the z-score is (x - μ) / σ, where x is the desired confidence level, μ is the mean, and σ is the standard deviation.

To construct the confidence interval, we start with the estimated error rate from the test set, which represents our point estimate. Then, we subtract and add the product of the z-score and the standard deviation from the point estimate. This gives us the lower and upper bounds of the confidence interval, respectively. The resulting interval represents the range of values within which we expect the true classification error to fall with the specified confidence level.

It's important to note that the normal approximation method assumes that the number of trials (size of the test set) is sufficiently large. If the test set is small, this approximation may not be accurate. In such cases, resampling methods like bootstrapping can provide more reliable confidence intervals.

In summary, constructing confidence intervals for classification error using the normal approximation method involves the following steps:

  1. Calculate the standard deviation of the binomial distribution using the formula sqrt(n * p * (1 - p)).
  2. Determine the z-score corresponding to the desired confidence level.
  3. Compute the lower and upper bounds of the confidence interval by subtracting and adding the product of the z-score and the standard deviation from the point estimate, respectively.

Keep in mind that in subsequent videos, we will explore more advanced methods based on resampling techniques, which are particularly useful for smaller datasets. These methods provide empirical confidence intervals and are often more accurate than the normal approximation method.

9.4 ML Confidence Intervals via Normal Approximation (L09 Model Eval 2: Confidence Intervals)
9.4 ML Confidence Intervals via Normal Approximation (L09 Model Eval 2: Confidence Intervals)
  • 2020.11.12
  • www.youtube.com
This video talks about the simplest way for making confidence intervals for machine learning classifiers using the test set performance: normal approximation...
 

9.5 Resampling and Repeated Holdout (L09 Model Eval 2: Confidence Intervals)



9.5 Resampling and Repeated Holdout (L09 Model Eval 2: Confidence Intervals)

In this video, we will delve into the topic of resampling and specifically discuss the repeated holdout method. Previously, we explored the regular holdout method, where the dataset is divided into training and test sets. We also explored how the normal approximation method can be used to construct confidence intervals based on the performance estimated on the test set. Now, we will shift our focus to resampling methods, starting with the repeated holdout method.

To provide a visual illustration, let's consider learning curves. Learning curves serve as indicators of whether our model would benefit from additional training data. In the graph, the x-axis represents the size of the training set, while the y-axis represents performance, measured as accuracy. However, the same plot could be used to measure error by flipping it. The performance shown here is based on the Amnesty handwritten digit dataset, but only a subset of 5000 images was used to speed up computation. Out of these 5000 images, 3000 were allocated for training, and 1500 were set aside as the test set. Another dataset consisting of 3500 images was also created, and training sets of varying sizes were constructed from it.

Each data point on the graph corresponds to a specific training set size, while the test set size remains constant at 1500. The trend observed is that as the training set size decreases, the training accuracy increases. However, as the training set size increases, the training accuracy decreases. One possible explanation for this trend is that with a smaller training set, it is easier for the model to memorize the data, including any outliers or noise. As the training set size grows, it becomes more challenging to memorize the data due to the presence of more diverse outliers. However, a larger training set facilitates better generalization, leading to improved performance on the test set.

It's worth noting that the graph stops at a training set size of 3500, as there was no larger dataset available. The test set, shown in red, remained fixed at 1500 samples. By reserving these samples for testing, a pessimistic bias was introduced because the model may not have reached its full capacity. The capacity refers to the model's potential to improve with more data. In this case, a simple softmax classifier, which is a multinomial logistic regression, was used for efficiency purposes. However, other classifiers could be employed for similar experiments.

In connection with learning curves, it is important to consider the size of the dataset and its impact on classifier performance. Increasing the dataset size can improve the classifier's performance, especially when learning curves indicate a decreasing test error as the training set size grows. For example, when working on a project involving movie rating prediction, collecting more movie reviews from sources like IMDb can enhance the classifier's performance.

During office hours, students often inquire about improving classifier performance for their projects. Enhancing a classifier can involve various strategies, such as parameter changes, feature selection, or feature extraction. However, increasing the dataset size is a simple yet effective method that may yield positive results. Examining learning curves helps determine whether more data can benefit the model, instead of solely focusing on tuning hyperparameters.

It's important to acknowledge the pessimistic bias resulting from splitting the dataset into training and test sets. By withholding a substantial portion of the data for testing, the model may not have reached its full potential due to limited training data. One solution is to reduce the size of the test set to address this bias. However, reducing the test set size introduces another challenge: an increase in variance. The variance of the model's performance estimate rises with smaller test sets, potentially leading to less reliable estimates.

To mitigate these challenges, we can employ a technique called Monte Carlo cross-validation, which involves repeating the holdout method multiple times and averaging the results. This technique is commonly known as the repeated holdout method.

In the repeated holdout method, we perform multiple iterations of the holdout process, where we randomly split the dataset into training and test sets. Each iteration uses a different random split, ensuring that different subsets of the data are used for training and testing in each iteration. By repeating this process several times, we can obtain multiple performance estimates for our model.

The key advantage of the repeated holdout method is that it provides a more robust and reliable estimate of the model's performance compared to a single holdout split. Since each iteration uses a different random split, we can capture the variability in the performance due to the randomness in the data. This helps us obtain a more accurate estimate of the model's true performance on unseen data.

Once we have the performance estimates from each iteration, we can calculate the average performance and use it as our final estimate. Additionally, we can also compute the variance or standard deviation of the performance estimates to get an idea of the variability in the results.

It's important to note that in the repeated holdout method, the training and test sets should be disjoint in each iteration to ensure that the model is evaluated on unseen data. Also, the size of the training and test sets should be determined based on the size of the available dataset and the desired trade-off between training and evaluation data.

The repeated holdout method is particularly useful when the dataset is large enough to allow for multiple random splits. It helps to provide a more robust evaluation of the model's performance and can be especially beneficial when working with limited data.

In summary, the repeated holdout method is a resampling technique that involves repeating the holdout process multiple times with different random splits of the dataset. It helps to obtain more reliable performance estimates and capture the variability in the model's performance. By averaging the results of the repeated holdout iterations, we can obtain a better estimate of the model's true performance.

9.5 Resampling and Repeated Holdout (L09 Model Eval 2: Confidence Intervals)
9.5 Resampling and Repeated Holdout (L09 Model Eval 2: Confidence Intervals)
  • 2020.11.13
  • www.youtube.com
This video talks about learning curves and how to assess whether a model can benefit from more data. Then it covers the repeated holdout method.-------This v...
 

9.6 Bootstrap Confidence Intervals (L09 Model Eval 2: Confidence Intervals)



9.6 Bootstrap Confidence Intervals (L09 Model Eval 2: Confidence Intervals)

Welcome back! We have now reached the more interesting parts of this lecture. In this video, our focus will be on empirical confidence intervals using the bootstrap method. As a quick recap, we have previously discussed the bootstrap method when we talked about bagging methods. In bagging, we drew bootstrap samples from the training set. But have you ever wondered why it is called the 'bootstrap' method?

Well, the term 'bootstrap' originated from the phrase 'pulling oneself up by one's bootstraps,' which was figuratively used to describe an impossible task. The bootstrap method is indeed a challenging technique as it involves estimating the sampling distribution from a single sample. So, in a way, we are metaphorically trying to pull ourselves up by our bootstraps by attempting this difficult task.

Over time, the meaning of 'bootstrap' expanded to include the concept of bettering oneself through rigorous, unaided effort. However, in the context of the bootstrap method, we are solely focused on the technique itself and not the political connotations associated with 'pulling oneself up by one's bootstraps.'

Now, let's delve into the bootstrap method and how it allows us to estimate the sampling distribution and the uncertainty of our performance estimates. The bootstrap method, first introduced by Bradley Efron in 1979, is a resampling technique used to estimate a sampling distribution when we only have access to a single dataset.

To understand the concept, imagine you have only one dataset, and you want to make use of it to estimate various sample statistics. These statistics can be anything of interest, such as the sample mean, standard deviation, R-squared, or correlations. The bootstrap method allows us to generate new datasets by repeatedly sampling from the original dataset, simulating the process of drawing samples from the population. It's important to note that the sampling is done with replacement, unlike the repeated holdout method, which samples without replacement.

By drawing these bootstrap samples and calculating the desired sample statistic, such as the sample mean, we can observe that the distribution of sample means follows a normal distribution. The standard deviation of this distribution, known as the standard error of the mean, can be estimated from the sample standard deviation divided by the square root of the sample size.

The bootstrap method enables us to construct confidence intervals by estimating the standard deviation and using it to determine the uncertainty associated with our performance estimates. Confidence intervals provide a range of plausible values for the true population parameter. In the case of the bootstrap method, we compute the standard deviation empirically and utilize it to calculate confidence intervals.

Now, let's understand the steps involved in the bootstrap procedure. First, we draw a sample with replacement from the original dataset. Next, we compute the desired sample statistic using this bootstrap sample. We repeat these two steps a large number of times, usually recommended to be around 200 or more, to obtain a distribution of sample statistics. The standard deviation of this distribution serves as an estimate of the standard error of the sample statistic. Finally, we can use the standard error to compute confidence intervals, which provide a measure of uncertainty around our performance estimate.

When it comes to evaluating the performance of a classifier using the bootstrap method, we can modify the approach slightly. Consider a dataset of size n. In this case, we perform p bootstrap rounds, where in each round, we draw a bootstrap sample from the original dataset. We then fit a model to each of these bootstrap samples and compute the accuracy on the out-of-bag samples, which are the samples not included in the bootstrap sample. By averaging the accuracies over all the bootstrap rounds, we obtain the bootstrap accuracy. This approach addresses the issue of overfitting by evaluating the model on unseen data, rather than the samples used for training. Additionally, the bootstrap accuracy provides a measure of the model's performance variability.

To summarize the steps involved in evaluating the performance of a classifier using the bootstrap method:

  1. Randomly select a bootstrap sample of size n (with replacement) from the original dataset.
  2. Train a classifier on the bootstrap sample.
  3. Evaluate the trained classifier on the out-of-bag samples (samples not included in the bootstrap sample) and calculate the accuracy.
  4. Repeat steps 1-3 for a large number of bootstrap rounds (p times).
  5. Calculate the average accuracy across all bootstrap rounds to obtain the bootstrap accuracy.

The bootstrap accuracy can serve as an estimate of the classifier's performance on unseen data and provides a measure of the uncertainty associated with the performance estimate. Furthermore, it can help assess the stability and robustness of the classifier.

By utilizing the bootstrap method, we can gain valuable insights into the performance of our models and estimate the uncertainty associated with our performance estimates. This technique is particularly useful when we have limited data and want to make the most of the available dataset. The bootstrap method allows us to approximate the sampling distribution, construct confidence intervals, and evaluate the performance of classifiers effectively.

In conclusion, the bootstrap method is a powerful resampling technique that enables us to estimate the sampling distribution and assess the uncertainty of performance estimates using a single dataset. It provides a practical approach to address various statistical challenges and has found applications in a wide range of fields, including machine learning, statistics, and data analysis. By understanding and implementing the bootstrap method, we can enhance our ability to make informed decisions and draw reliable conclusions from limited data.

9.6 Bootstrap Confidence Intervals (L09 Model Eval 2: Confidence Intervals)
9.6 Bootstrap Confidence Intervals (L09 Model Eval 2: Confidence Intervals)
  • 2020.11.13
  • www.youtube.com
This video talks about the Leave One Out Bootstrap (i.e., computing the model performances on out-of-bag samples) for constructing confidence intervals.-----...
Reason: