Something Interesting in Financial Video - page 27

 

Caltech's Machine Learning Course - CS 156. Lecture 08 - Bias-Variance Tradeoff 

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:16

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 08 - Bias-Variance Tradeoff

The professor discusses the bias-variance tradeoff in machine learning, explaining how the complexity of the hypothesis set affects the tradeoff between generalization and approximation. The lecturer introduces the concept of bias and variance, which measure the deviation between the average of hypotheses a machine learning algorithm produces and the actual target function and how much a given model's distribution of hypotheses varies based on different datasets, respectively. The tradeoff results in a larger hypothesis set having a smaller bias but a larger variance, while a smaller hypothesis set will have a larger bias but a smaller variance. The lecturer emphasizes the importance of having enough data resources to effectively navigate the hypothesis set and highlights the difference in scale between the bias-variance analysis and the VC analysis.

Also he discusses the tradeoff between simple and complex models in terms of their ability to approximate and generalize, with fewer examples requiring simple models and larger resources of examples requiring more complex models. The bias-variance analysis is specific to linear regression and assumes knowledge of the target function, with validation being the gold standard for choosing a model. Ensemble learning is discussed through Bagging, which uses bootstrapping to average multiple data sets, reducing variance. The balance between variance and covariance in ensemble learning is also explained, and linear regression is classified as a learning technique with fitting as the first part of learning, while the theory emphasizes good out-of-sample performance.

  • 00:00:00 In this section, the focus shifts to the bias-variance tradeoff, which is another approach to understanding generalization. In the previous lectures, the VC analysis established the generalization ability of a chosen hypothesis, via the VC dimension of a hypothesis set. The VC bound holds for any learning algorithm, for any input data, and for any target function. An aspect of VC analysis is that it provides a practical measure. By plotting the probability of error versus the number of examples, we discovered that the number of examples needed is proportional to the VC dimension, or rule of thumb, you need 10 times the VC dimension to start getting interesting generalization properties. Finally, we summarized the VC analysis into a generalization bound, which we will use in later techniques like regularization.

  • 00:05:00 In this section, the lecturer discusses the tradeoff between approximation and generalization when it comes to learning. Learning aims to achieve a small E_out, which means the hypothesis approximates the target function well and that this approximation holds out-of-sample. However, having a more complex hypothesis set raises the chance of approximating f well but gives a problem identifying the suitable hypothesis. One ideal hypothesis set for learning is a singleton hypothesis that happens to be the target function. Still, since we don't know the target function, we need a hypothesis set large enough to stand a chance. Additionally, the lecturer discusses how bias-variance analysis also decomposes E_out, whereas VC analysis emphasizes quantifying the tradeoff.

  • 00:10:00 In this section, the speaker introduces the bias-variance tradeoff and how it relates to real-valued functions and regression using squared error. The goal is to decompose the out-of-sample error into two conceptual components: approximation and generalization. To do this, the speaker uses the expected value of the error with respect to a particular data set since the final hypothesis depends on the data set used, but aims to remove the dependency by integrating out the data set. The result is a way to analyze the general behavior of the error when given a specific number of data points to work with.

  • 00:15:00 In this section, the lecturer explains how to calculate the expected values of a behavior with respect to all possible realizations of 100 examples. By reversing the order of integration and getting rid of an expectation, the lecturer gets to a clean decomposition. The next step involves deriving an average hypothesis by getting the expected value of all possible hypotheses. Although this is certainly an impossible task, it provides a conceptual tool for analysis. Understanding the technical utility of g bar becomes important when expanding the top expression to get a linear term that ultimately requires g bar to be defined.

  • 00:20:00 In this section, the lecturer decomposes a quantity into two steps that determine how far the hypothesis a machine learning algorithm derives from a given dataset diverges from the target function. The first step assesses how far this hypothesis deviates from the best hypothesis that the algorithm can produce given the given dataset, while the second step assesses how far this best hypothesis deviates from the actual target function. The lecturer arrives at two quantities, the bias and the variance, to denote these two steps. The bias measures the deviation between the average of hypotheses a machine learning algorithm produces and the actual target function, which sets finite for the algorithm's hypothesis set. Meanwhile, the variance measures how much a given model's distribution of hypotheses varies based on different datasets.

  • 00:25:00 In this section, the professor discusses the bias-variance tradeoff in machine learning. He explains that the bias is the limitation of the hypothesis set, and the variance is the difference in outcome when using different data sets. He then shows how there is a tradeoff between generalization and approximation when changing the size of the hypothesis set, and illustrates this idea with a comparison of a small and large hypothesis set. He argues that a larger hypothesis set will have a smaller bias but a larger variance, while a smaller hypothesis set will have a larger bias but a smaller variance.

  • 00:30:00 In this section, the speaker introduces the concept of bias-variance tradeoff, where the bias decreases and variance increases as the hypothesis set becomes larger. To understand this, the speaker sets a concrete example where the target function is a sinusoid, and two different hypothesis sets are given: a constant model and a linear model. The speaker then shows that the linear model gives a better approximation of the sinusoid, but with some errors. This is not a learning situation but illustrates the tradeoff between bias and variance in the approximation of the target function, paving the way for more complex learning problems.

  • 00:35:00 In this section, the lecturer explains the bias-variance tradeoff in machine learning. He uses the example of fitting a line to two points, first to approximate a target function, and secondly to learn from examples. The bias-variance analysis is needed to evaluate a model's performance regardless of which two points are used and to overcome the challenges of coping with the dependency on the dataset. The lecturer then generates datasets of size two points, fits a line to them, and shows that the expected out-of-sample error is mainly the summation of the bias and variance. The very light green line, g bar of x, is the average hypothesis he gets from repeating this game. Still, it is not the output of the learning process because different datasets will give different estimates.

  • 00:40:00 In this section of the video, the concept of bias-variance tradeoff is discussed in the context of machine learning. The variance is calculated as the standard deviation of the output of the learning process, while the bias is the error between the predicted output and the target function. The tradeoff between bias and variance is demonstrated using two models, one with a small bias and a large variance and the other with a large bias and a small variance. It is understood that in a learning situation, the model complexity should be matched to the data resources available rather than the target complexity.

  • 00:45:00 In this section, the speaker discusses the bias-variance tradeoff in learning and introduces the concept of learning curves. Learning curves plot the expected values of E_out (out-of-sample error) and E_in (in-sample error) as a function of N, the size of the dataset. As N increases, the out-of-sample error generally decreases, but this trend can be influenced by the complexity of the model being used. The speaker emphasizes the importance of having enough data resources to effectively navigate the hypothesis set, and notes that noisy data can make this navigation even more difficult. The learning curves provide a visual representation of the bias-variance tradeoff and how it changes with increasing N.

  • 00:50:00 In this section, the lecturer discusses the relationship between the bias-variance analysis and the VC analysis using learning curves. He explains that both theories are discussing approximation and take into consideration what happens in terms of generalization. The lecturer highlights the difference in scale between the two theories and mentions that the bias depends on the hypothesis set. Finally, the lecturer briefly covers the analysis for the linear regression case and recommends it as a good exercise to gain insight into linear regression.

  • 00:55:00 In this section, the instructor describes the in-sample error pattern and the out-of-sample error pattern, particularly using the learning curves. The instructor uses linear regression and noise to illustrate a simple formula for expected in-sample error: it's almost perfect, and you're doing better than perfect by the ratio of d plus 1. The instructor emphasizes a very specific curve, which shows that the more data points you have, the less noise will impact the error rate. However, when you overfit to the sample data, you end up fitting the noise, and this will harm you instead of helping you in the long run.
  • 01:00:00 In this section, the professor talks about the tradeoff between simple and complex models and their ability to approximate and generalize. While complex models can better approximate the target function and training examples, the simple models are better in terms of generalization ability. This is because there is a tradeoff between the two, and the sum of both quantities could go in either direction. The key is to match the complexity of the model to the data resources available. Fewer examples mean that simple models should be used, while larger resources of examples require complex models for better performance. The expected generalization error can be found using the formula, which is the VC dimension divided by the number of examples.

  • 01:05:00 In this section, the professor discusses how the bias-variance analysis is specific to linear regression and how it assumes you know the target function. While it's a helpful guide and can be used to understand how to affect both bias and variance, it's not something that can be plugged in to tell you what the model is. He also mentions that the gold standard for choosing a model is through validation, which includes ensemble methods like boosting. The professor then briefly introduces the idea of g bar as a theoretical tool for analysis but notes that it's not the focus of this lecture.

  • 01:10:00 In this section, the professor talks about ensemble learning through Bagging, which is the process of using a dataset to generate a large number of different datasets through bootstrapping and averaging them. This gives some dividend about the ensemble learning and can help reduce variance by averaging many things out. The moderator then asks if the bias-variance still appears through the Bayesian approach. The professor explains that although the Bayesian approach makes certain assumptions, the bias-variance still exists. Finally, he talks about the relation of numerical function approximation with the extrapolation in machine learning and the bias-variance covariance dilemma.

  • 01:15:00 In this section of the lecture, the professor discusses the balance between variance and covariance in the context of ensemble learning. He explains that in the bias-variance analysis, he had the luxury of picking independently generated data sets, generating independent models, and then averaging them. However, in actual practice, when constructing models based on variations of the data set, the covariance between the models starts to play a role. Later, when asked if linear regression is a learning technique or just function approximation, the professor states that linear regression is a learning technique and fitting is the first part of learning. The added element is to ensure that the model performs well out-of-sample, which is what the theory is about.

Lecture 08 - Bias-Variance Tradeoff
Lecture 08 - Bias-Variance Tradeoff
  • 2012.04.28
  • www.youtube.com
Bias-Variance Tradeoff - Breaking down the learning performance into competing quantities. The learning curves. Lecture 8 of 18 of Caltech's Machine Learning...
 

Caltech's Machine Learning Course - CS 156. Lecture 09 - The Linear Model II

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:18

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 09 - The Linear Model II

This lecture covers various aspects of the linear model, including the bias-variance decomposition, learning curves, and techniques for linear models such as perceptrons, linear regression, and logistic regression. The speaker emphasizes the tradeoff between complexity and generalization performance, cautioning against overfitting and emphasizing the importance of properly charging the VC dimension of the hypothesis space for valid warranties. The use of nonlinear transforms and their impact on generalization behavior is also discussed. The lecture further covers the logistic function and its applications in estimating probabilities, and introduces the concepts of likelihood and cross-entropy error measures in the context of logistic regression. Finally, iterative methods for optimizing the error function, such as gradient descent, are explained.

Also the lecture covers a range of topics related to linear models and optimization algorithms in machine learning. The professor explains the compromise between learning rate and speed in gradient descent optimization, introducing the logistic regression algorithm and discussing its error measures and learning algorithm. The challenges of termination in gradient descent and multi-class classification are also addressed. The role of derivation and selection of features in machine learning is emphasized and discussed as an art in application domains, charged in terms of VC dimension. Overall, this lecture provides a comprehensive overview of linear models and optimization algorithms for machine learning.

  • 00:00:00 In this section, Yaser Abu-Mostafa discusses the bias-variance decomposition in the out-of-sample error and illustrates how it trades off with the hypothesis set. He also explains learning curves, which describes the generalization error, and how the number of examples, proportional to the VC dimension, will determine generalization properties. Techniques for linear models are also discussed.

  • 00:05:00 In this section of the lecture, the speaker briefly recaps the linear model in terms of linear classification and linear regression, which have been covered in previous lectures, and then moves to the third type of linear model - logistic regression. Before starting on logistic regression, the speaker ties up the loose ends in terms of nonlinear transforms and generalization issues. Nonlinear transforms offer a platform for applying learning algorithms in the Z space (feature space), with the final hypothesis still residing in the X space (input space). In the case of nonlinear transforms, the speaker emphasizes that the generalization issues were left out and that he will provide the missing piece in the lecture.

  • 00:10:00 In this section, the lecturer discusses the price that one pays for doing nonlinear transformations when it comes to generalization behavior in the X space. By using the linear model in the X space, you can get a weight vector of d+1 free parameters. However, the VC dimension in the feature space may potentially be much larger than that of the X space. If the VC dimension is too large, then although it is possible to fit the 17th-order polynomial, there is no real chance of generalization. Two cases are discussed where the first case is almost linearly separable, and the second case is genuinely nonlinear. In order to get E_in to be zero, one has to go to a high-dimensional space, which becomes a problem as there are only two points to classify.

  • 00:15:00 In this section of the lecture, the instructor discusses the approximation-generalization tradeoff when dealing with linear models. He talks about how using a more complex model, such as a fourth-order surface, can better approximate the data but may not generalize well. He also mentions the idea of using a transformation to a non-linear space, but cautions against seeking a discount in the number of parameters. The instructor explains that charging the VC dimension of the entire hypothesis space explored in the mind is important in order for the warranty provided by the VC inequality to be valid.

  • 00:20:00 In this section, the discussion is centered around the dangers of data snooping when choosing a model before looking at the data. It is emphasized that this practice can lead to a contaminated hypothesis set, meaning that the data is no longer trustworthy for reflecting real-world performance. The concept of logistic regression is introduced, along with its unique model, error measure, and learning algorithm. This linear model is considered to be a significant complement to the perceptron and linear regression models previously discussed, and provides a useful example of the complexities and variations that exist within machine learning.

  • 00:25:00 In this section, the lecturer discusses the linear model and the different ways it can be used, such as perceptrons, linear regression, and logistic regression. For linear classification, the hypothesis is a decision of +1 or -1, which is a direct thresholding of the signal. In the case of linear regression, the output is the same as the input, while logistic regression applies a nonlinearity called the logistic function to the signal, which is interpreted as a probability of something happening. The lecturer explains the shape of the logistic function and its applications in estimating probabilities for various problems, such as credit card applications.

  • 00:30:00 In this section, the concept of a soft threshold or sigmoid is introduced in the context of the logistic function. This function takes a linear signal as input and outputs a probability. It is particularly useful in predicting outcomes like the risk of a heart attack, where multiple factors contribute to the likelihood of an event occurring. The output of the logistic regression is treated as a genuine probability during the learning process, even though the input data does not directly provide that information.

  • 00:35:00 In this section, we discuss supervised learning in medical data and how to generate a model that approximates a hidden target function. The examples are given as binary output, which is affected by a probability, making this a noisy case. The target is from the d-dimensional Euclidean space to 0,1 with a probability interpretation, f of x. The hypothesis g of x is found by finding the weights and dot-producting them with x. The objective is to choose the weights in such a way that the logistic regression hypothesis reflects the target function using an error measure constructed by likelihood that is both plausible and friendly to the optimizer. The error measure grades different hypotheses according to the likelihood that they are actually the target that generated the data.

  • 00:40:00 In this section of the lecture, the speaker discusses the use of likelihood and the controversy around its application. He explains that the use of likelihood is to find the most plausible hypothesis given the data. However, it is not a completely clean process as likelihood is not the probability that is required. The speaker then introduces a formula for likelihood and explains how it can be used to derive a full-fledged error measure. The formula is then used to find the likelihood of an entire dataset, which is a product of the likelihoods of individual data points. He concludes that there will always be a compromise when choosing a hypothesis, as favoring one example may mess up the others.

  • 00:45:00 In this section of the lecture, the speaker explains how maximizing the likelihood of a hypothesis under a dataset can lead to minimizing the error measure. Taking the natural logarithm allows the maximization to become a minimization, which results in an error measure in the training set. After simplifying the formula, the speaker calls the error measure the in-sample error of logistic regression, and he defines it as the error measure between the hypothesis that depends on w, applied to x_n, and the value given as a label for that example, which is y_n. The speaker also discusses the interesting interpretation of the risk score, which identifies those at risk of heart attacks based on the sign of w transposed x_n.

  • 00:50:00 In this section, the cross-entropy error measure is introduced as a way to measure the accuracy of binary predictions. The goal is to minimize this error measure in order to improve the model's predictions. However, unlike linear regression, there is no closed-form solution to minimize the error measure for logistic regression. Instead, an iterative solution is needed, which will be achieved through the gradient descent method. This method involves taking a step along the steepest slope of the surface and repeating until the minimum is reached. The convexity of the error measure for logistic regression makes gradient descent a good choice for optimization.

  • 00:55:00 In this section of the lecture, the professor discusses the iterative methods used to find the minimum value of the error function in the linear model. He explains that these methods involve moving along the surface in small steps and making local approximations using calculus, specifically Taylor series. He then introduces the concept of gradient descent, where the next weight is determined by the current weight plus the move in a specific direction, which is determined by solving for the unit vector in the direction of steepest descent. The professor goes on to explain how the direction that achieves the most negative value for the inner product between a vector and a unit vector is chosen as the direction of movement.
  • 01:00:00 In this section, the lecturer discusses the compromise between the size of the step, or learning rate, in gradient descent optimization. Taking very small steps will eventually get to the minimum, but it would take forever, while taking bigger steps would be faster but may not apply linear approximation. After analyzing the graphs, the best compromise is to have initially a large learning rate to take advantage of steep slopes and become more careful when closer to the minimum to avoid overshooting. The lecturer then presents the formula for a fixed learning rate, where the learning rate is proportional to the size of the gradient. The logistic regression algorithm is then introduced, where the gradient is computed using the in-sample error formula, and the next weight is obtained by subtracting the learning rate times the gradient from the current weight. Finally, all three linear models, perceptron, linear regression, and logistic regression, are summarized in one slide and applied to the credit domain.

  • 01:05:00 In this section, the professor discusses the different types of linear models that can be implemented in credit analysis and the corresponding error measures and learning algorithms used. For example, the perceptron is used for binary classification and logistic regression is used to compute the probability of default. Different error measures were used for each model, such as binary classification error for the perceptron and cross-entropy error for logistic regression. The learning algorithm used was dependent on the error measure chosen, such as the perceptron learning algorithm for classification error and gradient descent for cross-entropy error. Lastly, the professor briefly discusses termination criteria and issues that arise with termination in gradient descent as a properly analyzed termination is a bit tricky due to many unknowns in the error surface.

  • 01:10:00 In this section, the speaker explains that gradient descent is an effective but not foolproof optimization algorithm. If the surface that the optimization algorithm is trying to navigate has multiple local minima, the algorithm might only find a local minimum instead of a global minimum that gives the best result. The speaker suggests using a combination of criteria to terminate the optimization algorithm and notes that the conjugate gradient is a valid alternative to gradient descent. The speaker suggests that if local minima become a real issue in an application, there are many approaches in the field of optimization to tackle this problem.

  • 01:15:00 In this section, the professor explains the concept of cross-entropy, which is a way of getting a relationship between two probability distributions using logarithmic and expected values. The professor also discusses the limitations of binary search and 2nd-order methods in optimization, emphasizing that while more sophisticated methods may lead to better results, they may be too expensive in terms of CPU cycles. Finally, in response to a question, the professor confirms that logistic regression can be applied to a multi-class setting, as demonstrated in the example of recognizing digits.

  • 01:20:00 In this section of the lecture, the professor discusses various methods for multi-class classification, including ordinal regression and tree-based binary decisions. The professor also introduces the use of the tanh function, which will be used as the neuronal function in neural networks. The concept of the learning rate is also discussed, with the professor mentioning that there are heuristics for adaptive learning rates that can be used, and a rule of thumb for choosing the learning rate is presented. Additionally, the distinction between meaningful features and features derived from looking at the specific data set is made, with the former being less likely to forfeit the VC warranty.

  • 01:25:00 In this section, the professor discusses the process of deriving features in machine learning and emphasizes that it is an art that depends on the application domain. While it is possible to derive features based on the data, the final hypothesis set will still determine the generalization behavior. The professor also notes that selecting features is automatically done in machine learning, but it becomes part of learning and is charged in terms of VC dimension. The topic of selecting features will be further addressed in the future lecture on neural networks and hidden layers.

Lecture 09 - The Linear Model II
Lecture 09 - The Linear Model II
  • 2012.05.02
  • www.youtube.com
The Linear Model II - More about linear models. Logistic regression, maximum likelihood, and gradient descent. Lecture 9 of 18 of Caltech's Machine Learning ...
 

Caltech's Machine Learning Course - CS 156. Lecture 10 - Neural Networks 

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:19

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa



Caltech's Machine Learning Course - CS 156. Lecture 10 - Neural Networks

Yaser Abu-Mostafa, the professor at the California Institute of Technology, discusses logistic regression and neural networks in this lecture. Logistic regression is a linear model that calculates a probability interpretation of a bounded real-valued function. It is unable to optimize its error measure directly, so the method of gradient descent is introduced to minimize an arbitrary nonlinear function that is smooth enough and twice differentiable. Although there is no closed-form solution, the error measure is a convex function, making it relatively easy to optimize using gradient descent.

Stochastic gradient descent is an extension of gradient descent that is used in neural networks. Neural networks are a model that implements a hypothesis motivated by a biological viewpoint and related to perceptrons. The backpropagation algorithm is an efficient algorithm that goes with neural networks and makes the model particularly practical. The model has a biological link that got people excited and was easy to implement using the algorithm. Although it is not the model of choice nowadays, neural networks were successful in practical applications and are still used as a standard in many industries, such as banking and credit approval.

Brief summary:

  • Logistic regression is a linear model that calculates a probability interpretation of a bounded real-valued function;
  • The method of gradient descent is introduced to optimize logistic regression, but it is unable to optimize its error measure directly;
  • Stochastic gradient descent is an extension of gradient descent that is used in neural networks;
  • Neural networks are a model that implements a hypothesis motivated by a biological viewpoint and related to perceptrons;
  • The backpropagation algorithm is an efficient algorithm that goes with neural networks and makes the model particularly practical;
  • Although neural networks are not the model of choice nowadays, they are still used as a standard in many industries, such as banking and credit approval.

Lecture 10 - Neural Networks
Lecture 10 - Neural Networks
  • 2012.05.06
  • www.youtube.com
Neural Networks - A biologically inspired model. The efficient backpropagation learning algorithm. Hidden layers. Lecture 10 of 18 of Caltech's Machine Learn...
 

Caltech's Machine Learning Course - CS 156. Lecture 11 - Overfitting

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:21

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 11 - Overfitting

This lecture introduces the concept and importance of overfitting in machine learning. Overfitting occurs when a model is trained on noise instead of the signal, resulting in poor out-of-sample fit. The lecture includes various experiments to illustrate the effects of different parameters, such as noise level and target complexity, on overfitting. The lecturer stresses the importance of detecting overfitting early on and the use of regularization and validation techniques to prevent it. The impact of deterministic and stochastic noise on overfitting is also discussed, and the lecture concludes by introducing the next two lectures on avoiding overfitting through regularization and validation.

The concept of overfitting is discussed, and the importance of regularization in preventing it is emphasized. The professor highlights the trade-off between overfitting and underfitting and explains the VC dimension's role in overfitting, where the discrepancy in VC dimension given the same number of examples results in discrepancies in out-of-sample and in-sample error. The practical issue of validating a model and how it can impact overfitting and model selection is also covered. Furthermore, the professor emphasizes the role of piecewise linear functions in preventing overfitting and highlights the importance of considering the number of degrees of freedom in the model and restricting it through regularization.

  • 00:00:00 In this section, the lecturer introduces the topic of overfitting in machine learning and its importance, noting that the ability to deal with overfitting separates professionals from amateurs in the field. The main culprit for overfitting is identified as noise, and the lecturer introduces the concept of regularization and validation as techniques to deal with overfitting. The section serves as an introduction to a new topic that will be covered in the next three lectures.

  • 00:05:00 In this section, the lecturer explains the concept of overfitting by showing how it can occur when fitting a 4th-order polynomial to a 2nd-order target function with added noise. This results in zero training error and poor out-of-sample fit, which is a classic example of overfitting, where the model went further than it needed to. This point is further emphasized when discussing overfitting in neural networks, as E_in goes down during training while E_out stays high. The lecturer also notes that overfitting is a comparative term, as there has to be another situation that is better, and overfitting can occur within the same model.

  • 00:10:00 In this section, Professor Abu-Mostafa discusses overfitting, which occurs when E_in is lowered, but E_out increases due to fitting the noise instead of the signal. He explains that effective VC dimension grows with time, but the generalization error gets worse and worse as the number of parameters increases. Overfitting can occur when two different models or instances within the same model are compared. One way to fix this is to detect overfitting by using the early stopping algorithm, based on validation, which acts as regularization to prevent overfitting. In order to avoid fitting the noise when overfitting occurs, it's important to detect it early on and stop rather than continuing to minimize E_in.

  • 00:15:00 In this section, the lecturer discusses how overfitting can occur due to the presence of noise in the data. A case study is presented with two different models - one with a noisy low-order target, and another with a noiseless high-order target. A 2nd-order polynomial and a 10th-order polynomial are used to fit the data. For the second-order fit, the in-sample error is 0.05, and the out-of-sample error is slightly higher. In contrast, the 10th-order fit presents a problem, with the in-sample error being smaller than that of the 2nd-order fit. However, the out-of-sample error dramatically increases, indicating a case of overfitting where the noise has been fitted into the model.

  • 00:20:00 In this section, the lecturer discusses overfitting and how it can occur even in noiseless situations when the model is fitting another type of noise. He gives an example of fitting a 10th-order model to a 10th-order noisy target and how it resulted in overfitting. Then, he shows that by matching the complexity of the model to the data resources rather than the target complexity, it can result in better performance despite having a simpler model. The lecturer emphasizes that generalization issues depend on the size and quality of the dataset, and simply matching the complexity of the model to the target function is not always the best approach.

  • 00:25:00 In this section, the concept of overfitting in machine learning is explored. The lecture uses learning curves to demonstrate how the in-sample error for a more complex model is smaller, but the out-of-sample error is larger, defining the gray area where overfitting is happening. The lecture also shows an experiment with two learners, one choosing a 10th order and the other choosing a 2nd order to fit a 50th-order target with no noise. Despite the absence of noise, both learners still experience overfitting, leading to the definition of actual noise and the need for caution in real-world machine learning problems. The lecture concludes that overfitting occurs in the majority of cases, emphasizing the importance of understanding and addressing this issue.

  • 00:30:00 In this section, the lecturer discusses the parameters that affect overfitting, including the noise level, target complexity, and number of data points. To create interesting target functions with high complexity, the lecturer uses a standard set of Legendre polynomials with specific coefficients that are orthogonal to each other. By normalizing the signal to an energy of 1, the lecturer can state that sigma squared is the amount of noise. When generating instances of the experiment, the lecturer uses different combinations of noise, target complexity, and number of data points to observe the persistence of overfitting.

  • 00:35:00 In this section, the lecturer discusses an overfitting measurement method that compares the out-of-sample errors of two different models: a 2nd-order polynomial and a 10th-order polynomial. The measure is the difference between the out-of-sample error for the complex model and the out-of-sample error for the simple model. If the complex model's out-of-sample error is larger, causing the measure to be positive, then there is overfitting. The lecturer then shows how the overfitting measure changes with varying levels of noise and target complexity. As the noise level increases and the target complexity increases, overfitting worsens. The lecturer also notes that overfitting is a significant issue and must be addressed.

  • 00:40:00 In this section, the concept of noise in overfitting is expanded beyond conventional noise and divided into stochastic noise and deterministic noise. It is noted that more data usually leads to less overfitting, and an increase in stochastic or deterministic noise leads to more overfitting. Deterministic noise is defined as the part of the target function that a hypothesis set cannot capture, and it is labeled as noise because a hypothesis set cannot deal with it. The concept of how something that cannot be captured is noise is further explored using a hypothetical scenario involving explaining complex numbers to a young sibling with a limited understanding of numbers.

  • 00:45:00 In this section of the lecture, the difference between deterministic and stochastic noise is explained, and the impact of deterministic noise on overfitting is analyzed. It is emphasized that deterministic noise depends on the hypothesis set used, and as the target complexity increases, the deterministic noise and overfitting increase as well. However, this doesn't occur until the target complexity surpasses a certain level. For finite N, the same issues with stochastic noise apply to deterministic noise in that you may capture some of it due to the limited sample size. It is also mentioned that using a more complex hypothesis set is not always better and may lead to overfitting.

  • 00:50:00 In this section, the lecturer discusses the issue of overfitting when given a finite sample. He explains that once given a finite sample, one has the ability to fit the noise, both stochastic and deterministic, which can lead to worse performance. The lecturer provides a quantitative analysis that adds noise to the target to gain insight into the role of stochastic and deterministic noise. He adds and subtracts the centroid and epsilon in preparation for getting squared terms and cross terms, which leads to a variance term, a bias term, and an added term. The added term is just sigma squared, the variance of the noise.

  • 00:55:00 In this section of the lecture, the speaker discusses the decomposition of the expected value into bias and variance, and how they relate to deterministic and stochastic noise. Both represent the best approximation to the target function and the noise that cannot be predicted, respectively. The increase in the number of examples decreases variance, but both bias and variance are inevitable given a hypothesis. The deterministic noise and the stochastic noise both have a finite version on the data points which affect the variance by making the fit more susceptible to overfitting. The speaker gives a lead into the next two lectures on avoiding overfitting by discussing two approaches, regularization, and validation. Regularization is like putting the brakes to avoid overfitting, while validation is checking the bottom line to ensure avoidance of overfitting.
  • 01:00:00 In this section, the professor discusses the concept of putting the brakes on overfitting by using a restrained fit or regularization. He uses the example of fitting points to a 4th-order polynomial, but preventing it from fitting all the way by putting some friction in it. The amount of brake applied is minimal but results in a dramatic reduction in overfitting while still achieving a fantastic fit. The professor notes that it is important to understand regularization and how to choose it in order to prevent overfitting. The Q&A session addresses the importance of randomization in stochastic gradient descent and how to draw out-of-sample error in neural network plots.

  • 01:05:00 In this section, the professor explains that the deterministic and stochastic noise in a learning scenario are the same because the deterministic noise is caused by the inability of a hypothesis set to get closer to the target function. In real-world learning problems, the complexity of the target function is generally unknown, and the noise cannot be identified. The goal of understanding the overfitting conceptually is to avoid overfitting without the particulars of the noise. Over-training is synonymous to overfitting, relative to the same model. Other sources of error, such as floating-point numbers, produce a limited effect on overfitting, which is never mentioned. In terms of the third-order linear model (logistic regression), the professor clarifies that when applied to linearly separable data, a local minimum and zero in-sample error can be achieved.

  • 01:10:00 In this section, the professor discusses the issue of overfitting and the finite-sample version of it, which occurs due to the contribution of noise from both stochastic and deterministic factors in a finite sample. This leads the algorithm to fit that noise, which is harmful when it comes to fitting larger models such as H_10. When discussing the use of piecewise linear functions to prevent overfitting, the professor highlights the importance of considering the number of degrees of freedom in your model and taking steps to restrict your model in terms of fitting through regularization. Lastly, the professor covers the practical question of validating a model and how it can impact overfitting and model selection.

  • 01:15:00 In this section, the professor discusses the trade-off between overfitting and underfitting and explains that in order to arrive at a better hypothesis, you may need to deprive yourself of a resource that could have been used for training. The professor also elaborates on the VC (Vapnik-Chervonenkis) dimension and how it relates to overfitting, stating that the discrepancy in the VC dimension, given the same number of examples, is the reason for discrepancies in the out-of-sample and in-sample error. The professor also clarifies that even though they illustrated the target complexity in the color plots, target complexity is not explicitly measured, and there is no clear way to map it into the energy of deterministic noise. Finally, the professor discusses how the target complexity could translate into something in the bias-variance decomposition and has an impact on overfitting and generalization.

Lecture 11 - Overfitting
Lecture 11 - Overfitting
  • 2012.05.10
  • www.youtube.com
Overfitting - Fitting the data too well; fitting the noise. Deterministic noise versus stochastic noise. Lecture 11 of 18 of Caltech's Machine Learning Cours...
 
Caltech's Machine Learning Course - CS 156. Lecture 12 - Regularization

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:24

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 12 - Regularization

This lecture on regularization begins with an explanation of overfitting and its negative impact on the generalization of machine learning models. Two approaches to regularization are discussed: mathematical and heuristic. The lecture then delves into the impact of regularization on bias and variance in linear models, using the example of Legendre polynomials as expanding components. The relationship between C and lambda in regularization is also covered, with an introduction to augmented error and its role in justifying regularization for generalization. Weight decay/growth techniques and the importance of choosing the right regularizer to avoid overfitting are also discussed. The lecture ends with a focus on choosing a good omega as a heuristic exercise and hopes that lambda will serve as a saving grace for regularization.

The second part discusses the weight decay as a way of balancing simplicity of the network with its functionality. The lecturer cautions against over-regularization and non-optimal performance, emphasizing the use of validation to determine optimal regularization parameters for different levels of noise. Regularization is discussed as experimental with a basis in theory and practice. Common types of regularization such as L1/L2, early stopping, and dropout are introduced, along with how to determine the appropriate regularization method for different problems. Common hyperparameters associated with implementing regularization are also discussed.

  • 00:00:00 In this section, Yaser Abu-Mostafo delves into the details of overfitting, which occurs when a model fits the data too well, at the cost of poor generalization. Even if the data is not noisy, deterministic noise can occur due to the limitations of the model, leading to a pattern that harms the out-of-sample error and causes overfitting. However, Abu-Mostafo introduces regularization as the first cure for overfitting, which is a technique used in almost every machine learning application, and is important to understand.

  • 00:05:00 In this section, the lecturer discusses two approaches to regularization in machine learning. The first approach is mathematical, where smoothness constraints are imposed to solve ill-posed problems, but the assumptions made in these developments are not always realistic for practical applications. The second approach is heuristic and involves handicapping the minimization of in-sample error by putting the brakes on the fit, which helps fight overfitting. The lecturer gives an example using a sinusoid and a line fit, showing that by regularizing and controlling the offset and slope of the lines, we may be able to gain better performance out-of-sample.

  • 00:10:00 In this section, the lecturer discusses the impact of regularization on the bias and variance of a linear model. By using regularization, the variance is reduced while the bias is slightly increased due to the imperfect fit. The lecturer uses the example of a polynomial model with Legendre polynomials as expanding components to demonstrate the effect of regularization on bias and variance. With regularization, the linear model outperforms the unregularized model and even the constant model. The lecture delves into the mathematical development of one of the most famous regularization techniques in machine learning with a focus on concrete conclusions and lessons that can be learned to deal with real-world situations.

  • 00:15:00 In this section, the lecturer introduces the Legendre polynomials and explains how they can be used to construct a hypothesis set for polynomial regression. By using these polynomials, which are orthogonal and deal with different coordinates, the relevant parameter is a combination of weights, rather than just one individual weight. The hypothesis set can be parameterized and represented in a linear form, allowing for easy analytic solutions. The target function is unknown, and the goal is to get a good approximation for it using a finite training set. The lecturer also goes over the unconstrained and constrained solutions for minimizing in-sample error using linear regression.

  • 00:20:00 In this section, the lecturer discusses the concept of regularization, which is a constraint applied to the weights of hypothesis sets. Regularization involves setting a budget C for the total magnitude squared of the weights which means that you can't have all weights too big. The problem is to minimize in-sample error while subject to this constraint. The solution is obtained using Lagrange multipliers or KKT, which gives a new solution called w_reg. The lecturer explains that the goal is to pick a point within a circle that minimizes the in-sample error, which requires going as far out as you can afford to without violating the constraint.

  • 00:25:00 In this section, the concept of regularization is discussed, where the objective is to derive a model that generalizes well to unseen data. The solution of linear regression is the minimum absolute, which satisfies the constraint. The main focus is on deriving the analytic condition for achieving the minimum of E_in, subject to the constraint, in order to find a compromise between the objective and the constraint. The gradient of the objective function must be orthogonal to the ellipse, and the vector w is in the direction of the red surface. The analytic condition for w_reg is that the gradient must be proportional to the negative of the solution. By minimizing the equation of the solution, you obtain the minimum of E_in, unconditionally.

  • 00:30:00 In this section, the lecture discusses the relationship between the parameters C and lambda in regularization. The larger the value of C, the smaller the value of lambda as there is less emphasis on the regularization term. Conversely, as C decreases, the regularization term becomes more significant and the value of lambda needs to increase to enforce the condition. The lecture also introduces augmented error, which is the sum of the error function and regularization term. It is equivalent to an unconstrained optimization problem of minimizing the error function while subject to the constraint. This correspondence justifies regularization in terms of generalization and is applicable to any regularizer. Finally, the lecture provides the formula for minimizing augmented error and concludes by providing the solution.

  • 00:35:00 In this section, the speaker discusses the solution to the problem of regularization. The solution is represented by w_reg, which is a modification of the pseudo-inverse solution with an additional regularization term. Under clean assumptions, we have one-step learning, including regularization. In other words, we can have a solution outright without doing a constrained optimization. The regularization term in the solution becomes dominant as lambda increases, which knocks w_reg down to zero, creating a smaller and smaller solution. The speaker then applies regularization to a familiar problem, showing that the choice of lambda is critical, and a heuristic choice for the type of regularizer will be necessary.

  • 00:40:00 In this section, the concept of regularization and its associated method known as weight decay are introduced. Weight decay is a famous regularizer in machine learning that involves minimizing w transposed w and making sure that the weights are small so that the name “decay” is given. When using neural networks, weight decay can be implemented through batch gradient descent, where the addition of this term shrinks the weights before any movement in the weight space, which limits how much one can learn about the function when λ is large. Variations of weight decay include assigning importance factors to certain weights and using different constants to experiment with the type of regularizer being used.

  • 00:45:00 In this section, the lecturer discusses weight decay and weight growth techniques, which are constraints used in machine learning to limit the range of weights used by models. Weight decay involves constraining models to use smaller weights, while weight growth constraints larger weights. The lecturer explains that an optimal lambda value must be chosen for both techniques to achieve the best out-of-sample performance. Additionally, the lecturer discusses how to choose the right regularizer, emphasizing the importance of avoiding overfitting through the use of guidelines that help guide the choice of regularizers. Ultimately, the lecturer recommends using practical rules to help find the optimal regularizers, such as avoiding high-frequency stochastic noise.

  • 00:50:00 In this section of the lecture, the instructor explains the different types of noise that can lead to overfitting and why it's important to choose a regularizer that tends to pick smoother hypotheses. He defines the general form of regularization and the augmented error that is minimized, which is similar to the equation used in VC analysis. He also discusses the correspondence between the complexity of an individual hypothesis and the complexity of the set of objects, and how E_aug is a better estimate for E_out than E_in.

  • 00:55:00 In this section of the lecture on regularization, the idea of augmented error as a better proxy for the out-of-sample error is discussed. Regularization aims to reduce overfitting, which is essentially fitting the noise more than the signal. The guiding principle for choosing a regularizer is to move in the direction of smoother, as noise is not smooth and smoother solutions tend to harm noise more than fitting signal. The concept of simpler is also introduced in a case where smoother does not apply well. Choosing a good omega is a heuristic exercise, and the math involved is only as good as the assumption on which it is based. The lecture ends with the hope that lambda would serve as saving grace for choosing the regularizer.
  • 01:00:00 In this section of the lecture, the concept of weight decay for neural networks is explored, where small weights result in simplicity of the function, and larger weights result in a logical dependency to allow any functionality to be implemented. Another form of regularizer is weight elimination, where some of the weights within a network are forced to be zero, resulting in a smaller VC dimension, allowing for better generalization and smaller chance of overfitting. Soft weight elimination is introduced, whereby a continuous function is applied to the network to emphasize some of the weights over others. Finally, early stopping is discussed as a form of regularizer, which recommends stopping training before the end, as it is a way of indirectly providing simplicity to the function.

  • 01:05:00 In this section, the professor explains that regularization is done through the optimizer and that we don't change the objective function. Instead, we hand over the objective function, which is the in-sample error, to the optimizer and tell it to minimize it. The professor then cautions against just putting the regularizer in the optimizer, which can lead to over-regularization and non-optimal performance if not done correctly. He emphasizes the importance of capturing as much as possible in the objective function and then using validation to determine the optimal value for the regularization parameter, lambda. The professor then shows how the choice of lambda changes with different levels of noise, and how using validation can help determine the best possible outcome given the noise. Finally, he discusses the use of different types of regularizers with different parameters, depending on the performance.

  • 01:10:00 In this section, the professor discusses the use of regularizers in machine learning, which is an experimental activity rather than a completely principled activity. The machine learning approach is somewhere between theory and practice, meaning that it has a strong grounding in both. The professor uses Legendre polynomials as the orthogonal functions because they provide a level of generality that is interesting and the solution is simple. Regularization allows a user to find a sweet spot for the best performance, which could be between two discrete steps. The regularization term added does not explicitly depend on the data set. However, the optimal parameter, lambda, will depend on the training set, which will be determined by validation.

  • 01:15:00 In this section, the concept of regularization is introduced, which involves adding a penalty term to the loss function in order to avoid overfitting in machine learning models. The two most common types of regularization, L1 and L2, are discussed along with their respective advantages and disadvantages. Additionally, the use of early stopping and dropout as alternative regularization techniques is explained. The lecture concludes with an overview of how to determine the appropriate regularization method for a given problem, as well as common hyperparameters to consider when implementing regularization.

Lecture 12 - Regularization
Lecture 12 - Regularization
  • 2012.05.14
  • www.youtube.com
Regularization - Putting the brakes on fitting the noise. Hard and soft constraints. Augmented error and weight decay. Lecture 12 of 18 of Caltech's Machine ...
 

Caltech's Machine Learning Course - CS 156. Lecture 13 - Validation

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:26

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 13 - Validation

In lecture 13, the focus is on validation as an important technique in machine learning for model selection. The lecture goes into the specifics of validation, including why it's called validation and why it's important for model selection. Cross-validation is also discussed as a type of validation that allows for the use of all available examples for training and validation. The lecturer explains how to estimate the out-of-sample error using the random variable that takes an out-of-sample point and calculates the difference between the hypothesis and the target value. The lecture also discusses the bias introduced when using the estimate to choose a particular model, as it is no longer reliable since it was selected based on the validation set. The concept of cross-validation is introduced as a method for evaluating the out-of-sample error for different hypotheses.

Also he covers the use of cross-validation for model selection and validation to prevent overfitting, with a focus on "leave one out" and 10-fold cross-validation. The professor demonstrates the importance of accounting for out-of-sample discrepancy and data snooping, and suggests including randomizing methods to avoid sampling bias. He explains that although cross-validation can add complexity, combining it with regularization can select the best model, and because validation doesn't require assumptions, it's unique. The professor further explains how cross-validation can help make principled choices even when comparing across different scenarios and models, and how total validation points determine the error bar and bias.

  • 00:00:00 In this section, the focus is on validation, another important technique in machine learning that is used for model selection. The process involves choosing a validation set size and using it to validate the model selection process. The lecture goes into the specifics of validation, including why it's called validation and why it's important for model selection. The discussion also covers cross-validation, which is a type of validation that enables the use of all available examples for training and validation. The lecture contrasts validation with regularization, as far as control.

  • 00:05:00 In this section, the lecturer discusses validation and regularization in the context of the well-known equation that deals with the difference between the in-sample error and the out-of-sample error due to the model's complexity. Regularization estimates the penalty for overfit complexity while validation tries to estimate the out-of-sample error directly. The lecturer explains how to estimate the out-of-sample error using the random variable that takes an out-of-sample point and calculates the difference between the hypothesis and the target value. The lecturer emphasizes how variance affects the quality of the estimate, and proposes using a full set of points instead of one.

  • 00:10:00 In this section, the notion of a validation set and the validation error as an unbiased estimate of the out-of-sample error is introduced. The expected value of the validation error is E_out, which is another form of the expected value on a single point. The variance of the validation error is analyzed to show that there is an improvement in the estimate based on E_val compared to a single point. The variance ends up being proportional to 1/K, which means that increasing K can shrink the error bar and improve the reliability of the estimate. However, the number of validation points is not free and has a direct impact on the number of points available for training.

  • 00:15:00 In this section, the focus is on the process of validation, whereby K points are taken from N points for validation purposes, while the remaining subset D_train is used for training. It is also important to note the usefulness of having a reliable estimate of a validation set to ensure that the final hypothesis is reliable. However, having a reliable estimate of a bad quantity should not be the aim. As the value of K is increased, the estimate becomes more reliable, but the quality of hypothesis decreases. Thus, it is vital to find a means of not having to pay the price that comes with the increase of K. One way is to restore the data set after estimating the error and train on the full set to obtain better results.

  • 00:20:00 In this section, the focus is on the compromise in performance when using a validation set during training. The reduced set of D_train will have fewer examples compared to the full training set D, using which we obtain final hypothesis g minus. To get an estimate, we evaluate g minus on a validation set D_val, and then add the rest of the examples back into the pot and report g. However, a large K means the difference between g minus and g is bigger, and this affects the reliability of the estimate we report. Hence, there is a rule of thumb to use one fifth for validation to get the best of both worlds. We call it validation because it affects the learning process and helps in making choices.

  • 00:25:00 In this section, the focus is on understanding the difference between test error and validation error. When the test set is unbiased and is used to estimate E_out, there will be fluctuations in the estimate. If early stopping is used, the bias of the estimate changes. In a mini-learning scenario, it is easy to see that the expected value of the minimum is less than 0.5, making it an optimistic bias. The same thing happens when a point is chosen for early stopping - the point chosen is minimum on the realization, and an optimistic bias is introduced.

  • 00:30:00 In this section, the lecture discusses the use of the validation set for model selection in machine learning. The process involves training M models using a dataset split into training and validation sets, and then evaluating the performance of each model on the validation set to obtain estimates of out-of-sample error. The model with the smallest validation error is chosen, but there is a risk of bias introduced due to this selection process. Nevertheless, the bias is generally minor in practice and can be accepted to obtain a reliable estimate of the out-of-sample error.

  • 00:35:00 In this section, the lecturer discusses the bias introduced when using the estimate to choose a particular model, as it is no longer reliable since it was selected based on the validation set. The expected value of the estimator becomes a biased estimate of the out-of-sample error. An experiment with two models generated a curve that indicated a systematic bias towards one model or the other. The curves on the graph indicate the learning curve backward and how the out-of-sample error goes down with more examples for training. As the size of the validation set gets bigger, the estimate becomes more reliable, and the curves indicating the models' errors converge.

  • 00:40:00 In this section, the lecture explains how to estimate the discrepancy or bias between training on a special-hypothesis set and finding the final hypothesis using a validation set. The validation set is seen as the training error for the final hypothesis set, and with a little bit of mathematics related to the VC dimension and effective complexity, an estimate of the out-of-sample error can be obtained. Although more examples will improve the estimate, logarithmic contributions must be taken into account when selecting from an increased number of hypotheses. Nonetheless, when dealing with a single parameter, the effective complexity goes with a VC dimension of 1, which is not too difficult to handle. Therefore, if you have a suitable set, then estimating the out-of-sample error will not differ too much from the actual value.

  • 00:45:00 In this section, the speaker discusses the idea of data contamination when using error estimates to make decisions, particularly in the context of validation. The training set is considered to be completely contaminated, while the test set is completely clean and gives an unbiased estimate. However, the validation set is slightly contaminated because it is used to make a few decisions, so it is important to not get carried away and move on to another validation set when necessary. The speaker then introduces cross-validation as a regime of validation that can get a better estimate with a smaller error bar, as long as it is not biased in the process.

  • 00:50:00 In this section, the professor introduces the concept of validation through cross-validation, specifically the "leave one out" method. In this method, the dataset is divided into two, with one point being used for validation and the rest used for training. The process is repeated for different points, resulting in multiple unbiased and imperfect estimations. Since all the estimations are based on training with N minus 1 data points, they have a common thread. Despite being imperfect, the repeated estimates give insight into the behavior of the model and help optimize it for the best out-of-sample performance.

  • 00:55:00 In this section, the concept of cross-validation is introduced as a method for evaluating the out-of-sample error for different hypotheses. By dividing the dataset into training and validation sets, it is possible to estimate the performance of the model on unseen data. The "leave one out" method is used to illustrate the process. The effectiveness of cross-validation is discussed, with it being shown that using N minus 1 points to train and N points to validate is remarkably efficient for obtaining accurate results.
  • 01:00:00 In this section, the professor discusses the use of cross-validation for model selection. He demonstrates this by comparing the linear and constant models with three points, and shows how the constant model wins. He then applies cross-validation to the problem of finding a separating surface for handwritten digits using a 5th order nonlinear transform with 20 features. He uses cross-validation "leave one out" to compare 20 models and chooses where to stop adding features. He shows that the cross-validation error tracks closely with the out-of-sample error, and that using it as a criterion for model choice leads to minima at 6 features with improved performance compared to using the full model without validation.

  • 01:05:00 In this section, the professor discusses the use of validation for preventing overfitting, and how it is considered similar to regularization. He explains how "leave one out" validation is not practical for most real problems, and suggests using 10-fold cross-validation instead. He also provides guidance on the number of parameters to use based on the size of the data set, and clarifies why model choice by validation does not count as data snooping.

  • 01:10:00 In this section, the professor discusses the importance of accounting for out-of-sample discrepancy and data snooping when using the validation set to make model choices. He emphasizes the need to use randomizing methods such as flipping coins to avoid sampling bias and using cross-validation techniques to choose the regularization parameter in many practical cases. While cross-validation can add computational complexity, it can also be combined with regularization to select the best hypothesis for a model. The professor notes that although there are other methods for model selection, validation is unique in that it doesn't require assumptions.

  • 01:15:00 In this section, the professor discusses how validation can help make principled choices in selecting models, regardless of the nature of the choice, and how it can also be used to update the model in case of time evolution or tracking system evolution. When comparing validation and cross-validation, he explains that both methods have bias, but cross-validation allows for more examples to be used for both training and validation, resulting in a smaller error bar and less vulnerability to bias. While it may be possible to have data sets so large that cross-validation is not needed, the professor provides an example where even with 100 million points, cross-validation was still beneficial due to the nature of the data.

  • 01:20:00 In this section, the professor discusses scenarios where cross-validation is useful and addresses potential problems with it. He explains that cross-validation becomes relevant when the most relevant part of a large data set is smaller than the whole set. When deciding between competing models, statistical evidence is necessary to determine the significance of the out-of-sample error. The professor states that with a smaller data set, there is no definitive answer as to whether it is better to re-sample or break the set into chunks for cross-validation. The professor also discusses the role of balance between classes and how bias behaves when increasing the number of points left out. Finally, the professor explains that the total number of validation points determines the error bar, and bias is a function of how cross-validation is used.

  • 01:25:00 In this section, the professor discusses the error bar and how it can provide an indication of vulnerability to bias in an estimate. If two scenarios have comparable error bars, there is no reason to believe that one is more vulnerable to bias. However, a detailed analysis is needed to see the difference between taking one scenario at a time and considering correlations. The professor concludes that as long as a number of folds are done and every example appears in the cross-validation estimate exactly once, there is no preference between scenarios in terms of bias.

Lecture 13 - Validation
Lecture 13 - Validation
  • 2012.05.17
  • www.youtube.com
Validation - Taking a peek out of sample. Model selection and data contamination. Cross validation. Lecture 13 of 18 of Caltech's Machine Learning Course - C...
 

Caltech's Machine Learning Course - CS 156. Lecture 14 - Support Vector Machines

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:27

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 14 - Support Vector Machines

The lecture covers the importance of validation and its use in machine learning, as well as the advantages of cross-validation over validation. The focus of the lecture is on support vector machines (SVMs) as the most effective learning model for classification, with a detailed outline of the section that involves maximization of the margin, formulation, and analytical solutions through constrained optimization presented. The lecture covers a range of technicalities, including how to calculate the distance between a point and a hyperplane in SVMs, how to solve the optimization problem for SVMs, and how to formulate the SVM optimization problem in its dual formulation. The lecturer also discusses the practical aspects of using quadratic programming to solve the optimization problem and the importance of identifying support vectors. The lecture concludes with a brief discussion of the use of nonlinear transformations in SVMs.

The second part of this lecture on support vector machines (SVM), the lecturer explains how the number of support vectors divided by the number of examples gives an upper bound on the probability of error in classifying an out-of-sample point, making the use of support vectors with nonlinear transformation feasible. The professor also discusses the normalization of w transposed x plus b to be 1 and its necessity for optimization, as well as the soft-margin version of SVM, which allows for errors and penalizes them. In addition, the relationship between the number of support vectors and the VC dimension is explained, and the method's resistance to noise is mentioned, with the soft version of the method used in cases of noisy data.

  • 00:00:00 In this section, the lecturer discusses the importance of validation, particularly in terms of its use in machine learning. The concept of unbiased and optimistic bias as a result of validation error and its effect on model selection is also explained. The advantage of cross-validation over validation is further highlighted in the section. Furthermore, the lecturer introduces support vector machines as the most effective learning model for classification, citing its intuitive interpretation, a principled derivation, and optimization package as significant advantages to the learning model. A detailed outline of the section, which involves the maximization of margin, formulation, and analytical solutions through constrained optimization, is also presented.

  • 00:05:00 In this section, the concept of maximizing the margin in linear separation was explained. While all lines that separate linearly separable data have zero in-sample error, some may have better margins that allow for greater generalization. It is explained that a bigger margin is better because, in noisy situations, the likelihood that the new point will be classified correctly is higher. This is related to the growth function, and how a bigger growth function is disadvantageous for generalization in machine learning. It is shown that maximizing the margin can help with generalization by searching for lines that not only separate the data correctly but also have the maximum margin possible for those data points.

  • 00:10:00 In this section, the lecturer discusses fat margins and how they can improve the performance of a classifier. By requiring a classifier to have a margin of a certain size, the number of possible dichotomies is reduced, leading to a smaller growth function and smaller VC dimension. The larger the margin, the better the out-of-sample performance of the classifier. The lecturer then explains how to solve for the biggest possible margin, by finding the distance between the hyperplane and the nearest data point, and normalizing the vector w to simplify the analysis. The signal, or the distance between the hyperplane and the data points, is not the Euclidean distance, but the order of nearest and furthest points, and needs to be converted to obtain the Euclidean distance.

  • 00:15:00 In this section, the lecturer explains some technicalities relevant to the support vector machine analysis. Firstly, in order to compare the performance of different planes, the Euclidean distance is used as a yardstick. Secondly, w is extracted from vector X in order to analyze support vector machines more conveniently, and w₀ is pulled out so that it is not confused with the w vector that now has a new role. The goal is to compute the distance between xₙ (the nearest point) and the plane. The lecturer shows the vector w is orthogonal to the plane and to every vector on the plane, which means it is orthogonal to every normal vector on the plane, so now we can get the distance between xₙ and the plane.

  • 00:20:00 In this section, the speaker discusses how to calculate the distance between a point and a hyperplane in SVMs. This can be done by projecting the vector going from the point to a generic point on the hyperplane onto the direction that is orthogonal to the hyperplane. The unit vector in this direction is computed by normalizing the length of the vector. By using some algebra, the speaker derives a formula for the distance that is simplified by adding a missing term. This formula can be used to choose the combination of w's that gives the best possible margin. The optimization problem that results from this is not very user-friendly because of the minimum in the constraints. However, by making some simple observations, this problem can be reformulated into a more friendly quadratic one.

  • 00:25:00 In this section, the lecturer explains how to solve the optimization problem for Support Vector Machines (SVMs). They begin by showing how SVMs can be formulated as a constrained optimization problem where they must minimize an objective function subject to linear inequality constraints. They prove that it is possible to use Lagrange multipliers to transform the inequality constraints into equality constraints and then solve the new Lagrangian. They note that this approach was independently discovered by Karush and Kuhn-Tucker and is referred to as KKT Lagrangian. The lecturer emphasizes that the process is similar to the procedure for regularization, and they recall the gradient condition for the solution.

  • 00:30:00 In this section, the lecturer explains the relationship between the SVM and regularization and the Lagrange formulation. It is essential to note that the constraints lead to a non-zero gradient, unlike the unconstrained problem where the gradient equals 0. The Lagrange formulation is dependent on variables like w and b, and there are new variables, Lagrange multipliers like the alpha vector. The problem at hand is to minimize the objective function subject to constraints of the form, and then we give it a Lagrangian name. The interesting part is that we are actually maximizing with respect to alpha, although the alphas have to be non-negative, and thus we need to pay attention to this. The section concludes with a brief explanation of the unconstrained part, where we need to minimize the gradient of the Lagrangian with respect to w and b.

  • 00:35:00 In this section of the lecture, the speaker explains how to formulate the SVM optimization problem in its dual formulation. He first optimizes the problem with respect to w and b, resulting in two conditions that he substitutes back into the original Lagrangian, leading to the dual formulation of the problem, which is a nice formula in terms of the Lagrange multipliers alpha only. He then sets the constraint for the alphas to be non-negative and solves the maximization problem subject to these constraints, resulting in the optimal values of alpha that determine the support vectors.

  • 00:40:00 In this section, the speaker discusses the practical aspects of using quadratic programming to solve the optimization problem presented earlier for support vector machines. The objective and constraints are translated into coefficients that are passed onto the quadratic programming package for minimization. The matrix dimension depends on the number of examples and this becomes a practical consideration for large datasets. The speaker warns that when the number of examples is big, quadratic programming has a hard time finding the solution and may require use of heuristics.

  • 00:45:00 In this section, the lecture delves into the solutions brought by quadratic programming, specifically alpha, and how it relates to the original problem of determining the weights, the surface, the margin, and b. The lecture highlights the importance of identifying support vectors, which are the points that define the plane and the margin. The mathematics behind positive lambdas (alphas in this case) lends a way for identifying support vectors, as it only considers points with positive values. This means that these alpha values are crucial for defining the boundary between the two classifications, and identifying their location is critical in optimizing the weights and creating the maximum margin.

  • 00:50:00 In this section, the concept of support vectors is introduced and discussed in the context of the support vector machine (SVM) algorithm. Support vectors are defined as the data points that are closest to the decision boundary or hyperplane that separates the classes of data. The SVM algorithm optimizes a quadratic programming problem to determine the support vectors and the parameters of the decision function. The values of the parameters depend only on the support vectors, which are the critical points, enabling the model to generalize well. Nonlinear transformations are also briefly discussed as a way to handle non-separable data. Transforming the data into a higher-dimensional space does not complicate the optimization problem, and the same technique can be used to find the support vectors and decision function.

  • 00:55:00 In this section of the video, the lecturer discusses the use of nonlinear transformations in SVMs. Nonlinear transformations are used when data is not linearly separable, which is the case in the X space. The lecturer demonstrates how to use a nonlinear transformation and work in the Z space to achieve a linearly separable result. He explains that the solution is easy, and the number of alphas depends on the number of data points, not the dimensionality of the space that you're working in. The key idea is that you can go to an enormous space without paying a price in terms of the optimization. The support vectors are identified in the Z space, but in the X space, they look like data points.
  • 01:00:00 In this section, the lecturer discusses the generalization result that makes using support vectors with nonlinear transformation feasible. The number of support vectors, which represents the number of effective parameters, divided by the number of examples gives an upper bound on the probability of error in classifying an out-of-sample point. If the expected value of several runs of this machinery holds, then the actual E_out you will get in a particular case will be bounded above by a familiar type of bound (e.g., the number of parameters, degrees of freedom, and VC dimension divided by the number of examples). This result makes people use support vectors and support vectors with the nonlinear transformation, as you don't pay for the computation of going to a higher dimension or the generalization that goes with it.

  • 01:05:00 In this section, the professor explains why he chooses to normalize w transposed x plus b to be 1, and why this normalization is necessary for optimization. He also answers a question about how SVM deals with non-linearly separable points through nonlinear transformations, and how the soft-margin version of SVM allows for errors and penalizes for them. Additionally, the professor briefly touches on the relationship between the number of support vectors and the VC dimension, and how the alphas represent the parameters in SVM.

  • 01:10:00 In this section, the lecturer discusses the relationship between the number of non-zero parameters and the VC dimension, which is equivalent to the number of support vectors by the definition. The margin measure can vary depending on the norm used, but there is no compelling reason to prefer one over the other in terms of performance. While there is no direct method for pruning support vectors, taking subsets and obtaining the support vectors of the support vectors are possible computational considerations. The SVM method is not particularly susceptible to noise, and in cases of noisy data, the soft version of the method is used, which is remarkably similar to the non-noisy case.

Lecture 14 - Support Vector Machines
Lecture 14 - Support Vector Machines
  • 2012.05.18
  • www.youtube.com
Support Vector Machines - One of the most successful learning algorithms; getting a complex model at the price of a simple one. Lecture 14 of 18 of Caltech's...
 

Caltech's Machine Learning Course - CS 156. Lecture 15 - Kernel Methods

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:29

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 15 - Kernel Methods

This lecture on kernel methods introduces support vector machines (SVMs) as a linear model that is more performance-driven than traditional linear regression models because of the concept of maximizing the margin. If the data is not linearly separable, nonlinear transforms can be used to create wiggly surfaces that still enable complex hypotheses without paying a high price in complexity. The video explains kernel methods that go to high-dimensional Z space, explaining how to compute the inner product without computing the individual vectors. The video also outlines the different approaches to obtaining a valid kernel for classification problems and explains how to apply SVM to non-separable data. Finally, the video explains the concept of slack and quantifying the margin violation in SVM, introducing a variable xi to penalize margin violation and reviewing the Lagrangian formulation to solve for alpha.

The second part covers practical aspects of using support vector machines (SVMs) and kernel methods. He explains the concept of soft margin support vector machines and how they allow for some misclassification while maintaining a wide margin. He talks about the importance of the parameter C, which determines how much violation can occur, and suggests using cross-validation to determine its value. He also addresses concerns about the constant coordinate in transformed data and assures users that it plays the same role as the bias term. Additionally, he discusses the possibility of combining kernels to produce new kernels and suggests heuristic methods that can be used when quadratic programming fails in solving SVMs with too many data points.

  • 00:00:00 In this section of the lecture on Kernel Methods, Yaser Abu-Mostafa introduces the concept of support vector machines (SVMs), noting that they are nothing but a linear model in the simplest form, but are more performance-oriented because of the idea of maximizing the margin. By using a package of quadratic programming, we can solve the SVM problem and get the alphas back, which helps us identify the support vectors. If the data is not linearly separable, we can use nonlinear transform, but the resulting wiggly surface still allows us to get a complex hypothesis without paying a high price in complexity. We can predict the out-of-sample error based on the number of support vectors, which is an in-sample quantity.

  • 00:05:00 In this section, the video explains the concept of kernel methods and their role in extending support vector machines beyond the linearly separable case. The idea behind kernel methods is to go to a high-dimensional Z space without paying the price for complexity. The video explains that the key to achieving this is to be able to compute the inner product in the Z space without actually computing the individual vectors in that space. This is where kernels come in, as they allow for the computation of inner products using only explicit inputs. The video goes on to explain the implications of these methods for dealing with nonlinear transformations and soft margins, and how they can be used in practice to handle complex problems.

  • 00:10:00 In this section, the lecture explains the use of the inner product in the Z space, and how it relates to kernel methods. The inner product is necessary to form the Lagrangian and pass on constraints to quadratic programming, but it can be computed using only inner products in order to perform support vector machinery. By using a generalized inner product or kernel that corresponds to a Z space, one can transform two points x and x dash into a function that is determined by x and x dash, which is called the kernel. An example is given of a two-dimensional Euclidean space using a 2nd-order polynomial transformation.

  • 00:15:00 In this section, the lecturer discusses the concept of kernel methods and how to compute kernels without transforming x and x dash. The lecturer improvises a kernel that does not transform things to the Z space and convinces the audience that the kernel corresponds to a transformation to some Z space, taking an inner product there. By squaring a kernel with the 1 + x_xdash raised to the Q-power, the lecturer explains how this becomes an inner product in some space, making it a valid kernel. Furthermore, the lecturer compares how much computation it would take to do this with other dimensions, regardless of the complexity of Q, which remains the same.

  • 00:20:00 In this section, the lecturer explains a kernel method for polynomial transformation that can be carried out without actually expanding the polynomial. By taking the logarithm and exponentiating it, the polynomial becomes a simple operation that doesn't require a huge expansion. This is an easy polynomial that can be visualized in 2D and extrapolated for other cases. A kernel that maps to a higher dimensional space can be obtained by taking an inner product in that space. The lecturer introduces an example of a kernel that doesn't have an inner product term in X or Z space but corresponds to an inner product in an infinite-dimensional space. Despite the challenges of going to an infinite-dimensional space, the kernel method is still useful, and the number of support vectors can be used to determine the generalization of a model.

  • 00:25:00 In this section, the lecturer demonstrates the radial-basis-function kernel, a sophisticated kernel that corresponds to an infinite-dimensional space, and shows how it works in action by taking a slightly non-separable case. The lecturer generates 100 points at random and shows that there is no line to separate them. Then, the lecturer transforms X into an infinite-dimensional space and computes the kernel, which is a simple exponential. The lecturer passes this on to quadratic programming, which gives back the support vectors. When the lecturer darkens the support vectors, it becomes easier to see the two classes.

  • 00:30:00 In this section, the speaker discusses the idea of kernel methods and how they can be used for classification. He presents an example of using a kernel on a dataset of points in order to transform them to an infinite-dimensional space where they can be separated by a linear plane. The resulting margin and support vectors are used to determine the in-sample quantity that guides the generalization property. The speaker then goes on to explain how a valid kernel corresponding to an inner product in some Z space can be used in formulating the problem and constructing the hypothesis. Overall, he emphasizes the usefulness of kernel methods and how they can be applied to solve classification problems.

  • 00:35:00 In this section, we learn how to translate the linear model into a kernel form, where support vector machines becomes a model that allows for the choice of the kernel. The kernel takes the place of the inner product after inner products are taken with the Z space. The resulting model depends on the kernel choice, and we can also solve for b by plugging in a support vector. The kernel, however, is hard to determine as you cannot verify its validity without visiting the Z space. Nonetheless, we illustrate how we can compare approaches by looking at the functional form of different kernels.

  • 00:40:00 In this section, the lecturer explains the conditions for obtaining a valid kernel in kernel methods. There are three approaches: construction, where a kernel is constructed from a conceptual or explicit set of transformations; Mercer’s condition, which requires a given kernel to be symmetric and for a matrix constructed from the kernel values to be positive semi-definite; and finally, an improvisation approach, where the viability of the kernel is a very practical concern, and two conditions must simultaneously be satisfied. These are that the kernel is symmetric, and the matrix constructed from kernel values must be positive semi-definite for any choice of points, as required by Mercer’s condition.

  • 00:45:00 In this section, the lecturer describes situations where data is not linearly separable and how to apply support vector machines algorithm in such cases. There could be two scenarios of non-separable data, one where the non-separability is slight, and the other where non-separability is significant. To deal with non-linear separable data, one can make errors and learn with generalization instead of trying to use complex inordinately high-dimensional spaces that contain all data points, thus keeping the error low. In the case of serious non-separability, one must go for a nonlinear transformation and use kernels or soft-margin support vector machines. The lecturer then talks about the idea of margin violation and how to quantify it to account for classification errors.

  • 00:50:00 In this section, the lecturer introduces the concept of slack and quantifying the margin violation in SVM. He explains that he will introduce a slack for every point that measures the violation of margin, and will penalize the total violation made by adding up these slacks. He chooses this error measure, which is reasonable and measures the violation of the margin, instead of others. He then introduces the new optimization, which is minimizing the margin violation error term, along with maximizing the margin. The constant C gives the relative importance of this margin violation term versus the previous term that maximizes the margin. Depending on the value of C, the end result could be a linearly separable data or a compromise as it represents the tradeoff between margin and slack. Finally, he reviews the Lagrangian formulation with the addition of the new terms.

  • 00:55:00 In this section, the lecturer explains the new quadratic programming problem introduced by adding the variable xi to penalize margin violations. The Lagrangian includes new constraints on xi that must be solved for using Lagrange multipliers, beta. The lecturer then shows how the minimization of w and b remains unchanged and finds that solving for xi results in a quantity that is always zero. This finding leads to beta dropping out of the Lagrangian, leaving the same solution as before, with the only ramification being that alpha is now not only greater than or equal to zero but is also less than or equal to C.
  • 01:00:00 In this section of the video, the lecturer goes over the concept of soft margin support vector machines, which allow for some misclassification while still maintaining a wide margin. The solution involves an added constraint that requires alpha to be at most C, along with the already existing equality constraint. The soft margin support vector machines include both margin and non-margin support vectors, with the latter being the points that violate the margin, causing a slack that is represented by the value xi. The value of C is an important parameter that determines how much violation can occur, and this is usually determined through cross-validation.

  • 01:05:00 In this section, the lecturer discusses practical points on using support vector machines (SVMs) and kernel methods. He explains that if the data is not linearly separable, quadratic programming may not converge, leading to a situation where there is no feasible solution. However, he encourages users to be lazy and still pass alphas from quadratic programming back to the solution to evaluate whether or not it separates the data. Additionally, he addresses concerns about the constant coordinate, 1, that is transformed with the data, explaining that it effectively plays the same role as the bias term, b, and that users need not worry about having multiple coordinates with the same role.

  • 01:10:00 In this section, the professor explains that the linearity of support vector machines (SVMs) depends on certain assumptions, and it can be better than linear in some cases. The dimension of the data may affect SVM's effectiveness, but the RBF kernel can deal with infinite dimensions if the higher-order terms decay fast. A valid kernel needs to have a well-defined inner product, which depends on convergence. The professor doesn't touch on SVMs generalized to regression cases as they require more technical details, and SVMs' major success is in classification. Lastly, there might be complaints from quadratic programming packages for not being positive definite, but the solutions may still be fine with certain reliability.

  • 01:15:00 In this section, the professor discusses the possibility of combining kernels to produce new kernels and the requirement for the combination to maintain an inner product in a Z space. He also mentions that the quadratic programming problem is the bottleneck in solving problems with SVMs and gives an estimate of the number of points that can be handled by quadratic programming. Additionally, he suggests heuristic methods that can be used when quadratic programming fails in solving SVMs with too many data points.

Lecture 15 - Kernel Methods
Lecture 15 - Kernel Methods
  • 2012.05.24
  • www.youtube.com
Kernel Methods - Extending SVM to infinite-dimensional spaces using the kernel trick, and to non-separable data using soft margins. Lecture 15 of 18 of Calte...
 

Caltech's Machine Learning Course - CS 156. Lecture 16 - Radial Basis Functions

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:31

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 16 - Radial Basis Functions

In this lecture on radial basis functions, the professor Yaser Abu-Mostafa covers a range of topics from SVMs to clustering, unsupervised learning, and function approximation using RBFs. The lecture discusses the parameter learning process for RBFs, the effect of gamma on the outcome of a Gaussian in RBF models, and using RBFs for classification. The concept of clustering is introduced for unsupervised learning, with Lloyd's algorithm and K-means clustering discussed in detail. He also describes a modification to RBFs where certain representative centers are chosen for the data to influence the neighborhood around them, and the K-means algorithm is used to select these centers. The importance of selecting an appropriate value for the gamma parameter when implementing RBFs for function approximation is also discussed, along with the use of multiple gammas for different data sets and the relation of RBFs to regularization.

In the second part Yaser Abu-Mostafa discusses radial basis functions (RBF) and how they can be derived based on regularization. The professor introduces a smoothness constraint approach using derivatives to achieve a smooth function and presents the challenges of choosing the number of clusters and gamma when dealing with high-dimensional spaces. Additionally, the professor explains that using RBF assumes the target function is smooth and takes into account input noise in the data set. The limitations of clustering are also discussed, but it can be useful to obtain representative points for supervised learning. Finally, the professor mentions that in certain cases, RBFs can outperform support vector machines (SVMs) if the data is clustered in a particular way and the clusters have a common value.

  • 00:00:00 In this section, Abu-Mostafa introduces a way to generalize SVM by allowing errors or violations of the margin, which adds another degree of freedom to the design. By having a parameter C, they give a degree to which violations of the margin are allowed. The good news is that the solution is identical to use quadratic programming. However, it is not clear how to choose the best value for C, which is why cross-validation is used to determine the C value that minimizes the out-of-sample error estimate. SVM is a superb classification technique, and it is the model of choice for many people because it has very small overhead and a particular criterion that makes it better than choosing a random separating plane.

  • 00:05:00 In this section, the professor discusses the radial basis function model and its importance in understanding different facets of machine learning. The model is based on the idea that every point in a dataset will influence the value of the hypothesis at every point x through distance, with closer points having a bigger influence. The standard form of the radial basis function model is given by h(x) which depends on the distance between x and the data point x_n, given by the norm of x minus x_n squared, and a positive parameter gamma in an exponential determined by the weight to be determined. The model is called radial because of its symmetric influence around the data point center, and it is called a basis function because it is the building block of the model's functional form.

  • 00:10:00 In this section of the video, the lecturer discusses the parameter learning process for radial basis functions. The goal is to find the parameters, labeled w_1 up to w_N, which minimize some sort of error based on the training data. The points x_n are evaluated in order to evaluate the in-sample error. The lecturer introduces equations to solve for the unknowns, which are the w's, and shows that if phi is invertible,
    the solution is simply w equals the inverse of phi times y. By using the Gaussian kernel, the interpolation between points is exact, and the effect of fixing the parameter gamma is analyzed.

  • 00:15:00 In this section, the lecturer discusses the effect of gamma on the outcome of a Gaussian in RBF models. If gamma is small, the Gaussian is wide and results in successful interpolation even between two points. However, if gamma is large, the influence of the points dies out, resulting in poor interpolation between points. The lecturer also demonstrates how RBFs are used for classification, with the signal being the hypothesis value, which is then minimized to match the +1/-1 target for training data. Finally, the lecturer explains how radial basis functions are related to other models, including the simple nearest-neighbor method.

  • 00:20:00 In this section, the lecturer discusses implementing the nearest-neighbor method using radial basis functions (RBFs) by taking an influence of a nearby point. The nearest-neighbor method is brittle and abrupt, so the model can be made less abrupt by modifying it to become the k-nearest neighbors. By using a Gaussian instead of a cylinder, the surface can be smoothed. The lecturer then modified the exact-interpolation model to deal with the problem of having N parameters and N data points by introducing regularization, which solves issues of overfitting and underfitting. The resulting model is known as Ridge Regression.

  • 00:25:00 In this section, the lecturer describes a modification to radial basis functions, where certain important or representative centers are chosen for the data to influence the neighborhood around them. The number of centers is denoted as K, which is much smaller than the total number of data points, N, so that there are fewer parameters to consider. However, the challenge is in selecting the centers in a way that represents the data inputs without contaminating the training data. The lecturer explains the K-means clustering algorithm to select these centers, where the center for each group of nearby points is assigned as the mean of those points.

  • 00:30:00 In this section, the concept of clustering is introduced for unsupervised learning. The objective is to group similar data points together; each cluster has a center representative of the points within the cluster. The goal is to minimize the mean squared error of each point within its cluster. The challenge is that this problem is NP-hard, but by using Lloyd's algorithm, also known as K-means, a local minimum can be found iteratively. The algorithm minimizes the total mean squared error by fixing the clusters and optimizing the centers and then fixing the centers and optimizing the clusters iteratively.

  • 00:35:00 In this section on radial basis functions, the concept of Lloyd's algorithm for clustering is discussed. Lloyd's algorithm involves creating new clusters by taking every point and measuring its distance to the newly acquired mean. The closest mean is then determined to belong to that point's cluster. The algorithm continues back and forth, reducing the objective function until a local minimum is reached. The initial configuration of centers determines the local minimum, and trying different starting points can give different results. The algorithm is applied to a nonlinear target function, and its ability to create clusters based on similarity, rather than the target function, is demonstrated.

  • 00:40:00 In this section, the speaker discusses Lloyd’s algorithm, which involves repeatedly clustering data points and updating the cluster centers until convergence. The algorithm will involve radial basis functions, and while the clustering produced from the data in this example did not have any natural clustering, the speaker notes that the clustering does make sense. However, the way centers serve as a center of influence can cause issues, particularly when using unsupervised learning. The speaker then compares the previous support vectors lecture to the current data points, with the support vectors being representative of the separating plane rather than the data inputs like the generic centers from this lecture.

  • 00:45:00 In this section, the presenter discusses the process of choosing important points in supervised and unsupervised ways with the RBF kernel. The centers are found using Lloyd's algorithm, and half the choice problem is already solved. The weights are determined using labels, and there are K weights and N equations. As K is less than N, something will have to give, and the presenter shows how to solve this problem using the matrix phi, which has K columns and N rows. The approach involves making an in-sample error, but the chances of generalization are good since only K weights are determined. The presenter then relates this process to neural networks and emphasizes the familiarity of this configuration to layers.

  • 00:50:00 In this section, the speaker discusses the benefits of using radial basis functions and how they compare to neural networks. The radial basis function network is interpreted as looking at local regions in space without worrying about the faraway points, while neural networks interfere significantly. The radial basis function network's nonlinearity is phi, while the neural network's corresponding nonlinearity is theta, both of which are combined with w's to get h. Furthermore, the radial basis function network has two layers and can be implemented using support vector machines. Finally, the speaker highlights that the gamma parameter of the Gaussian in radial basis functions is now treated as a genuine parameter and learned.

  • 00:55:00 In this section, the lecturer discusses the importance of selecting an appropriate value for the gamma parameter when implementing radial basis functions (RBFs) for function approximation. If gamma is fixed, the pseudo-inverse method can be used to obtain the necessary parameters. However, if gamma is not fixed, gradient descent can be used. The lecturer explains an iterative approach called the Expectation-Maximization (EM) algorithm that can be used to converge quickly to the appropriate values of gamma and the necessary parameters for the RBF. Additionally, the lecturer discusses the use of multiple gammas for different data sets and the relation of RBFs to regularization. Finally, the lecturer compares RBFs to their kernel version and the use of support vectors for classification.
  • 01:00:00 In this section, the lecturer compares two different approaches that use the same kernel. The first approach is a straight RBF implementation with 9 centers, which uses unsupervised learning of centers followed by a pseudo-inverse and linear regression for classification. The second approach is an SVM that maximizes the margin, equates with a kernel, and passes to quadratic programming. Despite the fact that the data doesn't cluster normally, the SVM performs better with zero in-sample error and more closeness to the target. Finally, the lecturer discusses how RBFs can be derived entirely based on regularization, with one term minimizing the in-sample error and the other term being regularization to ensure that the function is not crazy outside.

  • 01:05:00 In this section, the professor introduces a smoothness constraint approach which involves constraints on derivatives to ensure a smooth function. The smoothness is measured by the size of the k-th derivative which is parametrized analytically and squared, and then integrated from minus infinity to plus infinity. The contributions of different derivatives are combined with coefficients and multiplied by a regularization parameter. The resulting solution leads to radial basis functions which represent the smoothest interpolation. Additionally, the professor explains how SVM simulates a two-level neural network and discusses the challenge of choosing the number of centers in clustering.

  • 01:10:00 In this section, the professor discusses the difficulties that arise when choosing the number of clusters in RBF and the choice of gamma when dealing with high dimensional spaces. The curse of dimensionality inherent in RBF, makes it difficult to expect good interpolation even with other methods. The professor reviews various heuristics and affirms that cross-validation and other similar techniques are useful for validation. The professor further explains how to choose gamma by treating the parameters on equal footing using general nonlinear optimization. He also discusses how to use EM algorithm to get a local minimum for gamma when the w_k's are constant. Finally, the professor mentions that two-layer neural networks are sufficient to approximate everything, but cases may arise when one needs more than two layers.

  • 01:15:00 In this section, the professor explains that one of the underlying assumptions in using radial basis functions (RBF) is that the target function is smooth. This is because the RBF formula is based on solving the approximation problem with smoothness. However, there is another motivation for using RBF, which is to take into account input noise in the data set. If the noise in the data is Gaussian, you'll find that by assuming noise, the value of the hypothesis should not change much by changing x to avoid missing anything. The result is having an interpolation which is Gaussian. The student asks about how to choose gamma in the RBF formula, and the professor says that the width of the Gaussian should be comparable to the distances between points so that there is a genuine interpolation, and there is an objective criterion for choosing gamma. When asked about whether the number of clusters in K centers is a measure of VC dimension, the professor says that the number of clusters affects the complexity of the hypothesis set, which in turn affects the VC dimension.

  • 01:20:00 In this section, the professor discusses the limitations of clustering and how it can be used as a half-cooked clustering method in unsupervised learning. He explains that clustering can be difficult as the inherent number of clusters is often unknown, and even if there is clustering, it may not be clear how many clusters there are. However, clustering can still be useful to obtain representative points for supervised learning to get the values right. The professor also mentions that in certain cases, RBFs can perform better than SVMs if the data is clustered in a particular way and the clusters have a common value.

Lecture 16 - Radial Basis Functions
Lecture 16 - Radial Basis Functions
  • 2012.05.29
  • www.youtube.com
Radial Basis Functions - An important learning model that connects several machine learning models and techniques. Lecture 16 of 18 of Caltech's Machine Lear...
 

Caltech's Machine Learning Course - CS 156. Lecture 17 - Three Learning Principles

Forum on trading, automated trading systems and testing trading strategies

Machine Learning and Neural Networks

MetaQuotes, 2023.04.07 12:34

Caltech's Machine Learning Course - CS 156 by Professor Yaser Abu-Mostafa




Caltech's Machine Learning Course - CS 156. Lecture 17 - Three Learning Principles

This lecture on Three Learning Principles covers Occam's razor, sampling bias, and data snooping in machine learning. The principle of Occam's razor is discussed in detail, along with the complexity of an object and a set of objects, which can be measured in different ways. The lecture explains how simpler models are often better, as they reduce complexity and improve out-of-sample performance. The concepts of falsifiability and non-falsifiability are also introduced. Sampling bias is another key concept discussed, along with methods to deal with it, such as matching distributions of input and test data. Data snooping is also covered, with examples of how it can affect the validity of a model, including through normalization and reusing the same data set for multiple models.

The second part covers the topic of data snooping and its dangers in machine learning, specifically in financial applications where overfitting due to data snooping can be especially risky. The professor suggests two remedies for data snooping: avoiding it or accounting for it. The lecture also touches on the importance of scaling and normalization of input data, as well as the principle of Occam's razor in machine learning. Additionally, the video discusses how to properly correct sampling bias in computer vision applications and concludes with a summary of all the topics covered.

  • 00:00:00 In this section, Professor Abu-Mostafa explains the versatility of radial basis functions (RBF) in machine learning. He notes that RBFs serve as a building block for Gaussian clusters in unsupervised learning and as a soft version of nearest neighbor, affecting the input space gradually with diminishing effect. They are also related to neural networks through the use of sigmoids in the activation function of the hidden layer. RBFs are applicable to support vector machines with an RBF kernel, except the centers in SVM happen to be the support vectors located around the separating boundary, whereas the centers in RBF are all over the input space, representing different clusters of the input. RBFs also originated from regularization, which allowed for smoothness criteria to be captured using a function of derivatives that solved for Gaussians during interpolation and extrapolation.

  • 00:05:00 In this section, the lecturer introduces the three learning principles: Occam's razor, sampling bias, and data snooping. He starts by explaining the Occam's razor principle, which states that the simplest model that fits the data is the most plausible. He notes that the statement is neither precise nor self-evident and proceeds to tackle two key questions: what does it mean for a model to be simple, and how do we know that simpler is better in terms of performance? The lecture will discuss these questions to make the principle concrete and practical in machine learning.

  • 00:10:00 In this section, the lecturer explains that complexity can be measured in two ways: the complexity of an object, such as a hypothesis, or the complexity of a set of objects, such as a hypothesis set or model. The complexity of an object can be measured by its minimum description length or the order of a polynomial, while the complexity of a set of objects can be measured by entropy or VC dimension. The lecturer argues that all these definitions of complexity are more or less talking about the same thing, despite being different conceptually.

  • 00:15:00 In this section, the lecturer explains the two categories used to measure complexity in the literature, including a simple statement and the complexity of a set of objects. The lecture then discusses the relationship between the complexity of an object and the complexity of a set of objects, both of which are related to counting. The lecture provides examples of how to measure complexity, including real-valued parameters and SVM, which is not really complex because it is defined only by very few support vectors. The first of five puzzles presented in this lecture is introduced, and it asks about a football oracle who can predict game outcomes.

  • 00:20:00 In this section, the speaker tells a story of a person sending letters predicting the outcome of football games. He explains that the person is not actually predicting anything but is instead sending different predictions to groups of recipients and then targeting the recipients that received the correct answer. The complexity of this scenario makes it impossible to predict with certainty, and the speaker uses this example to explain why simpler models in machine learning are often better. Simplifying the model reduces the complexity and helps improve out-of-sample performance, which is the concrete statement of Occam's razor.

  • 00:25:00 In this section of the lecture, the professor explains the argument behind the principle that simpler hypotheses are better for fit than complex ones. The crux of the proof lies in the fact that there are fewer simple hypotheses than complex ones, making it less likely for a given hypothesis to fit a dataset. However, when a simpler hypothesis does fit, it is more significant and provides more evidence than a complex one. The notion of falsifiability is also introduced, stating that data must have a chance of falsifying an assertion in order to provide evidence for it.

  • 00:30:00 In this section, the concept of non-falsifiability and sampling bias are discussed as important principles in machine learning. The axiom of non-falsifiability refers to the fact that linear models are too complex for data sets that are too small to be generalized. The lecture also explains the importance of red flags and specifically mentions how Occam's razor warns us against complex models that only fit data well in sample data sets. Sampling bias is another key concept that is discussed through a puzzle about a phone poll. The poll predicted that Dewey would win the 1948 presidential election, but Truman won due to a sampling bias from a group of telephone owners that was not representative of the general population.

  • 00:35:00 In this section, we learn about the sampling bias principle and its impact on learning outcomes. The principle states that biased data samples will lead to biased learning outcomes as algorithms fit the model to the data they receive. A practical example in finance demonstrated how a trader's algorithm that was successful in using historical stock data failed because it missed certain conditions in the market. To deal with sampling bias, one technique is to match the distributions of the input and test data, although it is not always possible to know the probability distributions. In such cases, resampling the training data or adjusting weights assigned to the samples can help achieve this. However, this may result in a loss of sample size and independence of the points.

  • 00:40:00 In this section, the lecturer discusses the issue of sampling bias in machine learning and presents various scenarios in which it can occur. In one case, the lecturer explains how weighting data points can be used to match a dataset's distribution to that of a smaller set, resulting in improved performance. However, in cases such as presidential polls, where the dataset is not weighted and sampling bias occurs, there is no cure. Finally, the lecturer applies the concept of sampling bias to the credit approval process, explaining that using historical data of only the approved customers leaves out the rejected applicants, potentially affecting the accuracy of future approval decisions. However, this bias is less severe in this scenario as banks tend to be aggressive in providing credit, so the boundary is mainly represented by the already approved customers.

  • 00:45:00 In this section, the speaker discusses the principle of data snooping, which states that if a dataset has affected any step of the learning process, then the ability of the same dataset to assess the outcome has been compromised. Data snooping is the most common trap for practitioners and has different manifestations, making it easy to fall into its traps. Looking at the data is one of the ways to fall into this trap because it allows the learners to zoom in and narrow down hypotheses, affecting the learning process. Due to its many manifestations, the speaker goes on to give examples of data snooping and the compensation and discipline needed to avoid its consequences.

  • 00:50:00 In this section, the speaker discusses the problem of data snooping and how it can affect the validity of a model. When looking solely at the data set, one may be vulnerable to designing a model based on the idiosyncrasies of that data. However, it is valid to consider all other information related to the target function and input space except the realization of the data set that will be used for training, unless appropriately charged. To illustrate this point, the speaker provides a financial forecasting puzzle where one predicts the exchange rate between the US dollar and the British pound using a data set of 2,000 points with a training set of 1,500 points and a test set of 500 points. The model is trained solely on the training set, and the output is evaluated on the test set to avoid data snooping.

  • 00:55:00 In this section, the video discusses how snooping can occur through normalization, which can affect the test set and lead to incorrect results. The lecture explains how normalization should only be done with parameters obtained exclusively from the training set, in order to ensure that the test set is observed without any bias or snooping. Additionally, the video touches upon the idea of reusing the same data set for multiple models, and how this can lead to data snooping and false results. By torturing the data long enough, it may start to confess, but the results cannot be trusted without proper testing on a fresh, new data set.
  • 01:00:00 In this section, the speaker discusses the danger of data snooping and how it can lead to overfitting. Data snooping is not just about directly looking at the data, but it can also occur when using prior knowledge from sources that have used the same data. Once we start making decisions based on this prior knowledge, we are already contaminating our model with the data. The speaker suggests two remedies for data snooping: avoiding it or accounting for it. While avoiding it requires discipline and can be difficult, accounting for it enables us to understand the impact of prior knowledge on the final model. In financial applications, overfitting due to data snooping is especially risky because the noise in the data can be used to fit a model that looks good in-sample but does not generalize out-of-sample.

  • 01:05:00 In this section, the professor discusses the issue of data snooping and how it can lead to misleading results in the case of testing a trading strategy. Using the "buy and hold" strategy with 50 years of data for the S&P 500, the results show a fantastic profit, but there is a sampling bias since only currently traded stocks were included in the analysis. This creates an unfair advantage and is a form of snooping, which should not be used in machine learning. The professor also addresses a question about the importance of scaling and normalization of input data, stating that while it is important, it was not covered due to time constraints. Finally, the professor explains how to properly compare different models without falling into the trap of data snooping.

  • 01:10:00 In this section, the video discusses data snooping and how it can make an individual more optimistic than they should be. Data snooping involves using the data for rejecting certain models and directing yourself to other models without accounting for it. By accounting for the data snooping, one can consider the effective VC dimension of their entire model and use a much larger data set for the model, ensuring generalization. The lecture also covers how to get around sampling bias through scaling and emphasizes the importance of Occam's razor in statistics. The professor also notes that there are scenarios in which Occam's razor can be violated.

  • 01:15:00 In this section, the professor discusses the principle of Occam's razor in relation to machine learning, where simpler models tend to perform better. The discussion then transitions to the idea of correcting sampling bias in applications of computer vision. The method is the same as discussed earlier, where data points are given different weights or resampled to replicate the test distribution. The approach may be modified depending on the domain-specific features extracted. The lecture concludes with a summary of the discussion.

Lecture 17 - Three Learning Principles
Lecture 17 - Three Learning Principles
  • 2012.05.31
  • www.youtube.com
Three Learning Principles - Major pitfalls for machine learning practitioners; Occam's razor, sampling bias, and data snooping. Lecture 17 of 18 of Caltech's...
Reason: