Machine Learning and Neural Networks - page 4

 

Lecture 6 - Theory of Generalization




Caltech's Machine Learning Course - CS 156. Lecture 06 - Theory of Generalization

The lecture discusses the theory of generalization and the growth function as the number of dichotomies that can be generated by a hypothesis set on a set of N points, with the goal being to characterize the entire growth function and generalize for every N by characterizing the break point. The speaker demonstrates the process of computing the growth function for different hypothesis sets and proving the upper bound for the growth function using combinatorial identity. The discussion also touches on using the growth function in the Hoeffding inequality, the VC bound to characterize overlaps between hypotheses and the Vapnik-Chervonenkis inequality, which is polynomial in N with the order of the polynomial decided by the break point.

The professor discusses the theory of generalization, clarifying previous points and explaining the concept of a break point, which is used to calculate resources needed for learning. The focus of learning is on approximation to E_out, not E_in, allowing the learner to work with familiar quantities. The professor also explains the reasoning behind replacing M with the growth function and how this is related to the combinatorial quantity B of N and k. While discussing regression functions, the professor emphasizes the bias-variance tradeoff and how learnability is independent of the target function. Finally, the professor notes that the same principles apply to all types of functions.

  • 00:00:00 In this section, we learn about dichotomies as mini-hypotheses that are restricted to a finite set of points and the growth function. The growth function counts the number of dichotomies that can be generated by a hypothesis set on a set of N points. The break point for perceptrons is defined as the point where patterns begin to be missed due to the use of hypotheses from a restricted set. The theoretical goal is to characterize the entire growth function and generalize for every N by characterizing the break point. We also see that a restriction on the number of patterns on a few points results in the loss of many patterns for larger numbers of points, independently of the hypothesis set and input space.

  • 00:05:00 In this section, the lecturer discusses two items: the first is showing that the growth function is polynomial with a break point and the second is demonstrating the replacement of M, the number of hypotheses, in Hoeffding's inequality. The lecturer emphasizes that they do not need to determine the particulars of the growth function, but only need to show that it is bounded by a polynomial so that it can be used in the Hoeffding inequality. The lecturer introduces a key quantity called B of N and k, which is a combinatorial quantity that represents the maximum number of dichotomies on N points with a break point k. The bound for B of N, k is found recursively by filling a table with N points and isolating the last point to introduce a recursion.

  • 00:10:00 In this section, the speaker discusses how to group rows of a matrix that represent the extension of a binary sequence. The first group, S_1, consists of rows that appear only once based on the extension. The second group, S_2, consists of rows that appear with both extensions. Using these groupings, the speaker defines the number of rows in group S_1 as alpha and the number of rows in group S_2 as beta. With these definitions, the speaker is able to find a recursion for the maximum number of rows/patterns that can be obtained on N points such that no k columns have all possible patterns.

  • 00:15:00 In this section of the lecture, the speaker discusses the theory of generalization and how to estimate beta. He explains that by analyzing the second part of the S_2 matrix, which contains repeated pattern blocks, he can argue that these pattern blocks have a break point of k minus 1, not k. He also explains that by taking alpha plus beta, which is the total number of rows or patterns in the mini-matrix, he can say something about a break point for this small matrix. He ends by stating that by putting it all together, he can estimate the full matrix and its number of rows.

  • 00:20:00 In this section, the speaker analyzes a matrix and derives a recursion formula to solve for an upper bound on B of N and k, where B of N and k is the maximum growth function of a hypothesis set with a break point of k. By computing values of B of N and k using the recursion formula, the speaker fills a table with an upper bound on B of N and k. The boundary conditions for the table are filled first and then the rest of the table is filled using the recursion formula.

  • 00:25:00 In this section, the speaker discusses the theory of generalization and talks about a table representing the maximum number of dichotomies or patterns given a specific number of points, N, and a break point, k. The speaker explains how the table is filled and how the constraint can be empty. Additionally, they present a formula that computes the maximum number of dichotomies or patterns to be an upper bound for the growth function of any hypothesis set that has a break point k, without asking any questions whatsoever about the hypothesis set or the input space.

  • 00:30:00 In this section, the lecturer discusses the induction step to prove a theorem on the formula for N and k. The step involves assuming that the formula holds for given values of N and k, and then proving that it also holds for N-1 and k-1. The lecturer demonstrates the process of manipulating the two formulae, merging the summations, and reducing them to a single quantity using algebra or combinatorial arguments. The aim is to establish that the given formula holds for all values of N and k, which includes the previously assumed values, and from there, the theorem is proven.

  • 00:35:00 In this section, the speaker explains the process of proving the upper bound for B of N and k, the growth function for a hypothesis set that has a break point k, using combinatorial identity. The resulting polynomial is useful because the break point is a fixed number and doesn't grow with N. The speaker then illustrates that the upper bound is polynomial in N by showing that the maximum power is N to the k minus 1, which is a constant. Finally, the speaker applies the upper bound to three examples of hypothesis sets and shows that they all satisfy the bound.

  • 00:40:00 In this section, the lecturer discusses computing the growth function for positive rays and positive intervals. By utilizing the break point, which is the only input required, he is able to find the growth function without considering the geometry of the hypothesis set. The lecturer then applies this method to the two-dimensional perceptron, where the growth function is unknown, but it is known that the break point is 4. By using the break point, he is able to bound the growth function completely, which is important in simplifying the characterization of hypothesis sets. The lecturer then explains how this growth function can be used in the Hoeffding inequality to replace the number of hypotheses using the union bound, which is next to useless when M is significant or infinite.

  • 00:45:00 In this section, the lecturer explains the pictorial proof of the polynomial boundedness of the growth function. The space of possible data sets covers all axes and the colored area represents the bad region where E_in deviates from E_out due to certain data sets. By painting this bad region red and using the Hoeffding inequality, the lecturer shows that the colored area is small, allowing the union bound to claim the possibility of multiple hypotheses. However, when more hypotheses are added, the colored area fills up the canvas, leading to the problem with the union bound. The lecturer then explains the two aspects needed to establish the relationship between the growth function and the overlaps and the approach for E_out to conform to the finite sample argument.

  • 00:50:00 In this section, the lecturer introduces the VC bound as a new canvas to characterize overlaps between hypotheses. He explains that the growth function is an abstract quantity that characterizes these overlaps and tells you the number of dichotomies that behave the same way. The lecturer explains that the redundancy is captured by the growth function and that the point being colored does not only depend on the sample but also on the entire space. The lecturer overcomes this by picking two samples instead of one, which are independently generated from the same distribution, to track E_out and E_in without relying on the entire hypothesis.

  • 00:55:00 In this section, the speaker discusses the concept of tracking between E_in and E_in dash, which are two different samples, and whether they track each other or not. If multiple bins are used, the tie between E_out and E_in becomes looser and looser. They also get loosely apart as the number of bins increases. The mathematical ramifications of multiple hypotheses happen in the same way here as for one bin. As the speaker goes through the technicalities of the proof, the epsilon becomes epsilon over 2 and then becomes epsilon over 4. When plugged in, they get epsilon squared over 16, resulting in a factor of 1/8. The result obtained is called the Vapnik-Chervonenkis inequality, which is polynomial in N and has the order of the polynomial decided by the break point.

  • 01:00:00 In this section of the video lecture, the moderator asks the professor to clarify some points made in previous slides. The professor explains that the N points chosen in slide 5 correspond to a particular set of points in an input space in machine learning, but in the abstraction, these are simply abstract labels. The professor also clarifies that their use of alpha and beta in the lecture is merely a naming convention, and there is no assertion about the relative values of the two. Finally, the professor explains that the break point is calculated by visiting the input space and the hypothesis set and finding out, for a given hypothesis set, what is the maximum number of points that cannot be separated in every possible way.

  • 01:05:00 In this section, the professor explains that for most learning models, exact or bound break points have already been established, meaning that the resources needed to learn can be estimated before starting the learning process. Although there may be cases where the bounds are not tight, in most cases, the discrepancy between the exact estimate of the growth function and the quadratic bound will be negligible. The lecture emphasizes that the focus of learning is not on the actual value of E_in, but on its approximation to E_out, enabling the learner to work with familiar quantities. Finally, the professor assures the audience that the VC dimension, which is a building block for understanding the theories of learning, will be covered in detail in the next lecture.

  • 01:10:00 In this section, the professor explains the reasoning behind replacing M with the growth function and the modifications that needed to be made to fulfill the statement's technical requirements. The professor also clarifies the definition of B of N and k, detailing how it is an upper bound for any hypothesis set with a break point, and how it is a purely combinatorial quantity. The professor then addresses a question regarding the proof of B of N and k, stating that k does not change when reducing x_N to x_N-1 since no k columns of the smaller set can have all possible patterns. Finally, the professor notes that the analysis and VC analysis are applicable to binary functions, though they can be extended to real-valued functions.

  • 01:15:00 In this section, the professor discusses how instead of going into technical extensions on learnability, he would rather use a different approach, the bias-variance tradeoff, when discussing regression functions. He also clarifies that learnability is proven under conditions about the hypothesis set, and that it is independent of the target function. He goes on to explain that the generalization question is not dependent on the target function, but the question of whether E_in can be minimized to make the user happy is dependent on the target function. Finally, the professor states that the same principles apply to regardless of the type of function.
Lecture 06 - Theory of Generalization
Lecture 06 - Theory of Generalization
  • 2012.04.21
  • www.youtube.com
Theory of Generalization - How an infinite model can learn from a finite sample. The most important theoretical result in machine learning. Lecture 6 of 18 o...
 

Lecture 07 - The VC Dimension




Caltech's Machine Learning Course - CS 156. Lecture 07 - The VC Dimension

The lecture introduces the concept of VC dimension, which is the maximum number of points that can be shattered by a hypothesis set, and explains its practical applications. The VC dimension represents the degrees of freedom of a model, and its relationship to the number of parameters in a model is discussed. Examples are given to demonstrate how to compute the VC dimension for different hypothesis sets. The relationship between the number of examples needed and the VC dimension is explored, and it is noted that there is a proportional relationship between the two. The implications of increasing the VC dimension on the performance of a learning algorithm are also discussed. Overall, the lecture provides insights into the VC theory and its practical implications for machine learning.

Also the video covers the concept of generalization and the generalization bound, which is a positive statement that shows the tradeoff between hypothesis set size and good generalization in machine learning. The professor explains the VC dimension, which is the largest value before the first break point, and how it can be used to approximate the number of examples needed. He notes the importance of choosing the correct error measure and clarifies that the VC dimension estimate is a loose estimate that can be used to compare models and approximate the number of examples needed. The lecture ends by highlighting the commonalities between this material and the topic of design of experiments and how the principles of learning extend to other situations beyond strict learning scenarios.

  • 00:00:00 In this section, the lecturer summarizes the previous lecture's main result in learning theory, which is the VC (Vapnik-Chervonenkis) inequality, which characterizes generalization in machine learning. The growth function, which characterizes the redundancy needed to switch from the Hoeffding inequality to the VC inequality, was introduced and related to bad events with overlapping regions. The technical problem with E_out was solved, and the growth function was used to replace the number of hypotheses M. The VC dimension, which is related to the break point, is then defined and computed exactly for perceptrons in any-dimensional space. The interpretation of the VC dimension and its practical applications are also discussed.

  • 00:05:00 In this section, the concept of VC dimension is introduced as the maximum number of points that can be shattered by a hypothesis set. The VC dimension is denoted as d_VC and is the largest value of N such that the growth function is 2 to the N. It is important to note that the VC dimension does not guarantee that every N points can be shattered, but only that there exist N points that can be shattered. The section provides examples, such as the positive rays and 2D perceptrons, to demonstrate how to compute the VC dimension for a given hypothesis set. The VC dimension is used to bound the growth function of a hypothesis set, and it serves as the order of the polynomial that bounds the growth function.

  • 00:10:00 In this section, the focus is on the VC dimension of convex sets and its relation to learning. The VC dimension represents the maximum number of points that can be shattered by a hypothesis set. If the VC dimension is finite, the final hypothesis will generalize, regardless of the input distribution or learning algorithm used. The learning diagram, which includes the target function, learning algorithm, and input distribution, shows that the VC theory is independent of the learning algorithm and target function, and only depends on the hypothesis set. Overall, there are three blocks in the VC theory: the hypothesis, hypothesis set, and the VC dimension.

  • 00:15:00 In this section, we learn about the VC dimension of perceptrons, which is the hypothesis set that the entire VC theory deals with, since it is the set that has the VC dimension and tells us if we are able to generalize. Although the VC dimension of perceptrons in two-dimensional space is three, a simple formula states that in d-dimensional space, the VC dimension is d plus one. This is important to understand the significance of the VC dimension, and we will prove this by showing that the VC dimension is at most d plus one and at least d plus one. To demonstrate, we will construct a specific set of N points (N being d plus one) using a matrix to be shattered, as long as it is possible to shatter them.

  • 00:20:00 In this section, the lecturer shows a specific set of d plus 1 points and demonstrates that they can be shattered using an invertible matrix. He then poses a question to the audience about the VC dimension and asks them to choose which conclusion they can make based on the results of the demonstration. The correct answer is b, which states that the VC dimension is greater than or equal to d plus 1.

  • 00:25:00 In this section, the professor discusses how to prove that the VC dimension is at most d plus 1. He asks the audience which of several statements would establish the premise and they respond with “d." The professor then explains that he needs to show that there is a set of d plus 2 points that he cannot shatter. He does this by showing that for a set of d plus 2 points, there will always be one point that is a linear combination of the others. Therefore, he constructs a dichotomy that he shows cannot be implemented with a perceptron.

  • 00:30:00 In this section of the video, the speaker explains the concept of a dichotomy in a perceptron, which is essentially assigning labels of +1 or -1 to specific points. Through the use of algebraic properties, it is shown that it's impossible to shatter any set of d plus 2 points, with the VC dimension being d plus 1. This is due to the number of parameters in the perceptron model, which is d plus 1, and the VC dimension gives the maximum number of points that can be shattered.

  • 00:35:00 In this section, the lecture introduces the concept of VC dimension and its interpretation. The VC dimension is a measure of the degrees of freedom of a model and how it relates to the number of parameters it has. The lecture compares these degrees of freedom to knobs on an audio system, where more knobs can give you more control over the sound, but it can be challenging to use effectively. The lecture explains that the VC dimension abstracts away the details of the mathematics inside a model and focuses on its expressive power. The lecture also discusses the correspondence between the VC dimension and the degrees of freedom of various models, such as positive rays, showing that the VC dimension equals one when there is one degree of freedom, which corresponds to a model with one parameter.

  • 00:40:00 In this section, the lecturer discusses degrees of freedom and their relationship to the VC dimension in the context of simple models. While the VC dimension counts the number of hypotheses that can be achieved by a model, it is not necessarily equal to the number of parameters. By constructing an artificial example, the lecturer shows that parameters may not always contribute to degrees of freedom. Instead, effective degrees of freedom can be measured more reliably by the VC dimension, and the lecturer demonstrates how a model with eight parameters can actually have the same VC dimension as a model with only two parameters. Finally, the lecturer notes that practitioners may be interested in the number of data points needed for a system and how this can be related to the VC dimension of the hypothesis set.

  • 00:45:00 In this section, the speaker discusses the relationship between the number of examples needed and the value of the VC dimension. The VC inequality has two small performance quantities that they want to be as small as possible. One of these is E_in not far from E_out, while the other is delta, which is small in value. After deciding on certain epsilon and delta values, the speaker explains how to determine the number of examples required to achieve them by looking at the function N to the power of the VC dimension times e to the power of -N plotted on a graph. The interesting part of the curve is where the probability is less than 1, and the speaker then explores the implications of increasing the VC dimension from 4 to 5.

  • 00:50:00 In this section, the lecturer discusses the relationship between the number of examples in a dataset and the VC dimension, which is a measure of the complexity of a learning algorithm. He uses several graphs to illustrate how the performance of the algorithm changes as the VC dimension increases, and emphasizes that the number of examples needed to achieve a certain level of performance is proportional to the VC dimension. However, he also notes that while the bounds on performance are guaranteed to follow a certain monotonicity, the actual performance may not always do so, which can be a source of frustration for practitioners.

  • 00:55:00 In this section, the lecturer discusses observations and practical applications of the VC dimension. The first lesson is that there is proportional relationship between the VC dimension and the number of examples needed to achieve a certain level of performance. The lecturer provides a rule of thumb where 10 times the VC dimension is needed to get to the comfort zone of the VC inequality where the probability statement is meaningful. The second practical observation is that for a huge range of reasonable epsilon and delta, the rule of thumb also holds true. The lecturer then simplifies the VC inequality formula and calls it formula capital Omega, stating that it depends on the growth function and that as the VC dimension gets bigger, the Omega formula becomes worse.
  • 01:00:00 In this section, the speaker discusses the concept of generalization and how having more examples can affect the growth function and polynomial behavior. He introduces the idea of the generalization bound, which is a positive statement instead of characterizing bad events. With probability greater than or equal to 1 minus delta, E_in tracks E_out, meaning that they are within Omega, which is dependent on the number of examples and the hypothesis set's VC dimension. The speaker simplifies the generalization bound by rearranging it to show that E_out is bounded by E_in plus Omega. He explains how this bound illustrates the tradeoff between the hypothesis set's size and good generalization, leading to the concept of regularization in machine learning.

  • 01:05:00 In this section, the professor explains that the VC dimension is the biggest value just short of the first break point, which means that any bigger point that acts as a break point will also be counted. The notion of a break point covers a lot of values, but the VC dimension is the unique one that stands out. He also clarifies that when discussing shattering N points, individuals get to pick the points to shatter. The professor explains that epsilon and delta are two performance parameters of learning, where epsilon is the approximation parameter that ensures E_in tracks E_out, while delta is the probability measure that determines the likelihood of the probability statement failing. When asked about the effect of error measure on the number of points to choose, the professor explains that when dealing with the error measure in a binary sense, there is no need to worry about the variance because there's an upper bound, but when using other co-domains or error measures, modifications are necessary.

  • 01:10:00 In this section, the professor explains that obtaining the VC dimension exactly is rare, but they know the exact dimension for perceptrons. When it comes to neural networks, the VC dimension estimate cannot be above a certain number due to redundancies and cancellations. The professor emphasizes that the VC dimension bound is a loose estimate, but it still maintains its conceptual meaning and can be used as a guide to compare models and approximate the number of examples needed. The rule of thumb is to use at least 10 times the VC dimension to get into the interesting region of the VC inequality, which depends on the customer's desired level of accuracy. The professor notes that there are commonalities between this material and the topic of design of experiments, and the principles of learning extend to other situations beyond strict learning scenarios.
Lecture 07 - The VC Dimension
Lecture 07 - The VC Dimension
  • 2012.04.26
  • www.youtube.com
The VC Dimension - A measure of what it takes a model to learn. Relationship to the number of parameters and degrees of freedom. Lecture 7 of 18 of Caltech's...
 

Lecture 8 - Bias-Variance Tradeoff



Caltech's Machine Learning Course - CS 156. Lecture 08 - Bias-Variance Tradeoff

The professor discusses the bias-variance tradeoff in machine learning, explaining how the complexity of the hypothesis set affects the tradeoff between generalization and approximation. The lecturer introduces the concept of bias and variance, which measure the deviation between the average of hypotheses a machine learning algorithm produces and the actual target function and how much a given model's distribution of hypotheses varies based on different datasets, respectively. The tradeoff results in a larger hypothesis set having a smaller bias but a larger variance, while a smaller hypothesis set will have a larger bias but a smaller variance. The lecturer emphasizes the importance of having enough data resources to effectively navigate the hypothesis set and highlights the difference in scale between the bias-variance analysis and the VC analysis.

Also he discusses the tradeoff between simple and complex models in terms of their ability to approximate and generalize, with fewer examples requiring simple models and larger resources of examples requiring more complex models. The bias-variance analysis is specific to linear regression and assumes knowledge of the target function, with validation being the gold standard for choosing a model. Ensemble learning is discussed through Bagging, which uses bootstrapping to average multiple data sets, reducing variance. The balance between variance and covariance in ensemble learning is also explained, and linear regression is classified as a learning technique with fitting as the first part of learning, while the theory emphasizes good out-of-sample performance.

  • 00:00:00 In this section, the focus shifts to the bias-variance tradeoff, which is another approach to understanding generalization. In the previous lectures, the VC analysis established the generalization ability of a chosen hypothesis, via the VC dimension of a hypothesis set. The VC bound holds for any learning algorithm, for any input data, and for any target function. An aspect of VC analysis is that it provides a practical measure. By plotting the probability of error versus the number of examples, we discovered that the number of examples needed is proportional to the VC dimension, or rule of thumb, you need 10 times the VC dimension to start getting interesting generalization properties. Finally, we summarized the VC analysis into a generalization bound, which we will use in later techniques like regularization.

  • 00:05:00 In this section, the lecturer discusses the tradeoff between approximation and generalization when it comes to learning. Learning aims to achieve a small E_out, which means the hypothesis approximates the target function well and that this approximation holds out-of-sample. However, having a more complex hypothesis set raises the chance of approximating f well but gives a problem identifying the suitable hypothesis. One ideal hypothesis set for learning is a singleton hypothesis that happens to be the target function. Still, since we don't know the target function, we need a hypothesis set large enough to stand a chance. Additionally, the lecturer discusses how bias-variance analysis also decomposes E_out, whereas VC analysis emphasizes quantifying the tradeoff.

  • 00:10:00 In this section, the speaker introduces the bias-variance tradeoff and how it relates to real-valued functions and regression using squared error. The goal is to decompose the out-of-sample error into two conceptual components: approximation and generalization. To do this, the speaker uses the expected value of the error with respect to a particular data set since the final hypothesis depends on the data set used, but aims to remove the dependency by integrating out the data set. The result is a way to analyze the general behavior of the error when given a specific number of data points to work with.

  • 00:15:00 In this section, the lecturer explains how to calculate the expected values of a behavior with respect to all possible realizations of 100 examples. By reversing the order of integration and getting rid of an expectation, the lecturer gets to a clean decomposition. The next step involves deriving an average hypothesis by getting the expected value of all possible hypotheses. Although this is certainly an impossible task, it provides a conceptual tool for analysis. Understanding the technical utility of g bar becomes important when expanding the top expression to get a linear term that ultimately requires g bar to be defined.

  • 00:20:00 In this section, the lecturer decomposes a quantity into two steps that determine how far the hypothesis a machine learning algorithm derives from a given dataset diverges from the target function. The first step assesses how far this hypothesis deviates from the best hypothesis that the algorithm can produce given the given dataset, while the second step assesses how far this best hypothesis deviates from the actual target function. The lecturer arrives at two quantities, the bias and the variance, to denote these two steps. The bias measures the deviation between the average of hypotheses a machine learning algorithm produces and the actual target function, which sets finite for the algorithm's hypothesis set. Meanwhile, the variance measures how much a given model's distribution of hypotheses varies based on different datasets.

  • 00:25:00 In this section, the professor discusses the bias-variance tradeoff in machine learning. He explains that the bias is the limitation of the hypothesis set, and the variance is the difference in outcome when using different data sets. He then shows how there is a tradeoff between generalization and approximation when changing the size of the hypothesis set, and illustrates this idea with a comparison of a small and large hypothesis set. He argues that a larger hypothesis set will have a smaller bias but a larger variance, while a smaller hypothesis set will have a larger bias but a smaller variance.

  • 00:30:00 In this section, the speaker introduces the concept of bias-variance tradeoff, where the bias decreases and variance increases as the hypothesis set becomes larger. To understand this, the speaker sets a concrete example where the target function is a sinusoid, and two different hypothesis sets are given: a constant model and a linear model. The speaker then shows that the linear model gives a better approximation of the sinusoid, but with some errors. This is not a learning situation but illustrates the tradeoff between bias and variance in the approximation of the target function, paving the way for more complex learning problems.

  • 00:35:00 In this section, the lecturer explains the bias-variance tradeoff in machine learning. He uses the example of fitting a line to two points, first to approximate a target function, and secondly to learn from examples. The bias-variance analysis is needed to evaluate a model's performance regardless of which two points are used and to overcome the challenges of coping with the dependency on the dataset. The lecturer then generates datasets of size two points, fits a line to them, and shows that the expected out-of-sample error is mainly the summation of the bias and variance. The very light green line, g bar of x, is the average hypothesis he gets from repeating this game. Still, it is not the output of the learning process because different datasets will give different estimates.

  • 00:40:00 In this section of the video, the concept of bias-variance tradeoff is discussed in the context of machine learning. The variance is calculated as the standard deviation of the output of the learning process, while the bias is the error between the predicted output and the target function. The tradeoff between bias and variance is demonstrated using two models, one with a small bias and a large variance and the other with a large bias and a small variance. It is understood that in a learning situation, the model complexity should be matched to the data resources available rather than the target complexity.

  • 00:45:00 In this section, the speaker discusses the bias-variance tradeoff in learning and introduces the concept of learning curves. Learning curves plot the expected values of E_out (out-of-sample error) and E_in (in-sample error) as a function of N, the size of the dataset. As N increases, the out-of-sample error generally decreases, but this trend can be influenced by the complexity of the model being used. The speaker emphasizes the importance of having enough data resources to effectively navigate the hypothesis set, and notes that noisy data can make this navigation even more difficult. The learning curves provide a visual representation of the bias-variance tradeoff and how it changes with increasing N.

  • 00:50:00 In this section, the lecturer discusses the relationship between the bias-variance analysis and the VC analysis using learning curves. He explains that both theories are discussing approximation and take into consideration what happens in terms of generalization. The lecturer highlights the difference in scale between the two theories and mentions that the bias depends on the hypothesis set. Finally, the lecturer briefly covers the analysis for the linear regression case and recommends it as a good exercise to gain insight into linear regression.

  • 00:55:00 In this section, the instructor describes the in-sample error pattern and the out-of-sample error pattern, particularly using the learning curves. The instructor uses linear regression and noise to illustrate a simple formula for expected in-sample error: it's almost perfect, and you're doing better than perfect by the ratio of d plus 1. The instructor emphasizes a very specific curve, which shows that the more data points you have, the less noise will impact the error rate. However, when you overfit to the sample data, you end up fitting the noise, and this will harm you instead of helping you in the long run.

  • 01:00:00 In this section, the professor talks about the tradeoff between simple and complex models and their ability to approximate and generalize. While complex models can better approximate the target function and training examples, the simple models are better in terms of generalization ability. This is because there is a tradeoff between the two, and the sum of both quantities could go in either direction. The key is to match the complexity of the model to the data resources available. Fewer examples mean that simple models should be used, while larger resources of examples require complex models for better performance. The expected generalization error can be found using the formula, which is the VC dimension divided by the number of examples.

  • 01:05:00 In this section, the professor discusses how the bias-variance analysis is specific to linear regression and how it assumes you know the target function. While it's a helpful guide and can be used to understand how to affect both bias and variance, it's not something that can be plugged in to tell you what the model is. He also mentions that the gold standard for choosing a model is through validation, which includes ensemble methods like boosting. The professor then briefly introduces the idea of g bar as a theoretical tool for analysis but notes that it's not the focus of this lecture.

  • 01:10:00 In this section, the professor talks about ensemble learning through Bagging, which is the process of using a dataset to generate a large number of different datasets through bootstrapping and averaging them. This gives some dividend about the ensemble learning and can help reduce variance by averaging many things out. The moderator then asks if the bias-variance still appears through the Bayesian approach. The professor explains that although the Bayesian approach makes certain assumptions, the bias-variance still exists. Finally, he talks about the relation of numerical function approximation with the extrapolation in machine learning and the bias-variance covariance dilemma.

  • 01:15:00 In this section of the lecture, the professor discusses the balance between variance and covariance in the context of ensemble learning. He explains that in the bias-variance analysis, he had the luxury of picking independently generated data sets, generating independent models, and then averaging them. However, in actual practice, when constructing models based on variations of the data set, the covariance between the models starts to play a role. Later, when asked if linear regression is a learning technique or just function approximation, the professor states that linear regression is a learning technique and fitting is the first part of learning. The added element is to ensure that the model performs well out-of-sample, which is what the theory is about.
Lecture 08 - Bias-Variance Tradeoff
Lecture 08 - Bias-Variance Tradeoff
  • 2012.04.28
  • www.youtube.com
Bias-Variance Tradeoff - Breaking down the learning performance into competing quantities. The learning curves. Lecture 8 of 18 of Caltech's Machine Learning...
 

Lecture 9 - The Linear Model II



Caltech's Machine Learning Course - CS 156. Lecture 09 - The Linear Model II

This lecture covers various aspects of the linear model, including the bias-variance decomposition, learning curves, and techniques for linear models such as perceptrons, linear regression, and logistic regression. The speaker emphasizes the tradeoff between complexity and generalization performance, cautioning against overfitting and emphasizing the importance of properly charging the VC dimension of the hypothesis space for valid warranties. The use of nonlinear transforms and their impact on generalization behavior is also discussed. The lecture further covers the logistic function and its applications in estimating probabilities, and introduces the concepts of likelihood and cross-entropy error measures in the context of logistic regression. Finally, iterative methods for optimizing the error function, such as gradient descent, are explained.

Also the lecture covers a range of topics related to linear models and optimization algorithms in machine learning. The professor explains the compromise between learning rate and speed in gradient descent optimization, introducing the logistic regression algorithm and discussing its error measures and learning algorithm. The challenges of termination in gradient descent and multi-class classification are also addressed. The role of derivation and selection of features in machine learning is emphasized and discussed as an art in application domains, charged in terms of VC dimension. Overall, this lecture provides a comprehensive overview of linear models and optimization algorithms for machine learning.

  • 00:00:00 In this section, Yaser Abu-Mostafa discusses the bias-variance decomposition in the out-of-sample error and illustrates how it trades off with the hypothesis set. He also explains learning curves, which describes the generalization error, and how the number of examples, proportional to the VC dimension, will determine generalization properties. Techniques for linear models are also discussed.

  • 00:05:00 In this section of the lecture, the speaker briefly recaps the linear model in terms of linear classification and linear regression, which have been covered in previous lectures, and then moves to the third type of linear model - logistic regression. Before starting on logistic regression, the speaker ties up the loose ends in terms of nonlinear transforms and generalization issues. Nonlinear transforms offer a platform for applying learning algorithms in the Z space (feature space), with the final hypothesis still residing in the X space (input space). In the case of nonlinear transforms, the speaker emphasizes that the generalization issues were left out and that he will provide the missing piece in the lecture.

  • 00:10:00 In this section, the lecturer discusses the price that one pays for doing nonlinear transformations when it comes to generalization behavior in the X space. By using the linear model in the X space, you can get a weight vector of d+1 free parameters. However, the VC dimension in the feature space may potentially be much larger than that of the X space. If the VC dimension is too large, then although it is possible to fit the 17th-order polynomial, there is no real chance of generalization. Two cases are discussed where the first case is almost linearly separable, and the second case is genuinely nonlinear. In order to get E_in to be zero, one has to go to a high-dimensional space, which becomes a problem as there are only two points to classify.

  • 00:15:00 In this section of the lecture, the instructor discusses the approximation-generalization tradeoff when dealing with linear models. He talks about how using a more complex model, such as a fourth-order surface, can better approximate the data but may not generalize well. He also mentions the idea of using a transformation to a non-linear space, but cautions against seeking a discount in the number of parameters. The instructor explains that charging the VC dimension of the entire hypothesis space explored in the mind is important in order for the warranty provided by the VC inequality to be valid.

  • 00:20:00 In this section, the discussion is centered around the dangers of data snooping when choosing a model before looking at the data. It is emphasized that this practice can lead to a contaminated hypothesis set, meaning that the data is no longer trustworthy for reflecting real-world performance. The concept of logistic regression is introduced, along with its unique model, error measure, and learning algorithm. This linear model is considered to be a significant complement to the perceptron and linear regression models previously discussed, and provides a useful example of the complexities and variations that exist within machine learning.

  • 00:25:00 In this section, the lecturer discusses the linear model and the different ways it can be used, such as perceptrons, linear regression, and logistic regression. For linear classification, the hypothesis is a decision of +1 or -1, which is a direct thresholding of the signal. In the case of linear regression, the output is the same as the input, while logistic regression applies a nonlinearity called the logistic function to the signal, which is interpreted as a probability of something happening. The lecturer explains the shape of the logistic function and its applications in estimating probabilities for various problems, such as credit card applications.

  • 00:30:00 In this section, the concept of a soft threshold or sigmoid is introduced in the context of the logistic function. This function takes a linear signal as input and outputs a probability. It is particularly useful in predicting outcomes like the risk of a heart attack, where multiple factors contribute to the likelihood of an event occurring. The output of the logistic regression is treated as a genuine probability during the learning process, even though the input data does not directly provide that information.

  • 00:35:00 In this section, we discuss supervised learning in medical data and how to generate a model that approximates a hidden target function. The examples are given as binary output, which is affected by a probability, making this a noisy case. The target is from the d-dimensional Euclidean space to 0,1 with a probability interpretation, f of x. The hypothesis g of x is found by finding the weights and dot-producting them with x. The objective is to choose the weights in such a way that the logistic regression hypothesis reflects the target function using an error measure constructed by likelihood that is both plausible and friendly to the optimizer. The error measure grades different hypotheses according to the likelihood that they are actually the target that generated the data.

  • 00:40:00 In this section of the lecture, the speaker discusses the use of likelihood and the controversy around its application. He explains that the use of likelihood is to find the most plausible hypothesis given the data. However, it is not a completely clean process as likelihood is not the probability that is required. The speaker then introduces a formula for likelihood and explains how it can be used to derive a full-fledged error measure. The formula is then used to find the likelihood of an entire dataset, which is a product of the likelihoods of individual data points. He concludes that there will always be a compromise when choosing a hypothesis, as favoring one example may mess up the others.

  • 00:45:00 In this section of the lecture, the speaker explains how maximizing the likelihood of a hypothesis under a dataset can lead to minimizing the error measure. Taking the natural logarithm allows the maximization to become a minimization, which results in an error measure in the training set. After simplifying the formula, the speaker calls the error measure the in-sample error of logistic regression, and he defines it as the error measure between the hypothesis that depends on w, applied to x_n, and the value given as a label for that example, which is y_n. The speaker also discusses the interesting interpretation of the risk score, which identifies those at risk of heart attacks based on the sign of w transposed x_n.

  • 00:50:00 In this section, the cross-entropy error measure is introduced as a way to measure the accuracy of binary predictions. The goal is to minimize this error measure in order to improve the model's predictions. However, unlike linear regression, there is no closed-form solution to minimize the error measure for logistic regression. Instead, an iterative solution is needed, which will be achieved through the gradient descent method. This method involves taking a step along the steepest slope of the surface and repeating until the minimum is reached. The convexity of the error measure for logistic regression makes gradient descent a good choice for optimization.

  • 00:55:00 In this section of the lecture, the professor discusses the iterative methods used to find the minimum value of the error function in the linear model. He explains that these methods involve moving along the surface in small steps and making local approximations using calculus, specifically Taylor series. He then introduces the concept of gradient descent, where the next weight is determined by the current weight plus the move in a specific direction, which is determined by solving for the unit vector in the direction of steepest descent. The professor goes on to explain how the direction that achieves the most negative value for the inner product between a vector and a unit vector is chosen as the direction of movement.

  • 01:00:00 In this section, the lecturer discusses the compromise between the size of the step, or learning rate, in gradient descent optimization. Taking very small steps will eventually get to the minimum, but it would take forever, while taking bigger steps would be faster but may not apply linear approximation. After analyzing the graphs, the best compromise is to have initially a large learning rate to take advantage of steep slopes and become more careful when closer to the minimum to avoid overshooting. The lecturer then presents the formula for a fixed learning rate, where the learning rate is proportional to the size of the gradient. The logistic regression algorithm is then introduced, where the gradient is computed using the in-sample error formula, and the next weight is obtained by subtracting the learning rate times the gradient from the current weight. Finally, all three linear models, perceptron, linear regression, and logistic regression, are summarized in one slide and applied to the credit domain.

  • 01:05:00 In this section, the professor discusses the different types of linear models that can be implemented in credit analysis and the corresponding error measures and learning algorithms used. For example, the perceptron is used for binary classification and logistic regression is used to compute the probability of default. Different error measures were used for each model, such as binary classification error for the perceptron and cross-entropy error for logistic regression. The learning algorithm used was dependent on the error measure chosen, such as the perceptron learning algorithm for classification error and gradient descent for cross-entropy error. Lastly, the professor briefly discusses termination criteria and issues that arise with termination in gradient descent as a properly analyzed termination is a bit tricky due to many unknowns in the error surface.

  • 01:10:00 In this section, the speaker explains that gradient descent is an effective but not foolproof optimization algorithm. If the surface that the optimization algorithm is trying to navigate has multiple local minima, the algorithm might only find a local minimum instead of a global minimum that gives the best result. The speaker suggests using a combination of criteria to terminate the optimization algorithm and notes that the conjugate gradient is a valid alternative to gradient descent. The speaker suggests that if local minima become a real issue in an application, there are many approaches in the field of optimization to tackle this problem.

  • 01:15:00 In this section, the professor explains the concept of cross-entropy, which is a way of getting a relationship between two probability distributions using logarithmic and expected values. The professor also discusses the limitations of binary search and 2nd-order methods in optimization, emphasizing that while more sophisticated methods may lead to better results, they may be too expensive in terms of CPU cycles. Finally, in response to a question, the professor confirms that logistic regression can be applied to a multi-class setting, as demonstrated in the example of recognizing digits.

  • 01:20:00 In this section of the lecture, the professor discusses various methods for multi-class classification, including ordinal regression and tree-based binary decisions. The professor also introduces the use of the tanh function, which will be used as the neuronal function in neural networks. The concept of the learning rate is also discussed, with the professor mentioning that there are heuristics for adaptive learning rates that can be used, and a rule of thumb for choosing the learning rate is presented. Additionally, the distinction between meaningful features and features derived from looking at the specific data set is made, with the former being less likely to forfeit the VC warranty.

  • 01:25:00 In this section, the professor discusses the process of deriving features in machine learning and emphasizes that it is an art that depends on the application domain. While it is possible to derive features based on the data, the final hypothesis set will still determine the generalization behavior. The professor also notes that selecting features is automatically done in machine learning, but it becomes part of learning and is charged in terms of VC dimension. The topic of selecting features will be further addressed in the future lecture on neural networks and hidden layers.
Lecture 09 - The Linear Model II
Lecture 09 - The Linear Model II
  • 2012.05.02
  • www.youtube.com
The Linear Model II - More about linear models. Logistic regression, maximum likelihood, and gradient descent. Lecture 9 of 18 of Caltech's Machine Learning ...
 

Lecture 10 - Neural Networks



Caltech's Machine Learning Course - CS 156. Lecture 10 - Neural Networks

Yaser Abu-Mostafa, the professor at the California Institute of Technology, discusses logistic regression and neural networks in this lecture. Logistic regression is a linear model that calculates a probability interpretation of a bounded real-valued function. It is unable to optimize its error measure directly, so the method of gradient descent is introduced to minimize an arbitrary nonlinear function that is smooth enough and twice differentiable. Although there is no closed-form solution, the error measure is a convex function, making it relatively easy to optimize using gradient descent.

Stochastic gradient descent is an extension of gradient descent that is used in neural networks. Neural networks are a model that implements a hypothesis motivated by a biological viewpoint and related to perceptrons. The backpropagation algorithm is an efficient algorithm that goes with neural networks and makes the model particularly practical. The model has a biological link that got people excited and was easy to implement using the algorithm. Although it is not the model of choice nowadays, neural networks were successful in practical applications and are still used as a standard in many industries, such as banking and credit approval.

Brief summary:

  • Logistic regression is a linear model that calculates a probability interpretation of a bounded real-valued function;
  • The method of gradient descent is introduced to optimize logistic regression, but it is unable to optimize its error measure directly;
  • Stochastic gradient descent is an extension of gradient descent that is used in neural networks;
  • Neural networks are a model that implements a hypothesis motivated by a biological viewpoint and related to perceptrons;
  • The backpropagation algorithm is an efficient algorithm that goes with neural networks and makes the model particularly practical;
  • Although neural networks are not the model of choice nowadays, they are still used as a standard in many industries, such as banking and credit approval.
Lecture 10 - Neural Networks
Lecture 10 - Neural Networks
  • 2012.05.06
  • www.youtube.com
Neural Networks - A biologically inspired model. The efficient backpropagation learning algorithm. Hidden layers. Lecture 10 of 18 of Caltech's Machine Learn...
 

Lecture 11 - Overfitting



Caltech's Machine Learning Course - CS 156. Lecture 11 - Overfitting

This lecture introduces the concept and importance of overfitting in machine learning. Overfitting occurs when a model is trained on noise instead of the signal, resulting in poor out-of-sample fit. The lecture includes various experiments to illustrate the effects of different parameters, such as noise level and target complexity, on overfitting. The lecturer stresses the importance of detecting overfitting early on and the use of regularization and validation techniques to prevent it. The impact of deterministic and stochastic noise on overfitting is also discussed, and the lecture concludes by introducing the next two lectures on avoiding overfitting through regularization and validation.

The concept of overfitting is discussed, and the importance of regularization in preventing it is emphasized. The professor highlights the trade-off between overfitting and underfitting and explains the VC dimension's role in overfitting, where the discrepancy in VC dimension given the same number of examples results in discrepancies in out-of-sample and in-sample error. The practical issue of validating a model and how it can impact overfitting and model selection is also covered. Furthermore, the professor emphasizes the role of piecewise linear functions in preventing overfitting and highlights the importance of considering the number of degrees of freedom in the model and restricting it through regularization.

  • 00:00:00 In this section, the lecturer introduces the topic of overfitting in machine learning and its importance, noting that the ability to deal with overfitting separates professionals from amateurs in the field. The main culprit for overfitting is identified as noise, and the lecturer introduces the concept of regularization and validation as techniques to deal with overfitting. The section serves as an introduction to a new topic that will be covered in the next three lectures.

  • 00:05:00 In this section, the lecturer explains the concept of overfitting by showing how it can occur when fitting a 4th-order polynomial to a 2nd-order target function with added noise. This results in zero training error and poor out-of-sample fit, which is a classic example of overfitting, where the model went further than it needed to. This point is further emphasized when discussing overfitting in neural networks, as E_in goes down during training while E_out stays high. The lecturer also notes that overfitting is a comparative term, as there has to be another situation that is better, and overfitting can occur within the same model.

  • 00:10:00 In this section, Professor Abu-Mostafa discusses overfitting, which occurs when E_in is lowered, but E_out increases due to fitting the noise instead of the signal. He explains that effective VC dimension grows with time, but the generalization error gets worse and worse as the number of parameters increases. Overfitting can occur when two different models or instances within the same model are compared. One way to fix this is to detect overfitting by using the early stopping algorithm, based on validation, which acts as regularization to prevent overfitting. In order to avoid fitting the noise when overfitting occurs, it's important to detect it early on and stop rather than continuing to minimize E_in.

  • 00:15:00 In this section, the lecturer discusses how overfitting can occur due to the presence of noise in the data. A case study is presented with two different models - one with a noisy low-order target, and another with a noiseless high-order target. A 2nd-order polynomial and a 10th-order polynomial are used to fit the data. For the second-order fit, the in-sample error is 0.05, and the out-of-sample error is slightly higher. In contrast, the 10th-order fit presents a problem, with the in-sample error being smaller than that of the 2nd-order fit. However, the out-of-sample error dramatically increases, indicating a case of overfitting where the noise has been fitted into the model.

  • 00:20:00 In this section, the lecturer discusses overfitting and how it can occur even in noiseless situations when the model is fitting another type of noise. He gives an example of fitting a 10th-order model to a 10th-order noisy target and how it resulted in overfitting. Then, he shows that by matching the complexity of the model to the data resources rather than the target complexity, it can result in better performance despite having a simpler model. The lecturer emphasizes that generalization issues depend on the size and quality of the dataset, and simply matching the complexity of the model to the target function is not always the best approach.

  • 00:25:00 In this section, the concept of overfitting in machine learning is explored. The lecture uses learning curves to demonstrate how the in-sample error for a more complex model is smaller, but the out-of-sample error is larger, defining the gray area where overfitting is happening. The lecture also shows an experiment with two learners, one choosing a 10th order and the other choosing a 2nd order to fit a 50th-order target with no noise. Despite the absence of noise, both learners still experience overfitting, leading to the definition of actual noise and the need for caution in real-world machine learning problems. The lecture concludes that overfitting occurs in the majority of cases, emphasizing the importance of understanding and addressing this issue.

  • 00:30:00 In this section, the lecturer discusses the parameters that affect overfitting, including the noise level, target complexity, and number of data points. To create interesting target functions with high complexity, the lecturer uses a standard set of Legendre polynomials with specific coefficients that are orthogonal to each other. By normalizing the signal to an energy of 1, the lecturer can state that sigma squared is the amount of noise. When generating instances of the experiment, the lecturer uses different combinations of noise, target complexity, and number of data points to observe the persistence of overfitting.

  • 00:35:00 In this section, the lecturer discusses an overfitting measurement method that compares the out-of-sample errors of two different models: a 2nd-order polynomial and a 10th-order polynomial. The measure is the difference between the out-of-sample error for the complex model and the out-of-sample error for the simple model. If the complex model's out-of-sample error is larger, causing the measure to be positive, then there is overfitting. The lecturer then shows how the overfitting measure changes with varying levels of noise and target complexity. As the noise level increases and the target complexity increases, overfitting worsens. The lecturer also notes that overfitting is a significant issue and must be addressed.

  • 00:40:00 In this section, the concept of noise in overfitting is expanded beyond conventional noise and divided into stochastic noise and deterministic noise. It is noted that more data usually leads to less overfitting, and an increase in stochastic or deterministic noise leads to more overfitting. Deterministic noise is defined as the part of the target function that a hypothesis set cannot capture, and it is labeled as noise because a hypothesis set cannot deal with it. The concept of how something that cannot be captured is noise is further explored using a hypothetical scenario involving explaining complex numbers to a young sibling with a limited understanding of numbers.

  • 00:45:00 In this section of the lecture, the difference between deterministic and stochastic noise is explained, and the impact of deterministic noise on overfitting is analyzed. It is emphasized that deterministic noise depends on the hypothesis set used, and as the target complexity increases, the deterministic noise and overfitting increase as well. However, this doesn't occur until the target complexity surpasses a certain level. For finite N, the same issues with stochastic noise apply to deterministic noise in that you may capture some of it due to the limited sample size. It is also mentioned that using a more complex hypothesis set is not always better and may lead to overfitting.

  • 00:50:00 In this section, the lecturer discusses the issue of overfitting when given a finite sample. He explains that once given a finite sample, one has the ability to fit the noise, both stochastic and deterministic, which can lead to worse performance. The lecturer provides a quantitative analysis that adds noise to the target to gain insight into the role of stochastic and deterministic noise. He adds and subtracts the centroid and epsilon in preparation for getting squared terms and cross terms, which leads to a variance term, a bias term, and an added term. The added term is just sigma squared, the variance of the noise.

  • 00:55:00 In this section of the lecture, the speaker discusses the decomposition of the expected value into bias and variance, and how they relate to deterministic and stochastic noise. Both represent the best approximation to the target function and the noise that cannot be predicted, respectively. The increase in the number of examples decreases variance, but both bias and variance are inevitable given a hypothesis. The deterministic noise and the stochastic noise both have a finite version on the data points which affect the variance by making the fit more susceptible to overfitting. The speaker gives a lead into the next two lectures on avoiding overfitting by discussing two approaches, regularization, and validation. Regularization is like putting the brakes to avoid overfitting, while validation is checking the bottom line to ensure avoidance of overfitting.

  • 01:00:00 In this section, the professor discusses the concept of putting the brakes on overfitting by using a restrained fit or regularization. He uses the example of fitting points to a 4th-order polynomial, but preventing it from fitting all the way by putting some friction in it. The amount of brake applied is minimal but results in a dramatic reduction in overfitting while still achieving a fantastic fit. The professor notes that it is important to understand regularization and how to choose it in order to prevent overfitting. The Q&A session addresses the importance of randomization in stochastic gradient descent and how to draw out-of-sample error in neural network plots.

  • 01:05:00 In this section, the professor explains that the deterministic and stochastic noise in a learning scenario are the same because the deterministic noise is caused by the inability of a hypothesis set to get closer to the target function. In real-world learning problems, the complexity of the target function is generally unknown, and the noise cannot be identified. The goal of understanding the overfitting conceptually is to avoid overfitting without the particulars of the noise. Over-training is synonymous to overfitting, relative to the same model. Other sources of error, such as floating-point numbers, produce a limited effect on overfitting, which is never mentioned. In terms of the third-order linear model (logistic regression), the professor clarifies that when applied to linearly separable data, a local minimum and zero in-sample error can be achieved.

  • 01:10:00 In this section, the professor discusses the issue of overfitting and the finite-sample version of it, which occurs due to the contribution of noise from both stochastic and deterministic factors in a finite sample. This leads the algorithm to fit that noise, which is harmful when it comes to fitting larger models such as H_10. When discussing the use of piecewise linear functions to prevent overfitting, the professor highlights the importance of considering the number of degrees of freedom in your model and taking steps to restrict your model in terms of fitting through regularization. Lastly, the professor covers the practical question of validating a model and how it can impact overfitting and model selection.

  • 01:15:00 In this section, the professor discusses the trade-off between overfitting and underfitting and explains that in order to arrive at a better hypothesis, you may need to deprive yourself of a resource that could have been used for training. The professor also elaborates on the VC (Vapnik-Chervonenkis) dimension and how it relates to overfitting, stating that the discrepancy in the VC dimension, given the same number of examples, is the reason for discrepancies in the out-of-sample and in-sample error. The professor also clarifies that even though they illustrated the target complexity in the color plots, target complexity is not explicitly measured, and there is no clear way to map it into the energy of deterministic noise. Finally, the professor discusses how the target complexity could translate into something in the bias-variance decomposition and has an impact on overfitting and generalization.
Lecture 11 - Overfitting
Lecture 11 - Overfitting
  • 2012.05.10
  • www.youtube.com
Overfitting - Fitting the data too well; fitting the noise. Deterministic noise versus stochastic noise. Lecture 11 of 18 of Caltech's Machine Learning Cours...
 

Lecture 12 - Regularization



Caltech's Machine Learning Course - CS 156. Lecture 12 - Regularization

This lecture on regularization begins with an explanation of overfitting and its negative impact on the generalization of machine learning models. Two approaches to regularization are discussed: mathematical and heuristic. The lecture then delves into the impact of regularization on bias and variance in linear models, using the example of Legendre polynomials as expanding components. The relationship between C and lambda in regularization is also covered, with an introduction to augmented error and its role in justifying regularization for generalization. Weight decay/growth techniques and the importance of choosing the right regularizer to avoid overfitting are also discussed. The lecture ends with a focus on choosing a good omega as a heuristic exercise and hopes that lambda will serve as a saving grace for regularization.

The second part discusses the weight decay as a way of balancing simplicity of the network with its functionality. The lecturer cautions against over-regularization and non-optimal performance, emphasizing the use of validation to determine optimal regularization parameters for different levels of noise. Regularization is discussed as experimental with a basis in theory and practice. Common types of regularization such as L1/L2, early stopping, and dropout are introduced, along with how to determine the appropriate regularization method for different problems. Common hyperparameters associated with implementing regularization are also discussed.

  • 00:00:00 In this section, Yaser Abu-Mostafo delves into the details of overfitting, which occurs when a model fits the data too well, at the cost of poor generalization. Even if the data is not noisy, deterministic noise can occur due to the limitations of the model, leading to a pattern that harms the out-of-sample error and causes overfitting. However, Abu-Mostafo introduces regularization as the first cure for overfitting, which is a technique used in almost every machine learning application, and is important to understand.

  • 00:05:00 In this section, the lecturer discusses two approaches to regularization in machine learning. The first approach is mathematical, where smoothness constraints are imposed to solve ill-posed problems, but the assumptions made in these developments are not always realistic for practical applications. The second approach is heuristic and involves handicapping the minimization of in-sample error by putting the brakes on the fit, which helps fight overfitting. The lecturer gives an example using a sinusoid and a line fit, showing that by regularizing and controlling the offset and slope of the lines, we may be able to gain better performance out-of-sample.

  • 00:10:00 In this section, the lecturer discusses the impact of regularization on the bias and variance of a linear model. By using regularization, the variance is reduced while the bias is slightly increased due to the imperfect fit. The lecturer uses the example of a polynomial model with Legendre polynomials as expanding components to demonstrate the effect of regularization on bias and variance. With regularization, the linear model outperforms the unregularized model and even the constant model. The lecture delves into the mathematical development of one of the most famous regularization techniques in machine learning with a focus on concrete conclusions and lessons that can be learned to deal with real-world situations.

  • 00:15:00 In this section, the lecturer introduces the Legendre polynomials and explains how they can be used to construct a hypothesis set for polynomial regression. By using these polynomials, which are orthogonal and deal with different coordinates, the relevant parameter is a combination of weights, rather than just one individual weight. The hypothesis set can be parameterized and represented in a linear form, allowing for easy analytic solutions. The target function is unknown, and the goal is to get a good approximation for it using a finite training set. The lecturer also goes over the unconstrained and constrained solutions for minimizing in-sample error using linear regression.

  • 00:20:00 In this section, the lecturer discusses the concept of regularization, which is a constraint applied to the weights of hypothesis sets. Regularization involves setting a budget C for the total magnitude squared of the weights which means that you can't have all weights too big. The problem is to minimize in-sample error while subject to this constraint. The solution is obtained using Lagrange multipliers or KKT, which gives a new solution called w_reg. The lecturer explains that the goal is to pick a point within a circle that minimizes the in-sample error, which requires going as far out as you can afford to without violating the constraint.

  • 00:25:00 In this section, the concept of regularization is discussed, where the objective is to derive a model that generalizes well to unseen data. The solution of linear regression is the minimum absolute, which satisfies the constraint. The main focus is on deriving the analytic condition for achieving the minimum of E_in, subject to the constraint, in order to find a compromise between the objective and the constraint. The gradient of the objective function must be orthogonal to the ellipse, and the vector w is in the direction of the red surface. The analytic condition for w_reg is that the gradient must be proportional to the negative of the solution. By minimizing the equation of the solution, you obtain the minimum of E_in, unconditionally.

  • 00:30:00 In this section, the lecture discusses the relationship between the parameters C and lambda in regularization. The larger the value of C, the smaller the value of lambda as there is less emphasis on the regularization term. Conversely, as C decreases, the regularization term becomes more significant and the value of lambda needs to increase to enforce the condition. The lecture also introduces augmented error, which is the sum of the error function and regularization term. It is equivalent to an unconstrained optimization problem of minimizing the error function while subject to the constraint. This correspondence justifies regularization in terms of generalization and is applicable to any regularizer. Finally, the lecture provides the formula for minimizing augmented error and concludes by providing the solution.

  • 00:35:00 In this section, the speaker discusses the solution to the problem of regularization. The solution is represented by w_reg, which is a modification of the pseudo-inverse solution with an additional regularization term. Under clean assumptions, we have one-step learning, including regularization. In other words, we can have a solution outright without doing a constrained optimization. The regularization term in the solution becomes dominant as lambda increases, which knocks w_reg down to zero, creating a smaller and smaller solution. The speaker then applies regularization to a familiar problem, showing that the choice of lambda is critical, and a heuristic choice for the type of regularizer will be necessary.

  • 00:40:00 In this section, the concept of regularization and its associated method known as weight decay are introduced. Weight decay is a famous regularizer in machine learning that involves minimizing w transposed w and making sure that the weights are small so that the name “decay” is given. When using neural networks, weight decay can be implemented through batch gradient descent, where the addition of this term shrinks the weights before any movement in the weight space, which limits how much one can learn about the function when λ is large. Variations of weight decay include assigning importance factors to certain weights and using different constants to experiment with the type of regularizer being used.

  • 00:45:00 In this section, the lecturer discusses weight decay and weight growth techniques, which are constraints used in machine learning to limit the range of weights used by models. Weight decay involves constraining models to use smaller weights, while weight growth constraints larger weights. The lecturer explains that an optimal lambda value must be chosen for both techniques to achieve the best out-of-sample performance. Additionally, the lecturer discusses how to choose the right regularizer, emphasizing the importance of avoiding overfitting through the use of guidelines that help guide the choice of regularizers. Ultimately, the lecturer recommends using practical rules to help find the optimal regularizers, such as avoiding high-frequency stochastic noise.

  • 00:50:00 In this section of the lecture, the instructor explains the different types of noise that can lead to overfitting and why it's important to choose a regularizer that tends to pick smoother hypotheses. He defines the general form of regularization and the augmented error that is minimized, which is similar to the equation used in VC analysis. He also discusses the correspondence between the complexity of an individual hypothesis and the complexity of the set of objects, and how E_aug is a better estimate for E_out than E_in.

  • 00:55:00 In this section of the lecture on regularization, the idea of augmented error as a better proxy for the out-of-sample error is discussed. Regularization aims to reduce overfitting, which is essentially fitting the noise more than the signal. The guiding principle for choosing a regularizer is to move in the direction of smoother, as noise is not smooth and smoother solutions tend to harm noise more than fitting signal. The concept of simpler is also introduced in a case where smoother does not apply well. Choosing a good omega is a heuristic exercise, and the math involved is only as good as the assumption on which it is based. The lecture ends with the hope that lambda would serve as saving grace for choosing the regularizer.

  • 01:00:00 In this section of the lecture, the concept of weight decay for neural networks is explored, where small weights result in simplicity of the function, and larger weights result in a logical dependency to allow any functionality to be implemented. Another form of regularizer is weight elimination, where some of the weights within a network are forced to be zero, resulting in a smaller VC dimension, allowing for better generalization and smaller chance of overfitting. Soft weight elimination is introduced, whereby a continuous function is applied to the network to emphasize some of the weights over others. Finally, early stopping is discussed as a form of regularizer, which recommends stopping training before the end, as it is a way of indirectly providing simplicity to the function.

  • 01:05:00 In this section, the professor explains that regularization is done through the optimizer and that we don't change the objective function. Instead, we hand over the objective function, which is the in-sample error, to the optimizer and tell it to minimize it. The professor then cautions against just putting the regularizer in the optimizer, which can lead to over-regularization and non-optimal performance if not done correctly. He emphasizes the importance of capturing as much as possible in the objective function and then using validation to determine the optimal value for the regularization parameter, lambda. The professor then shows how the choice of lambda changes with different levels of noise, and how using validation can help determine the best possible outcome given the noise. Finally, he discusses the use of different types of regularizers with different parameters, depending on the performance.

  • 01:10:00 In this section, the professor discusses the use of regularizers in machine learning, which is an experimental activity rather than a completely principled activity. The machine learning approach is somewhere between theory and practice, meaning that it has a strong grounding in both. The professor uses Legendre polynomials as the orthogonal functions because they provide a level of generality that is interesting and the solution is simple. Regularization allows a user to find a sweet spot for the best performance, which could be between two discrete steps. The regularization term added does not explicitly depend on the data set. However, the optimal parameter, lambda, will depend on the training set, which will be determined by validation.

  • 01:15:00 In this section, the concept of regularization is introduced, which involves adding a penalty term to the loss function in order to avoid overfitting in machine learning models. The two most common types of regularization, L1 and L2, are discussed along with their respective advantages and disadvantages. Additionally, the use of early stopping and dropout as alternative regularization techniques is explained. The lecture concludes with an overview of how to determine the appropriate regularization method for a given problem, as well as common hyperparameters to consider when implementing regularization.
Lecture 12 - Regularization
Lecture 12 - Regularization
  • 2012.05.14
  • www.youtube.com
Regularization - Putting the brakes on fitting the noise. Hard and soft constraints. Augmented error and weight decay. Lecture 12 of 18 of Caltech's Machine ...
 

Lecture 13 - Validation




Caltech's Machine Learning Course - CS 156. Lecture 13 - Validation

In lecture 13, the focus is on validation as an important technique in machine learning for model selection. The lecture goes into the specifics of validation, including why it's called validation and why it's important for model selection. Cross-validation is also discussed as a type of validation that allows for the use of all available examples for training and validation. The lecturer explains how to estimate the out-of-sample error using the random variable that takes an out-of-sample point and calculates the difference between the hypothesis and the target value. The lecture also discusses the bias introduced when using the estimate to choose a particular model, as it is no longer reliable since it was selected based on the validation set. The concept of cross-validation is introduced as a method for evaluating the out-of-sample error for different hypotheses.

Also he covers the use of cross-validation for model selection and validation to prevent overfitting, with a focus on "leave one out" and 10-fold cross-validation. The professor demonstrates the importance of accounting for out-of-sample discrepancy and data snooping, and suggests including randomizing methods to avoid sampling bias. He explains that although cross-validation can add complexity, combining it with regularization can select the best model, and because validation doesn't require assumptions, it's unique. The professor further explains how cross-validation can help make principled choices even when comparing across different scenarios and models, and how total validation points determine the error bar and bias.

  • 00:00:00 In this section, the focus is on validation, another important technique in machine learning that is used for model selection. The process involves choosing a validation set size and using it to validate the model selection process. The lecture goes into the specifics of validation, including why it's called validation and why it's important for model selection. The discussion also covers cross-validation, which is a type of validation that enables the use of all available examples for training and validation. The lecture contrasts validation with regularization, as far as control.

  • 00:05:00 In this section, the lecturer discusses validation and regularization in the context of the well-known equation that deals with the difference between the in-sample error and the out-of-sample error due to the model's complexity. Regularization estimates the penalty for overfit complexity while validation tries to estimate the out-of-sample error directly. The lecturer explains how to estimate the out-of-sample error using the random variable that takes an out-of-sample point and calculates the difference between the hypothesis and the target value. The lecturer emphasizes how variance affects the quality of the estimate, and proposes using a full set of points instead of one.

  • 00:10:00 In this section, the notion of a validation set and the validation error as an unbiased estimate of the out-of-sample error is introduced. The expected value of the validation error is E_out, which is another form of the expected value on a single point. The variance of the validation error is analyzed to show that there is an improvement in the estimate based on E_val compared to a single point. The variance ends up being proportional to 1/K, which means that increasing K can shrink the error bar and improve the reliability of the estimate. However, the number of validation points is not free and has a direct impact on the number of points available for training.

  • 00:15:00 In this section, the focus is on the process of validation, whereby K points are taken from N points for validation purposes, while the remaining subset D_train is used for training. It is also important to note the usefulness of having a reliable estimate of a validation set to ensure that the final hypothesis is reliable. However, having a reliable estimate of a bad quantity should not be the aim. As the value of K is increased, the estimate becomes more reliable, but the quality of hypothesis decreases. Thus, it is vital to find a means of not having to pay the price that comes with the increase of K. One way is to restore the data set after estimating the error and train on the full set to obtain better results.

  • 00:20:00 In this section, the focus is on the compromise in performance when using a validation set during training. The reduced set of D_train will have fewer examples compared to the full training set D, using which we obtain final hypothesis g minus. To get an estimate, we evaluate g minus on a validation set D_val, and then add the rest of the examples back into the pot and report g. However, a large K means the difference between g minus and g is bigger, and this affects the reliability of the estimate we report. Hence, there is a rule of thumb to use one fifth for validation to get the best of both worlds. We call it validation because it affects the learning process and helps in making choices.

  • 00:25:00 In this section, the focus is on understanding the difference between test error and validation error. When the test set is unbiased and is used to estimate E_out, there will be fluctuations in the estimate. If early stopping is used, the bias of the estimate changes. In a mini-learning scenario, it is easy to see that the expected value of the minimum is less than 0.5, making it an optimistic bias. The same thing happens when a point is chosen for early stopping - the point chosen is minimum on the realization, and an optimistic bias is introduced.

  • 00:30:00 In this section, the lecture discusses the use of the validation set for model selection in machine learning. The process involves training M models using a dataset split into training and validation sets, and then evaluating the performance of each model on the validation set to obtain estimates of out-of-sample error. The model with the smallest validation error is chosen, but there is a risk of bias introduced due to this selection process. Nevertheless, the bias is generally minor in practice and can be accepted to obtain a reliable estimate of the out-of-sample error.

  • 00:35:00 In this section, the lecturer discusses the bias introduced when using the estimate to choose a particular model, as it is no longer reliable since it was selected based on the validation set. The expected value of the estimator becomes a biased estimate of the out-of-sample error. An experiment with two models generated a curve that indicated a systematic bias towards one model or the other. The curves on the graph indicate the learning curve backward and how the out-of-sample error goes down with more examples for training. As the size of the validation set gets bigger, the estimate becomes more reliable, and the curves indicating the models' errors converge.

  • 00:40:00 In this section, the lecture explains how to estimate the discrepancy or bias between training on a special-hypothesis set and finding the final hypothesis using a validation set. The validation set is seen as the training error for the final hypothesis set, and with a little bit of mathematics related to the VC dimension and effective complexity, an estimate of the out-of-sample error can be obtained. Although more examples will improve the estimate, logarithmic contributions must be taken into account when selecting from an increased number of hypotheses. Nonetheless, when dealing with a single parameter, the effective complexity goes with a VC dimension of 1, which is not too difficult to handle. Therefore, if you have a suitable set, then estimating the out-of-sample error will not differ too much from the actual value.

  • 00:45:00 In this section, the speaker discusses the idea of data contamination when using error estimates to make decisions, particularly in the context of validation. The training set is considered to be completely contaminated, while the test set is completely clean and gives an unbiased estimate. However, the validation set is slightly contaminated because it is used to make a few decisions, so it is important to not get carried away and move on to another validation set when necessary. The speaker then introduces cross-validation as a regime of validation that can get a better estimate with a smaller error bar, as long as it is not biased in the process.

  • 00:50:00 In this section, the professor introduces the concept of validation through cross-validation, specifically the "leave one out" method. In this method, the dataset is divided into two, with one point being used for validation and the rest used for training. The process is repeated for different points, resulting in multiple unbiased and imperfect estimations. Since all the estimations are based on training with N minus 1 data points, they have a common thread. Despite being imperfect, the repeated estimates give insight into the behavior of the model and help optimize it for the best out-of-sample performance.

  • 00:55:00 In this section, the concept of cross-validation is introduced as a method for evaluating the out-of-sample error for different hypotheses. By dividing the dataset into training and validation sets, it is possible to estimate the performance of the model on unseen data. The "leave one out" method is used to illustrate the process. The effectiveness of cross-validation is discussed, with it being shown that using N minus 1 points to train and N points to validate is remarkably efficient for obtaining accurate results.

  • 01:00:00 In this section, the professor discusses the use of cross-validation for model selection. He demonstrates this by comparing the linear and constant models with three points, and shows how the constant model wins. He then applies cross-validation to the problem of finding a separating surface for handwritten digits using a 5th order nonlinear transform with 20 features. He uses cross-validation "leave one out" to compare 20 models and chooses where to stop adding features. He shows that the cross-validation error tracks closely with the out-of-sample error, and that using it as a criterion for model choice leads to minima at 6 features with improved performance compared to using the full model without validation.

  • 01:05:00 In this section, the professor discusses the use of validation for preventing overfitting, and how it is considered similar to regularization. He explains how "leave one out" validation is not practical for most real problems, and suggests using 10-fold cross-validation instead. He also provides guidance on the number of parameters to use based on the size of the data set, and clarifies why model choice by validation does not count as data snooping.

  • 01:10:00 In this section, the professor discusses the importance of accounting for out-of-sample discrepancy and data snooping when using the validation set to make model choices. He emphasizes the need to use randomizing methods such as flipping coins to avoid sampling bias and using cross-validation techniques to choose the regularization parameter in many practical cases. While cross-validation can add computational complexity, it can also be combined with regularization to select the best hypothesis for a model. The professor notes that although there are other methods for model selection, validation is unique in that it doesn't require assumptions.

  • 01:15:00 In this section, the professor discusses how validation can help make principled choices in selecting models, regardless of the nature of the choice, and how it can also be used to update the model in case of time evolution or tracking system evolution. When comparing validation and cross-validation, he explains that both methods have bias, but cross-validation allows for more examples to be used for both training and validation, resulting in a smaller error bar and less vulnerability to bias. While it may be possible to have data sets so large that cross-validation is not needed, the professor provides an example where even with 100 million points, cross-validation was still beneficial due to the nature of the data.

  • 01:20:00 In this section, the professor discusses scenarios where cross-validation is useful and addresses potential problems with it. He explains that cross-validation becomes relevant when the most relevant part of a large data set is smaller than the whole set. When deciding between competing models, statistical evidence is necessary to determine the significance of the out-of-sample error. The professor states that with a smaller data set, there is no definitive answer as to whether it is better to re-sample or break the set into chunks for cross-validation. The professor also discusses the role of balance between classes and how bias behaves when increasing the number of points left out. Finally, the professor explains that the total number of validation points determines the error bar, and bias is a function of how cross-validation is used.

  • 01:25:00 In this section, the professor discusses the error bar and how it can provide an indication of vulnerability to bias in an estimate. If two scenarios have comparable error bars, there is no reason to believe that one is more vulnerable to bias. However, a detailed analysis is needed to see the difference between taking one scenario at a time and considering correlations. The professor concludes that as long as a number of folds are done and every example appears in the cross-validation estimate exactly once, there is no preference between scenarios in terms of bias.
Lecture 13 - Validation
Lecture 13 - Validation
  • 2012.05.17
  • www.youtube.com
Validation - Taking a peek out of sample. Model selection and data contamination. Cross validation. Lecture 13 of 18 of Caltech's Machine Learning Course - C...
 

Lecture 14 - Support Vector Machines



Caltech's Machine Learning Course - CS 156. Lecture 14 - Support Vector Machines

The lecture covers the importance of validation and its use in machine learning, as well as the advantages of cross-validation over validation. The focus of the lecture is on support vector machines (SVMs) as the most effective learning model for classification, with a detailed outline of the section that involves maximization of the margin, formulation, and analytical solutions through constrained optimization presented. The lecture covers a range of technicalities, including how to calculate the distance between a point and a hyperplane in SVMs, how to solve the optimization problem for SVMs, and how to formulate the SVM optimization problem in its dual formulation. The lecturer also discusses the practical aspects of using quadratic programming to solve the optimization problem and the importance of identifying support vectors. The lecture concludes with a brief discussion of the use of nonlinear transformations in SVMs.

The second part of this lecture on support vector machines (SVM), the lecturer explains how the number of support vectors divided by the number of examples gives an upper bound on the probability of error in classifying an out-of-sample point, making the use of support vectors with nonlinear transformation feasible. The professor also discusses the normalization of w transposed x plus b to be 1 and its necessity for optimization, as well as the soft-margin version of SVM, which allows for errors and penalizes them. In addition, the relationship between the number of support vectors and the VC dimension is explained, and the method's resistance to noise is mentioned, with the soft version of the method used in cases of noisy data.

  • 00:00:00 In this section, the lecturer discusses the importance of validation, particularly in terms of its use in machine learning. The concept of unbiased and optimistic bias as a result of validation error and its effect on model selection is also explained. The advantage of cross-validation over validation is further highlighted in the section. Furthermore, the lecturer introduces support vector machines as the most effective learning model for classification, citing its intuitive interpretation, a principled derivation, and optimization package as significant advantages to the learning model. A detailed outline of the section, which involves the maximization of margin, formulation, and analytical solutions through constrained optimization, is also presented.

  • 00:05:00 In this section, the concept of maximizing the margin in linear separation was explained. While all lines that separate linearly separable data have zero in-sample error, some may have better margins that allow for greater generalization. It is explained that a bigger margin is better because, in noisy situations, the likelihood that the new point will be classified correctly is higher. This is related to the growth function, and how a bigger growth function is disadvantageous for generalization in machine learning. It is shown that maximizing the margin can help with generalization by searching for lines that not only separate the data correctly but also have the maximum margin possible for those data points.

  • 00:10:00 In this section, the lecturer discusses fat margins and how they can improve the performance of a classifier. By requiring a classifier to have a margin of a certain size, the number of possible dichotomies is reduced, leading to a smaller growth function and smaller VC dimension. The larger the margin, the better the out-of-sample performance of the classifier. The lecturer then explains how to solve for the biggest possible margin, by finding the distance between the hyperplane and the nearest data point, and normalizing the vector w to simplify the analysis. The signal, or the distance between the hyperplane and the data points, is not the Euclidean distance, but the order of nearest and furthest points, and needs to be converted to obtain the Euclidean distance.

  • 00:15:00 In this section, the lecturer explains some technicalities relevant to the support vector machine analysis. Firstly, in order to compare the performance of different planes, the Euclidean distance is used as a yardstick. Secondly, w is extracted from vector X in order to analyze support vector machines more conveniently, and w₀ is pulled out so that it is not confused with the w vector that now has a new role. The goal is to compute the distance between xₙ (the nearest point) and the plane. The lecturer shows the vector w is orthogonal to the plane and to every vector on the plane, which means it is orthogonal to every normal vector on the plane, so now we can get the distance between xₙ and the plane.

  • 00:20:00 In this section, the speaker discusses how to calculate the distance between a point and a hyperplane in SVMs. This can be done by projecting the vector going from the point to a generic point on the hyperplane onto the direction that is orthogonal to the hyperplane. The unit vector in this direction is computed by normalizing the length of the vector. By using some algebra, the speaker derives a formula for the distance that is simplified by adding a missing term. This formula can be used to choose the combination of w's that gives the best possible margin. The optimization problem that results from this is not very user-friendly because of the minimum in the constraints. However, by making some simple observations, this problem can be reformulated into a more friendly quadratic one.

  • 00:25:00 In this section, the lecturer explains how to solve the optimization problem for Support Vector Machines (SVMs). They begin by showing how SVMs can be formulated as a constrained optimization problem where they must minimize an objective function subject to linear inequality constraints. They prove that it is possible to use Lagrange multipliers to transform the inequality constraints into equality constraints and then solve the new Lagrangian. They note that this approach was independently discovered by Karush and Kuhn-Tucker and is referred to as KKT Lagrangian. The lecturer emphasizes that the process is similar to the procedure for regularization, and they recall the gradient condition for the solution.

  • 00:30:00 In this section, the lecturer explains the relationship between the SVM and regularization and the Lagrange formulation. It is essential to note that the constraints lead to a non-zero gradient, unlike the unconstrained problem where the gradient equals 0. The Lagrange formulation is dependent on variables like w and b, and there are new variables, Lagrange multipliers like the alpha vector. The problem at hand is to minimize the objective function subject to constraints of the form, and then we give it a Lagrangian name. The interesting part is that we are actually maximizing with respect to alpha, although the alphas have to be non-negative, and thus we need to pay attention to this. The section concludes with a brief explanation of the unconstrained part, where we need to minimize the gradient of the Lagrangian with respect to w and b.

  • 00:35:00 In this section of the lecture, the speaker explains how to formulate the SVM optimization problem in its dual formulation. He first optimizes the problem with respect to w and b, resulting in two conditions that he substitutes back into the original Lagrangian, leading to the dual formulation of the problem, which is a nice formula in terms of the Lagrange multipliers alpha only. He then sets the constraint for the alphas to be non-negative and solves the maximization problem subject to these constraints, resulting in the optimal values of alpha that determine the support vectors.

  • 00:40:00 In this section, the speaker discusses the practical aspects of using quadratic programming to solve the optimization problem presented earlier for support vector machines. The objective and constraints are translated into coefficients that are passed onto the quadratic programming package for minimization. The matrix dimension depends on the number of examples and this becomes a practical consideration for large datasets. The speaker warns that when the number of examples is big, quadratic programming has a hard time finding the solution and may require use of heuristics.

  • 00:45:00 In this section, the lecture delves into the solutions brought by quadratic programming, specifically alpha, and how it relates to the original problem of determining the weights, the surface, the margin, and b. The lecture highlights the importance of identifying support vectors, which are the points that define the plane and the margin. The mathematics behind positive lambdas (alphas in this case) lends a way for identifying support vectors, as it only considers points with positive values. This means that these alpha values are crucial for defining the boundary between the two classifications, and identifying their location is critical in optimizing the weights and creating the maximum margin.

  • 00:50:00 In this section, the concept of support vectors is introduced and discussed in the context of the support vector machine (SVM) algorithm. Support vectors are defined as the data points that are closest to the decision boundary or hyperplane that separates the classes of data. The SVM algorithm optimizes a quadratic programming problem to determine the support vectors and the parameters of the decision function. The values of the parameters depend only on the support vectors, which are the critical points, enabling the model to generalize well. Nonlinear transformations are also briefly discussed as a way to handle non-separable data. Transforming the data into a higher-dimensional space does not complicate the optimization problem, and the same technique can be used to find the support vectors and decision function.

  • 00:55:00 In this section of the video, the lecturer discusses the use of nonlinear transformations in SVMs. Nonlinear transformations are used when data is not linearly separable, which is the case in the X space. The lecturer demonstrates how to use a nonlinear transformation and work in the Z space to achieve a linearly separable result. He explains that the solution is easy, and the number of alphas depends on the number of data points, not the dimensionality of the space that you're working in. The key idea is that you can go to an enormous space without paying a price in terms of the optimization. The support vectors are identified in the Z space, but in the X space, they look like data points.

  • 01:00:00 In this section, the lecturer discusses the generalization result that makes using support vectors with nonlinear transformation feasible. The number of support vectors, which represents the number of effective parameters, divided by the number of examples gives an upper bound on the probability of error in classifying an out-of-sample point. If the expected value of several runs of this machinery holds, then the actual E_out you will get in a particular case will be bounded above by a familiar type of bound (e.g., the number of parameters, degrees of freedom, and VC dimension divided by the number of examples). This result makes people use support vectors and support vectors with the nonlinear transformation, as you don't pay for the computation of going to a higher dimension or the generalization that goes with it.

  • 01:05:00 In this section, the professor explains why he chooses to normalize w transposed x plus b to be 1, and why this normalization is necessary for optimization. He also answers a question about how SVM deals with non-linearly separable points through nonlinear transformations, and how the soft-margin version of SVM allows for errors and penalizes for them. Additionally, the professor briefly touches on the relationship between the number of support vectors and the VC dimension, and how the alphas represent the parameters in SVM.

  • 01:10:00 In this section, the lecturer discusses the relationship between the number of non-zero parameters and the VC dimension, which is equivalent to the number of support vectors by the definition. The margin measure can vary depending on the norm used, but there is no compelling reason to prefer one over the other in terms of performance. While there is no direct method for pruning support vectors, taking subsets and obtaining the support vectors of the support vectors are possible computational considerations. The SVM method is not particularly susceptible to noise, and in cases of noisy data, the soft version of the method is used, which is remarkably similar to the non-noisy case.
Lecture 14 - Support Vector Machines
Lecture 14 - Support Vector Machines
  • 2012.05.18
  • www.youtube.com
Support Vector Machines - One of the most successful learning algorithms; getting a complex model at the price of a simple one. Lecture 14 of 18 of Caltech's...
 

Lecture 15 - Kernel Methods



Caltech's Machine Learning Course - CS 156. Lecture 15 - Kernel Methods

This lecture on kernel methods introduces support vector machines (SVMs) as a linear model that is more performance-driven than traditional linear regression models because of the concept of maximizing the margin. If the data is not linearly separable, nonlinear transforms can be used to create wiggly surfaces that still enable complex hypotheses without paying a high price in complexity. The video explains kernel methods that go to high-dimensional Z space, explaining how to compute the inner product without computing the individual vectors. The video also outlines the different approaches to obtaining a valid kernel for classification problems and explains how to apply SVM to non-separable data. Finally, the video explains the concept of slack and quantifying the margin violation in SVM, introducing a variable xi to penalize margin violation and reviewing the Lagrangian formulation to solve for alpha.

The second part covers practical aspects of using support vector machines (SVMs) and kernel methods. He explains the concept of soft margin support vector machines and how they allow for some misclassification while maintaining a wide margin. He talks about the importance of the parameter C, which determines how much violation can occur, and suggests using cross-validation to determine its value. He also addresses concerns about the constant coordinate in transformed data and assures users that it plays the same role as the bias term. Additionally, he discusses the possibility of combining kernels to produce new kernels and suggests heuristic methods that can be used when quadratic programming fails in solving SVMs with too many data points.

  • 00:00:00 In this section of the lecture on Kernel Methods, Yaser Abu-Mostafa introduces the concept of support vector machines (SVMs), noting that they are nothing but a linear model in the simplest form, but are more performance-oriented because of the idea of maximizing the margin. By using a package of quadratic programming, we can solve the SVM problem and get the alphas back, which helps us identify the support vectors. If the data is not linearly separable, we can use nonlinear transform, but the resulting wiggly surface still allows us to get a complex hypothesis without paying a high price in complexity. We can predict the out-of-sample error based on the number of support vectors, which is an in-sample quantity.

  • 00:05:00 In this section, the video explains the concept of kernel methods and their role in extending support vector machines beyond the linearly separable case. The idea behind kernel methods is to go to a high-dimensional Z space without paying the price for complexity. The video explains that the key to achieving this is to be able to compute the inner product in the Z space without actually computing the individual vectors in that space. This is where kernels come in, as they allow for the computation of inner products using only explicit inputs. The video goes on to explain the implications of these methods for dealing with nonlinear transformations and soft margins, and how they can be used in practice to handle complex problems.

  • 00:10:00 In this section, the lecture explains the use of the inner product in the Z space, and how it relates to kernel methods. The inner product is necessary to form the Lagrangian and pass on constraints to quadratic programming, but it can be computed using only inner products in order to perform support vector machinery. By using a generalized inner product or kernel that corresponds to a Z space, one can transform two points x and x dash into a function that is determined by x and x dash, which is called the kernel. An example is given of a two-dimensional Euclidean space using a 2nd-order polynomial transformation.

  • 00:15:00 In this section, the lecturer discusses the concept of kernel methods and how to compute kernels without transforming x and x dash. The lecturer improvises a kernel that does not transform things to the Z space and convinces the audience that the kernel corresponds to a transformation to some Z space, taking an inner product there. By squaring a kernel with the 1 + x_xdash raised to the Q-power, the lecturer explains how this becomes an inner product in some space, making it a valid kernel. Furthermore, the lecturer compares how much computation it would take to do this with other dimensions, regardless of the complexity of Q, which remains the same.

  • 00:20:00 In this section, the lecturer explains a kernel method for polynomial transformation that can be carried out without actually expanding the polynomial. By taking the logarithm and exponentiating it, the polynomial becomes a simple operation that doesn't require a huge expansion. This is an easy polynomial that can be visualized in 2D and extrapolated for other cases. A kernel that maps to a higher dimensional space can be obtained by taking an inner product in that space. The lecturer introduces an example of a kernel that doesn't have an inner product term in X or Z space but corresponds to an inner product in an infinite-dimensional space. Despite the challenges of going to an infinite-dimensional space, the kernel method is still useful, and the number of support vectors can be used to determine the generalization of a model.

  • 00:25:00 In this section, the lecturer demonstrates the radial-basis-function kernel, a sophisticated kernel that corresponds to an infinite-dimensional space, and shows how it works in action by taking a slightly non-separable case. The lecturer generates 100 points at random and shows that there is no line to separate them. Then, the lecturer transforms X into an infinite-dimensional space and computes the kernel, which is a simple exponential. The lecturer passes this on to quadratic programming, which gives back the support vectors. When the lecturer darkens the support vectors, it becomes easier to see the two classes.

  • 00:30:00 In this section, the speaker discusses the idea of kernel methods and how they can be used for classification. He presents an example of using a kernel on a dataset of points in order to transform them to an infinite-dimensional space where they can be separated by a linear plane. The resulting margin and support vectors are used to determine the in-sample quantity that guides the generalization property. The speaker then goes on to explain how a valid kernel corresponding to an inner product in some Z space can be used in formulating the problem and constructing the hypothesis. Overall, he emphasizes the usefulness of kernel methods and how they can be applied to solve classification problems.

  • 00:35:00 In this section, we learn how to translate the linear model into a kernel form, where support vector machines becomes a model that allows for the choice of the kernel. The kernel takes the place of the inner product after inner products are taken with the Z space. The resulting model depends on the kernel choice, and we can also solve for b by plugging in a support vector. The kernel, however, is hard to determine as you cannot verify its validity without visiting the Z space. Nonetheless, we illustrate how we can compare approaches by looking at the functional form of different kernels.

  • 00:40:00 In this section, the lecturer explains the conditions for obtaining a valid kernel in kernel methods. There are three approaches: construction, where a kernel is constructed from a conceptual or explicit set of transformations; Mercer’s condition, which requires a given kernel to be symmetric and for a matrix constructed from the kernel values to be positive semi-definite; and finally, an improvisation approach, where the viability of the kernel is a very practical concern, and two conditions must simultaneously be satisfied. These are that the kernel is symmetric, and the matrix constructed from kernel values must be positive semi-definite for any choice of points, as required by Mercer’s condition.

  • 00:45:00 In this section, the lecturer describes situations where data is not linearly separable and how to apply support vector machines algorithm in such cases. There could be two scenarios of non-separable data, one where the non-separability is slight, and the other where non-separability is significant. To deal with non-linear separable data, one can make errors and learn with generalization instead of trying to use complex inordinately high-dimensional spaces that contain all data points, thus keeping the error low. In the case of serious non-separability, one must go for a nonlinear transformation and use kernels or soft-margin support vector machines. The lecturer then talks about the idea of margin violation and how to quantify it to account for classification errors.

  • 00:50:00 In this section, the lecturer introduces the concept of slack and quantifying the margin violation in SVM. He explains that he will introduce a slack for every point that measures the violation of margin, and will penalize the total violation made by adding up these slacks. He chooses this error measure, which is reasonable and measures the violation of the margin, instead of others. He then introduces the new optimization, which is minimizing the margin violation error term, along with maximizing the margin. The constant C gives the relative importance of this margin violation term versus the previous term that maximizes the margin. Depending on the value of C, the end result could be a linearly separable data or a compromise as it represents the tradeoff between margin and slack. Finally, he reviews the Lagrangian formulation with the addition of the new terms.

  • 00:55:00 In this section, the lecturer explains the new quadratic programming problem introduced by adding the variable xi to penalize margin violations. The Lagrangian includes new constraints on xi that must be solved for using Lagrange multipliers, beta. The lecturer then shows how the minimization of w and b remains unchanged and finds that solving for xi results in a quantity that is always zero. This finding leads to beta dropping out of the Lagrangian, leaving the same solution as before, with the only ramification being that alpha is now not only greater than or equal to zero but is also less than or equal to C.

  • 01:00:00 In this section of the video, the lecturer goes over the concept of soft margin support vector machines, which allow for some misclassification while still maintaining a wide margin. The solution involves an added constraint that requires alpha to be at most C, along with the already existing equality constraint. The soft margin support vector machines include both margin and non-margin support vectors, with the latter being the points that violate the margin, causing a slack that is represented by the value xi. The value of C is an important parameter that determines how much violation can occur, and this is usually determined through cross-validation.

  • 01:05:00 In this section, the lecturer discusses practical points on using support vector machines (SVMs) and kernel methods. He explains that if the data is not linearly separable, quadratic programming may not converge, leading to a situation where there is no feasible solution. However, he encourages users to be lazy and still pass alphas from quadratic programming back to the solution to evaluate whether or not it separates the data. Additionally, he addresses concerns about the constant coordinate, 1, that is transformed with the data, explaining that it effectively plays the same role as the bias term, b, and that users need not worry about having multiple coordinates with the same role.

  • 01:10:00 In this section, the professor explains that the linearity of support vector machines (SVMs) depends on certain assumptions, and it can be better than linear in some cases. The dimension of the data may affect SVM's effectiveness, but the RBF kernel can deal with infinite dimensions if the higher-order terms decay fast. A valid kernel needs to have a well-defined inner product, which depends on convergence. The professor doesn't touch on SVMs generalized to regression cases as they require more technical details, and SVMs' major success is in classification. Lastly, there might be complaints from quadratic programming packages for not being positive definite, but the solutions may still be fine with certain reliability.

  • 01:15:00 In this section, the professor discusses the possibility of combining kernels to produce new kernels and the requirement for the combination to maintain an inner product in a Z space. He also mentions that the quadratic programming problem is the bottleneck in solving problems with SVMs and gives an estimate of the number of points that can be handled by quadratic programming. Additionally, he suggests heuristic methods that can be used when quadratic programming fails in solving SVMs with too many data points.
Lecture 15 - Kernel Methods
Lecture 15 - Kernel Methods
  • 2012.05.24
  • www.youtube.com
Kernel Methods - Extending SVM to infinite-dimensional spaces using the kernel trick, and to non-separable data using soft margins. Lecture 15 of 18 of Calte...
Reason: