Machine Learning and Neural Networks - page 26

 

Lecture 4. Eigenvalues and Eigenvectors



4. Eigenvalues and Eigenvectors

This video explains the concept of eigenvalues and eigenvectors, and how they can be used to calculate linear transformations. It also goes on to show how eigenvectors can be used to find linear equations in a system.

  • 00:00:00 In this video, the author explains the concept of eigenvectors and eigenvalues for square matrices. They also discuss the usefulness of eigenvectors and eigenvalues for certain problems. Finally, the author discusses positive definite symmetric matrices and their importance.

  • 00:05:00 The video discusses the concept of eigenvalues and eigenvectors, and how they can be used to calculate linear transformations. It also goes on to show how eigenvectors can be used to find linear equations in a system.

  • 00:10:00 This video explains how eigenvalues and eigenvectors can be used to solve difference equations quickly. The first use for eigenvectors is to be able to solve for the principle use for which they were invented, which is to be able to solve for differences in vector equations. In addition, the video explains how similar matrices have the same eigenvalues.

  • 00:15:00 The video explains how eigenvalues are computed, and how they are related to eigenvectors. It also discusses how eigenvalues are preserved when matrices are multiplied.

  • 00:20:00 In this video, the presenter discusses the concept of eigenvalues and eigenvectors, and explains why they may not be identical. He then goes on to discuss how two matrices with the same eigenvalues can still be different in terms of their eigenvectors.

  • 00:25:00 In this video, the author specializes to symmetric matrices to discuss what is special about the eigenvalues and eigenvectors. He claims that an anti-symmetric matrix has imaginary eigenvalues.

  • 00:30:00 In this video, the eigenvalues and eigenvectors of a matrix are explained. Two quick checks are performed to verify that the calculation was done correctly, and then the trace of a matrix is shown. Finally, symmetric and positive-definite matrices are explained.

  • 00:35:00 The video discusses the eigenvalues and eigenvectors of a symmetric matrix. The eigenvalues and eigenvectors are important for understanding the matrix's structure, and it is possible to verify that the eigenvalues remain the same. Additionally, the video discusses how to get a diagonal matrix.

  • 00:40:00 In this video, the author diagonalizes a matrix, finds the eigenvalues, and finds an M so that the eigenvectors are similar. He then writes this information in matrix form and confirms that it is correct.

  • 00:45:00 This video discusses the concepts of eigenvalues and eigenvectors, and how they are related. It goes on to explain how a symmetric matrix can have different eigenvector and eigenvalue representations, and how to calculate these representations using the spectral theorem.
4. Eigenvalues and Eigenvectors
4. Eigenvalues and Eigenvectors
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 5. Positive Definite and Semidefinite Matrices



5. Positive Definite and Semidefinite Matrices

In this video, the speaker summarizes the highlights from the previous lectures in linear algebra, including eigenvalues, determinants, and pivots, all of which provide tests for positive definite matrices. The speaker then explains the relationship between positive definite and indefinite matrices, their connection to eigenvalues and determinants, and how to compute the energy in the vector X for a matrix. The speaker also discusses the concepts of deep learning, neural nets, machine learning, and minimizing an energy. They touch on the concept of a convex function and explain how it can be used in deep learning. Finally, the speaker introduces exercises for positive definite and semidefinite matrices and briefly mentions the upcoming topic of singular value decomposition.

  • 00:00:00 In this section, the speaker summarizes the highlights from the previous five lectures in linear algebra, including eigenvalues, a transpose a determinants, and pivots, all of which provide tests for positive definite matrices. He explains that positive definite matrices are the best of the symmetric matrices and have positive eigenvalues, but there are additional tests beyond eigenvalues. The speaker demonstrates how to determine if a two-by-two matrix is positive definite by asking whether it has positive eigenvalues, a positive determinant, positive pivots, or if it can be factored in a certain way.

  • 00:05:00 In this section, the speaker discusses positive definite and indefinite matrices and their connection to eigenvalues and determinants. The determinant of a matrix is connected to its eigenvalues, as they are the product of the eigenvalues, and if the determinant is negative, then there is at least one negative eigenvalue. Indefinite matrices can be made positive definite by adjusting the diagonal entries, and the leading determinants (determinants of submatrices in the upper left corner) must pass tests to ensure positive definiteness. The speaker also connects pivots to determinants and elimination. Ultimately, the speaker defines positive definite matrices as those that pass the energy test.

  • 00:10:00 In this section, the speaker demonstrates how to compute the energy in the vector X for a matrix and shows that the energy of a positive definite matrix is greater than zero. The energy, in this case, is a pure quadratic function that could be a loss function used in deep learning to minimize the difference between training data and the number obtained. The diagonal numbers of the matrix 3 & 6 give the diagonal pieces, and the cross terms, which can go negative, give 8 X Y.

  • 00:15:00 In this section, the speaker explains the relationship between deep learning, neural nets, machine learning, and minimizing an energy. The speaker uses the analogy of a bowl to visually demonstrate how neural networks work to find the minimum quadratic for a problem, and how having non-linear terms can make the problem more complicated. They then explain how machine learning on large problems can take over a week to compute because it involves minimizing complicated functions that can include more than 100,000 variables. The speaker also touches on the idea of a convex function and explains how it can be used in deep learning.

  • 00:20:00 In this section, the speaker discusses the concept of gradient descent, which is the primary algorithm used in deep learning, neural networks and machine learning. Starting from an initial point on a surface, the algorithm calculates the derivatives of the function to determine the direction of the steepest slope or gradient, and then follows this path until it reaches a minimum or turns upwards. The algorithm involves recalculation of the gradient at each step until the desired level of accuracy is achieved.

  • 00:25:00 In this section, the concept of gradient descent is explained, which is commonly used in machine learning for optimization. It is mentioned that usually only first derivatives are computed for optimization since computing second derivatives for a large number of variables can be complicated. However, gradient descent has limitations, such as when going down a narrow valley. Positive definite matrices are important as they give a bowl-like shape for optimization, but if the eigenvalues are far apart, it can cause problems. Finally, the conversation shifts towards homework.

  • 00:30:00 In this section, the speaker introduces exercises for positive definite and semidefinite matrices. The speaker provides an example of a positive definite matrix S and a positive definite matrix T, and asks if their addition, S + T, is positive definite. The speaker uses the energy test to answer this question, separating the equation into two pieces to show that it is indeed positive definite. The speaker also discusses the positivity of the inverse of sin, using the first test. The speaker notes that a matrix must be symmetric before it has real eigenvalues and can undergo further questioning.

  • 00:35:00 In this section, the speaker discusses the concept of positive definite matrices and introduces the idea of semi-definite matrices. A positive definite matrix is a symmetric matrix where all the eigenvalues are positive. The speaker shows how an orthogonal matrix times its transpose on a positive definite matrix gives a symmetric matrix. They then explain how similar matrices have the same eigenvalues and that this new symmetric matrix is indeed positive definite. The speaker then introduces the concept of semi-definite matrices, which have eigenvalues that are greater than or equal to zero. They explain how semi-definite matrices have a determinant of zero and may have one zero eigenvalue, but their trace value will give a positive number.

  • 00:40:00 In this section, the concept of positive definite matrices is expanded to include positive semi-definite ones that lie on the edge of the positive definite matrices. The eigenvalues of a matrix of all ones are calculated to be 3, 0, and 0, making it a positive semi-definite matrix. The tests for eigenvalues and energies being greater than or equal to 0 remain the same, but dependent columns are now allowed. The matrix must be symmetric, and if its rank is only 1, then it cannot be positive definite, but it is positive semi-definite if the eigenvalues are positive.

  • 00:45:00 In this section, the speaker briefly mentions that the topic of the next section will be the singular value decomposition (SVD). They also note that they have now covered positive definite and semidefinite matrices, indicating that they are moving on to more advanced topics in linear algebra.
5. Positive Definite and Semidefinite Matrices
5. Positive Definite and Semidefinite Matrices
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 6. Singular Value Decomposition (SVD)



6. Singular Value Decomposition (SVD)

This video explains the concept of Singular Value Decomposition (SVD), which is used to factorize a matrix into three matrices, where the middle one is diagonal and contains the singular values. The SVD helps understand the relationship between A, Sigma, and V, ultimately helping to solve equations. The video discusses the importance of orthogonal vectors, eigenvectors, and eigenvalues in SVD, and emphasizes the orthogonality of A and V matrices. The video also explains the graphical representation of the SVD process and the pole decomposition of a matrix. Finally, the video discusses the process of extracting the most important part of a big matrix of data using SVD.

  • 00:00:00 In this section, the instructor discusses the concept of Singular Value Decomposition (SVD) which is similar to eigenvalues but applicable to rectangular matrices. Eigenvalues are not feasible for rectangular matrices because eigenvectors are either complex or not orthogonal. SVD introduces two sets of singular vectors and singular values in place of eigenvectors and eigenvalues, respectively. The key to SVD is that a transpose a is a great matrix, which is square and represents the product of rectangular matrices. The first step to performing SVD is to show that any matrix can be factored into u times sigma times V transpose.

  • 00:05:00 In this section, the speaker discusses the factorization of the matrix A transpose A and introduces the concept of eigenvectors and eigenvalues. The matrix has positive definite eigenvalues, which are used to calculate their square roots. The eigenvectors are of this matrix are square, symmetric, and positive-definite. The resultant matrix has the same eigenvalues but different eigenvectors. The speaker then talks about the factorization of A, where we're looking for a set of orthogonal vectors V that can be multiplied by A to obtain a set of orthogonal vectors U. These vectors will be used to compute the singular value decomposition (SVD). The goal of SVD is to find a factorization of A into three matrices, where the middle one is diagonal and contains the singular values of A.

  • 00:10:00 In this section, the concept of the orthogonal property of V's in the output space is explored in the big picture of linear algebra where the space is divided into the column space, null space, and others. It is shown that when V's are multiplied by a, the resulting uses are also orthogonal, making V's special. A matrix form of the equations is presented, and it is revealed that by looking at a transpose a, the problem of finding orthogonal and orthonormal uses can be simplified. It is concluded that a transpose a is symmetric, positive definite, and has a diagonal form, which tells us the properties of V's.

  • 00:15:00 In this section, the speaker discusses the concept of Singular Value Decomposition (SVD). The V's in the SVD are the eigenvectors of the transpose of A. The Sigma Transpose Sigma are the eigenvalues of A transpose A. The SVD is established by taking the final step of understanding the eigenvectors for double or triple eigenvalues. The SVD helps understand the relationship between A, Sigma, and V, which will ultimately help solve equations like A times A transpose times X equals B.

  • 00:20:00 In this section, the speaker explains the final step of the Singular Value Decomposition (SVD) process, which is proving that the chosen basis vectors U are orthogonal. To do so, the speaker shows that the dot product of U1 and U2 equals zero. Since U1 is AV1/Sigma1 and U2 is AV2/Sigma2, the denominator of the fraction is canceled out, which leaves V1 transpose times the matrix times V2, which is Sigma2 transpose V2. As V2 is an eigenvector of A transpose A, the dot product between U1 and U2 equals zero, thus proving that the basis vectors U are orthogonal.

  • 00:25:00 In this section, the speaker discusses the orthogonality of the A and V matrices in the Singular Value Decomposition (SVD) and their relation to eigenvectors. The A and V matrices are shown to be orthogonal to each other in the column and row space, respectively. The speaker then discusses the history of the discovery and importance of this relationship in data matrices. The speaker cautions against using A transpose A to compute the SVD as it can be computationally expensive and vulnerable to roundoff errors. Finally, the speaker uses a diagram to explain how the SVD factors can be thought of as a series of rotations and stretches.

  • 00:30:00 In this section, the concept of Singular Value Decomposition (SVD) is explained through a graphical representation of the process. The video demonstrates how the orthogonal matrix rotates the unit vectors, and how Sigma stretches them, resulting in an ellipse. Finally, the orthogonal matrix U is applied, which rotates the ellipse. If the matrix is positive definite and symmetric, then U is the same as V, and the S that was originally given as input is the same as the A output. The video also explains how the parameters in the factorization can be counted.

  • 00:35:00 In this section, the speaker explains the matching of the numbers between the left and right sides in the singular value decomposition (SVD) using a two by two example. The rotation in the SVD requires two parameters, while the stretching requires two parameters, which adds up to a total of four parameters that match the four numbers in the SVD. Additionally, the speaker talks about the calculation of the SVD for a three by three matrix and suggests that a rotation in 3D space requires three parameters, namely roll, pitch, and yaw. Finally, the speaker mentions that the example for the SVD presented in the text is for a specific matrix and introduces a few facts about eigenvalues and singular values.

  • 00:40:00 In this section, the speaker explains that the determinant of the SVD product is equal to the product of the singular values. The example used shows that the product of the Sigma's is also equal to the determinant. However, computing examples of the SVD take more time since one has to take the square roots of the argument. The speaker emphasizes that the most important pieces of the SVD will be used in the next session, including the smaller and the larger SVD shapes, which consist of non-zero values and account for the null space stuff, respectively.

  • 00:45:00 In this section, the speaker introduces the pole decomposition of a matrix, which factors any matrix into a symmetric matrix times an orthogonal matrix. This is a famous factorization in engineering and geometry, and it can be obtained quickly out of the SVD. By putting in the identity and shifting things slightly, the S and Q can be read off from the SVD to recover this decomposition of a matrix, which in mechanical engineering language tells us that any strain can be described as a symmetric stretch and an internal twist.

  • 00:50:00 In this section, the speaker explains the process of extracting the most important part of a big matrix of data, which data science must do, since a part of the matrix is noise and a part of it is signal. To find the most significant part of the signal, the speaker examines the u Sigma Vtranspose, picking out the most essential number, Sigma 1. This number, together with its column and row, form the most critical part of the matrix, as it has the most substantial rank one, and is thus the part of the matrix with the highest variance. The next step is to compute these three elements to comprehend the data more fully.
6. Singular Value Decomposition (SVD)
6. Singular Value Decomposition (SVD)
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 7. Eckart-Young: The Closest Rank k Matrix to A



7. Eckart-Young: The Closest Rank k Matrix to A

In this YouTube video, the lecturer explains the concept of principle component analysis (PCA), which is used for understanding a matrix of data and extracting meaningful information from it. The importance of the largest k singular values of a matrix, which contain the most crucial information, is highlighted, and the Eckart-Young theorem, which states that the first k pieces of a singular value decomposition provide the best approximation to a rank k matrix, is introduced. The speaker also discusses different types of norms for vectors and matrices, including the l2, l1, and infinity norms. The importance of the Frobenius norm in the Netflix competition and MRI scans is highlighted, along with the concept of the closest rank k matrix to A. The speaker also discusses the use of orthogonal matrices in preserving the properties of the original matrix and introduces the concept of Singular Value Decomposition (SVD) and how it relates to PCA. Lastly, the importance of solving a linear system of equations involving the rectangular matrix A and its transpose is discussed, along with the use of SVD method in finding the best ratio of age to height for a given dataset.

  • 00:00:00 In this section, the lecturer explains the concept of principle component analysis (PCA), which is a tool used for understanding a matrix of data. He highlights the importance of extracting meaningful information from the data rather than copying it all. He explains that the largest k singular values of the matrix contain the most important facts, and a K is the best approximation to a rank K matrix. The Eckert-Young theorem, which states that using the first K pieces of a singular value decomposition is the best approximation to a rank K matrix, is introduced, and the lecturer explains the different measures of a matrix's norm.

  • 00:05:00 In this section, the speaker discusses different types of norms for vectors and matrices. The l2 norm, or the largest singular value, is an important norm for matrices. The speaker explains that when minimizing a function using the l1 norm, the winning vector is sparse, or mostly consisting of 0 components, which is useful in signal processing and sensing. The l1 norm is also known as basis pursuit and is important because it allows for interpretation of the components of the winning vector. The l2 and l1 norms are compared, and the speaker also introduces the infinity norm.

  • 00:10:00 In this section, the speaker explains three important matrix norms. The first is the two norm, which is similar to the length of a vector and satisfies the triangle inequality. The second is the Frobenius norm, which treats a matrix's entries like a long vector and takes the square root of the sum of their squares. The third is the nuclear norm, which is the sum of a matrix's singular values. These norms are important because they all satisfy the Eckart-Young statement that the closest rank K approximation to a matrix can be found from its first K singular values.

  • 00:15:00 In this section, the speaker discusses how the L2 and Frobenius norms of a matrix depend only on its singular values. The Frobenius norm was used in the Netflix competition where participants had to complete a large matrix of movie rankings with missing entries, and it turned out to be the right norm for the best nuclear norm completion of the matrix. This method of matrix completion is now being used for MRI scans with missing data, where it can produce an excellent picture even with incomplete data.

  • 00:20:00 In this section, the speaker discusses the concept of the closest rank k matrix to A. This involves completing a matrix by filling in what the MRI would have seen in the positions where it didn't look long enough, using the nuclear norm. The example given is of a rank four matrix, and to find the best approximation of rank two, the speaker picks 4 and 3 as the two largest values. Any other matrix B would be further away from A than this chosen matrix, although it is not obvious because it depends on the norm. The point of the theorem is that it is not easy to find the closest rank k matrix to A, and a proof is needed.

  • 00:25:00 In this section, the speaker discusses how diagonal matrices are not as special as they seem and introduces the concept of an orthogonal matrix, which can be used to multiply on both sides of a given matrix. The speaker poses the question of what happens to the singular values of a matrix when multiplied by an orthogonal matrix, and explains that the singular values will not change. The speaker also explains that the norms of vectors are not changed by orthogonal matrices, and concludes that orthogonal matrices are just as good as diagonal matrices in terms of preserving the properties of the original matrix.

  • 00:30:00 In this section, the concept of Singular Value Decomposition (SVD) was explained in the context of matrix QA. The matrix QA's SVD is composed of a diagonal matrix, Sigma, on the right of it; V transpose on the right of Sigma; and Q u on the left of Sigma, where Q u is an orthogonal matrix. This section introduced the concept of Principal Component Analysis (PCA) and explained how to extract meaningful insights from data points. The first step in PCA was to get mean zero by subtracting the average values of the data points for each component. The section further explained how the resulting values could be used to find the linear relationship between components.

  • 00:35:00 In this section, the speaker discusses Principal Component Analysis (PCA) and how it differs from least squares. While least squares measures the errors between points and a line, PCA measures the perpendicular distance of points from a line and adds up their squares to minimize them. Therefore, the solution to this problem involves the Singular Value Decomposition (SVD) Sigmas instead of the equations found in ordinary linear algebra. The speaker distinguishes the problem of finding the best linear relation in PCA from finding the least square solution as the former problem aims to model nonlinear data in a linear way.

  • 00:40:00 In this section, the speaker discusses the importance of solving a linear system of equations involving the rectangular matrix A and its transpose. While this is a fundamental application in 1806, the speaker notes that this is not the same as principle component analysis (PCA), which statisticians have been applying for a long time. He notes that the covariance matrix or the sample covariance matrix, which involves the mean and variance, plays a huge role in such statistical applications. In particular, the sample covariance matrix is computed from the samples and normalized by the number of data points, and it is exactly a train a a transpose.

  • 00:45:00 In this section, the speaker introduces a problem that involves finding the best ratio of age to height for a given dataset. The objective is to minimize the distance between the given data and the solution. The speaker suggests that the answer lies in finding the vector that points in the right direction, which could be a principal component in the symmetric positive definite matrix. The SVD method is proposed as a solution to this problem.
7. Eckart-Young: The Closest Rank k Matrix to A
7. Eckart-Young: The Closest Rank k Matrix to A
  • 2019.07.18
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 8: Norms of Vectors and Matrices



Lecture 8: Norms of Vectors and Matrices

This lecture discusses the concept of norms of vectors and matrices, including L1 and max norms, and their application in fields such as compress sensing and signal processing. The lecture also covers the importance of the triangle inequality in norms, the shape of s-norms, and the connection between the L2 norm of vectors and matrices. Additionally, the lecture explores the Frobenius norm and the nuclear norm, which remains a conjecture for optimizing neural nets, and emphasizes the importance of teaching and learning alongside students.

  • 00:00:00 In this section, the speaker discusses an interesting observation made by a faculty member at the Sloan School of MIT concerning how people guess the outcome of coin tosses. He explains that although, in theory, the optimal strategy would be to consistently guess heads, people and animals end up guessing tails about a quarter of the time, even though the odds of getting heads are much higher. The reason for this is not explained as the speaker did not have enough time to hear the explanation. The speaker also briefly introduces the concept of norms and their importance in measuring the size of vectors, matrices, tensors, and functions.

  • 00:05:00 In this section, the concept of norms of vectors and matrices is discussed. The lecturer introduces different types of norms like L1 norm and max norm that are integral in the field of compress sensing and signal processing. He explains that the P-norm equals the P power to the P power up here P, where taking P powers and P roots will yield the norm of two V to have a factor of two compared to the norm of V. Additionally, the zero norm is introduced, whose number of non-zero components gives a measure of the sparsity of matrices and vectors. However, it is not a norm because it violates the rule for the same number of non-zero components to have the same norm, and the math papers between one and infinity where proper norms exist are discussed.

  • 00:10:00 In this section, the lecturer discusses the norms of vectors and matrices. The unit ball for the norm is a circle with the equation v1 squared plus v2 squared equals one. The unit ball for the l1 norm is a diamond with the straight line graph of v1 plus v2 equal to one in the positive quadrant. The unit ball for the max norm is also plotted with the points zero, +/- one and +/- i equal to max, and the rest of the boundary takes a little thought to figure out. As the number p changes, the norm starts with a diamond, swells out to be a circle at p equal to two, and becomes a square at p equal infinity. Finally, the 0 norm is not included, and the points with only one non-zero are on the axes.

  • 00:15:00 In this section, the lecturer discusses different types of norms, such as L1 or Manhattan norm, L2 or Euclidean norm, and the s-norm, which is a norm of positive definite symmetric matrices. The lecturer notes the importance of the triangle inequality in norms, which is broken in certain cases, such as when using the Lp norm with p less than one. Additionally, the s-norm is shown to have a specific shape that satisfies the convexity property, which is not possessed by certain norms that violate the rules of a norm.

  • 00:20:00 In this section, the lecturer discusses the different types of norms that can be applied to vectors and matrices. The L2 norm is used when the matrix S is the identity matrix, but using a different matrix S will change the shape of the norm. A typical case is S equal to 3, which creates a weighted norm that is represented by an ellipse. All vector norms are variations on the L2 norm with different values for P. The lecturer also briefly mentions the basis pursuit problem and Ridge Regression with their respective L1 and L2 norms.

  • 00:25:00 In this section, the lecturer discusses the concept of norms in optimization, particularly the L1 and L2 norms. Using the example of finding the point on a line with the smallest L2 norm and then the smallest L1 norm, the lecturer emphasizes that the point with the smallest L1 norm is the winner and has the most zeros, making it a sparse vector. This is an important fact that extends into higher dimensions and makes the L1 norm special. Overall, the lecture delves into the nuances and applications of norms in optimizing neural nets and life in general.

  • 00:30:00 In this section, the speaker discusses the L1 norm winner and how going further up the line is not advisable as it increases the non-zero over second component. They also introduce the notion of two-norm of matrices and how it is connected to the two norm of vectors through a blow-up factor, which is the maximum ratio of the two norms of AX over the two norms of X. The matrix norm is defined as the maximum blow-up factor over all X's.

  • 00:35:00 In this section, the lecturer discusses norms of matrices and how to find a good norm of a matrix. He explains that the maximum value of the ratio obtained by the two norm is called Sigma 1. This value can be used to determine what the singular vector is without actually finding all of them. Additionally, other matrix norms can be obtained by maximizing that blow-up factor in that vector norm. Singular vectors are a way of finding the norms, thus, eigenvectors may not work when dealing with matrices that are not symmetric.

  • 00:40:00 In this section, the lecturer discusses the Frobenius norm of matrices, which is denoted by capital F and is equivalent to the square root of the sum of all the matrix elements squared. This norm is related to the Sigma's, the squares of the singular values of the SVD. Furthermore, the lecture explores how the orthogonal matrix and the Frobenius norm are tied together and how the nuclear norm is related to deep learning optimization algorithms.

  • 00:45:00 In this section, the lecturer discusses the conjecture that in a model situation, optimization by gradient descent picks out the weights that minimize the nuclear norm. The nuclear norm is the sum of the singular values of a matrix, similar to the L1 norm for vectors. This conjecture remains unproven, but the idea has potential applications in deep learning and compressed sensing. The lecturer emphasizes that his job is not to grade his students but to teach and learn with them. The lecture concludes with an announcement of homework three, which will use the notes from sections eight and nine.
Lecture 8: Norms of Vectors and Matrices
Lecture 8: Norms of Vectors and Matrices
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 9. Four Ways to Solve Least Squares Problems



9. Four Ways to Solve Least Squares Problems

In this video, the instructor discusses the concept of least squares and various ways to approach it. He emphasizes the importance of least squares, as it's an essential problem in linear algebra and serves as the glue that holds the entire course together. The video covers the pseudo-inverse of matrices, SVD of invertible and non-invertible matrices, and different methods to solve least squares problems, including Gauss's plan and orthogonal columns. The video also discusses the idea of minimizing the distance between ax + b and the actual measurements using the L2 norm squared and how it relates to linear regression and statistics. Additionally, the video provides insight into a project that uses the material learned in the course, focusing on areas like machine learning and deep learning.

  • 00:00:00 In this section, the instructor discusses the importance of least squares and how it is an essential problem in linear algebra. He mentions that there are various ways to approach least squares, and this subject is the glue that holds the entire course together. He also mentions that there won't be any final exams or tests, but instead, he will encourage a project that uses the material learned in the course. The project will include different areas like machine learning and deep learning, and he will send out a message about the details of the project as the time comes.

  • 00:05:00 In this section, the speaker explains the concept of the pseudo-inverse of a matrix. The inverse, when it exists, lets us multiply by it and then get back to the original vector, but for a matrix with no inverse, we turn to the pseudo-inverse. This is relevant in cases when the matrix is rectangular, has zero eigenvalues or has a null space. The speaker uses a picture of the row and column space to explain which parts of the image are invertible and which are hopeless. The pseudo-inverse will be used to solve problems when the matrix is not invertible, providing an adequate solution.

  • 00:10:00 In this section, the speaker explains how to define the pseudo-inverse of a matrix for situations where a matrix cannot be inverted. They discuss how to handle the null space of a matrix and what the pseudo-inverse should do in that case. The speaker provides a plan for what the pseudo-inverse should do in the column space and the orthogonal space where no one is hitting it. Using the SVD, they provide a formula for the pseudo-inverse that involves projecting a matrix onto the identity matrix on the top half and zero on the bottom half.

  • 00:15:00 In this section, the video discusses the SVD (singular value decomposition) of an invertible matrix, where the SVD takes the V's back to the U's or vice versa. If a matrix is not invertible, then its SVD requires its rectangular Sigma matrix to be replaced with its pseudo-inverse. The video shows an example of a matrix with two independent columns where Sigma has only two non-zeros, and the rest are zeros, representing a total singular situation. As a result, the best option is to use the pseudo-inverse of Sigma in place of Sigma inverse.

  • 00:20:00 In this section, the concept of Sigma plus, the pseudo-inverse of Sigma, is introduced as a solution for rectangular matrices that cannot be inverted. The pseudo-inverse is used to solve the least squares problem where there is an equation ax equal B, but a is not invertible. This problem arises when there are too many measurements or noise. The Sigma plus matrix is used to get the vectors in the column space, while the vectors in the orthogonal space are considered unsolvable. The first way to solve the least squares problem is to use the Sigma plus matrix to give the solution.

  • 00:25:00 In this section, the speaker discusses the least squares problem of fitting a straight line to noisy measurements using a linear system of equations. They explain that if the measurements lie on a line, then the linear system has a solution, but in general, it does not. They then introduce the idea of minimizing the distance between ax + b and the actual measurements using the L2 norm squared. This technique was proposed by Gauss and is used to find the best values of C and D in the equation Cx + D that represents the straight line closest to the measurements.

  • 00:30:00 In this section, the speaker explains the concept of least squares and how it is used to solve unsolvable problems in linear regression and statistics. By minimizing the quadratic loss function, a system of linear equations is produced which ultimately gives the best answer, following Gauss's advice. The best X is found by solving the equation a transpose a times X equals a transpose B, which leads to the minimum. The speaker then draws a graph to explain the concept of column space of A and how B is not in the column space, and how the squares and normal equations lead to the best AX.

  • 00:35:00 In this section, the speaker discusses different methods to solve least squares problems. Method 2 involves solving the normal equations using matrices in MATLAB. However, this method may not work if the matrix has almost singular columns. Method 3 involves using Gauss's plan, which works only if the matrix has independent columns, meaning that the matrix is invertible. The pseudo-inverse method can also be used when the matrix is not invertible but has independent columns. The importance of the invertibility of the matrix is emphasized throughout the section.

  • 00:40:00 In this section, the speaker explains that when the null space is zero, the answer from the pseudo-inverse method is the same as the answer coming from the method of a transpose a inverse a transpose B. However, the speaker notes that the null space of a transpose is not invertible, but a transpose a is invertible. Furthermore, the speaker explains that the matrix a a transpose is doing its best to be the inverse, but it's not close enough. The pseudo-inverse is shown to work when the rank is equal.

  • 00:45:00 In this section, the speaker discusses two more ways to solve least squares problems. The third way involves getting orthogonal columns first, which would make the problem easier. The Gram-Schmidt procedure is one way to get orthogonal vectors in a natural way. The fourth and final way to solve least squares problems is not discussed in detail, but it involves taking advantage of the fact that data in real life is often sparse. The speaker concludes by noting that least squares is not a new concept and continues to be used for good reason.
9. Four Ways to Solve Least Squares Problems
9. Four Ways to Solve Least Squares Problems
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 10: Survey of Difficulties with Ax = b



Lecture 10: Survey of Difficulties with Ax = b

In this lecture on numerical linear algebra, the difficulties with solving linear equations of the form Ax=b are discussed. These difficulties arise when the matrix A is nearly singular, making its inverse unreasonably large, and when the problem is too big with a giant matrix that is impossible to solve in a feasible time. The lecturer outlines several possibilities for solving the problem, ranging from the easy normal case to the extremely difficult case of underdetermined equations. The use of randomized linear algebra, iterative methods, and the SVD are discussed, along with the importance of finding solutions that work on test data, particularly with deep learning. Additionally, the lecturer emphasizes that the SVD is still the best tool for diagnosing any matrix issues.

  • 00:00:00 In this section, the lecturer discusses the difficulties that can arise when attempting to solve the equation Ax = B. He notes that the problem can occur in various sizes and ranks, and may be nearly singular or not nearly singular. He outlines several possibilities for solving the problem, ranging from the easy normal case of a square matrix with a reasonable condition number, to the extremely difficult case of underdetermined equations. In the latter case, the lecturer notes that the problem is common in deep learning and that multiple solutions may exist.

  • 00:05:00 In this section, the lecturer discusses difficult problems with Ax = b and how to approach them. These problems usually arise when the columns of the matrix are nearly dependent, making it problematic to accept the columns a1, a2, up to an of the given matrix. The solution to this is to find orthonormal column vectors in that column space by using Gram-Schmidt and fixing the columns by orthogonalizing them. The lecturer saves Gram-Schmidt discussion to the next lecture but previews the importance of column pivoting which allows reordering of the columns, a concept that is also applicable in elimination.

  • 00:10:00 In this section, the lecturer discusses the difficulties with solving linear equations of the form Ax=b, including the possibility that the matrix may be nearly singular, making its inverse unreasonably large. The lecturer also talks about inverse problems, which are typically problems where you know the system's output, but you must determine the network's structure or input. These problems often give nearly singular matrices, making it difficult to accurately solve the system without adding a penalty term to minimize the issue. The Leu and QR worlds, row exchanges, and Gram-Schmidt orthogonalization are also mentioned.

  • 00:15:00 In this section, we learn about some difficulties involved in solving linear equations using the Ax=b method. One such difficulty is when the matrix A is badly conditioned, leading to vectors approaching zero and a giant inverse of a transpose a. To counter this, we need to penalize A, which makes it more well-conditioned, but also shifts the problem to deciding how much to penalize it. Another method is iterative methods, like the conjugate gradient method, where we take a step closer and closer to the exact answer until it's close enough. When the problem is too big with a giant matrix that is impossible to solve in a feasible time, randomized linear algebra is used to sample the columns and rows of the matrix to provide an answer from the sample.

  • 00:20:00 In this section, the lecturer discusses the use of randomized linear algebra to determine solutions to difficult problems in cases where the matrix is reasonable. While there is no guarantee that the solutions will be correct, using the probabilities of inequalities can yield a good solution to the problem. Iterative methods and randomized algorithms, along with the use of the SVD, are discussed as methods of finding solutions. The lecturer emphasizes the importance of finding solutions that work on test data, particularly with deep learning, and discusses the deep mathematical questions that arise with this problem. The SVD is explained as a potential solution when the matrix is nearly singular.

  • 00:25:00 In this section, the professor discusses a method to regularize the problem of finding the minimum sum of ax minus B squared in the presence of large inverses. By using a least squares problem with an additional penalty term that includes a positive delta, even when this value goes to zero or a does crazy things, the problem will still be solvable and the function is guaranteed to be away from singular. As delta goes to zero, the behavior of the result drastically changes, and this factor may depend on the level of noise in the system.

  • 00:30:00 In this section of the video, the speaker is discussing the solution for a given Delta and analyzing when the solution exists. The focus is on solving a one by one problem, which involves finding the minimum of a penalized least squares problem. The equation is solved by setting the derivative to zero, and the resulting X value is used to determine the limit as Delta goes to zero. The two possibilities are that Sigma is not zero and the solution approaches the inverse of Sigma, or Sigma is zero and the solution does not exist.

  • 00:35:00 In this section of the video, the speaker discusses the behavior of the penalized squares approach when the penalty term goes to zero. The speaker notes that in this case, the system behaves in a strange way, with a sudden bifurcation between zero and a non-zero limit. This limit is identified as the pseudo-inverse, and as Delta gets smaller and smaller, the solution of the system approaches the pseudo-inverse, which is the always-correct answer for the system. The speaker notes that in a practical case, this approach would be useful for finding the unknown parameters of a system, such as the resistances and inductances in an electrical circuit.

  • 00:40:00 In this section, the lecturer explains that the solution to the problem Ax=b can be achieved by adding a penalty term to regularize the problem. The penalty term can be introduced by using the L1 norm, which gives sparse solutions without many little components in the answer. He also discusses the importance of iterative methods in conventional linear algebra and Gram-Schmidt with or without pivoting. However, he decides to cover those topics in the next lecture.

  • 00:45:00 In this section, the lecturer discusses how the SVD is an effective tool for proving things about matrices; it simplifies a messy problem into a problem about a diagonal matrix Sigma in the middle, which is why it is useful in diagnosing any matrix issues. Additionally, the lecturer provides a formula for a special case of a problem, with Sigma as a diagonal matrix, implying that understanding Sigma's behavior, specifically on each diagonal entry, is vital in pursuing such cases. The SVD, the lecturer emphasizes, is still the best tool for this. Finally, the lecturer highlights that this lecture is a survey of what numerical linear algebra deals with, and while not every topic has been covered yet, they will be in the remaining sessions.
Lecture 10: Survey of Difficulties with Ax = b
Lecture 10: Survey of Difficulties with Ax = b
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 11: Minimizing ‖x‖ Subject to Ax = b



Lecture 11: Minimizing ‖x‖ Subject to Ax = b

In this lecture, the speaker covers a range of topics related to numerical linear algebra. They start with discussing the issues that can arise when solving for Ax=b, then move onto the Gram-Schmidt process for finding an orthogonal basis for a space, and the modified Gram-Schmidt method for minimizing ‖x‖ subject to Ax = b. The speaker also introduces the concept of column exchange or column pivoting in a more professional Gram-Schmidt algorithm and discusses an improvement on the standard Gram-Schmidt process for orthonormalizing the columns of a matrix A. They also touch upon the idea of the Krylov space to solve the problem Ax=b and the importance of having a good basis for minimizing ‖x‖ subject to Ax = b. Finally, they mention that they have finished with the problem of minimizing x subject to Ax=b and are moving on to tackling the issue of dealing with very large matrices.

  • 00:00:00 In this section, the lecturer mentions three things. Firstly, the issues that can arise when solving for Ax=b, including where A is too big to fit into core but where other methods are available. Secondly, he shows the rough first draft of two pages of his book and explains the two-year process he undergoes to perfect and improve it. Thirdly, he discusses minimizing different norm, such as L1 or L2 or max L infinity norm, for the condition of solving with the constraint of a satisfied equation, providing a visual representation of the difference between L1, L2, and L infinity norms.

  • 00:05:00 In this section, the speaker discusses the winning point for different unit balls in different norm spaces, including L1, L2, and L infinity. He shows how to find the winning point, or the point that touches the line first, in each case. He then introduces the topic of the day, Gram-Schmidt, which is a way to make a non-orthogonal matrix orthogonal by finding a different set of vectors that span the same space while being orthogonal. He outlines the general facts of Gram-Schmidt and mentions that it is a standard topic taught in linear algebra courses.

  • 00:10:00 In this section, the professor explains the Gram-Schmidt process, which opens up the picture of a matrix to get an orthogonal matrix with columns Q1 to Qn that are orthonormal. The matrix R is used to tell what combinations the Qs are made of or backwards to say how A is related to the final Q. The equation for R is Q transpose times A, and the entries in R are just the inner product of the Qs with the As. The professor shows that there is nothing mysterious about R because of the orthogonal matrix Q. The MATLAB command would be Q R of A instead of Lu of A.

  • 00:15:00 In this section, the lecture explains the Gram-Schmidt process for finding an orthogonal basis for a space. The lecture starts with a non-orthogonal basis set and the aim is to construct an orthogonal basis set. The process starts with the first column vector being the first basis vector and then taking the second vector and orthogonalizing this with the first vector. The next step is constructing the third vector that is orthogonal to the first two vectors. This continues until the whole basis set is orthogonally constructed. Finally, we divide each vector by its norm to make each basis vector a unit vector. Gram-Schmidt takes a non-orthogonal basis set and generates an orthogonal set suitable for projection methods.

  • 00:20:00 In this section, the speaker discusses the modified Gram-Schmidt method for minimizing ‖x‖ subject to Ax = b. They explain the process of subtracting off the components of Q1 and Q2 from the vector and checking that the resulting vector is orthogonal. They also address the danger of taking rows in order during elimination and suggest using the modified Gram-Schmidt method to avoid computational errors.

  • 00:25:00 In this section of the lecture, the speaker discusses the idea of column exchange or column pivoting in a more professional gram-schmidt algorithm. Similar to elimination, in gram-schmidt, if the new part of the column is too small, it can build in round-off errors that cannot be removed. Therefore, it is essential for the algorithm to check the size of the pivot and exchange rows if necessary. The main idea behind column exchange is to compare the new part of the column with all the other potential possibilities to find the biggest component before deciding the next step. This process is crucial in avoiding round-off errors that can affect the accuracy of the result.

  • 00:30:00 In this section, the speaker explains an improvement on the standard Gram-Schmidt process for orthonormalizing the columns of a matrix A. Instead of only considering the next column in A, the improvement involves considering all remaining columns in A when orthonormalizing each new column. The speaker argues that this is not more work than the standard method, as all the subtractions needed are computed sooner regardless. The improvement relies on selecting the largest remaining column and is similar to selecting the largest pivot in Gaussian elimination.

  • 00:35:00 In this section, the lecturer introduces the idea of the Krylov space to solve the big matrix problem, Ax=b. Krylov space is a combination of vectors that span a space, and the lecturer uses combinations of these vectors to find the least square solution in that space, XJ. The Krylov space is determined by multiplying A by J vectors, up to A^k-1B. The lecturer looks for the best solution in this space to solve the problem Ax=b. However, there is still a catch in this method.

  • 00:40:00 In this section, the speaker discusses the importance of having a good basis for minimizing ‖x‖ subject to Ax = b. The basis should be orthogonalized to make calculations easier, and this is where the contributions of our nolde and Lan shows come in. An orthogonal basis is perfect for a projection, and the speaker explains the equation that makes calculations easy. When the Q's are orthogonal, the coefficients C can be easily found by computing the dot product of the given vector X with each Q, and then applying Q transpose. This allows for an efficient solution to the problem.

  • 00:45:00 In this section of the lecture, the speaker discusses the concept of basis and how to find a good basis using Gram-Schmidt or Krylov vectors. The speaker notes that using the Gram-Schmidt method in this case is preferable, and also mentions section 2.1 of the book on numerical linear algebra, which summarizes the common techniques in the field such as Krylov, Arnoldi, and Lanczos. He recommends 'Numerical Linear Algebra' by Golub and van Loan as an excellent textbook for those wanting to learn more about the topic.

  • 00:50:00 In this section of the video, the speaker mentions that they have finished with the problem of minimizing x subject to Ax=b and are moving on to tackling the issue of dealing with very large matrices.
Lecture 11: Minimizing ‖x‖ Subject to Ax = b
Lecture 11: Minimizing ‖x‖ Subject to Ax = b
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 12. Computing Eigenvalues and Singular Values



12. Computing Eigenvalues and Singular Values

In this video, the QR method for computing eigenvalues and singular values is introduced. The process involves starting with the desired matrix and factoring it into QR, creating an upper triangular matrix R that connects the non-orthogonal basis with the orthogonal basis. The process is iterated until the diagonal entries become small, at which point they can be used to approximate the eigenvalues. The speaker also discusses a shift method for computing eigenvectors to speed up the process. The benefits of using MATLAB for symmetric matrices are also highlighted. The video also touches upon the concept of Krylov vectors for solving eigenvalue problems for large matrices.

  • 00:00:00 In this section, the professor introduces the QR method for computing eigenvalues and singular values of a matrix. The QR method involves starting with a matrix whose eigenvalues are desired and factoring it into QR. The columns of the matrix are transformed into an orthogonal basis by orthogonalizing them and creating a matrix R that connects the not orthogonal basis with the orthogonal basis, which is upper triangular. Next, the method involves reversing the order and doing the same thing again to produce the next matrix. The professor claims that the eigenvalues are the same before and after the transformation, and the matrices are similar, which is useful for computing the singular values of the matrix.

  • 00:05:00 In this section, the professor explains the process of computing eigenvalues using QR factorization. The process involves iterating QR factorization multiple times until the diagonal entries of the resulting matrix become very small. At this point, the diagonal entries are close to the actual eigenvalues of the original matrix, and can be used to approximate them. The professor also highlights the fast convergence of the method, with the off-diagonal entries getting cubed and rapidly approaching zero, making the method extremely accurate.

  • 00:10:00 In this section, the video discusses an improvement in the algorithm for computing eigenvectors, which involves introducing a shift. Instead of taking the matrix A, they take the matrix A - siI, where si is some multiple of the identity matrix. This shifts all the eigenvalues of the matrix A by si. They then work with this shifted matrix, perform the Gram-Schmidt process, and reverse the order to get a matrix that is as close as possible to A. Finally, they undo the shift to obtain a new matrix, A1. The hope is that A1 is still similar to A but with a faster computational time.

  • 00:15:00 In this section, the professor discusses the QR method for computing eigenvalues of a matrix. He demonstrates an incomplete example where he uses the QR method to show that the lower triangular part of the matrix begins to disappear, and the eigenvalues start to pop up on the diagonal. The professor then discusses how to improve the efficiency of the QR method by taking advantage of any zeros in the original matrix. If there are extra diagonals with zeros, the method can be sped up by skipping some steps in the QR factorization process.

  • 00:20:00 In this section, the speaker discusses how to compute eigenvalues and singular values. It is not possible to obtain all eigenvalues as it is impossible to get a whole lower triangular part equal to zero, which would give us the eigenvalues. This is because the eigenvalues solve an equation of nth degree and centuries ago, it was proven that it is impossible to solve an instant equation by simple steps. Additionally, there is no simple formula to find lambdas or singular values. However, it is possible to get as close as we like by continuing with the QR method and reducing a matrix to Hessenberg form with one triangular plus one more diagonal, but lots of zeros. MATLAB and other matrix systems use la pack and Linpack to compute these values.

  • 00:25:00 In this section of the video, the speaker discusses the benefits of using MATLAB and provides insight into the characteristics of symmetric matrices. He explains that if a matrix is symmetric, then it can be safely predicted that it will only have one diagonal above the main diagonal, making it a tri-diagonal matrix. This significantly reduces the time to do the QR calculation, as it only requires working with 2n numbers instead of N^2. The speaker also briefly touches upon singular values, stating that they are the eigenvalues of a transpose matrix but warns against computing them using determinants, as it is slow, ill-conditioned and leads to loss of information.

  • 00:30:00 In this section, the speaker discusses the concept of using orthogonal matrices to simplify symmetric matrices, making them tri-diagonal so that their eigenvalues can easily be found. Then, the speaker poses the question of what can be done to a general matrix to simplify it in a way that leaves its singular values unchanged. The speaker connects this question to the SVD and discusses the invariance of the singular values under certain operations, such as multiplying by an orthogonal matrix. The question of what other operations leave the singular values invariant is left open for the audience to consider.

  • 00:35:00 In this section, the lecturer discusses the effect of multiplying an orthogonal matrix Q onto a diagonal matrix with singular values. It is shown that multiplying Q onto the diagonal matrix does not change the singular values, and that this can be done on both sides of the equation using different orthogonal matrices. This increased flexibility allows for the matrix to be reduced from tri-diagonal to bi-diagonal, which makes the algorithm faster as it progresses through each step. The lecturer also discusses the usefulness of a bi-diagonal matrix in simplifying matrix multiplication.

  • 00:40:00 In this section, the speaker discusses computing eigenvalues and singular values, specifically for matrices of order up to a thousand. The SVD involves looking at a transpose of a matrix, which would be tri-diagonal. To find singular values, one can get up to the transpose of a matrix, but finding its eigenvalues would require it to be symmetric and tri-diagonal. This method is effective for matrices up to a certain size, beyond which Krylov's method can be used for sparse matrices. Krylov's method restricts the matrix to a certain size, typically a hundred by hundred, and finds the eigenvector in that space.

  • 00:45:00 In this section, the speaker explains an approach called Krylov vectors that can be used to solve eigenvalue problems for large matrices. By applying the matrix operation to Krylov vectors, which have a smaller dimension than the original matrix, a smaller eigenvalue problem can be created and solved. While not providing exact eigenvalues, Krylov vectors can give good approximations for certain problems. The speaker also introduces the idea of random sampling for large matrices and mentions that this will be explored in the next lecture.
12. Computing Eigenvalues and Singular Values
12. Computing Eigenvalues and Singular Values
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
 

Lecture 13: Randomized Matrix Multiplication



Lecture 13: Randomized Matrix Multiplication

This video lecture discusses the concept of randomized matrix multiplication, which involves sampling the columns of matrix A and the corresponding rows of matrix B with probabilities that add up to one. The mean value of the random samples can be computed to obtain the correct answer, but there will still be variance. The lecture goes on to discuss the concepts of mean and variance and how to pick the best probabilities that minimize the variance. The process involves introducing an unknown variable called Lambda and taking derivatives with respect to it to find the best PJ. The focus then shifts to the question of how to weight the probabilities when looking at which columns in a matrix are larger or smaller. The lecturer suggests two possibilities: weight probabilities according to the norm squared or mix the columns of the matrix and use equal probabilities. Overall, the video provides a detailed explanation of randomized matrix multiplication and the process for optimizing probabilities to obtain the smallest variance.

  • 00:00:00 In this section of the video, the speaker explains the concept of randomized matrix multiplication, which is an idea that falls under randomized linear algebra. This method is used for large matrices by sampling the columns of matrix A and the corresponding rows of matrix B, but not all of them. Instead, different pieces are randomly sampled with probabilities that add up to one. By computing the mean value of the random samples, the correct answer can be obtained, but there will still be variance. The goal then is to pick the best probabilities that minimize the variance. The lecture goes on to discuss the concepts of mean and variance and practicing with an example.

  • 00:05:00 In this section, the speaker describes a randomized sampling process for matrix multiplication. The process involves taking two columns with probabilities of half each, adding them, and then dividing by the number of times they are sampled. The mean of the randomized matrix is then calculated using the formula for computing the average of the two samples. The variance is computed using either of the two methods, one of which involves adding the probabilities of different output values squared, while the other involves taking the average distance squared from the mean.

  • 00:10:00 In this section of the video, the speaker discusses the concepts of mean and variance in statistics and how they relate to their current example of computing variance for randomized matrix multiplication. He explains that the variance is a measurement of the sum of squares between points on either side of the mean, and that in his example, he is summing the squares of the differences between his output and the mean. He then goes on to compute the variance for his specific example, which involves two possible outcomes and probabilities for each.

  • 00:15:00 In this section, the speaker discusses the computation of variance and introduces a new formula for variance using probabilities and distances from the mean squared. The speaker also brings up the concept of randomized sampling in linear algebra and how adjusting probabilities can help to decrease variance when B is much bigger than A. The optimal probability comes from the square of the size of B divided by A, and the speaker plans to discuss this further in the future. Finally, the speaker mentions a second formula for variance that involves probability and distance from output squared.

  • 00:20:00 In this section, the speaker discusses mean and variance in probability and demonstrates the two ways to calculate mean squared when subtracting the mean. The focus then shifts to the question of how to weight the probabilities when looking at which columns in a matrix are larger or smaller. The speaker suggests two possibilities: weight probabilities according to the norm squared or mix the columns of the matrix and use equal probabilities. The speaker favors the first approach and explains how to use probabilities proportional to the norm squared.

  • 00:25:00 In this section, the lecturer explains how to rescale probabilities so that they add up to one. He then discusses his plan to choose row column and column row J with particular probabilities and how he will multiply them. His approximation, the approximate aB, will be the sum of all these samples over S samples. The lecturer also mentions that the plan is to choose the PJs to minimize the total variance and that the mean is correct.

  • 00:30:00 In this section, the lecturer explains how to calculate the variance for a sample in randomized matrix multiplication. The mean of the sum of all the samples is calculated by multiplying the mean of one sample by the number of samples, which leads to the hard part of calculating the variance. The variance calculation will depend on the piece, P1 to PR that was chosen with probabilities dependent on the size. Each sample is certainly wrong because it is a rank one, so when calculating the variance, we will definitely not get zero. The variance for a sample turns out to be the sum over the AJ AJ transpose probability squared. The mean squared is subtracted from this calculation to get the complete variance.

  • 00:35:00 In this section, the speaker plugs in the values for PJ and simplifies the denominator to a sum of a JPG of a JP j bj norms. By adding up the first power and getting C, the speaker gets the expression for the variance. After taking s samples and combining them, the variance is a fixed number, which is C that they would like to make small. The speaker wants to show that this is the best choice by choosing the weights of probabilities based on the length of a times the length of B.

  • 00:40:00 In this section, the speaker discusses the final step of optimizing the probabilities P1 to PR for the rows or columns of matrix A and the rows of matrix B, subject to the constraint that they add up to 1. The goal is to minimize the variance expression by choosing the optimal PJs. The speaker introduces the Lagrange idea to build the constraint into the function by introducing an unknown number, often called lambda, to find the best PJ. This section concludes the discussion of randomized sampling and leads to the final subproblem.

  • 00:45:00 In this section, the lecturer discusses the concept of Lagrange's idea in optimizing probabilities under the condition that they add to one. The process involves building the equation into the function and taking derivatives with respect to lambda, an unknown variable. After setting the derivatives to zero and solving, you get the final recommended answer, which can be validated by taking the derivative with respect to P. The lecturer also explains that the Lagrange multiplier is the correct number to make the equation equal to one.

  • 00:50:00 In this section, the professor explains the process of choosing probabilities to obtain the smallest variance in a randomized system. He mentions that the ideal probabilities are higher when the column is larger, so finding the lengths of the columns is a prerequisite before randomized sampling. Although the variance can be a bit challenging to calculate, he encourages students to go through the notes slowly and revisit the formulas for better understanding, as they will be using probability more seriously in the future.
Lecture 13: Randomized Matrix Multiplication
Lecture 13: Randomized Matrix Multiplication
  • 2019.05.16
  • www.youtube.com
MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning, Spring 2018Instructor: Gilbert StrangView the complete course: https://o...
Reason: