Machine Learning and Neural Networks - page 61

 

Machine Learning: Testing and Error Metrics



Machine Learning: Testing and Error Metrics

Hi and welcome to this tutorial on machine learning testing and error metrics. My name is Luis Serrano, and I work at Udacity, where I teach machine learning. That's a picture of me. Today, we will focus on two questions: first, how well is my model doing? Once we figure that out, how do we improve it based on these metrics? Let's dive right in and look at some data. We have blue points and red points, and we want to train a model to separate them. Our simplest model is a linear model, which is a line that cuts the data into blue and red. It makes some mistakes but is generally good. Let's also consider a more complex model using a higher degree polynomial. This model does better at separating the points, but which model is better between the two?

To answer this question, we need to use testing. Instead of using all the points for training, we split them into training and testing sets. The training points are used to train the model, while the testing points are used to evaluate how well the model performs. In our example, the linear model makes one mistake on the testing set, while the polynomial model makes two mistakes. Therefore, the linear model performs better on the testing set because it generalizes better.

When it comes to testing, there are some important rules to follow. The first golden rule is to never use your testing data for training. Keep your testing data separate and only use it for evaluation. The second rule is to ensure that your friends also don't use their testing data for training. Finally, the third rule emphasizes never using your testing data for training. It's crucial to avoid any accidental misuse of the testing data.

Although it might seem like we are wasting data by separating it into training and testing sets, there is a way to address this concern. We can split the data into k equal sets, typically using a technique called k-fold cross-validation. Each portion of the data is used for both training and testing, and the results are averaged at the end. This approach allows us to make better use of the data while still evaluating the model's performance.

Now let's discuss the metrics that help us assess how well our models are doing. One common metric is accuracy, which measures how many instances the model classifies correctly compared to the total number of instances. However, accuracy alone may not always be the best metric, as shown in the examples of credit card fraud detection, medical diagnostics, and spam classification. In these cases, false negatives and false positives have different implications.

To evaluate the models more effectively, we use a confusion matrix. This matrix presents four possibilities: true positives, true negatives, false positives, and false negatives. Each of these represents different outcomes based on the model's predictions compared to the actual data. For example, in medical diagnostics, a false negative means a sick person is classified as healthy, which is worse than a false positive. Similarly, in spam classification, a false positive means a non-spam email is marked as spam, which is worse than a false negative.

By considering the specific context and consequences of false positives and false negatives, we can choose appropriate metrics for evaluating our models. Accuracy alone may not capture the whole picture, and other metrics like precision, recall, and F1 score can provide more insights into the model's performance.

Can we combine these two scores into one? One simple way to combine precision and recall scores is by taking their average. Let's calculate the average for the precision and recall scores provided. On the left, the precision is 69.5, and on the right, it is 66.95. Taking the average of these two scores, we get 68.225. However, this average may not provide sufficient information and may not be significantly different from accuracy. To understand the implications of this average score, let's consider an extreme example involving credit card fraud detection.

In the example, we have two models: one that classifies all transactions as good and another that classifies all transactions as fraudulent. Let's calculate the precision and recall for both models.

For the model that classifies all transactions as good, the precision is 100% (as all classified as bad are actually good), and the recall is 0% (as none of the fraudulent ones are caught). If we take the average of precision and recall, we would get 50%. However, giving such a high score to a model that performs poorly doesn't seem appropriate. Similarly, for the model that classifies all transactions as fraudulent, the precision is 0.16% (correctly classified 472 out of all transactions) and the recall is 100% (as all fraudulent transactions are caught). Again, the average between precision and recall would be around 50%, which doesn't reflect the poor performance of the model accurately.

To overcome this limitation, we can use another type of average called the harmonic mean, which is also known as the F1 score. The harmonic mean is calculated using the formula 2 * (precision * recall) / (precision + recall). The F1 score provides a more balanced representation, especially when one metric (precision or recall) is significantly different from the other.

For the medical model, the precision is 55.7, and the recall is 83.3. Calculating the F1 score using the harmonic mean formula, we get 66.76%. For the spam detection model, the precision is 76.9, and the recall is 37. The F1 score is 49.96%. And for the linear model, the precision is 75, and the recall is 85.7. The F1 score is 80%.

In the case of the credit card fraud models, if we prefer to catch all fraudulent transactions, we would prioritize recall. So, a good metric in this case would be closer to recall rather than precision.

Testing and error metrics are crucial for evaluating and improving machine learning models. Following the golden rules of testing, selecting appropriate metrics, and considering the consequences of false positives and false negatives help us make informed decisions about our models.

Combining precision and recall into a single score can be done using the F1 score, which takes the harmonic mean of the two metrics. This provides a more balanced evaluation and is especially useful when one metric is significantly different from the other.

Machine Learning: Testing and Error Metrics
Machine Learning: Testing and Error Metrics
  • 2017.03.16
  • www.youtube.com
Announcement: New Book by Luis Serrano! Grokking Machine Learning. bit.ly/grokkingML40% discount code: serranoytA friendly journey into the process of evalua...
 

ROC (Receiver Operating Characteristic) Curve in 10 minutes!



ROC (Receiver Operating Characteristic) Curve in 10 minutes!

Greetings! I am Luis Serrano, and in this video, we will discuss the Receiver Operating Characteristic (ROC) curve. The ROC curve is widely used for evaluating machine learning models and making important decisions.

Let's start with a dataset containing blue and red points, where the blue points are labeled as positive and the red ones as negative. We will build a machine learning model that fits this dataset. For instance, consider this line. Note that the blue side of the line classifies every point on that side as blue or positive, while the red side classifies points on that side as red or negative. However, the model makes some mistakes. One example is the red point located in the blue area, which is a false positive. Another example is the blue point below the line in the red area, which is a false negative. To understand false positives and false negatives better, check out my other video on machine learning testing and error metrics.

Now, let's consider two different examples for the same type of model. The model on the left is a medical model, where the positive points represent sick patients and the negative points represent healthy ones. On the other hand, the model on the right is a spam detection model, where the positive points are spam messages and the negative points are non-spam or "ham" messages.

In the medical model, a false positive occurs when a healthy person is diagnosed as sick, leading to unnecessary tests. A false negative happens when a sick person is diagnosed as healthy, resulting in no treatment. In this case, false negatives are considered worse because it is better to send a healthy person for additional tests than to send a sick person home without treatment. Thus, we aim to modify the model to reduce false negatives.

In the spam detection model, false positives are good "ham" emails classified as spam and sent to the spam folder. False negatives are spam emails incorrectly classified as "ham" and delivered to the inbox. Here, false positives are considered worse because receiving occasional spam emails in the inbox is preferred over important emails being marked as spam. Therefore, we focus on reducing false positives in this model.

These two models represent extremes, and most models lie somewhere in between, tolerating some false positives or false negatives. However, the importance assigned to each type of error may vary. Thus, for each type of model, we need to determine the optimal point to set the classification threshold.

In this video, I will show you a useful method to help make such decisions. We will consider a hypothetical dataset represented by this line. Our models will be parallel translations of this line. For each model, we will record the number of correctly classified red and blue points.

We start with the line at the bottom, where all the blue points are correctly classified and none of the red points are correctly classified. This gives us 0 correct red points and 5 correct blue points. We plot these values and move the line to cover one point at a time, recording the correct classifications. This process continues until the end. At the end, we always have 5 correct blue points and 0 correct red points. We plot these points and calculate the area under the curve. In this case, the area is 21 out of 25 squares, resulting in an area under the curve of 0.84. The area ranges between 0 and 1, with higher values indicating better model performance. We can use this ROC curve to make decisions about our model. A good model would be represented by a point on the curve with few false positives and false negatives. Depending on our specific requirements, such as minimizing false negatives in the medical model or false positives in the spam detection model, we can select the corresponding points on the curve.

Alternatively, we can view the data points as having scores between 0 and 1 assigned by the model. By applying a threshold, we can convert these scores into discrete predictions. We adjust the threshold and observe how the model performs at different levels. This process creates the ROC curve. Choosing a point on the curve helps us select an appropriate threshold for our predictions. Moreover, it is worth noting that an area under the curve less than 0.5 does not indicate a worse model. In fact, it can still be a good model. For instance, a model with an area of 0 corresponds to a model making all the mistakes, but if we flip the predictions, it becomes almost as good as a model making no mistakes. Thus, models with areas below 0.5 can still be effective.

Thank you for your attention! If you enjoyed this video, please subscribe for more content. Feel free to like, share, and comment, especially if you have suggestions for future videos. You can also reach out to me on Twitter @LuisLikesMath. Check out the Luis Serrano Academy for more information on videos, books, and courses. See you in the next video!

ROC (Receiver Operating Characteristic) Curve in 10 minutes!
ROC (Receiver Operating Characteristic) Curve in 10 minutes!
  • 2020.07.14
  • www.youtube.com
The ROC curve is a very effective way to make decisions on your machine learning model based on how important is it to not allow false positives or false neg...
 

Linear Regression: A friendly introduction



Linear Regression: A friendly introduction

Hi, I'm Louis Serrano, and this is a friendly introduction to linear regression. This video is the first part of a three-part series where I'll be covering linear regression, logistic regression, and support vector machines. However, let's focus on linear regression for now.

Typically, explanations of linear regression involve complex concepts like linear algebra, matrices, derivatives, calculus, and error functions. But for me, this can be overwhelming. I have a visual mind, and I prefer to understand linear regression with a more intuitive approach using points and lines.

Let me demonstrate what I mean by that. First, let's define linear regression using an example of housing prices. Imagine we have five houses with varying numbers of rooms. We own house three and want to estimate its price for selling. To do this, we look at the prices of similar houses and try to make an educated guess. After observing that house one costs $150,000, house two costs $200,000, house four costs $300,000, and house five costs $350,000, we can estimate that house three might cost around $250,000.

Essentially, what we did here is perform linear regression. We plotted the houses on a graph, with the number of rooms on the horizontal axis and the price on the vertical axis. We observed that the prices seemed to follow a straight line, so we placed house three on the line to make the best estimate. This simple approach of fitting a line through a set of points and using it for predictions is the essence of linear regression.

In reality, calculating prices involves additional factors, making the data more complex. However, we can still fit a line that closely approximates the house prices. Linear regression can be applied to various scenarios, such as calculating stock prices in finance or determining patient lifespans in medicine. The applications are numerous.

Now, let's discuss how we actually perform linear regression. We need an algorithm or a procedure to find the line that best fits the points. For simplicity, we'll focus on a basic algorithm that involves making small adjustments to the line until it fits the points well.

Here's a summary of the algorithm:

  1. Start with a random line.
  2. Define the learning rate, which determines the size of the adjustments we make.
  3. Repeat the following steps a specified number of times (epochs):
    • Select a random point.
    • Move the line towards the point based on its position relative to the line.
  4. Enjoy your fitted line!

To move the line towards a point, we perform rotations and translations. Rotating the line clockwise or counterclockwise adjusts the slope, while translating the line up or down adjusts the y-intercept. The adjustments are made by adding or subtracting small amounts, which are determined by the learning rate.

By repeating this process multiple times, the line gradually moves closer to the points, resulting in a good fit. The number of epochs determines how many times we repeat the process, and it can be adjusted based on the desired accuracy and available computational resources.

That's the basic idea behind linear regression. It's a powerful tool for fitting lines to data points and making predictions. I hope this intuitive explanation helps you understand linear regression better.

We have something that is closer to the main point, and there are four cases to check to ensure the square trick works effectively. Let's examine another case where the point is below the line and to the right of the y-axis. The equation is presented again. In step one, we select a small learning rate of 0.01. In step two, we apply this rate to the slope and y-intercept. Now, the vertical distance is negative because the point is below the line (-4), while the horizontal distance remains positive as the point is to the right of the y-axis (+5). To modify the slope, we add 0.01 times (-4) times 5, resulting in -0.2. By subtracting -0.2, we decrease the slope, causing the line to move clockwise. For the y-intercept, we add the learning rate times the vertical distance, which is 0.01 times (-4), resulting in -0.04. The new line equation becomes y = 1.8 - 0.2x + 2.96 - 0.04. Notice that the slope is smaller, indicating a clockwise rotation, and the y-intercept is smaller, indicating a downward translation. This adjustment brings us closer to the point. Although we have only checked two of the four cases, I encourage you to test all of them to see that step two consistently works. Finally, let's explore the linear regression algorithm. The algorithm proceeds as follows:

  1. Start with a random line.
  2. Set the number of repetitions (epochs) to 8,000.
  3. Choose a small step length (learning rate) of 0.01.
  4. Repeat the following loop for the specified number of epochs:
    • Randomly select a point.
    • Adjust the line to move towards the point using the learning rate, vertical distance, and horizontal distance.
  5. Enjoy your fitted line. This is the linear regression algorithm. I encourage you to code it using the provided pseudocode and test it on different datasets to observe its performance. I have implemented this algorithm and found that it works exceptionally well.

Now, the question arises: Is this algorithm better or the same as existing ones? This situation reminds me of a brilliant mathematician named John Borwein, whom I once worked with. He searched for formulas to calculate digits of pi and found that when a formula closely approximated pi, it was a win-win situation because either he discovered a new formula or found something remarkably close to pi. Similarly, with this algorithm, we have a win-win situation. It either outperforms existing algorithms or proves to be as effective while being simpler.

Surprisingly, this algorithm is exactly the same as the one used in traditional linear regression, such as the square error gradient descent. The traditional approach involves minimizing the square error by calculating the distance between points and the line and then using derivatives, calculus, and gradient descent or solving linear systems of equations. The challenge I present to you is to verify that the square trick is equivalent to the traditional square error method by calculating the square error, taking derivatives, and performing small steps opposite to the derivative direction, just like gradient descent. You'll discover that the derivative of the squared difference is closely related to the vertical and horizontal distances. This demonstrates that the square trick is equivalent to the traditional approach.

The second challenge relates to measuring the badness or goodness of a line. We previously discussed the square error, which involves squaring the distances. Another simpler method is the absolute error, where we add the absolute values of the distances without considering their signs. The bad line has larger orange distances, while the good line has smaller orange distances due to its closer proximity to the points. Using calculus, you can minimize the absolute error and perform gradient descent steps. You can develop an absolute trick, which only deals with the horizontal distance and includes an if statement to handle the line's position relative to the point. By exploring this challenge, you'll discover the absolute trick. Feel free to code it and observe its effectiveness.

This video focused on linear regression. Remember that this is part of a three-part series that includes tricks for logistic regression and support vector machines. Stay tuned for the upcoming videos. If you enjoyed this content, please consider subscribing, liking, and sharing. I appreciate your comments, questions, and suggestions. Let me know how you fared with the challenges, and feel free to suggest topics for future videos. You can also reach out to me on Twitter (@LouisLikesMath). Thank you, and see you in the next video.

Linear Regression: A friendly introduction
Linear Regression: A friendly introduction
  • 2018.12.22
  • www.youtube.com
Announcement: New Book by Luis Serrano! Grokking Machine Learning. bit.ly/grokkingML40% discount code: serranoytAn introduction to linear regression that req...
 

Logistic Regression and the Perceptron Algorithm: A friendly introduction



Logistic Regression and the Perceptron Algorithm: A friendly introduction

There is a better algorithm that can handle all three cases simultaneously. I will now explain the perceptron trick, which takes care of these cases together. Let's consider a line with the same equation and a blue point that lies on the red line. We need to adjust the line's position to accommodate this point. To do so, we will decrease the value of -6 by a small amount, let's say 0.1, which will move the line upward. Next, we will adjust the slope of the line to rotate it. For example, if we want to rotate the line to be steeper, we can reduce the value of 2 to 1.8.

Similarly, if the point is farther in that direction, we need to rotate the line more to bring it closer. In this case, we might reduce the value of 2 to 1.4. However, if the point is on the other side, we need to rotate the line in the opposite direction. So, we may need to increase the value of 2, let's say to 2.2. Let's go through this process again more accurately. When the point is not far from the y-axis, we decrease the value of -6 by a small amount, such as 0.01, which moves the line up slightly.

Now, let's consider the value of 2. If the point has coordinates (4, 5), the horizontal distance is 4. To adjust the slope, we can multiply 4 by a learning rate, let's say 0.04, to reduce the value of 2. This will rotate the line around a pivot. Similarly, we can apply the same principle to the value of 3. For instance, if the point has coordinates (4, 5), the vertical distance is 5. By multiplying 5 with a learning rate, like 0.05, we can decrease the value of 3. This will further rotate the line. Finally, after reducing 2 by 0.04, 3 by 0.05, and -6 by 0.01, the equation becomes 1.96x + 2.95y - 6.01. This technique, known as the perceptron trick, addresses both the direction and magnitude of adjustments. To summarize, with a learning rate of 0.01, we can update the equation in the form of ax + by + c = 0 by reducing a by a learning rate times b, reducing b by a learning rate times q, and reducing c by a learning rate.

However, there is one more consideration: the case where the blue area is on top and the red area is on the bottom, with misclassified points. In this scenario, we would add the values instead of subtracting them. For example, if we have -2x - 3y + 6 = 0 instead of 2x + 3y - 6 = 0, we would add small amounts to a, b, and c. By taking into account the location of the points relative to the y-axis, we can determine whether to increase or decrease the values. Now, let's move on to the logistic regression algorithm, which is an even better approach.

I won't delve into the details, but I'll provide an outline and two challenges for you. The first challenge is called gradient descent, which involves using an error function to measure the performance of the classifier. The goal is to minimize the error and improve the algorithm using calculus. The process of gradient descent is similar to the perceptron algorithm we discussed earlier.

The second challenge is to choose an appropriate activation function for logistic regression. The activation function is responsible for transforming the weighted sum of inputs into a probability value between 0 and 1.

One commonly used activation function is the sigmoid function, which has an S-shaped curve. It maps any real number to a value between 0 and 1. The sigmoid function is defined as:

σ(z) = 1 / (1 + e^(-z))

Here, z represents the weighted sum of inputs and is calculated as:

z = ax + by + c

In logistic regression, the goal is to find the optimal values for a, b, and c that maximize the likelihood of the observed data. This is achieved by minimizing a cost function, often referred to as the cross-entropy loss function.

The cost function measures the discrepancy between the predicted probabilities and the actual class labels. One common form of the cost function is:

J(a, b, c) = -1/m * Σ(y * log(σ(z)) + (1-y) * log(1-σ(z)))

In this equation, m represents the number of training examples, y is the true class label (0 or 1), and σ(z) is the predicted probability of the positive class.

To minimize the cost function, gradient descent can be applied. The idea behind gradient descent is to iteratively update the parameter values (a, b, and c) in the opposite direction of the gradient of the cost function with respect to these parameters. This process continues until convergence, where the cost function is minimized.

The update equations for gradient descent in logistic regression are similar to those in the perceptron algorithm. Using the learning rate (α), the parameter updates are as follows:

a := a - α * ∂J/∂a b := b - α * ∂J/∂b c := c - α * ∂J/∂c

The partial derivatives (∂J/∂a, ∂J/∂b, ∂J/∂c) represent the gradients of the cost function with respect to each parameter. They can be computed using calculus.

By iteratively updating the parameter values using gradient descent, logistic regression can learn to classify data with more flexibility compared to the perceptron algorithm. It is a widely used and effective algorithm for binary classification problems.

In conclusion, logistic regression builds upon the principles of the perceptron algorithm but introduces a probabilistic framework and a different cost function. By applying gradient descent, it optimizes the parameters to maximize the likelihood of the observed data. The choice of the sigmoid activation function allows logistic regression to produce probability estimates for class membership.

Logistic Regression and the Perceptron Algorithm: A friendly introduction
Logistic Regression and the Perceptron Algorithm: A friendly introduction
  • 2019.01.01
  • www.youtube.com
Announcement: New Book by Luis Serrano! Grokking Machine Learning. bit.ly/grokkingML40% discount code: serranoytAn introduction to logistic regression and th...
 

Support Vector Machines (SVMs): A friendly introduction



Support Vector Machines (SVMs): A friendly introduction

Hello, my name is Luis Serrano, and this is a friendly introduction to Support Vector Machines, or SVM for short. This is the third video in a series of three on linear models. If you haven't watched the first one yet, it's called 'Linear Regression,' and the second one is called 'Logistic Regression.' This video builds upon the concepts covered in the second video.

Firstly, I'd like to give credit to the students of the machine learning class I taught at Quest University in British Columbia, Canada. They helped me develop the key idea for this video, and it was a wonderful experience working with them. In the picture, you can see me with my friend Richard Hoshino, who is also a professor at the university.

Now, let's dive into SVM. Support Vector Machines are a crucial classification algorithm that aims to separate points of two classes using a line. However, SVM goes a step further to find the best possible line that maximizes the separation between the points. Typically, SVM is explained in terms of linear optimization or gradient descent. But in this video, I'll introduce a method that I haven't seen in the literature, which I call the 'small-step method.' It's an iterative approach where we continuously improve the line's classification ability.

To begin, let's recap the previous video on logistic regression and the perceptron algorithm. In that video, we aimed to find a line that separates data into two classes: red points and blue points. The perceptron algorithm starts with a random line and iteratively adjusts it based on the feedback from the points. The algorithm makes small steps to gradually improve the line's ability to classify the points correctly.

Now, let's introduce the extra step in SVM. Instead of finding just one line, we aim to find two parallel lines that are as far apart as possible while still effectively separating the data. To illustrate this, imagine two sets of parallel lines, equidistant from the main line. The goal is to maximize the distance between these two lines. The larger the separation, the better the classification.

Next, we need to train the algorithm to select the best line. We achieve this by multiplying the line equations by different constants. By doing this, we can control the distance between the lines. A larger constant results in wider separation, while a smaller constant brings the lines closer together. In this way, we can find the optimal lines that maximize the distance.

To separate lines using equations, let's consider an example equation: 2x + 3y - 6 = 0. Multiplying this equation by a constant factor doesn't change the line itself but affects the distance between the lines. We can adjust this constant to expand or contract the lines.

To formalize the SVM algorithm, we incorporate the expanding rate, which determines the step size for spreading the lines apart. By multiplying the line equations with a value close to 1, such as 0.99, we gradually increase the distance between the lines at each iteration.

In summary, the SVM algorithm follows these steps:

  1. Start with a random line and two equidistant parallel lines.
  2. Apply the perceptron algorithm to adjust the line's position based on point feedback.
  3. Introduce the expanding rate to spread the lines apart slightly.
  4. Repeat these steps iteratively to find the optimal lines that maximize the separation.

This is a brief overview of the SVM algorithm and how it incorporates the small-step method to improve classification. It's an effective technique for solving classification problems, and I hope this video provides a unique perspective on SVM.

Now we are going to consider another aspect that contributes to the error in our model. This aspect is based on the distance between two lines. To demonstrate this, let's examine two support vector machines (SVMs) that classify a dataset. The first SVM has wider lines but misclassifies a point, while the second SVM has narrow lines but classifies all points correctly. The question is, which one is better? The answer depends on various factors such as the dataset, the model, and the scenario. However, we can analyze this using error functions.

Let's recall the error calculation in the perceptron algorithm. When a point is correctly classified, the error is zero. If a point is misclassified, the error depends on its distance from the boundary. Points closer to the boundary have smaller errors, while those further away have larger errors. We want an error function that captures this distance relationship. Similarly, in SVM, we have two classification errors: the blue classification error and the red classification error. Each error measures the distance from a misclassified point to its respective boundary. By summing these distances, we obtain the classification error.

Additionally, we have the margin error, which indicates the proximity of the lines. A larger margin error implies that the lines are closer together, while a smaller margin error suggests that the lines are farther apart. We want our model to have a small margin error, indicating wide lines. Hence, the smaller the error, the better our model is. Interestingly, the margin error resembles the regularization term in L2 regularization.

To summarize, the SVM error consists of the classification error and the margin error. The classification error measures the number of misclassified points and the extent of misclassification. The margin error indicates how far apart the lines are. These two errors together form the total SVM error. Our goal is to minimize this error using gradient descent, similar to the SVM trick of adjusting the lines' position and separation.

Returning to the comparison of the two models, the one on the left has a large classification error but a small margin error since the lines are far apart. The model on the right, on the other hand, has a small classification error but a large margin error due to the close proximity of the lines. Deciding which model is better depends on our preferences and requirements. We can use the C parameter, a hyperparameter, to control the importance of the classification error relative to the margin error. A small C emphasizes the margin error, resulting in a model like the one on the left, while a large C prioritizes the classification error, leading to a model similar to the one on the right.

It's important to note that the choice of hyperparameters can be determined through experimentation and testing different values to evaluate model performance. Hyperparameters play a crucial role in machine learning algorithms and allow us to fine-tune our models.

Thank you for your attention throughout this series of videos on linear models. I hope you found it informative and enjoyable. Feel free to like, share, and comment. Your feedback is appreciated. Stay tuned for more videos in the future.

Support Vector Machines (SVMs): A friendly introduction
Support Vector Machines (SVMs): A friendly introduction
  • 2019.01.27
  • www.youtube.com
Announcement: New Book by Luis Serrano! Grokking Machine Learning. bit.ly/grokkingML40% discount code: serranoytAn introduction to support vector machines (S...
 

Denoising and Variational Autoencoders



Denoising and Variational Autoencoders

Hello, I'm Luis Serrano, and in this video, we will be discussing autoencoders. Specifically, we will focus on denoising and variational autoencoders.

Autoencoders are popular generator models and are part of a series that includes generative adversarial networks and restricted Boltzmann machines. If you enjoy generator models, be sure to check out the links in the comments.

To understand autoencoders, let's imagine you want to comprehend a book, so you ask your intelligent friend Aisha to summarize it for you. Aisha's job is to condense the book into a few pages to make it easier for you to understand. Then, we assess the quality of Aisha's summary by asking our other friend, Berta, to rewrite the entire book based on Aisha's summary. We compare Berta's rewritten book with the original book to evaluate their performance.

Ideally, Aisha would summarize the book's main ideas as accurately as possible, and Berta would excel at reconstructing the book based on those ideas. By forcing Aisha and Berta to condense and rebuild the book, respectively, they gain a deep understanding of its content. This concept is the basis of autoencoders. They simplify the data significantly and then reconstruct it, extracting the most useful features in the process.

Autoencoders are dimensionality reduction algorithms and are part of unsupervised machine learning. In the context of real-life datasets, let's consider a dataset of facial images. Each image is encoded using multiple numbers to represent the color of each pixel. Aisha, now referred to as the encoder, summarizes this dataset by extracting features such as eye size, hair color, and other facial characteristics. These features are then passed to Berta, the decoder, who helps reconstruct the faces from these extracted features.

The dimension of a data point refers to the number of numbers required to encode it. In this case, we reduce the dimensionality of the facial image dataset by summarizing it into a smaller set of features and then increase it again during the reconstruction process.

Autoencoders consist of two parts: an encoder that shrinks the data and a decoder that expands it back. The goal is for the reconstructed output to closely resemble the original input. The shrunken or simplified data is known as the latent space, which summarizes and compactifies the data to provide valuable insights.

Two important types of autoencoders are denoising autoencoders and variational autoencoders. Denoising autoencoders are trained to take corrupted data, such as noisy images, and generate clear and crisper versions of those images. Variational autoencoders, on the other hand, are trained to generate new data by picking samples from a simplified latent space, which represents a lower-dimensional version of the original data space.

Variational autoencoders are particularly effective in generating high-resolution images, such as spaces. In the video, you can find samples of faces generated by a variational autoencoder.

The main concept to grasp before diving into autoencoders is dimensionality reduction. To illustrate this concept, let's consider a sample dataset of images. These images are simple 2x2 pixel images, with each pixel having a different color: red, blue, yellow, and green. Each pixel has an intensity value ranging from 0 to 1, where 1 represents full color intensity and 0 represents white.

Each data point in this dataset is described by four intensity values, corresponding to the four colors. However, upon closer inspection, we notice a peculiar property: the intensities of the pixels in the upper right and lower left corners are always the same, as are the intensities of the pixels in the upper left and lower right corners. This means that we no longer need to encode all four intensity values independently. Instead, we can represent each data point with just two values: the intensity of the top-left and bottom-right pixels.

By doing this, we have reduced the dimensionality of our dataset from four to two. This reduction allows us to capture the essential information of the images while discarding redundant or correlated features. It also simplifies the data representation and can help in various tasks such as visualization, storage, and analysis.

Autoencoders use a similar principle of dimensionality reduction but in a more sophisticated and learned manner. Instead of manually selecting which features to keep or discard, autoencoders learn to automatically extract the most informative features from the input data.

Let's dive deeper into the architecture and training process of autoencoders. As mentioned earlier, autoencoders consist of an encoder and a decoder.

The encoder takes the input data, such as an image, and applies a series of transformations to encode it into a lower-dimensional representation, known as the latent space. The latent space is a compressed representation of the input data, capturing its most important features.

The decoder takes the latent space representation and applies the reverse transformations to reconstruct the original input data as closely as possible. The goal is to minimize the difference between the input data and the reconstructed output, effectively learning a compressed representation that can faithfully reproduce the input.

To train an autoencoder, we need a dataset of input data for which we have both the original samples and their corresponding target outputs. During training, the autoencoder learns to minimize the reconstruction error, which is typically measured using a loss function such as mean squared error (MSE) or binary cross-entropy.

The training process involves feeding the input data through the encoder to obtain the latent space representation. Then, the latent space representation is passed through the decoder to reconstruct the data. The reconstruction is compared to the original input, and the discrepancy between them is used to update the weights of the encoder and decoder through backpropagation and gradient descent.

By iteratively training the autoencoder on a large dataset, it gradually learns to extract the most salient features from the input data and becomes capable of reconstructing it with minimal loss.

Denoising autoencoders and variational autoencoders are two popular variations of the basic autoencoder architecture.

Denoising autoencoders are specifically designed to handle noisy or corrupted input data. During training, the autoencoder is presented with input data that has been intentionally corrupted, such as by adding random noise or introducing distortions. The autoencoder then learns to denoise the input by reconstructing the original, clean data as accurately as possible. This denoising capability allows the autoencoder to learn robust representations that are more resilient to noise.

Variational autoencoders (VAEs) take a different approach by incorporating probabilistic modeling into the autoencoder framework. VAEs aim to learn a latent space that follows a specific probability distribution, such as a Gaussian distribution. This allows VAEs to generate new data points by sampling from the learned distribution in the latent space.

The training of VAEs involves not only minimizing the reconstruction error but also maximizing the likelihood of the latent space distribution matching the desired distribution. This is achieved through a combination of reconstruction loss and a regularization term called the Kullback-Leibler (KL) divergence, which measures the difference between the learned distribution and the desired distribution.

By optimizing the reconstruction loss and the KL divergence simultaneously, VAEs learn to generate new data samples that exhibit similar characteristics to the training data while exploring the diversity of the learned latent space.

Autoencoders are powerful unsupervised learning models that can learn compact representations of input data. They consist of an encoder and a decoder that work together to compress and reconstruct the data. Denoising autoencoders and variational autoencoders are two notable variations that extend the basic autoencoder architecture and offer additional capabilities such as denoising and generative modeling.

Let's examine some additional examples in this section. The first example we have here appears to be a fully colored image, except for the green pixel in the bottom right corner, which has a value of 0.2. The denoising encoder is capable of coloring the entire picture for us. Next, we have another example of a white picture, except for the blue pixel with a value of 0.8 in the upper right corner. The neocenter encoder determines that the picture should be white and transforms the blue pixel into white. I encourage you to explore the repository and experiment with your own images to see the outcomes. In summary, these outer encoders force any image to resemble one of the images in the trained dataset.

The other encoders we've used so far only had one layer, but it's important to note that this isn't always the case in neural networks. Encoders can have multiple layers and complex architectures, such as convolutional or recurrent layers. Here's an example of a convolutional encoder used for denoising images of handwritten characters in the MNIST dataset. As you can see, these types of denoising autoencoders perform well in cleaning up noisy images.

Now, let's move on to another fascinating property of autoencoders. They have the ability to generate entirely new data points, such as new images. These generated images are not mere copies of the images in the dataset but completely new and unique images that closely resemble the ones in the dataset. This capability is truly remarkable. For instance, an autoencoder can generate the face of a person who doesn't exist or a handwritten digit that has never been drawn before. To generate these new images, we only need to focus on the decoder and forget about the encoder. The decoder takes a set of numbers, called the latent representation, and generates an image based on those numbers. This process provides a visual representation of the latent space.

To illustrate this, let's consider an example. Suppose we input the numbers 0.3 and 0.8 into the decoder. It generates an image with intensities of 0.12 and 0.95. We can visualize the latent space as a square, where the horizontal axis corresponds to the red-green diagonal and the vertical axis corresponds to the blue-yellow diagonal of the image. Each point in this square represents an image, and as we move to the right, the intensity of the red-green diagonal increases. Similarly, as we move from bottom to top, the intensity of the blue-yellow diagonal increases. This visualization allows us to understand the latent space and its relationship to the generated images.

In the latent space, we can select any point uniformly, which is equivalent to choosing two numbers and passing them through the decoder. This process enables us to generate any of these images with the same likelihood. However, in some cases, we may only want to generate specific images that exist in the dataset, while excluding others. This situation is common when the latent space is large and contains noise, with only a small portion representing the desired images. To address this, we can use a technique called variational autoencoders.

In the case of variational autoencoders, we train two normal distributions that allow us to select points that are highly likely to be inside a specific region of interest. For example, if we have a small oval-shaped region in the latent space where the desired images exist, we want the autoencoder to generate images within or near that region with a higher probability than the rest of the space. We achieve this by training two normal distributions, one for each coordinate in the latent space. These distributions enable us to select points that are likely to be towards the center of the desired region. By using these selected points as the latent representation and passing them through the decoder, we can generate images that are more focused and aligned with our desired criteria.

The training process for variational autoencoders involves optimizing two objectives simultaneously: the reconstruction loss and the regularization loss. The reconstruction loss measures how well the generated images match the input images, similar to traditional autoencoders. The regularization loss, often referred to as the Kullback-Leibler (KL) divergence, encourages the latent distribution to resemble a known prior distribution, typically a standard normal distribution.

The addition of the regularization loss introduces a trade-off during training. On one hand, we want the reconstructed images to closely resemble the input images, which is achieved by minimizing the reconstruction loss. On the other hand, we want the latent distribution to match the prior distribution, promoting the generation of diverse and realistic images. Balancing these objectives is crucial to ensure that the model captures the important features of the data while still allowing for creativity and novelty in the generated images.

Once the variational autoencoder is trained, we can sample points from the prior distribution (often a standard normal distribution) and pass them through the decoder to generate new images. By controlling the sampling process, we can explore different regions of the latent space and generate diverse variations of the input data.

Variational autoencoders have been widely used in various applications, such as image generation, text generation, and anomaly detection. They provide a powerful framework for learning and generating complex data distributions while allowing control over the generated output.

In summary, autoencoders, including denoising and variational autoencoders, offer fascinating capabilities in image processing and generation. They can remove noise from images, reconstruct missing parts, and generate entirely new and unique images. Denoising autoencoders leverage an encoder-decoder architecture to learn the underlying structure of the data, while variational autoencoders add probabilistic modeling to capture diverse variations in the generated output. These techniques have revolutionized the field of unsupervised learning and have found extensive applications in computer vision and artificial intelligence research.

Denoising and Variational Autoencoders
Denoising and Variational Autoencoders
  • 2022.01.15
  • www.youtube.com
A video about autoencoders, a very powerful generative model. The video includes:Intro: (0:25)Dimensionality reduction (3:35)Denoising autoencoders (10:50)Va...
 

Decision trees - A friendly introduction



Decision trees - A friendly introduction

Welcome to Serrano Academy! In this video, we'll be discussing decision trees, which are highly popular machine learning models. Decision trees are effective in real-life scenarios and are intuitive to understand. They mimic the way humans make decisions, making them easy to interpret.

To illustrate how decision trees work, let's use an example of a recommendation system. Imagine you need to decide whether to wear a jacket in the morning. You can start by checking if it's raining outside. If it's raining, wearing a jacket is a clear choice. But if it's not raining, you can further consider the temperature. If it's cold, you'll wear a jacket, and if it's warm, you won't. This decision process can be represented as a decision tree, where each decision becomes a node and the available options become edges leading to new nodes or final decisions.

There can be multiple decision trees for a given problem. For instance, another decision tree may involve checking if it's Monday, the color of your car, and whether you had coffee that day. However, not all decision trees are equally effective. The first decision tree we discussed seems to work well, while the second one includes irrelevant nodes. Finding the best decision tree is the goal of machine learning. Machine learning helps us discover the decision tree that fits the data best. Let's explore an example to understand this process.

Consider a small dataset from a recommendation system for apps based on user demographics. The dataset includes columns for gender, age, and the app downloaded. We want to formulate a rule to recommend apps to future users based on this dataset. By analyzing the data, we can observe trends. For example, all young people in the dataset downloaded TikTok, so it's reasonable to recommend TikTok to a 16-year-old female. Similarly, if we see that females in their 30s mostly downloaded YouTube, we can recommend YouTube to a 30-year-old female. By following this approach, we can make recommendations for different users based on their demographic information.

The intuition behind decision trees aligns with the mathematical principles used in machine learning. Decision trees can handle both categorical and numerical data. For numerical data, we determine the best split point by evaluating different possibilities. Each split point creates a decision stump, and we compare the accuracy of these stumps to find the best one. Once we find the best split, we can continue building the decision tree by iterating the process on the resulting subsets of data. This iterative process allows us to construct larger decision trees.

Decision trees are powerful machine learning models that provide accurate predictions and are easy to interpret. They mimic human decision-making processes and can be trained on various types of data. By finding the best decision tree for a specific problem, we can make effective recommendations or predictions based on given data.

Decision trees - A friendly introduction
Decision trees - A friendly introduction
  • 2022.09.29
  • www.youtube.com
A video about decision trees, and how to train them on a simple example.Accompanying blog post: https://medium.com/@luis.serrano/splitting-data-by-asking-que...
 

A friendly introduction to Bayes Theorem and Hidden Markov Models



A friendly introduction to Bayes Theorem and Hidden Markov Models

Hi and welcome to an introduction to base theorem and hidden Markov models. I'm Luis Serrano from Udacity, where I teach machine learning and artificial intelligence courses. In this scenario, we have two friends named Alice and Bob who live far apart and communicate over the phone. Bob's mood changes based on the weather. If it's sunny, Bob is happy, and if it's rainy, Bob is grumpy. Alice can infer the weather from Bob's mood.

Let's make the scenario more complicated. Bob is mostly happy when it's sunny, but there are exceptions. He's mostly grumpy when it's rainy, but there are also exceptions. We have calculated probabilities based on past data. When it's sunny, Bob is happy with an 80% probability, and grumpy with a 20% probability. When it's rainy, Bob is grumpy with a 60% probability, and happy with a 40% probability.

Now, let's consider a specific situation. Bob tells Alice that this week was an emotional roller coaster. On Monday, he was happy, on Tuesday he was grumpy, on Wednesday he was happy again, on Thursday he was grumpy, and on Friday he was happy. Alice tries to infer the weather based on Bob's mood.

To determine the likelihood of this sequence of moods, we use a hidden Markov model. It has observations (Bob's mood) and hidden states (the weather). We calculate transition probabilities (the probability of going from one state to another) and emission probabilities (the probability of observations being emitted from the hidden states).

In this video, we will answer four questions. First, how do we calculate these probabilities? Second, what is the probability of a random day being sunny or rainy, regardless of Bob's mood? Third, if Bob is happy today, what is the probability that it's sunny or rainy? And fourth, if Bob is happy for three consecutive days, what is the most likely weather?

We can calculate the probabilities by analyzing past data. We count the occurrences of certain weather patterns and Bob's moods to estimate the probabilities. With enough data, we can get good estimations of the actual probabilities.

To determine the probability of a random day being sunny or rainy, independent of Bob's mood, we can either count the occurrences of sunny and rainy days in the past data or use the transition probabilities. In this case, we find that it's 2/3 likely to be sunny and 1/3 likely to be rainy.

If Bob is happy today, the probabilities of sunny and rainy change. We use Bayes' theorem to update the probabilities. For example, if Bob is happy and it's Wednesday, we consider the prior probabilities (2/3 sunny and 1/3 rainy) and the emission probabilities (80% sunny and 20% grumpy when sunny, 40% happy and 60% grumpy when rainy). Using Bayes' theorem, we calculate the posterior probabilities (8/10 sunny and 2/10 rainy).

These probabilities allow us to infer the weather based on Bob's mood. If Bob is happy, it's more likely to be sunny. If Bob is grumpy, it's more likely to be rainy. The probabilities change based on the new information.

We use hidden Markov models and Bayes' theorem to infer the weather from Bob's mood. By analyzing past data, we calculate transition probabilities and emission probabilities. This helps us determine the likelihood of certain weather patterns based on Bob's mood.

A friendly introduction to Bayes Theorem and Hidden Markov Models
A friendly introduction to Bayes Theorem and Hidden Markov Models
  • 2018.03.27
  • www.youtube.com
Announcement: New Book by Luis Serrano! Grokking Machine Learning. bit.ly/grokkingML40% discount code: serranoytA friendly introduction to Bayes Theorem and ...
 

Shannon Entropy and Information Gain



Shannon Entropy and Information Gain

Hi, I'm Louie Serrano, and I'm here with Shannon to talk about entropy and information gain. If you're interested in a more detailed explanation, I have written a blog post on this topic. You can find the link in the comments section.

Let's start by introducing the concept of entropy, which originates from physics. We can illustrate it using the three states of water: solid (ice), liquid (water), and gas (water vapor). Each state has a different level of entropy, which measures the speed at which particles inside an object are moving. Ice has low entropy because its particles are moving slowly, making it a stable substance. Water has medium entropy, as the particles move a bit faster. Water vapor has high entropy because the particles inside it move very quickly.

Entropy is not only a concept in physics but also appears in mathematics, particularly in probability theory. To demonstrate this, let's consider an example with three buckets, each containing balls of different colors. Bucket 1 has four red balls, Bucket 2 has three red balls and one blue ball, and Bucket 3 has two red balls and two blue balls. Based on intuition, we can infer that Bucket 1 has low entropy, Bucket 2 has medium entropy, and Bucket 3 has high entropy.

To validate our intuition, we can measure entropy by examining how much we can rearrange the set of balls in each bucket. In the first set, with four red balls, there is limited rearrangement possible since all the balls are indistinguishable in terms of color. For the second set, we can rearrange the balls in a few ways. Finally, for the third set, we have even more possible arrangements. We can calculate the number of arrangements using the binomial coefficient, which provides a quantitative measure of entropy. Based on the degree of rearrangement possible, we can confirm that Bucket 1 has low entropy, Bucket 2 has medium entropy, and Bucket 3 has high entropy.

However, there is a more precise way to determine entropy based on information. Let's imagine playing a game with these buckets. We start with a particular arrangement of balls and draw them randomly, trying to reproduce the exact sequence of the original arrangement. If we succeed, we win a significant amount of money. Otherwise, we win nothing. Now, the question arises: which bucket is the best to play the game with, and which one is the worst?

Upon reflection, we realize that Bucket 1 is the best choice because all the balls are red, making it easier to reproduce the original sequence. Bucket 2 is the medium choice since it contains a mix of red and blue balls, and Bucket 3 is the worst choice as we have no clue which color we will draw. We can calculate the probability of winning in each game by considering the likelihood of drawing a specific ball from the bucket. For Bucket 1, the probability of winning is 100% since all the balls are red. For Bucket 2, the probability is lower due to the presence of blue balls, and for Bucket 3, it is the lowest as there is an equal chance of drawing either red or blue balls.

Now, let's summarize the probabilities and their corresponding entropy levels in a table. We can observe that Bucket 1 has a high probability of winning, resulting in low entropy. Bucket 2 has a moderate probability, indicating medium entropy. Lastly, Bucket 3 has the lowest probability of winning, leading to high entropy.

To establish a formula for entropy, we can utilize the concept of logarithms. Instead of dealing with products, we can convert them into sums using logarithms. The logarithm of a product is equal to the sum of logarithms. By taking the logarithm of the product of probabilities, we can transform it into a sum of individual probabilities. This transformation allows us to calculate entropy as the average information content or uncertainty associated with an event.

The formula for entropy is given by:

Entropy = - (p1 * log(p1) + p2 * log(p2) + ... + pn * log(pn))

where p1, p2, ..., pn represent the probabilities of different outcomes or states. The logarithm function (typically with base 2) is used to account for the exponential nature of information.

Applying this formula to our example, let's calculate the entropy for each bucket. In Bucket 1, where all the balls are red, the probability of drawing a red ball is 1 (100%). Thus, the entropy for Bucket 1 is:

Entropy(Bucket 1) = - (1 * log2(1)) = 0

Since the logarithm of 1 is 0, the entropy is 0, indicating no uncertainty or information gain.

For Bucket 2, there are three red balls and one blue ball. The probability of drawing a red ball is 3/4, while the probability of drawing a blue ball is 1/4. Therefore, the entropy for Bucket 2 is:

Entropy(Bucket 2) = - (3/4 * log2(3/4) + 1/4 * log2(1/4))

Calculating the values, we get:

Entropy(Bucket 2) ≈ 0.811

This value represents a moderate level of uncertainty or information gain.

Moving on to Bucket 3, where there are two red balls and two blue balls, the probability of drawing a red ball or a blue ball is 1/2 each. Thus, the entropy for Bucket 3 is:

Entropy(Bucket 3) = - (1/2 * log2(1/2) + 1/2 * log2(1/2))

Simplifying the expression, we find:

Entropy(Bucket 3) = - (1/2 * (-1) + 1/2 * (-1)) = 1

The entropy for Bucket 3 is 1, indicating the highest level of uncertainty or information gain.

Entropy quantifies the level of uncertainty or information in a system. Using the concept of probabilities, we can calculate entropy as the average information content associated with different outcomes. Higher entropy values indicate greater uncertainty, while lower entropy values indicate less uncertainty or more predictable outcomes. Understanding entropy and information gain is valuable in various fields, including information theory, machine learning, and data analysis, as it allows us to measure and analyze the complexity and predictability of systems.

Shannon Entropy and Information Gain
Shannon Entropy and Information Gain
  • 2017.11.04
  • www.youtube.com
Announcement: New Book by Luis Serrano! Grokking Machine Learning. bit.ly/grokkingML40% discount code: serranoytAccompanying blog post: https://medium.com/p/...
 

Naive Bayes classifier: A friendly approach



Naive Bayes classifier: A friendly approach

Hi, I'm Luis Serrano, and in this video, we'll explore the Naive Bayes classifier. Your knowledge base is crucial in probability and incredibly useful in machine learning. Rather than seeing it as a complex formula involving probability ratios, let's think of it as the probability of an event happening given that we have information about another event. Naive Bayes extends this concept by making naive assumptions to simplify the math when dealing with multiple events.

To illustrate, let's build a spam detector. We start with a dataset of 100 emails, where 25 are spam and 75 are not. Our goal is to identify properties that correlate with spam emails. Let's focus on the word "buy." Among the spam emails, 20 contain "buy," while 5 non-spam emails have it. Based on this, we can conclude that if an email contains "buy," there's an 80% chance it's spam.

Now, let's consider another word, "cheap." Among the spam emails, 15 have "cheap," and among the non-spam emails, 10 have it. If an email contains "cheap," there's a 60% chance it's spam.

But what if we want to analyze both "buy" and "cheap" together? Among the spam emails, 12 contain both words, and there are no instances of this combination among the non-spam emails. If an email contains both "buy" and "cheap," it's 100% likely to be spam. However, a 100% certainty seems too strong and unrealistic for a classifier.

The issue arises because we found no instances of non-spam emails with both "buy" and "cheap." To tackle this, we could collect more data, but let's explore an alternative solution. We can make assumptions to estimate the occurrence of such cases. Let's assume that 0.5% of the emails contain both "buy" and "cheap." This assumption is based on the percentages of "buy" and "cheap" occurrences in our dataset.

Now, with this assumption, we can compute the probability of an email being spam if it contains both "buy" and "cheap." By applying Bayes' theorem, we find that the probability is approximately 94.737%.

The Naive Bayes classifier involves filling out a table with data. However, when certain events are too sparse in the dataset, we make the naive assumption that the events are independent. This assumption simplifies the calculations, even though it may not hold true in reality. By making these assumptions, we can estimate the probabilities and build a spam classifier.

Naive Bayes classifier: A friendly approach
Naive Bayes classifier: A friendly approach
  • 2019.02.10
  • www.youtube.com
Announcement: New Book by Luis Serrano! Grokking Machine Learning. bit.ly/grokkingML40% discount code: serranoytA visual description of Bayes' Theorem and th...
Reason: