Machine Learning and Neural Networks - page 58

 

Lecture 12.1 — Boltzmann machine learning



Lecture 12.1 — Boltzmann machine learning [Neural Networks for Machine Learning]

In the previous video, I demonstrated how a Boltzmann machine can be used as a probabilistic model for binary data vectors. Now, let's delve into the Boltzmann machine learning algorithm.

Initially, the Boltzmann machine learning algorithm was slow and noisy, making it impractical. However, several techniques were developed to significantly speed up the algorithm, making it more practical and effective. In fact, it has been used successfully in a million-dollar machine learning competition.

The Boltzmann machine learning algorithm is an unsupervised learning algorithm, meaning it doesn't require labeled data. Instead, it aims to build a model of a set of input vectors (or output vectors) by maximizing the product of probabilities assigned by the Boltzmann machine to the training vectors. This can also be achieved by maximizing the sum of log probabilities or the probability of obtaining the training cases.

To train the Boltzmann machine, we follow a process of settling to a stationary distribution multiple times with no external input. We then sample the visible vector and repeat this process multiple times to obtain samples from the posterior distribution.

Learning in the Boltzmann machine can be challenging due to the interactions between the weights. Each weight needs to know about other weights to determine the appropriate direction for change. Surprisingly, a simple learning algorithm using only local information is sufficient. The learning rule is based on the difference of two correlations, capturing the information needed for weight updates.

The learning rule consists of two terms. The first term increases the weights proportionally to the product of unit activities observed when presenting the data. This term is akin to the storage term in a Hopfield network. However, without control, the weights would keep growing, leading to instability. The second term decreases the weights proportional to the product of unit activities when sampling from the model's distribution. This term helps eliminate spurious minima and maintains stability.

The derivative of the log probability of a visible vector with respect to a weight is remarkably simple. It can be expressed as the product of the two connected unit activities. This simplicity arises from the linear relationship between log probabilities and the energy function, which, in turn, is linear in the weights.

The negative phase in the learning process serves to unlearn or reduce the influence of certain configurations. It focuses on decreasing the energy of terms that contribute significantly to the partition function, thus reducing their impact.

To collect statistics required for the learning rule, a method called contrastive divergence is commonly used. In the positive phase, data vectors are clamped on the visible units, and hidden units are updated until thermal equilibrium is reached. The correlations between unit pairs are then sampled and averaged over all the data vectors. In the negative phase, no data is clamped, and the network is allowed to settle to equilibrium without external interference. The correlations between unit pairs are again sampled multiple times.

Determining the number of iterations needed in the negative phase can be challenging, as the energy landscape of the Boltzmann machine may contain multiple modes. These modes represent different configurations with similar energy levels, and repeated sampling is necessary to capture them effectively.

When dealing with a Boltzmann machine, it's crucial to consider the number of repetitions required in the negative phase. Due to the presence of multiple modes in the energy landscape, the algorithm needs to sample extensively to ensure all modes are adequately represented.

The learning process in a Boltzmann machine involves iteratively updating the weights based on the correlations obtained from the positive and negative phases. By comparing the correlations in these phases, the algorithm can adjust the weights to improve the model's representation of the data.

One important aspect to note is that the Boltzmann machine learning algorithm is a form of Hebbian learning. It follows the principle proposed by Donald Hebb in the 1940s or 1950s, suggesting that synapses in the brain strengthen connections between neurons based on their co-activation. In the Boltzmann machine, the positive phase raises the weights in proportion to the product of unit activities observed during data presentation, resembling Hebb's idea.

However, without a counterbalancing mechanism, the weights would keep growing indefinitely, leading to instability. The negative phase serves this purpose by reducing the weights based on the product of unit activities when sampling from the model's distribution. This balance ensures that the Boltzmann machine remains stable during the learning process.

It's worth mentioning that the Boltzmann machine learning algorithm has theoretical underpinnings related to energy functions and probability distributions. The probability of a global configuration at thermal equilibrium follows an exponential function of its energy. The linear relationship between the weights and the log probability allows for straightforward derivative calculations and facilitates efficient weight updates.

The Boltzmann machine learning algorithm is an unsupervised learning approach that aims to build a model of input or output vectors. By maximizing the probabilities assigned to training vectors, the algorithm adjusts the weights using a combination of positive and negative correlations. This learning process, along with techniques like contrastive divergence, helps the Boltzmann machine capture complex patterns in the data.

Lecture 12.1 — Boltzmann machine learning [Neural Networks for Machine Learning]
Lecture 12.1 — Boltzmann machine learning [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
For cool updates on AI research, follow me at https://twitter.com/iamvriad.Lecture from the course Neural Networks for Machine Learning, as taught by Geoffre...
 

Lecture 12.2 — More efficient ways to get the statistics [Neural Networks for Machine Learning


Lecture 12.2 — More efficient ways to get the statistics [Neural Networks for Machine Learning

In this video, the speaker provides detailed information on how to speed up the Boltzmann machine learning algorithm by employing clever techniques to maintain Markov chains near the equilibrium distribution and utilizing mean field methods. While acknowledging that the material is advanced and not part of the course curriculum, the speaker assures viewers that they can skip this video unless they have a keen interest in optimizing deep Boltzmann machines.

The speaker discusses the challenges associated with reaching thermal equilibrium when starting from a random state, as it may take a considerable amount of time. Additionally, there is no easy way to determine whether thermal equilibrium has been reached. To address this, the speaker suggests starting from the state obtained in the previous iteration for a particular data vector. This stored state, referred to as a "particle," serves as an interpretation of the data vector in the hidden units, offering a warm start advantage. If the weights have been minimally updated, it only takes a few updates of the units in a particle to bring it back to equilibrium. Particles can be used for both the positive phase (when a data factor is clamped) and the negative phase (when nothing is clamped).

To collect statistics efficiently, the speaker introduces a method by Radford Neal in 1992. In the positive phase, data-specific particles are employed, with each particle representing a configuration of hidden units along with the associated data vector. Sequential updates are performed on the hidden units in each particle, with the relevant data vector clamped. The probabilities of connected unit pairs are then averaged across all particles. In the negative phase, fantasy particles, representing global configurations, are utilized. After each weight update, the units in each fantasy particle are sequentially updated. Again, the probabilities of connected unit pairs are averaged across all fantasy particles. The learning rule is defined as the change in weights, proportional to the difference between the averages obtained with data and fantasy particles.

While this learning rule works well for full batch learning, applying it to mini-batches poses challenges. Due to multiple weight updates in mini-batch learning, the stored data-specific particles for each data vector may no longer be close to thermal equilibrium. To overcome this issue, the speaker proposes making an assumption that the set of good explanations (states of hidden units interpreting a data vector) is unimodal when a data vector is clamped. This assumption enables the use of a mean field approximation, which provides an efficient method for approaching thermal equilibrium or an approximation thereof with the data.

To implement efficient mini-batch learning in Boltzmann machines, the speaker suggests using a special architecture called a deep Boltzmann machine. In this architecture, alternating parallel updates are performed for the fantasy particles by allowing layers without connections and omitting skip layer connections. By updating only half the states of all units in parallel, efficient updates can be achieved.

The speaker discusses the successful application of deep Boltzmann machines using mean field for positive phase learning and alternating updates of layers for the negative phase. Russ Salakhutdinov, for example, used this approach to model MNIST digits, and the generated data closely resembled the actual MNIST dataset.

Furthermore, the speaker addresses the challenge of estimating negative statistics with only a limited number of negative examples (fantasy particles). Typically, the global configuration space for interesting problems is highly multimodal. However, the learning process interacts with the Markov chain used for gathering negative statistics, effectively increasing its mixing rate. The speaker explains that when the fantasy particles outnumber the positive data in a mode of the energy surface, the energy is raised, leading the particles to escape from that mode. This interaction between learning and the Markov chain enables exploration of multiple modes, even with a limited number of particles.

This property of the learning algorithm, where the energy surface is manipulated to enhance the mixing rate of the Markov chain, is a crucial aspect of the Boltzmann machine's effectiveness. The learning process actively drives the fantasy particles to explore different modes of the energy surface, allowing the model to escape from local minima that the Markov chain alone would struggle to overcome in a reasonable time.

To further illustrate this concept, imagine the energy surface as a landscape with valleys and hills representing different modes. The fantasy particles act as explorers navigating this landscape. Initially, some modes may have a higher concentration of fantasy particles compared to the available data, resulting in an elevated energy surface in those regions. The learning algorithm recognizes this discrepancy and raises the energy surface, effectively creating barriers that impede the particles' movement.

By raising the energy surface, the learning algorithm encourages the fantasy particles to move away from overpopulated modes, seeking alternative modes with fewer particles. As they explore different regions of the energy landscape, the particles eventually escape from the initially dominant modes and distribute themselves across multiple modes more in line with the data distribution.

This process allows the Boltzmann machine to uncover various modes of the energy surface, effectively capturing the complex multimodal structure of the underlying data distribution. While the Markov chain alone might struggle to escape from local minima, the active manipulation of the energy surface by the learning algorithm enables the exploration of different modes, leading to a more accurate representation of the data.

In summary, the interaction between the learning algorithm and the Markov chain used to gather negative statistics is a key factor in the Boltzmann machine's effectiveness. The learning process dynamically adjusts the energy surface, encouraging the fantasy particles to explore different modes and escape from local minima. This ability to explore the energy landscape enhances the model's capacity to capture the complex distribution of the underlying data, resulting in improved performance and more accurate representations of the data.

Lecture 12.2 — More efficient ways to get the statistics [Neural Networks for Machine Learning]
Lecture 12.2 — More efficient ways to get the statistics [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 12.3 — Restricted Boltzmann Machines



Lecture 12.3 — Restricted Boltzmann Machines [Neural Networks for Machine Learning]

Boltzmann machines have a simplified architecture with no connections between hidden units, making it easy to compute the equilibrium distribution of the hidden units when the visible units are clamped. The learning algorithm for Boltzmann machines is slow, but a shortcut was discovered in 1998 that led to an efficient learning algorithm for restricted Boltzmann machines (RBMs). RBMs have restricted connectivity, with one layer of hidden units and no connections between hidden or visible units. The RBM architecture is a bipartite graph with independent computations for each unit.

The shortcut allows for quick computation of the expected values of connections between visible and hidden units in parallel. A learning algorithm for RBMs introduced in 2008 involves clamping a data vector on the visible units, computing the expected values of connections, and averaging them over the data vectors in the mini-batch. In the negative phase, fantasy particles (global configurations) are used to update each particle a few times, and the expected values of connections are averaged over the fantasy particles. This algorithm builds good density models for binary vectors.

Another learning algorithm for RBMs is faster but not as effective in building density models. It involves running an alternating chain of updates between visible and hidden units. The learning rule updates the weights based on the difference between the expected values of connections at the beginning and end of the chain. Running the chain for a long time to reach thermal equilibrium is not necessary; even a short chain produces effective learning.

The shortcut works because the Markov chain wanders away from the data towards the equilibrium distribution. By changing the weights to lower the probability of reconstructions and raise the probability of the data after one full step, the chain stops wandering away from the data. The learning stops when the data and the reconstructions have the same distribution. The energy surface in the space of global configurations is modified to create an energy minimum at the data point during learning.

However, the shortcut fails for regions far from the data. Persistent particles, which remember their states and undergo additional updates, can help address this issue. A compromise between speed and correctness is to start with small weights and use contrastive divergence (CD) with a few steps (CD-1, CD-3, CD-5, etc.) as the weights grow. This approach maintains effective learning even as the mixing rate of the Markov chain decreases.

Using this approach, the learning algorithm for restricted Boltzmann machines (RBMs) strikes a balance between speed and accuracy. It starts with small weights and utilizes contrastive divergence (CD) with a small number of steps, such as CD-1, as the weights gradually increase. This strategy ensures that the learning process continues to work reasonably well even when the mixing rate of the Markov chain slows down.

It is important to consider regions of the data space that the model favors but are far from any actual data points. These regions, known as low-energy holes, can cause issues with the normalization term. To address this, a technique called persistent particles can be employed. Persistent particles retain their states and undergo additional updates after each weight update. By doing so, they can explore and eventually fill up these low-energy holes, improving the model's performance.

The RBM learning algorithm using the shortcut and various techniques, such as CD with different numbers of steps and the use of persistent particles, allows for efficient learning and the construction of effective density models for sets of binary vectors. While the shortcut deviates from maximum likelihood learning and has theoretical limitations, it has proven to work well in practice, leading to a resurgence of interest in Boltzmann machine learning.

Lecture 12.3 — Restricted Boltzmann Machines [Neural Networks for Machine Learning]
Lecture 12.3 — Restricted Boltzmann Machines [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 12.4 — An example of RBM learning



Lecture 12.4 — An example of RBM learning [Neural Networks for Machine Learning]

In this video, we will demonstrate a simple example of a restricted Boltzmann machine (RBM) learning a model of handwritten twos. Once the model is trained, we will assess its ability to reconstruct twos and observe its behavior when given a different digit to reconstruct. Additionally, we will examine the weights obtained by training a larger RBM on all digit classes, which learns a wide range of features that are effective for reconstructing and modeling various digit classes.

The RBM used in this example has 16x16 pixel images of twos and 50 binary hidden units that function as feature detectors. When presented with a data case, the RBM activates the feature detectors using the weights and connections from pixels to feature detectors. Each binary neuron makes a stochastic decision to adopt a state of 1 or 0. The RBM then uses these activations to reconstruct the data by making binary decisions for each pixel. The weights are updated by incrementing the weights between active pixels and active feature detectors during data processing, and decrementing the weights during reconstruction.

Initially, the weights are random, and the reconstructions have lower energy than the data. Through training on hundreds of digit examples and weight adjustments, the weights gradually form patterns. Many feature detectors start as global detectors, becoming more localized as training progresses. The final weights reveal that each neuron has become a different feature detector, with most detectors being local in nature. For example, a feature detector may detect the top of a two by activating its white pixels when the top of a two is present and its black pixels when there is nothing.

After learning the model, we can assess its reconstruction abilities. When given a test example of a two, the reconstruction is generally faithful, albeit slightly blurry. However, if we provide a test example from a different digit class, such as a three, the RBM reconstructs an image that resembles a two rather than a three. This behavior occurs because the RBM has primarily learned feature detectors specific to twos and lacks detectors for certain characteristics of other digits.

Furthermore, we showcase feature detectors learned in the first hidden layer of a larger RBM trained on all ten digit classes. These feature detectors exhibit a wide variety of patterns. Some detect specific features like slanted lines, while others capture long-range or spatial regularities introduced by the normalization of the data. Overall, the RBM demonstrates its capability to learn complex ways of representing and detecting features in the input data.

Additionally, I would like to point out that the RBM used in this demonstration consists of 500 hidden units, allowing it to model all ten digit classes. This model has undergone extensive training using a technique called contrastive divergence. As a result, it has acquired a diverse set of feature detectors.

Examining the feature detectors in the hidden layer, we observe intriguing patterns. For instance, there is a feature detector, denoted by the blue box, that appears suitable for detecting the presence of diagonal lines. On the other hand, the feature detector in the red box exhibits a unique characteristic. It prefers to activate pixels located very near the bottom of the image and dislikes pixels in a specific row positioned 21 pixels above the bottom. This behavior stems from the normalization of the data, where digits cannot exceed a height of 20 pixels. Consequently, a pixel activated in the positive weight region cannot simultaneously activate in the negative weight region, resulting in this long-range regularity being learned.

Furthermore, another feature detector, highlighted in the green box, demonstrates an interesting property. It detects the bottom position of a vertical stroke and can detect it in multiple positions while disregarding intermediate positions. This behavior resembles the least significant digit in a binary number, which alternates between being active and inactive as the magnitude of the number increases. It showcases the RBM's ability to develop complex representations of spatial relationships and positions.

These examples illustrate the RBM's capacity to learn and extract meaningful features from the input data. By adjusting the weights during the learning process, the RBM aims to make the data have low energy while maintaining higher energy for the reconstructions. This learning mechanism enables the RBM to model and reconstruct digit images effectively, capturing both global and local features of the digits in its learned representations.

Lecture 12.4 — An example of RBM learning [Neural Networks for Machine Learning]
Lecture 12.4 — An example of RBM learning [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 12.5 — RBMs for collaborative filtering



Lecture 12.5 — RBMs for collaborative filtering [Neural Networks for Machine Learning]

In this video, we will discuss the application of Restricted Boltzmann Machines (RBMs) in collaborative filtering, specifically in the context of the Netflix competition. Collaborative filtering involves predicting how much a user would like a product based on their preferences for other products and the preferences of other users. The Netflix competition challenges participants to predict how much a user will like a movie based on their ratings of other movies.

The training data for this competition consists of a large dataset with a hundred million ratings for eighteen thousand movies by half a million users. To tackle the challenge of missing ratings for most movies, an important trick is employed when using RBMs. By utilizing this trick, models can be trained effectively and prove to be useful in practice, as demonstrated by the winning entry in the competition.

The approach of using RBMs for collaborative filtering involves treating each user as a training case, where each user is represented as a vector of movie ratings. For each movie, a visible unit with five alternative values (five-way softmax) is used instead of binary units. The RBM architecture consists of visible units representing movies and binary hidden units. The RBMs share weights among users who have rated the same movie, allowing for weight sharing and reducing the number of parameters. CD (contrastive divergence) learning is applied to train the RBMs, initially with CD1, and later with CD3, CD5, and CD9.

The RBM models perform comparably to matrix factorization methods commonly used in collaborative filtering. However, they yield different errors. Combining the predictions of RBMs with those of matrix factorization models results in significant improvements. The winning group in the Netflix competition utilized multiple RBM models and matrix factorization models in their ensemble to achieve better predictions.

In summary, the application of Restricted Boltzmann Machines (RBMs) in collaborative filtering for the Netflix competition involved treating each user as a training case, using RBMs with visible units representing movies and binary hidden units. By leveraging weight sharing among users who have rated the same movie, the RBMs can handle the large dataset effectively.

The RBMs were trained using CD learning, with iterations of CD1, CD3, CD5, and CD9, and they performed similarly to matrix factorization models commonly used in collaborative filtering. However, the combination of RBMs and matrix factorization models led to a significant improvement in predictions. The winning entry in the Netflix competition employed multiple RBM models and matrix factorization models in their ensemble, showcasing the effectiveness of this approach.

The utilization of RBMs in collaborative filtering demonstrates their capability to handle large and sparse datasets, such as the Netflix dataset with millions of ratings. By modeling the relationships between users and movies, RBMs provide a powerful tool for making accurate predictions and improving recommendation systems.

The successful application of RBMs in collaborative filtering showcases their usefulness in the field of machine learning and recommendation systems, and it highlights the potential for utilizing ensemble approaches to further enhance prediction accuracy.

Lecture 12.5 — RBMs for collaborative filtering [Neural Networks for Machine Learning]
Lecture 12.5 — RBMs for collaborative filtering [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 13.1 — The ups and downs of backpropagation



Lecture 13.1 — The ups and downs of backpropagation [Neural Networks for Machine Learning]

The video discusses the history of backpropagation, highlighting its origins in the 1970s and 1980s and why it fell out of favor in the 1990s. It challenges the popular belief that backpropagation failed due to its inability to handle multiple layers of nonlinear features. Instead, the main reasons for its abandonment were the limited computing power and small datasets available at the time.

Backpropagation was independently invented multiple times by different researchers, including Bryson and Ho in the late 1960s, Paul Wallace in 1974, Rama Hart and Williams in 1981, David Parker and Youngocar in 1985. Initially, it didn't work well for certain tasks, causing researchers to abandon it. However, in 1986, a paper demonstrated its potential for learning multiple layers of nonlinear feature detectors.

By the late 1990s, most machine learning researchers had given up on backpropagation, favoring support vector machines (SVMs) instead. The popular explanation was that backpropagation struggled with multiple hidden layers and recurrent networks. However, from a historical perspective, the real reasons for its failure were the limited computing power and small labeled datasets, which prevented backpropagation from shining in complex tasks like vision and speech.

Different types of machine learning tasks have different requirements. In statistics, low-dimensional data with noise requires separating true structure from noise. Bayesian neural nets can handle this well, while non-Bayesian neural nets like backpropagation are not as effective. Support vector machines and Gaussian processes are more suitable for such tasks. In artificial intelligence, high-dimensional data with complex structure requires finding appropriate representations, which backpropagation can learn by leveraging multiple layers and ample computation power.

The limitations of support vector machines are discussed, noting that they are viewed as an extension of perceptrons with the kernel trick. They rely on non-adaptive features and one layer of adaptive weights. While they work well, they cannot learn multiple layers of representation. The video also briefly mentions a historical document from 1995, a bet between Larry Jackel and Vladimir Vapnik regarding the theoretical understanding and future use of big neural nets trained with backpropagation. Ultimately, both sides of the bet were proven wrong, as the limitations were practical rather than theoretical.

The failure of backpropagation in the 1990s can be attributed to the limitations of computing power and small datasets, rather than its inherent capabilities. It still had potential for complex tasks and eventually became successful when larger datasets and more powerful computers became available. The video emphasizes the importance of considering different machine learning tasks and their specific requirements when choosing the appropriate algorithms.

Lecture 13.1 — The ups and downs of backpropagation [Neural Networks for Machine Learning]
Lecture 13.1 — The ups and downs of backpropagation [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 13.2 — Belief Nets



Lecture 13.2 — Belief Nets [Neural Networks for Machine Learning]

I abandoned backpropagation in the 1990s due to its reliance on a large number of labels, which were scarce at the time. However, I was inspired by the success of learning with few explicit labels. To preserve the benefits of gradient descent learning without the need for extensive labels, I explored alternative objective functions. Generative models, which aim to model input data rather than predict labels, aligned well with this pursuit. Graphical models, a concept combining discrete graph structures with real-valued computations, emerged as a promising approach in statistics and artificial intelligence. While Boltzmann machines were early examples of undirected graphical models, in 1992 Bradford Neil introduced directed graphical models called sigmoid belief Nets, employing similar units as Boltzmann machines. The challenge then became how to learn these sigmoid belief Nets.

Learning sigmoid belief Nets encountered multiple issues. Deep networks with multiple hidden layers suffered from slow learning. It was discovered that poor weight initialization contributed to this problem. Backpropagation also tended to get stuck in suboptimal local optima, which, although reasonably good, were far from optimal for deep Nets. While retreating to simpler models that allowed convex optimization was a possibility, it did not address the complexity of real-world data. To overcome these limitations, unsupervised learning emerged as a solution. By using unsupervised learning, we could leverage the efficiency and simplicity of gradient methods and stochastic mini-batch descent for weight adjustment. However, the focus shifted to modeling the structure of sensory input rather than the input-output relationship. The weights would be adjusted to maximize the probability of the generative model producing the observed sensory input.

Two primary problems arose: the inference problem and the learning problem. The inference problem involved inferring the states of unobserved variables, aiming to derive probability distributions over these variables given that they were not independent of each other. The learning problem involved adjusting the interactions between variables to make the network more likely to generate the training data. It entailed determining which nodes influenced others and the strength of their effect.

The marriage of graphical models and neural networks had a unique dynamic. Early graphical models relied on expert-defined graph structures and conditional probabilities, aiming to solve the inference problem. On the other hand, neural networks prioritized learning and avoided hand-wiring knowledge. Although neural networks lacked interpretability and sparse connectivity for easy inference, they had the advantage of learning from training data. However, neural network versions of belief Nets were developed. When constructing generative models using idealized neurons, two types emerged: energy-based models and causal models. Energy-based models utilized symmetric connections among binary stochastic neurons, resulting in Boltzmann machines. While learning Boltzmann machines proved challenging, restricting connectivity made learning easier for restricted Boltzmann machines. However, this approach limited the power of neural networks with multiple hidden layers. Causal models, which employed directed acyclic graphs with binary stochastic neurons, gave rise to sigmoid belief Nets. In 1992, Neil demonstrated that sigmoid belief Nets were slightly easier to learn compared to Boltzmann machines. In a sigmoid belief Net, all variables are binary stochastic neurons, and data generation involves making stochastic decisions layer by layer, ultimately producing unbiased samples of visible values.

By adopting causal models or hybrid approaches, we could overcome the limitations of backpropagation and leverage unsupervised learning to model the structure of sensory input effectively.

Before delving into causal belief Nets made of neurons, it is essential to provide some background on the relationship between artificial intelligence (AI) and probability. In the 1970s and early 1980s, there was a strong resistance within the AI community towards probability. Probability was considered unfavorable, and AI researchers preferred discrete symbol processing without incorporating probabilistic elements. However, a notable exception was John von Neumann, who recognized the potential for a connection between formal logic and thermodynamics, particularly the work of Boltzmann. Unfortunately, von Neumann's ideas did not gain traction during his lifetime.

Eventually, probabilities found their way into AI through the development of graphical models, which combine graph theory and probability theory. In the 1980s, AI researchers were working on practical problems that involved uncertainty, such as medical diagnosis or mineral exploration. Although there was an aversion to probabilities, it became clear that using probabilities was more effective than ad-hoc methods. Graphical models, introduced by Perl Hackerman Lauritzen and others, provided a framework for representing uncertainty and making probabilistic computations based on graph structures.

Graphical models encompass various types of models, and one subset is belief Nets. Belief Nets are directed acyclic graphs consisting of stochastic variables. These graphs often have sparsely connected nodes and allow for efficient inference algorithms that compute probabilities of unobserved nodes. However, these algorithms become exponentially complex when applied to densely connected networks.

A belief Net serves as a generative model, and its inference problem involves determining the states of unobserved variables, resulting in probability distributions over these variables. The learning problem focuses on adjusting the interactions between variables to increase the likelihood of generating the observed training data.

In the context of neural networks, there is a connection between graphical models and neural networks. Early graphical models relied on expert-defined graph structures and conditional probabilities, primarily addressing the inference problem. On the other hand, neural networks emphasized learning from training data and avoided handcrafted knowledge. While neural networks lacked interpretability and sparse connectivity, they offered the advantage of adaptability through learning.

To construct generative models with idealized neurons, two main types can be considered. Energy-based models, such as Boltzmann machines, connect binary stochastic neurons symmetrically. However, learning Boltzmann machines is challenging. Another option is causal models, which utilize directed acyclic graphs composed of binary stochastic neurons. In 1992, Neil introduced sigmoid belief Nets, which were easier to learn than Boltzmann machines. Sigmoid belief Nets are causal models where all variables are binary stochastic neurons.

To generate data from a causal model like a sigmoid belief Net, stochastic decisions are made layer by layer, starting from the top layer and cascading down to the visible effects. This process yields an unbiased sample of visible values according to the beliefs of the neural network.

By adopting unsupervised learning and utilizing causal models or hybrid approaches, it is possible to overcome the limitations of backpropagation and leverage the power of unsupervised learning to effectively model the structure of sensory input. These advancements provide a promising avenue for addressing the challenges posed by deep neural networks and pave the way for more sophisticated and efficient learning algorithms.

In conclusion, the exploration of belief Nets and their connection to neural networks has opened up new possibilities for AI and probabilistic modeling. The initial resistance towards probability in AI has been overcome, and graphical models have emerged as a powerful framework for representing uncertainty and making probabilistic computations.

Belief Nets, specifically sigmoid belief Nets, offer an alternative approach to generative modeling compared to energy-based models like Boltzmann machines. By utilizing directed acyclic graphs and binary stochastic neurons, sigmoid belief Nets provide a means to generate data and learn from training sets more effectively.

The integration of unsupervised learning with causal models or hybrid approaches has the potential to address the limitations of backpropagation in deep neural networks. By modeling the structure of sensory input and maximizing the probability of observed data, these approaches offer a way to leverage the efficiency and simplicity of gradient methods while capturing the complexity of real-world data.

The evolution of AI and the embrace of probability have reshaped the field, enabling researchers to develop more robust and adaptable models. As the journey continues, further advancements in probabilistic modeling, neural networks, and unsupervised learning are likely to emerge, leading to more sophisticated and intelligent AI systems.

By combining the strengths of graphical models and neural networks, researchers can continue to push the boundaries of AI, unlocking new possibilities for understanding, learning, and decision-making in complex and uncertain environments.

Lecture 13.2 — Belief Nets [Neural Networks for Machine Learning]
Lecture 13.2 — Belief Nets [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 13.3 — Learning sigmoid belief nets



Lecture 13.3 — Learning sigmoid belief nets [Neural Networks for Machine Learning]

The video discusses the challenges of learning sigmoid belief nets and introduces two different methods for addressing these challenges. Unlike Boltzmann machines, sigmoid belief nets do not require two different phases for learning, making the process simpler. They are locally normalized models, eliminating the need to deal with partition functions and their derivatives.

Learning in sigmoid belief nets becomes easy if we can obtain unbiased samples from the posterior distribution over hidden units given observed data. However, obtaining unbiased samples is difficult due to a phenomenon called "explaining away," which affects the posterior distribution. This phenomenon arises from the anti-correlation between hidden causes when an observed effect occurs.

Learning in deep sigmoid belief nets with multiple layers of hidden variables becomes even more challenging. The posterior distribution over the first layer of hidden variables is not factorial due to explaining away, and correlations between hidden variables exist in both the prior and posterior. Computing the prior term for the first layer requires integrating out all possible patterns of activity in higher layers, making the learning process complex.

Two methods for learning deep belief nets are discussed: the Monte Carlo method and variational methods. The Monte Carlo method involves running a Markov chain to approximate the posterior distribution and obtain samples. However, it can be slow for large deep belief nets. Variational methods, on the other hand, aim to obtain approximate samples from a different distribution that approximates the posterior. Although not unbiased, these samples can still be used for maximum likelihood learning, and by pushing up the lower bound on the log probability, improvements can be made in modeling the data.

Learning in sigmoid belief nets poses challenges, particularly in deep networks, but the Monte Carlo method and variational methods provide approaches to address these difficulties and make learning feasible.

Lecture 13.3 — Learning sigmoid belief nets [Neural Networks for Machine Learning]
Lecture 13.3 — Learning sigmoid belief nets [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 13.4 — The wake sleep algorithm



Lecture 13.4 — The wake sleep algorithm [Neural Networks for Machine Learning]

The wake-sleep algorithm is a learning method used for directed graphical models like sigmoid belief nets. It consists of two phases: the wake phase and the sleep phase. Unlike Boltzmann machines, which are used for undirected graphical models, the wake-sleep algorithm is specifically designed for sigmoid belief nets.

The algorithm is part of variational learning, a machine learning approach that approximates the posterior distribution to learn complicated graphical models. Instead of computing the exact posterior distribution, which is often difficult, variational learning approximates it with a cheaper approximation. Then, maximum likelihood learning is applied based on this approximation.

Surprisingly, the learning process still works effectively, driven by two factors: improving the model's ability to generate observed data and fitting the approximate posterior to the real posterior. This effect allows variational learning to work well for sigmoid belief nets.

The wake-sleep algorithm utilizes two sets of weights: generative weights and recognition weights. In the wake phase, data is fed into the visible layer, and a forward pass is performed using the recognition weights. Stochastic binary decisions are made for each hidden unit independently, generating stochastic binary states. These states are treated as samples from the true posterior distribution, and maximum likelihood learning is applied to the generative weights.

In the sleep phase, the process is reversed. Starting from a random vector in the top hidden layer, binary states are generated for each layer using the generative weights. The goal is to recover the hidden states from the data. The recognition weights are trained to achieve this.

The wake-sleep algorithm has flaws, such as the recognition weights not following the correct gradient and incorrect mode-averaging due to the independence approximation. Despite these limitations, some researchers, like Karl Friston, believe it resembles how the brain works. However, others think that better algorithms will be discovered in the future.

The wake-sleep algorithm approximates the posterior distribution and alternates between wake and sleep phases to learn a generative model. Despite its limitations, it has been influential in the field of machine learning.

Lecture 13.4 — The wake sleep algorithm [Neural Networks for Machine Learning]
Lecture 13.4 — The wake sleep algorithm [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
 

Lecture 14.1 — Learning layers of features by stacking RBMs



Lecture 14.1 — Learning layers of features by stacking RBMs [Neural Networks for Machine Learning]

In this video, the speaker discusses a different approach to learning sigmoid belief nets. They explain that while working on sigmoid belief nets, they shifted their focus to Boltzmann machines and discovered that restrictive Boltzmann machines could be learned efficiently. They realized that by treating the features learned by a restrictive Boltzmann machine as data, they could apply another restrictive Boltzmann machine to model the correlations between those features. This led to the idea of stacking multiple Boltzmann machines to learn multiple layers of nonlinear features, which sparked a resurgence of interest in deep neural networks.

The speaker then explores the challenge of combining stacked Boltzmann machines into one model. While one would expect a multi-layer Boltzmann machine, a student named Yitay discovered that the result is more similar to a sigmoid belief net. This unexpected finding solved the problem of learning deep sigmoid belief nets by focusing on learning undirected models like Boltzmann machines.

The speaker describes the process of training a layer of features that directly receive input from pixels and using the activation patterns of those features to learn another layer of features. This process can be repeated to learn multiple layers, with each layer modeling the correlated activity in the layer below. It is proven that adding another layer of features improves a variational lower bound on the log probability of generating the data.

To combine the Boltzmann machines into one model, the speaker explains the procedure of learning each machine individually and then composing them together. The resulting combined model is called a deep belief net, which consists of top layers that resemble a restrictive Boltzmann machine and bottom layers that resemble a sigmoid belief net. The speaker also discusses the benefits of stacking Boltzmann machines and explains the concept of averaging factorial distributions. They demonstrate how averaging two factorial distributions does not result in a factorial distribution. The video further delves into the learning process of stacking Boltzmann machines and fine-tuning the composite model using a variation of the wake-sleep algorithm. The three learning stages involve adjusting generative and recognition weights, sampling hidden and visible units, and updating the weights using contrastive divergence.

An example is presented where 500 binary hidden units are used to learn all ten digit classes in 28x28 pixel images. After training the RBM, the learned features are used for recognition and generation tasks.

The video highlights the unexpected discovery of using stacked Boltzmann machines to learn deep belief nets and provides insights into the learning and fine-tuning processes involved.

Lecture 14.1 — Learning layers of features by stacking RBMs [Neural Networks for Machine Learning]
Lecture 14.1 — Learning layers of features by stacking RBMs [Neural Networks for Machine Learning]
  • 2016.02.04
  • www.youtube.com
Lecture from the course Neural Networks for Machine Learning, as taught by Geoffrey Hinton (University of Toronto) on Coursera in 2012. Link to the course (l...
Reason: