Weight initialization methods in neural networks

When creating a neural network, before its first training run, we need to somehow set the initial weight. This seemingly simple task is of great importance for the subsequent training of the neural network and, in general, has a significant impact on the result of the entire work.

The fact is that the gradient descent method, which is most often used for training neural networks, cannot distinguish the local minima of a function from its global minimum. In practice, various solutions are applied to minimize this problem, and we will talk about them a bit later. However, the question remains open.

The second point is that the gradient descent method is an iterative process. Therefore, the total training time for a neural network directly depends on how far from the endpoint we are at the beginning.

Moreover, let's not forget about the laws of mathematics and the peculiarities of the activation functions that we discussed in the previous section of this book.

Initializing weights with a single value

Probably the first thing that comes to mind is to take a certain constant (0 or 1) and initialize all weights with a single value. Unfortunately, this is far from the best option, which is related to the laws of mathematics.

Using zero as a synaptic coefficient is often fatal to neural networks. In this case, the weighted sum of the input data would be zero. As we know from the previous section, many versions of the activation function in such a case return 0, and the neuron remains deactivated. Consequently, no signal goes further down the neural network.

The derivative of such a function with respect to x i will be zero. Consequently, during the training of the neural network, the error gradient through such a neuron will also not be passed to the preceding layers, paralyzing the training process.

Using 0 for the initialization of synaptic (weight) coefficients results in an untrainable neural network, which in most cases will generate 0 (depending on the activation function) regardless of the input data received.

Using a constant other than zero as a weighting factor also has disadvantages. The input layer of the neural network is supplied with a set of initial data. All neurons of the subsequent layer work with this dataset in the same way. Within the framework of a single neuron, according to the laws of mathematics, the constant can be factored out in the formula for calculating the weighted sum. As a result, in the first stage, we get a scaling of the sum of the initial values. Changes in weights are possible during training. However, this only applies to the first layer of neurons receiving the initial data.

If you look at the neural layer as a whole. Then all neurons in the same layer receive the same dataset. By using the same coefficient, all neurons generate the same signal. As a consequence, all neurons of one layer work synchronously as one neuron. This, in turn, leads to the same value being present at all inputs of all neurons of the subsequent layer. This happens from layer to layer throughout the neural network.

The applied learning algorithms do not allow the isolation of an individual neuron among a large number of identical values. Therefore, all weights will be changed synchronously during the training process. Each layer, except for the first one after the input, will receive its weights, uniform for the entire layer. This results in the linear scaling of the results obtained on the same neuron.

Initializing the synaptic coefficients with a single number other than zero causes the neural network to degenerate down to one neuron.

Initializing weights with random values #

Since we cannot initialize a neural network with a single number, let’s try initializing with random values. For maximum efficiency, let's not forget about what was mentioned above. We need to make sure that no two synaptic coefficients are the same. This will be facilitated by a continuous uniform distribution.

As practice has shown, such an approach yields results. Unfortunately, this is not always the case. Due to the random selection of weights, it is sometimes necessary to initialize the neural network several times before the desired result is achieved. The range of variation in the weights has a significant impact. If the gap between the minimum and maximum is large enough, some neurons will be isolated and others completely ignored.

Moreover, in deep neural networks, there is a risk of the so-called "gradient explosion" and "gradient vanishing".

The gradient explosion manifests itself when using weights greater than one. In this case, when the initial data is multiplied by factors greater than one, the weighted sum increases continuously and exponentially with each layer. At the same time, generating a large number at the output often leads to a large error.

During the training process, we will use an error gradient to adjust the weights. In order to pass the error gradient from the output layer to each neuron of our network, we need to multiply the obtained error by the weights. As a result, the error gradient, just like the weighted sum, will grow exponentially as it progresses through the layers of the neural network.

As a consequence, at some point, we will get a number that exceeds our technical capabilities for recording values, and we won't be able to further train and use the network.

The opposite situation occurs if we choose weight values close to zero. Constantly multiplying the initial data by weights less than one reduces the weighted sum of weight values. This process progresses exponentially with the increase in the number of layers of the neural network.

As a consequence, during the training process, we may encounter a situation where the gradient of a small error, when passing through layers, becomes smaller than the technically feasible precision. For our neurons, the error gradient will become zero, and they will not learn.

At the time of writing the book, the common practice is to initialize neurons using the Xavier method, proposed in 2010. Xavier Glorot and Yoshua Bengio proposed initializing the neural network with random numbers from a continuous normal distribution centered at point 0 and with a variance (δ2) equal to 1/n.

This approach enables the generating of synaptic coefficients such that the average of the neuron activations will be zero, and their variance will be the same for all layers of the neural network. Xavier initialization is most relevant when using hyperbolic tangent (tanh) as an activation function.

The theoretical justification for this approach was given in the article "Understanding the difficulty of training deep feedforward neural networks".

Xavier initialization gives good results when using sigmoid activation functions. But when ReLU is used as an activation function, it is not as efficient. This is due to the characteristics of the ReLU itself.

Since ReLU only misses positive weighted sum values, and negative ones are zeroed, the probability theory states that half the neurons will be deactivated most of the time. Consequently, the neurons of the subsequent layer will receive only half of the information, and the weighted sum of their inputs will be less. As the number of layers in the neural network increases, this effect will intensify: fewer and fewer neurons will reach the threshold value, and more and more information will be lost as it passes through the neural network.

A solution was proposed by Kaiming He in February 2015 in the article "Surpassing Human-Level Performance on ImageNet Classification". In the article, it's suggested to initialize the weights for neurons with ReLU activation from a continuous normal distribution with a variance (δ2) equal to 2/n. And when using PReLU as activation, the distribution variance should be 2/((1+a2) *n). This method of initializing synaptic scales is called “He-initialization”.

Initializing with a random orthogonal matrix

In December 2013, Andrew M. Saxe presented a three-layer neural network in the form of matrix multiplication in the article "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks", thereby showing the correspondence between the neural network and singular decomposition. The synaptic weight matrix of the first layer is represented by an orthogonal matrix, the vectors of which are the coordinates of the initial data in some n-dimensional space.

Since the vectors of an orthogonal matrix are orthonormalized, the initial data projections they generate are completely independent. This approach allows for the neural network to be pre-prepared in such a way that each neuron will learn to recognize its feature in the input data independently of the training of other neurons located in the same layer.

However, the method is not used widely, primarily due to the complexity of generating orthogonal matrices. The advantages of the method are demonstrated with the growth of the number of layers of the neural network. Therefore, in practice, initialization with orthogonal matrices can be found in deep neural networks when initialization with random values does not yield results.

Using pre-trained neural networks

This method can hardly be referred to as initialization, but its practical application is becoming increasingly popular. The essence of the method is as follows: to solve the problem, use a neural network that was trained on the same or similar data but solves different tasks. A series of lower layers are taken from a pre-trained neural network. These layers have already been trained to extract features from the initial data. Then, a few new layers of neurons are added, which will solve the given task based on the already extracted features.

In the first step, pre-trained layers are blocked and new layers are trained. If the training fails to produce the desired result, the learning block is removed from the borrowed neural layers and the neural network is retrained.

A variation of this method is the approach of first creating a multilayer neural network and training it to extract different features from the initial data. These can be unsupervised learning algorithms for dividing data into classes or autoencoder algorithms. In the latter, the neural network first extracts features from the initial data and then tries to return the original data based on the selected features.

After pre-training, the layers of neurons responsible for feature extraction are taken, and additional layers of neurons for solving the given task are added to them.

When constructing deep networks, this approach can help train the neural network faster compared to training a large neural network directly. This is because, during one training pass, a smaller neural network requires fewer operations to be performed compared to training a deep neural network. In addition, smaller neural networks are less prone to the risk of gradient explosion or vanishing.

In the practical part of the book, we will return to the process of initializing neural networks and in practice evaluate the advantages and disadvantages of each method.