Dropout

We continue studying methods for improving the convergence of neural networks. Let’s consider the dropout technology.

When training a neural network, a large number of features are fed into each neuron, the influence of each of which is difficult to assess. As a result, errors of some neurons are smoothed out by the correct values of others, and errors accumulate at the output of the neural network. Training stops at a certain local minimum with a sufficiently large error that does not meet our requirements. This effect was called co-adaptation of features, in which the influence of each feature seemingly adjusts to the surrounding environment. It would be better for us to get the opposite effect when the environment is decomposed into individual features and evaluate separately the impact of each of them.

To combat complex co-adaptation of features, in July 2012, a group of scientists from the University of Toronto, in a paper "Improving neural networks by preventing co-adaptation of feature detectors", proposed randomly excluding some neurons during the training process. Reducing the number of features during training increases the significance of each one, and the constant change in the quantitative and qualitative composition of features reduces the risk of their co-adaptation. Such a method is called Dropout.

Applying this method can be compared to decision trees because by excluding some neurons at random, we get a new neural network with its own weights at each training iteration. According to the rules of combinatorics, the variability of such networks is quite high.

At the same time, all the features and neurons are evaluated during the operation of the neural network. Thereby, we obtain the most accurate and independent assessment of the current state of the studied environment.

The authors of the solution in their paper point out that the method can also be used to improve the quality of pre-trained models.

Dropout implementation model for a perceptron with 2 hidden layers

Dropout implementation model for a perceptron with two hidden layers

Describing the proposed solution from a mathematical point of view, we can say that each individual neuron is excluded from the process with a certain given probability P. Thus, the the neuron will participate in the neural network training process with a probability of q=1—P.

To determine the list of excluded neurons, the method uses a pseudorandom number generator with a normal distribution. This approach allows for the most uniform possible exclusion of neurons. In practice, we will generate a vector of binary features of size equal to the input sequence. In the vector, we will set 1 for the features that are used and 0 for the excluded elements.

However, the exclusion of the analyzed features undoubtedly leads to a decrease in the sum at the input of the neuron activation function. To compensate for this effect, we multiply the value of each feature by a factor of 1/q. It's easy to notice that this coefficient will increase the values, as the probability q is always in the range from 0 to 1.

Where:

  • Di = elements of the Dropout results vector
  • q = probability of using a neuron during the learning process
  • mi = the element of the masking vector
  • xi = the elements of the input sequence vector

During the backward pass in the training process, the error gradient is multiplied by the derivative of the aforementioned function. As can be easily seen, in the case of Dropout, the backward pass will be similar to the forward pass which uses the masking vector from the forward pass.

During the operation of the neural network, the masking vector is filled with units, allowing values to be transmitted seamlessly in both directions.

In practice, the coefficient 1/q is constant throughout training, so we can easily count this coefficient once and write it instead of units in the masking tensor. In this way, we eliminate the operations of recalculating the coefficient and multiplying it by 1 of the mask in each training iteration.