Dropout

As we continue discussing ways to increase the convergence of models, let's consider the Dropout method.

When training a neural network, a large number of features are fed into each neuron, and it is difficult to assess the influence of each of them. As a result, errors from some neurons are smoothed out by the correct values from others, and errors accumulate at the output of the neural network. As a result, training stops at a certain local minimum with a relatively large error. This effect is known as feature co-adaptation, where the influence of each feature seems to adapt to the surrounding environment. For us, it would be better to achieve the opposite effect, where the environment is decomposed into individual features, and the influence of each feature is evaluated separately.

Dropout

Dropout

To combat the complex co-adaptation of features, in July 2012, a group of scientists from the University of Toronto in the article Improving neural networks by preventing co-adaptation of feature detectors proposed randomly excluding some of the neurons in the learning process. Reducing the number of features during training increases the significance of each one, and constant variation in the quantitative and qualitative composition of features reduces the risk of their co-adaptation. This method is called Dropout. Some compare the application of this method to decision trees because, by excluding some neurons, we get a new neural network with its own weights at each training iteration. According to the rules of combinatorics, the variability of such networks is quite high.

During the operation of the neural network, all attributes and neurons are evaluated. Thus, we get the most accurate and independent assessment of the current state of the environment under consideration.

The authors of the method in their paper mention the possibility of using it to improve the quality of pre-trained models as well.

Describing the proposed solution from the mathematics point of view, we can say that each individual neuron is dropped out of the process with a certain given probability p, or the neuron will participate in the process of training a neural network with probability q.

To determine the list of neurons to be dropped out, a pseudo-random number generator with a normal distribution is used. This approach provides the most uniform exclusion of neurons possible. In practice, we will generate a vector with a size equal to the input sequence. For the features used in the vector, we will set 1, and for the excluded elements, we will use 0.

However, excluding analyzed features undoubtedly leads to a reduction in the sum at the input of the neuron activation function. To compensate for this effect, we will multiply the value of each feature by a factor of 1/q. It is obvious that this coefficient will increase the values since the probability of q will always be in the range from 0 to 1.

where:

  • di = elements of the Dropout result vector
  • q = probability of using a neuron in the learning process
  • xi = elements of the masking vector
  • ni = elements of the input sequence

During the backpropagation in the training process, the error gradient is multiplied by the derivative of the above-mentioned function. In the case of Dropout, the backpropagation pass will be similar to the feed-forward pass using the masking vector from the feed-forward pass.

During the operation of the neural network, the masking vector is filled with ones, allowing values to be transmitted in both directions without hindrance.

In practice, the coefficient of 1/q is constant throughout the training, so we can easily calculate this coefficient once and write it instead of one in the masking tensor. This way, we combine the coefficient recalculation and multiplication by 1 in each training iteration.