Regularization

In the pursuit of minimized neural network error, we often complicate our model. What a disappointment it can be when, after prolonged and meticulous work, we achieve an acceptable training set error, only to find the model's error soaring during testing. Such a situation is quite common and is known as 'model overfitting'.

The reasons for this phenomenon are quite mundane and are related to the imperfections, or more precisely, the complexity of our world. Both the raw data and the benchmark results for the training and test datasets were obtained not under controlled laboratory conditions but were taken from real life. Hence, in addition to the analyzed features, they include a number of unaccounted factors, which we attributed to the so-called noise at the design stage for various reasons.

During the training process, we expect that the model will extract significant features from the given volume of raw data and establish relationships between these features and the expected outcome. However, due to the excessive complexity of the model, it can discover relationships between random variables that don't actually exist. It ends up "memorizing" the training dataset. As a result, we get an error close to zero on the training sample. In this process, the test dataset contains its own random noise deviations that don't fit into the concept learned from the training dataset. This confuses our model. As a result, we get a striking difference in the error of the neural network on the training and test samples.

The regularization methods discussed in this section are designed to exclude or minimize the influence of random noise and emphasize the regular features during the model training process. In the practice of training neural networks, you most commonly encounter the use of two methods: L1 and L2 regularizations. Both of them are built on the addition of the sum of weight norms to the loss function.

L1-regularization

L1-regularization is often referred to as lasso regression or Manhattan regression. The essence of this method lies in adding the sum of absolute values of weights to the loss function.

Where:

LL1(Y,Y',W) = loss function with L1-regularization
L(Y,Y') = one of the loss functions discussed earlier
λ = regularization coefficient (penalty)
wi = ith weighting coefficient

In the process of training the neural network, we will minimize our loss function. In this case, the minimization of the loss function depends directly on the sum of the absolute weight values. Thus, in our model training, we introduce an additional constraint of selecting weights as close to zero as possible.

The partial derivative of such a loss function will take the form:

Here, we don't explicitly calculate the derivative of the loss function itself to isolate the influence of regularization directly.

The function sign(wi) returns the sign of the weight when it is non-zero and 0 when the weight is zero. Since λ is a constant, and we consistently subtract the value of the derivative multiplied by the learning rate and the error gradient when updating the weights, then, when training the neural network, the model will set features that do not have a direct impact on the outcome to zero. This will completely eliminate the influence of random noise on the result.

L1 regularization introduces a penalty for large weights, thus enabling the selection of important features and mitigating the influence of random noise on the final outcome.

L2-regularization

L2, or ridge, regularization, like L1 regularization, introduces a large weighting penalty into the loss function. However, in this case, the L2 norm is used, which is the sum of the squares of the weights. As a result, the loss function will have the following form.

Similar to L1-regularization, we add a constraint to the model training process to use weighting coefficients as close to zero as possible. Let's look at the derivative of our loss function.

In the L2-regularization derivative formula, the penalty λ is multiplied by the weight. This implies that during training, the penalty is not constant but dynamic. It decreases proportionally as the weight decreases. In this process, each weight receives an individual penalty based on its magnitude. Hence, unlike L1 regularization, during the training of the neural network, the weights of the features that do not have a direct impact on the outcome will decrease. However, they will never reach zero, unless calculation precision limits allow for it.

L2 regularization introduces a penalty for large weights, thus enhancing the influence of important features and reducing, though not eliminating, the impact of random noise on the final outcome.

Elastic Net

As mentioned above, L1-regularization simplifies the model by zeroing out the weights for parameters that do not directly affect the expected outcome of the model. Applying such an approach is justified when we are reasonably confident about the presence of a small number of redundant features, the exclusion of which can only improve the model performance.

If, however, we understand that the overall result is a combination of small contributions from all the features used and the exclusion of any feature would worsen the model performance, then in such a scenario, using L2 regularization is justified.

But which of the methods to use when our model receives an obviously excessive number of features? Moreover, we do not understand the individual impact of features on the outcome. Perhaps excluding certain features could simplify our model and improve its performance. At the same time, excluding other features would have a negative impact on the model's performance.

At such times, Elastic Net regularization is applied. This model adds penalties based on both the L1 and L2 norms of weights to the loss function, combining the advantages of L1 and L2 regularization.

Please note that in the Elastic Net formula, L1 and L2 regularization each have their own regularization coefficients. Thus, by changing the regularization coefficients λ1 and λ2, the regularization model can be controlled. By setting them both to zero, we achieve model optimization without regularization. When λ1>0 and λ2=0 we have pure L1 regularization, and when λ1=0 and λ2>0 we get L2-regularization.

Techniques for improving the convergence of neural networks

Dropout