6.Batch normalization backpropagation methods

In the previous sections, we began studying the algorithm of the batch normalization method. To implement it in our library, we have created a separate neural layer in the form of the CNeuronBatchNorm class and have even built methods for initializing the class of the feed-forward algorithm. Now it's time to move on to building the backpropagation algorithm for our class. Let me remind you that the backpropagation algorithm in all neural layers of our library is represented by four virtual methods:

  • The CalcOutputGradient method for calculating the error gradient at the output of the neural network,
  • The CalcHiddenGradient method for propagating the gradient through the hidden layer,
  • The CalcDeltaWeights method for calculating weight adjustment values, and
  • The UpdateWeights method for updating the weight matrix.

All of them were declared in our CNeuronBase neural layer base class. They are overridden in each new class as needed.

In this class, we will override only two methods: error gradient propagation through the hidden layer and the calculation of weight adjusting values.

We will not override the error gradient method at the output of the neural network because I do not know of a scenario where it would be necessary to use batch normalization as the last layer of a neural network. Moreover, experiments show that the use of batch normalization immediately before the neural network result layer can adversely affect the results of the model.

As for the method for updating the weight matrix, we intentionally designed the operation of the buffer for the matrix of trainable parameters in such a way that it became possible to use the method from the parent class to update its parameters.

Now let's move on to the practical part and look at the implementation of the specified CalcHiddenGradient backpropagation methods. This is a virtual method that was defined in the CNeuronBase neural layer base class. The method is overridden in each new class of the neural layer to implement a specific algorithm. In the parameters, the method receives a pointer to the object of the previous neural layer and returns the logical result of the operations.

In the method body, we add a control block in which we check the validity of pointers both to the previous layer object received in the parameters and to the internal objects used in the method operation. We have talked about the importance of such a process on multiple occasions because accessing an object through an invalid pointer leads to a critical error and a complete termination of the program.

bool CNeuronBatchNorm::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
//--- control block
   if(!prevLayer || !prevLayer.GetOutputs() || !prevLayer.GetGradients() ||
      !m_cActivation || !m_cWeights)
      return false;

Next, we need to adjust the error gradient obtained from the next layer to the derivative of the activation function of our layer. In the base class, we have encapsulated all the work with the activation function into a separate object of the CActivation class. Therefore, now, to adjust the error gradient, we should simply call the appropriate method of this class and provide a pointer to the error gradient buffer of our class as a parameter. As always, do not forget to check the result of the operation.

//--- adjust the error gradient to the derivative of the activation function
   if(!m_cActivation.Derivative(m_cGradients))
      return false;

After that, we check the size of the specified normalization batch. If it is not more than one, simply copy the gradient buffer data of the current layer to the buffer of the previous layer. Then we exit the method with the result of copying the data.

//--- check the size of the normalization batch
   if(m_iBatchSize <= 1)
     {
      prevLayer.GetGradients().m_mMatrix = m_cGradients.m_mMatrix;
      if(m_cOpenCL && !prevLayer.GetGradients().BufferWrite())
         return false;
      return true;
     }

Next, we sequentially calculate the gradients for all functions of the algorithm.

I suggest going through the process and looking at the mathematical formulas for the propagation of the error gradient. At the initial stage, we have the error gradient for the results of our normalization layer, which corresponds to the scaling and shifting function values. Let me remind you of the formula:

To adjust the error gradient, we need to multiply it by the derivative of the function. According to the rules for calculating the derivative for , the shift β acts as a constant and its derivative is zero. The derivative of the product is equal to the second factor. Thus, our derivative will be equal to the scaling factor γ.

where Gi is the gradient of the ith element at the output of the scaling and shift function.

In the method code, this operation will be expressed in the following lines.

//--- branching of the algorithm by computing device
   if(!m_cOpenCL)
     {
      MATRIX mat_inputs = prevLayer.GetOutputs().m_mMatrix;
      if(!mat_inputs.Reshape(1prevLayer.Total()))
         return false;
      VECTOR inputs = mat_inputs.Row(0);
      CBufferType *inputs_grad = prevLayer.GetGradients();
      ulong total = m_cOutputs.Total();
      VECTOR gnx = m_cGradients.Row(0) * m_cWeights.Col(0);

Let's move on. We determine the normalized value using the formula.

From here, we need to distribute the error gradient to each of the components. I will not show the entire process of deriving partial differential formulas. I will only provide a ready-made formula for calculating the error gradient presented by the authors of the method in the article Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.

The last two formulas will be needed for the next method, where we will propagate the error gradient to the level of the trainable parameter matrix. Therefore, in the code of this method, we implement only the formulas given above.

      VECTOR temp = MathPow(MathSqrt(m_cBatchOptions.Col(1) + 1e-32), -1);
      VECTOR gvar = (inputs - m_cBatchOptions.Col(0)) / 
                   (-2 * pow(m_cBatchOptions.Col(1) + 1.0e-323.0 / 2.0)) * gnx;
      VECTOR gmu = temp * (-1) * gnx - gvar * 2 * 
                          (inputs - m_cBatchOptions.Col(0)) / (TYPE)m_iBatchSize;
      VECTOR gx = temp * gnx + gmu / (TYPE)m_iBatchSize + gvar * 2 * 
                          (inputs - m_cBatchOptions.Col(0)) / (TYPE)m_iBatchSize;

Note that the formulas are the sums of the values across the entire normalization dataset. We perform calculations only for the current value. Nevertheless, we do not deviate from the above formulas. The reason is that our dataset is stretched over time, and we return the error gradient at each step. During the period between updates of the trainable parameters of our class, we accumulate the error gradient on them, thereby summing it over the entire duration of our normalization dataset stretched along the time scale.

Now we only need to save the obtained error gradient into the corresponding element of the buffer and check the result of the operation.

      if(!inputs_grad.Row(gx0))
         return false;
      if(!inputs_grad.Reshape(prevLayer.Rows(), prevLayer.Cols()))
         return false;
     }
   else  // OpenCL block
     {
      return false;
     }
//---
   return true;
  }

As a result of performing these operations, we obtained a filled buffer for the gradient tensor of the previous layer. So, the task set for this method has been completed, and we can conclude the branching of the algorithm depending on the used device. We will set a temporary stub for the block of organizing multi-threaded computing using OpenCL, as in similar cases when working with other methods. Thus, we finish working on our CNeuronBatchNorm::CalcHiddenGradient method at this point.

We will continue to organize the process of the backpropagation pass. Let's move on to the next method CNeuronBatchNorm::CalcDeltaWeights. Usually, this method is responsible for distributing the error gradient to the level of the weight matrix. But in our case, we have slightly different trainable parameters, on which we will distribute the error gradient.

The CalcDeltaWeights method like the previous one, receives a pointer to the object of the previous layer in the parameters. However, in this case, it is more of a fulfillment of the requirement of method inheritance than a functional necessity. The formulas for propagating the error gradient to trainable variables have already been provided above, but I will list them again for reference.

As can be seen from the above formulas, the error gradient of the parameters does not depend on the values of the previous layer. The gradient of the scaling coefficient depends on the normalized value, while the gradient of the bias is equal to the error gradient at the output of the batch normalization layer. Of course, the normalized value itself depends on the values of the previous layer. However, to avoid its recalculation, we simply saved the normalized values in a buffer with a feed-forward pass. Therefore, in the body of this method, we will not refer to the elements of the previous layer. Hence, there is no point in wasting time checking the resulting pointer to the previous layer. At the same time, we will not completely exclude the control block as we check not only external pointers but also pointers to internal objects.

bool CNeuronBatchNorm::CalcDeltaWeights(CNeuronBase *prevLayerbool read)
  {
//--- control block
   if(!m_cGradients || !m_cDeltaWeights)
      return false;

After successfully passing the control block, we check the value of the normalization batch. It should be at least greater than one. Otherwise, we exit the method.

//--- check the size of the normalization batch
   if(m_iBatchSize <= 1)
      return true;

After successfully passing all the controls, we proceed to the direct implementation of the method algorithm. We always implement the algorithm in two versions: using standard MQL5 tools and using multi-threaded computing technology using OpenCL. Therefore, before continuing the operations, we will create a branching of the algorithm depending on the device used for computing operations.

//--- branching of the algorithm by the computing device
   if(!m_cOpenCL)
     {

In the branch of algorithm implementation using standard MQL5 tools we will use matrix operations. According to the formulas provided above, we determine the error gradient for the scaling coefficient and the bias. We add the obtained values to the previously accumulated error gradients of the corresponding elements and update the values in the error gradient accumulation buffer.

      VECTOR grad = m_cGradients.Row(0);
      VECTOR delta = m_cBatchOptions.Col(2) * grad + m_cDeltaWeights.Col(0);
      if(!m_cDeltaWeights.Col(delta0))
         return false;
      if(!m_cDeltaWeights.Col(grad + m_cDeltaWeights.Col(1), 1))
         return false;

After completing all the operations, we will have a fully updated error gradient buffer at the level of the batch normalization layer's trainable parameters. In other words, the task for this method is solved, and we close the branch of the algorithm depending on the computing device, along with the entire method. However, first, we add a stub in the block of the multi-threaded computing algorithm using OpenCL.

     }
   else  // OpenCL block
     {
      return false;
     }
//---
   return true;
  }

Above, we have redefined two methods from the backpropagation algorithm. The method of updating the weights, and in this case the trained parameters, was inherited from the parent class. Thus, the work on the backpropagation methods in terms of the organization of the process using standard MQL5 tools can be considered complete. Let's move on to the file handling methods.