Organizing multi-threaded computations in the batch normalization class

We continue working on our batch normalization class CNeuronBatchNorm. In the previous sections, we have already fully implemented the functionality of the class using standard MQL5 tools. In order to complete the work on the class, according to our concept, it remains to supplement its functionality with the ability to perform multi-threaded mathematical operations using OpenCL. Recall, the implementation of this functionality can be roughly divided into two sub-processes:

  • Create an OpenCL program.
  • Modify the methods of the main program to organize data exchange with the context and call the OpenCL program.

Let's start by creating the OpenCL program. First, we implement the BatchNormFeedForward forwards pass kernel. In the parameters, we pass pointers to four buffers and two constants to the kernel:

  • inputs — buffer of raw data (previous layer results)
  • options — normalization parameter buffer
  • weights — trainable parameter matrix buffer (named after the class buffer)
  • output — result buffer
  • batch — size of the normalization batch
  • total — size of the result buffer

__kernel void BatchNormFeedForward(__global TYPE *inputs,
                                   __global TYPE *options,
                                   __global TYPE *weights,
                                   __global TYPE *output,
                                   int batch,
                                   int total)
  {

The last parameter is necessary because we use vector variables of TYPE4 type to optimize the computation process. This approach allows parallel computing not at the software level, but at the microprocessor level. The use of a vector of four elements of type double allows you to fully fill a 256-bit microprocessor register and perform calculations on the entire vector in a single clock cycle. Thus, in one clock cycle of the microprocessor, we perform operations on four elements of our data array. OpenCL supports vector variables of 2, 3, 4, 8, and 16 elements. Before choosing a vector dimension, please check the specifications of your hardware.

In the kernel body, we immediately identify the current thread ID. We will need it to determine the offset in the buffers of the tensors before the analyzed variables.

We also check the size of the normalization batch. If it is not greater than one, we simply copy the corresponding elements from the gradient buffer of the current layer to the gradient buffer of the previous layer and terminate further execution of the kernel.

   int n = get_global_id(0);
   if(batch <= 1)
     {
      D4ToArray(outputToVect4(inputsn * 41total0), n * 41total0);
      return;
     }

Please note that when calling the function to convert tensor values into vector representation and vice versa for the bias parameter in the tensor, we increase the thread identifier by a factor of four. This is because when using vector operations with TYPE4, each thread simultaneously processes four elements of the tensor. Therefore, the number of launched threads will be four times smaller than the size of the processed tensor.

If the normalization batch size is greater than one, and we continue program execution, we need to determine the offset in the normalization parameter tensor buffers, taking into account the identifier of the current thread and the vector operation size (TYPE4)

   int shift = n * 4;
   int shift_options = n * 3 * 4;
   int shift_weights = n * 2 * 4;

We now move on directly to the execution of our algorithm. First, we create a vector with the analyzed input data and calculate the exponential average. Let's use the previous mean and variance to determine the first iteration. We divide the pre-obtained averaging value by the dataset package size only on the second and subsequent iterations. This is because the average of the first element is the element itself.

After determining the mean, we find the deviation of the current value from the mean and calculate the dataset variance.

   TYPE4 inp = ToVect4(inputsshift1total0);
   TYPE4 mean = ToVect4(optionsshift3total * 30) * ((TYPE)batch - 1) + inp ;
   if(options[shift_options ] != 0 && options[shift_options + 1] > 0)
      mean /= (TYPE4)batch;
   TYPE4 delt = inp - mean;
   TYPE4 variance = ToVect4(optionsshift3total * 31) * ((TYPE)batch - 1) + pow(delt2);
   if(options[shift_options + 1] > 0)
      variance /= (TYPE4)batch;

Having mean and sample variance, we can easily calculate the normalized value of the parameter.

   TYPE4 nx = delt / sqrt(variance + 1e-37f);

Next, according to the batch normalization algorithm, it is necessary to perform the shift and scaling of the normalized value. But before that, I want to remind you that at the initial stage, we initialized the buffer of trainable parameter matrices with zero values. As such, we get 0 for all values regardless of the previously normalized value.

Therefore, we check for a zero value for the scaling factor and replace it with one if necessary.

   if(weights[shift_weights] == 0)
      D4ToArray(weights, (TYPE4)1shift2total * 20);

Note that we are checking for equality to zero only for the first element of the analyzed value vector. We will replace the entire vector with one. This approach is acceptable because I expect to get null values only on the first pass. At this point, we will have all buffer elements equal to zero and need to replace them. The coefficients will then be determined and optimized and, therefore, it will be different from zero during model training.

After such a simple operation, we can safely scale up and shift.

   TYPE4 res = ToVect4(weightsshift2total * 20) * nx + 
               ToVect4(weightsshift2total * 21);

Now we only need to save the received data to the appropriate buffer elements. In doing so, we maintain not only the last result but also the intermediate values we need.

   D4ToArray(optionsmeanshift3total * 30);
   D4ToArray(optionsvarianceshift3total * 31);
   D4ToArray(optionsnxshift3total * 32);
   D4ToArray(outputresshift1total0);
  }

This completes the work on the BatchNormFeedForward feed-forward kernel and we can move on to the work on the backpropagation kernels.

To implement the backpropagation algorithm, we create two kernels, one for the propagation of the error gradient to the level of the previous layer and the other one for the propagation of the error gradient to the level of the matrix of trainable parameters.

We start by creating an error gradient propagation kernel through a hidden layer of the BatchNormCalcHiddenGradient neural network. In the parameters of this method, this time we will pass five data buffers and two constants:

  • inputs — buffer of input data (previous layer results)
  • options — normalization parameter buffer
  • weights — trainable parameter matrix buffer (named after the class buffer)
  • gradient — error gradient buffer at the result level of the current layer
  • gradient_inputs — error gradient buffer at the level of the previous layer results (in this case the kernel result)
  • batch — size of the normalization batch
  • total — size of the result buffer

__kernel void BatchNormCalcHiddenGradient(__global TYPE *options,
                                          __global TYPE *gradient,
                                          __global TYPE *inputs,
                                          __global TYPE *gradient_inputs,
                                          __global TYPE *weights,
                                          int batch,
                                          int total)
  {

At the beginning of the kernel, as in the feed-forward kernel, we determine the current flow identifier and check the normalization batch size. If the normalization batch size is not greater than one, we simply copy the error gradients from the current layer buffer to the previous layer buffer and stop executing the kernel.

   int n = get_global_id(0);
   int shift = n * 4;
   if(batch <= 1)
     {
      D4ToArray(gradient_inputsToVect4(gradientshift1total0),
                                                   shift1total0);
      return;
     }

If, however, the size of the normalization batch is greater than one, and we continue with the kernel operations, then we have to propagate the error gradient throughout the chain from the results level of the current layer to the results level of the previous layer. Below are the mathematical formulas we have to implement.

   TYPE4 inp = ToVect4(inputsshift1total0);
   TYPE4 gnx = ToVect4(gradientshift1total0) * 
               ToVect4(weightsshift2total * 20);
   TYPE4 temp = 1 / sqrt(ToVect4(optionsshift3total * 31) + 1e-37f);
   TYPE4 delt = inp - ToVect4(optionsshift3total * 30);
   TYPE4 gvar = delt / (-2 * pow(ToVect4(optionsshift3total * 31) +
                                              1.0e-37f3.0f / 2.0f)) * gnx;
   TYPE4 gmu = (-temp) * gnx - gvar * 2 * delt / (TYPE4)batch;
   TYPE4 gx = temp * gnx + gmu/(TYPE4)batch + gvar * 2 * delt/(TYPE4)batch;

After the calculation, we save the result of the operations and complete the kernel.

   D4ToArray(gradient_inputsgxshift1total0);
  }

This completes the first kernel in implementing the backpropagation algorithm of our batch normalization class, and we move on to the final phase of the OpenCL program, which is the creation of the second backpropagation kernel in which the error gradient is propagated to the level of the BatchNormCalcDeltaWeights trainable parameter matrix.

We pass three data buffers to this kernel in parameters. These are:

  • options — normalization parameter buffer
  • delta_weights — error gradient buffer at the trainable parameter matrix level (in this case the result of the kernel)
  • gradient — error gradient buffer at the result level of the current layer.

__kernel void BatchNormCalcDeltaWeights(__global TYPE *options,
                                        __global TYPE *delta_weights,
                                        __global TYPE *gradients)
  {

We need to implement only two mathematical formulas in this kernel:

As you can see, the operations are quite simple and will not require too long code to implement the algorithm. This time we don’t even use vector operations.

At the beginning of the kernel, we define the ID of the current thread and the offset in the buffers with the normalization parameter tensors. The offset in the error gradient buffers will match the thread ID.

   const int n = get_global_id(0);
   int shift_options = n * 3;
   int shift_weights = n * 2;

To reduce global memory access, we will first save the gradient error value in a local variable, and then calculate and immediately write the corresponding values for the current step to the gradient error accumulation buffer elements using the formulas mentioned above.

   TYPE grad = gradients[n];
   delta_weights[shift_weights] += grad * options[shift_options + 2];
   delta_weights[shift_weights + 1] += grad;
  }

As you can see, the error gradients are written to the corresponding buffer elements. So, the task assigned to this kernel is complete and we can finish its operation.

We have created all three kernels to implement feed-forward and backpropagation passes in our batch normalization class. We can now proceed to make changes to the main program to organize data exchange with the OpenCL context and call the corresponding program kernel.

Let's start this work as usual by creating constants for OpenCL kernels. Go to defines.mqh and add program kernel identifier constants at the beginning.

#define def_k_BatchNormFeedForward        37
#define def_k_BatchNormCalcHiddenGradient 38
#define def_k_BatchNormCalcDeltaWeights   39

Then add the kernel parameter identifiers.

//--- feed-forward pass of batch normalization
#define def_bnff_inputs                0
#define def_bnff_options               1
#define def_bnff_weights               2
#define def_bnff_outputs               3
#define def_bnff_batch                 4
#define def_bnff_total                 5

//--- gradient distribution through the batch normalization layer
#define def_bnhgr_options              0
#define def_bnhgr_gradient             1
#define def_bnhgr_inputs               2
#define def_bnhgr_gradient_inputs      3
#define def_bnhgr_weights              4
#define def_bnhgr_batch                5
#define def_bnhgr_total                6

//---- gradient distribution to optimized batch normalization parameters
#define def_bndelt_options             0
#define def_bndelt_delta_weights       1
#define def_bndelt_gradient            2

The next step is to initialize the new kernels in the program. To do this, we switch to the method of initializing the OpenCL program of the main dispatch class of our model CNet::InitOpenCL. First, we change the total number of kernels used.

   if(!m_cOpenCL.SetKernelsCount(40))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
   if(!m_cOpenCL.KernelCreate(def_k_BatchNormFeedForward,
                                   "BatchNormFeedForward"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
   if(!m_cOpenCL.KernelCreate(def_k_BatchNormCalcHiddenGradient,
                                   "BatchNormCalcHiddenGradient"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
   if(!m_cOpenCL.KernelCreate(def_k_BatchNormCalcDeltaWeights,
                                   "BatchNormCalcDeltaWeights"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

Now that the kernels have been created, and we can access them, we move on to working with the methods of our batch normalization class.

As is already the tradition, we will start with the feed-forward method. We only make changes to the implementation of the multithread algorithm using OpenCL. All other method code remains unchanged.

According to the algorithm of preparing the kernel to run, we must first pass all the necessary data to the OpenCL context memory. So, we check for created buffers in the context memory.

bool CNeuronBatchNorm::FeedForward(CNeuronBase *prevLayer)
  {
   ......
//--- branching of the algorithm by the computing device
   if(!m_cOpenCL)
     {
//--- Implementation using MQL5 tools
   ......
     }
   else  // OpenCL block
     {
 //--- checking data buffers
      CBufferType *inputs = prevLayer.GetOutputs();
      if(inputs.GetIndex() < 0)
         return false;
      if(m_cBatchOptions.GetIndex() < 0)
         return false;
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(m_cOutputs.GetIndex() < 0)
         return false;

In the next step, we pass pointers to data buffers and the values of required constants to the kernel parameters.

 //--- pass parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormFeedForward,
                                         def_bnff_inputsinputs.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormFeedForward,
                                    def_bnff_weightsm_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormFeedForward,
                               def_bnff_optionsm_cBatchOptions.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormFeedForward
                                    def_bnff_outputsm_cOutputs.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_BatchNormFeedForward,
                                    def_bnff_total, (int)m_cOutputs.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_BatchNormFeedForward,
                                               def_bnff_batchm_iBatchSize))
         return false;

After completing the preparatory work, we proceed to enqueue the kernel for execution. First, we need to fill in two dynamic arrays. In one of them, we fill in the dimension of the task space, and in the other one, we fill in the offset in each dimension of the task space. We will run the kernel in a one-dimensional zero-offset task space. The number of threads to run will be four times smaller than the tensor size of the current layer results. However, since the dimension of the tensor will not always be a multiple of four, and we need to run the computations for all elements of the result tensor, we will provide an additional thread that will compute the "tail" part of the tensor that is not a multiple of four.

After calculating the number of threads and filling the buffers, we call the kernel queuing method.

 //--- queuing for execution
      uint off_set[] = {0};
      uint NDRange[] = { (int)(m_cOutputs.Total() + 3) / 4 };
      if(!m_cOpenCL.Execute(def_k_BatchNormFeedForward1off_setNDRange))
         return false;
     }
//---
   if(!m_cActivation.Activation(m_cOutputs))
      return false;
//---
   return true;
  }

This completes the work with the feed-forward method, in which we have already implemented full functionality, including the ability to organize parallel computations on GPUs using OpenCL technology.

Let's proceed to backpropagation methods, in which we need to perform similar work. When implementing the backpropagation method in pure MQL5, we have overridden two methods. Consequently, we need to supplement both methods with multi-threaded computing functionality. First, let's add functionality to the method that propagates the error gradient up to the previous neural layer CNeuronBatchNorm::CalcHiddenGradient. As with the feed-forward method, we first create the necessary data buffers in the OpenCL context.

bool CNeuronBatchNorm::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
   ......
//--- branching of the algorithm by the computing device
   if(!m_cOpenCL)
     {
//--- Implementation using MQL5 tools
   ......
     }
   else  // OpenCL block
     {
 //--- checking data buffers
      CBufferTypeinputs = prevLayer.GetOutputs();
      CBufferTypeinputs_grad = prevLayer.GetGradients();
      if(inputs.GetIndex() < 0)
         return false;
      if(m_cBatchOptions.GetIndex() < 0)
         return false;
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(m_cOutputs.GetIndex() < 0)
         return false;
      if(m_cGradients.GetIndex() < 0)
         return false;
      if(inputs_grad.GetIndex() < 0)
         return false;

Then, according to our algorithm for implementing multi-threaded computations using OpenCL, we pass the parameters of the induced kernel.

 //--- pass parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormCalcHiddenGradient
                                             def_bnhgr_inputsinputs.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormCalcHiddenGradient,
                                        def_bnhgr_weightsm_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormCalcHiddenGradient,
                                   def_bnhgr_optionsm_cBatchOptions.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormCalcHiddenGradient,
                                     def_bnhgr_gradientm_cGradients.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormCalcHiddenGradient,
                               def_bnhgr_gradient_inputsinputs_grad.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_BatchNormCalcHiddenGradient,
                                        def_bnhgr_total, (int)m_cOutputs.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_BatchNormCalcHiddenGradient,
                                                   def_bnhgr_batchm_iBatchSize))
         return false;

After passing all parameters, we prepare the kernel for the run queue. Recall that when creating the kernel we defined the use of vector operations with type TYPE4. Accordingly, we reduce by four times the number of threads running. Now we call the queuing method of the kernel.

 //--- queuing
      int off_set[] = {0};
      int NDRange[] = { (int)(m_cOutputs.Total() + 3) / 4 };
      if(!m_cOpenCL.Execute(def_k_BatchNormCalcHiddenGradient1off_setNDRange))
         return false;
     }
//---
   return true;
  }

This concludes the method that propagates the error gradient through the hidden layer CNeuronBatchNorm::CalcHiddenGradient. We need to repeat the operations for the second backpropagation method CNeuronBatchNorm::CalcDeltaWeights.

Again, we repeat the algorithm for queuing the kernel. This time the kernel uses three data buffers.

bool CNeuronBatchNorm::CalcDeltaWeights(CNeuronBase *prevLayerbool read)
  {
   ......
//--- branching of the algorithm by the computing device
   if(!m_cOpenCL)
     {
//--- Implementation using MQL5 tools
   ......
     }
   else
     {
 //--- check data buffers
      if(m_cBatchOptions.GetIndex() < 0)
         return false;
      if(m_cGradients.GetIndex() < 0)
         return false;
      if(m_cDeltaWeights.GetIndex() < 0)
         return false;

Then we pass pointers to the created buffers as parameters to the launched kernel.

 //--- pass parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormCalcDeltaWeights,
                           def_bndelt_delta_weightsm_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormCalcDeltaWeights,
                                 def_bndelt_optionsm_cBatchOptions.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_BatchNormCalcDeltaWeights,
                                   def_bndelt_gradientm_cGradients.GetIndex()))
         return false;

Then we place the kernel in the execution queue. This time the number of threads will be equal to the number of elements in the results tensor of our batch normalization layer.

      //--- queuing
      int off_set[] = {0};
      int NDRange[] = {(int)m_cOutputs.Total()};
      if(!m_cOpenCL.Execute(def_k_BatchNormCalcDeltaWeights1off_setNDRange))
         return false;
      if(read && !m_cDeltaWeights.BufferRead())
         return false;
     }
//---
   return true;
  }

This concludes our work with CNeuronBatchNorm batch normalization class. It is ready for use as it now fully implements the data normalization algorithm. We have implemented the algorithm in two versions: using standard MQL5 tools and using OpenCL multi-threaded computing technology. This gives the user the opportunity to choose the technology used according to their requirements.

I now propose to look at the implementation of the batch normalization method in Python.