Organizing multi-threaded operations in Dropout

We continue to implement the Dropout technology. In the previous sections, we have already fully implemented the algorithm for the operation of this technology using standard MQL5 capabilities. Now we move on to implementing the algorithm using the multi-threading capability on the GPU using OpenCL. Within the framework of this book, we performed this operation many times before. However, I would like to repeat that in order to implement it, we need to work in two directions. First, we will create an OpenCL program, and then we need to do the work on the side of the main program to implement data exchange between the main program and the OpenCL context in which the program will run and call the OpenCL program.

As always, this work begins with the creation of the OpenCL program. In this case, we don't have to write much code on the OpenCL side. Moreover, we will use the same kernel to implement both feed-forward and backpropagation passes. How did that happen? Let's recall what operations we need to implement.

In the feed-forward pass, we perform data masking. The vector mask is created using MQL5 on the main program side. Here we need to mask the initial data. To do this, we element-wise multiply the initial data tensor by the vector mask.

Therefore, for the feed-forward pass, we need to create a kernel for element-wise multiplication of two tensors of the same size.

During the backpropagation process, an error gradient must be propagated through the masking operation. Let's take a closer look at the formula for the masking operation. 1/q is a constant that is defined at the class initialization stage and does not change throughout the model training and operation process. xi is a masking vector element that can only take two values: 1 or 0. Therefore, the entire masking process can be represented as multiplying a certain original value by a constant. As you know, the derivative of such an operation is the constant by which multiplication is performed.

In our case, to adjust the error gradient, we need to element-wise multiply the gradient of error of the current layer by the masking vector.

Thus, in the feed-forward and backpropagation passes, we element-wise multiply various tensors by the masking vector. Therefore, to implement both passes on the OpenCL side, it is sufficient to create one kernel of element-wise multiplication of two vectors. This is actually a fairly simple task. Using vector variables to optimize the process does not complicate the task.

To do this, we create the MaskMult masking kernel. In the parameters, this kernel receives pointers to three data buffers, two of which contain the input data, and the third one is used to write the results. Also, since vector operations are implied, the total number of threads will be smaller than the number of operations. So we won't be able to determine the size of the initial data tensors from the number of threads running. Therefore, to determine the dimensions of the tensors, we will transmit the necessary dimension information in kernel parameters.

In the body of the kernel, we define the ID of the current thread and transfer the necessary data from the buffers to local vector variables. Let's multiply two vector variables. The result obtained will be returned from the local vector variable to the scalar data buffer.

__kernel void MaskMult(__global TYPE *inputs,
                       __global TYPE *mask,
                       __global TYPE *outputs,
                       int outputs_total)
  {
   const int n = get_global_id(0) * 4;
//---

  TYPE4 out = ToVect4(inputs, n, 1, outputs_total, 0) *
               ToVect4(maskn1outputs_total0);
   D4ToArray(outputsoutn1outputs_total0);
  }

As you can see, the entire kernel code fits into three lines. Of course, this was made possible by using the previously created functions that translate the data of the scalar buffer to and from a local vector variable.

Once the OpenCL kernel is created, we proceed to implement the functionality on the main program side. First, we need to create constants to refer to OpenCL program elements. To do this, we open the defines.mqh file and specify constants for the kernel and its parameters.

#define def_k_MaskMult                40

//--- data masking
#define def_mask_inputs                0
#define def_mask_mask                  1
#define def_mask_outputs               2
#define def_mask_total                 3

Then we move on to the model dispatcher class. In the OpenCL context initialization method, we change the total number of kernels and then create a kernel in the context.

bool CNet::InitOpenCL(void)
  {
   ......
   if(!m_cOpenCL.SetKernelsCount(41))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
   ......

   if(!m_cOpenCL.KernelCreate(def_k_MaskMult"MaskMult"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
//---
   return true;
  }

Once the preparatory work has been completed, we move on to working directly with the methods of our CNeuronDropout class. As always, let's start with the CNeuronDropout::FeedForward method and implement the following processes in this method:

  • Pass information to the OpenCL context.
  • Pass parameters to the OpenCL kernel.
  • Place the kernel in the run queue.
  • Download the kernel results.
  • Clear context memory.

Moving on to the forward pass method. Changes will only affect the multi-threaded operation block, and the rest of the method code will remain unchanged.

The Dropout class can operate in two modes: training and production use. We have created a kernel for training mode, but have not prepared a kernel for the second case. For example, the operation of copying data from buffer to buffer is easy, and we can perform it with MQL5 tools. However, we have minimized data exchange between the OpenCL context and the main program. So, on the main program side, the content of the buffers will be irrelevant. To perform a data copy operation, you must first load the data from the OpenCL context into the main program memory and then copy the data from one buffer to another. You then need to return the data to the OpenCL context in another buffer for subsequent operations. This is totally inconsistent with our policy of minimizing data exchange between the OpenCL context and the main program.

We consider the second option: the use of a single kernel in two operation modes. In production use mode, the masking buffer is filled with units. It's also a working method. At the same time, we prepare the masking buffer on the side of the main program. OpenCL does not provide a pseudo-random number generator. So, before executing the kernel, we should pass the contents of the masking buffer from the main program to the OpenCL context. But in training mode, it's a coercive measure. Why waste time on this unnecessary operation in the use mode? Can we take a step back and prepare another kernel?

I found another solution. We already have a kernel to perform a linear activation function. Below is its mathematical representation.

If we consider the special case at a=1 and b=0, we get a simple copy of the data.

You do not need to load additional buffers into the OpenCL context memory. Instead, we will only pass two integer values into the parameters.

The algorithm for working with the kernel remains the same: check the presence of buffers in the context's memory, pass the kernel parameters, and enqueue the kernel.

bool CNeuronDropout::FeedForward(CNeuronBase *prevLayer)
  {
   ......
//--- branching of the algorithm depending on the execution device
   if(!m_cOpenCL)
     {
   ......
     }
   else  // OpenCL block
     {
      //--- operation mode flag check
      if(!m_bTrain)
        {
         //--- check data buffers
         CBufferType *inputs = prevLayer.GetOutputs();
         if(inputs.GetIndex() < 0)
            return false;
         if(m_cOutputs.GetIndex() < 0)
            return false;
         //--- pass parameters to the kernel
         if(!m_cOpenCL.SetArgumentBuffer(def_k_LineActivation
                                             def_activ_inputsinputs.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_LineActivation,
                                        def_activ_outputsm_cOutputs.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_LineActivation
                                                      def_activ_param_a, (TYPE)1))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_LineActivation,
                                                      def_activ_param_b, (TYPE)0))
            return false;
         uint offset[] = {0};
         uint NDRange[] = {(uint)m_cOutputs.Total()};
         if(!m_cOpenCL.Execute(def_k_LineActivation1offsetNDRange))
            return false;
        }

To organize work during training, we will repeat the algorithm mentioned above by enqueueing a new kernel.

      else
        {
         //--- check data buffers
         CBufferType *inputs = prevLayer.GetOutputs();
         if(inputs.GetIndex() < 0)
            return false;
         if(!m_cDropOutMultiplier.BufferCreate(m_cOpenCL))
            return false;
         if(m_cOutputs.GetIndex() < 0)
            return false;
         //--- pass parameters to the kernel
         if(!m_cOpenCL.SetArgumentBuffer(def_k_MaskMult
                                             def_mask_inputsinputs.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_MaskMult
                                 def_mask_maskm_cDropOutMultiplier.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_MaskMult,
                                        def_mask_outputsm_cOutputs.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_MaskMultdef_mask_totaltotal))
            return false;
         //--- enqueuing
         int off_set[] = {0};
         int NDRange[] = { (int)(total + 3) / 4};
         if(!m_cOpenCL.Execute(def_k_MaskMult1off_setNDRange))
            return false;
        }
     }
//---
   return true;
  }

This concludes the feed-forward kernel. Let's proceed to implement similar operations for the CNeuronDropout::CalcHiddenGradient backpropagation method. Let me remind you that we will use the same kernels for the backpropagation pass in this case. The call algorithm does not change. Changes will only affect the specification of buffers used.

bool CNeuronDropout::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
   ......
//--- branching of the algorithm depending on the execution device
   ulong total = m_cOutputs.Total();
   if(!m_cOpenCL)
     {
   ......
     }

   else  // OpenCL block
     {
      //--- operation mode flag check
      if(!m_bTrain)
        {
         //--- checking data buffers
         CBufferType *grad = prevLayer.GetGradients();
         if(grad.GetIndex() < 0)
            return false;
         if(m_cGradients.GetIndex() < 0)
            return false;
         //--- passing parameters to the kernel
         if(!m_cOpenCL.SetArgumentBuffer(def_k_LineActivation,
                                def_activ_inputsm_cGradients.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_LineActivation,
                                       def_activ_outputsgrad.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_LineActivation
                                               def_activ_param_a, (TYPE)1))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_LineActivation
                                               def_activ_param_b, (TYPE)0))
            return false;
         uint offset[] = {0};
         uint NDRange[] = {(uint)m_cOutputs.Total()};
         if(!m_cOpenCL.Execute(def_k_LineActivation1offsetNDRange))
            return false;
        }

Operation mode during training.

      else
        {
         //--- check data buffers
         CBufferTypeprev = prevLayer.GetGradients();
         if(prev.GetIndex() < 0)
            return false;
         if(m_cDropOutMultiplier.GetIndex() < 0)
            return false;
         if(m_cGradients.GetIndex() < 0)
            return false;
         //--- pass parameters to the kernel
         if(!m_cOpenCL.SetArgumentBuffer(def_k_MaskMult
                                         def_mask_inputsm_cGradients.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_MaskMult
                                   def_mask_maskm_cDropOutMultiplier.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_MaskMult,
                                                def_mask_outputsprev.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_MaskMultdef_mask_totaltotal))
            return false;
         //--- enqueuing
         int off_set[] = {0};
         int NDRange[] = { (int)(total + 3) / 4 };
         if(!m_cOpenCL.Execute(def_k_MaskMult1off_setNDRange))
            return false;
        }
     }
//---
   return true;
  }

Note that in the backpropagation process, we no longer load masking data into the OpenCL context. We expect it to remain in context with the feed-forward method.

Congratulations, we have completed the work on the methods of the Dropout algorithm implementation class. We've done quite a lot of work and implemented the Dropout algorithm with MQL5 and multi-threaded operations using OpenCL. Now we can test the models. But first, I suggest considering the implementation of this approach in Python in the TensorFlow library.