Organizing parallel computing in convolutional networks using OpenCL

In the previous section, we already created classes for two new types of neural layers. These are the convolutional and pooling layers. These types of layers are key in the architecture of convolutional neural networks. By alternating convolutional and pooling layers, we can create a model that searches for the key components of the desired object in the array of source data while simultaneously reducing the size of the processed information without sacrificing the overall performance of the model. This approach also helps filter out noise from the source data.

Reducing the information volume leads to a reduction in the cost of processing it. Furthermore, we can also parallelize computations in the convolutional and pooling layers using the technology of multi-threaded calculations in OpenCL. This will help reduce the time required for calculations while maintaining the overall operation volume, making the training and operation of the neural network much faster.

To organize multi-threaded operations using OpenCL, we need to perform two blocks of operations:

  • Write additional kernels in the previously created OpenCL program (opencl_program.cl).
  • Organize the process of interaction with the OpenCL context on the side of the main program.

Before organizing the transfer of data from the main program to the OpenCL context, it is necessary to understand when and what data will be needed. Therefore, we will begin our work by making changes to the OpenCL program.

Pooling layer

The creation of kernels in the OpenCL program and the construction of classes in the main program will start with the implementation of methods for the pooling layer. Feed-forward operations will be implemented in the ProofFeedForward kernel. We will transfer two data buffers from the main program to the kernel:

  • inputs: a vector of input data
  • outputs: a vector for writing results

To prevent an array out-of-bounds error, we will pass the size of the inputs_total initial data vector to the kernel in the parameters.

Let me remind you that in the convolutional neural networks algorithm, the pooling layer follows the convolutional layer of neurons. In turn, the convolutional layer includes several filters. Therefore, when receiving the results of the work of multiple filters from the convolutional layer in a single buffer, the pooling layer should process each filter separately. Therefore, to logically divide the common buffer of results of the convolutional layer by filters, the kernel will be given the size of the output vector of one filter input_neurons.

In the kernel parameters, we specify the window size for analyzing the initial data (window), the step for moving the window (step), the number of filters (window_out), and the activation function (activation).

__kernel void ProofFeedForward(__global double *inputs,
                               __global double *outputs,
                               int inputs_total,
                               int input_neurons,
                               int window,
                               int step,
                               int activation)

We will run this kernel in a two-dimensional task space. Thus, in each kernel, we will process one element of the results array in one filter. The number of the processed element will be determined by the thread identifier in dimension with index 0. Therefore, the total number of threads will tell us the number of elements in the output of one filter (neurons). Using this data, we will determine the offsets to the beginning of the window of analyzed data within the filter array of the initial data (shift).

  {
   const int n = get_global_id(0);
   const int w = get_global_id(1);
   const int neurons = get_global_size(0);
   const int window_out = get_global_size(1);
   int shift = n * step;

The second dimension with index 1 will indicate the index of the analyzed filter. Accordingly, we will determine the shift in the arrays of initial data (shift_inp) and results (out) before the beginning of the processed filter. Don't forget to check for any out-of-range errors within the result array.

Let's prepare a variable to store intermediate values of the current element of the result vector (s).

   int out = w * neurons + n;
   int shift_inp = w * input_neurons;
   TYPE s = 0;
   TYPE k = (TYPE)1 / (TYPE)window;
   TYPE4 k4 = (TYPE4)(k);

The values in the pooling layer will be computed in a nested array. In it, we will iterate through the elements of the input data that fall within the analyzed window and assemble the resulting value according to the activation formula.

Let me remind you that in our implementation, the pooling layer can receive one of two activation functions:

  • Average pooling which involves taking the arithmetic mean of the elements within the input data window.
  • Max pooling which involves selecting the maximum element within the input data window.

When calculating the arithmetic mean, we will not collect the sum of all elements and then divide by the size of the analyzed window. On the contrary, each element is first divided by the size of the window, and then the resulting quotients are summed up. This will allow us to get the final result in the body of the loop, eliminating the division operation behind the loop. The implementation of the division operation behind the loop is not critical, but only if it concerns any variants of operations in the loop. In our case, division is necessary only in the case of the arithmetic mean. When using Max pooling, the division is redundant, and for correct operation, we would need an additional check of the activation function. By moving the division inside the loop, we eliminate the need for an additional check for the activation function and only apply it when calculating the actual value.

Please note that we use vector operations with TYPE4 data type to speed up the process. Consequently, the step of the loop through the elements of the window is equal to four.

   for(int i = 0i < windowi += 4)
      switch(activation)
        {
         case 0:
            s += dot(ToVect4(inputsi1min(shift_inp+input_neurons,inputs_total),
                             shift_inp + shift), k4);
            break;
         case 1:
            s = Max4(ToVect4(inputsi1min(shift_inp+input_neurons,inputs_total), 
                             shift_inp + shift), s);
            break;
         default:
            break;
        }
   outputs[out] = s;
  }

After exiting the loop that iterates over the elements of the analyzed window, we will save the obtained value into the corresponding element of the result vector and exit the kernel.

We have examined the feed-forward kernel and can now proceed to build the algorithm for the backpropagation pass. As discussed earlier in the context of building the algorithm using MQL5, in the pooling layer, the backpropagation pass algorithm involves simply propagating the error gradient through the hidden layer. Therefore, the process of constructing the backpropagation pass will consist of writing the ProofCalcHiddenGradient gradient propagation kernel algorithm.

The new kernel will communicate with the external program through four data buffers:

  • inputs: buffer for the results of the preceding layer
  • gradient_inputs: buffer for the gradients of the preceding layer (in this case, it is used to record the results of the kernel operation)
  • outputs: buffer for the results of the forward pass of the current layer
  • gradients: buffer for the gradients at the results level of the current layer

Buffer size control will be organized using the inputs_total and outputs_total parameters. The names of the parameters correspond to the buffers whose sizes they store.

It is important to note that, unlike a fully connected layer, neurons in the pooling layer have limited connections to neurons in the previous layer. We will define connection zones using the window and step parameters. You can see that parameters of the same name were declared in the forward pass kernel. We have also retained their functional significance.

Let's add parameters for the number of elements per filter output and the activation function being used.

__kernel void ProofCalcHiddenGradient(__global TYPE *inputs,
                                      __global TYPE *gradient_inputs,
                                      __global TYPE *outputs,
                                      __global TYPE *gradients,
                                      int inputs_total,
                                      int outputs_total,
                                      int window,
                                      int step,
                                      int neurons,
                                      int activation)

When organizing multi-threaded computations, it's important to consider the issue of concurrent attempts to write to the same buffer elements from different threads. Therefore, the most suitable algorithms are those in which each thread is provided with its own objects for writing data, and these objects do not intersect with objects being written to by other threads.

Following the logic mentioned above, we will create an algorithm in which each thread will collect gradients and write them to a separate element of the gradient buffer of the previous layer. It should be noted that one difference in this approach compared to the one we adopted in the MQL5 implementation is as follows. When using the Max pooling activation function, if there are two or more elements with values equal to the maximum, the gradient will be fully transferred to all such elements. In contrast, in the implementation of the main program, we passed the gradient to only one element. Considering the use of variables and their precision, we assess the risk of encountering such a situation as minimal and accept it.

At the beginning of the kernel body, let's determine the ordinal number of the required element and the filter by stream identifiers. The total number of threads will give us the number of elements of one filter in the input data buffer (input_neurons) and the number of filters (window_out). Based on this data, we determine the first (start) and last (stop) elements of the resulting vector, which are affected by the processed element. When defining the influence zone, we need to keep in mind the limitations of the data buffer dimension for each filter. Therefore, the first element cannot be less than 0, and the last element cannot be greater than the number of elements in one filter (neurons).

  {
   const int n = get_global_id(0);
   const int w = get_global_id(1);
   const int input_neurons = get_global_size(0);
   const int window_out = get_global_size(1);
//---
   int start = n - window + step;
   start = max((start - start % step) / step0);
   int stop = min((n - n % step) / step + 1neurons);

Next, we determine the offset of the analyzed element in the common initial data buffer. At the same time, do not forget to check for going beyond the array of initial data.

After that, we will prepare the necessary internal variables. First of all, this is a variable for collecting intermediate values of the gradient (grad) and the value of the current element in the source data buffer (inp).

The creation of the last condition is because when using Max pooling, we will need to constantly compare the value of an element in the source data with the value from the results buffer. For technical reasons, accessing internal variables is much faster than accessing elements of the global array buffer. This is related to the storage location of the data. Internal variables are stored in private memory, while buffers are stored in global memory. The size of the private memory is small, and we cannot copy the entire array there, but accessing it takes minimal time. The size of the global memory is much larger, but the access time to it is significantly longer. To reduce the overall running time of the program, we will move a frequently used value from the global to the private memory of the OpenCL context.

   TYPE grad = 0;
   int shift_inp = w * input_neurons + n;
   if(shift_inp >= inputs_total)
      return;
   TYPE inp = inputs[shift_inp];

Next, we will organize a nested loop in which we will iterate over the elements that fall within the influence zone of the analyzed element of the input data. Inside the loop, we will first determine the offset of the processed element in the gradient error buffer. We will immediately check if the error gradients array falls within the boundaries. Then we will transfer the gradient in accordance with the activation function used.

For Average pooling, we simply divide the value of the error gradient by the size of the input data window and add the resulting value to the accumulated error gradient of the analyzed source data element. Please note that we will divide the error gradient by the size of the input data window, and not by the zone of influence. Indeed, the error obtained during the feed-forward pass is influenced by all the elements of the input data that affect the specific value.

In the case of Max pooling, we will first compare the value of the corresponding elements at the output and input of the neural layer. Only if they match will we transmit the error gradient in full.

After exiting the loop, we will save the computed gradient value in the gradient error buffer of the previous layer and conclude the execution of the kernel.

   for(int o = starto < stopo ++)
     {
      int shift_g = w * neurons + o;
      if(shift_g >= outputs_total)
         break;
      switch(activation)
        {
         case 0:
            grad += gradients[shift_g] / (TYPE)window;
            break;
         case 1:
            grad += (outputs[shift_g] == inp ? gradients[shift_g] : 0);
            break;
         default:
            break;
        }
     }
   gradient_inputs[shift_inp] = grad;
  }

The above two kernels cover the forward and backward pass processes in the pooling layer. Now we can move on to working with the convolutional layer.

Convolutional layer

Convolutional layer For the convolutional layer, we also have to implement forward and backward pass algorithms. Similarly to the kernels discussed earlier, the forward pass algorithm will be described in the ConvolutionFeedForward kernel. A convolutional layer, like a fully connected one, has a weight matrix and an activation function. Therefore, to communicate with the main program, we need four data buffers:

  • inputs: input data buffer
  • weights: matrix of weights
  • sums: vector of weighted sums of the original data before the activation function
  • outputs: vector of results

In addition to buffers, for the proper functioning of the new kernel, the following parameters will be required:

  • inputs_total: size of the input data array
  • window: size of the analyzed window of the source data
  • step: step of the source data window
  • window_out: number of filters in the layer

__kernel void ConvolutionFeedForward(__global TYPE *inputs,
                                     __global TYPE *weights,
                                     __global TYPE *outputs,
                                     int inputs_total,
                                     int window,
                                     int step,
                                     int window_out)

Building the algorithm of the kernel itself is similar to constructing a similar kernel for a fully connected neuron. Just like in the fully connected layer, the number of threads will be tied to the number of elements in the output buffer. However, considering the specific nature of the convolutional layer's operation, we will not be guided by the total number of elements in the buffer, but by the number of elements in the results buffer of a single filter. In this case, the results of the n-th element of all filters will be calculated in one thread.

At the beginning of the kernel, we will carry out preparatory work. We will determine the index of the processed element in the filter results buffer based on the thread number. The total number of threads will give us the number of elements in the output of each filter. From the obtained data and information from the kernel parameters, we will calculate the offset to the beginning of the analyzed window in the source data buffer and the size of the weight matrix being used.

  {
   const int n = get_global_id(0);
   const int neurons = get_global_size(0);
   const int weights_total = (window + 1) * window_out;
   int shift = n * step;

Since we decided to process all the filters sequentially in one thread, the next thing we do is organize a filter iteration loop. Inside the loop, we determine the offset to the processed element in the general result buffer and the offset in the weight matrix. At this point, we will also check for any out-of-bounds access to the weight matrix and prepare an internal variable for collecting the resulting value. We will initialize the variable with the bias element.

   for(int w = 0w < window_outw++)
     {
      int out = (transposed_out == 1 ? w + n * window_out : w * neurons + n);
      int shift_weights = w * (window + 1) ;
      if((shift_weights + window) >= weights_total)
         break;
      TYPE s = weights[shift_weights + window];

We will directly calculate the weighted sum of the analyzed input data window in a nested loop. Inside this loop, we will iterate through the elements of the analyzed window of input data and multiply them by the corresponding weight. To reduce the time spent on execution, we use vector operations. At the same time, do not forget to increase the size of the cycle step to the size of the used vector variables.

      for(int i = 0i < windowi += 4)
         s += dot(ToVect4(inputsi1inputs_totalshift),
                  ToVect4(weightsi1shift_weights + windowshift_weights));
      outputs[out] = s;
     }
  }

After collecting the weighted sum, we write the resulting value to the result buffer.

Next, we move on to creating kernels for the backward pass process. Unlike the pooling layer, the convolutional layer contains a weight matrix. Therefore, we will need to create more than one kernel, as in a similar process of a fully connected layer.

We will start building the process as before, following the algorithm of the backpropagation pass. We will fully apply the adjustments of the gradient based on the derivative of the activation function, just as we did for the fully connected layer. Let's start working on the convolutional layer by creating a gradient propagation kernel through the ConvolutionCalcHiddenGradient layer.

In this case, propagating the gradient to the lower layer does not depend on the input data and the results of the forward pass. Therefore, for our kernel to work, we will give it three data buffers:

  • gradient_inputs: buffer for the error gradients of the preceding layer (in this case, the result buffer)
  • weights: weight matrix
  • gradients: buffer for the error gradients at the input of the current layer

In addition to data buffers, a number of parameters are required for the correct operation of the kernel:

  • outputs_total: total number of elements in the result buffer (gradients at the output of the current neural layer);
  • window: size of the input data window (the number of input data elements analyzed by one neuron of the current layer);
  • step: step of moving the window along the array of initial data;
  • window_out: number of filters in the current convolutional layer;
  • neurons: number of elements at the output of one filter.

__kernel void ConvolutionCalcHiddenGradient(__global TYPE *gradient_inputs,
                                            __global TYPE *weights,
                                            __global TYPE *gradients,
                                            int window,
                                            int step,
                                            int window_out,
                                            int neurons)

The kernel will be launched in a multi-threaded mode with the number of threads equal to the number of elements in the gradient error buffer of the previous layer, which is also equal to the number of elements in the input data buffer.

As usual, at the beginning of the kernel, we determine the ordinal number of the element being processed by the number of the current thread and the number of elements in the gradient buffer of the previous layer by the total number of running threads. Additionally, we calculate the size of the weight matrix based on the size of the input data window and the number of filters in the current convolutional layer.

  {
   const int n = get_global_id(0);
   const int inputs_total = get_global_size(0);
   int weights_total = (window + 1) * window_out;

Continuing the preparatory work, let's determine the zone of influence of the current element in the result buffer of one filter and prepare an internal variable to record the intermediate results of the accumulation of the error gradient for the processed element.

   TYPE grad = 0;
   int w_start = n % step;
   int r_start = max((n - window + step) / step0);
   int total = (window - w_start + step - 1) / step;
   total = min((n + step) / steptotal);

Let me remind you that when creating the convolution layer class in the main program, we decided to consider the array of initial data as a single whole and apply all filters to the total amount of data. Therefore, each element of the input data affects the results of all filters. This means that we have to collect the error gradient on each element of the initial data from all filters. Therefore, to collect error gradients, we need a system of nested loops with iteration of filters and elements of each filter.

The outer loop iterates over the elements of the error gradient vector at the output of the current neural layer. In it, we will determine the offset to a specific element in the gradient vector and immediately check for going beyond the filter size.

   for(int i = 0i < totali ++)
     {
      int row = r_start + i;
      if(row >= neurons)
         break;

In the body of the nested loop, we will first determine the offset in the gradient buffer of the error at the output of the current layer and the weight matrix. Then, we will add the product of the values of these elements to the previously accumulated error gradient for the analyzed element of the original data.

      for(int wo = 0wo < window_outwo++)
        {
         int shift_g = (transposed_out == 1 ? row * window_out + wo :
                                                        row + wo * neurons);
         int shift_w = w_start + (total - i - 1) * step + wo * (window + 1);
         grad += gradients[shift_g] * weights[shift_w];
        }
     }
   gradient_inputs[n] = grad;
  }

After completion of all iterations and exiting from the block of two nested loops, the value of the accumulated gradient is stored in the error gradient buffer of the previous layer.

The distribution of the error gradient through the hidden layers of the neural network, in accordance with the algorithm of the error backward pass method, is followed by the transfer of the error gradient to weights. To perform this functionality, we create the ConcolutionCalcDeltaWeights kernel.

For the correct operation of the kernel, the use of 3 data buffers will be required:

  • inputs: input data buffer
  • delta_weights: buffer for the accumulated error gradients of the weight matrix (in this case, the results buffer)
  • gradients: buffer for the error gradients of the current layer (at the results level)

The gradient buffer contains the values of the error gradients already corrected for the derivative of the activation function. This procedure is performed before passing the error gradient to the previous layer. Therefore, adjusting for the derivative of the activation function at this stage will be unnecessary.

In addition to the data buffers, we need to introduce a few parameters in order to build the algorithm correctly:

  • inputs_total: total number of elements in the result buffer and, respectively, the error gradient buffer
  • step: step of moving the analyzed data window along the source data array
  • neurons: number of elements at the output of one filter

__kernel void ConvolutionCalcDeltaWeights(__global TYPE *inputs,
                                          __global TYPE *delta_weights,
                                          __global TYPE *gradients,
                                          int inputs_total,
                                          int step,
                                          int neurons)

It can be noticed that among the parameters, there are no variables to indicate the size of the window for the analyzed data and the number of filters in the current convolutional layer. This is due to a change in the approach to creating threads for operations. This kernel will collect error gradients at the level of the weight matrix, so it is quite logical to run the kernel for each weight. Furthermore, the weight matrix is represented as a two-dimensional table, where each row corresponds to a separate filter, and the elements within each row are the weights of the corresponding filter.

The OpenCL technology allows threads to be launched in two-dimensional space, with two indices for each thread. Let's use this property and create threads for this kernel in two dimensions. In the first dimension, the number of threads will be equal to the number of weights in one filter. In the second dimension, the number of threads will correspond to the number of filters used.

In the body of the kernel, we will determine the position of the analyzed element in the weight matrix and its dimensions. It should be recalled here that each filter has a bias weight, so the size of the analyzed data window will be one element less than the number of threads in the first dimension (the dimension with index 0).

Right there, we will determine the position of the analyzed element in the one-dimensional buffer of the weight matrix and the offset to the beginning of the corresponding filter in the error gradient buffer. And of course, let's prepare a variable to store intermediate values of the accumulated error gradient.

  {
   const int inp_w = get_global_id(0);
   const int w = get_global_id(1);
   const int window = get_global_size(0) - 1;
   const int window_out = get_global_size(1);
//---
   int shift_delt = w * (window + 1) + inp_w;
   TYPE value = 0;

Next comes the process of directly calculating the error gradient. Here we must remember that for the bias element, there are no corresponding elements in the source data buffer. Therefore, the gradient will be transferred to this element in full. In order not to check at each iteration of the loop, we will do it once before starting the loop. In the loop, we will iterate through the elements of the error gradient buffer and the original data, while the element of the weight matrix remains unchanged.

Thus, first, we check whether the current element of the weight matrix is a bias, and then we organize a loop to iterate through all the error gradient elements of the corresponding filter. Inside the loop, we will sum up the error gradient adjusted for the corresponding value of the initial data buffer.

After exiting the loop, add the obtained value to the previously accumulated error gradient for the analyzed element of the weight matrix. Let me remind you that we will not update the weight matrix at each iteration of the backward pass. We only accumulate the error gradient. The weight matrix is updated by a command from the main program after processing the data package installed by the user.

   if(inp_w == window)
     {
      for(int n = 0n < neuronsn ++)
         value += gradients[w * neurons + n];
     }
   else
      for(int n = 0n < neuronsn ++)
        {
         int shift_inp = n * step + inp_w;
         if(shift_inp >= inputs_total)
            break;
         value += inputs[shift_inp] * gradients[w * neurons + n];
        }
   delta_weights[shift_delt] += value;
  }

After distributing the error gradient to the weight matrix through the backpropagation algorithm, its update is provided. The weights are adjusted towards the anti-gradient. As mentioned before while creating the convolutional layer using MQL5, the previously established process for the fully connected layer fully meets the requirements for working with convolutional layers as well. Therefore, we will not create separate kernels and blocks of the main program but will use the previously created solution.

Implementing functionality on the side of the main program

After supplementing the OpenCL program with new kernels, we have to embed code blocks into the main program to organize the process of data exchange and launch kernels for execution at the right time and with the right amount of information. Let's take a closer look at how this can be implemented.

As a reminder, when building a fully connected neural layer, we started similar work by declaring constants. Now we will do the same: we will declare constants for calling each kernel.

#define def_k_ProofFeedForward            21
#define def_k_ProofHiddenGradients        22
#define def_k_ConvolutionFeedForward      23
#define def_k_ConvolutionHiddenGradients  24
#define def_k_ConvolutionDeltaWeights     25

We will also declare parameter constants for each kernel. The constants of the parameters must strictly correspond to the ordinal number of the parameter in the OpenCL program kernel. Parameter numbering starts from zero.

//--- feed-forward pass of the pooling layer
#define def_prff_inputs                   0
#define def_prff_outputs                  1
#define def_prff_inputs_total             2
#define def_prff_input_neurons            3
#define def_prff_window                   4
#define def_prff_step                     5
#define def_prff_activation               6

//--- gradient distribution through the pooling layer
#define def_prhgr_inputs                  0
#define def_prhgr_gradient_inputs         1
#define def_prhgr_outputs                 2
#define def_prhgr_gradients               3
#define def_prhgr_inputs_total            4
#define def_prhgr_outputs_total           5
#define def_prhgr_window                  6
#define def_prhgr_step                    7
#define def_prhgr_neurons                 8
#define def_prff_activation               9

//--- feed-forward pass of the convolutional layer
#define def_cff_inputs                    0
#define def_cff_weights                   1
#define def_cff_outputs                   2
#define def_cff_inputs_total              3
#define def_cff_window                    4
#define def_cff_step                      5
#define def_cff_window_out                6

//--- gradient distribution through the convolutional layer
#define def_convhgr_gradient_inputs       0
#define def_convhgr_weights               1
#define def_convhgr_gradients             2
#define def_convhgr_window                3
#define def_convhgr_step                  4
#define def_convhgr_window_out            5
#define def_convhgr_neurons               6

//--- distribution of the gradient to the weight matrix of the convolutional layer
#define def_convdelt_inputs               0
#define def_convdelt_delta_weights        1
#define def_convdelt_gradients            2
#define def_convdelt_inputs_total         3
#define def_convdelt_step                 4
#define def_convdelt_neurons              5

After declaring the constants, we need to update the list of used kernels from the OpenCL program. Let me remind you that this work is carried out in the CNet: :InitOpenCL method. Here we need to change the number of used kernels to 26.

   if(!m_cOpenCL.SetKernelsCount(26))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

Let's create entry points for new kernels.

   if(!m_cOpenCL.KernelCreate(def_k_ProofFeedForward"ProofFeedForward"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_ProofHiddenGradients,
                                               "ProofCalcHiddenGradient"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_ConvolutionFeedForward,
                                                "ConvolutionFeedForward"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_ConvolutionHiddenGradients,
                                         "ConvolutionCalcHiddenGradient"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_ConvolutionDeltaWeights,
                                            "ConcolutionCalcDeltaWeights"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

Further work will continue directly in the relevant methods. Remember that during the construction of classes, we implemented branching in many methods depending on the device used for executing operations. We have already written the MQL5 part. Now we will describe the algorithm for working with the OpenCL context.

We will supplement the methods in the same sequence in which we created them earlier. Let's start this work with the feed-forward method of the pooling layer CNeuronProof::FeedForward. To work correctly, this method uses two data buffers: initial data and results. At the beginning of the block, check for the presence of the specified buffers in the OpenCL context. The presence of a buffer handle will indicate a previously passed buffer to the OpenCL context.

bool CNeuronProof::FeedForward(CNeuronBase *prevLayer)
  {
//--- Control block
   if(!prevLayer || !m_cOutputs ||
      !prevLayer.GetOutputs())
      return false;
   CBufferType *input_data = prevLayer.GetOutputs();
//--- Algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
     // The MQL5 block is missing here
     }
   else // Block of operations with OpenCL
     {
      //--- check the availability of buffers in the OpenCL context
      if(input_data.GetIndex() < 0)
         return false;
      if(m_cOutputs.GetIndex() < 0)
         return false;

If there is data in the OpenCL context, we will pass pointers to the data buffers and parameters necessary for its operation to the kernel. At each step, we also check the result of the operations. This is crucial because launching a kernel with incomplete information can lead to a critical error and the halt of the entire program.

      //--- Send parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ProofFeedForwarddef_prff_inputs,
                                                         input_data.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ProofFeedForwarddef_prff_outputs,
                                                         m_cOutputs.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofFeedForwarddef_prff_inputs_total,
                                                            input_data.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofFeedForwarddef_prff_window,
                                                                     m_iWindow))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofFeedForwarddef_prff_stepm_iStep))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofFeedForwarddef_prff_activation,
                                                            (int)m_eActivation))
         return false;
      uint input_neurons = (input_data.Total()+m_iWindowOut-1) / m_iWindowOut;
      if(!m_cOpenCL.SetArgument(def_k_ProofFeedForwarddef_prff_input_neurons,
                                                                 input_neurons))
         return false;

Once all the necessary information is passed to the kernel, we then need to specify the number of threads for kernel execution and the initial offset in the task space. After that, we initiate the execution of the kernel and complete the method.

       //--- Queuing up the kernel for execution
      uint off_set[] = {00};
      uint NDRange[] = {m_iNeuronsm_iWindowOut};
      if(!m_cOpenCL.Execute(def_k_ProofFeedForward2off_setNDRange))
         return false;
     }
//---
   return true;
  }

After adding the code for the CNeuronProof::FeedForward forward pass method of the pooling layer, let's do the same work in the CNeuronProof::CalcHiddenGradient backward pass method. Unlike the forward pass, the error gradient distribution kernel through the pooling layer uses four data buffers:

  • initial data
  • feed-forward results
  • error gradients at the output of the neural layer
  • error gradients at the source data level (the result buffer in this case).

The first two buffers are used to determine which elements to employ when using Max pooling.

Therefore, we have to load all four buffers into the memory of the OpenCL context.

bool CNeuronProof::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
//--- Control block
   if(!prevLayer || !m_cOutputs ||
      !m_cGradients || !prevLayer.GetOutputs() ||
      !prevLayer.GetGradients())
      return false;
   CBufferType *input_data = prevLayer.GetOutputs();
   CBufferType *input_gradient = prevLayer.GetGradients();
   if(!input_gradient.BufferInit(input_data.Rows(), input_data.Cols(), 0))
      return false;
//--- Algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
     // The MQL5 block is missing here
     }
   else    // Block of operations with OpenCL
     {
      //--- check for buffers in the OpenCL context
      if(input_data.GetIndex() < 0)
         return false;
      if(m_cOutputs.GetIndex() < 0)
         return false;
      if(input_gradient.GetIndex() < 0)
         return false;
      if(m_cGradients.GetIndex() < 0)
         return false;

If there is data in the memory of the OpenCL context, we will pass pointers to buffers and necessary constants to the kernel parameters. At the same time, do not forget to control the results of the operations.

      //--- Send parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ProofHiddenGradients,
                                         def_prhgr_inputsinput_data.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ProofHiddenGradients,
                                        def_prhgr_outputsm_cOutputs.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ProofHiddenGradients,
                                    def_prhgr_gradientsm_cGradients.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ProofHiddenGradients
                            def_prhgr_gradient_inputsinput_gradient.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofHiddenGradients,
                                      def_prhgr_inputs_totalinput_data.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofHiddenGradients,
                                                     def_prhgr_windowm_iWindow))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofHiddenGradients,
                                                         def_prhgr_stepm_iStep))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofHiddenGradients,
                                        def_prhgr_activation, (int)m_eActivation))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofHiddenGradients
                                                   def_prhgr_neuronsm_iNeurons))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ProofHiddenGradients,
                                     def_prhgr_outputs_totalm_cOutputs.Total()))
         return false;

Then we specify the number of threads to run the kernel and the offset in the task area. After that, we will put the kernel in the execution queue.

Please note that when launching the forward pass kernel, the number of threads is equal to the number of elements at the output of one filter in the pooling layer. When running a backward pass kernel, the number of threads is equal to the number of elements in one filter of the previous neural layer.

      //--- Queuing up the kernel for execution
      uint input_neurons = (input_data.Total() + m_iWindowOut - 1) / m_iWindowOut;
      uint off_set[] = {00};
      uint NDRange[] = {input_neuronsm_iWindowOut};
      if(!m_cOpenCL.Execute(def_k_ProofHiddenGradients2off_setNDRange))
         return false;
     }
//---
   return true;
  }

This completes the work with the pooling layer class. We move on to do a similar job with the CNeuronConv convolutional layer class.

The convolutional neural layer, unlike the pooling layer, has a weight matrix and an activation function. Therefore, it will require the use of more buffers for its operation. The CNeuronConv::FeedForward forward pass method of the convolutional layer requires transferring 4 buffers to the OpenCL context memory:

  • initial data
  • weight matrix
  • additional activation function buffer (used for Swish activation function)
  • results buffer

Let's start working in the CNeuronConv::FeedForward forward pass method by checking the availability of buffers in use in the context of OpenCL.

bool CNeuronConv::FeedForward(CNeuronBase *prevLayer)
  {
//--- control block
   if(!prevLayer || !m_cOutputs || !m_cWeights || !prevLayer.GetOutputs())
      return false;
   CBufferType *input_data = prevLayer.GetOutputs();
   ulong total = input_data.Total();
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
     // The MQL5 block is missing here
     }
   else
     {
      //--- checking data buffers
      if(input_data.GetIndex() < 0)
         return false;
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(m_cOutputs.GetIndex() < 0)
         return false;

Then we need to pass buffer pointers to the corresponding kernel. In addition, in the kernel parameters, we will pass some constants necessary for the correct operation of the algorithm. Among the passed parameters will be the size of the analyzed window, the window step and the number of filters. At each step, we control the process of performing operations.

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionFeedForward,
                                          def_cff_inputsinput_data.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionFeedForward,
                                         def_cff_weightsm_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionFeedForward,
                                         def_cff_outputsm_cOutputs.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionFeedForward,
                                       def_cff_inputs_totalinput_data.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionFeedForward,
                                                      def_cff_windowm_iWindow))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionFeedForward,
                                                           def_cff_stepm_iStep))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionFeedForward,
                                                 def_cff_window_outm_iWindowOut))
         return false;

After passing all the necessary data to the kernel, we specify the number of threads to start and initiate its queuing.

      //--- put the kernel in the execution queue
      int off_set[] = {0};
      int NDRange[] = {(int)m_iNeurons};
      if(!m_cOpenCL.Execute(def_k_ConvolutionFeedForward1off_setNDRange))
         return false;
     }
   if(!m_cActivation.Activation(m_cOutputs))
      return false;
//---
   return true;
  }

Finally, we call the activation function and exit the method.

That's all for the feed-forward pass. Let's proceed to the backpropagation pass in the convolutional neural layer. As you remember, the backpropagation pass includes three sub-processes:

  • Distributing the error gradient over the neural network from the result to the initial data.
  • Distributing the error gradient to the weight matrix of each neural layer.
  • Adjusting the weight matrix towards the anti-gradient.

From the methods already implemented using MQL5, we know that no new method was created for the last sub-process. Instead, it is suggested to use the ready-made method of the fully connected neural layer, where we have already implemented multi-threaded computations using OpenCL tools. Therefore, at this stage, we have to refine only the methods of the first two sub-processes.

The CNeuronConv::CalcHiddenGradient method is responsible for distributing the error gradient across the convolutional layer. Correct execution of the algorithm of this method requires the presence of three data buffers:

  • Buffer for error gradients at the output of the neural layer (obtained from the next layer in the process of executing a similar method).
  • Weight matrix.
  • Buffer for error gradients at the input data level (in this case, it acts as a buffer for the results of the method).

Therefore, at the beginning of the block of work with the OpenCL technology, we check the presence of the necessary buffers in the context memory.

bool CNeuronConv::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
//--- control block
   if(!prevLayer || !prevLayer.GetOutputs() || !prevLayer.GetGradients() ||
      !m_cGradients || !m_cWeights)
      return false;
//--- adjust error gradients to the derivative of the activation function
   if(m_cActivation)
     {
      if(!m_cActivation.Derivative(m_cGradients))
         return false;
     }
//--- algorithm branching depending on the operating device
   CBufferTypeinput_gradient = prevLayer.GetGradients();
   if(!m_cOpenCL)
     {
     //The MQL5 block is missing here
     }

   else // Block for working with OpenCL
     {
      //--- checking data buffers
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(input_gradient.GetIndex() < 0)
         return false;
      if(m_cGradients.GetIndex() < 0)
         return false;

The next step is to pass the necessary data to the kernel parameters. Among them are pointers to the data buffers used.

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionHiddenGradients,
                        def_convhgr_gradient_inputsinput_gradient.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionHiddenGradients,
                                    def_convhgr_weightsm_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionHiddenGradients,
                                def_convhgr_gradientsm_cGradients.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionHiddenGradients,
                                               def_convhgr_neuronsm_iNeurons))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionHiddenGradients,
                                                 def_convhgr_windowm_iWindow))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionHiddenGradients,
                                                     def_convhgr_stepm_iStep))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionHiddenGradients,
                                          def_convhgr_window_outm_iWindowOut))
         return false;

Next, we will specify the number of threads equal to the number of elements in the source data buffer and enqueue the kernel for execution.

      //--- put the kernel in the execution queue
      int NDRange[] = {(int)input_gradient.Total()};
      int off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_ConvolutionHiddenGradients1off_setNDRange))
         return false;
     }
//---
   return true;
  }

To complete the work on the backpropagation pass methods in the convolutional network, we need to make similar changes to the method for distributing the error gradient to the weight matrix CNeuronConv::CalcDeltaWeights , taking into account the specifics of this method.

The algorithm of the error gradient distribution method to the weight matrix requires the presence of three buffers:

  • Error gradient at the output level of the neural layer.
  • Initial data buffer.
  • Buffer for accumulating error gradients at the weight matrix level.

Let's check the presence of the specified buffers in the memory of the OpenCL context. Let me remind you that we proceed from the assumption that there is enough video memory to store the entire model. If the model does not completely fit in the memory of your OpenCL device, then you will need to load the necessary data into the context memory before launching each kernel. After the completion of the kernel, free up memory to load the next batch of data.

bool CNeuronConv::CalcDeltaWeights(CNeuronBase *prevLayer)
  {
//--- control block
   if(!prevLayer || !prevLayer.GetOutputs() || !m_cGradients || !m_cDeltaWeights)
      return false;
//--- algorithm branching depending on the operating device
   CBufferType *input_data = prevLayer.GetOutputs();
   if(!m_cOpenCL)
     {
     // The MQL5 block is missing here
     }
   else // Block for working with OpenCL
     {
      //--- checking data buffers
      if(m_cGradients.GetIndex() < 0)
         return false;
      if(m_cDeltaWeights.GetIndex() < 0)
         return false;
      if(input_data.GetIndex() < 0)
         return false;

Then we pass the necessary parameters to the kernel corresponding to our sub-process. Let me remind you that it is very important to observe the correspondence of the specified kernel ID, parameter ID and the specified value, and we also control the process of performing operations at each step.

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionDeltaWeights,
                        def_convdelt_delta_weightsm_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionDeltaWeights,
                                    def_convdelt_inputsinput_data.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_ConvolutionDeltaWeights,
                               def_convdelt_gradientsm_cGradients.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionDeltaWeights,
                                 def_convdelt_inputs_totalinput_data.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionDeltaWeights,
                                              def_convdelt_neuronsm_iNeurons))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_ConvolutionDeltaWeights,
                                                    def_convdelt_stepm_iStep))
         return false;

When all the necessary information is transferred to the kernel, we specify the number of threads. In this case, we decided to use a two-dimensional thread distribution:

  • by the number of filters
  • by the number of weights in one filter

To do this, we specify two parameters in the NDRange array. Each parameter specifies the size of the corresponding task area. We send the kernel to the execution queue.

      //--- put the kernel in the execution queue
      uint NDRange[] = {m_iWindow + 1m_iWindowOut};
      uint off_set[] = {00};
      if(!m_cOpenCL.Execute(def_k_ConvolutionDeltaWeights2off_setNDRange))
         return false;
     }
//---
   return true;
  }

Now we have already created three types of fully functional neural layers for our neural network builder and can compare their effectiveness in solving a practical problem. I suggest doing some experiments in the next chapter. But before proceeding to the "field tests", we still have to check the correctness of the methods for transferring gradients.