Creating an OpenCL program

We will start porting calculations by creating an OpenCL program. The choice of this approach is very obvious. We have already organized the process using MQL5 tools. Therefore, the whole process of operations is clear and transparent. Computation operations will be performed in the OpenCL program. In the main program, we have to organize the process of transferring data and calling the OpenCL program. The latter process is easier to organize when we already know what data and to which kernel we need to transfer.

The program code will be written to a separate file opencl_program.cl. The file name and extensions can be anything, as we will later load it in the main program code as a resource. I use the *.cl extension as a common extension to denote OpenCL programs. In general, the use of standard extensions makes it easier to read projects with complex file structures.

For a feed-forward pass, similar to the MQL5 implementation, we will create kernel PerceptronFeedForward. To carry out operations, we need the result vector of the previous layer (inputs) and the weight matrix (weights) as the initial data. The result of the operations will be written to the result vector (outputs). In addition to the data sets, we need to know the number of neurons in the previous layer to ensure control over potential output exceeding the array limits.

As we discussed earlier, the number of threads will correspond to the number of neurons in the layer. Therefore, at the beginning of the kernel, we need to call the get_global_id function, which will return the thread identifier to us. Let's use this value as the ordinal number of the neuron being processed.

The weight matrix in our buffer is represented as a vector in which N weights of the first neuron go sequentially, followed by N weights of the second neuron, and so on. The value of N is one element greater than the number of neurons in the previous layer since we are using the bias element. We already have the neuron's ordinal number and the number of neurons in the previous layer, so we can determine the offset in the weight vector to the first weight of our neuron.

Next, we will create a local variable to accumulate the sum of products and initialize it with the bias offset coefficient. After that, we organize a loop and calculate the sum of products of the original signal by the weights for a specific neuron in our stream. OpenCL supports vector operations which allow the microprocessor to perform one operation simultaneously on multiple values in a single cycle. In the proposed implementation, I used vectors of type double4. This is a vector of four elements of type double. To convert data from an array of discrete values to a vector, I created a function called ToVect4, which we will discuss a bit later. To obtain the sum of products, I used the built-in dot function. It belongs to vector operations and allows obtaining the discrete value of the product of two vectors. This allowed us to use a step of 4 in the loop and thereby reduce the number of iterations by 4 times.

It should be noted that double4 is not the only vector data type supported by OpenCL. The ability to use them depends on the technical specifications of the hardware being used. The double4 type, in my personal opinion, is the most versatile for use on a wide range of available equipment. When creating your own libraries, you can use a different type of data that is most optimal for your equipment.

After completing the iterations of the loop, we will save the accumulated sum of the vector product to the corresponding element of the result buffer.

The full kernel code is presented below.

__kernel void PerceptronFeedForward(__global TYPE *inputs,
                                    __global TYPE *weights,
                                    __global TYPE *outputs,
                                    int inputs_total)
  {
   const int n = get_global_id(0);
   const int weights_total = get_global_size(0) * (inputs_total + 1);
   int shift = n * (inputs_total + 1);
   TYPE s = weights[shift + inputs_total];
   for(int i = 0i < inputs_totali += 4)
      s += dot(ToVect4(inputsi1inputs_total0),
               ToVect4(weightsi1weights_totalshift));
   outputs[n] = s;
  }

Let's examine the algorithm of the ToVect4 function. In parameters, the function receives a pointer to a vector of discrete values, the position of the first element to copy, the step between two elements to copy, the size of the data array, and the offset in the array to the first copied element. The parameters of the position of the first element and the offset serve a similar purpose but differ in the context of operations. The offset determines the displacement in the data vector from the element with index 0. For the feed-forward function, this offset is to the first weight of the processed neuron. The position of the first element to copy specifies the element without considering the step between elements. In the example given, this is the number of neurons in the preceding layer. When considering an example with a step of one element, one parameter can be easily expressed in terms of the other. The difference in using parameters will be more noticeable when discussing the backpropagation pass, where the weight copying step will be equal to the size of the weight vector for one neuron.

At the beginning of the function, we initialize the result vector with zero values and ensure that the step is at least one element. Then, we check how many elements we can take from the original data array starting from the initial position, considering the offset and step. This operation is necessary to prevent accessing data beyond the array boundaries. Next, we fill the result vectors with the available values, leaving the missing elements as zero. Thus, the size of the original array becomes a multiple of four. Meanwhile, the final value for calling functions remains unchanged, and the use of vector operations overall helps reduce time costs for operations.

TYPE4 ToVect4(__global TYPE *arrayint startint stepint sizeint shift)
  {
   TYPE4 result = (TYPE4)(0000);
m   step = max(1step);
   int st = start * step + shift;
   if(st < size)
     {
      int k = (size - shift + step - 1)  /  step;

      switch(k)
        {
         case 0:
            break;
         case  1:
            result = (TYPE4)(array[st], 000);
            break;
         case  2:
            result = (TYPE4)(array[st], array[st + step], 00);
            break;
         case  3:
            result = (TYPE4)(array[st], array[st + step], array[st + 2 * step], 0);
            break;
         default:
         result = (TYPE4)(array[st], array[st + step], array[st + 2 * step], array[st + 3 * step]);
               break;
        }
     }
   return result;
  }

To fully understand the forward pass, let's explore the activation functions. In general, they repeat the relevant implementations in MQL5, except for the kernel design. For example, below is the code for the implementation of a sigmoid. The kernel parameters include pointers to the input data buffers, output buffers, and activation function parameters. In the kernel body, we first determine the thread identifier, which indicates the ordinal number of the processed element in the data buffer, and then organize the process of calculating the activation function. As you can see, the code for calculating the value of the function is very similar to the relevant MQL5 code presented in the activation function description section.

__kernel void SigmoidActivation(__global TYPEinputs,
                                __global TYPEoutputs,
                                const TYPE aconst TYPE b)
  {
   size_t i = get_global_id(0);
   outputs[i] = a / (1 + exp(-inputs[i])) - b;
  }

The same can be said about the implementation of the Swish activation function.

__kernel void SwishActivation(__global TYPEinputs,
                              __global TYPEoutputs,
                              const TYPE b)
  {
   size_t i = get_global_id(0);
   TYPE value = inputs[i];
   outputs[i] = value / (1 + exp(-b * value));
  }

However, there are some difficulties in implementing the Softmax function. This is due to the difficulty of transferring data between threads to calculate the sum of all the values of the exponent vector in the neural layer. To resolve the issue, I decided to divide the process into several stages. In addition, we'll take advantage of the Work-group's ability to use shared arrays in local memory.

In the kernel parameters, we pass pointers to the buffers of input data and results, as well as the size of the input data buffer. In the kernel body, we first get all the necessary identifiers. These are the global IDs in two dimensions, and the thread ID in the local group. We will talk about the use of two dimensions in the global task space later.

__kernel void SoftMaxActivation(__global TYPEinputs,
                                __global TYPEoutputs,
                                const ulong total)
  {
   uint i = (uint)get_global_id(0);
   uint l = (uint)get_local_id(0);
   uint h = (uint)get_global_id(1);
   uint ls = min((uint)get_local_size(0), (uint)LOCAL_SIZE);

Then we will create a local array in which we allocate one element for each thread to sum up exponential values.

Note that OpenCL does not allow the creation of dynamic arrays. Therefore, the size of the local array must be determined before the program is compiled. So, we need to look for some kind of compromise. Excessive size leads to inefficient memory utilization. On the other hand, if the array size is too small, this limits the number of active parallel threads. Of course, solving such a task is much easier when you know the parameters of the device you are using and the architecture of the model. Therefore, we need a mechanism that allows us to easily and quickly change this parameter before compiling the program. The best solution for this is to use macro substitution, just like we did for the data type. In the code, we specify the LOCAL_SIZE constant, the value of which we assign in our constant file defines.mqh.

Next, we organize a loop in which each thread sums its part of the exponential values. The resulting value is written to the corresponding element of the local array.

   __local TYPE temp[LOCAL_SIZE];
   uint count = 0;
   for(count = l; (count < total && l < ls); count += ls)
     {
      uint shift = h * total + count;
      temp[l] = (count > l ? temp[l] : 0) + exp(inputs[shift]);
     }
   barrier(CLK_LOCAL_MEM_FENCE);

After the loop operations are complete, we set a barrier that allows all the threads of the local group to be synchronized. This operation suspends the execution of each thread, waiting for all threads within the local group to complete their loop iterations.

After obtaining several individual sums computed by each separate thread, it is necessary to combine them into a single sum. To do this, we will organize another cycle. In its body, we will sum the values of the local array in pairs. The trick is that we will divide the entire local array into two equal parts. The first half of the active threads will add the value from the second half to its value in the local array. In the next iteration of the loop, we will halve the number of active threads and sum the values obtained in the previous iteration of the loop. The loop repeats until the sum of all elements is collected in the first element of the array.

Here we insert a barrier in the body of the loop because before the start of each subsequent iteration, all threads must finish the previous iteration.

   count = ls;
   do
     {
      count = (count + 1) / 2;
      temp[l] += (l < count && (l + count) < ls ? temp[l + count] : 0);
      barrier(CLK_LOCAL_MEM_FENCE);
     }
   while(count > 1);

After obtaining the sum of the exponents of all the values in the buffer, we can calculate the value of each element after the activation function. This is what we will do in the next cycle.

//---
   TYPE sum=temp[0];
   for(count = lcount < totalcount += ls)
     {
      uint shift = h * total + count;
      outputs[shift] = exp(inputs[shift]) / (sum + 1e-37f);
     }
  }

A forward pass is followed by a backward pass. It begins with the definition of error gradients at the output of the neural layer. This function is performed by the CalcOutputGradient kernel. In the kernel parameters, it receives pointers to three vectors: the first two vectors of target values and forward pass results are the input data for the function, while the third one is used to store the calculation results. The parameters also specify the loss function to be used. The kernel code completely repeats the algorithm of the previously considered relevant method written using MQL5 tools. It also shows the branching of the algorithm depending on the loss function used.

__kernel void CalcOutputGradient(__global TYPE *target,
                                 __global TYPE *outputs,
                                 __global TYPE *gradients,
                                 int loss_function)
  {
   const int n = get_global_id(0);
   switch(loss_function)
     {
      case 0:
         gradients[n] = target[n] - outputs[n];
         break;
      case 1:
         gradients[n] = 2 * (target[n] - outputs[n]);
         break;
      case 2:
         gradients[n] = -target[n] /
                        (outputs[n] + 1e-37f) * log(outputs[n] + 1e-37f);
         break;
      case 3:
         gradients[n] = (target[n] - outputs[n]) /
                        (outputs[n] * (outputs[n] - 1) + 1e-37f);
         break;
      default:
         gradients[n] = target[n] - outputs[n];
         break;
     }
  }

The next step in our backpropagation process is to adjust the error gradient based on the derivative of the activation function. Recall that we have moved the activation function and all related processes to a separate class. Furthermore, each activation function has its own class. We will create a separate kernel for each activation function. Similarly, we will create a separate kernel for determining the derivative of each activation function.

When creating kernels for calculating derivative functions, we take into account the specific features of each of them. For example, the derivative of a linear activation function will always be its parameter a, and I see no reason to put this in a separate kernel. To adjust the error gradient to the derivative of this function, we can use the forward pass kernel by specifying 0 instead of the b parameter.

A similar situation arises when using LReLU. A linear relationship is also used here, but the linearity factor varies from the value to the activation function. Therefore, we need to create a new kernel that, in its parameters, will receive pointers to three data buffers and a leakage coefficient, whereas the forward pass kernel only received pointers to two buffers and the coefficient.

In the kernel body, we define the global thread identifier, which will identify the element to be processed in the data buffers. Then we check the value of the corresponding element before the function. If the number is greater than 0, we use the coefficient of 1. Otherwise, we will use the leakage factor. The error gradient obtained from the subsequent neural layer will be multiplied by the selected coefficient. The value of the operation will be written to the corresponding element of the result buffer.

__kernel void LReLUDerivative(__global TYPEoutputs,
                              __global TYPEoutput_gr,
                              __global TYPEinput_gr,
                              const TYPE a)
  {
   size_t i = get_global_id(0);
   input_gr[i] = (outputs[i] > 0 ? (TYPE)1 : a) * output_gr[i];
  }

The value of derivatives of S-shaped functions, such as the sigmoid and the hyperbolic tangent, can be easily calculated through the result of the activation function.

It is this approach that we will embed into the kernel algorithm for calculating the derivatives of these functions. The general approaches to the organization of kernel work remain the same.

__kernel void SigmoidDerivative(__global TYPEoutputs,
                                __global TYPEoutput_gr,
                                __global TYPEinput_gr,
                                const TYPE aconst TYPE b
                               )
  {
   size_t i = get_global_id(0);
   if(a == 0)
      input_gr[i] = 0;
   else
     {
      TYPE z = clamp(outputs[i] + b, (TYPE)0a);
      input_gr[i] = z * (1 - z / a) * output_gr[i];
     }
  }

__kernel void TanhDerivative(__global TYPEoutputs__global TYPEoutput_gr,
                             __global TYPEinput_gr)
  {
   size_t i = get_global_id(0);
   input_gr[i] = (1 - pow(outputs[i], 2)) * output_gr[i];
  }

To calculate the derivative of the Swish function, we will need the values both before and after activation. The mathematical formula of the derivative is presented below.

As you can see, the derivative of the function is expressed in terms of the activation function value, sigmoid values, and the activation function's parameter β. Substituting the sigmoid formula, we get the following expression.

To implement the formula mentioned, you need to pass pointers to four data buffers to the kernel: the corresponding values before and after activation, the error gradient from the subsequent layer, and the results buffer.

In the kernel body, calculate the derivative value using the provided formula and then multiply this value by the error gradient obtained from the subsequent layer. The result of the operation will be written to the corresponding element of the results buffer.

__kernel void SwishDerivative(__global TYPEoutputs,
                              __global TYPEoutput_gr,
                              __global TYPEinput_gr,
                              const TYPE b,
                              __global TYPEinputs)
  {
   size_t i = get_global_id(0);
   TYPE by = b * outputs[i];
   input_gr[i] = (by + (1 - by) / (1 + exp(-b * inputs[i]))) * output_gr[i];
  }

The kernel of calculating the derivative of the function Softmax is the most difficult. As in the case of the S-shaped functions, the values of the activation function itself are sufficient to calculate the derivative of the Softmax function. To compute the value of one element in the vector during the feed-forward pass, we used the values of all elements in the vector before the activation function (calculating the total sum of exponents). Therefore, the value of each element after activation depends on all elements of the vector before activation. This means that each element must receive its share of the error from each element at the output of the neural layer before activation. In general, the derivative of the Softmax function is calculated using the formula below.

In the parameters of the SoftMaxDerivative kernel, we will pass pointers to three data buffers: the values of the activation function, the error gradient from the next layer, and the result buffer.

In the kernel body, we define the global thread ID, which this time only points to an element in the results buffer. The global thread identifier in the second dimension is used when working with a matrix in which the Softmax function was applied row-by-row. In this case, this identifier will help us determine the offset to the analyzed data.

Next, we prepare two private variables: one to store the value of the corresponding element after activation, and the other to accumulate the total gradient error value.

After that, we organize a loop for collecting error gradients from all elements at the output of the neural layer. Each specific gradient is calculated using the above formula.

After completing the loop iterations, we will save the accumulated sum of gradients in the corresponding element of the results buffer.

__kernel void SoftMaxDerivative(__global TYPEoutputs,
                                __global TYPEoutput_gr,
                                __global TYPEinput_gr)
  {
   size_t i = get_global_id(0);
   size_t outputs_total = get_global_size(0);
   size_t shift = get_global_id(1) * outputs_total;
   TYPE output = outputs[shift + i];
   TYPE result = 0;
   for(int j = 0j < outputs_totalj++)
      result += outputs[shift + j] * output_gr[shift + j] *
                                    ((TYPE)(i == j ? 1 : 0) - output);
   input_gr[shift + i] = result;
  }

After adjusting the error gradient for the activation function derivative, we need to distribute the obtained values to the neurons of the previous neural layer. As it was mentioned above, here we will divide the threads by the number of neurons in the lower neural layer. For each neuron, we will collect gradients from all neurons dependent on it.

The distribution of the error gradient through the neural layer will be carried out in the CalcHiddenGradient kernel. Pointers to 3 arrays are input into the kennel:

  • weights: a matrix of weights;
  • gradients: an array of gradients adjusted for the derivative of the activation function;
  • gradient_inputs: an array for recording the gradients of the preceding layer.

In addition, the parameters indicate the number of neurons in the top layer (the size of the gradients array). The kernel construction algorithm is very similar to the forward pass method, as we also use the dot and ToVect4 functions. The difference lies in the arrays being used: during the forward pass, we took the input signal and multiplied it by the weights, whereas now we multiply the error gradient by the weights. There is one more point in using the ToVect4 function for the matrix of weights. When we considered this function for the feed-forward pass, we talked about a similar function of the parameters of the first element for copying start and for shifting shift. Then we used step of 1 element. Now, by iterating over the array of gradients, we will select the appropriate weights. However, in the feed-forward pass, neurons and weights followed in order, while in the backward pass, we take the weights across the weight matrix. In the vector expression of the weight matrix, we will use the step between the two elements to copy 1 element more than the number of neurons in the previous layer (the bias element). At the same time, the shift will be equal to the ordinal number of the processed neuron of the lower layer.

We do not specify the number of neurons in the lower neural layer in the parameters but use this value as a step to read values from the weight matrix. The get_global_size function allows us to get the specified value, which returns the total number of running kennel threads. Since we launched one thread for each neuron of the previous layer, the number of threads in this case will correspond to the number of neurons in the layer. Here, we calculate the number of elements in the weight matrix by multiplying the number of neurons in the layer by the number of neurons in the previous layer plus the bias element.

In other respects, we also use vector operations that allow us to utilize a loop in steps of 4 elements.

__kernel void CalcHiddenGradient(__global TYPE *gradient_inputs,
                                 __global TYPE *weights,
                                 __global TYPE *gradients,
                                 int outputs_total)
  {
   const int n = get_global_id(0);
   const int inputs_total = get_global_size(0);
   int weights_total = (inputs_total + 1) * outputs_total;
//---
   TYPE grad = 0;
   for(int o = 0o < outputs_totalo += 4)
      grad += dot(ToVect4(gradientso1outputs_total0), 
                  ToVect4(weightso, (inputs_total + 1), weights_totaln));
   gradient_inputs[n] = grad;
  }

If you look at the error backpropagation algorithm, we have already reached the finish line. After propagating the error gradient, all we have to do is update the weight matrix. However, we remember from the MQL5 implementation that the weight matrix will not be updated after each iteration. Again, the process of updating the weight matrix will be divided into 2 stages:

  1. Accumulating error gradients over a certain interval.
  2. Averaging of the accumulated gradient and adjustment of the weight matrix.

We will collect error gradients for each weight in the CalcDeltaWeights kernel. In the kernel parameters, pointers to the array of data from the output of the previous layer, the gradient array, and an array for accumulating deltas needed for weight adjustments are passed.

All weights are calculated independently, so we can run the error calculation for each weighting factor in a separate thread. To make the structure of the threads more visual and understandable, we will create a task space in two dimensions. The first dimension will be equal to the number of neurons in the current layer, and the second will be equal to the number of neurons in the previous layer.

In the kernel body, we will determine the dimension of the weight matrix over the problem space and the position of the analyzed element in this matrix. After that, we will determine the offset of the element in the result buffer.

Let's not forget that we accumulate the error gradient until the moment of direct updating of the weights. Therefore, we will add to the previously accumulated sum the product of the corresponding error gradient by the result element of the previous layer.

Note that we are using an additional bias element. For this element, a constant value of the incoming element equal to one is used. We didn't take it into account when creating the task space, but we must also accumulate an error gradient for it. From a mathematical point of view, the derivative of multiplication by 1 is equal to 1. This means that the error gradient for a given element is equal to the error gradient before the activation function of the corresponding neuron. To avoid duplicating the iteration for writing the bias weight error gradient, we will perform this iteration only for the thread with index 0 in the second dimension.

__kernel void CalcDeltaWeights(__global TYPE *inputs,
                               __global TYPE *delta_weights,
                               __global TYPE *gradients)
  {
   const int n = get_global_id(0);
   const int outputs_total = get_global_size(0);
   const int i = get_global_id(1);
   const int inputs_total = get_global_size(1);
//---
   TYPE grad = gradients[n];
   int shift = n * (inputs_total + 1);
   delta_weights[shift + i] = inputs[i] * grad + delta_weights[shift + i];
   if(i == 0)
      delta_weights[shift + inputs_total] += grad;
  }

Now, we need to organize the process of updating the matrix of weights. We have studied and already implemented several methods for updating weights in the main program. When implementing this process in MQL5, we created a dispatch method that redirected the logical chain of operations to the method corresponding to the selected method for updating the weights. Then, within these methods, we defined the device to perform the operations. To maintain the integrity of this approach, we have to create several kernels by analogy with the main program, implementing all the methods used to optimize the weight matrix.

All kernels were created using a single approach and therefore have many features in common. However, there are differences due to the specific features of each method. Let's start exploring kernels with the stochastic gradient descent method.

In the kernel parameters, we pass pointers to the matrix of accumulated gradients and the matrix of weights. In addition to pointers to matrices, the kernel parameters include the total number of elements in the weight matrix, the batch size for averaging, the learning rate, and regularization parameters.

As before, we will use vector operations, so the number of threads will be four times smaller than the size of the weight matrix. In the kernel body, we first define the offset in the array for the working elements of our stream and load them into vector variables. When reading the accumulated deltas, we immediately divide the obtained values by the batch size, which will give us the average value of the gradient. After that, we adjust the weights for the regularization coefficients and the average value of the accumulated deltas, taking into account the learning rate.

In conclusion, we return the obtained values to the weight matrix and reset the array of accumulated deltas.

__kernel void SGDUpdate(__global TYPE *delta_weights,
                        __global TYPE *weights,
                        int total,
                        int batch_size,
                        TYPE learningRate,
                        TYPE Lambda1,
                        TYPE Lambda2
                       )
  {
   int start = 4 * get_global_id(0);
   TYPE4 delta4 = ToVect4(delta_weightsstart1total0);
   TYPE4 weights4 = ToVect4(weightsstart1total0);
   TYPE lr = learningRate / ((TYPE)batch_size);
   weights4 -= (TYPE4)(Lambda1) + Lambda2 * weights4;
   weights4 += (TYPE4)(lr) * delta4;
   D4ToArray(weightsweights4start1total0);
   D4ToArray(delta_weights, (TYPE4)(0), start1total0);
  }

Next, we studied the accumulated momentum method. In the same sequence, we will create kernels for the implementation of methods. The MomentumUpdate kernel algorithm is very similar to the stochastic gradient descent kernel discussed above. The main differences are the introduction of an additional array for storing the accumulated pulse, updating the weights, and the pulse smoothing parameter.

In the kernel body, as in the previous method, we read the corresponding array values into vector variables. At the same time, we average the accumulated gradient. Then we adjust the weights for the regularization parameters. After this, we first update the momentum of the weight change, taking into account the average gradient and the previously accumulated momentum. Only after this step, we adjust the weights based on the updated momentum. Before exiting the kernel, transfer the values of the vector variables to the corresponding elements of the weights and moments matrices. The cumulative array of deltas will be zeroed.

__kernel void MomentumUpdate(__global TYPEdelta_weights,
                             __global TYPEweights,
                             __global TYPEmomentum,
                             int totalint batch_size,
                             TYPE learningRate,
                             TYPE beta,
                             TYPE Lambda1TYPE Lambda2)
  {
   int start = 4 * get_global_id(0);
//---
   TYPE4 delta4 = ToVect4(delta_weightsstart1total0) /
                                                      ((TYPE4)batch_size);
   TYPE4 weights4 = ToVect4(weightsstart1total0);
   TYPE4 momentum4 = ToVect4(momentumstart1total0);
   weights4 -= (TYPE4)(Lambda1) + Lambda2 * weights4;
   momentum4 = (TYPE4)(learningRate) * delta4 + (TYPE4)(beta) * momentum4;
   weights4 += momentum4;
   D4ToArray(weightsweights4start1total0);
   D4ToArray(momentummomentum4start1total0);
   D4ToArray(delta_weights, (TYPE4)(0), start1total0);
  }

The AdaGrad optimization method, like the accumulated momentum method, uses a single array to accumulate moments. But unlike the previous method, we will sum up the squares of the gradients and there is no smoothing coefficient.

The approach to the use of the accumulated moment has also changed. In the previous method, we used accumulated momentum to reduce the randomness of updates and maintain smooth movements in the direction of the anti-gradient. Now, in the adaptive gradient method, the accumulated square of gradients is used to decrease the learning rate with each iteration. This is reflected in the kernel code below.

__kernel void AdaGradUpdate(__global TYPEdelta_weights,
                            __global TYPEweights,
                            __global TYPEmomentum,
                            int totalint batch_size,
                            TYPE learningRate,
                            TYPE Lambda1TYPE Lambda2)
  {
   int start = 4 * get_global_id(0);
//---
   TYPE4 delta4 = ToVect4(delta_weightsstart1total0) /
                                                      ((TYPE4)batch_size);
   TYPE4 weights4 = ToVect4(weightsstart1total0);
   TYPE4 momentum4 = ToVect4(momentumstart1total0);
//---
   weights4 -= (TYPE4)(Lambda1) + Lambda2 * weights4;
   momentum4 = momentum4 + pow(delta42);
   weights4 += learningRate / sqrt(momentum4 + 1.0e-37f);
   D4ToArray(weightsweights4start1total0);
   D4ToArray(momentummomentum4start1total0);
   D4ToArray(delta_weights, (TYPE4)(0), start1total0);
  }

The main problem of the adaptive gradient method is the constant accumulation of the gradient square. With long-term training, this can lead to a decrease in the learning rate to zero and a practical stop in the training of the neural network. This problem is solved in RMSProp by introducing a smoothing factor for accumulating gradient squares. This allows us to limit the growth of the accumulated sum of squares of gradients and thereby limit the decrease in the learning rate.

Otherwise, the kernel algorithm repeats the previously considered methods for updating weights.

__kernel void RMSPropUpdate(__global TYPEdelta_weights,
                            __global TYPEweights,
                            __global TYPEmomentum,
                            int totalint batch_size,
                            TYPE learningRate,
                            TYPE beta,
                            TYPE Lambda1TYPE Lambda2)
  {
   int start = 4 * get_global_id(0);
//---
   TYPE4 delta4 = ToVect4(delta_weightsstart1total0) /
                                                      ((TYPE4)batch_size);
   TYPE4 weights4 = ToVect4(weightsstart1total0);
   TYPE4 momentum4 = ToVect4(momentumstart1total0);
//---
   weights4 -= (TYPE4)(Lambda1) + Lambda2 * weights4;
   momentum4 = beta * momentum4 + (1 - beta) * pow(delta42);
   weights4 += delta4 * learningRate / (sqrt(momentum4) + 1.0e-37f);
   D4ToArray(weightsweights4start1total0);
   D4ToArray(momentummomentum4start1total0);
   D4ToArray(delta_weights, (TYPE4)(0), start1total0);
  }

In the AdaDelta optimization update method, the authors tried to exclude the learning rate. But the price for this was the introduction of an additional momentum buffer and a second smoothing factor.

The method uses two exponential averages. The first one averages the square values of the corresponding weight, while the second one, like in the two previous methods, calculates the square of gradients on this weight. Instead of the learning rate, the ratio of the square roots of the two specified averages is used. As a result, we obtain a method in which, with an increase in the absolute value of the weight, the learning rate also increases. At the same time, an increase in the absolute value of the error gradient leads to a decrease in the learning rate.

Let's consider the implementation of this method in the AdaDeltaUpdate kernel. In the parameters, pointers to four data arrays are passed to the kernel:

  • delta_weights: an array of accumulated error gradients;
  • weights: a matrix of weights;
  • momentumW: a matrix of exponential mean squares of the weights;
  • momentumG: a matrix of exponential squares of error gradients.

In addition to the pointers to arrays, the kernel parameters include the size of the arrays, batch size, two exponential smoothing coefficients, and regularization parameters.

In the kernel body, we define the shift to the elements to be processed in this thread and read the necessary elements from the arrays into vector variables for further processing. The next step is to adjust the weights based on regularization parameters and update the weight moments and gradients. After that, we will update the weights themselves. Finally, write the new values to the data arrays and reset the array of accumulated gradients.

__kernel void AdaDeltaUpdate(__global TYPEdelta_weights,
                             __global TYPEweights,
                             __global TYPEmomentumW,
                             __global TYPEmomentumG,
                             int totalint batch_size,
                             TYPE beta1TYPE beta2,
                             TYPE Lambda1TYPE Lambda2)
  {
   int start = 4 * get_global_id(0);
//---
   TYPE4 delta4 = ToVect4(delta_weightsstart1total0) /
                                                      ((TYPE4)batch_size);
   TYPE4 weights4 = ToVect4(weightsstart1total0);
   TYPE4 momentumW4 = ToVect4(momentumWstart1total0);
   TYPE4 momentumG4 = ToVect4(momentumGstart1total0);
//---
   weights4 -= (TYPE4)(Lambda1) + Lambda2 * weights4;
   momentumW4 = beta1 * momentumW4 + (1 - beta1) * pow(weights42);
   momentumG4 = beta2 * momentumG4 + (1 - beta2) * pow(delta42);
   weights4 += delta4 * sqrt(momentumW4) / (sqrt(momentumG4) + 1.0e-37f);
   D4ToArray(weightsweights4start1total0);
   D4ToArray(momentumWmomentumW4start1total0);
   D4ToArray(momentumGmomentumG4start1total0);
   D4ToArray(delta_weights, (TYPE4)(0), start1total0);
  }

The last method we studied was the Adam adaptive moment estimation method. Below are the mathematical formulas of this method.

Compared to the methods discussed earlier, the formulas may appear more complex, but there's nothing daunting about them, and this method is implementable. Like AdaDelta, the method uses two buffers to accumulate moments. We accumulate the momentum of the gradients in the first buffer and the momentum of the gradient squares in the second one. Both buffers use exponential smoothing, but each uses a different smoothing factor. In addition, the learning rate is returned to the method.

Let's consider the implementation of the method in the AdamUpdate kernel. In the kernel parameters, we will pass pointers to the data arrays:

  • delta_weights: accumulated gradient deltas;
  • weights: a matrix of weights;
  • momentumM: a matrix of accumulated gradients;
  • momentumV: a matrix of accumulated gradient squares.

We will also pass the size of the arrays, batch size, learning rate, smoothing coefficients, and regularization parameters in the kernel parameters.

At the beginning of the kernel, as in the implementation of the previous optimization methods, we define the shift to the processed elements in the arrays. To avoid unnecessary confusion, we will synchronously use the elements of the arrays, that is, the size of the arrays and the shift to the corresponding elements will be identical.

Let's copy the processed array elements into vector variables. As always, we will immediately divide the accumulated deltas by the batch size and store the arithmetic mean of the error gradient.

Next, we calculate the updated values of the accumulated pulses and adjust them, as proposed by the authors of the method.

After the preparatory work, we will adjust our weights. As before, we will first adjust for regularization parameters, and then move towards the anti-gradient direction according to the optimization method rules. In other words, we will subtract the product of the learning rate and the ratio of the first moment of the gradient to the square root of its second moment from the current weight.

At the end of the kernel, we will save the obtained values to the corresponding arrays. Do not forget to reset the array of accumulated deltas.

__kernel void AdamUpdate(__global TYPEdelta_weights,
                         __global TYPEweights,
                         __global TYPEmomentumM,
                         __global TYPEmomentumV,
                         int totalint batch_size,
                         TYPE learningRate,
                         TYPE beta1TYPE beta2,
                         TYPE Lambda1TYPE Lambda2)
  {
   int start = 4 * get_global_id(0);
//---
   TYPE4 delta4 = ToVect4(delta_weightsstart1total0) /
                                                      ((TYPE4)batch_size);
   TYPE4 weights4 = ToVect4(weightsstart1total0);
   TYPE4 momentumM4 = ToVect4(momentumMstart1total0);
   TYPE4 momentumV4 = ToVect4(momentumVstart1total0);
//---
   momentumM4 = beta1 * momentumM4 + (1 - beta1) * delta4;
   momentumV4 = beta2 * momentumV4 + (1 - beta2) * pow(delta42);
   TYPE4 m = momentumM4 / (1 - beta1);
   TYPE4 v = momentumV4 / (1 - beta2);
   weights4 -= (TYPE4)(Lambda1) + Lambda2 * weights4;
   weights4 += learningRate * m / (sqrt(v) + 1.0e-37f);
   D4ToArray(weightsweights4start1total0);
   D4ToArray(momentumMmomentumM4start1total0);
   D4ToArray(momentumVmomentumV4start1total0);
   D4ToArray(delta_weights, (TYPE4)(0), start1total0);
  }

At this stage, we have completed the work on writing the OpenCL program. Of course, we will return to this work when implementing other architectural solutions for neural layers. However, in terms of a fully connected neural layer, this work can be considered complete. We save the code we've written and move on to implementing the processes of data exchange between the main program and the OpenCL kernels, as well as the functions for invoking the kernels. We have to do this work on the side of the main program.