Organizing parallel computing in the attention block

In the previous sections, we built a working attention block algorithm using standard MQL5 language capabilities. Now you can add an attention block to your model and test the quality of the Self-Attention mechanism. However, look at the block structure. In its operation, we used five internal layers and created an algorithm for transferring data between them in both the forward and backward directions. It's also important to note that each element of the sequence, described by a value vector, is processed using shared weight matrices, but independently of each other. This allows us to easily distribute operations across parallel threads, enabling us to perform a full set of operations in shorter time intervals. And yes, from the beginning, we decided to create a library with the capability to use two technologies. By doing so, we provide users with the opportunity to independently test and choose the most suitable technology for their specific use case.

As before, we organize the parallel computing unit using OpenCL. To use this technology, we will need to complete two stages of work:

  • Create an OpenCL program
  • Make changes to the main program

We will add the OpenCL program code to the previously created file opencl_program.cl. It is in this file that we collected all the kernels of the OpenCL program used in the work of the previous classes. To organize the operation of our attention class, we will need to create six kernels. In these kernels, we will need to organize the flow of information between the internal neural layers used in both the forward and backward directions.

First, we'll create the AttentionFeedForward kernel. Below is a brief recap on the sequence of operations during the feed-forward pass through the Self-Attention block:

  1. The source data is fed into three internal convolutional neural layers: m_cQuery, m_cKeys, m_cValues.

  1. The m_cQuery and m_cKeys result tensors are multiplied to obtain the m_cScores dependency matrix.

  1. The values of the m_cScores matrix are divided by the square root of the size of the description vector of one element of the m_cKeys sequence and normalized by the Softmax function in terms of rows (m_cQuery queries).

  1. The normalized matrix m_cScores is multiplied by the neural layer results tensor m_cValues to obtain the Self-Attention results.

  1. The results of the Self-Attention block are added to the original data and normalized.

  1. The obtained tensor serves as the input data for a block of two convolutional layers: m_cFF1 and m_cFF2.

Points 1 and 6 are covered by using the previously discussed convolutional layer class, which already implements a multi-threaded computation block. So, we will need to implement the remaining points in a new kernel.

To organize the specified operations, we will need to pass six data buffers and two parameters to the kernel. To make the program code more readable, the names of buffers and variables will be aligned with the names of the corresponding matrices in the algorithm description.

__kernel void AttentionFeedForward(__global TYPE *querys,
                                   __global TYPE *keys,
                                   __global TYPE *scores,
                                   __global TYPE *values,
                                   __global TYPE *outputs,
                                   int window,
                                   int key_size)
  {

As you may have noticed from the description of the Self-Attention algorithm, the primary analytical unit in this method is an element of the sequence, described by a value vector. For language models, this is usually a word. In the case of financial market analysis, we use a bar. It is precisely between these elements of the sequence that the coefficients of mutual dependencies are determined. Taking into account these coefficients, the values of the element description vectors are adjusted. Therefore, it is quite logical to divide the operations into threads based on the elements of the sequence.

Therefore, in the body of the kernel, the first thing we will do is determine the element of the sequence being analyzed based on the identifier of our thread. At the same time, the total number of running threads will indicate the number of elements in the sequence. Here, we will also immediately determine the offset in the query tensor and the dependency coefficient matrix to the first analyzed value.

   const int q = get_global_id(0);
   const int units = get_global_size(0);
   int shift_query = key_size * q;
   int shift_scores = units * q;
   TYPE summ = 0;

To normalize data with the Softmax function, we need the sum of the exponents of all normalized values. To calculate it, we add a variable with an initial zero value.

After completing the preparatory work, we will determine the values of one vector from the dependency coefficient matrix, which is related to the calculations for the dependencies of the analyzed element of the sequence. For this, we create a loop with the number of iterations equal to the number of elements in the sequence. In the body of the loop, we will alternately multiply the Query vector of the analyzed sequence element with all vectors of the Key tensor. For each vector multiplication result, we will take an exponential value and write it into the corresponding element of the Score matrix. Of course, we will add the values of the vector to our accumulator sum of all vector values for subsequent normalization.

   for(int s = 0s < unitss++)
     {
      TYPE score = 0;
      int shift_key = key_size * s;
      for(int k = 0k < key_sizek ++)
         score += querys[shift_query + k] * keys[shift_key + k];
      score = exp(score / sqrt((TYPE)key_size));
      summ += score;
      scores[shift_scores + s] = score;
     }

After the loop completes, our variable summ will accumulate the sum of all elements of our vector from the dependency coefficients tensor. To complete the normalization of the given vector values, all we have to do is divide the value of each of its elements by the total sum of all the values of the vector.

   for(int s = 0s < unitss++)
      scores[shift_scores + s] /= summ;

In the analyzed vector, we obtained the coefficients of dependencies of the analyzed element of the sequence on the rest of its elements. The sum of all coefficients will be equal to one.

Next, according to the algorithm, we need to multiply each vector of the Value tensor by the corresponding element of the resulting vector of dependency coefficients. The resulting vectors need to be added up. The final vector of values will be the result of the Self-Attention block.

Before passing the data further, we need to add the obtained data to the tensor of input data and normalize them. In the body of the kernel, I propose focusing on determining the results of the Self-Attention block. It will be more efficient to perform matrix addition and data normalization separately across the entire neural layer.

Let's look at the implementation of such a solution. To avoid recalculating at each iteration, we first determine the offset in the tensors of the initial data and results. The tensors have the same dimension, so the offset will be the same for both cases. Then, we will set up a system of two nested loops: in the outer loop, we will iterate over the elements of the vector of the analyzed element of the sequence, and in the inner loop, we will perform the actual computation of the values for each element of the result vector. For this purpose, the number of iterations in the inner loop will be equal to the number of elements in the sequence. In the body of this loop, we will multiply the values of the Value tensor elements by the corresponding dependency coefficients from the Score matrix. We will accumulate the resulting products in the local variable query. After completing the iterations of the inner loop, we will write the result into the corresponding element of the result tensor.

   shift_query = window * q;
   for(int i = 0i < windowi++)
     {
      TYPE query = 0;
      for(int v = 0v < unitsv++)
         query += values[window * v + i] * scores[shift_scores + v];
      outputs[shift_query + i] = query;
     }
  }

With this, we will complete work on the first feed-forward kernel. The next step is to create a kernel for adding up two tensors. It is sometimes more economical to do such work using matrix operations on the side of the main program. The operation is straightforward, and the overhead of data transfer is unlikely to be justified. We now have the opposite situation. We organize the entire process on the OpenCL context side. All the information is already in the context memory, and to perform the operation on the main program side, we will need to copy the data. We do not need to transfer data if computations are performed within the context. Therefore, we have created a kernel called Sum, in which we simply add elements from two buffers with the same index and store the result in an element of the third buffer with the same index.

__kernel void Sum(__global TYPE *inputs1,
                  __global TYPE *inputs2,
                  __global TYPE *outputs)
  {
   const int n = get_global_id(0);
//---
   outputs[n] = inputs1[n] + inputs2[n];
  }

The data normalization process has a more complex architecture. As you know, its process is expressed by the following mathematical formulas:

As you can notice, to calculate the normalized value of each element in the sequence, you need the arithmetic mean and the root mean square deviation of the entire sequence. To calculate them, we need to organize data transfer between individual threads. We will solve this problem in a way similar to the multi-threaded implementation of the Softmax activation function, that is, via an array in local memory. We will need to organize two summation blocks for values across the entire vector because before calculating the arithmetic mean, we cannot compute the variance. Furthermore, we cannot calculate the normalized value until we determine the variance.

The normalization process is organized in the LayerNormalize kernel. In the parameters, the kernel receives pointers to 3 buffers:

  • Source data buffer
  • Results buffer
  • Buffer for recording standard deviation parameters

We needed the last standard deviation buffer to save and transmit data to the backpropagation kernel.

Additionally, we will pass two parameters to the kernel: the total number of elements in the buffer being normalized and the offset in the buffer for root mean square deviations. I would like to remind you that within one attention neural layer, we perform data normalization twice. Let's normalize the results of the Self-Attention and FeedForward blocks.

__kernel void LayerNormalize(__global TYPEinputs,
                             __global TYPEoutputs,
                             __global TYPEstds,
                             const int total,
                             const int std_shift)
  {

In the kernel body, we define thread identifiers and initialize a local data array.

   uint i = (uint)get_global_id(0);
   uint l = (uint)get_local_id(0);
   uint ls = min((uint)get_local_size(0), (uint)LOCAL_SIZE);
   __local TYPE temp[LOCAL_SIZE];

First, we will determine the arithmetic mean of the buffer elements. To do this, we organize a loop in which each thread sums its values and stores the result in its own element of the local array. Since we are calculating the arithmetic mean of the entire buffer, we will divide the obtained value by the number of elements in the buffer.

   uint count = 0;
   do
     {
      uint shift = count * ls + l;
      temp[l] = (count > 0 ? temp[l] : 0) + (shift < total ? inputs[shift] : 0);
      count++;
     }
   while((count * ls + l) < total);
   temp[l] /= (TYPE)total;
   barrier(CLK_LOCAL_MEM_FENCE);

We will synchronize the work of threads using the barrier function. Since the calculations of the threads do not overlap, we only need one barrier at the end of the block.

Next, we need to collect parts of the total amount into a single whole. We will organize another loop in which we will collect the arithmetic mean of the buffer into one element of the local array with index 0. The result will be saved in a local variable.

   count = ls;
   do
     {
      count = (count + 1) / 2;
      temp[l] += (l < count ? temp[l + count] : 0);
      barrier(CLK_LOCAL_MEM_FENCE);
     }
   while(count > 1);
//---
   TYPE mean = (TYPEtemp[0];

I would like to draw your attention once again to the arrangement of barriers. Here you need to pay special attention to the operation of the algorithm because all threads must reach each barrier. Moreover, the sequence of their visits must also be observed.

After determining the arithmetic mean, we repeat the loops and calculate the standard deviation.

   count = 0;
   do
     {
      uint shift = count * ls + l;
      temp[l] = (count > 0 ? temp[l] : 0) + (shift < total ? (TYPE)pow(inputs[shift] - mean2) : 0);
      count++;
     }
   while((count * ls + l) < total);
   temp[l] /= (TYPE)total;
   barrier(CLK_LOCAL_MEM_FENCE);

   count = ls;
   do
     {
      count = (count + 1) / 2;
      temp[l] += (l < count ? temp[l + count] : 0);
      barrier(CLK_LOCAL_MEM_FENCE);
     }
   while(count > 1);
//---
   TYPE std = (TYPE)sqrt(temp[0]);
   if(l == 0)
      stds[std_shift] = std;

We save the obtained standard deviation into a buffer. To avoid simultaneous writes by all threads, we will save the value in only one thread. To achieve this, we will perform a thread index check before the operation of writing a value to the buffer.

Now that we have calculated the averages, we can normalize the original data. It's important to note that the limitation of the workgroup size may not allow us to allocate a separate thread for each element of the input data buffer. Therefore, we will also implement data normalization in a loop.

   count = 0;
   while((count * ls + l) < total)
     {
      uint shift = count * ls + l;
      outputs[shift] = (inputs[shift] - mean) / (std + 1e-37f);
      count++;
     }
  }

This concludes our work with feed-forward kernels. Continuing our work on making additions to the OpenCL program, we move on to building a reverse pass. Its algorithm completely mirrors the path taken above but in the reverse direction. In it, we have to propagate the error gradient from the output of the Self-Attention block to the internal neural layers m_cQuery, m_cKeys, m_cValues.

The simplest seems to be the calculation of the error gradient for the internal neural layer, m_cValues. Let me remind you that to obtain the result of the Self-Attention block, we multiplied the matrix of dependence coefficients m_cScores by the tensor of the results of the neural layer m_cValues​​. Therefore, to obtain the error gradient at the output level of the specified neural layer, we need to multiply the gradient obtained from previous operations by the derivative of the last operation. In this case, we have to multiply the matrix of dependency coefficients by the tensor of error gradients from previous operations.

After determining the error gradient on the internal neural layer m_cValues, we need to distribute the error gradient to two more internal neural layers, m_cQuerys and m_cKeys. However, in order to bring the error gradient to the level of the specified neural layers, it is necessary to pass it through the matrix of dependency coefficients.

However, when implementing in MQL5, we do not create an additional buffer for error gradients at the level of the dependency coefficient matrix. But in OpenCL there is difficulty in allocating a dynamic array for recording intermediate data about the error gradient values at the dependency coefficient matrix level. Therefore, here we will create two temporary data buffers: the first for the error gradient of the normalized data, and the second for the error gradients corrected by the derivative of the Softmax function.

Note that when we recalculate the error gradient to the level of the m_cQuerys and m_cKeys neural layers, the same elements of the dependency coefficient error gradient matrix are used in different operation threads. Therefore, we will divide the entire backpropagation algorithm within the attention layer into two blocks. In the first block, we will propagate the error gradient to the level of the internal neural layer of m_cValues value and the m_cScores coefficient matrix. In the second block, we will propagate the error gradient to two other neural layers: m_cQuerys and m_cKeys.

We implement the first block of operations in the AttentionCalcScoreGradient kernel. In the parameters of this kernel, we pass pointers to five data buffers and one parameter:

  • scores — dependency coefficient matrix buffer
  • scores_temp — buffer of error gradients at the level of the normalized dependency coefficient matrix
  • scores_grad — buffer of error gradients at the level of the dependency coefficient matrix, adjusted to the derivative of the normalization function
  • values — tensor buffer Values ​​(buffer of neural layer results m_cValues​​)
  • values_grad — error gradient tensor buffer at the level of results of the m_cValues ​​neural layer
  • outputs_grad is the buffer of error gradients at the output level of the Self-Attentionblock;
  • window is the size of the description vector of one element of the sequence in the Values​​ tensor.

Please note that the scores_temp and scores_grad buffers have no counterparts on the main program side. The reason is that we only need error gradients at the level of the dependency coefficient matrix to perform the operations of the current backward pass. However, OpenCL does not have the ability to create dynamic arrays. We created the specified buffers instead.

__kernel void AttentionCalcScoreGradient(__global TYPE *scores,
                                         __global TYPE *scores_grad,
                                         __global TYPE *values,
                                         __global TYPE *values_grad,
                                         __global TYPE *outputs_grad,
                                         __global TYPE *scores_temp,
                                         int window)
  {

The feed-forward algorithm involves normalizing the dependency coefficient matrix Score with the Softmax function in the context of Query requests. So, after determining the error gradients at the coefficient matrix level, it is necessary to adjust these values based on the derivative of the data normalization operation. Therefore, it would be logical to divide the operations into threads in the same manner. Moreover, such a distribution of operations into threads would be entirely appropriate for propagating the error gradient to the level of values within the internal neural layer.

At the beginning of the kernel, we do a little preparatory work. We determine the serial number of the analyzed vector of values and rows of the matrix of dependency coefficients by the identification number of the thread. The total number of running threads will tell us the dimensions of the tensors. Let's immediately determine the offset in the data buffers to the first element of the analyzed vectors of values.

   const int q = get_global_id(0);
   const int units = get_global_size(0);
   int shift_value = window * q;
   int shift_score = units * q;

Next, we will propagate the error gradient to the level of the internal neural layer m_cValues. As mentioned above, to determine the error gradient, we need to multiply the transposed matrix of dependency coefficients by the gradient tensor at the output of the Self-Attention block.

Within the kernel, we will define the error gradient for only one vector of element description. As you know, with a feed-forward pass, each element of the sequence in the Value tensor leaves its mark in the formation of all elements of the sequence of results of the Self-Attention block. Consequently, each element of the Value tensor must receive its share of the error gradient from all elements of the results tensor of the Self-Attention block. The measure of influence will be the corresponding dependence coefficient from the Score matrix. Thus, each element of the sequence of the Value tensor corresponds to one column in the dependency coefficient matrix Score. This explains the use of the transposed Score matrix in the formula above.

To organize this process, we will create a system of two nested loops. The number of iterations in the first loop is equal to the size of the vector describing one element of the sequence in the Value tensor. It should be noted that the error gradient tensor at the output of the Self-Attention block has the same dimensions. In the nested loop with a number of iterations equal to the number of elements in the sequence, we will iterate over the values of the corresponding column of the dependency coefficient matrix Score and the gradient vector of errors at the level of the Self-Attention block results. In this case, we will multiply the corresponding elements and sum the resulting products into a private variable. After completing the iterations of the inner loop, copy the accumulated sum of products to the error gradient buffer of the internal convolutional layer m_cValues.

//--- Distributing the gradient on Values
   for(int i = 0i < windowi ++)
     {
      TYPE grad = 0;
      for(int g = 0g < unitsg++)
         grad += scores[units * g + q] * outputs_grad[window * g + i];
      values_grad[shift_value + i] = grad;
     }

After the execution of the loop system the first part of our task, in which we propagate the error gradients to the internal neural layer m_cValues, can be considered complete.

The second part of our kernel is devoted to determining the error gradient at the level of the dependency coefficient matrix.

In the feed-forward pass, each element of the Query sequence is multiplied with all the elements of the Key sequence to form a single dependency coefficient matrix vector Score. Each such vector is normalized by the function Softmax. After that, we multiply it by the Value tensor. As a result of these operations, we obtain the corrected vector representation of one element of the sequence in the tensor of the Self-Attention block results. Thus, one element of the Query sequence interacts with all elements of the Key and Value tensors to form a vector describing one element of the result sequence. Therefore, to distribute the error gradient to a specific vector from the Query tensor, we need to take one corresponding error gradient vector of one element of the sequence at the level of the Self-Attention block and first multiply it by the transposed tensor of Value. Thus, we obtain an error vector at the level of the dependency coefficient matrix Score. Next, we need to adjust the resulting vector to the derivative of the Softmax function. It is this part of the error gradient distribution that we implement in this kernel. To further propagate the error gradient to the level of the internal neural layers m_cQuerys and m_cKeys, we will create another kernel a little later.

The error gradient distribution algorithm described above in matrix form can be represented as follows:

  1. Error gradient at the Score matrix level.

  1. Adjusting the error gradient to the derivative of the Softmax function.

Let's summarize the entire calculation into one formula:

First, let's propagate the error gradient to the level of the dependency coefficient matrix Score. Since, thanks to the division of operations into parallel threads within the kernel, we will be determining the error gradient for only one row, to calculate this error gradient vector, we need to take the error gradient vector for one element of the sequence at the level of the Self-Attention block results and multiply it by the transposed tensor of the internal layer's results, m_cValues. In practice, we will use the algorithm described above when calculating error gradients for the m_cValues layer. We will create a system of two nested loops. But this time, the number of iterations of the outer loop will be equal to the number of elements in the sequence. The nested loop will repeat its operations for the number of elements in the vector describing one element of the sequence. This difference is explained by the magnitude of the vector of results and is confirmed by the logic of the operations performed. Remember, with a forward pass, each element in the row of the dependency coefficient matrix corresponds to one vector describing the sequence element in the Values tensor.

//--- Gradient distribution on Score
   for(int k = 0k < unitsk++)
     {
      TYPE grad = 0;
      for(int i = 0i < windowi++)
         grad += outputs_grad[shift_value + i] * values[window * k + i];
      scores_temp[shift_score + k] = grad;
     }

After transferring the error gradient to the level of the dependency coefficient matrix, we need to adjust the obtained values using the derivative of the Softmax normalization function. Just like with the forward pass, when in order to obtain one normalized value it was necessary to process the entire vector of normalized values, to calculate one adjusted value we need to use all the elements of both vectors (error gradients at the level of the matrix of dependence coefficients and the normalized vector of coefficients itself).

The matrix expression of the process of adjusting for the derivative of the Softmax function is given above. For practical implementation, we will create a system of two nested loops. Both loops have the same number of iterations, which is equal to the size of the vector being normalized. In this case, it is equal to the number of elements in the sequence. When performing operations, it will be necessary to accumulate the sum of error gradients from each element of the normalized vector. To do this, we will create a private variable in the body of the outer loop grad. Besides, to reduce the number of accesses to global memory, we will store the repeated element in the private variable score. Let me remind you that accessing global memory is more time-consuming. So, by reducing the number of accesses to global memory buffers, we reduce the overall time spent on operations. In the body of the nested loop, we will perform operations of multiplying elements and adding the resulting products into a previously created private variable grad.

Please note that we have replaced the identity matrix with the expression (int)(i==k). The logical expression will give us the true value only on the diagonal of the matrix. Translating a boolean value into an integer will substitute 1 for true values and 0 for false values. Thus, such a short notation allows us to obtain the values of the identity matrix directly in the operation thread, without the need to first generate and save it.

//--- Adjust for the Softmax derivative
   for(int k = 0k < unitsk++)
     {
      TYPE grad = 0;
      TYPE score = scores[shift_score + k];
      for(int i = 0i < unitsi++)
         grad += scores[shift_score + i] *
                     ((int)(i == k) - score) * scores_temp[shift_score + i];
      scores_grad[shift_score + k] = grad;
     }
  }

After completing the iterations of the loop system, we will obtain the error gradients at the level of the dependency coefficient matrix, adjusted for the derivative of the Softmax function.

With that, we conclude the first backpropagation kernel and move on to creating the second kernel AttentionCalcHiddenGradient, in which we will propagate the error gradient to the internal neural layers m_cQuerys and m_cKeys. To do this, in the kernel parameters we need to pass pointers to five data buffers and one constant:

  • querys — buffer of results of the internal neural layer m_cQuerys
  • queries_grad — buffer of error gradients of the internal neural layer m_cQuerys
  • keys — buffer of results of the internal neural layer m_cKeys
  • keys_grad — buffer of error gradients of the internal neural layer m_cKeys
  • scores_grad — buffer of error gradients of dependency coefficient matrix m_cScores
  • key_size — size of the key vector of one element

__kernel void AttentionCalcHiddenGradient(__global TYPE *querys,
                                          __global TYPE *querys_grad,
                                          __global TYPE *keys,
                                          __global TYPE *keys_grad,
                                          __global TYPE *scores_grad,
                                          int key_size)
  {

Following the analogy with all the kernels discussed earlier, we will distribute the operations into threads in the context of a single element of the sequence. At the beginning of the kernel, we will perform preparatory work and determine the offsets in the data buffers to the first element of the vector of the analyzed element.

   const int q = get_global_id(0);
   const int units = get_global_size(0);
   int shift_query = key_size * q;
   int shift_score = units * q;

In the AttentionCalcScoreGradient kernel discussed above, we have already adjusted the error gradient of the dependency coefficient matrix to the derivative of the Softmax normalization function. However, during the feed-forward pass, before normalizing the matrix, we divided all its elements by the square root of the dimension of the key vector. Now we need to adjust the error gradient for the derivative of the mentioned operation. Similar to the multiplication operation, we will need to divide all the values of the error gradient buffer of the dependency coefficient matrix by the same constant.

Let's determine the value of the constant and store it in a private variable.

//--- Distribute the gradient on Querys and Keys
   const TYPE k = 1 / sqrt((TYPE)key_size);

This concludes the preparatory work. Now we can proceed directly to recalculating the error gradients. To obtain dependency coefficients, we multiplied two tensors (Query and Keys). We have already encountered derivatives of multiplication operations more than once. To obtain error gradients for one of the tensors, we need to multiply the error gradient tensor at the level of the dependency coefficient matrix by the second tensor. Since the Query and Key tensors have the same dimensions, we can calculate the error gradients for both tensors in the same loop system.

Let's create a system of two nested loops. The outer loop has a number of iterations equal to the size of the key vector of one sequence element. In the nested loop, we iterate through the vectors of the opposite tensor and the corresponding error gradients of the dependency coefficient matrix. Therefore, the number of its iterations will be equal to the number of elements in the analyzed sequence.

As a result, the number of iterations in the nested loop will be equal to the number of elements in the analyzed sequence. The results of these products will need to be summarized. To accumulate this amount, we will create two private variables grad_q and grad_k before declaring the nested loop.

Also, please note the following. To reduce the number of calculation operations, we will not add our previously calculated coefficient to adjust the error gradient to the products of the nested loop. We will use the mathematical properties of functions and take the constant factor out of brackets.

Thus, there is no need to multiply the value each time by a correction factor in the body of the nested loop. Instead, we can simply multiply the total amount once by the correction factor before writing it to the data buffer.

   for(int i = 0i < key_sizei++)
     {
      TYPE grad_q = 0;
      TYPE grad_k = 0;
      for(int s = 0s < unitss++)
        {
         grad_q += keys[key_size * s + i] * scores_grad[shift_score + s];
         grad_k += querys[key_size * s + i] * scores_grad[units * s + q];
        }
      querys_grad[shift_query + i] = grad_q * k;
      keys_grad[shift_query + i] = grad_k * k;
     }
  }

At the output of the loop system, we get error gradients for two nested internal neural layers m_cQuerys and m_cKeys. That is, the task of this kernel is solved. Considering the previously discussed AttentionCalcScoreGradient kernel, we have distributed the error gradient to all internal neural layers, and further distribution of the error gradient to the previous layer will be carried out using the well-tested methods of internal neural layers, as implemented by standard MQL5 means.

The backpropagation kernels discussed above bypassed the processes of adding result buffers and data normalization that we carried out during the feed-forward pass. The derivative of two functions is equal to the sum of the derivatives of these functions. So, for the operation of adding gradients, we can use a similar feed-forward kernel. We just need to specify the correct data buffers.

In the case of adjusting the error gradient to the data normalization function, we will have to create an additional kernel. Below is the error gradient correction formulas.

As you can see, in the formulas provided above, when calculating derivatives with respect to the means, the sum of values across the entire value buffer is used. However, unlike the forward pass, we have the ability to calculate all three sums in parallel.

In the kernel parameters, we pass pointers to four data buffers:

  • outputs — buffer of forward pass normalization results
  • out_gradient — buffer of gradients at the output of the normalization block
  • inp_gradient — buffer for writing adjusted gradients
  • stds — buffer of standard deviations calculated during the feed-forward pass

Also, in the parameters we will indicate the size of the buffers and the offset in the standard deviation buffer.

__kernel void LayerNormalizeGradient(__global TYPEoutputs,
                                     __global TYPEout_gradient,
                                     __global TYPEinp_gradient,
                                     __global TYPEstds,
                                     const int total,
                                     const int std_shift)
  {
   uint i = (uint)get_global_id(0);
   uint l = (uint)get_local_id(0);

In the kernel body we define thread identifiers and at the same time declare local data arrays. There will be three of them. In one, we will collect the derivative of the root mean square deviation, and the other two are intended for the terms in the derivative formula of the arithmetic mean.

   uint ls = min((uint)get_local_size(0), (uint)LOCAL_SIZE);
   __local TYPE dSTD[LOCAL_SIZE];
   __local TYPE dMean1[LOCAL_SIZE];
   __local TYPE dMean2[LOCAL_SIZE];

As with the feed-forward pass, each thread will first collect its share of the total.

   uint count = 0;
   do
     {
      uint shift = count * ls + l;
      dSTD[l] = (count > 0 ? dSTD[l] : 0) - 
                (shift < total ? out_gradient[shift] * outputs[shift] /
                (2 * (pow(stds[std_shift], (TYPE)2) + 1e-37f)) : 0);
      dMean1[l] = (count > 0 ? dMean1[l] : 0) - 
                (shift < total ? out_gradient[shift] /
                (stds[std_shift] + 1e-37f) : 0);
      dMean2[l] = (count > 0 ? dMean2[l] : 0) -
                  (shift < total ? 2 * outputs[shift] * stds[std_shift] /
                  (TYPE)total : 0);
      count++;
     }
   while((count * ls + l) < total);
   barrier(CLK_LOCAL_MEM_FENCE);

In the next loop, we will collect the sum in the first elements of the array.

   count = ls;
   do
     {
      count = (count + 1) / 2;
      dSTD[l] += (l < count ? dSTD[l + count] : 0);
      dMean1[l] += (l < count ? dMean1[l + count] : 0);
      dMean2[l] += (l < count ? dMean2[l + count] : 0);
      barrier(CLK_LOCAL_MEM_FENCE);
     }
   while(count > 1);
//---
   TYPE dstd = dSTD[0];
   TYPE dmean = dMean1[0] + dstd * dMean2[0];

We will transfer the resulting values to private variables. When calculating the derivative of the arithmetic mean deviation, we multiply the value of the right term by the derivative of the standard deviation and add it to the left term.

At this stage, we have enough data to adjust the error gradient for each buffer element. Let's organize another loop, in the body of which this work will be performed.

//---
   count = 0;
   while((count * ls + l) < total)
     {
      uint shift = count * ls + l;
      inp_gradient[shift] = out_gradient[shift] / (stds[std_shift] + 1e-32f) + 
                (2 * dstd * outputs[shift] * stds[std_shift]  + dmean) / total;
      count++;
     }
  }

This concludes our work with the OpenCL program. Now we need to proceed with the second part and set up the preparatory work for launching multi-threaded computations on the main program side.

First, let's add constants for working with kernels to the defines.mqh file. We need to add constants for identifying the kernels themselves and their variables. To name the constants, we use the previously agreed rules that apply to all constants within our project:

  • All constants begin with the prefix def.
  • Kernels begin with the prefix def_k.
  • Parameter constants after the def prefix contain a pointer to the kernel.

#define def_k_AttentionFeedForward     28
#define def_k_AttentionScoreGradients  29
#define def_k_AttentionHiddenGradients 30
#define def_k_Sum                      31
#define def_k_LayerNormalize           32
#define def_k_LayerNormalizeGradient   33

//--- feed-forward pass of the attention block
#define def_attff_querys               0
#define def_attff_keys                 1
#define def_attff_scores               2
#define def_attff_values               3
#define def_attff_outputs              4
#define def_attff_window               5
#define def_attff_key_size             6

//--- determine the gradient on the matrix of dependence coefficients of the attention block
#define def_attscr_scores              0
#define def_attscr_scores_grad         1
#define def_attscr_values              2
#define def_attscr_values_grad         3
#define def_attscr_outputs_grad        4
#define def_attscr_scores_temp         5
#define def_attscr_window              6

//--- gradient distribution through the attention block
#define def_atthgr_querys              0
#define def_atthgr_querys_grad         1
#define def_atthgr_keys                2
#define def_atthgr_keys_grad           3
#define def_atthgr_scores_grad         4
#define def_atthgr_key_size            5

//--- sum of vectors 
#define def_sum_inputs1                0
#define def_sum_inputs2                1
#define def_sum_outputs                2

//--- vector normalization
#define def_layernorm_inputs           0
#define def_layernorm_outputs          1
#define def_layernorm_std              2
#define def_layernorm_vector_size      3
#define def_layernorm_std_shift        4

//--- vector normalization gradient
#define def_layernormgr_outputs        0
#define def_layernormgr_out_grad       1
#define def_layernormgr_inp_grad       2
#define def_layernormgr_std            3
#define def_layernormgr_vector_size    4
#define def_layernormgr_std_shift      5

After that, we will need to add the declaration of the new kernels to the code of the main neural network dispatcher class. Like all previously created kernels, we will add the declaration of new kernels to the CNet::InitOpenCL method. In it, we will first update the total number of kernels used in the program.

   if(!m_cOpenCL.SetKernelsCount(34))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

After this, we will declare the kernels themselves.

   if(!m_cOpenCL.KernelCreate(def_k_AttentionFeedForward,
                                        "AttentionFeedForward"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_AttentionScoreGradients,
                                        "AttentionCalcScoreGradient"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_AttentionHiddenGradients,
                                        "AttentionCalcHiddenGradient"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_Sum"Sum"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_LayerNormalize"LayerNormalize"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_LayerNormalizeGradient
                                             "LayerNormalizeGradient"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

Then we move on to the attention mechanism class CNeuronAttention and make changes to its methods in terms of working with OpenCL technology.

Let's first add the feed-forward pass method CNeuronAttention::FeedForward. In this method, we need to organize a procedure for calling the feed-forward kernel AttentionFeedForward. We have created similar processes multiple times. So, its algorithm is as follows:

  1. Check the presence of data buffers in the OpenCL context.
  2. Pass parameters to the kernel, including pointers to data buffers.
  3. Queue the kernel to perform operations.

While doing so, we must ensure proper control of the operations to avoid potential critical errors during the program execution.

bool CNeuronAttention::FeedForward(CNeuronBase *prevLayer)
  {
//--- calculation of vectors Query, Key, Value
   .....
//--- Branching the algorithm on the computing device
   MATRIX out;
   if(!m_cOpenCL)
     {
   // MQL5 block
   .....
     }
   else // OpenCL block
     {
      //--- checking data buffers
      if(m_cQuerys.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cKeys.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cValues.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cScores.GetIndex() < 0)
         return false;
      if(m_cAttentionOut.GetOutputs().GetIndex() < 0)
         return false;

With all the necessary buffers in the OpenCL context, we will set up the transfer of pointers to them as kernel parameters.

      //--- pass parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForwarddef_attff_keys,
                                                   m_cKeys.GetOutputs().GetIndex()))
         return false;

      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForwarddef_attff_outputs,
                                            m_cAttentionOut.GetOutputs().GetIndex()))
         return false;

      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForwarddef_attff_querys,
                                                  m_cQuerys.GetOutputs().GetIndex()))
         return false;

      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForwarddef_attff_scores,
                                                               m_cScores.GetIndex()))
         return false;

      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForwarddef_attff_values,
                                                  m_cValues.GetOutputs().GetIndex()))
         return false;

      if(!m_cOpenCL.SetArgument(def_k_AttentionFeedForwarddef_attff_key_size,
                                                                        m_iKeysSize))
         return false;

      if(!m_cOpenCL.SetArgument(def_k_AttentionFeedForwarddef_attff_window,
                                                                          m_iWindow))
         return false;

Next comes the procedure for placing the kernel in the execution queue. First, let's indicate the number of required threads to launch and the offset. Only after that, we will call the kernel launch function, providing it with information about the number of instances to be launched.

      //--- put the kernel into the execution queue
      int off_set[] = {0};
      int NDRange[] = {m_iUnits};
      if(!m_cOpenCL.Execute(def_k_AttentionFeedForward1off_setNDRange))
         return false;
     }

This concludes the algorithm for launching the kernel of the Self-Attention block. However, we still need to add the contents of two buffers and normalize the data in the result buffer. Following the algorithm, first, we find the sum of two vectors (initial data and Self-Attention results). This operation is quite general and can be widely used outside of our neural attention layer class CNeuronAttention. Therefore, I decided to add it as a separate method to the data buffer class CBufferType::SumArray.

In the parameters to the SumArray method, we will pass a pointer to the buffer to be added. Immediately in the body of the method, we check the received pointer and the size of the received buffer. To successfully complete the operation, the size of the current buffer, which will be the first addend, and the resulting buffer (the second addend) must be equal.

bool CBufferType::SumArray(CBufferType *src)
  {
//--- check the source data array
   if(!src || src.Total() != Total())
      return false;

Like all the methods discussed earlier, the algorithm of this method is split into two threads depending on the execution device. In the block of performing operations using means MQL5 we will first match the matrix formats of both buffers. Then we perform the matrix addition operation. The result of the operation will be saved in the current buffer matrix.

   if(!m_cOpenCL)
     {
      //--- change the matrix size
      MATRIX temp = src.m_mMatrix;
      if(!temp.Reshape(Rows(), Cols()))
         return false;
      //--- add matrices
      m_mMatrix += temp;
     }

The algorithm for the block of multi-threaded operations is similar to that discussed above. First, we check for the presence of data in the context of OpenCL and, if necessary, load the data from the resulting buffer. Please note that we only check the received buffer. Earlier, when dividing the algorithm depending on the computing device, we already checked the pointer to the current OpenCL context of the buffer. Therefore, we consider the data of the current buffer to have already been transferred to the OpenCL context.

The control block is followed by passing parameters to the kernel and placing it in the execution queue.

   else
     {
      if(src.GetIndex() < 0 && !BufferCreate(m_cOpenCL))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_Sumdef_sum_inputs1m_myIndex))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_Sumdef_sum_inputs2src.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_Sumdef_sum_outputsm_myIndex))
         return false;
      uint off_set[] = {0};
      uint NDRange[] = {(uint)Total()};
      if(!m_cOpenCL.Execute(def_k_Sum1off_setNDRange))
         return false;
     }
//---
   return true;
  }

The data normalization process is organized in the CNeuronAttention::NormlizeBuffer method. However, while following the general rules for constructing the algorithm, there are two exceptions in this method. First, we eliminated the block for checking the presence of buffers in the OpenCL context. In this case, the risk of using unloaded buffers is minimal. The reason is that before calling this method, the used buffers have already been checked multiple times, and rechecking them would be unnecessary.

bool CNeuronAttention::NormlizeBuffer(CBufferType *buffer,
                                      CBufferType *std
                                      uint std_shift)
  {
   if(!m_cOpenCL)
     {
    // MQL5 block
   .....
     }
   else
     {
      if(!m_cOpenCL.SetArgumentBuffer(def_k_LayerNormalize,
                                     def_layernorm_inputsbuffer.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_LayerNormalize,
                                    def_layernorm_outputsbuffer.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_LayerNormalize,
                                           def_layernorm_stdstd.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_LayerNormalize,
                              def_layernorm_vector_size, (int)buffer.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_LayerNormalize,
                                          def_layernorm_std_shiftstd_shift))
         return false;

The second point is related to the use of a local data array and thread synchronization. The reason is that thread synchronization is only available within a work group. We need to explicitly specify its size. The normalization algorithm in the kernel is structured in such a way that the workgroup size cannot be greater than the size of the local array. Let me remind you that the size of the local array is determined by the LOCAL_SIZE constant. At the same time, the number of threads cannot be greater than the size of the normalized buffer. Therefore, in the array indicating the dimension of the task space, we will indicate the smaller of the two values. Since we normalize the values of the entire buffer in one batch, the dimensions of the global and local task space will be the same.

Once we have determined the problem dimensions, we enqueue the kernel for execution.

      int NDRange[] = {(int)MathMin(buffer.Total(), LOCAL_SIZE)};
      int off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_LayerNormalize1off_setNDRangeNDRange))
         return false;
     }
//---
   return true;
  }

This concludes the block of using OpenCL technology in the feed-forward method of our attention engine class, and we are finished working on this method. Further along, its code remains unchanged. The complete code is given in the description section of constructing a method using standard MQL5 tools.

We are now moving on to working on one of the backpropagation methods — the method of distributing the error gradient through a hidden layer CNeuronAttention::CalcHiddenGradient. The algorithm of our actions remains the same. We will only make an adjustment for the use of two kernels sequentially.

I would like to remind you that when creating backpropagation kernels, we determined the need to use two additional buffers for storing intermediate values of error gradients of the dependency coefficient matrix. So let's take a step back and declare additional buffers: m_cScoreGrad and m_cScoreTemp.

class CNeuronAttention    :  public CNeuronBase
  {
protected:
   .....
   int               m_cScoreGrad;
   int               m_cScoreTemp;
   .....
  };

However, in this case, we will not declare instances of buffer objects in main memory. We will not use these buffers to exchange data between the OpenCL context and the main program. They are needed only for temporary storage of data transferred between kernels. This means that their presence in the OpenCL context memory is enough for us. In the main program, we will only declare variables to store pointers to buffers.

Let's get back to working on the CNeuronAttention::CalcHiddenGradient method. First, we check the availability and, if necessary, create new data buffers in the OpenCL context, used in the first kernel. We intentionally do not create data buffers for the second kernel right away to ensure more efficient memory usage. This will allow us to use larger data buffers when OpenCL context memory resources are limited.

bool CNeuronAttention::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
   .....
//--- branching the algorithm across the computing device
   if(!m_cOpenCL)
     {
   // MQL5 block
   .....
     }
   else // OpenCL block
     {
      //--- check data buffers
      if(m_cValues.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cValues.GetGradients().GetIndex() < 0)
         return false;
      if(m_cScores.GetIndex() < 0)
         return false;
      if(m_cAttentionOut.GetGradients().GetIndex() < 0)
         return false;
      if(m_cScoreGrad < 0)
         return false;
      if(m_cScoreTemp < 0)
         return false;

After checking all the necessary buffers, we will pass pointers to them to the kernel.

      //--- pass parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients
              def_attscr_outputs_gradm_cAttentionOut.GetGradients().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                                         def_attscr_scoresm_cScores.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                                            def_attscr_scores_gradm_cScoreGrad))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                                            def_attscr_scores_tempm_cScoreTemp))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                            def_attscr_valuesm_cValues.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                     def_attscr_values_gradm_cValues.GetGradients().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AttentionScoreGradients,
                                                    def_attscr_windowm_iWindow))
         return false;

In addition to the pointers to data buffers, we pass the size of the vector describing one element of the sequence to the kernel.

After passing all the parameters, specify the number of required parallel threads and invoke the function to enqueue the kernel.

      //--- Place the kernel in the execution queue
      int off_set[] = {0};
      int NDRange[] = {m_iUnits};
      if(!m_cOpenCL.Execute(def_k_AttentionScoreGradients1off_setNDRange))
         return false;

Now we move on to working on the next kernel. Let's check the availability of buffers required for the new kernel.

      if(m_cQuerys.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cQuerys.GetGradients().GetIndex() < 0)
         return false;
      if(m_cKeys.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cKeys.GetGradients().GetIndex() < 0)
         return false;

After checking all the necessary data buffers, we will pass pointers to them to the kernel.

      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients
                                 def_atthgr_keysm_cKeys.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                          def_atthgr_keys_gradm_cKeys.GetGradients().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                             def_atthgr_querysm_cQuerys.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                      def_atthgr_querys_gradm_cQuerys.GetGradients().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                                             def_atthgr_scores_gradm_cScoreGrad))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AttentionHiddenGradients,
                                                 def_atthgr_key_sizem_iKeysSize))
         return false;

In addition to the pointers to data buffers, we pass the size of the key vector for one element of the sequence to the kernel parameters.

After finishing the transfer of all the necessary data to the kernel, we initialize the enqueuing of its execution. The arrays with the specified offset and the number of required kernel instances for execution are already prepared after launching the previous kernel, and we don't need to set them again. Therefore, we simply invoke the function to enqueue the kernel.

       if(!m_cOpenCL.Execute(def_k_AttentionHiddenGradients1off_setNDRange))
         return false;

At this point, we conclude our work on building the methods of our attention class and can proceed to test its functionality.