Organizing parallel computing for Multi-Head Self-Attention

We continue our steady progress on the path of knowledge and building a library for creating machine learning models within the MQL5 environment. In this section, we are planning to complete work on CNeuronMHAttention which is another class of neural layers. This class implements the Multi-Head Self-Attention algorithm. In the previous sections, we have already fully implemented the algorithm using standard MQL5 tools. Now let's supplement its functionality with the ability to use OpenCL technology to organize the computation process in multi-threaded mode using GPU resources.

We have already implemented similar work for each of the previously discussed neural layers. Let me remind you of the general algorithm for constructing this process. First, we create an OpenCL program. Next, we enhance the main program code with the functionality for calling this program and passing the necessary data in both directions. We will need to send the input data to the program before its execution and calculate the results after its execution.

As usual, we start by creating an OpenCL program. Do we really need to create a new program? Why do we need to create new kernels? The answer is obvious: we need to implement functionality. But let's recall that we inherited our class from a similar class implementing the Self-Attention algorithm. We have repeatedly talked about the continuity of these algorithms. Can we use the kernels created earlier to implement processes in this class?

Considering the similarity of processes, it would be more advantageous for us to use the same kernels for both implementations. Firstly, this reduces the number of OpenCL objects, and the system can only handle a limited number of them. Secondly, it's always more convenient to maintain and optimize one object rather than duplicating shared blocks across multiple similar objects, be it kernels or classes.

So, how can we implement this? The previously created kernels work within the same head of attention. Of course, on the main program side, we can copy data into separate buffers and sequentially invoke kernels for each attention head. This approach is possible, but it is irrational. Excessive copying of data in itself is not the best solution. Moreover, calling kernels sequentially for each attention head doesn't allow for the simultaneous calculation of all attention heads in parallel threads.

In fact, we can use the previously created kernels without unnecessary copying, by implementing some minor modifications.

One thing we have already done from the very beginning when creating the class is fully utilizing concatenated data buffers. That is, all our data buffers contain data from all attention heads at once. By transferring data to context memory, we transfer data from all attention heads. This means that on the OpenCL side, we can work with all attention heads in parallel. We just need to correctly determine the offset in the data buffer to the required values. These are the changes that we must make to the kernel.

To determine this bias, we need to understand the total number of attention heads used and the ordinal number of the working attention head. We can pass the total quantity in parameters but can't do the same for the serial number of the current one. To implement the transfer of such data, we would need to create a loop with a sequential kernel call for each attention head, which we try to avoid.

Let's remember how the function of queuing the kernel is organized. The CLExecute function has a work_dim parameter, which is responsible for the dimension of the task space. The function also receives in parameters a dynamic array global_work_size[], which indicates the total number of tasks being performed in each dimension.

bool  CLExecute
// handle to the OpenCL program kernel 
   int          kernel,
// dimension of the problem space  
   uint         work_dim,
// initial offset in task space 
   const uint&  global_work_offset[],
// total number of tasks 
   const uint&  global_work_size[]
   );

Earlier we used only one dimension, and now we can use two. We will continue to use one dimension for iterating over the elements of the sequence and the other dimension for iterating over the attention heads.

Well, a solution has been found and we can begin implementation. But there is one more question: to create a new kernel or not. Everything points towards modifying the previously created one. But in this case, after finishing the work, we will have to take a step back and adjust the methods of the CNeuronAttention class. Otherwise, we will get a critical error when trying to launch the kernel.

For my part, I decided to make changes to the previously created kernel and the methods of the main program. You can choose your preferred options.

Now, let's look at the changes made to the feed-forward kernel.

In the kernel body, we request the identifiers of the launched thread and the total number of threads in two dimensions. The first dimension will specify the index of the processed request and the length of the sequence. The second dimension will indicate the number of the active attention head.

We also determine the offset to the beginning of the vector being analyzed in the query tensor and the attention coefficient matrix.

   const int q = get_global_id(0);
   const int units = get_global_size(0);
   const int h = get_global_id(1);
   const int heads = get_global_size(1);
   int shift_query = key_size * (q * heads + h);
   int shift_scores = units * (q * heads + h);

As you can see, compared to the previous version, the kernel differs by having a second dimension that accounts for attention heads. Accordingly, the calculation of offsets is also adjusted to account for multi-head attention.

Next, we create a system of two nested loops to calculate one vector of the matrix of dependence coefficients. This is due to the fact that to calculate one element of the sequence at the output of the attention block, we need a whole vector of the matrix of dependence coefficients. Such calculation applies to one attention head.

Also, before starting the loop system, we will prepare a local variable summ to sum all the values of the vector. We will need this sum later to normalize the vector values.

The outer loop has a number of iterations equal to the number of sequence elements. It will immediately indicate the analyzed element in the key tensor Key and the column number in the attention coefficient matrix. In the body of the loop, we will determine the offset in the key tensor to the beginning of the vector of the analyzed element in the sequence and prepare a variable for calculating the result of multiplying two vectors.

In the nested loop with a number of iterations equal to the size of the key vector, we will perform the operation of multiplying the query vector by the key vector.

After completing the iterations in the nested loop, we will take the exponential of the obtained result from the vector multiplication, record the resulting value in the tensor of attention coefficient matrices, and add it to our sum of vector values.

   TYPE summ = 0;
   for(int s = 0s < unitss++)
     {
      TYPE score = 0;
      int shift_key = key_size * (s * heads + h);
      for(int k = 0k < key_sizek ++)
         score += querys[shift_query + k] * keys[shift_key + k];
      score = exp(score / sqrt((TYPE)key_size));
      summ += score;
      scores[shift_scores + s] = score;
     }

After completing all iterations of the loop system, we will have a vector with computed but unnormalized attention coefficients for one query vector with respect to all key vectors. To complete the vector normalization process, we need to divide the contents of the vector by the sum of all its values, which we have collected in the summ variable.

To perform this operation, we will create another loop with the number of iterations equal to the number of elements in the sequence.

   for(int s = 0s < unitss++)
      scores[shift_scores + s] /= summ;

As you can see, this block differs from the previous implementation only in terms of calculating the offsets of elements in tensors. Now that we have the normalized attention vector for one query with respect to all elements in the key tensor sequence, we can calculate the weighted vector for one element of the sequence at the output of one attention head. To do this, we will create a system of two nested loops.

First, we will determine the offset in the result tensor to the beginning of the vector for the analyzed element.

Then we will create an outer loop based on the number of elements in the result vector. In the body of the loop, we will first prepare a variable for accumulating the value of one element of the vector. We will create a nested loop with the number of iterations equal to the number of sequence elements, in which we will iterate through all the elements of the tensor of values. In each element of the description vector of an element, we will take one value corresponding to the counter of the outer loop iteration and multiply it by the element of the normalized attention coefficient vector according to the counter of the nested loop iteration. After completing the full cycle of iterations in the nested loop, the query variable will contain one value of the description vector for the analyzed element of the attention block sequence. We will write it to the corresponding element of the kernel work results buffer.

   shift_query = window * (q * heads + h);
   for(int i = 0i < windowi++)
     {
      TYPE query = 0;
      for(int v = 0v < unitsv++)
         query += values[window * (v * heads + h) + i] * scores[shift_scores + v];
      outputs[shift_query + i] = query;
     }
  }

After completing the iterations of the outer loop, we will obtain a complete description vector for one element of the sequence in the result tensor buffer.

As you can see, the operations of one kernel result in one description vector for an element of the sequence in the result tensor of one attention head. To calculate the complete tensor, we need to launch a task pool with a size equal to the product of the number of elements in the sequence and the number of attention heads. This is what we do when running the kernel in a two-dimensional task space.

To transform the kernel from the single-head attention plane to multi-head attention, we simply needed to organize the kernel launch in a two-dimensional space and adjust the offset calculations in the data buffers.

Let's do a similar job with backpropagation kernels. As you may recall, in the Self-Attention block, in contrast to the implementation of other neural layers, we implemented the propagation of the error gradient through the internal space of the hidden neural layer using two consecutive kernels. So, we need to transfer both kernels into the area of multi-head attention. However, let's consider things in order.

First, we will look at the AttentionCalcScoreGradient kernel. The kernel parameters remain unchanged. Here we have the same data buffers and one constant size of the description vector of one element.

__kernel void AttentionCalcScoreGradient(__global TYPE *scores,
                                         __global TYPE *scores_grad,
                                         __global TYPE *values,
                                         __global TYPE *values_grad,
                                         __global TYPE *outputs_grad,
                                         __global TYPE *scores_temp,
                                         int window)
  {

In the kernel body, similar to the feed-forward kernel, we add the retrieval of thread identification in the second dimension and adjust the calculation of offsets in data buffers accordingly.

   const int q = get_global_id(0);
   const int units = get_global_size(0);
   const int h = get_global_id(1);
   const int heads = get_global_size(1);
   int shift_value = window * (q * heads + h);
   int shift_score = units * (q * heads + h);

We do not change the kernel algorithm. As with the implementation of the Self-Attention algorithm, the kernel can be logically divided into two blocks.

In the first algorithm, we distribute the error gradient to the tensor of values Values​​. Here we create a system of two nested loops. The outer loop will have a number of iterations equal to the size of the description vector of one sequence element in the value tensor. In the loop body, we create a local variable to collect the error gradient of the analyzed element.

It should be understood that during the feed-forward pass, each element in the sequence of the value tensor has a significant influence on the value of each element in the sequence of the result tensor. The strength of this influence is determined by the corresponding column of the attention coefficient matrix, where each row corresponds to one element in the sequence tensor of results. Therefore, to obtain the error gradient vector for one element in the sequence tensor of values, we need to multiply the corresponding column of the attention coefficient matrix by the error gradient tensor at the level of the attention block results.

To perform this operation, we organize a nested loop with the number of iterations equal to the number of elements in the sequence. In the body of this loop, we will multiply two vectors and write the result to the corresponding element of the error gradient buffer of the value tensor.

//--- Distributing the gradient on Values
   for(int i = 0i < windowi ++)
     {
      TYPE grad = 0;
      for(int g = 0g < unitsg++)
         grad += scores[units * (g * heads + h) + q] *
                 outputs_grad[window * (g * heads + h) + i];
      values_grad[shift_value + i] = grad;
     }

Here, we made changes only in terms of determining the offsets to the analyzed elements in the data buffers.

The second block of this kernel is responsible for propagating the gradient to the level of the dependency coefficient matrix. First, we create a system of two nested loops and calculate the error gradient for one row of the dependency coefficient matrix. There is a very important moment here. We calculate the error gradient specifically for a matrix row, not a column. The normalization of the matrix with the Softmax function was performed row-wise, so we should also adjust it row-wise with respect to the Softmax derivative. To determine the error gradient for one row of the matrix, we need to take the corresponding vector from the error gradient tensor at the level of attention block results and multiply it by the key tensor of the corresponding attention head.

To perform the multiplication operation, we organize a nested loop.

//--- Gradient distribution on Score
   for(int k = 0k < unitsk++)
     {
      TYPE grad = 0;
      for(int i = 0i < windowi++)
         grad += outputs_grad[shift_value + i] * 
                 values[window * (k * heads + h) + i];
      scores_temp[shift_score + k] = grad;
     }

After running a full cycle of iterations of our tensor loop system, we will obtain a single row of error gradients for the dependency coefficient matrix. Before passing the error gradient further, it is necessary to correct it by the derivative of the Softmax function.

//--- Adjust for the Softmax derivative
   for(int k = 0k < unitsk++)
     {
      TYPE grad = 0;
      TYPE score = scores[shift_score + k];
      for(int i = 0i < unitsi++)
         grad += scores[shift_score + i] *
                 ((int)(i == k) - score) * scores_temp[shift_score + i];
      scores_grad[shift_score + k] = grad;
     }
  }

The operation results are written into the corresponding elements of the error gradient tensor.

This completes the work with the first kernel of the backpropagation algorithm. As you may have noticed, the changes affected only the definition of the offset in the data buffers and the additional dimension of the task space.

Let's move on to the second kernel of the AttentionCalcHiddenGradient error backpropagation algorithm. In this kernel, we need to propagate the error gradient from the dependency coefficient matrix to the buffers of the m_cQuerys and m_cKeys internal neural layers.

This operation is not difficult from a mathematical point of view. We have already determined the error gradient at the level of the dependency coefficient matrix in the previous kernel. Now we need to multiply the dependency coefficient matrix by the opposite tensor.

As in the previous kernel, the kernel header and parameters have not changed at all. Here we see the same set of buffers and parameters.

__kernel void AttentionCalcHiddenGradient(__global TYPE *querys,
                                          __global TYPE *querys_grad,
                                          __global TYPE *keys,
                                          __global TYPE *keys_grad,
                                          __global TYPE *scores_grad,
                                          int key_size)
  {

In the kernel body, we identify the thread in two dimensions of tasks. The second dimension has been added for the identification of the active attention head. We adjust the offsets in the gradient buffers accordingly, ensuring they are aligned with the elements of the sequence being analyzed.

   const int q = get_global_id(0);
   const int units = get_global_size(0);
   const int h = get_global_id(1);
   const int heads = get_global_size(1);
   int shift_query = key_size * (q * heads + h);
   int shift_score = units * (q * heads + h);

As mentioned earlier, in the kernel body, we need to distribute the error gradient to two internal neural layers from a single source. The same algorithm is used for gradient error distribution in both directions. And both recipient vectors have the same size. All of this allows us to calculate the error gradient for both tensors in parallel within the body of a single loop system. The number of iterations in the outer loop is equal to the size of the vector for which we are calculating the error gradient. In its body, we prepare variables for accumulating the error gradients and create a nested loop with a number of iterations equal to the number of elements in the sequence. In the body of the nested loop, we simultaneously calculate values from the product of two pairs of vectors.

//--- Propagate the gradient on Querys and Keys
   const TYPE k = 1 / sqrt((TYPE)key_size);
//---
   for(int i = 0i < key_sizei++)
     {
      TYPE grad_q = 0;
      TYPE grad_k = 0;
      for(int s = 0s < unitss++)
        {
         grad_q += keys[key_size * (s * heads + h) + i] *
                   scores_grad[shift_score + s];
         grad_k += querys[key_size * (s * heads + h) + i] *
                   scores_grad[units * (s * heads + h) + q];
        }
      querys_grad[shift_query + i] = grad_q * k;
      keys_grad[shift_query + i] = grad_k * k;
     }
  }

After exiting the nested loop, each variable has one value for the error gradient vectors of the required tensors. We write them into the corresponding elements of the tensors. After completing the full number of iterations of the loop system, we obtain the two desired vectors of error gradients.

We finish working with OpenCL program kernels. Here, we have only made slight changes to the kernels of the Self-Attention algorithm to transfer them to the area of ​​multi-headed attention.

Now we have to supplement the main program with the functionality of calling kernel data from methods of both the CNeuronAttention class and the CNeuronMHAttention class. We usually start this work by creating constants for working with kernels. But in this case, the constants have already been created.

Next, we created kernels in the OpenCL context. But this time we did not create new kernels. The ones that we slightly adjusted are already declared in the body of the main program. Therefore, we skip this step too.

Let's move on to making changes directly to class methods. For new kernels to work in the CNeuronAttention class, we add a second element to the offset and task space arrays. For offset, we specify 0 in both dimensions. For the task space, we leave the first value unchanged, and in the second element of the array, we introduce 1 (indicating the use of a single attention head). Additionally, when enqueueing the kernel for execution, we specify the two-dimensionality of the task space.

      int off_set[] = {00};
      int NDRange[] = {m_iUnits1};
      if(!m_cOpenCL.Execute(def_k_AttentionFeedForward2off_setNDRange))
         return false;

After this, we can fully use the updated feed-forward kernel.

We do such simple manipulations to call all three kernels in the methods of the CNeuronAttention class.

So, we have restored the functionality of the methods of the CNeuronAttention class, which implements the Self-Attention algorithm. There are also some changes on the main program side.

Let's move on to working on our CNeuronMHAttention class with the implementation of the Multi-Head Self-Attention algorithm. As usual, we'll start with the feed-forward method. Before we queue the kernel for operations, we need to do some preparatory work. First of all, we check the presence of the necessary buffers in the OpenCL context memory.

bool CNeuronMHAttention::FeedForward(CNeuronBase *prevLayer)
  {
   ......
//--- branching of the algorithm across the computing device
   MATRIX out;
   if(!m_cOpenCL)
     {
   ......
     }
   else // OpenCL block
     {
      //--- check data buffers
      if(m_cQuerys.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cKeys.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cValues.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cScores.GetIndex() < 0)
         return false;
      if(m_cAttentionOut.GetOutputs().GetIndex() < 0)
         return false;

After checking all the necessary buffers, we pass pointers to the buffers to the kernel parameters. There we also pass the constants necessary for the operation of the kernel.

Please note that when passing parameters to the kernel, we specified the m_iKeysSize variable, which contains the size of the key vector for one element of the sequence, twice. We specified it for both the key vector size parameter and the value vector size parameter. Two parameters in the kernel are a necessary measure. When using a single attention head, for the size of the value vector, we would need to specify the size of the input data vector. This is a requirement of the Self-Attention algorithm. However, when using multi-head attention, the W0 matrix allows us to use different sizes for the value vector.

      //--- pass parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForward,
                                 def_attff_keysm_cKeys.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForward,
                      def_attff_outputsm_cAttentionOut.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForward,
                             def_attff_querysm_cQuerys.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForward,
                                          def_attff_scoresm_cScores.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionFeedForward,
                             def_attff_valuesm_cValues.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AttentionFeedForward,
                                                 def_attff_key_sizem_iKeysSize))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AttentionFeedForward,
                                                   def_attff_windowm_iKeysSize))
         return false;

This concludes the preparatory work, and we can move on to organizing the kernel launch procedure. To do this, we indicate the size of the problem space in two dimensions. In the first dimension, we indicate the size of the sequence; in the second one, we indicate the number of attention heads. Let's call the method for adding the kernel to the execution queue.

      //--- putting the kernel into the execution queue
      int off_set[] = {00};
      int NDRange[] = {m_iUnitsm_iHeads};
      if(!m_cOpenCL.Execute(def_k_AttentionFeedForward2off_setNDRange))
         return false;
     }

Here, we conclude our work on the feed-forward method and transition to the CalcHiddenGradient method that propagates the error gradient through the hidden layer. To implement the process of this method, we have prepared two kernels, which we need to launch sequentially. First, we will run the error gradient propagation kernel up to the AttentionCalcScoreGradient dependency coefficient matrix.

The algorithm for carrying out the preparatory work and launching the kernel is similar to what we used above when launching the forward pass kernel.

bool CNeuronMHAttention::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
//--- branching of the algorithm across the computing device
   if(!m_cOpenCL)
     {
   ......
   // MQL5 block
     }
   else // OpenCL block
     {
      //--- check data buffers
      if(m_cValues.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cValues.GetGradients().GetIndex() < 0)
         return false;
      if(m_cScores.GetIndex() < 0)
         return false;
      if(m_cAttentionOut.GetGradients().GetIndex() < 0)
         return false;
      if(m_cScoreGrad < 0)
         return false;
      if(m_cScoreTemp < 0)
         return false;

After checking the buffers, we pass pointers to them and the necessary constants as parameters to the kernel.

      //--- passing parameters to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
               def_attscr_outputs_gradm_cAttentionOut.GetGradients().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                                          def_attscr_scoresm_cScores.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                                             def_attscr_scores_gradm_cScoreGrad))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                                             def_attscr_scores_tempm_cScoreTemp))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                             def_attscr_valuesm_cValues.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionScoreGradients,
                      def_attscr_values_gradm_cValues.GetGradients().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AttentionScoreGradients,
                                                   def_attscr_windowm_iKeysSize))
         return false;

We place the kernel in the queue for performing operations. As with the feed-forward pass, we create a two-dimensional task space. In the first dimension, we specify the number of elements being analyzed in the sequence, and in the second dimension, we specify the number of attention heads.

      //--- place the kernel into the execution queue
      int off_set[] = {00};
      int NDRange[] = {m_iUnitsm_iHeads};
      if(!m_cOpenCL.Execute(def_k_AttentionScoreGradients2off_setNDRange))
         return false;

We immediately begin the preparatory work before launching the second kernel. Checking the data buffers in the OpenCL context memory. Only those buffers that we did not check when launching the first kernel are subject to verification.

      //--- check data buffers
      if(m_cQuerys.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cQuerys.GetGradients().GetIndex() < 0)
         return false;
      if(m_cKeys.GetOutputs().GetIndex() < 0)
         return false;
      if(m_cKeys.GetGradients().GetIndex() < 0)
         return false;

We pass pointers to data buffers to the parameters of the second kernel. We also add the necessary constants there.

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                                def_atthgr_keysm_cKeys.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                         def_atthgr_keys_gradm_cKeys.GetGradients().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                            def_atthgr_querysm_cQuerys.GetOutputs().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                     def_atthgr_querys_gradm_cQuerys.GetGradients().GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AttentionHiddenGradients,
                                            def_atthgr_scores_gradm_cScoreGrad))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AttentionHiddenGradients
                                                def_atthgr_key_sizem_iKeysSize))
         return false;

After completing the preparatory work, we call the method for placing the kernel in the tasks execution queue. Please note that this time we are not creating new arrays specifying the task space because it has not changed, and we can use the existing arrays from the previous kernel launch.

      //--- place the kernel into the execution queue
      if(!m_cOpenCL.Execute(def_k_AttentionHiddenGradients2off_setNDRange))
         return false;
     }

This concludes our work on the implementation of the Multi-Head Self-Attention algorithm, including general and multi-threaded calculations. We have implemented all the functionality of the CNeuronMHAttention class. Now we can proceed with comprehensive testing of its performance using training and testing datasets.