5.Multi-Head Self-Attention backpropagation methods

We are confidently moving forward in our learning path. Let's proceed with the implementation of our Multi-Head Self-Attention class. In the previous sections, we have already implemented initialization methods and feed-forward methods. However, the neural layer training algorithm is based on the error gradient backpropagation algorithm. We now proceed to implement backpropagation methods.

We have already mentioned that the Multi-Head Self-Attention algorithm is a logical extension of Self-Attention. That's why we created our class based on the CNeuronAttention class. And yes, the processes are all very similar. However, there are still some minor differences in the implementation of multi-head attention. To implement these differences, we created a new class CNeuronMHAttention.

As we progress in creating the methods of the class, let's take a look at the implementation of these differences in the methods of the backpropagation algorithm.

In the parent class, we have overridden three virtual methods to implement the backpropagation algorithm:

  • CNeuronAttention::CalcHiddenGradient — method for calculating the error gradient through the hidden layer
  • CNeuronAttention::CalcDeltaWeights — method for calculating the error gradient to the level of the weights matrix
  • CNeuronAttention::UpdateWeights — method for updating the weights

So, we will also need to override the corresponding methods to organize the multi-head attention backpropagation pass. Let's start with the method of distributing the error gradient through the hidden layer of the CalcHiddenGradient neural network.

As in the parent class method, in the parameters of the method, we receive a pointer to the object of the previous neural layer. It is in its error gradient buffer that we are going to record the result of the work being done.

At the beginning of the CNeuronMHAttention::CalcHiddenGradient method body, there is the customary and essential attribute of any method: a check of pointers to the objects used in the method. Here, as in the similar method of the parent class, we will perform control checks only for pointers to objects that will be directly accessed from this method without using the methods of internal neural layers. The reason is that all inner neural layer methods have a similar block of controls. By calling them, we again validate the passed pointers to objects. This is an additional cost in resources and time. We can't disable the checks in the methods of the nested neural layers, so we will eliminate explicit duplication of controls in the current method.

We should immediately point out that we only exclude explicit duplication, but not possible. It's a fine line, but behind it lie great risks.

Explicit is the duplication that will happen anyway. If we see such duplication, we try to keep only one control point before the first use of the object whenever possible.

Note, that there must be at least one control point before the object is accessed for the first time.

I call duplication possible when it can occur under certain circumstances. In some cases, it may not happen. We do not eliminate such duplication because the risk of a critical error in the absence of control outweighs the potential benefits of improving program performance.

bool CNeuronMHAttention::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
//--- check the relevance of all objects
   if(!m_cOutputs || !m_cGradients ||
      m_cOutputs.Total() != m_cGradients.Total())
      return false;

After successfully passing the control block, we proceed directly to the error gradient distribution procedure. As you may recall, in the feed-forward pass, the data is normalized at the output of the neural layer. Also, we need to adjust the error gradient by the derivative of the normalization function. In the parent class, we derived this procedure in a separate method entitled CNeuronAttention::NormlizeBufferGradient. Now we just need to call it with the appropriate parameters.

//--- scale the gradient to normalization
   if(!NormlizeBufferGradient(m_cOutputsm_cGradientsGetPointer(m_cStd), 1))
      return false;

Next, we run the error gradient through the inner neural layers of the Feed Forward block. These are the two convolutional layers: m_cFF2 and m_cFF1. To propagate the gradient through these neural layers, we sequentially call the analogous methods of the mentioned neural layers. Don't forget to check the results of the operations.

//--- propagate the error gradient through the Feed Forward block
   if(!m_cFF2.CalcHiddenGradient(GetPointer(m_cFF1)))
      return false;
   if(!m_cFF1.CalcHiddenGradient(GetPointer(m_cW0)))
      return false;

After passing the error gradient via the Feed Forward block, we recall that before normalizing the data at the output of the neural layer, we added up the tensors of the results of the Multi-Head Self-Attention and Feed Forward blocks. Hence, we must also propagate the error gradient along both directions. For this purpose, after obtaining the error gradient from the Feed Forward block in the buffer of the inner neural layer m_cW0, we add up the two tensors.

   if(!m_cW0.GetGradients().SumArray(m_cGradients))
      return false;

Let's adjust it for the derivative of the data normalization process.

//--- adjust the gradient for normalization
   if(!NormlizeBufferGradient(m_cW0.GetOutputs(), m_cW0.GetGradients(),
                                                          GetPointer(m_cStd), 0))
      return false;

We continue utilizing internal neural layer methods. We will call the convolution layer gradient distribution method m_cW0 and check the result of the operations.

//--- distribution of the error gradient by attention heads
   if(!m_cW0.CalcHiddenGradient(GetPointer(m_cAttentionOut)))
      return false;

Next, we need to propagate the error gradient from the concatenated result of the Multi-Head Self-Attention block to the internal neural layers m_cQuerys, m_cKeys, and m_cValues. As you may recall, in the feed-forward pass, the path to m_cAttentionOut from the specified inner neural layers was completely recreated inside the method. Similarly, we will have to recreate the progression of the reverse signal.

Since we are creating a new thread of operations, according to our concept, it is necessary to organize two parallel threads of operations: using standard MQL5 tools and in the paradigm of multi-threaded operations using OpenCL.

//--- branching of the algorithm by computing device
   if(!m_cOpenCL)
     {
      MATRIX gradients[];
      MATRIX querys[], querys_grad = MATRIX::Zeros(m_iHeadsm_iUnits * m_iKeysSize);
      MATRIX keys[], keys_grad = MATRIX::Zeros(m_iHeadsm_iUnits * m_iKeysSize);
      MATRIX values[], values_grad = MATRIX::Zeros(m_iHeadsm_iUnits * m_iKeysSize);
      MATRIX attention_grad = m_cAttentionOut.GetGradients().m_mMatrix;

As always, in this section, we will consider the implementation using MQL5. We will proceed to the organization of multi-threaded operations later.

So, first, we're going to do some preparatory work. As in the forward pass, in this block, we organize the work separately for individual attention heads. As all the data is stored in concatenated buffers, we will prepare local matrices and split the buffers into individual matrices according to the attention heads.

      if(!m_cQuerys.GetOutputs().m_mMatrix.Vsplit(m_iHeadsquerys) ||
         !m_cKeys.GetOutputs().m_mMatrix.Vsplit(m_iHeadskeys) ||
         !m_cValues.GetOutputs().m_mMatrix.Vsplit(m_iHeadsvalues) ||
         !attention_grad.Reshape(m_iUnitsm_iHeads * m_iKeysSize) ||
         !attention_grad.Vsplit(m_iHeadsgradients))
         return false;

Next, we will create a loop with the number of iterations equal to the number of attention heads used.

      for(int head = 0head < m_iHeadshead++)
        {

During the feed-forward pass, the values of the concatenated buffer of results are assembled by multiplying the values of the m_cValues neural layer's tensor results with the corresponding elements of the dependency coefficient matrix, followed by vector addition. Now we need to organize the reverse process: propagating the error gradient along these two directions.

First, we transfer the error gradient to the inner neural layer m_cValues. Before that, let's do some preparatory work.

To propagate the gradient to the m_cValues neural layer, it is necessary to multiply the error gradient matrix by the dependency coefficient matrix. Hence, we first need to extract such a matrix for the attention head we analyze.

We then multiply the matrices and add the result to a local copy of the concatenated gradient matrix of the m_cValues layer.

      //--- gradient propagation to Values
         MATRIX score = MATRIX::Zeros(1m_iUnits * m_iUnits);
         if(!score.Row(m_cScores.m_mMatrix.Row(head), 0) ||
            !score.Reshape(m_iUnitsm_iUnits))
            return false;
         MATRIX temp = (score.Transpose().MatMul(gradients[head])).Transpose();
         if(!temp.Reshape(1m_iUnits * m_iKeysSize) ||
            !values_grad.Row(temp.Row(0), head))
            return false;

After that, we will propagate the gradient along the second path of the algorithm, through the dependency coefficient matrix to the neural layers m_cQuerys and m_cKeys. In essence, we first need to determine the error gradient at the level of the dependency coefficient matrix and then propagate the error gradient from there to the specified internal neural layers.

Here we should recall that the dependency coefficient matrix is normalized by the Softmax function in the Query section. To properly adjust the error gradient for the derivative of the Softmax function, we need at least the full vector of error gradients for the values involved in a single normalization operation. We can write it into a local matrix.

The task is clear, and we can proceed to implementation. To propagate the error gradient to the dependency coefficient matrix, it is sufficient to multiply the obtained gradient by the matrix of the results from the last feed-forward pass of the m_cValues neural layer.

After obtaining the error gradient vector at the dependency coefficient matrix level, we should adjust it using the derivative of the Softmax function.

We will organize a loop in which we adjust the error gradient using the derivative of the Softmax normalization function.

         //--- gradient distribution up to Score
         gradients[head] = gradients[head].MatMul(values[head].Transpose());
         //--- gradient correction by Softmax derivative
         for(int r = 0r < m_iUnitsr++)
           {
            MATRIX ident = MATRIX::Identity(m_iUnitsm_iUnits);
            MATRIX ones = MATRIX::Ones(m_iUnits1);
            MATRIX result = MATRIX::Zeros(1m_iUnits);
            if(!result.Row(score.Row(r), 0))
               return false;
            result = ones.MatMul(result);
            result = result.Transpose() * (ident - result);
            if(!gradients[head].Row(result.MatMul(gradients[head].Row(r)) / 
                                                          sqrt(m_iKeysSize), r))
               return false;
           }

In the next step, we distribute the error gradient to the result values of the m_cQuerys and m_cKeys neural layers. However, we will not immediately write the values into the data buffers of the specified neural layers. We will only accumulate the sums of the error gradients into the pre-prepared matrices querys_grad and keys_grad.

Technically, we multiply the adjusted error gradient by the opposite matrix. Multiplying it by the Keys matrix, we get the error gradient for Querys, and vice versa. We reformat the obtained matrices and add them to the corresponding local matrices.

         //--- gradient propagation to Querys and Keys
         temp = (gradients[head].MatMul(keys[head])).Transpose();
         if(! temp.Reshape(1m_iUnits * m_iKeysSize) ||
            !querys_grad.Row(temp.Row(0), head))
            return false;
         temp = (gradients[head].Transpose().MatMul(querys[head])).Transpose();
         if(! temp.Reshape(1m_iUnits * m_iKeysSize) ||
            !keys_grad.Row(temp.Row(0), head))
            return false;
        }

After completing the iterations of the loop, we obtain concatenated matrices of error gradients for all internal layers. Finally, we need to format the matrices as required and copy the values into the respective data buffers.

      if(!querys_grad.Reshape(m_iHeads * m_iKeysSizem_iUnits) ||
         !keys_grad.Reshape(m_iHeads * m_iKeysSizem_iUnits) ||
         !values_grad.Reshape(m_iHeads * m_iKeysSizem_iUnits))
         return false;
      m_cQuerys.GetGradients().m_mMatrix = querys_grad.Transpose();
      m_cKeys.GetGradients().m_mMatrix = keys_grad.Transpose();
      m_cValues.GetGradients().m_mMatrix = values_grad.Transpose();
     }
   else // OpenCL block
     {
      return false;
     }

As a result, we have propagated the error gradient to the level of internal neural layers. We have successfully addressed the previously set task and are concluding the section on algorithm partitioning based on the computational device. In the multi-threaded operations branch, we will temporarily set the method exit with a false result. We will complete this part later.

We haven't propagated the error gradient to the previous layer yet. We will further propagate the error gradient using internal neural layer methods.

We've already filled the error gradient buffers of all the inner layers. We only need to call the method for error gradient propagation through the layer to obtain the error gradient at the level of the original data. However, one question remains open: all three internal neural layers (m_cQuerys, m_cKeys, m_cValues) use the same tensor from the previous layer as their input data. This means that all three layers must pass the error gradient to the previous layer's buffer. In addition, the result of the Multi-Head Self-Attention block was added to the tensor of the original data before normalization. Hence, this is the fourth thread of the error gradient that we need to pass to the previous layer level.

However, our gradient propagation methods are constructed in a way that when the error gradient is saved in the buffer of the previous layer, it overwrites the previous values, erasing the prior information. This is done intentionally to avoid unnecessary buffer-clearing operations before starting each iteration of the backpropagation pass. To address this issue, after running the CalcHiddenGradient method for each internal neural layer, we will copy the error gradient data to a separate buffer, where we will accumulate it with the previously stored values At this point we should recall that the error gradient at the output of the Multi-Head Self-Attention block is already contained in the error gradient buffer of the neural layer m_cW0. It might seem that this buffer would be suitable for accumulating the error gradient for the previous layer. But that's a misconception. If we were to accumulate the error gradient in the mentioned buffer right now, it would distort the data during the subsequent error gradient propagation to the weight matrix of that layer. At the same time, we can implement the error gradient propagation to the matrix of the m_cW0 layer right now. There's all the data you need to do that. We call the CalcDeltaWeights method of the specified neural layer and then use its buffer to accumulate the total error gradient.

//--- propagate the error gradient to the previous layer
   if(!m_cW0.CalcDeltaWeights(GetPointer(m_cAttentionOut), false))
      return false;
   CBufferTypeattention_grad = m_cW0.GetGradients();
   if(!m_cValues.CalcHiddenGradient(prevLayer))
      return false;
   if(!attention_grad.SumArray(prevLayer.GetGradients()))
      return false;
   if(!m_cQuerys.CalcHiddenGradient(prevLayer))
      return false;
   if(!attention_grad.SumArray(prevLayer.GetGradients()))
      return false;
   if(!m_cKeys.CalcHiddenGradient(prevLayer))
      return false;
   if(!prevLayer.GetGradients().SumArray(attention_grad))
      return false;
//---
   return true;
  }

Attention should be paid to the last group of commands. During the previous operations, we copied data from the gradient buffer of the previous layer, but at the end of the method, we reversed the process by taking the cumulative error gradient from the internal neural layer's buffer and adding it to the values of the buffer of the previous layer. It is in the buffer of the previous layer where we need to obtain the result. From it, the methods of the previous layer will take the error gradient and distribute it further through the neural network.

This completes the task set for this method. We complete the method with a positive result.

Next, we will work on two more methods that will continue the execution of the error backpropagation algorithm in this class.

After propagating the error through all the neural layers of our network, we need to propagate the error gradient to the level of each weight. Our CNeuronMHAttention class does not contain a separate buffer for the weight matrix. All trained parameters are encapsulated in internal neural layers. Therefore, the only thing we need to do in the method for propagating the error gradient to the CalcDeltaWeights weight matrix is to consistently call the same method for all inner layers. At the same time, we should check the results of the operations.

Recall that in the previous method, we have already passed the error gradient to the weight matrix of the m_cW0 inner layer. It is necessary to exclude it from this iteration.

bool CNeuronMHAttention::CalcDeltaWeights(CNeuronBase *prevLayerbool read)
  {
//--- call the same method for all inner layers
   if(!m_cFF2.CalcDeltaWeights(GetPointer(m_cFF1), false))
      return false;
   if(!m_cFF1.CalcDeltaWeights(GetPointer(m_cW0), false))
      return false;
   if(!m_cQuerys.CalcDeltaWeights(prevLayerfalse))
      return false;
   if(!m_cKeys.CalcDeltaWeights(prevLayerfalse))
      return false;
   if(!m_cValues.CalcDeltaWeights(prevLayerread))
      return false;
//---
   return true;
  }

After propagating the error gradients to the weight matrices, the only remaining step is to update the weights of our internal neural layers. This functionality is assigned to the UpdateWeights method. Despite the complexity of the class itself, the method for updating the weight matrices turns out to be very concise and straightforward. It was object inheritance that helped us with this.

We created our CNeuronMHAttention class as a descendant of the CNeuronAttention class. We added only one object of the inner m_cW0 neural layer. During the operations of the UpdateWeights method of the convolutional neural layers used, all operations are performed only on elements within the object, without accessing data from other objects. That's why we can call a similar method from the parent class, where this process is already implemented for inherited objects. After successfully executing the method of the parent class, we only need to update the coefficient matrix of the m_cW0 internal neural layer.

bool CNeuronMHAttention::UpdateWeights(int batch_sizeTYPE learningRate
                                            VECTOR &BetaVECTOR &Lambda)
  {
//--- call the method of the parent class
   if(!CNeuronAttention::UpdateWeights(batch_sizelearningRateBetaLambda))
      return false;
//--- call the same method for all inner layers
   if(!m_cW0.UpdateWeights(batch_sizelearningRateBetaLambda))
      return false;
//---
   return true;
  }

Of course, we verify the result of all operations and return a boolean value indicating their execution to the caller.

Thus, we are nearing the completion of the Multi-Head Self-Attention technology implementation class. We have already implemented the whole algorithm using standard MQL5 tools. You can even create a script and test how it works. However, we still need to supplement our class with file handling methods.