5.Multi-Head Self-Attention feed-forward method

We have already organized the process of creating and initializing the CNeuronMHAttention Multi-Head attention class. And now, when we already have all the internal objects of our class, we can move on to the organization of the forward pass.

The virtual method FeedForward is responsible for the implementation of the feed-forward pass in all classes of our library. Adhering to the general system of organization of classes and their methods, as well as the principles of inheritance, in this class we will retain the previously defined structure of methods and override the FeedForward method. Like the similar method of the parent class, in the parameters, the fed-forward method receives a pointer to the object of the previous neural layer. According to the framework that has been tested more than once, at the beginning of the method we organize a block of controls. In it, we check the relevance of pointers to all dynamic objects used in the method. In this case, we check the pointer to the neural layer received in the parameters, its result buffer, and the buffer of results from the internal layer of the concatenated output of the attention block.

bool CNeuronMHAttention::FeedForward(CNeuronBase *prevLayer)
  {
--- check the relevance of all objects
   if(!prevLayer || !prevLayer.GetOutputs() ||
      !m_cAttentionOut.GetOutputs())
      return false;

After successfully passing the control block, we generate concatenated tensors of queries, keys, and values: Query, Key, and Value. To do this, we call the methods of forward pass of the internal convolutional layers m_cQuerys, m_cKeys, and m_cValues. The correspondence of tensors in the Multi-Head Self-Attention architecture and the invoke objects is not accidental: it makes the code more readable and allows you to track the algorithm being built.

   if(!m_cQuerys.FeedForward(prevLayer))
      return false;
   if(!m_cKeys.FeedForward(prevLayer))
      return false;
   if(!m_cValues.FeedForward(prevLayer))
      return false;

Be sure to control the process of performing operations.

Next, according to the Multi-Head Self-Attention algorithm, we have to determine the coefficients of dependence between the elements of the sequence and display the concatenated result of the work of all attention heads. This functionality is the link between the internal neural layers. It does not cover the use of other objects and will be built entirely within this method.

As you remember, when building all processes in the methods of our library classes, we create two branches of the algorithm: standard MQL5 tools and multi-threaded calculations on the GPU using OpenCL. As always, in this section, we will consider the implementation of the algorithm using standard MQL5 tools. And we will return to the implementation of multi-threaded computing using OpenCL later.

Now we need to determine how to organize the work. We have three dimensions:

Attention heads
Sequence elements
Vector with the description of one element of the sequence

Matrix operations allow us to operate only with two-dimensional matrices. One of the dimensions used will be a vector describing one element of the sequence. It's not hard to guess that in most cases, the size of the sequence will be tens of times larger than the number of attention heads. Therefore, we will create a loop for iterating through the attention heads, and within the loop, we will analyze the sequences of each attention head.

Before organizing the loop, we need to do a little preparatory work. Let's divide the concatenated results of the previous stage of the implemented algorithm into several attention heads matrices. For this, we will use dynamic arrays of matrices, which will give us a semblance of three-dimensional matrices. The index of an element in the array will indicate the attention head index. Each element in the array will be represented as a tabular matrix, where rows represent individual elements of the sequence. For the convenience of working with arrays, let's give them names that correspond to their content.

--- branching of the algorithm by the computing device
   MATRIX out;
   if(!m_cOpenCL)
     {
      if(!out.Init(m_iHeads, m_iUnits * m_iKeysSize))
         return false;
      MATRIX querys[], keys[], values[];
      if(!m_cQuerys.GetOutputs().m_mMatrix.Vsplit(m_iHeads, querys))
         return false;
      if(!m_cKeys.GetOutputs().m_mMatrix.Vsplit(m_iHeads, keys))
         return false;
      if(!m_cValues.GetOutputs().m_mMatrix.Vsplit(m_iHeads, values))
         return false;

After completing the preparatory work, we can proceed directly to the operations of calculating dependency coefficients. When solving such a problem, we used matrix operations in the forward pass method of the parent class CNeuronAttention. Now we will use the same algorithm, but we need to repeat it in a loop with the number of iterations equal to the number of attention heads.

According to the Multi-Head Self-Attention algorithm, the dependence coefficients are divided by the square root of the dimension of the Key vector, and the obtained values are then normalized with the Softmax function in the context of elements of Query queries.

Following the algorithm, we multiply the querys and transposed keys matrices, divide them by the square root of their dimension, and immediately calculate the exponential value. In the resulting matrix, we take line-by-line sums of values and organize a nested loop for data normalization.

      for(int head = 0; head < m_iHeads; head++)
        {
         //--- define Scores
         MATRIX sc = exp(querys[head].MatMul(keys[head].Transpose()) /
                                                                sqrt(m_iKeysSize));
         VECTOR sum = sc.Sum(1);
         for(uint r = 0; r < sc.Rows(); r++)
            if(!sc.Row(sc.Row(r) / sum[r], r))
               return false;

As you can see, the algorithm completely repeats similar operations of the parent class.

Now that we already have a calculated matrix of coefficients of dependencies between elements, we can move on using the Multi-Head Self-Attention algorithm and determine the values of the concatenated tensor of the results in terms of the analyzed attention head. To do this, we just need to calculate the products of two matrices containing the coefficients of dependence and the values of Values.

//--- output of the attention block
MATRIX temp = sc.MatMul(values[head]).Transpose();

Special attention should be paid to gathering results into a single concatenated tensor. The entire logic of constructing the algorithm assumes that the tensor of the concatenated result will be a tabular matrix. Each row of the matrix will contain a vector of the concatenated result of a single element of the sequence. I solved this problem as follows.

As a result of the multiplication operation, we obtained a tabular matrix where the number of rows equals the number of elements in the sequence, and the number of columns equals the size of the vector describing one element of the sequence. We transpose the matrix, reshape it into a row matrix, and add this resulting row to the concatenated matrix. At this stage, in the concatenated matrix, each attention head will have its own row.

We do the same with the matrix of dependency coefficients.

         if(!temp.Reshape(1, m_iUnits * m_iKeysSize))
            return false;
         if(!sc.Reshape(1, m_iUnits * m_iUnits))
            return false;
         if(!m_cScores.m_mMatrix.Row(sc.Row(0), head))
            return false;
         if(!out.Row(temp.Row(0), head))
            return false;
        }

Once the iterations of the loop are completed and the results of all the attention heads are obtained, we will reformat the concatenated matrix. We will make the number of columns equal to the number of elements of the sequence and transpose the matrix. As a result, we will have a number of rows equal to the number of elements in the analyzed sequence. This is the format we need to pass to the next convolutional layer of our multi-head attention block. We will save the matrix to the results buffer of the inner layer m_cAttentionOut.

      if(!out.Reshape(m_iHeads * m_iKeysSize, m_iUnits))
         return false;
      m_cAttentionOut.GetOutputs().m_mMatrix = out.Transpose();
     }
   else // OpenCL block
     {
      return false;
     }

This concludes the section on splitting the algorithm depending on the device for executing operations. Let's go back to using the methods of our internal neural layers. For a block of multi-threaded operations using OpenCL, we will set a temporary stub in the form of a return of a false value for the execution of method operations. We will return to it in the following sections.

We continue to move according to the Multi-Head Self-Attention algorithm. At the next stage, we will need to reduce the dimensionality of the concatenated tensor of results from all attention heads to the size of the original data tensor. For these purposes, the algorithm provides for the use of a trained matrix W0. This matrix has a dual purpose. First, it serves to change the dimension of the tensor. Second, it performs a weighted summation of all attention heads into a unified entity, thus determining the influence of each attention head on the final result.

To accomplish this task, we will use the object of the convolutional layer. We have already created a convolutional neural layer m_cW0, and now we have to call its forward pass method. In the parameters, we pass to the method a pointer to the object of the m_cAttentionOut neural layer. Do not forget to check the result of the operation.

if(!m_cW0.FeedForward(GetPointer(m_cAttentionOut)))
return false;

After the successful completion of the method operations, the result buffer of our neural layer will be the result of the Multi-Head Self-Attention block. According to the Transformer algorithm, we will need to add the obtained result to the original data into a single tensor and normalize the result using the following formulas:

When working on the parent class CNeuronAttention we created separate methods for these operations. Now let's make use of the results of the work done earlier.

//--- add to the initial data and normalize
   if(!m_cW0.GetOutputs().SumArray(prevLayer.GetOutputs()))
      return false;
   if(!NormlizeBuffer(m_cW0.GetOutputs(), GetPointer(m_cStd), 0))
      return false;

And, of course, don't forget to monitor the process of executing operations at every step.

Monitoring the process of executing operations is very important and should become a good habit, especially when dealing with such a large number of operations.

This concludes the Multi-Head Self-Attention block in the transformer encoder algorithm. Next comes its second block — Feed Forward. Within this block, we need to propagate the signal through two neural layers. We will do so by sequentially calling the feed-forward methods of each neural layer.

//--- FeedForward
   if(!m_cFF1.FeedForward(GetPointer(m_cW0)))
      return false;
   if(!m_cFF2.FeedForward(GetPointer(m_cFF1)))
      return false;

At the end of the forward pass algorithm, we will need to repeat the data normalization procedure. This time we add the result buffers of the Multi-Head Self-Attention and Feed Forward blocks.

//--- add to the output of attention and normalize
   if(!m_cOutputs.SumArray(m_cW0.GetOutputs()))
      return false;
   if(!NormlizeBuffer(m_cOutputs, GetPointer(m_cStd), 1))
      return false;
//---
   return true;
  }

The normalization procedure completes the feed-forward method. After the specified process completes successfully, we exit the method with a result of true. Let's move on to the implementation of the backpropagation method.

Building Multi-Head Self-Attention in MQL5

5.2.2.2 Multi-Head Self-Attention backpropagation methods