5.Self-Attention feed-forward method

We have already created the structure of a class organization for implementing the attention mechanism and even created an object initialization method. In this section, we will organize the forward pass process.

As you know, in the base class of the neural network, we have created a virtual method CNeuronBase::FeedForward which is responsible for organizing the feed-forward pass. In each new class, we override this method to organize the relevant process according to the algorithm of the implemented architectural solution. By doing so, we kind of personalize the method for each class. At the same time, the external program does not need to know anything about the organization of the process within the class. It doesn't even need to know the type of neural layer. It simply calls the FeedForward method of the next object and passes it a pointer to the previous layer of the neural network. In this way, we have shifted the functionality of dispatching and checking the required object type from our program to the system.

Let's go back to our CNeuronAttention::FeedForward method. Just like the method of the parent class, in parameters it receives a pointer to the object of the previous layer. This is consistent with the principles of method inheritance and overriding. Since we receive a pointer to an object, it is customary to begin the method with a block to check the validity of the received pointer. However, in this case, we will omit it. The reason is that the use of static internal objects allows us to refuse to check their pointers. Regarding the pointer to the previous neural layer, we will use it for the feed-forward pass of the internal convolutional neural layers m_cQuerys, m_cKeys and m_cValues. They already have the relevant controls and thus we do not need to duplicate them.

In accordance with the Self-Attention algorithm, we need to define the Query, Key, and Value vectors for each element of the sequence. As you remember, it was for this functionality that we created the first three convolutional layers. Therefore, to solve this problem, we just need to call the FeedForward methods for the named internal layers. With each call in the parameters, we pass a pointer to the previous neural layer obtained in the parameters of our CNeuronAttention::FeedForward method.

   if(!m_cQuerys.FeedForward(prevLayer))
      return false;
   if(!m_cKeys.FeedForward(prevLayer))
      return false;
   if(!m_cValues.FeedForward(prevLayer))
      return false;

Next in the Self-Attention algorithm, we need to determine the dependency coefficients and fill in the Score matrix. At this point, it's essential to recall our paradigm of creating classes capable of running both on the CPU and using the GPU tools. Each time we build a new process, we create a branching of the algorithm depending on the computing device in use. This method will not be an exception, and we will continue to work in the same direction. Right now, we will create a similar branching of the process. We will start with the process using MQL5 tools and will return to the OpenCL branch a little later.

For convenience, we copy the m_cQuerys and m_cKeys matrices which contain the results of the convolutional layers.

//--- branching of the algorithm by the computing device
   MATRIX out;
   if(!m_cOpenCL)
     {
      MATRIX querys = m_cQuerys.GetOutputs().m_mMatrix;
      MATRIX keys = m_cKeys.GetOutputs().m_mMatrix;

After completing the preparatory work, we need to "roll up our sleeves" and build a new process. The Self-Attention method involves line-wise normalization of the dependency matrix using the Softmax function.

The main feature of such normalization lies in obtaining a series of positive values that sum up to 1. Thus, by multiplying the normalized dependency coefficients with the values of Value vectors of the corresponding sequence elements and then summing up these vectors within one Query, we expect to obtain new vectors within the same range of values.

Let's look at the implementation of this process. First, we organize the process of calculating the dependency coefficients into the Score matrix. According to the Self-Attention algorithm, each element of the matrix represents the product of the Query and Key vectors. In this case, the matrix row indicates the position of the vector in the Queries matrix and its column indicates the position in the Keys matrix.

Here, it is important to carefully consider the choice of elements to be multiplied. Let's recall how we organized the output of the results to the buffer of the convolutional layer. To enable the operation of the pooling layer in the context of filters, we have organized the sequential output of filters. First, in the first row of the result buffer matrix, we output all the elements of the result of one filter. Then, in the next row, we write the elements of the next filter, and so on. This organization of the buffer is convenient for the transparent operation of the pooling layer within the filters. In this case, within the vector of one element of the sequence, we need to use one value from each filter. In other words, we need a transposed matrix.

Reorganizing the buffer data in such a way that the first elements of all filters come first, then the second elements of all filters, and so on, would require additional resources on each feed-forward pass. It would be much easier to organize a convenient record directly in the convolutional layer. However, this would disrupt the operation of both the pooling layer and subsequent convolutional layers when building convolutional models. Therefore, it was decided to introduce a flag into the operation of the convolutional layer to determine whether the values should be arranged in the result buffer. You may have already guessed this when I talked about the new SetTransposedOutput convolutional layer method when describing the initialization method. I promised to return to the description of the functionality of this method. Such a solution has helped us keep the structure of the feed-forward pass method transparent and avoid additional time and resource costs for data reorganization. Let's finish working with the feed-forward pass method, and then we can revisit the changes in the convolutional layer.

Taking into account the transposition of the convolutional layer results, to obtain the values of the matrix of dependency coefficients, we need to multiply the Querys matrix by the transposed matrix Keys. It sounds a little strange to transpose the work of the convolutional layer method and then transpose the Keys matrix. However, we will use the result of transposing the work of the convolutional layer more than once. Of course, with the help of the entered flag, we could transpose the work of the convolutional layer m_cQuerys, and leave the m_cKeys layer unchanged. But in this case, there is a possibility of confusion with the matrix dimensions. This will make the code more difficult to read and understand. Therefore, I decided to unify the dimensions of the matrices used.

Please note that simultaneously with the calculation of the vector product, we will prepare data for normalization according to the Softmax formula above. For this purpose, we will immediately divide the obtained matrix by the square root of the Key vector size and take the exponent of the resulting value.

Then we will take the row-wise sum of the matrix values and divide the values by the resulting vector of the matrix Scores. MQL5 matrix operations do not allow you to divide a matrix by a vector. Therefore, we will organize a loop in which we will sequentially divide each row by the sum of its values.

      //--- define Scores
      MATRIX scores = MathExp(querys.MatMul(keys.Transpose()) / sqrt(m_iKeysSize));
      //--- normalize Scores
      VECTOR summs = scores.Sum(1);
      for(int r = 0r < m_iUnitsr++)
         if(!scores.Row(scores.Row(r) / summs[r], r))
            return false;
      m_cScores.m_mMatrix = scores;

After normalizing the data in the matrix containing the dependencies coefficients of the elements in the sequence, we will transfer these values to our data buffer buffer m_cScores.

At this stage, we have computed and normalized the dependency coefficients between all elements of the sequence. Now, according to the algorithm of the Self-Attention method, we need to calculate the weighted sum of the Values vectors in terms of each Query. To do this, we just need to multiply the matrix of dependency coefficients by the matrix of results of the convolutional layer m_cValues. Again, it is precisely because of the transposition of the work of the convolutional layer that we do not transpose the matrix of the results of the m_cValues layer.

      //--- the output of the attention block
      MATRIX values = m_cValues.GetOutputs().m_mMatrix;
      out = scores.MatMul(values);

The product of the matrices will give us the result of the Self-Attention mechanism. But we will go a little further and build the entire Encoder block of the transformer. According to his algorithm, the results of Self-Attention are added to the buffer of the original data. The obtained values are normalized within the neural layer. The following formulas are used to normalize the data.

To perform this operation, we will first bring the format of the results matrix of the Self-Attention block in accordance with the format of the matrix of the initial data and add the two matrices. The result is normalized in a specially selected NormlizeBuffer method.

      //--- add to initial data and normalize
      if(!out.Reshape(prevLayer.Rows(), prevLayer.Cols()))
         return false;
      m_cAttentionOut.GetOutputs().m_mMatrix = out + 
                                             prevLayer.GetOutputs().m_mMatrix;
      if(!NormlizeBuffer(m_cAttentionOut.GetOutputs(), GetPointer(m_cStd), 0))
         return false;
     }

With this, the first block of operations is completed. This concludes the section on dividing the algorithm based on the execution of mathematical operations. For the block of operations using OpenCL, we will temporarily set the return of an error value and come back to it later.

   else // OpenCL block
     {
      return false;
     }

Let's continue working with the encoder algorithm and move on to the second block of operations. Here it is necessary to conduct the signal of each element of the sequence through two fully connected layers. As you remember, we decided to organize this work through two convolutional layers. At first glance, there is nothing complicated about it - we simply call the forward pass methods for each convolutional layer sequentially.

--- call the feed-forward methods of the Feed Forward block layers
   if(!m_cFF1.FeedForward(GetPointer(m_cAttentionOut)))
      return false;
   if(!m_cFF2.FeedForward(GetPointer(m_cFF1)))
      return false;

Here, correct operation is possible only due to the transposition of the buffer of the convolutional neural layers results. Only this approach allows the aligned operation on each individual element of the sequence.

After conducting a forward pass through two convolutional layers, just as after determining the attention results, it is necessary to propagate the obtained results to the data input into the first convolutional layer and normalize the resulting sums. We have already considered such a task above. Here we use the same algorithm, only the data buffers are different.

//--- add to the output of attention and normalize
   if(!m_cOutputs.SumArray(m_cAttentionOut.GetOutputs()))
      return false;
//--- normalize
   if(!NormlizeBuffer(m_cOutputsGetPointer(m_cStd), 1))
      return false;
//---
   return true;
  }

It should be noted that thanks to the buffer substitution organized in the initialization method, we obtain the results of the second convolutional layer from the result buffer of the current layer. In the same buffer, we will save the results of data normalization.

After the completion of the operations, we exit the feed-forward method with a positive result.

Now let's take a look at the changes made to the convolutional layer class. First, we'll add a variable to store the flag of the m_bTransposedOutput output structure. This will be a Boolean flag indicating the need to transpose the result matrix for output to the buffer. By default, we will set the value to false, which means working in normal mode.

class CNeuronConv    :  public CNeuronProof
  {
protected:
   bool              m_bTransposedOutput;
 
public:
   bool              SetTransposedOutput(const bool value);
   ....
  }

To control the value of the flag, let's create the SetTransposedOutput method. The functionality of the method is quite simple. We resize the result matrices and error gradients.

bool CNeuronConv::SetTransposedOutput(const bool value)
  {
   m_bTransposedOutput = value;
   if(value)
     {
      if(!m_cOutputs.BufferInit(m_iNeuronsm_iWindowOut0))
         return false;
      if(!m_cGradients.BufferInit(m_iNeuronsm_iWindowOut0))
         return false;
     }
   else
     {
      if(!m_cOutputs.BufferInit(m_iWindowOutm_iNeurons0))
         return false;
      if(!m_cGradients.BufferInit(m_iWindowOutm_iNeurons0))
         return false;
     }
//---
   return true;
  }

However, as you understand, the presence of a flag and even a method that changes it will not affect the results of data output to the buffer. To do this, we have to make some changes to the forward pass method. We are not changing the algorithm or the calculation logic at all; our changes will only involve rearranging matrices when multiplying the input data by the weight matrix, depending on the state of the m_bTransposedOutput flag.

bool CNeuronConv::FeedForward(CNeuronBase *prevLayer)
  {
//--- control block
    ....
//--- branching the algorithm depending on the execution device
   if(!m_cOpenCL)
     {
    ....
      //--- Calculating the weighted sum of the elements of the input window
      if(m_bTransposedOutput)
         m = m.MatMul(m_cWeights.m_mMatrix.Transpose());
      else
         m = m_cWeights.m_mMatrix.MatMul(m.Transpose());
      m_cOutputs.m_mMatrix = m;
     }
   else  // OpenCL block
     {
    ....
     }
//---
   if(!m_cActivation.Activation(m_cOutputs))
      return false;
//---
   return true;
  }

After making changes to the feed-forward method, we need to make similar adjustments to the backpropagation methods because the error gradient should be propagated back to the point of error occurrence. Otherwise, the results of training the neural network will be unpredictable. First, we make changes to the gradient distribution method in the hidden layer CNeuronConv::CalcHiddenGradient.

bool CNeuronConv::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
//--- control block
    ....
//--- correction of error gradients to the derivative of the activation function
    ....
//--- branching the algorithm depending on the execution device
   CBufferTypeinput_gradient = prevLayer.GetGradients();
   if(!m_cOpenCL)
     {
      MATRIX g = m_cGradients.m_mMatrix;
      if(m_bTransposedOutput)
        {
         if(!g.Reshape(m_iNeuronsm_iWindowOut))
            return false;
        }
      else
        {
         if(!g.Reshape(m_iWindowOutm_iNeurons))
            return false;
         g = g.Transpose();
        }
    ....
     }
   else  // OpenCL block
     {
    ....
     }
//---
   return true;
  }

Then we make the relevant changes in the CNeuronConv::CalcDeltaWeights method for distributing the gradient to the weight matrix level.

bool CNeuronConv::CalcDeltaWeights(CNeuronBase *prevLayer)
  {
//--- control block
    ....
//--- branching the algorithm depending on the execution device
   CBufferType *input_data = prevLayer.GetOutputs();
   if(!m_cOpenCL)
     {
    ....
      //---
      MATRIX g = m_cGradients.m_mMatrix;
      if(m_bTransposedOutput)
        {
         if(!g.Reshape(m_iNeuronsm_iWindowOut))
            return false;
         g = g.Transpose();
        }
      else
        {
         if(!g.Reshape(m_iWindowOutm_iNeurons))
            return false;
        }
      m_cDeltaWeights.m_mMatrix += g.MatMul(inp);
     }
   else  // OpenCL block
     {
    ....
     }
//---
   return true;
  }

As you can see, the changes are not so crucial, but they provide enhanced flexibility in settings.