Building Multi-Head Self-Attention in MQL5

When implementing the Multi-Head Self-Attention block, we can note its strong similarity with the previously considered Self-Attention block. This is not surprising, because Multi-Head Self-Attention is a logical development of Self-Attention technology. Therefore, when creating a new class, it would be quite logical to inherit not from the neural layer base class CNeuronBase but from the attention block class CNeuronAttention.

With this inheritance option, we inherit from the parent class, in addition to the methods and objects of the base class, also objects of the CNeuronAttention class, including:

  • m_cQuerys — convolutional layer for the formation of the query tensor Query
  • m_cKeys — convolutional layer for the formation of the key tensor Key
  • m_cValues — convolutional layer for the formation of the value tensor Value
  • m_cScoresis — buffer of the matrix of dependency coefficients
  • m_cAttentionOut — base layer of the source data for recording the results of the Self-Attention block operation
  • m_cFF1 and m_cFF2 — convolutional layers of the Feed Forward block

As we defined in the section describing the architectural solution, all objects will be used for their intended purpose. We will only increase their size in proportion to the number of attention heads. Thus, to implement the Multi-Head Self-Attention algorithm, we just need to add the internal layer of the W0 matrix and a variable for recording the number of attention heads.

class CNeuronMHAttention    :  public CNeuronAttention
  {
protected:
   CNeuronConv       m_cW0;
 
   int               m_iHeads;
 
public:
                     CNeuronMHAttention(void);
                    ~CNeuronMHAttention(void);
   //---
   virtual bool      Init(const CLayerDescription *descoverride;
   virtual bool      SetOpenCL(CMyOpenCL *opencloverride;
   virtual bool      FeedForward(CNeuronBase *prevLayeroverride;
   virtual bool      CalcHiddenGradient(CNeuronBase *prevLayeroverride;
   virtual bool      CalcDeltaWeights(CNeuronBase *prevLayerbool readoverride;
   virtual bool      UpdateWeights(int batch_sizeTYPE learningRate,
                                   VECTOR &BetaVECTOR &Lambdaoverride;
   //--- file operation methods
   virtual bool      Save(const int file_handleoverride;
   virtual bool      Load(const int file_handleoverride;
   //--- object identification method
   virtual int       Type(voidoverride const { return(defNeuronMHAttention);  }
  };

Regarding the class methods, we will override the standard set of methods:

  • Init — class initialization method
  • SetOpenCL — method for specifying the handle of the OpenCL context to be used
  • FeedForward — forward pass method
  • CalcHiddenGradient — method of distributing the gradient error through the hidden layer
  • CalcDeltaWeights — method of distributing the error gradient to the level of the matrix of weights of the current neural layer
  • UpdateWeights — method for updating the matrix of weights of the coefficients of the current neural layer
  • Save — method of saving neural layer data to a file
  • Load — method of loading neural layer data from a file
  • Type — method for identifying the type of neural layer

Well, let's start with the class constructor. In it, we create instances of objects necessary for the full functioning of the class and initialize internal variables with default values. Above, we defined only one new object, the convolutional layer m_cW0. We will use static objects, just like in the parent class. So, in the class constructor, we just have to specify the initial value for the number of attention heads. The class destructor remains empty.

CNeuronMHAttention::CNeuronMHAttention(void) :  m_iHeads(8)
  {
  }

In the next step, we will deal with the method of initializing the class. Despite the fact that most of the objects were inherited from the parent class, we cannot use its initialization method, since using them in the Multi-Head Self-Attention algorithm will require different tensor sizes. Therefore, we will have to rewrite the initialization method completely. At the same time, to construct the initialization method, we will use an algorithm similar to the corresponding method of the parent class.

Like the similar methods of all previously discussed classes, in the method parameters, we receive a pointer to the object describing the configuration of the neural layer being created. We immediately organize a block for checking the received data. First of all, we check the validity of the received pointer. Only after confirming the validity of its relevance do we check its contents:

  • The type of the neural layer to be created in the configuration description must match the type of the class (the type parameter).
  • The layer you create must have at least one element of the sequence to be analyzed (the count parameter).
  • The size of the description vector of one source data element must be greater than zero (the window parameter).
  • The size of the key vector of one element of the sequence must be greater than zero (the window_out parameter).
  • There must be at least one attention head (the step parameter).

bool CNeuronMHAttention::Init(const CLayerDescription *desc)
  {
//--- check the initial data
   if(!desc || desc.type != Type() ||
      desc.count <= 0 || desc.window <= 0 || desc.window_out <= 0 ||
      desc.step <= 0)
      return false;

It probably looks strange to use the step parameter to specify the number of attention heads. But, as you may recall, within the implementation of attention mechanisms, the step size of the input data window is always equal to the size of the window itself. Therefore, this parameter is free. To avoid an unnecessary increase in the size of the neural layer description object, we decided to make the most efficient use of the existing class variables. However, if code readability is a higher priority for you, you can always define the necessary number of variables to describe the architecture of the neural layer being created and name them accordingly.

After successfully passing through the control block, we will save the key parameters of the description of the neural layer being created into local variables.

//--- saving the constants
   m_iWindow = desc.window;
   m_iUnits = desc.count;
   m_iKeysSize = desc.window_out;
   m_iHeads = desc.step;

Like in similar methods of all previously discussed classes, the next step is to call the method of the base neural layer, in which inherited objects will be initialized. We cannot call the method of the parent class because it would create objects of different sizes, and we would need to modify those objects. And we don't want to do the same job twice. Therefore, we "jump over the head" and directly access the method of the base class.

Please note that before calling the method of the base class, we need to make some adjustments to the description of the architecture of the neural layer being created. At the same time, we do not know what plans the user has for the description object of the layer obtained in the parameters. Remember what we talked about objects and pointers to them. In the parameters, we got a pointer to the object. When we make changes to the object, they will be reflected on the side of the main program by the user. If the user applies a single object to describe multiple neural layers, there is a high probability that they will encounter an error when creating subsequent neural layers. Also, layers can be created with incorrect architecture. Therefore, we will create a new object to describe the architecture of the neural layer and populate it with the necessary parameters.

In the parent class, we have worked out a technology with the substitution of pointers to the object, result buffers and error gradients. Therefore, it doesn't matter how these objects are created in the base class method; you can specify any values for the layer size and result window in the parameters. To avoid performing unnecessary operations, we will specify them at least greater than zero.

To eliminate the creation of unnecessary objects, set the size of the source data window to zero and disable the activation function.

We leave the type of neural layer that we received in the description from the user.

Next, we call the method of the base neural layer, passing it the correct description.

//--- call the initialization method of the parent class
   CLayerDescriptiontemp = new CLayerDescription();
   if(!temp)
      return false;
   temp.type = desc.type;
   temp.optimization = desc.optimization;
   temp.activation = AF_NONE;
   temp.count = desc.count;
   temp.window_out = 1;
   temp.window = 0;
   if(!CNeuronBase::Init(temp))
     {
      delete temp;
      return false;
     }

In the above description of the neural layer architecture, we will change the type of the created object and its size. This is enough to create an object of concatenated results of the work of attention heads.

//--- initialize AttentionOut
   temp.type = defNeuronBase;
   temp.count = (int)(m_iUnits * m_iKeysSize * m_iHeads);
   if(!m_cAttentionOut.Init(temp))
     {
      delete temp;
      return false;
     }
   if(!m_cAttentionOut.GetOutputs().m_mMatrix.Reshape(m_iUnitsm_iKeysSize * m_iHeads) ||
      !m_cAttentionOut.GetGradients().m_mMatrix.Reshape(m_iUnitsm_iKeysSize * m_iHeads))
      return false;

After initializing the object, we will slightly change the format of the result buffers and error gradients.

Next, we have to create internal convolutional neural layers. First, we will create internal neural layers to form the Query, Key, and Value tensors. All of them receive a sequence of initial data as input. Therefore, in the window and step parameters, we will specify the size of the vector describing one element of the source data sequence.

The number of filters of the used convolutional layer, specified in the window_out parameter, should correspond to the size of the key vector of one element of the sequence. However, when discussing the architectural solution of this class, we determined the use of concatenated tensors. Therefore, we will increase the number of filters in proportion to the number of attention heads created.

The number of elements in the sequence at all stages remains constant. Therefore, we can write to the count parameter the number of elements of the original sequence received from an external program.

The Multi-Head Self-Attention architecture does not provide an activation function for the neural layers that are created. Therefore, in the activation parameter, we leave the constant AF_NONE.

The optimization method for the parameters of all neural layers is the same, and we leave this parameter unchanged.

//--- create a description for the inner neural layers
   if(!temp)
      return false;
   temp.type = defNeuronConv;
   temp.window = m_iWindow;
   temp.window_out = (int)(m_iKeysSize * m_iHeads);
   temp.step = m_iWindow;
   temp.count = m_iUnits;

First, we initialize the inner layer to create the query tensor Query. We check the result of the operation in order to exclude possible critical errors in the further execution of the method code.

//--- initializing Querys
   if(!m_cQuerys.Init(temp))
     {
      delete temp;
      return false;
     }
   m_cQuerys.SetTransposedOutput(true);

After successful initialization of the convolutional neural layer, we set the flag to transpose the result tensor. I'd like to remind you that we introduced this flag to enable the retrieval of a result tensor in which each row contains elements not from a single filter but from all filters for one sequence element.

Similarly, we initialize convolutional neural layer objects to create Key and Value tensors.

//--- initialize Keys
   if(!m_cKeys.Init(temp))
     {
      delete temp;
      return false;
     }
   m_cKeys.SetTransposedOutput(true);

Please note that during the initialization of the convolutional neural layer object to form the Value tensor, we do not align the number of used filters with the size of the input data window, as was done in the single-attention head class CNeuronAttention. The use of the W0 matrix allows us to avoid this rule. Reducing the dimensionality of the vector can indeed help save resources and reduce the execution time of operations. In turn, after recreating the complete algorithm of the Multi-Head Self-Attention method, you will be able to assess the advantages and disadvantages of such an implementation through practical examples.

//--- initialize Values
   if(!m_cValues.Init(temp))
     {
      delete temp;
      return false;
     }
   m_cValues.SetTransposedOutput(true);

After initializing the first group of internal convolutional layers, following the algorithm of the Multi-Head Self-Attention mechanism, we initialize the buffer for the dependency coefficient matrix m_cScores. Fill it with zero values, specifying the required buffer size. Again, let's draw a parallel with the CNeuronAttention class. If previously we created a square matrix with a side length equal to the number of elements in the sequence, now we need as many of these matrices as there are attention heads. At the same time, we have agreed to use a concatenated matrix. Therefore, we will increase the buffer size in proportion to the number of attention heads used. Unfortunately, MQL5 does not support three-dimensional matrices. Within the two-dimensional matrix, we will use rows to distribute the buffer across attention heads.

//--- initialize Scores
   if(!m_cScores.BufferInit(m_iHeadsm_iUnits * m_iUnits))
     {
      delete temp;
      return false;
     }

Now it's time to initialize the additional convolutional layer that performs the functionality of matrix W0 in the Multi-Head Self-Attention algorithm. Let's adjust the description of the architecture of the neural layer being created.

The type of neural layer to be created has already been specified, so we don't need to specify it again.

We determine the size of the input data window as the product of the size of the description vector of one sequence element in the Values tensor and the number of attention heads. In this implementation, we changed the size of the specified vector to the same one in the Key tensor. So, the size of the input data window is determined as the product of the size of the key vector of one sequence element and the number of attention heads (m_iKeysSize * m_iHeads).

We will equate the size of the step of the source data window to the size of the window itself.

According to the Multi-Head Self-Attention algorithm, matrix W0 is used to align the sizes of the tensor of results from the multi-head attention block with the tensor of input data. Therefore, we will specify the number of filters in this convolutional layer equal to the size of the description vector of one element of the sequence of initial data fed to the input of the Multi-Head Self-Attention block.

The Multi-Head Self-Attention algorithm does not provide an activation function for this matrix. Therefore, in the appropriate field, we leave the AF_NONE constant.

The optimization method for the weight matrices of all layers in the neural network, including the internal layers of individual blocks, is the same. Therefore, we leave the parameters indicating the optimization method used unchanged.

//--- initialize W0
   temp.window = (int)(m_iKeysSize * m_iHeads);
   temp.step = temp.window;
   temp.window_out = m_iWindow;
   if(!m_cW0.Init(temp))
     {
      delete temp;
      return false;
     }
   m_cW0.SetTransposedOutput(true);

After specifying all the necessary parameters for describing the created neural layer, we call the initialization method of our convolutional neural layer m_cW0.Init and check the results of the operations.

At the end of the initialization block of the convolutional layer m_cW0 we set the flag for transposing the result tensor.

This concludes the work on initializing the objects of the Multi-Head Self-Attention block. Next, let's move on to work on the Feed Forward block. The functionality and architecture of this block are completely transferred from the CNeuronAttention class. However, since we had to completely redefine the initialization method of the class, we will repeat the actions for initializing the internal layers m_cFF1 and m_cFF2.

The algorithm for initializing the neural layer remains the same. We will prepare a description of the neural layer to be created and call the method of its initialization. To describe the convolutional neural layer m_cFF1, we will use the description object of the convolutional neural layer which has already been used more than once in this method. Therefore, we will only specify the parameters that are being changed, as the rest are already contained in the neural layer description object.

  • The size of the source data window (window) is equal to the size of the description vector of one element of the source data tensor sequence fed to the input of our Multi-Head Self-Attention block. We receive this parameter from an external program and save it in the m_iWindow variable. Consequently, we can pass the value of the specified variable as a parameter.
  • We will set the step size of the input data window (step) equal to the size of the input data window itself.
  • Number of filters used (window_out): according to the transformer architecture proposed by the authors, the output size of the first layer of the Feed Forward block is four times larger than the size of the original data. Let's use this coefficient. However, during the implementation of your practical tasks, you can always modify this coefficient or even add it to the configuration description of the created neural layer and conduct practical tests to determine the most suitable coefficient for your specific tasks.
  • The activation function (activation): for this layer, the authors suggest using ReLU as an activation function. We replaced it with the close Swish function. The graph of this function is very close to the graph of the function proposed by the authors. At the same time, it does not contain kinks and is differentiated throughout the values.
  • The optimization parameters of the balance matrix remain unchanged.

//--- initialize FF1
   temp.window = m_iWindow;
   temp.step = temp.window;
   temp.window_out = temp.window * 4;
   temp.activation = AF_SWISH;
   temp.activation_params[0] = 1;
   temp.activation_params[1] = 0;
   if(!m_cFF1.Init(temp))
     {
      delete temp;
      return false;
     }
   m_cFF1.SetTransposedOutput(true);

After we have specified all the parameters in the configuration description of the created convolutional neural layer, we will call its initialization method and check the result of the operations.

Only upon successful initialization of the convolutional neural layer object, we will set the flag for transposing the result tensor.

Now we can proceed to initialize the last object used in the class — the second convolutional layer of the Feed Forward block m_cFF2. As a result of this neural layer operation, we again return to the dimension of the tensor of the original data. Therefore, in the description object of the structure of the created neural layer, we will need to swap the values of the input data window and the number of used filters. Typically, such an operation requires a local variable to temporarily store one of the values. But in our case, the parameters of the source data window size and its pitch are equal. Hence, we will first write the number of filters of the previous layer to the size parameter of the source data window. Next, in the parameter of the number of filters, specify the value of the window step of the previous convolutional layer. And finally, let's equate the size of the step of the source data window to its size.

The architecture of the transformer does not provide an activation function for this layer. But we will provide an opportunity for the user to experiment. To do this, let's transfer the activation function and its parameters from the architecture description provided by the user to the parameters of this method.

//--- initialize FF2
   temp.window = temp.window_out;
   temp.window_out = temp.step;
   temp.step = temp.window;
   temp.activation = desc.activation;
   temp.activation_params = desc.activation_params;
   if(!m_cFF2.Init(temp))
     {
      delete temp;
      return false;
     }
   m_cFF2.SetTransposedOutput(true);
   delete temp;

Once all the necessary parameters for describing the structure of the created neural layer are specified, we call its initialization method and set the flag for transposing the result tensor. At the same time, do not forget to check the results of the operations.

Now that all the necessary objects are initialized, we can safely delete the local neural layer description object without any risk of error.

Next, we will apply the technique refined in the CNeuronAttention class and substitute pointers to result and error gradient buffers of our multi-head attention class with similar buffers from the internal convolutional neural layer,m_cFF2. This will allow us to eliminate unnecessary costs for copying data between buffers. Also, we do not need additional memory to store duplicate data. To do this, we first check the pointers and, if necessary, delete previously created objects that are not needed. Then, we pass pointers to the objects of the convolutional layer m_cFF2 into the variables.

//--- to avoid copying buffers, replace them
   if(!SetOutputs(m_cFF2.GetOutputs()))
      return false;
   if(m_cGradients)
      delete m_cGradients;
   m_cGradients = m_cFF2.GetGradients();
//---
   SetOpenCL(m_cOpenCL);
//---
   return true;
  }

In conclusion, to all objects in the method, we will pass a pointer to the used OpenCL context. After that, we exit the method with a positive result.

This concludes our work on the class initialization method. However, we have an open question. At the end of the initialization method, we called the method for passing the OpenCL context pointer. We haven't overridden it yet, and a similar method of the parent class will be called as such. It is functional enough but does not apply to objects declared in the body of this class. Among them, there is only one object: the convolutional layer of m_cW0. Therefore, the method will be relatively short.

Like the similar methods of all the previously discussed classes, the CNeuronMHAttention::SetOpenCL method in the parameters receives a pointer to the object of working with the OpenCL context. We will have to distribute it to all internal objects. First, it would be necessary to check the validity of the received pointer. Instead, we'll call a similar method of the parent class, which already has all the controls and pointer passing to inherited objects. Thus, after the completion of the parent class method, we just have to pass the pointer to the new objects that were declared in the body of this class. However, in this case, we will pass not the pointer received in the parameters but the pointer from the local variable of the class inherited from the parent object. The reason is that the method of the parent class checked the received pointer and saved it to a local variable. It also passed it to all the objects that we inherited from the parent class. Therefore, in order for all objects to work in the same context, we pass an already validated pointer to the internal objects.

bool CNeuronMHAttention::SetOpenCL(CMyOpenCL *opencl)
  {
//--- call a method of a parent class
   CNeuronAttention::SetOpenCL(opencl);
//--- call a similar method for the internal layer
   m_cW0.SetOpenCL(m_cOpenCL);
//---
   return(!!m_cOpenCL);
  }

After passing the pointer to all internal objects, in this case, it's a single convolutional layer, we exit the method and return a result indicating the validity of the used context pointer.

With that, we conclude the process of creating and initializing our multi-head attention class object and move on to the next stage, which is setting up the feed-forward pass.