Building a GPT model using MQL5

Before you start working on a GPT model, don't expect to get some kind of a beast at the end of the section that can solve any problems. We only build the model algorithms. The operation of these algorithms will be comparable to the computational resources involved. Of course, we will get and evaluate the results of these algorithms. But first things first.

Let's briefly recap the algorithm:

  1. The Multi-Head Self-Attention block received, as input, a tensor of initial data where each element of the sequence is represented by a token (a vector of values).

One sequence for all heads (threads). The actions in steps 2-5 are identical for each attention head.

  1. For each token, three vectors (Query, Key, Value) are calculated by multiplying the token vector by the corresponding trainable matrix of weights W.
  2. By multiplying the Query and Key vectors, we determine the pairwise dependencies between the elements of the sequence. At this step, the Query vector of each element of the sequence is multiplied by the Key vectors of the current and all previous elements of the sequence.
  3. The matrix of the obtained dependence coefficients is normalized using the Softmax function in the context of each query (Query). In this case, a zero attention coefficient is set for subsequent elements of the sequence.
  4. As a result of steps 3 and 4, we get a square Score matrix with a dimension equal to the number of elements in the sequence, where the sum of all elements in the context of each Query is equal to one.
  5. Then we multiply the normalized attention coefficients by the Value vectors of the corresponding elements of the sequence, add the resulting vectors, and get the attention-adjusted value for each element of the sequence.
  6. Next, we determine the weighted attention result. To do this, we multiply the concatenated tensor of the results of all attention heads by the trained matrix W0.
  7. The resulting tensor is added to the input sequence and normalized.
  8. The Multi-Heads Self-Attention mechanism is followed by two fully connected layers of the Feed Forward block. The first (hidden) layer contains four times as many neurons as the input sequence with the ReLU activation function (we used the Swish function instead). The dimension of the second layer is equal to the dimension of the input sequence, and neurons do not use the activation function.
  9. The result of the fully connected layers is summed up with the tensor input to the Feed Forward block and the resulting tensor is normalized.

Now that we have refreshed the basic steps of the process, let's proceed with the implementation. To implement the new type of neural layer, let's create a new class CNeuronGPT, inheriting from the CNeuronBase neural layer base class of our model. Despite using the Self-Attention algorithm in the model, I chose not to inherit from our existing classes of neural layers using attention mechanisms. This is due to some peculiarities in the model implementation, which we will become familiar with during the process.

Perhaps one of the main differences is the ability to build multiple homogeneous layers within one class. Previously we used separate layers to implement parts of the model functionality, while now we are talking about the full-fledged creation of several copies of the layer being created, each with its own weights. To achieve this, in the body of the method, we declare not individual neural layers but entire collections of layers. Among them, you will see familiar variable names from working with previous classes, but they will now contain pointers to collections of neural layers. At the same time, we have preserved the functionality hidden behind the object names. Additionally, we have added two new variables:

  • m_iLayers — number of neural layers in the block
  • m_iCurrentPosition — number of the current element in the sequence

class CNeuronGPT    :  public CNeuronBase
  {
protected:
   CArrayLayers      m_cQuerys;
   CArrayLayers      m_cKeys;
   CArrayLayers      m_cValues;
   CArrayLayers      m_cScores;
   CArrayLayers      m_cAttentionOut;
   CArrayLayers      m_cW0;
   CArrayLayers      m_cFF1;
   CArrayLayers      m_cFF2;
   //---
   int               m_iLayers;
   int               m_iWindow;
   int               m_iUnits;
   int               m_iKeysSize;
   int               m_iHeads;
   CBufferType       m_dStd[];
   int               m_iCurrentPosition;
   int               m_iScoreTemp;
 
   virtual bool      NormlizeBuffer(CBufferType *bufferCBufferType *std,
                                                              uint std_shift);
   virtual bool      NormlizeBufferGradient(CBufferType *output,
                     CBufferType *gradient, CBufferType *stduint std_shift);
public:
                     CNeuronGPT(void);
                    ~CNeuronGPT(void);
   //---
   virtual bool      Init(const CLayerDescription *descoverride;
   virtual bool      SetOpenCL(CMyOpenCL *opencloverride;
   virtual bool      FeedForward(CNeuronBase *prevLayeroverride;
   virtual bool      CalcHiddenGradient(CNeuronBase *prevLayeroverride;
   virtual bool      CalcDeltaWeights(CNeuronBase *prevLayerbool readoverride;
   virtual bool      UpdateWeights(int batch_sizeTYPE learningRate,
                                   VECTOR &BetaVECTOR &Lambdaoverride;
   //---
   virtual int       GetUnits(voidconst { return m_iUnits;   }
   virtual int       GetLayers(voidconst { return m_iLayers; }
   //--- methods for operations with files
   virtual bool      Save(const int file_handleoverride;
   virtual bool      Load(const int file_handleoverride;
   //--- object identification methods
   virtual int       Type(voidoverride  const { return(defNeuronGPT);  }
  };

The addition of the m_iCurrentPosition variable is the second architectural feature of this model. We have already said that GPT refers to autoregressive models. At each step, it returns one element of the sequence and feeds it as input at a new iteration. We mentioned something similar about recurrent models. However, in recurrent models, the hidden state was added to the current state of the environment, while in the case of GPT, generating the language model involves creating a new state. Of course, concerning financial markets, we deviate slightly from this feedback and input the actual new state, but we will preserve the signal processing principles.

The logic is as follows: if only one element of the sequence is updated at each new iteration, there is no need to recalculate the same values every time. It's not efficient. Let's recalculate only the new element of the sequence, and for the previous elements of the sequence, let's use the values from previous iterations. This is why we introduce the m_iCurrentPosition variable to store the index of the current element in the sequence. We will get acquainted with its usage principles as we proceed with the implementation.

Let's take things step by step. As usual, we will start working on the methods of the class with the class constructor. In it, we initialize variables with initial values. Similar to the attention mechanism classes discussed earlier, we use static objects that do not instantiate in the class constructor. The class destructor remains empty.

CNeuronGPT::CNeuronGPT(void) :   m_iHeads(8),
                                 m_iWindow(0),
                                 m_iKeysSize(0),
                                 m_iUnits(0),
                                 m_iLayers(0),
                                 m_iCurrentPosition(0)
  {
  }

Following our previously used pattern of working with classes, next, we will construct the initialization method of the class. This method is inherited from the parent class CNeuronBase and is overridden in each new class.

In the parameters, the method receives a pointer to an object describing the created neural layer, and we immediately perform a validity check on the received pointer, as well as verify the presence of the specified minimum necessary parameters for the correct initialization of the class instance.

bool CNeuronGPT::Init(const CLayerDescription *desc)
  {
//--- checking the initial data
   if(!desc || desc.type != Type() || desc.count <= 0 || desc.window <= 0 ||
      desc.window_out <= 0 || desc.step <= 0 || desc.layers <= 0)
      return false;

After successfully passing the control block, we save the received parameters to the appropriate variables of our class.

//--- save the constants
   m_iWindow   = desc.window;
   m_iUnits    = desc.count;
   m_iKeysSize = desc.window_out;
   m_iHeads    = desc.step;
   m_iLayers   = desc.layers;
   if(!ArrayResize(m_dStdm_iLayers))
      return false;
   for(int l = 0l < m_iLayersl++)
      if(!m_dStd[l].BufferInit(121))
         return false;

Then, similar to the previously created classes using the attention mechanism, we will slightly adjust the description of the created neural layer and call the initialization method of the parent class. I would like to remind you that in the description of the created neural layer, we set the window size parameter of the input data to zero before calling the method of the parent class. This allows us to remove unused buffer objects from the parent class.

//--- call the initialization method of the parent class
   CLayerDescription *temp = new CLayerDescription();
   if(!temp || !temp.Copy(desc))
      return false;
   temp.window_out = 1;
   temp.window     = 0;
   temp.activation = AF_NONE;
   if(!CNeuronBase::Init(desc))
      return false;
   delete temp;

After that, we create a loop with the number of iterations equal to the number of homogeneous neural layers created. All other objects will be created in the body of this loop.

//--- run a loop to create objects of internal layers
   for(int layer = 0layer < m_iLayerslayer++)
     {

The operations in the loop body are very similar to the operations performed in the class initialization methods using the Self-Attention mechanism, but there are still differences.

Firstly, within the loop body, we create an instance of the CLayerDescription object to describe the neural layers being created and fill it with the necessary data. Since we have decided to input only the state update to the neural network, rather than the entire pattern information, I chose to forgo using convolutional neural layers and opted for a basic fully connected neural layer. Therefore, in the type field of the layer description object, we set the constant defNeuronBase. In this case, the window size of the input data will be equal to the size of the vector describing one element of the sequence. In this case, the entire volume of input data is perceived as the description of one element of the sequence.  

Next, we recall that the model uses the Multi-Head Self-Attention mechanism, so we need to create three vectors (Query, Key, Value) for each attention head from one vector of the initial data. I would like to remind you of another detail: when implementing the Multi-Head Self-Attention mechanism, we used concatenated vectors. Now we are going further: we will no only create a single tensor for all attention heads but we also combine all three entities mentioned above at once (Query, Key, Value). However, since it will contain only one element of the sequence, its size will not be so large. In the count field specify a size equal to the three vectors of one element of the key tensor sequence for each attention head. The newly created layer will not have an activation function, just like before. We will use the parameter optimization method specified by the user in the neural layer description from the method parameters.

      temp = new CLayerDescription();
      if(!temp)
         return false;
      temp.type = defNeuronBase;
      temp.window = m_iWindow;
      temp.count = (int)(3 * m_iKeysSize * m_iHeads);
      temp.activation = AF_NONE;
      temp.optimization = desc.optimization;

After creating the neural layer description object and specifying all the necessary parameters, we create the first internal neural layer Queries. We initialize it using a pre-created neural layer description object. It is essential to monitor the process of performing operations. After successfully completing the first two operations, we add the layer to the corresponding collection.

      //--- initialize Querys
      CNeuronBase *Querys = new CNeuronBase();
      if(!Querys)
        {
         delete temp;
         return false;
        }
      if(!Querys.Init(temp))
        {
         delete Querys;
         delete temp;
         return false;
        }
      if(!m_cQuerys.Add(Querys))
        {
         delete Querys;
         delete temp;
         return false;
        }

Despite creating a concatenated tensor, we have kept the name Querys for the neural layer, maintaining continuity with the previously created attention mechanism classes. However, we will also create internal neural layers for Keys and Values, although with different parameters.

We will use the internal neural layers Keys and Values to accumulate historical data on the received current states. It is, so to speak, the memory of our neural layer, and it should be sufficient to store the entire pattern being analyzed. However, since we have already calculated the state of these vectors in the fully connected neural layer Querys, we do not need matrices of weights in them. Therefore, before initializing the mentioned internal neural layers, we will make a change to the description object of the neural layer: we will set the size of the input data window to zero and ensure that the neural layer has enough elements to store the entire pattern description tensor.

      //--- initialize Keys
      CNeuronBase *Keys = new CNeuronBase();
      if(!Keys)
        {
         delete temp;
         return false;
        }
      temp.window = 0;
      temp.count = (int)(m_iUnits * m_iKeysSize * m_iHeads);
      if(!Keys.Init(temp))
        {
         delete Keys;
         delete temp;
         return false;
        }
      if(!Keys.GetOutputs().Reshape(m_iUnitsm_iKeysSize * m_iHeads))
         return false;
      if(!m_cKeys.Add(Keys))
        {
         delete Keys;
         delete temp;
         return false;
        }

The rest of the algorithm for creating an internal neural layer is similar to creating the Querys layer:

  • Create a new instance of the neural layer object.
  • Initialize the neural layer.
  • Add the neural layer to the corresponding collection.

      //--- initialize Values
      CNeuronBase *Values = new CNeuronBase();
      if(!Values)
        {
         delete temp;
         return false;
        }
      if(!Values.Init(temp))
        {
         delete Values;
         delete temp;
         return false;
        }
      if(!Values.GetOutputs().Reshape(m_iUnitsm_iKeysSize * m_iHeads))
         return false;
      if(!m_cValues.Add(Values))
        {
         delete Values;
         delete temp;
         return false;
        }

After creating the neural layers Query, Keys, and Values, we proceed to create the dependency coefficient matrix Score. There are implementation nuances here as well. This matrix in the Self-Attention implementation algorithm has a square size with each side of the square equal to the number of elements of the sequence. Each element of the matrix represents the coefficient of the pairwise relationship between the elements of the sequence, where the rows of the matrix correspond to the vectors of the tensor of the Query queries and the columns of the matrix correspond to the vectors of the Key tensor.

Now, let's think about how we can implement such a matrix if we have one Query vector that describes only the last state. Therefore, the Score matrix in this case degenerates into a vector. Of course, for each attention head. Certainly, the neural layer of the Score dependency coefficient vector does not contain a matrix of weights. Therefore, we adjust the number of elements in the neural layer and create a new internal neural layer using the algorithm mentioned above. Let's take advantage of the opportunity and make the matrix rectangular. The rows of the matrix will correspond to the attention heads.

      //--- initialize Scores
      CNeuronBase *Scores = new CNeuronBase();
      if(!Scores)
        {
         delete temp;
         return false;
        }
      temp.count = (int)(m_iUnits * m_iHeads);
      if(!Scores.Init(temp))
        {
         delete Scores;
         delete temp;
         return false;
        }
      if(!Scores.GetOutputs().Reshape(m_iHeadsm_iUnits))
         return false;
      if(!m_cScores.Add(Scores))
        {
         delete Scores;
         delete temp;
         return false;
        }

The next object we will create is a neural layer for the concatenated output of the AttentionOut attention heads. Here, the situation is similar to the dependency coefficient matrix. We have already discussed the reasons for the degeneration of the matrix of dependence coefficients into a vector, and to obtain the result of the work of the attention head according to the Self-Attention algorithm, we need to multiply the matrix of dependence coefficients by the Value tensor.

But in our case, with one Query vector at the output, we also get one vector for each attention head. Therefore, we will specify the correct layer size and execute the algorithm for its initialization.

      //--- initialize AttentionOut
      CNeuronBase *AttentionOut = new CNeuronBase();
      if(!AttentionOut)
        {
         delete temp;
         return false;
        }
      temp.count = (int)(m_iKeysSize * m_iHeads);
      if(!AttentionOut.Init(temp))
        {
         delete AttentionOut;
         delete temp;
         return false;
        }
      if(!AttentionOut.GetOutputs().Reshape(m_iHeadsm_iKeysSize))
         return false;
      if(!m_cAttentionOut.Add(AttentionOut))
        {
         delete AttentionOut;
         delete temp;
         return false;
        }

Following the multi-head attention algorithm, our next step will be to organize the results of all attention heads into a unified vector and adjust its size to match the size of the input data vector. In the algorithm of the Multi-Head Self-Attention mechanism, this operation is performed using the W0 matrix. However, we will perform this operation using a basic fully connected neural layer without an activation function.

Again, we will create a new instance of the neural layer object. Do not forget to check the result of the operation.

      //--- initialize W0
      CNeuronBase *W0 = new CNeuronBase();
      if(!W0)
        {
         delete temp;
         return false;
        }

In the neural layer description object, we enter the necessary parameters:

  • The size of the input data window is equal to the size of the previously created layer for the concatenated results of attention heads.
  • The number of elements at the output of the neural layer is equal to the size of the source data vector.
  • The activation function is not used.

We initialize the neural layer using the neural layer description object.

      temp.window = temp.count;
      temp.count = m_iWindow;
      temp.activation = AF_NONE;
      if(!W0.Init(temp))
        {
         delete W0;
         delete temp;
         return false;
        }
      if(!m_cW0.Add(W0))
        {
         delete W0;
         delete temp;
         return false;
        }

After the successful initialization of the neural layer object, we add it to the appropriate collection.

This concludes the work on initializing the objects of the Multi-Head Self-Attention mechanism, and we just have to create two neural layers of the Feed Forward block. The first neural layer has four times as many neurons in its output as the tensor received as input, and it is activated using the Swish function.

      //--- initialize FF1
      CNeuronBase *FF1 = new CNeuronBase();
      if(!FF1)
        {
         delete temp;
         return false;
        }
      temp.window = m_iWindow;
      temp.count = temp.window * 4;
      temp.activation = AF_SWISH;
      temp.activation_params[0] = 1;
      temp.activation_params[1] = 0;
      if(!FF1.Init(temp))
        {
         delete FF1;
         delete temp;
         return false;
        }
      if(!m_cFF1.Add(FF1))
        {
         delete FF1;
         delete temp;
         return false;
        }

The second neural layer of the Feed Forward block does not have the activation function. It returns the size of the tensor to the size of the initial data. Here we also use a basic fully connected neural layer. We will make the necessary adjustments to the description object of the neural layer and initialize the neural layer.

      //--- initialize FF2
      CNeuronBase *FF2 = new CNeuronBase();
      if(!FF2)
        {
         delete temp;
         return false;
        }
      temp.window = temp.count;
      temp.count = m_iWindow;
      temp.activation = AF_NONE;
      if(!FF2.Init(temp))
        {
         delete FF2;
         delete temp;
         return false;
        }
      if(!m_cFF2.Add(FF2))
        {
         delete FF2;
         delete temp;
         return false;
        }
      delete temp;
     }

We check the results of the operations at each step and add the created neural layer to the appropriate collection.

At this stage, we have created all the objects necessary for the operation of a single neural layer. We remove the description object of the neural layer and proceed to the next iteration of our loop, where we will create objects for the operation of the next layer.

Thus, upon completing all iterations of the loop, we will obtain objects for the operation of as many neural layers as the user specified when calling the initialization method of this neural layer.

Furthermore, to avoid copying data between the buffers of internal neural layers and the current layer, we will replace the pointers to the result and gradient buffers of the current layer.

//--- to avoid copying buffers, we will replace them
   if(m_cFF2.Total() < m_iLayers)
      return false;
   if(!m_cOutputs)
      delete m_cOutputs;
   CNeuronBase *neuron = m_cFF2.At(m_iLayers - 1);
   if(!neuron)
      return false;
   m_cOutputs = neuron.GetOutputs();
   if(!m_cGradients)
      delete m_cGradients;
   m_cGradients = neuron.GetGradients();

In conclusion, we call the method for distributing pointers to the OpenCL context among the class object and exit the initialization method.

   SetOpenCL(m_cOpenCL);
//---
   return true;
  }

To fully address the issue of class initialization, I suggest considering a method for distributing the OpenCL context object pointer among the internal layer objects.

Despite the change in the type of internal objects, from a neural layer to a collection of neural layers, the structure and algorithm of the pointer propagation method to the OpenCL context have not changed much. This became possible thanks to the similar method we previously wrote in the neural layer collection class.

In the parameters, our SetOpenCL method gets a pointer to the OpenCL context object. In the body of the method, we first call the relevant method of the parent class, where all the necessary controls are already implemented, and the pointer is saved in the corresponding class variable. After that, we alternately check the pointers of all the internal objects of the neural layer and call a similar method for them.

bool CNeuronGPT::SetOpenCL(CMyOpenCL *opencl)
  {
   CNeuronBase::SetOpenCL(opencl);
   m_cQuerys.SetOpencl(m_cOpenCL);
   m_cKeys.SetOpencl(m_cOpenCL);
   m_cValues.SetOpencl(m_cOpenCL);
   m_cScores.SetOpencl(m_cOpenCL);
   m_cAttentionOut.SetOpencl(m_cOpenCL);
   m_cW0.SetOpencl(m_cOpenCL);
   m_cFF1.SetOpencl(m_cOpenCL);
   m_cFF2.SetOpencl(m_cOpenCL);
   if(m_cOpenCL)
     {
      uint size = sizeof(TYPE) * m_iUnits * m_iHeads;
      m_iScoreTemp = m_cOpenCL.AddBuffer(sizeCL_MEM_READ_WRITE);
      for(int l = 0l < m_iLayersl++)
         m_dStd[l].BufferCreate(m_cOpenCL);
     }
   else
     {
      for(int l = 0l < m_iLayersl++)
         m_dStd[l].BufferFree();
     }
//---
   return(!!m_cOpenCL);
  }

Thus, we conclude the class initialization and proceed directly to implementing the neural layer operational algorithm. As always, we will start with the implementation of the feed-forward method.