Organizing parallel computing in the GPT model

We continue to work on our model class, GPT CNeuronGPT. In the previous sections, we have already recreated the model algorithm using standard MQL5 tools. Now it's time to supplement the model with the ability to perform mathematical operations in multi-threaded mode using the computing power of the GPU. This is the opportunity provided by OpenCL.

To organize this process, we have to perform two subtasks:

  • Create an OpenCL program.
  • Organize the call of the OpenCL program from the main program.

Let's start by creating an executable program on the OpenCL side. In this program, we need to implement that part of the algorithm that is not covered by the use of internal object methods. We have two such blocks: one in the feed-forward part, and the second, mirrored to the first, included in the error gradient propagation method when performing the backpropagation pass.

To execute the feed-forward algorithm, we will create the GPTFeedForward kernel. In part, the kernel algorithm will resemble a similar kernel for classes using attention mechanisms. This is not surprising since they all use the Self-Attention mechanism. However, each implementation has its nuances. Last time, instead of creating a new kernel for organizing multi-head attention, we were able to quickly modify the existing kernel of the Self-Attention algorithm. Now, creating a new kernel seems less costly compared to trying to create a universal kernel for all tasks.

Unlike the implementation of the Multi-Heads Self-Attention mechanism in which we translated the kernel into a two-dimensional task space, in this implementation we return to a one-dimensional space. This is due to the lack of the possibility of splitting the task into parallel threads in the context of the elements of the Query tensor sequence since the GPT model implementation only allows processing one query per iteration. In this case, we are left with a division by threads only in the context of attention heads.

In the parameters of the GPTFeedForward kernel, we will continue to pass pointers to five data buffers. However, the number of variables increases: earlier we obtained the size of the sequence from the dimension of the task space, but now we have to explicitly specify it in the kernel parameters. Here, an additional parameter is used to specify the current element in the sequences of keys and values.

__kernel void GPTFeedForward(__global TYPE *querys,
                             __global TYPE *keys,
                             __global TYPE *scores,
                             __global TYPE *values,
                             __global TYPE *outputs,
                             int key_size,
                             int units,
                             int current)
  {

As mentioned earlier, the created kernel will operate in a one-dimensional space, focusing on the attention heads. Therefore, the first thing we do in the body of the kernel is determine the active attention head based on the identifier of the executing thread and the total number of attention heads, considering the total number of running threads.

   const int h = get_global_id(0);
   const int heads = get_global_size(0);
   int shift_query = key_size * h;
   int shift_scores = units * h;

We immediately determine the offset in the Query and Score (dependency coefficient matrix) tensors.

Next, according to the Self-Attention algorithm that is being built, we determine the dependency coefficients between the elements of the sequence. To do this, we multiply the Query vector by the Key tensor. To implement these operations, we will create a system of two nested loops. The outer loop will iterate over the elements of the Key tensor sequence and, accordingly, the elements of the Score vector with dependency coefficients. In the body of the loop, we will define the offset in the Key tensor and prepare a local variable to count the intermediate values.

After that, we organize a nested loop with the number of iterations equal to the size of the description vector of one element of the sequence. In the body of this cycle, we perform the operation of multiplying a pair of vectors. The resulting value is divided by the square root of the dimension of the vector, and we take the exponent from it. We write the result of the operation into the corresponding element of the dependency coefficient vector and add it to the cumulative sum of all elements in the dependency coefficient vectors for subsequent data normalization.

We should consider the issue of the concatenated buffer of the results of the Query layer. The values of the key and query vectors of the current element in the sequence have not yet been transferred to the corresponding buffers. Therefore, we check the element that we are accessing in the key tensor. Before accessing the current element, we first copy the data to the buffer. Of course, for current operations, we could take data from the querys buffer. But we will need this data in subsequent iterations. Therefore, transferring them to the buffer is inevitable.

   TYPE summ = 0;
   for(int s = 0s < unitss++)
     {
      TYPE score = 0;
      int shift_key = key_size * (s * heads + h);
      for(int k = 0k < key_sizek ++)
        {
         if(s == current)
            keys[shift_key + k] = querys[shift_query + k + heads * key_size];
         score += querys[shift_query + k] * keys[shift_key + k];
        }
      score = exp(score / sqrt((TYPE)key_size));
      summ += score;
      scores[shift_scores + s] = score;
     }

As a result of performing a full cycle of iterations of the system created above from two loops, we get a vector of dependency coefficients. According to the Self-Attention algorithm, before further use of the obtained coefficients, they will have to be normalized by the Softmax function. When obtaining the exponent from the products of vectors, we have already executed part of the algorithm of the specified function. To complete the normalization operation, we just need to divide the values stored in the vector by their total sum, which we prudently collected in the local variable summ. Therefore, we organize another loop with the number of iterations equal to the size of the vector of dependency coefficients. In the body of this loop, we will divide all the values of the vector by the value of the local variable summ.

   for(int s = 0s < unitss++)
      scores[shift_scores + s] /= summ;

Thus, after completing the iterations of the loop in the Score vector, we get the normalized values of the dependency coefficients with the total sum of all elements in the block. In fact, the obtained coefficients give us an idea of the proportion of influence of each element of the sequence from the Value tensor on the final value of the analyzed element of the sequence in the tensor of the results of the current attention head.

This means that in order to obtain the final values, we need to multiply the Score vector with normalized dependency coefficients by the Value tensor. To perform this operation, we need another system of two nested loops. But before running it, we will determine the offset in the tensor of the results before the beginning of the vector of the analyzed element of the sequence.

The outer loop, with the number of iterations equal to the size of the vector describing one element of the sequence, indicates the ordinal number of the collected element in the result vector. The nested loop, with the number of iterations equal to the number of elements in the sequence, helps correlate the vectors of the Value tensor with the dependency coefficients from the Score vector. In the body of the nested loop, we multiply the vector from the Value tensor by the corresponding element dependency coefficient. The resulting values of the products will be accumulated in a local variable. After completing the iterations of the inner loop, we save the obtained value in the buffer of the Self-Attention block results.

   shift_query = key_size * h;
   for(int i = 0i < key_sizei++)
        {
      TYPE query = 0;
      for(int v = 0v < unitsv++)
        {
         if(v == current)
            values[key_size * (v * heads + h) + i] = 
                                      querys[(2 * heads + h) * key_size + i];
         query += values[key_size * (v * heads + h) + i] * 
                                                    scores[shift_scores + v];
        }
      outputs[shift_query + i] = query;
     }
  }

As a result of completing the full cycle of iterations within the loop system, we obtain the vector describing one element of the sequence in the tensor of results for one attention head. The task assigned to this kernel has been completed, and we can exit it.

This concludes the work with the feed-forward kernel, and we move further along. Now we need to organize the backpropagation process. The implementation of this task will be split into two kernels. In the GPTCalcScoreGradient kernel, we will propagate the error gradient to the vector of the dependency coefficients. In the GPTCalcHiddenGradient kernel, we will continue the propagation of the error gradient up to the level of the Query and Key tensors.

Let's take it step by step. The GPTCalcScoreGradient kernel in the parameters receives pointers to six data buffers and three parameters:

  • scores — buffer for the vector of dependency coefficients
  • scores_grad — buffer for the error gradient vector at the level of dependency coefficients
  • values — buffer for the Value tensor
  • values_grad — buffer for the error gradient tensor at the Value level
  • outputs_grad — error gradient tensor buffer at the result level of the Self-Attention block
  • scores_temp — buffer for writing intermediate values
  • window — size of the vector describing one element of the sequence in the Value tensor
  • units — number of elements in the sequence
  • current — ordinal number of the current item in the Value stack

__kernel void GPTCalcScoreGradient(__global TYPE *scores,
                                   __global TYPE *scores_grad,
                                   __global TYPE *values,
                                   __global TYPE *values_grad,
                                   __global TYPE *outputs_grad,
                                   __global TYPE *scores_temp,
                                   int window,
                                   int units,
                                   int current)
  {

As with the feed-forward pass, we run the kernel in one-dimensional space by the number of attention heads used. In the body of the kernel, we immediately determine the active attention head based on the thread identifier and the total number of attention heads considering the total number of launched threads.

   const int h = get_global_id(0);
   const int heads = get_global_size(0);
   int shift_value = window * (2 * heads + h);
   int shift_score = units * h;

We also determine the offset in the tensors of the error gradients at the level of Value and in the vector of the dependency coefficients. Note that the offset in the tensor of error gradients for Value and in the Value tensor itself will be different in this case.

In this implementation of the GPT model, we used one internal neural layer to generate a concatenated tensor containing the values of Query, Key, and Value for all attention heads. Accordingly, we assemble the error gradient into a similar concatenated tensor of the error gradients for the specified neural layer. However, this tensor contains only the current element of the sequence. At the same time, the Value stack tensor contains complete information about the entire sequence but only for the Value tensor.

After the preparatory work, we distribute the error gradient to the Value tensor. As mentioned above, we distribute the error gradient only for the current element of the sequence. To do this, we organize a loop with the number of iterations equal to the size of the description vector of one element of the sequence in the Value tensor. In the body of the loop, will multiply the error gradient vector at the result level of the Self-Attention block by the corresponding dependency coefficient from the Score vector. The obtained values are saved in the buffer of the concatenated tensor of error gradients.

//--- Gradient distribution to Values
   for(int i = 0i < windowi ++)
      values_grad[shift_value + i] = scores[units * h + current] * 
                                                   outputs_grad[window * h + i];

After calculating the error gradient on the Value tensor, we will determine the value of the error gradient at the level of the dependency coefficient vector. To perform this operation, we will need a system of two loops: an outer loop with the number of iterations equal to the number of elements in the sequence and a nested loop with the number of iterations equal to the size of the description vector for one element in the Value tensor. In the body of the nested loop, we will multiply 2 vectors (Value and error gradient). The resulting value is stored in the temporary data buffer.

//--- Gradient distribution to Score
   for(int k = 0k < unitsk++)
        {
      TYPE grad = 0;
      for(int i = 0i < windowi++)
         grad += outputs_grad[shift_value + i] * 
                                        values[window * (k * heads + h) + i];
      scores_temp[shift_score + k] = grad;
     }

After completing the full cycle of iterations within the loop system, in the temporary buffer, we will obtain a fully populated gradient vector of error for the dependency coefficient vector. But to distribute the error gradient further, we first need to correct it to the derivative of the Softmax function.

Let's organize another system of two nested cycles. Both loops will contain the number of iterations equal to the number of elements in the sequence. In the body of the nested loop, we will calculate the derivative of the function using the formula.

//--- Adjust to the Softmax derivative
   for(int k = 0k < unitsk++)
        {
      TYPE grad = 0;
      TYPE score = scores[shift_score + k];
      for(int i = 0i < unitsi++)
         grad += scores[shift_score + i] * ((int)(i == k) - score) *
                                                scores_temp[shift_score + i];
      scores_grad[shift_score + k] = grad;
     }
  }

We will save the obtained values in the error gradient buffer at the level of the dependency coefficient vector. At this stage, we complete the work of the first kernel and move on to the second one.

In the second kernel of the GPTCalcHiddenGradient backpropagation process, we have to propagate the error gradient further and bring it to the level of Query and Key tensors.

In parameters, the GPTCalcHiddenGradient kernel receives pointers to 4 data buffers and 3 parameters.

__kernel void GPTCalcHiddenGradient(__global TYPE *querys,
                                    __global TYPE *querys_grad,
                                    __global TYPE *keys,
                                    __global TYPE *scores_grad,
                                    int key_size,
                                    int units,
                                    int current)
  {

Note that we talked about distributing the gradient into the Query and Key tensors. In the kernel parameters, there is a pointer only to the Query error gradient buffer. This situation is made possible by the use of a concatenated buffer, in which we have already saved the error gradient at the level of the Value tensor. Now we add error gradients at the level of the Query and Key tensors to the same buffer.

In the kernel body, we determine the ordinal number of the analyzed attention head based on the thread identifier and the number of used attention heads considering the total number of tasks launched.

   const int h = get_global_id(0);
   const int heads = get_global_size(0);
   int shift_query = key_size * h;
   int shift_key = key_size * (heads + h);
   int shift_score = units * h;

Here we also define the offsets in the data buffers before the beginning of the analyzed vectors.

Next, we organize a system of two nested loops, in the body of which we will determine the error gradients at the level of the tensors we are looking for. To do this, multiply the error gradient at the level of the dependency coefficient vector by the opposite tensor.

//--- Gradient distribution on Querys and Keys
   const TYPE k = 1 / sqrt((TYPE)key_size);
//---
   for(int i = 0i < key_sizei++)
        {
      TYPE grad_q = 0;
      TYPE grad_k = 0;
      for(int s = 0s < unitss++)
        {
         grad_q += keys[key_size * (s * heads + h) + i] *
                                               scores_grad[shift_score + s];
         if(s == current)
            grad_k += querys[key_size * h + i] *
                                           scores_grad[units * h + current];
        }
      querys_grad[shift_query + i] = grad_q * k;
      querys_grad[shift_key + i] = grad_k * k;
     }
  }

Note that we calculate the error gradient only for the current element of the sequence and save the obtained values in the corresponding elements of the error gradient buffer.

As a result of all iterations of our loop system, we get a fully filled concatenated tensor of error gradients of all three entities (Query, Key, and Value). We complete the work on building the OpenCL program and move on to building the functionality on the side of the main program.

To make it more convenient to manage the constructed kernels in the main program, let's create named constants for calling kernels and accessing their elements. To do this, we open our constants file defines.mqh and create kernel access constants in it.

#define def_k_GPTFeedForward           34
#define def_k_GPTScoreGradients        35
#define def_k_GPTHiddenGradients       36

Then we add access constants to kernel parameters.

//--- GPT feed-forward pass
#define def_gptff_querys               0
#define def_gptff_keys                 1
#define def_gptff_scores               2
#define def_gptff_values               3
#define def_gptff_outputs              4
#define def_gptff_key_size             5
#define def_gptff_units                6
#define def_gptff_current              7

//--- determine the gradient at the matrix of GPT dependency coefficients
#define def_gptscr_scores              0
#define def_gptscr_scores_grad         1
#define def_gptscr_values              2
#define def_gptscr_values_grad         3
#define def_gptscr_outputs_grad        4
#define def_gptscr_scores_temp         5
#define def_gptscr_window              6
#define def_gptscr_units               7
#define def_gptscr_current             8

//--- gradient distribution via GPT
#define def_gpthgr_querys              0
#define def_gpthgr_querys_grad         1
#define def_gpthgr_keys                2
#define def_gpthgr_scores_grad         3
#define def_gpthgr_key_size            4
#define def_gpthgr_units               5
#define def_gpthgr_current             6

After that, we go to the dispatch service class of the CNet neural network model and, in the OpenCL initialization method InitOpenCL, we change the total number of kernels in our program. Next, we initialize the creation of new kernels in the OpenCL context.

bool CNet::InitOpenCL(void)
  {
   ......
   if(!m_cOpenCL.SetKernelsCount(37))
        {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
   ......
   if(!m_cOpenCL.KernelCreate(def_k_GPTFeedForward"GPTFeedForward"))
        {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
   if(!m_cOpenCL.KernelCreate(def_k_GPTScoreGradients"GPTCalcScoreGradient"))
        {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
   if(!m_cOpenCL.KernelCreate(def_k_GPTHiddenGradients"GPTCalcHiddenGradient"))
        {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
//---
   return true;
  }

This concludes the preparatory work and goes directly to the methods of our CNeuronGPT class. In them, we have to perform three stages of work to call each kernel:

  • Preparing the input data and transferring it to the memory of the OpenCL context.
  • Placing the kernel in the execution queue.
  • Loading the results of program execution into the memory of the main program.

First, we modify the CNeuronGPT::FeedForward method. In the block for organizing multi-threaded computing using OpenCL, we first check for the presence of an already created buffer in the memory of the OpenCL context.

bool CNeuronGPT::FeedForward(CNeuronBase *prevLayer)
  {
   ......
   for(int layer = 0layer < m_iLayerslayer++)
        {
   ......
      //--- branching of the algorithm by the computing device
      if(!m_cOpenCL)
        {
         // Program block using standard MQL5 tools
   ......
        }
      else // OpenCL block
        {
         //--- checking data buffers
         if(Querys.GetOutputs().GetIndex() < 0)
            return false;
         if(Keys.GetOutputs().GetIndex() < 0)
            return false;
         if(Values.GetOutputs().GetIndex() < 0)
            return false;
         if(Scores.GetOutputs().GetIndex() < 0)
            return false;
         if(AttentionOut.GetOutputs().GetIndex() < 0)
            return false;

When all buffers have been created, and those that are necessary for kernel operation have been passed to the OpenCL context memory, we pass pointers to the used data buffers and the necessary constants to the kernel parameters.

         //--- pass parameters to the kernel
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTFeedForward
                                def_gptff_keysKeys.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTFeedForward,
                     def_gptff_outputsAttentionOut.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTFeedForward
                            def_gptff_querysQuerys.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTFeedForward,
                            def_gptff_scoresScores.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTFeedForward
                            def_gptff_valuesValues.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTFeedForward,
                                             def_gptff_key_sizem_iKeysSize))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTFeedForward
                                                   def_gptff_unitsm_iUnits))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTFeedForward,
                                       def_gptff_currentm_iCurrentPosition))
            return false;

At this stage, the preparatory work is completed. Let's move on to the stage of placing the kernel in the execution queue. Here, we first create two dynamic arrays in which we specify the offset and the number of running threads in each task subspace. Then call the m_cOpenCL.Execute method that places the kernel in the queue.

         //--- Place a kernel in the queue
         int off_set[] = {0};
         int NDRange[] = {m_iHeads};
         if(!m_cOpenCL.Execute(def_k_GPTFeedForward1off_setNDRange))
            return false;
        }

This concludes the CNeuronGPT::FeedForward method. But we still have to do similar work in the CNeuronGPT::CalcHiddenGradient method of the backpropagation algorithm.

Let me remind you that in order to implement the backpropagation method, we have created two kernels that will be called sequentially one after the other. Therefore, the kernel maintenance work must be repeated for each of them.

First, let's create data buffers for the first kernel.

bool CNeuronGPT::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
   ......
   for(int layer = m_iLayers - 1layer >= 0layer--)
        {
   ......
      //--- branching of the algorithm by the computing device
      attention_grad = AttentionOut.GetGradients();
      if(!m_cOpenCL)
        {
         // Program block using standard MQL5 tools
   ......
        }
      else // OpenCL block
        {
         //--- checking data buffers
         if(Values.GetOutputs().GetIndex() < 0)
            return false;
         if(Querys.GetGradients().GetIndex() < 0)
            return false;
         if(Scores.GetOutputs().GetIndex() < 0)
            return false;
         if(attention_grad.GetIndex() < 0)
            return false;
         if(Scores.GetGradients().GetIndex() < 0)
            return false;
         if(m_iScoreTemp < 0)
            return false;

Following our algorithm for working with the OpenCL context, after creating data buffers and passing all the necessary information to the context memory, we pass pointers to the used data buffers and constants for executing the program algorithm to the parameters of the kernel being launched.

         //--- pass parameters to the kernel
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTScoreGradients
                           def_gptscr_outputs_gradattention_grad.GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTScoreGradients,
                            def_gptscr_scoresScores.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTScoreGradients,
                     def_gptscr_scores_gradScores.GetGradients().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTScoreGradients,
                                         def_gptscr_scores_tempm_iScoreTemp))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTScoreGradients,
                            def_gptscr_valuesValues.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTScoreGradients,
                     def_gptscr_values_gradQuerys.GetGradients().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTScoreGradients,
                                               def_gptscr_windowm_iKeysSize))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTScoreGradients,
                                                    def_gptscr_unitsm_iUnits))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTScoreGradients,
                                         def_gptscr_currentm_iCurrentPosition))
            return false;

Note that instead of the Value tensor error gradient buffer, we pass a pointer to the gradient buffer of the inner neural layer Querys. This is because we used a concatenated error gradient buffer for all three tensors. To eliminate the subsequent data copy operation, we will immediately write the data to the concatenated buffer.

After that, we perform the operation of placing the kernel in the queue. Let me remind you that we are launching a kernel to perform tasks in one-dimensional space in the context of attention heads.

Let's specify the offset in the task space and the number of threads to be started in the corresponding dynamic arrays. After that, we call the method of queuing our kernel.

         //--- Place the kernel in queue
         int off_set[] = {0};
         int NDRange[] = {m_iHeads};
         if(!m_cOpenCL.Execute(def_k_GPTScoreGradients1off_setNDRange))
            return false;

This concludes the work on the first kernel, and we move on to building a similar algorithm for the second kernel of the backpropagation pass.

Now we will check the additional buffers in the memory of the OpenCL context.

         if(Querys.GetOutputs().GetIndex() < 0)
            return false;
         if(Keys.GetOutputs().GetIndex() < 0)
            return false;

We pass the parameters to the kernel.

         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTHiddenGradients,
                                 def_gpthgr_keysKeys.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTHiddenGradients,
                             def_gpthgr_querysQuerys.GetOutputs().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTHiddenGradients,
                      def_gpthgr_querys_gradQuerys.GetGradients().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgumentBuffer(def_k_GPTHiddenGradients,
                      def_gpthgr_scores_gradScores.GetGradients().GetIndex()))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTHiddenGradients,
                                              def_gpthgr_key_sizem_iKeysSize))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTHiddenGradients,
                                                    def_gpthgr_unitsm_iUnits))
            return false;
         if(!m_cOpenCL.SetArgument(def_k_GPTHiddenGradients,
                                        def_gpthgr_currentm_iCurrentPosition))
            return false;

After that, we will put the kernel in the execution queue. Note that this time we do not create arrays of offset and dimensionality of the task space. We simply use the arrays created during the execution of the previous kernel without modification.

         if(!m_cOpenCL.Execute(def_k_GPTHiddenGradients1off_setNDRange))
            return false;
        }

This completes the work on building a GPT model class, and we can proceed to evaluate the results of the work done.