Neural Networks for Algorithmic Trading with MQL5Building the first neural network model in MQL5Organizing parallel computing using OpenCLImplementing functionality on the main program side

Creating an OpenCL program
Implementing functionality on the main program side

在一个文件中下载:

神经网络教程(PDF)

神经网络教程(CHM)

Implementing functionality on the main program side

The implementation of the functionality on the main program side will require some knowledge of process organization and effort. Let's start with the preparatory work. First, in our file of definitions, we need to add the loading of the OpenCL program written above as a resource and assign its contents to a string variable. Here, we will also add predefined macro substitutions for data types and the size of the local array to the program.

#resource "opencl_program.cl" as string OCLprogram
//---
#define TYPE                         float
#define LOCAL_SIZE                   256
const string ExtType = StringFormat("#define TYPE %s\r\n"
                                    "#define TYPE4 %s4\r\n"
                                    "#define LOCAL_SIZE %d\r\n",
                                     typename(TYPE),typename(TYPE),LOCAL_SIZE);
#define cl_program                   ExtType+OCLprogram

When declaring kernels in the main program, the CLKernelCreate function returns a handle. To work with OpenCL technology, we will use the CMyOpenCL class, which is derived from the standard COpenCL class. The aforementioned classes implement arrays for storing handles. A specific kernel is accessed by an index in the array. To simplify working with these indices and make the program code more readable, let's add constants for the indices of all the kernels created above. To explicitly identify the kernel index in the program code, we will start all named kernel constants with def_k.

//+------------------------------------------------------------------+
//| OpenCL Kernels                                                   |
//+------------------------------------------------------------------+
#define def_k_PerceptronFeedForward    0
#define def_k_LineActivation           1
#define def_k_SigmoidActivation        2
#define def_k_SigmoidDerivative        3
#define def_k_TANHActivation           4
#define def_k_TANHDerivative           5
#define def_k_LReLuActivation          6
#define def_k_LReLuDerivative          7
#define def_k_SoftMAXActivation        8
#define def_k_SoftMAXDerivative        9
#define def_k_SwishActivation          10
#define def_k_SwishDerivative          11
#define def_k_CalcOutputGradient       12
#define def_k_CalcHiddenGradient       13
#define def_k_CalcDeltaWeights         14
#define def_k_SGDUpdate                15
#define def_k_MomentumUpdate           16
#define def_k_AdaGradUpdate            17
#define def_k_RMSPropUpdate            18
#define def_k_AdaDeltaUpdate           19
#define def_k_AdamUpdate               20

To specify parameters when calling kernels, we can also use indices. However, now they are not specified explicitly. Instead, the serial number in the list of OpenCL kernel parameters is used. All kernels use their own set of parameters, so we will define named constants for all created kernels. To avoid confusion between identical parameters of different kernels, we will include a pointer to the respective kernel in the constant name. For example, the parameter constants for the forward pass kernel of the basic fully connected layer will start with def_pff.

//--- perceptron feed forward pass
#define def_pff_inputs                 0
#define def_pff_weights                1
#define def_pff_outputs                2
#define def_pff_inputs_total           3

We will declare constants for all written kernels in a similar way.

//--- calculating the error gradient of the result layer
#define def_outgr_target               0
#define def_outgr_outputs              1
#define def_outgr_gradients            2
#define def_outgr_loss_function        3

//--- calculating the error gradient of the hidden layer
#define def_hidgr_gradient_inputs      0
#define def_hidgr_weights              1
#define def_hidgr_gradients            2
#define def_hidgr_outputs_total        3

//--- calculating the error gradient at the level of the weight matrix
#define def_delt_inputs                0
#define def_delt_delta_weights         1
#define def_delt_gradients             2

//--- parameter optimization by stochastic gradient descent
#define def_sgd_delta_weights          0
#define def_sgd_weights                1
#define def_sgd_total                  2
#define def_sgd_batch_size             3
#define def_sgd_learningRate           4
#define def_sgd_Lambda1                5
#define def_sgd_Lambda2                6

//--- parameter optimization using the moment method
#define def_moment_delta_weights       0
#define def_moment_weights             1
#define def_moment_momentum            2
#define def_moment_total               3
#define def_moment_batch_size          4
#define def_moment_learningRate        5
#define def_moment_beta                6
#define def_moment_Lambda1             7
#define def_moment_Lambda2             8

//--- parameter optimization using the AdaGrad method
#define def_adagrad_delta_weights      0
#define def_adagrad_weights            1
#define def_adagrad_momentum           2
#define def_adagrad_total              3
#define def_adagrad_batch_size         4
#define def_adagrad_learningRate       5
#define def_adagrad_Lambda1            6
#define def_adagrad_Lambda2            7

//--- parameter optimization using the RMSProp method
#define def_rms_delta_weights          0
#define def_rms_weights                1
#define def_rms_momentum               2
#define def_rms_total                  3
#define def_rms_batch_size             4
#define def_rms_learningRate           5
#define def_rms_beta                   6
#define def_rms_Lambda1                7
#define def_rms_Lambda2                8

//--- parameter optimization using the AdaDelta method
#define def_adadelt_delta_weights      0
#define def_adadelt_weights            1
#define def_adadelt_momentumW          2
#define def_adadelt_momentumG          3
#define def_adadelt_total              4
#define def_adadelt_batch_size         5
#define def_adadelt_beta1              6
#define def_adadelt_beta2              7
#define def_adadelt_Lambda1            8
#define def_adadelt_Lambda2            9

//--- parameter optimization using the Adam method
#define def_adam_delta_weights         0
#define def_adam_weights               1
#define def_adam_momentumM             2
#define def_adam_momentumV             3
#define def_adam_total                 4
#define def_adam_batch_size            5
#define def_adam_learningRate          6
#define def_adam_beta1                 7
#define def_adam_beta2                 8
#define def_adam_Lambda1               9
#define def_adam_Lambda2               10

//--- activation functions
#define def_activ_inputs               0
#define def_activ_outputs              1
#define def_activ_param_a              2
#define def_activ_param_b              3

//--- adjusting the gradient to the derivative of the activation function
#define def_deactgr_outputs            0
#define def_deactgr_gradients          1
#define def_deactgr_deact_gradient     2
#define def_deactgr_act_param_a        3
#define def_deactgr_act_param_b        4

I intentionally provided a complete set of constants above to offer you a reference guide. It will assist in reading and understanding the code for our next steps in implementing OpenCL technology into the project.

After describing the constants, we will move on to creating classes that will be responsible for servicing OpenCL tools. We have already mentioned them multiple times. It's time to learn more about their features.

First, this is the CMyOpenCL class. It inherits from the COpenCL class from the MQL5 standard libraries. The standard library is well-written and has sufficient functionality to organize work. However, I found one aspect inconvenient personally: when working with buffers for data exchange between the main program and the OpenCL context, a similar approach is used as with other process objects. When creating a buffer, we have to specify its index in the general array of buffers. This is a perfectly workable option when we know all the buffers and their quantity in advance. However, our case is a little more complicated.

class CMyOpenCL   :  public COpenCL
  {
public:
                     CMyOpenCL(void)   {};
                    ~CMyOpenCL(void)   {};
   //--- initialization and shutdown
   virtual bool      Initialize(const string program, const bool show_log = true);
   //---
   template<typename T>
   int               AddBufferFromArray(T &data[], const uint data_array_offset,
                                   const uint data_array_count, const uint flags);
   int               AddBufferFromArray(MATRIX &data,
                                  const uint data_array_offset, const uint flags);
   int               AddBuffer(const uint size_in_bytes, const uint flags);
   bool              CheckBuffer(const int index);
   //---
   bool              BufferFromMatrix(const int buffer_index, MATRIX &data,
                                  const uint data_array_offset, const uint flags);
   bool              BufferRead(const int buffer_index, MATRIX &data,
                                                     const uint cl_buffer_offset);
   bool              BufferWrite(const int buffer_index, MATRIX &data,
                                                     const uint cl_buffer_offset);
  };

Earlier, we discussed that the number of used buffers for accumulating moments can vary depending on the chosen method for updating weights. In addition, we cannot know in advance how many neural layers the user will use to solve their tasks. Hence, I needed a dynamic array to store handles of data buffers. This problem was solved by adding a small AddBufferFromArray method. The parameters of this method are similar to those of the BufferFromArray method of the parent class except for the buffer index. The body of the method body a loop to search for empty cells in the buffer handle storage array. The first empty cell is used to create the buffer. When there are no free elements in the array, the method expands the array. The buffer is directly created by calling the above parent class method.

As a result of the operations, the method returns the index of the created buffer. If errors occur during operations, the method will return the INVALID_HANDLE constant.

I'd like to point out another aspect, which is that the method is created using the function template pattern. This allows you to use one method to create buffers of different types of data.

template<typename T>
int CMyOpenCL::AddBufferFromArray(T &data[], const uint data_array_offset,
                                  const uint data_array_count, const uint flags
                                 )
  {
   int result=INVALID_HANDLE;
   for(int i=0; i<m_buffers_total; i++)
     {
      if(m_buffers[i]!=INVALID_HANDLE)
         continue;
      result=i;
      break;
     }
//---
   if(result<0)
     {
      if(ArrayResize(m_buffers,m_buffers_total+1)>0)
        {
         m_buffers_total=ArraySize(m_buffers);
         result=m_buffers_total-1;
         m_buffers[result]=INVALID_HANDLE;
        }
      else
         return result;
     }
//---
   if(!BufferFromArray(result,data,data_array_offset,data_array_count,flags))
      return INVALID_HANDLE;
//---
   return result;
  }

The method created above allows the creation of buffers from arrays of any data types but it is not applicable when working with matrices. Therefore, the method was overloaded. The method algorithm remains unchanged.

int CMyOpenCL::AddBufferFromArray(MATRIX &data,
                                  const uint data_array_offset,
                                  const uint flags
                                 )
  {
//--- Search for a free element in a dynamic array of pointers
   int result = -1;
   for(int i = 0; i < m_buffers_total; i++)
     {
      if(m_buffers[i] != INVALID_HANDLE)
         continue;
      result = i;
      break;
     }
//--- If a free item is not found, add a new item to the array
   if(result < 0)
     {
      if(ArrayResize(m_buffers, m_buffers_total + 1) > 0)
        {
         m_buffers_total = ArraySize(m_buffers);
         result = m_buffers_total - 1;

         m_buffers[result] = INVALID_HANDLE;
  }
      else
         return result;
     }
//--- Create a buffer in the OpenCL context
   if(!BufferFromMatrix(result, data, data_array_offset, flags))
      return -1;
   return result;
  }

Anticipating a bit, I want to mention that we won't always be creating buffers based on ready-made arrays. Sometimes, we just need to create a buffer in the OpenCL context without duplicating it in the main memory. Or, for example, a specific buffer is only used to obtain results, and there is no need to load its data into the context before performing operations. As we've mentioned before, the data copying process is an expensive operation, and we would like to minimize such operations. Therefore, it would be easier for us to simply create a data buffer in the context of a certain size without copying the data. For such cases, we will create the AddBuffer method. As you can notice, the algorithm of the method is almost identical to the methods of the previous class. The only difference is that this method receives the buffer size in bytes as a parameter instead of an array. At the end of the method, we call the BufferCreate method, which will create a buffer of the specified size in the OpenCL context.

int CMyOpenCL::AddBuffer(const uint size_in_bytes, const uint flags)
  {
//--- Search for a free element in a dynamic array of pointers
   int result = -1;
   for(int i = 0; i < m_buffers_total; i++)
     {
      if(m_buffers[i] != INVALID_HANDLE)
         continue;
      result = i;
      break;
     }
//--- If a free item is not found, add a new item to the array
   if(result < 0)
     {
      if(ArrayResize(m_buffers, m_buffers_total + 1) > 0)
        {
         m_buffers_total = ArraySize(m_buffers);
         result = m_buffers_total - 1;
         m_buffers[result] = INVALID_HANDLE;
  }

      else
         return result;
     }
//--- Create a buffer in the OpenCL context
   if(!BufferCreate(result, size_in_bytes, flags))
      return -1;
   return result;
  }

We also created methods for reading (BufferRead) and writing (BufferWrite) data of the OpenCL context buffer to the main memory matrix. The method algorithm is completely identical. Let's consider the data reading method as an example. In the method parameters, it receives the buffer identifier in the dynamic array of our class, a matrix for writing data, and an offset in the context buffer.

Please do not confuse the buffer identifier in the dynamic class array and the buffer handle in the OpenCL context. The class operation is structured in such a way that we only pass the ordinal number of an element in the dynamic array of our class to the external program, which contains the handle of that buffer. As a result, when creating a buffer in the context using the class, the external program does not have direct access to the created buffer in the context. All work with the buffer should be done using class methods.

In the method body, we first check the received buffer ID for the size of our dynamic array. We then check the validity of the specified buffer handle. In addition, we will check the validity of the OpenCL context and program handles. Only after successfully passing all the controls, we call the function for reading data from the buffer. Don't forget to check the results of the operations at every step. At the end of the method, we will return the logical result of the operations.

bool CMyOpenCL::BufferRead(const int buffer_index, MATRIX &data,
                                     const uint cl_buffer_offset)
  {
//--- checking parameters
   if(buffer_index < 0 || buffer_index >= m_buffers_total || data.Rows() <= 0)
      return(false);
   if(m_buffers[buffer_index] == INVALID_HANDLE)
      return(false);
   if(m_context == INVALID_HANDLE || m_program == INVALID_HANDLE)
      return(false);
//--- reading buffer data from the OpenCL context
   if(!CLBufferRead(m_buffers[buffer_index], cl_buffer_offset, data))
      return(false);
//---
   return(true);
  }

The second class that we will create and use to transfer data between the main program and the OpenCL context is the CBufferType data buffer class. The class was created as a descendant of the CObject base class. Since the parent class is the base class, we need to recreate all the necessary functionality.

In addition to creating new methods in the new class, two new variables have appeared:

m_cOpenCL — a pointer to an object of the CMyOpenCL class
m_myIndex — the index of the current buffer in the dynamic array for storing buffer handles in the CMyOpenCL class.

The m_mMatrix matrix for storing data has also been introduced. Here we have slightly deviated from the generally accepted rules for creating classes. It is usually customary to restrict access to internal variables, and all interactions with them are built through class methods. Each such method restricts the degree of freedom to internal variables and requires additional time for executing the method's additional operations. Of course, this approach allows for complete control over changes in variable states. However, in building neural models, we aim to minimize the time spent on each iteration, as milliseconds per iteration can result in significant time overhead due to repeated calls. That is why we announced the m_mMatrix data matrix in public space. Of course, the fact that the class will be used to store and transmit data within our global project and that all buffers will be private or protected objects of other classes, minimizes our risks.

class CBufferType: public CObject
  {
protected:
   CMyOpenCL*        m_cOpenCL;     // OpenCL context object
   int               m_myIndex;     // data buffer index in context
public:
                     CBufferType(void);
                    ~CBufferType(void);
   //--- data matrix
   MATRIX            m_mMatrix;
   //--- method of initializing the buffer with initial values
   virtual bool      BufferInit(const ulong rows, const ulong columns,
                                                          const TYPE value = 0);
   //--- create a new buffer in the OpenCL context
   virtual bool      BufferCreate(CMyOpenCL *opencl);
   //--- delete the buffer in the context of OpenCL
   virtual bool      BufferFree(void);
   //--- read buffer data from the OpenCL context
   virtual bool      BufferRead(void);
   //--- write buffer data to the OpenCL context
   virtual bool      BufferWrite(void);
   //--- get the buffer index
   virtual int       GetIndex(void);
   //--- change the buffer index
   virtual bool      SetIndex(int index)
                       {
                        if(!m_cOpenCL.BufferFree(m_myIndex))
                           return false;
                        m_myIndex = index;
                        return true;
                       }
   //--- copy buffer data to an array
   virtual int       GetData(TYPE &values[], bool load = true);
   virtual int       GetData(MATRIX &values, bool load = true);
   virtual int       GetData(CBufferType* values, bool load = true);
   //--- calculate the average value of the data buffer
   virtual TYPE      MathMean(void);
   //--- vector operations
   virtual bool      SumArray(CBufferType* src);
   virtual int       Scaling(TYPE value);
   virtual bool      Split(CBufferType* target1, CBufferType* target2,
                                                            const int position);
   virtual bool      Concatenate(CBufferType* target1, CBufferType* target2,
                                    const int positions1, const int positions2);
   //--- methods for working with files
   virtual bool      Save(const int file_handle);
   virtual bool      Load(const int file_handle);
   //--- class identifier
   virtual int       Type(void)              const { return defBuffer;              }
   //--- methods for working with the data matrix
   ulong             Rows(void)              const { return m_mMatrix.Rows();       }
   ulong             Cols(void)              const { return m_mMatrix.Cols();       }
   uint              Total(void)             const { return (uint)(m_mMatrix.Rows() *
                                                                 m_mMatrix.Cols()); }
   TYPE              At(uint index)          const { return m_mMatrix.Flat(index);  }
   TYPE              operator[](ulong index) const { return m_mMatrix.Flat(index);  }
   VECTOR            Row(ulong row)                { return m_mMatrix.Row(row);     }
   VECTOR            Col(ulong col)                { return m_mMatrix.Col(col);     }
   bool              Row(VECTOR& vec,  ulong row)  { return m_mMatrix.Row(vec, row);}
   bool              Col(VECTOR& vec,  ulong col)  { return m_mMatrix.Col(vec, col);}
   bool              Activation(MATRIX& mat_out, ENUM_ACTIVATION_FUNCTION func)
                                      { return m_mMatrix.Activation(mat_out, func); }
   bool              Derivative(MATRIX& mat_out, ENUM_ACTIVATION_FUNCTION func)
                                      { return m_mMatrix.Derivative(mat_out, func); }
   bool              Reshape(ulong rows, ulong cols)
                                      { return m_mMatrix.Reshape(rows, cols);       }
//---
   bool              Update(uint index, TYPE value)
                       {
                        if(index >= Total())
                           return false;
                        m_mMatrix.Flat(index, value);
                        return true;
                       }

   bool              Update(uint row, uint col, TYPE value)
                       {
                        if(row >= Rows() || col >= Cols())
                           return false;
                        m_mMatrix[row, col] = value;
                        return true;
                       }
  };

The structure of the class methods is quite diverse. Some of them are similar to matrix functions and perform the same functionality — designed to work with a data matrix. Others carry out the functionality of interacting with the OpenCL context. Let's take a closer look at some of them.

In the class constructor, we will only set the initial values of the new variables. They are filled with empty values.

CBufferType::CBufferType(void)  : m_myIndex(-1)
  {
   m_cOpenCL = NULL;
  }

In the class destructor, we will perform memory cleaning operations. Here we'll clear the buffer in the context of OpenCL.

CBufferType::~CBufferType(void)
  {
   if(m_cOpenCL && m_myIndex >= 0 && m_cOpenCL.BufferFree(m_myIndex))
        {
         m_myIndex = -1;
         m_cOpenCL = NULL;
  }
  }

We have already used the BufferInit buffer initialization method in the neural layer class constructor. The main functionality of this method is to create a matrix of a specified size and populate it with initial values. The buffer size and initial values are specified in the method parameters. As part of this project, we will fill arrays with zero values during the initialization of the neural network and reset the buffers of accumulated deltas after updating the weight matrix.

bool CBufferType::BufferInit(ulong rows, ulong columns, TYPE value)
  {
   if(rows <= 0 || columns <= 0)
      return false;
   m_mMatrix = MATRIX::Full(rows, columns, value);
   if(m_cOpenCL)
     {
      CMyOpenCL *opencl=m_cOpenCL;
      BufferFree();
      return BufferCreate(opencl);
     }
//---
   return true;
  }

The next method is to create a buffer in the OpenCL context. In parameters, the method receives a pointer to an instance of the CMyOpenCL class in the context of which the buffer should be created.

The method starts with a control block. First, we check the validity of the obtained pointer - in case of receiving an invalid pointer, we delete the buffer previously created in the OpenCL context and exit the method.

bool CBufferType::BufferCreate(CMyOpenCL *opencl)
  {
//--- initial data validation block
   if(!opencl)
     {
      BufferFree();
      return false;
     }

Then we check that it matches the previously saved pointer. If the pointers are identical and the buffer index is already saved, we won't create a new buffer in the OpenCL context but will simply copy the data from the matrix to the data exchange buffer again. To do this, we call the BufferWrite method. This method has its own set of checks, which we will become familiar with a bit later, and it returns a logical result of the operation. We exit the method with the result of the method of writing data to the OpenCL context.

//--- if the received pointer matches the one previously saved,
//--- simply copy the buffer contents into the context memory
if(opencl == m_cOpenCL && m_myIndex >= 0)
return BufferWrite();

The subsequent code of the method will be executed only if we have not exited the method during the preceding operations. Here, we check the validity of the previously saved pointer to an instance of the CMyOpenCL class and the presence of an index in the dynamic array storing handles of data buffers. If this condition is met, we must clear the memory and delete the existing buffer using the BufferFree method before continuing operations. Only after successfully deleting the old buffer do we have the right to open a new one. Otherwise, uncontrolled use of memory resources will lead to memory shortages and corresponding consequences.

//--- checking for a previously saved pointer to the OpenCL context
//--- if available, remove the buffer from the unused context
   if(m_cOpenCL && m_myIndex >= 0)
     {
      if(m_cOpenCL.BufferFree(m_myIndex))
        {
         m_myIndex = -1;
         m_cOpenCL = NULL;
  }
      else
         return false;
     }

At the end of the method, we initiate the creation of a new data buffer in the specified context. To do this, we call the AddBufferFromArray method discussed above. The index obtained in response to the call will be stored in the m_myIndex variable. If the buffer opening operation is successful, we will save the CMyOpenCL instance pointer received as input to the method before exiting.

//--- create a new buffer in the specified OpenCL context
   if((m_myIndex = opencl.AddBufferFromArray(m_mMatrix, 0, CL_MEM_READ_WRITE)) < 0)
      return false;
   m_cOpenCL = opencl;
//---
   return true;
  }

In this method, we used two new methods: one for clearing the buffer and the other for writing data. The BufferFree method is responsible for clearing the buffer. The method algorithm is quite simple. First, we check for the presence of a stored pointer to an instance of the CMyOpenCL class and an index in the dynamic buffer array. If they are available, call the CMyOpenCL class buffer cleaning method and specify the buffer index to delete. If the buffer is successfully removed from the context, clear the pointer to the CMyOpenCL class instance and the buffer index variable.

It should be noted that calling this method clears memory and deletes the buffer only in the context of OpenCL. At the same time, the data matrix itself and its contents remain in RAM. We will be able to exploit this property to use OpenCL context memory more efficiently a little later.

bool CBufferType::BufferFree(void)
  {
//--- checking for a previously saved pointer to the OpenCL context
//--- if available, remove the buffer from the unused context
   if(m_cOpenCL && m_myIndex >= 0)
      if(m_cOpenCL.BufferFree(m_myIndex))
        {
         m_myIndex = -1;
         m_cOpenCL = NULL;
         return true;
  }
   if(m_myIndex >= 0)
      m_myIndex = -1;
//---
   return false;
  }

Next, I suggest considering methods for transferring information between the main program and the OpenCL context. This work is done in two similar methods: BufferRead and BufferWrite. Despite the differences in the operation directions, the algorithm of the methods is identical. At the beginning of the methods, a control block is organized that checks the validity of the pointer to an instance of the CMyOpenCL class and the presence of an index in the dynamic buffer array. And only after the control block has been successfully passed, the OpenCL context class method of the same name is called, specifying the buffer index, matrix, and offset in the OpenCL buffer.

bool CBufferType::BufferRead(void)
  {
   if(!m_cOpenCL || m_myIndex < 0)
      return false;
//---
   return m_cOpenCL.BufferRead(m_myIndex, m_mMatrix, 0);
  }

bool CBufferType::BufferWrite(void)
  {
   if(!m_cOpenCL || m_myIndex < 0)
      return false;
//---
   return m_cOpenCL.BufferWrite(m_myIndex, m_mMatrix, 0);
  }

We have separately created methods for obtaining and directly specifying the buffer index in the dynamic array of GetIndex and SetIndex buffer handles. Their code is straightforward, so I don't even move them outside the class declaration block.

We've added three GetData methods of the same name to the class. They all perform the same function which is copying matrix data into a given structure. The difference is in the data receiver. This can be a dynamic array, matrix, or another instance of the CBufferType class.

In the first case, the method parameters contain a reference to the array and a flag that indicates the need to read data from the OpenCL context before copying the data. The introduction of the flag is a necessary measure. As you may have noticed when considering a method for reading data from the context, if there is no pointer to the CMyOpenCL object or index in the dynamic buffer array, the method will return false. This will block receiving data from an array without a buffer created in the OpenCL context. The introduction of a flag allows you to control this process.

At the beginning of the method, we check the flag and read data from the context, if necessary. Only then do we change the size of the receiver array and create a data copying cycle. Finally, the method returns the number of copied items.

int CBufferType::GetData(TYPE &values[], bool load = true)
  {
   if(load && !BufferRead())
      return -1;
   if(ArraySize(values) != Total() &&
      ArrayResize(values, Total()) <= 0)
      return false;
//---
   for(uint i = 0; i < Total(); i++)
      values[i] = m_mMatrix.Flat(i);
   return (int)Total();
  }

The other two methods are built on the basis of a similar algorithm but they take into account the specifics of the receiver object.

int CBufferType::GetData(MATRIX &values, bool load = true)
  {
   if(load && !BufferRead())
      return -1;
//---
   values = m_mMatrix;
   return (int)Total();
  }

int CBufferType::GetData(CBufferType *values, bool load = true)
  {
   if(!values)
      return -1;
   if(load && !BufferRead())
      return -1;
   values.m_mMatrix.Copy(m_mMatrix);
   return (int)values.Total();
  }

Now that we have prepared constants and classes for working with the OpenCL context, we can continue to work on organizing the process directly in our neural network classes.

When creating methods for our neural network base class, we did not add two methods, UseOpenCL and InitOpenCL. As can be seen from the names of the methods, they are designed to initialize and control the process of working with OpenCL. The first one is used to switch the operating mode and enables and disables the use of OpenCL. The second one initializes the operation of an instance of the CMyOpenCL class.

Let's take a step back and fill these gaps. In the parameters of the UseOpenCL method, we will specify the new state as a logical value. Using a logical value to convey a binary state to enable/disable a function seems intuitive to me. It is quite logical to use true to enable the functionality and false to turn it off.

In the method body, we will organize the algorithm to branch out depending on the state being set. When we receive a command to disable the functionality, we will check the current pointer to an instance of the CMyOpenCL class that is stored in the m_Copencl variable. If the pointer is invalid, the functionality has not been initialized before, and we have nothing to disable. In this case, we will just update the state of the technology usage flag and exit the method.

If the functionality was previously activated and a signal to deactivate it has now been received, we will initiate the process of cleaning up the object and deleting it. After that, we will distribute a new (empty) pointer to neural network objects, save the flag, and exit the method.

void CNet::UseOpenCL(bool value)
  {
   if(!value)
     {
      if(!m_cOpenCL)
        {
         m_bOpenCL = value;
         return;
  }
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      if(!!m_cLayers)
         m_cLayers.SetOpencl(m_cOpenCL);
      m_bOpenCL = value;
      return;
     }

Further operations will be performed only when the OpenCL functionality is enabled. When we receive a signal to enable the use of OpenCL, we start the process of creating and initializing a new instance of the CMyOpenCL class, which is placed in a separate InitOpenCL method.

Before exiting the method, save the new flag for using OpenCL and distribute the pointer to the new object across all objects of the neural network. To do this, we will pass a new pointer into the dynamic array object storing the layers of the neural network, and from there, the pointer will be passed down the hierarchical chain to each object in the neural network.

//---
   if(!!m_cOpenCL)
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
     }
   m_bOpenCL = InitOpenCL();
   if(!!m_cLayers)
      m_cLayers.SetOpencl(m_cOpenCL);
   return;
  }

The actual process of creating a new instance of the CMyOpenCL class and initializing it is placed in a separate InitOpenCL method.

At the beginning of the method, we check for the existence of a previously saved pointer to an object of the CMyOpenCL class. At this point, the question arises about what we want to do next if there is a previously instantiated object. We can continue using a previously initialized instance of the class or create a new one. Using an existing facility seems less labor-intensive at this stage. However, in this case, we may need an additional method to restart the functionality in the event of an error of some kind. This is an additional effort that is likely to require developing an additional control system for the entire project code.

We chose the forced restart option. Therefore, if we have a valid pointer to a previously created instance of the CMyOpenCL class, we start the process of deleting its contents from memory, and then the object itself. Only after clearing the memory, we start the process of creating and initializing a new object. The process of creating an OpenCL context and program is implemented in the COpenCL::Initialize method. As parameters to this method, we will pass a text variable containing our program. Remember, we wrote our program code from a file resource into it?

bool CNet::InitOpenCL(void)
  {
//--- Delete previously created OpenCL objects
   if(!!m_cOpenCL)
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
     }
//--- Create a new object to work with OpenCL
   m_cOpenCL = new CMyOpenCL();
   if(!m_cOpenCL)
      return false;
//--- Initialize the object for working with OpenCL
   if(!m_cOpenCL.Initialize(cl_program, true))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

Next, let's specify the number of kernels and buffers used. Above, we have declared constants for 20 kernels, each using no more than 4 data buffers. I intentionally don't specify a large number of buffers at this stage, as thanks to our new method, the array will automatically expand when a new data buffer is created. However, the number of kernels in the program is static and does not depend on the neural network architecture.

   if(!m_cOpenCL.SetKernelsCount(20))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }
   if(!m_cOpenCL.SetBuffersCount(4))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

After that, we will initialize all program kernels and save the handles for calling them into an array within the CMyOpenCL class object.

We are not creating all the data buffers one by one at this stage for one simple reason: their quantity depends on the architecture of the neural network and may exceed the available OpenCL context memory capacity. If it is insufficient, dynamic memory allocation can be used. This implies loading buffers as needed and subsequently freeing memory when a specific data buffer is not planned to be used. However, this approach leads to an increase in the overhead of copying data between the main memory and the OpenCL context. Therefore, its use is justified only if there is a lack of GPU memory.

The kernel creation algorithm is identical. Here are just a few examples.

   if(!m_cOpenCL.KernelCreate(def_k_PerceptronFeedForward, "PerceptronFeedForward"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_CalcOutputGradient, "CalcOutputGradient"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_CalcHiddenGradient, "CalcHiddenGradient"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

   if(!m_cOpenCL.KernelCreate(def_k_CalcDeltaWeights, "CalcDeltaWeights"))
     {
      m_cOpenCL.Shutdown();
      delete m_cOpenCL;
      return false;
     }

So we have come to the stage of organizing work with the OpenCL context directly in the neural layer class. When creating many class methods, we branched the method algorithm depending on the device for performing operations. Then we created the process organization code using MQL5 and left gaps in the process organization on the OpenCL side. Let's go back and fill in these gaps.

We will start with the direct pass method. We have previously discussed the organization of operations using MQL5. Now let's look at the implementation of working with the OpenCL context.

bool CNeuronBase::FeedForward(CNeuronBase * prevLayer)
  {
//--- control block
   if(!prevLayer || !m_cOutputs || !m_cWeights ||
      !prevLayer.GetOutputs() || !m_cActivation)
      return false;
   CBufferType *input_data = prevLayer.GetOutputs();
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      if(m_cWeights.Cols() != (input_data.Total() + 1))
         return false;
      //---
      MATRIX m = input_data.m_mMatrix;
      if(!m.Reshape(1, input_data.Total() + 1))
         return false;
      m[0, m.Cols() - 1] = 1;
      m_cOutputs.m_mMatrix = m.MatMul(m_cWeights.m_mMatrix.Transpose());
     }

First, we'll check that the initial data array, the weight matrix, and the result buffer have a buffer index. The logic here is simple. If we receive a pointer to a data array with an existing buffer in the method's parameters, we assume that the data is already loaded into the OpenCL context. Above, when creating a data buffer in the CBufferType class, we immediately created a buffer in the OpenCL context. Therefore, the absence of a buffer index may indicate an error. Because of this, in such a case, we end the method with a false result. If you use dynamic memory allocation, then at this point you will need to create copies of all data buffers used in this kernel and copy the contents of the source data buffers into the OpenCL context.

   else // OpenCL block
     {
      //--- checking data buffers
      if(input_data.GetIndex() < 0)
         return false;
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(m_cOutputs.GetIndex() < 0)
         return false;

Then we will specify the parameters for the feed-forward kernel. Here we will specify their indices for buffers and specific values for discrete parameters.

      //--- passing arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_PerceptronFeedForward, def_pff_inputs,
                                                           input_data.GetIndex()))
         return false;

      if(!m_cOpenCL.SetArgumentBuffer(def_k_PerceptronFeedForward, def_pff_weights,
                                                            m_cWeights.GetIndex()))
         return false;

      if(!m_cOpenCL.SetArgumentBuffer(def_k_PerceptronFeedForward, def_pff_outputs,
                                                            m_cOutputs.GetIndex()))
         return false;

      if(!m_cOpenCL.SetArgument(def_k_PerceptronFeedForward, def_pff_inputs_total,
                                                               input_data.Total()))
         return false;

In the NDRange array, we will specify the number of parallel threads required by the number of neurons in the current layer and launch the kernel for execution. Note that the Execute method does not literally start kernel execution, but only queues it for execution. The kernel is launched directly when you try to read the results of its operation. However, we will not download the results of each kernel's operations. Instead, we'll queue up a forward pass through the entire section and download only the result of the model's work from the last layer. This will take up the entire queue of operations. Thus, we will reduce the amount of data transferred and the time it takes to download it.

In the case of dynamic memory allocation, after queuing the kernel, it will be necessary to load all changes from the OpenCL context into the data matrices and delete unused buffers from the context. Note that you need to download the contents of all buffers whose data changes during the kernel operation.

      //--- putting the kernel in the execution queue
      uint off_set[] = {0};
      uint NDRange[] = {m_cOutputs.Total()};
      if(!m_cOpenCL.Execute(def_k_PerceptronFeedForward, 1, off_set, NDRange))
         return false;
     }
//---
   return m_cActivation.Activation(m_cOutputs);
  }

After performing the above-described operations, we call the activation method of the required activation function class and exit the method.

It is also necessary to supplement the code for backpropagation methods. In the gradient computation kernel at the output of the neural network, three buffers are used: for target values, for the results of the last feed-forward pass, and for writing the obtained gradients. We'll check them at the beginning of the OpenCL block.

bool CNeuronBase::CalcOutputGradient(CBufferType* target, ENUM_LOSS_FUNCTION loss)
  {
//--- control block
   if(!target || !m_cOutputs || !m_cGradients ||
      target.Total() < m_cOutputs.Total() ||
      m_cGradients.Total() < m_cOutputs.Total())
      return false;

//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      switch(loss)
        {
         case LOSS_MAE:
            m_cGradients.m_mMatrix = target.m_mMatrix - m_cOutputs.m_mMatrix;
            break;
         case LOSS_MSE:
            m_cGradients.m_mMatrix = (target.m_mMatrix - m_cOutputs.m_mMatrix) * 2;
            break;
         case LOSS_CCE:
            m_cGradients.m_mMatrix=target.m_mMatrix/(m_cOutputs.m_mMatrix+FLT_MIN)*
                                     log(m_cOutputs.m_mMatrix) * (-1);
            break;
         case LOSS_BCE:
            m_cGradients.m_mMatrix = (target.m_mMatrix-m_cOutputs.m_mMatrix)/
                                     (MathPow(m_cOutputs.m_mMatrix,2) -
                                      m_cOutputs.m_mMatrix+FLT_MIN);
            break;
         default:
            m_cGradients.m_mMatrix = target.m_mMatrix - m_cOutputs.m_mMatrix;
            break;
  }
     }

   else // OpenCL block
     {
      //--- checking data buffers
      if(target.GetIndex() < 0)
         return false;
      if(m_cOutputs.GetIndex() < 0)
         return false;
      if(m_cGradients.GetIndex() < 0)
         return false;

Next, we will specify their indices in our kernel parameters. We will also specify the loss function used in the kernel parameters.

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcOutputGradient, def_outgr_target,
                                                                target.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcOutputGradient, def_outgr_outputs,
                                                            m_cOutputs.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcOutputGradient,def_outgr_gradients,
                                                          m_cGradients.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_CalcOutputGradient, def_outgr_loss_function,
                                                                        (int)loss))
         return false;

The number of independent operation threads launched equals the number of neurons at the output of our model.

Start the kernel execution and complete the method.

      //--- put the kernel in the execution queue
      uint NDRange[] = { m_cOutputs.Total() };
      uint off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_CalcOutputGradient, 1, off_set, NDRange))
         return false;
     }
//---
   return true;
  }

The process of distributing the gradient through the hidden layer to the neurons of the previous layer is divided into two sub-processes. In the first buffer, we will adjust the error gradient based on the derivative of the activation function, and in the second one, we will distribute the error gradient values to the neurons of the previous layer according to their influence on the final result. We have created a separate kernel for each sub-process. We placed the correction of the error gradient for the derivative of the activation function into a separate class of the activation function. Therefore, in the CalcHiddenGradient method, we will only have to launch the error gradient distribution kernel in the OpenCL program.

bool CNeuronBase::CalcHiddenGradient(CNeuronBase *prevLayer)
  {
//--- adjust the incoming gradient by the derivative of the activation function.
   if(!m_cActivation.Derivative(m_cGradients))
      return false;
//--- check the buffers of the previous layer
   if(!prevLayer)
      return false;
   CBufferType *input_data = prevLayer.GetOutputs();
   CBufferType *input_gradient = prevLayer.GetGradients();
   if(!input_data || !input_gradient ||
      input_data.Total() != input_gradient.Total())
      return false;
//--- check the match between the size of the input data buffer and the weight matrix
   if(!m_cWeights || m_cWeights.Cols() != (input_data.Total() + 1))
      return false;
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      MATRIX grad = m_cGradients.m_mMatrix.MatMul(m_cWeights.m_mMatrix);
      grad.Reshape(input_data.Rows(), input_data.Cols());
      input_gradient.m_mMatrix = grad;
     }

Again, at the beginning of the OpenCL block, we check for the availability of previously created buffers in the OpenCL context for the current kernel to work.

  else // OpenCL block
     {
      //--- check data buffers
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(input_gradient.GetIndex() < 0)
         return false;
      if(m_cGradients.GetIndex() < 0)
         return false;

After successfully passing the control block, we will pass the buffer handles and the number of neurons in the layer to the kernel.

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcHiddenGradient,
                             def_hidgr_gradient_inputs, input_gradient.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcHiddenGradient, def_hidgr_weights,
                                                             m_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcHiddenGradient,def_hidgr_gradients,
                                                          m_cGradients.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_CalcHiddenGradient, def_hidgr_outputs_total,
                                                             m_cGradients.Total()))
         return false;

The number of threads in this case will be equal to the number of neurons in the previous layer. We will write their value to the first element of the NDRange array. Let's start kernel operations.

      //--- put the kernel in the execution queue
      uint NDRange[] = {input_data.Total()};
      uint off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_CalcHiddenGradient, 1, off_set, NDRange))
         return false;
     }
//---
   return true;
  }

After propagating the error gradient across all neurons in our network based on their influence on the final result, the next step is to organize the process of updating the weight matrix. We have divided this process into two sub-processes. The weight matrix will not always be updated after every iteration. Therefore, at each iteration, we calculate the error gradient for each weight and add it to a separate buffer. Upon receiving a command from the main program, we adjust the weight matrix by the size of the batch, which gives us the average value from the accumulated error gradient.

Error gradients are accumulated in the CalcDeltaWeights method. To perform the kernel operations of this method, we need three buffers:

the buffer of the results of the last direct pass of the previous layer,
the current layer's gradient buffer,
the buffer for accumulating weight gradients.

bool CNeuronBase::CalcDeltaWeights(CNeuronBase *prevLayer, bool read);
  {
//--- control block
   if(!prevLayer || !m_cDeltaWeights || !m_cGradients)
      return false;
   CBufferType *Inputs = prevLayer.GetOutputs();
   if(!Inputs)
      return false;
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      MATRIX m = Inputs.m_mMatrix;
      m.Resize(1, Inputs.Total() + 1);
      m[0, Inputs.Total()] = 1;
      m = m_cGradients.m_mMatrix.Transpose().MatMul(m);
      m_cDeltaWeights.m_mMatrix += m;
     }

First, as usual, we check the availability of used buffers in the OpenCL context.

   else // OpenCL block
     {
      //--- check data buffers
      if(m_cGradients.GetIndex() < 0)
         return false;
      if(m_cDeltaWeights.GetIndex() < 0)
         return false;
      if(Inputs.GetIndex() < 0)
         return false;

We pass the pointers to them to the kernel parameters.

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcDeltaWeights,
                              def_delt_delta_weights, m_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcDeltaWeights, def_delt_inputs,
                                                               Inputs.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_CalcDeltaWeights, def_delt_gradients,
                                                         m_cGradients.GetIndex()))
         return false;

In this case, we will use a two-dimensional task space to launch the kernel. In one dimension, we specify the number of neurons in the current layer, and in the other dimension, the number of neurons in the previous layer.

After the preparatory work is completed, we will start the kernel execution.

Then we will check the data reading flag and, if necessary, load the result of operations from the context.

And of course, do not forget to monitor the process of performing operations at every step.

      //--- put the kernel in the execution queue
      uint NDRange[] = {m_cGradients.Total(), Inputs.Total()};
      uint off_set[] = {0, 0};
      if(!m_cOpenCL.Execute(def_k_CalcDeltaWeights, 2, off_set, NDRange))
         return false;
      if(read && !m_cDeltaWeights.BufferRead())
         return false;
     }
//---
   return true;
  }

We are successfully moving forward in the process of creating our project. To complete the work on the fully connected neuron, we need to describe the sub-process of updating the weight matrix. In our project, we decided to implement several algorithms for updating the weights. We have created our own kernel for each algorithm for updating the weight matrix. Let's add calls to these kernels to the corresponding methods of our class.

We will start with the stochastic gradient descent method. The implementation of this method requires only two buffers: accumulated deltas and the weight matrix. We check the availability of these buffers in the OpenCL context.

bool CNeuronBase::SGDUpdate(int batch_size, TYPE learningRate, VECTOR &Lambda)
  {
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      TYPE lr = learningRate / ((TYPE)batch_size);
      m_cWeights.m_mMatrix -= m_cWeights.m_mMatrix * Lambda[1] + Lambda[0];
      m_cWeights.m_mMatrix += m_cDeltaWeights.m_mMatrix * lr;
      m_cDeltaWeights.m_mMatrix.Fill(0);
     }
   else // OpenCL block
     {
      //--- check data buffers
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(m_cDeltaWeights.GetIndex() < 0)
         return false;

Then we will pass pointers to them to the kernel parameters. In addition, we need to transfer training parameters to the kernel:

batch_size
learningRate
Lambda vector (regularization parameters)

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_SGDUpdate, def_sgd_delta_weights,
                                                     m_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_SGDUpdate, def_sgd_weights,
                                                          m_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_SGDUpdate, def_sgd_total,
                                                        (int)m_cWeights.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_SGDUpdate, def_sgd_batch_size, batch_size))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_SGDUpdate, def_sgd_learningRate,
                                                                   learningRate))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_SGDUpdate, def_sgd_Lambda1, Lambda[0]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_SGDUpdate, def_sgd_Lambda2, Lambda[1]))
         return false;

Let's determine the number of threads to be launched. There will be four times fewer elements in these buffers than in the weight matrix. This effect is achieved through the use of vector operations.

Please note the following while working with the algorithm for determining the number of threads. We can't just divide the number of neurons by four because we can't be sure that the number of neurons will always be a multiple of four. But we must be sure that the number of threads covers all neurons in our layer. So we need a function similar to rounding up to an integer. Instead, we will use the property of integer division to discard the fractional part, in other words, rounding down. To get the result we want, before dividing by the vector size, we'll increase the number of neurons by a value one greater than the vector size. After such a small mathematical trick, the result of integer division will be the required number of threads. When using this trick, you should be particularly careful with the data type used because the desired effect can only be achieved when all variables in the operation are integers.

      //--- put the kernel in the execution queue
      int NDRange[] = { (int)((m_cWeights.Total() + 3) / 4) };
      int off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_SGDUpdate, 1, off_set, NDRange))
         return false;
     }
   return true;
  }

After the preparatory work, we will request the kernel to be completed.

In the description of the weight matrix update process using the accumulated momentum method, we have an additional buffer for storing moments and a momentum averaging coefficient. For the rest, the principles of constructing the algorithm laid down in the previous method are preserved.

bool CNeuronBase::MomentumUpdate(int batch_size, TYPE learningRate,
                                 VECTOR &Beta, VECTOR &Lambda)
  {
   if(Beta[0] == 0)
      return SGDUpdate(batch_size, learningRate, Lambda);
//--- control block
   if(!m_cMomenum[0])
      return false;
   if(m_cMomenum[0].Total() < m_cWeights.Total())
      return false;
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      TYPE lr = learningRate / ((TYPE)batch_size);
      m_cWeights.m_mMatrix -= m_cWeights.m_mMatrix * Lambda[1] + Lambda[0];
      m_cMomenum[0].m_mMatrix = m_cDeltaWeights.m_mMatrix * lr +
                                        m_cMomenum[0].m_mMatrix * Beta[0] ;
      m_cWeights.m_mMatrix += m_cMomenum[0].m_mMatrix;
      m_cDeltaWeights.m_mMatrix.Fill(0);
     }

   else // OpenCL block
     {
      //--- check data buffers
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(m_cDeltaWeights.GetIndex() < 0)
         return false;
      if(m_cMomenum[0].GetIndex() < 0)
         return false;

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_MomentumUpdate,
                          def_moment_delta_weights, m_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_MomentumUpdate, def_moment_weights,
                                                         m_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_MomentumUpdate,
                                 def_moment_momentum, m_cMomenum[0].GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_MomentumUpdate, def_moment_total,
                                                        (int)m_cWeights.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_MomentumUpdate, def_moment_batch_size,
                                                                    batch_size))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_MomentumUpdate, def_moment_learningRate,
                                                                  learningRate))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_MomentumUpdate, def_moment_Lambda1,
                                                                     Lambda[0]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_MomentumUpdate, def_moment_Lambda2,
                                                                     Lambda[1]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_MomentumUpdate, def_moment_beta, Beta[0]))
         return false;

We will set the number of threads to 4 times less than the number of elements in the weight matrix and start performing operations.

      //--- put the kernel in the execution queue
      int NDRange[] = { (int)((m_cWeights.Total() + 3) / 4) };
      int off_set[] = {0};
      if (! m_copencl. Execute (def_k_momentumUpdate, 1, off_set, ndRange))
         return false;
     }
   return true;
  }

Please note the constants used in kernels and their parameters. Despite the similarity of operations, a small detail or a typo with a constant can often lead to a critical error and program termination.

Let's move on to the next implementation. The AdaGrad optimization method is implemented in the AdaGradUpdate method and in the respective kernel, which we will identify by the def_k_AdaGradUpdate constant. To avoid possible errors when specifying parameters, all parameter constants for this kernel start with def_adagrad_. As you can see, all constant names are intuitive and logically connected. This reduces the risk of a possible error. This method is very convenient when there are a large number of constants.

The AdaGrad method, like the cumulative pulse method, uses a moment accumulation buffer. However, unlike the previous method, there is no averaging factor here. At this point, we don't care about differences in the use of parameters and buffers. We are only interested in their availability: the use of buffers and parameters is already described in the OpenCL program kernel, and here we organize the process of transferring data from the main program to the OpenCL context.

The algorithm for organizing the process of working with the OpenCL context in the AdaGradUpdate method is similar to that used in the methods described earlier.

First, check for buffers in the OpenCL context.
Then we will send pointers to buffers and optimization parameters to the kernel.
Start kernel execution.

bool CNeuronBase::AdaGradUpdate(int batch_size, TYPE learningRate, VECTOR &Lambda)
  {
//--- control block
   if(!m_cMomenum[0])
      return false;
   if(m_cMomenum[0].Total() < m_cWeights.Total())
      return false;
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      m_cWeights.m_mMatrix -= m_cWeights.m_mMatrix * Lambda[1] + Lambda[0];
      MATRIX delta = m_CDeltaWeights . m_mMatrix /((TYPE) batch_size);
      MATRIX G = m_cMomenum[0].m_mMatrix = m_cMomenum[0].m_mMatrix + delta.Power(2);
      G = MathPow(MathSqrt(G) + 1e-32, -1);
      G = G * learningRate;
      m_cWeights.m_mMatrix += G * delta;
       m_cDeltaWeights.m_mMatrix.Fill(0);
    }

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdaGradUpdate,
                           def_adagrad_delta_weights, m_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdaGradUpdate, def_adagrad_weights,
                                                           m_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdaGradUpdate, def_adagrad_momentum,
                                                        m_cMomenum[0].GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaGradUpdate, def_adagrad_total,
                                                          (int)m_cWeights.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaGradUpdate, def_adagrad_batch_size,
                                                                      batch_size))
         return false;
      if (! m_copencl. SetArgument (Def_K_AdaGradUpdate, Def_Adagrad_LearningRate,
                                                                    learningRate))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaGradUpdate, def_adagrad_Lambda1,
                                                                       Lambda[0]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaGradUpdate, def_adagrad_Lambda2,
                                                                       Lambda[1]))
         return false;

      //--- put the kernel in the execution queue
      int NDRange[] = { (int)((m_cWeights.Total() + 3) / 4) };
      int off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_AdaGradUpdate, 1, off_set, NDRange))
         return false;
     }
   return true;
  }

The RMSProp optimization method is functionally similar to AdaGrad, but it includes a coefficient for averaging the accumulated momentum.

We're following the established framework: check the availability of OpenCL context buffers, then send pointers to buffers and optimization parameters to the kernel while also ensuring the use of the proper method and constant naming:

RMS PropUpdate method
def_k_ RMSPropUpdate kernel constant
def_rms_ parameter constants

After specifying the parameters, launch the kernel.

bool CNeuronBase::RMSPropUpdate(int batch_size, TYPE learningRate,
                                VECTOR &Beta, VECTOR &Lambda)
  {
//--- control block
   if(!m_cMomenum[0])
      return false;
   if(m_cMomenum[0].Total() < m_cWeights.Total())
      return false;
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      TYPE lr = learningRate;
      m_cWeights.m_mMatrix -= m_cWeights.m_mMatrix * Lambda[1] + Lambda[0];
      MATRIX delta = m_CDeltaWeights . m_mMatrix /((TYPE) batch_size);
      MATRIX G = m_cMomenum[0].m_mMatrix = m_cMomenum[0].m_mMatrix * Beta[0] +
                                                delta.Power(2) * (1 - Beta[0]);
      G = MathPow(MathSqrt(G) + 1e-32, -1);
      G = G * learningRate;
      m_cWeights.m_mMatrix += G * delta;
      m_cDeltaWeights.m_mMatrix.Fill(0);
     }

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_RMSPropUpdate, def_rms_delta_weights,
                                                      m_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_RMSPropUpdate, def_rms_weights,
                                                           m_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_RMSPropUpdate, def_rms_momentum,
                                                        m_cMomenum[0].GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_RMSPropUpdate, def_rms_total,
                                                          (int)m_cWeights.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_RMSPropUpdate, def_rms_batch_size,
                                                                      batch_size))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_RMSPropUpdate, def_rms_learningRate,
                                                                    learningRate))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_RMSPropUpdate, def_rms_Lambda1, Lambda[0]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_RMSPropUpdate, def_rms_Lambda2, Lambda[1]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_RMSPropUpdate, def_rms_beta, Beta[0]))
         return false;

      //--- put the kernel in the execution queue
      int NDRange[] = { (int)((m_cWeights.Total() + 3) / 4) };
      int off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_RMSPropUpdate, 1, off_set, NDRange))
         return false;
     }
//---
   return true;
  }

The developers of the AdaDelta method opted to not use a learning rate but compensated for it by introducing an additional buffer for moments with an additional averaging coefficient. Accordingly, we will use one more buffer in this kernel.

When setting kernel parameters, again, mind the naming:

AdaDeltaUpdate method
def_k_AdaDeltaUpdate kernel constant
def_adadelt parameter constants

Furthermore, for seamless portability of the constructed neural network, we need to ensure the consistency of buffer usage in terms of performing operations using MQL5 and in the OpenCL context. When used within the same platform, changing the sequence in which the momentum arrays are used will not have an effect. Whatever we call them, their content will be appropriate to the context of use. However, when transferring a pre-trained neural network to another platform, we will likely get unexpected results. At the same time, we should remember the purpose and functionality of arrays. The moments are only used during the weight matrix update process in the training of the neural network and do not participate in the feed-forward pass. So, the impact of mixed-up buffers will only become apparent when attempting to retrain the neural network. This should not be neglected. If we use a once built neural network for a long time, we will need to periodically refine it. This is necessary to keep weights relevant in our changing world.

Taking into account the above, we will pass pointers to the loaded buffers and training parameters to the kernel.

Let's calculate the number of required threads and launch the kernel.

bool CNeuronBase::AdaDeltaUpdate(int batch_size, VECTOR &Beta, VECTOR &Lambda)
  {
//--- control block
   for(int i = 0; i < 2; i++)
     {
      if(!m_cMomenum[i])
         return false;
      if(m_cMomenum[i].Total() < m_cWeights.Total())
         return false;
     }
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      MATRIX delta = m_CDeltaWeights . m_mMatrix /((TYPE) batch_size);
      MATRIX W = m_cMomenum[0].m_mMatrix = m_cMomenum[0].m_mMatrix * Beta[0] +
                                  m_cWeights.m_mMatrix.Power(2) * (1 - Beta[0]);
      m_cMomenum[1].m_mMatrix = m_cMomenum[1].m_mMatrix * Beta[1] +
                                                 delta.Power(2) * (1 - Beta[1]);
      m_cWeights.m_mMatrix -= m_cWeights.m_mMatrix * Lambda[1] + Lambda[0];
      W = MathSqrt(W) / (MathSqrt(m_cMomenum[1].m_mMatrix) + 1e-32);
      m_cWeights.m_mMatrix += W * delta;
      m_cDeltaWeights.m_mMatrix.Fill(0);
     }

   else // OpenCL block
     {
      //--- create data buffers
      if(m_cWeights.GetIndex() < 0)
         return false;
      if(m_cDeltaWeights.GetIndex() < 0)
         return false;
      if(m_cMomenum[0].GetIndex() < 0)
         return false;
      if(m_cMomenum[1].GetIndex() < 0)
         return false;

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdaDeltaUpdate,
                           def_adadelt_delta_weights, m_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdaDeltaUpdate, def_adadelt_weights,
                                                           m_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdaDeltaUpdate, def_adadelt_momentumW,
                                                        m_cMomenum[0].GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdaDeltaUpdate, def_adadelt_momentumG,
                                                        m_cMomenum[1].GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaDeltaUpdate, def_adadelt_total,
                                                          (int)m_cWeights.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaDeltaUpdate, def_adadelt_batch_size,
                                                                      batch_size))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaDeltaUpdate, def_adadelt_Lambda1,
                                                                       Lambda[0]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaDeltaUpdate, def_adadelt_Lambda2,
                                                                       Lambda[1]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaDeltaUpdate, def_adadelt_beta1, Beta[0]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdaDeltaUpdate, def_adadelt_beta2, Beta[1]))
         return false;

      //--- put the kernel in the execution queue
      int NDRange[] = { (int)((m_cWeights.Total() + 3) / 4) };
      int off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_AdaDeltaUpdate, 1, off_set, NDRange))
         return false;
     }
//---
   return true;
  }

Our description of the operations performed in the fully connected neural layer is nearing completion. One method remains to be described, and it's the weight update method — specifically, the Adam optimization algorithm. This method, though the last on the list, is not of lesser importance. Like AdaDelta, the Adam method also employs two momentum buffers, but in addition, it returns the learning rate.

Let's recap the main stages of our algorithm and highlight key checkpoints:

Verify the presence of the necessary data in the OpenCL context memory.
Pass pointers to data buffers and training parameters to the kernel. Ensure naming consistency: Method AdamUpdate а kernel constant def_k_AdamUpdate а parameter constants def_adam_...
Monitor the consistent use of buffers between MQL5 and the OpenCL context.
Execute the kernel.

bool CNeuronBase::AdamUpdate(int batch_size, TYPE learningRate,
                             VECTOR &Beta, VECTOR &Lambda)
  {
//--- control block
   for(int i = 0; i < 2; i++)
     {
      if(!m_cMomenum[i])
         return false;
      if(m_cMomenum[i].Total() != m_cWeights.Total())
         return false;
     }
//--- algorithm branching depending on the operating device
   if(!m_cOpenCL)
     {
      MATRIX delta = m_CDeltaWeights . m_mMatrix /((TYPE) batch_size);
      m_cMomenum[0].m_mMatrix = m_cMomenum[0].m_mMatrix * Beta[0] +
                                                      delta * (1 - Beta[0]);
      m_cMomenum[1].m_mMatrix = m_cMomenum[1].m_mMatrix * Beta[1] +
                                           MathPow(delta,2) * (1 - Beta[1]);
      MATRIX M = m_cMomenum[0].m_mMatrix / (1 - Beta[0]);
      MATRIX V = m_cMomenum[1].m_mMatrix / (1 - Beta[1]);
      m_cWeights.m_mMatrix -= m_cWeights.m_mMatrix * Lambda[1] + Lambda[0];
      m_cWeights.m_mMatrix += M * learningRate  / MathSqrt(V);
      m_cDeltaWeights.m_mMatrix.Fill(0);
     }

      //--- pass arguments to the kernel
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdamUpdate, def_adam_delta_weights,
                                                    m_cDeltaWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdamUpdate, def_adam_weights,
                                                         m_cWeights.GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdamUpdate, def_adam_momentumM,
                                                      m_cMomenum[0].GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgumentBuffer(def_k_AdamUpdate, def_adam_momentumV,
                                                      m_cMomenum[1].GetIndex()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdamUpdate, def_adam_total,
                                                       (int)m_cWeights.Total()))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdamUpdate, def_adam_batch_size,
                                                                    batch_size))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdamUpdate, def_adam_Lambda1, Lambda[0]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdamUpdate, def_adam_Lambda2, Lambda[1]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdamUpdate, def_adam_beta1, Beta[0]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdamUpdate, def_adam_beta2, Beta[1]))
         return false;
      if(!m_cOpenCL.SetArgument(def_k_AdamUpdate, def_adam_learningRate,
                                                                  learningRate))
         return false;

      //--- put the kernel in the execution queue
      int NDRange[] = { (int)((m_cWeights.Total() + 3) / 4) };
      int off_set[] = {0};
      if(!m_cOpenCL.Execute(def_k_AdamUpdate, 1, off_set, NDRange))
         return false;
     }
//---
   return true;
  }

We have completed a description of the processes of a fully connected neural layer. Now, we've reached the stage where we can look at the work done and assess the initial results. In fact, we already have enough created base classes to build a small perceptron model with several fully connected layers. One of them will serve as the receiver of input data (input layer), the last neural layer will produce the results (output layer), and hidden layers will be in between.

Creating an OpenCL program

Implementing the perceptron model in Python