Organizing parallel computing using OpenCL

In the previous chapters, we have already become acquainted with the organization of the operation of a fully connected neural layer using MQL5. I would like to remind you that in our implementation, we used matrix operations to multiply the input data vector by the weight matrix. From one neural layer to another, the signal flows sequentially, and we cannot initiate operations on the subsequent neural layer until the operations on the previous one are fully completed. In contrast to this, the results of operations within one neuron in a layer do not depend on the operations being carried out with other neurons within the same neural layer. Consequently, we can reduce the time cost of processing a single neural layer if we can organize parallel computation. The more neurons we process simultaneously, the less time we spend on processing one signal and training the neural network as a whole.

As we have already discussed earlier, OpenCL technology will help us in organizing parallel computations. Of course, this will require extra work to customize the process. Let's consider which processes we will transfer to OpenCL to make it as efficient as possible. Let me remind you that due to the overhead time for data transfer between devices, we can achieve real performance improvement only with a large number of concurrent operation threads.

The first thing that can be carried over is the computation of forward pass operations. We can move the execution of operations on each individual neuron into the realm of parallel computing. First, we calculate the weighted sum of the input signal for each neuron and then calculate the activation function for each neuron.

We can also move the operations of the backward pass into the realm of parallel computations. Let's break down the steps of the backward pass.

Deviation of calculated values from the reference values at the output layer of the neural network can be easily divided into separate threads for each neuron.

Furthermore, we can also adjust the obtained deviation for each neuron based on the derivative of the activation function. As a result of such an operation, we obtain the error gradient before the neuron activation function.

Following the backpropagation process, in the next step we need to distribute the resulting error gradient to the neurons of the previous layer. In a fully connected neural layer, all neurons from the previous layer are connected to all neurons in the subsequent layer. In each element of the error gradient vector, there is a component from every neuron in the previous layer. There are two seemingly equivalent approaches here:

We can create threads for each element of the error gradient vector, and within each thread, iterate through all neurons of the previous layer and add the value of its gradient error component.
Conversely, we can divide the threads for each neuron in the previous layer and assemble the gradient error components from the previous layer.

Despite their apparent equivalence, the first approach has several drawbacks. Since we will be summing up the gradient error components from different neurons of the subsequent layer, it's necessary to initialize the value of the current vector to zero before starting the operations. This means additional costs in time and resources. In addition, there are also technical nuances. Working with global memory is slower than working with a thread's private memory. Therefore, it's preferable to assemble values in fast memory and transfer them to global memory once. The most challenging aspect of this approach is that there's a significant likelihood of multiple threads attempting to write values to a single neuron in the previous layer simultaneously. And that is highly undesirable for us.

Based on the combination of the above factors, the second option becomes more attractive for implementation.

Splitting the following two processes into threads (calculating deltas for weight adjustment and directly updating the weight matrix) doesn't raise any questions, as each weight is involved in only one connection between two neurons and doesn't affect the others.

Activation function class

Creating an OpenCL program