Architecture and principles of implementation of a fully connected layer

When constructing the base class for the neural network and the dynamic array to store pointers to neuron layers, we defined the main methods and interfaces for data exchange between the neural network manager and its components. This is what defines the basic public methods of all our neural layer classes. I suggest summarizing the mentioned sections briefly now. Let's highlight the key class methods we have yet to write and their functionality.

Please note that all neural layer objects must be descendants of the CObject base class. This is the fundamental requirement for placing pointers to instances of these objects into the dynamically created array we've designed.

Adhering to the general principles of object organization, in the class constructor, we initialize internal variables and constants. In the destructor, we will perform memory cleanup: deleting all internal instances of various classes and clearing arrays.

In parameters, the Init method receives an instance of the CLayerDescription class containing the description of the neural layer to be created. Therefore, this method should be organized to create the entire internal architecture for the proper functioning of our neural layer. We will need to create several arrays to store the data.

It is an array for recording the states at the output of neurons. This array will have a size equal to the number of neurons in our layer.

We will also need an array to store the weights. This will be a matrix, where the size of the first dimension is equal to the number of neurons in our layer, and the size of the second dimension is one more than the size of the input data array. For a fully connected neural layer, the array of input data consists of the output values of the neurons from the previous layer. Consequently, the size of the second dimension will be one element larger than the size of the previous layer. The added element will serve to adjust the bias.

For the backward pass, we will need an array to store gradients (deviations of calculated values from reference values at the output of neurons). Its size will correspond to the number of neurons in our layer.

Additionally, depending on the training method, we might need one or two matrices to store accumulated moments. The sizes of these matrices will be equal to the size of the weight matrix.

We will not always update the weights after every iteration of the backward pass. It is possible to update the weights after a full pass of the training sample or based on some batch. We will not store the intermediate states of all neurons and their inputs. On the contrary, after each iteration of the backward pass, we will calculate the necessary change for each weight as if we were updating the weights at every iteration. But instead of changing the weights, we will summarize the resulting deltas into a separate array. If updating is necessary, we will simply take the average delta value over the period and adjust the weights accordingly. For this purpose, we will need another matrix with a size equal to the weight matrix.

For all arrays, we will create a special class called CBufferType. It will inherit from the base class CObject with the addition of the necessary functionality to organize the operation of the data buffer.

In addition to creating arrays and matrices, we need to fill them with initial values. We will fill all arrays, except the weights, with zeros, and initialize the weight matrix with random values.

In addition to data arrays, our class will also use local variables. We will need to save the activation and optimization parameters of the neurons. We will store the type of optimization method in a variable, and for activation functions, we will create a whole structure of separate classes inheriting from a common base class.

Let me remind you that we are building a universal platform for creating neural networks and their operation in the MetaTrader 5 terminal. We plan to provide users with the ability to utilize multi-threaded computations using the OpenCL technology. All objects in our neural network will operate in the same context. This will reduce the time spent on unnecessary data overload. The actual instance of the class for working with the OpenCL technology will be created in the base neural network class, and a pointer to the created object will be passed to all elements of the neural network. Therefore, all objects that make up the neural network, including our neural layer, should have a method for obtaining the SetOpenCL pointer and a variable for storing it.

The forward pass will be organized in the FeedForward method. The only parameter of this method will be a pointer to the CNeuronBase object of the previous layer of the neural network. We will need the output states of the neurons from the previous layer, which will form the incoming data stream. To access them, let's create the GetOutputs method.

The backward pass, unlike the forward pass, will be divided into several methods:

  • CalcOutputGradient calculates the error gradient at the output layer of the neural network by reference values.
  • CalcHiddenGradient skips the error gradient through the hidden layer from output to input. As a result, we will pass the error gradients to the previous layer. To access the array of gradients from the previous layer, we will need a method to access them — GetGradients.
  • CalcDeltaWeights calculates the necessary changes in weights based on the analysis of the last iteration.
  • UpdateWeights is a method to directly update the weights.

Let's not forget the common for all objects methods of working with files and identification, namely Save, Load, and Type.

In our object detailing, we will focus on the neural layer class and will not create separate objects for each neuron. In fact, there are a number of reasons for this. From what lies on the surface:

  • Using the Softmax activation function involves working with the entire neural layer.
  • Using the Dropout and Layer Normalization methods requires the processing of the entire neural layer data.
  • This approach allows us to efficiently organize multi-threaded computations based on matrix operations.

Let's delve more into matrix operations and see how they allow us to distribute operations across multiple parallel threads. Consider a small example of three elements in the input (vector Inputs) and two neurons in the layer. Both neurons have their weight vectors W1 and W2. In this case, each vector of weights contains three elements.

According to the mathematical model of the neuron, we need to element-wise multiply the input data vector with the weight vector, sum up the obtained values, and apply the activation function to them. Essentially, the same process, except for the activation function, is achieved through matrix multiplication.

Matrix multiplication is an operation resulting in a matrix. The elements of the new matrix are obtained by summing the element-wise products of rows from the first matrix with columns from the second matrix.

Thus, to obtain the sum of element-wise products of the input data vector and the weight vector of one of the neurons, it is necessary to multiply the input data row vector by the weight column vector.

This rule is applicable to any matrices. The only condition is that the number of columns in the first matrix must be equal to the number of rows in the second matrix. Therefore, we can assemble the weight vectors of all neurons in the layer into a single matrix W, where each column will represent the weight vector of an individual neuron.

You can see that the computation of any element in vector Z is independent of the other elements of the same vector. Accordingly, we can load the matrices of input data and weights into memory and then concurrently compute the values of all elements in the output vector.

We can go even further and load not just a single vector of input data, but a matrix where the rows represent individual states of the system. When working with time series data, each row represents a snapshot of the system's state at a certain moment in time. As a result, we will increase the number of parallel threads of operations and potentially reduce the time to process data.

Naturally, we can also use multi-threading to calculate the activation function values for each independent element of matrix Z. An exception might be the use of the Softmax activation function due to the peculiarities of its computation. However, even in this case, parallelization of computations at different stages of the function calculation is possible.