Neural Networks in Trading: Time Series Forecasting Using Adaptive Modal Decomposition (ACEFormer)

MetaTrader 5 — Trading systems | 3 July 2026, 12:52

733

Dmitriy Gizlyk

Introduction

The financial market is a complex and dynamic system in which every price movement results from the intricate interaction of numerous factors. It reflects virtually everything, from macroeconomic information flows and company-specific news to investor's emotional swings and the cold calculations of algorithmic trading strategies. Within this vast mixture of signals, noise, and distortions, extracting meaningful information and identifying genuine market trends is not merely an interesting challenge but a strategic necessity.

The ability to accurately forecast market direction can provide a sustainable competitive advantage. One of the greatest challenges is informational noise — the frequent, and often meaningless, micro-fluctuations in price generated by short-term trades, news headlines, or random algorithmic activity. These fluctuations often prevent analytical models from capturing the underlying trend.

Efforts to develop predictive models date back to the late twentieth century. Early neural network architectures demonstrated that it was, in principle, possible to train models to forecast market movements. However, these approaches were unable to retain information over long time horizons and quickly lost track of events that had occurred only a short time earlier.

The introduction of LSTM networks improved this situation. Thanks to their memory mechanisms, LSTM models were capable of preserving important patterns over extended periods. They quickly became widely adopted for time series forecasting. Nevertheless, it is not that straightforward. Financial time series differ from conventional sequential data. They are irregular, often with uneven intervals between ticks. They contain a large number of short-lived spikes that carry little meaningful information about the underlying market trend.

High-frequency trading poses an especially significant challenge. It generates what is commonly referred to as market noise — repeated quote fluctuations occurring within extremely short time intervals. These fluctuations obscure genuine trends, increase data instability, and overwhelm predictive models with insignificant events. As a result, even sophisticated neural architectures may begin focusing on distracting short-term variations rather than the information that truly matters.

To address these challenges, the paper "An End-to-End Structure with Novel Position Mechanism and Improved EMD for Stock Forecasting", introduces the ACEFormer framework — an integrated architecture for financial time series analysis specifically designed for high-frequency trading environments. Rather than representing a single predictive model, ACEFormer combines several complementary components, each addressing a distinct task: noise filtering, irregular temporal interval modeling, and selective attention to the most informative market movements.

The first stage of the ACEFormer architecture performs data denoising. Here it uses a modified ACEEMD (Alias Complete Ensemble Empirical Mode Decomposition with Adaptive Noise) algorithm. The method is based on Empirical Mode Decomposition (EMD) but incorporates several improvements. This helps eliminate two major limitations of conventional EMD: the end effect and mode mixing. By removing the first Intrinsic Mode Function (IMF), which contains most of the high-frequency oscillations, ACEEMD effectively suppresses market noise while preserving the critical turning points that characterize the underlying trend.

Following this pre-filtering, the denoised data is passed to the temporal-awareness module. Because financial market events occur at irregular time intervals, conventional attention mechanisms are not well suited to modeling them. To address this issue, the authors incorporate a Time-Aware module that processes feature values while explicitly accounting for the elapsed time between observations. This enables the model to better capture event sequences and better understand event sequences and identify casual relationships.

The resulting features are then processed by an enhanced attention block. Unlike the standard Attention mechanism, this module is specifically designed for financial data, where identifying critical change points while ignoring insignificant fluctuations is essential. By placing greater emphasis on informative regions of the time series, the model does not dissipate its attention on noisy elements and concentrates on potentially relevant information.

In the final stage, a fully connected neural network is used. It aggregates the extracted features and produces the final prediction of future price direction. Consequently, the ACEFormer architecture encompasses the entire forecasting pipeline — from noise reduction and temporal modeling to attention-based feature extraction and final prediction.

The ACEFormer algorithm

The ACEFormer algorithm is a multi-stage framework for processing time series data with the objective of accurately forecasting price movements in financial markets. Its core principle is the sequential and adaptive suppression of market noise, followed by the extraction of informative features and the generation of forecasts that account for long-term market trends. This approach is particularly effective in high-frequency trading environments, where meaningful signals are often obscured by numerous random fluctuations and market noise.

The process begins with the input time series 𝑆={𝑠1,𝑠2,…,𝑠𝑛}, where each vector 𝑠𝑖 contains the price, trading volume, and other market indicators observed at time step 𝑖. To prepare the data for model training, the sequence is extended with a zero-padding segment of length 𝑝, enabling the model to predict future time steps. This enables the model to predict future 𝑝 time steps, despite the absence of explicit information about the future in the source data 𝐷=[𝑠1,𝑠2,…,𝑠𝑛,0,0,…,0] ∈ 𝑅(𝑛+𝑝)×𝑑, where 𝑑 is the number of features. These zeros help model preserve the sequence structure and forecast future values.

The next stage performs signal smoothing using convolutional filters. Two convolutional filters, 𝑓 and 𝑔, are applied sequentially to reduce random fluctuations and stabilize the input sequence. This preprocessing step suppresses transient spikes and improves the quality of the data presented to subsequent stages of the model.

Following the initial smoothing stage, the adaptive Empirical Mode Decomposition algorithm (ACEEMD) is applied to suppress high-frequency noise. The process begins by adding and subtracting Gaussian noise 𝑛𝑖(𝑡), from each element of the input time series, producing two new sequences: 𝑝𝑒𝑖(𝑡)=𝑥(𝑡)+𝑛𝑖(𝑡) and 𝑝𝑚𝑖(𝑡)=𝑥(𝑡)−𝑛𝑖(𝑡).

Each of these sequences is subsequently decomposed using Empirical Mode Decomposition (EMD), extracting the first Intrinsic Mode Function (IMF).

The corresponding IMF components obtained from both decompositions are then summed. This component is subtracted from the original signal, providing the denoised time series 𝑟1(𝑡)=𝑥(𝑡)−IMF1(𝑡). The denoised time series is then forwarded to the subsequent processing stages.

To enable the model to capture the temporal ordering of events, positional encoding is added to the input representation, preserving information about the location of the data in the sequence. A linear projection of the data is then performed.

One of the defining characteristics of the ACEFormer architecture is its probabilistic attention module, which plays a central role in improving the model's generalization capability. Probabilistic attention is a computationally efficient variant of the conventional Self-Attention mechanism that eliminates many irrelevant attention connections. Rather than computing attention over the entire sequence, the mechanism focuses exclusively on the most informative time steps. To achieve this, an importance score is first estimated for every position. In ACEFormer, this score is defined as the maximum projection of each Query onto a randomly sampled subset of Keys. After normalization, the most informative positions are selected. And the Self-Attention computation is performed only for this subset. Consequently, attention is evaluated not over the full sequence but over a compact subset that is highly likely to contain the most informative temporal events.

The probabilistic attention module in ACEFormer is not merely a technical optimization but a strategic design choice. It enables the model to adapt more effectively to dynamic market conditions, where the significance of individual dependencies changes continuously over time. This approach produces more robust and reliable forecasts when operating on noisy and highly volatile financial data.

As a result, probabilistic attention allows the ACEFormer model to focus on genuinely informative patterns while filtering out irrelevant dependencies and random market fluctuations. This improves the model's ability to extract meaningful relationships and generate accurate forecasts, particularly for predicting future price direction in financial markets.

Following probabilistic attention, the resulting feature representations are processed by a convolutional layer and a max-pooling operation. These components further enhance local feature extraction and improve the model's representation of important patterns. The convolution operation emphasizes the regions of the time series that contain the most informative signals for subsequent prediction.

The final data-processing stage employs a conventional Self-Attention mechanism. This module enables every element of the sequence to access the global context, allowing the model to capture dependencies between events separated by long temporal intervals.

To obtain forecast values for a given planning horizon, a fully connected network is used.

Overall, the ACEFormer algorithm consists of several stages, beginning with denoising and ending with the generation of accurate forecasts. Each stage contributes to the model's ability to handle noisy and highly volatile financial time series, identify long-term market trends, and predict future price movements with high accuracy.

The authors' visualization of the ACEFormer framework is presented below.

Implementation in MQL5

Having examined the theoretical foundations of the ACEFormer framework, we can now proceed to its practical implementation in MQL5. We begin with the probabilistic attention module. It is one of the core components of the architecture that provides high computational efficiency while preserving the quality of data representations.

Before discussing the implementation details, it is worth reiterating the conceptual advantage of probabilistic attention. This mechanism represents a compromise between predictive accuracy and computational efficiency. Unlike conventional attention, which processes the entire sequence, probabilistic attention selectively focuses on the most informative elements. This strategy substantially reduces both memory consumption and computational cost without losing model quality, particularly when processing long sequences.

The implementation presented in this article is divided into three sequentially executed kernels. Each kernel performs a specific task: from estimating importance, through selecting the best queries, to computing the final contextual representations. Let us examine this pipeline step by step.

The first stage estimates the importance of each query. This operation is performed by the ProbAttentionQueryImp kernel. The kernel receives the following inputs:

the Query matrix (querys);
the combined matrix of Keys and Values (keys_values);
the index array index_keys, which stores the sampled Key indices associated with each Query.

In this context, the Keys refer to randomly sampled elements used to estimate Query importance. The sampling process is employed solely for statistical evaluation rather than for computing the final attention scores. Its purpose is to measure how strongly each Query responds to a representative subset of Keys.

__kernel void ProbAttentionQeuryImp(__global const float* querys,
                                    __global const float2* __attribute__((aligned(8))) keys_values,
                                    __global const float* index_keys,
                                    __global float* querys_imp,
                                    const int dimension
                                   )
  {
   const size_t id_q = get_global_id(0);
   const size_t total_q = get_global_size(0);
   const size_t ind_k = get_local_id(1);
   const size_t total_ind = get_local_size(1);
   const size_t id_h = get_global_id(2);
   const size_t total_h = get_global_size(2);

The kernel is executed over a three-dimensional task space, with each dimension serving a distinct purpose in organizing the parallel computation. The first dimension spans the sequence of Queries. The second corresponds to the number of sampled Keys associated with each Query. This number may vary depending on the model configuration and the analysis depth. The third dimension represents the attention heads — independent processing units that simultaneously analyze different aspects of the input sequence.

Particular attention should be paid to the behavior of the attention heads. Each head operates on its own independently sampled subset of Keys. This design enables multiple complementary views of the same sequence, allowing each head to discover distinct relationships and structural patterns. As a result, the overall architecture becomes more robust: if one head underestimates an important region of the sequence, another may still capture it. Collectively, the attention heads produce a richer and more expressive representation of the original signal, significantly improving the quality of the attention mechanism and the informativeness of the resulting context.

Execution threads are organized into work groups along the second dimension. To facilitate communication between parallel threads within each work group, a shared array is allocated in the OpenCL device's local memory.

__local float temp[LOCAL_ARRAY_SIZE][2];
const int ls = min((int)total_ind, (int)LOCAL_ARRAY_SIZE);

The next step is to compute the offsets within the input buffers corresponding to the current execution thread. The offset into the Query buffer is determined directly from the thread identifier in the first dimension. The offset into the Key buffer is computed by first retrieving the sampled Key index from the indexing buffer and then converting this index into the corresponding buffer offset.

const int shift_q = dimension * (id_q * total_h + id_h);
const int id_k = index_keys[total_ind * id_q * total_h + ind_k * total_h + id_h];
const int shift_k = dimension * (id_k * total_h + id_h);

For every Query–Key pair, the kernel computes their dot product, which measures their degree of similarity. This operation consists of element-wise multiplication followed by accumulation of the resulting products.

   float sum = 0;
#pragma unroll
   for(int d = 0; d < dimension; d++)
      sum += IsNaNOrInf(querys[shift_q + d] * keys_values[shift_k + d].s0, 0);

Next, using the shared local-memory array, the threads cooperatively compute both the sum and the maximum of these dot products within each work group. This parallel reduction efficiently produces aggregate statistics for every sampled subset.

   int id_t = ind_k % ls;
#pragma unroll
   for(int i = 0; i < total_ind; i += ls)
     {
      if(i <= ind_k || (i + ls) > ind_k)
        {
         temp[id_t][0] = IsNaNOrInf((i == 0 ? 0 : temp[id_t][0]) + sum, 0);
         temp[id_t][1] = (i == 0 ? IsNaNOrInf(sum, MIN_VALUE) : fmax(temp[id_t][1], IsNaNOrInf(sum, MIN_VALUE)));
         barrier(CLK_LOCAL_MEM_FENCE);
        }
     }
   int count = ls;
#pragma unroll
   do
     {
      count = (count + 1) / 2;
      if(ind_k < count && (ind_k + count) < ls)
        {
         temp[ind_k][0] += temp[ind_k + count][0];
         temp[ind_k + count][0] = 0;
         temp[ind_k][1] = fmax(temp[ind_k + count][1], temp[ind_k][1]);
        }
      barrier(CLK_LOCAL_MEM_FENCE);
     }
   while(count > 1);

The importance score of the current Query is then computed as the difference between the maximum and the average dot product. The larger this value, the more informative the Query is considered to be. The resulting importance scores are stored in the output buffer querys_imp.

 if(ind_k == 0)
    querys_imp[id_q * total_h + id_h] = IsNaNOrInf(temp[0][1] - temp[0][0] / total_ind, MIN_VALUE);
}

The next stage selects the most informative Queries. This task is performed by the TopKImportanceToIndex kernel. Instead of relying on computationally expensive sorting algorithms, the implementation employs a simple yet robust ranking strategy.

For each Query, the kernel counts, in parallel, the number of Queries with higher importance scores. If this count is smaller than the specified threshold top_k, the current Query is included in the final index list. Although straightforward, this approach is particularly well suited for GPU execution because it requires minimal synchronization and does not depend on auxiliary data structures.

__kernel void TopKImportanceToIndex(__global const float* importance,
                                   __global float* indexes,
                                   const int top_k
                                  )
  {
   const size_t id_q = get_global_id(0);
   const size_t total_q = get_global_size(0);
   const size_t id_h = get_global_id(1);
   const size_t total_h = get_global_size(1);
//---
   float imp = importance[id_q * total_h + id_h];
   int pos = 0;
#pragma unroll
   for(int i = 0; i < total_q; i++)
     {
      if(i == id_q)
         continue;
      float val = importance[i * total_h + id_h];
      if(val > imp || (i < id_q && val >= imp))
         pos++;
      if(pos >= top_k)
         break;
     }
//---
   if(pos < top_k)
      indexes[pos * total_h + id_h] = (float)id_q;
  }

The third and final stage computes the attention mechanism itself This operation is implemented by the QIndexAttention kernel. Its purpose is to generate the final contextual representation for each selected Query.

The kernel receives the complete sets of Query, Key, and Value vectors as input. As discussed earlier, a key design decision is to avoid creating additional copies of the selected Query subset. This is important for reducing memory usage and improving computational efficiency. Instead, the kernel operates through an index buffer containing references to the most informative Queries identified during the previous stage.

It is worth noting that the Key and Value tokens are stored together in a single data buffer. This arrangement simplifies memory access patterns and improves cache utilization. Specifically, the implementation uses the float2 vector type, where the first component represents the Key and the second represents the corresponding Value. Treating each Key–Value pair as a single logical entity reduces memory-access overhead while yielding a more compact and efficient implementation.

__kernel void QIndexAttention(__global const float *q,
                              __global const float2* kv,
                              __global float *scores,
                              __global const float *indexes,
                              __global float *out,
                              const int dimension,
                              const int heads_kv
                             )
  {
//--- init
   const int ind_q = get_global_id(0);
   const int k = get_local_id(1);
   const int h = get_global_id(2);
   const int total_q = get_global_size(0);
   const int total_k = get_local_size(1);
   const int heads = get_global_size(2);

This kernel also operates over a three-dimensional task space. However, the first Query dimension operates only with the selected subset of the most informative tokens. The second dimension Keys covers the complete sequence. As before, execution threads are grouped into work groups along the second dimension.

Within the kernel, the current execution thread is identified across all dimensions of the task space. These identifiers are then used to compute the appropriate offsets into the input buffers.

const int h_kv = h % heads_kv;
const int q_id = (int)(indexes[ind_q * heads + h] + 0.001f);
const int shift_q = dimension * (q_id * heads + h);
const int shift_kv = dimension * (heads_kv * k + h_kv);
const int shift_s = total_k * (ind_q *  heads + h) + k;

Notice that before computing the offset into the Query buffer, the kernel first retrieves the corresponding Query index from the buffer containing the selected high-importance elements.

A shared array is then allocated in local memory to enable data exchange between threads belonging to the same work group.

__local float temp[LOCAL_ARRAY_SIZE];
const uint ls = min((uint)total_k, (uint)LOCAL_ARRAY_SIZE);

The first computational stage evaluates the dot products between the Query and Key vectors, producing an array of intermediate values commonly referred to as raw scores. These scores quantify the relevance of each Query–Key pair and serve as the basis for the subsequent attention computation.

//--- Score
   float score = 0;
   if(q_id >= 0)
     {
#pragma unroll
      for(int d = 0; d < dimension; d++)
         score += IsNaNOrInf(q[shift_q + d] * kv[shift_kv + d].s0, 0);
     }

To stabilize calculations and improve numerical stability, the Softmax normalization is implemented in a modified form. Within each work group, the maximum value among all Scores is first identified.

//--- max of score
#pragma unroll
   for(int i = 0; i < total_k; i += ls)
     {
      if(k >= i && k < (i + ls))
         temp[k % ls] = (i == 0 ? score : fmax(temp[k % ls], score));
      barrier(CLK_LOCAL_MEM_FENCE);
     }
//---
   uint count = ls;
#pragma unroll
   do
     {
      count = (count + 1) / 2;
      if(k < count && (k + count) < ls)
         temp[k] = fmax(temp[k + count], temp[k]);
      barrier(CLK_LOCAL_MEM_FENCE);
     }
   while(count > 1);

Each Score is then shifted by subtracting this maximum value. This transformation prevents exponential overflow by ensuring that every exponent is less than or equal to zero. Consequently, the exponential values remain within the interval from 0 to 1.

score = IsNaNOrInf(exp(score - temp[0]), 0);

The exponentials are then summed, and each value is divided by the resulting total, converting the raw scores into final weights.

//--- sum of exp
#pragma unroll
   for(int i = 0; i < total_k; i += ls)
     {
      if(k >= i && k < (i + ls))
         temp[k % ls] = (i == 0 ? 0 : temp[k % ls]) + score;
      barrier(CLK_LOCAL_MEM_FENCE);
     }
//---
   count = ls;
#pragma unroll
   do
     {
      count = (count + 1) / 2;
      if(k < count && (k + count) < ls)
        {
         temp[k] += temp[k + count];
         temp[k + count] = 0;
        }
      barrier(CLK_LOCAL_MEM_FENCE);
     }
   while(count > 1);
//--- score
   if(temp[0] > 0)
      score /= temp[0];
   scores[shift_s] = score;

Finally, these attention weights are applied to the Value tensor. The weighted Value vectors are accumulated to form a single context vector that captures the semantic representation of the input sequence from the perspective of the current Query.

//--- out
#pragma unroll
   for(int d = 0; d < dimension; d++)
     {
      float val = kv[shift_kv + d].s1 * score;
#pragma unroll
      for(int i = 0; i < total_k; i += ls)
        {
         if(k >= i && k < (i + ls))
            temp[k % ls] = (i == 0 ? 0 : temp[k % ls]) + val;
         barrier(CLK_LOCAL_MEM_FENCE);
        }
      //---
      uint count = ls;
#pragma unroll
      do
        {
         count = (count + 1) / 2;
         if(k < count && (k + count) < ls)
           {
            temp[k] += temp[k + count];
            temp[k + count] = 0;
           }
         barrier(CLK_LOCAL_MEM_FENCE);
        }
      while(count > 1);
      //---
      if(k == 0)
         out[dimension * (ind_q * heads + h) + d] = temp[0];
      barrier(CLK_LOCAL_MEM_FENCE);
     }
  }

The entire described mechanism implements a coherent and highly efficient probabilistic attention scheme. It begins with a fast, approximate estimation of Query importance, followed by the selection of the most promising Queries, and concludes with the full computation performed only on a limited yet informative subset. This approach not only accelerates the processing of long sequences but also preserves a high level of predictive accuracy. At the same time, it significantly reduces the volume of intermediate data and the number of accesses to global memory.

However, the process described so far covers only the forward pass — the stage in which the model generates predictions from the input data. To enable learning, it is also necessary to implement backpropagation — allowing the trainable parameters of every component to be updated according to their contribution to the model's final output.

In this implementation, we made a deliberate architectural decision to propagate gradients exclusively through the attention mechanism, excluding the Query selection stage from the backpropagation pass. At first glance, this may appear to be a simplification, but it is in fact a carefully considered design choice based on the underlying computational structure.

Both stages discussed above — the selection of informative Queries and the attention computation itself — rely on the same fundamental operation: matching Query tokens with Key tokens. During the Query selection stage, a sampled subset of Keys is evaluated against the complete set of Queries to estimate the importance of each element based on its response. During the attention stage, the perspective is reversed: the previously selected Queries are evaluated against the complete Key sequence. In other words, both stages operate on the same entities but from different viewpoints. This symmetry allows us to eliminate redundant computations and establish an efficient gradient flow by updating the model parameters through a single computational pathway.

This design offers several important advantages. First, it reduces computational cost because gradients propagate through only one information path. Second, it improves numerical stability by eliminating potential conflicts between two parallel sources of gradients. Third, it results in a cleaner and more elegant architecture with fewer dependencies, simplifying both implementation and testing. Most importantly, all the information about the importance is already contained in the gradient signals. Consequently, a single parameter update simultaneously improves both the attention mechanism and the Query selection procedure, effectively reusing the information learned during training.

The QIndexAttentionGradients kernel implements error backpropagation through the attention mechanism, responsible for the accurate distribution of gradients across the three key components: Query, Key and Value. The computational domain is organized along three dimensions:

the most important Queries;
the token dimension;
the attention heads.

This provides a high degree of parallelism and enables efficient utilization of GPU computational resources.

__kernel void QIndexAttentionGradients(__global const float* q,
                                       __global float* q_g,
                                       __global const float2* kv,
                                       __global float2* kv_g,
                                       __global const float* indexes,
                                       __global const float* scores,
                                       __global const float* gradient,
                                       const int kunits, const int heads_kv
                                      )
  {
//--- init
   const int ind_q = get_global_id(0);
   const int d = get_global_id(1);
   const int h = get_global_id(2);
   const int qunits = get_global_size(0);
   const int dimension = get_global_size(1);
   const int heads = get_global_size(2);

At the beginning of execution, each thread determines its coordinates in the task space. The actual Query index is retrieved from the indexes array to establish the correct correspondence with the elements stored in global memory. All required memory offsets are then computed.

const int h_kv = h % heads_kv;
const int q_id = (int)(indexes[ind_q * heads + h] + 0.001f);
const int shift_q = dimension * (q_id * heads + h) + d;
const int shift_s = (ind_q * heads + h) * kunits;
const int shift_g = h * dimension + d;

The first stage computes gradients with respect to the Value vectors. This implementation supports using fewer attention heads for Keys and Values (heads_kv) than all heads processing Queries. This reduces both memory consumption and computational cost while preserving the flexibility of the architecture. However, it also requires special approach during backpropagation.

Because Value vectors may be shared across multiple attention heads, their gradients must aggregate contributions from every head whose output depends on those Values. This ensures that error information is propagated correctly to Values participating in multiple attention pathways.

For each Value position, the algorithm iterates over all attention heads that potentially reference that Value. Within this loop, the weighted contribution of each head is computed as the product of the output gradient and the normalized attention weight (score) obtained during the feed-forward pass. These contributions are accumulated and ultimately stored in the second component of the float2 structure within the kv_g gradient buffer.

This procedure guarantees consistent and accurate gradient propagation even when the number of heads for Keys and Values differs from the number of Query heads. Consequently, the model can be trained correctly despite the structural asymmetry between the attention components.

//--- Calculating Value's gradients
   int step_score = kunits * heads;
   if(h < heads_kv)
     {
#pragma unroll
      for(int v = ind_q; v < kunits; v += qunits)
        {
         float grad = 0;
         for(int hq = h; hq < heads; hq += heads_kv)
           {
            int shift_score = hq * kunits + v;
            for(int g = 0; g < qunits; g++)
               grad += IsNaNOrInf(gradient[shift_g + dimension * (hq - h + g * heads)], 0) *
                       scores[shift_score + g * step_score];
           }
         int shift_v = dimension * (heads_kv * v + h) + d;
         kv_g[shift_v].s1 = IsNaNOrInf(grad, 0);
        }
     }

The next stage computes gradients with respect to the Queries. This step is somewhat more involved because it requires evaluating the derivative of the Softmax function. For each Query, the output gradient corresponding to the current position is retrieved, after which the algorithm performs two nested loops over the Keys. The first loop computes the contribution of each attention weight, while the second accounts for the influence of every Key through its normalized value. This procedure propagates the error signal accurately through the Softmax distribution while preserving the probabilistic structure of the attention mechanism. Finally, the accumulated Query gradient is written into the q_g buffer at the previously computed offset.

//--- Calculating Query's gradients
   float grad = 0;
   float out_g = IsNaNOrInf(gradient[shift_g + ind_q * dimension], 0);
   int shift_kv = h_kv * dimension + d;
#pragma unroll
   for(int k = 0; (k < kunits && out_g != 0); k++)
     {
      float sc_g = 0;
      float sc = scores[shift_s + k];
      if(sc == 0)
         continue;
      for(int v = 0; v < kunits; v++)
         sc_g += scores[shift_s + v] * out_g * kv[shift_kv + v * heads_kv * dimension].s1 *
                 ((float)(k == v) - sc);
      grad += sc_g * kv[shift_kv + k * heads_kv * dimension].s0;
     }
   q_g[shift_q] = grad;

The subsequent stage computes gradients with respect to the Keys, one of the most delicate parts of the backpropagation pass. The objective here is to determine precisely how each Key contributed to the model's final prediction through the attention mechanism.

As noted earlier, the implementation may use different numbers of attention heads for Queries and Keys. Therefore, each Key gradient must accumulate contributions from every attention head in which that Key participated.

During the feed-forward pass, every Query–Key pair produces a scalar attention score that is normalized by the Softmax function. The normalized values are stored in the scores buffer. However, these values alone are insufficient for computing the gradients. Since Softmax is a nonlinear transformation, backpropagation requires evaluating its derivative. Although the Softmax outputs have already been computed and stored, the sensitivity of the entire function with respect to each input logit must still be calculated. This derivative consists of both diagonal and off-diagonal terms of the Softmax formula. Consequently, when computing the gradient for a particular Key, the algorithm must iterate over every Query associated with that Key and accumulate their contributions.

A key aspect of this procedure is the use of the selected Query indices to reconstruct the correct dependency chain. Without these indices, the gradient distribution would be incorrect.

The algorithm iterates over every relevant pair, evaluates the required Softmax derivative terms, multiplies them by the corresponding output error gradients. The results are then accumulated into the gradient for this key and written to the first component of the kv_g buffer, which stores gradients for both Keys and Values.

//--- Calculating Key's gradients
   if(h < heads_kv)
     {
#pragma unroll
      for(int k = ind_q; k < kunits; k += qunits)
        {
         int shift_k = dimension * (heads_kv * k + h_kv) + d;
         grad = 0;
         for(int hq = h; hq < heads; hq++)
           {
            int shift_score = hq * kunits + k;
            float val = kv[shift_k + heads_kv * dimension].s1;
            for(int scr = 0; scr < qunits; scr++)
              {
               float sc_g = 0;
               int shift_sc = scr * kunits * heads;
               float sc = scores[shift_sc + k];
               if(sc == 0)
                  continue;
               for(int v = 0; v < kunits; v++)
                  sc_g += scores[shift_sc + v] * gradient[shift_g + scr * dimension] *
                          val * ((float)(k == v) - sc);
               grad += IsNaNOrInf(sc_g * 
                                  q[(hq + (int)(indexes[scr * heads + hq] + 0.001f) * heads) * dimension + d], 0);
              }
           }
         kv_g[shift_k].s0 = IsNaNOrInf(grad, 0);
        }
     }
  }

This concludes our discussion of the probabilistic attention algorithms implemented within the OpenCL program. We have examined each major stage in sequence — from estimating Query importance and selecting the most informative elements to computing attention and implementing the backpropagation pass. Every kernel has been carefully adapted to the ACEFormer architecture and optimized for efficient execution on GPU devices.

The complete implementation, including the source code for all kernels described above, is provided in the attachment.

The next stage of our work will focus on implementing the probabilistic attention algorithms within the main application. At this level, the OpenCL program is integrated with the model logic, buffer management, and computation synchronization. However, the scope of the present article has already reached a reasonable limit, so we will continue this discussion in the next article.

Conclusion

In this article, we introduced the ACEFormer framework — an architecture designed for highly efficient processing of sequential data under constrained computational resources. Its key strengths — modularity, adaptability, and computational efficiency — form the foundation of the implementation.

ACEFormer provides an elegant solution to the problem of scaling attention mechanisms for long sequences. Rather than processing the entire input sequence exhaustively, it employs a probabilistic selection mechanism to identify the most informative elements. This significantly reduces computational overhead while maintaining nearly the same level of quality. Such an approach is particularly valuable in environments where every microsecond of execution time and every megabyte of memory matter, such as algorithmic trading platforms.

In the practical part of this work, we examined in detail the implementation of all major components of probabilistic attention within the OpenCL program. The next step will be to implement the probabilistic attention algorithms at the level of the main application. To keep the current article focused and manageable, however, we will pause here and continue this implementation in the next article. An equally interesting stage of the project awaits there.

References

Programs Used in the Article

#	Name	Type	Description
1	Research.mq5	Expert Advisor	Expert Advisor for collecting samples
2	ResearchRealORL.mq5	Expert Advisor	Expert Advisor for collecting samples using the Real-ORL method
3	Study.mq5	Expert Advisor	Expert Advisor for offline model training
4	StudyOnline.mq5	Expert Advisor	Expert Advisor for online model training
4	Test.mq5	Expert Advisor	Expert Advisor for model testing
5	Trajectory.mqh	Class library	System state and model architecture description structure
6	NeuroNet.mqh	Class library	A library of classes for creating a neural network
7	NeuroNet.cl	Code library	OpenCL program code

Translated from Russian by MetaQuotes Ltd.
Original article: https://www.mql5.com/ru/articles/18004

Attached files |

Download ZIP

MQL5.zip (2720.44 KB)

Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.

This article was written by a user of the site and reflects their personal views. MetaQuotes Ltd is not responsible for the accuracy of the information presented, nor for any consequences resulting from the use of the solutions, strategies or recommendations described.