Neural Networks in Trading: Actor—Director—Critic (Final Part)

MetaTrader 5 — Trading systems | 16 June 2026, 13:18

1 506

Dmitriy Gizlyk

Introduction

In the previous article, we became acquainted with the theoretical aspects of the Actor–Director–Critic framework, an extended version of the Actor–Critic architecture. The classical Actor–Critic architecture has become a cornerstone of many successful reinforcement learning (RL) algorithms. In this setup, the agent is divided into two components: the Actor proposes actions, while the Critic evaluates them based on rewards received from the environment. This combination enables the gradual development of a strategy in which actions become increasingly purposeful and evaluations progressively more accurate. However, despite its elegance and effectiveness, this architecture encounters several significant limitations in real-world applications, particularly in trading.

The main problem appears during the early stages of training. At this point, the Actor has not yet learned which actions are beneficial, while the Critic is unable to evaluate them adequately because it is also just beginning to learn. This creates a kind of blind wandering: the agent performs numerous random, inefficient, and sometimes even harmful actions while receiving uninformative or delayed feedback. In market environments, where mistakes can be costly, such an approach becomes excessively risky and unstable.

To address this fundamental issue, the authors of the Actor–Director–Critic framework introduced an additional component — the Director — which adds an additional evaluation channel to the system. Unlike the Critic, which produces a continuous assessment of an action based on environmental rewards, the Director classifies actions in binary terms: whether they fit or do not fit the given strategy. This enables the agent to navigate the decision space more quickly and effectively.

It is important to emphasize that the Director is not a filter and does not restrict the agent's freedom of action. On the contrary, it complements the Critic's evaluation by providing more categorical feedback. While the Critic may be uncertain about the quality of an action (especially during the early stages of training), the Director can immediately indicate whether the action conforms to the behavioral model it has learned. This helps the Actor avoid repeating actions that are already identified as erroneous, thereby conserving training resources and accelerating the development of a robust strategy.

The result is a synergistic interaction among three components: the Actor learns to select actions, the Critic learns to evaluate them in terms of expected reward, and the Director explicitly identifies actions that should be avoided altogether. This creates a dual-feedback system — continuous and binary — that enables the agent to eliminate unpromising directions more rapidly and focus on productive strategies.

The author's visualization of the Actor–Director–Critic framework is presented below.

The authors' visualization of the Actor–Director–Critic framework

In the practical section of the previous article, a detailed description of the architecture of the trainable models was provided. It should be noted that several substantial modifications were introduced compared to the authors' original implementation of the framework.

First and foremost, we adapted the proposed concepts to the HiSSD multi-agent framework discussed earlier. Furthermore, in our implementation, both the Director and the Critic are trained on latent representations of the Agent's shared skills — compressed features extracted from the internal layers of the environment-state encoder, rather than on the raw environmental features themselves. This approach enables the formation of a generalized representation of action patterns, allowing actions to be evaluated within the context of global behavioral logic rather than a specific market situation. As a result, the system receives more stable and strategically meaningful feedback, which is critically important under conditions of limited information and high uncertainty, as is often the case in financial markets.

In this article, the focus shifts to model training. Several modifications were also introduced to this process compared to the original approach.

One of the key differences is the division of training into two stages. During the first stage, all system components undergo offline training using a pre-collected training dataset. This stage allows the agent to accumulate initial experience and establish a baseline behavioral strategy without affecting live trading operations. Each model is trained using its own objective function, specifically tailored to its role within the architecture.

The second stage consists of online fine-tuning. At this point, the agent begins interacting directly with the environment. The Actor refines its policy in real time, while the Critic and Director continue learning, improving the quality of their evaluations and classifications based on updated data. This stage enables the agent to adapt to current market conditions while preserving the strategic direction established during offline training.

This two-stage approach provides a balance between stability and adaptability. The agent first develops a reliable initial policy and subsequently refines it under real-world conditions. As a result, we expect to obtain a robust and efficiently trainable system capable of adapting to changing market dynamics without sacrificing strategic consistency.

Offline Training

The algorithm for the first training stage (offline training) is implemented as the Expert Advisor "…\Experts\ADC\Study.mq5". The majority of its code was inherited from the previous project, which is not surprising. The architectural design of the trainable models is largely based on concepts developed within the HiSSD framework, with which we are already familiar.

This continuity allowed us to leverage proven solutions without reinventing the wheel. Nevertheless, some modifications were unavoidable.

The addition of two new components (the Critic and the Director) significantly expanded the system's functionality. Their integration required changes to the model-training logic. Within the scope of this article, we will focus exclusively on the Train method, which contains nearly the entire model-training process.

As before, this method accepts no parameters, obtaining all required data from previously initialized global variables. At the beginning of the method, a number of local variables are created and initialized for temporary storage of intermediate information during training.

void Train(void)
  {
//---
   vector<float> probability = vector<float>::Full(Buffer.Size(), 1.0f / Buffer.Size());
//---
   vector<float> result, target, state;
   matrix<float> fstate = matrix<float>::Zeros(1, NForecast * BarDescr);
   bool Stop = false;
//---
   uint ticks = GetTickCount();

Once the preparatory stage is complete, we proceed to the training process itself, which is organized as a system of nested loops. The outer loop iterates over mini-batches for a specified number of training iterations.

for(int iter = 0; (iter < Iterations && !IsStopped() && !Stop); iter += Batch)
  {
   int tr = SampleTrajectory(probability);
   int start = (int)((MathRand() * MathRand() / MathPow(32767, 2)) * (Buffer[tr].Total - 2 - NForecast - Batch));
   if(start <= 0)
     {
      iter -= Batch;
      continue;
     }

At the initial stage of offline training, a single trajectory is sampled from the experience replay buffer. A specific environment state is then randomly selected from this trajectory and used as the starting point for constructing a new mini-batch of training data. This approach ensures diversity among training examples.

Before processing the newly formed mini-batch, we perform a mandatory reset of the internal temporary data buffers of all trainable models. This is particularly important for recurrent modules, which, as is well known, possess memory and can retain contextual information from previous states, accumulating knowledge across earlier time steps. However, when switching between unrelated trajectory segments, this memory can become problematic.

The context stored in the hidden states of the previous mini-batch is no longer relevant. It has no relationship to the new sample and may distort the generated signals. Therefore, resetting temporary states before processing each new mini-batch is not merely a precaution—it is a necessary condition for the correct and independent analysis of the current historical data segment.

if(
   !cEncoder.Clear()
   || !cTask.Clear()
   || !cActor.Clear()
   || !cProbability.Clear()
   || !cDirector.Clear()
   || !cCritic.Clear()
)
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }
result = vector<float>::Zeros(NActions);

After initializing the mini-batch, we proceed to the next stage: organizing a nested loop that sequentially iterates through environment states. These states are processed strictly in chronological order, exactly as they were originally observed by the agent during its interaction with the trading environment.

Preserving historical order is not simply a formality but a fundamental requirement for the effective training of recurrent model components. Unlike conventional fully connected layers, which operate on isolated inputs, recurrent neural networks maintain hidden states based on accumulated information from previous time steps. Their strength lies in their ability to identify temporal dependencies, behavioral patterns, and recurring signals within time-series data.

If the temporal structure of the data is disrupted—even partially—the model may lose its ability to recognize causal relationships that emerge specifically through temporal dynamics. Consequently, the value of such training examples is significantly diminished.

For this reason, every time step within a mini-batch is processed strictly in sequence — without shuffling, omissions, or deviations. This approach allows recurrent components to build an internal contextual representation progressively, accumulating knowledge about the evolution of the market environment.

for(int i = start; i < MathMin(Buffer[tr].Total, start + Batch); i++)
  {
   if(!state.Assign(Buffer[tr].States[i].state) ||
      MathAbs(state).Sum() == 0 ||
      !bState.AssignArray(state))
     {
      iter -= Batch + start - i;
      break;
     }
   //---

Within the body of the nested loop, we load the description of the analyzed environment state from the experience replay buffer and generate the corresponding timestamp harmonics.

bTime.Clear();
double time = (double)Buffer[tr].States[i].account[7];
double x = time / (double)(D'2024.01.01' - D'2023.01.01');
bTime.Add((float)MathSin(x != 0 ? 2.0 * M_PI * x : 0));
x = time / (double)PeriodSeconds(PERIOD_MN1);
bTime.Add((float)MathCos(x != 0 ? 2.0 * M_PI * x : 0));
x = time / (double)PeriodSeconds(PERIOD_W1);
bTime.Add((float)MathSin(x != 0 ? 2.0 * M_PI * x : 0));
x = time / (double)PeriodSeconds(PERIOD_D1);
bTime.Add((float)MathSin(x != 0 ? 2.0 * M_PI * x : 0));
if(bTime.GetIndex() >= 0)
   bTime.BufferWrite();

Next, we construct the account-state representation vector.

//--- Account
float PrevBalance = Buffer[tr].States[MathMax(i - 1, 0)].account[0];
float PrevEquity = Buffer[tr].States[MathMax(i - 1, 0)].account[1];
float profit = float(bState[0] / _Point * (result[0] - result[3]));
bAccount.Clear();
bAccount.Add(1);
bAccount.Add((PrevEquity + profit) / PrevEquity);
bAccount.Add(profit / PrevEquity);
bAccount.Add(MathMax(result[0] - result[3], 0));
bAccount.Add(MathMax(result[3] - result[0], 0));
bAccount.Add((bAccount[3] > 0 ? profit / PrevEquity : 0));
bAccount.Add((bAccount[4] > 0 ? profit / PrevEquity : 0));
bAccount.Add(0);
bAccount.AddArray(GetPointer(bTime));
if(bAccount.GetIndex() >= 0)
   bAccount.BufferWrite();

At this point, it is worth highlighting an important methodological technique employed during offline training. This refers to the use of a so-called "near-perfect trajectory". Within this approach, we deliberately relax chronological isolation, allowing the algorithm to peek ahead at future environment states that are already available within the training dataset.

Using this information, we construct an enhanced action tensor that reflects strategically more informed decisions. Naturally, these decisions will, with high probability, differ from the actions originally taken by the agent during its interaction with the market. Rather than replicating the agent's behavior, this tensor serves as an approximate benchmark toward which the agent should converge during training. This is precisely why we refer to it as a "near-perfect" trajectory.

However, such a strategy inevitably creates discrepancies between the actual actions performed by the agent and the idealized actions generated retrospectively. As a result, corresponding adjustments must also be made to other parts of the model.

In particular, we must recalculate the account-state representation vector so that it remains consistent with the actions derived from the near-perfect trajectory. Otherwise, the Agent would be trained on inconsistent data in which actions and their consequences no longer correspond to one another. Such inconsistencies would inevitably distort the training signals.

After preparing the input data set, we sequentially execute the forward-pass methods of the trainable models, during which various predicted values are generated. The objective of model training is to minimize the deviation between these predicted values and the desired target outcomes.

//--- Feed Forward
if(!cEncoder.feedForward((CBufferFloat*)GetPointer(bState), 1, false, (CBufferFloat*)NULL))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }
if(!cTask.feedForward((CBufferFloat*)GetPointer(bState), 1, false, GetPointer(cEncoder), LatentLayer))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }
if(!cActor.feedForward((CBufferFloat*)GetPointer(bAccount), 1, false, GetPointer(cTask), -1))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }
if(!cProbability.feedForward(GetPointer(cEncoder), LatentLayer, (CBufferFloat*)NULL))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }

It is worth noting that at this stage we do not perform forward passes through the Agent's action-evaluation models — the Critic and the Director. This is a direct consequence of the "near-perfect trajectory" approach. However, we will return to this topic shortly.

Once the feed-forward operations of the trainable models have been successfully completed, we proceed to the generation of target values. First, a sequence of future environment states is loaded from the experience replay buffer over a predefined planning horizon.

//--- Look for target
target = vector<float>::Zeros(NActions);
bActions.AssignArray(target);
if(!state.Assign(Buffer[tr].States[i + NForecast].state) ||
   !state.Resize(NForecast * BarDescr) ||
   MathAbs(state).Sum() == 0)
  {
   iter -= Batch + start - i;
   break;
  }
if(!fstate.Resize(1, NForecast * BarDescr) ||
   !fstate.Row(state, 0) ||
   !fstate.Reshape(NForecast, BarDescr))
  {
   iter -= Batch + start - i;
   break;
  }
for(int j = 0; j < NForecast / 2; j++)
  {
   if(!fstate.SwapRows(j, NForecast - j - 1))
     {
      PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
      Stop = true;
      break;
     }
  }

These data serve as target values for our environment state encoder. Consequently, we can perform a backpropagation through this model to optimize its parameters and minimize the discrepancy between predicted and target values.

//--- State Encoder
Result.AssignArray(fstate);
if(!cEncoder.backProp(Result, (CBufferFloat*)NULL, NULL))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }

The information about future environment states also forms the basis for constructing the "near-perfect" trading-operation tensor. Importantly, each new operation is generated taking into account the previous one, enabling us to model a coherent trading strategy rather than isolated impulsive actions. As a result, a contextually linked trading sequence is created, in which every subsequent action is determined by the logic of the preceding one. This is particularly important during offline training, where the agent does not receive real-time feedback from the environment and must instead extract patterns from already recorded data.

This method helps mitigate the influence of market noise, which manifests itself in real-world data through random fluctuations. Within the "near-perfect" trajectory, the agent's behavior becomes smoother, more rational, and more strategically consistent. Potentially, this can accelerate model training while fostering a robust trading logic that can later be adapted during online operation.

The process of constructing a "near-perfect" trading operation was described in detail in the previous article, and there is little value in repeating it here.

The resulting tensor of a "near-perfect" trading operation serves as the target value for training our high-level Actor module.

//--- Actor Policy
if(!cActor.backProp(GetPointer(bActions), (CNet*)GetPointer(cTask), -1))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }

At this stage, we move on to the most important component of the training architecture — the training of the Actor's action-evaluation models, namely the Critic and the Director. These components are responsible for generating the feedback signals that guide the agent's strategy, helping it distinguish productive actions from ineffective ones.

Here, however, we encounter a fundamental challenge: the space of possible actions in any given environment state is extremely large. Evaluating the entire set of decisions that the Actor could potentially make is practically infeasible within reasonable computational resources. Furthermore, many of these actions will never actually be executed by the agent and therefore provide little or no training value.

For this reason, Critic training typically follows a local evaluation strategy. Rather than attempting to cover the entire action space, we focus on a neighborhood around the actual decisions made by the agent at the current time step. It is within this local subspace that optimization directions are sought — vector displacements that may potentially increase the objective function and, consequently, improve profitability.

This is where the "near-perfect trajectory" approach provides a significant advantage. Because we have access to a reference behavior generated through limited look-ahead into future states, we can shift the focus of evaluation from what the agent actually did to what it should have done. In other words, we train the Critic not merely to distinguish good actions from bad ones, but to orient its evaluations toward actions that are close to the strategic ideal represented by the "near-perfect" trajectory.

Within this paradigm, we perform a forward pass through the Critic to evaluate the "near-perfect" trading operation.

//--- Critic
if(!cCritic.feedForward(GetPointer(bActions), 1, false, (CNet*)GetPointer(cEncoder), LatentLayer))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }

A backpropagation pass is then executed immediately to minimize the deviation between the Critic's evaluation and the value calculated from actual historical market data.

float reward = float((result[0] - result[3]) * fstate[0, 0] / Point());
Result.Clear();
if(!Result.Add(reward)
   || !cCritic.backProp(Result, (CNet*)GetPointer(cEncoder), LatentLayer)
   || !cEncoder.backPropGradient((CBufferFloat*)NULL, (CBufferFloat*)NULL, LatentLayer, true)
  )
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }

The Director's training process is conceptually similar to that of the Critic but involves several important distinctions. Like any classifier, the Director requires a balanced set of positive and negative examples from which it can learn the boundary between "good" and "bad" agent actions.

The positive examples are relatively straightforward. These consist of the trading operations generated within the framework of the "near-perfect" trajectory. Such actions represent strategically sound decisions derived from analysis of future environment states and therefore deserve favorable classification.

However, a reliable classifier cannot be trained using positive examples alone. It also requires a representative pool of negative examples that exhibit behavior inconsistent with our objectives. This introduces a methodological challenge: by its nature, the training dataset does not cover the entire space of potential actions. It contains only those decisions that the agent actually made in the past.

To address this issue, we adopted a simple yet effective heuristic. As negative examples, we use sets of random values generated within the permissible range of the agent's action space. With high probability, such random actions do not align with the strategic objectives of the model and can therefore be regarded as behavioral noise or mistakes.

It is important to note that the alternation between positive and negative examples during training is also randomized. This prevents the model from overfitting to either category and helps establish a more robust classification boundary. Such an approach makes the Director more flexible and generalizable while producing sharper and more confident signals—an especially valuable property under the high levels of uncertainty characteristic of financial markets.

As a result, the Director becomes a powerful binary guide capable of decisively rejecting unproductive actions and steering the agent toward more promising regions of the strategic decision space.

//--- Director
Result.Clear();
if((MathRand() / 32767.0) > 0.5)
   Result.Add(1);
else
  {
   target = vector<float>::Zeros(NActions);
   for(int i = 0; i < NActions; i++)
      target[i] = float(MathRand() / 32767.0);
   bActions.AssignArray(target);
   Result.Add(0);
  }
if(!cDirector.feedForward(GetPointer(bActions), 1, false, (CNet*)GetPointer(cEncoder), LatentLayer)
   || !cDirector.backProp(Result, (CNet*)GetPointer(cEncoder), LatentLayer)
   || !cEncoder.backPropGradient((CBufferFloat*)NULL, (CBufferFloat*)NULL, LatentLayer, true)
  )
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   Stop = true;
   break;
  }

Next, we train the low-level Controller and the predictive model responsible for estimating the probabilities of future price movement direction. These code blocks were transferred unchanged from the previous project, where they were described in detail. Therefore, I suggest leaving them for independent study. The complete source code of the offline model training Expert Advisor can be found in the attachment.

Online Training

The next stage of our work is the fine-tuning of the trainable models in an online setting, where we encounter an entirely different class of challenges and constraints. While the offline stage emphasized deep, generalized learning based on historical data and the "near-perfect trajectory" approach, the online stage focuses exclusively on adaptation to real-time conditions.

One of the key advantages of online learning is the direct availability of feedback from the environment. The Actor makes a decision, the action is executed, and its outcome becomes known almost immediately. This allows the agent's behavior to be adjusted promptly, reinforcing successful strategies while discarding ineffective ones.

However, this approach is not without significant limitations. The most important of these is the inability to look ahead into future states, as was possible when constructing the "near-perfect trajectory" during offline training. In the online setting, the agent must make decisions based solely on the current state and its existing policy, without access to any "knowledge from the future". This fundamentally changes the learning conditions and necessitates a transition to a more classical Reinforcement Learning paradigm.

Learning is therefore driven by trial and error, and the quality of decisions can only be assessed retrospectively. As a result, the importance of the evaluation models — the Critic and the Director — increases significantly. These components serve as an internal advisory system for the Actor, guiding its behavior. Unlike the offline stage, however, they are now fine-tuned continuously during live trading activity, adapting in real time to changing market conditions.

Let us examine the implementation step by step. The online training algorithm is implemented within the Expert Advisor "...\Experts\ADC\StudyOnline.mq5". Given the limited scope of this article, we will only focus on a detailed examination of the OnTick method. This method processes incoming tick events. This is also where we will implement the main training algorithm.

void OnTick()
  {
//---
   if(!IsNewBar())
      return;

First, it is important to note that our models analyze historical data only on closed candles and are not designed to react to every incoming tick. Consequently, until the next candle closes, there is no need to perform a detailed analysis of the environment state. The outcome would remain unchanged. Therefore, to minimize unnecessary computations, the method begins by checking whether a new bar has closed. If not, we wait for the next tick.

Once a new bar has closed, we request historical data from the terminal over a specified depth and construct the input data buffers.

   int bars = CopyRates(Symb.Name(), TimeFrame, iTime(Symb.Name(), TimeFrame, 1), HistoryBars, Rates);
   if(!ArraySetAsSeries(Rates, true))
      return;
//---
   RSI.Refresh();
   CCI.Refresh();
   ATR.Refresh();
   MACD.Refresh();
   Symb.Refresh();
   Symb.RefreshRates();
//---
   float atr = 0;
   for(int b = 0; b < (int)HistoryBars; b++)
     {
      float open = (float)Rates[b].open;
      float rsi = (float)RSI.Main(b);
      float cci = (float)CCI.Main(b);
      atr = (float)ATR.Main(b);
      float macd = (float)MACD.Main(b);
      float sign = (float)MACD.Signal(b);
      if(rsi == EMPTY_VALUE || cci == EMPTY_VALUE || atr == EMPTY_VALUE ||
         macd == EMPTY_VALUE || sign == EMPTY_VALUE)
         continue;
      //---
      int shift = b * BarDescr;
      sState.state[shift] = (float)(Rates[b].close - open);
      sState.state[shift + 1] = (float)(Rates[b].high - open);
      sState.state[shift + 2] = (float)(Rates[b].low - open);
      sState.state[shift + 3] = (float)(Rates[b].tick_volume / 1000.0f);
      sState.state[shift + 4] = rsi;
      sState.state[shift + 5] = cci;
      sState.state[shift + 6] = atr;
      sState.state[shift + 7] = macd;
      sState.state[shift + 8] = sign;
     }
//---

We then load information on the account state and currently open positions.

   sState.account[0] = (float)AccountInfoDouble(ACCOUNT_BALANCE);
   sState.account[1] = (float)AccountInfoDouble(ACCOUNT_EQUITY);
//---
   double buy_value = 0, sell_value = 0, buy_profit = 0, sell_profit = 0;
   double position_discount = 0;
   double multiplyer = 1.0 / (60.0 * 60.0 * 10.0);
   int total = PositionsTotal();
   datetime current = TimeCurrent();
   for(int i = 0; i < total; i++)
     {
      if(PositionGetSymbol(i) != Symb.Name())
         continue;
      double profit = PositionGetDouble(POSITION_PROFIT);
      switch((int)PositionGetInteger(POSITION_TYPE))
        {
         case POSITION_TYPE_BUY:
            buy_value += PositionGetDouble(POSITION_VOLUME);
            buy_profit += profit;
            break;
         case POSITION_TYPE_SELL:
            sell_value += PositionGetDouble(POSITION_VOLUME);
            sell_profit += profit;
            break;
        }
      position_discount += profit - (current - PositionGetInteger(POSITION_TIME)) * multiplyer * MathAbs(profit);
     }
   sState.account[2] = (float)buy_value;
   sState.account[3] = (float)sell_value;
   sState.account[4] = (float)buy_profit;
   sState.account[5] = (float)sell_profit;
   sState.account[6] = (float)position_discount;
   sState.account[7] = (float)Rates[0].time;

Next, we generate timestamp harmonics.

   bTime.Clear();
   double time = (double)Rates[0].time;
   double x = time / (double)(D'2024.01.01' - D'2023.01.01');
   bTime.Add((float)MathSin(x != 0 ? 2.0 * M_PI * x : 0));
   x = time / (double)PeriodSeconds(PERIOD_MN1);
   bTime.Add((float)MathCos(x != 0 ? 2.0 * M_PI * x : 0));
   x = time / (double)PeriodSeconds(PERIOD_W1);
   bTime.Add((float)MathSin(x != 0 ? 2.0 * M_PI * x : 0));
   x = time / (double)PeriodSeconds(PERIOD_D1);
   bTime.Add((float)MathSin(x != 0 ? 2.0 * M_PI * x : 0));
   if(bTime.GetIndex() >= 0)
      bTime.BufferWrite();
//---
   bAccount.Clear();
   bAccount.Add((float)((sState.account[0] - PrevBalance) / PrevBalance));
   bAccount.Add((float)(sState.account[1] / PrevBalance));
   bAccount.Add((float)((sState.account[1] - PrevEquity) / PrevEquity));
   bAccount.Add(sState.account[2]);
   bAccount.Add(sState.account[3]);
   bAccount.Add((float)(sState.account[4] / PrevBalance));
   bAccount.Add((float)(sState.account[5] / PrevBalance));
   bAccount.Add((float)(sState.account[6] / PrevBalance));
   bAccount.AddArray(GetPointer(bTime));
//---
   if(bAccount.GetIndex() >= 0)
      if(!bAccount.BufferWrite())
         return;
//---
   bState.AssignArray(sState.state);

It is important to note that the resulting representation of the current environment state will be used for two distinct purposes. Naturally, it will serve as input for a feed-forward pass through the Agent, resulting in the generation of new trading actions. However, rewards from the environment for those actions will only become available after the next bar is formed. This creates a gap in time.

On the other hand, at this stage we are able to evaluate the effectiveness of the actions taken by the Agent at the previous time step. It is in our interest to perform this evaluation before updating the model states, which still contain the results of the analysis performed on the previous environment state.

Therefore, we first feed the newly constructed environment state representation into the target models and generate a predicted estimate of the state under the assumption that the Agent follows its current behavioral policy.

if(!bFirstRun)
  {
   //--- Target Nets
   if(!cEncoder[1].feedForward((CBufferFloat*)GetPointer(bState), 1, false, (CBufferFloat*)NULL)
      || !cTask[1].feedForward((CBufferFloat*)GetPointer(bState), 1, false,
                                                    (CNet*)GetPointer(cEncoder[1]), LatentLayer)
      || !cActor[1].feedForward((CBufferFloat*)GetPointer(bAccount), 1, false,
                                                                        GetPointer(cTask[1]), -1)
      || !cCritic[2].feedForward(GetPointer(cActor[1]), -1, GetPointer(cEncoder[1]), LatentLayer)
      || !cCritic[3].feedForward(GetPointer(cActor[1]), -1, GetPointer(cEncoder[1]), LatentLayer)
      || !cCritic[4].feedForward(GetPointer(cActor[1]), -1, GetPointer(cEncoder[1]), LatentLayer)
      || !cCritic[5].feedForward(GetPointer(cActor[1]), -1, GetPointer(cEncoder[1]), LatentLayer)
     )
     {
      PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
      return;
     }

The use of target models plays an important role in maintaining a coherent and stable behavioral strategy while minimizing the influence of market noise. This allows the Agent to consider not only immediate rewards but also expected future returns.

We then proceed to train the Critic. It is worth recalling that the authors of theActor—Director—Critic framework proposed the use of two Critics operating in parallel, each with its own pair of target models. First, we construct the target value for evaluating the Agent's most recent actions, incorporating the reward obtained at the current stage for the first Critic, and perform a feed-forward and backpropagation pass through the model.

//--- Critic 1
cCritic[2].getResults(Result);
float reward = Result[0];
cCritic[4].getResults(Result);
reward = (reward + Result[0]) / 2 * DiscFactor + float(sState.account[1] - PrevEquity);
Result.Clear();
if(!Result.Add(reward)
   || !cCritic[0].backProp(Result, (CNet*)GetPointer(cEncoder[0]), LatentLayer))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   return;
  }

It should be noted that the original framework proposes propagating the error gradient based on the minimum estimate of the performed actions. In our implementation, however, we adopt a slightly different approach. We will propagate the error gradient from the model with the lowest average evaluation error. Thus we shift from the minimum estimate to a more accurate one.

The second important aspect of online training concerns timely Actor policy updates. The most obvious and technically simple solution is to use a fixed iteration counter. After a predefined number of steps, the Agent's strategy is updated. This approach is entirely reasonable in a true online-learning environment, where each environment state is unique and cannot be revisited.

However, we intend to use the powerful simulation capabilities of the MetaTrader 5 Strategy Tester. This enables us to replay the same sequence of events multiple times, effectively simulating an online process while allowing repeated training passes.

This introduces a potential issue. If we rely on a naive fixed-iteration update schedule, then each training run will update the Actor's policy at exactly the same environment states. This significantly reduces the variability of the training process, introduces artificial biases, and hinders the agent's ability to learn robust patterns.

To avoid this effect, we implemented a stochastic approach to triggering policy updates. Instead of using a deterministic counter, we generate a random integer and perform a policy update only when that value is divisible by a predefined number. This mechanism preserves the required regularity of optimization while making updates weakly predictable in time. As a result, it helps prevent overfitting to specific data segments.

if(cCritic[0].getRecentAverageError() <= cCritic[1].getRecentAverageError() &&
   (MathRand() % ActorUpdate) == 0)
   if(!cActor[0].backPropGradient((CNet*)GetPointer(cTask[0]), -1, -1, false)
      || !cTask[0].backPropGradient((CNet*)GetPointer(cEncoder[0]), LatentLayer, -1, true)
     )
     {
      PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
      return;
     }

The same procedure is repeated for the second Critic.

//--- Critic 2
cCritic[3].getResults(Result);
reward = Result[0];
cCritic[5].getResults(Result);
reward = (reward + Result[0]) / 2 * DiscFactor + float(sState.account[1] - PrevEquity);
Result.Clear();
if(!Result.Add(reward)
   || !cCritic[1].backProp(Result, (CNet*)GetPointer(cEncoder[0]), LatentLayer))
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   return;
  }
if(cCritic[0].getRecentAverageError() > cCritic[1].getRecentAverageError() &&
   (MathRand() % ActorUpdate) == 0)
   if(!cActor[0].backPropGradient((CNet*)GetPointer(cTask[0]), -1, -1, false)
      || !cTask[0].backPropGradient((CNet*)GetPointer(cEncoder[0]), LatentLayer, -1, true)
     )
     {
      PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
      return;
     }

The Director's training process is considerably simpler. Profitable trading operations are treated as positive examples, while all other trades are treated as negative examples.

//--- Director
Result.Clear();
if((sState.account[1] - PrevEquity) > 0)
   Result.Add(1);
else
   Result.Add(0);
if(!cDirector.backProp(Result, (CNet*)GetPointer(cEncoder[0]), LatentLayer)
   || !cActor[0].backPropGradient((CNet*)GetPointer(cTask[0]), -1, -1, false)
   || !cTask[0].backPropGradient((CNet*)GetPointer(cEncoder[0]), LatentLayer, -1, true)
  )
  {
   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
   return;
  }

Here, we also update the parameters of the predictive model based on the direction of the recently closed candlestick.

 //--- Probability
 vector<float> target = vector<float>::Zeros(NActions / 3);
 if(sState.state[0] > 0)
    target[0] = 1;
 else
    if(sState.state[0] < 0)
       target[1] = 1;
 if(!Result.AssignArray(target)
    || !cProbability.backProp(Result, (CNet*)GetPointer(cEncoder[0]), LatentLayer)
    || !cEncoder[0].backPropGradient((CBufferFloat*)NULL, (CBufferFloat*)NULL, LatentLayer)
   )
   {
    PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
    return;
   }
}

Once the model-optimization iterations have been completed, we proceed to generate a new trading operation. The previously constructed representation of the current environment state is now passed to the trainable models, including the model that evaluates the Actor's actions.

//--- New state
   if(!cEncoder[0].feedForward((CBufferFloat*)GetPointer(bState), 1, false, (CBufferFloat*)NULL)
      || !cTask[0].feedForward((CBufferFloat*)GetPointer(bState), 1, false, (CNet*)GetPointer(cEncoder[0]),
                                                                                                   LatentLayer)
      || !cActor[0].feedForward((CBufferFloat*)GetPointer(bAccount), 1, false, (CNet*)GetPointer(cTask[0]), -1)
      || !cProbability.feedForward((CNet*)GetPointer(cEncoder[0]), LatentLayer, (CBufferFloat*)NULL)
      || !cDirector.feedForward((CNet*)GetPointer(cActor[0]), -1, (CNet*)GetPointer(cEncoder[0]), LatentLayer)
      || !cCritic[0].feedForward((CNet*)GetPointer(cActor[0]), -1, (CNet*)GetPointer(cEncoder[0]), LatentLayer)
      || !cCritic[1].feedForward((CNet*)GetPointer(cActor[0]), -1, (CNet*)GetPointer(cEncoder[0]), LatentLayer)
     )
     {
      PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
      return;
     }

We then store in global variables the information that will be required when processing the next closed candlestick.

PrevBalance = sState.account[0];
PrevEquity = sState.account[1];

After that, we move on to performing trading operations. First, we obtain the output vector produced by our Actor.

   vector<float> temp;
   cActor[0].getResults(temp);
//---
   if(temp.Size() < NActions)
      temp = vector<float>::Zeros(NActions);

Mutually offsetting position volumes are removed from the resulting tensor.

double min_lot = Symb.LotsMin();
double step_lot = Symb.LotsStep();
double stops = (MathMax(Symb.StopsLevel(), 1) + Symb.Spread()) * Symb.Point();
if(temp[0] >= temp[3])
  {
   temp[0] -= temp[3];
   temp[3] = 0;
  }
else
  {
   temp[3] -= temp[0];
   temp[0] = 0;
  }

We then proceed to decode the Actor's output. If they contain no volume of long positions, any existing positions that may have been opened previously are closed.

//--- buy control
   if(temp[0] < min_lot || (temp[1] * MaxTP * Symb.Point()) <= 2 * stops ||
                                 (temp[2] * MaxSL * Symb.Point()) <= stops)
     {
      if(buy_value > 0)
         CloseByDirection(POSITION_TYPE_BUY);
     }

If we need to open or hold long positions, the generated values are converted into actual trading volumes and price levels.

else
  {
   double buy_lot = min_lot + MathRound((double)(temp[0] - min_lot) / step_lot) * step_lot;
   double buy_tp = NormalizeDouble(Symb.Ask() + temp[1] * MaxTP * Symb.Point(), Symb.Digits());
   double buy_sl = NormalizeDouble(Symb.Ask() - temp[2] * MaxSL * Symb.Point(), Symb.Digits());

When there are previously opened positions, we trailing the relevant trading levels.

if(buy_value > 0)
   TrailPosition(POSITION_TYPE_BUY, buy_sl, buy_tp);

The volume of the current position is then adjusted by either partially closing the position or adding the required amount. The latter case also includes the opening of a new position.

 if(buy_value != buy_lot)
   {
    if((buy_value - buy_lot) >= min_lot)
       ClosePartial(POSITION_TYPE_BUY, buy_value - buy_lot);
    else
       if((buy_lot - buy_value) >= min_lot)
          if(!Trade.Buy(buy_lot - buy_value, Symb.Name(), Symb.Ask(), buy_sl, buy_tp))
             if(Trade.CheckResultRetcode() == 10019)
               {
                Result.Clear();
                Result.Add(0);
                if(!cDirector.backProp(Result, (CNet*)GetPointer(cEncoder[0]), LatentLayer)
                   || !cActor[0].backPropGradient((CNet*)GetPointer(cTask[0]), -1, -1, false)
                   || !cTask[0].backPropGradient((CNet*)GetPointer(cEncoder[0]), LatentLayer, -1, true)
                  )
                  {
                   PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
                   return;
                  }
               }
   }
}

It is important to note that if an "Insufficient funds" error occurs while opening a new position or increasing an existing one, immediate feedback is provided through the Director, indicating that the decision should be considered negative.

Short positions are adjusted in a similar manner.

//--- sell control
   if(temp[3] < min_lot || (temp[4] * MaxTP * Symb.Point()) <= 2 * stops ||
                                  (temp[5] * MaxSL * Symb.Point()) <= stops)
     {
      if(sell_value > 0)
         CloseByDirection(POSITION_TYPE_SELL);
     }
   else
     {
      double sell_lot = min_lot + MathRound((double)(temp[3] - min_lot) / step_lot) * step_lot;;
      double sell_tp = NormalizeDouble(Symb.Bid() - temp[4] * MaxTP * Symb.Point(), Symb.Digits());
      double sell_sl = NormalizeDouble(Symb.Bid() + temp[5] * MaxSL * Symb.Point(), Symb.Digits());
      if(sell_value > 0)
         TrailPosition(POSITION_TYPE_SELL, sell_sl, sell_tp);
      if(sell_value != sell_lot)
        {
         if((sell_value - sell_lot) >= min_lot)
            ClosePartial(POSITION_TYPE_SELL, sell_value - sell_lot);
         else
            if((sell_lot - sell_value) >= min_lot)
               if(!Trade.Sell(sell_lot - sell_value, Symb.Name(), Symb.Bid(), sell_sl, sell_tp))
                  if(Trade.CheckResultRetcode() == 10019)
                    {
                     Result.Clear();
                     Result.Add(0);
                     if(!cDirector.backProp(Result, (CNet*)GetPointer(cEncoder[0]), LatentLayer)
                        || !cActor[0].backPropGradient((CNet*)GetPointer(cTask[0]), -1, -1, false)
                        || !cTask[0].backPropGradient((CNet*)GetPointer(cEncoder[0]), LatentLayer, -1, true)
                       )
                       {
                        PrintFormat("%s -> %d", __FUNCTION__, __LINE__);
                        return;
                       }
                    }
        }
     }

Finally, the method checks whether it is time to update the target models. If necessary, soft-update procedures are called to copy parameters from the trainable models to their corresponding target models.

   bFirstRun = false;
//---
   if((int(Rates[0].time / PeriodSeconds(TimeFrame)) % TragetUpdate) == 0)
     {
      if(MathRand() / 32767.0 > 0.5)
         cCritic[2].WeightsUpdate(GetPointer(cCritic[0]), tau);
      else
         cCritic[4].WeightsUpdate(GetPointer(cCritic[0]), tau);
      if(MathRand() / 32767.0 > 0.5)
         cCritic[3].WeightsUpdate(GetPointer(cCritic[1]), tau);
      else
         cCritic[5].WeightsUpdate(GetPointer(cCritic[1]), tau);
      cEncoder[1].WeightsUpdate(GetPointer(cEncoder[0]), tau);
      cTask[1].WeightsUpdate(GetPointer(cTask[0]), tau);
      cActor[1].WeightsUpdate(GetPointer(cActor[0]), tau);
     }
   if(PrevBalance < 50)
      ExpertRemove();
  }

The method then ends and waits for the next candlestick to close.

The complete source code of the online model training Expert Advisor is provided in the attachment.

Testing

We have invested significant effort in adapting and implementing the core concepts of the Actor–Director–Critic framework in MQL5, integrating its components into the architecture of the trainable models. The interaction logic between the Actor, Director, and Critic was carefully designed, and several original approaches to agent training were implemented. It is now time for the final—and perhaps most anxious stage: evaluating the effectiveness of the proposed solutions using real historical market data.

The framework is evaluated on historical data under conditions that closely resemble real-world trading. This makes it possible to objectively assess whether the selected architectural and algorithmic solutions can successfully cope with the dynamics and uncertainty of financial markets. Moreover, such testing reveals both the strengths and weaknesses of the current implementation and also helps outline directions for further improvement and optimization.

The training dataset was generated using random agent runs in the MetaTrader 5 Strategy Tester, allowing us to collect a broad spectrum of behavioral scenarios. The dataset includes historical EURUSD M1 data for the entire year of 2024.

Initial model training was performed offline without updating the training dataset until prediction errors stabilized. We then switched to the MetaTrader 5 Strategy Tester and continued fine-tuning the model parameters until stable performance was achieved.

An objective assessment of the learned trading policy can only be obtained by evaluating the trained models on data outside the training sample. To test the performance, we used historical data from January through March 2025. Since this period was not used during training, the risk of overfitting is eliminated, giving the results real practical significance.

All other parameters, including the market environment, timeframe, execution simulation model, and terminal settings, were left unchanged. This ensured a clean evaluation of the learned strategy itself, without interference from external factors.

The testing results are presented below and provide a clear illustration of the agent's behavioral model.

Testing results

During the testing period, the model executed 684 trades. Of these, 268 were closed profitably, resulting in a win rate of slightly above 39%. Nevertheless, the model generated an overall profit during the test period because the average winning trade was nearly twice the size of the average losing trade.

Conclusion

In this work, we explored the theoretical foundations of the Actor–Director–Critic framework and implemented our own interpretation of the proposed concepts using MQL5. The framework was fully integrated into an existing multi-agent architecture, resulting in a modular, flexible, and efficiently trainable agent capable of considering not only local action evaluations (through the Critic) but also the broader strategic context of behavioral logic (through the Director). This approach provides more accurate and robust feedback to the Actor, enabling the agent to discard ineffective actions more quickly and explore productive regions of the policy space more efficiently.

The conducted testing confirmed the viability of the proposed approach and demonstrated that the Actor–Director–Critic framework is capable of making more balanced decisions while exhibiting confident behavior even under conditions of market uncertainty.

However, it is important to note that the programs presented in this article are intended solely as demonstrative examples showcasing the capabilities of the framework. Before applying the proposed solutions in live trading environments, the models should be trained on a more representative dataset and subjected to comprehensive testing and validation.

References

Programs Used in the Article

#	Name	Type	Description
1	Research.mq5	Expert Advisor	Expert Advisor for collecting samples
2	ResearchRealORL.mq5	Expert Advisor	Expert Advisor for collecting samples using the Real-ORL method
3	Study.mq5	Expert Advisor	Expert Advisor for offline model training
4	StudyOnline.mq5	Expert Advisor	Expert Advisor for online model training
4	Test.mq5	Expert Advisor	Expert Advisor for model testing
5	Trajectory.mqh	Class library	System state and model architecture description structure
6	NeuroNet.mqh	Class library	A library of classes for creating a neural network
7	NeuroNet.cl	Code library	OpenCL program code

Translated from Russian by MetaQuotes Ltd.
Original article: https://www.mql5.com/ru/articles/17819

Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.

This article was written by a user of the site and reflects their personal views. MetaQuotes Ltd is not responsible for the accuracy of the information presented, nor for any consequences resulting from the use of the solutions, strategies or recommendations described.