Deep Reinforcement Learning in MQL5: A Primer
Most algorithmic traders are stuck in the paradigm of "If-Then" logic. If RSI > 70, Then Sell. If MA(50) crosses MA(200), Then Buy.
This is Static Logic. The problem? The market is Dynamic.
The frontier of quantitative finance is moving away from static rules and towards Deep Reinforcement Learning (DRL). This is the same technology (like AlphaZero) that taught itself to play Chess and Go better than any human grandmaster, simply by playing millions of games against itself.
But can we apply this to MetaTrader 5? Can we build an EA that starts with zero knowledge and learns to trade profitably by trial and error?
In this technical primer, I will guide you through the theory, the architecture, and the code required to bring DRL into the MQL5 environment.
The Theory: How DRL Differs from Supervised Learning
In traditional Machine Learning (Supervised Learning), we feed the model historical data (Features) and tell it what happened (Labels). We say: "Here is a Hammer candle. Price went up next. Learn this."
In Reinforcement Learning, there are no labels. There is only an Agent interacting with an Environment.
The Markov Decision Process (MDP)
To implement this in trading, we map the market to an MDP structure:
- The Agent: Your Trading Bot.
- The Environment: The Market (MetaTrader 5).
- The State (S): What the agent sees (Candle Open, High, Low, Close, Moving Averages, Account Equity).
- The Action (A): What the agent can do (0=Buy, 1=Sell, 2=Hold, 3=Close).
- The Reward (R): The feedback loop. If the agent buys and equity increases, R = +1. If equity decreases, R = -1.
The goal of the Agent is not to predict the next price. Its goal is to maximize the Cumulative Reward over time. It learns a Policy (strategy) that maps States to Actions.
The Architecture: Bridging Python and MQL5
Here is the hard truth: You cannot train DRL models efficiently inside MQL5.
MQL5 is C++ based. It is optimized for execution speed, not for the heavy matrix calculus required for backpropagation in Neural Networks. Python (with PyTorch or TensorFlow) is the industry standard for training.
Therefore, the professional workflow is a Hybrid Architecture:
- Training (Python): We create a custom "Gym Environment" that simulates MT5 data. We train the agent using algorithms like PPO (Proximal Policy Optimization) or A2C.
- Export (ONNX): We freeze the trained "Brain" (Neural Network) into an ONNX file.
- Inference (MQL5): We load the ONNX file into the EA. The EA feeds live market data (State) to the ONNX model, which returns the optimal move (Action).
Step 1: The Training Code (Python Snippet)
We use the stable-baselines3 library to handle the heavy lifting. The key is defining the environment.
class MT5TrainEnv(gym.Env):
def __init__(self, data):
self.data = data
self.action_space = gym.spaces.Discrete(3) # Buy, Sell, Hold
self.observation_space = gym.spaces.Box(low=-inf, high=inf, shape=(20,))
def step(self, action):
# Calculate Profit/Loss based on action
reward = self._calculate_reward(action)
state = self._get_next_candle()
return state, reward, done, info
# 2. Train the Model
env = MT5TrainEnv(historical_data)
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=1000000)
# 3. Export to ONNX for MQL5
model.policy.to_onnx("RatioX_DRL_Brain.onnx")
Step 2: The Execution Code (MQL5 Snippet)
In MetaTrader 5, we don't train. We just execute. We use the native OnnxRun function.
{
// Load the trained brain
onnx_handle = OnnxCreate("RatioX_DRL_Brain.onnx", ONNX_DEFAULT);
if(onnx_handle == INVALID_HANDLE) return INIT_FAILED;
return INIT_SUCCEEDED;
}
void OnTick()
{
// 1. Get Current State (Must match Python shape)
float state_vector[];
FillStateVector(state_vector); // Custom function to get RSI, MA, etc.
// 2. Ask the AI for the Action
float output_data[];
OnnxRun(onnx_handle, ONNX_NO_CONVERSION, state_vector, output_data);
// 3. Execute
int action = GetMaxIndex(output_data);
if(action == 0) Trade.Buy(1.0);
if(action == 1) Trade.Sell(1.0);
}
The Reality Check: Why Isn't Everyone Doing This?
The theory is beautiful. The reality is brutal. DRL in finance faces three massive hurdles:
- The Simulation-to-Reality Gap: An agent might learn to exploit a specific quirk in your backtest data (overfitting) that does not exist in the live market.
- Non-Stationarity: In the game of Go, the rules never change. In the Market, the "rules" (volatility, correlation, liquidity) change every day. A bot trained on 2020 data might fail in 2025.
- Reward Hacking: The bot might discover that "Not trading" is the safest way to avoid losing money, so it learns to do nothing. Or it might take insane risks to achieve a high reward if the penalty for drawdown isn't high enough.
The Solution: Hybrid Intelligence
At Ratio X, we spent two years researching pure DRL. Our conclusion? You cannot trust a Neural Network with your entire wallet.
This is why we built the MLAI 2.0 Engine as a Hybrid System.
- We use Machine Learning to detect the probability of a regime change (Trend vs. Range).
- We use Hard-Coded Logic (C++) to manage Risk, Stops, and Execution.
The AI provides the "Context," and the classical code provides the "Safety." This combination allows us to capture the adaptability of AI without the chaotic unpredictability of a pure DRL agent.
Experience The Hybrid Advantage (60% OFF)
We want you to see the difference between "Static Logic" and "Hybrid AI" yourself.
For this article only, we are releasing 10 Discount Coupons that offer our biggest discount ever: 60% OFF the Ratio X Trader's Toolbox.
🧪 DEVELOPER'S FLASH SALE
Use Code: MQLFRIEND60
(Only 10 uses allowed. Get 60% OFF Lifetime Access.)
Includes: MLAI Engine, AI Quantum, Gold Fury, and the Source Codes Vault is available as an upgrade.
💙 Impact: 10% of all Ratio X sales are donated directly to Childcare Institutions in Brazil.


