Discussing the article: "MQL5 Wizard Techniques you should know (Part 45): Reinforcement Learning with Monte-Carlo"

 

Check out the new article: MQL5 Wizard Techniques you should know (Part 45): Reinforcement Learning with Monte-Carlo.

Monte-Carlo is the fourth different algorithm in reinforcement learning that we are considering with the aim of exploring its implementation in wizard assembled Expert Advisors. Though anchored in random sampling, it does present vast ways of simulation which we can look to exploit.

With the Monte Carlo algorithm, Q-Values are only updated after the completion of an episode. An episode is a batch of cycles. For this article, we have assigned this number of cycles the input parameter ‘m_episodes_size’ and it is optimizable or adjustable. Monte Carlo is attributed to being quite robust to market variability because it can better simulate a wide range of possible market scenarios, allowing traders to determine how different strategies perform under a variety of conditions. This variability helps traders understand potential tradeoffs, risks and returns, enabling them to make more informed decisions.

This edge, it is argued, stems from its ‘long-term performance insight’ which contrasts with traditional methods that tend to focus on short-term outcomes. By this what’s meant is the infrequent updates Monte Carlo simulations performs, given that they only happen once in an episode, do evade market noise which Q-Learning & SARSA are bound to run into since they execute their updates more frequently. Assessment of the long-term performance of trading strategies by evaluating cumulative rewards over time is therefore what Monte Carlo strives to achieve. By analysing multiple episodes of this, traders can gain insights into the overall profitability and sustainability of their strategies.

The Monte Carlo algorithm computes action-value estimates based on average returns of state-action pairs across multiple cycles within a single episode. This better allows traders to assess which actions (e.g., buying or selling) are most likely to yield favourable outcomes based on historical performance. This updating of the Q-Values stems from having the reward component of these Q-Values determined as follows:

Where:

  • R t+1 , R t+2 ,…,R T  are the rewards received at each step after time t.
  • γ /gamma is the discount factor (0 ≤ γ ≤ 1), which sets by how much future rewards are "discounted" (i.e., valued less than immediate rewards).
  • T represents the time step at which the episode ends (terminal state or episode size in cycles).


    Author: Stephen Njuki

     
    Hi Mr Njuki,

    I hope you're well.

    I'm simply enquiring on the optimization that was performed in 2022 for the expert advisor. Could you please elaborate which pricing model was used.

    Kind regards,