Discussing the article: "MQL5 Wizard Techniques you should know (Part 55): SAC with Prioritized Experience Replay"

You are missing trading opportunities:
- Free trading apps
- Over 8,000 signals for copying
- Economic news for exploring financial markets
Registration
Log in
You agree to website policy and terms of use
If you do not have an account, please register
Check out the new article: MQL5 Wizard Techniques you should know (Part 55): SAC with Prioritized Experience Replay.
Replay buffers in Reinforcement Learning are particularly important with off-policy algorithms like DQN or SAC. This then puts the spotlight on the sampling process of this memory-buffer. While default options with SAC, for instance, use random selection from this buffer, Prioritized Experience Replay buffers fine tune this by sampling from the buffer based on a TD-score. We review the importance of Reinforcement Learning, and, as always, examine just this hypothesis (not the cross-validation) in a wizard assembled Expert Advisor.
Prioritized Experience Replay (PER) buffers and typical Replay Buffers (for random sampling) are both used in RL with off-policy algorithms like DQN and SAC because they allow for storing and sampling of past experiences. PER does differ from a typical replay buffer in how past experiences are prioritized and sampled.
With the typical replay buffer, experiences are sampled uniformly and at random meaning any of the past experiences has an equal probability of being selected regardless of its importance or relevance to the learning process. With PER, past experiences are sampled based on their ‘priority’, a. property that is often quantified by the magnitude of the Temporal Difference Error. This error serves as a proxy for learning potential. Each experience gets assigned a value of this error and experiences with high values get sampled more frequently. This prioritization can be implemented using a proportional or rank based approach.
Typical replay buffers also do not introduce or use any biases. PER does and this could unfairly skew the learning process which is why, to correct this PER uses importance sampling weights to adjust the impact of each sampled experience. Typical replay buffers are thus more sample efficient since they are accomplishing way fewer things in the background as opposed to PER. On the flip side PER provides more focused and constructive learning which the typical buffers do not.
It goes without saying therefore that implementing a PER would be more complex than a typical replay buffer; however why this is emphasized here is because PER requires an additional class, to maintain the priority queue often referred to as the ‘sum-tree’. This data structure allows for a more efficient sampling of experiences based on their priority. PER tends to lead to faster convergence and better performance as it focuses on experiences that are more informative or challenging for the agent.
Author: Stephen Njuki