Discussing the article: "MetaTrader 5 Machine Learning Blueprint (Part 1): Data Leakage and Timestamp Fixes"

 

Check out the new article: MetaTrader 5 Machine Learning Blueprint (Part 1): Data Leakage and Timestamp Fixes.

Before we can even begin to make use of ML in our trading on MetaTrader 5, it’s crucial to address one of the most overlooked pitfalls—data leakage. This article unpacks how data leakage, particularly the MetaTrader 5 timestamp trap, can distort our model's performance and lead to unreliable trading signals. By diving into the mechanics of this issue and presenting strategies to prevent it, we pave the way for building robust machine learning models that deliver trustworthy predictions in live trading environments. 

Data snooping or data leakage might seem subtle, but its impact on machine learning models can be monumental—and devastating. Imagine studying for a test where you unknowingly peek at the answers beforehand. Your perfect score feels earned, but it's actually cheating. This is precisely what happens when we use MetaTrader 5's default timestamps in machine learning—data leakage unexpectedly corrupts your model's integrity.

How MetaTrader 5's Timestamps Trick You

EURUSD M5 - MetaTrader5

MetaTrader 5 labels the 5-minute bar starting at 18:55, i.e., the 2nd-last bar above, as:
Time Open High Low Close

2 Apr 18:55

  1.08718

  1.08724

  1.08668

  1.08670

By timestamping at the start, MetaTrader 5 implies this bar's data was available at 18:55:00—a full 5 minutes before it actually closed! If your model uses this in training, it's like giving a student exam answers 5 minutes before the test begins. To counteract this, we should avoid using MetaTrader 5's precompiled time-bars, instead using tick data to create the bars we use in our models.

Author: Patrick Murimi Njoroge

 

The activity-driven bars do not solve all problems you mentioned for time bars. For example, you wrote:

The Subtle Intra-Bar Leakage: However, a more subtle form of data leakage can still occur within the very formation of that time bar. If a significant event transpires midway through a 1-minute bar (e.g., at 09:00:35), any features derived from that bar (such as its high price or a flag for the event) will inevitably incorporate this information by the bar's end.

If you build equal volume, equal range or other tick-based custom bars, you will mark such a bar with a single label anyway, and it will leak (or more precise, blur) information about the high price across the entire bar.

The only way to solve this - is to build "bars" with the specific features (you're going to use) in mind. For example, in case of high or lows being the main features, you should try, probably a zigzag "bars" with extermums marked with exact time.

Actually, the approach with constant timeframes, and specifically limiting them to M1 is problematic in the context of data leakage in MT5. Labelling M1 bars with ending time is not much better than with beginning time, imho.


For those, who are interested in building custom bars (charts) natively in MT5, there is the article with MQL5 implementation of equal-volume, equal-range, and renko bars. Of course, you can mark the bars with ending time in the open source code.

Custom symbols: Practical basics
Custom symbols: Practical basics
  • www.mql5.com
The article is devoted to the programmatic generation of custom symbols which are used to demonstrate some popular methods for displaying quotes. It describes a suggested variant of minimally invasive adaptation of Expert Advisors for trading a real symbol from a derived custom symbol chart. MQL source codes are attached to this article.
 
Stanislav Korotky #:

The activity-driven bars do not solve all problems you mentioned for time bars. For example, you wrote:

If you build equal volume, equal range or other tick-based custom bars, you will mark such a bar with a single label anyway, and it will leak (or more precise, blur) information about the high price across the entire bar.

The only way to solve this - is to build "bars" with the specific features (you're going to use) in mind. For example, in case of high or lows being the main features, you should try, probably a zigzag "bars" with extermums marked with exact time.

Actually, the approach with constant timeframes, and specifically limiting them to M1 is problematic in the context of data leakage in MT5. Labelling M1 bars with ending time is not much better than with beginning time, imho.


For those, who are interested in building custom bars (charts) natively in MT5, there is the article with MQL5 implementation of equal-volume, equal-range, and renko bars. Of course, you can mark the bars with ending time in the open source code.

The activity-driven bars aim to improve the statistical properties information contained in the bars, such as less heteroskedasticity and improved normality. The solution to the The Subtle Intra-Bar Leakage I have proposed is labelling bars using their end times, so that all events that occur within the bar are captured in the timestamp. A useful example is when you use features derived from the timestamp, such as Fourier transformations, in training your model. If you use the MetaTrader5 convention where bars are labelled by start of the period, then you are misinforming your model. The distinction may not matter much for some models, but it has a huge impact on those that aim to exploit the cyclical nature of markets. I hope I have clarified my intent.
  

 
Stanislav Korotky #:

The activity-based bars don't solve all the problems you mentioned for time bars. For example, you wrote:

If you create bars of the same volume, range, or other tick-based custom bars, you'll be marking such a bar with a single label anyway, and information about the maximum price will leak (or more accurately, blur) across the entire bar.

The only way to solve this problem is to create "bars" with the specific features (you'll be using) in mind. For example, if highs or lows are the main characteristics, you should try to create a "zigzag bar" with extermums marked exactly in time.

The constant timeframe approach, and in particular the limitation to M1, is problematic in the context of the MT5 data leak. Marking M1 bars with the end time is imho not much better than with the start time.


For those interested in creating custom bars (charts) natively in MT5, there is the article with the MQL5 implementation of Equal Volume, Equal Range and Renko bars. Of course, you can mark the bars with end time in the open source code.

What do you mean when you state "If you create bars of the same volume, range, or other tick-based custom bars, you'll be marking such a bar with a single label anyway, and information about the maximum price will leak (or more accurately, blur) across the entire bar"?

 
Patrick Murimi Njoroge #:

What do you mean when you state "If you create bars of the same volume, range, or other tick-based custom bars, you'll be marking such a bar with a single label anyway, and information about the maximum price will leak (or more accurately, blur) across the entire bar"?

I don't understand what's unclear. My sentense was a direct reply to your sentense, quoted in my previous post - so you can see the context. No matter how you form the bars, every property of the bar is attributed by a single timestamp, and actual "event" for the property is not matching that time.