Discussing the article: "Data Science and ML (Part 29): Essential Tips for Selecting the Best Forex Data for AI Training Purposes"

 

Check out the new article: Data Science and ML (Part 29): Essential Tips for Selecting the Best Forex Data for AI Training Purposes.

In this article, we dive deep into the crucial aspects of choosing the most relevant and high-quality Forex data to enhance the performance of AI models.

With all the trading data and information such as indicators (there are more than 36 built-in indicators in MetaTrader 5), symbol pairs (there are more than 100 symbols) that can also be used as data for correlation strategies, there are also news which are valuable data for traders, etc. The point I'm trying to raise is that there is abundant information for traders to use in manual trading or when trying to build Artificial Intelligence models to help us make smart trading decisions in our trading robots.

Out of all the information we have at hand, there has to be some bad information (that is just common sense). Not all indicators, data, strategy, etc. are useful for a particular trading symbol, strategy, or situation. How do we determine the right information for trading and machine learning models for maximum efficiency and profitability? This is where feature selection comes into play.

Author: Omega J Msigwa

 
Thank you for you clear and well written article, It is exactly what I was trying to understand and was working away to check correlations myself . Thanks also for the python file as it makes an easy template for me to adapt . Hopefully after some analysis I will say thanks for opening my eyes to what is possible 
 
«Объединяя или удаляя сильно коррелированные признаки, можно упростить модель, не теряя при этом важной информации. Например, в представленной выше корреляционной матрице переменные Open, High и Low имеют 100% корреляцию. Их корреляция составляет 99 с лишним % (округленные значения). В этом случае можно исключить часть этих переменных, оставив лишь одну, либо применить методы снижения размерности, которые мы рассмотрим далее.»
Killing market data. This is a classic approach of so-called free "data cleaning", based on a bias that takes its roots straight from stationary learning.

Here, for example, in this article https://link.springer.com/article/10.1186/s40854-024-00622-6?utm_source
they prove that OHLC is not just four numbers, but a single topological object.

If we leave only Close, we lose information about volatility within the bar. A high correlation of 99% is "noise" for linear regression, but that 1% difference is a "signal" for the trader (shadow length, breakout strength). Removing "correlated" prices turns a candlestick chart into a linear chart, destroying the very essence of candlestick analysis.


"The correlation coefficient ... evaluates only linear relationships between numerical variables."

The author himself admits the limitations of the method, but still suggests using it for feature selection.
The market is not linear. The same Article introduces the concept of structural limitations (High Close). Pearson correlation does not see these constraints. If we follow the logic of the first article and remove the "redundant" High/Low, the model ceases to understand the limits of acceptable values. As a result, we get an algorithm that does not understand the difference between a "calm market" and a "market with huge tails" if their opening prices coincide.


"By reducing the dimensionality... we simplify the model and reduce the computational cost."

This is "saving on matches."
You can transform the data (Unconstrained Transformation) rather than "throwing away" the data to simplify it. Instead of removing High and Low because of their correlation with Open, you should transform them into relative values (candle spread, close position relative to extremes). Thus, the dimensionality remains the same (or slightly less), but the informativeness (geometry) remains 100%, and the correlation problem disappears.

A structural VAR and VECM modeling method for open-high-low-close data contained in candlestick chart - Financial Innovation
A structural VAR and VECM modeling method for open-high-low-close data contained in candlestick chart - Financial Innovation
  • 2024.03.05
  • link.springer.com
The structural modeling of open-high-low-close (OHLC) data contained within the candlestick chart is crucial to financial practice. However, the inherent constraints in OHLC data pose immense challenges to its structural modeling. Models that fail to process these constraints may yield results deviating from those of the original OHLC data...