Discussing the article: "Overcoming The Limitation of Machine Learning (Part 1): Lack of Interoperable Metrics"

 

Check out the new article: Overcoming The Limitation of Machine Learning (Part 1): Lack of Interoperable Metrics.

There is a powerful and pervasive force quietly corrupting the collective efforts of our community to build reliable trading strategies that employ AI in any shape or form. This article establishes that part of the problems we face, are rooted in blind adherence to "best practices". By furnishing the reader with simple real-world market-based evidence, we will reason to the reader why we must refrain from such conduct, and rather adopt domain-bound best practices if our community should stand any chance of recovering the latent potential of AI.

Imagine you’re in a lottery-style competition. You and 99 other people are randomly selected to play for a $1,000,000 jackpot. The rules are simple; you must guess the heights of the other 99 participants. The winner is the person with the smallest total error across their 99 guesses.

Now, here’s the twist: for this example, imagine the average global human height is 1.1 meters. If you simply guess 1.1 meters for everyone, you might actually win the jackpot, even though every single prediction is technically wrong. Why? Because in noisy, uncertain environments, guessing the average tends to produce the smallest overall error.


Author: Gamuchirai Zororo Ndawana

 

"However, our strategy demonstrates the ability to recover and stay on track, which is exactly what we strive for."

I always thought that what one should strive for is for a strategy to generate profits :)

 
Maxim Dmitrievsky # :

"However, our strategy demonstrates the ability to recover and stay on track, which is exactly what we are aiming for."

I've always thought that one should strive for a strategy to bring profits :)

Yes indeed, but unfortunately we still have no standardised machine learning metrics that are aware of the difference between profit and loss.

 

Thanks for the article, @Gamuchirai Zororo Ndawana

I agree with @Maxim Dmitrievsky that the ultimate goal is profitability. The idea of "recover and stay on track" makes sense as robustness and drawdown control, but it does not replace profit.

On metrics: it is true that there is no standardized ML metric that is PnL aware, although in practice models are validated with Sharpe, Sortino, Calmar, profit factor, max DD, plus asymmetric losses or rewards in reinforcement learning (RL style) that do incorporate PnL and costs (costs and turnover).

Technically, I would review two key points in the article:
  • The examples contain look-ahead bias (features using i + HORIZON), which invalidates the evaluation;
  • The DRS test that "sums to zero" is tautological because the two labels are antisymmetric by construction, it does not prove market understanding.
Even so, the reminder not to select by RMSE or MAE on returns is useful.

Practical suggestion: walk-forward testing, costs and slippage, asymmetric or quantile loss or utility-based objectives, and penalizing turnover to avoid mean hugging. (Pragmatic take: align the loss with how you make money.)
 

Quoted: Yes indeed, but unfortunately we still have no standardised machine learning metrics that are aware of the difference between profit and loss.

Answer: Profit and loss columns will only exist if your back tested product or the flat market is as good as the forward market you are using against the subsequence portfolio or basket of index that will follow this line of order.

There are some index and newly foundered ETF`s coming out, or that are produced on an increasing basis, as for this intended usage, and will produce these results, profit margins such as the dowjones 30 index as well many other index's which have been created for this intended use. Peter Matty    

 
Miguel Angel Vico Alba # :

Thanks for the article, @Gamuchirai Zororo Ndawana

I agree with @Maxim Dmitrievsky that the ultimate goal is profitability. The idea of recovering and staying on track makes sense for robustness and controlling drawdown, but it's no substitute for profit.

Regarding metrics: It is true that there is no standardised ML metric that takes PnL into account, although in practice, models are validated using Sharpe, Sortino, Calmar, Win Factor, Max DD, as well as asymmetric losses or rewards in reinforcement learning (RL style) that include PnL and cost (cost and revenue).

From a technical point of view, I would check two important points in the article:
  • The examples contain look-ahead bias (features with i + HORIZON), which invalidates the evaluation;
  • The DRS test, where "the sum is zero," is tautological because the two terms are antisymmetric by design; it does not demonstrate market understanding.
Nevertheless, the advice not to select based on the RMSE or MAE of returns is useful. Practical suggestions: Walk-forward testing, cost and slippage, asymmetric or quantile loss- or benefit-based targets, and penalising revenue to avoid mean hugging. (Pragmatic approach: Align the loss with how you make money.)

Sometimes I wonder if the translation tools we rely on may fail to capture the original message. Your response offers a lot more talking points than what I understood from @Maxim Dmitrievsky original message.

Thank you for pointing out those oversights in the look ahead bias (features with i + HORIZON), those are the worst bugs I hate, they neccisate an entire re-test. But this time more thoughtfully.

You've also provided valuable feedback with the validation measures used to validate models in practice, Sharpe Ratio's must be akin to a universal Gold Standard. I need to learn more about Calmar and Sortino to develop an opinion on those, thank you for that.

I agree with you that the two terms are antisymmetric by design, and the test is that the models should remain antisymmetric, any deviation from this expectation, is failing the test. If one or both models have unacceptable bias then their predictions will not remain antisymmetric as we expect.

However, the notion of profit is only a simple illustration I gave to highlight the problem. None of the metrics we have today inform us when mean hugging is happening. None of the literature on statistical learning tells us why mean hugging is happening. Unfortunately it's happening due to the best practices we follow, and this is just one of many ways I wish to get more conversations started on the dangers of best practices.

This article was more of a cry for help, for us to come together and design new protocols from the ground up. New standards. New objectives that our optimisers work on directly, that are tailored for our interests.