How do you detect when your EA stops matching its backtest? - page 2

 
This thread evolved way beyond my original CUSUM question — and that's exactly what I was hoping for.

The progression from output-level monitoring (CUSUM on equity) → input-level filtering (Enrique's outlier cap) → generalized external filtering (fxsaber's BestInterval with any indicator) is a clean framework.

Key takeaway for me: the filter doesn't need to live inside the strategy logic. Decoupling detection from execution opens up a much wider design space. Appreciate the insights from everyone here.
 
michael schouten:

I've been running live EAs for a while and kept running into the same blind spot: by the time I notice performance degraded, I'm already 10-15 trades deep into the drawdown. Looking at the equity curve doesn't catch it early — too noisy.

What I've started doing is comparing live trades against the backtest distribution statistically, trade by trade, instead of waiting for monthly review:

  • For each closed trade, compute the pip outcome
  • Maintain a rolling CUSUM (cumulative sum of deviations from backtest mean R)
  • Alert when CUSUM crosses a threshold calibrated to the backtest's own variance

The math is basically Page's CUSUM test — standard SPC stuff, but I haven't seen much discussion of applying it to EA monitoring specifically. Most people either (a) eyeball equity, or (b) wait for X losing trades in a row, which is way too late.

Two things I'm still figuring out:

  1. Threshold calibration — I'm using 4σ but it feels arbitrary. Has anyone tuned this for forex specifically? The tail behaviour is non-gaussian enough that standard SPC assumptions feel shaky.

  2. Regime changes vs. genuine strategy decay — CUSUM fires on both. Any ideas how to tell them apart without waiting weeks?

Curious how others here handle this. Do you monitor per-trade deviation, or something else entirely

Your approach is actually quite solid. Using CUSUM on trade outcomes is a much better early warning system than watching the equity curve, which is usually too noisy to detect small but persistent drift.

Regarding threshold calibration: instead of assuming something like 4σ, it's better to calibrate the threshold using the backtest trade distribution itself. One simple method is to run a Monte Carlo simulation on the backtest trades (shuffle or bootstrap the trade list thousands of times) and apply the same CUSUM logic to those sequences. Then measure how often the alarm triggers. This lets you pick a threshold based on a desired false-alarm rate (for example 1% or 5%) rather than relying on Gaussian assumptions, which rarely hold for trading returns.

Also, it's usually better to measure outcomes in R (profit divided by risk per trade) instead of raw pips, since that keeps the distribution more stable across different volatility conditions.

For distinguishing regime changes vs strategy decay, one practical approach is running multiple monitors at different horizons. For example a short window (~20 trades) and a longer window (~100+ trades). If only the short window fires, it's often just a temporary regime shift. If both short and long monitors start drifting, that’s a stronger indication the strategy edge may actually be degrading.

Another useful metric to monitor alongside trade results is MAE/MFE drift. If your maximum adverse excursion starts increasing compared to the backtest distribution while win rate stays similar, it often indicates a change in market conditions before the PnL degradation becomes obvious.
 
avantikajain jain #:
Your approach is actually quite solid. Using CUSUM on trade outcomes is a much better early warning system than watching the equity curve, which is usually too noisy to detect small but persistent drift.

Regarding threshold calibration: instead of assuming something like 4σ, it's better to calibrate the threshold using the backtest trade distribution itself. One simple method is to run a Monte Carlo simulation on the backtest trades (shuffle or bootstrap the trade list thousands of times) and apply the same CUSUM logic to those sequences. Then measure how often the alarm triggers. This lets you pick a threshold based on a desired false-alarm rate (for example 1% or 5%) rather than relying on Gaussian assumptions, which rarely hold for trading returns.

Also, it's usually better to measure outcomes in R (profit divided by risk per trade) instead of raw pips, since that keeps the distribution more stable across different volatility conditions.

For distinguishing regime changes vs strategy decay, one practical approach is running multiple monitors at different horizons. For example a short window (~20 trades) and a longer window (~100+ trades). If only the short window fires, it's often just a temporary regime shift. If both short and long monitors start drifting, that’s a stronger indication the strategy edge may actually be degrading.

Another useful metric to monitor alongside trade results is MAE/MFE drift. If your maximum adverse excursion starts increasing compared to the backtest distribution while win rate stays similar, it often indicates a change in market conditions before the PnL degradation becomes obvious.
Good additions. Monte Carlo on bootstrapped trade sequences for threshold calibration is cleaner than assuming normality — that's going into my next iteration.

The MAE/MFE drift point is sharp. Adverse excursion increasing before win rate drops is exactly the kind of leading indicator that CUSUM on PnL alone misses. That's monitoring the quality of the trade, not just the outcome.

Multi-horizon CUSUM (short vs long window) for separating regime shift from decay is practical. Short fires alone = weather. Both fire = climate change. Simple heuristic but effective.