Does the Currency Strength Meter Concept Survive Real Trading Costs?

Does the Currency Strength Meter Concept Survive Real Trading Costs?

13 June 2026, 18:53
Jan Kahlert
0
3

Currency strength meters are one of the most popular indicator concepts in retail forex. The pitch is always the same: rank the eight major currencies by momentum, buy the strongest against the weakest, profit. It sounds plausible — and it is almost never tested the way a hypothesis should be tested.

So I tested it. Not to build a product, but to answer one question honestly:

Does cross-sectional currency-strength momentum — strongest currency long against weakest short — have a net edge on intraday-to-daily horizons after realistic retail costs?

Spoiler: no. Not after costs, and — more interestingly — not even before costs. Below is the full setup and the numbers, so you can check the reasoning yourself.

Why this is a falsification study, not a backtest


Most "I backtested X" posts share a fatal flaw: the author keeps adjusting parameters until something looks good, then shows you the survivor. That is curve-fitting with extra steps.

This study was designed the other way around:

1. The entire hypothesis grid was frozen before the first result was computed. Every lookback, holding period, evaluation time and normalization scheme was declared up front. Nothing was added, removed, or refined afterwards. Ideas that came up during the analysis went into a notes file — untested.

2. Every configuration was compared against a random baseline with identical trade frequency and identical costs (100 random seeds per evaluation-time/holding-period combination). A configuration only counts if it clearly beats what random trading achieves under the same conditions.

3. Acceptance criteria were fixed in advance. A configuration qualifies as a candidate only if it meets all three: net profit factor ≥ 1.15 in-sample, positive net result in at least 6 of 8 in-sample years, and a net result above the 95% quantile of the random baseline.

4. An out-of-sample window (2023–today) was reserved and never touched. It would only have been opened to confirm in-sample candidates.

This design cannot produce a sellable strategy. It can only produce an honest answer.

The setup

Parameter Value
Universe 8 majors (USD, EUR, GBP, JPY, CHF, AUD, NZD, CAD) → 28 pairs
Data M5 bars, broker feed, 2005–2026 (~42 million bars), cached locally
Data quality Gap audit per pair; in-sample window 2015–2022 has < 1.8% missing intraweek bars on every pair
Strength definition Equal-weight mean of each currency's 7 pair ROCs, sign-corrected (base +, quote −); computed both ATR-normalized and raw
Signal At the evaluation time, go long the top currency against the bottom currency via the direct pair; one position at a time
Grid 4 lookbacks (1h, 4h, 1d, 5d) × 4 holding periods (1h, 4h, 1d, 5d) × 3 evaluation times (London open, NY open, daily close) × 2 normalizations = 96 configurations
Execution Signal on bar close, entry at the open of the next M5 bar; a hard assertion in the code enforces entry time > signal time (no lookahead)
In-sample 2015–2022
Out-of-sample 2023+ (reserved, untouched)

No stop loss, no take profit, no filters. Deliberately. The point is to measure whether the effect itself exists — exactly as the strength-meter concept is sold — not to engineer trade management around a non-effect.

Getting 42 million bars out of MetaTrader 5


The MetaTrader 5 Python API makes the data side surprisingly painless. The export script connects to a running terminal in read-only fashion — it imports no order functions at all — pulls M5 history per pair with copy_rates_range, and caches everything as Parquet.

All later runs work purely on the cache; the terminal is never touched again.

from datetime import datetime, timezone
from pathlib import Path
import pandas as pd
import MetaTrader5 as mt5

TERMINAL = r"<PATH-TO-MT5>\terminal64.exe"
DATA = Path("data")

mt5.initialize(path=TERMINAL)          # read-only — no order functions imported

def fetch_pair(symbol, start_year, end_year):
    """M5 bars year by year, 2-day overlap + dedupe, broker time (tz-naive)."""
    mt5.symbol_select(symbol, True)
    frames = []
    for year in range(start_year, end_year + 1):
        start = datetime(year, 1, 1, tzinfo=timezone.utc)
        stop  = datetime(year + 1, 1, 2, tzinfo=timezone.utc)   # overlap -> dedupe
        rates = mt5.copy_rates_range(symbol, mt5.TIMEFRAME_M5, start, stop)
        if rates is not None and len(rates):
            frames.append(pd.DataFrame(rates))
    full = pd.concat(frames, ignore_index=True)
    full["time"] = pd.to_datetime(full["time"], unit="s")
    return full.drop_duplicates("time").sort_values("time").reset_index(drop=True)

for base, quote in canonical_pairs():          # 28 majors
    symbol = resolve_symbol(base, quote, available)
    df = fetch_pair(symbol, 2005, end_year)
    cols = ["time", "open", "high", "low", "close", "tick_volume", "spread"]
    df[cols].to_parquet(DATA / f"{base}{quote}.parquet", index=False)

mt5.shutdown()

Two practical notes if you reproduce this. First, the MetaTrader5 package requires Python 3.12 — it does not install on 3.14 at the time of writing. Second, never run two Python processes against the same terminal; the API silently returns empty results under contention.

Before anything else, the cache went through a gap audit: missing intraweek M5 slots per pair, per year, with weekends and the daily midnight rollover excluded as benign.

The in-sample window 2015–2022 came out below 1.8% missing bars on every one of the 28 pairs. If your data fails this kind of audit, fix the data before you compute a single statistic — every number downstream inherits its quality.

The coverage heatmap below makes this visible at a glance: the dark 2013 band is a broker history artifact that sits entirely before the test window, while coverage is uniformly dense from 2015 onward.


coverage

The strength definition is the one the indicator world actually uses, just written down explicitly. For each currency, take the rate of change of its 7 pairs over the lookback, flip the sign where the currency is the quote, and average:


import numpy as np
import pandas as pd

def true_range(high, low, close):
    prev = close.shift(1)
    hl, hc, lc = high - low, (high - prev).abs(), (low - prev).abs()
    return pd.DataFrame(np.fmax(np.fmax(hl, hc), lc),
                        index=close.index, columns=close.columns)

def momentum(close, lookback, scheme, high=None, low=None):
    """ROC per pair over `lookback` bars — raw or ATR-normalized."""
    diff = close - close.shift(lookback)
    if scheme == "raw":
        return diff / close.shift(lookback)
    atr = true_range(high, low, close).rolling(
        lookback, min_periods=max(2, lookback // 2)).mean()
    return diff / atr.replace(0.0, np.nan)              # ATR-normalized variant

def incidence_matrix(symbols):
    """Signed currency x pair matrix: +1 base, -1 quote, averaged per currency."""
    ccys = sorted({c for s in symbols for c in (s[:3], s[3:6])})
    M = pd.DataFrame(0.0, index=ccys, columns=symbols)
    for s in symbols:
        M.loc[s[:3], s] += 1.0       # base  -> +
        M.loc[s[3:6], s] -= 1.0      # quote -> -
    return M.div((M != 0).sum(axis=1), axis=0)          # equal-weight mean

def strength_frame(mom, symbols):
    """Strength per currency over time = incidence . momentum."""
    M = incidence_matrix(symbols)
    vals = mom[symbols].values @ M[symbols].values.T
    return pd.DataFrame(vals, index=mom.index, columns=M.index)


The ATR-normalized variant divides each pair's ROC by that pair's ATR over the same lookback before averaging, so that one volatile pair (hello, GBPNZD) cannot dominate a currency's basket. Both variants are part of the pre-registered grid — this is explicitly not a tuning knob.

The signal is then trivial: at each evaluation time, rank the eight currencies and go long the top against the bottom via their direct pair.

Costs: the part everyone skips


Costs were modeled from a live spread snapshot of my ECN account (raw spreads), plus slippage and commission:

- Per-pair raw spread (zero-spread snapshot artifacts floored at 0.1 pips)

- 0.3 pips slippage per side

- 4 USD per lot round-turn commission, converted to pips per pair via the actual pip value of the quote currency

- A stress variant with 2× spread was computed for every configuration, since a single snapshot understates the average spread

Pair (examples) Spread (pips) Slippage RT Commission RT Total round-turn
EURUSD 0.10 0.6 0.40 1.10 pips
GBPJPY 0.20 0.6 0.64 1.44 pips
USDJPY 0.30 0.6 0.64 1.54 pips
AUDNZD 0.70 0.6 0.69 1.99 pips (most expensive group)


Across the 28 pairs, total round-turn costs land between roughly 1.0 and 2.1 pips (base case). That number is the hurdle every single trade has to clear before it earns a cent. If you have never converted your commission into pips, do it — it changes how you look at any intraday system.

The conversion is the part people get wrong, so here it is in code. A flat USD commission per lot translates into a different pip cost on every pair, because the pip value of one lot depends on the quote currency:


# Live spread snapshot in POINTS, per pair — edit for your own broker.
SPREAD_POINTS = {
    "EURUSD": 1, "GBPUSD": 1, "USDCHF": 0, "USDJPY": 3, "USDCAD": 0,
    "AUDUSD": 1, "AUDNZD": 7, "AUDCAD": 1, "AUDCHF": 4, "AUDJPY": 3,
    "CHFJPY": 3, "EURGBP": 2, "EURAUD": 0, "EURCHF": 1, "EURJPY": 1,
    "EURNZD": 0, "EURCAD": 1, "GBPCHF": 2, "GBPJPY": 2, "CADCHF": 4,
    "CADJPY": 0, "GBPAUD": 2, "GBPCAD": 9, "GBPNZD": 5, "NZDCAD": 2,
    "NZDCHF": 3, "NZDJPY": 5, "NZDUSD": 1,
}
POINTS_TO_PIPS = 0.1            # fractional (5-/3-digit) pricing
SPREAD_FLOOR_PIPS = 0.1        # replace 0-spread snapshot artifacts
SLIPPAGE_PIPS_PER_SIDE = 0.3
STRESS_SPREAD_MULT = 2.0       # stress variant = 2x spread
COMMISSION_USD_PER_LOT_ROUNDTURN = 4.0
CONTRACT_SIZE = 100_000        # 1 standard lot

def spread_pips(sym):
    pts = SPREAD_POINTS[sym]
    return SPREAD_FLOOR_PIPS if pts <= 0 else pts * POINTS_TO_PIPS

def commission_pips(pip_value_usd_per_lot):
    """Flat USD commission -> pips, different on every pair (the pip value of one
    lot depends on the quote currency). USD-quote ~10 USD/pip -> 0.4 pips RT."""
    return COMMISSION_USD_PER_LOT_ROUNDTURN / pip_value_usd_per_lot

def total_cost_pips(sym, pip_value_usd_per_lot, stress=False):
    sp = spread_pips(sym) * (STRESS_SPREAD_MULT if stress else 1.0)
    return sp + 2.0 * SLIPPAGE_PIPS_PER_SIDE + commission_pips(pip_value_usd_per_lot)

A simulator that cannot look ahead


Lookahead bias is the most common way a backtest lies, and it usually sneaks in silently: a signal computed on a bar's close gets filled at that same bar's open.

The simulator here makes the mistake structurally impossible — the signal is computed on bar close, the fill happens at the open of the next M5 bar, and a hard assertion guards the invariant on every single trade:


def _next_valid(arr, pos, n):
    while pos < n and not np.isfinite(arr[pos]):   # skip missing bars of the pair
        pos += 1
    return pos if pos < n else None

# signal computed on the bar CLOSE at evaluation time `te`:
# entry = first bar strictly AFTER te with a valid open (no lookahead)
ep = _next_valid(open_arr, times.searchsorted(te, side="right"), n)
entry_time, entry_price = times[ep], open_arr[ep]
assert entry_time > te, f"LOOKAHEAD: entry {entry_time} <= signal {te}"

# exit = first bar at/after entry_time + holding period (exit by time, no SL/TP)
xp = _next_valid(open_arr, times.searchsorted(entry_time + holding_td, side="left"), n)
exit_time, exit_price = times[xp], open_arr[xp]

gross_pips = (exit_price - entry_price) / pip * direction
net_pips   = gross_pips - costs.total_cost_pips(symbol, pip_value_usd)

If the assertion can never fire, why keep it? Because code gets refactored, and an assertion is a tripwire that survives refactoring. It costs nothing, and it converts a silent corruption into a loud crash.

The engine itself was verified against a hand-calculated mini dataset (three currencies, three pairs, every intermediate number recomputed manually) before the first real run.

The random baseline: the cheapest lie detector in quant research


Every configuration was compared against random trading under identical conditions: same entry timestamps, random pair, random direction, same holding period, same cost model — 100 seeds per evaluation-time/holding-period combination.

The question it answers: "If my signal contained zero information, what results would this exact trading pattern produce anyway?"


def random_signals(ev_times, symbols, seed):
    """Random pair + direction at each evaluation timestamp (deterministic)."""
    rng = np.random.default_rng(seed)
    pidx = rng.integers(0, len(symbols), size=len(ev_times))
    dirs = rng.choice([-1, 1], size=len(ev_times))
    return {ev_times[i]: (symbols[pidx[i]], int(dirs[i]))
            for i in range(len(ev_times))}

# 100 seeds per (evaluation time x holding period); identical entry times & costs
baseline_q95 = {}
for ev_name, evt in ev_times_map.items():
    seed_sigs = [random_signals(evt, symbols, s) for s in range(100)]
    for hold_name, hold_td in HOLDINGS.items():
        sums = np.array([
            run_config(open_df, high_df, low_df, close_df, pip,
                       eval_times=evt, holding_td=hold_td, signals=seed_sigs[s]
                       )["net_base_pips"].sum()
            for s in range(100)])
        baseline_q95[(ev_name, hold_name)] = np.quantile(sums, 0.95)

This matters because costs and drift create artifacts. With ~1.5 pips taken out of every trade, all baselines are deeply negative on average — so a strategy can be "profitable relative to random" while still losing money, or look impressive in absolute pips while doing nothing a dart-throwing monkey wouldn't.

Only the comparison against the baseline's 95% quantile separates signal from structure.


The result: 0 candidates out of 96


Metric (in-sample 2015–2022) Value
Configurations meeting all 3 criteria 0 / 96
Median gross profit factor 0.949
Median net profit factor (base costs) 0.871
Median net profit factor (2× spread stress) 0.856
Configurations with positive net result 11 / 96
Configurations beating the random baseline's 95% quantile 5 / 96 — vs. ~4.8 expected by pure chance
Median trades per configuration ~2,070
Median win rate 49.2%


The distribution of all 96 net profit factors sits almost entirely below 1.0, and not a single configuration reaches the pre-registered 1.15 threshold.

distribution

Two things in this table matter more than everything else.

First, the gross number. The median profit factor before any costs is 0.949. This is not the story of a real edge being eaten by spread. There is no edge to eat.

Buying the strongest currency against the weakest, as defined by its own momentum, performed slightly worse than a coin flip across 8 years and roughly 2,000 trades per configuration — before the broker took anything.

Second, the baseline comparison. Out of 96 configurations tested against the 95% quantile of a random-trading distribution, you expect about 4.8 to pass by pure chance (96 × 5%). Exactly 5 passed. The "winners" appear at precisely the rate that luck predicts.


The best configuration — and why it proves the point


Here are the top 5 of 96 by net profit factor:

Configuration Trades Gross PF Net PF (base) Net PF (stress) Positive years Beats baseline q95
4h lookback, 5d hold, London, ATR-norm. 2,069 1.145 1.116 1.111 7 / 8 yes
4h lookback, 1d hold, London, raw 2,073 1.167 1.110 1.099 6 / 8 yes
4h lookback, 1d hold, London, ATR-norm. 2,073 1.147 1.090 1.080 5 / 8 yes
4h lookback, 5d hold, London, raw 2,069 1.108 1.080 1.074 6 / 8 yes
1h lookback, 5d hold, NY, ATR-norm. 2,070 1.094 1.064 1.059 5 / 8 no


The best one — 4h lookback, 5-day hold, evaluated at London open — reaches a net PF of 1.116 with 7 of 8 positive years and beats the random baseline. If I showed you only this row, it would look like a publishable strategy.

That is exactly the trap. With 96 parallel hypotheses, the best of the batch is guaranteed to look decent — the question is whether it looks better than the best of 96 random tries would.

It doesn't: it misses the pre-registered PF threshold, its single losing year (2021, −2,089 pips) wipes out a large slice of the cumulative result, and the cluster it belongs to (4h/London) is the kind of pattern you find in any random grid after the fact.

Whenever someone shows you one great backtest, the first question to ask is: out of how many?

For contrast, the bottom of the table: the short-horizon configurations (1h lookback, 1h hold) reach net profit factors as low as 0.25.

At ~1.5 pips round-turn cost against a typical 1-hour move, the math is unwinnable — the cost is a massive fraction of the expected move. This is why so many high-frequency retail systems that look fine on idealized spreads die instantly on real accounts.


heatmap

Net pips per configuration × year. The visual takeaway: no row is consistently green. The few green patches sit in different years for different configurations, which is what noise looks like.

What this result means — and what it doesn't


Falsified: top-vs-bottom currency-strength momentum on 1-hour to 5-day horizons, evaluated at session opens or daily close, with or without volatility normalization, after realistic ECN retail costs — across 96 pre-registered configurations, 2015–2022.

The effect does not exist gross, and costs only deepen the hole.

Not tested, and therefore not falsified: other mechanisms built on the same strength calculation (mean-reversion of strength extremes, strength divergence between correlated currencies, multi-week horizons, strength as a filter on an independent entry signal rather than as the signal itself).

Those are different hypotheses. If I ever test one, it gets its own pre-registered grid first — and there is still an untouched 2023+ out-of-sample window waiting for it.

The out-of-sample window was deliberately not opened for this study. With zero in-sample candidates, there was nothing to confirm, and running the near-misses through it "just to see" would have burned the only clean data for any post-hoc story the results might tempt me to tell.

Takeaways


1. Convert your costs into pips before you backtest anything intraday. 1.0–2.1 pips round-turn is the real hurdle on a raw-spread ECN account — and it is fatal to most short-horizon ideas on its own.

2. Count your hypotheses. If you test 96 variants, ~5 will beat a 95% significance bar by luck. Showing the best one is not evidence; it is selection.

3. A random baseline with identical costs is the cheapest lie detector there is. It costs a few lines of code and it killed every "promising" configuration in this study.

4. A clean negative is a result. This study closed a question I would otherwise have revisited every few months. That is worth more than another indicator on a chart.


Method details for reproduction: strength = sign-corrected equal-weight mean of a currency's 7 pair ROCs over the lookback (ATR-normalized variant divides each pair's ROC by its ATR over the same lookback); signals on M5 bar close, fills at the next bar's open;

one position per configuration at a time; exits strictly by time; in-sample 2015–2022, ~2,070 trades per configuration; random baseline = 100 seeds with identical entry times, random pair and direction, identical cost model.

Disclosure: The research question, the hypothesis grid, and the acceptance criteria were defined by me and frozen before the first result was computed. The implementation (data export, simulation engine, cost model) and the draft of this article were built with Claude (Anthropic) as a coding and writing assistant, working phase by phase with manual sign-off gates.

Verification relied on mechanisms rather than trust: a unit-sanity suite checked against a hand-calculated mini dataset, a hard-coded lookahead assertion, deterministic seeding with reproduced runs, and a random baseline with identical costs. All artifacts (per-configuration results, yearly heatmaps, baseline distributions) are stated in the article as produced.