MetaTrader 5 Machine Learning Blueprint (Part 18): Sequential Bootstrap, Corrected — Clone, Class Erasure, and the Comparison Toolkit

MetaTrader 5 — Trading systems | 29 June 2026, 14:49

141

Patrick Murimi Njoroge

Introduction

After the calibration step in ModelDevelopmentPipeline, a diagnostic check revealed that the out-of-fold Brier score of the sequential bootstrap arm was 0.2442 — the same, to four decimal places, as the standard bootstrap arm at 0.2441. The two numbers should not be that close. Part 5 demonstrated, both analytically and through Monte Carlo simulation, that sequential bootstrap draws samples of higher average uniqueness than standard bootstrapping, and that this uniqueness advantage propagates into better out-of-bag discrimination. Identical out-of-fold Brier scores from two samplers that differ in their construction principle are not a coincidence — they are a symptom.

Tracing the symptom reveals two defects operating in sequence. The first is a type erasure: _apply_sequential_bagging transferred the fitted estimators from a SequentiallyBootstrappedBaggingClassifier into a plain BaggingClassifier shell, discarding the class identity that CalibratorCV would need to treat the sequential arm differently from the standard one. With the genuine class erased, CalibratorCV saw two standard BaggingClassifier pipelines and produced identical fold refits for both. The second defect is a shape mismatch that a naive removal of the erasure would expose: sklearn's clone() preserves constructor parameters — including the full samples_info_sets of length N — but each fold refit receives only n_train < N rows, causing the classifier's indicator-matrix shape check to raise a ValueError before any training begins.

Both defects are corrected in this article. We retain the genuine SequentiallyBootstrappedBaggingClassifier in the production pipeline. We also add _find_seq_bagging() to CalibratorCV.fit() to detect the sequential classifier and re-inject fold-sliced samples_info_sets before each refit. Finally, we ship a standalone bootstrap_comparison module to show the before-and-after difference at both the sampling and predictive levels. All source code for this series is in the GitHub repository.

Two Compounding Defects

The two defects are not independent. The first defect — the type erasure — is the cause; the second defect — the shape mismatch — is the consequence that prevents the simplest correction from working. Understanding the dependency is what determines the order of the fixes.

After _apply_sequential_bagging fit a SequentiallyBootstrappedBaggingClassifier, it transferred the fitted base estimators to a plain BaggingClassifier and returned that shell inside a Pipeline. The rationale was deployment convenience: the SequentiallyBootstrappedBaggingClassifier requires price_bars_index at fit time, and discarding that dependency after training made the saved artifact simpler. The consequence, however, was that every downstream consumer of best_model — including CalibratorCV — received a standard BaggingClassifier. The sequential character of the ensemble was invisible to the calibrator, and the calibrator's fold refits used standard uniform sampling for both the sequential and the standard arms.

Removing the type erasure (ie, returning the SequentiallyBootstrappedBaggingClassifier from _apply_sequential_bagging) does not immediately fix calibration. When CalibratorCV clones the pipeline for a fold refit, sklearn's clone() reproduces all constructor parameters, including samples_info_sets, which at that point contains all N observations. The fold's training set contains n_train < N rows. When SequentiallyBootstrappedBaggingClassifier._fit() calls get_active_indices(samples_info_sets, price_bars_index), it builds an active-index map of length N, then checks that length against the n_train samples passed to fit(). The check raises a ValueError. The original CalibratorCV catches only TypeError, so the exception is not handled.

The correction must therefore do two things simultaneously: retain the genuine class so CalibratorCV can detect it, and then re-inject a properly row-sliced samples_info_sets before each fold refit so the shape check passes.

Defect 1 — Type Erasure in _apply_sequential_bagging

The original method trained a SequentiallyBootstrappedBaggingClassifier and then constructed a fresh BaggingClassifier, copied the fitted estimator list, features, and class metadata across, and returned that shell:

# Original — type erasure after sequential bootstrap fit
bag = apply_seq_bootstrap(
    X=X, y=y, estimator=MyPipeline(base_est.steps),
    n_estimators=int(bagging_n),
    max_samples=bagging_samples,   # read from model_params
    samples_info_sets=self.events["t1"],
    price_bars_index=self.bar_data.index,
    ...
)

standard_bag = BaggingClassifier(estimator=MyPipeline(base_est.steps), ...)
standard_bag.estimators_          = bag.estimators_
standard_bag.estimators_features_ = bag.estimators_features_
standard_bag.classes_             = bag.classes_
standard_bag.n_classes_           = bag.n_classes_

return Pipeline([("seq_bag", standard_bag)])   # SB type is gone

From this point forward, anything that received best_model — analyze_features, calibrate_model, ONNX export — saw a BaggingClassifier. The only entity that needed the original type for correct behavior was CalibratorCV, and it received a shell instead. The calibrator's fold refits trained a standard ensemble on both arms, produced identical OOF probabilities, and measured identical Brier scores.

Defect 2 — Why Removing the Erasure Alone Breaks

The fix must preserve the class identity. But sklearn's clone(estimator) is defined as: create a new instance of the same class using the constructor parameters recorded in get_params(deep=True). For SequentiallyBootstrappedBaggingClassifier, samples_info_sets is a constructor parameter, so clone() copies it in full — all N rows — into the cloned object. The following call then runs: fold_clf.fit(X_train, y_train) where X_train has shape (n_train, n_features) and n_train < N.

Inside SequentiallyBootstrappedBaggingClassifier._fit(), the code builds an active-index map from the full samples_info_sets:

if self.active_indices_ is None:
    self.active_indices_ = get_active_indices(
        self.samples_info_sets, self.price_bars_index
    )

if len(self.active_indices_) != n_samples:
    raise ValueError(
        f"Indicator matrix shape {len(self.active_indices_)} "
        f"does not match number of samples {n_samples}"
    )

len(active_indices_) equals N (the full series length); n_samples equals n_train. The check raises a ValueError, which the original CalibratorCV's except TypeError clause does not catch. The calibration step crashes with no useful diagnostic.

The shape mismatch is not a defect in the shape check — that check is correct. The mismatch is a defect in how the full samples_info_sets is propagated into fold clones. The fix must supply a fold-sliced series instead of the full one.

Fix 1 — Retaining Classifier Identity in _apply_sequential_bagging

The corrected method removes the conversion step and returns the genuine SequentiallyBootstrappedBaggingClassifier directly:

# Corrected — SB classifier identity is retained
bag = apply_seq_bootstrap(
    X=X, y=y, estimator=MyPipeline(base_est.steps),
    n_estimators=int(bagging_n),
    max_samples=1.0,   # always full draw; see Section 5.1
    samples_info_sets=self.events["t1"],
    price_bars_index=self.bar_data.index,
    ...
)

return Pipeline([("seq_bag", bag)])   # genuine SB type preserved

Inference — predict and predict_proba — is inherited from BaggingClassifier and requires only the fitted estimator list, which is present. The price_bars_index is not needed after training, so the deployment concern that motivated the original shell is resolved without giving anything up.

The max_samples=1.0 Convention

The original code read bagging_max_samples from model_params, which was typically set to the dataset's average uniqueness following AFML §6.2. That recommendation applies to a standardBaggingClassifier, where uniform sampling has no built-in mechanism for reducing overlap: limiting the sample to a fraction equal to the average uniqueness controls redundancy by drawing fewer observations. The sequential sampler already encodes a draw probability proportional to uniqueness at each step — it corrects for overlap within the draw itself. Setting max_samples below 1.0 on a sequential ensemble applies the avgU remedy twice and produces samples smaller than necessary. The corrected method hardcodes max_samples=1.0 for the sequential arm; the standard arm continues to use the avgU fraction.

Fix 2 — _find_seq_bagging and Fold Re-injection in CalibratorCV.fit()

Detecting the Sequential Classifier

The detection helper performs a depth-one tree walk over the estimator passed to CalibratorCV. It returns the first SequentiallyBootstrappedBaggingClassifier it finds, or None if the estimator is a standard bagging variant:

def _find_seq_bagging(estimator):
    """
    Locate a SequentiallyBootstrappedBaggingClassifier inside estimator.

    Handles both a bare classifier and one nested as a step in a
    (My)Pipeline. Returns the instance if found, otherwise None.
    """
    if isinstance(estimator, SequentiallyBootstrappedBaggingClassifier):
        return estimator
    if hasattr(estimator, "steps"):
        for _, step in estimator.steps:
            if isinstance(step, SequentiallyBootstrappedBaggingClassifier):
                return step
    return None

The implementation uses isinstance because only one temporal sampler class exists today. A runtime-checkable Protocol can be introduced when additional sampler variants are added.

Fold-Level Re-injection

At the start of CalibratorCV.fit(), the detection runs once on the unfitted estimator. If an SB classifier is found, its samples_info_sets is captured as seq_t1 and validated against the full training set size:

seq_step = _find_seq_bagging(self.estimator)
if seq_step is not None:
    seq_t1 = getattr(seq_step, "samples_info_sets", None)
    if seq_t1 is None:
        raise ValueError(
            "The estimator contains a SequentiallyBootstrappedBagging-"
            "Classifier but samples_info_sets is not set."
        )
    if len(seq_t1) != n_samples:
        raise ValueError(
            f"samples_info_sets length ({len(seq_t1)}) does not match "
            f"the number of training rows ({n_samples})."
        )
else:
    seq_t1 = None

Inside the fold loop, the cloned estimator receives the fold-sliced samples_info_sets before fit() is called. active_indices_ is reset to None so the classifier recomputes the active-index map from the sliced series rather than retaining a map built on all N observations:

for _, (train_idx, test_idx) in enumerate(self.cv_.split(X_, y_)):
    X_train, X_test = X[train_idx], X[test_idx]
    y_train = y[train_idx]
    sw_train = sample_weight[train_idx]

    fold_clf = clone(self.estimator)

    if seq_t1 is not None:
        fold_seq = _find_seq_bagging(fold_clf)
        fold_seq.samples_info_sets = seq_t1.iloc[train_idx]
        fold_seq.active_indices_ = None   # force recompute for this fold

    try:
        fold_clf.fit(X_train, y_train, sample_weight=sw_train)
    except TypeError:
        fold_clf.fit(X_train, y_train)

    oof_probs[test_idx] = fold_clf.predict_proba(X_test)[:, 1]

Phase 3 — the full-data refit that becomes CalibratorCV.estimator_ — receives the same treatment, this time with the unsliced seq_t1 because all N rows are used:

self.estimator_ = clone(self.estimator)
if seq_t1 is not None:
    full_seq = _find_seq_bagging(self.estimator_)
    full_seq.samples_info_sets = seq_t1
    full_seq.active_indices_ = None
self.estimator_.fit(X, y, sample_weight=sample_weight)

The fold-level re-injection means sequential bootstrap is genuinely applied on every cross-validation split, not just on the initial full-data training run. Without this, the OOF probabilities reflect standard bagging on all folds; the comparison is measuring nothing meaningful.

OOB Metrics and Memory Diagnostics

The Phase-3 full-data refit stored in CalibratorCV.estimator_ is a pipeline whose last step is the fitted bagging classifier. The _extract_bagging_clf helper peels the MyPipeline wrapper to reach that estimator:

def _extract_bagging_clf(fitted_pipe):
    """Return the fitted bagging estimator from a pipeline-wrapped object."""
    if hasattr(fitted_pipe, "steps"):
        return fitted_pipe.steps[-1][1]
    return fitted_pipe

compute_custom_oob_metrics reconstructs out-of-bag predictions from estimators_samples_ (the standard path) or _estimators_samples (the sequential path) without requiring oob_score=True at fit time. It handles both BaggingClassifier and SequentiallyBootstrappedBaggingClassifier through the same interface. The function returns a dictionary containing f1, precision, recall, pwa, neg_log_loss, accuracy, coverage, and — for binary classification — auc.

estimate_ensemble_size accumulates the shallow memory footprint of the estimator list and the stored sample-index arrays. It returns a value in megabytes. The estimate is a lower bound — it uses sys.getsizeof, which reports the object header size rather than the total heap allocation for complex objects — but it provides a consistent relative comparison across the two arms within the same run.

Both functions are called on cal.estimator_ after each CalibratorCV.fit() call in compare_predictions. The results populate two new fields in PredictionComparison: oob_metrics (a DataFrame, rows = metric names, columns = {"standard", "sequential"}) and memory_mb (a dict). Both appear in BootstrapComparison.summary() and in a dedicated OOB row of the comparison figure.

The bootstrap_comparison Module

The module provides three entry points: compare_sampling, compare_predictions, and the orchestrator compare_bootstrap_methods, which returns a BootstrapComparison object combining both levels. A convenience wrapper, compare_from_pipeline, pulls the required inputs directly from a fitted ModelDevelopmentPipeline.

Sampling-Level Comparison

compare_sampling is model-free and inexpensive. For n_repeats independent draws of length n_obs under each method, it records the average uniqueness of the drawn set and accumulates how often each observation is selected. On data with overlapping triple-barrier labels, sequential bootstrap consistently achieves higher average uniqueness because the indicator-matrix-weighted draw probability actively discourages selection of observations that cover bar ranges already represented in the current sample.

Predictive-Level Comparison

compare_predictions builds a pipeline for each arm, wraps it in CalibratorCV over PurgedKFold, and collects both OOF and OOB diagnostics. The sequential pipeline uses max_samples=1.0 (full draw, sampler corrects overlap internally); the standard pipeline uses the dataset average uniqueness as its per-estimator sample fraction, following AFML §6.2. Because CalibratorCV now re-injects samples_info_sets per fold, the sequential arm actually exercises sequential bootstrap on every split and every Phase-3 refit.

Usage

from sklearn.tree import DecisionTreeClassifier
from afml.ensemble import compare_bootstrap_methods

result = compare_bootstrap_methods(
    base_estimator=DecisionTreeClassifier(max_depth=4),
    X=features,
    y=events["bin"],
    samples_info_sets=events["t1"],
    price_bars_index=bars.index,
    n_estimators=100,
    n_splits=5,
    pct_embargo=0.01,
    sample_weight=events["tW"],
    n_repeats=200,
)

print(result.summary())
result.plot(save_path="bootstrap_comparison.png")

For a pipeline already trained with ModelDevelopmentPipeline:

from afml.ensemble import compare_from_pipeline

result = compare_from_pipeline(pipeline)
print(result.summary())

Because _apply_sequential_bagging now retains the genuine SequentiallyBootstrappedBaggingClassifier, compare_from_pipeline can unwrap pipeline.best_model, reach the SB classifier via its .estimator attribute, and pass it as the base estimator for the new comparison runs — without requiring samples_info_sets or price_bars_index to be explicitly re-supplied.

Results

The Before-and-After Brier Comparison

Figure 1 reproduces the production symptom on the left and the corrected output on the right. Before the fix, both arms produce a Brier score in the range 0.244x to four significant figures — visually and numerically indistinguishable. After the fix, the sequential arm reaches 0.2387 against the standard arm's 0.2441, a difference of 0.0054 Brier points. Small in absolute terms, this gap is consistent with the uniqueness advantage measured in Part 5 and with the OOB accuracy gap of 2.3 percentage points seen in Table 1 of that article.

OOF Brier score before and after the fold-resampler fix

Figure 1. Two-panel illustration of the fold-resampler fix effect on OOF Brier score

Left panel (Before fix): Both the standard and sequential bootstrap arms produce an out-of-fold Brier score of 0.244x — indistinguishable to four decimal places — because type erasure caused both fold refits to use standard bagging.
Right panel (After fix): The sequential arm reaches 0.2387 while the standard arm holds at 0.2441, confirming that the sequential sampler is now exercised on every fold.

Sampling-Level Comparison

Figure 2 shows the compare_sampling output over 1,000 independent draws. The uniqueness distribution of the sequential bootstrap (blue) is shifted rightward relative to the standard distribution (red), with means of 0.70 and 0.60 respectively — reproducing the Monte Carlo result from Part 5. The per-observation selection rate panel (right) shows the standard bootstrap concentrated near the theoretical value of 0.632 — the probability that a given observation is selected at least once in a full-size draw with replacement — while the sequential distribution is wider, reflecting the uniqueness-weighted selection probabilities that the sequential sampler applies to favor low-concurrency observations.

Sampling comparison — standard vs. sequential bootstrap

Figure 2. Two-panel illustration of sampling-level comparison from compare_sampling (1,000 draws)

Uniqueness distribution (left): Sequential bootstrap draws have a mean average uniqueness of 0.70 against 0.60 for standard bootstrap. Dashed vertical lines mark the means; the sequential distribution is narrower, indicating more consistent de-duplication across draws.
Selection rate (right): Standard bootstrap concentrates near 0.632 (the theoretical rate for uniform with-replacement sampling). Sequential bootstrap's distribution is broader: observations that are concurrently active with many already-drawn observations are selected less often; isolated observations are selected more often.

OOF and OOB Metrics Comparison

Figure 3 shows the OOF and OOB metrics from bootstrap_comparison on the EURUSD tick-M5 dataset (49 labeled events, 1,075 price bars). On this 49-observation subset, standard bootstrap edges out sequential on four of the five OOF metrics: Brier score (0.235 vs 0.244), ECE (0.093 vs 0.119), PWA (0.657 vs 0.602), and accuracy (0.611 vs 0.592). Sequential records a marginal improvement on OOF neg-log-loss (−0.668 vs −0.673), a difference of 0.005. This direction is consistent with the small sample: with 49 labeled events, the sequential sampler has limited room to avoid concurrently active observations within a single fold, and the 1.14× uniqueness advantage measured at the sampling level does not propagate reliably into better fold-level calibration at this scale.

The OOB panel tells a different story for one metric. Sequential bootstrap achieves an OOB AUC of 0.629 against the standard arm's 0.556; the gap of 0.073 is the largest separation in the figure. F1 and accuracy show the opposite ordering at smaller magnitudes (standard 0.621 vs sequential 0.610 for F1; standard 0.633 vs sequential 0.612 for accuracy). Coverage is identical at 1.000 for both arms, confirming that the sequential sampler does not systematically exclude observations from OOB estimation. The AUC advantage aligns with the recall-oriented effect that sequential bootstrap produces by sampling low-concurrency observations more frequently. Calibration metrics such as Brier score and ECE, which penalize probability sharpness rather than ranking, do not reflect this on a dataset of 49 observations, because the benefit of higher uniqueness materialises as ranking improvement before it shows up as better probability estimates.

Sequential vs. standard bootstrap — EURUSD tick-M5

Figure 3. Two-panel illustration of OOF and OOB metrics from bootstrap_comparison

OOF metrics (left): Standard bootstrap scores lower (better) on Brier score (0.235 vs 0.244), ECE (0.093 vs 0.119), PWA (0.657 vs 0.602), and accuracy (0.611 vs 0.592). Sequential bootstrap returns a marginally better neg-log-loss (−0.668 vs −0.673). The 49-observation subset limits the uniqueness advantage from propagating into OOF calibration.
OOB metrics (right): Sequential bootstrap achieves the larger separation on AUC (0.629 vs 0.556), reflecting the ranking improvement that flows from sampling low-concurrency observations more often. F1 and accuracy favour standard bootstrap by smaller margins. Coverage is 1.000 for both arms, confirming that neither sampler systematically excludes observations from OOB estimation.

The afml.cache Refactoring

The caching infrastructure introduced in Part 6 accumulated three structural defects over successive iterations. At least three modules — backtest_cache, cv_cache, and incremental_bar_cache — contained separate key-generation implementations that diverged in their handling of DataFrames, sklearn estimators, and scipy distributions. The serialization strategy was inconsistent: some paths used pickle directly while others called joblib. And cache_monitoring.py had missing top-level imports (joblib.Memory, os, time) that caused an ImportError at every monitoring callsite. The March 2026 refactoring consolidates all of this into a single source of truth and adds two new capabilities: source-hash-based stale detection and dataset-access contamination tracking.

Unified Cache Core

unified_cache.py is the new core module. It exports UnifiedCacheKeyGenerator, the @cacheable decorator, CacheStats, and initialize_cache_system. The CACHE_DIRS dict is built from the AFML_CACHE environment variable when set, and falls back to appdirs.user_cache_dir("afml") otherwise. This is the same mechanism the pipeline already uses for Google Drive routing.

UnifiedCacheKeyGenerator.generate_key() handles every type in the AFML stack: sklearn BaseEstimator instances are hashed via get_params(deep=True), scipy frozen distributions via their args and kwds, DataFrames via shape, column list, DatetimeIndex endpoints, and a 1%-sample content hash, and primitives via direct string encoding. Two optional flags extend the key. time_aware=True scans bound arguments for the first DatetimeIndex parameter (t1, data, prices, index) and appends the date range; a run on January data cannot return the cached result of a March run. auto_versioning=True hashes the function's source code and closure values; any edit to the decorated function automatically produces a cache miss on the next call.

from afml.cache import cacheable, initialize_cache_system

initialize_cache_system()

# Invalidates automatically when source changes or date range shifts.
@cacheable(time_aware=True, auto_versioning=True)
def fit_model(data: pd.DataFrame, estimator, params: dict):
    # expensive refit — only runs on cache miss
    ...

CacheStats tracks hits and misses per function in a defaultdict protected by a threading.Lock(). The stats dict is flushed to cache_stats.json every 25 calls so that progress survives a crash without incurring a write on every call. The CacheAnalyzer context manager snapshots stats on entry and reports the delta on exit, which makes it convenient for isolating the cache contribution of a specific code block in a notebook or script.

Incremental Bar Cache

IncrementalBarCache.get_or_extend() wraps make_bars with a five-case decision tree whose correctness rests on a single determinism invariant:

cached([t₀, t₁]) + incremental([t₁, t₂]) == from_scratch([t₀, t₂])

Only tick bars satisfy this invariant under the current make_bars implementation. Tick bar membership is positional (arange // bar_size), so the trailing n % bar_size ticks are a self-contained leftover: prepend them to new ticks and the boundary bar is reconstructed exactly. Volume and dollar bars do not satisfy the invariant because their boundaries fall on multiples of a cumulative metric that almost never coincides with the last complete tick; the residual offset would have to seed the extension's cumulative sum, but make_bars exposes no such seed. Time bars produce a duplicate bar for the clock bin that straddles the cache boundary. Information bars carry evolving EWM expectations that feed back into bar-closing decisions; extending them incrementally would need the full accumulator state threaded through, which is not yet implemented. For all non-tick bar types, get_or_extend falls back to a full recomputation, which is correct and cheap (a single cumsum/resample + groupby).

from afml.cache import IncrementalBarCache
from pathlib import Path

cache = IncrementalBarCache(Path("bar_cache"))

# Dollar bars — full recompute on first call; extends on subsequent calls.
bars = cache.get_or_extend(
    ticks_2021_2024, bar_type="dollar", bar_size=1_000_000, price="mid_price"
)

# Auto-calibrated dollar imbalance bars (information bar path).
bars = cache.get_or_extend(
    ticks_2021_2024, bar_type="dollar_imbalance", target_timeframe="M15"
)

Information bars accept two mutually exclusive calibration paths. target_timeframe auto-calibrates initial EWM threshold parameters from the target clock-time cadence (the recommended entry point). exp_ticks_init and exp_imbalance_init supply manual initial parameters. The two paths produce distinct cache keys by design: switching calibration mode produces a cache miss rather than returning bars built under different EWM assumptions. Thread safety is implemented via os.replace(), which is atomic on both POSIX and Windows; concurrent reads are safe and concurrent writes are last-writer-wins.

Cache Monitoring and Selective Cleaning

CacheMonitor consumes CacheStats and produces two outputs. get_efficiency_report() returns a DataFrame sorted by hit rate with per-function call counts, average computation times, and disk usage. analyze_cache_patterns() flags four categories: high-miss-rate functions (hit_rate < 0.5 and total_calls > 10), unused caches (last access more than seven days ago), oversized entries (cache_size_mb > 100), and optimization candidates (total_calls > 50 and hit_rate < 0.3). The lazy joblib.Memory property was the source of the ImportError in the previous version; it is now initialized on first access rather than at module import time.

from afml.cache import print_cache_health, analyze_cache_patterns

# Print hit rate, top/worst performers, stale caches, recommendations.
print_cache_health(detailed=True)

# Programmatic access to flagged categories.
patterns = analyze_cache_patterns()
print(patterns["high_miss_rate_functions"])

SelectiveCacheCleaner adds source-hash-based stale detection. FunctionTracker records an MD5 hash of each decorated function's source code and closure values. When the hash changes, the entry is flagged as stale and removed on the next call to clean_stale(). This makes the cache safe during active development: editing a cached function automatically causes re-execution on the next call without requiring a manual cache clear. Policy-based cleaning is also available: clean_old_entries(days=30) removes files older than 30 days, clean_large_files(max_size_mb=500) removes oversized entries, and clean_by_module("afml.backtest") targets a specific module. Full cache reset requires the user to type "yes" at a confirmation prompt.

Data Access Tracking

DataAccessTracker logs every dataset access with its temporal range, stated purpose, and the caller's file, function, and line number. analyze_contamination() counts non-excluded accesses per dataset and maps the count to one of four warning levels: CLEAN (0 accesses), ACCEPTABLE (1–2), WARNING (3–10), and CONTAMINATED (>10). The thresholds are calibrated to iterative ML development: a test set accessed more than twice during hyperparameter search is statistically compromised even if each access was labeled a different purpose. The accesses are written in append mode so a crash mid-session does not corrupt the log.

from afml.cache import log_data_access, print_contamination_report

# Call this once before each data load.
log_data_access(
    dataset_name="EURUSD_H1",
    start_date=pd.Timestamp("2020-01-01"),
    end_date=pd.Timestamp("2023-12-31"),
    purpose="optimize",   # "train" | "test" | "validate" | "optimize" | "analyze"
)

# At the end of an experiment, audit the full access history.
print_contamination_report()

Backtest, CV, and Startup Layers

BacktestCache and cv_cache are now thin wrappers over @cacheable. All key-generation and persistence logic that previously duplicated the core has been removed from both. cv_cacheable is a backwards-compatibility shim; existing decorated functions continue to work without modification. startup_script.py coordinates initialization in the correct order: it calls initialize_cache_system(), verifies port availability, runs a smoke-test of cache functionality, and only then starts the MQL5 bridge. Stale-cache cleanup can be toggled on startup by uncommenting one line in run_cache_startup().

Conclusion

Three changes are required to make the sequential bootstrap comparison scientifically valid in this pipeline. First, _apply_sequential_bagging must return the genuine SequentiallyBootstrappedBaggingClassifier rather than a plain BaggingClassifier shell; without this, downstream consumers cannot distinguish the sequential arm from the standard arm. Second, CalibratorCV.fit() must detect the SB classifier via _find_seq_bagging() and re-inject a fold-sliced samples_info_sets before each refit; without this, retaining the genuine class would produce a ValueError on the first fold. Third, max_samples on the sequential arm must be 1.0; the avgU-as-max_samples remedy from AFML §6.2 applies to uniform sampling only and would undersample the sequential ensemble unnecessarily.

The bootstrap_comparison module is the instrument that exposed all three defects: only by building a two-arm comparison that expected measurable separation between the arms did the identical Brier scores become visible as a defect rather than a result. The module is now part of the afml.ensemble package and can be applied to any pipeline that uses both sampler types, either from scratch via compare_bootstrap_methods or from a trained pipeline via compare_from_pipeline.

Attached Files

	File	Module	Description
1.	bootstrap_comparison.py	afml.ensemble	New comparison toolkit: compare_sampling, compare_predictions, compare_bootstrap_methods, compare_from_pipeline. Returns a BootstrapComparison with .summary() and .plot(). Exposes OOF Brier, ECE, log-loss, PWA, accuracy alongside OOB F1, AUC, coverage, and ensemble memory for both bootstrap arms.
2.	calibration.py	afml.calibration	Updated CalibratorCV.fit() with sequential-bootstrap awareness. New _find_seq_bagging() helper traverses the estimator tree; detected classifiers receive a fold-sliced samples_info_sets and a reset active_indices_ before each refit. Also contains fit_platt_scaling, analyze_calibrated_cross_val_scores, and calibration metrics (brier_score, expected_calibration_error, compute_reliability).
3.	model_development.py	afml.production	Updated ModelDevelopmentPipeline._apply_sequential_bagging(): returns the genuine SequentiallyBootstrappedBaggingClassifier rather than a plain BaggingClassifier shell, and hardcodes max_samples=1.0 for the sequential arm. Also contains the standalone apply_seq_bootstrap helper and the full ModelDevelopmentPipeline class.
4.	oob_metrics.py	afml.ensemble	compute_custom_oob_metrics reconstructs OOB predictions from estimators_samples_ without requiring oob_score=True; handles both RandomForestClassifier and BaggingClassifier variants including SequentiallyBootstrappedBaggingClassifier. estimate_ensemble_size returns a shallow memory estimate in MB.
5.	sb_bagging.py	afml.ensemble	SequentiallyBootstrappedBaggingClassifier and SequentiallyBootstrappedBaggingRegressor, both extending SequentiallyBootstrappedBaseBagging. Implements sequential bootstrap sampling via seq_bootstrap, parallel estimator construction, and OOB score computation without the standard bagging OOB infrastructure.
6.	unified_cache.py	afml.cache	New core module. Exports UnifiedCacheKeyGenerator (single key-generation source of truth for DataFrames, sklearn estimators, scipy distributions, and primitives), the @cacheable decorator with time_aware and auto_versioning flags, thread-safe CacheStats, CacheAnalyzer context manager, and initialize_cache_system. Respects the AFML_CACHE environment variable for cache-directory routing.
7.	cache_monitoring.py	afml.cache	Rewritten monitoring layer. CacheMonitor aggregates FunctionCacheStats dataclasses into a CacheHealthReport. get_efficiency_report() returns a per-function DataFrame sorted by hit rate. analyze_cache_patterns() flags high-miss-rate, unused, oversized, and optimization-candidate functions. Previously missing imports (joblib.Memory, os, time) are resolved via lazy loading.
8.	selective_cleaner.py	afml.cache	SelectiveCacheCleaner with policy-based cleaning: clean_stale() uses FunctionTracker source-code hashes to auto-invalidate entries when decorated functions change; clean_old_entries(days), clean_large_files(max_size_mb), clean_by_module(), and clean_by_function() apply size- and age-based policies. Full reset requires an explicit confirmation string.
9.	backtest_cache.py	afml.cache	Collapsed to a thin layer over @cacheable(time_aware=True, auto_versioning=True). Retains BacktestMetadata dataclass and manual save_result() for workflows that require explicit metadata tagging. All key-generation logic removed in favor of the unified core.
10.	cv_cache.py	afml.cache	Backwards-compatibility shim. cv_cacheable forwards to the unified @cacheable decorator so that existing decorated cross-validation functions require no modification. Provides a reference implementation of a @cacheable-decorated clf_hyper_fit function.
11.	data_access_tracker.py	afml.cache	New module. DataAccessTracker logs every dataset access with temporal range, purpose, and caller location to a CSV. analyze_contamination() maps access count to four warning levels: CLEAN (0), ACCEPTABLE (1–2), WARNING (3–10), CONTAMINATED (>10). The log is written in append mode for crash-safety. get_contamination_report() returns a summary DataFrame; print_contamination_report() formats it for console display.
12.	incremental_bar_cache.py	afml.cache	IncrementalBarCache.get_or_extend() wraps make_bars with a five-case decision tree. Enforces the determinism invariant for all ten AFML bar types; only tick bars support true incremental extension, while all others trigger a full recomputation. Information bars accept either a target_timeframe (auto-calibrated) or manual exp_ticks_init/exp_imbalance_init parameters; the two paths produce distinct cache keys. Thread safety via atomic os.replace().

Further Reading

López de Prado, M. (2018). Advances in Financial Machine Learning, Chapter 4 (Sequential Bootstrap), Chapter 6 (Ensemble Methods). John Wiley & Sons.

Attached files |

Download ZIP

bootstrap_comparison.py (34.75 KB)

calibration.py (48.74 KB)

model_development.py (104.98 KB)

oob_metrics.py (6.88 KB)

unified_cache.py (16.09 KB)

cache_monitoring.py (15.54 KB)

selective_cleaner.py (10.92 KB)

backtest_cache.py (2.04 KB)

cv_cache.py (0.58 KB)

data_access_tracker.py (11.58 KB)

incremental_bar_cache.py (34.97 KB)

sb_bagging.py (31.22 KB)

Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.

This article was written by a user of the site and reflects their personal views. MetaQuotes Ltd is not responsible for the accuracy of the information presented, nor for any consequences resulting from the use of the solutions, strategies or recommendations described.

Patrick Murimi Njoroge

Kenya
7801

MetaTrader 5 Machine Learning Blueprint (Part 18): Sequential Bootstrap, Corrected — Clone, Class Erasure, and the Comparison Toolkit

Table of Contents

Introduction

Two Compounding Defects

Defect 1 — Type Erasure in _apply_sequential_bagging

Defect 2 — Why Removing the Erasure Alone Breaks

Fix 1 — Retaining Classifier Identity in _apply_sequential_bagging

The max_samples=1.0 Convention

Fix 2 — _find_seq_bagging and Fold Re-injection in CalibratorCV.fit()

Detecting the Sequential Classifier

Fold-Level Re-injection

OOB Metrics and Memory Diagnostics

The bootstrap_comparison Module

Sampling-Level Comparison

Predictive-Level Comparison

Usage

Results

The Before-and-After Brier Comparison

Sampling-Level Comparison

OOF and OOB Metrics Comparison

The afml.cache Refactoring

Unified Cache Core

Incremental Bar Cache

Cache Monitoring and Selective Cleaning

Data Access Tracking

Backtest, CV, and Startup Layers

Conclusion

Attached Files

Other articles by this author