Feature Engineering for ML (Part 5): Microstructural Features in Python

MetaTrader 5 — Trading systems | 10 June 2026, 11:48

1 236

Patrick Murimi Njoroge

Introduction

The preceding articles in this series treated time as a feature in its own right: fractional differentiation preserves memory across a stationary series, and cyclical encoding embeds the Fourier structure of the trading calendar into the feature matrix. Both operate on bar-level data. Microstructural features work differently. They treat each bar not as a single observation but as a compressed summary of many individual trades, and they ask what those trades reveal about the market's internal state at the moment the bar closed.

Chapter 19 of López de Prado's Advances in Financial Machine Learning surveys three generations of microstructure research. The first generation, typified by the Roll (1984) model, infers the effective bid-ask spread from the serial covariance of price changes alone, without needing quote data. The second generation introduces strategic trade models — Kyle's Lambda, Amihud's Lambda, and Hasbrouck's Lambda — that quantify price impact as a function of signed order flow. The third generation, represented by VPIN (Easley et al., 2012), estimates the probability that a trade is informed by comparing buy and sell volume within equal-volume buckets rather than equal-time windows.

The implementation lives in afml.features.microstructure. The primary entry point, compute_all_microfeatures(), requires only an OHLCV DataFrame and computes the full spread and impact suite using Numba-accelerated kernels. When raw tick data is also available, two additional functions — bar_microstructure_features() and vpin() — extend the output with per-bar imbalance statistics and the volume-synchronised informed-trading measure. All three paths converge on the same bar-indexed feature DataFrame that downstream models expect.

Before describing the implementation, one design decision requires explanation upfront: the tick preprocessing strategy that makes Numba parallelization viable across both layers. It was the non-obvious constraint that drove the architecture, and it is described in detail in Section 8.

What Microstructural Information Measures

Standard OHLCV features assume that price formation is exogenous: the market opens, trades happen, and the bar records the outcome. Microstructure theory takes the opposite view. It models the trading process itself as the mechanism through which private information enters prices. Two participants are assumed: an informed trader who knows the true asset value and acts on that knowledge, and an uninformed market maker who provides liquidity without that knowledge. The spread and the price impact coefficient are the equilibrium outcomes of this strategic interaction.

The tick rule is the earliest classification tool in this framework. It assigns a direction bt ∈ {−1, +1} to each trade based on whether the transaction price moved up, down, or was unchanged from the prior trade. The raw series {bt} is a feature in its own right, but the chapter identifies five productive transformations: a Kalman filter on the expected future direction Et[bt+1]; structural break detection on those predictions (covered in Part 7 of this series); entropy of the {bt} sequence (Chapter 18, covered in the next article); t-values from the Wald-Wolfowitz runs test on {bt}; and fractional differentiation of the cumulative series Σibi (Chapter 5, Part 1 of this series). All five transformations encode different aspects of how directional information accumulates within a bar or across bars.

The Roll model formalized this intuition in 1984 by showing that, under the simplest model of a bid-ask spread, consecutive price changes must be negatively autocorrelated. A buy that hits the ask lifts the price; the next trade, reverting to the bid, depresses it. The covariance of consecutive price changes therefore encodes the half-spread. The formula is:

Roll Spread

The max operator is necessary because positive covariance (which violates the model's assumptions) appears frequently in empirical data, particularly for illiquid instruments or during trending periods. The implementation sets the spread to NaN rather than zero in that case, which is the correct signal: the Roll model does not apply when the covariance is positive.

The strategic trade models of the second generation add order flow. Kyle (1985) showed that the price impact per unit of signed order flow — his lambda — measures the market maker's inference about whether an incoming order is informed. A high Kyle's Lambda means that a given quantity of buy volume moves the price by a large amount, implying that market makers suspect informed trading and widen their quotes. The Amihud (2002) ILLIQ ratio approximates the same concept from daily data without requiring a signed order flow classifier: it divides the absolute log return by dollar volume and averages over a window. Hasbrouck's (2009) IV estimator refines this by regressing log returns on signed square-root dollar volume, which produces a model more consistent with the theoretical price impact literature.

VPIN (Easley, López de Prado, and O'Hara, 2012) applies a different approach entirely. Rather than operating on price changes, it partitions the tick stream into equal-volume buckets and measures the imbalance between buyer-initiated and seller-initiated volume within each bucket. High VPIN indicates that trades are concentrated on one side of the market, which the model interprets as evidence of informed participation. VPIN attracted controversy after the 2010 Flash Crash, partly because the measure's predictive validity depends critically on how volume is classified as buyer- or seller-initiated, a problem the tick rule only approximates.

Architecture: Two Layers, One Decision

The implementation is divided into two distinct layers based on data availability. The first layer, exposed through compute_all_microfeatures(), requires only an OHLCV bar DataFrame. It computes all spread and impact measures that can be derived from bar-level close, high, low, and volume data. The second layer consists of bar_microstructure_features() and vpin(), both of which require the raw tick stream. They compute per-bar imbalance statistics and the volume-synchronised informed-trading measure by reprocessing the raw trades that compose each bar.

Two-layer microstructure feature design

Figure 1. Architecture of the two-layer microstructure feature design

Left path (blue): compute_all_microfeatures() accepts a pre-aggregated OHLCV DataFrame and dispatches to seven Numba-accelerated kernel functions, each computing one feature family.
Right path (orange): bar_microstructure_features() and vpin() accept the raw tick DataFrame alongside the OHLCV bar DataFrame. Bar boundaries are located with a single np.searchsorted call; the resulting bar_starts and bar_ends index arrays are then passed into _bar_features_kernel, which computes all four imbalance statistics across all bars in one parallel pass.
Convergence node (green): Both paths produce a bar-indexed feature DataFrame that plugs directly into the ML pipeline built in earlier parts of the ML Blueprint series.
Numba badge: All inner loops in both layers use @njit(parallel=True, cache=True); the prange loop in _bar_features_kernel executes bar-level imbalance computation across all bars simultaneously.

The choice between the two layers is straightforward. If tick data is unavailable — which is the default for most historical OHLCV datasets from MetaTrader 5 — call compute_all_microfeatures() with include_bar_features=False and include_vpin=False. If tick data is available from a broker feed or a saved tick CSV, pass it as tick_df to unlock the bar-level imbalance columns and VPIN. The column names are consistent across both paths for the shared features, so models trained on OHLCV data can be extended with tick features without retraining from scratch.

First Generation: Roll Spread and Corwin-Schultz

The Roll spread kernel operates purely on close prices. It computes the rolling covariance of consecutive price differences over a configurable window and maps the result to an effective spread estimate. The kernel is implemented as a bare @njit loop — no Pandas, no vectorized operations — because the rolling covariance is a doubly nested sum that vectorization would not simplify without materializing a large intermediate array.

from afml.features.microstructure import roll_measure, roll_impact

# Close prices and volume from an OHLCV DataFrame
spread = roll_measure(ohlcv["close"], window=20)
impact = roll_impact(ohlcv["close"], ohlcv["volume"], window=20)

roll_impact normalises the Roll spread by the contemporaneous dollar volume, close × volume. This produces a dimensionless measure of spread per unit of liquidity, which is more comparable across instruments than the raw spread in price units. Both functions return a pd.Series of float32 with the same index as the input.

The Corwin-Schultz (2012) estimator uses only the daily high and low to infer both the bid-ask spread and the intra-bar volatility. Its derivation rests on two identities. For a single period, the range (ln H/L)² captures both the spread and the variance. For a two-period window, the range of the combined high and low captures only the variance (because the spread component cancels in the two-day maximum high and minimum low). Solving the system yields:

Corwin-Schultz Beta

Corwin-Schultz Gamma

Corwin-Schultz Alpha

Corwin-Schultz Spread

Negative alpha values, which occur when the variance component exceeds what the model can account for, are set to NaN. This happens more frequently than the original paper suggests when applied to foreign exchange data at sub-daily frequencies, where the high-low range is dominated by tick noise rather than genuine price discovery.

from afml.features.microstructure import corwin_schultz_spread

cs = corwin_schultz_spread(ohlcv["high"], ohlcv["low"])
# Returns a DataFrame with columns cs_spread and cs_sigma
print(cs.dtypes)  # cs_spread    float32, cs_sigma    float32

The Beckers-Parkinson high-low volatility estimator

Beckers (1983) showed that volatility estimators derived from the high-low price range are more accurate than close-to-close estimators. Parkinson (1980) demonstrated that for a geometric Brownian motion observed continuously,

Parkinson σ

where k1 = 4 ln 2. The Corwin-Schultz estimator builds directly on this result: volatility is embedded in the two-day high-low range γ, and it is subtracted out when deriving the spread estimate α. The cs_sigma column returned by corwin_schultz_spread() is therefore the Beckers-Parkinson volatility as a byproduct of the spread model. It is a bid-ask-adjusted intraday volatility estimate, and it is a more precise volatility feature than the close-to-close standard deviation for instruments where bid-ask bounce constitutes a significant fraction of the observed price range.

Second Generation: Kyle, Amihud, and Hasbrouck Lambdas

All three second-generation estimators relate price changes to order flow. They differ in how they model the relationship. Kyle's Lambda uses the OLS slope from regressing price changes on signed volume — it is the only one that requires a tick direction classifier. Amihud's ILLIQ uses unsigned returns divided by dollar volume and requires no classifier. Hasbrouck's Lambda is a refinement of Kyle's that replaces linear order flow with the square root of dollar volume, which produces a better fit to the theoretical price impact curve derived from inventory models.

The rolling OLS computation for Kyle's Lambda is the most computationally intensive element in the bar-level layer. Each bar requires a window of w prior observations, and for each window the kernel accumulates five running sums: Σx, Σy, Σx², Σxy, and the count. The Numba JIT kernel processes these in a single pass:

# The Numba kernel (internal — shown for exposition)
# x = b_t * v_t   (signed volume),  y = Δp_t   (price change)
# OLS slope = (n·Σxy - Σx·Σy) / (n·Σx² - (Σx)²)

from afml.features.microstructure import kyle_lambda, amihud_lambda, hasbrouck_lambda

kl = kyle_lambda(ohlcv["close"], ohlcv["volume"], window=20)
# Returns DataFrame — columns: kyle_lambda (float32), kyle_lambda_t (float32)
al = amihud_lambda(ohlcv["close"], ohlcv["volume"], window=20)
# Returns Series  — column:  amihud_lambda (float32)
hl = hasbrouck_lambda(ohlcv["close"], ohlcv["volume"], window=20)
# Returns DataFrame — columns: hasbrouck_lambda (float32), hasbrouck_lambda_t (float32)

Each function accepts either a pd.Series or a NumPy array. When a Series is provided, the result carries the same index; when an array is provided, a zero-based integer index is used. The b parameter — tick direction — defaults to None, in which case the tick rule is applied automatically using the internal _tick_rule() function in the same module. Supplying a pre-computed b array avoids recomputing tick directions when calling all three estimators in sequence.

The three lambdas are not interchangeable. Kyle's Lambda is sensitive to the quality of the tick rule classifier, because noise in the direction signal biases the OLS slope toward zero. Amihud's ILLIQ is robust to classification error but conflates illiquidity with volatility — a large absolute return and a small dollar volume both inflate the measure, regardless of whether the return was driven by a large informed order or by a random shock. Hasbrouck's Lambda occupies the middle ground: it is more robust than Kyle's to classification noise (because the square-root transformation compresses outliers) but retains the directional component that Amihud discards. For the feature pipeline, all three should be included and let the downstream feature importance step determine which is most predictive in the current regime.

The t-statistic as a companion feature

kyle_lambda() and hasbrouck_lambda() each return a two-column DataFrame: the OLS slope and its rolling t-statistic. The t-statistic captures something the raw lambda does not: the reliability of the estimate. During thin markets, a small number of trades can produce a large lambda simply because the denominator of the regression is near zero. The t-statistic will be close to zero in that case, signalling that the lambda estimate is underpowered. Including both the coefficient and its t-statistic as separate features gives the model a mechanism to discount lambda values that arise from sparse data.

The t-statistic is derived from the centered OLS residuals. With running sums Σx, Σy, Σx², Σxy, Σy², and n observations in the window:

# Centered quantities (with-intercept OLS, computed inside the Numba kernel)
# sxx_c = Σx² − (Σx)²/n     ← centered sum of squares for x
# sxy_c = Σxy − ΣxΣy/n      ← centered cross-product
# syy_c = Σy² − (Σy)²/n     ← centered sum of squares for y
#
# slope = sxy_c / sxx_c
# RSS   = syy_c − sxy_c² / sxx_c   (df = n − 2)
# s²    = RSS / (n − 2)
# t     = slope / √(s² / sxx_c)
#
# tstat → 0 when the window is thin or the regression is underpowered.
# tstat = NaN when RSS ≤ 0 (numerical guard) or n < 3.

Amihud's ILLIQ is a mean ratio rather than a regression slope, so a regression t-statistic is not applicable; it is returned as a plain Series. compute_all_microfeatures() includes all five columns — kyle_lambda, kyle_lambda_t, amihud_lambda, hasbrouck_lambda, hasbrouck_lambda_t — in the output DataFrame.

Third Generation: VPIN

VPIN departs from the regression-based estimators in two ways. First, it operates on volume-time rather than bar-time: the tick stream is partitioned into buckets of equal total volume V*, not equal time intervals. Second, it does not model price directly; instead it measures order flow imbalance as a fraction of total bucket volume:

VPIN

where the sum runs over a rolling window of n buckets, Vτ^B and Vτ^S are the buyer-initiated and seller-initiated volume within bucket τ, and V* is the target bucket size.

The implementation uses a two-pass Numba kernel. The first pass is sequential: it fills buckets tick by tick, accumulating buy and sell volume within each bucket. This pass cannot be parallelised because each tick's bucket assignment depends on how much volume the preceding ticks contributed. The second pass computes the rolling sum of absolute imbalances across bucket windows; this pass uses prange because each window is independent:

from afml.features.microstructure import vpin

vpin_series = vpin(
    volume=tick_df["volume"],
    close=tick_df["mid_price"],
    bucket_size=None,   # defaults to total_volume / (50 * n_buckets)
    n_buckets=50,
)
# Index is the tick index of each bucket window's last tick
# Forward-fill onto bar index before joining to OHLCV features

The default bucket_size targets 50 × n_buckets equally sized buckets across the full tick series, which matches the convention in the original paper. The result is indexed by tick rather than by bar, so it must be forward-filled before joining to the bar-level feature matrix. compute_all_microfeatures() handles this alignment automatically when tick_df is provided.

A qualification is warranted here. VPIN's predictive performance in live trading has been contested in the literature (Andersen and Bondarenko, 2014, showed that VPIN did not precede the Flash Crash as its proponents claimed). Its value in an ML pipeline is therefore not as a standalone signal but as a regime variable: high VPIN indicates order flow imbalance, which the model can combine with the lambda estimates and entropy features to characterise the current microstructural regime.

Bar-Level Imbalance Features

The bar_microstructure_features() function computes four per-bar imbalance statistics directly from the tick data that compose each bar. These are the same statistics used to define imbalance bars (covered in Part 1 of the "Beyond the Clock" series), applied here as features on already-formed bars rather than as bar-formation criteria.

from afml.features.microstructure import bar_microstructure_features

imb = bar_microstructure_features(tick_df, ohlcv_df)
# Returns DataFrame with columns:
#   tick_imbalance   : Σ b_t            (net signed tick count)
#   volume_imbalance : Σ b_t · v_t      (net signed volume)
#   dollar_imbalance : Σ b_t · v_t · p_t (net signed dollar flow)
#   buy_fraction     : fraction of buy ticks

The function maps each tick to its parent bar using np.searchsorted on the bar timestamp index, then dispatches to the _bar_features_kernel Numba kernel which processes all bars in parallel. Each bar is independent of the others once the tick-to-bar mapping is known, making this an embarrassingly parallel problem.

The mapping relies on the bar timestamp convention established in the alternative bars implementation: each bar's index timestamp is the last tick's time plus one microsecond. Ticks that fall within the half-open interval (prev_bar_time, bar_time] belong to that bar. This convention must be consistent between the bar-formation code and any downstream feature computation that rejoins tick data to bars.

The Tick Preprocessing Requirement

The most consequential design decision in the microstructure module is not which features to compute but how bar boundaries are identified in the tick stream. The naive approach is to call np.searchsorted(bar_times, tick_times) every time a feature function needs to slice the ticks belonging to a particular bar. For a tick file with N ticks and K feature functions, this incurs K independent O(N log N) operations — and for the full feature suite, K is seven or more.

Tick preprocessing/ baked tick_num vs. per-call searchsorted

Figure 2. Two-panel illustration of the tick preprocessing problem

Left panel: The naive approach calls searchsorted once per feature family. With seven feature functions, that is seven full scans of the tick array, each of O(N log N) complexity. For a million-tick file, the overhead dominates the actual feature computation.
Right panel: The preprocessed approach performs a single np.searchsorted call to produce bar_starts and bar_ends index arrays, then passes them directly into _bar_features_kernel. All K features are computed in one parallel pass; no further scanning of the tick array occurs.

The design in bar_microstructure_features() solves this problem by consolidating all bar-level tick work into a single pass. A single np.searchsorted call maps every tick to its bar, producing integer arrays bar_starts and bar_ends that are computed once and shared across all four features. The Numba kernel _bar_features_kernel then receives those index arrays and computes tick imbalance, volume imbalance, dollar imbalance, and buy fraction for every bar simultaneously via prange:

from afml.features.microstructure import bar_microstructure_features

# Single searchsorted → bar_starts / bar_ends → one parallel kernel pass
imb = bar_microstructure_features(tick_df, ohlcv_df)
# Four float32 columns computed across all bars in one call:
#   tick_imbalance, volume_imbalance, dollar_imbalance, buy_fraction

The key property is that each bar's computation is independent of every other bar once the index arrays are known. This is what makes prange viable: there is no shared mutable state between iterations. Had the four features been implemented as separate functions each calling np.searchsorted internally, the tick array would be scanned four times for no benefit. Consolidating them into one kernel call avoids that overhead and is the pattern to follow when adding new per-bar tick features in future.

The implication for the data pipeline is that tick data must be retained alongside the bar DataFrame any time per-bar tick features are needed. If you are working with bars produced by a third-party library that does not provide access to the underlying ticks, the full per-bar feature set is unavailable; only the OHLCV-derived features from compute_all_microfeatures() can be computed.

Additional Features from Microstructural Datasets

Sections 19.3 through 19.5 cover features that market microstructure theory derives from first principles. Section 19.6 of the chapter takes a different approach: it catalogues observable patterns in microstructural data that are likely informative even without a theoretical derivation. These features are not currently implemented in afml.features.microstructure, but they represent the natural next layer for anyone with access to full order-book or FIX message data.

Distribution of order sizes

Easley et al. (2016) document that round-lot trade sizes — 5, 10, 25, 50, 100, 200, 500 — appear at frequencies far exceeding their neighbours. On E-mini S&P 500 futures, size 100 is 16.8 times more frequent than size 99 and size 500 is 57.1 times more frequent than size 499. This excess arises from human "GUI traders" who click buttons corresponding to round quantities. Algorithmic traders ("silicon traders") randomise their order sizes specifically to avoid leaving this footprint. A useful feature is therefore the ratio of round-lot volume to total volume in a window. A rising ratio suggests increasing human participation; a falling ratio suggests a market increasingly dominated by automated flow, which carries different informational content.

Cancellation rates and predatory order flow

Eisler et al. (2012) and Easley et al. (2012) study the impact of market orders, limit orders, and cancellations on the bid-ask spread. High cancellation rates are associated with four categories of predatory algorithms: quote stuffers (overwhelm the exchange with messages to slow competitors), quote danglers (force squeezed traders to chase adverse prices), liquidity squeezers (trade directionally against distressed large investors draining available depth), and pack hunters (independently acting predators that spontaneously coordinate to trigger cascading effects). Each category leaves a distinct signature in the ratio of cancellations to executions and in the time distribution of order arrivals. Measuring the cancellation rate and the proportion of market orders in each time bar is a computationally inexpensive feature that captures aggregate predatory pressure.

TWAP execution algorithm detection

Easley et al. (2012) show that large institutional orders executed through TWAP algorithms produce a characteristic intra-minute volume distribution: volume concentrates at the beginning of each minute, consistently across hours of the day. Computing the order imbalance at the start of every minute and testing for a persistent component across consecutive minutes is a feature that can anticipate the remaining inventory of a large TWAP order. The feature is most informative near the open and close of major equity and futures markets, where TWAP execution is most common.

Options market features

Muravyev et al. (2013) and Cremers and Weinbaum (2010) find that option trades (as opposed to quotes) contain information not reflected in the underlying stock price. The put-call parity implied stock price — derived by inverting the no-arbitrage relationship using observed option trade prices — provides a signal about where informed traders are expressing directional views. Deviations between this implied price and the actual bid-ask range of the underlying tend to resolve in favour of the underlying, but the direction and magnitude of the deviation before resolution is informative. For ML pipelines that include equity options data, computing the volatility spread across strikes and its rate of change as a feature matrix is the practical implementation of this observation.

Serial correlation of signed order flow

Tóth et al. (2011) show that the signs of consecutive trades on London Stock Exchange stocks are positively autocorrelated for periods of many hours. On timescales below a few hours, this persistence is attributable primarily to order splitting by large institutional participants rather than to herding. A feature measuring the first-order autocorrelation of signed volume in a rolling window is therefore a proxy for the presence of a large order being worked in the market. This feature is complementary to VPIN: VPIN captures the magnitude of the imbalance, while signed-volume autocorrelation captures its temporal persistence.

What Is Microstructural Information?

Section 19.7 is the chapter's theoretical contribution that is most frequently overlooked. López de Prado argues that the microstructure literature uses the word "information" without a precise definition, and proposes a formal quantitative measure grounded in signal processing and statistical learning theory.

The argument proceeds in six steps. First, construct a feature matrix X = {Xt} from all available microstructural features — VPIN, Kyle's lambda, cancellation rates, and so on. Second, assign labels yt ∈ {0, 1} indicating whether each observation resulted in a market-making profit (1) or loss (0), using the triple-barrier method from Chapter 3. Third, fit a classifier on the training set (X, y). Fourth, as new out-of-sample observations arrive at time τ > T, compute the cross-entropy loss Lτ of the classifier's predictions. Fifth, fit a kernel density estimator on the array of negative cross-entropy losses {−Lt}. Sixth, estimate the microstructural information at time τ as:

Phi

where F is the cumulative distribution function of the KDE. The result φτ ∈ (0, 1) measures how predictable the market maker's environment is at time τ. Under normal conditions, the classifier's cross-entropy loss is low — market makers can forecast their own adverse selection risk with reasonable accuracy, and φτ is high. When an informed trader is present, the cross-entropy loss rises — the market maker's model becomes uninformed — and φτ falls toward zero.

This definition of microstructural information has a direct connection to the 2010 Flash Crash. Market makers kept providing liquidity at exceedingly tight spreads because their models assigned a low probability to adverse selection. The cross-entropy loss of those predictions was rising throughout the cascade, but without a mechanism to measure and monitor it in real time, market makers could not widen their quotes in response. The implication is that φτ — estimated from any sufficiently rich feature matrix — should be included in every market-making and execution model as a regime indicator.

For the ML feature pipeline, φτ is not a standalone prediction target but an additional input feature: a scalar that summarises the current microstructural regime as measured by the predictability of the market-making problem. It is computed offline by retraining the classifier periodically on the most recent labelled window, evaluating its out-of-sample cross-entropy, fitting the KDE on a rolling buffer of those losses, and mapping the latest loss through the CDF. This workflow is a natural extension of the CPCV backtesting framework developed in Part 16 of the ML Blueprint series, where the meta-labeling step already produces the classifier and the labelled outcome series needed to build it.

Results: Features on Synthetic Market Data

The figure below shows all four feature families computed on 250 synthetic EURUSD-like bars with two embedded high-volatility regimes (shaded). The synthetic data uses a geometric random walk with a multiplicative volatility factor of 2.5 in the first regime and 1.8 in the second.

4-panel plot showing price, Roll and Corwin-Schultz spread, Kyle and Amihud lambdas, and VPIN across 250 synthetic bars

Figure 3. 4-panel illustration of microstructure features on 250 synthetic EURUSD-like bars

Panel (a): The synthetic price series, a geometric random walk, with high and low bands. The two orange-shaded regions mark the elevated-volatility regimes injected into the data generator.
Panel (b): Roll spread (solid, blue) and Corwin-Schultz spread (dashed, green), both in basis points. The Corwin-Schultz estimate is visibly more volatile because it is computed from adjacent-bar high-low ranges rather than a rolling covariance window. Both measures spike during the shaded regimes.
Panel (c): Kyle's Lambda (normalised, purple, left axis) and Amihud's ILLIQ (orange, right axis). Kyle's Lambda shows sign changes because the synthetic tick rule, derived from a random walk, produces noisy direction signals that intermittently reverse the OLS slope. Amihud's ILLIQ remains strictly non-negative and rises in the first high-volatility regime where volume also increases.
Panel (d): VPIN, bounded in [0, 1]. On a symmetric random walk, VPIN hovers near 0.50 because buy and sell volumes are approximately balanced by construction. The measure diverges from 0.50 on real directional data where sustained order-flow imbalance is present.

The panel (c) result — Kyle's Lambda showing negative values on synthetic data — is worth examining directly. Under the Kyle model, the slope of price changes on signed volume should be positive: buy pressure pushes prices up and sell pressure pushes them down. On a random walk without serial correlation in the tick rule, the OLS slope is centred near zero and can flip negative for many windows. This is not a bug in the implementation; it is the expected behaviour of a regression-based estimator on data that does not exhibit the information asymmetry the model was designed to detect. On real tick data from an equity or forex market, Kyle's Lambda is typically positive and increases during periods of thin liquidity.

Conclusion

This article implemented the full microstructure feature suite from AFML Chapter 19 in afml.features.microstructure. The primary entry point compute_all_microfeatures() requires only OHLCV data. When tick data is available, bar_microstructure_features() and vpin() extend the feature matrix with per-bar imbalance statistics and the volume-synchronised informed-trading measure. The key design constraint is that bar boundaries must be resolved in a single np.searchsorted pass, with the resulting index arrays shared across all per-bar computations; this is what makes prange parallelization across bars viable.

The implemented feature families span three generations of microstructure research: Roll (1984) and Corwin-Schultz (2012) spread and volatility estimators; Kyle's Lambda, Amihud's ILLIQ, and Hasbrouck's Lambda; and VPIN. Kyle's Lambda and Hasbrouck's Lambda each return both the OLS slope and its rolling t-statistic, giving downstream models a signal for when a lambda estimate is underpowered. Section 19.6 of the chapter identifies five additional feature families — order size distribution, cancellation rates, TWAP detection, options market features, and serial correlation of signed order flow — that require full order-book or FIX data and are not yet in afml.features.microstructure. They represent the natural next implementation layer for anyone with exchange-grade data access.

Section 19.7 introduced the most theoretically grounded definition of microstructural information in the chapter: the cross-entropy loss of a market maker's prediction model, mapped through a KDE-fitted CDF to produce φτ ∈ (0, 1). This scalar summarises the current regime in terms of adverse selection risk and is a natural companion feature for the CPCV-validated models built in the ML Blueprint series.

The next article will port the bar-level feature layer to MQL5, implementing the Roll, Corwin-Schultz, and impact measure kernels as a reusable include file that any Expert Advisor can link against at runtime.

References

López de Prado, M. (2018). Advances in Financial Machine Learning. Wiley. Chapter 19.
Roll, R. (1984). A Simple Implicit Measure of the Effective Bid-Ask Spread in an Efficient Market. Journal of Finance, 39(4), 1127–1139.
Parkinson, M. (1980). The Extreme Value Method for Estimating the Variance of the Rate of Return. Journal of Business, 53, 61–65.
Beckers, S. (1983). Variances of Security Price Returns Based on High, Low, and Closing Prices. Journal of Business, 56, 97–112.
Corwin, S. A., & Schultz, P. (2012). A Simple Way to Estimate Bid-Ask Spreads from Daily High and Low Prices. Journal of Finance, 67(2), 719–760.
Kyle, A. S. (1985). Continuous Auctions and Insider Trading. Econometrica, 53(6), 1315–1335.
Amihud, Y. (2002). Illiquidity and stock returns: cross-section and time-series effects. Journal of Financial Markets, 5(1), 31–56.
Hasbrouck, J. (2009). Trading costs and returns for U.S. equities: Estimating effective costs from daily data. Journal of Finance, 64(3), 1445–1477.
Easley, D., Kiefer, N., O'Hara, M., & Paperman, J. (1996). Liquidity, Information, and Infrequently Traded Stocks. Journal of Finance, 51(4), 1405–1436.
Easley, D., López de Prado, M., & O'Hara, M. (2011). The Microstructure of the Flash Crash. Journal of Portfolio Management, 37(2), 118–128.
Easley, D., López de Prado, M., & O'Hara, M. (2012). Flow Toxicity and Liquidity in a High-frequency World. Review of Financial Studies, 25(5), 1457–1493.
Easley, D., López de Prado, M., & O'Hara, M. (2016). Discerning information from trade data. Journal of Financial Economics, 120(2), 269–286.
Andersen, T. G., & Bondarenko, O. (2014). VPIN and the Flash Crash. Journal of Financial Markets, 17, 1–46.
Eisler, Z., Bouchaud, J., & Kockelkoren, J. (2012). The price impact of order book events: market orders, limit orders and cancellations. Quantitative Finance, 12(9), 1395–1419.
Tóth, B., Palit, I., Lillo, F., & Farmer, J. (2011). Why is order flow so persistent? Working paper. arXiv:1108.1632.
Muravyev, D., Pearson, N., & Broussard, J. (2013). Is there price discovery in equity options?Journal of Financial Economics, 107(2), 259–283.
Cremers, M., & Weinbaum, D. (2010). Deviations from Put-Call Parity and Stock Return Predictability. Journal of Financial and Quantitative Analysis, 45(2), 335–367.
O'Hara, M. (1995). Market Microstructure Theory. Blackwell.
Hasbrouck, J. (2007). Empirical Market Microstructure. Oxford University Press.

Attached files |

Download ZIP

microstructure.py (31.48 KB)

Warning: All rights to these materials are reserved by MetaQuotes Ltd. Copying or reprinting of these materials in whole or in part is prohibited.

This article was written by a user of the site and reflects their personal views. MetaQuotes Ltd is not responsible for the accuracy of the information presented, nor for any consequences resulting from the use of the solutions, strategies or recommendations described.