Discussing the article: "Statistical Arbitrage Through Cointegrated Stocks (Final): Data Analysis with Specialized Database"

MetaQuotes 2026.02.28 13:33

Check out the new article: Statistical Arbitrage Through Cointegrated Stocks (Final): Data Analysis with Specialized Database.

The article shows how to pair SQLite (OLTP) with DuckDB (OLAP) for statistical arbitrage data processing. DuckDB’s columnar engine, ASOF JOIN, and array functions accelerate core tasks such as quote–trade alignment and RWEC, with measured speedups from 2x to 23x versus SQLite on larger inputs. You get simpler queries and faster analytics while keeping trade execution in SQLite.

Any trading activity requires data. At a minimum, one needs the asset's current price to enter into a buy or sell operation. Usually, we’ll want a bit more data; at least some price history is useful to get an insight into how the asset price has moved over time. Soon, we start calculating maxima and minima, ranges, averages, average true ranges, volume weighted averages, and price history views. With more data, we can better understand how the price arrived at its current value. With enough data, maybe we can also speculate on where the price may go in the following hours, days, or months. We may use the asset price data to build and read candle patterns, and we may use it to create more or less complex indicators to ease our price history visualization. It is not uncommon that for a single asset, we find ourselves sourcing historic price data from dozens of months.

When it comes to pairs trading, we need at least double of the data. We need to calculate the mean spread continuously, and we need yet more historical data if we want to check if the pair relationship is strong enough and still holds. Maybe we want to calculate a dynamic standard deviation threshold based on the volatility. Thus, besides requiring more data, we are also requiring more computational power for loading the data for both symbols, calculating the spread, the mean, the standard deviation threshold, and now, the volatility in real time.

The data and computational power required by pairs trading strategies are usually a fraction of the data required by statistical arbitrage strategies. We have baskets composed of an unlimited number of stocks for which we must calculate the cointegration vector and the respective portfolio weights; run a Rolling Windows Eigenvector Comparison to check the portfolio weights stability; run a Chow Test with a Cumulative Sum of Squares to anticipate structural breaks; and we must run all this computation in real time for live trade monitoring. Eventually, we need to run all this computation for several baskets simultaneously, that is, we need to require price data for dozens of symbols and process it as soon as possible to be able to react on time to alerts, breaks, and market disruptions.

Author: Jocimar Lopes

New comment