Module D · Research to Reality - Chapter 15

Honest Backtesting and Validation

The full validation scorecard: walk-forward, purged and embargoed cross-validation, the deflated Sharpe, the probability of backtest overfitting, a blocked bootstrap, and parameter-stability maps.

STATRISK

What you'll learn

·Walk-forward distributions
·Why ordinary CV leaks
·Purged and embargoed CV
·The deflated Sharpe
·Probability of backtest overfitting
·Ridge vs plateau

A backtest is a measurement, and the act of searching quietly contaminates it. A backtest simply replays a trading rule over past data to see how it would have done. Every threshold you tried, every window you swept, every pair you scanned uses up a little of your statistical confidence. By the time a strategy "works", you have often just tested it into existence. The previous chapters took the HDFCBANK / KOTAKBANK pairs trade out of sample once and watched the tempting in-sample net Sharpe of 1.01 collapse to 0.34. (In-sample is the data you used to build and tune the rule; out-of-sample is fresh data you never touched - the only honest test. The Sharpe ratio is return divided by risk.) That was the first line of defence. This chapter is the full validation scorecard - the battery of tests a quant runs before believing a single number, run on the same pair and the same z-score rule, with the real numbers in this data window. The point is not to bless the backtest. It is to subtract the part that was never real and report what little is left, without flinching.

The validation funnel: each ring subtracts edge that was never real

A single backtest answers exactly one question - could a rule have fit this past? - and the answer is almost always yes. Validation is a funnel that asks the harder questions one after another, and each question removes some apparent edge. What drips out of the bottom is the honest part. Keep this picture in mind for the whole chapter. The wide mouth is the tempting in-sample Sharpe. Every ring below it is one section that follows.

Read it top to bottom: nothing here is hidden, and nothing here flatters.

The setup on trial is the same as before. The spread is s = A - b.B in log prices. The hedge ratio b is estimated on the 2019-2023 train window and then frozen. We use a trailing 60-day z-score, enter at plus-or-minus two, exit at the mean, hard-stop at four, fill on the next bar, and charge realistic delivery (CNC) costs on every position change. Every test below works on the one daily net-return series this produces. Keep two reference numbers in mind throughout: in-sample net Sharpe near 1.0, out-of-sample near 0.3.

Walk-forward: the distribution, not the average

One out-of-sample number is just a single draw from a noisy process. Quote it on its own and you are bluffing. Walk-forward testing turns that one held-out test into many. You estimate the hedge ratio on data up to a cut-off date, trade the next block blind, step the cut-off forward, and repeat. Each block is called a fold. The anchored version grows the training window from a fixed start. The rolling version uses a fixed-length trailing window that forgets old regimes. The honest output is never the average. It is the whole spread of fold Sharpes.

Walk-forward fold Sharpe distribution: one number hides everything chart — EX 1Walk-forward fold Sharpe distribution: one number hides everythingSTATch15/01_cell8.py

Across 14 folds, the anchored scheme lands a median fold Sharpe of +0.47, with a middle-half range (the interquartile range, where half the folds land) of [+0.29, +1.10]. 93% of folds are positive, but the worst is -0.96 and the best a lucky +3.30. The rolling scheme does slightly better in the middle (median +0.57) and much wider at the extremes (IQR [+0.10, +1.67], worst -0.95). The shape is the tell. This is not a tight cloud sitting confidently above zero. It is a low pile of mediocre folds, with a couple of lucky spikes pulling the average up. And the headline in-sample 1.01 sits above almost every single fold. A lone backtest shows you the best case, not the typical one.

Heads up

A single out-of-sample Sharpe is one sample from a wide spread of outcomes. When the median fold is roughly half the headline, a quarter of folds are near-flat, and the worst loses money outright, the honest summary is "around 0.5, but I have seen -0.96", not "0.57". Anyone who shows you one number has chosen which draw to show you.

Purged and embargoed cross-validation: ordinary k-fold leaks

The natural next thought is cross-validation. Instead of one cut between train and test, you split the data into many folds and rotate which fold is the test set, then average the results. But ordinary k-fold cross-validation leaks on a trading strategy. The leak is built in, not a bug you can patch. A leak means the test set is secretly contaminated by information from the training set. Two threads tie nearby days together. First, overlapping labels: a reversion trade opened on day t is judged by the spread's move over the next H days, so day t's outcome is built from the same returns as t+1, t+2, and so on - here the holding period is about H = 39 days. Second, trailing features: today's z-score is computed from the prior 60 days. Now shuffle the days into folds at random - the default in almost every library - and nearly every test day has a near-twin sitting in the training set. The model is then tested on data it has effectively already seen.

The fix (Lopez de Prado): contiguous folds, purge the overlapping label window, embargo the residual autocorrelation.

Measured exactly on this strategy, the fraction of test days that have a training neighbour inside the label window is 100% for shuffled k-fold, 16.6% for blocked (contiguous) folds, and 0% once you purge plus-or-minus H days and add a small embargo. Purging means deleting the training days whose label window overlaps the test block; an embargo adds a short gap after the test block to break any leftover spillover. That leak is not cosmetic - it inflates the score. Pick the entry threshold on each training fold and score it out of fold: the naive shuffled-CV Sharpe is 0.78. Switch to purged and embargoed folds and it drops to 0.50 - a 55% inflation removed. The longer the holding period, the bigger that gap.

Naive shuffled CV reports 0.78; purged and embargoed CV reports 0.50 chart — EX 2Naive shuffled CV reports 0.78; purged and embargoed CV reports 0.50STATch15/03_naive_shuffled_vs_purged_embargo_cv_shar.py

Key idea

The k-fold leak is not a coding mistake you can fix with a cleaner library. It is baked into the fact that trading labels overlap in time. Shuffle the rows and you hand the model near-twins of the test set. Half the apparent cross-validated edge here was that leak, and it vanished the moment the folds were made contiguous and purged.

The Deflated Sharpe Ratio: pay a toll for every trial

Suppose you tried 64 different configurations and reported only the best one. Its Sharpe is biased upward simply because you took the maximum over many noisy tries. Even with zero true edge, the best of 64 random strategies looks good. The Deflated Sharpe Ratio (Bailey and Lopez de Prado) prices that in. Instead of comparing your Sharpe against zero, it compares it against the Sharpe you would expect to beat by luck alone after N tries. It also corrects the test for returns that are not bell-shaped.

The Deflated Sharpe: deflate the best of 64 trials against the luck benchmark chart — EX 3The Deflated Sharpe: deflate the best of 64 trials against the luck benchmarkSTATch15/04_build_the_grid_of_trials_we_searched_pri.py

Here the best of 64 configs has a daily Sharpe of 0.070 (annualised 1.11), with returns far from bell-shaped - skew +1.54, kurtosis 18.0 - over T = 2347 days. Tested against zero, the Probabilistic Sharpe Ratio is 1.000: it looks rock-solid, all but certain. But the expected best Sharpe from 64 lucky tries is 0.56 annualised. Deflate against that benchmark instead of zero, and the Deflated Sharpe drops to 0.962. It still clears the usual 0.95 credibility line - just barely. Try a few hundred configurations instead of 64 and the same observed Sharpe falls straight through 0.95 into "not credible". The toll grows with how hard you searched.

Heads up

The Deflated Sharpe here charged for 64 explicit configs. It cannot see the pairs you quietly threw away, the windows you tried last week, or the thousands of backtests in published papers you read before choosing this one. The true number of tries - and so the true haircut - is always larger than any you can write down. A PSR of 1.000 against zero means nothing unless you also say how many doors you opened to find it.

Probability of Backtest Overfitting

The Deflated Sharpe asks "is this number real?" PBO - the probability of backtest overfitting - asks the deeper question: is my whole way of picking the best config overfit? PBO is the chance that the setting which looks best in your backtest is just luck that will not repeat. We measure it with Combinatorially-Symmetric Cross-Validation (Bailey, Borwein, Lopez de Prado, Zhu). Take the T x N grid of returns from every trial, chop time into S blocks, and for every symmetric way of splitting those blocks into an in-sample half and an out-of-sample half: pick the best config in-sample, find its rank out of sample, and record where it landed. PBO is the fraction of splits where the in-sample winner falls below the out-of-sample median.

PBO via CSCV: the in-sample winner is a coin flip out of sample chart — EX 4PBO via CSCV: the in-sample winner is a coin flip out of sampleRISKch15/05_cscv_over_the_trial_grid.py

Over 70 symmetric splits of S = 8 blocks across the N = 64 configs, PBO = 54%. The config that looks best in sample lands below median out of sample more often than not - barely better than picking one at random. The scatter makes it concrete. The in-sample-best config carries a median in-sample Sharpe of +1.25, but its median out-of-sample Sharpe is just +0.66, and the cloud sprays across the diagonal instead of hugging it. Choosing the best config added almost no real out-of-sample information.

Note

A PBO near 50% is the signature of a search that learned the noise, not the signal. It does not mean the strategy is worthless. It means the act of choosing the best config gives you essentially no edge over a coin flip. If your ranking of configs does not survive being re-ranked out of sample, you have not found a setting. You have found a story.

What survives, and how sure we are

Two questions remain. How wide is the error bar on the surviving Sharpe? And is the edge a fragile ridge or a sturdy plateau? Take the error bar first. The Sharpe is a single estimate with a fat tail, and on roughly 600 out-of-sample days that tail is wide. A block bootstrap measures it. Bootstrapping means resampling your own data many times to see how much a number wobbles. The stationary block bootstrap (Politis and Romano) resamples blocks of consecutive days - which keeps the natural clustering of real returns - a few thousand times, and reads the Sharpe off each resample.

Block-bootstrap confidence interval: zero is inside it chart — EX 5Block-bootstrap confidence interval: zero is inside itSTATch15/06_cell18.py

The out-of-sample point Sharpe of 0.34 comes with a 95% bootstrap interval of [-0.47, 1.15], and zero sits squarely inside it. In plain terms, we cannot rule out "no edge" at the 5% level. The chance the true Sharpe is actually negative is 20%. Even the full-sample interval, [0.30, 1.31], is wide enough that the headline number is far less certain than one clean figure ever admits. The honest read is not "Sharpe 0.34" but "somewhere between losing money one year in five and a respectable edge, and I cannot tell which from this data."

A ridge is an instant red flag. A plateau is necessary but not sufficient - the whole plateau can still be in-sample luck.

Now the shape. Picture the net Sharpe as a surface over the settings you can choose. A ridge is a thin spike - nudge a setting and the edge vanishes, a classic sign of overfitting. A plateau is a broad flat top - the edge survives across a whole neighbourhood of settings. Sweeping the net Sharpe across entry-versus-exit and window-versus-entry slices, the base config sits on a broad, gently varying plateau, not a spike. The base full-sample net Sharpe is 0.82, its 3x3 neighbourhood averages 0.79 with a standard deviation of only 0.08, and 96% of all swept cells are positive. That is the good news - the exact thresholds were not curve-fitted to a coincidence. The bad news is that it is a low plateau. The one-knob-at-a-time table shows the out-of-sample Sharpe still depends on the choices: shortening the z-score window to 90 days lifts it to 0.99, while 60 or 120 days leave it at 0.34, a tighter stop drops it to 0.19, and doubling the assumed cost takes it to 0.22. The robustness to small changes is real, but it is the robustness of a marginal edge.

Parameter-stability heatmaps: a broad but low plateau, not a bright ridge chart — EX 6Parameter-stability heatmaps: a broad but low plateau, not a bright ridgeRISKch15/07_cell20.py

Put every layer on one page and the story is plain. Each ring of the funnel removed some apparent edge, and what reaches the bottom is thin and uncertain.

Validation layer	What it asks	Result in this window
Single in-sample backtest	Could a rule fit the past?	Net Sharpe 1.01
Walk-forward (anchored)	Survive re-estimated blind folds?	Median fold 0.47, 93% > 0, worst -0.96
Walk-forward (rolling)	Same, adapting to drift	Median fold 0.57, 86% > 0, best +3.07
Purged + embargoed CV	Strip the overlap leak	0.78 -> 0.50 (55% inflation removed)
Deflated Sharpe (64 trials)	Beat the best of N lucky tries?	PSR-vs-0 = 1.000 collapses to DSR 0.96
PBO via CSCV	Is selection itself overfit?	54% - a coin flip
Block-bootstrap 95% CI	How sure is the Sharpe?	0.34, CI [-0.47, 1.15], zero inside
Stability sweep	Ridge or plateau?	Plateau, 96% cells > 0, base only 0.34 OOS

Key idea

Validation working feels like a disappointment. The job of these tests is not to bless the curve. It is to subtract the part that was never real and report what is left. On this pair, what is left is a thin, uncertain, out-of-sample edge whose confidence interval includes zero. That is the method succeeding, not failing.

Where this breaks

Validation is a defence, not a guarantee. Each tool here fails quietly in its own way. An expert keeps those failure modes in view.

Validation has its own overfitting. Run walk-forward, PBO and the rest, then tweak the strategy until they pass, and you have simply overfit to the validators instead. These tools work only when you run them once, on a hypothesis you fixed in advance. Used over and over, they become just another search that itself needs deflating.
The number of trials is always underestimated. The Deflated Sharpe charged for 64 explicit configs. It cannot price the pairs you discarded, the windows you tried last month, or the published papers that steered you here. The true N, and the true haircut, is larger than anything you can write down.
Purging assumes you know the holding period. We purged plus-or-minus 39 days, taken from a half-life estimate. If a spread refuses to revert and positions last longer than that, the embargo is too short and the leak survives. Purged CV controls the overlap, not the deeper drift in the relationship that walk-forward is meant to catch.
The bootstrap assumes the past looks like the future. A block bootstrap can only reshuffle the regime you actually observed. It cannot invent the crash you never saw, the borrow squeeze, or the day the cointegration simply ends. Its interval is a floor on your uncertainty, not the whole of it.
PBO and stability maps are relative, not absolute. A low PBO and a broad plateau only tell you that within your grid the selection is stable. They say nothing about whether the entire grid is one big pool of in-sample luck. A robust, well-validated, marginal edge is still marginal.
None of this brings back real-world frictions. A strategy can pass every test on this page and still be untradeable once you add borrow fees, financing, position-limit bans, two-legged execution risk, and a point-in-time universe. Treat the short leg as a research abstraction unless you have a real way to implement it.

The bottom line: on HDFCBANK / KOTAKBANK, what survives walk-forward, purged cross-validation, the Deflated Sharpe, a 54% PBO, a bootstrap interval that brackets zero, and a low plateau is a thin, uncertain edge that the data cannot reliably tell apart from nothing. The rarest skill in quantitative trading is the discipline to compute all of this before you fall in love with a curve - and to walk away when the funnel comes up empty. Educational content only, not investment advice.