Walk-Forward Testing & Robustness
Split in- and out-of-sample, roll the window forward, and avoid fooling yourself.
- ·Train/test split
- ·In- vs out-of-sample
- ·Rolling walk-forward
- ·Parameter stability
- ·Overfitting red flags
- ·Deflated expectations
Here is the uncomfortable truth that ends most trading systems before they begin: any strategy can be made to look brilliant on the past. Give me enough parameters to tune and a fixed slice of history, and I'll hand you a gorgeous equity curve that means absolutely nothing. The market never repeats that exact history, so a strategy fitted to it has learned a story that will never be told again.
Last chapter we found "best" parameters by sweeping a grid. But best on what? On the very data we tuned them on - which is like grading your own exam after seeing the answer key. This chapter introduces the single most important discipline in quantitative trading: walk-forward testing. The idea is simple and almost old-fashioned in its honesty - tune on one slice of history, then judge the result on a different slice the strategy has never seen. If the edge survives that, you might have something. If it doesn't, you've been fooling yourself, and it's far better to learn that on a screen than with real money.
In-sample and out-of-sample
Two terms you'll use forever:
- In-sample (or training) data is the history you're allowed to tune on. You can optimise, fit, and fiddle here as much as you like.
- Out-of-sample (or test) data is held back, sealed in an envelope. The strategy never sees it during tuning. It's the exam.
The split is brutally simple: cut the timeline in two. Everything before the cut is in-sample; everything after is out-of-sample. The order is sacred - the future must stay in the future. You never shuffle market data the way you might shuffle a deck, because that would let tomorrow's information leak into yesterday's decisions.
# Split history into IN-SAMPLE (train) and OUT-OF-SAMPLE (test) -- never shuffle time.
import os
from datetime import datetime, timedelta
from openalgo import api
client = api(
api_key=os.getenv("OPENALGO_API_KEY", "your_api_key_here"),
host=os.getenv("OPENALGO_HOST", "http://127.0.0.1:5000"),
)
end = datetime.now().strftime("%Y-%m-%d")
start = (datetime.now() - timedelta(days=900)).strftime("%Y-%m-%d")
close = client.history(symbol="RELIANCE", exchange="NSE", interval="D",
start_date=start, end_date=end)["close"]
# Cut the timeline once. The FIRST chunk is for tuning; the LAST is held back,
# untouched, to judge the result. Order matters -- the future must stay future.
cut = int(len(close) * 0.7)
in_sample = close.iloc[:cut]
out_sample = close.iloc[cut:]
print(f"Total daily bars : {len(close)}")
print(f"In-sample (train): {len(in_sample)} bars {in_sample.index[0].date()} -> {in_sample.index[-1].date()}")
print(f"Out-of-sample : {len(out_sample)} bars {out_sample.index[0].date()} -> {out_sample.index[-1].date()}")
print("\nWe tune ONLY on the in-sample. The out-of-sample is a sealed exam paper.")Total daily bars : 610 In-sample (train): 427 bars 2024-01-05 -> 2025-09-23 Out-of-sample : 183 bars 2025-09-24 -> 2026-06-23 We tune ONLY on the in-sample. The out-of-sample is a sealed exam paper.
This is why, when we reach machine learning in Chapter 30, we always pass train_test_split(..., shuffle=False). The default shuffles rows randomly - fine for photos of cats, catastrophic for time series. Shuffling lets the model peek at the future, and a model that has seen the future always looks like a genius.
Step one: optimise on the training slice
Walk-forward is a two-step dance, and step one is exactly the optimisation you already know - but performed only on the in-sample data. We sweep the EMA grid on the training slice and pick the combo with the best Sharpe. Notice we never even touch the test data here.
# Step 1 of walk-forward: find the best parameter ON THE TRAINING DATA ONLY.
import os
from datetime import datetime, timedelta
import pandas as pd
import vectorbt as vbt
from openalgo import api, ta
client = api(
api_key=os.getenv("OPENALGO_API_KEY", "your_api_key_here"),
host=os.getenv("OPENALGO_HOST", "http://127.0.0.1:5000"),
)
end = datetime.now().strftime("%Y-%m-%d")
start = (datetime.now() - timedelta(days=900)).strftime("%Y-%m-%d")
close = client.history(symbol="RELIANCE", exchange="NSE", interval="D",
start_date=start, end_date=end)["close"]
train = close.iloc[:int(len(close) * 0.7)] # tune only on this slice
combos = [(f, s) for f in (10, 20, 30) for s in (40, 50, 60) if f < s]
ent, ext = {}, {}
for f, s in combos:
fe, se = ta.ema(train, f), ta.ema(train, s)
ent[(f, s)] = (fe > se) & (fe.shift(1) <= se.shift(1))
ext[(f, s)] = (fe < se) & (fe.shift(1) >= se.shift(1))
cols = pd.MultiIndex.from_tuples(combos, names=["fast", "slow"])
pf = vbt.Portfolio.from_signals(train, pd.DataFrame(ent).set_axis(cols, axis=1),
pd.DataFrame(ext).set_axis(cols, axis=1),
init_cash=100000, fees=0.001, freq="1D")
best = tuple(int(x) for x in pf.sharpe_ratio().idxmax())
print(f"Tuned on {len(train)} in-sample bars.")
print(f"Best combo by Sharpe: EMA {best[0]}/{best[1]}")
print(f"In-sample Sharpe : {pf.sharpe_ratio()[best]:.2f} (looks great -- but it SHOULD)")Tuned on 427 in-sample bars. Best combo by Sharpe: EMA 30/60 In-sample Sharpe : 0.71 (looks great -- but it SHOULD)
The in-sample Sharpe will usually look good. It should - we hand-picked the combo that scored highest on exactly this data. That number is not evidence of skill; it's evidence that we optimised. The real question comes next.
Step two: grade it out-of-sample
Now the moment of truth. We take the winner from training - and only the winner - and run it on the held-back test data. Then we put the two returns side by side.
# Step 2: take the winner from training and grade it on UNSEEN out-of-sample data.
import os
from datetime import datetime, timedelta
import pandas as pd
import vectorbt as vbt
from openalgo import api, ta
client = api(
api_key=os.getenv("OPENALGO_API_KEY", "your_api_key_here"),
host=os.getenv("OPENALGO_HOST", "http://127.0.0.1:5000"),
)
def run(c, f, s): # backtest one combo on a slice
fe, se = ta.ema(c, f), ta.ema(c, s)
e = (fe > se) & (fe.shift(1) <= se.shift(1))
x = (fe < se) & (fe.shift(1) >= se.shift(1))
return vbt.Portfolio.from_signals(c, e, x, init_cash=100000, fees=0.001, freq="1D")
def best_on(c, combos): # pick best Sharpe combo on a slice
ent = {cb: ((ta.ema(c, cb[0]) > ta.ema(c, cb[1])) &
(ta.ema(c, cb[0]).shift(1) <= ta.ema(c, cb[1]).shift(1))) for cb in combos}
ext = {cb: ((ta.ema(c, cb[0]) < ta.ema(c, cb[1])) &
(ta.ema(c, cb[0]).shift(1) >= ta.ema(c, cb[1]).shift(1))) for cb in combos}
cols = pd.MultiIndex.from_tuples(combos, names=["fast", "slow"])
pf = vbt.Portfolio.from_signals(c, pd.DataFrame(ent).set_axis(cols, axis=1),
pd.DataFrame(ext).set_axis(cols, axis=1),
init_cash=100000, fees=0.001, freq="1D")
return tuple(int(x) for x in pf.sharpe_ratio().idxmax())
end = datetime.now().strftime("%Y-%m-%d")
start = (datetime.now() - timedelta(days=900)).strftime("%Y-%m-%d")
close = client.history(symbol="RELIANCE", exchange="NSE", interval="D",
start_date=start, end_date=end)["close"]
cut = int(len(close) * 0.7)
combos = [(f, s) for f in (10, 20, 30) for s in (40, 50, 60) if f < s]
best = best_on(close.iloc[:cut], combos) # learn on train
in_s = run(close.iloc[:cut], *best).total_return() * 100
out_s = run(close.iloc[cut:], *best).total_return() * 100
print(f"Chosen on training: EMA {best[0]}/{best[1]}")
print(f"In-sample return : {in_s:6.2f} % (what we tuned for)")
print(f"Out-of-sample : {out_s:6.2f} % (the only number that counts)")Chosen on training: EMA 30/60 In-sample return : 15.90 % (what we tuned for) Out-of-sample : -4.57 % (the only number that counts)
The in-sample number is what you tuned for; the out-of-sample number is the only one that counts. A strategy that earned a great in-sample return but stumbles out-of-sample hasn't found an edge - it has memorised noise. When the two are close, you've found something that might generalise. When the out-of-sample badly trails the in-sample, that gap is the cost of your overfitting, paid back to you as a warning instead of a loss.
A note on what you'll see: a simple EMA crossover on a single stock won't print heroic returns, and an out-of-sample slice of a few months can easily come back negative. That's realistic - most simple ideas are mediocre, and walk-forward is the tool that tells you so before you bet on them.
Rolling the window forward
One split is a snapshot; real trading is a film. In practice you'd re-tune your strategy periodically - say every quarter - using recent history, then trade those settings until the next re-tune. Walk-forward testing mimics exactly that: tune on a window, test on the window immediately after, then slide both windows forward and repeat.
Stitch all those out-of-sample test results together and you get something precious: a track record built entirely from data the strategy had never seen at the moment of each decision. It's the closest thing to live trading you can get without risking capital.
# Roll the window forward: tune, test, slide, repeat -- mimicking real re-tuning.
import os
from datetime import datetime, timedelta
import pandas as pd
import vectorbt as vbt
from openalgo import api, ta
client = api(
api_key=os.getenv("OPENALGO_API_KEY", "your_api_key_here"),
host=os.getenv("OPENALGO_HOST", "http://127.0.0.1:5000"),
)
COMBOS = [(f, s) for f in (10, 20, 30) for s in (40, 50, 60) if f < s]
def signals(c, f, s):
fe, se = ta.ema(c, f), ta.ema(c, s)
return ((fe > se) & (fe.shift(1) <= se.shift(1)),
(fe < se) & (fe.shift(1) >= se.shift(1)))
def best_on(c):
ent = {cb: signals(c, *cb)[0] for cb in COMBOS}
ext = {cb: signals(c, *cb)[1] for cb in COMBOS}
cols = pd.MultiIndex.from_tuples(COMBOS, names=["fast", "slow"])
pf = vbt.Portfolio.from_signals(c, pd.DataFrame(ent).set_axis(cols, axis=1),
pd.DataFrame(ext).set_axis(cols, axis=1),
init_cash=100000, fees=0.001, freq="1D")
return tuple(int(x) for x in pf.sharpe_ratio().idxmax())
end = datetime.now().strftime("%Y-%m-%d")
start = (datetime.now() - timedelta(days=900)).strftime("%Y-%m-%d")
close = client.history(symbol="RELIANCE", exchange="NSE", interval="D",
start_date=start, end_date=end)["close"]
train_len, test_len = 250, 100 # window sizes in bars
print(f"{'window':7}{'tuned EMA':12}{'OOS return %':>13}")
i = 0
while i + train_len + test_len <= len(close):
tr = close.iloc[i:i + train_len]
te = close.iloc[i + train_len:i + train_len + test_len]
f, s = best_on(tr) # tune on train window
e, x = signals(te, f, s)
oos = vbt.Portfolio.from_signals(te, e, x, init_cash=100000, fees=0.001,
freq="1D").total_return() * 100
print(f"{i // test_len + 1:<7}{f'{f}/{s}':12}{oos:13.2f}") # test on next window
i += test_len # slide forwardwindow tuned EMA OOS return % 1 30/60 -3.30 2 30/60 -6.07 3 30/60 -4.15
Each row here is an independent little experiment: tune on 250 bars, trade the next 100, slide on. No single window gets to cherry-pick its own answer key.
Parameter stability: the quiet truth-teller
Here's a test that catches overfitting before you even look at returns. Across all those rolling windows, did the optimiser keep picking the same settings? If window after window lands on roughly the same EMA lengths, your strategy is stable - it's responding to a persistent feature of the market. If the "best" combo ricochets wildly from one window to the next, the optimiser is just chasing random noise, and whatever it picks next is anyone's guess.
# Parameter STABILITY: does the optimiser pick the same settings each window?
import os
from datetime import datetime, timedelta
import pandas as pd
import vectorbt as vbt
from openalgo import api, ta
client = api(
api_key=os.getenv("OPENALGO_API_KEY", "your_api_key_here"),
host=os.getenv("OPENALGO_HOST", "http://127.0.0.1:5000"),
)
COMBOS = [(f, s) for f in (10, 20, 30) for s in (40, 50, 60) if f < s]
def best_on(c):
ent, ext = {}, {}
for f, s in COMBOS:
fe, se = ta.ema(c, f), ta.ema(c, s)
ent[(f, s)] = (fe > se) & (fe.shift(1) <= se.shift(1))
ext[(f, s)] = (fe < se) & (fe.shift(1) >= se.shift(1))
cols = pd.MultiIndex.from_tuples(COMBOS, names=["fast", "slow"])
pf = vbt.Portfolio.from_signals(c, pd.DataFrame(ent).set_axis(cols, axis=1),
pd.DataFrame(ext).set_axis(cols, axis=1),
init_cash=100000, fees=0.001, freq="1D")
return tuple(int(x) for x in pf.sharpe_ratio().idxmax())
end = datetime.now().strftime("%Y-%m-%d")
start = (datetime.now() - timedelta(days=900)).strftime("%Y-%m-%d")
close = client.history(symbol="RELIANCE", exchange="NSE", interval="D",
start_date=start, end_date=end)["close"]
picks = []
train_len, step = 250, 100
i = 0
while i + train_len <= len(close):
picks.append(best_on(close.iloc[i:i + train_len]))
i += step
print("Best combo chosen in each rolling training window:")
for n, (f, s) in enumerate(picks, 1):
print(f" window {n}: EMA {f}/{s}")
unique = len(set(picks))
print(f"\nDistinct picks across {len(picks)} windows: {unique}")
print("Few distinct picks -> STABLE and trustworthy. Jumping around -> noise-fitting.")Best combo chosen in each rolling training window: window 1: EMA 30/60 window 2: EMA 30/60 window 3: EMA 30/60 window 4: EMA 10/40 Distinct picks across 4 windows: 2 Few distinct picks -> STABLE and trustworthy. Jumping around -> noise-fitting.
Stability is more reassuring than any single backtest number. A strategy whose ideal parameters barely move through time is one you can actually trust to re-tune and keep trading. A strategy whose parameters won't sit still is telling you, loudly, that it has no real edge - listen to it.
Reading the overfitting red flags
Let's make the warning signs concrete. When you lay in-sample and out-of-sample returns side by side for several combos, the gap between them is the overfitting tell. A combo that looks spectacular in training and falls apart in testing has fitted the past's noise. The combo whose two columns stay closest - even if neither is the absolute highest - is the more honest bet.
# The overfitting tell: a big gap between in-sample glory and out-of-sample reality.
import os
from datetime import datetime, timedelta
import pandas as pd
import vectorbt as vbt
from openalgo import api, ta
client = api(
api_key=os.getenv("OPENALGO_API_KEY", "your_api_key_here"),
host=os.getenv("OPENALGO_HOST", "http://127.0.0.1:5000"),
)
COMBOS = [(f, s) for f in (5, 10, 20) for s in (30, 50) if f < s]
def run(c, f, s):
fe, se = ta.ema(c, f), ta.ema(c, s)
e = (fe > se) & (fe.shift(1) <= se.shift(1))
x = (fe < se) & (fe.shift(1) >= se.shift(1))
return vbt.Portfolio.from_signals(c, e, x, init_cash=100000, fees=0.001, freq="1D")
end = datetime.now().strftime("%Y-%m-%d")
start = (datetime.now() - timedelta(days=900)).strftime("%Y-%m-%d")
close = client.history(symbol="RELIANCE", exchange="NSE", interval="D",
start_date=start, end_date=end)["close"]
cut = int(len(close) * 0.6)
train, test = close.iloc[:cut], close.iloc[cut:]
print(f"{'EMA':9}{'in-sample %':>13}{'out-sample %':>14}{' gap':>8}")
for f, s in COMBOS:
ins = run(train, f, s).total_return() * 100
oos = run(test, f, s).total_return() * 100
print(f"{f'{f}/{s}':9}{ins:13.2f}{oos:14.2f}{ins - oos:8.1f}")
print("\nA combo that shines in-sample but collapses out-of-sample is OVERFIT.")
print("Trust the setting whose two columns stay closest together.")EMA in-sample % out-sample % gap 5/30 -11.89 -6.86 -5.0 5/50 -3.03 -7.06 4.0 10/30 -17.24 -7.40 -9.8 10/50 -7.06 -9.18 2.1 20/30 -5.75 -10.88 5.1 20/50 -5.26 -8.11 2.8 A combo that shines in-sample but collapses out-of-sample is OVERFIT. Trust the setting whose two columns stay closest together.
The classic red flags, gathered in one place:
- A large in-sample / out-of-sample gap - the bigger the drop-off, the more you overfit.
- Unstable parameters - the best settings jump around from window to window.
- Too few trades - a curve built on a handful of trades is luck wearing a costume.
- Fragility - a tiny change in a parameter flips the result from wonderful to dreadful.
Even a genuinely good in-sample result deserves suspicion. If you test 50 combos, one will look great by pure chance - the same way one of 50 coin-flippers will get ten heads in a row. Professionals "deflate" their expectations for exactly this reason: the more things you tried, the luckier your apparent winner probably is. Walk-forward is the antidote, because luck rarely survives on data the strategy never saw.
Walk-forward by hand
To be sure the mechanics aren't VectorBT magic, here's the same walk-forward written in plain pandas - positions from an EMA comparison, returns from pct_change, equity from a running product. Every moving part is visible, and the conclusion is the same: the stitched-together out-of-sample record is your honest expectation, not the flattering in-sample dream.
# Walk-forward without VectorBT: plain pandas returns, so you see every moving part.
import os
from datetime import datetime, timedelta
from openalgo import api, ta
client = api(
api_key=os.getenv("OPENALGO_API_KEY", "your_api_key_here"),
host=os.getenv("OPENALGO_HOST", "http://127.0.0.1:5000"),
)
def strat_return(c, f, s):
# Hold long while fast EMA is above slow; earn the bar's return when in.
pos = (ta.ema(c, f) > ta.ema(c, s)).astype(int).shift(1).fillna(0)
daily = c.pct_change().fillna(0)
equity = (1 + pos * daily).prod() # gross growth multiple
return (equity - 1) * 100
end = datetime.now().strftime("%Y-%m-%d")
start = (datetime.now() - timedelta(days=900)).strftime("%Y-%m-%d")
close = client.history(symbol="RELIANCE", exchange="NSE", interval="D",
start_date=start, end_date=end)["close"]
combos = [(f, s) for f in (10, 20) for s in (40, 50, 60) if f < s]
train_len, test_len, i, oos_total = 250, 100, 0, []
while i + train_len + test_len <= len(close):
tr = close.iloc[i:i + train_len]
te = close.iloc[i + train_len:i + train_len + test_len]
best = max(combos, key=lambda cb: strat_return(tr, *cb)) # tune on train
oos_total.append(strat_return(te, *best)) # score on test
print(f"window {i // test_len + 1}: tuned EMA {best[0]}/{best[1]} -> OOS {oos_total[-1]:6.2f} %")
i += test_len
print(f"\nAverage out-of-sample return per window: {sum(oos_total) / len(oos_total):.2f} %")
print("This stitched out-of-sample record is the honest expectation -- not the in-sample dream.")window 1: tuned EMA 20/60 -> OOS -9.08 % window 2: tuned EMA 10/60 -> OOS -1.94 % window 3: tuned EMA 20/60 -> OOS -4.38 % Average out-of-sample return per window: -5.13 % This stitched out-of-sample record is the honest expectation -- not the in-sample dream.
When this honest number is good, you have real reason to proceed to live trading. When it's poor, you've just saved yourself a costly lesson - and that's a win too.
Try it yourself
- In
04_rolling_walk_forward.py, shrinktrain_lento 150 andtest_lento 60. Do more, smaller windows make the picks more or less stable? - Run the stability check on a different stock. Count the distinct picks - would you trust this strategy on that name?
- In the red-flag example, change the split from 0.6 to 0.5. Do the in-sample/out-of-sample gaps widen or narrow, and why might that be?
Recap
- Tuning and judging on the same data is self-deception; walk-forward tunes in-sample and grades out-of-sample.
- Never shuffle time-series data - the future must stay future (hence
shuffle=Falsefor ML splits). - Rolling the window forward builds an honest track record from data the strategy never saw at decision time.
- Parameter stability across windows is a powerful, return-free sign of a real edge; jumpy parameters signal noise-fitting.
- The in-sample / out-of-sample gap, unstable parameters, too few trades, and fragility are the classic overfitting red flags.
- The more combos you test, the luckier your apparent winner - deflate your expectations and trust the out-of-sample record.
You've now proven an edge the honest way. In the final chapter we make a different leap: instead of hand-coding rules, we let a machine learn the rules from data - training a classifier and a neural network to read the market, then wiring the whole thing into a complete, runnable trading bot.