Calibration loop
How wrong is our backtest? We measure it.
Every night, the adaptive engine compares its backtest predictions against actual forward-test results for every strategy with 8 or more live trades. When a gap shows up, we diagnose it (slippage, entry timing, exit execution, regime mismatch, stale data) and update the model. Every adjustment is logged here.
Most backtesters do not publish this. The published numbers tend to come from the backtest, not from how the backtest compares to live execution. We do both and publish the gap.
Live calibration state
625 gap measurements logged · tracking since 4/23/2026
Entry delay (bars)
Number of bars added to entry timing in backtests to simulate the cron-cycle delay between signal and real-money execution. 0 = no delay needed.
Slippage multiplier
How much real fill prices differ from backtest assumed prices. 1.0 = backtest matches reality. >1.0 = real slippage is worse than backtest assumed (the backtest was too optimistic).
slippage_mult:ranging_high_vol
slippage_mult:ranging_low_vol
slippage_mult:ranging_med_vol
slippage_mult:strong_trend_high_vol
slippage_mult:strong_trend_low_vol
slippage_mult:strong_trend_med_vol
slippage_mult:trending_high_vol
slippage_mult:trending_low_vol
slippage_mult:weak_trend_high_vol
slippage_mult:weak_trend_low_vol
slippage_mult:weak_trend_med_vol
Stop overshoot %
Average distance real stop fills miss the stop level by, as % of price. Used to calibrate stop-loss assumptions in backtests so reported P&L matches reality.
TP tightening factor
How much we tighten the take-profit threshold in backtests to match real-world execution. 1.0 = no adjustment needed. >1.0 = real fills happen later than backtest assumed.
Win-rate offset
Forward-test win rate minus backtest predicted win rate, averaged across calibrated genes. Positive = forward beats backtest. Negative = forward underperforms backtest predictions.
Knowledge map: what we actually know
We have 1086 graduated strategies across 7 market regimes, which means 7602 cells in the (strategy, regime) grid. Most of those cells have zero forward observations. Each row below shows how many cells have any data, how many have statistical confidence (≥8 trades), and how many have graduated to high-confidence (≥20 trades with positive posterior).
Why this matters: a strategy with zero trades in the current regime has only its backtest prior as evidence. The post-launch explore/exploit policy will reserve 30% of paper-trading slots for genes with high uncertainty in the current regime, deliberately filling these gaps over time. The line is the live state today.
The structural gap, stated clearly: we can only forward-test a fixed number of strategies concurrently (24 active slots). On 2026-04-30 a gate audit found that ~99% of previously-approved strategies had Deflated Sharpe fluke probabilities flagging them as noise. 549 of 595 prior approvals were retroactively retired. The bench is now 15 graduated strategies that clear the corrected gate. With auto-rotation (retire after 30 forward trades or 21 days active) and the bandit selector, throughput now matches real-signal throughput rather than approval-volume throughput.
| Regime | Observed | Confident (≥8) | Graduated (≥20) | Avg posterior |
|---|---|---|---|---|
| strong trend med vol | 6/542(1.1%) | 0 | 0 | 0.592 |
| weak trend low vol | 6/542(1.1%) | 0 | 0 | 0.592 |
| weak trend med vol | 6/542(1.1%) | 0 | 0 | 0.593 |
| ranging high vol | 4/542(0.7%) | 0 | 0 | 0.594 |
| ranging low volcurrent | 4/542(0.7%) | 0 | 0 | 0.594 |
| trending high vol | 1/542(0.2%) | 0 | 0 | 0.593 |
| trending low vol | 1/542(0.2%) | 0 | 0 | 0.593 |
Performance by context
Where the engine actually performs vs where it does not. Sliced from 269 live v2 trades (overall avg -0.12% per trade). The CI-lower column is the pessimistic edge of a 90% confidence interval. Small slices with high variance show negative CI even when the average is positive, which is the correct read. As the dataset grows, per-slice gap-vs-backtest calibration becomes available.
By coin
| Slice | N | Avg | CI low |
|---|---|---|---|
| ETH | 79 | +0.02% | -0.24% |
| XRP | 68 | -0.43% | -0.74% |
| BTC | 64 | +0.25% | +0.00% |
| BNB | 48 | -0.48% | -0.70% |
| SOL | 10 | +0.24% | -0.47% |
By 1h regime
| Slice | N | Avg | CI low |
|---|---|---|---|
| weak trend low vol | 77 | -0.59% | -0.81% |
| ranging low vol | 67 | +0.04% | -0.18% |
| weak trend med vol | 58 | -0.02% | -0.29% |
| strong trend med vol | 17 | -0.90% | -1.16% |
| strong trend low vol | 17 | +1.19% | +0.89% |
| ranging high vol | 12 | +1.25% | +0.63% |
| ranging med vol | 10 | -1.05% | -2.49% |
| unlabeled | 5 | +1.41% | +0.01% |
| weak trend high vol | 3 | -0.12% | -0.32% |
| trending high vol | 2 | -1.35% | -1.54% |
| trending low vol | 1 | -1.83% | -1.83% |
By timeframe
| Slice | N | Avg | CI low |
|---|---|---|---|
| 5m | 219 | -0.18% | -0.33% |
| 1H | 37 | +0.19% | -0.17% |
| 15m | 13 | -0.05% | -0.17% |
Predictions tracked across the product
The calibration loop above is one self-correction system. We are extending the same pattern (claim, observe, gap, diagnose, act) to every system that makes a prediction or claim: the /prove verdict, the AI verdict text, the reasoning narratives, the reliability score, and so on. Each row is a domain we are now instrumenting publicly. Resolution takes time: most loops have a 30+ day window between claim and outcome, so most rows below will show 0 resolved for the first month.
| Domain | Total | Resolved | Pending | Earliest pending | Next resolves |
|---|---|---|---|---|---|
| prove_verdict | 402 | 0 | 402 | 4/25/2026 | 5/25/2026 |
| ai_verdict_text | 294 | 0 | 294 | 4/25/2026 | — |
Schema, helper library, and full domain list at src/lib/predictions.ts in the codebase. Adding a new self-correction loop is a single recordPrediction() call.
Recent gap measurements
13 forward trades, P&L gap -1.062%, win rate gap -61.5pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
20 forward trades, P&L gap -0.788%, win rate gap -58.9pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
12 forward trades, P&L gap -0.653%, win rate gap -56.8pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
29 forward trades show a small gap (+0.513% P&L, -7.3pp win rate). Within noise; no calibration adjustment needed.
20 forward trades, P&L gap +0.396%, win rate gap -17.0pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
15 forward trades, P&L gap +0.405%, win rate gap -13.5pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
8 forward trades, P&L gap -1.424%, win rate gap -57.0pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
9 forward trades, P&L gap -0.964%, win rate gap -49.2pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
Across 9 forward trades the live P&L came in -0.253% vs backtest, with win rate 33% (vs 60% predicted). Diagnosed as slippage drift: backtest was too optimistic on fill prices. Action: adjust_backtest.
19 forward trades, P&L gap +0.240%, win rate gap -8.6pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
13 forward trades, P&L gap +0.372%, win rate gap -4.0pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
18 forward trades, P&L gap +0.275%, win rate gap +6.5pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
9 forward trades, P&L gap +0.040%, win rate gap +4.1pp. Diagnosis: aligned. Action: none.
15 forward trades, P&L gap +0.191%, win rate gap -3.7pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
15 forward trades, P&L gap +0.284%, win rate gap -2.3pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
15 forward trades, P&L gap +0.292%, win rate gap -3.5pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
11 forward trades, P&L gap -0.051%, win rate gap -3.6pp. Diagnosis: time_exits_dominating. Action: review_regime_or_tighten_tp.
Across 8 forward trades the live P&L came in -0.124% vs backtest, with win rate 38% (vs 49% predicted). Diagnosed as slippage drift: backtest was too optimistic on fill prices. Action: adjust_backtest.
15 forward trades, P&L gap -2.716%, win rate gap -49.5pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
Win rate dropped from 46% in backtest to 33% in 72 forward trades (-12.4pp). Entries are happening at worse prices than backtest assumed. Action: adjust_backtest.
33 forward trades show a small gap (-0.017% P&L, -6.3pp win rate). Within noise; no calibration adjustment needed.
24 forward trades, P&L gap +0.021%, win rate gap -26.4pp. Diagnosis: insufficient_validation_sample. Action: investigate_backtest.
13 forward trades, P&L gap -1.062%, win rate gap -61.5pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
20 forward trades, P&L gap -0.788%, win rate gap -58.9pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
12 forward trades, P&L gap -0.653%, win rate gap -56.8pp. Diagnosis: overfit_or_lookahead_suspected. Action: investigate_backtest.
Methodology
Trigger.The calibrator runs at 02:00 UTC daily. For every active gene with 8 or more forward-test trades, it pulls the original backtest result and the live trade history and computes the gap on five dimensions: average P&L per trade, win rate, take-profit hit rate, time-exit rate, and average bars held.
Diagnosis.Gaps are categorized by signature. P&L drops with similar win rate suggests slippage drift. Win rate drops suggest entry timing problems (cron delay, gate misfire). Lower TP hit rate suggests exit execution lag. Pattern mismatch with regime suggests the backtest tested a different market structure than current.
Action.Slippage drift updates the global slippage multiplier, which feeds the next round of backtests. Entry timing adds entry-delay bars to backtest assumptions. Regime mismatches narrow the gene's activation window rather than adjusting the global model.
Honesty principle. When the gap is positive (forward beats backtest, like in the recent RSI(7) entries) we log it as a minor gap and take no action. Being right by accident is still being miscalibrated. Same scrutiny in both directions.
See how the calibrated engine actually trades: