Regime classifier validity

Every night we run a statistical test on every strategy: do its P&L distributions actually differ across market regimes? If our regime labels are noise, this page shows it. If they're real, it shows that too.

Latest run: 2026-05-30 (UTC) • 39 nightly runs on file• Method: Welch's t-test (p<0.05 significance threshold).

Is the classifier load-bearing?

13.5%

of regime-pair comparisons show statistically different P&L

2.7×

more discrimination than the 5% chance threshold

2,623

pair tests across 40 strategies

If the classifier produced random labels, only ~5% of pair-tests would pass p<0.05 by chance. Observing 13.5% means the labels carry real information about conditional strategy performance. This number going below ~10% would be the signal that the classifier needs to be rebuilt.

Significance % over time

Nightly % of strategy-regime pair-tests that came back statistically significant (p<0.05). The 5% line is chance. If this trends down toward the red line, the classifier is losing discriminating power and needs rebuilding.

1-hour 4-hour Daily

Which timeframe's regime separates strategies best?

Same test, broken down by which higher-timeframe regime the trade was tagged with. The one with the highest significant % is the one that matters most for strategy selection right now.

Window: 39 nightly runs so far.

Daily regime

8.2%

38 of 462 significant

1-hour regime

15.9%

183 of 1149 significant

4-hour regime

13.1%

133 of 1012 significant

Strategies with the strongest regime-conditional edge

Biggest measurable performance gaps between regimes. These are the strategies where deploying in one regime vs another changes the outcome materially.

Strategy	Regime A	Mean A	n_A	vs. Regime B	Mean B	n_B	Gap	p
ema-13-80-v2	trending_high_vol	-1.709%	3	weak_trend_high_vol	+2.331%	7	-4.040%	0.038
ema-13-80-v1	trending_high_vol	-2.147%	5	weak_trend_high_vol	+1.398%	10	-3.545%	0.004
ema-13-80-v2	weak_trend_low_vol	-0.543%	19	weak_trend_high_vol	+2.331%	7	-2.874%	0.008
ema-13-80-v2	weak_trend_med_vol	-0.350%	14	weak_trend_high_vol	+2.331%	7	-2.681%	0.010
ema-13-80-v1	trending_high_vol	-2.147%	5	ranging_med_vol	+0.444%	21	-2.591%	0.013
ema-13-80-v2	ranging_low_vol	-0.192%	19	weak_trend_high_vol	+2.331%	7	-2.523%	0.014
ema-13-80-v1	trending_high_vol	-2.147%	5	strong_trend_low_vol	+0.372%	8	-2.519%	0.048
ema-13-80-v3	trending_low_vol	-0.828%	5	weak_trend_high_vol	+1.569%	9	-2.397%	0.021
ema-slow	trending_low_vol	-1.284%	3	ranging_med_vol	+0.962%	6	-2.246%	0.001
ema-13-80-v1	trending_low_vol	-0.828%	5	weak_trend_high_vol	+1.398%	10	-2.226%	0.022
D_mfi_14_30_williams_r_-85_tp1.25_b120	weak_trend_low_vol	-1.662%	3	weak_trend_med_vol	+0.494%	3	-2.156%	0.028
ema-13-80-v3	ranging_low_vol	-0.486%	56	weak_trend_high_vol	+1.569%	9	-2.055%	0.027

Gap = Mean A − Mean B. Positive means regime A is better for that strategy. These figures include the v1 baseline portfolio because v2 research genes are still small sample. Will shift to v2 as forward-test trade counts grow.

Which regime boundaries matter most?

Pairs of regime labels that most often produced statistically different strategy P&L. Pairs near the top are the real regime boundaries. Pairs that rarely appear here are candidates for merging. They aren't doing distinct work.

trending_high_volvsweak_trend_high_vol

20 strategies differavg gap +0.744%

trending_high_volvsranging_med_vol

18 strategies differavg gap +0.541%

ranging_low_volvsweak_trend_high_vol

16 strategies differavg gap +0.586%

trending_high_volvsweak_trend_low_vol

16 strategies differavg gap +0.368%

trending_high_volvsweak_trend_med_vol

14 strategies differavg gap +0.343%

ranging_low_volvsweak_trend_low_vol

13 strategies differavg gap +0.213%

trending_high_volvsstrong_trend_med_vol

12 strategies differavg gap +0.389%

ranging_low_volvsranging_med_vol

12 strategies differavg gap +0.334%

trending_high_volvsstrong_trend_low_vol

11 strategies differavg gap +0.610%

ranging_low_volvsweak_trend_med_vol

11 strategies differavg gap +0.126%

How this works

Our classifier tags every trade with the market regime at entry (and exit) on 1h, 4h, and 1d timeframes. Labels include trending_low_vol, ranging_high_vol, etc.
For every strategy with enough trades, we group its P&L by the regime label at entry and compute mean and variance per regime.
For every pair of regimes with ≥3 trades each, we run Welch's t-test to check whether the two regime distributions differ. p<0.05 means the difference is unlikely to be chance.
If > 5% of pair-tests come back significant, the classifier is producing labels that carry real information. Below 5% means it's random noise and needs rebuilding.

The validator runs at 03:00 UTC nightly from src/worker/adaptive-v2/regime-validator.js (private repo). The full method is documented above.

Not financial advice. Regime labels are descriptive, not predictive. A strategy that performs well in one regime historically may not continue to in the future. Past performance does not guarantee future results.