The brutal evaluation
Most ML eval is performed on data the model has at least seen the shape of. You train on 2018-2024, you test on 2025. The distribution is similar. The model looks great. Then you ship and watch a 2026 macro event blow through every backtest assumption.
What we changed
We added a deliberately hostile eval set we call the brutal pack. The construction rule:
If a setup looks high-probability on the surface (clean breakout, rising volume, momentum scan match) but the realized outcome was a loss, it goes in.
Cherry-picked? Yes. On purpose. The point isn't to estimate average performance; the point is to stress the abstention edge. A 60% win rate on the full population is meaningless if the 40% losses are all clustered in the same regime.
What we measure
- Brutal pass rate: of the hostile-set cases, what fraction did the cluster correctly NO_TRADE? Target: ≥ 85%.
- Brutal slippage: of cases where the cluster did execute, what was the realized R-multiple? Target: not statistically worse than full population.
- Calibration drift on brutal: does the model's confidence stay calibrated under the hostile distribution? Or does it overconfidence-spike?
What "92% should be NO_TRADE" means
Across our holdout test (not brutal — just standard out-of-sample) we observe ~92% of all bar-by-bar evaluations have no positive expected value after slippage. The market is mostly chop. The minority of bars that are tradeable have to clear a high bar — calibration, conviction, governor checks, earnings veto, regime check.
This is the opposite of how trading "AI" is usually marketed. The honest answer is: most of the time, the right move is no move.
If your tool generates 50 signals a day, it's selling you signals — not edge.