Calibrated conviction beats raw confidence
"The model is 87% confident this is a BUY." Great. Out of all the times it said 87%, how often was it actually right?
That second sentence is the whole game. Without it, the 87% is a vibe. With it, you have a probability you can size to.
What calibration means
A model is calibrated when its stated confidence matches its empirical hit rate. If it says 70% across 1,000 cases, ~700 should resolve as predicted. If it says 95% and only 60% resolve, it's overconfident — and any position sizing tied to that 95% is broken.
Out-of-the-box LLM softmax scores are generally not calibrated. They're optimized for argmax accuracy, not for probability honesty. Treating them as probabilities is a leak.
How we calibrate
We don't ship the raw decoder output. Every probability the cluster surfaces to your app passes through a post-hoc calibration layer trained on holdout outcomes. The shape we use is well-studied; we won't detail which here, because it would let a competitor copy our setup without the boring parts that make it work. The honest version:
- We measure calibration on out-of-sample data, not training data.
- We publish the gap (the difference between predicted and actual hit rate) in our internal eval pack.
- When gap > 5% on any bin, the head doesn't ship.
Why this maps to position size
If 70% confidence is honestly 70% hit rate, then a Kelly-fraction position size is well-defined. Our tier ladder is conservative on top of that — we cap below full Kelly to survive bad streaks:
- ≥ 0.85 calibrated → 6x base size (rare, ~5% of signals)
- ≥ 0.75 → 3x
- ≥ 0.65 → 2x
- ≥ 0.60 → 1x
- < 0.60 → 0 (NO_TRADE)
The 0.60 floor isn't a guess. It's the point below which the calibration curve gets noisy and the expected value of trading is negative net of slippage.
If your AI tool tells you 80% but won't tell you how often "80%" was right, your position size is being chosen by a marketing department.