Method

Calibrated conviction beats raw confidence

2026-06-03 · Qovaryx Team

"The model is 87% confident this is a BUY." Great. Out of all the times it said 87%, how often was it actually right?

That second sentence is the whole game. Without it, the 87% is a vibe. With it, you have a probability you can size to.

What calibration means

A model is calibrated when its stated confidence matches its empirical hit rate. If it says 70% across 1,000 cases, ~700 should resolve as predicted. If it says 95% and only 60% resolve, it's overconfident — and any position sizing tied to that 95% is broken.

Out-of-the-box LLM softmax scores are generally not calibrated. They're optimized for argmax accuracy, not for probability honesty. Treating them as probabilities is a leak.

How we calibrate

We don't ship the raw decoder output. Every probability the cluster surfaces to your app passes through a post-hoc calibration layer trained on holdout outcomes. The shape we use is well-studied; we won't detail which here, because it would let a competitor copy our setup without the boring parts that make it work. The honest version:

We measure calibration on out-of-sample data, not training data.
We publish the gap (the difference between predicted and actual hit rate) in our internal eval pack.
When gap > 5% on any bin, the head doesn't ship.

Why this maps to position size

If 70% confidence is honestly 70% hit rate, then a Kelly-fraction position size is well-defined. Our tier ladder is conservative on top of that — we cap below full Kelly to survive bad streaks:

≥ 0.85 calibrated → 6x base size (rare, ~5% of signals)
≥ 0.75 → 3x
≥ 0.65 → 2x
≥ 0.60 → 1x
< 0.60 → 0 (NO_TRADE)

The 0.60 floor isn't a guess. It's the point below which the calibration curve gets noisy and the expected value of trading is negative net of slippage.

If your AI tool tells you 80% but won't tell you how often "80%" was right, your position size is being chosen by a marketing department.