← All research
Eval

The brutal evaluation

2026-05-31 · Qovaryx Team

Most ML eval is performed on data the model has at least seen the shape of. You train on 2018-2024, you test on 2025. The distribution is similar. The model looks great. Then you ship and watch a 2026 macro event blow through every backtest assumption.

What we changed

We added a deliberately hostile eval set we call the brutal pack. The construction rule:

If a setup looks high-probability on the surface (clean breakout, rising volume, momentum scan match) but the realized outcome was a loss, it goes in.

Cherry-picked? Yes. On purpose. The point isn't to estimate average performance; the point is to stress the abstention edge. A 60% win rate on the full population is meaningless if the 40% losses are all clustered in the same regime.

What we measure

What "92% should be NO_TRADE" means

Across our holdout test (not brutal — just standard out-of-sample) we observe ~92% of all bar-by-bar evaluations have no positive expected value after slippage. The market is mostly chop. The minority of bars that are tradeable have to clear a high bar — calibration, conviction, governor checks, earnings veto, regime check.

This is the opposite of how trading "AI" is usually marketed. The honest answer is: most of the time, the right move is no move.

If your tool generates 50 signals a day, it's selling you signals — not edge.
Not financial advice. Architecture notes describe what we built, not how to trade. Options trading involves substantial risk of loss.