How we build benchmarks that still mean something.
A benchmark is only useful while it still separates models and survives contamination. Most public evals saturate, leak into training data, or measure the scaffold instead of the model. This is the method CodeSOTA uses to avoid that — the same one behind every number on the public registry.
It is a method page, not a price list. We are not taking commercial engagements yet — this is how the work is done and where it is heading.
Not which model scores highest — but which system wins for this workflow, at this cost, with these failure modes, against the real alternatives.
A leaderboard rank that ignores cost, latency, and how the thing breaks is not a decision — it is a screenshot.
Four things that make an eval worth trusting.
Skip any one of them and the number stops meaning what it claims to.
We define representative tasks from actual workflows. A suite can stay private as a hold-out set, publish later, or become a repeatable regression set — but it is built to mirror the job, not to look impressive.
Models, prompts, retrieval, tools, scaffolds, and agent loops are varied independently, so you can see whether a result comes from the model or the system around it. Most reported wins are scaffold wins in disguise.
Every result is paired with latency, cost, operating constraints, and failure modes. The output is a decision surface — when does this system win, where does it fail, what does it cost — not a leaderboard screenshot.
Score tables, method notes, a re-runnable harness, and citations for every claim. If a number cannot be reproduced or sourced, it does not ship. This is the same standard the public registry is held to.
From a vague claim to a reproducible decision.
| Input | Task definition, candidate systems, real-world constraints, the question being answered |
| Data | Private samples, public benchmarks, papers, docs, and collected failure examples |
| Run | Models, prompts, tools, harnesses, graders — with cost and latency tracked per run |
| Output | Score tables, frontier charts, failure analysis, and a reproducible re-run harness |
Contamination-resistant, by construction.
Public benchmarks leak. Once a test set is on the internet it eventually lands in a training run, and the score stops measuring capability and starts measuring memorisation. A private hold-out suite — never published, or published only after it has done its job — is the only way to keep a number honest.
We rank every environment we touch by discriminative power — how far apart it pulls the best and worst models, penalised as the leader hits the ceiling. An environment nobody can fail, or that everybody fails, is not worth a run. That is the same lens applied across the RL-environment index.
A reward you cannot game: audio-verify.
Synthesised speech → whisper.cpp ASR → a structured-field reward. No learned judge — the reward is objectively verifiable, so it cannot be flattered. A structured field either survives the round-trip or it does not.
| Measure | Best system | Baseline |
|---|---|---|
| Structured entity recovery | 1.00 | 0.42 |
| Structured WER | 0.03 | 0.42 |
| Plain WER | 0.025 | 0.088 |
Plain WER barely separates the systems (0.025 vs 0.088); the structured-field reward opens a +0.58 gap on entity recovery. Measuring the right thing is the whole game.
Working on something where this matters?
We are not running paid engagements yet, but we are always interested in hard evaluation problems — places where the public benchmarks have saturated and you genuinely can’t tell systems apart. If that is you, say hello and tell us what you’re measuring.