Codesota · MethodHow we build evaluationsPublic methodology →
§ 00 · Premise

How we build benchmarks that still mean something.

A benchmark is only useful while it still separates models and survives contamination. Most public evals saturate, leak into training data, or measure the scaffold instead of the model. This is the method CodeSOTA uses to avoid that — the same one behind every number on the public registry.

It is a method page, not a price list. We are not taking commercial engagements yet — this is how the work is done and where it is heading.

§ 01 · The question

Not which model scores highest — but which system wins for this workflow, at this cost, with these failure modes, against the real alternatives.

A leaderboard rank that ignores cost, latency, and how the thing breaks is not a decision — it is a screenshot.

§ 02 · Principles

Four things that make an eval worth trusting.

Skip any one of them and the number stops meaning what it claims to.

§ 01
Tasks from real work, not benchmark theater

We define representative tasks from actual workflows. A suite can stay private as a hold-out set, publish later, or become a repeatable regression set — but it is built to mirror the job, not to look impressive.

§ 02
Separate the model from the harness

Models, prompts, retrieval, tools, scaffolds, and agent loops are varied independently, so you can see whether a result comes from the model or the system around it. Most reported wins are scaffold wins in disguise.

§ 03
Score on the frontier, not a single number

Every result is paired with latency, cost, operating constraints, and failure modes. The output is a decision surface — when does this system win, where does it fail, what does it cost — not a leaderboard screenshot.

§ 04
Reproducible and source-backed

Score tables, method notes, a re-runnable harness, and citations for every claim. If a number cannot be reproduced or sourced, it does not ship. This is the same standard the public registry is held to.

§ 03 · Method

From a vague claim to a reproducible decision.

InputTask definition, candidate systems, real-world constraints, the question being answered
DataPrivate samples, public benchmarks, papers, docs, and collected failure examples
RunModels, prompts, tools, harnesses, graders — with cost and latency tracked per run
OutputScore tables, frontier charts, failure analysis, and a reproducible re-run harness
§ 04 · Why it holds up

Contamination-resistant, by construction.

Public benchmarks leak. Once a test set is on the internet it eventually lands in a training run, and the score stops measuring capability and starts measuring memorisation. A private hold-out suite — never published, or published only after it has done its job — is the only way to keep a number honest.

We rank every environment we touch by discriminative power — how far apart it pulls the best and worst models, penalised as the leader hits the ceiling. An environment nobody can fail, or that everybody fails, is not worth a run. That is the same lens applied across the RL-environment index.

§ 05 · Worked example

A reward you cannot game: audio-verify.

Synthesised speech → whisper.cpp ASR → a structured-field reward. No learned judge — the reward is objectively verifiable, so it cannot be flattered. A structured field either survives the round-trip or it does not.

MeasureBest systemBaseline
Structured entity recovery1.000.42
Structured WER0.030.42
Plain WER0.0250.088

Plain WER barely separates the systems (0.025 vs 0.088); the structured-field reward opens a +0.58 gap on entity recovery. Measuring the right thing is the whole game.

§ 06 · Get in touch

Working on something where this matters?

We are not running paid engagements yet, but we are always interested in hard evaluation problems — places where the public benchmarks have saturated and you genuinely can’t tell systems apart. If that is you, say hello and tell us what you’re measuring.