Frontier models and agents are shipping faster than anyone can independently confirm what they can do. The claims come from the labs that build them, on evaluations those same labs select, often on data the model has already seen. The result is a field that runs on announcements — and a public that learns a system’s real limits only after it has been deployed into something that matters.
Codesota exists to close that gap. We measure capability the way a metrology lab measures anything: against a fixed standard, with the method written down, and the date stamped on every reading. What can this model do that last quarter’s could not? Where does the agent break? Which benchmark is saturated, which is contaminated, which still discriminates? Those are measurement questions, and they deserve measured answers — not a press cycle.
Concretely, each benchmark is a task–dataset–metric triple with a declared direction, a fixed split, a reproducibility package and a dated submission. Each row carries a verification tier — self-reported, community-reproduced, or Codesota-reproduced — and each score is stamped with the day it was run. We also run original studies where the existing measures fail: the TTS Elo study is one, built because a WER number tells you almost nothing about whether a voice is actually preferred. The full standard is on the methodology page.
A registry, not a leaderboard. A leaderboard is a view of “who is on top right now”; a registry is the record that makes that view legible. When a model regresses between checkpoints, the preceding score stays visible so the regression itself is visible. When a score turns out to be wrong, the correction is visible too. Nothing is silently deleted.
And the measurement is independent. The project can earn its keep — custom benchmarks, commissioned evaluations, tasteful and clearly-labelled sponsorship — but none of it buys a ranking position, a better score, or a silent reshuffle. Paid work is disclosed inline where it appears, and the public method is not for sale. We do not grade the homework of the people who wrote it.