Codesota · MethodologyThe editorial standard of evidenceIssue: April 22, 2026
Editorial · Methodology

How we measure the
state of the art.

Codesota is an open registry of machine-learning benchmarks. The value of a registry depends entirely on the standard of evidence behind each row. This page is the standard — plainly stated, so that a reader can check whether we hold to it.

What follows is how a number becomes a published score — what we require from a benchmark, what we require from a result, what “verified” means, how corrections and retractions are recorded, and what we refuse to do.

§ 01 · Scope

What counts as a benchmark.

The four things a task must have before it becomes a row in the registry.

A benchmark on Codesota is not simply a dataset. It is a task-dataset-metric triple with enough structure that two people can run it and compare their answers. That minimum structure is four items: a declared metric with a direction, a fixed test split, a reproducibility script, and a dated submission.

A declared metric means the benchmark page states which number is being reported and whether higher or lower is better. Half the confusion in the field comes from tables that leave this implicit; ours do not.

A fixed test split means the slice of data used to produce the score is the same slice used by every other row in the table. Private test sets are recorded separately, and never mixed with public scores in the same ranking.

A reproducibility script means there is some path by which the score can be re-executed. For open checkpoints this is a command, a commit hash and an environment; for closed models it is the API endpoint, version string, prompt template and decoding parameters. Without one of the two, the row does not publish as a verified score.

A dated submission means every score carries the day it was run and, separately, the day it was verified. That separation matters when you are reading the table six months later and trying to tell whether a “current” number is still current.

§ 02 · Evidence

What counts as a result.

Three fields define a result. Missing any of the three marks it claim-only.

A result has three fields behind it: code (a frozen commit or a container digest), an environment (declared explicitly, not “use the latest”), and a seed or decoding configuration (recorded, not assumed).

If any of the three is missing, the submission is not rejected — it is recorded as claim-only. It still appears on the model's page, still carries its date and source, and still counts toward historical record. It just does not receive a verified badge and does not contribute to SOTA lines until the gap is filled.

This is deliberately a soft fail rather than a hard one. Many useful numbers in the literature predate the practice of pinning commits; refusing to list them would erase a large part of the field's memory. Listing them as claim-only keeps them visible while keeping the verification tier honest.

Conversely, a reproducibility package that does not run on its own declared environment — the script errors, the weights do not load, the reported number cannot be regenerated within the benchmark's stated tolerance — is held back from publication until the submitter resolves it, or declines to.

§ 03 · Verification

What counts as verified.

Three tiers, marked explicitly on every row. A score is allowed to move between tiers; it is not allowed to be unmarked.

Verification is not a boolean. A score on Codesota sits in one of three tiers, marked explicitly on the row and in the JSON. A reader should never have to guess which tier they are looking at.

TierWhat it means
Self-reportedThe author's number, with a source link. No independent reproduction. Listed with date and source; not counted toward the SOTA line until upgraded.
Community-reproducedAn independent party ran the reproducibility package and obtained a score within the benchmark's declared tolerance. The reproducer and their commit/environment are recorded alongside the original.
Codesota-reproducedRun on our own infrastructure from the submitter's code and weights. The run is wrapped as a container, and the container digest is published next to the score.
Fig 1 · Verification tiers. The copper marker is reserved for Codesota-reproduced rows.

Tier changes are recorded, not overwritten. When a self-reported score is later reproduced, the row moves up a tier and the verification date is added; the original submission date is preserved.

Where a model cannot be reproduced at all — a proprietary endpoint, a withdrawn checkpoint — the row stays at self-reported or at its highest-achievable tier, and the reason is recorded in the row's notes rather than hidden behind a badge.

§ 04 · History

Scores are dated.

The table never silently forgets.

Every score on Codesota is stamped with the day it was published and the day it was verified. When a successor model lands and a row slips off the top, the older row does not disappear — it stays in the history, ranked by its date.

This is the single most important difference between a registry and a leaderboard. A leaderboard is a view of “who is on top right now”; a registry is the record that makes that view legible. If a model regresses between checkpoints — and they do regress — the preceding score stays visible so the regression itself is visible.

The same discipline applies to the benchmark itself. If a benchmark is updated (a split refreshed, a metric refined), rows from the prior version are not retroactively rewritten. They are marked as scored against the earlier version of the benchmark, and new rows accrue against the current one. A reader can still see the continuity; they are just not misled about which split produced which number.

The record is append-only in spirit. Corrections are possible; silent deletions are not.

§ 05 · Refusals

What we don’t do.

A short list of policies that are worth stating plainly.

Most of what keeps a registry honest is what it refuses to do. In no particular order:

  • 01
    We do not take payment for listings.
    A model is listed because it has a verifiable score, not because its team paid for placement. Rankings are decided on published methodology, independently of any commercial relationship.
  • 02
    We do not hide negative results.
    Poor scores appear alongside strong ones. If a reproduction run disagrees with an author’s reported number, both are recorded with the discrepancy visible.
  • 03
    We do not re-run evaluations to flatter a vendor.
    An evaluation is run once per submission. If a vendor disagrees with the result, the correct response is a new submission with a new reproducibility package — not a quiet re-run of the old one.
  • 04
    We do not weight scores by training-set overlap.
    We expose the context — training cutoff, contamination flags, saturation notes — and leave the reasoning to the reader. Re-weighting post-hoc to flatter (or punish) a specific vendor is the same category error as selective re-running.
  • 05
    We do not silently retract rows.
    When a row is retracted, it stays on the page with a strikethrough and a link to the retraction note. Readers who cited it can still find it; the retraction is part of the public record.
  • 06
    We do not fill missing data with estimates.
    If a model has not been run on a benchmark, the cell is blank. Aggregate scores only average over benchmarks that were actually evaluated, and the coverage is shown alongside the aggregate.
§ 06 · Contribute

How to submit a result.

Five steps. No editorial board, no review panel — just a reproduction run and a dated row.

  1. 01
    Check the benchmark exists.
    Browse the task index or the registry to confirm the benchmark you are targeting is already tracked. If it is not, open a benchmark-proposal issue before submitting a score — we would rather define the task once than reshape the table around a one-off submission.
  2. 02
    Prepare the reproducibility package.
    A frozen commit (or container digest), the declared environment, the seed or decoding parameters, and a one-command invocation that reproduces the reported score on the declared split. For API-only models, the equivalent is the endpoint version and a complete prompt/decoding specification.
  3. 03
    Submit via /submit or a pull request.
    The submission form accepts a link to the reproducibility package and the reported score; contributors may equivalently open a pull request against the JSON for the benchmark. Either route is fine.
  4. 04
    We reproduce.
    The submission is queued for reproduction. A run either matches within tolerance (the row is marked Codesota-reproduced), matches via an independent community run (community- reproduced), or remains self-reported until it can be verified. We do not re-run the same package repeatedly to try to match a preferred number.
  5. 05
    Publish, dated, with source tier.
    The row appears with its submission date, verification date, tier, and a link back to the reproducibility package. The JSON at /data/benchmarks.json updates in the same cycle.
§ 07 · Corrections

How corrections work.

The procedure a reader triggers when they believe a score is wrong.

When a reader spots an error — a wrong number, a mis-attributed checkpoint, a benchmark definition that drifted — they file it through the site's feedback channel or by opening an issue on the public GitHub repository.

We then attempt verification. Where a reproducibility package exists, the score is re-run from it. Where one does not, we check the cited source. If the reported correction stands, the row is updated in place — but the update is visible: the row carries a correction note with the date and the reason. A reader returning to a cell they cited previously can see what changed and why.

If the correction does not stand — we cannot reproduce it, or the cited source is itself wrong — the original row remains, annotated with the report and the verification outcome. The dissent is part of the public record even when it does not change the number.

Where a correction materially changes the conclusion a reader would draw from the table — the top model is no longer the top model, the regression flipped direction — it is recorded as a retraction rather than a silent edit. See the next section.

§ 08 · Retractions

How retractions are recorded.

A retracted row is not a deleted row.

A retraction happens when a published score turns out to be materially wrong — a bug in the evaluation harness, a mis-identified checkpoint, an undisclosed contamination of the test split — and correcting the cell in place would rewrite history in a way that misleads readers who had cited it.

In that case, the row stays on the page. The numbers are struck through, the row is marked retracted, and a link is added to a retraction note describing what happened, when it was discovered, and what the revised position is. A fresh row — with the correct number and a new date — is added below; it does not overwrite the retracted one.

This is mildly ugly on the page and is meant to be. A retraction is an unusual event, not a routine one, and the page makes that visible. The alternative — silent deletion — is the failure mode that made aggregator sites unreliable in their late stages, and it is not available to us.

§ 09 · Access

Open data, everywhere.

Every number on the site is also available as JSON. No paywall, no signup, no crawler trap.

The page you are reading and the JSON a program would consume are the same data, rendered differently. Every benchmark row on the site has a JSON representation, and every aggregated view has a bulk dump.

ResourceWhere
Full benchmark dump/data/benchmarks.json
API reference/api — query-time endpoints, auth, limits
Source repositorygithub.com/kwikiel/codesota — registry, site, build history
Changelog/changelog — registry updates and corrections
Fig 2 · The same rows render as JSON, HTML and CSV from the same underlying files.
§ 10 · Disclosure

Who runs this, and how it is funded.

A page on methodology is not complete without a page on incentives.

Codesota is maintained by Kacper Wikiel in Warsaw, built in public on GitHub. The registry, the site, and the infrastructure are one repository.

The project is currently self-directed and not venture-backed. Day-to-day running costs (hosting, reproduction infrastructure, inference for API-only benchmarks) are covered directly. Where paid work exists — consulting, custom benchmarking, commissioned evaluations for teams who want a specific model profiled — it is flagged inline on the page where the result appears, and does not influence the ordering of the public registry.

We have no financial relationship with any model vendor that would put a finger on the scale for a specific listing. If that ever changes, the relationship will be disclosed on the affected page before it affects any row.

Editorial judgement — which benchmarks to include, how to describe them, how to frame context in the surrounding copy — is, and remains, the responsibility of the site owner. Methodology questions and disputes are welcome; the address is in the margin.

§ 11 · Contamination tax

Two scores, not one.

The gap between a benchmark's gold tests and an independently-written test set is the contamination tax. We publish both numbers.

A benchmark score is only meaningful if you trust the test set. Public benchmarks leak — their problems end up in training corpora, their gold answers in fine-tuning data, their evaluation harnesses in instruction-tuning sets. A model that has seen the test scores higher than a model that hasn’t, even when the underlying capability is the same. That gap is the contamination tax, and historically nobody has been quoting it.

We publish two scores per benchmark wherever we can. The first is the canonical gold score on the benchmark’s declared test split. The second is an independent score on a parallel test set: same task, same difficulty distribution, but generated independently of the original benchmark — re-transcribed ground truth, freshly sampled problems past the model’s training cutoff, or a held-out commercial split that has never been released publicly. The two numbers are reported side-by-side on the benchmark page, and their difference renders as a single mono-spaced column: the tax.

A small tax (under ~3 points absolute) means the model has generalised — its score on a clean evaluation matches its score on the public one. A large tax (10+ points) means either the public benchmark is contaminated, or the model is gaming surface features that do not transfer. Both are worth knowing. Neither is captured by a single leaderboard number.

The methodology is closer to how clinical trials handle bias than to how leaderboards usually work. The gold score is the registered analysis. The independent score is the replication arm. We do not pretend the gold number is contamination-free; we measure how much it isn’t, and we name the gap.

Our first target is OmniDocBench, the OCR benchmark with the most public attention and therefore the most contamination risk. Coverage will expand from there to coding (LiveCodeBench, SWE-bench Pro), reasoning (GPQA, HLE), and agentic tasks. Where we cannot run the independent set ourselves, the column publishes as pending rather than as a fabricated number.

§ 12 · FAQ

Frequently asked, honestly answered.

Questions that come up about the methodology. Schema.org FAQPage markup is embedded for search.

Q01Why does Codesota not accept self-reported scores without reproduction?+

Self-reported scores are recorded, but they are labelled claim-only until they are reproduced — either by an independent run or by a signed container hash. Most of the drift that made late-stage leaderboards unreliable came from self-reported numbers that nobody ever re-executed. A claim without a reproduction is still useful signal, but it is not evidence.

Q02What about closed models that cannot be reproduced?+

Closed, API-only models are run against the public test split through their official endpoint, with the date, model identifier, prompt template and harness commit recorded. These rows are labelled API-verified rather than fully reproduced — the weights are not public, so another party cannot re-execute the exact same artefact. Where an API version string changes, we treat it as a new submission.

Q03How do you handle benchmark contamination?+

Where a benchmark uses a continuously refreshed split (for example LiveCodeBench) we prefer it to static splits for frontier comparisons. Where a benchmark is static, we record the model training-data cutoff alongside the score, and we flag benchmarks known to be saturated or likely contaminated. We do not, and cannot, certify that a given training set did not include a given test item; we only expose the information needed to reason about it.

Q04Do you weight benchmarks by training-set overlap or vendor association?+

No. A 2026 Berkeley RDI study showed that several widely-cited agent benchmarks can be exploited to near-perfect scores without solving any real tasks; re-weighting scores post-hoc to flatter a particular vendor would be the same failure mode in reverse. Codesota reports the metric a benchmark defines, on the split it defines, and exposes the context (date, harness, contamination flags) rather than rolling everything into a single opinion.

Q05What happens when a score turns out to be wrong?+

If a published score turns out to be wrong — an evaluation bug, a miscounted split, a mis-identified checkpoint — we correct it in place, but the row carries a visible correction note with the date and the reason. If the result was materially misleading, the row is retracted: it stays on the page with a strikethrough and a link to the retraction note. No silent deletion.

Q06Is Codesota paid to list or promote specific models?+

No. Editorial rankings are not for sale. Codesota is currently self-directed and not investor-backed. When paid work (consulting, custom benchmarking, commissioned evaluations) exists, it is labelled inline on the affected page and does not change the ordering of the public registry.

Q07Where can I get the raw data?+

Every score that appears on the site is also available as JSON. The main dataset is served at /data/benchmarks.json; per-area JSON is linked from each benchmark page. The API reference at /api documents the query-time endpoints.

Related · Further reading

What to read next.

All routes verified live · April 2026