Codesota · AboutA note from the editorIssue: April 22, 2026
Mission · About

We measure what
AI can actually do.

Codesota is an independent measurement project for machine-learning capability. We test what models and agents can really do — on fixed, dated, reproducible benchmarks — and publish every result as an open record that no vendor can edit, retire, or buy.

The discipline is unfashionable: every score carries the day it was run, a path to reproduce it, and a verification tier. A registry, not a leaderboard — built to outlast its authors. Started 10 December 2025, after Meta retired Papers with Code.

§ 01 · Mission

Why measurement matters.

Frontier models and agents are shipping faster than anyone can independently confirm what they can do. The claims come from the labs that build them, on evaluations those same labs select, often on data the model has already seen. The result is a field that runs on announcements — and a public that learns a system’s real limits only after it has been deployed into something that matters.

Codesota exists to close that gap. We measure capability the way a metrology lab measures anything: against a fixed standard, with the method written down, and the date stamped on every reading. What can this model do that last quarter’s could not? Where does the agent break? Which benchmark is saturated, which is contaminated, which still discriminates? Those are measurement questions, and they deserve measured answers — not a press cycle.

Concretely, each benchmark is a task–dataset–metric triple with a declared direction, a fixed split, a reproducibility package and a dated submission. Each row carries a verification tier — self-reported, community-reproduced, or Codesota-reproduced — and each score is stamped with the day it was run. We also run original studies where the existing measures fail: the TTS Elo study is one, built because a WER number tells you almost nothing about whether a voice is actually preferred. The full standard is on the methodology page.

A registry, not a leaderboard. A leaderboard is a view of “who is on top right now”; a registry is the record that makes that view legible. When a model regresses between checkpoints, the preceding score stays visible so the regression itself is visible. When a score turns out to be wrong, the correction is visible too. Nothing is silently deleted.

And the measurement is independent. The project can earn its keep — custom benchmarks, commissioned evaluations, tasteful and clearly-labelled sponsorship — but none of it buys a ranking position, a better score, or a silent reshuffle. Paid work is disclosed inline where it appears, and the public method is not for sale. We do not grade the homework of the people who wrote it.

§ 02 · Principles

What stays
fixed.

The project may change shape — grow, hire, earn money. These are the commitments that do not move regardless, because they are what makes the record worth trusting.

  • 01
    The record comes first.
    Codesota may grow, take funding, or become a company — that future is open. What is fixed is the order of priorities: the integrity of the public record is never subordinate to a business goal. If the two ever conflict, the record wins, and the conflict is disclosed.
  • 02
    Every score is labelled, not laundered.
    A model announcement is not silently promoted into a verified result. Each score carries a verification tier — self-reported, community-reproduced, or Codesota-reproduced — and a date. Self-reported numbers are welcome; they are simply marked as such, never dressed up as something stronger than they are.
  • 03
    One editor, in the open.
    Editorial judgement — what to include, how to describe it, how to contextualise a score — is one person’s, made in public. There is no hidden review panel. Disputes are welcome in GitHub issues, where the disagreement is preserved alongside the record.
  • 04
    No leaderboard theatre.
    We do not re-rank scores to flatter a vendor, re-run an evaluation until the number improves, or quietly delete rows that age badly. When a result is corrected, the correction is visible. The record is append-only in spirit.
  • 05
    The ranking is not for sale.
    Codesota can earn its keep — custom benchmarks, commissioned evaluations, and tasteful, clearly-labelled sponsorship are all fair game. What money cannot buy is a place in the public ranking, a better score, or a quiet reshuffle. Anything paid is disclosed where it appears.
  • 06
    Not a replacement for the paper.
    The registry is a record of results; the primary literature is still the primary literature. Every row links back to its source, and the source is what you cite.
§ 03 · Masthead

Who runs this.

Codesota is written and maintained by Kacper Wikiel, in Warsaw. The repository, the registry and the site are one codebase, public on GitHub.

Day-to-day work is writing editorial pages, running reproductions on submitted checkpoints, and curating the task taxonomy. Contributors who submit reproduction runs or corrections are credited on the row rather than abstracted into a masthead count.

Today the project is self-directed and self-funded. It may not stay that way — funding, a team, or a company are all on the table — but the integrity commitments above travel with it. When paid work exists — consulting on benchmark design, commissioned evaluations, custom profiling, clearly-labelled sponsorship — it is declared on the page it affects, and it does not change the ordering of the public registry. If that arrangement ever changes, the change is announced on this page before any row is affected.

Editorial judgement — which benchmarks to include, how to describe them, how to contextualise a score — is, and remains, the responsibility of the editor. Dissent is welcome in GitHub issues; the record is public either way.

§ 04 · Contact

How to write in.

Four routes, in order of preference.

  1. 01
    Open a GitHub issue.
    Corrections, new benchmark proposals, methodology disputes. The repo is github.com/kwikiel/codesota.com. Public, so the discussion is preserved.
  2. 02
    Submit a result.
    If you have trained a model and have a reproducibility package ready, the submission form is the right door. We run it, publish the score, and date the row.
  3. 03
    Partnership or consulting.
    Benchmark-design work, custom evaluations, commissioned profiling — see /consulting. Listing placement is not part of the arrangement.
  4. 04
    Press / republication.
    Registry data is published under CC BY 4.0. Cite codesota.com with the snapshot date; the JSON at /data/benchmarks.json is the canonical form.
Related · Further reading

What to read next.

All routes verified live · April 2026