Codesota · How it works9,168 results · 303 added this monthUpdated 2026-05-13
§ 00 · The flywheel

How Codesota compounds.

Papers with Code gave the field a shape: the <Task, Dataset, Metric> tuple, the leaderboard, the code link. We inherit that shape and add three things it was missing — provenance for every score, an append-only audit log, and a paper-discovery feed keyed to verified scores, not press releases.

Each of those three compounds on the other two. A verified score writes a row in the log; the log makes the paper it came from richer on /papers; the paper’s leaderboard entry makes the score easier to verify next time.

§ 01 · The cycle

Four turns. Each feeds the next.

  1. 01

    Seed from PWC

    79,817 papers · 9,327 benchmarks · 5,628 datasets from seven years of Papers with Code. Frozen source, trust grade ≈ C.

    Inherited · read-only
  2. 02

    Verify & date

    New scores arrive through submissions, arxiv extraction and editor audits. Each row records source URL, verification date, and whether we believed the claim.

    93% of current rows verified
  3. 03

    Append to log

    /log records every benchmark_results insert, newest first, grouped by day. New-SOTA rows are marked; unverified rows still appear, flagged.

    303 rows added in last 30 days
  4. 04

    Publish in /papers

    Any paper with a verified score surfaces on /papers with its top leaderboard entry and code link. No paper shows up without at least one score we recorded.

    34,667 papers discoverable
§ 02 · State of the build

What’s shipped, what’s next.

Live · ship­ping
  • /log

    Append-only ledger of every benchmark result, with delta-vs-prior-SOTA and source URL.

  • /papers

    Paper discovery keyed to verified scores — only papers with at least one recorded benchmark result.

  • /lineage/vqa

    Editorial lineage graph: attention path + branches with live SOTA pulled from the registry.

  • /llm

    LLM-specific leaderboard with reasoning, code and math subtask breakdowns.

  • /ocr

    OCR benchmarks — layout, handwriting, table structure across 16+ models.

  • /submit (gated)

    Signed-in users submit paper + score; we review and append to the log.

On the roadmap
  • LLM-assisted extraction with human review

    Paper PDFs → structured score extraction → queued on /dashboard/review → editor approves per-row before append.

  • Per-paper detail pages

    /paper/<id> with every claimed score, each marked verified/unverified and cross-linked to the active leaderboard.

  • Rich-link cards (og:image, oEmbed)

    When a Codesota benchmark URL is shared anywhere, it renders a live SOTA card instead of a plain link.

  • Arxiv annotator extension

    Overlay verified SOTA context on arxiv abstract pages — the long bet.

  • Freshness cron + flag button

    Weekly audit for NULL source_url / result_date, plus a user-facing flag-a-score button.

§ 03 · Contribute

Three ways to move the needle.

Submit

New score for a tracked benchmark

Paper URL + model + score + source. We verify and append — usually within 48h. Signed-in users only; this is the fastest way to write to the log.

Submit a result →
Flag

Claim that doesn't match source

Incorrect score, broken link, wrong metric, stale SOTA. Any benchmark page has a flag button — we investigate and correct with a log entry.

Your flags →
Build

On the open JSON feed

Full registry lives at /data/benchmarks.json — no API key, no rate limit. Cite in papers, embed in dashboards, wire into agents.

Open the feed ↗
§ 04
Why it has to be this shape

Research infrastructure can’t depend on goodwill.

When Meta shut down Papers with Code in July 2025, seven years of task graphs, leaderboards and paper-to-code linkages went to a redirect. The data wasn’t lost — the paperswithcode-data repo is still frozen on GitHub — but the living, updating version of it was.

Open data from day one. Community contributions as the primary write path, not as an afterthought. No single point of failure — the benchmarks JSON is mirrorable, the log is append-only, the source URLs are checkable. Value compounds because every new score makes the next verification cheaper.

That’s the flywheel. Not a marketing graphic — a concrete commitment that every surface of this site feeds the others, and all of it stays readable even if the rest of the internet changes its mind about what’s load-bearing.

Spin the flywheel.

Every verified submission pins one more score; every correction leaves a log entry; every paper with a tracked benchmark turns up on /papers. The more turns, the harder the data is to move.

Submit a result Read the logWhat happened to PWC