Benchmarking is fixable.

Papers report scores without methodology. Leaderboards silently mix apples and oranges. Dataset links rot. Human baselines get collected once and never updated. Codesota exists to push back against that — slowly, row by row, with evidence.

This page is an honest note about how to help. The project is built in public by one person and a small rotating circle of contributors. There is no Discord with thousands of members, no bounty program with quoted dollar amounts. There is a GitHub repo, a working registry, and a standing invitation.

§ 02 · Problem

Why the registry needs hands.

Scores without methodology

A widely-cited model reports a number on ImageNet. It used a different test split, non-standard augmentation, and the code repo now returns a 404. The number is quoted in dozens of slides. Nobody flags it because nobody owns the table.

Leaderboards that mix apples and oranges

A benchmark has twelve entries from the last year. Three use multi-scale test-time-augmentation (worth one or two points). Five use a different dataset version. One was quietly retracted by its authors. The ranking treats all twelve as comparable. It is not a ranking; it is a collage.

Confident agents, wrong numbers

Ask a coding agent to audit a page and it produces a well-formatted report with plausible citations. Half the scores are rounded wrong. Two of the papers do not exist. The only reliable fix is a human who reads the primary source and writes down what they actually found.

§ 03 · Process

How a contribution lands.

Four steps. Use AI tools freely to do the mechanical work — the editorial judgement must be yours.

Pick a target

A benchmark with stale rows, a paper we have not covered, or a page where the numbers no longer line up with the source. Open issues on GitHub are a starting shortlist.

Do the work

Reproduce a score, file a correction, or draft a page. Cite primary sources — the paper, the model card, the appendix — not tweets. If you use an LLM, verify every claim it makes.

Submit

Open a pull request, or send the write-up via /submit. Data files go as JSON; editorial notes as Markdown.

Publish

Accepted work is merged with your name attached. If a number you produced lands on the front page, the chart annotation cites you directly.

§ 04 · Who

Who this is for.

Students and early-career engineers

Portfolio pieces that are not another MNIST classifier. A merged reproduction with your name on a public registry is a stronger signal than a leetcode streak.

Researchers

Tired of citing scores you cannot reproduce? Fix the benchmark data you actually depend on. Corrections are preserved in the history, so the repair is legible.

Anyone who reads papers carefully

You notice when a number does not add up. That instinct, applied to one table for one afternoon, is often worth more than a week of automated scraping.

§ 05 · FAQ

Straight answers.

Do I need a PhD?: No. The easiest contributions are research-skill work — read the paper carefully, check the numbers, check the links, write down what is actually there.
Is there payment?: Not yet. The project is self-funded and does not have a budget for contributor payments. Accepted work is credited on the page, in the commit history, and in the site's changelog.
Can I use ChatGPT or Claude?: Yes. Use them freely to do mechanical work. Anything the model produces you are responsible for verifying — unchecked model output is the bug we are trying to fix, not the solution.
What if I claim a task and cannot finish?: Drop it. No penalty. Leave a comment on the issue so someone else can pick it up.
How long until a submission is reviewed?: It depends — usually days, sometimes a couple of weeks. One person is currently reviewing; the queue is public in GitHub.

§ 06 · Start

Pick one row. Check the numbers.

Submit a score →Open the repo Suggest a paper