Benchmarking is fixable.
Papers report scores without methodology. Leaderboards silently mix apples and oranges. Dataset links rot. Human baselines get collected once and never updated. Codesota exists to push back against that — slowly, row by row, with evidence.
This page is an honest note about how to help. The project is built in public by one person and a small rotating circle of contributors. There is no Discord with thousands of members, no bounty program with quoted dollar amounts. There is a GitHub repo, a working registry, and a standing invitation.
Why the registry needs hands.
A widely-cited model reports a number on ImageNet. It used a different test split, non-standard augmentation, and the code repo now returns a 404. The number is quoted in dozens of slides. Nobody flags it because nobody owns the table.
A benchmark has twelve entries from the last year. Three use multi-scale test-time-augmentation (worth one or two points). Five use a different dataset version. One was quietly retracted by its authors. The ranking treats all twelve as comparable. It is not a ranking; it is a collage.
Ask a coding agent to audit a page and it produces a well-formatted report with plausible citations. Half the scores are rounded wrong. Two of the papers do not exist. The only reliable fix is a human who reads the primary source and writes down what they actually found.
How a contribution lands.
Four steps. Use AI tools freely to do the mechanical work — the editorial judgement must be yours.
A benchmark with stale rows, a paper we have not covered, or a page where the numbers no longer line up with the source. Open issues on GitHub are a starting shortlist.
Reproduce a score, file a correction, or draft a page. Cite primary sources — the paper, the model card, the appendix — not tweets. If you use an LLM, verify every claim it makes.
Open a pull request, or send the write-up via /submit. Data files go as JSON; editorial notes as Markdown.
Accepted work is merged with your name attached. If a number you produced lands on the front page, the chart annotation cites you directly.
Who this is for.
Portfolio pieces that are not another MNIST classifier. A merged reproduction with your name on a public registry is a stronger signal than a leetcode streak.
Tired of citing scores you cannot reproduce? Fix the benchmark data you actually depend on. Corrections are preserved in the history, so the repair is legible.
You notice when a number does not add up. That instinct, applied to one table for one afternoon, is often worth more than a week of automated scraping.
Straight answers.
- Do I need a PhD?
- No. The easiest contributions are research-skill work — read the paper carefully, check the numbers, check the links, write down what is actually there.
- Is there payment?
- Not yet. The project is self-funded and does not have a budget for contributor payments. Accepted work is credited on the page, in the commit history, and in the site's changelog.
- Can I use ChatGPT or Claude?
- Yes. Use them freely to do mechanical work. Anything the model produces you are responsible for verifying — unchecked model output is the bug we are trying to fix, not the solution.
- What if I claim a task and cannot finish?
- Drop it. No penalty. Leave a comment on the issue so someone else can pick it up.
- How long until a submission is reviewed?
- It depends — usually days, sometimes a couple of weeks. One person is currently reviewing; the queue is public in GitHub.