Can You Trust This Data?
Yes. Here's exactly how we collect, verify, and maintain benchmark results.
Where Does the Data Come From?
Four sources, in order of trust:
Our own testing
We run models locally on the same benchmarks. No vendor APIs, no marketing numbers. Same test conditions, reproducible results.
Official leaderboards
AlphaXiv and similar platforms that maintain verified benchmark results.
Published papers
arXiv, ACL, NeurIPS, CVPR, and peer-reviewed venues. Only when they report results on standard benchmarks.
Vendor-reported (grain of salt)
Numbers from vendor documentation or announcements. Labeled as "vendor-claimed" and prioritized for independent verification. We aim to replace these with our own testing.
How Do You Verify Results?
Five-tier verification. Every result is labeled:
We ran this model locally on the benchmark. Same conditions, our hardware, reproducible. Highest trust.
Confirmed via official leaderboard, reproducible code, or peer-reviewed paper with verifiable claims.
Published in paper but not yet on official leaderboard. We include it with the source paper linked.
Recently announced, awaiting confirmation. Removed after 30 days if unverified.
From vendor documentation only. Treat with skepticism. Prioritized for independent testing. Will be upgraded or corrected once we verify.
When sources conflict: Our own testing overrides official leaderboards, which override papers with code, which override papers without code. Vendor claims are always lowest priority.
What Gets Excluded?
Four hard boundaries:
Marketing claims
No blog posts, press releases, or promotional materials unless backed by verifiable data.
Estimated or interpolated values
Missing data stays blank. We never fill gaps with estimates.
Paid placements
Rankings are performance-only. No vendor can pay for favorable positioning.
Cherry-picked metrics
We report standard metrics as defined by dataset creators. No selective highlighting.
How is the CodeSOTA Score Calculated?
The CodeSOTA Score is a weighted average across multiple benchmarks, designed to give you one number to compare models.
Benchmark Weights
Score Normalization
All scores are normalized to a 0-100 scale:
- - Higher-is-better metrics (accuracy, F1): used directly if already 0-100, scaled if 0-1
- - Lower-is-better metrics (CER, edit distance): inverted (100 - score)
Handling Missing Data
We only average benchmarks where the model was actually tested:
- - Minimum 2 benchmarks required for an aggregate score
- - Coverage shown as "X/8 benchmarks" so you know the confidence
- - Missing data is shown as "--", never estimated
Score Tiers
The goal is transparency: you can always click through to see individual benchmark scores and make your own judgment.
Can I Access the Raw Data?
Yes. All data is available as JSON files at:
/data/*.json
Each file includes model names, paper references, source URLs, verification status, and raw metrics.
How Do I Report an Error?
Three ways to contribute:
Corrections: Open a GitHub issue with a link to the correct source.
Additions: Submit a pull request updating the JSON file. Include verification sources.
New benchmarks: Open a discussion issue. We prioritize standardized benchmarks with active research.
Questions about our methodology? Reach out via GitHub Issues.