Can You Trust This Data?

Yes. Here's exactly how we collect, verify, and maintain benchmark results.

Where Does the Data Come From?

Four sources, in order of trust:

1

Our own testing

We run models locally on the same benchmarks. No vendor APIs, no marketing numbers. Same test conditions, reproducible results.

2

Official leaderboards

AlphaXiv and similar platforms that maintain verified benchmark results.

3

Published papers

arXiv, ACL, NeurIPS, CVPR, and peer-reviewed venues. Only when they report results on standard benchmarks.

4

Vendor-reported (grain of salt)

Numbers from vendor documentation or announcements. Labeled as "vendor-claimed" and prioritized for independent verification. We aim to replace these with our own testing.

How Do You Verify Results?

Five-tier verification. Every result is labeled:

SELF-TESTED

We ran this model locally on the benchmark. Same conditions, our hardware, reproducible. Highest trust.

VERIFIED

Confirmed via official leaderboard, reproducible code, or peer-reviewed paper with verifiable claims.

PAPER-CLAIMED

Published in paper but not yet on official leaderboard. We include it with the source paper linked.

PENDING VERIFICATION

Recently announced, awaiting confirmation. Removed after 30 days if unverified.

VENDOR-CLAIMED

From vendor documentation only. Treat with skepticism. Prioritized for independent testing. Will be upgraded or corrected once we verify.

When sources conflict: Our own testing overrides official leaderboards, which override papers with code, which override papers without code. Vendor claims are always lowest priority.

What Gets Excluded?

Four hard boundaries:

Marketing claims

No blog posts, press releases, or promotional materials unless backed by verifiable data.

Estimated or interpolated values

Missing data stays blank. We never fill gaps with estimates.

Paid placements

Rankings are performance-only. No vendor can pay for favorable positioning.

Cherry-picked metrics

We report standard metrics as defined by dataset creators. No selective highlighting.

How is the CodeSOTA Score Calculated?

The CodeSOTA Score is a weighted average across multiple benchmarks, designed to give you one number to compare models.

Benchmark Weights

Primary (3x weight)
OmniDocBench, OCRBench v2, olmOCR-Bench
Comprehensive, widely-used benchmarks
Secondary (2x weight)
CHURRO-DS, CC-OCR
Specialized but important
Tertiary (1x weight)
KITAB-Bench, ThaiOCRBench, MME-VideoOCR
Language-specific or narrow scope

Score Normalization

All scores are normalized to a 0-100 scale:

  • - Higher-is-better metrics (accuracy, F1): used directly if already 0-100, scaled if 0-1
  • - Lower-is-better metrics (CER, edit distance): inverted (100 - score)

Handling Missing Data

We only average benchmarks where the model was actually tested:

  • - Minimum 2 benchmarks required for an aggregate score
  • - Coverage shown as "X/8 benchmarks" so you know the confidence
  • - Missing data is shown as "--", never estimated

Score Tiers

90-100: Excellent 80-89: Good 70-79: Average 60-69: Below Avg <60: Poor

The goal is transparency: you can always click through to see individual benchmark scores and make your own judgment.

Can I Access the Raw Data?

Yes. All data is available as JSON files at:

/data/*.json

Each file includes model names, paper references, source URLs, verification status, and raw metrics.

Format: JSON
License: CC BY 4.0
Update frequency: Weekly

How Do I Report an Error?

Three ways to contribute:

Corrections: Open a GitHub issue with a link to the correct source.

Additions: Submit a pull request updating the JSON file. Include verification sources.

New benchmarks: Open a discussion issue. We prioritize standardized benchmarks with active research.

Contribute on GitHub

Questions about our methodology? Reach out via GitHub Issues.