Can You Trust This Data?

Yes. Here's exactly how we collect, verify, and maintain benchmark results.

Where Does the Data Come From?

Four sources, in order of trust:

Our own testing

We run models locally on the same benchmarks. No vendor APIs, no marketing numbers. Same test conditions, reproducible results.

Official leaderboards

AlphaXiv and similar platforms that maintain verified benchmark results.

Published papers

arXiv, ACL, NeurIPS, CVPR, and peer-reviewed venues. Only when they report results on standard benchmarks.

Vendor-reported (grain of salt)

Numbers from vendor documentation or announcements. Labeled as "vendor-claimed" and prioritized for independent verification. We aim to replace these with our own testing.

How Do You Verify Results?

Five-tier verification. Every result is labeled:

SELF-TESTED

We ran this model locally on the benchmark. Same conditions, our hardware, reproducible. Highest trust.

VERIFIED

Confirmed via official leaderboard, reproducible code, or peer-reviewed paper with verifiable claims.

PAPER-CLAIMED

Published in paper but not yet on official leaderboard. We include it with the source paper linked.

PENDING VERIFICATION

Recently announced, awaiting confirmation. Removed after 30 days if unverified.

VENDOR-CLAIMED

From vendor documentation only. Treat with skepticism. Prioritized for independent testing. Will be upgraded or corrected once we verify.

When sources conflict: Our own testing overrides official leaderboards, which override papers with code, which override papers without code. Vendor claims are always lowest priority.

What Gets Excluded?

Four hard boundaries:

Marketing claims

No blog posts, press releases, or promotional materials unless backed by verifiable data.

Estimated or interpolated values

Missing data stays blank. We never fill gaps with estimates.

Paid placements

Rankings are performance-only. No vendor can pay for favorable positioning.

Cherry-picked metrics

We report standard metrics as defined by dataset creators. No selective highlighting.

How is the CodeSOTA Score Calculated?

The CodeSOTA Score is a weighted average across multiple benchmarks, designed to give you one number to compare models.

Benchmark Weights

Primary (3x weight)

OmniDocBench, OCRBench v2, olmOCR-Bench

Comprehensive, widely-used benchmarks

Secondary (2x weight)

CHURRO-DS, CC-OCR

Specialized but important

Tertiary (1x weight)

KITAB-Bench, ThaiOCRBench, MME-VideoOCR

Language-specific or narrow scope

Score Normalization

All scores are normalized to a 0-100 scale:

- Higher-is-better metrics (accuracy, F1): used directly if already 0-100, scaled if 0-1
- Lower-is-better metrics (CER, edit distance): inverted (100 - score)

Handling Missing Data

We only average benchmarks where the model was actually tested:

- Minimum 2 benchmarks required for an aggregate score
- Coverage shown as "X/8 benchmarks" so you know the confidence
- Missing data is shown as "--", never estimated

Score Tiers

90-100: Excellent80-89: Good70-79: Average60-69: Below Avg<60: Poor

The goal is transparency: you can always click through to see individual benchmark scores and make your own judgment.

Can I Access the Raw Data?

Yes. All data is available as JSON files at:

/data/*.json

Each file includes model names, paper references, source URLs, verification status, and raw metrics.

Format:JSON

License:CC BY 4.0

Update frequency:Weekly

How Do I Report an Error?

Three ways to contribute:

Corrections: Open a GitHub issue with a link to the correct source.

Additions: Submit a pull request updating the JSON file. Include verification sources.

New benchmarks: Open a discussion issue. We prioritize standardized benchmarks with active research.

Contribute on GitHub

Independence & Conflict of Interest Policy

100% Independent. No Vendor Investment.

Bootstrap-funded. No external investors. No vendor relationships.

What "Independence" Means

We have zero financial relationships with any OCR vendor or model creator
No venture capital or investor pressure to favor specific outcomes
All benchmark results are published regardless of outcome - good or bad
Vendors cannot pay for favorable positioning, early access, or result suppression

Disclosure Policy

If we ever accept funding, partnerships, or paid services from vendors:

- Full disclosure on this page and relevant benchmark pages
- Clear labeling of any sponsored content or paid evaluations
- Separation between editorial decisions and business relationships
- Annual conflict of interest report (when applicable)

Current Status

No vendor fundingNo paid partnershipsNo advisory rolesSelf-funded

GDPR Compliance & Data Handling

For Public Benchmark Data

- All benchmark datasets are from published research with appropriate licenses
- No personal data is collected, processed, or stored
- Model outputs are aggregated statistics, not individual predictions

For Private Evaluations (Coming Soon)

- Your documents stay in the EU (hosted on EU servers)
- Data is processed only for evaluation, then deleted
- No data sharing with vendors or third parties
- Full data processing agreement available on request

EU-based operations. GDPR compliant by design.

Need a Private Evaluation?

We run the same rigorous benchmarks on your documents. Independent, reproducible, confidential.

Request Evaluation View Decision Guide

Questions about our methodology? Reach out via GitHub Issues.