How CodeSOTA ensures benchmark results are trustworthy.
Vendor-provided benchmark results lack independent verification. Marketing materials cherry-pick metrics. Academic benchmarks use outdated datasets.
CodeSOTA runs independent evaluations with versioned datasets, deterministic configurations, and public methodology. Every verified result is reproducible.
Six pieces of evidence behind every verified benchmark result.
SHA-256 hash of the exact dataset version used for evaluation. Guarantees same test data across runs. Prevents dataset drift.
sha256:a3f2b9c8d1e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0Exact parameters: temperature, top-p, max tokens, system prompts, API versions. Version-controlled configuration files stored in Git.
Pinned Docker images with locked dependency versions. GPU/CPU specs, Python/library versions documented. Reproducible anywhere.
Transparent pricing breakdown: API costs (input/output tokens), compute costs (GPU hours), infrastructure overhead. Updated with vendor pricing changes.
Public GitHub repository with evaluation scripts. Character Error Rate (CER), Word Error Rate (WER), Table Edit Distance (TEDS), F1 scores — all open-source implementations.
github.com/codesota/ocr-benchmarksTimestamp of benchmark execution. Tracks freshness. Models improve, APIs change — date context is critical for interpretation.
Independent benchmark execution, in five stages.
Pinned Docker images for each model evaluation. Deterministic configurations with fixed random seeds. API version locking to prevent silent changes.
docker pull codesota/paddle-ocr:v2.8.1 docker run --gpus all -e SEED=42 codesota/paddle-ocr:v2.8.1 evaluate
Semantic versioning (v1.2.3) for datasets. Cached datasets with SHA-256 integrity checks. Immutable storage — versions never change after publication.
dataset: ocr-invoices-eu-v1.3.0 hash: sha256:a3f2b9c8d1e4f5a6b7c8d9e0f1a2b3c4 size: 10,000 documents (Polish, German, Czech)
Deterministic evaluation scripts with version control. Automated pipelines run identical code for every model. Parallel execution for efficiency, isolated environments for integrity.
python evaluate.py \ --model paddle-ocr-vl \ --dataset ocr-invoices-eu-v1.3.0 \ --config configs/paddle_deterministic.yaml
Cross-checking against vendor claims where available. Statistical analysis for outlier detection. Re-runs for suspicious results. Human review of failure cases.
Git commit hash linking results to exact code version. Benchmark version tagging (v2025.01). Public changelog tracking all methodology changes.
commit: 7a3f2b9c (2025-01-15) benchmark: ocr-invoices-eu v1.3.0 methodology: v2025.01 (no breaking changes since v2024.12)
All evaluation code, configuration files, and metric implementations are open-source. Fork, audit, reproduce.
View on GitHubEvery verified result includes reproduction instructions. Run the same evaluation yourself with provided Docker images and datasets.
Deterministic by designMethodology changes tracked in Git. Breaking changes increment major version. Public changelog with rationale for every modification.
No silent updatesTimestamp on every benchmark result. "Last verified" dates visible on all leaderboards. Automatic staleness warnings for results older than 90 days.
Context is criticalNot all results are equal. We distinguish unverified, verified, and continuously monitored.
Results submitted by third parties (vendors, researchers, community). Not independently reproduced by CodeSOTA. Provided for completeness but flagged as unverified.
Use case · Quick discovery of new models, initial comparison, tracking vendor claims
Independently reproduced by CodeSOTA. Meets all six badge criteria: dataset hash, prompt/config, runtime, cost, metric code, verification date. Single-run validation.
Use case · Procurement decisions, RFP benchmarking, vendor selection, technical evaluations
Automated regular reruns (weekly/monthly). Tracks model drift, API changes, performance degradation. Alerts on significant deviations. Highest confidence tier.
Use case · Production monitoring, SLA tracking, regression detection, long-term reliability assessment
Independent validation for procurement teams.
Verification process: 2-3 weeks for standard OCR models. Custom datasets and private evaluations available.
CodeSOTA accepts no vendor investment, equity, or revenue-sharing agreements with OCR providers. We make money from private evaluations and enterprise consulting, not from vendors seeking favorable rankings.
Vendors may pay for verification services (benchmark execution, badge issuance), but verification is pass/fail — we publish results as-is, favorable or not. Payment does not influence methodology or ranking.
Any financial relationship with a benchmarked vendor (consulting, evaluation fees, partnerships) is disclosed on the relevant benchmark page.
Methodology changes are never made at vendor request. All changes go through public review with rationale documented in Git changelog.