The open, callable
registry of SOTA.
Codesota connects papers, code, datasets, models, and benchmark results in one public registry. Browse it like a research index, or call /api/sota from an agent, notebook, or dashboard.
No paywall. No signup. No sponsored leaderboards. Each result carries its source type, date, metric direction, and provenance trail, so a SOTA claim can be inspected before it is reused.
One task alias in, one dated pick out.
Rows keep metric, date, source, and trust status together.
Saturated datasets point to successors instead of pretending nothing changed.
New scores enter through benchmark-aware provenance fields.
The registry is now callable.
The April releases moved Codesota from a static reference toward infrastructure: an API, lineage maps, contamination accounting, and structured score submission.
Full changelog →Registry as product
/api/sota returns the current dated, sourced SOTA pick per task with runners-up and a stable snapshot id.
Structured score intake
The new score form validates benchmark ids, captures provenance, and queues submissions for review.
Where benchmarks move next
Coding, vision, audio, and agentic maps now track saturated benchmarks, successors, and branch tasks.
Contamination tax
Gold-vs-independent scoring makes benchmark leakage visible as a measurable gap, not a footnote.
Current top results.
A transverse slice of the registry: the leading published score on each canonical benchmark, grouped by area. Shaded rows are independently verified by CodeSOTA; unshaded rows cite the paper or vendor. This is the human view of the same registry exposed through /api/sota.
- Results
- 1,008
- Models tracked
- 163
- Datasets indexed
- 98
- Research areas
- 17
| Area | Benchmark | Leading model | Metric | Score | Results |
|---|---|---|---|---|---|
| Code | HumanEval | o4-mini (high) | pass@1 | 99.3% | 33 |
| Code | SWE-bench Verified | Claude Opus 4.7 | resolve rate | 87.6% | 39 |
| Code | LiveCodeBench | DeepSeek-R1-0528 | pass@1 | 73.3% | 22 |
| Reasoning | MMLU-Pro | N/A | accuracy | N/A | 0 |
| Reasoning | GPQA Diamond | o3 | accuracy | 82.8% | 17 |
| Reasoning | Humanity's Last Exam | N/A | accuracy | N/A | 0 |
| Math | MATH | o4-mini (high) | accuracy | 98.2% | 29 |
| Math | AIME 2024 | o1-preview | accuracy | 83.3% | 3 |
| Math | GSM8K | o1-preview | accuracy | 97.8% | 5 |
| Vision | ImageNet-1K | coca-finetuned | top-1 | 91.0% | 22 |
| Vision | COCO detection | co-detr-swin-l | mAP | 66.0% | 17 |
| Vision | ADE20K | ONE-PEACE | mIoU | 63.0% | 13 |
| VQA | VQA-v2 | Qwen2-VL 72B | accuracy | 87.6% | 23 |
| VQA | TextVQA | Qwen2.5-VL 72B | accuracy | 85.5% | 9 |
| OCR | OCRBench v2 | Qwen2.5-VL-72B | overall | 63.70 | 74 |
| OCR | OmniDocBench | mineru-2.5 | layout mAP | 97.5% | 47 |
| OCR | ParseBench | LlamaParse Agentic | accuracy | 84.9% | 14 |
| OCR | OCR · CER | mistral-ocr-3 | CER (lower) | 3.7 | 1 |
| Speech | WildASR | Gemini 3 Pro | WER (lower) | 2.8 | 14 |
| Speech | VoiceBench | Ultravox-GLM-4P7 | overall | 88.9% | 13 |
| Audio | ESC-50 | BEATs | accuracy | 98.1% | 4 |
| Embeddings | MTEB | NV-Embed-v2 | avg | 72.3% | 6 |
curl https://www.codesota.com/api/sota/code?tier=sotaThe frontier climbs.
HumanEval is the oldest public code-generation benchmark, and it is nearing saturation. The step chart shows each successive SOTA-setting submission in the registry; the current leader is a reasoning-augmented mini model, not a frontier flagship.
Below, small multiples plot the real SOTA envelope across 7 modalities. Every point is a dated benchmark result from the registry; each step up is a submission that beat the running best. X-axis is calendar time.
Scores age. Benchmarks move.
The registry does not treat every benchmark as equally current. Lineages show where attention moved, which datasets saturated, and which successors carry better signal.
Browse all lineages →Attention moved from short Python functions to repository-scale engineering and held-out commercial splits.
Classification gave way to detection, segmentation, and video-scale visual understanding.
The useful frontier shifted from sound labels toward instruction-following audio-language models.
Agent benchmarks now track long-horizon tool use, web work, operating systems, and software tasks.
Seventeen research areas. One registry.
Everyone tracks frontier LLM scores. We also track what your pipeline depends on: OCR, ASR, detection, retrieval, and embedded inference, with the same standard of evidence.
Frontier models on MMLU, GPQA, MATH, AIME.
HumanEval, SWE-bench Verified, LiveCodeBench.
Long-horizon autonomy, tool use, OpenRouter flow.
Layout, handwriting, table extraction.
WER on WildASR and industry splits.
Voice clarity, fingerprint robustness.
ImageNet, CIFAR, linear probe.
COCO, LVIS zero-shot, detection.
VQA-v2, TextVQA, chart reasoning.
MTEB avg, BEIR, hybrid retrieval.
ESC-50, AudioSet, sound event detection.
CheXpert, MIMIC-CXR, MedQA.
Habitat, LIBERO-Long, manipulation.
MVTec-AD, DAGM, NEU-DET.
Speed, cost and energy on real silicon.
LLMs on Hailo-10H, edge chip catalog.
What we have been writing.
Sections are written like issues: a leaderboard at the top, methodology below it, then essays that explain what the numbers mean, not just what they are.
The callable SOTA registry
One CORS-open endpoint for current, dated, sourced SOTA picks by task.
Papers with their scores
Recent ML papers cross-linked to models, code, datasets, and benchmark rows.
Every ML task
The task index with canonical benchmarks, top results, and trust grades.
Evidence-ranked leaderboards
Benchmark pages sorted by result coverage, verification density, and source quality.
Model cards
Model pages that show where each system ranks and which papers introduced it.
How results are trusted
Source types, reproduction status, corrections, and verification rules.
Benchmark evolution
How tasks move from saturated datasets to harder evaluation regimes.
What happened to PWC
Why the old registry mattered and what a stricter successor needs to fix.
Help fix a row
Submit a result, challenge a score, or add missing paper evidence.
Trained something
that beats the table?
Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.
If Codesota informed your research, please cite the registry. A citation helps other readers find reproduced, dated numbers. It also helps us keep independent benchmarks sustainable.
@misc{codesota2026,
title = {Codesota: The Open Registry of State-of-the-Art Machine Learning},
author = {Wikiel, Kacper},
year = {2026},
url = {https://codesota.com},
note = {Accessed: 2026-04-28}
}@misc{codesota-registry2026,
title = {Codesota Benchmark Registry},
author = {Wikiel, Kacper},
year = {2026},
howpublished = {\url{https://codesota.com/data/benchmarks.json}},
note = {Open JSON registry of sourced and dated benchmark results}
}