Codesota · The Open RegistryHuman-readable pages · machine-readable SOTA APIIssue: April 27, 2026
Callable registry · 17 research areas · 1,008 results

The open, callable
registry of SOTA.

Codesota connects papers, code, datasets, models, and benchmark results in one public registry. Browse it like a research index, or call /api/sota from an agent, notebook, or dashboard.

No paywall. No signup. No sponsored leaderboards. Each result carries its source type, date, metric direction, and provenance trail, so a SOTA claim can be inspected before it is reused.

Try /api/sota Browse tasksCORS-open · cache by snapshot_id
§ 02 · Dashboard

Current top results.

A transverse slice of the registry: the leading published score on each canonical benchmark, grouped by area. Shaded rows are independently verified by CodeSOTA; unshaded rows cite the paper or vendor. This is the human view of the same registry exposed through /api/sota.


Results
1,008
Models tracked
163
Datasets indexed
98
Research areas
17
API schema →
Top score · canonical benchmark
.jsonopen
AreaBenchmarkLeading modelMetricScoreResults
CodeHumanEvalo4-mini (high)pass@199.3%33
CodeSWE-bench VerifiedClaude Opus 4.7resolve rate87.6%39
CodeLiveCodeBenchDeepSeek-R1-0528pass@173.3%22
ReasoningMMLU-ProN/AaccuracyN/A0
ReasoningGPQA Diamondo3accuracy82.8%17
ReasoningHumanity's Last ExamN/AaccuracyN/A0
MathMATHo4-mini (high)accuracy98.2%29
MathAIME 2024o1-previewaccuracy83.3%3
MathGSM8Ko1-previewaccuracy97.8%5
VisionImageNet-1Kcoca-finetunedtop-191.0%22
VisionCOCO detectionco-detr-swin-lmAP66.0%17
VisionADE20KONE-PEACEmIoU63.0%13
VQAVQA-v2Qwen2-VL 72Baccuracy87.6%23
VQATextVQAQwen2.5-VL 72Baccuracy85.5%9
OCROCRBench v2Qwen2.5-VL-72Boverall63.7074
OCROmniDocBenchmineru-2.5layout mAP97.5%47
OCRParseBenchLlamaParse Agenticaccuracy84.9%14
OCROCR · CERmistral-ocr-3CER (lower)3.71
SpeechWildASRGemini 3 ProWER (lower)2.814
SpeechVoiceBenchUltravox-GLM-4P7overall88.9%13
AudioESC-50BEATsaccuracy98.1%4
EmbeddingsMTEBNV-Embed-v2avg72.3%6
Fig 2 · Each row shows the leading value on the canonical benchmark, higher- or lower-is-better declared in the metric label. Scores drawn from the open JSON at /data/benchmarks.json.
API mirror
curl https://www.codesota.com/api/sota/code?tier=sota
Docs
§ 03 · What's moving

The frontier climbs.

HumanEval is the oldest public code-generation benchmark, and it is nearing saturation. The step chart shows each successive SOTA-setting submission in the registry; the current leader is a reasoning-augmented mini model, not a frontier flagship.

Below, small multiples plot the real SOTA envelope across 7 modalities. Every point is a dated benchmark result from the registry; each step up is a submission that beat the running best. X-axis is calendar time.

HumanEval pass@1 (%)90.392.895.397.8100.31Jul '10Claude 3.5 Sonnet (Oct 2024)o1-previewQwen2.5-Coder-32B-InstructGPT-4.1 minigpt-41o3-minio4-minio3-mini (high)o4-mini (high)
Fig 3 · Leading HumanEval pass@1 submissions in the registry, ordered by value. Dots mark SOTA-setting rows; names annotate each step up.
Code · LiveCodeBench
2526
91.7%pass@1, ↑
SWE-bench Verified
26
87.6%resolve, ↑
Reasoning · MMLU-Pro
86.60acc, ↑
Vision · COCO det.
2126
66.1%mAP, ↑
VQA · MMMU
2526
86.00acc, ↑
OCR · OCRBench v2
2526
45.00overall, ↑
Embeddings · MTEB
242526
72.3%avg, ↑
§ 04 · Lineages

Scores age. Benchmarks move.

The registry does not treat every benchmark as equally current. Lineages show where attention moved, which datasets saturated, and which successors carry better signal.

Browse all lineages →
Coding
HumanEvalHumanEval+LiveCodeBenchSWE-benchVerifiedPro

Attention moved from short Python functions to repository-scale engineering and held-out commercial splits.

Vision
ImageNetCOCOSA-1BSA-V

Classification gave way to detection, segmentation, and video-scale visual understanding.

Audio
ESC-50AudioSetClothoAudioBenchVoiceBench

The useful frontier shifted from sound labels toward instruction-following audio-language models.

Agentic
GAIAWebArenaOSWorldtau-benchSWE-bench Pro

Agent benchmarks now track long-horizon tool use, web work, operating systems, and software tasks.

§ 05 · Jump in

Seventeen research areas. One registry.

Everyone tracks frontier LLM scores. We also track what your pipeline depends on: OCR, ASR, detection, retrieval, and embedded inference, with the same standard of evidence.

Domain92.9%
LLM reasoning

Frontier models on MMLU, GPQA, MATH, AIME.

Domain99.3%
Code generation

HumanEval, SWE-bench Verified, LiveCodeBench.

Domain87.6%
Agentic

Long-horizon autonomy, tool use, OpenRouter flow.

Domain97.5%
OCR / documents

Layout, handwriting, table extraction.

Domain2.8
Speech-to-text

WER on WildASR and industry splits.

Domain88.9%
Text-to-speech

Voice clarity, fingerprint robustness.

Domain91.0%
Vision · classification

ImageNet, CIFAR, linear probe.

Domain66.0%
Vision · detection

COCO, LVIS zero-shot, detection.

Domain87.6%
Multimodal / VQA

VQA-v2, TextVQA, chart reasoning.

Domain72.3%
Embeddings / retrieval

MTEB avg, BEIR, hybrid retrieval.

Domain98.1%
Audio classification

ESC-50, AudioSet, sound event detection.

Domain89.20
Medical imaging

CheXpert, MIMIC-CXR, MedQA.

Domain
Robotics

Habitat, LIBERO-Long, manipulation.

Domain99.60
Industrial AI

MVTec-AD, DAGM, NEU-DET.

Domain
Hardware · inference

Speed, cost and energy on real silicon.

Domain
Embedded AI

LLMs on Hailo-10H, edge chip catalog.

Fig 4 · Each tile links to a dedicated section with its own leaderboard, methodology, and priced hardware comparisons where applicable.
§ 06 · Read
Flagship content

What we have been writing.

Sections are written like issues: a leaderboard at the top, methodology below it, then essays that explain what the numbers mean, not just what they are.

API

The callable SOTA registry

One CORS-open endpoint for current, dated, sourced SOTA picks by task.

Read →
Papers

Papers with their scores

Recent ML papers cross-linked to models, code, datasets, and benchmark rows.

Read →
Tasks

Every ML task

The task index with canonical benchmarks, top results, and trust grades.

Read →
Leaderboards

Evidence-ranked leaderboards

Benchmark pages sorted by result coverage, verification density, and source quality.

Read →
Models

Model cards

Model pages that show where each system ranks and which papers introduced it.

Read →
Method

How results are trusted

Source types, reproduction status, corrections, and verification rules.

Read →
Lineages

Benchmark evolution

How tasks move from saturated datasets to harder evaluation regimes.

Read →
Archive

What happened to PWC

Why the old registry mattered and what a stricter successor needs to fix.

Read →
Contribute

Help fix a row

Submit a result, challenge a score, or add missing paper evidence.

Read →
§ 07 · Register

Trained something
that beats the table?

Submit a checkpoint, paper result, or correction with structured benchmark provenance. We validate the score, cross-check the source, and add the row to the registry with its date and evidence trail.

§ 08 · Cite this work
BibTeX · APA · Plain

If Codesota informed your research, please cite the registry. A citation helps other readers find reproduced, dated numbers. It also helps us keep independent benchmarks sustainable.

BibTeX · registry entryselect all, copy
@misc{codesota2026,
  title        = {Codesota: The Open Registry of State-of-the-Art Machine Learning},
  author       = {Wikiel, Kacper},
  year         = {2026},
  url          = {https://codesota.com},
  note         = {Accessed: 2026-04-28}
}
BibTeX · the open JSON dataset
@misc{codesota-registry2026,
  title        = {Codesota Benchmark Registry},
  author       = {Wikiel, Kacper},
  year         = {2026},
  howpublished = {\url{https://codesota.com/data/benchmarks.json}},
  note         = {Open JSON registry of sourced and dated benchmark results}
}
APA · plain text
Wikiel, K. (2026). Codesota: The Open Registry of State-of-the-Art Machine Learning. Retrieved 2026-04-28, from https://codesota.com