Codesota · The Open RegistryEvery benchmark reproduced · every score dated · every claim traced to codeIssue: April 22, 2026
Live registry · 17 research areas · 988 results

The state of the art,
measured honestly.

Codesota is the open registry ML engineers consult before choosing a model — benchmarks linked to code, scores cross-checked against the paper, and original analysis of how the market actually uses these models. A calmer, stricter successor to Papers with Code.

No paywall. No signup. No sponsored leaderboards. Every result carries its source type — reproduced, paper, or vendor-reported — so you can decide how much to believe each number.

§ 01 · Dashboard

Current state of the art.

A transverse slice of the registry — the leading published score on each canonical benchmark, grouped by area. Shaded rows are independently verified by CodeSOTA; unshaded rows cite the paper or vendor.


Results
988
Models tracked
163
Datasets indexed
97
Research areas
17
Full registry →
Top score · canonical benchmark
.jsonopen
AreaBenchmarkLeading modelMetricScoreResults
CodeHumanEvalo4-mini (high)pass@199.3%33
CodeSWE-bench VerifiedClaude Opus 4.7resolve rate87.6%39
CodeLiveCodeBenchDeepSeek-R1-0528pass@173.3%22
ReasoningMMLU-Proaccuracy0
ReasoningGPQA Diamondo3accuracy82.8%17
ReasoningHumanity's Last Examaccuracy0
MathMATHo4-mini (high)accuracy98.2%29
MathAIME 2024o1-previewaccuracy83.3%3
MathGSM8Ko1-previewaccuracy97.8%5
VisionImageNet-1Kcoca-finetunedtop-191.0%22
VisionCOCO detectionco-detr-swin-lmAP66.0%17
VisionADE20KONE-PEACEmIoU63.0%13
VQAVQA-v2Qwen2-VL 72Baccuracy87.6%23
VQATextVQAQwen2.5-VL 72Baccuracy85.5%9
OCROCRBench v2Qwen2.5-VL-72Boverall63.7074
OCROmniDocBenchmineru-2.5layout mAP97.5%47
OCRParseBenchLlamaParse Agenticaccuracy84.9%14
OCROCR · CERmistral-ocr-3CER (lower)3.71
SpeechWildASRGemini 3 ProWER (lower)2.814
SpeechVoiceBenchUltravox-GLM-4P7overall88.9%13
AudioESC-50BEATsaccuracy98.1%4
EmbeddingsMTEBNV-Embed-v2avg72.3%6
Fig 2 · Each row shows the leading value on the canonical benchmark, higher- or lower-is-better declared in the metric label. Scores drawn from the open JSON at /data/benchmarks.json.
§ 02 · What's moving

The frontier climbs.

HumanEval — the oldest public code-generation benchmark — is nearing saturation. The step chart shows each successive SOTA-setting submission in the registry; the current leader is a reasoning-augmented mini model, not a frontier flagship.

Below, small multiples plot the real SOTA envelope across 7 modalities. Every point is a dated benchmark result from the registry; each step up is a submission that beat the running best. X-axis is calendar time.

HumanEval pass@1 (%)90.392.895.397.8100.31Jul '10Claude 3.5 Sonnet (Oct 2024)o1-previewQwen2.5-Coder-32B-InstructGPT-4.1 minigpt-41o3-minio4-minio3-mini (high)o4-mini (high)
Fig 3 · Leading HumanEval pass@1 submissions in the registry, ordered by value. Dots mark SOTA-setting rows; names annotate each step up.
Code · LiveCodeBench
2526
91.7%pass@1, ↑
SWE-bench Verified
26
87.6%resolve, ↑
Reasoning · MMLU-Pro
86.60acc, ↑
Vision · COCO det.
2126
66.1%mAP, ↑
VQA · MMMU
2526
86.00acc, ↑
OCR · OCRBench v2
2526
45.00overall, ↑
Embeddings · MTEB
242526
72.3%avg, ↑
§ 03 · Jump in

Sixteen domains. One registry.

Everyone tracks frontier LLM scores. We also track what your pipeline depends on — OCR, ASR, detection, retrieval, embedded inference — with the same standard of evidence.

Domain92.9%
LLM reasoning

Frontier models on MMLU, GPQA, MATH, AIME.

Domain99.3%
Code generation

HumanEval, SWE-bench Verified, LiveCodeBench.

Domain87.6%
Agentic

Long-horizon autonomy, tool use, OpenRouter flow.

Domain97.5%
OCR / documents

Layout, handwriting, table extraction.

Domain2.8
Speech-to-text

WER on WildASR and industry splits.

Domain88.9%
Text-to-speech

Voice clarity, fingerprint robustness.

Domain91.0%
Vision · classification

ImageNet, CIFAR, linear probe.

Domain66.0%
Vision · detection

COCO, LVIS zero-shot, detection.

Domain87.6%
Multimodal / VQA

VQA-v2, TextVQA, chart reasoning.

Domain72.3%
Embeddings / retrieval

MTEB avg, BEIR, hybrid retrieval.

Domain98.1%
Audio classification

ESC-50, AudioSet, sound event detection.

Domain89.20
Medical imaging

CheXpert, MIMIC-CXR, MedQA.

Domain
Robotics

Habitat, LIBERO-Long, manipulation.

Domain99.60
Industrial AI

MVTec-AD, DAGM, NEU-DET.

Domain
Hardware · inference

Speed, cost and energy on real silicon.

Domain
Embedded AI

LLMs on Hailo-10H, edge chip catalog.

Fig 4 · Each tile links to a dedicated section with its own leaderboard, methodology and — where applicable — priced hardware comparisons.
§ 04 · Read
Flagship content

What we have been writing.

Sections are written like issues: a leaderboard at the top, the methodology below it, then essays that explain what the numbers mean — not just what they are.

Section

LLM reasoning

Frontier LLMs across MMLU, GPQA, MATH and AIME with verification notes.

Read →
Section

OCR & documents

Layout, handwriting, structured extraction — with ParseBench write-up.

Read →
Section

Speech

ASR accuracy, voice fingerprints, and a deep dive on TTS robustness.

Read →
Section

Vision

Classification, detection, segmentation — priced against hardware.

Read →
Section

Agentic AI

Long-horizon autonomy, tool use, OpenRouter market trends.

Read →
Index

Every ML task

The alphabetical register with trust grades for every canonical benchmark.

Read →
Essay

Voice fingerprints

Why TTS benchmarks miss the acoustic fingerprint that actually matters.

Read →
Archive

Papers with Code

The successor project to the archived Meta registry.

Read →
Guide

Choosing a TTS model

A practitioner guide to speech synthesis trade-offs.

Read →
§ 05 · Register

Trained something
that beats the table?

Submit a checkpoint or a paper result. We verify open-weight models against the public benchmark, cross-check vendor-reported numbers against the source, and add the row to the registry with its date and code trail.

§ 06 · Cite this work
BibTeX · APA · Plain

If Codesota informed your research, please cite the registry. A citation helps other readers find reproduced, dated numbers — and helps us keep independent benchmarks sustainable.

BibTeX · registry entryselect all, copy
@misc{codesota2026,
  title        = {Codesota: The Open Registry of State-of-the-Art Machine Learning},
  author       = {Wikiel, Kacper},
  year         = {2026},
  url          = {https://codesota.com},
  note         = {Accessed: 2026-04-24}
}
BibTeX · the open JSON dataset
@misc{codesota-registry2026,
  title        = {Codesota Benchmark Registry},
  author       = {Wikiel, Kacper},
  year         = {2026},
  howpublished = {\url{https://codesota.com/data/benchmarks.json}},
  note         = {Open JSON registry of reproduced benchmark results}
}
APA · plain text
Wikiel, K. (2026). Codesota: The Open Registry of State-of-the-Art Machine Learning. Retrieved 2026-04-24, from https://codesota.com