Codesota · The Open RegistryEvery benchmark reproduced · every score dated · every claim traced to codeIssue: April 22, 2026
Live registry · 17 research areas · 995 results

The state of the art,
measured honestly.

Codesota is the open registry ML engineers consult before choosing a model — benchmarks linked to code, scores cross-checked against the paper, and original analysis of how the market actually uses these models. A calmer, stricter successor to Papers with Code.

No paywall. No signup. No sponsored leaderboards. Every result carries its source type — reproduced, paper, or vendor-reported — so you can decide how much to believe each number.

§ 01 · Dashboard

Current state of the art.

A transverse slice of the registry — the leading published score on each canonical benchmark, grouped by area. Shaded rows are independently verified by CodeSOTA; unshaded rows cite the paper or vendor.


Results
995
Models tracked
164
Datasets indexed
98
Research areas
17
Full registry →
Top score · canonical benchmark
.jsonopen
AreaBenchmarkLeading modelMetricScoreResults
CodeHumanEvalo4-mini (high)pass@199.3%33
CodeSWE-bench VerifiedClaude Opus 4.5resolve rate80.9%38
CodeLiveCodeBenchDeepSeek-R1-0528pass@173.3%22
ReasoningMMLUo3accuracy92.9%19
ReasoningGPQA Diamondo3accuracy82.8%17
MathMATHo4-mini (high)accuracy98.2%29
MathAIME 2024o1-previewaccuracy83.3%3
MathGSM8Ko1-previewaccuracy97.8%5
VisionImageNet-1Kcoca-finetunedtop-191.0%22
VisionCOCO detectionco-detr-swin-lmAP66.0%17
VisionADE20KONE-PEACEmIoU63.0%13
VQAVQA-v2Qwen2-VL 72Baccuracy87.6%23
VQATextVQAQwen2.5-VL 72Baccuracy85.5%9
OCROCRBench v2Qwen2.5-VL-72Boverall63.7074
OCROmniDocBenchmineru-2.5layout mAP97.5%47
OCRParseBenchLlamaParse Agenticaccuracy84.9%14
OCROCR · CERmistral-ocr-3CER (lower)3.71
SpeechWildASRGemini 3 ProWER (lower)2.814
SpeechVoiceBenchUltravox-GLM-4P7overall88.9%13
AudioESC-50BEATsaccuracy98.1%4
EmbeddingsMTEBNV-Embed-v2avg72.3%6
Fig 2 · Each row shows the leading value on the canonical benchmark, higher- or lower-is-better declared in the metric label. Scores drawn from the open JSON at /data/benchmarks.json.
§ 02 · What's moving

The frontier climbs.

HumanEval — the oldest public code-generation benchmark — is nearing saturation. The step chart shows each successive SOTA-setting submission in the registry; the current leader is a reasoning-augmented mini model, not a frontier flagship.

To the right, small multiples sketch the trajectory across 17 modalities. The final dot on each line is the present leading score from the registry; intermediate points visualise the climb, they do not claim a calendar date for each step.

90.392.895.397.8100.31610Claude 3.5 Sonnet (Oct 2024)o1-previewQwen2.5-Coder-32B-InstructGPT-4.1 minigpt-41o3-minio4-minio3-mini (high)o4-mini (high)HumanEval pass@1 (%)
Fig 3 · Leading HumanEval pass@1 submissions in the registry, ordered by value. Dots mark SOTA-setting rows; names annotate each step up.
Code · HumanEval
99.3%pass@1, ↑
SWE-bench Verified
80.9%resolve, ↑
Math · MATH
98.2%acc, ↑
Reasoning · MMLU
92.9%acc, ↑
Reasoning · GPQA
82.8%acc, ↑
Vision · ImageNet-1K
91.0%top-1, ↑
Vision · COCO det.
66.0%mAP, ↑
VQA-v2
87.6%acc, ↑
OCR · OCRBench v2
63.70overall, ↑
OCR · OmniDocBench
97.5%layout mAP, ↑
OCR · ParseBench
84.9%acc, ↑
OCR · CER
3.7CER, ↓
Speech · WildASR
2.8WER, ↓
Speech · VoiceBench
88.9%overall, ↑
Audio · ESC-50
98.1%acc, ↑
Embeddings · MTEB
72.3%avg, ↑
Code · HumanEval+
87.2%pass@1, ↑
§ 03 · Jump in

Sixteen domains. One registry.

Everyone tracks frontier LLM scores. We also track what your pipeline depends on — OCR, ASR, detection, retrieval, embedded inference — with the same standard of evidence.

Domain92.9%
LLM reasoning

Frontier models on MMLU, GPQA, MATH, AIME.

Domain99.3%
Code generation

HumanEval, SWE-bench Verified, LiveCodeBench.

Domain80.9%
Agentic

Long-horizon autonomy, tool use, OpenRouter flow.

Domain97.5%
OCR / documents

Layout, handwriting, table extraction.

Domain2.8
Speech-to-text

WER on WildASR and industry splits.

Domain88.9%
Text-to-speech

Voice clarity, fingerprint robustness.

Domain91.0%
Vision · classification

ImageNet, CIFAR, linear probe.

Domain66.0%
Vision · detection

COCO, LVIS zero-shot, detection.

Domain87.6%
Multimodal / VQA

VQA-v2, TextVQA, chart reasoning.

Domain72.3%
Embeddings / retrieval

MTEB avg, BEIR, hybrid retrieval.

Domain98.1%
Audio classification

ESC-50, AudioSet, sound event detection.

Domain89.20
Medical imaging

CheXpert, MIMIC-CXR, MedQA.

Domain
Robotics

Habitat, LIBERO-Long, manipulation.

Domain99.60
Industrial AI

MVTec-AD, DAGM, NEU-DET.

Domain
Hardware · inference

Speed, cost and energy on real silicon.

Domain
Embedded AI

LLMs on Hailo-10H, edge chip catalog.

Fig 4 · Each tile links to a dedicated section with its own leaderboard, methodology and — where applicable — priced hardware comparisons.
§ 04 · Read
Flagship content

What we have been writing.

Sections are written like issues: a leaderboard at the top, the methodology below it, then essays that explain what the numbers mean — not just what they are.

Section

LLM reasoning

Frontier LLMs across MMLU, GPQA, MATH and AIME with verification notes.

Read →
Section

OCR & documents

Layout, handwriting, structured extraction — with ParseBench write-up.

Read →
Section

Speech

ASR accuracy, voice fingerprints, and a deep dive on TTS robustness.

Read →
Section

Vision

Classification, detection, segmentation — priced against hardware.

Read →
Section

Agentic AI

Long-horizon autonomy, tool use, OpenRouter market trends.

Read →
Index

Every ML task

The alphabetical register with trust grades for every canonical benchmark.

Read →
Essay

Voice fingerprints

Why TTS benchmarks miss the acoustic fingerprint that actually matters.

Read →
Archive

Papers with Code

The successor project to the archived Meta registry.

Read →
Guide

Choosing a TTS model

A practitioner guide to speech synthesis trade-offs.

Read →
§ 05 · Register

Trained something
that beats the table?

Submit a checkpoint or a paper result. We verify open-weight models against the public benchmark, cross-check vendor-reported numbers against the source, and add the row to the registry with its date and code trail.