The state of the art,
measured honestly.
Codesota is the open registry ML engineers consult before choosing a model — benchmarks linked to code, scores cross-checked against the paper, and original analysis of how the market actually uses these models. A calmer, stricter successor to Papers with Code.
No paywall. No signup. No sponsored leaderboards. Every result carries its source type — reproduced, paper, or vendor-reported — so you can decide how much to believe each number.
Current state of the art.
A transverse slice of the registry — the leading published score on each canonical benchmark, grouped by area. Shaded rows are independently verified by CodeSOTA; unshaded rows cite the paper or vendor.
- Results
- 995
- Models tracked
- 164
- Datasets indexed
- 98
- Research areas
- 17
| Area | Benchmark | Leading model | Metric | Score | Results |
|---|---|---|---|---|---|
| Code | HumanEval | o4-mini (high) | pass@1 | 99.3% | 33 |
| Code | SWE-bench Verified | Claude Opus 4.5 | resolve rate | 80.9% | 38 |
| Code | LiveCodeBench | DeepSeek-R1-0528 | pass@1 | 73.3% | 22 |
| Reasoning | MMLU | o3 | accuracy | 92.9% | 19 |
| Reasoning | GPQA Diamond | o3 | accuracy | 82.8% | 17 |
| Math | MATH | o4-mini (high) | accuracy | 98.2% | 29 |
| Math | AIME 2024 | o1-preview | accuracy | 83.3% | 3 |
| Math | GSM8K | o1-preview | accuracy | 97.8% | 5 |
| Vision | ImageNet-1K | coca-finetuned | top-1 | 91.0% | 22 |
| Vision | COCO detection | co-detr-swin-l | mAP | 66.0% | 17 |
| Vision | ADE20K | ONE-PEACE | mIoU | 63.0% | 13 |
| VQA | VQA-v2 | Qwen2-VL 72B | accuracy | 87.6% | 23 |
| VQA | TextVQA | Qwen2.5-VL 72B | accuracy | 85.5% | 9 |
| OCR | OCRBench v2 | Qwen2.5-VL-72B | overall | 63.70 | 74 |
| OCR | OmniDocBench | mineru-2.5 | layout mAP | 97.5% | 47 |
| OCR | ParseBench | LlamaParse Agentic | accuracy | 84.9% | 14 |
| OCR | OCR · CER | mistral-ocr-3 | CER (lower) | 3.7 | 1 |
| Speech | WildASR | Gemini 3 Pro | WER (lower) | 2.8 | 14 |
| Speech | VoiceBench | Ultravox-GLM-4P7 | overall | 88.9% | 13 |
| Audio | ESC-50 | BEATs | accuracy | 98.1% | 4 |
| Embeddings | MTEB | NV-Embed-v2 | avg | 72.3% | 6 |
The frontier climbs.
HumanEval — the oldest public code-generation benchmark — is nearing saturation. The step chart shows each successive SOTA-setting submission in the registry; the current leader is a reasoning-augmented mini model, not a frontier flagship.
To the right, small multiples sketch the trajectory across 17 modalities. The final dot on each line is the present leading score from the registry; intermediate points visualise the climb, they do not claim a calendar date for each step.
Sixteen domains. One registry.
Everyone tracks frontier LLM scores. We also track what your pipeline depends on — OCR, ASR, detection, retrieval, embedded inference — with the same standard of evidence.
Frontier models on MMLU, GPQA, MATH, AIME.
HumanEval, SWE-bench Verified, LiveCodeBench.
Long-horizon autonomy, tool use, OpenRouter flow.
Layout, handwriting, table extraction.
WER on WildASR and industry splits.
Voice clarity, fingerprint robustness.
ImageNet, CIFAR, linear probe.
COCO, LVIS zero-shot, detection.
VQA-v2, TextVQA, chart reasoning.
MTEB avg, BEIR, hybrid retrieval.
ESC-50, AudioSet, sound event detection.
CheXpert, MIMIC-CXR, MedQA.
Habitat, LIBERO-Long, manipulation.
MVTec-AD, DAGM, NEU-DET.
Speed, cost and energy on real silicon.
LLMs on Hailo-10H, edge chip catalog.
What we have been writing.
Sections are written like issues: a leaderboard at the top, the methodology below it, then essays that explain what the numbers mean — not just what they are.
LLM reasoning
Frontier LLMs across MMLU, GPQA, MATH and AIME with verification notes.
OCR & documents
Layout, handwriting, structured extraction — with ParseBench write-up.
Speech
ASR accuracy, voice fingerprints, and a deep dive on TTS robustness.
Vision
Classification, detection, segmentation — priced against hardware.
Agentic AI
Long-horizon autonomy, tool use, OpenRouter market trends.
Every ML task
The alphabetical register with trust grades for every canonical benchmark.
Voice fingerprints
Why TTS benchmarks miss the acoustic fingerprint that actually matters.
Papers with Code
The successor project to the archived Meta registry.
Choosing a TTS model
A practitioner guide to speech synthesis trade-offs.
Trained something
that beats the table?
Submit a checkpoint or a paper result. We verify open-weight models against the public benchmark, cross-check vendor-reported numbers against the source, and add the row to the registry with its date and code trail.