Which benchmark should you trust?
Tasks answer what problem you are solving. Benchmarks answer whether the evidence is still useful. This page separates active evaluations from saturated, superseded, and unmapped leaderboards so old scores do not masquerade as current capability.
TTS speed vs quality vs cost.
Compare Gradium, ElevenLabs, Cartesia, OpenAI, and other TTS providers on the metrics that matter for voice agents: WER, critical entity accuracy, p95 first-byte latency, severe error count, and cost per 1K characters.
Status before scores.
A leaderboard with many rows can still be obsolete. Start with benchmark status, then inspect result density and source quality.
Active
Still discriminates frontier systems. Use these for current model comparisons.
Saturating
Useful but ceiling effects or contamination risks are visible. Read successor context.
Saturated
Good historical anchor, weak frontier signal. Prefer the successor benchmark.
Superseded
Replaced by a cleaner, harder, or more representative evaluation artifact.
Unmapped
Tracked leaderboard without curated lineage status yet. Treat as coverage backlog.
Active and saturating benchmarks.
These are the first places to look for present-day model comparisons. Saturating benchmarks are still shown, but with the caveat that successor benchmarks may matter more.
OCRBench v2
Unmapped task. 74 results, 1 verified.
olmOCR-Bench
Document Parsing. 55 results, 0 verified.
OmniDocBench
Document Parsing. 47 results, 11 verified.
Terminal-Bench 2.0
Unmapped task. 20 results, 20 verified.
GPQA
Unmapped task. 17 results, 0 verified.
ParseBench
Document Parsing. 14 results, 14 verified.
SWE-Bench Verified
Code Generation. 39 results, 1 verified.
MATH
Mathematical Reasoning. 29 results, 0 verified.
Benchmarks replace each other.
A leaderboard is only useful if you know whether the benchmark is current, saturated, superseded, or still carrying the field. Codesota treats lineage as part of benchmark quality, not editorial decoration.
Coding Benchmarks
How code-generation evaluation moved from short Python functions to repository-scale software engineering. Attention path tracks the benchmark frontier focus has migrated to; branches show specialised variants and successors that remain active in their own right.
Agentic AI Benchmarks
How evaluation of AI agents evolved from structured task completion in synthetic environments through real-world software engineering to open-ended computer use. The coding lineage (see coding.json) covers SWE-bench and its successors in depth — this lineage focuses on the broader question of agent-task evaluation: web navigation, API use, desktop control, and the multi-step planning that connects language model capabilities to real-world action. Branches include OSWorld (visual desktop agents) and tau-bench (function-calling reliability).
Mathematical Reasoning Benchmarks
How mathematical reasoning evaluation evolved from grade-school word problems through competition mathematics to research-frontier problems that current AI cannot reliably solve. The lineage traces the shift from linguistic arithmetic (GSM8K) to formal mathematical proof and open research problems. Branches include the AIME competition track, which became a frontier benchmark after o1 broke it open, and FrontierMath, which sources unpublished problems from professional mathematicians.
OCR Benchmarks
How optical character recognition evaluation moved from word-level handwriting transcription to whole-document parsing with tables, charts and layout. Attention path tracks the frontier focus; branches show language-specific forks and metric-isolated variants.
Multimodal Reasoning Benchmarks
How vision-language model evaluation moved beyond visual question answering (covered in the VQA lineage) into multimodal reasoning — science, mathematics, chart understanding, and expert-level perception. When VQA-v2 saturated, the field needed benchmarks that tested whether models could integrate vision and language for genuine reasoning, not pattern matching. This lineage tracks that shift from ScienceQA through MMMU, MathVista, and into the expert-difficulty frontier.
NLP Benchmarks
How natural language understanding evaluation evolved from narrow task-specific tests to multi-task suites, and then was eclipsed by 'reasoning' as the frontier label. GLUE unified disparate NLU tasks; SuperGLUE raised the floor when GLUE saturated; BIG-bench expanded coverage to hundreds of tasks. The shift around 2023 was conceptual as much as technical — once models passed human baselines on NLU tasks, the interesting question became not 'does the model understand language' but 'can it reason'. Branches include SQuAD (reading comprehension), HellaSwag (commonsense completion), and WinoGrande (Winograd schemas).
Visual Question Answering
From the original image+question task to broad multimodal reasoning. The attention path tracks where leaderboard focus has moved; branches show specialized variants that remain active.
Reasoning Benchmarks
How evaluations of language-model reasoning evolved from broad knowledge testing to expert-level problem solving that frontier models still cannot reliably solve. The lineage runs from MMLU's wide-coverage factual sweep through specialist tracks like GPQA, to HLE — a 2,500-question exam designed by domain experts where top models still score below 35%. Branches include BIG-Bench Hard (multi-step reasoning) and ARC-AGI (fluid abstract reasoning), which each probe different failure modes than the main knowledge-testing spine.
Text-to-Speech Benchmarks
How TTS evaluation evolved from single-speaker naturalness datasets toward production benchmarks that test intelligibility, voice similarity, latency, streaming behavior, and information preservation. The lineage separates beauty metrics like MOS from operational metrics such as WER round-trip, critical entity accuracy, and first-byte latency.
Speech Recognition Benchmarks
How automatic speech recognition evaluation evolved from clean read speech on LibriSpeech, through multi-speaker and noisy conditions, toward naturalistic and multilingual benchmarks that reflect real deployment environments. The spine tracks where word error rate evaluation moved as clean-speech performance saturated; branches cover speaker verification (VoxCeleb), noisy conditions (LibriSpeech-other, GigaSpeech), and multilingual evaluation (FLEURS, Common Voice).
Audio Understanding Benchmarks
How audio AI evaluation evolved from environmental sound classification on small datasets through large-scale event detection to foundation-model-era benchmarks that combine audio perception with language understanding. The lineage runs from ESC-50 (2015) through AudioSet (2017) to audio-text retrieval and captioning benchmarks (Clotho, AudioCaps — popularised by the CLAP model), then to VoiceBench and AudioBench which test audio-language model instruction following. Branches include MUSDB18 (music source separation) and MusicNet (symbolic music).
Vision Benchmarks
How computer vision evaluation moved from image classification on ImageNet through object detection and dense prediction on COCO, to open-world promptable segmentation with SA-1B and SA-V. The lineage reflects a structural shift: early benchmarks measured closed-set accuracy on fixed categories; modern benchmarks ask models to segment anything a user points at, including in video. Branches include CIFAR and Pascal VOC (historically important precursors) and ADE20K / Open Images (semantic and large-scale detection offshoots). SAM and SAM 2 are the reference *models* Meta shipped alongside their respective benchmarks — included here only as the systems that established SOTA on each.
Where the result rows are.
| Area | Benchmarks | Results | Verified |
|---|---|---|---|
| Unmapped | 39 | 267 | 38 |
| Vision & Documents | 25 | 228 | 58 |
| Code & Software Engineering | 8 | 101 | 43 |
| Language & Knowledge | 11 | 41 | 0 |
| Structured Data & Forecasting | 3 | 38 | 27 |
| Multimodal Media | 5 | 33 | 30 |
| Robotics, Control & RL | 1 | 12 | 3 |
| Audio & Speech | 7 | 8 | 8 |
All benchmark artifacts.
Sorted by result count. Status comes from curated lineage when available. Unmapped rows stay visible as coverage backlog.
| Benchmark | Task | Metric | Status | Lineage | Year | Results | Verified |
|---|---|---|---|---|---|---|---|
| OCRBench v2 | Unmapped task | overall-en-private | Active | 2024 | 74 | 1 (1%) | |
| olmOCR-Bench | Document Parsing | pass-rate | Active | 2024 | 55 | 0 (0%) | |
| OmniDocBench | Document Parsing | composite | Active | 2024 | 47 | 11 (23%) | |
| SWE-Bench Verified | Code Generation | resolve-rate | Saturating | 2024 | 39 | 1 (3%) | |
| HumanEval | Code Generation | pass@1 | Saturated | 2021 | 33 | 15 (45%) | |
| MATH | Mathematical Reasoning | accuracy | Saturating | 2021 | 29 | 0 (0%) | |
| VQA v2.0 | Visual Question Answering | accuracy | Saturated | 2017 | 23 | 20 (87%) | |
| ImageNet-1K | Image Classification | top-1-accuracy | Saturated | 2012 | 22 | 6 (27%) | |
| Cora | Node Classification | accuracy | Unmapped | N/A | 2000 | 21 | 21 (100%) |
| ABIDE I | Unmapped task | accuracy | Unmapped | N/A | 2012 | 21 | 0 (0%) |
| Terminal-Bench 2.0 | Unmapped task | accuracy | Active | 2026 | 20 | 20 (100%) | |
| MMLU | Unmapped task | accuracy | Saturated | 2021 | 19 | 0 (0%) | |
| Open Graph Benchmark | Node Classification | accuracy-ogbn-arxiv | Unmapped | N/A | 2020 | 17 | 6 (35%) |
| GPQA | Unmapped task | accuracy | Active | 2024 | 17 | 0 (0%) | |
| COCO | Object Detection | mAP | Saturating | 2014 | 17 | 0 (0%) | |
| Atari 2600 | Unmapped task | human-normalized-score | Unmapped | N/A | 2013 | 16 | 1 (6%) |
| CIFAR-100 | Image Classification | accuracy | Unmapped | N/A | 2009 | 15 | 3 (20%) |
| MBPP | Code Generation | pass@1 | Saturated | 2021 | 14 | 12 (86%) | |
| ParseBench | Document Parsing | accuracy | Active | 2026 | 14 | 14 (100%) | |
| FUNSD | Unmapped task | f1 | Saturated | 2019 | 13 | 13 (100%) | |
| ADE20K | Semantic Segmentation | mIoU | Active | 2016 | 13 | 0 (0%) | |
| MuJoCo | Continuous Control | average-return | Unmapped | N/A | 2012 | 12 | 3 (25%) |
| CC-OCR | Unmapped task | multi-scene-f1 | Unmapped | N/A | 2024 | 12 | 0 (0%) |
| MVTec AD | Unmapped task | auroc | Unmapped | N/A | 2019 | 11 | 0 (0%) |
| CIFAR-10 | Image Classification | accuracy | Unmapped | N/A | 2009 | 11 | 8 (73%) |
| IAM | Handwriting Recognition | cer | Active | 1999 | 8 | 8 (100%) | |
| ImageNet Linear Probe | Image Classification | top-1-accuracy | Unmapped | N/A | 2012 | 8 | 5 (63%) |
| KITAB-Bench | Document OCR | cer | Active | 2024 | 8 | 0 (0%) | |
| CheXpert | Unmapped task | auroc | Unmapped | N/A | 2019 | 7 | 0 (0%) |
| MME-VideoOCR | Unmapped task | total-accuracy | Unmapped | N/A | 2024 | 6 | 0 (0%) |
| HumanEval+ | Code Generation | pass@1 | Active | 2023 | 5 | 5 (100%) | |
| GSM8K | Mathematical Reasoning | accuracy | Saturated | 2021 | 5 | 0 (0%) | |
| NoCaps | Image Captioning | cider | Unmapped | N/A | 2019 | 5 | 5 (100%) |
| OK-VQA | Visual Question Answering | accuracy | Active | 2019 | 5 | 5 (100%) | |
| ThaiOCRBench | Document OCR | ted-score | Active | 2024 | 5 | 0 (0%) | |
| AudioSet | Audio Classification | map | Saturating | 2017 | 4 | 4 (100%) | |
| ESC-50 | Audio Classification | accuracy | Saturated | 2015 | 4 | 4 (100%) | |
| MBPP+ | Code Generation | pass@1 | Active | 2023 | 4 | 4 (100%) | |
| ARC-Challenge | Unmapped task | accuracy | Unmapped | N/A | 2018 | 4 | 0 (0%) |
| HellaSwag | Unmapped task | accuracy | Unmapped | N/A | 2019 | 4 | 0 (0%) |
| NIH ChestX-ray14 | Unmapped task | auroc | Unmapped | N/A | 2017 | 4 | 0 (0%) |
| APPS | Code Generation | pass@1 | Unmapped | N/A | 2021 | 3 | 3 (100%) |
| CodeContests | Code Generation | pass@1 | Active | 2022 | 3 | 3 (100%) | |
| AIME 2024 | Mathematical Reasoning | accuracy | Active | 2024 | 3 | 0 (0%) | |
| CommonsenseQA | Unmapped task | accuracy | Unmapped | N/A | 2019 | 3 | 0 (0%) |
| MAWPS | Unmapped task | accuracy | Unmapped | N/A | 2016 | 3 | 0 (0%) |
| MIMIC-CXR | Unmapped task | auroc | Unmapped | N/A | 2019 | 3 | 0 (0%) |
| RLBench | Unmapped task | success-rate | Unmapped | N/A | 2020 | 3 | 3 (100%) |
| Severstal Steel Defect | Unmapped task | dice | Unmapped | N/A | 2019 | 3 | 0 (0%) |
| SVAMP | Unmapped task | accuracy | Unmapped | N/A | 2021 | 3 | 0 (0%) |
| VisA | Unmapped task | auroc | Unmapped | N/A | 2022 | 3 | 0 (0%) |
| WinoGrande | Unmapped task | accuracy | Unmapped | N/A | 2019 | 3 | 0 (0%) |
| Cityscapes | Semantic Segmentation | mIoU | Unmapped | N/A | 2016 | 3 | 3 (100%) |
| LogiQA | Logical Reasoning | accuracy | Unmapped | N/A | 2020 | 2 | 0 (0%) |
| ReClor | Logical Reasoning | accuracy | Unmapped | N/A | 2020 | 2 | 0 (0%) |
| ABIDE II | Unmapped task | accuracy | Unmapped | N/A | 2017 | 2 | 0 (0%) |
| COVID-19 Image Data Collection | Unmapped task | auroc | Unmapped | N/A | 2020 | 2 | 0 (0%) |
| HotpotQA | Unmapped task | f1 | Unmapped | N/A | 2018 | 2 | 0 (0%) |
| RSNA Pneumonia Detection | Unmapped task | map | Unmapped | N/A | 2018 | 2 | 0 (0%) |
| StrategyQA | Unmapped task | accuracy | Unmapped | N/A | 2021 | 2 | 0 (0%) |
| VinDr-CXR | Unmapped task | auroc | Unmapped | N/A | 2022 | 2 | 0 (0%) |
| ImageNet-V2 | Image Classification | top-1-accuracy | Unmapped | N/A | 2019 | 2 | 0 (0%) |
| NEU-DET | Unmapped task | map | Unmapped | N/A | 2013 | 1 | 0 (0%) |
| PadChest | Unmapped task | auroc | Unmapped | N/A | 2020 | 1 | 0 (0%) |
| Weld Defect X-Ray | Unmapped task | map | Unmapped | N/A | 2021 | 1 | 0 (0%) |
| Common Voice | Automatic Speech Recognition | wer | Unmapped | N/A | 2019 | 0 | N/A |
| LibriSpeech | Automatic Speech Recognition | wer-test-clean | Saturated | 2015 | 0 | N/A | |
| LJ Speech | Text-to-Speech | mos | Saturating | 2017 | 0 | N/A | |
| TTS Intelligibility | Text-to-Speech | critical-entity-accuracy | Active | 2026 | 0 | N/A | |
| VCTK | Text-to-Speech | mos | Active | 2019 | 0 | N/A | |
| SWE-Bench | Code Generation | resolve-rate | Superseded | 2023 | 0 | N/A | |
| CNN/DailyMail | Text Summarization | rouge-1 | Unmapped | N/A | 2015 | 0 | N/A |
| CoNLL-2003 | Named Entity Recognition | f1 | Unmapped | N/A | 2003 | 0 | N/A |
| GLUE | Text Classification | average-score | Unmapped | N/A | 2018 | 0 | N/A |
| SNLI | Natural Language Inference | accuracy | Unmapped | N/A | 2015 | 0 | N/A |
| SQuAD v2.0 | Question Answering | f1 | Unmapped | N/A | 2018 | 0 | N/A |
| SuperGLUE | Text Classification | average-score | Unmapped | N/A | 2019 | 0 | N/A |
| COCO Captions | Image Captioning | cider | Unmapped | N/A | 2015 | 0 | N/A |
| GQA | Visual Question Answering | accuracy | Saturated | 2019 | 0 | N/A | |
| M4 Competition | Time-Series Forecasting | smapi | Unmapped | N/A | 2018 | 0 | N/A |
| ACDC | Unmapped task | mean-dsc | Unmapped | N/A | 2017 | 0 | N/A |
| BraTS 2023 | Unmapped task | mean-dice-wt-tc-et | Unmapped | N/A | 2023 | 0 | N/A |
| BTCV | Unmapped task | mean-dsc | Unmapped | N/A | 2015 | 0 | N/A |
| DocLayNet | Unmapped task | mAP | Unmapped | N/A | 2022 | 0 | N/A |
| KolektorSDD2 | Unmapped task | auroc | Unmapped | N/A | 2021 | 0 | N/A |
| MVTec 3D-AD | Unmapped task | auroc | Unmapped | N/A | 2021 | 0 | N/A |
| reVISION | Unmapped task | accuracy | Unmapped | N/A | 2025 | 0 | N/A |
| Synapse Multi-Organ CT | Unmapped task | mean-dsc | Unmapped | N/A | 2015 | 0 | N/A |
| CodeSOTA Polish | Document OCR | cer | Unmapped | N/A | 2025 | 0 | N/A |
| CTW1500 | Scene Text Detection | f1 | Unmapped | N/A | 2019 | 0 | N/A |
| ICDAR 2015 | Scene Text Detection | f1 | Unmapped | N/A | 2015 | 0 | N/A |
| ICDAR 2019 ArT | Scene Text Detection | f1 | Unmapped | N/A | 2019 | 0 | N/A |
| IMPACT-PSNC | Document OCR | cer | Unmapped | N/A | 2012 | 0 | N/A |
| Pascal VOC 2012 | Object Detection | mAP | Unmapped | N/A | 2012 | 0 | N/A |
| PolEval 2021 OCR | Document OCR | cer | Unmapped | N/A | 2021 | 0 | N/A |
| Polish EMNIST Extension | Handwriting Recognition | accuracy | Unmapped | N/A | 2020 | 0 | N/A |
| SROIE | Document OCR | f1 | Unmapped | N/A | 2019 | 0 | N/A |
| Total-Text | Scene Text Detection | f1 | Unmapped | N/A | 2017 | 0 | N/A |
| Union14M | Scene Text Detection | accuracy | Unmapped | N/A | 2023 | 0 | N/A |
Add a benchmark or result.
If a benchmark is missing, submit the paper or the leaderboard source. If a row is stale, submit the correction with a source link and the metric definition.