Codesota · BenchmarksStatus · lineage · evidence density
Registry · Benchmarks

Which benchmark should you trust?

Tasks answer what problem you are solving. Benchmarks answer whether the evidence is still useful. This page separates active evaluations from saturated, superseded, and unmapped leaderboards so old scores do not masquerade as current capability.

Browse tasks Submit resultBenchmark lineages
New · Text-to-Speech benchmark

TTS speed vs quality vs cost.

Compare Gradium, ElevenLabs, Cartesia, OpenAI, and other TTS providers on the metrics that matter for voice agents: WER, critical entity accuracy, p95 first-byte latency, severe error count, and cost per 1K characters.

WER ↓Entity accuracy ↑TTFB p95 ↓Cost / 1K ↓
Benchmark output
Quality
normalized WER · CER · exact match
Information fidelity
numbers · dates · names · emails
Speed
TTFB p50/p95 · total latency
Cost
estimated run cost · cost per 1K chars
Open benchmark →
Benchmarks
99
With results
65
Result rows
1,008
Verified rows
435
§ 01 · Status filters

Status before scores.

A leaderboard with many rows can still be obsolete. Start with benchmark status, then inspect result density and source quality.

17 benchmarks

Active

Still discriminates frontier systems. Use these for current model comparisons.

5 benchmarks

Saturating

Useful but ceiling effects or contamination risks are visible. Read successor context.

10 benchmarks

Saturated

Good historical anchor, weak frontier signal. Prefer the successor benchmark.

1 benchmarks

Superseded

Replaced by a cleaner, harder, or more representative evaluation artifact.

66 benchmarks

Unmapped

Tracked leaderboard without curated lineage status yet. Treat as coverage backlog.

§ 02 · Current signal

Active and saturating benchmarks.

These are the first places to look for present-day model comparisons. Saturating benchmarks are still shown, but with the caveat that successor benchmarks may matter more.

Active · Unmapped

OCRBench v2

Unmapped task. 74 results, 1 verified.

Active · Vision & Documents

olmOCR-Bench

Document Parsing. 55 results, 0 verified.

Active · Vision & Documents

OmniDocBench

Document Parsing. 47 results, 11 verified.

Active · Unmapped

Terminal-Bench 2.0

Unmapped task. 20 results, 20 verified.

Active · Unmapped

GPQA

Unmapped task. 17 results, 0 verified.

Active · Vision & Documents

ParseBench

Document Parsing. 14 results, 14 verified.

Saturating · Code & Software Engineering

SWE-Bench Verified

Code Generation. 39 results, 1 verified.

Saturating · Language & Knowledge

MATH

Mathematical Reasoning. 29 results, 0 verified.

§ 03 · Lineage

Benchmarks replace each other.

A leaderboard is only useful if you know whether the benchmark is current, saturated, superseded, or still carrying the field. Codesota treats lineage as part of benchmark quality, not editorial decoration.

13 benchmarks · 9 active

Coding Benchmarks

How code-generation evaluation moved from short Python functions to repository-scale software engineering. Attention path tracks the benchmark frontier focus has migrated to; branches show specialised variants and successors that remain active in their own right.

8 benchmarks · 8 active

Agentic AI Benchmarks

How evaluation of AI agents evolved from structured task completion in synthetic environments through real-world software engineering to open-ended computer use. The coding lineage (see coding.json) covers SWE-bench and its successors in depth — this lineage focuses on the broader question of agent-task evaluation: web navigation, API use, desktop control, and the multi-step planning that connects language model capabilities to real-world action. Branches include OSWorld (visual desktop agents) and tau-bench (function-calling reliability).

6 benchmarks · 5 active

Mathematical Reasoning Benchmarks

How mathematical reasoning evaluation evolved from grade-school word problems through competition mathematics to research-frontier problems that current AI cannot reliably solve. The lineage traces the shift from linguistic arithmetic (GSM8K) to formal mathematical proof and open research problems. Branches include the AIME competition track, which became a frontier benchmark after o1 broke it open, and FrontierMath, which sources unpublished problems from professional mathematicians.

12 benchmarks · 9 active

OCR Benchmarks

How optical character recognition evaluation moved from word-level handwriting transcription to whole-document parsing with tables, charts and layout. Attention path tracks the frontier focus; branches show language-specific forks and metric-isolated variants.

6 benchmarks · 6 active

Multimodal Reasoning Benchmarks

How vision-language model evaluation moved beyond visual question answering (covered in the VQA lineage) into multimodal reasoning — science, mathematics, chart understanding, and expert-level perception. When VQA-v2 saturated, the field needed benchmarks that tested whether models could integrate vision and language for genuine reasoning, not pattern matching. This lineage tracks that shift from ScienceQA through MMMU, MathVista, and into the expert-difficulty frontier.

7 benchmarks · 2 active

NLP Benchmarks

How natural language understanding evaluation evolved from narrow task-specific tests to multi-task suites, and then was eclipsed by 'reasoning' as the frontier label. GLUE unified disparate NLU tasks; SuperGLUE raised the floor when GLUE saturated; BIG-bench expanded coverage to hundreds of tasks. The shift around 2023 was conceptual as much as technical — once models passed human baselines on NLU tasks, the interesting question became not 'does the model understand language' but 'can it reason'. Branches include SQuAD (reading comprehension), HellaSwag (commonsense completion), and WinoGrande (Winograd schemas).

9 benchmarks · 6 active

Visual Question Answering

From the original image+question task to broad multimodal reasoning. The attention path tracks where leaderboard focus has moved; branches show specialized variants that remain active.

6 benchmarks · 5 active

Reasoning Benchmarks

How evaluations of language-model reasoning evolved from broad knowledge testing to expert-level problem solving that frontier models still cannot reliably solve. The lineage runs from MMLU's wide-coverage factual sweep through specialist tracks like GPQA, to HLE — a 2,500-question exam designed by domain experts where top models still score below 35%. Branches include BIG-Bench Hard (multi-step reasoning) and ARC-AGI (fluid abstract reasoning), which each probe different failure modes than the main knowledge-testing spine.

5 benchmarks · 5 active

Text-to-Speech Benchmarks

How TTS evaluation evolved from single-speaker naturalness datasets toward production benchmarks that test intelligibility, voice similarity, latency, streaming behavior, and information preservation. The lineage separates beauty metrics like MOS from operational metrics such as WER round-trip, critical entity accuracy, and first-byte latency.

7 benchmarks · 6 active

Speech Recognition Benchmarks

How automatic speech recognition evaluation evolved from clean read speech on LibriSpeech, through multi-speaker and noisy conditions, toward naturalistic and multilingual benchmarks that reflect real deployment environments. The spine tracks where word error rate evaluation moved as clean-speech performance saturated; branches cover speaker verification (VoxCeleb), noisy conditions (LibriSpeech-other, GigaSpeech), and multilingual evaluation (FLEURS, Common Voice).

7 benchmarks · 6 active

Audio Understanding Benchmarks

How audio AI evaluation evolved from environmental sound classification on small datasets through large-scale event detection to foundation-model-era benchmarks that combine audio perception with language understanding. The lineage runs from ESC-50 (2015) through AudioSet (2017) to audio-text retrieval and captioning benchmarks (Clotho, AudioCaps — popularised by the CLAP model), then to VoiceBench and AudioBench which test audio-language model instruction following. Branches include MUSDB18 (music source separation) and MusicNet (symbolic music).

7 benchmarks · 4 active

Vision Benchmarks

How computer vision evaluation moved from image classification on ImageNet through object detection and dense prediction on COCO, to open-world promptable segmentation with SA-1B and SA-V. The lineage reflects a structural shift: early benchmarks measured closed-set accuracy on fixed categories; modern benchmarks ask models to segment anything a user points at, including in video. Branches include CIFAR and Pascal VOC (historically important precursors) and ADE20K / Open Images (semantic and large-scale detection offshoots). SAM and SAM 2 are the reference *models* Meta shipped alongside their respective benchmarks — included here only as the systems that established SOTA on each.

§ 04 · Evidence density

Where the result rows are.

AreaBenchmarksResultsVerified
Unmapped3926738
Vision & Documents2522858
Code & Software Engineering810143
Language & Knowledge11410
Structured Data & Forecasting33827
Multimodal Media53330
Robotics, Control & RL1123
Audio & Speech788
§ 05 · Benchmark index

All benchmark artifacts.

Sorted by result count. Status comes from curated lineage when available. Unmapped rows stay visible as coverage backlog.

BenchmarkTaskMetricStatusLineageYearResultsVerified
OCRBench v2
OCRBench v2
Unmapped taskoverall-en-privateActive2024741 (1%)
olmOCR-Bench
olmOCR-Bench
Document Parsingpass-rateActive2024550 (0%)
OmniDocBench
OmniDocBench v1.5
Document ParsingcompositeActive20244711 (23%)
SWE-Bench Verified
SWE-bench Verified Subset
Code Generationresolve-rateSaturating2024391 (3%)
HumanEval
HumanEval: Hand-Written Evaluation Set
Code Generationpass@1Saturated20213315 (45%)
MATH
Mathematics Aptitude Test of Heuristics
Mathematical ReasoningaccuracySaturating2021290 (0%)
VQA v2.0
Visual Question Answering v2.0
Visual Question AnsweringaccuracySaturated20172320 (87%)
ImageNet-1K
ImageNet Large Scale Visual Recognition Challenge 2012
Image Classificationtop-1-accuracySaturated2012226 (27%)
Cora
Cora Citation Network
Node ClassificationaccuracyUnmappedN/A20002121 (100%)
ABIDE I
Autism Brain Imaging Data Exchange I
Unmapped taskaccuracyUnmappedN/A2012210 (0%)
Terminal-Bench 2.0
Terminal-Bench 2.0: Terminal Environment Agent Benchmark
Unmapped taskaccuracyActive20262020 (100%)
MMLU
Massive Multitask Language Understanding
Unmapped taskaccuracySaturated2021190 (0%)
Open Graph Benchmark
Open Graph Benchmark (OGB)
Node Classificationaccuracy-ogbn-arxivUnmappedN/A2020176 (35%)
GPQA
Graduate-Level Google-Proof Q&A
Unmapped taskaccuracyActive2024170 (0%)
COCO
Microsoft COCO: Common Objects in Context
Object DetectionmAPSaturating2014170 (0%)
Atari 2600
Arcade Learning Environment (Atari 2600)
Unmapped taskhuman-normalized-scoreUnmappedN/A2013161 (6%)
CIFAR-100
Canadian Institute for Advanced Research 100
Image ClassificationaccuracyUnmappedN/A2009153 (20%)
MBPP
Mostly Basic Python Problems
Code Generationpass@1Saturated20211412 (86%)
ParseBench
ParseBench: A Document Parsing Benchmark for AI Agents
Document ParsingaccuracyActive20261414 (100%)
FUNSD
Form Understanding in Noisy Scanned Documents
Unmapped taskf1Saturated20191313 (100%)
ADE20K
ADE20K Scene Parsing Benchmark
Semantic SegmentationmIoUActive2016130 (0%)
MuJoCo
Multi-Joint dynamics with Contact
Continuous Controlaverage-returnUnmappedN/A2012123 (25%)
CC-OCR
Comprehensive Challenge OCR
Unmapped taskmulti-scene-f1UnmappedN/A2024120 (0%)
MVTec AD
MVTec Anomaly Detection Dataset
Unmapped taskaurocUnmappedN/A2019110 (0%)
CIFAR-10
Canadian Institute for Advanced Research 10
Image ClassificationaccuracyUnmappedN/A2009118 (73%)
IAM
IAM Handwriting Database
Handwriting RecognitioncerActive199988 (100%)
ImageNet Linear Probe
ImageNet-1K Linear Probe Evaluation
Image Classificationtop-1-accuracyUnmappedN/A201285 (63%)
KITAB-Bench
KITAB Arabic OCR Benchmark
Document OCRcerActive202480 (0%)
CheXpert
CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels
Unmapped taskaurocUnmappedN/A201970 (0%)
MME-VideoOCR
MME Video OCR Benchmark
Unmapped tasktotal-accuracyUnmappedN/A202460 (0%)
HumanEval+
HumanEval+ Extended Version
Code Generationpass@1Active202355 (100%)
GSM8K
Grade School Math 8K
Mathematical ReasoningaccuracySaturated202150 (0%)
NoCaps
Novel Object Captioning at Scale
Image CaptioningciderUnmappedN/A201955 (100%)
OK-VQA
Outside Knowledge Visual Question Answering
Visual Question AnsweringaccuracyActive201955 (100%)
ThaiOCRBench
Thai OCR Benchmark
Document OCRted-scoreActive202450 (0%)
AudioSet
AudioSet
Audio ClassificationmapSaturating201744 (100%)
ESC-50
Environmental Sound Classification 50
Audio ClassificationaccuracySaturated201544 (100%)
MBPP+
MBPP+ Extended Version
Code Generationpass@1Active202344 (100%)
ARC-Challenge
AI2 Reasoning Challenge
Unmapped taskaccuracyUnmappedN/A201840 (0%)
HellaSwag
HellaSwag
Unmapped taskaccuracyUnmappedN/A201940 (0%)
NIH ChestX-ray14
NIH Clinical Center Chest X-ray Dataset
Unmapped taskaurocUnmappedN/A201740 (0%)
APPS
Automated Programming Progress Standard
Code Generationpass@1UnmappedN/A202133 (100%)
CodeContests
CodeContests Competitive Programming
Code Generationpass@1Active202233 (100%)
AIME 2024
American Invitational Mathematics Examination 2024
Mathematical ReasoningaccuracyActive202430 (0%)
CommonsenseQA
CommonsenseQA
Unmapped taskaccuracyUnmappedN/A201930 (0%)
MAWPS
Math Word Problem Repository
Unmapped taskaccuracyUnmappedN/A201630 (0%)
MIMIC-CXR
MIMIC-CXR: Medical Information Mart for Intensive Care - Chest X-ray
Unmapped taskaurocUnmappedN/A201930 (0%)
RLBench
Robot Learning Benchmark (RLBench)
Unmapped tasksuccess-rateUnmappedN/A202033 (100%)
Severstal Steel Defect
Severstal Steel Defect Detection
Unmapped taskdiceUnmappedN/A201930 (0%)
SVAMP
Simple Variations on Arithmetic Math Word Problems
Unmapped taskaccuracyUnmappedN/A202130 (0%)
VisA
Visual Anomaly Dataset
Unmapped taskaurocUnmappedN/A202230 (0%)
WinoGrande
WinoGrande
Unmapped taskaccuracyUnmappedN/A201930 (0%)
Cityscapes
Cityscapes Dataset
Semantic SegmentationmIoUUnmappedN/A201633 (100%)
LogiQA
LogiQA
Logical ReasoningaccuracyUnmappedN/A202020 (0%)
ReClor
Reading Comprehension Dataset Requiring Logical Reasoning
Logical ReasoningaccuracyUnmappedN/A202020 (0%)
ABIDE II
Autism Brain Imaging Data Exchange II
Unmapped taskaccuracyUnmappedN/A201720 (0%)
COVID-19 Image Data Collection
COVID-19 Image Data Collection
Unmapped taskaurocUnmappedN/A202020 (0%)
HotpotQA
HotpotQA
Unmapped taskf1UnmappedN/A201820 (0%)
RSNA Pneumonia Detection
RSNA Pneumonia Detection Challenge
Unmapped taskmapUnmappedN/A201820 (0%)
StrategyQA
StrategyQA
Unmapped taskaccuracyUnmappedN/A202120 (0%)
VinDr-CXR
VinDr-CXR: Vietnamese Dataset for Chest Radiograph
Unmapped taskaurocUnmappedN/A202220 (0%)
ImageNet-V2
ImageNet-V2 Matched Frequency
Image Classificationtop-1-accuracyUnmappedN/A201920 (0%)
NEU-DET
NEU Surface Defect Database
Unmapped taskmapUnmappedN/A201310 (0%)
PadChest
PadChest: A Large Chest X-ray Image Dataset
Unmapped taskaurocUnmappedN/A202010 (0%)
Weld Defect X-Ray
X-Ray Weld Defect Detection Dataset
Unmapped taskmapUnmappedN/A202110 (0%)
Common Voice
Mozilla Common Voice
Automatic Speech RecognitionwerUnmappedN/A20190N/A
LibriSpeech
LibriSpeech ASR Corpus
Automatic Speech Recognitionwer-test-cleanSaturated20150N/A
LJ Speech
The LJ Speech Dataset
Text-to-SpeechmosSaturating20170N/A
TTS Intelligibility
English TTS Intelligibility Benchmark
Text-to-Speechcritical-entity-accuracyActive20260N/A
VCTK
CSTR VCTK Corpus
Text-to-SpeechmosActive20190N/A
SWE-Bench
SWE-bench: Software Engineering Benchmark
Code Generationresolve-rateSuperseded20230N/A
CNN/DailyMail
CNN/DailyMail Summarization
Text Summarizationrouge-1UnmappedN/A20150N/A
CoNLL-2003
CoNLL-2003 Named Entity Recognition
Named Entity Recognitionf1UnmappedN/A20030N/A
GLUE
General Language Understanding Evaluation
Text Classificationaverage-scoreUnmappedN/A20180N/A
SNLI
Stanford Natural Language Inference
Natural Language InferenceaccuracyUnmappedN/A20150N/A
SQuAD v2.0
Stanford Question Answering Dataset v2.0
Question Answeringf1UnmappedN/A20180N/A
SuperGLUE
SuperGLUE
Text Classificationaverage-scoreUnmappedN/A20190N/A
COCO Captions
COCO Captions
Image CaptioningciderUnmappedN/A20150N/A
GQA
GQA: Visual Reasoning in the Real World
Visual Question AnsweringaccuracySaturated20190N/A
M4 Competition
M4 Forecasting Competition
Time-Series ForecastingsmapiUnmappedN/A20180N/A
ACDC
Automated Cardiac Diagnosis Challenge
Unmapped taskmean-dscUnmappedN/A20170N/A
BraTS 2023
Brain Tumor Segmentation Challenge 2023
Unmapped taskmean-dice-wt-tc-etUnmappedN/A20230N/A
BTCV
Beyond The Cranial Vault Multi-Organ CT Segmentation
Unmapped taskmean-dscUnmappedN/A20150N/A
DocLayNet
DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis
Unmapped taskmAPUnmappedN/A20220N/A
KolektorSDD2
Kolektor Surface Defect Dataset 2
Unmapped taskaurocUnmappedN/A20210N/A
MVTec 3D-AD
MVTec 3D Anomaly Detection Dataset
Unmapped taskaurocUnmappedN/A20210N/A
reVISION
reVISION Polish Vision-Language Benchmark
Unmapped taskaccuracyUnmappedN/A20250N/A
Synapse Multi-Organ CT
Synapse Multi-Organ Abdominal CT Segmentation
Unmapped taskmean-dscUnmappedN/A20150N/A
CodeSOTA Polish
CodeSOTA Polish OCR Benchmark
Document OCRcerUnmappedN/A20250N/A
CTW1500
Curved Text in the Wild 1500
Scene Text Detectionf1UnmappedN/A20190N/A
ICDAR 2015
ICDAR 2015 Incidental Scene Text
Scene Text Detectionf1UnmappedN/A20150N/A
ICDAR 2019 ArT
ICDAR 2019 Arbitrary-Shaped Text
Scene Text Detectionf1UnmappedN/A20190N/A
IMPACT-PSNC
IMPACT Polish Digital Libraries Ground Truth
Document OCRcerUnmappedN/A20120N/A
Pascal VOC 2012
Pascal Visual Object Classes Challenge 2012
Object DetectionmAPUnmappedN/A20120N/A
PolEval 2021 OCR
PolEval 2021 OCR Post-Correction Task
Document OCRcerUnmappedN/A20210N/A
Polish EMNIST Extension
EMNIST Extended with Polish Diacritics
Handwriting RecognitionaccuracyUnmappedN/A20200N/A
SROIE
Scanned Receipts OCR and Information Extraction
Document OCRf1UnmappedN/A20190N/A
Total-Text
Total-Text
Scene Text Detectionf1UnmappedN/A20170N/A
Union14M
Union14M: A Unified Benchmark for Scene Text Recognition
Scene Text DetectionaccuracyUnmappedN/A20230N/A
§ 06 · Missing coverage

Add a benchmark or result.

If a benchmark is missing, submit the paper or the leaderboard source. If a row is stale, submit the correction with a source link and the metric definition.

Submit result Contribution guide