Codesota · Tasks · Vol. IIDecision-first benchmark discoveryIssue: April 22, 2026
§ 00 · Index

AI tasks and
benchmark evidence.

Find the benchmark evidence behind any AI capability.

Start with a capability — OCR, coding agents, ASR, RAG, VQA, forecasting — then inspect the benchmark, metric, leading model, date, and trust grade.

SOTA is not a model. It is a claim about a model on a benchmark. This view currently surfaces 62 task pages across nine capability areas.

§ 01 · Capability finder

Find a benchmark by capability.

What are you trying to evaluate?

§ 02 · Counts

The register, by the numbers.

Figures sourced from the live Postgres registry · updated every 10 min
9
Capability areas
Stable top-level ontology
147
Tasks catalogued
62 with published SOTA
780
Datasets indexed
Canonical benchmark per task marked
9,166
Benchmark results
All dated · verified where possible
§ 03 · Product map

Three different questions.

Start with the capability. Then inspect the evidence.

Tasks

Start with the capability

A task describes what the system must do: OCR, code generation, ASR, retrieval, VQA, detection.

Find benchmark
Leaderboards

Then inspect the evidence

A benchmark supplies evidence: metric definition, result counts, source quality, trust grade, and current top rows.

Open leaderboards
Lineages

Check if the benchmark still matters

A lineage tells you whether a benchmark is canonical, saturated, stale, contested, or replaced by a harder successor.

View evolution
§ 04 · Area map

Nine stable capability areas.

Choose the area closest to the decision you are making. Domains, modalities, methods, and safety properties are overlays — not separate root taxonomies.

01 · 16 tasks

Language & Knowledge

Retrieval, QA, reasoning, factuality, text classification, and knowledge-heavy language tasks.

Use when: evaluating answers grounded in text, documents, memory, retrieval, or general language knowledge.

Common trap: chat preference is not retrieval quality; use task-specific evidence for RAG, QA, and factuality.

Start with: Commonsense Reasoning · Mathematical Reasoning · Multi-step Reasoning
02 · 11 tasks

Vision & Documents

OCR, layout, tables, parsing, detection, segmentation, and visual extraction.

Use when: evaluating document AI or visual extraction systems.

Common trap: OCR is not document understanding; layout, tables, handwriting, and VQA need separate evidence.

Start with: Document OCR · Scene Text Detection · Document Layout Analysis
03 · 6 tasks

Audio & Speech

ASR, TTS, speaker intelligence, music, sound events, and audio-language tasks.

Use when: choosing speech, voice, transcription, or audio understanding systems.

Common trap: low WER does not guarantee diarization, latency, speaker identity, or noisy-call performance.

Start with: Speech Recognition · Audio Captioning · Music Generation
04 · 3 tasks

Multimodal Media

VQA, image-text retrieval, video QA, document VQA, image generation, and editing.

Use when: the input or output crosses text, image, video, audio, or document boundaries.

Common trap: one multimodal score can hide failures in OCR-heavy, spatial, chart, or video tasks.

Start with: Visual Question Answering · Image Captioning · Text-to-Image Generation
05 · 7 tasks

Code & Software Engineering

Code generation, repair, repository work, tests, security, and UI/mobile code.

Use when: evaluating software output, coding assistants, or repository-level model behavior.

Common trap: HumanEval-style synthesis is not the same as issue resolution or production engineering.

Start with: Code Generation · React Native Code Generation · Code Translation
06 · 9 tasks

Agents & Tool Use

Tool calling, web agents, OS tasks, long-horizon autonomy, and workflow execution.

Use when: the model must plan, use tools, recover from errors, or act over multiple steps.

Common trap: agent benchmark wins may depend on scaffolding, tools, browser setup, and budget, not only the base model.

Start with: SWE-bench · Task agents · Autonomous Coding
07 · 5 tasks

Structured Data & Forecasting

Tables, tabular prediction, time series, anomalies, recommenders, graphs, and optimization.

Use when: choosing evidence for numerical, temporal, graph, or business-data decisions.

Common trap: forecasting and tabular claims are sensitive to split design, leakage, and horizon definition.

Start with: Node Classification · Tabular Classification · Link Prediction
08 · 2 tasks

Robotics, Control & RL

Game play, continuous control, navigation, manipulation, VLA models, drones, and driving.

Use when: evaluating embodied systems, control policies, or simulation-to-real claims.

Common trap: simulator scores rarely transfer cleanly to hardware without environment and safety evidence.

Start with: Atari Games · Continuous Control
09 · 3 tasks

Science, Medicine & Industry

Medical, clinical, scientific, industrial, legal, finance, climate, and compliance AI.

Use when: the decision depends on a regulated domain, specialized data, or high-cost errors.

Common trap: general benchmarks can miss domain drift, licensing, annotation quality, and clinical workflow constraints.

Start with: Disease Classification · Anomaly Detection · Medical Image Segmentation
§ 05 · Capability area

Language & Knowledge.

Language understanding, retrieval, QA, RAG, factuality, and knowledge extraction. Reasoning appears here as a capability tag, not as a separate root.


Tasks
16
Verified SOTA
10
Results
317
Language & Knowledge · 16 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01Commonsense Reasoning
Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social…
Massive Multitask Language Understanding
Registered leader
o3
92.9% · accuracy
BFragmented*82
02Mathematical Reasoning
Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco…
Mathematics Aptitude Test of Heuristics
Registered leader
Claude Opus 4.5
90.7% · accuracy
Canonical*79
03Multi-step Reasoning
Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili…
Graduate-Level Google-Proof Q&A Diamond
Registered leader
Gemini 2.5 Pro
84.0% · accuracy
BCanonical*53
04Question Answering
Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning,…
question-answering
Natural Questions: a Benchmark for Question Answering ResearchNot enough registered evidenceBStale26
05Text Summarization
Text summarization compresses documents while preserving key information — a task that became dramatically mor…
summarization
CNN/DailyMail Summarization
Registered leader
BRIO
47.8% · rouge-1
Canonical*15
06Logical Reasoning
Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak…
LogiQA
Registered leader
GPT-4o
56.3% · accuracy
Canonical*12
07Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Stanford Natural Language Inference
Registered leader
GPT-4o
92.6% · accuracy
Canonical*8
08Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C…
text-ranking
BEIR
Registered leader
NV-Embed-v2
62.65 · ndcg@10
Canonical*8
09Named Entity Recognition
Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u…
token-classification
CoNLL-2003 Named Entity Recognition
Registered leader
GLiNER-multitask
93.8% · f1
Canonical*7
10Arithmetic Reasoning
Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca…
Math Word Problem Repository
Registered leader
GPT-4o
97.2% · accuracy
Canonical*6
11Text Embeddings
Generating dense vector embeddings for retrieval, ranking, clustering, and semantic search.
feature-extraction
Why this benchmark?

Why this benchmark?: MTEB-style evidence is the practical first stop for embedding and retrieval model choices.

What it measures: Retrieval, ranking, classification, clustering, and semantic similarity tasks.

What it misses: Your corpus, latency budget, reranker pairing, and domain-specific relevance labels.

MTEB Leaderboard
Registered leader
NV-Embed-v2
72.31 · avg-score
Canonical6
12Entity Linking
Linking mentions to knowledge base entities.
AIDA-CoNLL-YAGO (test-b)
Registered leader
GENRE
93.30 · micro_f1
Sparse*3
13Knowledge Graph Completion
Predicting missing links in knowledge graphs.
FB15k-237 Knowledge Graph Completion
Registered leader
NBFNet
0.415 · mrr
Sparse*3
14Relation Extraction
Extracting relationships between entities from text.
TAC Relation Extraction Dataset
Registered leader
LUKE
72.7% · f1
Sparse*3
15Semantic Textual Similarity
Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti…
sentence-similarity
STS Benchmark
Registered leader
GTE-Qwen2-7B-instruct
88.40 · spearman
Sparse*3
16Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s…
table-question-answering
WikiTableQuestions
Registered leader
GPT-4
75.3% · accuracy
Sparse*3
Fig 05 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 06 · Capability area

Vision & Documents.

Images, video frames, OCR, layout, tables, document parsing, detection, segmentation, and visual anomaly detection.


Tasks
11
Verified SOTA
8
Results
2,039
Vision & Documents · 11 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01Document OCR
Reading text, structure, and layout from document images.
Why this benchmark?

Why this benchmark?: OCRBench v2 is a broad first stop for document OCR and visual text extraction comparisons.

What it measures: Text recognition, document understanding subtasks, and multi-scenario OCR capability.

What it misses: Invoice extraction, handwriting-heavy archives, table fidelity, and local-language production OCR.

When not to use: Do not use it as the only evidence for production document automation.

OCRBench v2
Registered leader
Qwen2.5-VL-72B
63.70 · overall
Canonical829
02Scene Text Detection
Detecting text regions in natural scene images
coco-text
Registered leader
CLIP4STR-L
81.90 · 1-1-accuracy
ACanonical*581
03Document Layout Analysis
Analyzing the layout structure of documents
d4la
Registered leader
DoPTA
70.7% · map
Canonical*133
04Scene Text Recognition
Recognizing text in natural scene images
cute80
Registered leader
CPPD
99.7% · accuracy
Canonical*127
05Document Parsing
Parsing document structure and content
Why this benchmark?

Why this benchmark?: OmniDocBench is a stronger starting point when the output must preserve reading order, layout, tables, and formulas.

What it measures: End-to-end document parsing quality across text, tables, formulas, and layout-sensitive output.

What it misses: Vendor-specific workflow behavior, low-resource languages, and private document templates.

OmniDocBench v1.5
Registered leader
Mistral OCR 3
91.63 · reading-order
Emerging117
06Table Recognition
Detecting and parsing tables in documents
icdar2013-table-structure-recognition
Registered leader
Proposed System (With post- processing)
95.46 · f-measure
Fragmented71
07General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
OCRBench v2
Registered leader
mistral-ocr-2512
25.20 · overall-en-private
Canonical*66
08Document Image Classification
Classifying documents by type or category
aip
Registered leader
ResNet-RS (ResNet-200 + RS training tricks)
83.40 · top-1-accuracy-verb
Canonical*62
09Handwriting Recognition
Recognizing handwritten text
No canonical benchmark registeredNot enough registered evidenceMissing*40
10Document Understanding
Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —…
document-question-answering
Form Understanding in Noisy Scanned DocumentsNot enough registered evidenceCanonical*7
11Semantic Segmentation
Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton…
image-segmentation
ADE20K Scene Parsing Benchmark
Registered leader
InternImage-H
62.9% · mIoU
Canonical*6
Fig 06 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 07 · Capability area

Audio & Speech.

ASR, TTS, speaker intelligence, music, sound events, audio-language understanding, and audio safety.


Tasks
6
Verified SOTA
1
Results
37
Audio & Speech · 6 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a…
automatic-speech-recognition
Why this benchmark?

Why this benchmark?: ASR leaderboards remain useful when the split, language, domain, and WER normalization are explicit.

What it measures: Transcription error rate on fixed audio corpora.

What it misses: Diarization, streaming latency, noisy calls, code-switching, and downstream extraction quality.

Mozilla Common Voice
Registered leader
Whisper Large-v2
11.20 · wer
Canonical22
02Audio Captioning
Generating text descriptions of audio content.
AudioCaps
Registered leader
AudioCaps baseline (TopDown+Align)
36.9% · spider
Sparse*3
03Music Generation
Generating music from text, audio, or other inputs.
MusicCaps
Registered leader
MusicLM
4.000 · fad
Sparse*3
04Sound Event Detection
Detecting and localizing sound events in audio.
Domestic Environment Sound Event Detection (DCASE Task 4)
Registered leader
ATST-SED
58.10 · event-f1
Sparse*3
05Speaker Verification
Verifying speaker identity from voice samples.
VoxCeleb1 Original Test Set (VoxCeleb1-O)
Registered leader
ResNet-34 (AM-Softmax, VoxCeleb2)
1.180 · eer
Sparse*3
06Speech Translation
Translating spoken audio directly to another language.
MuST-C English-German tst-COMMON
Registered leader
SeamlessM4T v2 Large
37.1% · bleu
Sparse*3
Fig 07 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 08 · Capability area

Multimodal Media.

Cross-modal tasks only: VQA, image-text retrieval, video QA, document VQA, text-to-image, image editing, and any-to-any media models.


Tasks
3
Verified SOTA
2
Results
49
Multimodal Media · 3 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural langu…
visual-question-answering
Why this benchmark?

Why this benchmark?: VQA is best treated as a family: chart, document, spatial, and general image QA can disagree.

What it measures: Question answering over visual inputs on fixed task distributions.

What it misses: OCR-heavy documents, long-context image sets, grounding, and tool-assisted visual reasoning.

Visual Question Answering v2.0
Registered leader
Qwen2-VL 72B
87.6% · accuracy
Fragmented47
02Image Captioning
Image captioning — generating natural language descriptions of images — was the task that launched the modern…
image-to-text
COCO Captions
Registered leader
BLIP-2
145.8% · CIDEr
ASparse*2
03Text-to-Image Generation
Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022)…
text-to-image
DPG-BenchNot enough registered evidenceContested0
Fig 08 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 09 · Capability area

Code & Software Engineering.

Code generation, completion, repair, repository understanding, tests, vulnerability work, UI code, and mobile app code generation.


Tasks
7
Verified SOTA
6
Results
263
Code & Software Engineering · 7 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
Why this benchmark?

Why this benchmark?: LiveCodeBench is harder to memorize than older static coding sets and tracks recent coding ability.

What it measures: Competitive programming style code generation on dated, rolling problems.

What it misses: Repository repair, tool use, test writing, and long-horizon software engineering work.

When not to use: Use SWE-bench or repository benchmarks for agentic coding work.

LiveCodeBench
Registered leader
DeepSeek-R1-0528
73.3% · pass@1
BCanonical196
02React Native Code Generation
Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation,…
Callstack Incubator React Native Evaluation Suite
Registered leader
Composer 2
98.90 · navigation-satisfaction
Canonical*40
03Code Translation
Converting code between programming languages.
TransCoder Evaluation on GeeksForGeeks Algorithmic Problems
Registered leader
Claude Sonnet 4
89.40 · computational-accuracy
Canonical*7
04Bug Detection
Identifying bugs and vulnerabilities in code.
Bugs2Fix: Learning to Rewrite Buggy Code
Registered leader
GPT-4o
78.6% · accuracy
Canonical*6
05Code Completion
Predicting the next tokens in code sequences.
Cross-File Code Completion Evaluation
Registered leader
Claude Sonnet 4
44.50 · exact-match
Canonical*6
06Program Repair
Automatically fixing bugs in code.
Defects4J: A Database of Real Faults in Java Programs
Registered leader
SRepair
101.0 · correct-patches
Canonical*5
07Code Summarization
Generating natural language descriptions of code.
CodeXGLUE Code-to-Text Python subset
Registered leader
CodeT5-base
20.0% · bleu
Sparse*3
Fig 09 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 10 · Capability area

Agents & Tool Use.

Tool calling, web and desktop agents, browser automation, long-horizon autonomy, multi-agent coordination, and agent safety.


Tasks
9
Verified SOTA
5
Results
184
Agents & Tool Use · 9 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01SWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for…
Why this benchmark?

Why this benchmark?: SWE-bench Verified is the usual first benchmark for autonomous agents repairing real GitHub issues.

What it measures: Patch generation that resolves repository issues against validation tests.

What it misses: Product judgment, multi-day maintenance work, security review, and UI-heavy tasks.

SWE-bench Verified — Agentic Leaderboard
Registered leader
Claude Mythos Preview
93.90 · resolve-rate
BCanonical81
02Task agents
AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks…
No canonical benchmark registeredNot enough registered evidenceMissing*35
03Autonomous Coding
Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal…
SWE-bench Verified (Agentic)
Registered leader
Claude Opus 4.5
80.90 · pct_resolved
BCanonical*23
04Web & Desktop Agents
Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by…
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Registered leader
CoAct-1
60.76 · success-rate
Canonical*19
05Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like…
No canonical benchmark registeredNot enough registered evidenceMissing*8
06HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI auton…
Human-Calibrated Autonomy Software Tasks
Registered leader
Claude Opus 4
55.00 · success-rate
Canonical*6
07RE-Bench
RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineerin…
Research Engineering Benchmark
Registered leader
o3
0.380 · normalized-score
Canonical*5
08Time Horizon
Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the…
METR Autonomy Evaluation: Time Horizon
Registered leader
Claude Opus 4
60.00 · task-horizon-minutes
Canonical*5
09Bioinformatics Agents
LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpre…
No canonical benchmark registeredNot enough registered evidenceMissing*2
Fig 10 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 11 · Capability area

Structured Data & Forecasting.

Tables, tabular classification and regression, time-series forecasting, anomaly detection, recommender systems, graph learning, and optimization.


Tasks
5
Verified SOTA
3
Results
19
Structured Data & Forecasting · 5 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01Node Classification
Node classification — assigning labels to vertices in a graph using both node features and neighborhood struct…
graph-ml
Cora Citation Network
Registered leader
ACNet
83.5% · accuracy
Canonical*6
02Tabular Classification
Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain…
tabular-classification
OpenML-CC18
Registered leader
AutoGluon-Tabular
88.5% · accuracy
Canonical*5
04Molecular Property Prediction
Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from mo…
Open Graph Benchmark - ogbg-molhiv
Registered leader
DGN
79.70 · roc_auc
Sparse*3
05Tabular Regression
Tabular regression — predicting continuous values from structured data — powers everything from house-price es…
tabular-regression
California Housing
Registered leader
XGBoost
0.453 · rmse
Sparse*2
Fig 11 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 12 · Capability area

Robotics, Control & RL.

Game playing, continuous control, manipulation, navigation, embodied instruction following, VLA models, drones, and autonomous driving.


Tasks
2
Verified SOTA
0
Results
21
Robotics, Control & RL · 2 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01Atari Games
Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pix…
reinforcement-learning
Arcade Learning Environment (Atari 2600)
Registered leader
Go-Explore
40000.0 · human-normalized-score
Canonical*12
02Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the O…
Multi-Joint dynamics with Contact
Registered leader
TD-MPC2 (317M params)
960.0 · average-return
Canonical*9
Fig 12 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 13 · Capability area

Science, Medicine & Industry.

A domain layer for medical imaging, clinical text, drug discovery, protein modeling, industrial inspection, remote sensing, climate, legal, finance, and compliance AI.


Tasks
3
Verified SOTA
3
Results
110
Science, Medicine & Industry · 3 tasks
Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

TaskBest first benchmarkCurrent registered leaderTrustStatusResultsActions
01Disease Classification
Diagnosing diseases from medical images or data.
Autism Brain Imaging Data Exchange I
Registered leader
SSAE + Softmax (Explainable ASD)
98.2% · accuracy
Canonical*57
02Anomaly Detection
Detecting defects and anomalies in manufacturing (MVTec AD, VisA).
MVTec Anomaly Detection Dataset
Registered leader
AnomalyGPT
97.40 · auroc
Canonical*27
03Medical Image Segmentation
Segmenting organs and abnormalities in medical images.
Automated Cardiac Diagnosis Challenge
Registered leader
MedNeXt-L
92.65 · mean-dsc
Canonical*26
Fig 13 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.
§ 14
Trust grades

What the letters mean.

Benchmarks are not equally believable. Some are held out behind a private evaluator; some ship their test set as part of the training corpus. We grade the canonical dataset of every task on a four-point scale and show the letter next to the score.

A
Reproduced · dated · code
The full path is visible: a public checkpoint, a frozen commit, a declared environment, and a score we (or a signed reproducer) ran against a held-out test set. Contamination controlled, metric direction declared, date stamped.
B
Partial reproduction
Known weaknesses — evaluator overlap, public answer keys, a missing seed — but the submission otherwise checks out. Cite with caution; we preserve the caveat alongside the number.
C
Claim-only
The authors say so. We have not reproduced it and cannot yet. Shown in the register for completeness, but do not treat as state of the art.
F
Contested or retracted
The benchmark is considered unreliable: documented contamination, split leakage, or a score withdrawn by its authors. The row remains visible — leaderboards that silently forget are worse than leaderboards that argue in public.

A dataset can be regraded in public at any time; the history is preserved on the benchmark page. We publish the regrade, we don't erase the prior.

§ 15 · Standing columns

Capability buckets, not benchmarks.

HuggingFace pipeline-tag categories. These group concrete tasks thematically; they are not themselves measurable. Use them to navigate to the real rankings.

Standing column

Image + Text → Video

Animate a still image guided by a text prompt.

Standing column

Video → Video

Video editing, style transfer, super-resolution.

Standing column

Image → 3D

Generate a 3D mesh or NeRF from one or more images.

Standing column

Text → 3D

Generate a 3D asset from a text prompt.

Standing column

Image → Video

Animate a still image into a short clip.

Standing column

Unconditional Image Generation

Generative image models without text conditioning (DCGAN, StyleGAN era).

Fig 15 · Standing columns exist to aid navigation, not to be ranked. Follow any link to the underlying task's leaderboard.
§ 16
Methodology

Why this register can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repo link alone — a frozen commit, a declared environment, a recorded seed. If it does not run, the row does not publish.

Second, every benchmark has a metric direction. Higher-is-better and lower-is-better are declared on the dataset; no ambiguity reaches the reader.

Third, every score carries a date. When a model regresses — and they do — the record is preserved. The table never silently forgets.

Decision signal

What were you trying to decide today?

Decision
Missing
Real decision