Codesota · Tasks · Vol. IICapability-first task ontologyIssue: April 22, 2026
§ 00 · Index

Every AI capability,
mapped to benchmark evidence.

Tasks are capabilities. Benchmarks are evidence. Domains, modalities, and safety properties are filters. This page groups 62 task pages into nine stable capability areas, then shows the canonical benchmark, leading model, and trust grade for each row.

Reasoning, safety, robustness, multilingual coverage, and vertical domains are treated as cross-cutting overlays rather than competing top-level roots.

§ 01 · Counts

The register, by the numbers.

Figures sourced from the live Postgres registry · updated every 10 min
9
Capability areas
Stable top-level ontology
147
Tasks catalogued
62 with evidence rows
780
Datasets indexed
Canonical scope now labelled
9,164
Benchmark results
All dated · verified where possible
§ 02 · Product map

Three different questions.

Tasks are taxonomy. Leaderboards are evidence. Lineages are benchmark history.

Tasks

Start with the problem

Use this page when you know the capability you care about: OCR, code generation, ASR, retrieval, VQA, detection.

Browse taxonomy
Leaderboards

Then inspect the evidence

Use benchmark pages when you need result counts, source quality, trust badges, metric definitions, and current top rows.

Open leaderboards
Lineages

Check if the benchmark still matters

Use lineage pages when a benchmark looks saturated, outdated, contaminated, or replaced by a harder successor.

View evolution
§ 03 · Area map

Nine stable capability areas.

Domains, modalities, methods, and safety properties are filters. They do not compete with the top-level task ontology.

01 · 16 tasks
Language & Knowledge
02 · 11 tasks
Vision & Documents
03 · 6 tasks
Audio & Speech
04 · 3 tasks
Multimodal Media
05 · 7 tasks
Code & Software Engineering
06 · 9 tasks
Agents & Tool Use
07 · 5 tasks
Structured Data & Forecasting
08 · 2 tasks
Robotics, Control & RL
09 · 3 tasks
Science, Medicine & Industry
§ 04 · Capability area

Language & Knowledge.

Language understanding, retrieval, QA, RAG, factuality, and knowledge extraction. Reasoning appears here as a capability tag, not as a separate root.


Tasks
16
Verified SOTA
10
Results
317
Language & Knowledge · 16 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Commonsense Reasoning
Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social…
Massive Multitask Language Understanding
legacylegacyambiguous
MMLU is saturated and better treated as general knowledge / legacy LLM eval, not canonical commonsense reasoning.
o392.9%
accuracy
82
02Mathematical Reasoning
Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco…
Mathematics Aptitude Test of HeuristicsClaude Opus 4.590.7%
accuracy
79
03Multi-step Reasoning
Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili…
Graduate-Level Google-Proof Q&A DiamondGemini 2.5 Pro84.0%
accuracy
53
04Question Answering
Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning,…
Natural Questions: a Benchmark for Question Answering Research26
05Text Summarization
Text summarization compresses documents while preserving key information — a task that became dramatically mor…
CNN/DailyMail SummarizationBRIO47.8%
rouge-1
15
06Logical Reasoning
Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak…
LogiQAGPT-4o56.3%
accuracy
12
07Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Stanford Natural Language InferenceGPT-4o92.6%
accuracy
8
08Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C…
BEIR legacy retrieval
legacylegacyambiguous
Legacy retrieval snapshot. Split modern retrieval, reranking, multilingual, and long-context RAG evals before calling this current SOTA.
NV-Embed-v262.65
ndcg@10
8
09Named Entity Recognition
Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u…
CoNLL-2003 Named Entity RecognitionGLiNER-multitask93.8%
f1
7
10Arithmetic Reasoning
Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca…
Math Word Problem RepositoryGPT-4o97.2%
accuracy
6
11Text Embeddings
Generating dense vector embeddings for retrieval, ranking, clustering, and semantic search.
Legacy MTEB English, 2024 snapshot
historicallegacyambiguous
NV-Embed-v2 is a historical MTEB English 56-task snapshot, not a fresh 2026 embedding frontier.
NV-Embed-v272.31
avg-score
6
12Entity Linking
Linking mentions to knowledge base entities.
AIDA-CoNLL-YAGO (test-b)GENRE93.30
micro_f1
3
13Knowledge Graph Completion
Predicting missing links in knowledge graphs.
FB15k-237 Knowledge Graph CompletionNBFNet0.415
mrr
3
14Relation Extraction
Extracting relationships between entities from text.
TAC Relation Extraction DatasetLUKE72.7%
f1
3
15Semantic Textual Similarity
Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti…
STS BenchmarkGTE-Qwen2-7B-instruct88.40
spearman
3
16Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s…
WikiTableQuestionsGPT-475.3%
accuracy
3
Fig 04 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 05 · Capability area

Vision & Documents.

Images, video frames, OCR, layout, tables, document parsing, detection, segmentation, and visual anomaly detection.


Tasks
11
Verified SOTA
6
Results
2,039
Vision & Documents · 11 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Document OCR
Reading text, structure, and layout from document images.
OCRBench v2 public overall
submetricagingambiguous
Scope is public overall. Do not compare directly with English-private OCRBench v2 or full document parsing metrics.
Qwen2.5-VL-72B63.70
overall
829
02Scene Text Detection
Detecting text regions in natural scene images
COCO-Text detection scope needs review
misclassifiedstalemisclassified
CLIP4STR-style scene text recognition rows do not belong under detection. Detection needs region metrics such as precision, recall, F-measure, or hmean.
581
03Document Layout Analysis
Analyzing the layout structure of documents
d4laDoPTA70.7%
map
133
04Scene Text Recognition
Recognizing text in natural scene images
cute80CPPD99.7%
accuracy
127
05Document Parsing
Parsing document structure and content
OmniDocBench v1.5
submetricagingambiguous
Reading order is only one OmniDocBench facet. Summary SOTA needs text, layout, table TEDS, reading order, and end-to-end structure facets.
Mistral OCR 391.63
reading-order
117
06Table Recognition
Detecting and parsing tables in documents
ICDAR2013 table structure (legacy)
legacylegacyambiguous
ICDAR2013 is too narrow for 2026 table recognition. Promote PubTables-1M, PubTabNet, FinTabNet, or table-specific document parsing metrics.
Proposed System (With post- processing)95.46
f-measure
71
07General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
OCRBench v2
needs coveragestaleambiguous
Fold this into OCR unless the metric scope is explicit: public overall, English-private, recognition, understanding, or full parsing.
66
08Document Image Classification
Classifying documents by type or category
aipResNet-RS (ResNet-200 + RS training tricks)83.40
top-1-accuracy-verb
62
09Handwriting Recognition
Recognizing handwritten text
40
10Document Understanding
Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —…
Form Understanding in Noisy Scanned Documents7
11Semantic Segmentation
Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton…
ADE20K Scene Parsing BenchmarkInternImage-H62.9%
mIoU
6
Fig 05 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 06 · Capability area

Audio & Speech.

ASR, TTS, speaker intelligence, music, sound events, audio-language understanding, and audio safety.


Tasks
6
Verified SOTA
1
Results
35
Audio & Speech · 6 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a…
Mozilla Common VoiceWhisper Large-v211.20
wer
20
02Audio Captioning
Generating text descriptions of audio content.
AudioCaps
historicalstaleambiguous
Baseline-style AudioCaps rows should not read as current leading audio-language SOTA without a refresh.
AudioCaps baseline (TopDown+Align)36.9%
spider
3
03Music Generation
Generating music from text, audio, or other inputs.
MusicCaps
historicalstaleambiguous
MusicLM is historically important, but this needs MusicCaps/MusicBench, human eval, and proprietary/open splits.
MusicLM4.000
fad
3
04Sound Event Detection
Detecting and localizing sound events in audio.
Domestic Environment Sound Event Detection (DCASE Task 4)ATST-SED58.10
event-f1
3
05Speaker Verification
Verifying speaker identity from voice samples.
VoxCeleb1 Original Test Set (VoxCeleb1-O)ResNet-34 (AM-Softmax, VoxCeleb2)1.180
eer
3
06Speech Translation
Translating spoken audio directly to another language.
MuST-C English-German tst-COMMONSeamlessM4T v2 Large37.1%
bleu
3
Fig 06 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 07 · Capability area

Multimodal Media.

Cross-modal tasks only: VQA, image-text retrieval, video QA, document VQA, text-to-image, image editing, and any-to-any media models.


Tasks
3
Verified SOTA
2
Results
49
Multimodal Media · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural langu…
Visual Question Answering v2.0Qwen2-VL 72B87.6%
accuracy
47
02Image Captioning
Image captioning — generating natural language descriptions of images — was the task that launched the modern…
COCO Captions
legacylegacyambiguous
COCO captioning is legacy and saturated. Add NoCaps, Flickr30k, caption QA, or preference-based caption evals.
BLIP-2145.8%
CIDEr
2
03Text-to-Image Generation
Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022)…
DPG-Bench0
Fig 07 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 08 · Capability area

Code & Software Engineering.

Code generation, completion, repair, repository understanding, tests, vulnerability work, UI code, and mobile app code generation.


Tasks
7
Verified SOTA
6
Results
263
Code & Software Engineering · 7 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
LiveCodeBenchGemini 3 Pro Preview91.7%
pass@1
196
02React Native Code Generation
Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation,…
Callstack Incubator React Native Evaluation SuiteComposer 298.90
navigation-satisfaction
40
03Code Translation
Converting code between programming languages.
TransCoder Evaluation on GeeksForGeeks Algorithmic ProblemsClaude Sonnet 489.40
computational-accuracy
7
04Bug Detection
Identifying bugs and vulnerabilities in code.
Bugs2Fix: Learning to Rewrite Buggy CodeGPT-4o78.6%
accuracy
6
05Code Completion
Predicting the next tokens in code sequences.
Cross-File Code Completion EvaluationClaude Sonnet 444.50
exact-match
6
06Program Repair
Automatically fixing bugs in code.
Defects4J: A Database of Real Faults in Java ProgramsSRepair101.0
correct-patches
5
07Code Summarization
Generating natural language descriptions of code.
CodeXGLUE Code-to-Text Python subsetCodeT5-base20.0%
bleu
3
Fig 08 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 09 · Capability area

Agents & Tool Use.

Tool calling, web and desktop agents, browser automation, long-horizon autonomy, multi-agent coordination, and agent safety.


Tasks
9
Verified SOTA
5
Results
184
Agents & Tool Use · 9 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01SWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for…
SWE-bench Verified — Agentic LeaderboardClaude Mythos Preview93.90
resolve-rate
81
02Task agents
AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks…
35
03Autonomous Coding
Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal…
SWE-bench Verified (Agentic)Claude Opus 4.580.90
pct_resolved
23
04Web & Desktop Agents
Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by…
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsCoAct-160.76
success-rate
19
05Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like…
8
06HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI auton…
Human-Calibrated Autonomy Software TasksClaude Opus 455.00
success-rate
6
07RE-Bench
RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineerin…
Research Engineering Benchmarko30.380
normalized-score
5
08Time Horizon
Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the…
METR Autonomy Evaluation: Time HorizonClaude Opus 460.00
task-horizon-minutes
5
09Bioinformatics Agents
LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpre…
2
Fig 09 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 10 · Capability area

Structured Data & Forecasting.

Tables, tabular classification and regression, time-series forecasting, anomaly detection, recommender systems, graph learning, and optimization.


Tasks
5
Verified SOTA
3
Results
19
Structured Data & Forecasting · 5 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Node Classification
Node classification — assigning labels to vertices in a graph using both node features and neighborhood struct…
Cora Citation NetworkACNet83.5%
accuracy
6
02Tabular Classification
Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain…
OpenML-CC18AutoGluon-Tabular88.5%
accuracy
5
03Link Prediction
Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-ta…
Open Graph Benchmark - ogbl-collabPROXI70.98
hits_at_50
3
04Molecular Property Prediction
Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from mo…
Open Graph Benchmark - ogbg-molhivDGN79.70
roc_auc
3
05Tabular Regression
Tabular regression — predicting continuous values from structured data — powers everything from house-price es…
California HousingXGBoost0.453
rmse
2
Fig 10 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 11 · Capability area

Robotics, Control & RL.

Game playing, continuous control, manipulation, navigation, embodied instruction following, VLA models, drones, and autonomous driving.


Tasks
2
Verified SOTA
0
Results
21
Robotics, Control & RL · 2 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Atari Games
Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pix…
Arcade Learning Environment (Atari 2600)
legacylegacy
Classic RL benchmark. Keep separate from modern embodied, VLA, robotics manipulation, and navigation tasks.
Go-Explore40000.0
human-normalized-score
12
02Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the O…
Multi-Joint dynamics with Contact
submetricagingambiguous
MuJoCo control is a narrow simulation slice. Split from robotics manipulation, navigation, and VLA evaluations.
TD-MPC2 (317M params)960.0
average-return
9
Fig 11 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 12 · Capability area

Science, Medicine & Industry.

A domain layer for medical imaging, clinical text, drug discovery, protein modeling, industrial inspection, remote sensing, climate, legal, finance, and compliance AI.


Tasks
3
Verified SOTA
3
Results
110
Science, Medicine & Industry · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Disease Classification
Diagnosing diseases from medical images or data.
Autism Brain Imaging Data Exchange I
claim-onlystaleambiguous
High ABIDE accuracy claims are leakage-risk until subject-level split, site-held-out validation, preprocessing, confound control, and external validation are verified.
SSAE + Softmax (Explainable ASD)98.2%
accuracy
57
02Anomaly Detection
Detecting defects and anomalies in manufacturing (MVTec AD, VisA).
MVTec Anomaly Detection Dataset
submetricagingambiguous
MVTec AD rows must split image-level classification, pixel-level localization, zero/few/full-shot, and AUROC/AUPRO metric scopes.
AnomalyGPT97.40
auroc
27
03Medical Image Segmentation
Segmenting organs and abnormalities in medical images.
Automated Cardiac Diagnosis ChallengeMedNeXt-L92.65
mean-dsc
26
Fig 12 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 13
Trust grades

What the letters mean.

Benchmarks are not equally believable. Some are held out behind a private evaluator; some ship their test set as part of the training corpus. We grade the canonical dataset of every task on a four-point scale and show the letter next to the score.

A
Reproduced · dated · code
The full path is visible: a public checkpoint, a frozen commit, a declared environment, and a score we (or a signed reproducer) ran against a held-out test set. Contamination controlled, metric direction declared, date stamped.
B
Partial reproduction
Known weaknesses — evaluator overlap, public answer keys, a missing seed — but the submission otherwise checks out. Cite with caution; we preserve the caveat alongside the number.
C
Claim-only
The authors say so. We have not reproduced it and cannot yet. Shown in the register for completeness, but do not treat as state of the art.
F
Contested or retracted
The benchmark is considered unreliable: documented contamination, split leakage, or a score withdrawn by its authors. The row remains visible — leaderboards that silently forget are worse than leaderboards that argue in public.

A dataset can be regraded in public at any time; the history is preserved on the benchmark page. We publish the regrade, we don't erase the prior.

§ 14 · Standing columns

Capability buckets, not benchmarks.

HuggingFace pipeline-tag categories. These group concrete tasks thematically; they are not themselves measurable. Use them to navigate to the real rankings.

Standing column

Image + Text → Video

Animate a still image guided by a text prompt.

Standing column

Video → Video

Video editing, style transfer, super-resolution.

Standing column

Image → 3D

Generate a 3D mesh or NeRF from one or more images.

Standing column

Text → 3D

Generate a 3D asset from a text prompt.

Standing column

Image → Video

Animate a still image into a short clip.

Standing column

Unconditional Image Generation

Generative image models without text conditioning (DCGAN, StyleGAN era).

Fig 14 · Standing columns exist to aid navigation, not to be ranked. Follow any link to the underlying task's leaderboard.
§ 15
Methodology

Why this register can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repo link alone — a frozen commit, a declared environment, a recorded seed. If it does not run, the row does not publish.

Second, every benchmark has a metric direction. Higher-is-better and lower-is-better are declared on the dataset; no ambiguity reaches the reader.

Third, every score carries a date. When a model regresses — and they do — the record is preserved. The table never silently forgets.