Codesota · Tasks · Vol. IICapability-first task ontologyIssue: April 22, 2026

§ 00 · Index

Every AI capability,
mapped to benchmark evidence.

Tasks are capabilities. Benchmarks are evidence. Domains, modalities, and safety properties are filters. This page groups 62 task pages into nine stable capability areas, then shows the canonical benchmark, leading model, and trust grade for each row.

Reasoning, safety, robustness, multilingual coverage, and vertical domains are treated as cross-cutting overlays rather than competing top-level roots.

Browse capability areas →Compare leaderboards Check benchmark lineages

§ 01 · Counts

The register, by the numbers.

Figures sourced from the live Postgres registry · updated every 10 min

Capability areas

Stable top-level ontology

147

Tasks catalogued

62 with evidence rows

780

Datasets indexed

Canonical scope now labelled

9,164

Benchmark results

All dated · verified where possible

§ 02 · Product map

Three different questions.

Tasks are taxonomy. Leaderboards are evidence. Lineages are benchmark history.

Tasks

Start with the problem

Use this page when you know the capability you care about: OCR, code generation, ASR, retrieval, VQA, detection.

Browse taxonomy →

Leaderboards

Then inspect the evidence

Use benchmark pages when you need result counts, source quality, trust badges, metric definitions, and current top rows.

Open leaderboards →

Lineages

Check if the benchmark still matters

Use lineage pages when a benchmark looks saturated, outdated, contaminated, or replaced by a harder successor.

View evolution →

§ 03 · Area map

Nine stable capability areas.

Domains, modalities, methods, and safety properties are filters. They do not compete with the top-level task ontology.

Code & Software Engineering

06 · 9 tasks

Agents & Tool Use

07 · 5 tasks

Structured Data & Forecasting

08 · 2 tasks

Robotics, Control & RL

09 · 3 tasks

Science, Medicine & Industry

§ 04 · Capability area

Language & Knowledge.

Language understanding, retrieval, QA, RAG, factuality, and knowledge extraction. Reasoning appears here as a capability tag, not as a separate root.

Tasks: 16
Verified SOTA: 10
Results: 317

Language & Knowledge · 16 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Commonsense Reasoning Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social… reasoning lineage →nlp lineage →	Massive Multitask Language Understanding legacylegacyambiguous MMLU is saturated and better treated as general knowledge / legacy LLM eval, not canonical commonsense reasoning.	o3	92.9% accuracy	82
02	Mathematical Reasoning Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco… math lineage →	Mathematics Aptitude Test of Heuristics	Claude Opus 4.5	90.7% accuracy	79
03	Multi-step Reasoning Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili… reasoning lineage →	Graduate-Level Google-Proof Q&A Diamond	Gemini 2.5 Pro	84.0% accuracy	53
04	Question Answering Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning,…	Natural Questions: a Benchmark for Question Answering Research	—	—	26
05	Text Summarization Text summarization compresses documents while preserving key information — a task that became dramatically mor…	CNN/DailyMail Summarization	BRIO	47.8% rouge-1	15
06	Logical Reasoning Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak…	LogiQA	GPT-4o	56.3% accuracy	12
07	Natural Language Inference Determining entailment relationships between sentences (SNLI, MNLI).	Stanford Natural Language Inference	GPT-4o	92.6% accuracy	8
08	Text Ranking Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C…	BEIR legacy retrieval legacylegacyambiguous Legacy retrieval snapshot. Split modern retrieval, reranking, multilingual, and long-context RAG evals before calling this current SOTA.	NV-Embed-v2	62.65 ndcg@10	8
09	Named Entity Recognition Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u…	CoNLL-2003 Named Entity Recognition	GLiNER-multitask	93.8% f1	7
10	Arithmetic Reasoning Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca…	Math Word Problem Repository	GPT-4o	97.2% accuracy	6
11	Text Embeddings Generating dense vector embeddings for retrieval, ranking, clustering, and semantic search.	Legacy MTEB English, 2024 snapshot historicallegacyambiguous NV-Embed-v2 is a historical MTEB English 56-task snapshot, not a fresh 2026 embedding frontier.	NV-Embed-v2	72.31 avg-score	6
12	Entity Linking Linking mentions to knowledge base entities.	AIDA-CoNLL-YAGO (test-b)	GENRE	93.30 micro_f1	3
13	Knowledge Graph Completion Predicting missing links in knowledge graphs.	FB15k-237 Knowledge Graph Completion	NBFNet	0.415 mrr	3
14	Relation Extraction Extracting relationships between entities from text.	TAC Relation Extraction Dataset	LUKE	72.7% f1	3
15	Semantic Textual Similarity Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti…	STS Benchmark	GTE-Qwen2-7B-instruct	88.40 spearman	3
16	Table Question Answering Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s…	WikiTableQuestions	GPT-4	75.3% accuracy	3

Fig 04 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 05 · Capability area

Vision & Documents.

Images, video frames, OCR, layout, tables, document parsing, detection, segmentation, and visual anomaly detection.

Tasks: 11
Verified SOTA: 6
Results: 2,039

Vision & Documents · 11 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Document OCR Reading text, structure, and layout from document images. ocr lineage →vision lineage →	OCRBench v2 public overall submetricagingambiguous Scope is public overall. Do not compare directly with English-private OCRBench v2 or full document parsing metrics.	Qwen2.5-VL-72B	63.70 overall	829
02	Scene Text Detection Detecting text regions in natural scene images	COCO-Text detection scope needs review misclassifiedstalemisclassified CLIP4STR-style scene text recognition rows do not belong under detection. Detection needs region metrics such as precision, recall, F-measure, or hmean.	—	—	581
03	Document Layout Analysis Analyzing the layout structure of documents	d4la	DoPTA	70.7% map	133
04	Scene Text Recognition Recognizing text in natural scene images	cute80	CPPD	99.7% accuracy	127
05	Document Parsing Parsing document structure and content ocr lineage →	OmniDocBench v1.5 submetricagingambiguous Reading order is only one OmniDocBench facet. Summary SOTA needs text, layout, table TEDS, reading order, and end-to-end structure facets.	Mistral OCR 3	91.63 reading-order	117
06	Table Recognition Detecting and parsing tables in documents	ICDAR2013 table structure (legacy) legacylegacyambiguous ICDAR2013 is too narrow for 2026 table recognition. Promote PubTables-1M, PubTabNet, FinTabNet, or table-specific document parsing metrics.	Proposed System (With post- processing)	95.46 f-measure	71
07	General OCR Capabilities Comprehensive benchmarks covering multiple aspects of OCR performance. ocr lineage →	OCRBench v2 needs coveragestaleambiguous Fold this into OCR unless the metric scope is explicit: public overall, English-private, recognition, understanding, or full parsing.	—	—	66
08	Document Image Classification Classifying documents by type or category	aip	ResNet-RS (ResNet-200 + RS training tricks)	83.40 top-1-accuracy-verb	62
09	Handwriting Recognition Recognizing handwritten text	—	—	—	40
10	Document Understanding Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —… ocr lineage →	Form Understanding in Noisy Scanned Documents	—	—	7
11	Semantic Segmentation Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton… vision lineage →	ADE20K Scene Parsing Benchmark	InternImage-H	62.9% mIoU	6

Fig 05 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 06 · Capability area

Audio & Speech.

ASR, TTS, speaker intelligence, music, sound events, audio-language understanding, and audio safety.

Tasks: 6
Verified SOTA: 1
Results: 35

Audio & Speech · 6 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Speech Recognition Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a… audio lineage →speech lineage →	Mozilla Common Voice	Whisper Large-v2	11.20 wer	20
02	Audio Captioning Generating text descriptions of audio content.	AudioCaps historicalstaleambiguous Baseline-style AudioCaps rows should not read as current leading audio-language SOTA without a refresh.	AudioCaps baseline (TopDown+Align)	36.9% spider	3
03	Music Generation Generating music from text, audio, or other inputs.	MusicCaps historicalstaleambiguous MusicLM is historically important, but this needs MusicCaps/MusicBench, human eval, and proprietary/open splits.	MusicLM	4.000 fad	3
04	Sound Event Detection Detecting and localizing sound events in audio.	Domestic Environment Sound Event Detection (DCASE Task 4)	ATST-SED	58.10 event-f1	3
05	Speaker Verification Verifying speaker identity from voice samples.	VoxCeleb1 Original Test Set (VoxCeleb1-O)	ResNet-34 (AM-Softmax, VoxCeleb2)	1.180 eer	3
06	Speech Translation Translating spoken audio directly to another language.	MuST-C English-German tst-COMMON	SeamlessM4T v2 Large	37.1% bleu	3

Fig 06 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 07 · Capability area

Multimodal Media.

Cross-modal tasks only: VQA, image-text retrieval, video QA, document VQA, text-to-image, image editing, and any-to-any media models.

Tasks: 3
Verified SOTA: 2
Results: 49

Multimodal Media · 3 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Visual Question Answering Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural langu… multimodal lineage →vqa lineage →	Visual Question Answering v2.0	Qwen2-VL 72B	87.6% accuracy	47
02	Image Captioning Image captioning — generating natural language descriptions of images — was the task that launched the modern… multimodal lineage →vision lineage →	COCO Captions legacylegacyambiguous COCO captioning is legacy and saturated. Add NoCaps, Flickr30k, caption QA, or preference-based caption evals.	BLIP-2	145.8% CIDEr	2
03	Text-to-Image Generation Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022)… multimodal lineage →vision lineage →	DPG-Bench	—	—	0

Fig 07 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 08 · Capability area

Code & Software Engineering.

Code generation, completion, repair, repository understanding, tests, vulnerability work, UI code, and mobile app code generation.

Tasks: 7
Verified SOTA: 6
Results: 263

Code & Software Engineering · 7 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Code Generation Generating code from natural language descriptions (HumanEval, MBPP). coding lineage →	LiveCodeBench	Gemini 3 Pro Preview	91.7% pass@1	196
02	React Native Code Generation Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation,…	Callstack Incubator React Native Evaluation Suite	Composer 2	98.90 navigation-satisfaction	40
03	Code Translation Converting code between programming languages.	TransCoder Evaluation on GeeksForGeeks Algorithmic Problems	Claude Sonnet 4	89.40 computational-accuracy	7
04	Bug Detection Identifying bugs and vulnerabilities in code.	Bugs2Fix: Learning to Rewrite Buggy Code	GPT-4o	78.6% accuracy	6
05	Code Completion Predicting the next tokens in code sequences.	Cross-File Code Completion Evaluation	Claude Sonnet 4	44.50 exact-match	6
06	Program Repair Automatically fixing bugs in code.	Defects4J: A Database of Real Faults in Java Programs	SRepair	101.0 correct-patches	5
07	Code Summarization Generating natural language descriptions of code.	CodeXGLUE Code-to-Text Python subset	CodeT5-base	20.0% bleu	3

Fig 08 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 09 · Capability area

Agents & Tool Use.

Tool calling, web and desktop agents, browser automation, long-horizon autonomy, multi-agent coordination, and agent safety.

Tasks: 9
Verified SOTA: 5
Results: 184

Agents & Tool Use · 9 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	SWE-bench SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for… coding lineage →agentic lineage →	SWE-bench Verified — Agentic Leaderboard	Claude Mythos Preview	93.90 resolve-rate	81
02	Task agents AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks…	—	—	—	35
03	Autonomous Coding Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal…	SWE-bench Verified (Agentic)	Claude Opus 4.5	80.90 pct_resolved	23
04	Web & Desktop Agents Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by…	OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	CoAct-1	60.76 success-rate	19
05	Tool Use Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like…	—	—	—	8
06	HCAST HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI auton…	Human-Calibrated Autonomy Software Tasks	Claude Opus 4	55.00 success-rate	6
07	RE-Bench RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineerin…	Research Engineering Benchmark	o3	0.380 normalized-score	5
08	Time Horizon Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the…	METR Autonomy Evaluation: Time Horizon	Claude Opus 4	60.00 task-horizon-minutes	5
09	Bioinformatics Agents LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpre…	—	—	—	2

Fig 09 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 10 · Capability area

Structured Data & Forecasting.

Tables, tabular classification and regression, time-series forecasting, anomaly detection, recommender systems, graph learning, and optimization.

Tasks: 5
Verified SOTA: 3
Results: 19

Structured Data & Forecasting · 5 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Node Classification Node classification — assigning labels to vertices in a graph using both node features and neighborhood struct…	Cora Citation Network	ACNet	83.5% accuracy	6
02	Tabular Classification Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain…	OpenML-CC18	AutoGluon-Tabular	88.5% accuracy	5
03	Link Prediction Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-ta…	Open Graph Benchmark - ogbl-collab	PROXI	70.98 hits_at_50	3
04	Molecular Property Prediction Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from mo…	Open Graph Benchmark - ogbg-molhiv	DGN	79.70 roc_auc	3
05	Tabular Regression Tabular regression — predicting continuous values from structured data — powers everything from house-price es…	California Housing	XGBoost	0.453 rmse	2

Fig 10 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 11 · Capability area

Robotics, Control & RL.

Game playing, continuous control, manipulation, navigation, embodied instruction following, VLA models, drones, and autonomous driving.

Tasks: 2
Verified SOTA: 0
Results: 21

Robotics, Control & RL · 2 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Atari Games Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pix…	Arcade Learning Environment (Atari 2600) legacylegacy Classic RL benchmark. Keep separate from modern embodied, VLA, robotics manipulation, and navigation tasks.	Go-Explore	40000.0 human-normalized-score	12
02	Continuous Control Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the O…	Multi-Joint dynamics with Contact submetricagingambiguous MuJoCo control is a narrow simulation slice. Split from robotics manipulation, navigation, and VLA evaluations.	TD-MPC2 (317M params)	960.0 average-return	9

Fig 11 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 12 · Capability area

Science, Medicine & Industry.

A domain layer for medical imaging, clinical text, drug discovery, protein modeling, industrial inspection, remote sensing, climate, legal, finance, and compliance AI.

Tasks: 3
Verified SOTA: 3
Results: 110

Science, Medicine & Industry · 3 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Disease Classification Diagnosing diseases from medical images or data.	Autism Brain Imaging Data Exchange I claim-onlystaleambiguous High ABIDE accuracy claims are leakage-risk until subject-level split, site-held-out validation, preprocessing, confound control, and external validation are verified.	SSAE + Softmax (Explainable ASD)	98.2% accuracy	57
02	Anomaly Detection Detecting defects and anomalies in manufacturing (MVTec AD, VisA).	MVTec Anomaly Detection Dataset submetricagingambiguous MVTec AD rows must split image-level classification, pixel-level localization, zero/few/full-shot, and AUROC/AUPRO metric scopes.	AnomalyGPT	97.40 auroc	27
03	Medical Image Segmentation Segmenting organs and abnormalities in medical images.	Automated Cardiac Diagnosis Challenge	MedNeXt-L	92.65 mean-dsc	26

Fig 12 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 13

Trust grades

What the letters mean.

Benchmarks are not equally believable. Some are held out behind a private evaluator; some ship their test set as part of the training corpus. We grade the canonical dataset of every task on a four-point scale and show the letter next to the score.

Reproduced · dated · code

The full path is visible: a public checkpoint, a frozen commit, a declared environment, and a score we (or a signed reproducer) ran against a held-out test set. Contamination controlled, metric direction declared, date stamped.

Partial reproduction

Known weaknesses — evaluator overlap, public answer keys, a missing seed — but the submission otherwise checks out. Cite with caution; we preserve the caveat alongside the number.

Claim-only

The authors say so. We have not reproduced it and cannot yet. Shown in the register for completeness, but do not treat as state of the art.

Contested or retracted

The benchmark is considered unreliable: documented contamination, split leakage, or a score withdrawn by its authors. The row remains visible — leaderboards that silently forget are worse than leaderboards that argue in public.

A dataset can be regraded in public at any time; the history is preserved on the benchmark page. We publish the regrade, we don't erase the prior.

§ 14 · Standing columns

Capability buckets, not benchmarks.

HuggingFace pipeline-tag categories. These group concrete tasks thematically; they are not themselves measurable. Use them to navigate to the real rankings.

Standing column

Any-to-Any (Omni)

True omni: any modality in (text + image + audio + video) AND generate multiple modalities out (including speech, not just text). Narrowest open-weights category — Qwen3-Omni · Vita · Mini-Omni. Proprietary: GPT-4o · Gemini 3 · Sesame CSM.

Standing column

Image + Text → Text (VLMs)

Vision-Language Models that read images and produce text answers.

Standing column

Image + Text → Image

Image editing and inpainting conditioned on text prompts.

Text-to-Image (when added) →

Standing column

Image + Text → Video

Animate a still image guided by a text prompt.

Text-to-Video →

Standing column

Audio + Text → Text (Speech LLMs)

Multimodal LLMs that listen and respond in text.

Standing column

Audio → Audio

Speech translation, voice conversion, audio enhancement.

Standing column

Video → Video

Video editing, style transfer, super-resolution.

Text-to-Video →

Standing column

Image → 3D

Generate a 3D mesh or NeRF from one or more images.

Text-to-3D →

Standing column

Text → 3D

Generate a 3D asset from a text prompt.

Image-to-3D →

Standing column

Text → Audio

Music, sound effects, environmental audio from text.

Standing column

Image → Video

Animate a still image into a short clip.

Text-to-Video →

Standing column

Unconditional Image Generation

Generative image models without text conditioning (DCGAN, StyleGAN era).

Text-to-Image →

Fig 14 · Standing columns exist to aid navigation, not to be ranked. Follow any link to the underlying task's leaderboard.

§ 15

Methodology

Why this register can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repo link alone — a frozen commit, a declared environment, a recorded seed. If it does not run, the row does not publish.

Second, every benchmark has a metric direction. Higher-is-better and lower-is-better are declared on the dataset; no ambiguity reaches the reader.

Third, every score carries a date. When a model regresses — and they do — the record is preserved. The table never silently forgets.

Full methodology →Read the issue Missing a task? Propose it.

Every AI capability,mapped to benchmark evidence.

The register, by the numbers.

Three different questions.

Start with the problem

Then inspect the evidence

Check if the benchmark still matters

Nine stable capability areas.

Language & Knowledge.

Vision & Documents.

Audio & Speech.

Multimodal Media.

Code & Software Engineering.

Agents & Tool Use.

Structured Data & Forecasting.

Robotics, Control & RL.

Science, Medicine & Industry.

What the letters mean.

Capability buckets, not benchmarks.

Any-to-Any (Omni)

Image + Text → Text (VLMs)

Image + Text → Image

Image + Text → Video

Audio + Text → Text (Speech LLMs)

Audio → Audio

Video → Video

Image → 3D

Text → 3D

Text → Audio

Image → Video

Unconditional Image Generation

Why this register can be trusted.

Every AI capability,
mapped to benchmark evidence.