Codesota · Tasks · Vol. IIDecision-first benchmark discoveryIssue: April 22, 2026

§ 00 · Index

AI tasks and
benchmark evidence.

Find the benchmark evidence behind any AI capability.

Start with a capability — OCR, coding agents, ASR, RAG, VQA, forecasting — then inspect the benchmark, metric, leading model, date, and trust grade.

SOTA is not a model. It is a claim about a model on a benchmark. This view currently surfaces 62 task pages across nine capability areas.

Find a benchmark →Browse ontology Submit correction

§ 01 · Capability finder

Find a benchmark by capability.

What are you trying to evaluate?

01 · Canonical

Document OCR

Reading text, structure, and layout from document images.

Benchmark:: OCRBench v2
Leader:: Qwen2.5-VL-72B · 63.70 overall
Trust:: Unknown

Open leaderboard View lineage Submit correction

02 · Canonical

Document Layout Analysis

Analyzing the layout structure of documents

Benchmark:: d4la
Leader:: DoPTA · 70.7% map
Trust:: Unknown

Open leaderboard View lineage Submit correction

03 · Emerging

Document Parsing

Parsing document structure and content

Benchmark:: OmniDocBench v1.5
Leader:: Mistral OCR 3 · 91.63 reading-order
Trust:: Unknown

Open leaderboard View lineage Submit correction

04 · Fragmented

Table Recognition

Detecting and parsing tables in documents

Benchmark:: icdar2013-table-structure-recognition
Leader:: Proposed System (With post- processing) · 95.46 f-measure
Trust:: Unknown

Open leaderboard View lineage Submit correction

05 · Missing

Handwriting Recognition

Recognizing handwritten text

Benchmark:: No canonical benchmark registered
Leader:: Not enough registered evidence
Trust:: Unknown

Open leaderboard View lineage Submit correction

06 · Canonical

Document Understanding

Document QA and extraction beyond plain OCR; inspect layout and parsing evidence separately.

Benchmark:: Form Understanding in Noisy Scanned Documents
Leader:: Not enough registered evidence
Trust:: Unknown

Open leaderboard View lineage Submit correction

§ 02 · Counts

The register, by the numbers.

Figures sourced from the live Postgres registry · updated every 10 min

Capability areas

Stable top-level ontology

147

Tasks catalogued

62 with published SOTA

780

Datasets indexed

Canonical benchmark per task marked

9,166

Benchmark results

All dated · verified where possible

§ 03 · Product map

Three different questions.

Start with the capability. Then inspect the evidence.

Tasks

Start with the capability

A task describes what the system must do: OCR, code generation, ASR, retrieval, VQA, detection.

Find benchmark →

Leaderboards

Then inspect the evidence

A benchmark supplies evidence: metric definition, result counts, source quality, trust grade, and current top rows.

Open leaderboards →

Lineages

Check if the benchmark still matters

A lineage tells you whether a benchmark is canonical, saturated, stale, contested, or replaced by a harder successor.

View evolution →

§ 04 · Area map

Nine stable capability areas.

Choose the area closest to the decision you are making. Domains, modalities, methods, and safety properties are overlays — not separate root taxonomies.

01 · 16 tasks

Language & Knowledge

Retrieval, QA, reasoning, factuality, text classification, and knowledge-heavy language tasks.

Use when: evaluating answers grounded in text, documents, memory, retrieval, or general language knowledge.

Common trap: chat preference is not retrieval quality; use task-specific evidence for RAG, QA, and factuality.

Start with: Commonsense Reasoning · Mathematical Reasoning · Multi-step Reasoning

02 · 11 tasks

Vision & Documents

OCR, layout, tables, parsing, detection, segmentation, and visual extraction.

Use when: evaluating document AI or visual extraction systems.

Common trap: OCR is not document understanding; layout, tables, handwriting, and VQA need separate evidence.

Start with: Document OCR · Scene Text Detection · Document Layout Analysis

03 · 6 tasks

Audio & Speech

ASR, TTS, speaker intelligence, music, sound events, and audio-language tasks.

Use when: choosing speech, voice, transcription, or audio understanding systems.

Common trap: low WER does not guarantee diarization, latency, speaker identity, or noisy-call performance.

Start with: Speech Recognition · Audio Captioning · Music Generation

04 · 3 tasks

Multimodal Media

VQA, image-text retrieval, video QA, document VQA, image generation, and editing.

Use when: the input or output crosses text, image, video, audio, or document boundaries.

Common trap: one multimodal score can hide failures in OCR-heavy, spatial, chart, or video tasks.

Start with: Visual Question Answering · Image Captioning · Text-to-Image Generation

05 · 7 tasks

Code & Software Engineering

Code generation, repair, repository work, tests, security, and UI/mobile code.

Use when: evaluating software output, coding assistants, or repository-level model behavior.

Common trap: HumanEval-style synthesis is not the same as issue resolution or production engineering.

Start with: Code Generation · React Native Code Generation · Code Translation

06 · 9 tasks

Agents & Tool Use

Tool calling, web agents, OS tasks, long-horizon autonomy, and workflow execution.

Use when: the model must plan, use tools, recover from errors, or act over multiple steps.

Common trap: agent benchmark wins may depend on scaffolding, tools, browser setup, and budget, not only the base model.

Start with: SWE-bench · Task agents · Autonomous Coding

07 · 5 tasks

Structured Data & Forecasting

Tables, tabular prediction, time series, anomalies, recommenders, graphs, and optimization.

Use when: choosing evidence for numerical, temporal, graph, or business-data decisions.

Common trap: forecasting and tabular claims are sensitive to split design, leakage, and horizon definition.

Start with: Node Classification · Tabular Classification · Link Prediction

08 · 2 tasks

Robotics, Control & RL

Game play, continuous control, navigation, manipulation, VLA models, drones, and driving.

Use when: evaluating embodied systems, control policies, or simulation-to-real claims.

Common trap: simulator scores rarely transfer cleanly to hardware without environment and safety evidence.

Start with: Atari Games · Continuous Control

09 · 3 tasks

Science, Medicine & Industry

Medical, clinical, scientific, industrial, legal, finance, climate, and compliance AI.

Use when: the decision depends on a regulated domain, specialized data, or high-cost errors.

Common trap: general benchmarks can miss domain drift, licensing, annotation quality, and clinical workflow constraints.

Start with: Disease Classification · Anomaly Detection · Medical Image Segmentation

§ 05 · Capability area

Language & Knowledge.

Language understanding, retrieval, QA, RAG, factuality, and knowledge extraction. Reasoning appears here as a capability tag, not as a separate root.

Tasks: 16
Verified SOTA: 10
Results: 317

Language & Knowledge · 16 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01Commonsense Reasoning Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social… reasoning lineage →nlp lineage →	Massive Multitask Language Understanding	Registered leader o3 92.9% · accuracy	B	Fragmented*	82	Open leaderboard View lineage Submit correction Watch this task
02Mathematical Reasoning Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco… math lineage →	Mathematics Aptitude Test of Heuristics	Registered leader Claude Opus 4.5 90.7% · accuracy	—	Canonical*	79	Open leaderboard View lineage Submit correction Watch this task
03Multi-step Reasoning Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili… reasoning lineage →	Graduate-Level Google-Proof Q&A Diamond	Registered leader Gemini 2.5 Pro 84.0% · accuracy	B	Canonical*	53	Open leaderboard View lineage Submit correction Watch this task
04Question Answering Question answering now spans extractive reading comprehension, open-domain retrieval QA, multi-hop reasoning,… question-answering	Natural Questions: a Benchmark for Question Answering Research	Not enough registered evidence	B	Stale	26	Open leaderboard View lineage Submit correction Watch this task
05Text Summarization Text summarization compresses documents while preserving key information — a task that became dramatically mor… summarization	CNN/DailyMail Summarization	Registered leader BRIO 47.8% · rouge-1	—	Canonical*	15	Open leaderboard View lineage Submit correction Watch this task
06Logical Reasoning Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak…	LogiQA	Registered leader GPT-4o 56.3% · accuracy	—	Canonical*	12	Open leaderboard View lineage Submit correction Watch this task
07Natural Language Inference Determining entailment relationships between sentences (SNLI, MNLI).	Stanford Natural Language Inference	Registered leader GPT-4o 92.6% · accuracy	—	Canonical*	8	Open leaderboard View lineage Submit correction Watch this task
08Text Ranking Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C… text-ranking	BEIR	Registered leader NV-Embed-v2 62.65 · ndcg@10	—	Canonical*	8	Open leaderboard View lineage Submit correction Watch this task
09Named Entity Recognition Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u… token-classification	CoNLL-2003 Named Entity Recognition	Registered leader GLiNER-multitask 93.8% · f1	—	Canonical*	7	Open leaderboard View lineage Submit correction Watch this task
10Arithmetic Reasoning Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca…	Math Word Problem Repository	Registered leader GPT-4o 97.2% · accuracy	—	Canonical*	6	Open leaderboard View lineage Submit correction Watch this task
11Text Embeddings Generating dense vector embeddings for retrieval, ranking, clustering, and semantic search. feature-extraction Why this benchmark? Why this benchmark?: MTEB-style evidence is the practical first stop for embedding and retrieval model choices. What it measures: Retrieval, ranking, classification, clustering, and semantic similarity tasks. What it misses: Your corpus, latency budget, reranker pairing, and domain-specific relevance labels.	MTEB Leaderboard	Registered leader NV-Embed-v2 72.31 · avg-score	—	Canonical	6	Open leaderboard View lineage Submit correction Watch this task
12Entity Linking Linking mentions to knowledge base entities.	AIDA-CoNLL-YAGO (test-b)	Registered leader GENRE 93.30 · micro_f1	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
13Knowledge Graph Completion Predicting missing links in knowledge graphs.	FB15k-237 Knowledge Graph Completion	Registered leader NBFNet 0.415 · mrr	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
14Relation Extraction Extracting relationships between entities from text.	TAC Relation Extraction Dataset	Registered leader LUKE 72.7% · f1	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
15Semantic Textual Similarity Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti… sentence-similarity	STS Benchmark	Registered leader GTE-Qwen2-7B-instruct 88.40 · spearman	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
16Table Question Answering Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s… table-question-answering	WikiTableQuestions	Registered leader GPT-4 75.3% · accuracy	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task

Fig 05 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 06 · Capability area

Vision & Documents.

Images, video frames, OCR, layout, tables, document parsing, detection, segmentation, and visual anomaly detection.

Tasks: 11
Verified SOTA: 8
Results: 2,039

Vision & Documents · 11 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01Document OCR Reading text, structure, and layout from document images. ocr lineage →vision lineage → Why this benchmark? Why this benchmark?: OCRBench v2 is a broad first stop for document OCR and visual text extraction comparisons. What it measures: Text recognition, document understanding subtasks, and multi-scenario OCR capability. What it misses: Invoice extraction, handwriting-heavy archives, table fidelity, and local-language production OCR. When not to use: Do not use it as the only evidence for production document automation.	OCRBench v2	Registered leader Qwen2.5-VL-72B 63.70 · overall	—	Canonical	829	Open leaderboard View lineage Submit correction Watch this task
02Scene Text Detection Detecting text regions in natural scene images	coco-text	Registered leader CLIP4STR-L 81.90 · 1-1-accuracy	A	Canonical*	581	Open leaderboard View lineage Submit correction Watch this task
03Document Layout Analysis Analyzing the layout structure of documents	d4la	Registered leader DoPTA 70.7% · map	—	Canonical*	133	Open leaderboard View lineage Submit correction Watch this task
04Scene Text Recognition Recognizing text in natural scene images	cute80	Registered leader CPPD 99.7% · accuracy	—	Canonical*	127	Open leaderboard View lineage Submit correction Watch this task
05Document Parsing Parsing document structure and content ocr lineage → Why this benchmark? Why this benchmark?: OmniDocBench is a stronger starting point when the output must preserve reading order, layout, tables, and formulas. What it measures: End-to-end document parsing quality across text, tables, formulas, and layout-sensitive output. What it misses: Vendor-specific workflow behavior, low-resource languages, and private document templates.	OmniDocBench v1.5	Registered leader Mistral OCR 3 91.63 · reading-order	—	Emerging	117	Open leaderboard View lineage Submit correction Watch this task
06Table Recognition Detecting and parsing tables in documents	icdar2013-table-structure-recognition	Registered leader Proposed System (With post- processing) 95.46 · f-measure	—	Fragmented	71	Open leaderboard View lineage Submit correction Watch this task
07General OCR Capabilities Comprehensive benchmarks covering multiple aspects of OCR performance. ocr lineage →	OCRBench v2	Registered leader mistral-ocr-2512 25.20 · overall-en-private	—	Canonical*	66	Open leaderboard View lineage Submit correction Watch this task
08Document Image Classification Classifying documents by type or category	aip	Registered leader ResNet-RS (ResNet-200 + RS training tricks) 83.40 · top-1-accuracy-verb	—	Canonical*	62	Open leaderboard View lineage Submit correction Watch this task
09Handwriting Recognition Recognizing handwritten text	No canonical benchmark registered	Not enough registered evidence	—	Missing*	40	Open leaderboard View lineage Submit correction Watch this task
10Document Understanding Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —… document-question-answering ocr lineage →	Form Understanding in Noisy Scanned Documents	Not enough registered evidence	—	Canonical*	7	Open leaderboard View lineage Submit correction Watch this task
11Semantic Segmentation Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton… image-segmentation vision lineage →	ADE20K Scene Parsing Benchmark	Registered leader InternImage-H 62.9% · mIoU	—	Canonical*	6	Open leaderboard View lineage Submit correction Watch this task

Fig 06 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 07 · Capability area

Audio & Speech.

ASR, TTS, speaker intelligence, music, sound events, audio-language understanding, and audio safety.

Tasks: 6
Verified SOTA: 1
Results: 37

Audio & Speech · 6 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01Speech Recognition Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a… automatic-speech-recognition audio lineage →speech lineage → Why this benchmark? Why this benchmark?: ASR leaderboards remain useful when the split, language, domain, and WER normalization are explicit. What it measures: Transcription error rate on fixed audio corpora. What it misses: Diarization, streaming latency, noisy calls, code-switching, and downstream extraction quality.	Mozilla Common Voice	Registered leader Whisper Large-v2 11.20 · wer	—	Canonical	22	Open leaderboard View lineage Submit correction Watch this task
02Audio Captioning Generating text descriptions of audio content.	AudioCaps	Registered leader AudioCaps baseline (TopDown+Align) 36.9% · spider	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
03Music Generation Generating music from text, audio, or other inputs.	MusicCaps	Registered leader MusicLM 4.000 · fad	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
04Sound Event Detection Detecting and localizing sound events in audio.	Domestic Environment Sound Event Detection (DCASE Task 4)	Registered leader ATST-SED 58.10 · event-f1	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
05Speaker Verification Verifying speaker identity from voice samples.	VoxCeleb1 Original Test Set (VoxCeleb1-O)	Registered leader ResNet-34 (AM-Softmax, VoxCeleb2) 1.180 · eer	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
06Speech Translation Translating spoken audio directly to another language.	MuST-C English-German tst-COMMON	Registered leader SeamlessM4T v2 Large 37.1% · bleu	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task

Fig 07 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 08 · Capability area

Multimodal Media.

Cross-modal tasks only: VQA, image-text retrieval, video QA, document VQA, text-to-image, image editing, and any-to-any media models.

Tasks: 3
Verified SOTA: 2
Results: 49

Multimodal Media · 3 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01Visual Question Answering Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural langu… visual-question-answering multimodal lineage →vqa lineage → Why this benchmark? Why this benchmark?: VQA is best treated as a family: chart, document, spatial, and general image QA can disagree. What it measures: Question answering over visual inputs on fixed task distributions. What it misses: OCR-heavy documents, long-context image sets, grounding, and tool-assisted visual reasoning.	Visual Question Answering v2.0	Registered leader Qwen2-VL 72B 87.6% · accuracy	—	Fragmented	47	Open leaderboard View lineage Submit correction Watch this task
02Image Captioning Image captioning — generating natural language descriptions of images — was the task that launched the modern… image-to-text multimodal lineage →vision lineage →	COCO Captions	Registered leader BLIP-2 145.8% · CIDEr	A	Sparse*	2	Open leaderboard View lineage Submit correction Watch this task
03Text-to-Image Generation Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022)… text-to-image multimodal lineage →vision lineage →	DPG-Bench	Not enough registered evidence	—	Contested	0	Open leaderboard View lineage Submit correction Watch this task

Fig 08 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 09 · Capability area

Code & Software Engineering.

Code generation, completion, repair, repository understanding, tests, vulnerability work, UI code, and mobile app code generation.

Tasks: 7
Verified SOTA: 6
Results: 263

Code & Software Engineering · 7 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01Code Generation Generating code from natural language descriptions (HumanEval, MBPP). coding lineage → Why this benchmark? Why this benchmark?: LiveCodeBench is harder to memorize than older static coding sets and tracks recent coding ability. What it measures: Competitive programming style code generation on dated, rolling problems. What it misses: Repository repair, tool use, test writing, and long-horizon software engineering work. When not to use: Use SWE-bench or repository benchmarks for agentic coding work.	LiveCodeBench	Registered leader DeepSeek-R1-0528 73.3% · pass@1	B	Canonical	196	Open leaderboard View lineage Submit correction Watch this task
02React Native Code Generation Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation,…	Callstack Incubator React Native Evaluation Suite	Registered leader Composer 2 98.90 · navigation-satisfaction	—	Canonical*	40	Open leaderboard View lineage Submit correction Watch this task
03Code Translation Converting code between programming languages.	TransCoder Evaluation on GeeksForGeeks Algorithmic Problems	Registered leader Claude Sonnet 4 89.40 · computational-accuracy	—	Canonical*	7	Open leaderboard View lineage Submit correction Watch this task
04Bug Detection Identifying bugs and vulnerabilities in code.	Bugs2Fix: Learning to Rewrite Buggy Code	Registered leader GPT-4o 78.6% · accuracy	—	Canonical*	6	Open leaderboard View lineage Submit correction Watch this task
05Code Completion Predicting the next tokens in code sequences.	Cross-File Code Completion Evaluation	Registered leader Claude Sonnet 4 44.50 · exact-match	—	Canonical*	6	Open leaderboard View lineage Submit correction Watch this task
06Program Repair Automatically fixing bugs in code.	Defects4J: A Database of Real Faults in Java Programs	Registered leader SRepair 101.0 · correct-patches	—	Canonical*	5	Open leaderboard View lineage Submit correction Watch this task
07Code Summarization Generating natural language descriptions of code.	CodeXGLUE Code-to-Text Python subset	Registered leader CodeT5-base 20.0% · bleu	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task

Fig 09 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 10 · Capability area

Agents & Tool Use.

Tool calling, web and desktop agents, browser automation, long-horizon autonomy, multi-agent coordination, and agent safety.

Tasks: 9
Verified SOTA: 5
Results: 184

Agents & Tool Use · 9 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01SWE-bench SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for… coding lineage →agentic lineage → Why this benchmark? Why this benchmark?: SWE-bench Verified is the usual first benchmark for autonomous agents repairing real GitHub issues. What it measures: Patch generation that resolves repository issues against validation tests. What it misses: Product judgment, multi-day maintenance work, security review, and UI-heavy tasks.	SWE-bench Verified — Agentic Leaderboard	Registered leader Claude Mythos Preview 93.90 · resolve-rate	B	Canonical	81	Open leaderboard View lineage Submit correction Watch this task
02Task agents AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks…	No canonical benchmark registered	Not enough registered evidence	—	Missing*	35	Open leaderboard View lineage Submit correction Watch this task
03Autonomous Coding Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal…	SWE-bench Verified (Agentic)	Registered leader Claude Opus 4.5 80.90 · pct_resolved	B	Canonical*	23	Open leaderboard View lineage Submit correction Watch this task
04Web & Desktop Agents Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by…	OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	Registered leader CoAct-1 60.76 · success-rate	—	Canonical*	19	Open leaderboard View lineage Submit correction Watch this task
05Tool Use Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like…	No canonical benchmark registered	Not enough registered evidence	—	Missing*	8	Open leaderboard View lineage Submit correction Watch this task
06HCAST HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI auton…	Human-Calibrated Autonomy Software Tasks	Registered leader Claude Opus 4 55.00 · success-rate	—	Canonical*	6	Open leaderboard View lineage Submit correction Watch this task
07RE-Bench RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineerin…	Research Engineering Benchmark	Registered leader o3 0.380 · normalized-score	—	Canonical*	5	Open leaderboard View lineage Submit correction Watch this task
08Time Horizon Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the…	METR Autonomy Evaluation: Time Horizon	Registered leader Claude Opus 4 60.00 · task-horizon-minutes	—	Canonical*	5	Open leaderboard View lineage Submit correction Watch this task
09Bioinformatics Agents LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpre…	No canonical benchmark registered	Not enough registered evidence	—	Missing*	2	Open leaderboard View lineage Submit correction Watch this task

Fig 10 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 11 · Capability area

Structured Data & Forecasting.

Tables, tabular classification and regression, time-series forecasting, anomaly detection, recommender systems, graph learning, and optimization.

Tasks: 5
Verified SOTA: 3
Results: 19

Structured Data & Forecasting · 5 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01Node Classification Node classification — assigning labels to vertices in a graph using both node features and neighborhood struct… graph-ml	Cora Citation Network	Registered leader ACNet 83.5% · accuracy	—	Canonical*	6	Open leaderboard View lineage Submit correction Watch this task
02Tabular Classification Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain… tabular-classification	OpenML-CC18	Registered leader AutoGluon-Tabular 88.5% · accuracy	—	Canonical*	5	Open leaderboard View lineage Submit correction Watch this task
03Link Prediction Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-ta…	Open Graph Benchmark - ogbl-collab	Registered leader PROXI 70.98 · hits_at_50	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
04Molecular Property Prediction Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from mo…	Open Graph Benchmark - ogbg-molhiv	Registered leader DGN 79.70 · roc_auc	—	Sparse*	3	Open leaderboard View lineage Submit correction Watch this task
05Tabular Regression Tabular regression — predicting continuous values from structured data — powers everything from house-price es… tabular-regression	California Housing	Registered leader XGBoost 0.453 · rmse	—	Sparse*	2	Open leaderboard View lineage Submit correction Watch this task

Fig 11 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 12 · Capability area

Robotics, Control & RL.

Game playing, continuous control, manipulation, navigation, embodied instruction following, VLA models, drones, and autonomous driving.

Tasks: 2
Verified SOTA: 0
Results: 21

Robotics, Control & RL · 2 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01Atari Games Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pix… reinforcement-learning	Arcade Learning Environment (Atari 2600)	Registered leader Go-Explore 40000.0 · human-normalized-score	—	Canonical*	12	Open leaderboard View lineage Submit correction Watch this task
02Continuous Control Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the O…	Multi-Joint dynamics with Contact	Registered leader TD-MPC2 (317M params) 960.0 · average-return	—	Canonical*	9	Open leaderboard View lineage Submit correction Watch this task

Fig 12 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 13 · Capability area

Science, Medicine & Industry.

A domain layer for medical imaging, clinical text, drug discovery, protein modeling, industrial inspection, remote sensing, climate, legal, finance, and compliance AI.

Tasks: 3
Verified SOTA: 3
Results: 110

Science, Medicine & Industry · 3 tasks

Sorted by result count, then name

Current registered leader means the best result currently registered for this benchmark and metric. It does not mean universally best model.

Task	Best first benchmark	Current registered leader	Trust	Status	Results	Actions
01Disease Classification Diagnosing diseases from medical images or data.	Autism Brain Imaging Data Exchange I	Registered leader SSAE + Softmax (Explainable ASD) 98.2% · accuracy	—	Canonical*	57	Open leaderboard View lineage Submit correction Watch this task
02Anomaly Detection Detecting defects and anomalies in manufacturing (MVTec AD, VisA).	MVTec Anomaly Detection Dataset	Registered leader AnomalyGPT 97.40 · auroc	—	Canonical*	27	Open leaderboard View lineage Submit correction Watch this task
03Medical Image Segmentation Segmenting organs and abnormalities in medical images.	Automated Cardiac Diagnosis Challenge	Registered leader MedNeXt-L 92.65 · mean-dsc	—	Canonical*	26	Open leaderboard View lineage Submit correction Watch this task

Fig 13 · Each row links to the task page with full history. Shaded rows mark independently verified registered leaders. Benchmark statuses are manual where curated and otherwise inferred from available registry evidence.

§ 14

Trust grades

What the letters mean.

Benchmarks are not equally believable. Some are held out behind a private evaluator; some ship their test set as part of the training corpus. We grade the canonical dataset of every task on a four-point scale and show the letter next to the score.

Reproduced · dated · code

The full path is visible: a public checkpoint, a frozen commit, a declared environment, and a score we (or a signed reproducer) ran against a held-out test set. Contamination controlled, metric direction declared, date stamped.

Partial reproduction

Known weaknesses — evaluator overlap, public answer keys, a missing seed — but the submission otherwise checks out. Cite with caution; we preserve the caveat alongside the number.

Claim-only

The authors say so. We have not reproduced it and cannot yet. Shown in the register for completeness, but do not treat as state of the art.

Contested or retracted

The benchmark is considered unreliable: documented contamination, split leakage, or a score withdrawn by its authors. The row remains visible — leaderboards that silently forget are worse than leaderboards that argue in public.

A dataset can be regraded in public at any time; the history is preserved on the benchmark page. We publish the regrade, we don't erase the prior.

§ 15 · Standing columns

Capability buckets, not benchmarks.

HuggingFace pipeline-tag categories. These group concrete tasks thematically; they are not themselves measurable. Use them to navigate to the real rankings.

Standing column

Any-to-Any (Omni)

True omni: any modality in (text + image + audio + video) AND generate multiple modalities out (including speech, not just text). Narrowest open-weights category — Qwen3-Omni · Vita · Mini-Omni. Proprietary: GPT-4o · Gemini 3 · Sesame CSM.

Standing column

Image + Text → Text (VLMs)

Vision-Language Models that read images and produce text answers.

Standing column

Image + Text → Image

Image editing and inpainting conditioned on text prompts.

Text-to-Image (when added) →

Standing column

Image + Text → Video

Animate a still image guided by a text prompt.

Text-to-Video →

Standing column

Audio + Text → Text (Speech LLMs)

Multimodal LLMs that listen and respond in text.

Standing column

Audio → Audio

Speech translation, voice conversion, audio enhancement.

Standing column

Video → Video

Video editing, style transfer, super-resolution.

Text-to-Video →

Standing column

Image → 3D

Generate a 3D mesh or NeRF from one or more images.

Text-to-3D →

Standing column

Text → 3D

Generate a 3D asset from a text prompt.

Image-to-3D →

Standing column

Text → Audio

Music, sound effects, environmental audio from text.

Standing column

Image → Video

Animate a still image into a short clip.

Text-to-Video →

Standing column

Unconditional Image Generation

Generative image models without text conditioning (DCGAN, StyleGAN era).

Text-to-Image →

Fig 15 · Standing columns exist to aid navigation, not to be ranked. Follow any link to the underlying task's leaderboard.

§ 16

Methodology

Why this register can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repo link alone — a frozen commit, a declared environment, a recorded seed. If it does not run, the row does not publish.

Second, every benchmark has a metric direction. Higher-is-better and lower-is-better are declared on the dataset; no ambiguity reaches the reader.

Third, every score carries a date. When a model regresses — and they do — the record is preserved. The table never silently forgets.

Full methodology →Read the issue Missing a task? Propose it.

Decision signal

What were you trying to decide today?

Decision

Missing

Real decision

AI tasks andbenchmark evidence.

Find a benchmark by capability.

Document OCR

Document Layout Analysis

Document Parsing

Table Recognition

Handwriting Recognition

Document Understanding

The register, by the numbers.

Three different questions.

Start with the capability

Then inspect the evidence

Check if the benchmark still matters

Nine stable capability areas.

Language & Knowledge

Vision & Documents

Audio & Speech

Multimodal Media

Code & Software Engineering

Agents & Tool Use

Structured Data & Forecasting

Robotics, Control & RL

Science, Medicine & Industry

Language & Knowledge.

Vision & Documents.

Audio & Speech.

Multimodal Media.

Code & Software Engineering.

Agents & Tool Use.

Structured Data & Forecasting.

Robotics, Control & RL.

Science, Medicine & Industry.

What the letters mean.

Capability buckets, not benchmarks.

Any-to-Any (Omni)

Image + Text → Text (VLMs)

Image + Text → Image

Image + Text → Video

Audio + Text → Text (Speech LLMs)

Audio → Audio

Video → Video

Image → 3D

Text → 3D

Text → Audio

Image → Video

Unconditional Image Generation

Why this register can be trusted.

What were you trying to decide today?

AI tasks and
benchmark evidence.