Codesota · Tasks · Vol. IIThe index of every machine-learning taskIssue: April 22, 2026
§ 00 · Index

Every machine-learning task,
indexed.

An alphabetical register of the 75 tasks our editors track, grouped by area. Each row names the canonical benchmark, the leading model, and a trust grade that tells you how much to believe the number.

Shaded rows mark independently verified state of the art. Dates and scores are in tabular mono; descriptions in serif; navigation in sans.

§ 01 · Counts

The register, by the numbers.

Figures sourced from the live Postgres registry · updated every 10 min
18
Research areas
Grouping the index top-down
121
Tasks catalogued
75 with published SOTA
371
Datasets indexed
Canonical benchmark per task marked
9,102
Benchmark results
All dated · verified where possible
§ 02 · Product map

Three different questions.

Tasks are taxonomy. Leaderboards are evidence. Lineages are benchmark history.

Tasks

Start with the problem

Use this page when you know the capability you care about: OCR, code generation, ASR, retrieval, VQA, detection.

Browse taxonomy
Leaderboards

Then inspect the evidence

Use benchmark pages when you need result counts, source quality, trust badges, metric definitions, and current top rows.

Open leaderboards
Lineages

Check if the benchmark still matters

Use lineage pages when a benchmark looks saturated, outdated, contaminated, or replaced by a harder successor.

View evolution
§ 03 · Area

Multimodal.

Models that read, see, hear — and sometimes do all three at once. The most crowded frontier; also the least standardised.


Tasks
3
Verified SOTA
2
Results
49
Multimodal · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural langu…
Visual Question Answering v2.0Qwen2-VL 72B87.6%
accuracy
47
02Image Captioning
Image captioning — generating natural language descriptions of images — was the task that launched the modern…
COCO CaptionsBLIP-2145.8%
CIDEr
2
03Text-to-Image Generation
Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022)…
DPG-Bench0
Fig 03 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 04 · Area

Computer Vision.

Pixels in, structure out: detection, segmentation, depth. The oldest leaderboards in the register.


Tasks
13
Verified SOTA
8
Results
2,129
Computer Vision · 13 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Optical Character Recognition
Extracting text from document images
KITAB Arabic OCR BenchmarkSurya4.950
cer
829
02Scene Text Detection
Detecting text regions in natural scene images
coco-textCLIP4STR-L81.90
1-1-accuracy
581
03Document Layout Analysis
Analyzing the layout structure of documents
d4laDoPTA70.7%
map
133
04Scene Text Recognition
Recognizing text in natural scene images
cute80CPPD99.7%
accuracy
127
05Document Parsing
Parsing document structure and content
OmniDocBench v1.5Mistral OCR 391.63
reading-order
117
06Table Recognition
Detecting and parsing tables in documents
icdar2013-table-structure-recognitionProposed System (With post- processing)95.46
f-measure
71
07General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
OCRBench v2mistral-ocr-251225.20
overall-en-private
66
08Document Image Classification
Classifying documents by type or category
aipResNet-RS (ResNet-200 + RS training tricks)83.40
top-1-accuracy-verb
62
09Object Detection
Detecting and localizing objects in images with bounding boxes and class labels.
Microsoft Common Objects in ContextScyllaNet66.12
box-map
46
10Image Classification
Image classification is the task that launched modern deep learning — AlexNet's 2012 ImageNet win cut error ra…
ImageNet Large Scale Visual Recognition Challenge 2012CoCa (finetuned)91.00
top-1-accuracy
44
11Handwriting Recognition
Recognizing handwritten text
40
12Document Understanding
Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —…
Form Understanding in Noisy Scanned Documents7
13Semantic Segmentation
Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton…
ADE20K Scene Parsing BenchmarkInternImage-H62.9%
mIoU
6
Fig 04 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 05 · Area

Natural Language Processing.

Text in, text out. Reasoning, retrieval, rewriting. Everything an LLM is measured on — and several things it is rarely measured on well.


Tasks
17
Verified SOTA
17
Results
5,995
Natural Language Processing · 17 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Polish LLM General
General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, ques…
Open Polish LLM LeaderboardMeta-Llama-3.1-405B-Instruct-FP893.44
belebele
3,728
02Polish Cultural Competency
Evaluating language models on Polish linguistic and cultural knowledge across art & entertainment, culture & t…
Polish Linguistic and Cultural Competency BenchmarkGemini-3.1-Pro-Preview100.0
geography
1,155
03Polish Text Understanding
Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky question…
Complex Polish Text Understanding BenchmarkQwen/Qwen3.5-35B-A3B thinking (API)4.702
tricky-questions
465
04Polish Conversation Quality
Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities,…
Polish Multi-Turn BenchmarkPhi-410.00
stem
450
05Polish Emotional Intelligence
Evaluating language models on emotional intelligence in Polish: understanding emotional states, predicting emo…
Polish Emotional Intelligence Benchmark (EQ-Bench v2 PL)Mistral-Large-Instruct-240778.07
eq-score
101
06Question Answering
Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (20…
Stanford Question Answering Dataset v2.0DeBERTa-v3-large91.4%
f1
24
07Text Summarization
Text summarization compresses documents while preserving key information — a task that became dramatically mor…
CNN/DailyMail SummarizationBRIO47.8%
rouge-1
15
08Text Classification
Text classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the…
SuperGLUEDeBERTa-v3-large91.40
average-score
12
09Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Stanford Natural Language InferenceGPT-4o92.6%
accuracy
8
10Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C…
BEIRNV-Embed-v262.65
ndcg@10
8
11Named Entity Recognition
Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u…
CoNLL-2003 Named Entity RecognitionGLiNER-multitask93.8%
f1
7
12Feature Extraction
Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powerin…
MTEB LeaderboardNV-Embed-v272.31
avg-score
6
13Machine Translation
Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer…
WMT'23GPT-484.10
comet
4
14Fill-Mask
Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict w…
GLUEDeBERTa-v3-large91.37
avg-score
3
15Semantic Textual Similarity
Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti…
STS BenchmarkGTE-Qwen2-7B-instruct88.40
spearman
3
16Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s…
WikiTableQuestionsGPT-475.3%
accuracy
3
17Zero-Shot Classification
Zero-shot classification asks a model to categorize text into labels it has never been explicitly trained on —…
XNLIGPT-487.4%
accuracy
3
Fig 05 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 06 · Area

Audio.

Sound in, symbols out. Speech recognition, speaker diarisation, music and environmental audio.


Tasks
3
Verified SOTA
0
Results
9
Audio · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Audio Captioning
Generating text descriptions of audio content.
AudioCapsAudioCaps baseline (TopDown+Align)36.9%
spider
3
02Music Generation
Generating music from text, audio, or other inputs.
MusicCapsMusicLM4.000
fad
3
03Sound Event Detection
Detecting and localizing sound events in audio.
Domestic Environment Sound Event Detection (DCASE Task 4)ATST-SED58.10
event-f1
3
Fig 06 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 07 · Area

Speech.

Sound in, symbols out. Speech recognition, speaker diarisation, music and environmental audio.


Tasks
5
Verified SOTA
2
Results
40
Speech · 5 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a…
Mozilla Common VoiceWhisper Large-v211.20
wer
20
02Text-to-Speech
Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness…
CSTR VCTK CorpusNaturalSpeech 34.360
mos
11
03Speaker Verification
Verifying speaker identity from voice samples.
VoxCeleb1 Original Test Set (VoxCeleb1-O)ResNet-34 (AM-Softmax, VoxCeleb2)1.180
eer
3
04Speech Translation
Translating spoken audio directly to another language.
MuST-C English-German tst-COMMONSeamlessM4T v2 Large37.1%
bleu
3
05Voice Cloning
Replicating a speaker's voice characteristics.
LibriTTS test-clean zero-shot TTS evaluationVALL-E5.900
wer
3
Fig 07 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 08 · Area

Reinforcement Learning.

Policies, rewards, environments. Where progress is hardest to verify and easiest to overclaim.


Tasks
2
Verified SOTA
0
Results
21
Reinforcement Learning · 2 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Atari Games
Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pix…
Arcade Learning Environment (Atari 2600)Go-Explore40000.0
human-normalized-score
12
02Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the O…
Multi-Joint dynamics with ContactTD-MPC2 (317M params)960.0
average-return
9
Fig 08 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 09 · Area

Agentic AI.

A section of the register covering 8 tasks with canonical benchmarks.


Tasks
8
Verified SOTA
5
Results
149
Agentic AI · 8 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01SWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for…
SWE-bench Verified — Agentic LeaderboardClaude Mythos Preview93.90
resolve-rate
81
02Autonomous Coding
Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal…
SWE-bench Verified (Agentic)Claude Opus 4.580.90
pct_resolved
23
03Web & Desktop Agents
Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by…
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsCoAct-160.76
success-rate
19
04Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like…
8
05HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI auton…
Human-Calibrated Autonomy Software TasksClaude Opus 455.00
success-rate
6
06RE-Bench
RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineerin…
Research Engineering Benchmarko30.380
normalized-score
5
07Time Horizon
Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the…
METR Autonomy Evaluation: Time HorizonClaude Opus 460.00
task-horizon-minutes
5
08Bioinformatics Agents
LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpre…
2
Fig 09 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 10 · Area

Computer Code.

A section of the register covering 6 tasks with canonical benchmarks.


Tasks
6
Verified SOTA
5
Results
223
Computer Code · 6 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
SWE-bench Verified SubsetClaude Opus 4.787.60
resolve-rate
196
02Code Translation
Converting code between programming languages.
TransCoder Evaluation on GeeksForGeeks Algorithmic ProblemsClaude Sonnet 489.40
computational-accuracy
7
03Bug Detection
Identifying bugs and vulnerabilities in code.
Bugs2Fix: Learning to Rewrite Buggy CodeGPT-4o78.6%
accuracy
6
04Code Completion
Predicting the next tokens in code sequences.
Cross-File Code Completion EvaluationClaude Sonnet 444.50
exact-match
6
05Program Repair
Automatically fixing bugs in code.
Defects4J: A Database of Real Faults in Java ProgramsSRepair101.0
correct-patches
5
06Code Summarization
Generating natural language descriptions of code.
CodeXGLUE Code-to-Text Python subsetCodeT5-base20.0%
bleu
3
Fig 10 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 11 · Area

Graphs.

A section of the register covering 3 tasks with canonical benchmarks.


Tasks
3
Verified SOTA
1
Results
12
Graphs · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Node Classification
Node classification — assigning labels to vertices in a graph using both node features and neighborhood struct…
Cora Citation NetworkACNet83.5%
accuracy
6
02Link Prediction
Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-ta…
Open Graph Benchmark - ogbl-collabPROXI70.98
hits_at_50
3
03Molecular Property Prediction
Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from mo…
Open Graph Benchmark - ogbg-molhivDGN79.70
roc_auc
3
Fig 11 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 12 · Area

Industrial Inspection.

A section of the register covering 1 task with canonical benchmarks.


Tasks
1
Verified SOTA
1
Results
27
Industrial Inspection · 1 task
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Anomaly Detection
Detecting defects and anomalies in manufacturing (MVTec AD, VisA).
MVTec Anomaly Detection DatasetAnomalyGPT97.40
auroc
27
Fig 12 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 13 · Area

Knowledge Base.

A section of the register covering 3 tasks with canonical benchmarks.


Tasks
3
Verified SOTA
0
Results
9
Knowledge Base · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Entity Linking
Linking mentions to knowledge base entities.
AIDA-CoNLL-YAGO (test-b)GENRE93.30
micro_f1
3
02Knowledge Graph Completion
Predicting missing links in knowledge graphs.
FB15k-237 Knowledge Graph CompletionNBFNet0.415
mrr
3
03Relation Extraction
Extracting relationships between entities from text.
TAC Relation Extraction DatasetLUKE72.7%
f1
3
Fig 13 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 14 · Area

Medical.

A section of the register covering 2 tasks with canonical benchmarks.


Tasks
2
Verified SOTA
2
Results
83
Medical · 2 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Disease Classification
Diagnosing diseases from medical images or data.
Autism Brain Imaging Data Exchange ISSAE + Softmax (Explainable ASD)98.2%
accuracy
57
02Medical Image Segmentation
Segmenting organs and abnormalities in medical images.
Automated Cardiac Diagnosis ChallengeMedNeXt-L92.65
mean-dsc
26
Fig 14 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 15 · Area

Mobile Development.

A section of the register covering 1 task with canonical benchmarks.


Tasks
1
Verified SOTA
1
Results
40
Mobile Development · 1 task
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01React Native Code Generation
Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation,…
Callstack Incubator React Native Evaluation SuiteComposer 298.90
navigation-satisfaction
40
Fig 15 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 16 · Area

Reasoning.

A section of the register covering 5 tasks with canonical benchmarks.


Tasks
5
Verified SOTA
3
Results
234
Reasoning · 5 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Commonsense Reasoning
Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social…
Massive Multitask Language Understandingo392.9%
accuracy
82
02Mathematical Reasoning
Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco…
Mathematics Aptitude Test of HeuristicsClaude Opus 4.590.7%
accuracy
79
03Multi-step Reasoning
Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili…
Graduate-Level Google-Proof Q&AGemini 2.5 Pro84.0%
accuracy
55
04Logical Reasoning
Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak…
LogiQAGPT-4o56.3%
accuracy
12
05Arithmetic Reasoning
Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca…
Math Word Problem RepositoryGPT-4o97.2%
accuracy
6
Fig 16 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 17 · Area

Time Series.

A section of the register covering 3 tasks with canonical benchmarks.


Tasks
3
Verified SOTA
3
Results
82
Time Series · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Time Series Forecasting
Time-series forecasting exploded in 2023-2025 when foundation models crossed over from NLP. Nixtla's TimeGPT (…
M4 Forecasting CompetitionTiDE13.95
smapi
75
02Tabular Classification
Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain…
OpenML-CC18AutoGluon-Tabular88.5%
accuracy
5
03Tabular Regression
Tabular regression — predicting continuous values from structured data — powers everything from house-price es…
California HousingXGBoost0.453
rmse
2
Fig 17 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 18
Trust grades

What the letters mean.

Benchmarks are not equally believable. Some are held out behind a private evaluator; some ship their test set as part of the training corpus. We grade the canonical dataset of every task on a four-point scale and show the letter next to the score.

A
Reproduced · dated · code
The full path is visible: a public checkpoint, a frozen commit, a declared environment, and a score we (or a signed reproducer) ran against a held-out test set. Contamination controlled, metric direction declared, date stamped.
B
Partial reproduction
Known weaknesses — evaluator overlap, public answer keys, a missing seed — but the submission otherwise checks out. Cite with caution; we preserve the caveat alongside the number.
C
Claim-only
The authors say so. We have not reproduced it and cannot yet. Shown in the register for completeness, but do not treat as state of the art.
F
Contested or retracted
The benchmark is considered unreliable: documented contamination, split leakage, or a score withdrawn by its authors. The row remains visible — leaderboards that silently forget are worse than leaderboards that argue in public.

A dataset can be regraded in public at any time; the history is preserved on the benchmark page. We publish the regrade, we don't erase the prior.

§ 19 · Standing columns

Capability buckets, not benchmarks.

HuggingFace pipeline-tag categories. These group concrete tasks thematically; they are not themselves measurable. Use them to navigate to the real rankings.

Standing column

Image + Text → Video

Animate a still image guided by a text prompt.

Standing column

Video → Video

Video editing, style transfer, super-resolution.

Standing column

Image → 3D

Generate a 3D mesh or NeRF from one or more images.

Standing column

Text → 3D

Generate a 3D asset from a text prompt.

Standing column

Image → Video

Animate a still image into a short clip.

Standing column

Unconditional Image Generation

Generative image models without text conditioning (DCGAN, StyleGAN era).

Fig 19 · Standing columns exist to aid navigation, not to be ranked. Follow any link to the underlying task's leaderboard.
§ 20
Methodology

Why this register can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repo link alone — a frozen commit, a declared environment, a recorded seed. If it does not run, the row does not publish.

Second, every benchmark has a metric direction. Higher-is-better and lower-is-better are declared on the dataset; no ambiguity reaches the reader.

Third, every score carries a date. When a model regresses — and they do — the record is preserved. The table never silently forgets.