Codesota · Tasks · Vol. IIThe index of every machine-learning taskIssue: April 22, 2026
§ 00 · Index

Every machine-learning task,
indexed.

An alphabetical register of the 75 tasks our editors track, grouped by area. Each row names the canonical benchmark, the leading model, and a trust grade that tells you how much to believe the number.

Shaded rows mark independently verified state of the art. Dates and scores are in tabular mono; descriptions in serif; navigation in sans.

§ 01 · Counts

The register, by the numbers.

Figures sourced from the live Postgres registry · updated every 10 min
18
Research areas
Grouping the index top-down
120
Tasks catalogued
75 with published SOTA
368
Datasets indexed
Canonical benchmark per task marked
9,082
Benchmark results
All dated · verified where possible
§ 02 · Area

Multimodal.

Models that read, see, hear — and sometimes do all three at once. The most crowded frontier; also the least standardised.


Tasks
3
Verified SOTA
2
Results
49
Multimodal · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Visual Question Answering
Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural langu…
Visual Question Answering v2.0Qwen2-VL 72B87.6%
accuracy
47
02Image Captioning
Image captioning — generating natural language descriptions of images — was the task that launched the modern…
COCO CaptionsBLIP-2145.8%
CIDEr
2
03Text-to-Image Generation
Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022)…
DPG-Bench0
Fig 02 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 03 · Area

Computer Vision.

Pixels in, structure out: detection, segmentation, depth. The oldest leaderboards in the register.


Tasks
13
Verified SOTA
8
Results
2,129
Computer Vision · 13 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Optical Character Recognition
Extracting text from document images
KITAB Arabic OCR BenchmarkSurya4.950
cer
829
02Scene Text Detection
Detecting text regions in natural scene images
coco-textCLIP4STR-L81.90
1-1-accuracy
581
03Document Layout Analysis
Analyzing the layout structure of documents
d4laDoPTA70.7%
map
133
04Scene Text Recognition
Recognizing text in natural scene images
cute80CPPD99.7%
accuracy
127
05Document Parsing
Parsing document structure and content
OmniDocBench v1.5Mistral OCR 391.63
reading-order
117
06Table Recognition
Detecting and parsing tables in documents
icdar2013-table-structure-recognitionProposed System (With post- processing)95.46
f-measure
71
07General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
OCRBench v2mistral-ocr-251225.20
overall-en-private
66
08Document Image Classification
Classifying documents by type or category
aipResNet-RS (ResNet-200 + RS training tricks)83.40
top-1-accuracy-verb
62
09Object Detection
Detecting and localizing objects in images with bounding boxes and class labels.
Microsoft Common Objects in ContextScyllaNet66.12
box-map
46
10Image Classification
Image classification is the task that launched modern deep learning — AlexNet's 2012 ImageNet win cut error ra…
ImageNet Large Scale Visual Recognition Challenge 2012CoCa (finetuned)91.00
top-1-accuracy
44
11Handwriting Recognition
Recognizing handwritten text
40
12Document Understanding
Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —…
Form Understanding in Noisy Scanned Documents7
13Semantic Segmentation
Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton…
ADE20K Scene Parsing BenchmarkInternImage-H62.9%
mIoU
6
Fig 03 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 04 · Area

Natural Language Processing.

Text in, text out. Reasoning, retrieval, rewriting. Everything an LLM is measured on — and several things it is rarely measured on well.


Tasks
17
Verified SOTA
17
Results
5,995
Natural Language Processing · 17 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Polish LLM General
General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, ques…
Open Polish LLM LeaderboardMeta-Llama-3.1-405B-Instruct-FP893.44
belebele
3,728
02Polish Cultural Competency
Evaluating language models on Polish linguistic and cultural knowledge across art & entertainment, culture & t…
Polish Linguistic and Cultural Competency BenchmarkGemini-3.1-Pro-Preview100.0
geography
1,155
03Polish Text Understanding
Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky question…
Complex Polish Text Understanding BenchmarkQwen/Qwen3.5-35B-A3B thinking (API)4.702
tricky-questions
465
04Polish Conversation Quality
Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities,…
Polish Multi-Turn BenchmarkPhi-410.00
stem
450
05Polish Emotional Intelligence
Evaluating language models on emotional intelligence in Polish: understanding emotional states, predicting emo…
Polish Emotional Intelligence Benchmark (EQ-Bench v2 PL)Mistral-Large-Instruct-240778.07
eq-score
101
06Question Answering
Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (20…
Stanford Question Answering Dataset v2.0DeBERTa-v3-large91.4%
f1
24
07Text Summarization
Text summarization compresses documents while preserving key information — a task that became dramatically mor…
CNN/DailyMail SummarizationBRIO47.8%
rouge-1
15
08Text Classification
Text classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the…
SuperGLUEDeBERTa-v3-large91.40
average-score
12
09Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Stanford Natural Language InferenceGPT-4o92.6%
accuracy
8
10Text Ranking
Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C…
BEIRNV-Embed-v262.65
ndcg@10
8
11Named Entity Recognition
Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u…
CoNLL-2003 Named Entity RecognitionGLiNER-multitask93.8%
f1
7
12Feature Extraction
Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powerin…
MTEB LeaderboardNV-Embed-v272.31
avg-score
6
13Machine Translation
Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer…
WMT'23GPT-484.10
comet
4
14Fill-Mask
Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict w…
GLUEDeBERTa-v3-large91.37
avg-score
3
15Semantic Textual Similarity
Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti…
STS BenchmarkGTE-Qwen2-7B-instruct88.40
spearman
3
16Table Question Answering
Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s…
WikiTableQuestionsGPT-475.3%
accuracy
3
17Zero-Shot Classification
Zero-shot classification asks a model to categorize text into labels it has never been explicitly trained on —…
XNLIGPT-487.4%
accuracy
3
Fig 04 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 05 · Area

Audio.

Sound in, symbols out. Speech recognition, speaker diarisation, music and environmental audio.


Tasks
3
Verified SOTA
0
Results
9
Audio · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Audio Captioning
Generating text descriptions of audio content.
AudioCapsAudioCaps baseline (TopDown+Align)36.9%
spider
3
02Music Generation
Generating music from text, audio, or other inputs.
MusicCapsMusicLM4.000
fad
3
03Sound Event Detection
Detecting and localizing sound events in audio.
Domestic Environment Sound Event Detection (DCASE Task 4)ATST-SED58.10
event-f1
3
Fig 05 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 06 · Area

Speech.

Sound in, symbols out. Speech recognition, speaker diarisation, music and environmental audio.


Tasks
5
Verified SOTA
2
Results
40
Speech · 5 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Speech Recognition
Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a…
Mozilla Common VoiceWhisper Large-v211.20
wer
20
02Text-to-Speech
Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness…
CSTR VCTK CorpusNaturalSpeech 34.360
mos
11
03Speaker Verification
Verifying speaker identity from voice samples.
VoxCeleb1 Original Test Set (VoxCeleb1-O)ResNet-34 (AM-Softmax, VoxCeleb2)1.180
eer
3
04Speech Translation
Translating spoken audio directly to another language.
MuST-C English-German tst-COMMONSeamlessM4T v2 Large37.1%
bleu
3
05Voice Cloning
Replicating a speaker's voice characteristics.
LibriTTS test-clean zero-shot TTS evaluationVALL-E5.900
wer
3
Fig 06 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 07 · Area

Reinforcement Learning.

Policies, rewards, environments. Where progress is hardest to verify and easiest to overclaim.


Tasks
2
Verified SOTA
0
Results
21
Reinforcement Learning · 2 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Atari Games
Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pix…
Arcade Learning Environment (Atari 2600)Go-Explore40000.0
human-normalized-score
12
02Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the O…
Multi-Joint dynamics with ContactTD-MPC2 (317M params)960.0
average-return
9
Fig 07 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 08 · Area

Agentic AI.

A section of the register covering 8 tasks with canonical benchmarks.


Tasks
8
Verified SOTA
5
Results
129
Agentic AI · 8 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01SWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for…
SWE-bench Verified — Agentic LeaderboardClaude Mythos Preview93.90
resolve-rate
81
02Web & Desktop Agents
Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by…
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer EnvironmentsCoAct-160.76
success-rate
19
03Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like…
8
04HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI auton…
Human-Calibrated Autonomy Software TasksClaude Opus 455.00
success-rate
6
05RE-Bench
RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineerin…
Research Engineering Benchmarko30.380
normalized-score
5
06Time Horizon
Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the…
METR Autonomy Evaluation: Time HorizonClaude Opus 460.00
task-horizon-minutes
5
07Autonomous Coding
Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most comme…
SWE-bench Verified (Agentic)Claude Opus 4.580.90
pct_resolved
3
08Bioinformatics Agents
LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpre…
2
Fig 08 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 09 · Area

Computer Code.

A section of the register covering 6 tasks with canonical benchmarks.


Tasks
6
Verified SOTA
5
Results
223
Computer Code · 6 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Code Generation
Generating code from natural language descriptions (HumanEval, MBPP).
SWE-bench Verified SubsetClaude Opus 4.787.60
resolve-rate
196
02Code Translation
Converting code between programming languages.
TransCoder Evaluation on GeeksForGeeks Algorithmic ProblemsClaude Sonnet 489.40
computational-accuracy
7
03Bug Detection
Identifying bugs and vulnerabilities in code.
Bugs2Fix: Learning to Rewrite Buggy CodeGPT-4o78.6%
accuracy
6
04Code Completion
Predicting the next tokens in code sequences.
Cross-File Code Completion EvaluationClaude Sonnet 444.50
exact-match
6
05Program Repair
Automatically fixing bugs in code.
Defects4J: A Database of Real Faults in Java ProgramsSRepair101.0
correct-patches
5
06Code Summarization
Generating natural language descriptions of code.
CodeXGLUE Code-to-Text Python subsetCodeT5-base20.0%
bleu
3
Fig 09 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 10 · Area

Graphs.

A section of the register covering 3 tasks with canonical benchmarks.


Tasks
3
Verified SOTA
1
Results
12
Graphs · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Node Classification
Node classification — assigning labels to vertices in a graph using both node features and neighborhood struct…
Cora Citation NetworkACNet83.5%
accuracy
6
02Link Prediction
Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-ta…
Open Graph Benchmark - ogbl-collabPROXI70.98
hits_at_50
3
03Molecular Property Prediction
Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from mo…
Open Graph Benchmark - ogbg-molhivDGN79.70
roc_auc
3
Fig 10 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 11 · Area

Industrial Inspection.

A section of the register covering 1 task with canonical benchmarks.


Tasks
1
Verified SOTA
1
Results
27
Industrial Inspection · 1 task
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Anomaly Detection
Detecting defects and anomalies in manufacturing (MVTec AD, VisA).
MVTec Anomaly Detection DatasetAnomalyGPT97.40
auroc
27
Fig 11 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 12 · Area

Knowledge Base.

A section of the register covering 3 tasks with canonical benchmarks.


Tasks
3
Verified SOTA
0
Results
9
Knowledge Base · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Entity Linking
Linking mentions to knowledge base entities.
AIDA-CoNLL-YAGO (test-b)GENRE93.30
micro_f1
3
02Knowledge Graph Completion
Predicting missing links in knowledge graphs.
FB15k-237 Knowledge Graph CompletionNBFNet0.415
mrr
3
03Relation Extraction
Extracting relationships between entities from text.
TAC Relation Extraction DatasetLUKE72.7%
f1
3
Fig 12 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 13 · Area

Medical.

A section of the register covering 2 tasks with canonical benchmarks.


Tasks
2
Verified SOTA
2
Results
83
Medical · 2 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Disease Classification
Diagnosing diseases from medical images or data.
Autism Brain Imaging Data Exchange ISSAE + Softmax (Explainable ASD)98.2%
accuracy
57
02Medical Image Segmentation
Segmenting organs and abnormalities in medical images.
Automated Cardiac Diagnosis ChallengeMedNeXt-L92.65
mean-dsc
26
Fig 13 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 14 · Area

Mobile Development.

A section of the register covering 1 task with canonical benchmarks.


Tasks
1
Verified SOTA
1
Results
40
Mobile Development · 1 task
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01React Native Code Generation
Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation,…
Callstack Incubator React Native Evaluation SuiteComposer 298.90
navigation-satisfaction
40
Fig 14 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 15 · Area

Reasoning.

A section of the register covering 5 tasks with canonical benchmarks.


Tasks
5
Verified SOTA
3
Results
234
Reasoning · 5 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Commonsense Reasoning
Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social…
Massive Multitask Language Understandingo392.9%
accuracy
82
02Mathematical Reasoning
Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco…
Mathematics Aptitude Test of HeuristicsClaude Opus 4.590.7%
accuracy
79
03Multi-step Reasoning
Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili…
Graduate-Level Google-Proof Q&AGemini 2.5 Pro84.0%
accuracy
55
04Logical Reasoning
Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak…
LogiQAGPT-4o56.3%
accuracy
12
05Arithmetic Reasoning
Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca…
Math Word Problem RepositoryGPT-4o97.2%
accuracy
6
Fig 15 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 16 · Area

Time Series.

A section of the register covering 3 tasks with canonical benchmarks.


Tasks
3
Verified SOTA
3
Results
82
Time Series · 3 tasks
Sorted by result count, then name
#TaskCanonical benchmarkLeading modelScoreResults
01Time Series Forecasting
Time-series forecasting exploded in 2023-2025 when foundation models crossed over from NLP. Nixtla's TimeGPT (…
M4 Forecasting CompetitionTiDE13.95
smapi
75
02Tabular Classification
Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain…
OpenML-CC18AutoGluon-Tabular88.5%
accuracy
5
03Tabular Regression
Tabular regression — predicting continuous values from structured data — powers everything from house-price es…
California HousingXGBoost0.453
rmse
2
Fig 16 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.
§ 17
Trust grades

What the letters mean.

Benchmarks are not equally believable. Some are held out behind a private evaluator; some ship their test set as part of the training corpus. We grade the canonical dataset of every task on a four-point scale and show the letter next to the score.

A
Reproduced · dated · code
The full path is visible: a public checkpoint, a frozen commit, a declared environment, and a score we (or a signed reproducer) ran against a held-out test set. Contamination controlled, metric direction declared, date stamped.
B
Partial reproduction
Known weaknesses — evaluator overlap, public answer keys, a missing seed — but the submission otherwise checks out. Cite with caution; we preserve the caveat alongside the number.
C
Claim-only
The authors say so. We have not reproduced it and cannot yet. Shown in the register for completeness, but do not treat as state of the art.
F
Contested or retracted
The benchmark is considered unreliable: documented contamination, split leakage, or a score withdrawn by its authors. The row remains visible — leaderboards that silently forget are worse than leaderboards that argue in public.

A dataset can be regraded in public at any time; the history is preserved on the benchmark page. We publish the regrade, we don't erase the prior.

§ 18 · Standing columns

Capability buckets, not benchmarks.

HuggingFace pipeline-tag categories. These group concrete tasks thematically; they are not themselves measurable. Use them to navigate to the real rankings.

Standing column

Image + Text → Video

Animate a still image guided by a text prompt.

Standing column

Video → Video

Video editing, style transfer, super-resolution.

Standing column

Image → 3D

Generate a 3D mesh or NeRF from one or more images.

Standing column

Text → 3D

Generate a 3D asset from a text prompt.

Standing column

Image → Video

Animate a still image into a short clip.

Standing column

Unconditional Image Generation

Generative image models without text conditioning (DCGAN, StyleGAN era).

Fig 18 · Standing columns exist to aid navigation, not to be ranked. Follow any link to the underlying task's leaderboard.
§ 19
Methodology

Why this register can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repo link alone — a frozen commit, a declared environment, a recorded seed. If it does not run, the row does not publish.

Second, every benchmark has a metric direction. Higher-is-better and lower-is-better are declared on the dataset; no ambiguity reaches the reader.

Third, every score carries a date. When a model regresses — and they do — the record is preserved. The table never silently forgets.