Codesota · Tasks · Vol. IIThe index of every machine-learning taskIssue: April 22, 2026

§ 00 · Index

Every machine-learning task,
indexed.

An alphabetical register of the 75 tasks our editors track, grouped by area. Each row names the canonical benchmark, the leading model, and a trust grade that tells you how much to believe the number.

Shaded rows mark independently verified state of the art. Dates and scores are in tabular mono; descriptions in serif; navigation in sans.

§ 01 · Counts

The register, by the numbers.

Figures sourced from the live Postgres registry · updated every 10 min

Research areas

Grouping the index top-down

120

Tasks catalogued

75 with published SOTA

368

Datasets indexed

Canonical benchmark per task marked

9,082

Benchmark results

All dated · verified where possible

§ 02 · Area

Multimodal.

Models that read, see, hear — and sometimes do all three at once. The most crowded frontier; also the least standardised.

Tasks: 3
Verified SOTA: 2
Results: 49

Multimodal · 3 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Visual Question Answering Visual question answering (VQA) is the original multimodal reasoning task — given an image and a natural langu…	Visual Question Answering v2.0	Qwen2-VL 72B	87.6% accuracy	47
02	Image Captioning Image captioning — generating natural language descriptions of images — was the task that launched the modern…	COCO Captions	BLIP-2	145.8% CIDEr	2
03	Text-to-Image Generation Text-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022)…	DPG-Bench	—	—	0

Fig 02 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 03 · Area

Computer Vision.

Pixels in, structure out: detection, segmentation, depth. The oldest leaderboards in the register.

Tasks: 13
Verified SOTA: 8
Results: 2,129

Computer Vision · 13 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Optical Character Recognition Extracting text from document images	KITAB Arabic OCR Benchmark	Surya	4.950 cer	829
02	Scene Text Detection Detecting text regions in natural scene images	coco-text	CLIP4STR-L	81.90 1-1-accuracy	581
03	Document Layout Analysis Analyzing the layout structure of documents	d4la	DoPTA	70.7% map	133
04	Scene Text Recognition Recognizing text in natural scene images	cute80	CPPD	99.7% accuracy	127
05	Document Parsing Parsing document structure and content	OmniDocBench v1.5	Mistral OCR 3	91.63 reading-order	117
06	Table Recognition Detecting and parsing tables in documents	icdar2013-table-structure-recognition	Proposed System (With post- processing)	95.46 f-measure	71
07	General OCR Capabilities Comprehensive benchmarks covering multiple aspects of OCR performance.	OCRBench v2	mistral-ocr-2512	25.20 overall-en-private	66
08	Document Image Classification Classifying documents by type or category	aip	ResNet-RS (ResNet-200 + RS training tricks)	83.40 top-1-accuracy-verb	62
09	Object Detection Detecting and localizing objects in images with bounding boxes and class labels.	Microsoft Common Objects in Context	ScyllaNet	66.12 box-map	46
10	Image Classification Image classification is the task that launched modern deep learning — AlexNet's 2012 ImageNet win cut error ra…	ImageNet Large Scale Visual Recognition Challenge 2012	CoCa (finetuned)	91.00 top-1-accuracy	44
11	Handwriting Recognition Recognizing handwritten text	—	—	—	40
12	Document Understanding Document understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables —…	Form Understanding in Noisy Scanned Documents	—	—	7
13	Semantic Segmentation Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins auton…	ADE20K Scene Parsing Benchmark	InternImage-H	62.9% mIoU	6

Fig 03 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 04 · Area

Natural Language Processing.

Text in, text out. Reasoning, retrieval, rewriting. Everything an LLM is measured on — and several things it is rarely measured on well.

Tasks: 17
Verified SOTA: 17
Results: 5,995

Natural Language Processing · 17 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Polish LLM General General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, ques…	Open Polish LLM Leaderboard	Meta-Llama-3.1-405B-Instruct-FP8	93.44 belebele	3,728
02	Polish Cultural Competency Evaluating language models on Polish linguistic and cultural knowledge across art & entertainment, culture & t…	Polish Linguistic and Cultural Competency Benchmark	Gemini-3.1-Pro-Preview	100.0 geography	1,155
03	Polish Text Understanding Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky question…	Complex Polish Text Understanding Benchmark	Qwen/Qwen3.5-35B-A3B thinking (API)	4.702 tricky-questions	465
04	Polish Conversation Quality Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities,…	Polish Multi-Turn Benchmark	Phi-4	10.00 stem	450
05	Polish Emotional Intelligence Evaluating language models on emotional intelligence in Polish: understanding emotional states, predicting emo…	Polish Emotional Intelligence Benchmark (EQ-Bench v2 PL)	Mistral-Large-Instruct-2407	78.07 eq-score	101
06	Question Answering Extractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (20…	Stanford Question Answering Dataset v2.0	DeBERTa-v3-large	91.4% f1	24
07	Text Summarization Text summarization compresses documents while preserving key information — a task that became dramatically mor…	CNN/DailyMail Summarization	BRIO	47.8% rouge-1	15
08	Text Classification Text classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the…	SuperGLUE	DeBERTa-v3-large	91.40 average-score	12
09	Natural Language Inference Determining entailment relationships between sentences (SNLI, MNLI).	Stanford Natural Language Inference	GPT-4o	92.6% accuracy	8
10	Text Ranking Text ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by C…	BEIR	NV-Embed-v2	62.65 ndcg@10	8
11	Named Entity Recognition Named entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from u…	CoNLL-2003 Named Entity Recognition	GLiNER-multitask	93.8% f1	7
12	Feature Extraction Feature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powerin…	MTEB Leaderboard	NV-Embed-v2	72.31 avg-score	6
13	Machine Translation Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer…	WMT'23	GPT-4	84.10 comet	4
14	Fill-Mask Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict w…	GLUE	DeBERTa-v3-large	91.37 avg-score	3
15	Semantic Textual Similarity Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detecti…	STS Benchmark	GTE-Qwen2-7B-instruct	88.40 spearman	3
16	Table Question Answering Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a s…	WikiTableQuestions	GPT-4	75.3% accuracy	3
17	Zero-Shot Classification Zero-shot classification asks a model to categorize text into labels it has never been explicitly trained on —…	XNLI	GPT-4	87.4% accuracy	3

Fig 04 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 05 · Area

Audio.

Sound in, symbols out. Speech recognition, speaker diarisation, music and environmental audio.

Tasks: 3
Verified SOTA: 0
Results: 9

Audio · 3 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Audio Captioning Generating text descriptions of audio content.	AudioCaps	AudioCaps baseline (TopDown+Align)	36.9% spider	3
02	Music Generation Generating music from text, audio, or other inputs.	MusicCaps	MusicLM	4.000 fad	3
03	Sound Event Detection Detecting and localizing sound events in audio.	Domestic Environment Sound Event Detection (DCASE Task 4)	ATST-SED	58.10 event-f1	3

Fig 05 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 06 · Area

Speech.

Sound in, symbols out. Speech recognition, speaker diarisation, music and environmental audio.

Tasks: 5
Verified SOTA: 2
Results: 40

Speech · 5 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Speech Recognition Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a…	Mozilla Common Voice	Whisper Large-v2	11.20 wer	20
02	Text-to-Speech Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness…	CSTR VCTK Corpus	NaturalSpeech 3	4.360 mos	11
03	Speaker Verification Verifying speaker identity from voice samples.	VoxCeleb1 Original Test Set (VoxCeleb1-O)	ResNet-34 (AM-Softmax, VoxCeleb2)	1.180 eer	3
04	Speech Translation Translating spoken audio directly to another language.	MuST-C English-German tst-COMMON	SeamlessM4T v2 Large	37.1% bleu	3
05	Voice Cloning Replicating a speaker's voice characteristics.	LibriTTS test-clean zero-shot TTS evaluation	VALL-E	5.900 wer	3

Fig 06 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 07 · Area

Reinforcement Learning.

Policies, rewards, environments. Where progress is hardest to verify and easiest to overclaim.

Tasks: 2
Verified SOTA: 0
Results: 21

Reinforcement Learning · 2 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Atari Games Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pix…	Arcade Learning Environment (Atari 2600)	Go-Explore	40000.0 human-normalized-score	12
02	Continuous Control Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the O…	Multi-Joint dynamics with Contact	TD-MPC2 (317M params)	960.0 average-return	9

Fig 07 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 08 · Area

Agentic AI.

A section of the register covering 8 tasks with canonical benchmarks.

Tasks: 8
Verified SOTA: 5
Results: 129

Agentic AI · 8 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	SWE-bench SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for…	SWE-bench Verified — Agentic Leaderboard	Claude Mythos Preview	93.90 resolve-rate	81
02	Web & Desktop Agents Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by…	OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments	CoAct-1	60.76 success-rate	19
03	Tool Use Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like…	—	—	—	8
04	HCAST HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI auton…	Human-Calibrated Autonomy Software Tasks	Claude Opus 4	55.00 success-rate	6
05	RE-Bench RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineerin…	Research Engineering Benchmark	o3	0.380 normalized-score	5
06	Time Horizon Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the…	METR Autonomy Evaluation: Time Horizon	Claude Opus 4	60.00 task-horizon-minutes	5
07	Autonomous Coding Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most comme…	SWE-bench Verified (Agentic)	Claude Opus 4.5	80.90 pct_resolved	3
08	Bioinformatics Agents LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpre…	—	—	—	2

Fig 08 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 09 · Area

Computer Code.

A section of the register covering 6 tasks with canonical benchmarks.

Tasks: 6
Verified SOTA: 5
Results: 223

Computer Code · 6 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Code Generation Generating code from natural language descriptions (HumanEval, MBPP).	SWE-bench Verified Subset	Claude Opus 4.7	87.60 resolve-rate	196
02	Code Translation Converting code between programming languages.	TransCoder Evaluation on GeeksForGeeks Algorithmic Problems	Claude Sonnet 4	89.40 computational-accuracy	7
03	Bug Detection Identifying bugs and vulnerabilities in code.	Bugs2Fix: Learning to Rewrite Buggy Code	GPT-4o	78.6% accuracy	6
04	Code Completion Predicting the next tokens in code sequences.	Cross-File Code Completion Evaluation	Claude Sonnet 4	44.50 exact-match	6
05	Program Repair Automatically fixing bugs in code.	Defects4J: A Database of Real Faults in Java Programs	SRepair	101.0 correct-patches	5
06	Code Summarization Generating natural language descriptions of code.	CodeXGLUE Code-to-Text Python subset	CodeT5-base	20.0% bleu	3

Fig 09 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 10 · Area

Graphs.

A section of the register covering 3 tasks with canonical benchmarks.

Tasks: 3
Verified SOTA: 1
Results: 12

Graphs · 3 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Node Classification Node classification — assigning labels to vertices in a graph using both node features and neighborhood struct…	Cora Citation Network	ACNet	83.5% accuracy	6
02	Link Prediction Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-ta…	Open Graph Benchmark - ogbl-collab	PROXI	70.98 hits_at_50	3
03	Molecular Property Prediction Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from mo…	Open Graph Benchmark - ogbg-molhiv	DGN	79.70 roc_auc	3

Fig 10 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 11 · Area

Industrial Inspection.

A section of the register covering 1 task with canonical benchmarks.

Tasks: 1
Verified SOTA: 1
Results: 27

Industrial Inspection · 1 task

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Anomaly Detection Detecting defects and anomalies in manufacturing (MVTec AD, VisA).	MVTec Anomaly Detection Dataset	AnomalyGPT	97.40 auroc	27

Fig 11 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 12 · Area

Knowledge Base.

A section of the register covering 3 tasks with canonical benchmarks.

Tasks: 3
Verified SOTA: 0
Results: 9

Knowledge Base · 3 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Entity Linking Linking mentions to knowledge base entities.	AIDA-CoNLL-YAGO (test-b)	GENRE	93.30 micro_f1	3
02	Knowledge Graph Completion Predicting missing links in knowledge graphs.	FB15k-237 Knowledge Graph Completion	NBFNet	0.415 mrr	3
03	Relation Extraction Extracting relationships between entities from text.	TAC Relation Extraction Dataset	LUKE	72.7% f1	3

Fig 12 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 13 · Area

Medical.

A section of the register covering 2 tasks with canonical benchmarks.

Tasks: 2
Verified SOTA: 2
Results: 83

Medical · 2 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Disease Classification Diagnosing diseases from medical images or data.	Autism Brain Imaging Data Exchange I	SSAE + Softmax (Explainable ASD)	98.2% accuracy	57
02	Medical Image Segmentation Segmenting organs and abnormalities in medical images.	Automated Cardiac Diagnosis Challenge	MedNeXt-L	92.65 mean-dsc	26

Fig 13 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 14 · Area

Mobile Development.

A section of the register covering 1 task with canonical benchmarks.

Tasks: 1
Verified SOTA: 1
Results: 40

Mobile Development · 1 task

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	React Native Code Generation Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation,…	Callstack Incubator React Native Evaluation Suite	Composer 2	98.90 navigation-satisfaction	40

Fig 14 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 15 · Area

Reasoning.

A section of the register covering 5 tasks with canonical benchmarks.

Tasks: 5
Verified SOTA: 3
Results: 234

Reasoning · 5 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Commonsense Reasoning Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social…	Massive Multitask Language Understanding	o3	92.9% accuracy	82
02	Mathematical Reasoning Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have beco…	Mathematics Aptitude Test of Heuristics	Claude Opus 4.5	90.7% accuracy	79
03	Multi-step Reasoning Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capabili…	Graduate-Level Google-Proof Q&A	Gemini 2.5 Pro	84.0% accuracy	55
04	Logical Reasoning Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weak…	LogiQA	GPT-4o	56.3% accuracy	12
05	Arithmetic Reasoning Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models ca…	Math Word Problem Repository	GPT-4o	97.2% accuracy	6

Fig 15 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 16 · Area

Time Series.

A section of the register covering 3 tasks with canonical benchmarks.

Tasks: 3
Verified SOTA: 3
Results: 82

Time Series · 3 tasks

Sorted by result count, then name

#	Task	Canonical benchmark	Leading model	Score	Results
01	Time Series Forecasting Time-series forecasting exploded in 2023-2025 when foundation models crossed over from NLP. Nixtla's TimeGPT (…	M4 Forecasting Competition	TiDE	13.95 smapi	75
02	Tabular Classification Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain…	OpenML-CC18	AutoGluon-Tabular	88.5% accuracy	5
03	Tabular Regression Tabular regression — predicting continuous values from structured data — powers everything from house-price es…	California Housing	XGBoost	0.453 rmse	2

Fig 16 · Each row links to the task page with full history. Shaded rows mark independently verified state of the art; empty score cells mean no benchmark in the register yet clears our trust bar.

§ 17

Trust grades

What the letters mean.

Benchmarks are not equally believable. Some are held out behind a private evaluator; some ship their test set as part of the training corpus. We grade the canonical dataset of every task on a four-point scale and show the letter next to the score.

Reproduced · dated · code

The full path is visible: a public checkpoint, a frozen commit, a declared environment, and a score we (or a signed reproducer) ran against a held-out test set. Contamination controlled, metric direction declared, date stamped.

Partial reproduction

Known weaknesses — evaluator overlap, public answer keys, a missing seed — but the submission otherwise checks out. Cite with caution; we preserve the caveat alongside the number.

Claim-only

The authors say so. We have not reproduced it and cannot yet. Shown in the register for completeness, but do not treat as state of the art.

Contested or retracted

The benchmark is considered unreliable: documented contamination, split leakage, or a score withdrawn by its authors. The row remains visible — leaderboards that silently forget are worse than leaderboards that argue in public.

A dataset can be regraded in public at any time; the history is preserved on the benchmark page. We publish the regrade, we don't erase the prior.

§ 18 · Standing columns

Capability buckets, not benchmarks.

HuggingFace pipeline-tag categories. These group concrete tasks thematically; they are not themselves measurable. Use them to navigate to the real rankings.

Standing column

Any-to-Any (Omni)

True omni: any modality in (text + image + audio + video) AND generate multiple modalities out (including speech, not just text). Narrowest open-weights category — Qwen3-Omni · Vita · Mini-Omni. Proprietary: GPT-4o · Gemini 3 · Sesame CSM.

Standing column

Image + Text → Text (VLMs)

Vision-Language Models that read images and produce text answers.

Standing column

Image + Text → Image

Image editing and inpainting conditioned on text prompts.

Text-to-Image (when added) →

Standing column

Image + Text → Video

Animate a still image guided by a text prompt.

Text-to-Video →

Standing column

Audio + Text → Text (Speech LLMs)

Multimodal LLMs that listen and respond in text.

Standing column

Audio → Audio

Speech translation, voice conversion, audio enhancement.

Standing column

Video → Video

Video editing, style transfer, super-resolution.

Text-to-Video →

Standing column

Image → 3D

Generate a 3D mesh or NeRF from one or more images.

Text-to-3D →

Standing column

Text → 3D

Generate a 3D asset from a text prompt.

Image-to-3D →

Standing column

Text → Audio

Music, sound effects, environmental audio from text.

Standing column

Image → Video

Animate a still image into a short clip.

Text-to-Video →

Standing column

Unconditional Image Generation

Generative image models without text conditioning (DCGAN, StyleGAN era).

Text-to-Image →

Fig 18 · Standing columns exist to aid navigation, not to be ranked. Follow any link to the underlying task's leaderboard.

§ 19

Methodology

Why this register can be trusted.

Most leaderboards are a ledger of claims. Authors submit a number, a banner appears; the number stands until the next banner appears. Codesota is different in three ordinary ways.

First, every submission carries code. Not a repo link alone — a frozen commit, a declared environment, a recorded seed. If it does not run, the row does not publish.

Second, every benchmark has a metric direction. Higher-is-better and lower-is-better are declared on the dataset; no ambiguity reaches the reader.

Third, every score carries a date. When a model regresses — and they do — the record is preserved. The table never silently forgets.

Full methodology →Read the issue Missing a task? Propose it.

Every machine-learning task,indexed.

The register, by the numbers.

Multimodal.

Computer Vision.

Natural Language Processing.

Audio.

Speech.

Reinforcement Learning.

Agentic AI.

Computer Code.

Graphs.

Industrial Inspection.

Knowledge Base.

Medical.

Mobile Development.

Reasoning.

Time Series.

What the letters mean.

Capability buckets, not benchmarks.

Any-to-Any (Omni)

Image + Text → Text (VLMs)

Image + Text → Image

Image + Text → Video

Audio + Text → Text (Speech LLMs)

Audio → Audio

Video → Video

Image → 3D

Text → 3D

Text → Audio

Image → Video

Unconditional Image Generation

Why this register can be trusted.

Every machine-learning task,
indexed.