Every ML task — current SOTA, and how much to trust it

Machine Translation

Machine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer revolution sparked by "Attention Is All You Need" (2017) — literally the architecture that now powers all of AI. Google's multilingual T5 and Meta's NLLB-200 pushed translation to 200+ languages, but the real disruption came from GPT-4 and Claude matching or beating specialized MT systems on WMT benchmarks for high-resource pairs like English-German and English-Chinese. The unsolved frontier is low-resource languages (under 1M parallel sentences), where dedicated models like NLLB still dominate, and literary translation where preserving style, humor, and cultural nuance remains beyond any system. BLEU scores are increasingly seen as unreliable — human evaluation and newer metrics like COMET and BLEURT are becoming the standard.

84.10comet

by GPT-4✓

4 results · 2 datasets

Fill-Mask

Fill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.

91.37avg-score

by DeBERTa-v3-large✓

by GTE-Qwen2-7B-instruct✓

Semantic Textual Similarity

Semantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detection, paraphrase mining, and retrieval. STS Benchmark scores climbed from 70 (GloVe averages) to 86+ with Sentence-BERT, and now exceed 92 with models like GTE-Qwen2 and E5-Mistral that leverage billion-parameter backbones. The real shift was from symmetric similarity (are these two sentences paraphrases?) to asymmetric retrieval (does this passage answer this query?), driven by the RAG revolution that made embedding quality a production-critical metric. Cross-lingual semantic similarity remains a hard frontier — models trained primarily on English still lose 5-10 points when comparing sentences across language families, despite multilingual pretraining.

88.40spearman

Table Question Answering

Table question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.

75.3%accuracy

by GPT-4✓

3 results · 2 datasets

Zero-Shot Classification

Zero-shot classification asks a model to categorize text into labels it has never been explicitly trained on — the ultimate test of language understanding and generalization. The breakthrough was the natural language inference (NLI) trick: reframe classification as "does this text entail the label?" using models fine-tuned on MNLI, pioneered by Yin et al. (2019) and popularized by BART-large-MNLI. Today, instruction-tuned LLMs have largely subsumed this approach — GPT-4, Claude, and Llama 3 can classify into arbitrary taxonomies via prompting with near-supervised accuracy. The remaining challenge is consistency and calibration: LLMs are powerful but their predictions can be brittle to prompt phrasing, making them unreliable for high-stakes automated pipelines without careful engineering.

87.4%accuracy

by GPT-4✓

by AudioCaps baseline (TopDown+Align)

Audio

3 tasks

Audio Captioning

Generating text descriptions of audio content.

36.9%spider

Music Generation

Generating music from text, audio, or other inputs.

4.000fad

by MusicLM

Sound Event Detection

Detecting and localizing sound events in audio.

58.10event-f1

by ATST-SED

by ResNet-34 (AM-Softmax, VoxCeleb2)

Speech

5 tasks

Speech Recognition

Automatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.

11.20wer

by Whisper Large-v2✓

20 results · 4 datasets

Text-to-Speech

Text-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.

4.360mos

by NaturalSpeech 3✓

11 results · 2 datasets

Speaker Verification

Verifying speaker identity from voice samples.

1.180eer

Speech Translation

Translating spoken audio directly to another language.

37.1%bleu

by SeamlessM4T v2 Large

Voice Cloning

Replicating a speaker's voice characteristics.

5.900wer

by VALL-E

40000.0human-normalized-score

Reinforcement Learning

2 tasks

Atari Games

Atari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).

by Go-Explore

12 results · 1 dataset

Continuous Control

Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.

960.0average-return

by TD-MPC2 (317M params)

9 results · 1 dataset

Agentic AI

7 tasks

SWE-bench

SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.

93.90resolve-rate

by Claude Mythos Preview✓

81 results · 1 dataset

Web & Desktop Agents

Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.

60.76success-rate

by CoAct-1✓

19 results · 2 datasets

Tool Use

Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.

No benchmark yet meets our trust bar.

Propose one →

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

55.00success-rate

by Claude Opus 4✓

RE-Bench

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

0.380normalized-score

by o3✓

60.00task-horizon-minutes

Time Horizon

Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the single most important meta-metric for agentic AI. METR's evaluations suggest current frontier agents degrade significantly after 30-60 minutes of autonomous operation, while human software engineers can sustain productive work for hours. The metric matters because economic value scales exponentially with reliable autonomy duration: an agent that works reliably for 8 hours is not 16x more valuable than one that works for 30 minutes — it's qualitatively different, enabling entirely new categories of delegatable work.

by Claude Opus 4✓

Autonomous Coding

Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?

80.90pct_resolved

by Claude Opus 4.5

124 results · 10 datasets

Computer Code

6 tasks

Code Generation

Generating code from natural language descriptions (HumanEval, MBPP).

80.90resolve-rate

by Claude Opus 4.5

Code Translation

Converting code between programming languages.

89.40computational-accuracy

by Claude Sonnet 4✓

7 results · 1 dataset

Bug Detection

Identifying bugs and vulnerabilities in code.

78.6%accuracy

by GPT-4o✓

Code Completion

Predicting the next tokens in code sequences.

44.50exact-match

by Claude Sonnet 4✓

Program Repair

Automatically fixing bugs in code.

101.0correct-patches

by SRepair✓

Code Summarization

Generating natural language descriptions of code.

20.0%bleu

by CodeT5-base

Graphs

3 tasks

Node Classification

Node classification — assigning labels to vertices in a graph using both node features and neighborhood structure — is the flagship task for Graph Neural Networks. GCN (Kipf & Welling, 2017) established the Cora/Citeseer/PubMed benchmark trinity, but these datasets are tiny by modern standards and results have saturated well above 85% accuracy. The field has moved toward large-scale heterogeneous graphs (ogbn-arxiv, ogbn-products from OGB) and the unsettled debate over whether simple MLPs with neighborhood features can match GNNs, as shown by SIGN and SGC ablations.

83.5%accuracy

by ACNet✓

6 results · 2 datasets

Link Prediction

Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-target discovery, and social network recommendation. TransE (2013) launched the knowledge graph embedding era, and the field matured through DistMult, RotatE, and CompGCN, benchmarked on FB15k-237 and WN18RR. The current frontier is inductive link prediction (generalizing to unseen entities), where GNN-based methods like NBFNet and foundation models like ULTRA (2024) show that a single model can transfer across entirely different knowledge graphs without retraining.

70.98hits_at_50

by PROXI

Molecular Property Prediction

Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from molecular structure — is the workhorse task of AI-driven drug discovery. GNNs operate on molecular graphs while transformer approaches (ChemBERTa, Uni-Mol) use SMILES strings or 3D coordinates. MoleculeNet (2018) and the Therapeutic Data Commons (TDC) provide standardized benchmarks, but the real bottleneck is distribution shift: models trained on known chemical space struggle with novel scaffolds, and the gap between leaderboard accuracy and actual wet-lab utility remains the field's central challenge.

79.70roc_auc

by DGN

Industrial Inspection

1 task

Anomaly Detection

Detecting defects and anomalies in manufacturing (MVTec AD, VisA).

97.40auroc

by AnomalyGPT✓

27 results · 7 datasets

Knowledge Base

3 tasks

Entity Linking

Linking mentions to knowledge base entities.

93.30micro_f1

by GENRE

Knowledge Graph Completion

Predicting missing links in knowledge graphs.

0.415mrr

by NBFNet

Relation Extraction

Extracting relationships between entities from text.

72.7%f1

by LUKE

by SSAE + Softmax (Explainable ASD)✓

Medical

2 tasks

Disease Classification

Diagnosing diseases from medical images or data.

98.2%accuracy

57 results · 9 datasets

Medical Image Segmentation

Segmenting organs and abnormalities in medical images.

92.65mean-dsc

by MedNeXt-L✓

26 results · 4 datasets

Mobile Development

1 task

React Native Code Generation

Evaluating AI models on generating correct, production-quality React Native implementations. Covers animation, navigation, state management, lists, and platform APIs using real-world libraries (Reanimated, React Navigation, Zustand, FlashList).

98.90navigation-satisfaction

by Composer 2✓

40 results · 1 dataset

Reasoning

5 tasks

Mathematical Reasoning

Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.

90.7%accuracy

by Claude Opus 4.5✓

62 results · 4 datasets

Commonsense Reasoning

Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.

91.6%accuracy

by Claude Opus 4.5✓

61 results · 6 datasets

Multi-step Reasoning

Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.

84.0%accuracy

by Gemini 2.5 Pro✓

55 results · 5 datasets

Logical Reasoning

Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.

56.3%accuracy

by GPT-4o

12 results · 4 datasets

Arithmetic Reasoning

Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.

97.2%accuracy

by GPT-4o

6 results · 2 datasets

Time Series

3 tasks

Time Series Forecasting

Time-series forecasting exploded in 2023-2025 when foundation models crossed over from NLP. Nixtla's TimeGPT (2023), Google's TimesFM (2024), and Amazon's Chronos showed that a single pretrained model can zero-shot forecast diverse series, rivaling task-specific statistical models like ETS and ARIMA. Yet the Monash benchmark and M-competition lineage (M4, M5) reveal an uncomfortable truth: simple ensembles of statistical methods still win on many univariate tasks. The real battle now is multivariate long-horizon forecasting, where PatchTST and iTransformer compete with state-space models like Mamba.

13.95smapi

by TiDE✓

75 results · 6 datasets

Tabular Classification

Tabular classification — predicting discrete labels from structured rows and columns — remains the one domain where gradient-boosted trees (XGBoost, LightGBM, CatBoost) stubbornly rival deep learning. Despite years of effort, neural approaches like TabNet (2019) and FT-Transformer (2021) only match tree methods on certain splits, and a 2022 NeurIPS study by Grinsztajn et al. confirmed that trees still dominate on medium-sized datasets. The real frontier is AutoML systems (AutoGluon, FLAML) that ensemble both paradigms, and the emerging question of whether foundation models pretrained on millions of tables can finally tip the balance.

88.5%accuracy

by AutoGluon-Tabular✓