ML Tasks
Canonical benchmarks and live SOTA leaderboards across 118 tasks.
Multimodal
10 tasksVisual Question Answering
visual-question-answeringVisual question answering (VQA) is the original multimodal reasoning task — given an image and a natural language question, produce the correct answer. VQAv2 (2017) defined the field, but modern benchmarks like GQA, OK-VQA, and TextVQA have pushed toward compositional reasoning, external knowledge, and OCR-dependent understanding. The task was largely "solved" in its classic form once multimodal LLMs arrived, with GPT-4V and Gemini saturating standard benchmarks, but adversarial and compositional variants still expose systematic failures in spatial reasoning and counting. VQA's legacy is establishing that vision-language models need more than pattern matching — they need genuine visual understanding.
Image Captioning
image-to-textImage captioning — generating natural language descriptions of images — was the task that launched the modern vision-language era when Show and Tell (2015) paired CNNs with RNNs. The field progressed through BLIP, BLIP-2, and CoCa, each improving grounding and descriptive richness, until multimodal LLMs effectively subsumed it as a special case of image-text-to-text. COCO Captions and NoCaps remain standard benchmarks, but CIDEr and SPICE scores have largely saturated — the real frontier is dense captioning, generating paragraph-level descriptions that capture spatial relationships, attributes, and background context that brief captions miss. Captioning's importance now lies more in its role as training signal for other vision-language tasks than as a standalone evaluation.
Any-to-Any
any-to-anyAny-to-any models are the endgame of multimodal AI — a single architecture that can accept and generate any combination of text, images, audio, and video. GPT-4o (2024) was the first production model to natively process and generate across modalities in real time, and Gemini 2.0 pushed this further with interleaved multimodal outputs. The technical challenge is enormous: unifying tokenization across modalities, preventing mode collapse where the model favors text over other outputs, and maintaining quality competitive with specialist models in each domain. Meta's Chameleon and open efforts like NExT-GPT explored this space, but true any-to-any generation at frontier quality remains the province of the largest labs.
Audio-Text-to-Text
audio-text-to-textAudio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.
Cross-Modal Retrieval
visual-document-retrievalCross-modal retrieval finds the best match between items in different modalities — given text, find the right image; given an image, find the right caption. CLIP (2021) revolutionized the field by learning a shared embedding space from 400M image-text pairs, spawning an entire ecosystem of models like SigLIP, EVA-CLIP, and OpenCLIP that power everything from search engines to generative model guidance. The challenge has shifted from coarse retrieval to fine-grained discrimination: telling apart nearly identical images based on subtle textual differences, or retrieving across underrepresented domains and languages. Recall@K on Flickr30K and COCO may look saturated, but real-world deployment exposes failures on long-tail queries and compositional descriptions.
Image-Text-to-Image
image-text-to-imageImage-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.
Image-Text-to-Text
image-text-to-textImage-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.
Image-Text-to-Video
image-text-to-videoImage-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.
Text-to-Image Generation
text-to-imageText-to-image generation went from "interesting research" to cultural phenomenon in 18 months. DALL-E 2 (2022) proved diffusion models could produce photorealistic images from text, Stable Diffusion democratized it as open source, and Midjourney v5/v6 set the aesthetic bar that even non-technical users now expect. DALL-E 3 (2023) solved the prompt-following problem by training on highly descriptive captions, Flux pushed open-source quality to near-commercial levels, and Ideogram cracked reliable text rendering in images. The remaining frontiers are compositional generation (multiple objects with specified spatial relationships), consistent character identity across images, and the still-unsolved challenge of reliable hand and finger anatomy.
Video Understanding
video-text-to-textVideo understanding asks models to reason over temporal sequences — answering questions, generating summaries, or detecting events across minutes or hours of footage. Early approaches like VideoBERT and TimeSformer processed short clips, but Gemini 1.5 Pro's million-token context (2024) enabled reasoning over hour-long videos natively, and GPT-4o brought real-time video comprehension. The core bottleneck remains temporal reasoning at scale: models can describe individual frames well but struggle to track causal chains, count repetitions, or understand temporal ordering across long sequences. Video-MME and EgoSchema are pushing evaluation beyond simple recognition toward genuine temporal understanding.
Computer Vision
27 tasksOptical Character Recognition
Extracting text from document images
Scene Text Detection
Detecting text regions in natural scene images
Scene Text Recognition
Recognizing text in natural scene images
Document Layout Analysis
Analyzing the layout structure of documents
Document Parsing
Parsing document structure and content
Document Image Classification
Classifying documents by type or category
General OCR Capabilities
Comprehensive benchmarks covering multiple aspects of OCR performance.
Handwriting Recognition
Recognizing handwritten text
Table Recognition
Detecting and parsing tables in documents
Object Detection
object-detectionObject detection — finding what's in an image and where — is the backbone of autonomous vehicles, surveillance, and robotics. The two-stage R-CNN lineage (2014–2017) gave way to single-shot detectors like YOLO, now in its 11th iteration and still getting faster. DETR (2020) proved transformers could replace hand-designed components like NMS entirely, spawning a family of end-to-end detectors that dominate COCO leaderboards above 60 mAP. The field's current obsession: open-vocabulary detection that works on any object described in natural language, not just fixed categories.
Image Classification
image-classificationImage classification is the task that launched modern deep learning — AlexNet's 2012 ImageNet win cut error rates in half overnight and triggered the entire neural network renaissance. The progression from VGGNet to ResNet to Vision Transformers traces the intellectual history of the field itself. Today's frontier models like EVA-02 and SigLIP push top-1 accuracy above 91% on ImageNet, but the real action has shifted to efficiency (MobileNet, EfficientNet) and robustness under distribution shift. Still the default benchmark for new architectures, and the foundation that every other vision task builds on.
Document Understanding
document-question-answeringDocument understanding requires parsing visually rich documents — invoices, forms, scientific papers, tables — where layout and typography carry as much meaning as the text itself. LayoutLMv3 (2022) and Donut pioneered layout-aware pretraining, but the game changed when GPT-4V and Claude 3 demonstrated that general-purpose multimodal LLMs could match or exceed specialist models on DocVQA and InfographicsVQA without fine-tuning. The persistent challenges are multi-page reasoning, handling handwritten text mixed with print, and accurately extracting structured data from complex table layouts. This task sits at the intersection of OCR, layout analysis, and language understanding, making it one of the highest-value enterprise AI applications.
Semantic Segmentation
image-segmentationSemantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins autonomous driving, medical imaging, and satellite analysis. FCN (2015) showed you could repurpose classifiers for pixel labeling, DeepLab introduced atrous convolutions and CRFs, and SegFormer (2021) proved transformers dominate here too. State-of-the-art on Cityscapes exceeds 85 mIoU, but ADE20K with its 150 classes remains brutally challenging. The frontier has moved toward universal segmentation models like Mask2Former that handle semantic, instance, and panoptic segmentation in a single architecture.
Depth Estimation
depth-estimationDepth estimation recovers 3D structure from 2D images — a problem that haunted computer vision for decades before deep learning cracked monocular depth prediction. The field shifted dramatically with MiDaS (2019) showing that mixing diverse training data beats task-specific models, then again with Depth Anything (2024) proving foundation model scale changes everything. Modern systems achieve sub-5% relative error on NYU Depth V2, but real-world robustness — handling reflections, transparency, and extreme lighting — remains the frontier. Critical for autonomous driving, AR/VR, and robotics where accurate 3D perception is non-negotiable.
Image Feature Extraction
image-feature-extractionImage feature extraction produces dense vector representations that encode visual semantics — the hidden layer outputs that power retrieval, clustering, similarity search, and transfer learning. The field progressed from hand-crafted descriptors (SIFT, SURF) to CNN features (ResNet, EfficientNet) to self-supervised vision transformers like DINOv2 (2023), which produces features so rich they rival task-specific models on segmentation, depth, and classification without any fine-tuning. DINOv2's success proved that visual foundation models can match the "extract and use everywhere" paradigm that BERT established in NLP. The quality of your feature extractor determines the ceiling for virtually every downstream vision task.
Image-to-3D
image-to-3dImage-to-3D reconstruction infers full 3D geometry from one or a few images — a fundamentally ill-posed problem that recent models solve with learned geometric priors. Traditional multi-view stereo required dozens of calibrated views, but single-image methods like One-2-3-45 (2023) and TripoSR leverage large-scale 3D training data to hallucinate plausible geometry from a single photo. 3D Gaussian Splatting (2023) revolutionized the representation side, enabling real-time rendering of reconstructed scenes. The practical gap is clear: scanned objects still look better than generated ones, but the convenience of snap-and-reconstruct is reshaping e-commerce product visualization and AR content creation.
Image-to-Image
image-to-imageImage-to-image translation covers a vast family of tasks — super-resolution, style transfer, inpainting, colorization, denoising — unified by the idea of learning a mapping between image domains. Pix2Pix (2017) and CycleGAN showed paired and unpaired translation were both learnable, but diffusion models rewrote the playbook entirely. ControlNet (2023) demonstrated that conditioning Stable Diffusion on edges, depth, or poses gives surgical control over generation, while models like SUPIR push restoration quality beyond what was thought possible. The Swiss army knife of visual AI — nearly every creative and restoration workflow runs through some form of image-to-image.
Image-to-Video
image-to-videoImage-to-video generation animates a single still image into a coherent video sequence — one of the hardest generation tasks because it demands both visual fidelity and temporal consistency. Stable Video Diffusion (2023) proved that fine-tuning image diffusion models on video data produces remarkably stable motion, and Runway's Gen-3 and Kling showed commercial viability. The key challenge remains physics-aware motion: objects should move naturally, lighting should evolve consistently, and the camera should behave like a real one. A cornerstone of the emerging AI filmmaking pipeline.
Keypoint Detection
keypoint-detectionKeypoint detection localizes specific anatomical or structural landmarks — body joints, facial features, hand articulations — enabling pose estimation, gesture recognition, and motion capture. OpenPose (2017) first demonstrated real-time multi-person pose estimation, and the field has since progressed through HRNet, ViTPose, and RTMPose pushing both accuracy and speed. Modern systems detect 133 whole-body keypoints (body + hands + face) in real-time on mobile devices. The applications span from sports biomechanics (analyzing an athlete's form frame-by-frame) to sign language recognition and AR avatar puppeteering.
Mask Generation
mask-generationMask generation produces pixel-precise segmentation masks for objects, and Meta's Segment Anything (SAM, 2023) transformed it from a specialized task into a foundational capability. Trained on 11M images with 1B+ masks, SAM demonstrated that a single promptable model — click a point, draw a box, or provide text — could segment virtually anything. SAM 2 (2024) extended this to video with real-time tracking, while EfficientSAM and FastSAM address the original's computational cost. The "foundation model" moment for segmentation, analogous to what GPT-3 was for NLP.
Text-to-3D
text-to-3dText-to-3D generates 3D assets — meshes, NeRFs, or Gaussian splats — from text descriptions alone, a capability that barely existed before DreamFusion (2022) showed score distillation sampling could lift 2D diffusion priors into 3D. The field moves at breakneck speed: Magic3D added coarse-to-fine generation, Instant3D achieved single-shot inference, and Meshy and Tripo brought commercial quality. Multi-view consistency remains the core challenge — the "Janus problem" where different viewpoints produce contradictory details. The promise of democratizing 3D content creation for games, VR, and e-commerce is driving massive investment.
Text-to-Video
text-to-videoText-to-video generation is the most ambitious frontier in generative AI — synthesizing temporally coherent, physically plausible video from text prompts alone. The field exploded in 2024 with Sora demonstrating cinematic-quality generation, followed by open models like CogVideoX and Mochi pushing accessibility. The core technical challenge is maintaining consistency across frames: characters shouldn't morph, physics should hold, and camera motion should feel intentional. Quality is improving at a staggering pace, but generation still takes minutes per clip and artifacts remain visible under scrutiny — the gap between demos and reliable production tools is real.
Unconditional Image Generation
unconditional-image-generationUnconditional image generation — producing realistic images from pure noise — is the purest test of a generative model's learned distribution. GANs dominated for years (ProGAN, StyleGAN, StyleGAN3 pushed FID below 2 on FFHQ), but diffusion models dethroned them in both quality and diversity starting with DDPM (2020). The FID metric itself is now questioned as models produce images indistinguishable from real photos. Historically the proving ground for new generative architectures, though the field's energy has largely migrated to conditional generation (text-to-image) where practical applications live.
Video Classification
video-classificationVideo classification — recognizing actions and events in clips — extends image understanding into the temporal domain, requiring models to reason about motion, context, and temporal ordering. The field evolved from hand-crafted features (HOG, optical flow) through 3D CNNs (C3D, I3D) to video transformers like TimeSformer and VideoMAE that treat frames as spatiotemporal tokens. Kinetics-400 accuracy now exceeds 88%, but the real challenge is long-form video understanding where events unfold over minutes, not seconds. Essential for content moderation, sports analytics, and security applications.
Video-to-Video
video-to-videoVideo-to-video translation transforms existing footage — applying style transfer, temporal super-resolution, relighting, or motion retargeting while preserving temporal coherence across frames. The naive approach of processing frames independently produces unwatchable flicker, so the core technical challenge is enforcing cross-frame consistency. Diffusion-based approaches like Rerender-A-Video and TokenFlow (2023) showed that propagating attention features between frames solves this elegantly. The practical frontier is real-time processing for live video — current methods are offline and slow, but the creative potential for film post-production, video editing, and content repurposing is enormous.
Zero-Shot Image Classification
zero-shot-image-classificationZero-shot image classification uses vision-language models to categorize images into arbitrary classes never seen during training — you describe categories in text, and the model matches. CLIP (2021) proved this was viable at scale by training on 400M image-text pairs, achieving competitive accuracy on ImageNet without ever seeing a labeled example. SigLIP, EVA-CLIP, and MetaCLIP have since pushed zero-shot ImageNet accuracy above 83%, closing the gap with supervised models. The paradigm shift this represents is profound: instead of collecting labeled datasets for every new domain, you just describe what you're looking for.
Zero-Shot Object Detection
zero-shot-object-detectionZero-shot object detection finds and localizes objects described by free-form text, without any task-specific fine-tuning — the open-vocabulary dream of detection. Grounding DINO (2023) married DINO's detection architecture with grounded pre-training to achieve state-of-the-art open-set detection, while OWL-ViT and YOLO-World showed different paths to the same goal. The technical challenge is grounding language precisely enough to distinguish similar objects ("the red car" vs "the blue car" in the same scene). This is rapidly replacing traditional closed-set detectors in production because it eliminates the most painful step: collecting and annotating domain-specific training data.
Natural Language Processing
19 tasksPolish LLM General
General-purpose evaluation of language models on Polish language tasks: sentiment, reading comprehension, question answering, cyberbullying detection, and emotional intelligence.
Polish Cultural Competency
Evaluating language models on Polish linguistic and cultural knowledge across art & entertainment, culture & tradition, geography, grammar, history, and vocabulary.
Polish Text Understanding
Evaluating language models on understanding Polish text: sentiment, implicatures, phraseology, tricky questions, and hallucination resistance.
Polish Conversation Quality
Evaluating language models on multi-turn conversation quality in Polish across coding, extraction, humanities, math, reasoning, roleplay, STEM, and writing.
Polish Emotional Intelligence
Evaluating language models on emotional intelligence in Polish: understanding emotional states, predicting emotional responses, and nuanced sentiment analysis.
Text Summarization
summarizationText summarization compresses documents while preserving key information — a task that became dramatically more capable with LLMs but also harder to evaluate. PEGASUS (2020) and BART set the encoder-decoder baseline, but GPT-4 and Claude produce summaries that human evaluators often prefer over reference summaries, breaking ROUGE as a meaningful metric. CNN/DailyMail and XSum remain standard benchmarks, but the field is moving toward long-document summarization (books, legal filings, earnings calls) where 100K+ token context windows are finally making single-pass summarization feasible. The core unsolved problem is faithfulness — even frontier models hallucinate facts in roughly 5-15% of summaries, making factual consistency the critical metric that separates production-ready from demo-ready.
Text Classification
text-classificationText classification is the gateway drug of NLP — sentiment analysis, spam detection, topic labeling — and the task where transformers first proved their dominance over LSTMs. BERT (2018) set the template, but the real revolution came when instruction-tuned LLMs like GPT-4 and Llama 3 started matching fine-tuned classifiers zero-shot, threatening to make task-specific training obsolete. SST-2, AG News, and IMDB remain standard benchmarks, though the field increasingly cares about multilingual and low-resource performance where English-centric models still stumble. The open question: does a 70B parameter model doing classification via prompting actually beat a 100M fine-tuned encoder when you factor in latency and cost?
Question Answering
question-answeringExtractive and abstractive question answering is one of the oldest NLP benchmarks, from the original SQuAD (2016) to the adversarial complexity of Natural Questions and TriviaQA. Human parity on SQuAD 2.0 was claimed by ALBERT in 2020, effectively saturating the benchmark — but real-world QA over noisy documents, multi-hop reasoning (HotpotQA, MuSiQue), and long-context grounding remain far from solved. The paradigm has shifted from standalone QA models to retrieval-augmented generation (RAG), where the bottleneck moved from answer extraction to retrieval quality. Modern systems like Perplexity and Google's AI Overviews show that production QA is now an end-to-end pipeline problem, not a single-model benchmark.
Natural Language Inference
Determining entailment relationships between sentences (SNLI, MNLI).
Text Ranking
text-rankingText ranking is the invisible backbone of every search engine and RAG pipeline. The field was transformed by ColBERT (2020) introducing late interaction, then by instruction-tuned embedding models like E5-Mistral and GTE-Qwen that turned general LLMs into retrieval engines. MS MARCO and BEIR remain the standard battlegrounds, but the real test is zero-shot transfer — can a model trained on web search generalize to legal documents, scientific papers, and code? The gap between supervised and zero-shot performance has shrunk from 15+ points to under 3 in two years.
Named Entity Recognition
token-classificationNamed entity recognition (NER) extracts structured mentions — people, organizations, locations, dates — from unstructured text, making it foundational to knowledge graphs, financial compliance, and clinical NLP. CoNLL-2003 English F1 scores have been above 93% since BERT, and current leaders like UniNER and GLiNER push past 95%, but these numbers mask the real difficulty: nested entities, emerging entity types, and cross-lingual transfer where performance drops 10-20 points. The shift from sequence labeling to generative NER (framing extraction as text generation) has opened the door for LLMs to compete, though latency-sensitive production systems still rely on encoder models like DeBERTa-v3 and SpanBERT.
Feature Extraction
feature-extractionFeature extraction — generating dense vector embeddings from text — is the unsung infrastructure layer powering semantic search, RAG pipelines, clustering, and recommendation systems. Sentence-BERT (2019) made it practical, but the field exploded in 2023-2024 with instruction-tuned embedding models like E5-Mistral, GTE-Qwen2, and Nomic Embed that turned decoder-only LLMs into embedding engines, pushing MTEB scores past 70 average across 50+ tasks. The key insight was that pre-training scale transfers to embedding quality — a 7B parameter embedding model crushes a 110M one on zero-shot retrieval. Matryoshka representation learning (Kusupati et al., 2022) added the ability to truncate embeddings to any dimension without retraining, making deployment flexible across latency and storage budgets.
Machine Translation
translationMachine translation is the oldest AI grand challenge, from rule-based systems in the 1950s to the transformer revolution sparked by "Attention Is All You Need" (2017) — literally the architecture that now powers all of AI. Google's multilingual T5 and Meta's NLLB-200 pushed translation to 200+ languages, but the real disruption came from GPT-4 and Claude matching or beating specialized MT systems on WMT benchmarks for high-resource pairs like English-German and English-Chinese. The unsolved frontier is low-resource languages (under 1M parallel sentences), where dedicated models like NLLB still dominate, and literary translation where preserving style, humor, and cultural nuance remains beyond any system. BLEU scores are increasingly seen as unreliable — human evaluation and newer metrics like COMET and BLEURT are becoming the standard.
Fill-Mask
fill-maskFill-mask (masked language modeling) is the original BERT pretraining objective: mask 15% of tokens, predict what goes there. It powered the encoder revolution that dominated NLP from 2018 to 2022 and remains the training signal behind models like RoBERTa, DeBERTa, and XLM-RoBERTa that still run most production classification and NER systems. As a standalone task it has limited direct applications, but probing what a model predicts for masked slots became a key technique for analyzing bias, factual knowledge, and linguistic competence stored in model weights. The task has faded from the research spotlight as decoder-only (GPT-style) pretraining proved more scalable, but encoder models trained with MLM remain the most cost-efficient option for tasks that need fast inference on structured prediction.
Semantic Textual Similarity
sentence-similaritySemantic similarity measures how close two pieces of text are in meaning — the foundation of duplicate detection, paraphrase mining, and retrieval. STS Benchmark scores climbed from 70 (GloVe averages) to 86+ with Sentence-BERT, and now exceed 92 with models like GTE-Qwen2 and E5-Mistral that leverage billion-parameter backbones. The real shift was from symmetric similarity (are these two sentences paraphrases?) to asymmetric retrieval (does this passage answer this query?), driven by the RAG revolution that made embedding quality a production-critical metric. Cross-lingual semantic similarity remains a hard frontier — models trained primarily on English still lose 5-10 points when comparing sentences across language families, despite multilingual pretraining.
Table Question Answering
table-question-answeringTable question answering bridges natural language and structured data — asking "what was Q3 revenue?" over a spreadsheet and getting the right cell or computed answer. Google's TAPAS (2020) pioneered joint table-text pre-training, and TAPEX trained on synthetic SQL execution traces to teach models tabular reasoning. The field shifted dramatically when GPT-4 and Claude demonstrated they could reason over tables in-context without any table-specific fine-tuning, often matching or beating specialized models on WikiTableQuestions and SQA. The hard frontier is multi-step numerical reasoning over large tables with hundreds of rows — exactly the kind of task where tool-augmented LLMs that generate and execute code are pulling ahead of pure neural approaches.
Zero-Shot Classification
zero-shot-classificationZero-shot classification asks a model to categorize text into labels it has never been explicitly trained on — the ultimate test of language understanding and generalization. The breakthrough was the natural language inference (NLI) trick: reframe classification as "does this text entail the label?" using models fine-tuned on MNLI, pioneered by Yin et al. (2019) and popularized by BART-large-MNLI. Today, instruction-tuned LLMs have largely subsumed this approach — GPT-4, Claude, and Llama 3 can classify into arbitrary taxonomies via prompting with near-supervised accuracy. The remaining challenge is consistency and calibration: LLMs are powerful but their predictions can be brittle to prompt phrasing, making them unreliable for high-stakes automated pipelines without careful engineering.
Language Modeling
text-generationLanguage modeling — predicting the next token — is the pretraining objective that accidentally became the foundation of modern AI. From GPT-2's "too dangerous to release" moment in 2019 to GPT-4, Claude, Llama 3, and Gemini, scaling language models has produced emergent capabilities no one predicted from loss curves alone. Perplexity on benchmarks like WikiText-103 and Penn Treebank is essentially a historical artifact now; the field evaluates via downstream tasks (MMLU, HumanEval, MATH) because raw perplexity stopped correlating with usefulness years ago. The frontier has moved to mixture-of-experts architectures (Mixtral, DeepSeek-V3), longer context windows (1M+ tokens), and efficient inference — the model is no longer the bottleneck, serving it is.
Reading Comprehension
Understanding and answering questions about passages.
Audio
7 tasksAudio Captioning
Generating text descriptions of audio content.
Music Generation
Generating music from text, audio, or other inputs.
Sound Event Detection
Detecting and localizing sound events in audio.
Audio Classification
audio-classificationAudio classification identifies what's happening in a sound — music genre, environmental sounds, speaker emotion, language identification — and underpins everything from content moderation to smart home devices. Audio Spectrogram Transformer (AST) and BEATs brought ImageNet-level transfer learning to audio by treating spectrograms as images, achieving >95% accuracy on AudioSet's 527-class ontology. The paradigm shifted with audio foundation models like CLAP (contrastive language-audio pretraining) and Whisper's encoder, which provide general-purpose audio representations that transfer to downstream tasks with minimal fine-tuning. The hard problems remain fine-grained classification in noisy real-world conditions, rare sound event detection with few examples, and efficient on-device inference for always-listening applications.
Audio-to-Audio
audio-to-audioAudio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.
Text-to-Audio
text-to-audioText-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.
Voice Activity Detection
voice-activity-detectionVoice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.
Speech
5 tasksSpeech Recognition
automatic-speech-recognitionAutomatic speech recognition went from a specialized pipeline (acoustic model + language model + decoder) to a single end-to-end model with OpenAI's Whisper (2022), which was trained on 680K hours of web audio and became the de facto open-source standard overnight. Whisper large-v3 hits under 5% word error rate on LibriSpeech clean, and commercial APIs from Google, AWS, and Deepgram compete fiercely on noisy, accented, and multilingual speech where error rates are 2-3x higher. The real frontier is real-time streaming ASR at conversational latency (<500ms), code-switching between languages mid-sentence, and robust recognition of domain-specific terminology (medical, legal, technical). Assembly AI's Universal-2 and Deepgram's Nova-3 currently lead production benchmarks, but the gap with fine-tuned Whisper variants is narrow.
Text-to-Speech
text-to-speechText-to-speech has undergone a stunning transformation from robotic concatenation to near-human expressiveness in under five years. ElevenLabs, OpenAI's TTS, and XTTS-v2 produce speech that most listeners cannot distinguish from recordings, while open models like Bark, VALL-E (Microsoft), and F5-TTS demonstrated that voice cloning from 3-second samples is now a commodity capability. The frontier has moved beyond intelligibility (solved) to prosody, emotion control, and real-time streaming at under 200ms latency for conversational AI. Evaluation remains messy — MOS (Mean Opinion Score) is subjective and expensive, and automated metrics like UTMOS only loosely correlate with human preference, making benchmark comparisons unreliable.
Speaker Verification
Verifying speaker identity from voice samples.
Speech Translation
Translating spoken audio directly to another language.
Voice Cloning
Replicating a speaker's voice characteristics.
Reinforcement Learning
3 tasksAtari Games
reinforcement-learningAtari games became the canonical RL benchmark when DeepMind's DQN (2013) learned to play Breakout from raw pixels, but the goalposts keep moving. Agent57 (2020) was the first to achieve superhuman scores on all 57 games, and recent work like BBF and MEME shows that sample efficiency — not just final performance — is the new frontier. The benchmark's age is both its strength (decades of comparable results) and weakness (it doesn't capture the open-ended reasoning modern RL needs).
Continuous Control
Continuous control — learning smooth motor commands in simulated physics — was transformed by MuJoCo and the OpenAI Gym suite in the mid-2010s. SAC (2018) and TD3 became reliable baselines, but the field shifted toward harder locomotion (humanoid parkour, dexterous hands) and sim-to-real transfer after DeepMind's dm_control and Isaac Gym raised the bar. DreamerV3 (2023) showed that world-model approaches can match or beat model-free methods across dozens of control tasks with a single hyperparameter set, signaling a move toward generalist RL agents.
Offline RL
Offline RL — learning policies from fixed datasets without further environment interaction — matters because most real-world domains (healthcare, robotics, autonomous driving) can't afford online exploration. CQL (2020) and IQL (2022) established strong baselines on the D4RL benchmark, but the field was disrupted by Decision Transformer (2021), which recast RL as sequence modeling. The latest wave uses pretrained language models as policy backbones, blurring the line between offline RL and in-context learning, with benchmarks like CORL tracking reproducibility across dozens of algorithms.
Adversarial
2 tasksAgentic AI
6 tasksSWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.
Web & Desktop Agents
Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.
HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.
RE-Bench
RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.
Time Horizon
Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the single most important meta-metric for agentic AI. METR's evaluations suggest current frontier agents degrade significantly after 30-60 minutes of autonomous operation, while human software engineers can sustain productive work for hours. The metric matters because economic value scales exponentially with reliable autonomy duration: an agent that works reliably for 8 hours is not 16x more valuable than one that works for 30 minutes — it's qualitatively different, enabling entirely new categories of delegatable work.
Autonomous Coding
Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?
Computer Code
6 tasksCode Generation
Generating code from natural language descriptions (HumanEval, MBPP).
Code Translation
Converting code between programming languages.
Bug Detection
Identifying bugs and vulnerabilities in code.
Code Completion
Predicting the next tokens in code sequences.
Program Repair
Automatically fixing bugs in code.
Code Summarization
Generating natural language descriptions of code.
Graphs
4 tasksNode Classification
graph-mlNode classification — assigning labels to vertices in a graph using both node features and neighborhood structure — is the flagship task for Graph Neural Networks. GCN (Kipf & Welling, 2017) established the Cora/Citeseer/PubMed benchmark trinity, but these datasets are tiny by modern standards and results have saturated well above 85% accuracy. The field has moved toward large-scale heterogeneous graphs (ogbn-arxiv, ogbn-products from OGB) and the unsettled debate over whether simple MLPs with neighborhood features can match GNNs, as shown by SIGN and SGC ablations.
Link Prediction
Link prediction — inferring missing or future edges in a graph — underpins knowledge graph completion, drug-target discovery, and social network recommendation. TransE (2013) launched the knowledge graph embedding era, and the field matured through DistMult, RotatE, and CompGCN, benchmarked on FB15k-237 and WN18RR. The current frontier is inductive link prediction (generalizing to unseen entities), where GNN-based methods like NBFNet and foundation models like ULTRA (2024) show that a single model can transfer across entirely different knowledge graphs without retraining.
Molecular Property Prediction
Molecular property prediction — estimating toxicity, solubility, binding affinity, or other properties from molecular structure — is the workhorse task of AI-driven drug discovery. GNNs operate on molecular graphs while transformer approaches (ChemBERTa, Uni-Mol) use SMILES strings or 3D coordinates. MoleculeNet (2018) and the Therapeutic Data Commons (TDC) provide standardized benchmarks, but the real bottleneck is distribution shift: models trained on known chemical space struggle with novel scaffolds, and the gap between leaderboard accuracy and actual wet-lab utility remains the field's central challenge.
Graph Classification
graph-mlGraph classification — predicting a label for an entire graph, not individual nodes — matters for molecular screening, social network analysis, and program verification. GIN (Xu et al., 2019) formalized the connection between GNN expressiveness and the Weisfeiler-Leman graph isomorphism test, and the TU datasets became standard benchmarks. Recent work on graph transformers (GPS, Exphormer) and higher-order GNNs pushes beyond WL limits, while OGB's ogbg-molhiv and ogbg-molpcba provide more rigorous large-scale evaluation than the classic small-graph benchmarks.
Industrial Inspection
4 tasksAnomaly Detection
Detecting defects and anomalies in manufacturing (MVTec AD, VisA).
Steel Defect Detection
Detecting defects in steel production: rolled-in scale, patches, pitting.
Surface Defect Detection
Detecting scratches, dents, and surface imperfections on materials.
Weld Inspection
Detecting weld defects: porosity, cracks, lack of fusion, slag inclusion.
Knowledge Base
3 tasksEntity Linking
Linking mentions to knowledge base entities.
Knowledge Graph Completion
Predicting missing links in knowledge graphs.
Relation Extraction
Extracting relationships between entities from text.
Medical
4 tasksDisease Classification
Diagnosing diseases from medical images or data.
Medical Image Segmentation
Segmenting organs and abnormalities in medical images.
Clinical NLP
Processing clinical notes and medical text.
Drug Discovery
Predicting molecular properties and drug interactions.
Methodology
4 tasksContinual Learning
Learning new tasks without forgetting old ones.
Few-Shot Learning
Learning from very few examples.
Self-Supervised Learning
Learning representations without labeled data.
Transfer Learning
Transferring knowledge between tasks and domains.
Mobile Development
1 taskReasoning
5 tasksMathematical Reasoning
Mathematical reasoning benchmarks — GSM8K, MATH, Minerva, and the competition-level AIME/AMC tests — have become the primary yardstick for frontier model intelligence. OpenAI's o1 and o3 (2024-2025) cracked problems that were previously out of reach by scaling inference-time compute with search and verification. The MATH benchmark went from ~50% (GPT-4, early 2023) to >90% (o1, late 2024) in under two years, but Olympiad-level problems (FrontierMath, Putnam) and formal theorem proving (Lean 4) remain far from solved, preserving mathematical reasoning as the clearest ladder for measuring progress.
Commonsense Reasoning
Commonsense reasoning — answering questions that require everyday knowledge about how the physical and social world works — is measured by benchmarks like CommonsenseQA, PIQA, and HellaSwag. Large language models have largely saturated early benchmarks (HellaSwag went from 95% to near-ceiling by 2023), forcing a shift to harder tests like ARC-Challenge and Winoground. The uncomfortable insight is that scale alone buys enormous commonsense performance, but adversarial probing still reveals brittle failures on spatial reasoning, temporal logic, and physical intuition that humans find trivial.
Multi-step Reasoning
Multi-step reasoning — maintaining coherent inference chains across 5+ sequential steps — is the meta-capability that determines whether a model can solve complex real-world problems or only handle one-hop questions. Benchmarks like StrategyQA, MuSiQue, and BIG-Bench Hard isolate this ability, and the performance gap between single-step and multi-step tasks remains the widest failure mode of current LLMs. Techniques like chain-of-thought, tree-of-thought, and iterative refinement help, but error accumulation across steps means that 95% per-step accuracy yields only 60% accuracy over 10 steps — a fundamental scaling challenge.
Logical Reasoning
Logical reasoning — formal deduction, constraint satisfaction, and syllogistic inference — exposes a core weakness in autoregressive language models: they pattern-match rather than prove. Benchmarks like LogiQA, FOLIO, and the ReClor reading comprehension test push models toward deductive rigor, and performance improves substantially with chain-of-thought and self-consistency decoding. But systematic evaluations (2023-2024) show that even frontier models fail on problems requiring more than 3-4 reasoning steps, and neurosymbolic approaches that compile to SAT solvers or proof assistants remain more reliable for true logical correctness.
Arithmetic Reasoning
Arithmetic reasoning — solving computation-heavy problems stated in natural language — tests whether models can reliably execute multi-step calculations. GPT-4 and Claude showed dramatic improvement over GPT-3 on benchmarks like GSM8K's arithmetic subset, but systematic errors on large-number multiplication and multi-digit division persist. Chain-of-thought prompting (Wei et al., 2022) was the breakthrough technique, and tool-augmented approaches (letting models call a calculator) essentially solve the task — making the pure reasoning version a test of memorization vs. genuine computation.
Robots
4 tasksRobot Manipulation
Robot manipulation — grasping, placing, and using tools — is where sim-to-real and foundation models meet physical dexterity. DexNet (2017) pioneered data-driven grasp planning, but the field accelerated when contact-rich manipulation was tackled with RL in simulation (DexterousHands, 2023) and then transferred to real hardware. Current state-of-the-art combines diffusion policies (Chi et al., 2023) with large pretrained vision encoders to achieve robust 6-DOF manipulation from a handful of demonstrations, though deformable objects and multi-step assembly remain unsolved.
Robot Navigation
Autonomous navigation — moving through unstructured environments while avoiding obstacles — spans indoor service robots to outdoor last-mile delivery. Classical SLAM (simultaneous localization and mapping) methods like ORB-SLAM still dominate mapping, but end-to-end learning approaches using habitat simulators (Habitat 2.0, iGibson) show promise for semantic navigation ("go to the kitchen"). The Habitat Challenge results reveal that modular pipelines (map → plan → act) consistently beat monolithic learned policies, suggesting that full end-to-end navigation is still years away from displacing classical stacks in production.
Robotics
roboticsEnd-to-end robotics — learning perception, planning, and control in a single model — entered a new era with vision-language-action (VLA) models. Google's RT-2 (2023) showed that a web-pretrained VLM could directly output robot actions, and the open-source Open X-Embodiment dataset (2023) unified data from 22 robot types across 21 institutions. The key tension is generalization: lab demos on specific robots are plentiful, but a single policy that transfers across embodiments, tasks, and environments remains the holy grail, with π₀ (Physical Intelligence, 2024) and Google's RT-X pushing this frontier.
Sim-to-Real Transfer
Sim-to-real transfer — training policies in simulation and deploying on physical hardware — is the bridge between unlimited virtual data and messy reality. Domain randomization (Tobin et al., 2017) was the first scalable approach, and OpenAI's Rubik's cube hand (2019) showed it could work for dexterous manipulation. The modern toolkit combines photorealistic rendering (Isaac Sim, MuJoCo MJX on GPU), system identification, and real-world fine-tuning, but the gap persists for contact-rich tasks where simulation physics diverge from reality. Narrowing this gap is existential for robotics — it determines whether lab results actually work in factories and homes.
Time Series
4 tasksTime Series Forecasting
time-series-forecastingTime-series forecasting exploded in 2023-2025 when foundation models crossed over from NLP. Nixtla's TimeGPT (2023), Google's TimesFM (2024), and Amazon's Chronos showed that a single pretrained model can zero-shot forecast diverse series, rivaling task-specific statistical models like ETS and ARIMA. Yet the Monash benchmark and M-competition lineage (M4, M5) reveal an uncomfortable truth: simple ensembles of statistical methods still win on many univariate tasks. The real battle now is multivariate long-horizon forecasting, where PatchTST and iTransformer compete with state-space models like Mamba.
Tabular Classification
tabular-classificationTabular classification — predicting discrete labels from structured rows and columns — remains the one domain where gradient-boosted trees (XGBoost, LightGBM, CatBoost) stubbornly rival deep learning. Despite years of effort, neural approaches like TabNet (2019) and FT-Transformer (2021) only match tree methods on certain splits, and a 2022 NeurIPS study by Grinsztajn et al. confirmed that trees still dominate on medium-sized datasets. The real frontier is AutoML systems (AutoGluon, FLAML) that ensemble both paradigms, and the emerging question of whether foundation models pretrained on millions of tables can finally tip the balance.
Tabular Regression
tabular-regressionTabular regression — predicting continuous values from structured data — powers everything from house-price estimation to demand forecasting and shares the same tree-vs-neural tension as classification. XGBoost and LightGBM remain brutally effective defaults, but recent work on differentiable trees and table-aware transformers (TabPFN, 2022) showed that meta-learned priors can beat tuned GBDTs on small datasets in seconds. The challenge is distribution shift: real-world regression targets drift over time, and most benchmarks (UCI, Kaggle) are static snapshots that hide this problem entirely.
Time Series Classification
Classifying time series patterns.