Active Benchmarks by Domain

Browse verified, actively-maintained benchmarks by problem domain. These are the recommended datasets for evaluating your models.

Finding the right benchmark

1

Pick your domain

What type of problem? (vision, language, audio, etc.)

2

Find your task

What specific problem are you solving?

3

Choose dataset

Which benchmark fits your use case?

4

Compare results

How does your model stack up?

Browse by problem domain

Natural Language Processing

Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.

16 tasks27 datasets5981 results
Polish LLM GeneralPolish Cultural CompetencyPolish Text UnderstandingPolish Conversation Quality+12 more

Computer Vision

Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.

23 tasks188 datasets2039 results
Optical Character RecognitionScene Text DetectionDocument Layout AnalysisScene Text Recognition+19 more

Reasoning

Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.

5 tasks20 datasets232 results
Commonsense ReasoningMathematical ReasoningMulti-step ReasoningLogical Reasoning+1 more

Computer Code

Developing AI coding assistants? Test code generation, completion, translation, bug detection, and repair capabilities.

6 tasks15 datasets223 results
Code GenerationCode TranslationBug DetectionCode Completion+2 more

Agentic AI

Benchmarks for autonomous agents, software engineering agents, web agents, desktop agents, and terminal-based task execution.

10 tasks19 datasets184 results
SWE-benchTask agentsAutonomous CodingWeb & Desktop Agents+6 more

Computer Vision

Research focused on enabling computers to interpret and understand visual information from images and videos, including tasks such as image classification, object detection, segmentation, and visual recognition.

15 tasks202 datasets94 results
Object DetectionImage ClassificationImage segmentationOCR+11 more

Medical

Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.

4 tasks15 datasets83 results
Disease ClassificationMedical Image SegmentationDrug DiscoveryClinical NLP

Time-series

2 tasks7 datasets75 results
Time-series forecastingTime-series classification

Multimodal

Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.

10 tasks26 datasets49 results
Visual Question AnsweringImage CaptioningAudio-Text-to-TextText-to-Image Generation+6 more

Mobile Development

Benchmarks evaluating AI code generation for mobile platforms — React Native, Flutter, Swift, Kotlin. Tests real-world patterns: navigation, animation, state management, platform APIs.

1 tasks1 datasets40 results
React Native Code Generation

Natural Language Processing

The field of AI concerned with the interaction between computers and human language, encompassing text understanding, generation, translation, sentiment analysis, and question answering.

3 tasks66 datasets31 results
Language ModelingText classificationMachine Translation

Speech

Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.

3 tasks6 datasets28 results
Speech RecognitionSpeaker VerificationSpeech Translation

Industrial Inspection

Building quality control systems? Benchmark anomaly detection, defect classification, and automated visual inspection for manufacturing.

4 tasks10 datasets27 results
Anomaly DetectionSteel Defect DetectionSurface Defect DetectionWeld Inspection

Reinforcement Learning

Training agents to make decisions? Benchmark your policies on game playing, continuous control, and offline learning tasks.

3 tasks3 datasets21 results
Atari GamesContinuous ControlOffline RL

Audio

Research on processing, understanding, and generating audio signals, including speech recognition, music generation, sound classification, and audio synthesis.

5 tasks52 datasets14 results
Text-to-speechVoice cloningAutomatic Speech RecognitionAudio-Language Models+1 more

Graphs

Working with network data? Test graph learning models on node classification, link prediction, and molecular property tasks.

4 tasks5 datasets12 results
Node ClassificationMolecular Property PredictionLink PredictionGraph Classification

Knowledge Base

Building knowledge systems? Evaluate graph completion, relation extraction, and entity linking performance.

3 tasks3 datasets9 results
Relation ExtractionEntity LinkingKnowledge Graph Completion

Audio

Processing general audio signals? Test your models on sound classification, event detection, music analysis, and source separation.

6 tasks8 datasets9 results
Music GenerationSound Event DetectionAudio CaptioningAudio-to-Audio+2 more

General

A broad category encompassing machine learning research and tasks that don't fit specifically into vision or language domains, including general ML methods, optimization, and cross-domain approaches.

11 tasks87 datasets8 results
Coding AgentsVideo-Language ModelsReinforcement LearningRetrieval+7 more

Time Series

Predicting future trends or detecting anomalies? Benchmark forecasting accuracy and pattern recognition in sequential data.

2 tasks2 datasets7 results
Tabular ClassificationTabular Regression

Other

2 tasks9 datasets0 results
RoboticsOther

Robots

Building robotic systems? Find benchmarks for manipulation, navigation, and simulation-to-reality transfer.

3 tasks3 datasets0 results
Robot ManipulationRobot NavigationSim-to-Real Transfer

Adversarial

Need to test model robustness? Benchmark resilience against adversarial attacks and evaluate defense mechanisms.

2 tasks2 datasets0 results
Adversarial RobustnessAdversarial Attacks

Methodology

Improving learning efficiency? Test self-supervised, few-shot, transfer, and continual learning approaches.

4 tasks4 datasets0 results
Self-Supervised LearningTransfer LearningFew-Shot LearningContinual Learning

24

Research Areas

147

Tasks

780

Datasets

9166

Benchmark Results

Browse OntologyVote on Benchmarks (188)Saturated & LegacyPWC Archive