Active Benchmarks by Domain
Browse verified, actively-maintained benchmarks by problem domain. These are the recommended datasets for evaluating your models.
Finding the right benchmark
Pick your domain
What type of problem? (vision, language, audio, etc.)
Find your task
What specific problem are you solving?
Choose dataset
Which benchmark fits your use case?
Compare results
How does your model stack up?
Browse by problem domain
Natural Language Processing
Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.
Computer Vision
Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.
Reasoning
Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.
Computer Code
Developing AI coding assistants? Test code generation, completion, translation, bug detection, and repair capabilities.
Agentic AI
Benchmarks for autonomous agents, software engineering agents, web agents, desktop agents, and terminal-based task execution.
Computer Vision
Research focused on enabling computers to interpret and understand visual information from images and videos, including tasks such as image classification, object detection, segmentation, and visual recognition.
Medical
Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.
Time-series
Multimodal
Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.
Mobile Development
Benchmarks evaluating AI code generation for mobile platforms — React Native, Flutter, Swift, Kotlin. Tests real-world patterns: navigation, animation, state management, platform APIs.
Natural Language Processing
The field of AI concerned with the interaction between computers and human language, encompassing text understanding, generation, translation, sentiment analysis, and question answering.
Speech
Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.
Industrial Inspection
Building quality control systems? Benchmark anomaly detection, defect classification, and automated visual inspection for manufacturing.
Reinforcement Learning
Training agents to make decisions? Benchmark your policies on game playing, continuous control, and offline learning tasks.
Audio
Research on processing, understanding, and generating audio signals, including speech recognition, music generation, sound classification, and audio synthesis.
Graphs
Working with network data? Test graph learning models on node classification, link prediction, and molecular property tasks.
Knowledge Base
Building knowledge systems? Evaluate graph completion, relation extraction, and entity linking performance.
Audio
Processing general audio signals? Test your models on sound classification, event detection, music analysis, and source separation.
General
A broad category encompassing machine learning research and tasks that don't fit specifically into vision or language domains, including general ML methods, optimization, and cross-domain approaches.
Time Series
Predicting future trends or detecting anomalies? Benchmark forecasting accuracy and pattern recognition in sequential data.
Other
Robots
Building robotic systems? Find benchmarks for manipulation, navigation, and simulation-to-reality transfer.
Adversarial
Need to test model robustness? Benchmark resilience against adversarial attacks and evaluate defense mechanisms.
Methodology
Improving learning efficiency? Test self-supervised, few-shot, transfer, and continual learning approaches.
24
Research Areas
147
Tasks
780
Datasets
9166
Benchmark Results