What task are you working on?

Find the right benchmarks to evaluate your model. Browse by problem domain to discover datasets, evaluation metrics, and current state-of-the-art results.

Papers With Code Archive

1,500+ historical benchmark results with SOTA timeline charts

Browse Archive

Finding the right benchmark

1

Pick your domain

What type of problem? (vision, language, audio, etc.)

2

Find your task

What specific problem are you solving?

3

Choose dataset

Which benchmark fits your use case?

4

Compare results

How does your model stack up?

Browse by problem domain

Computer Vision

Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.

10 tasks 28 datasets
Scene Text DetectionDocument OCRHandwriting RecognitionDocument Understanding +6 more

Natural Language Processing

Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.

9 tasks 6 datasets
Language ModelingMachine TranslationQuestion AnsweringText Classification +5 more

Reasoning

Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.

5 tasks 15 datasets
Mathematical ReasoningCommonsense ReasoningLogical ReasoningMulti-step Reasoning +1 more

Computer Code

Developing AI coding assistants? Test code generation, completion, translation, bug detection, and repair capabilities.

6 tasks 8 datasets
Code GenerationCode CompletionCode TranslationCode Summarization +2 more

Speech

Working with voice and audio? Evaluate speech-to-text accuracy, voice synthesis quality, and speaker identification performance.

5 tasks 4 datasets
Speech RecognitionText-to-SpeechSpeaker VerificationSpeech Translation +1 more

Medical

Building healthcare AI? Find benchmarks for medical imaging, disease diagnosis, clinical text processing, and drug discovery.

4 tasks 9 datasets
Medical Image SegmentationDisease ClassificationDrug DiscoveryClinical NLP

Audio

Processing general audio signals? Test your models on sound classification, event detection, music analysis, and source separation.

4 tasks 2 datasets
Audio ClassificationSound Event DetectionMusic GenerationAudio Captioning

Time Series

Predicting future trends or detecting anomalies? Benchmark forecasting accuracy and pattern recognition in sequential data.

2 tasks 1 datasets
Time Series ForecastingTime Series Classification

Industrial Inspection

Building quality control systems? Benchmark anomaly detection, defect classification, and automated visual inspection for manufacturing.

4 tasks 7 datasets
Anomaly DetectionWeld InspectionSurface Defect DetectionSteel Defect Detection

Graphs

Working with network data? Test graph learning models on node classification, link prediction, and molecular property tasks.

4 tasks 2 datasets
Node ClassificationLink PredictionGraph ClassificationMolecular Property Prediction

Multimodal

Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.

5 tasks 2 datasets
Image CaptioningVisual Question AnsweringText-to-Image GenerationVideo Understanding +1 more

Robots

Building robotic systems? Find benchmarks for manipulation, navigation, and simulation-to-reality transfer.

3 tasks
Robot ManipulationRobot NavigationSim-to-Real Transfer

Reinforcement Learning

Training agents to make decisions? Benchmark your policies on game playing, continuous control, and offline learning tasks.

3 tasks 2 datasets
Atari GamesContinuous ControlOffline RL

Knowledge Base

Building knowledge systems? Evaluate graph completion, relation extraction, and entity linking performance.

3 tasks
Knowledge Graph CompletionRelation ExtractionEntity Linking

Adversarial

Need to test model robustness? Benchmark resilience against adversarial attacks and evaluate defense mechanisms.

2 tasks
Adversarial RobustnessAdversarial Attacks

Methodology

Improving learning efficiency? Test self-supervised, few-shot, transfer, and continual learning approaches.

4 tasks
Self-Supervised LearningTransfer LearningFew-Shot LearningContinual Learning

16

Research Areas

73

Tasks

86

Datasets

258

Benchmark Results