Agentic AI
Measuring autonomous AI capabilities? METR benchmarks track time horizon, multi-step reasoning, and sustained task performance - key metrics for AGI progress.
Tasks in Agentic AI
Time Horizon
How long an AI agent can work autonomously before failing (METR).
HCAST
Human-Calibrated Autonomy Software Tasks - 90 tasks across cybersecurity, AI R&D, and engineering.
RE-Bench
Research Engineering tasks requiring experimentation and implementation.
SWE-bench
Resolving real GitHub issues autonomously.
Autonomous Coding
Extended coding tasks without human intervention.
Explore Other Areas
Computer Vision
Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.
Natural Language Processing
Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.
Reasoning
Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.
Computer Code
Developing AI coding assistants? Test code generation, completion, translation, bug detection, and repair capabilities.