Agentic AI

Measuring autonomous AI capabilities? METR benchmarks track time horizon, multi-step reasoning, and sustained task performance - key metrics for AGI progress.

5 tasks0 datasets

Tasks in Agentic AI

Time Horizon

How long an AI agent can work autonomously before failing (METR).

0 datasets

View →

HCAST

Human-Calibrated Autonomy Software Tasks - 90 tasks across cybersecurity, AI R&D, and engineering.

0 datasets

View →

RE-Bench

Research Engineering tasks requiring experimentation and implementation.

0 datasets

View →

SWE-bench

Resolving real GitHub issues autonomously.

0 datasets

View →

Autonomous Coding

Extended coding tasks without human intervention.

0 datasets

View →

Explore Other Areas

Computer Vision

Building systems that understand images and video? Find benchmarks for recognition, detection, segmentation, and document analysis tasks.

Natural Language Processing

Processing and understanding text? Evaluate your models on language understanding, generation, translation, and information extraction benchmarks.

Reasoning

Testing if your model can think logically? Benchmark math problem solving, commonsense understanding, and multi-step reasoning capabilities.

Computer Code

Developing AI coding assistants? Test code generation, completion, translation, bug detection, and repair capabilities.