Agentic AI Benchmarks
Measuring autonomous AI capabilities with METR's time horizon evaluations. The critical benchmark category for tracking progress toward AGI.
Why Agentic Benchmarks Matter
Traditional benchmarks (MMLU, HumanEval, etc.) measure single-turn responses. Agentic benchmarks measure sustained autonomous performance - the ability to work independently on complex tasks over extended periods.
METR's evaluations are uniquely positioned to track AGI progress because they measure:
- Multi-step reasoning - Planning and executing long chains of actions
- Error recovery - Detecting and fixing mistakes autonomously
- Real-world tasks - Actual software engineering, not synthetic problems
- Time horizon - How long before the agent fails or needs help
METR Leaderboard
| Model | Provider | 50% Time Horizon | 80% Time Horizon | HCAST | Date |
|---|---|---|---|---|---|
SOTAGPT-5.1-Codex-Max | OpenAI | 160 min | 30 min | 48% | Dec 2024 |
GPT-5 | OpenAI | 137 min | 26 min | 42% | Dec 2024 |
o1-preview | OpenAI | 120 min | 22 min | 35% | Sep 2024 |
GPT-4o | OpenAI | 90 min | 18 min | 22% | Jun 2024 |
Claude 3 Opus | Anthropic | 75 min | 15 min | 15% | Mar 2024 |
Claude 2.1 | Anthropic | 45 min | 10 min | 10% | Dec 2023 |
GPT-4 | OpenAI | 15 min | 5 min | 8% | Mar 2023 |
Source: evaluations.metr.org | Tasks: github.com/METR/public-tasks
Implications for AGI Timeline
The rapid improvement in agentic capabilities suggests that autonomous AI systems capable of extended independent work may arrive sooner than traditional benchmark saturation would indicate.
Key milestones to watch:
- 4-hour horizon - Full workday tasks become feasible
- 8-hour horizon - Single-day projects achievable autonomously
- Multi-day horizon - Complex software projects, research tasks
METR's projections suggest the 8-hour milestone could be reached by late 2025 under aggressive extrapolation.