Path to AGIUpdated Dec 2024

Agentic AI Benchmarks

Measuring autonomous AI capabilities with METR's time horizon evaluations. The critical benchmark category for tracking progress toward AGI.

7-month
Doubling Time
Time horizon capabilities double every 7 months on average
160 min
Current SOTA
GPT-5.1-Codex-Max 50% time horizon (Dec 2024)
~20 hrs
6-Month Projection
Estimated upper bound by April 2026 (METR forecast)

Why Agentic Benchmarks Matter

Traditional benchmarks (MMLU, HumanEval, etc.) measure single-turn responses. Agentic benchmarks measure sustained autonomous performance - the ability to work independently on complex tasks over extended periods.

METR's evaluations are uniquely positioned to track AGI progress because they measure:

  • Multi-step reasoning - Planning and executing long chains of actions
  • Error recovery - Detecting and fixing mistakes autonomously
  • Real-world tasks - Actual software engineering, not synthetic problems
  • Time horizon - How long before the agent fails or needs help

METR Leaderboard

ModelProvider50% Time Horizon80% Time HorizonHCASTDate
SOTAGPT-5.1-Codex-Max
OpenAI160 min30 min48%Dec 2024
GPT-5
OpenAI137 min26 min42%Dec 2024
o1-preview
OpenAI120 min22 min35%Sep 2024
GPT-4o
OpenAI90 min18 min22%Jun 2024
Claude 3 Opus
Anthropic75 min15 min15%Mar 2024
Claude 2.1
Anthropic45 min10 min10%Dec 2023
GPT-4
OpenAI15 min5 min8%Mar 2023

Source: evaluations.metr.org | Tasks: github.com/METR/public-tasks

Implications for AGI Timeline

The rapid improvement in agentic capabilities suggests that autonomous AI systems capable of extended independent work may arrive sooner than traditional benchmark saturation would indicate.

Key milestones to watch:

  • 4-hour horizon - Full workday tasks become feasible
  • 8-hour horizon - Single-day projects achievable autonomously
  • Multi-day horizon - Complex software projects, research tasks

METR's projections suggest the 8-hour milestone could be reached by late 2025 under aggressive extrapolation.

Related Resources