Path to AGIUpdated Dec 2024

Agentic AI Benchmarks

Measuring autonomous AI capabilities with METR's time horizon evaluations. The critical benchmark category for tracking progress toward AGI.

7-month

Doubling Time

Time horizon capabilities double every 7 months on average

160 min

Current SOTA

GPT-5.1-Codex-Max 50% time horizon (Dec 2024)

~20 hrs

6-Month Projection

Estimated upper bound by April 2026 (METR forecast)

Why Agentic Benchmarks Matter

Traditional benchmarks (MMLU, HumanEval, etc.) measure single-turn responses. Agentic benchmarks measure sustained autonomous performance - the ability to work independently on complex tasks over extended periods.

METR's evaluations are uniquely positioned to track AGI progress because they measure:

Multi-step reasoning - Planning and executing long chains of actions
Error recovery - Detecting and fixing mistakes autonomously
Real-world tasks - Actual software engineering, not synthetic problems
Time horizon - How long before the agent fails or needs help

METR Leaderboard

Model	Provider	50% Time Horizon	80% Time Horizon	HCAST	Date
SOTAGPT-5.1-Codex-Max	OpenAI	160 min	30 min	48%	Dec 2024
GPT-5	OpenAI	137 min	26 min	42%	Dec 2024
o1-preview	OpenAI	120 min	22 min	35%	Sep 2024
GPT-4o	OpenAI	90 min	18 min	22%	Jun 2024
Claude 3 Opus	Anthropic	75 min	15 min	15%	Mar 2024
Claude 2.1	Anthropic	45 min	10 min	10%	Dec 2023
GPT-4	OpenAI	15 min	5 min	8%	Mar 2023

Source: evaluations.metr.org | Tasks: github.com/METR/public-tasks

Implications for AGI Timeline

The rapid improvement in agentic capabilities suggests that autonomous AI systems capable of extended independent work may arrive sooner than traditional benchmark saturation would indicate.

Key milestones to watch:

4-hour horizon - Full workday tasks become feasible
8-hour horizon - Single-day projects achievable autonomously
Multi-day horizon - Complex software projects, research tasks

METR's projections suggest the 8-hour milestone could be reached by late 2025 under aggressive extrapolation.

Agentic AI Benchmarks

Why Agentic Benchmarks Matter

METR Leaderboard

Implications for AGI Timeline

Related Resources

LLM Benchmarks

Code Generation

Building Blocks