Time Horizon2024

METR Autonomy Evaluation: Time Horizon

Measures the length of tasks AI agents can reliably complete autonomously. Task horizon is the 50th-percentile task length at 50% success. Higher = agent can handle longer multi-step tasks without human intervention.

Metrics:task-horizon-minutes
Paper / Website
Current State of the Art

Claude Opus 4

Anthropic

60

task-horizon-minutes

task-horizon-minutes Progress Over Time

Showing 3 breakthroughs from Apr 2025 to Sep 2025

9.423.237.050.864.6Apr 2025Jun 2025Sep 2025task-horizon-minutesDate

Key Milestones

Apr 2025
Claude 3.7 Sonnet

Claude 3.7 Sonnet (Feb 2025). ~14 min task horizon. METR autonomy evaluation.

14.0
Jun 2025
o3

OpenAI o3 (Apr 2025). ~30 min task horizon. METR autonomy evaluation.

30.0
+114.3%
Sep 2025
Claude Opus 4Current SOTA

Claude Opus 4 (2025). ~1 hour task horizon. Estimated from METR trajectory and Anthropic model card.

60.0
+100.0%
Total Improvement
328.6%
Time Span
5m
Breakthroughs
3
Current SOTA
60.0

Top Models Performance Comparison

Top 5 models ranked by task-horizon-minutes

task-horizon-minutes1Claude Opus 460.0100.0%2o330.050.0%3Claude 3.7 Sonnet14.023.3%4o14.06.7%5GPT-4 Turbo (2024)2.03.3%0%25%50%75%100%% of best
Best Score
60.0
Top Model
Claude Opus 4
Models Compared
5
Score Range
58.0

task-horizon-minutesPrimary

#ModelScorePaper / CodeDate
1
Claude Opus 4API
Anthropic
60Apr 2025
2
o3API
OpenAI
30Apr 2025
3
Claude 3.7 SonnetAPI
Anthropic
14Apr 2025
4
o1API
OpenAI
4Apr 2025
5
GPT-4 Turbo (2024)
OpenAI
2Apr 2025

Related Papers1

METR: Measuring Autonomy in AI Systems (2025 Update)
Apr 2025Models: Claude Opus 4, o3, Claude 3.7 Sonnet +2 more