Time Horizon2024
METR Autonomy Evaluation: Time Horizon
Measures the length of tasks AI agents can reliably complete autonomously. Task horizon is the 50th-percentile task length at 50% success. Higher = agent can handle longer multi-step tasks without human intervention.
Metrics:task-horizon-minutes
Paper / WebsiteCurrent State of the Art
Claude Opus 4
Anthropic
60
task-horizon-minutes
task-horizon-minutes Progress Over Time
Showing 3 breakthroughs from Apr 2025 to Sep 2025
Key Milestones
Apr 2025
Claude 3.7 Sonnet
Claude 3.7 Sonnet (Feb 2025). ~14 min task horizon. METR autonomy evaluation.
14.0
Sep 2025
Claude Opus 4Current SOTA
Claude Opus 4 (2025). ~1 hour task horizon. Estimated from METR trajectory and Anthropic model card.
60.0
+100.0%
Total Improvement
328.6%
Time Span
5m
Breakthroughs
3
Current SOTA
60.0
Top Models Performance Comparison
Top 5 models ranked by task-horizon-minutes
Best Score
60.0
Top Model
Claude Opus 4
Models Compared
5
Score Range
58.0
task-horizon-minutesPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Claude Opus 4API Anthropic | 60 | Apr 2025 | |
| 2 | o3API OpenAI | 30 | Apr 2025 | |
| 3 | Claude 3.7 SonnetAPI Anthropic | 14 | Apr 2025 | |
| 4 | o1API OpenAI | 4 | Apr 2025 | |
| 5 | GPT-4 Turbo (2024) OpenAI | 2 | Apr 2025 |
Related Papers1
METR: Measuring Autonomy in AI Systems (2025 Update)
Apr 2025Models: Claude Opus 4, o3, Claude 3.7 Sonnet +2 more