Time Horizon
Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the single most important meta-metric for agentic AI. METR's evaluations suggest current frontier agents degrade significantly after 30-60 minutes of autonomous operation, while human software engineers can sustain productive work for hours. The metric matters because economic value scales exponentially with reliable autonomy duration: an agent that works reliably for 8 hours is not 16x more valuable than one that works for 30 minutes — it's qualitatively different, enabling entirely new categories of delegatable work.
Time horizon benchmarks measure how long an AI agent can maintain coherent, goal-directed behavior on tasks requiring hours to days of sustained effort. Current agents reliably handle tasks up to ~30 minutes but degrade significantly on multi-hour tasks, with performance dropping as task complexity and duration increase.
History
Early agent benchmarks (HumanEval, GAIA) test tasks completable in minutes
SWE-bench and WebArena push task horizons to 10-30 minutes
RE-Bench tests up to 8-hour agent runs, revealing diminishing returns beyond 2 hours
METR reports on evaluating agents on tasks with 1-day to 1-week horizons
Claude and GPT-4 handle multi-turn conversations spanning hours of interaction
Background agent modes (Claude Code, Devin) enable multi-hour autonomous operation
Time-horizon becomes a key axis for comparing agent architectures
How Time Horizon Works
Task Decomposition
Long-horizon tasks are broken into subtasks, with the agent maintaining a high-level plan across the full duration.
Working Memory Management
The agent tracks progress, intermediate results, and context across potentially thousands of steps and tool calls.
Error Recovery
Over long horizons, errors are inevitable — the agent must detect, diagnose, and recover from failures without losing overall progress.
Priority Management
The agent must decide what to work on next, when to pivot, and when to seek clarification — mimicking human project management.
State Persistence
External memory, file-based notes, and checkpoint mechanisms prevent context loss across long sessions.
Current Landscape
Time horizon is emerging as a critical capability axis in 2025. Most benchmarks test tasks completable in minutes, but real-world value requires hours or days of coherent work. Current agents show reliable performance up to ~30 minutes, degraded but useful performance at 1-2 hours, and significant struggles beyond that. The key bottleneck is not raw capability but sustained coherence — maintaining goals, context, and quality over extended periods.
Key Challenges
Context degradation — models lose track of early decisions and context as conversations grow long
Error compounding — small mistakes early in a long task cascade into large failures
Planning horizon — agents struggle to anticipate consequences of current decisions 100+ steps ahead
Cost scaling — long-horizon tasks consume large amounts of compute and API credits
Evaluation difficulty — measuring partial progress on incomplete long-horizon tasks is hard
Quick Recommendations
Long-horizon task evaluation
RE-Bench / METR task suites
Most rigorous evaluation of multi-hour agent performance
Extended autonomous operation
Claude Code background mode / Devin
Designed for multi-hour autonomous coding with persistence
Research on long-horizon agents
OpenHands + external memory
Flexible framework for studying time-horizon scaling
What's Next
The frontier is reliable multi-day autonomous operation. Key advances needed: (1) persistent memory architectures that don't degrade with scale, (2) hierarchical planning that bridges minutes to days, (3) self-monitoring systems that detect and correct drift from goals. Expect time-horizon to become a primary metric alongside accuracy for agent evaluation.
Benchmarks & SOTA
Related Tasks
HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.
Autonomous Coding
Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?
SWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.
Web & Desktop Agents
Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.
Something wrong or missing?
Help keep Time Horizon benchmarks accurate. Report outdated results, missing benchmarks, or errors.