Agentic AI

Time Horizon

Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the single most important meta-metric for agentic AI. METR's evaluations suggest current frontier agents degrade significantly after 30-60 minutes of autonomous operation, while human software engineers can sustain productive work for hours. The metric matters because economic value scales exponentially with reliable autonomy duration: an agent that works reliably for 8 hours is not 16x more valuable than one that works for 30 minutes — it's qualitatively different, enabling entirely new categories of delegatable work.

1 datasets5 resultsView full task mapping →

Time horizon benchmarks measure how long an AI agent can maintain coherent, goal-directed behavior on tasks requiring hours to days of sustained effort. Current agents reliably handle tasks up to ~30 minutes but degrade significantly on multi-hour tasks, with performance dropping as task complexity and duration increase.

History

2023

Early agent benchmarks (HumanEval, GAIA) test tasks completable in minutes

2024

SWE-bench and WebArena push task horizons to 10-30 minutes

2024

RE-Bench tests up to 8-hour agent runs, revealing diminishing returns beyond 2 hours

2024

METR reports on evaluating agents on tasks with 1-day to 1-week horizons

2024

Claude and GPT-4 handle multi-turn conversations spanning hours of interaction

2025

Background agent modes (Claude Code, Devin) enable multi-hour autonomous operation

2025

Time-horizon becomes a key axis for comparing agent architectures

How Time Horizon Works

Task Decomposition

Long-horizon tasks are broken into subtasks, with the agent maintaining a high-level plan across the full duration.

Working Memory Management

The agent tracks progress, intermediate results, and context across potentially thousands of steps and tool calls.

Error Recovery

Over long horizons, errors are inevitable — the agent must detect, diagnose, and recover from failures without losing overall progress.

Priority Management

The agent must decide what to work on next, when to pivot, and when to seek clarification — mimicking human project management.

State Persistence

External memory, file-based notes, and checkpoint mechanisms prevent context loss across long sessions.

Current Landscape

Time horizon is emerging as a critical capability axis in 2025. Most benchmarks test tasks completable in minutes, but real-world value requires hours or days of coherent work. Current agents show reliable performance up to ~30 minutes, degraded but useful performance at 1-2 hours, and significant struggles beyond that. The key bottleneck is not raw capability but sustained coherence — maintaining goals, context, and quality over extended periods.

Key Challenges

Context degradation — models lose track of early decisions and context as conversations grow long

Error compounding — small mistakes early in a long task cascade into large failures

Planning horizon — agents struggle to anticipate consequences of current decisions 100+ steps ahead

Cost scaling — long-horizon tasks consume large amounts of compute and API credits

Evaluation difficulty — measuring partial progress on incomplete long-horizon tasks is hard

Quick Recommendations

Long-horizon task evaluation

RE-Bench / METR task suites

Most rigorous evaluation of multi-hour agent performance

Extended autonomous operation

Claude Code background mode / Devin

Designed for multi-hour autonomous coding with persistence

Research on long-horizon agents

OpenHands + external memory

Flexible framework for studying time-horizon scaling

What's Next

The frontier is reliable multi-day autonomous operation. Key advances needed: (1) persistent memory architectures that don't degrade with scale, (2) hierarchical planning that bridges minutes to days, (3) self-monitoring systems that detect and correct drift from goals. Expect time-horizon to become a primary metric alongside accuracy for agent evaluation.

Benchmarks & SOTA

METR Time Horizon

METR Autonomy Evaluation: Time Horizon

20245 results

Measures the length of tasks AI agents can reliably complete autonomously. Task horizon is the 50th-percentile task length at 50% success. Higher = agent can handle longer multi-step tasks without human intervention.

State of the Art

Claude Opus 4

Anthropic

task-horizon-minutes

Related Tasks

Task agents

AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.

Autonomous Coding

Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

Tool Use

Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Time Horizon benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Agentic AI