Time Horizon

Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the single most important meta-metric for agentic AI. METR's evaluations suggest current frontier agents degrade significantly after 30-60 minutes of autonomous operation, while human software engineers can sustain productive work for hours. The metric matters because economic value scales exponentially with reliable autonomy duration: an agent that works reliably for 8 hours is not 16x more valuable than one that works for 30 minutes — it's qualitatively different, enabling entirely new categories of delegatable work.

1
Datasets
0
Results
task-horizon-minutes
Canonical metric
Canonical Benchmark

METR Time Horizon

Measures the length of tasks AI agents can reliably complete autonomously. Task horizon is the 50th-percentile task length at 50% success. Higher = agent can handle longer multi-step tasks without human intervention.

Primary metric: task-horizon-minutes
View full leaderboard

Top 10

Leading models on METR Time Horizon.

No results yet. Be the first to contribute.

What were you looking for on Time Horizon?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Agentic AI.

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Time Horizon? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.