Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Time HorizonHome/Tasks/Agentic AI/Time Horizon

Time Horizon.

Time horizon — how long an AI agent can work autonomously before requiring human correction — is arguably the single most important meta-metric for agentic AI. METR's evaluations suggest current frontier agents degrade significantly after 30-60 minutes of autonomous operation, while human software engineers can sustain productive work for hours. The metric matters because economic value scales exponentially with reliable autonomy duration: an agent that works reliably for 8 hours is not 16x more valuable than one that works for 30 minutes — it's qualitatively different, enabling entirely new categories of delegatable work.

1
Datasets
5
Results
task-horizon-minutes
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

METR Time Horizon

Measures the length of tasks AI agents can reliably complete autonomously. Task horizon is the 50th-percentile task length at 50% success. Higher = agent can handle longer multi-step tasks without human intervention.

Primary metric: task-horizon-minutes
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on METR Time Horizon.

#Modeltask-horizon-minutesYearSource
Claude Opus 460.02025paper ↗
2o330.02025paper ↗
3Claude 3.7 Sonnet14.02025paper ↗
4o14.002025paper ↗
5GPT-4 Turbo (2024)2.002025paper ↗

What were you looking for on Time Horizon?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

METR Time Horizon
CANONICAL
5 results · task-horizon-minutes
Top: Claude Opus 4 60.0
§ 05 · Related tasks

Other tasks in Agentic AI.

Agent MemoryAutonomous CodingBioinformatics AgentsHCASTRE-BenchSWE-benchTask agentsTool Use
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Time Horizon? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.