Codesota · Benchmark · METR Time HorizonHome/Leaderboards/METR Time Horizon
Unknown

METR Time Horizon.

Measures the length of tasks AI agents can reliably complete autonomously. Task horizon is the 50th-percentile task length at 50% success. Higher = agent can handle longer multi-step tasks without human intervention.

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Task Horizon Minutes

Task Horizon Minutes is the reported evaluation metric for METR Time Horizon. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Task Horizon Minutesverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Claude Opus 4
Claude Opus 4 (2025). ~1 hour task horizon. Estimated from METR trajectory and Anthropic model card.
verified602025Paper ↗Looks wrong?
02o3
OpenAI o3 (Apr 2025). ~30 min task horizon. METR autonomy evaluation.
verified302025Paper ↗Looks wrong?
03Claude 3.7 Sonnet
Claude 3.7 Sonnet (Feb 2025). ~14 min task horizon. METR autonomy evaluation.
verified142025Paper ↗Looks wrong?
04o1
OpenAI o1 (Sep 2024). ~4 min task horizon. METR autonomy evaluation.
verified4.002025Paper ↗Looks wrong?
05GPT-4 Turbo (2024)
GPT-4 Turbo (Apr 2024). ~2 min task horizon. METR autonomy eval, metr.org/research/autonomy-evals/
verified2.002025Paper ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards