Measures the length of tasks AI agents can reliably complete autonomously. Task horizon is the 50th-percentile task length at 50% success. Higher = agent can handle longer multi-step tasks without human intervention.
Task Horizon Minutes is the reported evaluation metric for METR Time Horizon. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Claude Opus 4 | verified | 60 | 2025 | Paper ↗ | Looks wrong? |
| 02 | o3 | verified | 30 | 2025 | Paper ↗ | Looks wrong? |
| 03 | Claude 3.7 Sonnet | verified | 14 | 2025 | Paper ↗ | Looks wrong? |
| 04 | o1 | verified | 4.00 | 2025 | Paper ↗ | Looks wrong? |
| 05 | GPT-4 Turbo (2024) | verified | 2.00 | 2025 | Paper ↗ | Looks wrong? |