90 realistic software engineering tasks calibrated against human performance times. Tests whether agents can complete tasks that take humans 15 minutes to 4 hours. Primary metric: success rate across all tasks.
Success Rate is the reported evaluation metric for HCAST. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Claude Opus 4 | verified | 55 | 2025 | Paper ↗ | Looks wrong? |
| 02 | o3 | verified | 49 | 2025 | Paper ↗ | Looks wrong? |
| 03 | Claude 3.7 Sonnet | verified | 38 | 2025 | Paper ↗ | Looks wrong? |
| 04 | o1 | verified | 28 | 2025 | Paper ↗ | Looks wrong? |
| 05 | Claude 3.5 Sonnet | verified | 18 | 2025 | Paper ↗ | Looks wrong? |
| 06 | GPT-4 Turbo (2024) | verified | 12 | 2023 | Paper ↗ | Looks wrong? |