HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

1
Datasets
0
Results
success-rate
Canonical metric
Canonical Benchmark

HCAST

90 realistic software engineering tasks calibrated against human performance times. Tests whether agents can complete tasks that take humans 15 minutes to 4 hours. Primary metric: success rate across all tasks.

Primary metric: success-rate
View full leaderboard

Top 10

Leading models on HCAST.

No results yet. Be the first to contribute.

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Agentic AI.