HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.
HCAST (Holistic Capability Assessment for Software Tasks) evaluates AI coding agents on a broad spectrum of software engineering tasks beyond bug fixing — including feature implementation, refactoring, documentation, testing, and code review. It provides a more holistic view of autonomous coding capability than SWE-bench alone.
History
SWE-bench establishes automated evaluation of coding agents on real GitHub issues
Growing recognition that SWE-bench primarily tests bug fixing, not broader software engineering
Multiple teams propose extensions covering feature work, refactoring, and documentation
HCAST framework proposed to unify evaluation across diverse software task types
HCAST adopted alongside SWE-bench for more comprehensive agent evaluation
Results show agents perform well on localized bug fixes but poorly on architectural changes
How HCAST Works
Task Sampling
Tasks are drawn from multiple categories: bug fixes, feature additions, refactoring, test writing, documentation, and code review.
Environment Setup
Each task includes a repository snapshot, task description, and evaluation criteria specific to the task type.
Agent Execution
The coding agent works on the task with access to the repository, terminal, and test infrastructure.
Multi-Dimensional Evaluation
Beyond test-pass rates, evaluation includes code quality metrics, diff minimality, documentation accuracy, and review quality.
Capability Profiling
Results are broken down by task type, producing a capability profile showing agent strengths and weaknesses across software engineering dimensions.
Current Landscape
HCAST represents the field's recognition that autonomous coding evaluation needs to go beyond bug fixing. In 2025, the benchmark reveals that current agents are strongest at localized bug fixes (40-50% success) and weakest at architectural refactoring (<10%) and feature development requiring design decisions (<20%). This capability profile is more informative than a single accuracy number and guides development priorities.
Key Challenges
Evaluation design — measuring 'good refactoring' or 'helpful code review' is inherently subjective
Task diversity — software engineering encompasses too many skill types for any single benchmark to cover
Baseline calibration — human performance varies dramatically across task types, making comparisons difficult
Reproducibility — non-deterministic agent behavior means scores vary across runs
Cost — comprehensive evaluation across many task types is expensive in compute and API calls
Quick Recommendations
Comprehensive agent evaluation
HCAST + SWE-bench Verified
Broadest coverage of software engineering capabilities
Feature development testing
HCAST feature subset
Specifically tests the harder task of adding new functionality, not just fixing bugs
Agent development
Use HCAST profiles to identify weaknesses
Capability profiles show exactly which software engineering skills to improve
What's Next
Expect HCAST-style holistic evaluation to become standard, with additional dimensions covering: security vulnerability remediation, performance optimization, legacy code modernization, and collaborative development (working alongside human developers on shared branches).
Benchmarks & SOTA
Related Tasks
Task agents
AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.
Autonomous Coding
Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.
Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.
Bioinformatics Agents
LLM-agent benchmarks for computational biology — exploring datasets, running multi-step analyses, and interpreting biological results.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep HCAST benchmarks accurate. Report outdated results, missing benchmarks, or errors.