Agentic AI

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

1 datasets6 resultsView full task mapping →

HCAST (Holistic Capability Assessment for Software Tasks) evaluates AI coding agents on a broad spectrum of software engineering tasks beyond bug fixing — including feature implementation, refactoring, documentation, testing, and code review. It provides a more holistic view of autonomous coding capability than SWE-bench alone.

History

2023

SWE-bench establishes automated evaluation of coding agents on real GitHub issues

2024

Growing recognition that SWE-bench primarily tests bug fixing, not broader software engineering

2024

Multiple teams propose extensions covering feature work, refactoring, and documentation

2024

HCAST framework proposed to unify evaluation across diverse software task types

2025

HCAST adopted alongside SWE-bench for more comprehensive agent evaluation

2025

Results show agents perform well on localized bug fixes but poorly on architectural changes

How HCAST Works

Task Sampling

Tasks are drawn from multiple categories: bug fixes, feature additions, refactoring, test writing, documentation, and code review.

Environment Setup

Each task includes a repository snapshot, task description, and evaluation criteria specific to the task type.

Agent Execution

The coding agent works on the task with access to the repository, terminal, and test infrastructure.

Multi-Dimensional Evaluation

Beyond test-pass rates, evaluation includes code quality metrics, diff minimality, documentation accuracy, and review quality.

Capability Profiling

Results are broken down by task type, producing a capability profile showing agent strengths and weaknesses across software engineering dimensions.

Current Landscape

HCAST represents the field's recognition that autonomous coding evaluation needs to go beyond bug fixing. In 2025, the benchmark reveals that current agents are strongest at localized bug fixes (40-50% success) and weakest at architectural refactoring (<10%) and feature development requiring design decisions (<20%). This capability profile is more informative than a single accuracy number and guides development priorities.

Key Challenges

Evaluation design — measuring 'good refactoring' or 'helpful code review' is inherently subjective

Task diversity — software engineering encompasses too many skill types for any single benchmark to cover

Baseline calibration — human performance varies dramatically across task types, making comparisons difficult

Reproducibility — non-deterministic agent behavior means scores vary across runs

Cost — comprehensive evaluation across many task types is expensive in compute and API calls

Quick Recommendations

Comprehensive agent evaluation

HCAST + SWE-bench Verified

Broadest coverage of software engineering capabilities

Feature development testing

HCAST feature subset

Specifically tests the harder task of adding new functionality, not just fixing bugs

Agent development

Use HCAST profiles to identify weaknesses

Capability profiles show exactly which software engineering skills to improve

What's Next

Expect HCAST-style holistic evaluation to become standard, with additional dimensions covering: security vulnerability remediation, performance optimization, legacy code modernization, and collaborative development (working alongside human developers on shared branches).

Benchmarks & SOTA

HCAST

Human-Calibrated Autonomy Software Tasks

20246 results

90 realistic software engineering tasks calibrated against human performance times. Tests whether agents can complete tasks that take humans 15 minutes to 4 hours. Primary metric: success rate across all tasks.

State of the Art

Claude Opus 4

Anthropic

success-rate

Related Tasks

Task agents

AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep HCAST benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Agentic AI

HCAST

History

How HCAST Works

Current Landscape

Key Challenges

Quick Recommendations

What's Next

Benchmarks & SOTA

HCAST

Related Tasks

Task agents

Autonomous Coding

Tool Use

Bioinformatics Agents

Something wrong or missing?