Agentic AI

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

1 datasets6 resultsView full task mapping →

HCAST (Holistic Capability Assessment for Software Tasks) evaluates AI coding agents on a broad spectrum of software engineering tasks beyond bug fixing — including feature implementation, refactoring, documentation, testing, and code review. It provides a more holistic view of autonomous coding capability than SWE-bench alone.

History

2023

SWE-bench establishes automated evaluation of coding agents on real GitHub issues

2024

Growing recognition that SWE-bench primarily tests bug fixing, not broader software engineering

2024

Multiple teams propose extensions covering feature work, refactoring, and documentation

2024

HCAST framework proposed to unify evaluation across diverse software task types

2025

HCAST adopted alongside SWE-bench for more comprehensive agent evaluation

2025

Results show agents perform well on localized bug fixes but poorly on architectural changes

How HCAST Works

1Task SamplingTasks are drawn from multip…2Environment SetupEach task includes a reposi…3Agent ExecutionThe coding agent works on t…4Multi-Dimensional Eva…Beyond test-pass rates5Capability ProfilingResults are broken down by …HCAST Pipeline
1

Task Sampling

Tasks are drawn from multiple categories: bug fixes, feature additions, refactoring, test writing, documentation, and code review.

2

Environment Setup

Each task includes a repository snapshot, task description, and evaluation criteria specific to the task type.

3

Agent Execution

The coding agent works on the task with access to the repository, terminal, and test infrastructure.

4

Multi-Dimensional Evaluation

Beyond test-pass rates, evaluation includes code quality metrics, diff minimality, documentation accuracy, and review quality.

5

Capability Profiling

Results are broken down by task type, producing a capability profile showing agent strengths and weaknesses across software engineering dimensions.

Current Landscape

HCAST represents the field's recognition that autonomous coding evaluation needs to go beyond bug fixing. In 2025, the benchmark reveals that current agents are strongest at localized bug fixes (40-50% success) and weakest at architectural refactoring (<10%) and feature development requiring design decisions (<20%). This capability profile is more informative than a single accuracy number and guides development priorities.

Key Challenges

Evaluation design — measuring 'good refactoring' or 'helpful code review' is inherently subjective

Task diversity — software engineering encompasses too many skill types for any single benchmark to cover

Baseline calibration — human performance varies dramatically across task types, making comparisons difficult

Reproducibility — non-deterministic agent behavior means scores vary across runs

Cost — comprehensive evaluation across many task types is expensive in compute and API calls

Quick Recommendations

Comprehensive agent evaluation

HCAST + SWE-bench Verified

Broadest coverage of software engineering capabilities

Feature development testing

HCAST feature subset

Specifically tests the harder task of adding new functionality, not just fixing bugs

Agent development

Use HCAST profiles to identify weaknesses

Capability profiles show exactly which software engineering skills to improve

What's Next

Expect HCAST-style holistic evaluation to become standard, with additional dimensions covering: security vulnerability remediation, performance optimization, legacy code modernization, and collaborative development (working alongside human developers on shared branches).

Benchmarks & SOTA

Related Tasks

Autonomous Coding

Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?

SWE-bench

SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.

Web & Desktop Agents

Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.

RE-Bench

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

Something wrong or missing?

Help keep HCAST benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000