HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.
HCAST (Holistic Capability Assessment for Software Tasks) evaluates AI coding agents on a broad spectrum of software engineering tasks beyond bug fixing — including feature implementation, refactoring, documentation, testing, and code review. It provides a more holistic view of autonomous coding capability than SWE-bench alone.
History
SWE-bench establishes automated evaluation of coding agents on real GitHub issues
Growing recognition that SWE-bench primarily tests bug fixing, not broader software engineering
Multiple teams propose extensions covering feature work, refactoring, and documentation
HCAST framework proposed to unify evaluation across diverse software task types
HCAST adopted alongside SWE-bench for more comprehensive agent evaluation
Results show agents perform well on localized bug fixes but poorly on architectural changes
How HCAST Works
Task Sampling
Tasks are drawn from multiple categories: bug fixes, feature additions, refactoring, test writing, documentation, and code review.
Environment Setup
Each task includes a repository snapshot, task description, and evaluation criteria specific to the task type.
Agent Execution
The coding agent works on the task with access to the repository, terminal, and test infrastructure.
Multi-Dimensional Evaluation
Beyond test-pass rates, evaluation includes code quality metrics, diff minimality, documentation accuracy, and review quality.
Capability Profiling
Results are broken down by task type, producing a capability profile showing agent strengths and weaknesses across software engineering dimensions.
Current Landscape
HCAST represents the field's recognition that autonomous coding evaluation needs to go beyond bug fixing. In 2025, the benchmark reveals that current agents are strongest at localized bug fixes (40-50% success) and weakest at architectural refactoring (<10%) and feature development requiring design decisions (<20%). This capability profile is more informative than a single accuracy number and guides development priorities.
Key Challenges
Evaluation design — measuring 'good refactoring' or 'helpful code review' is inherently subjective
Task diversity — software engineering encompasses too many skill types for any single benchmark to cover
Baseline calibration — human performance varies dramatically across task types, making comparisons difficult
Reproducibility — non-deterministic agent behavior means scores vary across runs
Cost — comprehensive evaluation across many task types is expensive in compute and API calls
Quick Recommendations
Comprehensive agent evaluation
HCAST + SWE-bench Verified
Broadest coverage of software engineering capabilities
Feature development testing
HCAST feature subset
Specifically tests the harder task of adding new functionality, not just fixing bugs
Agent development
Use HCAST profiles to identify weaknesses
Capability profiles show exactly which software engineering skills to improve
What's Next
Expect HCAST-style holistic evaluation to become standard, with additional dimensions covering: security vulnerability remediation, performance optimization, legacy code modernization, and collaborative development (working alongside human developers on shared branches).
Benchmarks & SOTA
Related Tasks
Autonomous Coding
Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?
SWE-bench
SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.
Web & Desktop Agents
Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.
RE-Bench
RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.
Something wrong or missing?
Help keep HCAST benchmarks accurate. Report outdated results, missing benchmarks, or errors.