What is SWE-bench Verified?

SWE-bench Verified is a curated subset of 500 real GitHub issues from popular Python repos. It measures whether an AI agent can autonomously diagnose and fix real software bugs. Top systems resolve around 55-62% of issues.

What do agentic benchmarks measure?

Agentic benchmarks measure an AI system's ability to complete multi-step tasks autonomously — writing code, navigating websites, using tools, or performing research — rather than answering single questions.

What is the best AI coding agent in 2026?

As of early 2026, leading agents on SWE-bench Verified include OpenAI's codex-1 and Anthropic's Claude with tool use, both scoring in the 55-62% range. Performance varies significantly by benchmark and task type.

How is HCAST different from other benchmarks?

HCAST (Human-Calibrated Autonomy Safety Tasks) measures whether AI agents can safely handle realistic tasks without causing harm. Unlike pure capability benchmarks, it evaluates judgment, safety, and knowing when to ask for help.

Agentic AI Benchmarks Explained: SWE-bench, RE-bench, HCAST | 2026

Each benchmark below includes what it measures, how it works, current leaderboard standings, key findings, and known limitations.

SWE-bench Verified

Software Engineering Benchmark (Verified subset) · Software Engineering · 2023 (Verified: 2024) · Princeton NLP / OpenAI

What it measures

Whether an agent can autonomously resolve real GitHub issues — reading codebases, localizing bugs, writing patches, and passing existing test suites.

How it works

1Agent receives a GitHub issue description and the full repository at the relevant commit.
2It must produce a code patch that resolves the issue.
3The patch is validated by running the project's own unit/integration tests.
4An instance is "resolved" only if all relevant tests pass and no existing tests break.

Key findings

Scaffolding matters as much as model quality — the same model can vary 15+ points depending on the agent framework.
Agents still struggle with large codebases (>100k lines) where localization is the bottleneck.
Most resolved issues are small, localized bug fixes — multi-file architectural changes remain extremely difficult.

Limitations

Python-only — no coverage of JavaScript, Rust, Go, or other languages.
Issues are self-contained; real engineering involves cross-repo dependencies and ambiguous requirements.
Test-based validation can miss subtle regressions not covered by existing tests.
Verified subset was curated partly with help from OpenAI, raising neutrality questions.

Leaderboard

% resolved

#1codex-1 (OpenAI)

62.3%Feb 2026

Claude 3.5 Sonnet + SWE-agent

55.0%Jan 2026

Amazon Q Developer

52.4%Dec 2025

Devlo

50.8%Jan 2026

AutoCodeRover v2

47.6%Nov 2025

Aider + GPT-4o

45.3%Oct 2025

Dataset: 500 verified instances from 12 Python repos

RE-bench

Research Engineering Benchmark · Research Engineering · 2024 · METR

What it measures

Whether an agent can tackle open-ended ML research engineering tasks — optimizing training loops, implementing novel architectures, and debugging performance issues — given extended time budgets (up to 8 hours).

How it works

1Agent receives a research engineering task with a clear metric to optimize (e.g., reduce loss, improve accuracy).
2It has access to a compute environment with GPUs, libraries, and starter code.
3Tasks are scored on a 0-100 scale based on how close the agent gets to expert-level performance.
4Time budgets range from 1 to 8 hours, measuring how agents allocate effort.

Key findings

Agents achieve roughly 4x less score per hour than human experts, but close the gap with longer time budgets.
The biggest failure mode is getting stuck in unproductive loops — repeating the same failed approach without pivoting.
Agents are better at optimization tasks (hyperparameter tuning) than creative design tasks (novel architectures).

Limitations

Only 7 tasks — too few for statistical confidence across different capability dimensions.
Expert baselines were set by specific researchers; different experts might achieve different scores.
Tasks were designed in 2024 and may not reflect the evolving frontier of ML research.
Compute-bound tasks advantage agents with larger inference budgets.

Leaderboard

score / 100

#1Claude 3.5 Sonnet (8h budget)

42/100Dec 2025

o1-preview (8h budget)

38/100Nov 2025

GPT-4o (8h budget)

34/100Oct 2025

Claude 3 Opus (8h budget)

31/100Sep 2025

Human experts (2h budget)

86/100Baseline

Dataset: 7 challenging ML engineering tasks

HCAST

Human-Calibrated Autonomy Safety Tasks · Autonomous Safety · 2025 · METR

What it measures

Whether AI agents can safely and reliably complete realistic autonomous tasks — and critically, whether they know when to stop, ask for help, or refuse unsafe actions.

How it works

1Tasks span categories like web research, data analysis, system administration, and communication.
2Each task has both a capability score (did it succeed?) and a safety score (did it avoid harmful actions?).
3Human contractors complete the same tasks to establish calibrated baselines.
4Tasks include adversarial elements — e.g., instructions that seem reasonable but would cause harm if followed blindly.

Key findings

Higher capability does not always correlate with higher safety — o1 completes more tasks but takes more unsafe actions.
Agents rarely ask for clarification even when tasks are genuinely ambiguous, leading to silent failures.
The gap between agent and human performance is largest on tasks requiring real-world judgment (e.g., "is this email appropriate to send?").

Limitations

Safety evaluation is inherently subjective — reasonable people disagree on what constitutes "harmful" in edge cases.
The adversarial tasks may not reflect the full range of real-world safety challenges.
Human baselines vary significantly by contractor experience level.
Relatively new benchmark; methodology is still evolving.

Leaderboard

% completed / % safe

#1Claude 3.5 Sonnet

34% tasks, 91% safetyJan 2026

GPT-4o

31% tasks, 87% safetyDec 2025

37% tasks, 82% safetyJan 2026

Gemini 1.5 Pro

28% tasks, 85% safetyNov 2025

Human contractors

78% tasks, 96% safetyBaseline

Dataset: 144 tasks across 12 skill categories

WebArena

WebArena: A Realistic Web Environment for Building Autonomous Agents · Web Navigation · 2023 · CMU

What it measures

Whether an agent can complete realistic tasks on real websites — shopping, forum management, code repository navigation, content management, and map-based tasks.

How it works

1Five self-hosted websites simulate realistic web apps: e-commerce (OneStopShop), forums (Reddit clone), GitLab, CMS, and OpenStreetMap.
2Agent receives a natural language instruction (e.g., "Find the cheapest wireless mouse and add it to cart").
3It interacts through browser actions: click, type, scroll, navigate.
4Success is measured by functional correctness — did the task actually get completed?

Key findings

Raw model performance (14%) vs. agent scaffolding (35%+) shows the framework matters more than the model for web tasks.
Agents fail most on tasks requiring long action sequences (>15 steps) — they lose track of state.
Visual grounding (screenshot-based) agents are catching up to DOM-based agents but still trail by ~5%.

Limitations

Self-hosted websites are static snapshots — no other users, no dynamic content, no CAPTCHAs.
Tasks are unambiguous by design, unlike real-world web tasks which often require interpretation.
No evaluation of efficiency — an agent that takes 200 actions to complete a 3-action task still "passes."
Does not test recovery from errors (website down, session expired, etc.).

Leaderboard

% task success

#1AgentOccam (GPT-4o)

39.5%Jan 2026

AWM + GPT-4o

35.5%Nov 2025

SteP + GPT-4o

33.1%Oct 2025

BrowserGym + Claude 3.5

31.2%Dec 2025

WebVoyager + GPT-4V

26.4%Sep 2025

GPT-4 (direct)

14.4%Jul 2025

Dataset: 812 tasks across 5 real websites

OSWorld

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments · Desktop Computer Use · 2024 · HKU / Salesforce

What it measures

Whether an agent can use a full desktop OS — opening applications, editing documents, managing files, configuring settings, and chaining multi-app workflows.

How it works

1Agent controls a virtual machine through screenshots and keyboard/mouse actions.
2Tasks span Office apps, web browsers, file managers, terminals, and system settings.
3Evaluation checks the actual OS state after the agent acts (file contents, app states, configs).
4Some tasks require chaining multiple applications (e.g., "download the CSV from email, open it in LibreOffice, sort by column B").

Key findings

Even the best agents solve less than 1 in 4 tasks — desktop computer use remains extremely challenging for AI.
Agents are weakest on tasks requiring precise spatial reasoning (drag-and-drop, window resizing, menu navigation).
Multi-app workflows see ~3x failure rates compared to single-app tasks.

Limitations

VM-based evaluation is slow and expensive to run at scale.
Tasks are biased toward Linux (Ubuntu) — Windows and macOS coverage is thinner.
Screen resolution and rendering differences can cause brittleness in visual agents.
No multi-turn interaction — agent cannot ask the user for clarification mid-task.

Leaderboard

% task success

#1Claude 3.5 Sonnet (computer use)

22.0%Jan 2026

UI-TARS (ByteDance)

18.8%Dec 2025

SeeClick + GPT-4o

16.2%Nov 2025

CogAgent

13.7%Oct 2025

GPT-4V (direct)

12.2%Sep 2025

Human performance

72.4%Baseline

Dataset: 369 tasks across Ubuntu, Windows, and macOS

GAIA

General AI Assistants Benchmark · General AI Assistance · 2023 · Meta FAIR / HuggingFace

What it measures

Whether an agent can answer questions that are simple for humans but require multi-step tool use for AI — web browsing, file processing, calculation, and reasoning in combination.

How it works

1Questions are designed to be trivially solvable by humans (>90% accuracy) but hard for AI.
2Three levels: L1 (1-2 tool uses), L2 (5-10 steps), L3 (long chains with file analysis).
3Agents must use tools: web search, code execution, file reading, calculator.
4Exact-match evaluation — the answer must be precisely correct.

Key findings

Humans still outperform the best AI agents by 16+ points, primarily on Level 2 and Level 3 tasks.
Agent failures cluster around tool selection errors — using the wrong tool rather than using it incorrectly.
The gap between Level 1 (~90% for top agents) and Level 3 (~45%) shows compounding failures in multi-step reasoning.

Limitations

Exact-match scoring penalizes correct answers with slightly different formatting.
Some questions rely on web content that changes over time, causing instability in scores.
Level 3 questions are so hard that even top agents score below 50%, making statistical analysis noisy.
Does not measure user experience — only factual correctness of the final answer.

Leaderboard

% correct

#1o1 + tools (OpenAI)

75.6%Feb 2026

Gemini Ultra + tools

72.3%Jan 2026

Claude 3.5 Sonnet + tools

70.1%Jan 2026

HuggingFace Agents

56.4%Nov 2025

AutoGPT

42.8%Sep 2025

Human (non-expert)

92%Baseline

Dataset: 466 questions across 3 difficulty levels

τ-bench

Tool-Agent-User Benchmark · Tool Use & Conversation · 2024 · Sierra AI

What it measures

Whether an agent can handle realistic multi-turn customer service conversations requiring database lookups, policy adherence, and tool calls — without hallucinating actions or violating business rules.

How it works

1Agent plays a customer service representative with access to databases and tools (order lookup, flight rebooking, refund processing).
2A simulated user follows a scripted scenario with specific needs and constraints.
3The agent must use tools correctly AND follow domain-specific policies (return windows, fare rules, etc.).
4Evaluation checks both task completion and policy compliance at each turn.

Key findings

Airline tasks are significantly harder than retail due to more complex policies (fare classes, connection rules, rebooking logic).
The most common failure mode is policy violation — agents complete the task but break a business rule in the process.
Larger models are disproportionately better at policy compliance than at raw task completion.

Limitations

Only two domains — retail and airline — which may not generalize to other verticals.
Simulated users follow scripts and don't behave like frustrated real customers.
Policy rules are explicit; real-world policies often have undocumented exceptions.
No evaluation of tone, empathy, or customer satisfaction — only functional correctness.

Leaderboard

% pass rate

#1Claude 3.5 Sonnet

68.2% (retail) / 52.1% (airline)Jan 2026

GPT-4o

64.7% (retail) / 49.8% (airline)Dec 2025

Gemini 1.5 Pro

59.3% (retail) / 43.6% (airline)Nov 2025

Claude 3 Haiku

52.1% (retail) / 38.4% (airline)Oct 2025

Llama 3.1 70B

44.8% (retail) / 31.2% (airline)Sep 2025

Dataset: 2 domains (retail, airline), 200+ conversations

Benchmark	Best Agent	Human	Gap	Difficulty	Real-World Relevance
SWE-bench Verified	62%	~77%*	15%	Hard	Very High
RE-bench	42/100	86/100	44 pts	Very Hard	High
HCAST	37%	78%	41%	Hard	Very High
WebArena	39.5%	78.2%	39%	Hard	High
OSWorld	22%	72.4%	50%	Very Hard	Very High
GAIA	75.6%	92%	16%	Medium	Medium
τ-bench (retail)	68.2%	~95%	27%	Medium	Very High

Agentic AI Benchmarks Explained

What Makes Agentic Benchmarks Different

Multi-step, not single-shot

Tool use, not just reasoning

Real environments, not synthetic puzzles

Open-ended evaluation

The taxonomy at a glance

Benchmark-by-Benchmark Deep Dive

SWE-bench Verified

What it measures

How it works

Key findings

Limitations

Leaderboard

RE-bench

What it measures

How it works

Key findings

Limitations

Leaderboard

HCAST

What it measures

How it works

Key findings

Limitations

Leaderboard

WebArena

What it measures

How it works

Key findings

Limitations

Leaderboard

OSWorld

What it measures

How it works

Key findings

Limitations

Leaderboard

GAIA

What it measures

How it works

Key findings

Limitations

Leaderboard

τ-bench

What it measures

How it works

Key findings

Limitations

Leaderboard

Cross-Benchmark Comparison

The Gap Between Benchmarks and Reality

Benchmarks are too clean

No cost accounting

Single-attempt evaluation

No collaboration testing

Overfitting risk is real

Safety is an afterthought

How to Evaluate Agents for Your Use Case

Map your task to the closest benchmark

Build your own eval set (20-50 examples)

Measure total cost, not just accuracy

Test failure modes, not just successes

Re-evaluate monthly

Continue Exploring

Browse Agentic Benchmarks

All Guides

Prompting Techniques Guide