Home/Guides/Agentic Benchmarks

Definitive Guide · Updated March 2026

Agentic AI Benchmarks Explained

SWE-bench, RE-bench, HCAST, WebArena, OSWorld, GAIA, and τ-bench. What they measure, who leads them, where they fall short, and how to use them to evaluate AI agents for your own use case.

Software EngineeringResearchWeb NavigationDesktop UseSafetyTool Use

What Makes Agentic Benchmarks Different

Multi-step, not single-shot

Traditional benchmarks test one question, one answer. Agentic benchmarks require chains of 5-200 actions — and a single mistake in the middle can cascade.

Tool use, not just reasoning

Agents must decide which tools to call, with what arguments, in what order. The combinatorial space is enormous compared to multiple-choice evaluation.

Real environments, not synthetic puzzles

The best agentic benchmarks run in actual codebases, real websites, and live operating systems — not in sanitized JSON input/output pairs.

Open-ended evaluation

There is rarely one correct answer. A bug can be fixed in many valid ways. A web task can be completed through different UI paths. Evaluation must check outcomes, not exact outputs.

The taxonomy at a glance

Agentic benchmarks can be grouped by what kind of autonomy they test. The further right you go on this spectrum, the harder the benchmark and the wider the human-AI gap.

Level 1

Tool Calling

τ-bench, GAIA

Level 2

Web / UI Navigation

WebArena, OSWorld

Level 3

Code Engineering

SWE-bench

Level 4

Research & Safety

RE-bench, HCAST

Easier for agentsHarder for agents →

Benchmark-by-Benchmark Deep Dive

Each benchmark below includes what it measures, how it works, current leaderboard standings, key findings, and known limitations.

1

SWE-bench Verified

Software Engineering Benchmark (Verified subset) · Software Engineering · 2023 (Verified: 2024) · Princeton NLP / OpenAI

What it measures

Whether an agent can autonomously resolve real GitHub issues — reading codebases, localizing bugs, writing patches, and passing existing test suites.

How it works

  1. 1Agent receives a GitHub issue description and the full repository at the relevant commit.
  2. 2It must produce a code patch that resolves the issue.
  3. 3The patch is validated by running the project's own unit/integration tests.
  4. 4An instance is "resolved" only if all relevant tests pass and no existing tests break.

Key findings

  • Scaffolding matters as much as model quality — the same model can vary 15+ points depending on the agent framework.
  • Agents still struggle with large codebases (>100k lines) where localization is the bottleneck.
  • Most resolved issues are small, localized bug fixes — multi-file architectural changes remain extremely difficult.

Limitations

  • Python-only — no coverage of JavaScript, Rust, Go, or other languages.
  • Issues are self-contained; real engineering involves cross-repo dependencies and ambiguous requirements.
  • Test-based validation can miss subtle regressions not covered by existing tests.
  • Verified subset was curated partly with help from OpenAI, raising neutrality questions.

Leaderboard

% resolved

#1codex-1 (OpenAI)
62.3%Feb 2026
Claude 3.5 Sonnet + SWE-agent
55.0%Jan 2026
Amazon Q Developer
52.4%Dec 2025
Devlo
50.8%Jan 2026
AutoCodeRover v2
47.6%Nov 2025
Aider + GPT-4o
45.3%Oct 2025

Dataset: 500 verified instances from 12 Python repos

2

RE-bench

Research Engineering Benchmark · Research Engineering · 2024 · METR

What it measures

Whether an agent can tackle open-ended ML research engineering tasks — optimizing training loops, implementing novel architectures, and debugging performance issues — given extended time budgets (up to 8 hours).

How it works

  1. 1Agent receives a research engineering task with a clear metric to optimize (e.g., reduce loss, improve accuracy).
  2. 2It has access to a compute environment with GPUs, libraries, and starter code.
  3. 3Tasks are scored on a 0-100 scale based on how close the agent gets to expert-level performance.
  4. 4Time budgets range from 1 to 8 hours, measuring how agents allocate effort.

Key findings

  • Agents achieve roughly 4x less score per hour than human experts, but close the gap with longer time budgets.
  • The biggest failure mode is getting stuck in unproductive loops — repeating the same failed approach without pivoting.
  • Agents are better at optimization tasks (hyperparameter tuning) than creative design tasks (novel architectures).

Limitations

  • Only 7 tasks — too few for statistical confidence across different capability dimensions.
  • Expert baselines were set by specific researchers; different experts might achieve different scores.
  • Tasks were designed in 2024 and may not reflect the evolving frontier of ML research.
  • Compute-bound tasks advantage agents with larger inference budgets.

Leaderboard

score / 100

#1Claude 3.5 Sonnet (8h budget)
42/100Dec 2025
o1-preview (8h budget)
38/100Nov 2025
GPT-4o (8h budget)
34/100Oct 2025
Claude 3 Opus (8h budget)
31/100Sep 2025
Human experts (2h budget)
86/100Baseline

Dataset: 7 challenging ML engineering tasks

3

HCAST

Human-Calibrated Autonomy Safety Tasks · Autonomous Safety · 2025 · METR

What it measures

Whether AI agents can safely and reliably complete realistic autonomous tasks — and critically, whether they know when to stop, ask for help, or refuse unsafe actions.

How it works

  1. 1Tasks span categories like web research, data analysis, system administration, and communication.
  2. 2Each task has both a capability score (did it succeed?) and a safety score (did it avoid harmful actions?).
  3. 3Human contractors complete the same tasks to establish calibrated baselines.
  4. 4Tasks include adversarial elements — e.g., instructions that seem reasonable but would cause harm if followed blindly.

Key findings

  • Higher capability does not always correlate with higher safety — o1 completes more tasks but takes more unsafe actions.
  • Agents rarely ask for clarification even when tasks are genuinely ambiguous, leading to silent failures.
  • The gap between agent and human performance is largest on tasks requiring real-world judgment (e.g., "is this email appropriate to send?").

Limitations

  • Safety evaluation is inherently subjective — reasonable people disagree on what constitutes "harmful" in edge cases.
  • The adversarial tasks may not reflect the full range of real-world safety challenges.
  • Human baselines vary significantly by contractor experience level.
  • Relatively new benchmark; methodology is still evolving.

Leaderboard

% completed / % safe

#1Claude 3.5 Sonnet
34% tasks, 91% safetyJan 2026
GPT-4o
31% tasks, 87% safetyDec 2025
o1
37% tasks, 82% safetyJan 2026
Gemini 1.5 Pro
28% tasks, 85% safetyNov 2025
Human contractors
78% tasks, 96% safetyBaseline

Dataset: 144 tasks across 12 skill categories

4

WebArena

WebArena: A Realistic Web Environment for Building Autonomous Agents · Web Navigation · 2023 · CMU

What it measures

Whether an agent can complete realistic tasks on real websites — shopping, forum management, code repository navigation, content management, and map-based tasks.

How it works

  1. 1Five self-hosted websites simulate realistic web apps: e-commerce (OneStopShop), forums (Reddit clone), GitLab, CMS, and OpenStreetMap.
  2. 2Agent receives a natural language instruction (e.g., "Find the cheapest wireless mouse and add it to cart").
  3. 3It interacts through browser actions: click, type, scroll, navigate.
  4. 4Success is measured by functional correctness — did the task actually get completed?

Key findings

  • Raw model performance (14%) vs. agent scaffolding (35%+) shows the framework matters more than the model for web tasks.
  • Agents fail most on tasks requiring long action sequences (>15 steps) — they lose track of state.
  • Visual grounding (screenshot-based) agents are catching up to DOM-based agents but still trail by ~5%.

Limitations

  • Self-hosted websites are static snapshots — no other users, no dynamic content, no CAPTCHAs.
  • Tasks are unambiguous by design, unlike real-world web tasks which often require interpretation.
  • No evaluation of efficiency — an agent that takes 200 actions to complete a 3-action task still "passes."
  • Does not test recovery from errors (website down, session expired, etc.).

Leaderboard

% task success

#1AgentOccam (GPT-4o)
39.5%Jan 2026
AWM + GPT-4o
35.5%Nov 2025
SteP + GPT-4o
33.1%Oct 2025
BrowserGym + Claude 3.5
31.2%Dec 2025
WebVoyager + GPT-4V
26.4%Sep 2025
GPT-4 (direct)
14.4%Jul 2025

Dataset: 812 tasks across 5 real websites

5

OSWorld

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments · Desktop Computer Use · 2024 · HKU / Salesforce

What it measures

Whether an agent can use a full desktop OS — opening applications, editing documents, managing files, configuring settings, and chaining multi-app workflows.

How it works

  1. 1Agent controls a virtual machine through screenshots and keyboard/mouse actions.
  2. 2Tasks span Office apps, web browsers, file managers, terminals, and system settings.
  3. 3Evaluation checks the actual OS state after the agent acts (file contents, app states, configs).
  4. 4Some tasks require chaining multiple applications (e.g., "download the CSV from email, open it in LibreOffice, sort by column B").

Key findings

  • Even the best agents solve less than 1 in 4 tasks — desktop computer use remains extremely challenging for AI.
  • Agents are weakest on tasks requiring precise spatial reasoning (drag-and-drop, window resizing, menu navigation).
  • Multi-app workflows see ~3x failure rates compared to single-app tasks.

Limitations

  • VM-based evaluation is slow and expensive to run at scale.
  • Tasks are biased toward Linux (Ubuntu) — Windows and macOS coverage is thinner.
  • Screen resolution and rendering differences can cause brittleness in visual agents.
  • No multi-turn interaction — agent cannot ask the user for clarification mid-task.

Leaderboard

% task success

#1Claude 3.5 Sonnet (computer use)
22.0%Jan 2026
UI-TARS (ByteDance)
18.8%Dec 2025
SeeClick + GPT-4o
16.2%Nov 2025
CogAgent
13.7%Oct 2025
GPT-4V (direct)
12.2%Sep 2025
Human performance
72.4%Baseline

Dataset: 369 tasks across Ubuntu, Windows, and macOS

6

GAIA

General AI Assistants Benchmark · General AI Assistance · 2023 · Meta FAIR / HuggingFace

What it measures

Whether an agent can answer questions that are simple for humans but require multi-step tool use for AI — web browsing, file processing, calculation, and reasoning in combination.

How it works

  1. 1Questions are designed to be trivially solvable by humans (>90% accuracy) but hard for AI.
  2. 2Three levels: L1 (1-2 tool uses), L2 (5-10 steps), L3 (long chains with file analysis).
  3. 3Agents must use tools: web search, code execution, file reading, calculator.
  4. 4Exact-match evaluation — the answer must be precisely correct.

Key findings

  • Humans still outperform the best AI agents by 16+ points, primarily on Level 2 and Level 3 tasks.
  • Agent failures cluster around tool selection errors — using the wrong tool rather than using it incorrectly.
  • The gap between Level 1 (~90% for top agents) and Level 3 (~45%) shows compounding failures in multi-step reasoning.

Limitations

  • Exact-match scoring penalizes correct answers with slightly different formatting.
  • Some questions rely on web content that changes over time, causing instability in scores.
  • Level 3 questions are so hard that even top agents score below 50%, making statistical analysis noisy.
  • Does not measure user experience — only factual correctness of the final answer.

Leaderboard

% correct

#1o1 + tools (OpenAI)
75.6%Feb 2026
Gemini Ultra + tools
72.3%Jan 2026
Claude 3.5 Sonnet + tools
70.1%Jan 2026
HuggingFace Agents
56.4%Nov 2025
AutoGPT
42.8%Sep 2025
Human (non-expert)
92%Baseline

Dataset: 466 questions across 3 difficulty levels

7

τ-bench

Tool-Agent-User Benchmark · Tool Use & Conversation · 2024 · Sierra AI

What it measures

Whether an agent can handle realistic multi-turn customer service conversations requiring database lookups, policy adherence, and tool calls — without hallucinating actions or violating business rules.

How it works

  1. 1Agent plays a customer service representative with access to databases and tools (order lookup, flight rebooking, refund processing).
  2. 2A simulated user follows a scripted scenario with specific needs and constraints.
  3. 3The agent must use tools correctly AND follow domain-specific policies (return windows, fare rules, etc.).
  4. 4Evaluation checks both task completion and policy compliance at each turn.

Key findings

  • Airline tasks are significantly harder than retail due to more complex policies (fare classes, connection rules, rebooking logic).
  • The most common failure mode is policy violation — agents complete the task but break a business rule in the process.
  • Larger models are disproportionately better at policy compliance than at raw task completion.

Limitations

  • Only two domains — retail and airline — which may not generalize to other verticals.
  • Simulated users follow scripts and don't behave like frustrated real customers.
  • Policy rules are explicit; real-world policies often have undocumented exceptions.
  • No evaluation of tone, empathy, or customer satisfaction — only functional correctness.

Leaderboard

% pass rate

#1Claude 3.5 Sonnet
68.2% (retail) / 52.1% (airline)Jan 2026
GPT-4o
64.7% (retail) / 49.8% (airline)Dec 2025
Gemini 1.5 Pro
59.3% (retail) / 43.6% (airline)Nov 2025
Claude 3 Haiku
52.1% (retail) / 38.4% (airline)Oct 2025
Llama 3.1 70B
44.8% (retail) / 31.2% (airline)Sep 2025

Dataset: 2 domains (retail, airline), 200+ conversations

Cross-Benchmark Comparison

How far are the best AI agents from human performance on each benchmark? The gap ranges from 16% (GAIA) to 50% (OSWorld).

BenchmarkBest AgentHumanGapDifficultyReal-World Relevance
SWE-bench Verified62%~77%*15%HardVery High
RE-bench42/10086/10044 ptsVery HardHigh
HCAST37%78%41%HardVery High
WebArena39.5%78.2%39%HardHigh
OSWorld22%72.4%50%Very HardVery High
GAIA75.6%92%16%MediumMedium
τ-bench (retail)68.2%~95%27%MediumVery High

* SWE-bench human baseline is estimated from professional developer resolution rates on similar issue sets.

The Gap Between Benchmarks and Reality

Benchmarks are too clean

Real tasks involve ambiguous requirements, missing documentation, political constraints, and changing goals mid-execution. No benchmark captures this yet.

No cost accounting

An agent that spends $50 in API calls to fix a $5 bug "passes" SWE-bench. Real deployment requires cost-effective solutions, not just correct ones.

Single-attempt evaluation

Most benchmarks give agents one shot. In reality, humans iterate — fix, test, debug, retry. Agents that can learn from failures would score differently in multi-attempt settings.

No collaboration testing

Real agents will work alongside humans, ask questions, present options, and delegate. No current benchmark measures collaborative intelligence.

Overfitting risk is real

With published benchmarks, agent developers can (and do) optimize for specific test patterns. Performance on held-out real-world tasks is often 20-40% lower.

Safety is an afterthought

Only HCAST explicitly measures safety. An agent that resolves 60% of SWE-bench by blindly executing untested code would be dangerous in production.

How to Evaluate Agents for Your Use Case

Public benchmarks are a starting point, not the answer. Here is a practical framework for evaluating whether an AI agent will work for your specific needs.

1

Map your task to the closest benchmark

If you need a coding agent, start with SWE-bench results. Customer service? Check τ-bench. Web automation? WebArena. This gives you a rough ceiling for what to expect.

2

Build your own eval set (20-50 examples)

Take real tasks from your team's last month of work. Run the agent on them. Your private eval is 10x more predictive than any public benchmark because it captures your specific domain, data, and edge cases.

3

Measure total cost, not just accuracy

Track API spend, latency, and human review time per task. An agent that solves 60% of tasks but requires a human to verify every output may cost more than just doing it manually.

4

Test failure modes, not just successes

When the agent fails, HOW does it fail? Silent wrong answers are worse than refusals. Data corruption is worse than timeouts. The failure taxonomy matters more than the success rate.

5

Re-evaluate monthly

Agent capabilities are improving rapidly. An agent that scored 30% three months ago may score 50% today with a new model or framework. Set calendar reminders to re-run your evals.

Benchmark scores are sourced from published papers, official leaderboards, and verified reproductions. Scores change frequently as new agents are released. Last updated March 2026.