Agentic AI Benchmarks
How evaluation of AI agents evolved from structured task completion in synthetic environments through real-world software engineering to open-ended computer use. The coding lineage (see coding.json) covers SWE-bench and its successors in depth — this lineage focuses on the broader question of agent-task evaluation: web navigation, API use, desktop control, and the multi-step planning that connects language model capabilities to real-world action. Branches include OSWorld (visual desktop agents) and tau-bench (function-calling reliability).
The agentic benchmark space is the fastest-moving in AI evaluation — within 18 months, the field went from no standard benchmark to a fragmented landscape of task families: web navigation (WebArena), software engineering (SWE-bench), computer use (OSWorld), function calling (tau-bench), and generalist tasks (GAIA). The core measurement problem hasn't been solved: most benchmarks score final-state binary success, which masks the variance in how agents fail. SWE-bench Verified is the single most-cited agentic benchmark but is actively being deprecated by its own creators (see coding.json). As of 2025, no single benchmark serves as the universal agent evaluation — GAIA's generalist tasks and SWE-bench Pro's software-engineering track are the closest to consensus anchor points.
Attention path plus branches.
Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.
Nodes in detail.
WebArena
812 tasks across 4 fully-functional web applications (Reddit, GitLab, e-commerce, CMS). Realistic task completion with binary success scoring. GPT-4 achieves ~14% at launch; top agents approached 45% by late 2024. Web-agent standard alongside GAIA.
AgentBench
8 diverse environments: OS, database, knowledge graph, digital card game, lateral thinking puzzles, web shopping, web browsing. Measures LLMs across the full breadth of agentic task types in a single framework. GPT-4 scored 3.6 overall; open-source models trailed significantly at launch.
GAIA
466 real-world tasks at three difficulty levels — browsing, calculation, multi-hop reasoning — each with a factual, verifiable answer. Humans solve Level 3 tasks at ~92%; GPT-4 with plugins scored 15% at launch. The first widely-adopted general-agent benchmark.
OSWorld
369 real computer tasks across Windows, macOS, and Ubuntu using GUI screenshots and actions — file management, app control, web browsing, multi-app workflows. GPT-4V scored ~14% at launch. Tests the visual-desktop-control capability orthogonal to SWE-bench's code-level operations.
tau-bench
Two domains (airline, retail) with 115 and 167 tasks involving simulated user dialogue and tool calling against real policy constraints. Measures whether agents complete multi-turn customer service tasks reliably without violating rules. Pass rate across 5 trials is the metric; top models score 40–60%.
SWE-bench Verified
500 human-verified real GitHub issues. The dominant agentic-coding benchmark until 2025 when OpenAI stopped evaluating on it. Full history in the coding lineage. Included here as the bridge between web-agent benchmarks and the software-engineering agent track.
SWE-bench Pro
1,865 contamination-controlled software engineering tasks from 41 business repos. GPT-5 and Claude Opus 4.1 score ~23% here vs >70% on Verified. The current software-agent frontier. Full detail in the coding lineage.