Codesota · Lineage · Agentic AI Benchmarks7 benchmarks · 6 edgesUpdated 2026-04-27
Benchmark lineage

Agentic AI Benchmarks

How evaluation of AI agents evolved from structured task completion in synthetic environments through real-world software engineering to open-ended computer use. The coding lineage (see coding.json) covers SWE-bench and its successors in depth — this lineage focuses on the broader question of agent-task evaluation: web navigation, API use, desktop control, and the multi-step planning that connects language model capabilities to real-world action. Branches include OSWorld (visual desktop agents) and tau-bench (function-calling reliability).

Editor's note

The agentic benchmark space is the fastest-moving in AI evaluation — within 18 months, the field went from no standard benchmark to a fragmented landscape of task families: web navigation (WebArena), software engineering (SWE-bench), computer use (OSWorld), function calling (tau-bench), and generalist tasks (GAIA). The core measurement problem hasn't been solved: most benchmarks score final-state binary success, which masks the variance in how agents fail. SWE-bench Verified is the single most-cited agentic benchmark but is actively being deprecated by its own creators (see coding.json). As of 2025, no single benchmark serves as the universal agent evaluation — GAIA's generalist tasks and SWE-bench Pro's software-engineering track are the closest to consensus anchor points.

§ 01 · Lineage graph

Attention path plus branches.

Solid arrows follow the attention path — the dashed arrow marks a scope shift (leaderboard attention jumping between tasks). Thin grey arcs drop down to specialized branches. Click any node to jump to its detail.

attention path scope shift branch / fork active saturating saturated / superseded
SCOPE SHIFTSCOPE SHIFTDIRECT SUCCESSORGAIANOV 2023AgentBenchAUG 2023WebArenaJUL 2023OSWorldAPR 2024tau-benchJUN 2024SWE-bench VerifiedAUG 2024SOTA 87.60SWE-bench ProSEP 2025
GAIAAgentBench · scope shift
GAIA tests generalist factual task completion; AgentBench tests LLMs across 8 structured interactive environments. Concurrent 2023 benchmarks that approached agent evaluation from different angles — GAIA through real-world tasks, AgentBench through controlled environments.
AgentBenchWebArena · scope shift · attention
WebArena focused AgentBench's broad multi-environment approach onto a single domain — web navigation — with fully functional applications. The specificity made WebArena easier to iterate on and it became the web-agent standard.
WebArenaOSWorld · scope shift
OSWorld extends from web-browser agents to full desktop computer use — GUI-level actions across applications, not just browser navigation. A natural scope expansion once web-only agents hit ceiling.
WebArenatau-bench · scope shift
tau-bench moves from open-ended web navigation to constrained tool-calling in customer-service workflows — testing reliability and rule-following rather than exploration. A different failure mode from WebArena's navigation challenges.
GAIASWE-bench Verified · scope shift · attention
SWE-bench Verified narrowed the generalist-agent task to software engineering specifically — real GitHub issues on real codebases. Where leaderboard attention concentrated from 2024 onward.
SWE-bench VerifiedSWE-bench Pro · direct successor · attention
Verified was deprecated by OpenAI in September 2025 due to contamination and shortcut-reward flaws. SWE-bench Pro adds held-out commercial repos and contamination controls. The current software-agent frontier.
§ 02 · Benchmarks in this lineage

Nodes in detail.

Jul 2023Active

WebArena

WebArena: Realistic Web Navigation Environment

812 tasks across 4 fully-functional web applications (Reddit, GitLab, e-commerce, CMS). Realistic task completion with binary success scoring. GPT-4 achieves ~14% at launch; top agents approached 45% by late 2024. Web-agent standard alongside GAIA.

Zhou et al. (CMU / Meta / Google / Amazon) · paper
Aug 2023Active

AgentBench

AgentBench: Evaluating LLMs as Agents

8 diverse environments: OS, database, knowledge graph, digital card game, lateral thinking puzzles, web shopping, web browsing. Measures LLMs across the full breadth of agentic task types in a single framework. GPT-4 scored 3.6 overall; open-source models trailed significantly at launch.

Liu et al. (Tsinghua University) · paper
Nov 2023Active

GAIA

GAIA: General AI Assistants Benchmark

466 real-world tasks at three difficulty levels — browsing, calculation, multi-hop reasoning — each with a factual, verifiable answer. Humans solve Level 3 tasks at ~92%; GPT-4 with plugins scored 15% at launch. The first widely-adopted general-agent benchmark.

Mialon et al. (Meta AI / HuggingFace) · paper
Apr 2024Active

OSWorld

OSWorld: Visual Computer-Use Agent Benchmark

369 real computer tasks across Windows, macOS, and Ubuntu using GUI screenshots and actions — file management, app control, web browsing, multi-app workflows. GPT-4V scored ~14% at launch. Tests the visual-desktop-control capability orthogonal to SWE-bench's code-level operations.

Xie et al. · paper
Jun 2024Active

tau-bench

tau-bench: Tool-Agent-User Benchmark

Two domains (airline, retail) with 115 and 167 tasks involving simulated user dialogue and tool calling against real policy constraints. Measures whether agents complete multi-turn customer service tasks reliably without violating rules. Pass rate across 5 trials is the metric; top models score 40–60%.

Yao et al. (Princeton / Salesforce) · paper
Aug 2024Saturating
View benchmark page →

SWE-bench Verified

SWE-bench Verified (reference node — see coding lineage)

500 human-verified real GitHub issues. The dominant agentic-coding benchmark until 2025 when OpenAI stopped evaluating on it. Full history in the coding lineage. Included here as the bridge between web-agent benchmarks and the software-engineering agent track.

OpenAI + SWE-bench team · paper
Sep 2025Active

SWE-bench Pro

SWE-bench Pro (Scale AI) — see coding lineage

1,865 contamination-controlled software engineering tasks from 41 business repos. GPT-5 and Claude Opus 4.1 score ~23% here vs >70% on Verified. The current software-agent frontier. Full detail in the coding lineage.

Scale AI · paper