Agentic AI

Web & Desktop Agents

Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.

2 datasets11 resultsView full task mapping →

Web agents autonomously navigate websites, fill forms, click buttons, and complete multi-step tasks in a browser. Benchmarks like WebArena and Mind2Web test this capability, with current best agents completing 30-40% of complex web tasks — a dramatic improvement from near-zero in 2023 but still far from reliable.

History

2017

MiniWoB benchmark introduced for simple web interaction tasks

2022

WebShop simulates e-commerce navigation with 1.2M products

2023

Mind2Web released — 2K real-world web tasks across 137 websites

2023

WebArena provides a realistic, self-hosted web environment with 812 tasks

2024

GPT-4V + SoM (Set-of-Mark) prompting achieves 15% on WebArena

2024

Claude 3.5 Sonnet with computer use capability enables direct screenshot-based web interaction

2024

VisualWebArena extends WebArena with vision-dependent tasks

2024

Agent-E and BrowserUse frameworks reach 30%+ on WebArena

2025

Claude computer use and OpenAI Operator represent commercial web agent products

2025

Multi-tab, multi-site tasks emerge as the next challenge frontier

How Web & Desktop Agents Works

1Task InterpretationThe agent receives a natura…2Page ObservationThe agent observes the curr…3Action SelectionBased on the current state …4ExecutionThe action is executed in t…5Progress TrackingThe agent evaluates whether…Web & Desktop Agents Pipeline
1

Task Interpretation

The agent receives a natural language task (e.g., 'Book the cheapest flight from NYC to London on March 15') and plans the sequence of web interactions needed.

2

Page Observation

The agent observes the current page state via screenshot (vision) or DOM/accessibility tree (structured) representation.

3

Action Selection

Based on the current state and goal, the agent selects an action: click, type, scroll, navigate, or wait.

4

Execution

The action is executed in the browser environment, producing a new page state.

5

Progress Tracking

The agent evaluates whether the task is progressing toward completion, potentially backtracking or trying alternative approaches on errors.

Current Landscape

Web agents in 2025 work well for structured, predictable web tasks (form filling, data extraction, simple navigation) but remain unreliable for complex multi-step workflows on dynamic sites. The field uses two paradigms: DOM-based (parsing HTML/accessibility trees) and vision-based (screenshot understanding). Vision-based approaches are more general but slower and more expensive. Commercial products (Claude computer use, Operator) are emerging but still require human oversight for important tasks.

Key Challenges

Observation complexity — real web pages have thousands of DOM elements, requiring effective filtering and attention

Dynamic content — JavaScript-heavy sites, pop-ups, and loading states make action timing critical

Error recovery — wrong clicks can lead to unrecoverable states; agents need to detect and recover from mistakes

Authentication and CAPTCHAs — real-world web tasks often require login, 2FA, and CAPTCHA solving

Safety — autonomous web agents can make irreversible actions (purchases, deletions, form submissions)

Quick Recommendations

Browser automation research

WebArena + Claude 3.5 Sonnet

Most realistic benchmark with strongest multimodal model

Production web automation

Claude computer use / OpenAI Operator

Commercial products with safety guardrails for real web interaction

Open-source framework

BrowserUse / Playwright + LLM

Flexible, extensible framework for building custom web agents

Simple web tasks

Selenium + GPT-4o vision

Cost-effective for well-defined, repetitive web workflows

What's Next

The frontier is reliable multi-site, multi-session web agents that can handle real-world complexity — authentication, dynamic content, error recovery, and safety constraints. Expect advances in: (1) hybrid DOM+vision observation, (2) persistent browser state and memory, (3) human-in-the-loop confirmation for irreversible actions.

Benchmarks & SOTA

Related Tasks

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

Autonomous Coding

Autonomous coding — AI systems that write, debug, and ship software without human guidance — is the most commercially immediate agentic capability. Benchmarks range from function-level synthesis (HumanEval, MBPP) to full-repository tasks (SWE-bench), and the field moved from autocomplete to genuine software engineering when Cognition's Devin (2024) and open alternatives like SWE-Agent and OpenHands demonstrated multi-file, multi-step coding workflows. The frontier is extended autonomy: can an agent maintain a codebase over days, not just resolve a single issue?

SWE-bench

SWE-bench — resolving real GitHub issues from popular Python repositories — became the defining benchmark for AI software engineering after its 2023 release by Princeton. The verified subset (500 curated problems) went from ~4% resolution rate with raw GPT-4 to over 50% with agentic scaffolds like SWE-Agent and Amazon Q Developer by mid-2025. What makes it uniquely challenging is the need to navigate large codebases, write tests, and produce patches that pass CI — skills that require genuine multi-file reasoning, not just code generation.

RE-Bench

RE-Bench (Research Engineering Benchmark) from METR evaluates AI agents on 7 open-ended ML research engineering tasks requiring genuine experimentation — training models, analyzing data, and iterating on approaches over extended time horizons up to 8 hours. Unlike pass/fail coding benchmarks, RE-Bench uses continuous scoring that measures quality of results, capturing the difference between a mediocre and excellent solution. It revealed a critical finding: current frontier models (as of late 2024) plateau after ~2 hours of autonomous work while human experts continue improving, exposing the "long-horizon reliability" gap in agentic AI.

Something wrong or missing?

Help keep Web & Desktop Agents benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000