Web & Desktop Agents
Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.
Web agents autonomously navigate websites, fill forms, click buttons, and complete multi-step tasks in a browser. Benchmarks like WebArena and Mind2Web test this capability, with current best agents completing 30-40% of complex web tasks — a dramatic improvement from near-zero in 2023 but still far from reliable.
History
MiniWoB benchmark introduced for simple web interaction tasks
WebShop simulates e-commerce navigation with 1.2M products
Mind2Web released — 2K real-world web tasks across 137 websites
WebArena provides a realistic, self-hosted web environment with 812 tasks
GPT-4V + SoM (Set-of-Mark) prompting achieves 15% on WebArena
Claude 3.5 Sonnet with computer use capability enables direct screenshot-based web interaction
VisualWebArena extends WebArena with vision-dependent tasks
Agent-E and BrowserUse frameworks reach 30%+ on WebArena
Claude computer use and OpenAI Operator represent commercial web agent products
Multi-tab, multi-site tasks emerge as the next challenge frontier
How Web & Desktop Agents Works
Task Interpretation
The agent receives a natural language task (e.g., 'Book the cheapest flight from NYC to London on March 15') and plans the sequence of web interactions needed.
Page Observation
The agent observes the current page state via screenshot (vision) or DOM/accessibility tree (structured) representation.
Action Selection
Based on the current state and goal, the agent selects an action: click, type, scroll, navigate, or wait.
Execution
The action is executed in the browser environment, producing a new page state.
Progress Tracking
The agent evaluates whether the task is progressing toward completion, potentially backtracking or trying alternative approaches on errors.
Current Landscape
Web agents in 2025 work well for structured, predictable web tasks (form filling, data extraction, simple navigation) but remain unreliable for complex multi-step workflows on dynamic sites. The field uses two paradigms: DOM-based (parsing HTML/accessibility trees) and vision-based (screenshot understanding). Vision-based approaches are more general but slower and more expensive. Commercial products (Claude computer use, Operator) are emerging but still require human oversight for important tasks.
Key Challenges
Observation complexity — real web pages have thousands of DOM elements, requiring effective filtering and attention
Dynamic content — JavaScript-heavy sites, pop-ups, and loading states make action timing critical
Error recovery — wrong clicks can lead to unrecoverable states; agents need to detect and recover from mistakes
Authentication and CAPTCHAs — real-world web tasks often require login, 2FA, and CAPTCHA solving
Safety — autonomous web agents can make irreversible actions (purchases, deletions, form submissions)
Quick Recommendations
Browser automation research
WebArena + Claude 3.5 Sonnet
Most realistic benchmark with strongest multimodal model
Production web automation
Claude computer use / OpenAI Operator
Commercial products with safety guardrails for real web interaction
Open-source framework
BrowserUse / Playwright + LLM
Flexible, extensible framework for building custom web agents
Simple web tasks
Selenium + GPT-4o vision
Cost-effective for well-defined, repetitive web workflows
What's Next
The frontier is reliable multi-site, multi-session web agents that can handle real-world complexity — authentication, dynamic content, error recovery, and safety constraints. Expect advances in: (1) hybrid DOM+vision observation, (2) persistent browser state and memory, (3) human-in-the-loop confirmation for irreversible actions.
Benchmarks & SOTA
OSWorld
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
369 real computer tasks across Windows, macOS, and Ubuntu requiring GUI interaction. Tests agents operating full desktop apps like spreadsheets, image editors, and terminals. Much harder than web-only benchmarks.
State of the Art
CoAct-1
Salesforce
60.76
success-rate
WebArena
WebArena: A Realistic Web Environment for Building Autonomous Agents
812 long-horizon web navigation tasks across realistic web environments (e-commerce, social media, code repos, CMS). Tests ability to complete real-world browser tasks like making purchases, posting content, or querying databases.
State of the Art
Agent-E (GPT-4o)
Emergence AI
73
success-rate
Related Tasks
Task agents
AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.
Autonomous Coding
Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.
HCAST
HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.
Tool Use
Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Web & Desktop Agents benchmarks accurate. Report outdated results, missing benchmarks, or errors.