Agentic AI

Web & Desktop Agents

Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.

2 datasets19 resultsView full task mapping →

Web agents autonomously navigate websites, fill forms, click buttons, and complete multi-step tasks in a browser. Benchmarks like WebArena and Mind2Web test this capability, with current best agents completing 30-40% of complex web tasks — a dramatic improvement from near-zero in 2023 but still far from reliable.

History

2017

MiniWoB benchmark introduced for simple web interaction tasks

2022

WebShop simulates e-commerce navigation with 1.2M products

2023

Mind2Web released — 2K real-world web tasks across 137 websites

2023

WebArena provides a realistic, self-hosted web environment with 812 tasks

2024

GPT-4V + SoM (Set-of-Mark) prompting achieves 15% on WebArena

2024

Claude 3.5 Sonnet with computer use capability enables direct screenshot-based web interaction

2024

VisualWebArena extends WebArena with vision-dependent tasks

2024

Agent-E and BrowserUse frameworks reach 30%+ on WebArena

2025

Claude computer use and OpenAI Operator represent commercial web agent products

2025

Multi-tab, multi-site tasks emerge as the next challenge frontier

How Web & Desktop Agents Works

1Task InterpretationThe agent receives a natura…2Page ObservationThe agent observes the curr…3Action SelectionBased on the current state …4ExecutionThe action is executed in t…5Progress TrackingThe agent evaluates whether…Web & Desktop Agents Pipeline
1

Task Interpretation

The agent receives a natural language task (e.g., 'Book the cheapest flight from NYC to London on March 15') and plans the sequence of web interactions needed.

2

Page Observation

The agent observes the current page state via screenshot (vision) or DOM/accessibility tree (structured) representation.

3

Action Selection

Based on the current state and goal, the agent selects an action: click, type, scroll, navigate, or wait.

4

Execution

The action is executed in the browser environment, producing a new page state.

5

Progress Tracking

The agent evaluates whether the task is progressing toward completion, potentially backtracking or trying alternative approaches on errors.

Current Landscape

Web agents in 2025 work well for structured, predictable web tasks (form filling, data extraction, simple navigation) but remain unreliable for complex multi-step workflows on dynamic sites. The field uses two paradigms: DOM-based (parsing HTML/accessibility trees) and vision-based (screenshot understanding). Vision-based approaches are more general but slower and more expensive. Commercial products (Claude computer use, Operator) are emerging but still require human oversight for important tasks.

Key Challenges

Observation complexity — real web pages have thousands of DOM elements, requiring effective filtering and attention

Dynamic content — JavaScript-heavy sites, pop-ups, and loading states make action timing critical

Error recovery — wrong clicks can lead to unrecoverable states; agents need to detect and recover from mistakes

Authentication and CAPTCHAs — real-world web tasks often require login, 2FA, and CAPTCHA solving

Safety — autonomous web agents can make irreversible actions (purchases, deletions, form submissions)

Quick Recommendations

Browser automation research

WebArena + Claude 3.5 Sonnet

Most realistic benchmark with strongest multimodal model

Production web automation

Claude computer use / OpenAI Operator

Commercial products with safety guardrails for real web interaction

Open-source framework

BrowserUse / Playwright + LLM

Flexible, extensible framework for building custom web agents

Simple web tasks

Selenium + GPT-4o vision

Cost-effective for well-defined, repetitive web workflows

What's Next

The frontier is reliable multi-site, multi-session web agents that can handle real-world complexity — authentication, dynamic content, error recovery, and safety constraints. Expect advances in: (1) hybrid DOM+vision observation, (2) persistent browser state and memory, (3) human-in-the-loop confirmation for irreversible actions.

Benchmarks & SOTA

Related Tasks

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Web & Desktop Agents benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Web & Desktop Agents Benchmarks - Agentic AI - CodeSOTA | CodeSOTA