Agentic AI

Web & Desktop Agents

Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.

2 datasets39 resultsView full task mapping →

Web agents autonomously navigate websites, fill forms, click buttons, and complete multi-step tasks in a browser. Benchmarks like WebArena and Mind2Web test this capability, with current best agents completing 30-40% of complex web tasks — a dramatic improvement from near-zero in 2023 but still far from reliable.

History

2017

MiniWoB benchmark introduced for simple web interaction tasks

2022

WebShop simulates e-commerce navigation with 1.2M products

2023

Mind2Web released — 2K real-world web tasks across 137 websites

2023

WebArena provides a realistic, self-hosted web environment with 812 tasks

2024

GPT-4V + SoM (Set-of-Mark) prompting achieves 15% on WebArena

2024

Claude 3.5 Sonnet with computer use capability enables direct screenshot-based web interaction

2024

VisualWebArena extends WebArena with vision-dependent tasks

2024

Agent-E and BrowserUse frameworks reach 30%+ on WebArena

2025

Claude computer use and OpenAI Operator represent commercial web agent products

2025

Multi-tab, multi-site tasks emerge as the next challenge frontier

How Web & Desktop Agents Works

Task Interpretation

The agent receives a natural language task (e.g., 'Book the cheapest flight from NYC to London on March 15') and plans the sequence of web interactions needed.

Page Observation

The agent observes the current page state via screenshot (vision) or DOM/accessibility tree (structured) representation.

Action Selection

Based on the current state and goal, the agent selects an action: click, type, scroll, navigate, or wait.

Execution

The action is executed in the browser environment, producing a new page state.

Progress Tracking

The agent evaluates whether the task is progressing toward completion, potentially backtracking or trying alternative approaches on errors.

Current Landscape

Web agents in 2025 work well for structured, predictable web tasks (form filling, data extraction, simple navigation) but remain unreliable for complex multi-step workflows on dynamic sites. The field uses two paradigms: DOM-based (parsing HTML/accessibility trees) and vision-based (screenshot understanding). Vision-based approaches are more general but slower and more expensive. Commercial products (Claude computer use, Operator) are emerging but still require human oversight for important tasks.

Key Challenges

Observation complexity — real web pages have thousands of DOM elements, requiring effective filtering and attention

Dynamic content — JavaScript-heavy sites, pop-ups, and loading states make action timing critical

Error recovery — wrong clicks can lead to unrecoverable states; agents need to detect and recover from mistakes

Authentication and CAPTCHAs — real-world web tasks often require login, 2FA, and CAPTCHA solving

Safety — autonomous web agents can make irreversible actions (purchases, deletions, form submissions)

Quick Recommendations

Browser automation research

WebArena + Claude 3.5 Sonnet

Most realistic benchmark with strongest multimodal model

Production web automation

Claude computer use / OpenAI Operator

Commercial products with safety guardrails for real web interaction

Open-source framework

BrowserUse / Playwright + LLM

Flexible, extensible framework for building custom web agents

Simple web tasks

Selenium + GPT-4o vision

Cost-effective for well-defined, repetitive web workflows

What's Next

The frontier is reliable multi-site, multi-session web agents that can handle real-world complexity — authentication, dynamic content, error recovery, and safety constraints. Expect advances in: (1) hybrid DOM+vision observation, (2) persistent browser state and memory, (3) human-in-the-loop confirmation for irreversible actions.

Benchmarks & SOTA

OSWorld

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

202428 results

369 real computer tasks across Windows, macOS, and Ubuntu requiring GUI interaction. Tests agents operating full desktop apps like spreadsheets, image editors, and terminals. Much harder than web-only benchmarks.

State of the Art

Agent S3 w/ bBoN

63.5

success-rate

WebArena

WebArena: A Realistic Web Environment for Building Autonomous Agents

202311 results

812 long-horizon web navigation tasks across realistic web environments (e-commerce, social media, code repos, CMS). Tests ability to complete real-world browser tasks like making purchases, posting content, or querying databases.

State of the Art

Qwen3-235B-A22B

Alibaba

95.6

accuracy

Related Tasks

Task agents

AI agents are autonomous software systems that use artificial intelligence to achieve goals and complete tasks on behalf of users, acting independently to perceive their environment, make decisions, and take actions without constant human intervention. They use advanced capabilities like reasoning, memory, planning, and learning, often leveraging large language models (LLMs) and other AI tools to interpret information and perform complex workflows across various industries.

Autonomous Coding

Agent benchmarks where systems complete coding, terminal, repository, or developer-workflow tasks with minimal human intervention.

HCAST

HCAST (Human-Calibrated Autonomy Software Tasks) is a 90-task benchmark from METR designed to measure AI autonomy with human-calibrated baselines — every task has known completion times from professional software engineers, enabling direct human-vs-AI comparison. Tasks span realistic software engineering scenarios at varying difficulty levels, from simple bug fixes to complex architectural changes. The human calibration is what makes HCAST distinctive: instead of just pass/fail, it reveals whether AI agents are 10x slower, equally fast, or approaching superhuman speed on specific task types.

Tool Use

Benchmarks measuring AI agents ability to use tools and APIs to complete real-world tasks across domains like retail and airline customer service.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Web & Desktop Agents benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Agentic AI