Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Web & Desktop AgentsHome/Tasks/Agentic AI/Web & Desktop Agents

Web & Desktop Agents.

Web and desktop agents — AI systems that operate browsers and GUIs to complete real tasks — are benchmarked by WebArena, VisualWebArena, Mind2Web, and OSWorld. Current agents (GPT-4V + Playwright, Claude Computer Use) achieve 15-35% success on realistic web tasks, far below human performance. The core difficulty is grounding: mapping high-level instructions ("book a flight under $300") to pixel-level or DOM-level actions across unpredictable, dynamic interfaces. This is where multimodal understanding meets sequential decision-making, and progress here directly predicts when AI assistants can truly act on your behalf.

2
Datasets
39
Results
success-rate
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

OSWorld

369 real computer tasks across Windows, macOS, and Ubuntu requiring GUI interaction. Tests agents operating full desktop apps like spreadsheets, image editors, and terminals. Much harder than web-only benchmarks.

Primary metric: success-rate
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on OSWorld.

#Modelsuccess-rateYearSource
Agent S3 w/ bBoN63.52025paper ↗
2GLM-5V-Turbo62.32026paper ↗
3CoAct-160.82026paper ↗
4JEDI-7B with o3 planner51.02025paper ↗
5UI-TARS-247.52026paper ↗
6GTA1 (7B)45.22026paper ↗
7UI-TARS-1.542.52026paper ↗
8Agent S2 (Gemini 2.5)41.42026paper ↗
9Holo2-8B39.92026paper ↗
10Qwen3-VL-235B-A22B-Thinking38.12025paper ↗

What were you looking for on Web & Desktop Agents?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

2 datasets tracked for this task.

OSWorld
CANONICAL
28 results · success-rate
Top: Agent S3 w/ bBoN 63.5
WebArena
11 results · success-rate
Top: Qwen3-235B-A22B 95.6
§ 05 · Related tasks

Other tasks in Agentic AI.

Agent MemoryAutonomous CodingBioinformatics AgentsHCASTRE-BenchSWE-benchTask agentsTime Horizon
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Web & Desktop Agents? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.