Web & Desktop Agents2023

WebArena: A Realistic Web Environment for Building Autonomous Agents

812 long-horizon web navigation tasks across realistic web environments (e-commerce, social media, code repos, CMS). Tests ability to complete real-world browser tasks like making purchases, posting content, or querying databases.

Samples:812
Metrics:success-rate
Paper / Website
Current State of the Art

Agent-E (GPT-4o)

Emergence AI

73

success-rate

success-rate Progress Over Time

Showing 2 breakthroughs from Jul 2023 to Jul 2024

9.126.544.061.478.8Jul 2023Jul 2024success-rateDate

Key Milestones

Jul 2023
GPT-4 Turbo (2024)

GPT-4 Turbo baseline on WebArena. Table 1, arxiv:2307.13854. Original paper result.

14.9
Jul 2024
Agent-E (GPT-4o)Current SOTA

Agent-E v3.5 on WebArena. arxiv:2407.13032. Hierarchical agent with DOM distillation, GPT-4o.

73.0
+389.9%
Total Improvement
389.9%
Time Span
1y
Breakthroughs
2
Current SOTA
73.0

Top Models Performance Comparison

Top 6 models ranked by success-rate

success-rate1Agent-E (GPT-4o)73.0100.0%2OpenAI Operator (CUA)58.179.6%3Claude Opus 455.075.3%4Agent Q (GPT-4o)50.569.2%5Claude 3.7 Sonnet35.148.1%6GPT-4 Turbo (2024)14.920.4%0%25%50%75%100%% of best
Best Score
73.0
Top Model
Agent-E (GPT-4o)
Models Compared
6
Score Range
58.1

success-ratePrimary

#ModelScorePaper / CodeDate
1
Agent-E (GPT-4o)
Emergence AI
73Jul 2023
2
OpenAI Operator (CUA)
OpenAI
58.1Jan 2025
3
Claude Opus 4API
Anthropic
55Apr 2025
4
Agent Q (GPT-4o)
MultiOn
50.5Jul 2023
5
Claude 3.7 SonnetAPI
Anthropic
35.1Feb 2025
6
GPT-4 Turbo (2024)
OpenAI
14.9Jul 2023

Related Papers3

Claude 3.7 Sonnet System Card
Feb 2025Models: Claude 3.7 Sonnet
WebArena: A Realistic Web Environment for Building Autonomous Agents
Jul 2023Models: Agent-E (GPT-4o), Agent Q (GPT-4o), GPT-4 Turbo (2024)

Other Web & Desktop Agents Datasets

WebArena Benchmark - Web & Desktop Agents | CodeSOTA