Web & Desktop Agents2023
WebArena: A Realistic Web Environment for Building Autonomous Agents
812 long-horizon web navigation tasks across realistic web environments (e-commerce, social media, code repos, CMS). Tests ability to complete real-world browser tasks like making purchases, posting content, or querying databases.
Current State of the Art
Agent-E (GPT-4o)
Emergence AI
73
success-rate
success-rate Progress Over Time
Showing 2 breakthroughs from Jul 2023 to Jul 2024
Key Milestones
Total Improvement
389.9%
Time Span
1y
Breakthroughs
2
Current SOTA
73.0
Top Models Performance Comparison
Top 6 models ranked by success-rate
Best Score
73.0
Top Model
Agent-E (GPT-4o)
Models Compared
6
Score Range
58.1
success-ratePrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Agent-E (GPT-4o) Emergence AI | 73 | Jul 2023 | |
| 2 | OpenAI Operator (CUA) OpenAI | 58.1 | Jan 2025 | |
| 3 | Claude Opus 4API Anthropic | 55 | Apr 2025 | |
| 4 | Agent Q (GPT-4o) MultiOn | 50.5 | Jul 2023 | |
| 5 | Claude 3.7 SonnetAPI Anthropic | 35.1 | Feb 2025 | |
| 6 | GPT-4 Turbo (2024) OpenAI | 14.9 | Jul 2023 |
Related Papers3
METR: Measuring Autonomy in AI Systems (2025 Update)
Apr 2025Models: Claude Opus 4
Claude 3.7 Sonnet System Card
Feb 2025Models: Claude 3.7 Sonnet
WebArena: A Realistic Web Environment for Building Autonomous Agents
Jul 2023Models: Agent-E (GPT-4o), Agent Q (GPT-4o), GPT-4 Turbo (2024)