Web & Desktop Agents2024
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
369 real computer tasks across Windows, macOS, and Ubuntu requiring GUI interaction. Tests agents operating full desktop apps like spreadsheets, image editors, and terminals. Much harder than web-only benchmarks.
Current State of the Art
Claude Opus 4
Anthropic
38
success-rate
success-rate Progress Over Time
Showing 2 breakthroughs from Feb 2024 to Apr 2024
Key Milestones
Total Improvement
304.3%
Time Span
2m
Breakthroughs
2
Current SOTA
38.0
Top Models Performance Comparison
Top 5 models ranked by success-rate
Best Score
38.0
Top Model
Claude Opus 4
Models Compared
5
Score Range
31.5
success-ratePrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Claude Opus 4API Anthropic | 38 | Apr 2025 | |
| 2 | Claude 3.7 SonnetAPI Anthropic | 22 | Feb 2025 | |
| 3 | Claude Computer Use Anthropic | 14.9 | Apr 2024 | |
| 4 | UFO (GPT-4V)Open Source Microsoft | 9.4 | Apr 2024 | |
| 5 | GPT-4 Turbo (2024) OpenAI | 6.5 | Apr 2024 |
Related Papers3
METR: Measuring Autonomy in AI Systems (2025 Update)
Apr 2025Models: Claude Opus 4
Claude 3.7 Sonnet System Card
Feb 2025Models: Claude 3.7 Sonnet
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Apr 2024Models: Claude Computer Use, UFO (GPT-4V), GPT-4 Turbo (2024)