Web & Desktop Agents2024

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

369 real computer tasks across Windows, macOS, and Ubuntu requiring GUI interaction. Tests agents operating full desktop apps like spreadsheets, image editors, and terminals. Much harder than web-only benchmarks.

Samples:369
Metrics:success-rate
Paper / Website
Current State of the Art

Claude Opus 4

Anthropic

38

success-rate

success-rate Progress Over Time

Showing 2 breakthroughs from Feb 2024 to Apr 2024

6.515.123.732.340.9Feb 2024Apr 2024success-rateDate

Key Milestones

Feb 2024
UFO (GPT-4V)

UFO GPT-4V (Windows-focused). Evaluated on OSWorld subset. arxiv:2402.07939.

9.4
Apr 2024
Claude Opus 4Current SOTA

Claude Opus 4 on OSWorld. Anthropic model card, 2025. State-of-the-art GUI agent capability.

38.0
+304.3%
Total Improvement
304.3%
Time Span
2m
Breakthroughs
2
Current SOTA
38.0

Top Models Performance Comparison

Top 5 models ranked by success-rate

success-rate1Claude Opus 438.0100.0%2Claude 3.7 Sonnet22.057.9%3Claude Computer Use14.939.2%4UFO (GPT-4V)9.424.7%5GPT-4 Turbo (2024)6.517.1%0%25%50%75%100%% of best
Best Score
38.0
Top Model
Claude Opus 4
Models Compared
5
Score Range
31.5

success-ratePrimary

Related Papers3

Claude 3.7 Sonnet System Card
Feb 2025Models: Claude 3.7 Sonnet
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
Apr 2024Models: Claude Computer Use, UFO (GPT-4V), GPT-4 Turbo (2024)

Other Web & Desktop Agents Datasets

OSWorld Benchmark - Web & Desktop Agents | CodeSOTA