Codesota · Benchmark · OSWorldHome/Leaderboards/Agents & Tool Use/Web Agents/OSWorld
Unknown

OSWorld.

369 real computer tasks across Windows, macOS, and Ubuntu requiring GUI interaction. Tests agents operating full desktop apps like spreadsheets, image editors, and terminals. Much harder than web-only benchmarks.

Paper Leaderboard
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Success Rate

Success Rate is the reported evaluation metric for OSWorld. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Success Rateverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Agent S3 w/ bBoNunverified63.52025Paper ↗Code ↗Looks wrong?
02GLM-5V-Turbounverified62.32026Paper ↗Code ↗Looks wrong?
03CoAct-1
CoAct-1 on OSWorld, 60.76% success rate (SOTA as of Feb 2026). Salesforce, arxiv:2508.03923, Aug 2025. Combines GUI and programmatic code execution.
verified60.762026Source ↗Looks wrong?
04JEDI-7B with o3 plannerunverified512025Paper ↗Code ↗Looks wrong?
05UI-TARS-2
UI-TARS-2 on OSWorld, 47.5% success rate. ByteDance Seed, arxiv:2509.02544, Sep 2025. Multi-turn RL trained.
verified47.52026Source ↗Looks wrong?
06GTA1 (7B)
GTA1 (7B) on OSWorld, 45.2% success rate. Salesforce AI Research, arxiv:2507.05791, Jul 2025. ICLR 2026 paper.
verified45.22026Source ↗Looks wrong?
07UI-TARS-1.5
UI-TARS-1.5 on OSWorld, 42.5% success rate (100 steps). ByteDance, released Apr 2025.
verified42.52026Source ↗Looks wrong?
08Agent S2 (Gemini 2.5)
Agent S2 with Gemini 2.5 on OSWorld, 41.4% (50 steps). From OSWorld-Human paper, arxiv:2506.16042, Jun 2025.
verified41.42026Source ↗Looks wrong?
09Holo2-8Bunverified39.92026Paper ↗Looks wrong?
10Qwen3-VL-235B-A22B-Thinkingunverified38.12025Paper ↗Code ↗Looks wrong?
11OpenAI CUA (o1)
OpenAI Computer-Using Agent (CUA/Operator) on OSWorld, 38.1% success rate. Announced Jan 2025.
verified38.12026Source ↗Looks wrong?
12Holo2-4Bunverified37.72026Paper ↗Looks wrong?
13Holo2-30B-A3Bunverified37.42026Paper ↗Looks wrong?
14Agent S2 (Claude 3.7)
Agent S2 with Claude 3.7 Sonnet on OSWorld, 34.5% (50 steps). Simular AI, arxiv:2504.00906, Apr 2025.
verified34.52026Source ↗Looks wrong?
15Agent S2 w/ Claude-3.7-Sonnetunverified34.52025Paper ↗Code ↗Looks wrong?
16Qwen3-VL-8B-Instructunverified33.92025Paper ↗Code ↗Looks wrong?
17Agent S2 w/ Claude-3.5-Sonnetunverified33.72025Paper ↗Code ↗Looks wrong?
18Qwen3-VL-235B-A22B-Instructunverified31.62025Paper ↗Code ↗Looks wrong?
19Claude 3.7 Sonnet
Claude 3.7 Sonnet on OSWorld, top of leaderboard at release (Feb 2025), 100 steps. From OSWorld-Human paper (arxiv:2506.16042).
verified282026Source ↗Looks wrong?
20UI-TARS-72B
UI-TARS-72B on OSWorld, 24.6% success rate (50 steps). ByteDance, arxiv:2501.12326, Jan 2025.
verified24.62026Source ↗Looks wrong?
21Claude Computer Use
Claude 3.5 Sonnet computer use with extended steps on OSWorld. Anthropic announcement Oct 2024.
verified222026Source ↗Looks wrong?
22Agent S w/ GPT-4ounverified20.582024Paper ↗Code ↗Looks wrong?
23Agent S w/ Claude-3.5unverified20.482024Paper ↗Code ↗Looks wrong?
24UFO (GPT-4V)
UFO GPT-4V (Windows-focused). Evaluated on OSWorld subset. arxiv:2402.07939.
verified9.402024Paper ↗Source ↗Looks wrong?
25Qwen2.5-VL-72Bunverified8.832025Paper ↗Code ↗Looks wrong?
26Kimi-VL-A3B-Instructunverified8.222025Paper ↗Code ↗Looks wrong?
27GPT-4 Turbo (2024)
GPT-4V (screenshot-only) on OSWorld. Table 3, arxiv:2404.07972. Screenshot-based GUI agent.
verified6.502024Paper ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Web Agents