Who leads the OSWorld benchmark?

Agent S3 w/ bBoN currently leads OSWorld with a score of 63.5 on Success Rate.

What is the state-of-the-art score on OSWorld?

The state-of-the-art result on OSWorld is 63.5 (Success Rate), achieved by Agent S3 w/ bBoN as of 2026.

How many models are tracked on OSWorld?

Codesota tracks 27 models on OSWorld.

When was the OSWorld leaderboard last updated?

The OSWorld leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2024.

Codesota · Benchmark · OSWorldHome/Leaderboards/Agents & Tool Use/Web Agents/OSWorld

Unknown

OSWorld.

Name: OSWorld Benchmark Results
Creator: Unknown
Published: 2024-01-01
License: https://creativecommons.org/licenses/by/4.0/

369 real computer tasks across Windows, macOS, and Ubuntu requiring GUI interaction. Tests agents operating full desktop apps like spreadsheets, image editors, and terminals. Much harder than web-only benchmarks.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

Success Rate

Success Rate is the reported evaluation metric for OSWorld. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Success Rateverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Agent S3 w/ bBoN	unverified	63.5	2025	Paper ↗Code ↗	Looks wrong?
02	GLM-5V-Turbo	unverified	62.3	2026	Paper ↗Code ↗	Looks wrong?
03	CoAct-1 CoAct-1 on OSWorld, 60.76% success rate (SOTA as of Feb 2026). Salesforce, arxiv:2508.03923, Aug 2025. Combines GUI and programmatic code execution.	verified	60.76	2026	Source ↗	Looks wrong?
04	JEDI-7B with o3 planner	unverified	51	2025	Paper ↗Code ↗	Looks wrong?
05	UI-TARS-2 UI-TARS-2 on OSWorld, 47.5% success rate. ByteDance Seed, arxiv:2509.02544, Sep 2025. Multi-turn RL trained.	verified	47.5	2026	Source ↗	Looks wrong?
06	GTA1 (7B) GTA1 (7B) on OSWorld, 45.2% success rate. Salesforce AI Research, arxiv:2507.05791, Jul 2025. ICLR 2026 paper.	verified	45.2	2026	Source ↗	Looks wrong?
07	UI-TARS-1.5 UI-TARS-1.5 on OSWorld, 42.5% success rate (100 steps). ByteDance, released Apr 2025.	verified	42.5	2026	Source ↗	Looks wrong?
08	Agent S2 (Gemini 2.5) Agent S2 with Gemini 2.5 on OSWorld, 41.4% (50 steps). From OSWorld-Human paper, arxiv:2506.16042, Jun 2025.	verified	41.4	2026	Source ↗	Looks wrong?
09	Holo2-8B	unverified	39.9	2026	Paper ↗	Looks wrong?
10	Qwen3-VL-235B-A22B-Thinking	unverified	38.1	2025	Paper ↗Code ↗	Looks wrong?
11	OpenAI CUA (o1) OpenAI Computer-Using Agent (CUA/Operator) on OSWorld, 38.1% success rate. Announced Jan 2025.	verified	38.1	2026	Source ↗	Looks wrong?
12	Holo2-4B	unverified	37.7	2026	Paper ↗	Looks wrong?
13	Holo2-30B-A3B	unverified	37.4	2026	Paper ↗	Looks wrong?
14	Agent S2 (Claude 3.7) Agent S2 with Claude 3.7 Sonnet on OSWorld, 34.5% (50 steps). Simular AI, arxiv:2504.00906, Apr 2025.	verified	34.5	2026	Source ↗	Looks wrong?
15	Agent S2 w/ Claude-3.7-Sonnet	unverified	34.5	2025	Paper ↗Code ↗	Looks wrong?
16	Qwen3-VL-8B-Instruct	unverified	33.9	2025	Paper ↗Code ↗	Looks wrong?
17	Agent S2 w/ Claude-3.5-Sonnet	unverified	33.7	2025	Paper ↗Code ↗	Looks wrong?
18	Qwen3-VL-235B-A22B-Instruct	unverified	31.6	2025	Paper ↗Code ↗	Looks wrong?
19	Claude 3.7 Sonnet Claude 3.7 Sonnet on OSWorld, top of leaderboard at release (Feb 2025), 100 steps. From OSWorld-Human paper (arxiv:2506.16042).	verified	28	2026	Source ↗	Looks wrong?
20	UI-TARS-72B UI-TARS-72B on OSWorld, 24.6% success rate (50 steps). ByteDance, arxiv:2501.12326, Jan 2025.	verified	24.6	2026	Source ↗	Looks wrong?
21	Claude Computer Use Claude 3.5 Sonnet computer use with extended steps on OSWorld. Anthropic announcement Oct 2024.	verified	22	2026	Source ↗	Looks wrong?
22	Agent S w/ GPT-4o	unverified	20.58	2024	Paper ↗Code ↗	Looks wrong?
23	Agent S w/ Claude-3.5	unverified	20.48	2024	Paper ↗Code ↗	Looks wrong?
24	UFO (GPT-4V) UFO GPT-4V (Windows-focused). Evaluated on OSWorld subset. arxiv:2402.07939.	verified	9.40	2024	Paper ↗Source ↗	Looks wrong?
25	Qwen2.5-VL-72B	unverified	8.83	2025	Paper ↗Code ↗	Looks wrong?
26	Kimi-VL-A3B-Instruct	unverified	8.22	2025	Paper ↗Code ↗	Looks wrong?
27	GPT-4 Turbo (2024) GPT-4V (screenshot-only) on OSWorld. Table 3, arxiv:2404.07972. Screenshot-based GUI agent.	verified	6.50	2024	Paper ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Web Agents