Codesota · Benchmark · WebArenaHome/Leaderboards/Agents & Tool Use/Web Agents/WebArena
Unknown

WebArena.

812 long-horizon web navigation tasks across realistic web environments (e-commerce, social media, code repos, CMS). Tests ability to complete real-world browser tasks like making purchases, posting content, or querying databases.

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Accuracy

Accuracy is the reported evaluation metric for WebArena. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Qwen3-235B-A22Bunverified95.62025Paper ↗Code ↗Looks wrong?

Success Rate

Success Rate is the reported evaluation metric for WebArena. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Success Rateverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Agent-E (GPT-4o)
Agent-E v3.5 on WebArena. arxiv:2407.13032. Hierarchical agent with DOM distillation, GPT-4o.
verified732023Paper ↗Source ↗Looks wrong?
02OpenAI Operator (CUA)
OpenAI CUA (Operator) on WebArena; launched Jan 23 2025
verified58.12025Source ↗Looks wrong?
03Claude Opus 4
Claude Opus 4 on WebArena. Estimated from Anthropic model card, 2025.
verified552025Paper ↗Source ↗Looks wrong?
04Agent Q (GPT-4o)
Agent Q (GPT-4o) on WebArena. arxiv:2408.07199. MCTS + DPO self-play.
verified50.52023Paper ↗Source ↗Looks wrong?
05Claude 3.7 Sonnet
Claude 3.7 Sonnet on WebArena. Reported in agentic benchmarks comparison, 2025.
verified35.12025Paper ↗Source ↗Looks wrong?
06GPT-4 Turbo (2024)
GPT-4 Turbo baseline on WebArena. Table 1, arxiv:2307.13854. Original paper result.
verified14.92023Paper ↗Looks wrong?

Score 0 10

Score 0 10 is the reported evaluation metric for WebArena. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Score 0 10verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Holo3-35B-A3Bunverified64.82026Paper ↗Looks wrong?
02Holo2-30B-A3Bunverified46.32026Paper ↗Looks wrong?
03Holo2-8Bunverified42.22026Paper ↗Looks wrong?
04Holo2-4Bunverified412026Paper ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Web Agents