812 long-horizon web navigation tasks across realistic web environments (e-commerce, social media, code repos, CMS). Tests ability to complete real-world browser tasks like making purchases, posting content, or querying databases.
Accuracy is the reported evaluation metric for WebArena. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Qwen3-235B-A22B | unverified | 95.6 | 2025 | Paper ↗Code ↗ | Looks wrong? |
Success Rate is the reported evaluation metric for WebArena. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Agent-E (GPT-4o) | verified | 73 | 2023 | Paper ↗Source ↗ | Looks wrong? |
| 02 | OpenAI Operator (CUA) | verified | 58.1 | 2025 | Source ↗ | Looks wrong? |
| 03 | Claude Opus 4 | verified | 55 | 2025 | Paper ↗Source ↗ | Looks wrong? |
| 04 | Agent Q (GPT-4o) | verified | 50.5 | 2023 | Paper ↗Source ↗ | Looks wrong? |
| 05 | Claude 3.7 Sonnet | verified | 35.1 | 2025 | Paper ↗Source ↗ | Looks wrong? |
| 06 | GPT-4 Turbo (2024) | verified | 14.9 | 2023 | Paper ↗ | Looks wrong? |
Score 0 10 is the reported evaluation metric for WebArena. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.
Higher is better
| Rank | Model | Trust | Score | Year | Links | Fix |
|---|---|---|---|---|---|---|
| 01 | Holo3-35B-A3B | unverified | 64.8 | 2026 | Paper ↗ | Looks wrong? |
| 02 | Holo2-30B-A3B | unverified | 46.3 | 2026 | Paper ↗ | Looks wrong? |
| 03 | Holo2-8B | unverified | 42.2 | 2026 | Paper ↗ | Looks wrong? |
| 04 | Holo2-4B | unverified | 41 | 2026 | Paper ↗ | Looks wrong? |