Model arena, by the numbers.
Not benchmarks — real human preferences. Millions of blind pairwise comparisons across text, code, vision, documents, search, image generation, and video generation. Scores below are Bradley-Terry estimates, the same family as chess Elo.
Who leads where.
Anthropic leads Text, Code, Document, and Search. Google leads Vision, Text-to-Image, and Text-to-Video. The best value column is the cheapest model within two points of the leader — the one you'd ship first.
- Source
- arena.ai · fka LMSYS
- Metric
- Bradley-Terry · higher is better
- Updated
- March 2026
- Method
- /methodology
| Category | Leader | Provider | Score | Votes | Best value |
|---|---|---|---|---|---|
| Text | Claude Opus 4.6 | Anthropic | 1502 | 800K+ | Grok-4.1 ($0.20 / $0.50) |
| Code | Claude Opus 4.6 | Anthropic | 1548 | 210K+ | GLM-5 ($1 / $3.20, MIT) |
| Vision | Gemini 3 Pro | 1290 | 716K+ | Gemini 3 Flash ($0.50 / $3) | |
| Document | Claude Opus 4.6 | Anthropic | 1524 | 44K+ | Claude Haiku 4.5 ($1 / $5) |
| Search | Claude Opus 4.6 Search | Anthropic | 1255 | 248K+ | Grok-4-fast ($0.20 / $0.50) |
| Text-to-Image | Gemini 3.1 Flash Image | 1266 | 4.3M | qwen-image (Apache 2.0) | |
| Text-to-Video | Veo 3.1 Audio 1080p | 1381 | 247K+ | Kandinsky 5.0 (MIT) |
How arena rankings work.
Preference elicitation, not exam scoring. The arena doesn't ask which model is correct — it asks which one a human prefers, averaged across hundreds of thousands of people.
A prompt, two anonymous answers.
A user submits a prompt. Two unnamed models answer side by side. The user picks the better response without knowing which model is which. Bias from brand, cost, or training-set advertising drops out.
Votes aggregate via Bradley-Terry.
Each comparison updates both models' scores against their expected outcome — the same family of estimator as chess Elo. Millions of votes collapse into a single ordering per category, per model, per week.
Confidence intervals decide ties.
Each score carries a 95% interval. Models with overlapping intervals are statistically tied — not ranked. More votes tighten the interval, and the leaderboard only separates two models when the evidence is there.
What a preference ranking does — and doesn't — measure.
Arena scores tell you which model a typical user picks when shown two unnamed answers side by side. That is useful and it is real. It is not the same as accuracy on a graded task.
A model can win the arena while failing graded benchmarks, because users reward confident prose, formatting, and length. A model can lose the arena while acing graded benchmarks, because the user prompts skew toward creative tasks that graded suites don't cover. Treat the arena as one dimension of evidence, pair it with the task registry at /tasks, and route between them on your own workload.
We mirror arena.ai here, categorised and cross-linked. For the editorial methodology behind Codesota's wider registry — what counts as reproduced, what a retraction looks like, how we record dates — see /methodology.
Triangulate before you ship.
Arena preference is one axis. Graded benchmarks are another. The interesting model is the one that wins on both — or fails on either in a way you can explain.
All arena numbers on this page and its subpages are mirrored from arena.ai. Scores update as new votes arrive; we refresh the snapshot monthly and date every page.
The task registry
Graded benchmarks — OCR, ASR, MTEB, detection, retrieval — where a model is measured against a reference, not against preference. Complement, not substitute.
Open /tasks →LLM leaderboards
Codesota’s own LLM rollup — one table, every frontier model, every standard graded benchmark. Cross-reference before you pick.
Open /llm →Methodology
Why these numbers can be trusted — what reproduction means, what a retraction looks like, and how we keep the registry honest across years.
Read /methodology →