Codesota · The Arena LedgerHuman preference · seven categories · millions of votesUpdated: March 2026
Live registry · arena.ai · seven categories

Model arena, by the numbers.

Not benchmarks — real human preferences. Millions of blind pairwise comparisons across text, code, vision, documents, search, image generation, and video generation. Scores below are Bradley-Terry estimates, the same family as chess Elo.

§ 01 · Leaderboard

Who leads where.

Anthropic leads Text, Code, Document, and Search. Google leads Vision, Text-to-Image, and Text-to-Video. The best value column is the cheapest model within two points of the leader — the one you'd ship first.


Source
arena.ai · fka LMSYS
Metric
Bradley-Terry · higher is better
Updated
March 2026
Method
/methodology
Category · Leader · Score · Best value
Seven categories · sorted by release order
CategoryLeaderProviderScoreVotesBest value
TextClaude Opus 4.6Anthropic1502800K+Grok-4.1 ($0.20 / $0.50)
CodeClaude Opus 4.6Anthropic1548210K+GLM-5 ($1 / $3.20, MIT)
VisionGemini 3 ProGoogle1290716K+Gemini 3 Flash ($0.50 / $3)
DocumentClaude Opus 4.6Anthropic152444K+Claude Haiku 4.5 ($1 / $5)
SearchClaude Opus 4.6 SearchAnthropic1255248K+Grok-4-fast ($0.20 / $0.50)
Text-to-ImageGemini 3.1 Flash ImageGoogle12664.3Mqwen-image (Apache 2.0)
Text-to-VideoVeo 3.1 Audio 1080pGoogle1381247K+Kandinsky 5.0 (MIT)
Fig 2 · Prices in the best-value column are vendor-published per million tokens (input / output), or a licence note for open-weight models. Bradley-Terry scores carry confidence intervals; models with overlapping CIs are statistically tied.
Text · 800K+
1502
Claude Opus 4.6
Anthropic leads with thinking variants at the top
Full Text table →
Code · 210K+
1548
Claude Opus 4.6
Claude holds top 5 positions · GLM-5 best open-source
Full Code table →
Search · 248K+
1255
Claude Opus 4.6 Search
Grok-fast offers strongest value per dollar
Full Search table →
§ 02 · How

How arena rankings work.

Preference elicitation, not exam scoring. The arena doesn't ask which model is correct — it asks which one a human prefers, averaged across hundreds of thousands of people.

Step one

A prompt, two anonymous answers.

A user submits a prompt. Two unnamed models answer side by side. The user picks the better response without knowing which model is which. Bias from brand, cost, or training-set advertising drops out.

Step two

Votes aggregate via Bradley-Terry.

Each comparison updates both models' scores against their expected outcome — the same family of estimator as chess Elo. Millions of votes collapse into a single ordering per category, per model, per week.

Step three

Confidence intervals decide ties.

Each score carries a 95% interval. Models with overlapping intervals are statistically tied — not ranked. More votes tighten the interval, and the leaderboard only separates two models when the evidence is there.

§ 03
Footnote

What a preference ranking does — and doesn't — measure.

Arena scores tell you which model a typical user picks when shown two unnamed answers side by side. That is useful and it is real. It is not the same as accuracy on a graded task.

A model can win the arena while failing graded benchmarks, because users reward confident prose, formatting, and length. A model can lose the arena while acing graded benchmarks, because the user prompts skew toward creative tasks that graded suites don't cover. Treat the arena as one dimension of evidence, pair it with the task registry at /tasks, and route between them on your own workload.

We mirror arena.ai here, categorised and cross-linked. For the editorial methodology behind Codesota's wider registry — what counts as reproduced, what a retraction looks like, how we record dates — see /methodology.

§ 04 · Elsewhere

Triangulate before you ship.

Arena preference is one axis. Graded benchmarks are another. The interesting model is the one that wins on both — or fails on either in a way you can explain.

All arena numbers on this page and its subpages are mirrored from arena.ai. Scores update as new votes arrive; we refresh the snapshot monthly and date every page.

Door 01

The task registry

Graded benchmarks — OCR, ASR, MTEB, detection, retrieval — where a model is measured against a reference, not against preference. Complement, not substitute.

Open /tasks →
Door 02

LLM leaderboards

Codesota’s own LLM rollup — one table, every frontier model, every standard graded benchmark. Cross-reference before you pick.

Open /llm →
Door 03

Methodology

Why these numbers can be trusted — what reproduction means, what a retraction looks like, and how we keep the registry honest across years.

Read /methodology →