Codesota · The Arena LedgerHuman preference · seven categories · millions of votesUpdated: March 2026

Live registry · arena.ai · seven categories

Model arena, by the numbers.

Not benchmarks — real human preferences. Millions of blind pairwise comparisons across text, code, vision, documents, search, image generation, and video generation. Scores below are Bradley-Terry estimates, the same family as chess Elo.

Browse Text rankings →Best open-weight modelsSource · arena.ai

§ 01 · Leaderboard

Who leads where.

Anthropic leads Text, Code, Document, and Search. Google leads Vision, Text-to-Image, and Text-to-Video. The best value column is the cheapest model within two points of the leader — the one you'd ship first.

Source: arena.ai · fka LMSYS
Metric: Bradley-Terry · higher is better
Updated: March 2026
Method: /methodology

Category · Leader · Score · Best value

Seven categories · sorted by release order

Category	Leader	Provider	Score	Votes	Best value
Text	Claude Opus 4.6	Anthropic	1502	800K+	Grok-4.1 ($0.20 / $0.50)
Code	Claude Opus 4.6	Anthropic	1548	210K+	GLM-5 ($1 / $3.20, MIT)
Vision	Gemini 3 Pro	Google	1290	716K+	Gemini 3 Flash ($0.50 / $3)
Document	Claude Opus 4.6	Anthropic	1524	44K+	Claude Haiku 4.5 ($1 / $5)
Search	Claude Opus 4.6 Search	Anthropic	1255	248K+	Grok-4-fast ($0.20 / $0.50)
Text-to-Image	Gemini 3.1 Flash Image	Google	1266	4.3M	qwen-image (Apache 2.0)
Text-to-Video	Veo 3.1 Audio 1080p	Google	1381	247K+	Kandinsky 5.0 (MIT)

Fig 2 · Prices in the best-value column are vendor-published per million tokens (input / output), or a licence note for open-weight models. Bradley-Terry scores carry confidence intervals; models with overlapping CIs are statistically tied.

Text · 800K+

1502

Claude Opus 4.6

Anthropic leads with thinking variants at the top

Full Text table →

Code · 210K+

1548

Claude Opus 4.6

Claude holds top 5 positions · GLM-5 best open-source

Full Code table →

Vision · 716K+

1290

Gemini 3 Pro

Google dominates — 4 of top 6 are Gemini

Full Vision table →

Document · 44K+

1524

Claude Opus 4.6

Context window size correlates with ranking

Full Document table →

Search · 248K+

1255

Claude Opus 4.6 Search

Grok-fast offers strongest value per dollar

Full Search table →

Text-to-Image · 4.3M

1266

Gemini 3.1 Flash Image

Google and OpenAI lead; Flux and Seedream challenge

Full Text-to-Image table →

Text-to-Video · 247K+

1381

Veo 3.1 Audio 1080p

Audio generation is the new frontier

Full Text-to-Video table →

§ 02 · How

How arena rankings work.

Preference elicitation, not exam scoring. The arena doesn't ask which model is correct — it asks which one a human prefers, averaged across hundreds of thousands of people.

Step one

A prompt, two anonymous answers.

A user submits a prompt. Two unnamed models answer side by side. The user picks the better response without knowing which model is which. Bias from brand, cost, or training-set advertising drops out.

Step two

Votes aggregate via Bradley-Terry.

Each comparison updates both models' scores against their expected outcome — the same family of estimator as chess Elo. Millions of votes collapse into a single ordering per category, per model, per week.

Step three

Confidence intervals decide ties.

Each score carries a 95% interval. Models with overlapping intervals are statistically tied — not ranked. More votes tighten the interval, and the leaderboard only separates two models when the evidence is there.

§ 03

Footnote

What a preference ranking does — and doesn't — measure.

Arena scores tell you which model a typical user picks when shown two unnamed answers side by side. That is useful and it is real. It is not the same as accuracy on a graded task.

A model can win the arena while failing graded benchmarks, because users reward confident prose, formatting, and length. A model can lose the arena while acing graded benchmarks, because the user prompts skew toward creative tasks that graded suites don't cover. Treat the arena as one dimension of evidence, pair it with the task registry at /tasks, and route between them on your own workload.

We mirror arena.ai here, categorised and cross-linked. For the editorial methodology behind Codesota's wider registry — what counts as reproduced, what a retraction looks like, how we record dates — see /methodology.

§ 04 · Elsewhere

Triangulate before you ship.

Arena preference is one axis. Graded benchmarks are another. The interesting model is the one that wins on both — or fails on either in a way you can explain.

All arena numbers on this page and its subpages are mirrored from arena.ai. Scores update as new votes arrive; we refresh the snapshot monthly and date every page.

Door 01

The task registry

Graded benchmarks — OCR, ASR, MTEB, detection, retrieval — where a model is measured against a reference, not against preference. Complement, not substitute.

Open /tasks →

Door 02

LLM leaderboards

Codesota’s own LLM rollup — one table, every frontier model, every standard graded benchmark. Cross-reference before you pick.

Open /llm →

Door 03

Methodology

Why these numbers can be trusted — what reproduction means, what a retraction looks like, and how we keep the registry honest across years.

Read /methodology →