Live Arena — 43,670 votes cast

AI Document Arena

Which AI Understands Documents Best?

Head-to-head Elo rankings across 13 frontier models on real-world document tasks — PDFs, contracts, research papers, and more. Updated March 2026.

Models

43,670

Votes

Providers

124

Elo spread

Key Findings

👑

Anthropic Dominates Documents

Anthropic holds 4 of the top 6 positions — including the top two. Claude Opus 4.6 leads by 33 Elo points over the nearest non-Anthropic model. Their document-centric training clearly pays off.

📄

Context Window Matters

Every model in the top 8 supports a 1M token context window. The ability to ingest entire books or large contract bundles in one pass translates directly to better document understanding scores.

💡

Best Value Pick

Claude Haiku 4.5 ranks 9th at $1/$5 per million tokens — 5× cheaper than Opus 4.6. For high-volume document pipelines it delivers near-frontier quality at a fraction of the cost.

Full Leaderboard

Elo ± 95% CI · 43,670 votes total

#	Model	Provider	Elo	CI	Votes	Input $/M	Output $/M	Context
🥇	claude-opus-4-6	Anthropic	1524	±12	4,336	$5.00	$25	1M
🥈	claude-sonnet-4-6	Anthropic	1491	±14	1,813	$3.00	$15	1M
🥉	gpt-5.4	OpenAI	1483	±16	1,349	$2.50	$15	1.1M
4	claude-opus-4-5	Anthropic	1473	±11	6,112	$5.00	$25	200K
5	gemini-3.1-pro-preview	Google	1457	±9	3,972	$2.00	$12	1M
6	claude-sonnet-4-5	Anthropic	1450	±11	6,375	$3.00	$15	200K
7	gemini-3-pro	Google	1447	±8	8,872	$2.00	$12	1M
8	gemini-2.5-pro	Google	1430	±8	6,766	$1.25	$10	1M
9	claude-haiku-4-5	Anthropic	1427	±12	5,678	$1.00	$5	200K
10	gemini-3-flash	Google	1424	±9	7,303	$0.50	$3	1M
11	gpt-5.2-high	OpenAI	1413	±9	5,867	$1.75	$14	400K
12	gpt-5.1	OpenAI	1408	±8	7,021	$1.25	$10	400K
13	gpt-5.2	OpenAI	1408	±8	8,280	$1.75	$14	400K

🥇claude-opus-4-6

Anthropic

1524

±12 Elo

Votes

4,336

Price

$5/$25

Context

🥈claude-sonnet-4-6

Anthropic

1491

±14 Elo

Votes

1,813

Price

$3/$15

Context

🥉gpt-5.4

OpenAI

1483

±16 Elo

Votes

1,349

Price

$2.5/$15

Context

1.1M

#4claude-opus-4-5

Anthropic

1473

±11 Elo

Votes

6,112

Price

$5/$25

Context

200K

#5gemini-3.1-pro-preview

Google

1457

±9 Elo

Votes

3,972

Price

$2/$12

Context

#6claude-sonnet-4-5

Anthropic

1450

±11 Elo

Votes

6,375

Price

$3/$15

Context

200K

#7gemini-3-pro

Google

1447

±8 Elo

Votes

8,872

Price

$2/$12

Context

#8gemini-2.5-pro

Google

1430

±8 Elo

Votes

6,766

Price

$1.25/$10

Context

#9claude-haiku-4-5

Anthropic

1427

±12 Elo

Votes

5,678

Price

$1/$5

Context

200K

#10gemini-3-flash

Google

1424

±9 Elo

Votes

7,303

Price

$0.5/$3

Context

#11gpt-5.2-high

OpenAI

1413

±9 Elo

Votes

5,867

Price

$1.75/$14

Context

400K

#12gpt-5.1

OpenAI

1408

±8 Elo

Votes

7,021

Price

$1.25/$10

Context

400K

#13gpt-5.2

OpenAI

1408

±8 Elo

Votes

8,280

Price

$1.75/$14

Context

400K

Provider Breakdown

Anthropic

5 models

Avg Elo1473

Best modelclaude-opus-4-6

Top Elo1524

Rank range#1 – #9

OpenAI

4 models

Avg Elo1428

Best modelgpt-5.4

Top Elo1483

Rank range#3 – #13

Google

4 models

Avg Elo1440

Best modelgemini-3.1-pro-preview

Top Elo1457

Rank range#5 – #10

🏆

Anthropic Dominates Documents

4 of the top 6 models come from a single lab

In the broader LLM Arena, no single provider commands such a decisive lead on a specific task type. The Document Arena tells a different story: Anthropic's models take ranks #1, #2, #4, and #6, with Claude Opus 4.6 sitting 33 points clear of the nearest competitor outside Anthropic.

The pattern holds across model tiers. Even Claude Haiku 4.5 — the budget option at rank #9 — outperforms mid-range GPT-5.x and Gemini models in document comprehension. This suggests a systematic advantage, likely rooted in Constitutional AI training that emphasizes careful, faithful reading of source material.

Notably, both Claude Opus 4.5 (rank #4, 200K ctx) and Claude Opus 4.6 (rank #1, 1M ctx) beat all non-Anthropic models in their context tier — confirming that context window is not the sole explanatory variable.

📐

Context Window Matters

1M token models occupy all of the top 8 positions

Long-context capability is not just a marketing feature — in the Document Arena it correlates directly with Elo ranking. All eight models in the top half support 1M tokens or more, allowing them to process a 700-page novel, a 500-page contract bundle, or an entire codebase in a single call.

The three 200K context models (Claude Opus 4.5, Sonnet 4.5, Haiku 4.5) cluster at ranks 4, 6, and 9 — still competitive, but capped in the kinds of tasks where full-document ingestion matters most. OpenAI's GPT-5.x series with 400K context occupies the bottom three ranks.

For practitioners, this has a direct implication: if your document workflow involves files longer than 150,000 words (~200K tokens), the jump to a 1M context model is not optional — it determines whether the model can even attempt the task.

1465

avg Elo

1M / 1.1M

8 models

1410

avg Elo

400K

3 models

1450

avg Elo

200K

3 models

Price-Per-Page Analysis

Assuming ~750 tokens per document page (input + 150 token output), normalized to $1 per 1M tokens.

Model	Elo Rank	Cost / 100 pages	Cost / 1,000 pages	Value score
gemini-3-flash	#10	8.3¢	$0.825	724
claude-haiku-4-5	#9	15.0¢	$1.500	647
gemini-2.5-pro	#8	24.4¢	$2.437	595
gpt-5.1	#12	24.4¢	$2.437	586
gemini-3.1-pro-preview	#5	33.0¢	$3.300	576
gpt-5.4	#3	41.3¢	$4.125	565
claude-sonnet-4-5	#6	45.0¢	$4.500	545
claude-sonnet-4-6	#2	45.0¢	$4.500	560
claude-opus-4-6	#1	75.0¢	$7.500	529

Value score = Elo ÷ log₁₀(cost index). Higher is better. Claude Haiku 4.5 leads on value.

About the Document Arena

The Document Arena is newer and more focused than general-purpose LLM arenas. With 43,670 votes cast across 13 models, it has crossed the threshold where Elo ratings become statistically meaningful — the average confidence interval of ±10 Elo points means rank separations of 15+ points are reliable signal, not noise.

Tasks include: extracting tables from scanned PDFs, summarizing 200-page legal contracts, answering questions from multi-document sets, comparing versions of technical specifications, and parsing structured data from unstructured reports.

Votes are collected blind — evaluators see responses labeled A and B, not model names. This eliminates brand bias and focuses ratings on output quality. Elo ratings are recalculated after every vote using a K-factor of 32.

Frequently Asked Questions

Which AI is best for document understanding in 2026?▾

Claude Opus 4.6 leads the AI Document Arena with an Elo rating of 1524, based on 4,336 head-to-head comparisons. It is followed by Claude Sonnet 4.6 (1491) and GPT-5.4 (1483). Anthropic models occupy 4 of the top 6 positions.

What is the best AI for PDF analysis?▾

For PDF analysis, Claude Opus 4.6 ranks first overall, while Claude Haiku 4.5 offers the best value at $1/$5 per million tokens while still ranking 9th out of 13 models. All top performers support at least 200K context windows, with 1M context models dominating the top 8 spots.

How does the AI Document Arena work?▾

The Arena uses pairwise Elo ratings. Voters compare two AI models side-by-side on the same document task and pick the better response. Results are aggregated using the Elo rating system — the same method used in chess rankings. Higher Elo means consistently better document understanding.

Does context window size matter for document AI?▾

Yes — the top 8 models in the Document Arena all support 1M token context windows. This allows them to process entire books, lengthy contracts, or large codebases in a single pass. The 200K context models (Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5) still rank competitively but fall behind their 1M counterparts.

What is the most cost-effective AI for document processing?▾

Claude Haiku 4.5 at $1 input / $5 output per million tokens offers the best value: it ranks 9th out of 13 models while being 5x cheaper than Claude Opus 4.6. For high-volume document workflows, Gemini 3 Flash ($0.50/$3) is even cheaper but ranks 10th.

Prices as of March 2026. Elo ratings from blind pairwise evaluations. Context window sizes reflect maximum supported input. Arena methodology follows Chatbot Arena (LMSYS) conventions. Updated regularly as new votes are collected.