Live Arena — 43,670 votes cast

AI Document Arena

Which AI Understands Documents Best?

Head-to-head Elo rankings across 13 frontier models on real-world document tasks — PDFs, contracts, research papers, and more. Updated March 2026.

13
Models
43,670
Votes
3
Providers
124
Elo spread

Key Findings

👑

Anthropic Dominates Documents

Anthropic holds 4 of the top 6 positions — including the top two. Claude Opus 4.6 leads by 33 Elo points over the nearest non-Anthropic model. Their document-centric training clearly pays off.

📄

Context Window Matters

Every model in the top 8 supports a 1M token context window. The ability to ingest entire books or large contract bundles in one pass translates directly to better document understanding scores.

💡

Best Value Pick

Claude Haiku 4.5 ranks 9th at $1/$5 per million tokens — 5× cheaper than Opus 4.6. For high-volume document pipelines it delivers near-frontier quality at a fraction of the cost.

Full Leaderboard

Elo ± 95% CI · 43,670 votes total
🥇claude-opus-4-6
Anthropic
1524
±12 Elo
Votes
4,336
Price
$5/$25
Context
1M
🥈claude-sonnet-4-6
Anthropic
1491
±14 Elo
Votes
1,813
Price
$3/$15
Context
1M
🥉gpt-5.4
OpenAI
1483
±16 Elo
Votes
1,349
Price
$2.5/$15
Context
1.1M
#4claude-opus-4-5
Anthropic
1473
±11 Elo
Votes
6,112
Price
$5/$25
Context
200K
#5gemini-3.1-pro-preview
Google
1457
±9 Elo
Votes
3,972
Price
$2/$12
Context
1M
#6claude-sonnet-4-5
Anthropic
1450
±11 Elo
Votes
6,375
Price
$3/$15
Context
200K
#7gemini-3-pro
Google
1447
±8 Elo
Votes
8,872
Price
$2/$12
Context
1M
#8gemini-2.5-pro
Google
1430
±8 Elo
Votes
6,766
Price
$1.25/$10
Context
1M
#9claude-haiku-4-5
Anthropic
1427
±12 Elo
Votes
5,678
Price
$1/$5
Context
200K
#10gemini-3-flash
Google
1424
±9 Elo
Votes
7,303
Price
$0.5/$3
Context
1M
#11gpt-5.2-high
OpenAI
1413
±9 Elo
Votes
5,867
Price
$1.75/$14
Context
400K
#12gpt-5.1
OpenAI
1408
±8 Elo
Votes
7,021
Price
$1.25/$10
Context
400K
#13gpt-5.2
OpenAI
1408
±8 Elo
Votes
8,280
Price
$1.75/$14
Context
400K

Provider Breakdown

Anthropic

5 models
Avg Elo1473
Best modelclaude-opus-4-6
Top Elo1524
Rank range#1 – #9

OpenAI

4 models
Avg Elo1428
Best modelgpt-5.4
Top Elo1483
Rank range#3 – #13

Google

4 models
Avg Elo1440
Best modelgemini-3.1-pro-preview
Top Elo1457
Rank range#5 – #10
🏆

Anthropic Dominates Documents

4 of the top 6 models come from a single lab

In the broader LLM Arena, no single provider commands such a decisive lead on a specific task type. The Document Arena tells a different story: Anthropic's models take ranks #1, #2, #4, and #6, with Claude Opus 4.6 sitting 33 points clear of the nearest competitor outside Anthropic.

The pattern holds across model tiers. Even Claude Haiku 4.5 — the budget option at rank #9 — outperforms mid-range GPT-5.x and Gemini models in document comprehension. This suggests a systematic advantage, likely rooted in Constitutional AI training that emphasizes careful, faithful reading of source material.

Notably, both Claude Opus 4.5 (rank #4, 200K ctx) and Claude Opus 4.6 (rank #1, 1M ctx) beat all non-Anthropic models in their context tier — confirming that context window is not the sole explanatory variable.

📐

Context Window Matters

1M token models occupy all of the top 8 positions

Long-context capability is not just a marketing feature — in the Document Arena it correlates directly with Elo ranking. All eight models in the top half support 1M tokens or more, allowing them to process a 700-page novel, a 500-page contract bundle, or an entire codebase in a single call.

The three 200K context models (Claude Opus 4.5, Sonnet 4.5, Haiku 4.5) cluster at ranks 4, 6, and 9 — still competitive, but capped in the kinds of tasks where full-document ingestion matters most. OpenAI's GPT-5.x series with 400K context occupies the bottom three ranks.

For practitioners, this has a direct implication: if your document workflow involves files longer than 150,000 words (~200K tokens), the jump to a 1M context model is not optional — it determines whether the model can even attempt the task.

1465
avg Elo
1M / 1.1M
8 models
1410
avg Elo
400K
3 models
1450
avg Elo
200K
3 models

Price-Per-Page Analysis

Assuming ~750 tokens per document page (input + 150 token output), normalized to $1 per 1M tokens.

ModelElo RankCost / 100 pagesCost / 1,000 pagesValue score
gemini-3-flash#108.3¢$0.825724
claude-haiku-4-5#915.0¢$1.500647
gemini-2.5-pro#824.4¢$2.437595
gpt-5.1#1224.4¢$2.437586
gemini-3.1-pro-preview#533.0¢$3.300576
gpt-5.4#341.3¢$4.125565
claude-sonnet-4-5#645.0¢$4.500545
claude-sonnet-4-6#245.0¢$4.500560
claude-opus-4-6#175.0¢$7.500529

Value score = Elo ÷ log₁₀(cost index). Higher is better. Claude Haiku 4.5 leads on value.

About the Document Arena

The Document Arena is newer and more focused than general-purpose LLM arenas. With 43,670 votes cast across 13 models, it has crossed the threshold where Elo ratings become statistically meaningful — the average confidence interval of ±10 Elo points means rank separations of 15+ points are reliable signal, not noise.

Tasks include: extracting tables from scanned PDFs, summarizing 200-page legal contracts, answering questions from multi-document sets, comparing versions of technical specifications, and parsing structured data from unstructured reports.

Votes are collected blind — evaluators see responses labeled A and B, not model names. This eliminates brand bias and focuses ratings on output quality. Elo ratings are recalculated after every vote using a K-factor of 32.

Frequently Asked Questions

Which AI is best for document understanding in 2026?
Claude Opus 4.6 leads the AI Document Arena with an Elo rating of 1524, based on 4,336 head-to-head comparisons. It is followed by Claude Sonnet 4.6 (1491) and GPT-5.4 (1483). Anthropic models occupy 4 of the top 6 positions.
What is the best AI for PDF analysis?
For PDF analysis, Claude Opus 4.6 ranks first overall, while Claude Haiku 4.5 offers the best value at $1/$5 per million tokens while still ranking 9th out of 13 models. All top performers support at least 200K context windows, with 1M context models dominating the top 8 spots.
How does the AI Document Arena work?
The Arena uses pairwise Elo ratings. Voters compare two AI models side-by-side on the same document task and pick the better response. Results are aggregated using the Elo rating system — the same method used in chess rankings. Higher Elo means consistently better document understanding.
Does context window size matter for document AI?
Yes — the top 8 models in the Document Arena all support 1M token context windows. This allows them to process entire books, lengthy contracts, or large codebases in a single pass. The 200K context models (Claude Opus 4.5, Claude Sonnet 4.5, Claude Haiku 4.5) still rank competitively but fall behind their 1M counterparts.
What is the most cost-effective AI for document processing?
Claude Haiku 4.5 at $1 input / $5 output per million tokens offers the best value: it ranks 9th out of 13 models while being 5x cheaper than Claude Opus 4.6. For high-volume document workflows, Gemini 3 Flash ($0.50/$3) is even cheaper but ranks 10th.

Prices as of March 2026. Elo ratings from blind pairwise evaluations. Context window sizes reflect maximum supported input. Arena methodology follows Chatbot Arena (LMSYS) conventions. Updated regularly as new votes are collected.