AI Document Arena
Which AI Understands Documents Best?
Head-to-head Elo rankings across 13 frontier models on real-world document tasks — PDFs, contracts, research papers, and more. Updated March 2026.
Key Findings
Anthropic Dominates Documents
Anthropic holds 4 of the top 6 positions — including the top two. Claude Opus 4.6 leads by 33 Elo points over the nearest non-Anthropic model. Their document-centric training clearly pays off.
Context Window Matters
Every model in the top 8 supports a 1M token context window. The ability to ingest entire books or large contract bundles in one pass translates directly to better document understanding scores.
Best Value Pick
Claude Haiku 4.5 ranks 9th at $1/$5 per million tokens — 5× cheaper than Opus 4.6. For high-volume document pipelines it delivers near-frontier quality at a fraction of the cost.
Full Leaderboard
Elo ± 95% CI · 43,670 votes total| # | Model | Provider | Elo | CI | Votes | Input $/M | Output $/M | Context |
|---|---|---|---|---|---|---|---|---|
| 🥇 | claude-opus-4-6 | Anthropic | 1524 | ±12 | 4,336 | $5.00 | $25 | 1M |
| 🥈 | claude-sonnet-4-6 | Anthropic | 1491 | ±14 | 1,813 | $3.00 | $15 | 1M |
| 🥉 | gpt-5.4 | OpenAI | 1483 | ±16 | 1,349 | $2.50 | $15 | 1.1M |
| 4 | claude-opus-4-5 | Anthropic | 1473 | ±11 | 6,112 | $5.00 | $25 | 200K |
| 5 | gemini-3.1-pro-preview | 1457 | ±9 | 3,972 | $2.00 | $12 | 1M | |
| 6 | claude-sonnet-4-5 | Anthropic | 1450 | ±11 | 6,375 | $3.00 | $15 | 200K |
| 7 | gemini-3-pro | 1447 | ±8 | 8,872 | $2.00 | $12 | 1M | |
| 8 | gemini-2.5-pro | 1430 | ±8 | 6,766 | $1.25 | $10 | 1M | |
| 9 | claude-haiku-4-5 | Anthropic | 1427 | ±12 | 5,678 | $1.00 | $5 | 200K |
| 10 | gemini-3-flash | 1424 | ±9 | 7,303 | $0.50 | $3 | 1M | |
| 11 | gpt-5.2-high | OpenAI | 1413 | ±9 | 5,867 | $1.75 | $14 | 400K |
| 12 | gpt-5.1 | OpenAI | 1408 | ±8 | 7,021 | $1.25 | $10 | 400K |
| 13 | gpt-5.2 | OpenAI | 1408 | ±8 | 8,280 | $1.75 | $14 | 400K |
Provider Breakdown
Anthropic
5 modelsOpenAI
4 modelsAnthropic Dominates Documents
4 of the top 6 models come from a single lab
In the broader LLM Arena, no single provider commands such a decisive lead on a specific task type. The Document Arena tells a different story: Anthropic's models take ranks #1, #2, #4, and #6, with Claude Opus 4.6 sitting 33 points clear of the nearest competitor outside Anthropic.
The pattern holds across model tiers. Even Claude Haiku 4.5 — the budget option at rank #9 — outperforms mid-range GPT-5.x and Gemini models in document comprehension. This suggests a systematic advantage, likely rooted in Constitutional AI training that emphasizes careful, faithful reading of source material.
Notably, both Claude Opus 4.5 (rank #4, 200K ctx) and Claude Opus 4.6 (rank #1, 1M ctx) beat all non-Anthropic models in their context tier — confirming that context window is not the sole explanatory variable.
Context Window Matters
1M token models occupy all of the top 8 positions
Long-context capability is not just a marketing feature — in the Document Arena it correlates directly with Elo ranking. All eight models in the top half support 1M tokens or more, allowing them to process a 700-page novel, a 500-page contract bundle, or an entire codebase in a single call.
The three 200K context models (Claude Opus 4.5, Sonnet 4.5, Haiku 4.5) cluster at ranks 4, 6, and 9 — still competitive, but capped in the kinds of tasks where full-document ingestion matters most. OpenAI's GPT-5.x series with 400K context occupies the bottom three ranks.
For practitioners, this has a direct implication: if your document workflow involves files longer than 150,000 words (~200K tokens), the jump to a 1M context model is not optional — it determines whether the model can even attempt the task.
Price-Per-Page Analysis
Assuming ~750 tokens per document page (input + 150 token output), normalized to $1 per 1M tokens.
| Model | Elo Rank | Cost / 100 pages | Cost / 1,000 pages | Value score |
|---|---|---|---|---|
| gemini-3-flash | #10 | 8.3¢ | $0.825 | 724 |
| claude-haiku-4-5 | #9 | 15.0¢ | $1.500 | 647 |
| gemini-2.5-pro | #8 | 24.4¢ | $2.437 | 595 |
| gpt-5.1 | #12 | 24.4¢ | $2.437 | 586 |
| gemini-3.1-pro-preview | #5 | 33.0¢ | $3.300 | 576 |
| gpt-5.4 | #3 | 41.3¢ | $4.125 | 565 |
| claude-sonnet-4-5 | #6 | 45.0¢ | $4.500 | 545 |
| claude-sonnet-4-6 | #2 | 45.0¢ | $4.500 | 560 |
| claude-opus-4-6 | #1 | 75.0¢ | $7.500 | 529 |
Value score = Elo ÷ log₁₀(cost index). Higher is better. Claude Haiku 4.5 leads on value.
About the Document Arena
The Document Arena is newer and more focused than general-purpose LLM arenas. With 43,670 votes cast across 13 models, it has crossed the threshold where Elo ratings become statistically meaningful — the average confidence interval of ±10 Elo points means rank separations of 15+ points are reliable signal, not noise.
Tasks include: extracting tables from scanned PDFs, summarizing 200-page legal contracts, answering questions from multi-document sets, comparing versions of technical specifications, and parsing structured data from unstructured reports.
Votes are collected blind — evaluators see responses labeled A and B, not model names. This eliminates brand bias and focuses ratings on output quality. Elo ratings are recalculated after every vote using a K-factor of 32.
Frequently Asked Questions
Which AI is best for document understanding in 2026?▾
What is the best AI for PDF analysis?▾
How does the AI Document Arena work?▾
Does context window size matter for document AI?▾
What is the most cost-effective AI for document processing?▾
Prices as of March 2026. Elo ratings from blind pairwise evaluations. Context window sizes reflect maximum supported input. Arena methodology follows Chatbot Arena (LMSYS) conventions. Updated regularly as new votes are collected.