ParseBench
Unknown
LlamaIndex 2026 document parsing benchmark. ~2,078 human-verified pages from ~1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five dimensions: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, and visual grounding. No LLM-as-judge. Overall score = unweighted mean of the five dimensions.
Benchmark Stats
SOTA History
accuracy
accuracy
Higher is better
| Rank | Model | Source | Score | Year | Paper |
|---|---|---|---|---|---|
| 1 | LlamaParse Agentic SOTA on ParseBench: highest overall score (84.9), best-in-column on Tables (90.7), Charts (78.1), Semantic Formatting (85.2), and Visual Grounding (80.6). Cost ~$0.012/page. ParseBench Table 5. Sub-scores: tables 90.7, charts 78.1, content-faithfulness 89.7, semantic-formatting 85.2, visual-grounding 80.6. | Community | 84.9 | 2026 | Source |
| 2 | LlamaParse Cost Effective LlamaParse in cost-effective mode: competitive with Gemini 3 Flash minimal at ~1/10 the cost. ParseBench Table 5. Sub-scores: tables 73.2, charts 66.7, content-faithfulness 88, semantic-formatting 73, visual-grounding 58.6. | Community | 71.9 | 2026 | Source |
| 3 | Google Gemini 3 Flash Gemini 3 Flash at default high thinking, evaluated as a VLM parser. Strongest VLM overall on ParseBench; 89.9 on Tables (best-in-column). ParseBench Table 5. Sub-scores: tables 89.9, charts 64.8, content-faithfulness 86.2, semantic-formatting 58.4, visual-grounding 56. | Community | 71 | 2026 | Source |
| 4 | Reducto Reducto (default non-agentic pipeline). Second-best specialised parser overall. ParseBench Table 5. Sub-scores: tables 70.3, charts 57, content-faithfulness 86.4, semantic-formatting 56.8, visual-grounding 68.7. | Community | 67.8 | 2026 | Source |
| 5 | Qwen 3 VL Qwen 3 VL evaluated via a parse-with-layout pipeline. Visual grounding uses a separate layout-only pipeline; 4 pages excluded where that pipeline failed. ParseBench Table 5. Sub-scores: tables 74.7, charts 28.2, content-faithfulness 87.6, semantic-formatting 64.2, visual-grounding 55.2. | Community | 62 | 2026 | Source |
| 6 | Azure Document Intelligence Azure Document Intelligence (prebuilt layout). Best non-LlamaParse visual grounding (73.8). ParseBench Table 5. Sub-scores: tables 86, charts 1.6, content-faithfulness 84.9, semantic-formatting 51.9, visual-grounding 73.8. | Community | 59.6 | 2026 | Source |
| 7 | Extend Extend parse pipeline. ParseBench Table 5. Sub-scores: tables 85.1, charts 1.6, content-faithfulness 84.1, semantic-formatting 47.4, visual-grounding 60.7. | Community | 55.8 | 2026 | Source |
| 8 | Dots OCR 1.5 Dots OCR 1.5: strongest content-faithfulness score in the benchmark (90.0), but charts collapse to 0.9. ParseBench Table 5. Sub-scores: tables 85.2, charts 0.9, content-faithfulness 90, semantic-formatting 47, visual-grounding 55.8. | Community | 55.8 | 2026 | Source |
| 9 | Docling Docling OSS pipeline. Visual grounding score (66.1) excludes 13 pages where the pipeline failed. ParseBench Table 5. Sub-scores: tables 66.4, charts 52.8, content-faithfulness 66.9, semantic-formatting 1, visual-grounding 66.1. | Community | 50.6 | 2026 | Source |
| 10 | Google Cloud Document AI Google Cloud Document AI (layout parser). ParseBench Table 5. Sub-scores: tables 55.1, charts 1.4, content-faithfulness 83.7, semantic-formatting 50.5, visual-grounding 61.3. | Community | 50.4 | 2026 | Source |
| 11 | AWS Textract AWS Textract via its layout pipeline. Strong on grounding (70.4) but near-zero on charts (6.0) and formatting (3.7). ParseBench Table 5. Sub-scores: tables 84.6, charts 6, content-faithfulness 74.8, semantic-formatting 3.7, visual-grounding 70.4. | Community | 47.9 | 2026 | Source |
| 12 | OpenAI GPT-5 Mini GPT-5 Mini evaluated as a VLM parser on ParseBench with reasoning set to medium. ParseBench Table 5. Sub-scores: tables 69.8, charts 30.1, content-faithfulness 82.3, semantic-formatting 45.8, visual-grounding 6.2. | Community | 46.8 | 2026 | Source |
| 13 | LandingAI LandingAI ADE parse pipeline. ParseBench Table 5. Sub-scores: tables 73.7, charts 10.9, content-faithfulness 88.6, semantic-formatting 27.9, visual-grounding 25.1. | Community | 45.2 | 2026 | Source |
| 14 | Anthropic Haiku 4.5 Claude Haiku 4.5 with extended thinking enabled, evaluated as a VLM parser. ParseBench Table 5. Sub-scores: tables 77.2, charts 13.8, content-faithfulness 78.7, semantic-formatting 49.4, visual-grounding 6.7. | Community | 45.2 | 2026 | Source |