ParseBench

Unknown

LlamaIndex 2026 document parsing benchmark. ~2,078 human-verified pages from ~1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five dimensions: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, and visual grounding. No LLM-as-judge. Overall score = unweighted mean of the five dimensions.

Benchmark Stats

Models14
Papers14
Metrics1

SOTA History

Not enough data to show trend.

accuracy

accuracy

Higher is better

RankModelSourceScoreYearPaper
1LlamaParse Agentic

SOTA on ParseBench: highest overall score (84.9), best-in-column on Tables (90.7), Charts (78.1), Semantic Formatting (85.2), and Visual Grounding (80.6). Cost ~$0.012/page. ParseBench Table 5. Sub-scores: tables 90.7, charts 78.1, content-faithfulness 89.7, semantic-formatting 85.2, visual-grounding 80.6.

Community84.92026Source
2LlamaParse Cost Effective

LlamaParse in cost-effective mode: competitive with Gemini 3 Flash minimal at ~1/10 the cost. ParseBench Table 5. Sub-scores: tables 73.2, charts 66.7, content-faithfulness 88, semantic-formatting 73, visual-grounding 58.6.

Community71.92026Source
3Google Gemini 3 Flash

Gemini 3 Flash at default high thinking, evaluated as a VLM parser. Strongest VLM overall on ParseBench; 89.9 on Tables (best-in-column). ParseBench Table 5. Sub-scores: tables 89.9, charts 64.8, content-faithfulness 86.2, semantic-formatting 58.4, visual-grounding 56.

Community712026Source
4Reducto

Reducto (default non-agentic pipeline). Second-best specialised parser overall. ParseBench Table 5. Sub-scores: tables 70.3, charts 57, content-faithfulness 86.4, semantic-formatting 56.8, visual-grounding 68.7.

Community67.82026Source
5Qwen 3 VL

Qwen 3 VL evaluated via a parse-with-layout pipeline. Visual grounding uses a separate layout-only pipeline; 4 pages excluded where that pipeline failed. ParseBench Table 5. Sub-scores: tables 74.7, charts 28.2, content-faithfulness 87.6, semantic-formatting 64.2, visual-grounding 55.2.

Community622026Source
6Azure Document Intelligence

Azure Document Intelligence (prebuilt layout). Best non-LlamaParse visual grounding (73.8). ParseBench Table 5. Sub-scores: tables 86, charts 1.6, content-faithfulness 84.9, semantic-formatting 51.9, visual-grounding 73.8.

Community59.62026Source
7Extend

Extend parse pipeline. ParseBench Table 5. Sub-scores: tables 85.1, charts 1.6, content-faithfulness 84.1, semantic-formatting 47.4, visual-grounding 60.7.

Community55.82026Source
8Dots OCR 1.5

Dots OCR 1.5: strongest content-faithfulness score in the benchmark (90.0), but charts collapse to 0.9. ParseBench Table 5. Sub-scores: tables 85.2, charts 0.9, content-faithfulness 90, semantic-formatting 47, visual-grounding 55.8.

Community55.82026Source
9Docling

Docling OSS pipeline. Visual grounding score (66.1) excludes 13 pages where the pipeline failed. ParseBench Table 5. Sub-scores: tables 66.4, charts 52.8, content-faithfulness 66.9, semantic-formatting 1, visual-grounding 66.1.

Community50.62026Source
10Google Cloud Document AI

Google Cloud Document AI (layout parser). ParseBench Table 5. Sub-scores: tables 55.1, charts 1.4, content-faithfulness 83.7, semantic-formatting 50.5, visual-grounding 61.3.

Community50.42026Source
11AWS Textract

AWS Textract via its layout pipeline. Strong on grounding (70.4) but near-zero on charts (6.0) and formatting (3.7). ParseBench Table 5. Sub-scores: tables 84.6, charts 6, content-faithfulness 74.8, semantic-formatting 3.7, visual-grounding 70.4.

Community47.92026Source
12OpenAI GPT-5 Mini

GPT-5 Mini evaluated as a VLM parser on ParseBench with reasoning set to medium. ParseBench Table 5. Sub-scores: tables 69.8, charts 30.1, content-faithfulness 82.3, semantic-formatting 45.8, visual-grounding 6.2.

Community46.82026Source
13LandingAI

LandingAI ADE parse pipeline. ParseBench Table 5. Sub-scores: tables 73.7, charts 10.9, content-faithfulness 88.6, semantic-formatting 27.9, visual-grounding 25.1.

Community45.22026Source
14Anthropic Haiku 4.5

Claude Haiku 4.5 with extended thinking enabled, evaluated as a VLM parser. ParseBench Table 5. Sub-scores: tables 77.2, charts 13.8, content-faithfulness 78.7, semantic-formatting 49.4, visual-grounding 6.7.

Community45.22026Source

Submit a Result