Home/Browse/Computer Vision/Document Parsing/ParseBench

ParseBench

Name: ParseBench Benchmark Results
Creator: Unknown
License: https://creativecommons.org/licenses/by/4.0/

Unknown

LlamaIndex 2026 document parsing benchmark. ~2,078 human-verified pages from ~1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five dimensions: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, and visual grounding. No LLM-as-judge. Overall score = unweighted mean of the five dimensions.

Paper Leaderboard

Benchmark Stats

Models14

Papers14

Metrics1

SOTA History

Not enough data to show trend.

accuracy

Higher is better

Rank	Model	Source	Score	Year	Paper
1	LlamaParse Agentic SOTA on ParseBench: highest overall score (84.9), best-in-column on Tables (90.7), Charts (78.1), Semantic Formatting (85.2), and Visual Grounding (80.6). Cost ~$0.012/page. ParseBench Table 5. Sub-scores: tables 90.7, charts 78.1, content-faithfulness 89.7, semantic-formatting 85.2, visual-grounding 80.6.	Community	84.9	2026	Source
2	LlamaParse Cost Effective LlamaParse in cost-effective mode: competitive with Gemini 3 Flash minimal at ~1/10 the cost. ParseBench Table 5. Sub-scores: tables 73.2, charts 66.7, content-faithfulness 88, semantic-formatting 73, visual-grounding 58.6.	Community	71.9	2026	Source
3	Google Gemini 3 Flash Gemini 3 Flash at default high thinking, evaluated as a VLM parser. Strongest VLM overall on ParseBench; 89.9 on Tables (best-in-column). ParseBench Table 5. Sub-scores: tables 89.9, charts 64.8, content-faithfulness 86.2, semantic-formatting 58.4, visual-grounding 56.	Community	71	2026	Source
4	Reducto Reducto (default non-agentic pipeline). Second-best specialised parser overall. ParseBench Table 5. Sub-scores: tables 70.3, charts 57, content-faithfulness 86.4, semantic-formatting 56.8, visual-grounding 68.7.	Community	67.8	2026	Source
5	Qwen 3 VL Qwen 3 VL evaluated via a parse-with-layout pipeline. Visual grounding uses a separate layout-only pipeline; 4 pages excluded where that pipeline failed. ParseBench Table 5. Sub-scores: tables 74.7, charts 28.2, content-faithfulness 87.6, semantic-formatting 64.2, visual-grounding 55.2.	Community	62	2026	Source
6	Azure Document Intelligence Azure Document Intelligence (prebuilt layout). Best non-LlamaParse visual grounding (73.8). ParseBench Table 5. Sub-scores: tables 86, charts 1.6, content-faithfulness 84.9, semantic-formatting 51.9, visual-grounding 73.8.	Community	59.6	2026	Source
7	Extend Extend parse pipeline. ParseBench Table 5. Sub-scores: tables 85.1, charts 1.6, content-faithfulness 84.1, semantic-formatting 47.4, visual-grounding 60.7.	Community	55.8	2026	Source
8	Dots OCR 1.5 Dots OCR 1.5: strongest content-faithfulness score in the benchmark (90.0), but charts collapse to 0.9. ParseBench Table 5. Sub-scores: tables 85.2, charts 0.9, content-faithfulness 90, semantic-formatting 47, visual-grounding 55.8.	Community	55.8	2026	Source
9	Docling Docling OSS pipeline. Visual grounding score (66.1) excludes 13 pages where the pipeline failed. ParseBench Table 5. Sub-scores: tables 66.4, charts 52.8, content-faithfulness 66.9, semantic-formatting 1, visual-grounding 66.1.	Community	50.6	2026	Source
10	Google Cloud Document AI Google Cloud Document AI (layout parser). ParseBench Table 5. Sub-scores: tables 55.1, charts 1.4, content-faithfulness 83.7, semantic-formatting 50.5, visual-grounding 61.3.	Community	50.4	2026	Source
11	AWS Textract AWS Textract via its layout pipeline. Strong on grounding (70.4) but near-zero on charts (6.0) and formatting (3.7). ParseBench Table 5. Sub-scores: tables 84.6, charts 6, content-faithfulness 74.8, semantic-formatting 3.7, visual-grounding 70.4.	Community	47.9	2026	Source
12	OpenAI GPT-5 Mini GPT-5 Mini evaluated as a VLM parser on ParseBench with reasoning set to medium. ParseBench Table 5. Sub-scores: tables 69.8, charts 30.1, content-faithfulness 82.3, semantic-formatting 45.8, visual-grounding 6.2.	Community	46.8	2026	Source
13	LandingAI LandingAI ADE parse pipeline. ParseBench Table 5. Sub-scores: tables 73.7, charts 10.9, content-faithfulness 88.6, semantic-formatting 27.9, visual-grounding 25.1.	Community	45.2	2026	Source
14	Anthropic Haiku 4.5 Claude Haiku 4.5 with extended thinking enabled, evaluated as a VLM parser. ParseBench Table 5. Sub-scores: tables 77.2, charts 13.8, content-faithfulness 78.7, semantic-formatting 49.4, visual-grounding 6.7.	Community	45.2	2026	Source

Submit a Result

Back to Document Parsing