Codesota · Benchmark · ParseBenchHome/Leaderboards/Vision & Documents/Document Parsing/ParseBench
Unknown

ParseBench.

LlamaIndex 2026 document parsing benchmark. ~2,078 human-verified pages from ~1,211 enterprise documents (insurance, finance, government) with 169K rule-based tests across five dimensions: tables (GTRM), charts (ChartDataPointMatch), content faithfulness, semantic formatting, and visual grounding. No LLM-as-judge. Overall score = unweighted mean of the five dimensions.

Paper Leaderboard Lineage
§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

accuracy

Accuracy is the reported evaluation metric for ParseBench. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01LlamaParse Agentic
SOTA on ParseBench: highest overall score (84.9), best-in-column on Tables (90.7), Charts (78.1), Semantic Formatting (85.2), and Visual Grounding (80.6). Cost ~$0.012/page. ParseBench Table 5. Sub-scores: tables 90.7, charts 78.1, content-faithfulness 89.7, semantic-formatting 85.2, visual-grounding 80.6.
verified84.92026Source ↗Looks wrong?
02LlamaParse Cost Effective
LlamaParse in cost-effective mode: competitive with Gemini 3 Flash minimal at ~1/10 the cost. ParseBench Table 5. Sub-scores: tables 73.2, charts 66.7, content-faithfulness 88, semantic-formatting 73, visual-grounding 58.6.
verified71.92026Source ↗Looks wrong?
03Google Gemini 3 Flash
Gemini 3 Flash at default high thinking, evaluated as a VLM parser. Strongest VLM overall on ParseBench; 89.9 on Tables (best-in-column). ParseBench Table 5. Sub-scores: tables 89.9, charts 64.8, content-faithfulness 86.2, semantic-formatting 58.4, visual-grounding 56.
verified712026Source ↗Looks wrong?
04Gemini 3 Flash
Gemini 3 Flash at default high thinking, evaluated as a VLM parser. Strongest VLM overall on ParseBench; 89.9 on Tables (best-in-column). ParseBench Table 5. Sub-scores: tables 89.9, charts 64.8, content-faithfulness 86.2, semantic-formatting 58.4, visual-grounding 56.
verified712026Source ↗Looks wrong?
05Reducto
Reducto (default non-agentic pipeline). Second-best specialised parser overall. ParseBench Table 5. Sub-scores: tables 70.3, charts 57, content-faithfulness 86.4, semantic-formatting 56.8, visual-grounding 68.7.
verified67.82026Source ↗Looks wrong?
06Qwen 3 VL
Qwen 3 VL evaluated via a parse-with-layout pipeline. Visual grounding uses a separate layout-only pipeline; 4 pages excluded where that pipeline failed. ParseBench Table 5. Sub-scores: tables 74.7, charts 28.2, content-faithfulness 87.6, semantic-formatting 64.2, visual-grounding 55.2.
verified622026Source ↗Looks wrong?
07Qwen3-VL-4B
Qwen 3 VL evaluated via a parse-with-layout pipeline. Visual grounding uses a separate layout-only pipeline; 4 pages excluded where that pipeline failed. ParseBench Table 5. Sub-scores: tables 74.7, charts 28.2, content-faithfulness 87.6, semantic-formatting 64.2, visual-grounding 55.2.
verified622026Source ↗Looks wrong?
08Azure Document Intelligence
Azure Document Intelligence (prebuilt layout). Best non-LlamaParse visual grounding (73.8). ParseBench Table 5. Sub-scores: tables 86, charts 1.6, content-faithfulness 84.9, semantic-formatting 51.9, visual-grounding 73.8.
verified59.62026Source ↗Looks wrong?
09Dots OCR 1.5
Dots OCR 1.5: strongest content-faithfulness score in the benchmark (90.0), but charts collapse to 0.9. ParseBench Table 5. Sub-scores: tables 85.2, charts 0.9, content-faithfulness 90, semantic-formatting 47, visual-grounding 55.8.
verified55.82026Source ↗Looks wrong?
10Extend
Extend parse pipeline. ParseBench Table 5. Sub-scores: tables 85.1, charts 1.6, content-faithfulness 84.1, semantic-formatting 47.4, visual-grounding 60.7.
verified55.82026Source ↗Looks wrong?
11Docling
Docling OSS pipeline. Visual grounding score (66.1) excludes 13 pages where the pipeline failed. ParseBench Table 5. Sub-scores: tables 66.4, charts 52.8, content-faithfulness 66.9, semantic-formatting 1, visual-grounding 66.1.
verified50.62026Source ↗Looks wrong?
12Google Cloud Document AI
Google Cloud Document AI (layout parser). ParseBench Table 5. Sub-scores: tables 55.1, charts 1.4, content-faithfulness 83.7, semantic-formatting 50.5, visual-grounding 61.3.
verified50.42026Source ↗Looks wrong?
13AWS Textract
AWS Textract via its layout pipeline. Strong on grounding (70.4) but near-zero on charts (6.0) and formatting (3.7). ParseBench Table 5. Sub-scores: tables 84.6, charts 6, content-faithfulness 74.8, semantic-formatting 3.7, visual-grounding 70.4.
verified47.92026Source ↗Looks wrong?
14GPT-5 mini
GPT-5 Mini evaluated as a VLM parser on ParseBench with reasoning set to medium. ParseBench Table 5. Sub-scores: tables 69.8, charts 30.1, content-faithfulness 82.3, semantic-formatting 45.8, visual-grounding 6.2.
verified46.82026Source ↗Looks wrong?
15OpenAI GPT-5 Mini
GPT-5 Mini evaluated as a VLM parser on ParseBench with reasoning set to medium. ParseBench Table 5. Sub-scores: tables 69.8, charts 30.1, content-faithfulness 82.3, semantic-formatting 45.8, visual-grounding 6.2.
verified46.82026Source ↗Looks wrong?
16Anthropic Haiku 4.5
Claude Haiku 4.5 with extended thinking enabled, evaluated as a VLM parser. ParseBench Table 5. Sub-scores: tables 77.2, charts 13.8, content-faithfulness 78.7, semantic-formatting 49.4, visual-grounding 6.7.
verified45.22026Source ↗Looks wrong?
17LandingAI
LandingAI ADE parse pipeline. ParseBench Table 5. Sub-scores: tables 73.7, charts 10.9, content-faithfulness 88.6, semantic-formatting 27.9, visual-grounding 25.1.
verified45.22026Source ↗Looks wrong?
Lineage

ParseBench in context.

See full ocr benchmarks lineage →
This benchmark (1)
active2026-01
ParseBench
None yet — this is the current frontier.
§ 04 · Submit a result

Add to the leaderboard.

← Back to Document Parsing