Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Benchmark · OCRBench v2Home/Leaderboards/OCRBench v2
South China University of Technology

OCRBench v2.

Tests 8 core OCR capabilities across 23 tasks. Evaluates LMMs on text recognition, referring, extraction.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

Overall (Chinese)

Overall Zh Private is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Overall (Chinese)verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01Qwen2.5-VL-72B
From Qwen2.5-VL-72B-Instruct model card benchmark table.
paper63.72025Source ↗Edit result
02gemini-25-pro
Chinese, Private split. #1 on Chinese
paper62.22025Source ↗Edit result
03Gemini 2.5 Pro
Chinese, Private split. #1 on Chinese
unverified62.22025Source ↗Edit result
04Qianfan-OCR
Baidu Qianfan-OCR 4B (Qwen3-4B + Qianfan-ViT), Apache 2.0, 192 langs. Layout-as-Thought. #1 on zh
paper60.772025Source ↗Edit result
05minicpm-v-4.5-8b
Chinese, Private split. #4 overall
unverified58.82025Source ↗Edit result
06sail-vl2-8bpaper57.62025Source ↗Edit result
07claude-3.5-sonnetunverified48.42024Source ↗Edit result
08InternVL2.5-78B
From Qwen2.5-VL-72B-Instruct model card comparison table.
paper46.22025Source ↗Edit result
09Qwen2-VL-72B
From Qwen2.5-VL-72B-Instruct model card comparison table.
paper46.12024Source ↗Edit result
10gpt-4o-2024unverified45.72024Source ↗Edit result

English Score

English Score is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for English Scoreverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01Ovis2.5-9Bunverified63.42025Paper ↗Code ↗Edit result
02Intern-S1-Prounverified60.12026Paper ↗Source ↗Edit result

Overall (English)

Overall En Private is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Overall (English)verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01seed-1.6-vision
English, Private split. #1 on OCRBench v2
paper62.22025Source ↗Edit result
02Seed1.6-vision
English, Private split. #1 on OCRBench v2
unverified62.22025Source ↗Edit result
03Qwen2.5-VL-72B
From Qwen2.5-VL-72B-Instruct model card benchmark table. HF: Qwen/Qwen2.5-VL-72B-Instruct.
paper61.52025Source ↗Edit result
04qwen3-omni-30bpaper61.32025Source ↗Edit result
05Nemotron Nano V2 VLunverified61.22025Source ↗Edit result
06nemotron-nano-v2-vlpaper61.22025Source ↗Edit result
07gemini-25-propaper59.32025Source ↗Edit result
08Gemini 2.5 Prounverified59.32025Source ↗Edit result
09llama-3.1-nemotron-nano-vl-8bpaper56.42025Source ↗Edit result
10Qianfan-OCR
Baidu Qianfan-OCR 4B (Qwen3-4B + Qianfan-ViT), Apache 2.0, 192 langs. Layout-as-Thought.
paper562025Source ↗Edit result
11gpt-4o
Listed as GPT5-2025-08-07 on leaderboard
paper55.52024Source ↗Edit result
12ovis2.5-8bunverified54.12025Source ↗Edit result
13gemini-1.5-prounverified51.62024Source ↗Edit result
14sail-vl2-8bpaper49.32025Source ↗Edit result
15minicpm-v-4.5-8bunverified48.42025Source ↗Edit result
16Qwen2-VL-72B
From Qwen2.5-VL-72B-Instruct model card comparison table.
paper47.82024Source ↗Edit result
17gpt-4o-2024
GPT-4o baseline (not GPT5-2025-08-07)
paper47.62024Source ↗Edit result
18claude-3.5-sonnetpaper47.52024Source ↗Edit result
19internvl3.5-14bunverified47.12025Source ↗Edit result
20step-1vunverified46.82024Source ↗Edit result
21grok4unverified452025Source ↗Edit result
22InternVL2.5-78B
From Qwen2.5-VL-72B-Instruct model card comparison table.
paper452025Source ↗Edit result
23GPT-4o miniunverified44.12024Source ↗Edit result
24gpt-4o-minipaper44.12024Source ↗Edit result
25Claude Sonnet 4
Claude-sonnet-4-20250514
unverified42.42025Source ↗Edit result
26claude-sonnet-4
Claude-sonnet-4-20250514
paper42.42025Source ↗Edit result
27qwen2.5-vl-7bunverified41.82025Source ↗Edit result
28deepseek-vl2-smallpaper412024Source ↗Edit result
29pixtral-12bunverified38.42024Source ↗Edit result
30phi-4-multimodalunverified38.12025Source ↗Edit result
31glm-4v-9bunverified37.12024Source ↗Edit result
32molmo-7bunverified33.92024Source ↗Edit result
33llava-ov-7bpaper33.72024Source ↗Edit result

Chinese Score

Chinese Score is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Chinese Scoreverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01Intern-S1-Prounverified60.62026Paper ↗Source ↗Edit result
02Ovis2.5-9Bunverified582025Paper ↗Code ↗Edit result

Overall Zh Public

Overall Zh Public is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Overall Zh Publicverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01InternVL3-14B
Table 3, arxiv:2501.00321. Highest on Chinese public split (tied with Qwen2.5-VL-7B).
paper55.72025Source ↗Edit result
02Qwen2.5-VL-7B
Table 3, arxiv:2501.00321.
paper55.62025Source ↗Edit result
03Ovis2-8B
Table 3, arxiv:2501.00321.
paper49.22025Source ↗Edit result
04Gemini 1.5 Pro
Table 3, arxiv:2501.00321.
paper43.12024Source ↗Edit result
05DeepSeek-VL2-Small
Table 3, arxiv:2501.00321.
paper42.72024Source ↗Edit result
06Step-1V
Table 3, arxiv:2501.00321.
paper42.62024Source ↗Edit result
07MiniCPM-o-2.6
Table 3, arxiv:2501.00321.
paper41.12024Source ↗Edit result
08Claude 3.5 Sonnet
Table 3, arxiv:2501.00321.
paper39.62024Source ↗Edit result
09GLM-4V-9B
Table 3, arxiv:2501.00321.
paper36.62024Source ↗Edit result
10GPT-4o
Table 3, arxiv:2501.00321.
paper32.22024Source ↗Edit result

Overall En Public

Overall En Public is the reported evaluation metric for OCRBench v2. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Overall En Publicverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01InternVL3-14B
Highest score on English public split. Table 2, arxiv:2501.00321.
paper52.62025Source ↗Edit result
02Gemini 1.5 Pro
Table 2, arxiv:2501.00321. Gemini-1.5-Pro.
paper51.92024Source ↗Edit result
03Ovis2-8B
Table 2, arxiv:2501.00321.
paper47.72025Source ↗Edit result
04Step-1V
Table 2, arxiv:2501.00321.
paper46.72024Source ↗Edit result
05Qwen2.5-VL-7B
Table 2, arxiv:2501.00321. Same as Step-1V average (46.7).
paper46.72025Source ↗Edit result
06GPT-4o
Table 2, arxiv:2501.00321.
paper46.52024Source ↗Edit result
07Claude 3.5 Sonnet
Table 2, arxiv:2501.00321. claude-3-5-sonnet-20241022.
paper45.22024Source ↗Edit result
08MiniCPM-o-2.6
Table 2, arxiv:2501.00321.
paper45.12024Source ↗Edit result
09DeepSeek-VL2-Small
Table 2, arxiv:2501.00321.
paper43.32024Source ↗Edit result
10GLM-4V-9B
Table 2, arxiv:2501.00321.
paper42.62024Source ↗Edit result
11Pixtral-12B
Table 2, arxiv:2501.00321.
paper40.32024Source ↗Edit result
12LLaVA-OneVision-7B
Table 2, arxiv:2501.00321.
paper36.42024Source ↗Edit result
13Cambrian-1-8B
Table 2, arxiv:2501.00321.
paper34.72024Source ↗Edit result
14Molmo-7B
Table 2, arxiv:2501.00321.
paper34.52024Source ↗Edit result
Lineage

OCRBench v2 in context.

See full ocr benchmarks lineage →
Predecessors (1)
superseded2023-05
OCRBench
10× more items, human-verified, EN+ZH parity, four public/private splits to combat contamination. Original v1 saturated within 18 months; v2 reopened the gap.
This benchmark (1)
active2024-12
OCRBench v2
§ 04 · Submit a result

Add to the leaderboard.

← Back to Leaderboards