OCRBench v2
South China University of Technology
Comprehensive benchmark evaluating 8 OCR capabilities across 23 tasks in 31 scenarios.
32
Total Results
27
Models Tested
2
Metrics
2025-12-21
Last Updated
Overall (English)
Average score on English private test set
Higher is better
| Rank | Model | Score | Source |
|---|---|---|---|
| 1 | seed-1.6-vision English, Private split. #1 on OCRBench v2 | 62.2 | alphaxiv-leaderboard |
| 2 | qwen3-omni-30b | 61.3 | alphaxiv-leaderboard |
| 3 | nemotron-nano-v2-vl | 61.2 | alphaxiv-leaderboard |
| 4 | gemini-25-pro | 59.3 | alphaxiv-leaderboard |
| 5 | llama-3.1-nemotron-nano-vl-8b | 56.4 | ocrbench-v2-leaderboard |
| 6 | gpt-4o Listed as GPT5-2025-08-07 on leaderboard | 55.5 | alphaxiv-leaderboard |
| 7 | ovis2.5-8b | 54.1 | ocrbench-v2-leaderboard |
| 8 | gemini-1.5-pro | 51.6 | ocrbench-v2-leaderboard |
| 9 | sail-vl2-8b | 49.3 | ocrbench-v2-leaderboard |
| 10 | minicpm-v-4.5-8b | 48.4 | ocrbench-v2-leaderboard |
| 11 | gpt-4o-2024 GPT-4o baseline (not GPT5-2025-08-07) | 47.6 | ocrbench-v2-leaderboard |
| 12 | claude-3.5-sonnet | 47.5 | ocrbench-v2-leaderboard |
| 13 | internvl3.5-14b | 47.1 | ocrbench-v2-leaderboard |
| 14 | step-1v | 46.8 | ocrbench-v2-leaderboard |
| 15 | grok4 | 45 | ocrbench-v2-leaderboard |
| 16 | gpt-4o-mini | 44.1 | ocrbench-v2-leaderboard |
| 17 | claude-sonnet-4 Claude-sonnet-4-20250514 | 42.4 | ocrbench-v2-leaderboard |
| 18 | qwen2.5-vl-7b | 41.8 | ocrbench-v2-leaderboard |
| 19 | deepseek-vl2-small | 41 | ocrbench-v2-leaderboard |
| 20 | pixtral-12b | 38.4 | ocrbench-v2-leaderboard |
| 21 | phi-4-multimodal | 38.1 | ocrbench-v2-leaderboard |
| 22 | glm-4v-9b | 37.1 | ocrbench-v2-leaderboard |
| 23 | molmo-7b | 33.9 | ocrbench-v2-leaderboard |
| 24 | llava-ov-7b | 33.7 | ocrbench-v2-leaderboard |
| 25 | idefics3-8b | 26 | ocrbench-v2-leaderboard |
| 26 | mistral-ocr-2512 Verified via CodeSOTA benchmark. 7,400 English samples. Mistral OCR is a pure OCR model (text extraction only) - not designed for VQA, chart parsing, or structured extraction tasks. Strong on full-page OCR (79.1%) and document parsing (55.2%). | 25.2 | codesota-verified |
| 27 | docowl2 | 23.4 | ocrbench-v2-leaderboard |
Overall (Chinese)
Average score on Chinese private test set
Higher is better
| Rank | Model | Score | Source |
|---|---|---|---|
| 1 | gemini-25-pro Chinese, Private split. #1 on Chinese | 62.2 | alphaxiv-leaderboard |
| 2 | minicpm-v-4.5-8b Chinese, Private split. #4 overall | 58.8 | ocrbench-v2-leaderboard |
| 3 | sail-vl2-8b | 57.6 | ocrbench-v2-leaderboard |
| 4 | claude-3.5-sonnet | 48.4 | ocrbench-v2-leaderboard |
| 5 | gpt-4o-2024 | 45.7 | ocrbench-v2-leaderboard |