Benchmark for OCR across multi-scene, multilingual, and document parsing tasks.
View on AlphaXiv ↗F1 score on multi-scene text reading
Higher is better
| # | Model | Score | Source |
|---|---|---|---|
| ★ | gemini-15-pro | 83.25% | |
| 2 | qwen2-vl-72b | 77.95% | |
| 3 | internvl2-76b | 76.92% | |
| 4 | gpt-4o | 76.4% | |
| 5 | claude-35-sonnet | 72.87% |
F1 score on multilingual text (10 languages)
Higher is better
| # | Model | Score | Source |
|---|---|---|---|
| ★ | gemini-15-pro | 78.97% | |
| 2 | gpt-4o | 73.44% |
F1 score on key information extraction
Higher is better
| # | Model | Score | Source |
|---|---|---|---|
| ★ | qwen2-vl-72b | 71.76% | |
| 2 | gemini-15-pro | 67.28% | |
| 3 | claude-35-sonnet | 64.58% | |
| 4 | gpt-4o | 63.45% |
Average score on document parsing
Higher is better
| # | Model | Score | Source |
|---|---|---|---|
| ★ | gemini-15-pro | 62.37 |