Visual Question Answering2019en
TextVQA: Towards VQA Models That Can Read
Visual Question Answering dataset requiring models to read and reason about text in natural images. Contains 45,336 questions about 28,408 images from Open Images dataset. Questions require OCR-based reasoning, e.g. "What does the sign say?". A standard benchmark for evaluating text understanding within visual scenes. ANLS and exact-match accuracy metrics.
Current State of the Art
Qwen2.5-VL 72B
Alibaba
85.5
accuracy
accuracy Progress Over Time
Showing 6 breakthroughs from Jan 2023 to Feb 2025
Key Milestones
Mar 2023
GPT-4V
TextVQA val. GPT-4V. Reported in multiple papers (Qwen2-VL Table 1, InternVL2 Table 3).
78.0
+83.5%
Feb 2025
Qwen2.5-VL 72BCurrent SOTA
TextVQA val. Qwen2.5-VL 72B. Table 2. arxiv:2502.13923
85.5
+0.7%
Total Improvement
101.2%
Time Span
2y 1m
Breakthroughs
6
Current SOTA
85.5
Top Models Performance Comparison
Top 9 models ranked by accuracy
Best Score
85.5
Top Model
Qwen2.5-VL 72B
Models Compared
9
Score Range
43.0
accuracyPrimary
| # | Model | Score | Paper / Code | Date |
|---|---|---|---|---|
| 1 | Qwen2.5-VL 72BOpen Source Alibaba | 85.5 | Feb 2025 | |
| 2 | Qwen2-VL 72BOpen Source Alibaba | 84.9 | Sep 2024 | |
| 3 | InternVL2-76BOpen Source Shanghai AI Lab | 84.4 | Apr 2024 | |
| 4 | Llama 3.2 Vision 90BOpen Source Meta | 83.4 | Jul 2024 | |
| 5 | Gemini 1.5 ProAPI Google | 82.2 | Feb 2024 | |
| 6 | GPT-4V | 78 | Mar 2023 | |
| 7 | GPT-4oAPI OpenAI | 77.4 | Oct 2024 | |
| 8 | LLaVA-1.5Open Source UW-Madison / Microsoft | 61.3 | Oct 2023 | |
| 9 | BLIP-2Open Source Salesforce | 42.5 | Jan 2023 |
Related Papers9
Qwen2.5-VL Technical Report
Feb 2025Models: Qwen2.5-VL 72B
SWE-bench Verified
Oct 2024Models: GPT-4o
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Sep 2024Models: Qwen2-VL 72B
The Llama 3 Herd of Models
Jul 2024Models: Llama 3.2 Vision 90B
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Apr 2024Models: InternVL2-76B
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Feb 2024Models: Gemini 1.5 Pro
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)
Oct 2023Models: LLaVA-1.5
GPT-4 Technical Report
Mar 2023Models: GPT-4V