Codesota · Multimodal · Visual Question Answering · TextVQATasks/Multimodal/Visual Question Answering
Visual Question Answering · benchmark dataset · 2019 · EN

TextVQA: Towards VQA Models That Can Read.

Visual Question Answering dataset requiring models to read and reason about text in natural images. Contains 45,336 questions about 28,408 images from Open Images dataset. Questions require OCR-based reasoning, e.g. "What does the sign say?". A standard benchmark for evaluating text understanding within visual scenes. ANLS and exact-match accuracy metrics.

Paper Submit a result
§ 01 · Leaderboard

Best published scores.

23 results indexed across 1 metric. Shaded row marks current SOTA; ties broken by submission date.


Primary
accuracy · higher is better
accuracy· primary
23 rows
#ModelOrgSubmittedPaper / codeaccuracy
01Ovis2.5-9BAug 2025Ovis2.5 Technical Report · code91.20
02Qwen2-VL 72BOpenAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code85.50
03Qwen2.5-VL 72BOpenAlibabaFeb 2025Qwen2.5-VL Technical Report85.50
04Qwen2-VL 72BOpenAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o…84.90
05Llama 3-V (405B)Jul 2024The Llama 3 Herd of Models · code84.80
06InternVL2-76BOpenShanghai AI LabApr 2024InternVL: Scaling up Vision Foundation Models and Aligni…84.40
07Qwen2-VL 7BAlibabaSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code84.30
08MiniCPM-o 4.5-InstructApr 2026MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal … · code83.80
09Qwen2.5-VL-72BFeb 2025Qwen2.5-VL Technical Report · code83.50
10Llama 3.2 Vision 90BOpenMetaJul 2024The Llama 3 Herd of Models83.40
11BLIP3-o (8B)May 2025BLIP3-o: A Family of Fully Open Unified Multimodal Model… · code83.10
12Gemini 1.5 ProAPIGoogleFeb 2024Gemini 1.5: Unlocking multimodal understanding across mi…82.20
13AriaOct 2024Aria: An Open Multimodal Native Mixture-of-Experts Model · code81.10
14Qianfan-OCROpenBaidu QianfanMar 2026Qianfan-OCR: A Unified End-to-End Model for Document Int… · code80
15Qwen2-VL-2BSep 2024Qwen2-VL: Enhancing Vision-Language Model's Perception o… · code79.70
16GPT-4VMar 2023GPT-4 Technical Report78
17GPT-4oAPIOpenAIOct 2024SWE-bench Verified77.40
18MiniCPM-Llama3-V 2.5Aug 2024MiniCPM-V: A GPT-4V Level MLLM on Your Phone · code76.60
19ZAYA1-VL-8BMay 2026pwc-dump · code74.40
20LLaVA-1.5OpenUW-Madison / MicrosoftOct 2023Improved Baselines with Visual Instruction Tuning (LLaVA…61.30
21AIMv2 ViT-3B/14 + Llama 3.0 8BNov 2024Multimodal Autoregressive Pre-training of Large Vision E… · code58.20
22BLIP-2OpenSalesforceJan 2023BLIP-2: Bootstrapping Language-Image Pre-training with F…42.50
23Flamingo (32-shot)Apr 2022Flamingo: a Visual Language Model for Few-Shot Learning · code37.90
Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.
§ 03 · Progress

8 steps
of state of the art.

Each row below marks a model that broke the previous record on accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · accuracy
  1. Apr 29, 2022Flamingo (32-shot)37.90
  2. Jan 30, 2023BLIP-2Salesforce42.50
  3. Mar 15, 2023GPT-4V78
  4. Feb 15, 2024Gemini 1.5 ProGoogle82.20
  5. Apr 25, 2024InternVL2-76BShanghai AI Lab84.40
  6. Jul 31, 2024Llama 3-V (405B)84.80
  7. Sep 18, 2024Qwen2-VL 72BAlibaba85.50
  8. Aug 15, 2025Ovis2.5-9B91.20
Fig 3 · SOTA-setting models only. 8 entries span Apr 2022 Aug 2025.
§ 04 · Literature

17 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result Read submission guide
What a submission needs
  • 01A public checkpoint or API endpoint
  • 02A reproduction script with frozen commit + seed
  • 03Declared evaluation environment (Python, deps)
  • 04One row per metric declared by this dataset
  • 05A contact so we can follow up on discrepancies