Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Registry · Multimodal · Visual question answering15 providers · verified 2026-04← back to the register
§ 00 · Task

Visual question answering,
measured.

Ask a question about an image; receive an answer in text. Every frontier LLM is a vision-language model now, so the buyer question has shifted from “which VQA specialist?” to which VLM, on which axes, at what price.

Fifteen providers compared on cost per image, multi-image reasoning, video, OCR-in-image quality and license. Pricing normalised to standard-resolution images using each vendor’s own image-token formula.

§ 01 · Leaderboard

Fifteen providers, side by side.

Grouped: frontier · doc · open
#Provider / modelTierLicenseCost / 1K imgMulti-imgVideoResolutionOCR-in-img
01OpenAI
GPT-5 vision · GPT-4o vision
FrontierProprietary API$5–10Up to 2048×2048Strong
02Anthropic
Claude Opus 4.7 · Sonnet 4.6
FrontierProprietary API$5–15Up to 1568px short edgeStrong
03Google
Gemini 3 Pro · Gemini 3 Ultra
FrontierProprietary API$2–8Up to 3072×3072Strong
04xAI
Grok 4 Vision
FrontierProprietary API~$5–10StandardDecent
05Mistral
Pixtral Large · Pixtral 12B
FrontierHybrid~$3–8Native arbitraryStrong
06Amazon Web Services
Textract Queries · AnalyzeDocument
DocProprietary API~$30–65PDF / image, document-optimisedStrong
07Microsoft Azure
Document Intelligence · Custom Query
DocProprietary API~$10–50PDF / image, document-optimisedStrong
08Google Cloud
Document AI · Custom Extractor
DocProprietary API~$30–65PDF / image, document-optimisedStrong
09Reducto
Reducto
DocProprietary API~$5–20PDF / image, document-optimisedStrong
10Mathpix
Mathpix · Convert / Query
DocProprietary API~$5–15PDF / imageStrong (math)
11Alibaba (open)
Qwen2.5-VL-72B · Qwen3-VL
OpenOpen weightsSelf-hostNative arbitrary (no resize)Strong
12Shanghai AI Lab (open)
InternVL 3 (38B / 78B)
OpenOpen weightsSelf-hostUp to 4K nativeStrong
13Allen Institute (AI2)
Molmo 72B · Molmo-D
OpenOpen weightsSelf-hostStandardDecent
14ByteDance / academic (open)
LLaVA-OneVision (7B / 72B)
OpenOpen weightsSelf-hostStandardDecent
15DeepSeek (open)
DeepSeek-VL2
OpenOpen weightsSelf-hostNative arbitraryStrong
Fig 01 · Shaded row marks the multi-image reasoning leader; frontier VLMs otherwise sit within 1–3 points of each other on MMMU and MMBench. Pricing is token-derived and scales with resolution. Hover a price to see the per-vendor note.
§ 02
The task

What the metric actually measures.

Visual question answering asks a model to read an image and answer a natural-language question about it. The archetypal output is a short string — “three”, “red”, “a golden retriever” — graded against human-written references using exact or soft matching.

The classic VQAv2 and GQA benchmarks were designed in 2017–2019 for specialist models trained from scratch. Frontier VLMs trained on internet-scale image-text data now saturate them in zero-shot; the human ceiling on VQAv2 is around 80%, and current frontier models sit at 84–87%. Past that point a two-point delta is noise about dataset quirks, not signal about model quality.

The interesting evaluations in 2026 are MMMU (multi-discipline reasoning), MathVista (visual math), ChartQA, DocVQA and MMBench. Leaderboards on each of these diverge sharply — no single number captures “best VLM”, and your own task should be your primary eval.

What the metric misses: calibration. A model that confidently invents an answer is strictly worse than one that says “I’m not sure.” VLMs systematically over-predict counts and over-affirm presence (“is there a cat?” “yes”, even with no cat in frame). Leaderboards don’t penalise that; your production eval should.

§ 03 · Benchmarks

The canonical six, graded in public.

Useful for academic comparison and open-weights training. Frontier API providers don’t report against these consistently — treat as historical context.
#BenchmarkScaleYearWhat it measuresLink
01MMMU
On Codesota ↗
11.5K questions · 30 disciplines · college-level2024Massive Multi-discipline Multimodal Understanding. The reference frontier benchmark — questions span art, business, medicine, science, technology. Hard for humans (88% expert ceiling); current SOTA ~86%.leaderboard →
02MMMU-Pro
On Codesota ↗
Vision-only · 10 answer choices2024Harder MMMU variant. Vision-only questions (no text shortcuts) and ten answer choices instead of four. Current frontier sits at ~82% on Gemini 3.1 Pro Preview.leaderboard →
03MathVista6,141 questions · math + visual reasoning2024Tests mathematical reasoning over diagrams, geometry, and charts. The discriminator between a captioning model dressed up as a VLM and a model that actually reasons over visual structure.page →
04ChartQA32K Q&A pairs over 21K charts2022Real charts from Pew, OECD, Statista. Tests chart reading, value extraction, comparative reasoning. Frontier VLMs at 85–90%; Molmo / older open-weights still in the 60s.page →
05DocVQA50K questions · 12K document images2021Industry documents (invoices, forms, reports). The standard for document VQA. AWS Textract Query / Document AI / Reducto are all evaluated here.page →
06MMBench
On Codesota ↗
3,217 multiple-choice questions · 20 abilities2023Rounded capability suite covering perception, reasoning, OCR, knowledge. The default “am I generally a competent VLM” eval — most papers report it.leaderboard →
07VQAv2
On Codesota ↗
1.1M questions · COCO images2017The canonical VQA dataset. Saturated. Still reported for historical comparability — treat anything above 80% as ceiling-pinned.leaderboard →
§ 04 · Lineage

How VQA evolved.

Open the full lineage graph

Community attention doesn’t stay on one benchmark. It migrates — from VQA to VQAv2 when language priors made the original too easy; from VQAv2 to MMMU when balanced pairs saturated above 85%; from MMMU to MMMU-Pro when vision-only questions fixed MMMU’s text shortcuts. Meanwhile, specialised branches (GQA, OK-VQA, TextVQA, ScienceQA) peeled off to test capabilities the mainline benchmarks miss.

VQA
2015
VQAv2
2017
MMMU
2023
MMMU-Pro
2024

Attention path only. The full graph at /lineage/vqa also shows the five specialised branches (GQA, OK-VQA, A-OKVQA, TextVQA, ScienceQA) and carries live SOTA scores pulled from the registry.

§ 05
Methodology

What VQA-accuracy means here.

The original VQAv2 metric gives full credit when at least three of ten human annotators produced the same answer string (after light normalisation). It is forgiving of synonyms by construction — “sofa” and “couch” both score — and unforgiving of extra words.

Newer suites (MMMU, MMBench) use multiple-choice or short-answer formats evaluated by exact match, which eliminates the ten-annotator quirk but introduces format-sensitivity of its own. DocVQA uses ANLS (Average Normalised Levenshtein Similarity) to tolerate OCR-style near-misses on long strings.

Pricing figures in § 01 are token-derived: each provider prices images as a function of resolution and detail flag, then bills tokens at the model’s text rate. Numbers above assume standard-resolution input; high-detail can cost 2–4× more. Click a price to see the vendor’s own pricing page.

§ 06 · Related

Where to read next.

Cross-link
Frontier LLM

LLM-specific leaderboard with denser, faster-moving data.

Cross-link
Methodology

How we grade benchmarks, reproduce runs, and record retractions.

Reply within 48 hours · No newsletter

What were you looking for on visual question answering?

Missing a vendor, a column we skipped, a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.