Codesota · Registry · Multimodal · Visual question answering15 providers · verified 2026-04← back to the register

§ 00 · Task

Visual question answering,
measured.

Ask a question about an image; receive an answer in text. Every frontier LLM is a vision-language model now, so the buyer question has shifted from “which VQA specialist?” to which VLM, on which axes, at what price.

Fifteen providers compared on cost per image, multi-image reasoning, video, OCR-in-image quality and license. Pricing normalised to standard-resolution images using each vendor’s own image-token formula.

§ 01 · Leaderboard

Fifteen providers, side by side.

Grouped: frontier · doc · open

#	Provider / model	Tier	License	Cost / 1K img	Multi-img	Video	Resolution	OCR-in-img
01	OpenAI GPT-5 vision · GPT-4o vision	Frontier	Proprietary API	$5–10	●	—	Up to 2048×2048	Strong
02	Anthropic Claude Opus 4.7 · Sonnet 4.6	Frontier	Proprietary API	$5–15	●	—	Up to 1568px short edge	Strong
03	Google Gemini 3 Pro · Gemini 3 Ultra	Frontier	Proprietary API	$2–8	●	●	Up to 3072×3072	Strong
04	xAI Grok 4 Vision	Frontier	Proprietary API	~$5–10	●	—	Standard	Decent
05	Mistral Pixtral Large · Pixtral 12B	Frontier	Hybrid	~$3–8	●	—	Native arbitrary	Strong
06	Amazon Web Services Textract Queries · AnalyzeDocument	Doc	Proprietary API	~$30–65	—	—	PDF / image, document-optimised	Strong
07	Microsoft Azure Document Intelligence · Custom Query	Doc	Proprietary API	~$10–50	—	—	PDF / image, document-optimised	Strong
08	Google Cloud Document AI · Custom Extractor	Doc	Proprietary API	~$30–65	—	—	PDF / image, document-optimised	Strong
09	Reducto Reducto	Doc	Proprietary API	~$5–20	—	—	PDF / image, document-optimised	Strong
10	Mathpix Mathpix · Convert / Query	Doc	Proprietary API	~$5–15	—	—	PDF / image	Strong (math)
11	Alibaba (open) Qwen2.5-VL-72B · Qwen3-VL	Open	Open weights	Self-host	●	●	Native arbitrary (no resize)	Strong
12	Shanghai AI Lab (open) InternVL 3 (38B / 78B)	Open	Open weights	Self-host	●	●	Up to 4K native	Strong
13	Allen Institute (AI2) Molmo 72B · Molmo-D	Open	Open weights	Self-host	—	—	Standard	Decent
14	ByteDance / academic (open) LLaVA-OneVision (7B / 72B)	Open	Open weights	Self-host	●	●	Standard	Decent
15	DeepSeek (open) DeepSeek-VL2	Open	Open weights	Self-host	●	—	Native arbitrary	Strong

Fig 01 · Shaded row marks the multi-image reasoning leader; frontier VLMs otherwise sit within 1–3 points of each other on MMMU and MMBench. Pricing is token-derived and scales with resolution. Hover a price to see the per-vendor note.

§ 02

The task

What the metric actually measures.

Visual question answering asks a model to read an image and answer a natural-language question about it. The archetypal output is a short string — “three”, “red”, “a golden retriever” — graded against human-written references using exact or soft matching.

The classic VQAv2 and GQA benchmarks were designed in 2017–2019 for specialist models trained from scratch. Frontier VLMs trained on internet-scale image-text data now saturate them in zero-shot; the human ceiling on VQAv2 is around 80%, and current frontier models sit at 84–87%. Past that point a two-point delta is noise about dataset quirks, not signal about model quality.

The interesting evaluations in 2026 are MMMU (multi-discipline reasoning), MathVista (visual math), ChartQA, DocVQA and MMBench. Leaderboards on each of these diverge sharply — no single number captures “best VLM”, and your own task should be your primary eval.

What the metric misses: calibration. A model that confidently invents an answer is strictly worse than one that says “I’m not sure.” VLMs systematically over-predict counts and over-affirm presence (“is there a cat?” → “yes”, even with no cat in frame). Leaderboards don’t penalise that; your production eval should.

§ 03 · Benchmarks

The canonical six, graded in public.

Useful for academic comparison and open-weights training. Frontier API providers don’t report against these consistently — treat as historical context.

#	Benchmark	Scale	Year	What it measures	Link
01	MMMU On Codesota ↗	11.5K questions · 30 disciplines · college-level	2024	Massive Multi-discipline Multimodal Understanding. The reference frontier benchmark — questions span art, business, medicine, science, technology. Hard for humans (88% expert ceiling); current SOTA ~86%.	leaderboard →
02	MMMU-Pro On Codesota ↗	Vision-only · 10 answer choices	2024	Harder MMMU variant. Vision-only questions (no text shortcuts) and ten answer choices instead of four. Current frontier sits at ~82% on Gemini 3.1 Pro Preview.	leaderboard →
03	MathVista	6,141 questions · math + visual reasoning	2024	Tests mathematical reasoning over diagrams, geometry, and charts. The discriminator between a captioning model dressed up as a VLM and a model that actually reasons over visual structure.	page →
04	ChartQA	32K Q&A pairs over 21K charts	2022	Real charts from Pew, OECD, Statista. Tests chart reading, value extraction, comparative reasoning. Frontier VLMs at 85–90%; Molmo / older open-weights still in the 60s.	page →
05	DocVQA	50K questions · 12K document images	2021	Industry documents (invoices, forms, reports). The standard for document VQA. AWS Textract Query / Document AI / Reducto are all evaluated here.	page →
06	MMBench On Codesota ↗	3,217 multiple-choice questions · 20 abilities	2023	Rounded capability suite covering perception, reasoning, OCR, knowledge. The default “am I generally a competent VLM” eval — most papers report it.	leaderboard →
07	VQAv2 On Codesota ↗	1.1M questions · COCO images	2017	The canonical VQA dataset. Saturated. Still reported for historical comparability — treat anything above 80% as ceiling-pinned.	leaderboard →

§ 04 · Lineage

How VQA evolved.

Open the full lineage graph →

Community attention doesn’t stay on one benchmark. It migrates — from VQA to VQAv2 when language priors made the original too easy; from VQAv2 to MMMU when balanced pairs saturated above 85%; from MMMU to MMMU-Pro when vision-only questions fixed MMMU’s text shortcuts. Meanwhile, specialised branches (GQA, OK-VQA, TextVQA, ScienceQA) peeled off to test capabilities the mainline benchmarks miss.

VQA

2015

VQAv2

2017

MMMU

2023

MMMU-Pro

2024

Attention path only. The full graph at /lineage/vqa also shows the five specialised branches (GQA, OK-VQA, A-OKVQA, TextVQA, ScienceQA) and carries live SOTA scores pulled from the registry.

§ 05

Methodology

What VQA-accuracy means here.

The original VQAv2 metric gives full credit when at least three of ten human annotators produced the same answer string (after light normalisation). It is forgiving of synonyms by construction — “sofa” and “couch” both score — and unforgiving of extra words.

Newer suites (MMMU, MMBench) use multiple-choice or short-answer formats evaluated by exact match, which eliminates the ten-annotator quirk but introduces format-sensitivity of its own. DocVQA uses ANLS (Average Normalised Levenshtein Similarity) to tolerate OCR-style near-misses on long strings.

Pricing figures in § 01 are token-derived: each provider prices images as a function of resolution and detail flag, then bills tokens at the model’s text rate. Numbers above assume standard-resolution input; high-detail can cost 2–4× more. Click a price to see the vendor’s own pricing page.

Full methodology →Back to the register

§ 06 · Related

Where to read next.

Cross-link

Image captioning →

Sister task: generate a description, not answer a question.

Cross-link

The full register →

Every machine-learning task we track, by area.

Cross-link

Frontier LLM →

LLM-specific leaderboard with denser, faster-moving data.

Cross-link

Computer vision →

Pixels in, structure out. Detection, segmentation, depth.

Cross-link

Methodology →

How we grade benchmarks, reproduce runs, and record retractions.

Reply within 48 hours · No newsletter

What were you looking for on visual question answering?

Missing a vendor, a column we skipped, a use case you need help picking for? Tell us — we reply within 48 hours and update the page based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.

Visual question answering,measured.

Fifteen providers, side by side.

What the metric actually measures.

The canonical six, graded in public.

How VQA evolved.

What VQA-accuracy means here.

Where to read next.

What were you looking for on visual question answering?

Visual question answering,
measured.