What the metric actually measures.
Visual question answering asks a model to read an image and answer a natural-language question about it. The archetypal output is a short string — “three”, “red”, “a golden retriever” — graded against human-written references using exact or soft matching.
The classic VQAv2 and GQA benchmarks were designed in 2017–2019 for specialist models trained from scratch. Frontier VLMs trained on internet-scale image-text data now saturate them in zero-shot; the human ceiling on VQAv2 is around 80%, and current frontier models sit at 84–87%. Past that point a two-point delta is noise about dataset quirks, not signal about model quality.
The interesting evaluations in 2026 are MMMU (multi-discipline reasoning), MathVista (visual math), ChartQA, DocVQA and MMBench. Leaderboards on each of these diverge sharply — no single number captures “best VLM”, and your own task should be your primary eval.
What the metric misses: calibration. A model that confidently invents an answer is strictly worse than one that says “I’m not sure.” VLMs systematically over-predict counts and over-affirm presence (“is there a cat?” → “yes”, even with no cat in frame). Leaderboards don’t penalise that; your production eval should.