Codesota · Guides · Multimodal AIVision-language models, read honestlyUpdated March 2026

Guide · Vision language

What multimodal models actually do, in 2026.

Six headline VLMs — GPT-5 Vision, Claude Opus 4, Gemini 2.5 Pro, Qwen2.5-VL, InternVL2.5, LLaVA-OneVision — scored on the benchmarks that matter, with the failure modes their evaluations quietly reveal.

Scores are the authors' and leaderboards' own, preserved verbatim. Where a benchmark is known to be saturated or contaminated, that is stated in the row.

§ 01 · Scope

Multimodal is not just image + text.

Six capability tracks, at very different stages of maturity.

The word “multimodal” has expanded far beyond feeding a single image to a language model. In 2026 there are six distinguishable capability tracks, and conflating them is the main reason a model chosen for one job disappoints on another.

Track	Maturity	Where it stands
Vision + Language	Mature	Single image understanding, OCR, chart reading, visual QA. The most developed modality. Near-human on documents.
Multi-Image Reasoning	Maturing	Comparing multiple images, finding differences, understanding image sequences. Models handle 5–10 images well, degrade beyond that.
Video Understanding	Active research	Temporal reasoning, event detection, long-form comprehension. Gemini leads. Most models still sample frames rather than process motion.
Audio + Vision	Emerging	Joint audio-visual reasoning. Gemini 2.5 Pro processes video with audio natively. Others require separate ASR pipelines.
Interleaved Documents	Maturing	PDFs with mixed text, tables, figures, and charts. Critical for enterprise. Qwen2.5-VL and Claude Opus 4 lead here.
Spatial / 3D	Early	Depth, 3D layouts, physical spatial relationships from 2D images. All models struggle. Active research frontier.

§ 02 · Benchmarks

The six questions we actually ask.

Each benchmark measures something different. No single number tells the whole story.

MMMU (Massive Multi-discipline Multimodal Understanding): College-level reasoning across 30 subjects requiring both image understanding and domain knowledge; Examples · Art history analysis, circuit diagram solving, medical image interpretation; Status · Far from saturated. Human expert: ~88.6%. Best model: ~74.8%.
MathVista (Mathematical reasoning in Visual contexts): Math problem solving from charts, geometry diagrams, scientific figures, and word problems with visual elements; Examples · Reading bar chart values and computing percentages, solving geometry from diagrams; Status · Active. Human: ~60% (surprisingly low). Models approaching human level.
RealWorldQA (Real World Question Answering): Practical visual understanding from real-world photos — spatial reasoning, navigation, everyday comprehension; Examples · Reading street signs, estimating distances, understanding physical layouts; Status · Active. Tests practical intelligence that benchmarks often miss.
ChartQA (Chart Question Answering): Extracting data and answering questions about charts and plots; Examples · Finding max values in bar charts, computing trends from line graphs, reading pie chart segments; Status · Approaching saturation. Best models at ~88-89%. Human: ~92%.
DocVQA (Document Visual Question Answering): Extracting information from scanned documents, forms, receipts, and reports; Examples · Reading values from invoices, finding dates in contracts, parsing table entries; Status · Near saturation. Open-source models (Qwen2.5-VL) hit 96.4%. Human: ~98%.
Video-MME (Video Multi-Modal Evaluation): Understanding video content across short (< 2min), medium (4-15min), and long (30-60min) clips; Examples · Summarizing events, temporal ordering, cause-effect reasoning across scenes; Status · Very active. Best model: ~75.2% (Gemini). Huge room for improvement.

§ 03 · Comparison

Where each model wins.

Accuracy percentages on standard splits. The copper row of each column marks the column leader.

Model	MMMU	MathVista	RealWorldQA	ChartQA	DocVQA	Video-MME	Type
GPT-5 Vision OpenAI · Jan 2026	74.8	67.2	72.4	88.6	95.1	68.3	API
Claude Opus 4 Anthropic · Mar 2026	72.1	65.8	74.6	86.9	94.8	64.7	API
Gemini 2.5 Pro Google · Mar 2025	72.7	63.9	70.8	88.2	93.4	75.2	API
Qwen2.5-VL-72B Alibaba · Jan 2025	70.2	61.4	68.7	86.1	96.4	61.8	Open source
InternVL2.5-78B Shanghai AI Lab · Dec 2024	70.1	62.8	67.5	85.4	94.9	60.2	Open source
LLaVA-OneVision-72B LLaVA Team / ByteDance · Aug 2024	62.4	57.6	64.2	80	91.3	58.4	Open source

Fig 1 · Scores from published papers, official model cards and the OpenCompass leaderboard.

§ 04 · Dossier

Each model's honest profile.

Strengths and weaknesses as the authors and independent reviewers report them.

GPT-5 Vision · OpenAI · API

Strengths

Most consistent performer across all modalities
Strong spatial reasoning and counting
Excellent chart and diagram interpretation
Native tool use with vision input

Weaknesses

Expensive at scale ($2.50/M input tokens for images)
Occasional hallucination on fine-grained text in images
Video limited to sampled frames, not true temporal modeling

Claude Opus 4 · Anthropic · API

Strengths

Best-in-class instruction following with visual context
Precise bounding box and region understanding
Strongest refusal of misleading visual prompts
Multi-page document reasoning with 200K context

Weaknesses

Highest cost per query ($15/M input tokens)
Slower inference than competitors
Video understanding behind GPT-5 and Gemini

Gemini 2.5 Pro · Google · API

Strengths

Best video understanding — processes up to 1 hour natively
Interleaved audio + video + text in single query
1M token context window for long documents
Competitive pricing ($1.25/M input)

Weaknesses

Spatial reasoning slightly behind GPT-5
Inconsistent on complex table extraction
Occasional refusal on benign medical/scientific images

Qwen2.5-VL-72B · Alibaba · Open source

Strengths

Highest DocVQA score of any model (96.4%)
Open source (Apache 2.0) — full data privacy
HTML-based document parsing with bounding boxes
Efficient 7B variant rivals GPT-4o-mini

Weaknesses

Requires A100/H100 GPUs for 72B inference
Lower MMMU than API models
Video understanding significantly behind Gemini

InternVL2.5-78B · Shanghai AI Lab · Open source

Strengths

Strong all-around open-source VLM
Dynamic resolution — adapts tile count to image complexity
Multilingual vision-language support (8+ languages)
Active community and rapid iteration

Weaknesses

Large model footprint (78B parameters)
Slightly behind Qwen2.5-VL on documents
Training data transparency concerns

LLaVA-OneVision-72B · LLaVA Team / ByteDance · Open source

Strengths

Unified architecture for image, multi-image, and video
Efficient training recipe — strong results from academic compute
Single-image, multi-image, and video from one checkpoint
Well-documented, easy to fine-tune

Weaknesses

Benchmark scores trail frontier models by 8-12 points
Older architecture showing its age in 2026
Less competitive on document understanding

§ 05 · Failure modes

How VLMs break, predictably.

Every model in the comparison fails in these same ways. Shipping a feature without a guard for them is a bug.

The most useful thing a VLM paper rarely discusses in the abstract is how its model fails. Below are the seven recurring failure modes across every model in this guide — the severity marker tracks how often each one causes production regressions rather than benchmark noise.

high

Counting objects

Models consistently miscount objects in cluttered scenes. Ask "how many red cars?" in a parking lot photo and expect errors once count exceeds ~7.

high

Spatial relationships

Left/right, above/below, and relative position questions remain unreliable. Models may say object A is left of B when it is clearly right.

medium

Fine-grained text in images

Small text, watermarks, and text at angles gets misread or hallucinated. License plates, serial numbers, and distant signage are particularly problematic.

high

Temporal reasoning in video

Most VLMs sample frames rather than processing true video. They miss motion, speed, and cause-effect that happens between sampled frames.

medium

Multi-step visual reasoning

Chains of visual inference (A implies B, B implies C from image) degrade rapidly. Accuracy drops ~15% per reasoning step.

medium

Negation and absence

Asking "is there NOT a dog in this image?" or "what is missing from this scene?" triggers frequent errors. Models are biased toward confirming presence.

high

Hallucinated OCR

When text is partially visible or blurry, models confidently fabricate plausible-looking text rather than admitting uncertainty.

§ 06 · Video

Video is hard.

The gap between answering questions about a video and truly understanding temporal dynamics is still wide.

System	Video-MME	What it does
Gemini 2.5 Pro	75.2	Processes up to 1 hour of video natively with audio. The only model that does not rely purely on frame sampling.
GPT-5 Vision	68.3	Samples frames. Works well for surveillance review and content moderation. Struggles with fast-action sports and temporal ordering.
Open-source cluster	58–62	LLaVA-OneVision, InternVL2.5, Qwen2.5-VL all cluster here. Usable for short clips. Long video comprehension remains a significant gap vs. Gemini.

Works reliably

Scene description and summarisation
Object identification across frames
Text / subtitle extraction from video
Action recognition (walking, running, cooking)
Content moderation and safety screening

Still unreliable

Precise temporal ordering of events
Counting occurrences of repeated actions
Understanding cause-and-effect chains
Speed and motion estimation
Multi-person interaction tracking

§ 07 · Distinction

Generation and understanding are different worlds.

Both are called multimodal; they are not the same capability.

Text-to-image generation (DALL·E 3, Midjourney v7, Stable Diffusion 3.5, Flux) and vision-language understanding (the models on this page) are fundamentally different capabilities, despite sharing the multimodal label.

Image generation

Converts text descriptions into pixel data
Diffusion-based architectures dominate
Evaluated by FID, aesthetic scores, human preference
Rapid commoditisation in 2025–2026
Text rendering in images still imperfect

Vision understanding (VLMs)

Converts visual data into language and reasoning
Transformer-based with vision encoders
Evaluated by MMMU, DocVQA, ChartQA, etc.
Still improving rapidly on hard reasoning tasks
The models covered in this guide

The convergence trend. Some models now merge the two: GPT-5 can generate and understand images; Gemini 2.5 Pro generates images natively. But the underlying architectures remain different, and specialisation still wins on benchmarks. A great generator is not a great understander, and vice versa.

§ 08 · Decision

Match the constraint.

Six common use cases, each with the primary and secondary pick and the one-line reason.

Use case	Pick	Why
Document extraction at scale	Qwen2.5-VL-72B (self-hosted) or Gemini 2.5 Pro (API)	Highest DocVQA scores. Qwen for privacy/cost. Gemini for ease.
Complex reasoning over images	GPT-5 Vision or Claude Opus 4	Best MMMU scores. Claude for instruction adherence. GPT-5 for breadth.
Video analysis and surveillance	Gemini 2.5 Pro	Only model with native long-video processing. 7+ point Video-MME lead.
Budget-friendly prototyping	Gemini 2.5 Pro or Qwen2.5-VL-7B	Gemini at $1.25/M input. Qwen 7B runs on a single consumer GPU.
Privacy-critical / air-gapped	Qwen2.5-VL or InternVL2.5	Open source. Deploy anywhere. No data leaves your infrastructure.
Multi-page document comprehension	Claude Opus 4 or Gemini 2.5 Pro	200K and 1M context respectively. Both handle interleaved images well.

Related · Further reading

What multimodal models actually do, in 2026.

Multimodal is not just image + text.

The six questions we actually ask.

Where each model wins.

Each model's honest profile.

GPT-5 Vision · OpenAI · API

Claude Opus 4 · Anthropic · API

Gemini 2.5 Pro · Google · API

Qwen2.5-VL-72B · Alibaba · Open source

InternVL2.5-78B · Shanghai AI Lab · Open source

LLaVA-OneVision-72B · LLaVA Team / ByteDance · Open source

How VLMs break, predictably.

Video is hard.

Generation and understanding are different worlds.

Match the constraint.

Continue through the registry.