Codesota · Guides · Multimodal AIVision-language models, read honestlyUpdated March 2026
Guide · Vision language

What multimodal models actually do, in 2026.

Six headline VLMs — GPT-5 Vision, Claude Opus 4, Gemini 2.5 Pro, Qwen2.5-VL, InternVL2.5, LLaVA-OneVision — scored on the benchmarks that matter, with the failure modes their evaluations quietly reveal.

Scores are the authors' and leaderboards' own, preserved verbatim. Where a benchmark is known to be saturated or contaminated, that is stated in the row.

§ 01 · Scope

Multimodal is not just image + text.

Six capability tracks, at very different stages of maturity.

The word “multimodal” has expanded far beyond feeding a single image to a language model. In 2026 there are six distinguishable capability tracks, and conflating them is the main reason a model chosen for one job disappoints on another.

TrackMaturityWhere it stands
Vision + LanguageMatureSingle image understanding, OCR, chart reading, visual QA. The most developed modality. Near-human on documents.
Multi-Image ReasoningMaturingComparing multiple images, finding differences, understanding image sequences. Models handle 5–10 images well, degrade beyond that.
Video UnderstandingActive researchTemporal reasoning, event detection, long-form comprehension. Gemini leads. Most models still sample frames rather than process motion.
Audio + VisionEmergingJoint audio-visual reasoning. Gemini 2.5 Pro processes video with audio natively. Others require separate ASR pipelines.
Interleaved DocumentsMaturingPDFs with mixed text, tables, figures, and charts. Critical for enterprise. Qwen2.5-VL and Claude Opus 4 lead here.
Spatial / 3DEarlyDepth, 3D layouts, physical spatial relationships from 2D images. All models struggle. Active research frontier.
§ 02 · Benchmarks

The six questions we actually ask.

Each benchmark measures something different. No single number tells the whole story.

MMMU (Massive Multi-discipline Multimodal Understanding)
College-level reasoning across 30 subjects requiring both image understanding and domain knowledge
Examples · Art history analysis, circuit diagram solving, medical image interpretation
Status · Far from saturated. Human expert: ~88.6%. Best model: ~74.8%.
MathVista (Mathematical reasoning in Visual contexts)
Math problem solving from charts, geometry diagrams, scientific figures, and word problems with visual elements
Examples · Reading bar chart values and computing percentages, solving geometry from diagrams
Status · Active. Human: ~60% (surprisingly low). Models approaching human level.
RealWorldQA (Real World Question Answering)
Practical visual understanding from real-world photos — spatial reasoning, navigation, everyday comprehension
Examples · Reading street signs, estimating distances, understanding physical layouts
Status · Active. Tests practical intelligence that benchmarks often miss.
ChartQA (Chart Question Answering)
Extracting data and answering questions about charts and plots
Examples · Finding max values in bar charts, computing trends from line graphs, reading pie chart segments
Status · Approaching saturation. Best models at ~88-89%. Human: ~92%.
DocVQA (Document Visual Question Answering)
Extracting information from scanned documents, forms, receipts, and reports
Examples · Reading values from invoices, finding dates in contracts, parsing table entries
Status · Near saturation. Open-source models (Qwen2.5-VL) hit 96.4%. Human: ~98%.
Video-MME (Video Multi-Modal Evaluation)
Understanding video content across short (< 2min), medium (4-15min), and long (30-60min) clips
Examples · Summarizing events, temporal ordering, cause-effect reasoning across scenes
Status · Very active. Best model: ~75.2% (Gemini). Huge room for improvement.
§ 03 · Comparison

Where each model wins.

Accuracy percentages on standard splits. The copper row of each column marks the column leader.

ModelMMMUMathVistaRealWorldQAChartQADocVQAVideo-MMEType
GPT-5 Vision
OpenAI · Jan 2026
74.867.272.488.695.168.3API
Claude Opus 4
Anthropic · Mar 2026
72.165.874.686.994.864.7API
Gemini 2.5 Pro
Google · Mar 2025
72.763.970.888.293.475.2API
Qwen2.5-VL-72B
Alibaba · Jan 2025
70.261.468.786.196.461.8Open source
InternVL2.5-78B
Shanghai AI Lab · Dec 2024
70.162.867.585.494.960.2Open source
LLaVA-OneVision-72B
LLaVA Team / ByteDance · Aug 2024
62.457.664.28091.358.4Open source
Fig 1 · Scores from published papers, official model cards and the OpenCompass leaderboard.
§ 04 · Dossier

Each model's honest profile.

Strengths and weaknesses as the authors and independent reviewers report them.

GPT-5 Vision · OpenAI · API

Strengths
  • Most consistent performer across all modalities
  • Strong spatial reasoning and counting
  • Excellent chart and diagram interpretation
  • Native tool use with vision input
Weaknesses
  • Expensive at scale ($2.50/M input tokens for images)
  • Occasional hallucination on fine-grained text in images
  • Video limited to sampled frames, not true temporal modeling

Claude Opus 4 · Anthropic · API

Strengths
  • Best-in-class instruction following with visual context
  • Precise bounding box and region understanding
  • Strongest refusal of misleading visual prompts
  • Multi-page document reasoning with 200K context
Weaknesses
  • Highest cost per query ($15/M input tokens)
  • Slower inference than competitors
  • Video understanding behind GPT-5 and Gemini

Gemini 2.5 Pro · Google · API

Strengths
  • Best video understanding — processes up to 1 hour natively
  • Interleaved audio + video + text in single query
  • 1M token context window for long documents
  • Competitive pricing ($1.25/M input)
Weaknesses
  • Spatial reasoning slightly behind GPT-5
  • Inconsistent on complex table extraction
  • Occasional refusal on benign medical/scientific images

Qwen2.5-VL-72B · Alibaba · Open source

Strengths
  • Highest DocVQA score of any model (96.4%)
  • Open source (Apache 2.0) — full data privacy
  • HTML-based document parsing with bounding boxes
  • Efficient 7B variant rivals GPT-4o-mini
Weaknesses
  • Requires A100/H100 GPUs for 72B inference
  • Lower MMMU than API models
  • Video understanding significantly behind Gemini

InternVL2.5-78B · Shanghai AI Lab · Open source

Strengths
  • Strong all-around open-source VLM
  • Dynamic resolution — adapts tile count to image complexity
  • Multilingual vision-language support (8+ languages)
  • Active community and rapid iteration
Weaknesses
  • Large model footprint (78B parameters)
  • Slightly behind Qwen2.5-VL on documents
  • Training data transparency concerns

LLaVA-OneVision-72B · LLaVA Team / ByteDance · Open source

Strengths
  • Unified architecture for image, multi-image, and video
  • Efficient training recipe — strong results from academic compute
  • Single-image, multi-image, and video from one checkpoint
  • Well-documented, easy to fine-tune
Weaknesses
  • Benchmark scores trail frontier models by 8-12 points
  • Older architecture showing its age in 2026
  • Less competitive on document understanding
§ 05 · Failure modes

How VLMs break, predictably.

Every model in the comparison fails in these same ways. Shipping a feature without a guard for them is a bug.

The most useful thing a VLM paper rarely discusses in the abstract is how its model fails. Below are the seven recurring failure modes across every model in this guide — the severity marker tracks how often each one causes production regressions rather than benchmark noise.

high
Counting objects
Models consistently miscount objects in cluttered scenes. Ask "how many red cars?" in a parking lot photo and expect errors once count exceeds ~7.
high
Spatial relationships
Left/right, above/below, and relative position questions remain unreliable. Models may say object A is left of B when it is clearly right.
medium
Fine-grained text in images
Small text, watermarks, and text at angles gets misread or hallucinated. License plates, serial numbers, and distant signage are particularly problematic.
high
Temporal reasoning in video
Most VLMs sample frames rather than processing true video. They miss motion, speed, and cause-effect that happens between sampled frames.
medium
Multi-step visual reasoning
Chains of visual inference (A implies B, B implies C from image) degrade rapidly. Accuracy drops ~15% per reasoning step.
medium
Negation and absence
Asking "is there NOT a dog in this image?" or "what is missing from this scene?" triggers frequent errors. Models are biased toward confirming presence.
high
Hallucinated OCR
When text is partially visible or blurry, models confidently fabricate plausible-looking text rather than admitting uncertainty.
§ 06 · Video

Video is hard.

The gap between answering questions about a video and truly understanding temporal dynamics is still wide.

SystemVideo-MMEWhat it does
Gemini 2.5 Pro75.2Processes up to 1 hour of video natively with audio. The only model that does not rely purely on frame sampling.
GPT-5 Vision68.3Samples frames. Works well for surveillance review and content moderation. Struggles with fast-action sports and temporal ordering.
Open-source cluster58–62LLaVA-OneVision, InternVL2.5, Qwen2.5-VL all cluster here. Usable for short clips. Long video comprehension remains a significant gap vs. Gemini.
Works reliably
  • Scene description and summarisation
  • Object identification across frames
  • Text / subtitle extraction from video
  • Action recognition (walking, running, cooking)
  • Content moderation and safety screening
Still unreliable
  • Precise temporal ordering of events
  • Counting occurrences of repeated actions
  • Understanding cause-and-effect chains
  • Speed and motion estimation
  • Multi-person interaction tracking
§ 07 · Distinction

Generation and understanding are different worlds.

Both are called multimodal; they are not the same capability.

Text-to-image generation (DALL·E 3, Midjourney v7, Stable Diffusion 3.5, Flux) and vision-language understanding (the models on this page) are fundamentally different capabilities, despite sharing the multimodal label.

Image generation
  • Converts text descriptions into pixel data
  • Diffusion-based architectures dominate
  • Evaluated by FID, aesthetic scores, human preference
  • Rapid commoditisation in 2025–2026
  • Text rendering in images still imperfect
Vision understanding (VLMs)
  • Converts visual data into language and reasoning
  • Transformer-based with vision encoders
  • Evaluated by MMMU, DocVQA, ChartQA, etc.
  • Still improving rapidly on hard reasoning tasks
  • The models covered in this guide

The convergence trend. Some models now merge the two: GPT-5 can generate and understand images; Gemini 2.5 Pro generates images natively. But the underlying architectures remain different, and specialisation still wins on benchmarks. A great generator is not a great understander, and vice versa.

§ 08 · Decision

Match the constraint.

Six common use cases, each with the primary and secondary pick and the one-line reason.

Use casePickWhy
Document extraction at scaleQwen2.5-VL-72B (self-hosted) or Gemini 2.5 Pro (API)Highest DocVQA scores. Qwen for privacy/cost. Gemini for ease.
Complex reasoning over imagesGPT-5 Vision or Claude Opus 4Best MMMU scores. Claude for instruction adherence. GPT-5 for breadth.
Video analysis and surveillanceGemini 2.5 ProOnly model with native long-video processing. 7+ point Video-MME lead.
Budget-friendly prototypingGemini 2.5 Pro or Qwen2.5-VL-7BGemini at $1.25/M input. Qwen 7B runs on a single consumer GPU.
Privacy-critical / air-gappedQwen2.5-VL or InternVL2.5Open source. Deploy anywhere. No data leaves your infrastructure.
Multi-page document comprehensionClaude Opus 4 or Gemini 2.5 Pro200K and 1M context respectively. Both handle interleaved images well.
Related · Further reading

Continue through the registry.