The word “multimodal” has expanded far beyond feeding a single image to a language model. In 2026 there are six distinguishable capability tracks, and conflating them is the main reason a model chosen for one job disappoints on another.
| Track | Maturity | Where it stands |
|---|---|---|
| Vision + Language | Mature | Single image understanding, OCR, chart reading, visual QA. The most developed modality. Near-human on documents. |
| Multi-Image Reasoning | Maturing | Comparing multiple images, finding differences, understanding image sequences. Models handle 5–10 images well, degrade beyond that. |
| Video Understanding | Active research | Temporal reasoning, event detection, long-form comprehension. Gemini leads. Most models still sample frames rather than process motion. |
| Audio + Vision | Emerging | Joint audio-visual reasoning. Gemini 2.5 Pro processes video with audio natively. Others require separate ASR pipelines. |
| Interleaved Documents | Maturing | PDFs with mixed text, tables, figures, and charts. Critical for enterprise. Qwen2.5-VL and Claude Opus 4 lead here. |
| Spatial / 3D | Early | Depth, 3D layouts, physical spatial relationships from 2D images. All models struggle. Active research frontier. |