Multimodal
Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.
Multimodal AI in 2025 has moved from research demos to production-ready systems. The gap between proprietary and open-source models has narrowed dramatically, with practical choices now spanning from edge devices to frontier reasoning.
State of the Field (Dec 2024)
- -Gemini 3 Pro leads proprietary models with breakthrough reasoning scores on Humanity's Last Exam, while Gemini 3 Flash matches previous-gen Pro performance at lower cost and latency
- -Open-source models have achieved near-parity: InternVL3-78B hits 72.2% on MMMU, Molmo 2 leads in video understanding and grounding tasks, Qwen 2.5 VL handles 29 languages and 1-hour videos
- -Hallucination remains the critical deployment blocker. Models confidently describe non-existent objects, and grounding objectives surprisingly don't fix this in open-ended generation
- -Spatial reasoning and 3D understanding lag behind: even frontier models struggle with orientation tasks (56% vs 95.7% human), limiting robotics and embodied AI applications
Quick Recommendations
General-purpose multimodal reasoning (production API)
Gemini 3 Flash
Matches Gemini 2.5 Pro performance at lower cost and latency. Best efficiency-capability tradeoff for API usage.
Open-source general multimodal
InternVL3-78B
72.2% MMMU, state-of-the-art among open models. Reasonable compute requirements for on-prem deployment.
Video understanding and tracking
Molmo 2
Leading open-weight model for video QA, dense captioning, multi-object tracking. 9M video training examples show.
Document understanding and OCR
Llama 3.2 Vision 90B
73.6% VQAv2, 70.7% DocVQA. Meta's focus on document tasks delivers practical results for enterprise.
Edge deployment (resource-constrained)
Qwen 2.5 VL-7B
Strong performance in 7B parameters. Handles variable resolution, 29 languages, deployable on modest hardware.
Scientific and technical diagrams
DeepSeek-VL
MoE architecture optimized for technical reasoning. Better than generalist models on specialized scientific content.
Multi-image reasoning
Pixtral (Mistral AI)
Native multi-image processing, strong instruction-following. Architectural modularity aids practical deployment.
Long-context document reasoning
MACT framework on top of base model
Multi-agent collaboration outperforms monolithic scaling. Decompose into planning, execution, judgment agents.
Hallucination-critical applications
Base model + MARINE framework
Training-free hallucination reduction via open-source vision model guidance. Works across diverse LVLMs.
Frontier reasoning (cost no object)
Gemini 3 Pro
Tops LMArena for vision tasks, breakthrough scores on reasoning benchmarks. Vendor support and reliability.
Tasks & Benchmarks
Cross-Modal Retrieval
Retrieving items across different modalities (image-text).
Image Captioning
Generating text descriptions of images (COCO Captions).
Text-to-Image Generation
Generating images from text descriptions (Stable Diffusion, DALL-E).
Video Understanding
Understanding and reasoning about video content.
Visual Question Answering
Answering questions about images (VQA, GQA).
Show all datasets and SOTA results
Cross-Modal Retrieval
Image Captioning
330K images with 5 captions each. Standard benchmark for image captioning.
Text-to-Image Generation
Video Understanding
Visual Question Answering
265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.
Honest Takes
Open-source has caught up for most use cases
Unless you need absolute frontier reasoning, InternVL3-78B or Molmo 2 will serve you better than paying per-token for proprietary APIs. The performance gap has collapsed while deployment flexibility remains massive.
Video understanding is still the wild west
Despite claims, most models fail hard on videos over 15 minutes. If your use case involves long-form video, budget for custom fine-tuning. The benchmarks don't reflect real-world complexity.
Grounding doesn't fix hallucination
Research shows spatial grounding training has little to no effect on object hallucination in captions. You'll need explicit verification pipelines, not architectural fixes.
Scientific domains are still underserved
Gemini 2.5 Pro and o3 struggle on chemistry Olympiad problems. If you work with specialized diagrams (molecular structures, technical schematics), expect to build domain-specific solutions.
MoE is the new scaling paradigm
Mixture-of-experts architectures like DeepSeek-VL deliver comparable performance at a fraction of the compute. Dense models are increasingly a poor cost-performance choice.