Multimodal

Combining vision and language? Evaluate image captioning, visual QA, text-to-image generation, and cross-modal retrieval models.

5 tasks2 datasets0 results

Multimodal AI in 2025 has moved from research demos to production-ready systems. The gap between proprietary and open-source models has narrowed dramatically, with practical choices now spanning from edge devices to frontier reasoning.

State of the Field (Dec 2024)

-Gemini 3 Pro leads proprietary models with breakthrough reasoning scores on Humanity's Last Exam, while Gemini 3 Flash matches previous-gen Pro performance at lower cost and latency
-Open-source models have achieved near-parity: InternVL3-78B hits 72.2% on MMMU, Molmo 2 leads in video understanding and grounding tasks, Qwen 2.5 VL handles 29 languages and 1-hour videos
-Hallucination remains the critical deployment blocker. Models confidently describe non-existent objects, and grounding objectives surprisingly don't fix this in open-ended generation
-Spatial reasoning and 3D understanding lag behind: even frontier models struggle with orientation tasks (56% vs 95.7% human), limiting robotics and embodied AI applications

Quick Recommendations

General-purpose multimodal reasoning (production API)

Gemini 3 Flash

Matches Gemini 2.5 Pro performance at lower cost and latency. Best efficiency-capability tradeoff for API usage.

Open-source general multimodal

InternVL3-78B

72.2% MMMU, state-of-the-art among open models. Reasonable compute requirements for on-prem deployment.

Video understanding and tracking

Molmo 2

Leading open-weight model for video QA, dense captioning, multi-object tracking. 9M video training examples show.

Document understanding and OCR

Llama 3.2 Vision 90B

73.6% VQAv2, 70.7% DocVQA. Meta's focus on document tasks delivers practical results for enterprise.

Edge deployment (resource-constrained)

Qwen 2.5 VL-7B

Strong performance in 7B parameters. Handles variable resolution, 29 languages, deployable on modest hardware.

Scientific and technical diagrams

DeepSeek-VL

MoE architecture optimized for technical reasoning. Better than generalist models on specialized scientific content.

Multi-image reasoning

Pixtral (Mistral AI)

Native multi-image processing, strong instruction-following. Architectural modularity aids practical deployment.

Long-context document reasoning

MACT framework on top of base model

Multi-agent collaboration outperforms monolithic scaling. Decompose into planning, execution, judgment agents.

Hallucination-critical applications

Base model + MARINE framework

Training-free hallucination reduction via open-source vision model guidance. Works across diverse LVLMs.

Frontier reasoning (cost no object)

Gemini 3 Pro

Tops LMArena for vision tasks, breakthrough scores on reasoning benchmarks. Vendor support and reliability.

Tasks & Benchmarks

Cross-Modal Retrieval

Retrieving items across different modalities (image-text).

0 datasets0 results

Image Captioning

Generating text descriptions of images (COCO Captions).

1 datasets0 results

Text-to-Image Generation

Generating images from text descriptions (Stable Diffusion, DALL-E).

0 datasets0 results

Video Understanding

Understanding and reasoning about video content.

0 datasets0 results

Visual Question Answering

Answering questions about images (VQA, GQA).

1 datasets0 results

Show all datasets and SOTA results

Cross-Modal Retrieval

No datasets indexed yet. Contribute on GitHub

Image Captioning

COCO CaptionsCOCO Captions2015

330K images with 5 captions each. Standard benchmark for image captioning.

Text-to-Image Generation

No datasets indexed yet. Contribute on GitHub

Video Understanding

No datasets indexed yet. Contribute on GitHub

Visual Question Answering

VQA v2.0Visual Question Answering v2.02017

265K images with 1.1M questions. Balanced dataset to reduce language biases found in v1.

Honest Takes

Open-source has caught up for most use cases

Unless you need absolute frontier reasoning, InternVL3-78B or Molmo 2 will serve you better than paying per-token for proprietary APIs. The performance gap has collapsed while deployment flexibility remains massive.

Video understanding is still the wild west

Despite claims, most models fail hard on videos over 15 minutes. If your use case involves long-form video, budget for custom fine-tuning. The benchmarks don't reflect real-world complexity.

Grounding doesn't fix hallucination

Research shows spatial grounding training has little to no effect on object hallucination in captions. You'll need explicit verification pipelines, not architectural fixes.

Scientific domains are still underserved

Gemini 2.5 Pro and o3 struggle on chemistry Olympiad problems. If you work with specialized diagrams (molecular structures, technical schematics), expect to build domain-specific solutions.

MoE is the new scaling paradigm

Mixture-of-experts architectures like DeepSeek-VL deliver comparable performance at a fraction of the compute. Dense models are increasingly a poor cost-performance choice.