Matches Gemini 2.5 Pro performance at lower cost and latency. Best efficiency-capability tradeoff for API usage.
Models that read, see, hear — and sometimes all three at once. The most crowded frontier in the index, and the least standardised: every leaderboard has a slightly different test split.
Multimodal AI in 2025 has moved from research demos to production-ready systems. The gap between proprietary and open-source models has narrowed dramatically, with practical choices now spanning from edge devices to frontier reasoning.
Each task opens onto a leaderboard of its canonical benchmark, with the full submission history and dated scores. Tasks without an indexed result are listed elsewhere in the register; the table below is sorted by result count.
Leading scores for the headline benchmarks in this area, drawn from the registry. Shaded rows mark the top result per task; follow any row into the full leaderboard.
| # | Task | Benchmark | Leading model | Score |
|---|---|---|---|---|
| 01 | Image Captioning | COCO Captions | BLIP-2 | 145.8% CIDEr |
| 02 | Visual Question Answering | MMBench: Is Your Multi-modal Model an All-around Player? | Qwen2.5-VL 72B | 90.5% accuracy |
Matches Gemini 2.5 Pro performance at lower cost and latency. Best efficiency-capability tradeoff for API usage.
72.2% MMMU, state-of-the-art among open models. Reasonable compute requirements for on-prem deployment.
Leading open-weight model for video QA, dense captioning, multi-object tracking. 9M video training examples show.
73.6% VQAv2, 70.7% DocVQA. Meta's focus on document tasks delivers practical results for enterprise.
Strong performance in 7B parameters. Handles variable resolution, 29 languages, deployable on modest hardware.
MoE architecture optimized for technical reasoning. Better than generalist models on specialized scientific content.
Native multi-image processing, strong instruction-following. Architectural modularity aids practical deployment.
Multi-agent collaboration outperforms monolithic scaling. Decompose into planning, execution, judgment agents.
Training-free hallucination reduction via open-source vision model guidance. Works across diverse LVLMs.
Tops LMArena for vision tasks, breakthrough scores on reasoning benchmarks. Vendor support and reliability.
Unless you need absolute frontier reasoning, InternVL3-78B or Molmo 2 will serve you better than paying per-token for proprietary APIs. The performance gap has collapsed while deployment flexibility remains massive.
Despite claims, most models fail hard on videos over 15 minutes. If your use case involves long-form video, budget for custom fine-tuning. The benchmarks don't reflect real-world complexity.
Research shows spatial grounding training has little to no effect on object hallucination in captions. You'll need explicit verification pipelines, not architectural fixes.
Gemini 2.5 Pro and o3 struggle on chemistry Olympiad problems. If you work with specialized diagrams (molecular structures, technical schematics), expect to build domain-specific solutions.
Mixture-of-experts architectures like DeepSeek-VL deliver comparable performance at a fraction of the compute. Dense models are increasingly a poor cost-performance choice.
The benchmarks above come from the same Postgres registry that powers the wider Codesota index. Each task has exactly one canonical dataset. Each score carries a metric direction, a date and — where possible — a reproduction status.
When a score regresses, the prior record stays visible. When a benchmark is contested, we mark it rather than delete it. The goal is a register that argues in public.
Sibling area hubs, the unified task index and the methodology that binds them.