Multimodalimage-to-text

Image Captioning

Image captioning — generating natural language descriptions of images — was the task that launched the modern vision-language era when Show and Tell (2015) paired CNNs with RNNs. The field progressed through BLIP, BLIP-2, and CoCa, each improving grounding and descriptive richness, until multimodal LLMs effectively subsumed it as a special case of image-text-to-text. COCO Captions and NoCaps remain standard benchmarks, but CIDEr and SPICE scores have largely saturated — the real frontier is dense captioning, generating paragraph-level descriptions that capture spatial relationships, attributes, and background context that brief captions miss. Captioning's importance now lies more in its role as training signal for other vision-language tasks than as a standalone evaluation.

2 datasets7 resultsView full task mapping →

Image captioning models generate natural language descriptions of images, ranging from brief one-line summaries to detailed paragraphs covering objects, relationships, actions, and scene context. Once a flagship vision-language task, captioning is now a commodity capability of any VLM but remains critical for accessibility, search indexing, and training data generation.

History

2015

Show and Tell (Vinyals et al., Google) applies encoder-decoder with attention to image captioning, establishing the CNN+LSTM paradigm

2016

MS COCO Captioning Challenge drives rapid progress; attention mechanisms become standard

2019

Oscar (Microsoft) uses object tags as anchor points between vision and language, achieving SOTA on COCO Captions

2022

BLIP (Salesforce) introduces bootstrapped language-image pretraining with captioning and filtering, generating high-quality synthetic captions at scale

2022

CoCa (Google) unifies contrastive and captioning objectives in a single encoder-decoder model

2023

BLIP-2 achieves 145.8 CIDEr on COCO Captions with a frozen ViT + Q-Former + frozen LLM architecture

2023

LLaVA and InstructBLIP show that instruction-tuned VLMs generate richer, more detailed captions than task-specific models

2024

Frontier VLMs (GPT-4o, Claude 3.5, Gemini) produce paragraph-length captions with contextual understanding, scene reasoning, and cultural references

2025

Captioning shifts from a task to a capability — every VLM does it; focus moves to controllable detail level, factual accuracy, and domain-specific captioning

How Image Captioning Works

Visual Feature Extraction

The image is encoded by a vision model (ViT, ConvNeXt, or SigLIP encoder) into a sequence of visual features representing patches or regions of the image.

Feature Bridging

Visual features are projected into the language model's embedding space. Methods range from simple linear projections (LLaVA) to learned queries (BLIP-2's Q-Former) that compress visual information into a fixed number of tokens.

Caption Generation

An autoregressive language model generates the caption token by token, attending to both the visual features and previously generated text. Beam search or nucleus sampling controls output diversity.

Optional Refinement

Post-processing may include factual grounding (checking generated descriptions against detected objects), length control, and style adaptation (alt-text vs. detailed description vs. artistic interpretation).

Current Landscape

Image captioning in 2025 is a solved commodity task for standard use cases — any modern VLM produces captions that are generally accurate and fluent. The frontier has moved to controllable captioning (specify detail level, style, and focus), domain-specific captioning (medical images, satellite imagery, scientific figures), and dense captioning (describing every region of an image with bounding boxes). Captioning's biggest impact is now behind the scenes: synthetic caption generation (à la BLIP-2's CapFilt) is the backbone of training data pipelines for both vision-language models and text-to-image generators. The quality of these synthetic captions directly determines downstream model quality.

Key Challenges

Hallucination — captioning models frequently mention objects, attributes, or relationships not present in the image

Detail level control — users want different caption styles (brief alt-text vs. exhaustive description) and models struggle to calibrate appropriately

Factual accuracy — captions may misidentify species, landmarks, brands, or cultural artifacts without grounding in external knowledge

Bias — captioning models reflect dataset biases in gender, ethnicity, and cultural representation

Evaluation — CIDEr and BLEU correlate poorly with human judgment of caption quality; human evaluation remains the gold standard

Quick Recommendations

Best overall

GPT-4o

Generates the most detailed, contextually aware captions with minimal hallucination; understands cultural and situational context

Best for accessibility (alt text)

Claude 3.5 Sonnet

Excels at concise, accurate descriptions suitable for screen readers; follows alt-text best practices when prompted

Best for training data generation

BLIP-2 + Qwen2.5-VL-7B

BLIP-2 for fast bulk captioning, VLM for quality verification; cost-effective pipeline for generating millions of captions

Open source

Qwen2.5-VL-7B

Strong captioning quality in a deployable 7B model; handles diverse image types including documents, charts, and natural scenes

Lightweight / edge

PaliGemma-2-3B

Google's 3B captioning model optimized for fast inference; suitable for mobile and edge deployment

What's Next

Captioning will evolve toward interactive and grounded descriptions — captions that reference specific image regions, adapt to user preferences, and update dynamically as images change. Expect tighter integration with accessibility tools (real-time scene description for visually impaired users), automated alt-text generation as a browser/OS-level feature, and culturally aware captioning that adapts descriptions for different audiences. Dense captioning with spatial grounding (every object described with coordinates) will become standard for training data generation.

Benchmarks & SOTA

COCO Captions

20156 results

330K images with 5 captions each. Standard benchmark for image captioning.

State of the Art

BLIP-2

Salesforce

145.8

CIDEr

NoCaps

Novel Object Captioning at Scale

20191 results

15K validation images from Open Images with 166K captions. Tests zero-shot generalization to novel objects not seen during captioning model training.

State of the Art

BLIP ViT-L

113.2

cider

Related Tasks

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

Image-Text-to-Text

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Image Captioning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Multimodal