Multimodalimage-to-text

Image Captioning

Image captioning — generating natural language descriptions of images — was the task that launched the modern vision-language era when Show and Tell (2015) paired CNNs with RNNs. The field progressed through BLIP, BLIP-2, and CoCa, each improving grounding and descriptive richness, until multimodal LLMs effectively subsumed it as a special case of image-text-to-text. COCO Captions and NoCaps remain standard benchmarks, but CIDEr and SPICE scores have largely saturated — the real frontier is dense captioning, generating paragraph-level descriptions that capture spatial relationships, attributes, and background context that brief captions miss. Captioning's importance now lies more in its role as training signal for other vision-language tasks than as a standalone evaluation.

2 datasets2 resultsView full task mapping →

Image captioning models generate natural language descriptions of images, ranging from brief one-line summaries to detailed paragraphs covering objects, relationships, actions, and scene context. Once a flagship vision-language task, captioning is now a commodity capability of any VLM but remains critical for accessibility, search indexing, and training data generation.

History

2015

Show and Tell (Vinyals et al., Google) applies encoder-decoder with attention to image captioning, establishing the CNN+LSTM paradigm

2016

MS COCO Captioning Challenge drives rapid progress; attention mechanisms become standard

2019

Oscar (Microsoft) uses object tags as anchor points between vision and language, achieving SOTA on COCO Captions

2022

BLIP (Salesforce) introduces bootstrapped language-image pretraining with captioning and filtering, generating high-quality synthetic captions at scale

2022

CoCa (Google) unifies contrastive and captioning objectives in a single encoder-decoder model

2023

BLIP-2 achieves 145.8 CIDEr on COCO Captions with a frozen ViT + Q-Former + frozen LLM architecture

2023

LLaVA and InstructBLIP show that instruction-tuned VLMs generate richer, more detailed captions than task-specific models

2024

Frontier VLMs (GPT-4o, Claude 3.5, Gemini) produce paragraph-length captions with contextual understanding, scene reasoning, and cultural references

2025

Captioning shifts from a task to a capability — every VLM does it; focus moves to controllable detail level, factual accuracy, and domain-specific captioning

How Image Captioning Works

1Visual Feature Extrac…The image is encoded by a v…2Feature BridgingVisual features are project…3Caption GenerationAn autoregressive language …4Optional RefinementPost-processing may include…Image Captioning Pipeline
1

Visual Feature Extraction

The image is encoded by a vision model (ViT, ConvNeXt, or SigLIP encoder) into a sequence of visual features representing patches or regions of the image.

2

Feature Bridging

Visual features are projected into the language model's embedding space. Methods range from simple linear projections (LLaVA) to learned queries (BLIP-2's Q-Former) that compress visual information into a fixed number of tokens.

3

Caption Generation

An autoregressive language model generates the caption token by token, attending to both the visual features and previously generated text. Beam search or nucleus sampling controls output diversity.

4

Optional Refinement

Post-processing may include factual grounding (checking generated descriptions against detected objects), length control, and style adaptation (alt-text vs. detailed description vs. artistic interpretation).

Current Landscape

Image captioning in 2025 is a solved commodity task for standard use cases — any modern VLM produces captions that are generally accurate and fluent. The frontier has moved to controllable captioning (specify detail level, style, and focus), domain-specific captioning (medical images, satellite imagery, scientific figures), and dense captioning (describing every region of an image with bounding boxes). Captioning's biggest impact is now behind the scenes: synthetic caption generation (à la BLIP-2's CapFilt) is the backbone of training data pipelines for both vision-language models and text-to-image generators. The quality of these synthetic captions directly determines downstream model quality.

Key Challenges

Hallucination — captioning models frequently mention objects, attributes, or relationships not present in the image

Detail level control — users want different caption styles (brief alt-text vs. exhaustive description) and models struggle to calibrate appropriately

Factual accuracy — captions may misidentify species, landmarks, brands, or cultural artifacts without grounding in external knowledge

Bias — captioning models reflect dataset biases in gender, ethnicity, and cultural representation

Evaluation — CIDEr and BLEU correlate poorly with human judgment of caption quality; human evaluation remains the gold standard

Quick Recommendations

Best overall

GPT-4o

Generates the most detailed, contextually aware captions with minimal hallucination; understands cultural and situational context

Best for accessibility (alt text)

Claude 3.5 Sonnet

Excels at concise, accurate descriptions suitable for screen readers; follows alt-text best practices when prompted

Best for training data generation

BLIP-2 + Qwen2.5-VL-7B

BLIP-2 for fast bulk captioning, VLM for quality verification; cost-effective pipeline for generating millions of captions

Open source

Qwen2.5-VL-7B

Strong captioning quality in a deployable 7B model; handles diverse image types including documents, charts, and natural scenes

Lightweight / edge

PaliGemma-2-3B

Google's 3B captioning model optimized for fast inference; suitable for mobile and edge deployment

What's Next

Captioning will evolve toward interactive and grounded descriptions — captions that reference specific image regions, adapt to user preferences, and update dynamically as images change. Expect tighter integration with accessibility tools (real-time scene description for visually impaired users), automated alt-text generation as a browser/OS-level feature, and culturally aware captioning that adapts descriptions for different audiences. Dense captioning with spatial grounding (every object described with coordinates) will become standard for training data generation.

Benchmarks & SOTA

Related Tasks

Audio-Text-to-Text

Audio-text-to-text is the backbone of voice assistants that actually understand context — models that jointly process speech and text to generate grounded responses. Whisper (2022) cracked robust transcription, but the real leap came when Gemini 1.5 and GPT-4o (2024) began reasoning natively over audio tokens alongside text, eliminating the lossy ASR-then-LLM pipeline. The key challenges are handling overlapping speakers, noisy environments, and preserving prosodic cues like sarcasm or hesitation that pure transcription destroys. Benchmarks like SUPERB and Dynamic-SUPERB are expanding, but real-world spoken dialogue understanding remains far ahead of what leaderboards capture.

Image-Text-to-Image

Image-text-to-image covers instruction-guided image editing — taking a source image plus a text command and producing a modified result. InstructPix2Pix (2023) demonstrated this could work zero-shot, and subsequent models like DALL-E 3's inpainting, Ideogram, and Stable Diffusion's img2img pipelines made it practical. The core difficulty is surgical precision: users want "change the dress to red" without altering the face, background, or lighting, which requires disentangling content from style at a level current architectures still fumble. Benchmarks are fragmented across editing fidelity, instruction-following, and identity preservation, making unified comparison difficult.

Image-Text-to-Text

Image-text-to-text exploded from a research curiosity to the dominant AI interface in under two years. GPT-4V (2023) proved multimodal LLMs could reason over images, Gemini 1.5 scaled to million-token contexts mixing text and vision, and Claude 3 showed that careful RLHF produces models that refuse to hallucinate about image content. MMMU and MMBench have become the standard evaluation gauntlet, but the real challenge is grounding — models still confabulate spatial relationships and struggle with fine-grained visual reasoning. This is the task that turned chatbots into visual assistants.

Image-Text-to-Video

Image-text-to-video is generative AI's hardest unsolved frontier — animating a still image according to a text prompt while maintaining temporal coherence and physical plausibility. Stable Video Diffusion (2023) and Runway Gen-2 showed early promise, Sora (2024) raised the bar dramatically with minute-long physically consistent clips, and Kling and Veo 2 pushed quality further. The fundamental challenge is that video generation requires implicit world models: objects must persist, lighting must evolve consistently, and motion must obey approximate physics across dozens of frames. Evaluation is still largely human-judged, with FVD and CLIP-temporal scores poorly correlating with perceived quality.

Something wrong or missing?

Help keep Image Captioning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Image Captioning Benchmarks - Multimodal - CodeSOTA | CodeSOTA