Image Captioning
Generate a natural-language description of an image. Once a dedicated subfield with custom encoder-decoder architectures, now a side-effect of any frontier vision-language model — and the textbook example of a task that modern VLMs have quietly solved.
What it looks like
Image captioning is deceptively hard because every image has multiple “correct” descriptions. Here are three examples showing how the same scene gets different valid captions — and why the evaluation metrics that follow have to account for that variance.
A busy kitchen scene
Multiple valid captions
- 1.“A person slicing vegetables on a wooden cutting board.”
- 2.“Hands chopping fresh herbs next to a bowl of tomatoes.”
- 3.“Overhead shot of meal prep with colorful ingredients.”
Three valid captions — none is ‘wrong’, each emphasizes different salient objects. This is why captioning benchmarks use 5+ human references per image.
A cityscape at dusk
Multiple valid captions
- 1.“A city skyline reflected in the water at dusk.”
- 2.“Tall buildings along a waterfront during the blue hour.”
- 3.“An illuminated skyline with lights shining on a river.”
Abstract scenes like skylines reward descriptive metrics (SPICE) over n-gram metrics (BLEU). A model that says ‘tall buildings at night’ should score similarly to one saying ‘skyline at dusk’.
A dog at the beach
Multiple valid captions
- 1.“A golden retriever standing on a sandy beach.”
- 2.“A brown dog near the waves on an empty shore.”
- 3.“A happy puppy looking at the camera by the sea.”
Notice the breed → color drift (‘golden retriever’ → ‘brown dog’). Good captioning systems get the high-salience noun right; great ones also get attributes and relations.
Why it still matters
Captioning has quietly become one of the most load-bearing capabilities in production AI — even when nobody calls it “captioning.” If your product does any of the following, you’re shipping an image captioner:
- Accessibility: alt text for the visually impaired.
- Content moderation: describe images before running policy checks.
- Search: index visual content with natural-language tags.
- E-commerce: auto-generate product descriptions from photos.
- Multimodal RAG: caption-then-retrieve pipelines for document AI.
The benchmarks haven’t caught up with what modern VLMs can do. COCO Captions — the canonical dataset — was designed in 2015 for specialist encoder-decoder models trained from scratch. Frontier 2026 VLMs trained on internet-scale image-text pairs routinely saturate it in zero-shot.
That’s why the interesting action has moved to NoCaps (novel objects), long-form description benchmarks, and qualitative evals (attribute correctness, relation grounding, hallucination rate) that pure CIDEr scores can’t capture.
The metrics
Evaluation is the real bottleneck in image captioning. Here are the four metrics you’ll see in every paper, and when each one is load-bearing.
CIDEr
0 – 10+ (higher is better)Consensus-based Image Description Evaluation
TF-IDF weighted n-gram similarity against multiple references.
The dominant captioning metric since 2015. Rewards matching rare, image-specific words and penalizes generic filler like ‘a photo of’.
SPICE
0.0 – 1.0Semantic Propositional Image Caption Evaluation
F1 over semantic scene graphs (objects, attributes, relations) extracted from captions.
Closer to how humans judge captions: ‘a red car next to a tree’ matches even if word order differs. Complements CIDEr.
BLEU-4
0.0 – 1.0Bilingual Evaluation Understudy (4-gram)
Precision of 4-grams compared against reference captions.
Borrowed from machine translation. Strict and n-gram based, so it undervalues paraphrasing. Still reported for historical comparability.
METEOR
0.0 – 1.0Metric for Evaluation of Translation with Explicit ORdering
Harmonic mean of unigram precision/recall, with stemming and synonym matching.
More forgiving than BLEU to lexical variation. Correlates moderately well with human judgment on captions.
2026 leaders on COCO Captions
Frontier VLMs have pushed COCO CIDEr past 145 — once considered superhuman. At this point the score is more a measurement of dataset leakage than of model quality. Use NoCaps for honest evaluation.
| Model | Provider | COCO CIDEr |
|---|---|---|
| GPT-5.2 | OpenAI | 150+ |
| Claude Opus 4.6 | Anthropic | 148+ |
| Gemini 3 Pro | 147+ | |
| Qwen2.5-VL-72B | Alibaba | 143 |
| InternVL 3 | Shanghai AI Lab | 140 |
| LLaVA-OneVision | ByteDance / Academic | 136 |
Scores are approximate — frontier labs no longer optimize for COCO and rarely report exact numbers. Treat the ordering, not the decimals, as the signal.
Datasets
COCO Captions
The canonical captioning dataset. Each image has 5 human-written reference captions. Karpathy split (5K val + 5K test) remains the standard reporting protocol.
Dataset page →NoCaps
Novel Object Captioning at Scale. Validation images from Open Images containing objects NOT in COCO training data — measures zero-shot generalization. Critical for evaluating VLMs, which shouldn’t have seen the specific novel objects during pretraining.
Dataset page →Flickr30k
Older but still reported alongside COCO for comparability. 5 captions per image sourced from Flickr photo descriptions.
Dataset page →Practical tips for 2026
Don’t train a specialist model. Unless you have a domain so niche that frontier VLMs fail (medical imaging, satellite, specialized manufacturing), use an off-the-shelf VLM with a good prompt and spend your budget on evaluation instead.
Prompt for the trade-off you want. “Describe this image in one sentence” produces COCO-style output. “Describe everything you see, including spatial relationships” produces dense captions suitable for accessibility or retrieval.
Watch for hallucinations. VLMs hallucinate non-existent objects more often on cluttered scenes. The CHAIR metric (Caption Hallucination Assessment with Image Relevance) is worth adding to your eval loop if your use case is accessibility or moderation.
Cache aggressively. Caption generation is 10-50× more expensive per image than CLIP-style embedding. For search/retrieval use cases, embed first and only caption the top-k hits on demand.
What were you looking for on image captioning?
Missing a model, a metric we skipped, a use case you need help with? Tell us — we reply within 48 hours and update pages based on what readers actually ask.
Real humans read every message. We track what people are asking for and prioritize accordingly.