Why the references come in fives.
Every image has more than one correct caption. A kitchen scene can truthfully be described as “a person slicing vegetables”, “hands chopping herbs”, or “overhead shot of meal prep” — none is wrong; each emphasises different salient objects. This is why every canonical captioning dataset ships with five or more human references per image, and why the evaluation metrics are all some flavour of consensus.
Captioning has quietly become one of the most load-bearing capabilities in production AI, even when nobody calls it captioning. Accessibility alt text, content moderation, visual search indexing, e-commerce product descriptions, caption-then-retrieve pipelines for multimodal RAG — all of it is image captioning, dressed in other names.
And yet the benchmarks have not caught up. COCO Captions was designed in 2015 for specialist encoder–decoder models trained from scratch; frontier 2026 VLMs trained on internet-scale image-text pairs routinely saturate it in zero-shot. At CIDEr above 145 the number is measuring dataset leakage more than model quality. The honest signal has moved to NoCaps (novel objects), long-form description benchmarks, and qualitative evals — attribute correctness, relation grounding, hallucination rate — that n-gram metrics cannot capture.
What the metric misses: hallucination. VLMs invent non-existent objects more often on cluttered scenes and the standard captioning metrics won’t flag it — a confidently wrong caption can score similarly to a confidently right one. The CHAIR metric (Caption Hallucination Assessment with Image Relevance) is worth adding if your use case is accessibility or moderation.