Codesota · Registry · Computer vision · Image captioning6 models · 3 datasets · 4 metrics← back to the register
§ 00 · Task

Image captioning,
by the numbers.

Generate a natural-language description of an image. Once a dedicated subfield with custom encoder–decoder architectures; now a side-effect of any frontier vision-language model — and the textbook example of a task modern VLMs have quietly saturated.

CIDEr past 145 is the ceiling of the canonical benchmark. Past that, the interesting signal lives in NoCaps (novel objects), long-form description, and qualitative hallucination audits.

§ 01 · Leaderboard

2026 leaders on COCO Captions.

Sorted by CIDEr · Karpathy split
#ModelProviderCOCO CIDErNotesDate
01GPT-5.2OpenAI150+Zero-shot via vision-language API; no fine-tuning on COCO.Dec 2025
02Claude Opus 4.6Anthropic148+Leads on NoCaps novel-object generalisation.Feb 2026
03Gemini 3 ProGoogle147+Strong zero-shot; tends to over-describe.Nov 2025
04Qwen2.5-VL-72BAlibaba143Best open-weights VLM for captioning.2025
05InternVL 3Shanghai AI Lab140Strong on fine-grained objects.2025
06LLaVA-OneVisionByteDance / Academic136Baseline open-source reference model.2024
Fig 01 · Scores are approximate — frontier labs no longer optimise for COCO and rarely report exact decimals. Treat the ordering, not the digits, as signal; use NoCaps for honest zero-shot evaluation.
§ 02
The task

Why the references come in fives.

Every image has more than one correct caption. A kitchen scene can truthfully be described as “a person slicing vegetables”, “hands chopping herbs”, or “overhead shot of meal prep” — none is wrong; each emphasises different salient objects. This is why every canonical captioning dataset ships with five or more human references per image, and why the evaluation metrics are all some flavour of consensus.

Captioning has quietly become one of the most load-bearing capabilities in production AI, even when nobody calls it captioning. Accessibility alt text, content moderation, visual search indexing, e-commerce product descriptions, caption-then-retrieve pipelines for multimodal RAG — all of it is image captioning, dressed in other names.

And yet the benchmarks have not caught up. COCO Captions was designed in 2015 for specialist encoder–decoder models trained from scratch; frontier 2026 VLMs trained on internet-scale image-text pairs routinely saturate it in zero-shot. At CIDEr above 145 the number is measuring dataset leakage more than model quality. The honest signal has moved to NoCaps (novel objects), long-form description benchmarks, and qualitative evals — attribute correctness, relation grounding, hallucination rate — that n-gram metrics cannot capture.

What the metric misses: hallucination. VLMs invent non-existent objects more often on cluttered scenes and the standard captioning metrics won’t flag it — a confidently wrong caption can score similarly to a confidently right one. The CHAIR metric (Caption Hallucination Assessment with Image Relevance) is worth adding if your use case is accessibility or moderation.

§ 03 · Benchmarks

The canonical three, in plain view.

COCO is saturated. NoCaps is the honest test. Flickr30k is historical context, not a measurement of 2026 capability.
#DatasetScaleYearWhat it measuresLink
01COCO Captions330K images · 1.5M captions2015The canonical captioning dataset. Each image has five human-written reference captions. Karpathy split (5K val + 5K test) remains the standard reporting protocol.page →
02NoCaps15K images · 166K captions2019Novel Object Captioning at Scale. Validation images from Open Images containing objects NOT in COCO training data — measures zero-shot generalisation. Critical for evaluating VLMs, which shouldn't have seen the specific novel objects during pretraining.page →
03Flickr30k31K images · 158K captions2014Older but still reported alongside COCO for comparability. Five captions per image sourced from Flickr photo descriptions.page →
§ 04
Methodology

The four metrics, defined.

Evaluation is the real bottleneck in image captioning. Every paper reports some combination of the four below; each is load-bearing on a different axis.

CIDEr
0 – 10+ (higher is better)
Consensus-based Image Description Evaluation
TF-IDF weighted n-gram similarity against multiple references.
The dominant captioning metric since 2015. Rewards rare, image-specific words and penalises generic filler like “a photo of”.
SPICE
0.0 – 1.0
Semantic Propositional Image Caption Evaluation
F1 over semantic scene graphs (objects, attributes, relations) extracted from captions.
Closer to how humans judge captions: “a red car next to a tree” matches even if word order differs. Complements CIDEr.
BLEU-4
0.0 – 1.0
Bilingual Evaluation Understudy (4-gram)
Precision of 4-grams compared against reference captions.
Borrowed from machine translation. Strict and n-gram based, so it undervalues paraphrasing. Still reported for historical comparability.
METEOR
0.0 – 1.0
Metric for Evaluation of Translation with Explicit ORdering
Harmonic mean of unigram precision/recall, with stemming and synonym matching.
More forgiving than BLEU to lexical variation. Correlates moderately well with human judgement on captions.
Full methodology Back to the register
§ 05 · Related

Where to read next.

Cross-link
Frontier LLM

LLM-specific leaderboard with denser, faster-moving data.

Cross-link
Methodology

How we grade benchmarks, reproduce runs, and record retractions.

Reply within 48 hours · No newsletter

What were you looking for on image captioning?

Missing a model, a metric we skipped, a use case you need help with? Tell us — we reply within 48 hours and update pages based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.