Codesota · Registry · Computer vision · Image captioning6 models · 3 datasets · 4 metrics← back to the register

§ 00 · Task

Image captioning,
by the numbers.

Generate a natural-language description of an image. Once a dedicated subfield with custom encoder–decoder architectures; now a side-effect of any frontier vision-language model — and the textbook example of a task modern VLMs have quietly saturated.

CIDEr past 145 is the ceiling of the canonical benchmark. Past that, the interesting signal lives in NoCaps (novel objects), long-form description, and qualitative hallucination audits.

§ 01 · Leaderboard

2026 leaders on COCO Captions.

Sorted by CIDEr · Karpathy split

#	Model	Provider	COCO CIDEr	Notes	Date
01	GPT-5.2	OpenAI	150+	Zero-shot via vision-language API; no fine-tuning on COCO.	Dec 2025
02	Claude Opus 4.6	Anthropic	148+	Leads on NoCaps novel-object generalisation.	Feb 2026
03	Gemini 3 Pro	Google	147+	Strong zero-shot; tends to over-describe.	Nov 2025
04	Qwen2.5-VL-72B	Alibaba	143	Best open-weights VLM for captioning.	2025
05	InternVL 3	Shanghai AI Lab	140	Strong on fine-grained objects.	2025
06	LLaVA-OneVision	ByteDance / Academic	136	Baseline open-source reference model.	2024

Fig 01 · Scores are approximate — frontier labs no longer optimise for COCO and rarely report exact decimals. Treat the ordering, not the digits, as signal; use NoCaps for honest zero-shot evaluation.

§ 02

The task

Why the references come in fives.

Every image has more than one correct caption. A kitchen scene can truthfully be described as “a person slicing vegetables”, “hands chopping herbs”, or “overhead shot of meal prep” — none is wrong; each emphasises different salient objects. This is why every canonical captioning dataset ships with five or more human references per image, and why the evaluation metrics are all some flavour of consensus.

Captioning has quietly become one of the most load-bearing capabilities in production AI, even when nobody calls it captioning. Accessibility alt text, content moderation, visual search indexing, e-commerce product descriptions, caption-then-retrieve pipelines for multimodal RAG — all of it is image captioning, dressed in other names.

And yet the benchmarks have not caught up. COCO Captions was designed in 2015 for specialist encoder–decoder models trained from scratch; frontier 2026 VLMs trained on internet-scale image-text pairs routinely saturate it in zero-shot. At CIDEr above 145 the number is measuring dataset leakage more than model quality. The honest signal has moved to NoCaps (novel objects), long-form description benchmarks, and qualitative evals — attribute correctness, relation grounding, hallucination rate — that n-gram metrics cannot capture.

What the metric misses: hallucination. VLMs invent non-existent objects more often on cluttered scenes and the standard captioning metrics won’t flag it — a confidently wrong caption can score similarly to a confidently right one. The CHAIR metric (Caption Hallucination Assessment with Image Relevance) is worth adding if your use case is accessibility or moderation.

§ 03 · Benchmarks

The canonical three, in plain view.

COCO is saturated. NoCaps is the honest test. Flickr30k is historical context, not a measurement of 2026 capability.

#	Dataset	Scale	Year	What it measures	Link
01	COCO Captions	330K images · 1.5M captions	2015	The canonical captioning dataset. Each image has five human-written reference captions. Karpathy split (5K val + 5K test) remains the standard reporting protocol.	page →
02	NoCaps	15K images · 166K captions	2019	Novel Object Captioning at Scale. Validation images from Open Images containing objects NOT in COCO training data — measures zero-shot generalisation. Critical for evaluating VLMs, which shouldn't have seen the specific novel objects during pretraining.	page →
03	Flickr30k	31K images · 158K captions	2014	Older but still reported alongside COCO for comparability. Five captions per image sourced from Flickr photo descriptions.	page →

§ 04

Methodology

The four metrics, defined.

Evaluation is the real bottleneck in image captioning. Every paper reports some combination of the four below; each is load-bearing on a different axis.

CIDEr

0 – 10+ (higher is better)

Consensus-based Image Description Evaluation

TF-IDF weighted n-gram similarity against multiple references.

The dominant captioning metric since 2015. Rewards rare, image-specific words and penalises generic filler like “a photo of”.

SPICE

0.0 – 1.0

Semantic Propositional Image Caption Evaluation

F1 over semantic scene graphs (objects, attributes, relations) extracted from captions.

Closer to how humans judge captions: “a red car next to a tree” matches even if word order differs. Complements CIDEr.

BLEU-4

0.0 – 1.0

Bilingual Evaluation Understudy (4-gram)

Precision of 4-grams compared against reference captions.

Borrowed from machine translation. Strict and n-gram based, so it undervalues paraphrasing. Still reported for historical comparability.

METEOR

0.0 – 1.0

Metric for Evaluation of Translation with Explicit ORdering

Harmonic mean of unigram precision/recall, with stemming and synonym matching.

More forgiving than BLEU to lexical variation. Correlates moderately well with human judgement on captions.

Full methodology →Back to the register

§ 05 · Related

Where to read next.

Cross-link

Visual question answering →

Sister task: answer a question about an image, not describe it.

Cross-link

The full register →

Every machine-learning task we track, by area.

Cross-link

Computer vision →

Pixels in, structure out. Detection, segmentation, depth.

Cross-link

Frontier LLM →

LLM-specific leaderboard with denser, faster-moving data.

Cross-link

Methodology →

How we grade benchmarks, reproduce runs, and record retractions.

Reply within 48 hours · No newsletter

What were you looking for on image captioning?

Missing a model, a metric we skipped, a use case you need help with? Tell us — we reply within 48 hours and update pages based on what readers actually ask.

Real humans read every message. We track what people are asking for and prioritize accordingly.

Image captioning,by the numbers.

2026 leaders on COCO Captions.

Why the references come in fives.

The canonical three, in plain view.

The four metrics, defined.

Where to read next.

What were you looking for on image captioning?

Image captioning,
by the numbers.