Image Captioning Benchmarks — CIDEr, SPICE, BLEU-4 SOTA | CodeSOTA

Image captioning,
by the numbers.

Generate a natural-language description of an image. Once a dedicated subfield with custom encoder–decoder architectures; now a side-effect of any frontier vision-language model — and the textbook example of a task modern VLMs have quietly saturated.

CIDEr past 145 is the ceiling of the canonical benchmark. Past that, the interesting signal lives in NoCaps (novel objects), long-form description, and qualitative hallucination audits.

#	Model	Provider	COCO CIDEr	Notes	Date
01	GPT-5.2	OpenAI	150+	Zero-shot via vision-language API; no fine-tuning on COCO.	Dec 2025
02	Claude Opus 4.6	Anthropic	148+	Leads on NoCaps novel-object generalisation.	Feb 2026
03	Gemini 3 Pro	Google	147+	Strong zero-shot; tends to over-describe.	Nov 2025
04	Qwen2.5-VL-72B	Alibaba	143	Best open-weights VLM for captioning.	2025
05	InternVL 3	Shanghai AI Lab	140	Strong on fine-grained objects.	2025
06	LLaVA-OneVision	ByteDance / Academic	136	Baseline open-source reference model.	2024

Model

Provider

COCO CIDEr

Notes

Date

GPT-5.2

OpenAI

150+

Zero-shot via vision-language API; no fine-tuning on COCO.

Dec 2025

Claude Opus 4.6

Anthropic

148+

Leads on NoCaps novel-object generalisation.

Feb 2026

Gemini 3 Pro

Google

147+

Strong zero-shot; tends to over-describe.

Nov 2025

Qwen2.5-VL-72B

Alibaba

143

Best open-weights VLM for captioning.

2025

InternVL 3

Shanghai AI Lab

140

Strong on fine-grained objects.

2025

LLaVA-OneVision

ByteDance / Academic

136

Baseline open-source reference model.

2024

Why the references come in fives.

Every image has more than one correct caption. A kitchen scene can truthfully be described as “a person slicing vegetables”, “hands chopping herbs”, or “overhead shot of meal prep” — none is wrong; each emphasises different salient objects. This is why every canonical captioning dataset ships with five or more human references per image, and why the evaluation metrics are all some flavour of consensus.

Captioning has quietly become one of the most load-bearing capabilities in production AI, even when nobody calls it captioning. Accessibility alt text, content moderation, visual search indexing, e-commerce product descriptions, caption-then-retrieve pipelines for multimodal RAG — all of it is image captioning, dressed in other names.

And yet the benchmarks have not caught up. COCO Captions was designed in 2015 for specialist encoder–decoder models trained from scratch; frontier 2026 VLMs trained on internet-scale image-text pairs routinely saturate it in zero-shot. At CIDEr above 145 the number is measuring dataset leakage more than model quality. The honest signal has moved to NoCaps (novel objects), long-form description benchmarks, and qualitative evals — attribute correctness, relation grounding, hallucination rate — that n-gram metrics cannot capture.

What the metric misses: hallucination. VLMs invent non-existent objects more often on cluttered scenes and the standard captioning metrics won’t flag it — a confidently wrong caption can score similarly to a confidently right one. The CHAIR metric (Caption Hallucination Assessment with Image Relevance) is worth adding if your use case is accessibility or moderation.

#	Dataset	Scale	Year	What it measures	Link
01	COCO Captions	330K images · 1.5M captions	2015	The canonical captioning dataset. Each image has five human-written reference captions. Karpathy split (5K val + 5K test) remains the standard reporting protocol.	page →
02	NoCaps	15K images · 166K captions	2019	Novel Object Captioning at Scale. Validation images from Open Images containing objects NOT in COCO training data — measures zero-shot generalisation. Critical for evaluating VLMs, which shouldn't have seen the specific novel objects during pretraining.	page →
03	Flickr30k	31K images · 158K captions	2014	Older but still reported alongside COCO for comparability. Five captions per image sourced from Flickr photo descriptions.	page →

Dataset

Scale

Year

What it measures

Link

COCO Captions

330K images · 1.5M captions

2015

The canonical captioning dataset. Each image has five human-written reference captions. Karpathy split (5K val + 5K test) remains the standard reporting protocol.

page →

NoCaps

15K images · 166K captions

2019

Novel Object Captioning at Scale. Validation images from Open Images containing objects NOT in COCO training data — measures zero-shot generalisation. Critical for evaluating VLMs, which shouldn't have seen the specific novel objects during pretraining.

page →

Flickr30k

31K images · 158K captions

2014

Older but still reported alongside COCO for comparability. Five captions per image sourced from Flickr photo descriptions.

page →

The four metrics, defined.

Evaluation is the real bottleneck in image captioning. Every paper reports some combination of the four below; each is load-bearing on a different axis.

CIDEr

0 – 10+ (higher is better)

Consensus-based Image Description Evaluation

TF-IDF weighted n-gram similarity against multiple references.

The dominant captioning metric since 2015. Rewards rare, image-specific words and penalises generic filler like “a photo of”.

SPICE

0.0 – 1.0

Semantic Propositional Image Caption Evaluation

F1 over semantic scene graphs (objects, attributes, relations) extracted from captions.

Closer to how humans judge captions: “a red car next to a tree” matches even if word order differs. Complements CIDEr.

BLEU-4

0.0 – 1.0

Bilingual Evaluation Understudy (4-gram)

Precision of 4-grams compared against reference captions.

Borrowed from machine translation. Strict and n-gram based, so it undervalues paraphrasing. Still reported for historical comparability.

METEOR

0.0 – 1.0

Metric for Evaluation of Translation with Explicit ORdering

Harmonic mean of unigram precision/recall, with stemming and synonym matching.

More forgiving than BLEU to lexical variation. Correlates moderately well with human judgement on captions.

Image captioning,
by the numbers.

2026 leaders on COCO Captions.

Why the references come in fives.

The canonical three, in plain view.

The four metrics, defined.

Where to read next.

What were you looking for on image captioning?

Image captioning,by the numbers.

2026 leaders on COCO Captions.

Why the references come in fives.

The canonical three, in plain view.

The four metrics, defined.

Where to read next.

What were you looking for on image captioning?

Image captioning,
by the numbers.