Codesota · Benchmark · COCO CaptionsHome/Leaderboards/Multimodal Media/Image Captioning/COCO Captions

Unknown

COCO Captions.

330K images with 5 captions each. Standard benchmark for image captioning.

Paper ↗Leaderboard ↓

§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

cider

Cider is the reported evaluation metric for COCO Captions. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for ciderverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	PaLI-X-55B PaLI-X 55B (scaling up multilingual vision-language). Google, 2023. CIDEr on Karpathy test split.	verified	149.2	2023	Source ↗	Looks wrong?
02	PaLI-17B PaLI (Pathways Language and Image model) 17B. Google Research, ICLR 2023. CIDEr on Karpathy test split without CIDEr optimization.	verified	149.1	2022	Source ↗	Looks wrong?
03	BEiT-3 BEiT-3 (Image as a Foreign Language). Microsoft, CVPR 2023. CIDEr on Karpathy test split.	verified	147.6	2022	Source ↗	Looks wrong?
04	BLIP-2 (OPT 2.7B) BLIP-2 with frozen OPT-2.7B. Salesforce, ICML 2023. CIDEr on Karpathy test split.	verified	145.8	2023	Source ↗	Looks wrong?
05	OFA OFA-Huge (Unifying Architectures, Tasks, and Modalities). Alibaba DAMO, ICML 2022. CIDEr on Karpathy test split.	verified	145.3	2022	Source ↗	Looks wrong?
06	GIT2 GIT2 (5.1B parameters). Microsoft, 2022. CIDEr on Karpathy test split.	verified	145	2022	Source ↗	Looks wrong?
07	GIT GIT (Generative Image-to-text Transformer). Microsoft, 2022. CIDEr on Karpathy test split.	verified	144.8	2022	Source ↗	Looks wrong?
08	SimVLM SimVLM large. ICLR 2022. CIDEr on Karpathy test split.	verified	143.3	2022	Source ↗	Looks wrong?
09	VinVL VinVL large model. CVPR 2021. CIDEr on Karpathy test split.	verified	140.9	2022	Source ↗	Looks wrong?
10	Chameleon-SFT	unverified	140.8	2024	Paper ↗Code ↗	Looks wrong?
11	BLIP BLIP (Bootstrapping Language-Image Pre-training). ICML 2022. CIDEr on Karpathy test split.	verified	136.7	2022	Source ↗	Looks wrong?
12	CogVLM CogVLM-17B zero-shot. Tsinghua KEG, Nov 2023. CIDEr on COCO Karpathy test split. Zero-shot result.	verified	126.4	2023	Source ↗	Looks wrong?

CIDEr

CIDEr is the reported evaluation metric for COCO Captions. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for CIDErverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	BLIP-2 COCO Karpathy test split. FlanT5-XXL backbone. Table 12. arxiv:2301.12597	verified	145.8	2023	Paper ↗	Looks wrong?
02	CoCa COCO Karpathy test split. Single-model fine-tune. Table 4. arxiv:2205.01068	verified	143.6	2022	Paper ↗	Looks wrong?

R 1

R 1 is the reported evaluation metric for COCO Captions. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for R 1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	BLIP ViT-L	unverified	65.1	2022	Paper ↗Code ↗	Looks wrong?
02	ALIGN	unverified	59.9	2021	Paper ↗Code ↗	Looks wrong?
03	AltCLIP	unverified	42.9	2022	Paper ↗Code ↗	Looks wrong?

bleu-4

Bleu 4 is the reported evaluation metric for COCO Captions. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for bleu-4verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	GIT GIT. BLEU-4 on Karpathy test split.	verified	44.1	2022	Source ↗	Looks wrong?
02	GIT2 GIT2. BLEU-4 on Karpathy test split.	verified	44.1	2022	Source ↗	Looks wrong?
03	OFA OFA-Huge. BLEU-4 on Karpathy test split.	verified	43.9	2022	Source ↗	Looks wrong?
04	BLIP-2 (OPT 2.7B) BLIP-2 with frozen OPT-2.7B. BLEU-4 on Karpathy test split.	verified	43.7	2023	Source ↗	Looks wrong?
05	VinVL VinVL large model. CVPR 2021. BLEU-4 on Karpathy test split.	verified	41	2022	Source ↗	Looks wrong?
06	CoCa CoCa. BLEU-4 on Karpathy test split.	verified	40.9	2022	Source ↗	Looks wrong?
07	SimVLM SimVLM large. ICLR 2022. BLEU-4 on Karpathy test split.	verified	40.6	2022	Source ↗	Looks wrong?
08	BLIP BLIP. ICML 2022. BLEU-4 on Karpathy test split.	verified	40.4	2022	Source ↗	Looks wrong?

spice

Spice is the reported evaluation metric for COCO Captions. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for spiceverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	SimVLM SimVLM large. ICLR 2022. SPICE on Karpathy test split.	verified	25.4	2022	Source ↗	Looks wrong?
02	OFA OFA-Huge. SPICE on Karpathy test split.	verified	24.8	2022	Source ↗	Looks wrong?
03	CoCa CoCa. SPICE on Karpathy test split.	verified	24.7	2022	Source ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Image Captioning