Codesota · Benchmark · AudioCapsHome/Leaderboards/Audio & Speech/Audio Captioning/AudioCaps

Unknown

AudioCaps.

Audio generation quality evaluated on AudioCaps captions

Paper ↗Leaderboard ↓

§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?

Use row edits to send a sourced correction into moderation.

Add / edit result ↗Report issue ↗

R 10

R 10 is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for R 10verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)	unverified	83.7	2022	Paper ↗Code ↗	Looks wrong?

R 5

R 5 is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for R 5verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)	unverified	71.9	2022	Paper ↗Code ↗	Looks wrong?

R 1

R 1 is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for R 1verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)	unverified	35.1	2022	Paper ↗Code ↗	Looks wrong?

Fad

Fad is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Fadverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	AudioLDM AudioLDM (Liu et al., ICML 2023). FAD on AudioCaps test set. Baseline comparison in AudioLDM 2 paper.	verified	4.48	2023	Source ↗	Looks wrong?
02	AudioLDM 2-Full-Large AudioLDM 2-Full-Large (Liu et al., IEEE/ACM TASLP 2024). FAD on AudioCaps test set. Table II in paper.	verified	1.86	2024	Source ↗	Looks wrong?
03	AudioLDM 2-Full AudioLDM 2-Full (Liu et al., IEEE/ACM TASLP 2024). FAD on AudioCaps test set. Table II in paper.	verified	1.78	2024	Source ↗	Looks wrong?
04	TANGO TANGO (Ghosal et al., 2023). FAD on AudioCaps test set. Previous SOTA before AudioLDM 2.	verified	1.73	2023	Source ↗	Looks wrong?
05	AudioLDM 2-AC-Large AudioLDM 2 AudioCaps-finetuned large model (Liu et al., IEEE/ACM TASLP 2024). Best FAD on AudioCaps test set. Table II in paper.	verified	1.42	2024	Source ↗	Looks wrong?

Cider

Cider is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Ciderverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	Audio Flamingo 3	unverified	0.70	2025	Paper ↗Code ↗	Looks wrong?

Spider

Spider is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Spiderverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

Rank	Model	Trust	Score	Year	Links	Fix
01	AudioCaps baseline (TopDown+Align) Original AudioCaps baseline — seed, verify (paper reports CIDEr/METEOR/SPICE separately).	paper	0.37	2026	Source ↗	Looks wrong?
02	EnCLAP-base EnCLAP-base, AudioCaps test, Table 2. ICASSP 2024.	paper	0.30	2026	Source ↗	Looks wrong?
03	Pengi Pengi zero/few-shot audio captioning. NeurIPS 2023.	paper	0.27	2026	Source ↗	Looks wrong?

§ 04 · Submit a result

Add to the leaderboard.

← Back to Audio Captioning

AudioCaps Leaderboard | CodeSOTA | CodeSOTA