Codesota · Benchmark · AudioCapsHome/Leaderboards/Audio & Speech/Audio Captioning/AudioCaps
Unknown

AudioCaps.

Audio generation quality evaluated on AudioCaps captions

Paper Leaderboard
§ 01 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

R 10

R 10 is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for R 10verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)unverified83.72022Paper ↗Code ↗Looks wrong?

R 5

R 5 is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for R 5verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)unverified71.92022Paper ↗Code ↗Looks wrong?

R 1

R 1 is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for R 1verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01CLAP (HTSAT-RoBERTa, fusion, K2C Aug.; T->A)unverified35.12022Paper ↗Code ↗Looks wrong?

Fad

Fad is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Fadverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01AudioLDM
AudioLDM (Liu et al., ICML 2023). FAD on AudioCaps test set. Baseline comparison in AudioLDM 2 paper.
verified4.482023Source ↗Looks wrong?
02AudioLDM 2-Full-Large
AudioLDM 2-Full-Large (Liu et al., IEEE/ACM TASLP 2024). FAD on AudioCaps test set. Table II in paper.
verified1.862024Source ↗Looks wrong?
03AudioLDM 2-Full
AudioLDM 2-Full (Liu et al., IEEE/ACM TASLP 2024). FAD on AudioCaps test set. Table II in paper.
verified1.782024Source ↗Looks wrong?
04TANGO
TANGO (Ghosal et al., 2023). FAD on AudioCaps test set. Previous SOTA before AudioLDM 2.
verified1.732023Source ↗Looks wrong?
05AudioLDM 2-AC-Large
AudioLDM 2 AudioCaps-finetuned large model (Liu et al., IEEE/ACM TASLP 2024). Best FAD on AudioCaps test set. Table II in paper.
verified1.422024Source ↗Looks wrong?

Cider

Cider is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Ciderverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01Audio Flamingo 3unverified0.702025Paper ↗Code ↗Looks wrong?

Spider

Spider is the reported evaluation metric for AudioCaps. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Spiderverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksFix
01AudioCaps baseline (TopDown+Align)
Original AudioCaps baseline — seed, verify (paper reports CIDEr/METEOR/SPICE separately).
paper0.372026Source ↗Looks wrong?
02EnCLAP-base
EnCLAP-base, AudioCaps test, Table 2. ICASSP 2024.
paper0.302026Source ↗Looks wrong?
03Pengi
Pengi zero/few-shot audio captioning. NeurIPS 2023.
paper0.272026Source ↗Looks wrong?
§ 04 · Submit a result

Add to the leaderboard.

← Back to Audio Captioning