Audio Captioning2019en

AudioCaps

Audio generation quality evaluated on AudioCaps captions

Current State of the Art

AudioCaps baseline (TopDown+Align)

Kim et al.

0.369

spider

AudioCaps — spider

3 results · 1 SOTA advances · higher is better

All results
SOTA frontier
0120262027spiderAudioCaps baseline (TopDown+Align)

spider Progress Over Time

Showing 3 breakthroughs from May 2023 to Apr 2026

0.2610.2910.3200.3490.379May 2023Oct 2024Apr 2026spiderDate

Key Milestones

May 2023
Pengi

Pengi zero/few-shot audio captioning. NeurIPS 2023.

0.271
Jan 2024
EnCLAP-base

EnCLAP-base, AudioCaps test, Table 2. ICASSP 2024.

0.300
+10.7%
Apr 2026
AudioCaps baseline (TopDown+Align)Current SOTA

Original AudioCaps baseline — seed, verify (paper reports CIDEr/METEOR/SPICE separately).

0.369
+23.0%
Total Improvement
36.2%
Time Span
3y
Breakthroughs
3
Current SOTA
0.369

Top Models Performance Comparison

Top 3 models ranked by spider

spider1AudioCaps baseline (TopDo...0.369100.0%2EnCLAP-base0.30081.3%3Pengi0.27173.4%0%25%50%75%100%% of best
Best Score
0.369
Top Model
AudioCaps baselin...
Models Compared
3
Score Range
0.098

spiderPrimary

#ModelScorePaper / CodeDate
1
AudioCaps baseline (TopDown+Align)Open Source
Kim et al.
0.369Apr 2026
2
EnCLAP-baseOpen Source
KAIST / NAVER
0.300Apr 2026
3
PengiOpen Source
Microsoft
0.271Apr 2026