Audio Captioning
Generating text descriptions of audio content.
Audio captioning generates natural language descriptions of audio content — 'a dog barking in a park with birds singing in the background.' It bridges audio understanding and language generation, with applications in accessibility, search, and content indexing. The field is young compared to image captioning but advancing rapidly through AudioSet pretraining and audio-language models.
History
Clotho (Drossos et al.) provides one of the first dedicated audio captioning datasets with 5K clips and crowd-sourced captions
AudioCaps (Kim et al.) scales to 46K audio-caption pairs sourced from AudioSet; becomes the primary benchmark
DCASE Audio Captioning Challenge drives community research; encoder-decoder models with PANNs achieve baseline performance
Audio-language pretraining (CLAP) enables better audio representations for captioning downstream
WavCaps (Mei et al.) combines ChatGPT-generated pseudo-captions with real data for large-scale pretraining
Pengi and SALMONN integrate audio understanding into LLMs for open-ended audio-language tasks
Qwen-Audio and SALMONN-13B achieve strong audio captioning as part of multi-task audio understanding
Audio-language models handle captioning, QA, and reasoning over audio in a unified framework
How Audio Captioning Works
Audio encoding
A pretrained audio encoder (BEATs, PANNs, or whisper encoder) converts audio to a sequence of embeddings
Projection
Audio embeddings are projected into the language model's embedding space via a learned linear or Q-Former adapter
Caption generation
An autoregressive language model generates a natural language description conditioned on the projected audio features
Decoding
Beam search or nucleus sampling produces the final caption; length and diversity can be controlled via decoding parameters
Current Landscape
Audio captioning in 2025 is still maturing — where image captioning was around 2017. The primary datasets (AudioCaps, Clotho) are small, and evaluation metrics are borrowed from image/text domains. The most promising direction is integrating audio captioning into multimodal LLMs (Qwen-Audio, SALMONN) that handle captioning as one of many audio understanding tasks. Pseudo-labeling approaches (WavCaps) are helping scale training data beyond the limited manually-annotated corpora.
Key Challenges
Data scarcity: AudioCaps (46K) is tiny compared to image captioning datasets (COCO Captions: 330K, LAION: billions)
Temporal ordering: describing a sequence of events ('first a door closes, then footsteps approach') requires temporal reasoning
Evaluation: CIDEr and METEOR from image captioning don't capture audio-specific quality; human evaluation is expensive
Polysemous sounds: many sounds are ambiguous without visual context (running water vs. rain vs. static)
Source counting and spatial description ('two dogs barking from the left') is not well-captured by current models
Quick Recommendations
Best accuracy (AudioCaps)
Qwen-Audio or SALMONN-13B
Multimodal LLMs with audio encoders achieve top CIDEr scores on AudioCaps
Open-source captioning
WavCaps or EnCLAP
Strong captioning with publicly available weights; trainable on custom audio domains
General audio understanding
Pengi or LTU (Listen, Think, Understand)
Handle captioning, QA, and classification in a single model
Accessibility
Whisper (for speech) + audio captioner (for sounds)
Combined pipeline describes both speech content and environmental sounds for hearing-impaired users
What's Next
Expect large-scale audio-language datasets (1M+ audio-caption pairs) created through LLM-assisted annotation pipelines. Dense audio captioning (describing each event with timestamps) will emerge as the standard task format. Multimodal models will handle audio + video captioning jointly, producing descriptions that integrate visual and auditory information. Real-time audio captioning for accessibility will become a product category.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Music Generation
Generating music from text, audio, or other inputs.
Sound Event Detection
Detecting and localizing sound events in audio.
Text-to-Audio
Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.
Audio-to-Audio
Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.
Something wrong or missing?
Help keep Audio Captioning benchmarks accurate. Report outdated results, missing benchmarks, or errors.