Audio

Audio Captioning

Generating text descriptions of audio content.

0 datasets0 resultsView full task mapping →

Audio captioning generates natural language descriptions of audio content — 'a dog barking in a park with birds singing in the background.' It bridges audio understanding and language generation, with applications in accessibility, search, and content indexing. The field is young compared to image captioning but advancing rapidly through AudioSet pretraining and audio-language models.

History

2019

Clotho (Drossos et al.) provides one of the first dedicated audio captioning datasets with 5K clips and crowd-sourced captions

2020

AudioCaps (Kim et al.) scales to 46K audio-caption pairs sourced from AudioSet; becomes the primary benchmark

2021

DCASE Audio Captioning Challenge drives community research; encoder-decoder models with PANNs achieve baseline performance

2022

Audio-language pretraining (CLAP) enables better audio representations for captioning downstream

2023

WavCaps (Mei et al.) combines ChatGPT-generated pseudo-captions with real data for large-scale pretraining

2023

Pengi and SALMONN integrate audio understanding into LLMs for open-ended audio-language tasks

2024

Qwen-Audio and SALMONN-13B achieve strong audio captioning as part of multi-task audio understanding

2025

Audio-language models handle captioning, QA, and reasoning over audio in a unified framework

How Audio Captioning Works

1Audio encodingA pretrained audio encoder …2ProjectionAudio embeddings are projec…3Caption generationAn autoregressive language …4DecodingBeam search or nucleus samp…Audio Captioning Pipeline
1

Audio encoding

A pretrained audio encoder (BEATs, PANNs, or whisper encoder) converts audio to a sequence of embeddings

2

Projection

Audio embeddings are projected into the language model's embedding space via a learned linear or Q-Former adapter

3

Caption generation

An autoregressive language model generates a natural language description conditioned on the projected audio features

4

Decoding

Beam search or nucleus sampling produces the final caption; length and diversity can be controlled via decoding parameters

Current Landscape

Audio captioning in 2025 is still maturing — where image captioning was around 2017. The primary datasets (AudioCaps, Clotho) are small, and evaluation metrics are borrowed from image/text domains. The most promising direction is integrating audio captioning into multimodal LLMs (Qwen-Audio, SALMONN) that handle captioning as one of many audio understanding tasks. Pseudo-labeling approaches (WavCaps) are helping scale training data beyond the limited manually-annotated corpora.

Key Challenges

Data scarcity: AudioCaps (46K) is tiny compared to image captioning datasets (COCO Captions: 330K, LAION: billions)

Temporal ordering: describing a sequence of events ('first a door closes, then footsteps approach') requires temporal reasoning

Evaluation: CIDEr and METEOR from image captioning don't capture audio-specific quality; human evaluation is expensive

Polysemous sounds: many sounds are ambiguous without visual context (running water vs. rain vs. static)

Source counting and spatial description ('two dogs barking from the left') is not well-captured by current models

Quick Recommendations

Best accuracy (AudioCaps)

Qwen-Audio or SALMONN-13B

Multimodal LLMs with audio encoders achieve top CIDEr scores on AudioCaps

Open-source captioning

WavCaps or EnCLAP

Strong captioning with publicly available weights; trainable on custom audio domains

General audio understanding

Pengi or LTU (Listen, Think, Understand)

Handle captioning, QA, and classification in a single model

Accessibility

Whisper (for speech) + audio captioner (for sounds)

Combined pipeline describes both speech content and environmental sounds for hearing-impaired users

What's Next

Expect large-scale audio-language datasets (1M+ audio-caption pairs) created through LLM-assisted annotation pipelines. Dense audio captioning (describing each event with timestamps) will emerge as the standard task format. Multimodal models will handle audio + video captioning jointly, producing descriptions that integrate visual and auditory information. Real-time audio captioning for accessibility will become a product category.

Benchmarks & SOTA

No datasets indexed for this task yet.

Contribute on GitHub

Related Tasks

Music Generation

Generating music from text, audio, or other inputs.

Sound Event Detection

Detecting and localizing sound events in audio.

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

Audio-to-Audio

Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.

Something wrong or missing?

Help keep Audio Captioning benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Audio Captioning Benchmarks - Audio - CodeSOTA | CodeSOTA