Audiotext-to-audio

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

1
Datasets
3
Results
fad
Canonical metric
Canonical Benchmark

AudioCaps (T2A)

AudioCaps captions used as prompts for text-to-audio generation models. Standard eval for AudioLDM, AudioGen, Stable Audio.

Primary metric: fad
View full leaderboard

Top 10

Leading models on AudioCaps (T2A).

RankModelfadYearSource
1
Stable Audio Open
2.572026paper
2
AudioGen Medium
1.822026paper
3
AudioLDM 2
1.422026paper

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Audio.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace