Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Text-to-AudioHome/Tasks/Audio/Text-to-Audio
Audio· text-to-audio

Text-to-Audio.

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

1
Datasets
0
Results
fad
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

AudioCaps (T2A)

AudioCaps captions used as prompts for text-to-audio generation models. Standard eval for AudioLDM, AudioGen, Stable Audio.

Primary metric: fad
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on AudioCaps (T2A).

No results yet. Be the first to contribute.

What were you looking for on Text-to-Audio?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

1 dataset tracked for this task.

AudioCaps (T2A)
CANONICAL
0 results · fad
§ 05 · Related tasks

Other tasks in Audio.

Audio CaptioningAudio-to-AudioMusic GenerationSound Event DetectionVoice Activity Detection
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Text-to-Audio? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.