Audiotext-to-audio

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

2 datasets0 resultsView full task mapping →

Text-to-audio generates general sound effects, ambient soundscapes, and music from text descriptions. AudioLDM and MusicGen launched the field, and models like Stable Audio 2.0 and Udio now produce production-quality audio from natural language prompts. The task is rapidly maturing but still struggles with fine-grained temporal control and complex multi-source compositions.

History

2020

Jukebox (OpenAI) generates raw audio music with lyrics using VQ-VAE, showing neural audio generation is feasible

2023

AudioLDM (Liu et al.) applies latent diffusion to audio generation, producing sound effects from text prompts

2023

MusicGen (Meta) generates music from text and melody conditioning with a single-stage transformer over audio tokens

2023

Bark (Suno) generates speech, music, and sound effects in a unified model

2024

Stable Audio 2.0 (Stability AI) enables 3-minute high-quality audio generation with timing control

2024

Udio and Suno v3 produce near-professional-quality songs from text descriptions, including vocals

2024

AudioLDM 2 and Make-An-Audio 2 improve temporal coherence and multi-source generation

2025

ElevenLabs Sound Effects and commercial APIs make text-to-audio accessible for content creators at scale

How Text-to-Audio Works

1Text encodingThe text prompt is encoded …2Latent generationA latent diffusion model (o…3Audio decodingA neural vocoder (HiFi-GAN)…4Post-processingGenerated audio may be deno…Text-to-Audio Pipeline
1

Text encoding

The text prompt is encoded using CLAP (audio-language model) or FLAN-T5 to produce conditioning embeddings

2

Latent generation

A latent diffusion model (or autoregressive transformer) generates audio representations in a compressed latent space

3

Audio decoding

A neural vocoder (HiFi-GAN) or codec decoder converts latent representations to raw audio waveforms

4

Post-processing

Generated audio may be denoised, normalized, and trimmed to produce clean output

Current Landscape

Text-to-audio in 2025 is where text-to-image was in 2023: exciting, rapidly improving, but not yet production-reliable for all use cases. Sound effects and ambient audio are largely solved — models produce convincing environmental sounds. Music generation has made a stunning leap with Udio and Suno producing songs that casual listeners find impressive, though musicians note issues with repetition, structure, and production quality. The latent diffusion approach (AudioLDM, Stable Audio) dominates for sound effects, while codec-based transformers (MusicGen, SoundStorm) lead for music.

Key Challenges

Temporal precision: specifying exact timing ('thunder at 3 seconds, then rain fading in') is unreliable in current models

Complex compositions: generating multiple simultaneous sound sources with correct spatial relationships

Long-form coherence: maintaining musical structure (verse-chorus-bridge) over minutes-long generations

Copyright and training data concerns: models trained on copyrighted music face legal challenges

Audio quality still falls below studio recording quality, especially for music with vocals

Quick Recommendations

Best sound effects

ElevenLabs Sound Effects or AudioLDM 2

High-quality environmental sounds and foley from natural language descriptions

Music generation

Udio or Suno v3

Full song generation with vocals, instrumentation, and musical structure from text

Open-source music

MusicGen-large (3.3B)

Meta's open model generates instrumental music from text and melody prompts

Ambient/background audio

Stable Audio 2.0

Up to 3 minutes of high-quality ambient generation with timing control

Sound design (film/games)

AudioLDM 2 + manual layering

Generate individual sound elements and compose them in a DAW for full control

What's Next

Expect real-time audio generation for games and interactive media, full multimodal models that generate synchronized audio for video (Sora-style), and fine-grained control over individual instruments and sound layers. Music generation will converge with production tools (DAWs), enabling AI-assisted composition workflows. Personalized audio generation — models trained on your musical style or brand's sonic identity — will emerge as a commercial category.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Text-to-Audio benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000