Text-to-Audio
Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.
Text-to-audio generates general sound effects, ambient soundscapes, and music from text descriptions. AudioLDM and MusicGen launched the field, and models like Stable Audio 2.0 and Udio now produce production-quality audio from natural language prompts. The task is rapidly maturing but still struggles with fine-grained temporal control and complex multi-source compositions.
History
Jukebox (OpenAI) generates raw audio music with lyrics using VQ-VAE, showing neural audio generation is feasible
AudioLDM (Liu et al.) applies latent diffusion to audio generation, producing sound effects from text prompts
MusicGen (Meta) generates music from text and melody conditioning with a single-stage transformer over audio tokens
Bark (Suno) generates speech, music, and sound effects in a unified model
Stable Audio 2.0 (Stability AI) enables 3-minute high-quality audio generation with timing control
Udio and Suno v3 produce near-professional-quality songs from text descriptions, including vocals
AudioLDM 2 and Make-An-Audio 2 improve temporal coherence and multi-source generation
ElevenLabs Sound Effects and commercial APIs make text-to-audio accessible for content creators at scale
How Text-to-Audio Works
Text encoding
The text prompt is encoded using CLAP (audio-language model) or FLAN-T5 to produce conditioning embeddings
Latent generation
A latent diffusion model (or autoregressive transformer) generates audio representations in a compressed latent space
Audio decoding
A neural vocoder (HiFi-GAN) or codec decoder converts latent representations to raw audio waveforms
Post-processing
Generated audio may be denoised, normalized, and trimmed to produce clean output
Current Landscape
Text-to-audio in 2025 is where text-to-image was in 2023: exciting, rapidly improving, but not yet production-reliable for all use cases. Sound effects and ambient audio are largely solved — models produce convincing environmental sounds. Music generation has made a stunning leap with Udio and Suno producing songs that casual listeners find impressive, though musicians note issues with repetition, structure, and production quality. The latent diffusion approach (AudioLDM, Stable Audio) dominates for sound effects, while codec-based transformers (MusicGen, SoundStorm) lead for music.
Key Challenges
Temporal precision: specifying exact timing ('thunder at 3 seconds, then rain fading in') is unreliable in current models
Complex compositions: generating multiple simultaneous sound sources with correct spatial relationships
Long-form coherence: maintaining musical structure (verse-chorus-bridge) over minutes-long generations
Copyright and training data concerns: models trained on copyrighted music face legal challenges
Audio quality still falls below studio recording quality, especially for music with vocals
Quick Recommendations
Best sound effects
ElevenLabs Sound Effects or AudioLDM 2
High-quality environmental sounds and foley from natural language descriptions
Music generation
Udio or Suno v3
Full song generation with vocals, instrumentation, and musical structure from text
Open-source music
MusicGen-large (3.3B)
Meta's open model generates instrumental music from text and melody prompts
Ambient/background audio
Stable Audio 2.0
Up to 3 minutes of high-quality ambient generation with timing control
Sound design (film/games)
AudioLDM 2 + manual layering
Generate individual sound elements and compose them in a DAW for full control
What's Next
Expect real-time audio generation for games and interactive media, full multimodal models that generate synchronized audio for video (Sora-style), and fine-grained control over individual instruments and sound layers. Music generation will converge with production tools (DAWs), enabling AI-assisted composition workflows. Personalized audio generation — models trained on your musical style or brand's sonic identity — will emerge as a commercial category.
Benchmarks & SOTA
Related Tasks
Audio Captioning
Generating text descriptions of audio content.
Music Generation
Generating music from text, audio, or other inputs.
Sound Event Detection
Detecting and localizing sound events in audio.
Audio-to-Audio
Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.
Something wrong or missing?
Help keep Text-to-Audio benchmarks accurate. Report outdated results, missing benchmarks, or errors.