Audiotext-to-audio

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

1 datasets0 resultsView full task mapping →

Text-to-audio generates general sound effects, ambient soundscapes, and music from text descriptions. AudioLDM and MusicGen launched the field, and models like Stable Audio 2.0 and Udio now produce production-quality audio from natural language prompts. The task is rapidly maturing but still struggles with fine-grained temporal control and complex multi-source compositions.

History

2020

Jukebox (OpenAI) generates raw audio music with lyrics using VQ-VAE, showing neural audio generation is feasible

2023

AudioLDM (Liu et al.) applies latent diffusion to audio generation, producing sound effects from text prompts

2023

MusicGen (Meta) generates music from text and melody conditioning with a single-stage transformer over audio tokens

2023

Bark (Suno) generates speech, music, and sound effects in a unified model

2024

Stable Audio 2.0 (Stability AI) enables 3-minute high-quality audio generation with timing control

2024

Udio and Suno v3 produce near-professional-quality songs from text descriptions, including vocals

2024

AudioLDM 2 and Make-An-Audio 2 improve temporal coherence and multi-source generation

2025

ElevenLabs Sound Effects and commercial APIs make text-to-audio accessible for content creators at scale

How Text-to-Audio Works

Text encoding

The text prompt is encoded using CLAP (audio-language model) or FLAN-T5 to produce conditioning embeddings

Latent generation

A latent diffusion model (or autoregressive transformer) generates audio representations in a compressed latent space

Audio decoding

A neural vocoder (HiFi-GAN) or codec decoder converts latent representations to raw audio waveforms

Post-processing

Generated audio may be denoised, normalized, and trimmed to produce clean output

Current Landscape

Text-to-audio in 2025 is where text-to-image was in 2023: exciting, rapidly improving, but not yet production-reliable for all use cases. Sound effects and ambient audio are largely solved — models produce convincing environmental sounds. Music generation has made a stunning leap with Udio and Suno producing songs that casual listeners find impressive, though musicians note issues with repetition, structure, and production quality. The latent diffusion approach (AudioLDM, Stable Audio) dominates for sound effects, while codec-based transformers (MusicGen, SoundStorm) lead for music.

Key Challenges

Temporal precision: specifying exact timing ('thunder at 3 seconds, then rain fading in') is unreliable in current models

Complex compositions: generating multiple simultaneous sound sources with correct spatial relationships

Long-form coherence: maintaining musical structure (verse-chorus-bridge) over minutes-long generations

Audio quality still falls below studio recording quality, especially for music with vocals

Quick Recommendations

Best sound effects

ElevenLabs Sound Effects or AudioLDM 2

High-quality environmental sounds and foley from natural language descriptions

Music generation

Udio or Suno v3

Full song generation with vocals, instrumentation, and musical structure from text

Open-source music

MusicGen-large (3.3B)

Meta's open model generates instrumental music from text and melody prompts

Ambient/background audio

Stable Audio 2.0

Up to 3 minutes of high-quality ambient generation with timing control

Sound design (film/games)

AudioLDM 2 + manual layering

Generate individual sound elements and compose them in a DAW for full control

What's Next

Expect real-time audio generation for games and interactive media, full multimodal models that generate synchronized audio for video (Sora-style), and fine-grained control over individual instruments and sound layers. Music generation will converge with production tools (DAWs), enabling AI-assisted composition workflows. Personalized audio generation — models trained on your musical style or brand's sonic identity — will emerge as a commercial category.

Benchmarks & SOTA

AudioCaps (T2A)

AudioCaps — Text-to-Audio Generation Benchmark

20230 results

AudioCaps captions used as prompts for text-to-audio generation models. Standard eval for AudioLDM, AudioGen, Stable Audio.

No results tracked yet

Related Tasks

Audio Captioning

Generating text descriptions of audio content.

Music Generation

Generating music from text, audio, or other inputs.

Sound Event Detection

Detecting and localizing sound events in audio.

Audio-to-Audio

Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Text-to-Audio benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Audio