Music Generation
Generating music from text, audio, or other inputs.
Music generation creates original musical compositions from text descriptions, melodies, or other conditioning signals. Suno and Udio have brought the field to a mainstream audience with impressive full-song generation including vocals, while MusicGen and Stable Audio offer open and controllable alternatives. Quality has improved dramatically, but structure, originality, and artist control remain open challenges.
History
WaveNet (DeepMind) demonstrates raw audio generation at unprecedented quality, hinting at music generation potential
Jukebox (OpenAI) generates 1-minute music clips with singing in various styles using VQ-VAE + autoregressive model
MusicGen (Meta) generates high-quality instrumental music from text and melody conditioning in a single model
MusicLM (Google) generates music from text descriptions with temporal consistency; not publicly released
Suno v3 and Udio v1 generate full songs with vocals, lyrics, and musical structure from text prompts
Stable Audio 2.0 (Stability AI) enables timed, structured music generation with stereo output up to 3 minutes
YuE and SongComposer focus on structured song generation with explicit verse/chorus control
Suno v4 and Udio v2 push quality closer to professional production; copyright lawsuits from major labels intensify
How Music Generation Works
Conditioning
Text descriptions, melody hints, or genre/mood tags provide the creative direction; encoded via CLAP, T5, or FLAN-T5
Audio tokenization
Music is represented as discrete tokens using neural audio codecs (EnCodec, DAC) at multiple quantization levels
Sequence generation
A transformer generates audio token sequences autoregressively, conditioned on the text/melody embeddings
Decoding
Audio tokens are decoded back to waveforms via the codec decoder, producing 24-48kHz stereo audio
Post-processing
Loudness normalization, stereo enhancement, and optional mastering effects prepare the output for playback
Current Landscape
Music generation in 2025 has captured public imagination with Suno and Udio producing songs that impress casual listeners. However, the field faces existential questions about copyright, originality, and the role of AI in creative expression. MusicGen remains the dominant open-source option, while commercial tools focus on ease of use over controllability. The architecture landscape has converged on codec-based transformers (MusicGen, SoundStorm) and latent diffusion (Stable Audio), with the quality gap between the two approaches narrowing. Professional musicians increasingly view these tools as assistants (for demos, backgrounds, inspiration) rather than replacements.
Key Challenges
Long-form structure: generating coherent songs with verse-chorus-bridge structure over 3+ minutes is unreliable
Copyright: training on copyrighted music raises legal issues; major labels have filed lawsuits against Suno and Udio
Controllability: fine-grained control over specific instruments, harmony, and arrangement is limited in current models
Vocal quality: generated singing often has artifacts in pitch accuracy, pronunciation, and breath control
Originality: models tend to produce generic, averaged versions of their training distribution rather than novel compositions
Quick Recommendations
Full song generation
Suno v4 or Udio v2
Best quality for complete songs with vocals, lyrics, and production; consumer-friendly interface
Open-source instrumental
MusicGen-large (3.3B)
Meta's open model; controllable via text and melody conditioning; good for research and custom applications
Controllable generation
Stable Audio 2.0
Timing control and stereo output; more predictable structure than fully generative approaches
Background / ambient music
MusicGen-small or Riffusion
Lightweight models suitable for generating non-focal background music at scale
AI-assisted composition
AIVA or Amper Music
Designed for professional workflows with DAW integration and stem-level control
What's Next
Expect fine-grained control interfaces where users can specify chord progressions, instrument arrangement, and structural elements while AI handles production. Multi-track generation (separate stems for each instrument) will enable DAW integration. Copyright-safe models trained exclusively on licensed or public domain music will emerge as a commercial category. Real-time collaborative composition between humans and AI will become the creative frontier.
Benchmarks & SOTA
No datasets indexed for this task yet.
Contribute on GitHubRelated Tasks
Audio Captioning
Generating text descriptions of audio content.
Sound Event Detection
Detecting and localizing sound events in audio.
Text-to-Audio
Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.
Audio-to-Audio
Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.
Something wrong or missing?
Help keep Music Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.