Audio

Music Generation

Generating music from text, audio, or other inputs.

0 datasets0 resultsView full task mapping →

Music generation creates original musical compositions from text descriptions, melodies, or other conditioning signals. Suno and Udio have brought the field to a mainstream audience with impressive full-song generation including vocals, while MusicGen and Stable Audio offer open and controllable alternatives. Quality has improved dramatically, but structure, originality, and artist control remain open challenges.

History

2016

WaveNet (DeepMind) demonstrates raw audio generation at unprecedented quality, hinting at music generation potential

2020

Jukebox (OpenAI) generates 1-minute music clips with singing in various styles using VQ-VAE + autoregressive model

2023

MusicGen (Meta) generates high-quality instrumental music from text and melody conditioning in a single model

2023

MusicLM (Google) generates music from text descriptions with temporal consistency; not publicly released

2024

Suno v3 and Udio v1 generate full songs with vocals, lyrics, and musical structure from text prompts

2024

Stable Audio 2.0 (Stability AI) enables timed, structured music generation with stereo output up to 3 minutes

2024

YuE and SongComposer focus on structured song generation with explicit verse/chorus control

2025

Suno v4 and Udio v2 push quality closer to professional production; copyright lawsuits from major labels intensify

How Music Generation Works

1ConditioningText descriptions2Audio tokenizationMusic is represented as dis…3Sequence generationA transformer generates aud…4DecodingAudio tokens are decoded ba…5Post-processingLoudness normalizationMusic Generation Pipeline
1

Conditioning

Text descriptions, melody hints, or genre/mood tags provide the creative direction; encoded via CLAP, T5, or FLAN-T5

2

Audio tokenization

Music is represented as discrete tokens using neural audio codecs (EnCodec, DAC) at multiple quantization levels

3

Sequence generation

A transformer generates audio token sequences autoregressively, conditioned on the text/melody embeddings

4

Decoding

Audio tokens are decoded back to waveforms via the codec decoder, producing 24-48kHz stereo audio

5

Post-processing

Loudness normalization, stereo enhancement, and optional mastering effects prepare the output for playback

Current Landscape

Music generation in 2025 has captured public imagination with Suno and Udio producing songs that impress casual listeners. However, the field faces existential questions about copyright, originality, and the role of AI in creative expression. MusicGen remains the dominant open-source option, while commercial tools focus on ease of use over controllability. The architecture landscape has converged on codec-based transformers (MusicGen, SoundStorm) and latent diffusion (Stable Audio), with the quality gap between the two approaches narrowing. Professional musicians increasingly view these tools as assistants (for demos, backgrounds, inspiration) rather than replacements.

Key Challenges

Long-form structure: generating coherent songs with verse-chorus-bridge structure over 3+ minutes is unreliable

Copyright: training on copyrighted music raises legal issues; major labels have filed lawsuits against Suno and Udio

Controllability: fine-grained control over specific instruments, harmony, and arrangement is limited in current models

Vocal quality: generated singing often has artifacts in pitch accuracy, pronunciation, and breath control

Originality: models tend to produce generic, averaged versions of their training distribution rather than novel compositions

Quick Recommendations

Full song generation

Suno v4 or Udio v2

Best quality for complete songs with vocals, lyrics, and production; consumer-friendly interface

Open-source instrumental

MusicGen-large (3.3B)

Meta's open model; controllable via text and melody conditioning; good for research and custom applications

Controllable generation

Stable Audio 2.0

Timing control and stereo output; more predictable structure than fully generative approaches

Background / ambient music

MusicGen-small or Riffusion

Lightweight models suitable for generating non-focal background music at scale

AI-assisted composition

AIVA or Amper Music

Designed for professional workflows with DAW integration and stem-level control

What's Next

Expect fine-grained control interfaces where users can specify chord progressions, instrument arrangement, and structural elements while AI handles production. Multi-track generation (separate stems for each instrument) will enable DAW integration. Copyright-safe models trained exclusively on licensed or public domain music will emerge as a commercial category. Real-time collaborative composition between humans and AI will become the creative frontier.

Benchmarks & SOTA

No datasets indexed for this task yet.

Contribute on GitHub

Related Tasks

Audio Captioning

Generating text descriptions of audio content.

Sound Event Detection

Detecting and localizing sound events in audio.

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

Audio-to-Audio

Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.

Something wrong or missing?

Help keep Music Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000