Audio-to-Audio
Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.
Audio-to-audio transforms input audio into modified output audio — covering speech enhancement, source separation, style transfer, voice conversion, and audio super-resolution. The field has matured from simple noise reduction to neural models that can separate overlapping speakers, enhance degraded recordings, and transform audio characteristics in real-time.
History
Deep neural networks first applied to speech enhancement (denoising), outperforming classical spectral subtraction
Wave-U-Net (Stoller et al.) adapts U-Net to raw audio for source separation; Conv-TasNet follows with superior separation quality
DTLN (Dual-signal Transformation LSTM Network) enables real-time noise suppression on mobile devices
Demucs (Meta) achieves SOTA music source separation, isolating vocals, drums, bass, and other instruments
HiFi-GAN and BigVGAN push audio super-resolution — upsampling low-quality audio to high-fidelity output
AudioSR (Liu et al.) introduces diffusion-based audio super-resolution from 4kHz to 48kHz
Demucs v4 (Hybrid Transformer) and BandSplitRNN push music separation to near-studio quality
NVIDIA Maxine and Krisp demonstrate commercial real-time audio enhancement with background noise and echo removal
Real-time voice conversion and style transfer models enable live audio transformation in consumer applications
How Audio-to-Audio Works
Input representation
Audio is converted to time-frequency representations (STFT spectrograms) or processed as raw waveforms
Mask estimation / generation
A neural network (U-Net, transformer, or TasNet) estimates a mask that isolates target audio from the mixture
Source reconstruction
The mask is applied to the mixture spectrogram, and inverse STFT or a neural decoder reconstructs the waveform
Post-processing
Phase reconstruction, artifact removal, and loudness normalization produce clean output audio
Current Landscape
Audio-to-audio in 2025 spans a diverse set of subtasks unified by the common thread of transforming input audio. Speech enhancement (noise removal) is effectively solved for real-time applications, with commercial products (Krisp, NVIDIA Maxine) used by millions daily. Music source separation has reached impressive quality with Demucs v4 and BandSplitRNN, enabling stem extraction for DJs, remixers, and music production. The emerging areas are universal audio transformation models that handle multiple tasks (enhance, separate, convert, upsample) with a single architecture.
Key Challenges
Real-time processing with low latency (<20ms) requires efficient architectures and aggressive optimization
Source separation of more than 4 instruments in a music mix degrades significantly in quality
Generalization across recording conditions: models trained on studio audio fail on phone recordings or field audio
Artifact-free processing: neural models can introduce subtle metallic or warbling artifacts in enhanced audio
Evaluation: objective metrics (SDR, PESQ, STOI) don't always correlate with perceived quality
Quick Recommendations
Speech enhancement (real-time)
NVIDIA Maxine or Krisp SDK
Production-grade noise suppression and echo cancellation at <10ms latency
Music source separation
Demucs v4 (htdemucs)
Best open-source separation of vocals, drums, bass, and other stems from mixed tracks
Audio super-resolution
AudioSR or NVSR (NVIDIA)
Upsamples low-bandwidth audio (phone calls, old recordings) to 48kHz quality
Speech separation (cocktail party)
SepFormer or TF-GridNet
Separate overlapping speakers from a single microphone recording
Open-source denoising
DeepFilterNet 3
Real-time speech enhancement on CPU; open-source and lightweight
What's Next
The frontier is unified audio transformation models that handle enhancement, separation, style transfer, and super-resolution in a single network. Real-time voice conversion (changing your voice to sound like someone else during a live call) is becoming practical. Expect generative approaches (diffusion, flow-matching) to replace mask-based methods for higher quality reconstruction, and on-device processing to enable privacy-preserving audio enhancement without cloud dependency.
Benchmarks & SOTA
Related Tasks
Audio Captioning
Generating text descriptions of audio content.
Music Generation
Generating music from text, audio, or other inputs.
Sound Event Detection
Detecting and localizing sound events in audio.
Text-to-Audio
Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.
Something wrong or missing?
Help keep Audio-to-Audio benchmarks accurate. Report outdated results, missing benchmarks, or errors.