Audioaudio-to-audio

Audio-to-Audio

Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.

2 datasets0 resultsView full task mapping →

Audio-to-audio transforms input audio into modified output audio — covering speech enhancement, source separation, style transfer, voice conversion, and audio super-resolution. The field has matured from simple noise reduction to neural models that can separate overlapping speakers, enhance degraded recordings, and transform audio characteristics in real-time.

History

2014

Deep neural networks first applied to speech enhancement (denoising), outperforming classical spectral subtraction

2018

Wave-U-Net (Stoller et al.) adapts U-Net to raw audio for source separation; Conv-TasNet follows with superior separation quality

2019

DTLN (Dual-signal Transformation LSTM Network) enables real-time noise suppression on mobile devices

2020

Demucs (Meta) achieves SOTA music source separation, isolating vocals, drums, bass, and other instruments

2021

HiFi-GAN and BigVGAN push audio super-resolution — upsampling low-quality audio to high-fidelity output

2022

AudioSR (Liu et al.) introduces diffusion-based audio super-resolution from 4kHz to 48kHz

2023

Demucs v4 (Hybrid Transformer) and BandSplitRNN push music separation to near-studio quality

2024

NVIDIA Maxine and Krisp demonstrate commercial real-time audio enhancement with background noise and echo removal

2025

Real-time voice conversion and style transfer models enable live audio transformation in consumer applications

How Audio-to-Audio Works

1Input representationAudio is converted to time-…2Mask estimation / gen…A neural network (U-Net3Source reconstructionThe mask is applied to the …4Post-processingPhase reconstructionAudio-to-Audio Pipeline
1

Input representation

Audio is converted to time-frequency representations (STFT spectrograms) or processed as raw waveforms

2

Mask estimation / generation

A neural network (U-Net, transformer, or TasNet) estimates a mask that isolates target audio from the mixture

3

Source reconstruction

The mask is applied to the mixture spectrogram, and inverse STFT or a neural decoder reconstructs the waveform

4

Post-processing

Phase reconstruction, artifact removal, and loudness normalization produce clean output audio

Current Landscape

Audio-to-audio in 2025 spans a diverse set of subtasks unified by the common thread of transforming input audio. Speech enhancement (noise removal) is effectively solved for real-time applications, with commercial products (Krisp, NVIDIA Maxine) used by millions daily. Music source separation has reached impressive quality with Demucs v4 and BandSplitRNN, enabling stem extraction for DJs, remixers, and music production. The emerging areas are universal audio transformation models that handle multiple tasks (enhance, separate, convert, upsample) with a single architecture.

Key Challenges

Real-time processing with low latency (<20ms) requires efficient architectures and aggressive optimization

Source separation of more than 4 instruments in a music mix degrades significantly in quality

Generalization across recording conditions: models trained on studio audio fail on phone recordings or field audio

Artifact-free processing: neural models can introduce subtle metallic or warbling artifacts in enhanced audio

Evaluation: objective metrics (SDR, PESQ, STOI) don't always correlate with perceived quality

Quick Recommendations

Speech enhancement (real-time)

NVIDIA Maxine or Krisp SDK

Production-grade noise suppression and echo cancellation at <10ms latency

Music source separation

Demucs v4 (htdemucs)

Best open-source separation of vocals, drums, bass, and other stems from mixed tracks

Audio super-resolution

AudioSR or NVSR (NVIDIA)

Upsamples low-bandwidth audio (phone calls, old recordings) to 48kHz quality

Speech separation (cocktail party)

SepFormer or TF-GridNet

Separate overlapping speakers from a single microphone recording

Open-source denoising

DeepFilterNet 3

Real-time speech enhancement on CPU; open-source and lightweight

What's Next

The frontier is unified audio transformation models that handle enhancement, separation, style transfer, and super-resolution in a single network. Real-time voice conversion (changing your voice to sound like someone else during a live call) is becoming practical. Expect generative approaches (diffusion, flow-matching) to replace mask-based methods for higher quality reconstruction, and on-device processing to enable privacy-preserving audio enhancement without cloud dependency.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Audio-to-Audio benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000