Audioaudio-to-audio

Audio-to-Audio

Audio-to-audio encompasses speech enhancement, voice conversion, source separation, and style transfer — any task where audio goes in and transformed audio comes out. Speech enhancement (denoising) was revolutionized by Meta's Demucs and Microsoft's DCCRN, now used in every video call; voice conversion took a leap with RVC and So-VITS-SVC enabling zero-shot voice cloning that sparked both creative tools and deepfake concerns. Source separation (isolating vocals, drums, bass from a mix) reached near-production quality with HTDemucs and Band-Split RNN, making stems extraction a solved problem for most music. The field is converging toward unified models that handle multiple audio transformations through natural language instructions, blurring the line with text-to-audio generation.

2 datasets0 resultsView full task mapping →

Audio-to-audio transforms input audio into modified output audio — covering speech enhancement, source separation, style transfer, voice conversion, and audio super-resolution. The field has matured from simple noise reduction to neural models that can separate overlapping speakers, enhance degraded recordings, and transform audio characteristics in real-time.

History

2014

Deep neural networks first applied to speech enhancement (denoising), outperforming classical spectral subtraction

2018

Wave-U-Net (Stoller et al.) adapts U-Net to raw audio for source separation; Conv-TasNet follows with superior separation quality

2019

DTLN (Dual-signal Transformation LSTM Network) enables real-time noise suppression on mobile devices

2020

Demucs (Meta) achieves SOTA music source separation, isolating vocals, drums, bass, and other instruments

2021

HiFi-GAN and BigVGAN push audio super-resolution — upsampling low-quality audio to high-fidelity output

2022

AudioSR (Liu et al.) introduces diffusion-based audio super-resolution from 4kHz to 48kHz

2023

Demucs v4 (Hybrid Transformer) and BandSplitRNN push music separation to near-studio quality

2024

NVIDIA Maxine and Krisp demonstrate commercial real-time audio enhancement with background noise and echo removal

2025

Real-time voice conversion and style transfer models enable live audio transformation in consumer applications

How Audio-to-Audio Works

Input representation

Audio is converted to time-frequency representations (STFT spectrograms) or processed as raw waveforms

Mask estimation / generation

A neural network (U-Net, transformer, or TasNet) estimates a mask that isolates target audio from the mixture

Source reconstruction

The mask is applied to the mixture spectrogram, and inverse STFT or a neural decoder reconstructs the waveform

Post-processing

Phase reconstruction, artifact removal, and loudness normalization produce clean output audio

Current Landscape

Audio-to-audio in 2025 spans a diverse set of subtasks unified by the common thread of transforming input audio. Speech enhancement (noise removal) is effectively solved for real-time applications, with commercial products (Krisp, NVIDIA Maxine) used by millions daily. Music source separation has reached impressive quality with Demucs v4 and BandSplitRNN, enabling stem extraction for DJs, remixers, and music production. The emerging areas are universal audio transformation models that handle multiple tasks (enhance, separate, convert, upsample) with a single architecture.

Key Challenges

Real-time processing with low latency (<20ms) requires efficient architectures and aggressive optimization

Source separation of more than 4 instruments in a music mix degrades significantly in quality

Generalization across recording conditions: models trained on studio audio fail on phone recordings or field audio

Artifact-free processing: neural models can introduce subtle metallic or warbling artifacts in enhanced audio

Evaluation: objective metrics (SDR, PESQ, STOI) don't always correlate with perceived quality

Quick Recommendations

Speech enhancement (real-time)

NVIDIA Maxine or Krisp SDK

Production-grade noise suppression and echo cancellation at <10ms latency

Music source separation

Demucs v4 (htdemucs)

Best open-source separation of vocals, drums, bass, and other stems from mixed tracks

Audio super-resolution

AudioSR or NVSR (NVIDIA)

Upsamples low-bandwidth audio (phone calls, old recordings) to 48kHz quality

Speech separation (cocktail party)

SepFormer or TF-GridNet

Separate overlapping speakers from a single microphone recording

Open-source denoising

DeepFilterNet 3

Real-time speech enhancement on CPU; open-source and lightweight

What's Next

The frontier is unified audio transformation models that handle enhancement, separation, style transfer, and super-resolution in a single network. Real-time voice conversion (changing your voice to sound like someone else during a live call) is becoming practical. Expect generative approaches (diffusion, flow-matching) to replace mask-based methods for higher quality reconstruction, and on-device processing to enable privacy-preserving audio enhancement without cloud dependency.

Benchmarks & SOTA

DNS Challenge

20200 results

Deep noise suppression on Microsoft DNS challenge data

No results tracked yet

VCTK (Voice Conversion)

20190 results

Voice conversion quality on multi-speaker English corpus

No results tracked yet

Related Tasks

Audio Captioning

Generating text descriptions of audio content.

Music Generation

Generating music from text, audio, or other inputs.

Sound Event Detection

Detecting and localizing sound events in audio.

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Audio-to-Audio benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Audio