Codesota · Tasks · Speech EnhancementTasks/Audio/Speech Enhancement
Audio · added by community request · last verified 2026-06

Speech Enhancement (noise suppression).

Noisy speech in, clean speech out. The denoiser behind every video call, hearing aid, and the front end of many speech recognition pipelines. The field is tracked on two canonical benchmarks: VoiceBank+DEMAND — the academic standard, scored with intrusive metrics (PESQ, STOI) against a clean reference — and the Microsoft DNS Challenge, whose real-recording blind sets are scored with human MOS and the reference-free DNSMOS. Below: the SOTA trajectory from SEGAN (2017) to today’s Mamba/xLSTM and diffusion models, every number verified in the primary paper.

Every score links to the paper it comes from. No aggregator-site numbers. Submit a result →

§ 01 · How this task is scored

Four metrics, two regimes.

Intrusive metrics (PESQ, STOI, SI-SDR) need the clean reference signal, so they only exist on synthetic test sets. Reference-free metrics (DNSMOS) and human MOS are how real recordings get scored.

PESQ
−0.5 → 4.5 · higher better

Perceptual Evaluation of Speech Quality (ITU-T P.862). Compares enhanced audio against the clean reference and predicts a quality opinion score. The headline metric on VoiceBank+DEMAND — wideband PESQ unless stated otherwise. Needs the clean reference, so it only works on synthetic test sets.

STOI
0 → 1 (or %) · higher better

Short-Time Objective Intelligibility. Predicts how much of the words a listener can make out, not how pleasant the audio sounds. Saturated on VoiceBank+DEMAND — everything modern scores 0.95–0.96 — but still discriminative on harder, lower-SNR test sets.

SI-SDR
dB · higher better

Scale-Invariant Signal-to-Distortion Ratio. A waveform-level fidelity measure: how much of the output is the target signal vs residual noise and artifacts. Favored by the source-separation community; punishes generative models that produce plausible-but-different waveforms.

DNSMOS
1 → 5 · higher better

A neural network trained on human ratings from the Microsoft DNS Challenges that predicts MOS without needing a clean reference (arXiv:2010.15941). This is how recent DNS Challenge blind sets are scored — real recordings have no clean reference, so PESQ/STOI cannot be computed.

§ 02 · Benchmark

VoiceBank+DEMAND (Valentini).

The academic standard since 2016: 28 VoiceBank speakers mixed with DEMAND environmental noise for training, 2 unseen speakers + unseen noises for test. Wideband PESQ is the headline number. Nearly a decade of SOTA progress fits on one axis: 1.97 (noisy) → 2.16 (SEGAN, 2017) → 3.55 (SEMamba, 2024).

13 PESQ results · 5 STOI results · each row links to the source paper.

Trust tiers for PESQ (wideband)verifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinks
01SEMamba + PCS
Same paper, with Perceptual Contrast Stretching post-processing. PCS sharpens spectral contrast specifically in ways PESQ rewards — read it as a PESQ-tuned variant, not a free win.
paper3.692024Paper ↗Code ↗
02SEMamba
“An Investigation of Incorporating Mamba for Speech Enhancement” (non-causal config). State-space backbone swapped into the MP-SENet recipe.
paper3.552024Paper ↗Code ↗
03xLSTM-SENet2
“xLSTM-SENet: xLSTM for Single-Channel Speech Enhancement”, Table 4. Shows xLSTM matches Mamba and Conformer backbones at similar complexity.
paper3.532025Paper ↗Code ↗
04MP-SENet
“MP-SENet: A Speech Enhancement Model with Parallel Denoising of Magnitude and Phase Spectra”. First to denoise magnitude and phase in parallel — the recipe most 2024–2026 models build on.
paper3.502023Paper ↗Code ↗
05CMGAN
“CMGAN: Conformer-based Metric GAN for Speech Enhancement” (SSNR 11.10 dB in the same run). Conformer backbone + metric discriminator.
paper3.412022Paper ↗Code ↗
06FRCRN
“FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement”, Table 3 (WB-PESQ). Also 2nd, ICASSP 2022 DNS Challenge real-time fullband track.
paper3.212022Paper ↗Code ↗
07MetricGAN+
“MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement”. Trains the discriminator to mimic PESQ itself — directly optimizing the eval metric.
paper3.152021Paper ↗Code ↗
08DEMUCS (non-causal)
“Real Time Speech Enhancement in the Waveform Domain”, Table 1 (H=64, S=2, U=2). Waveform-domain U-Net from the music source-separation lineage.
paper3.072020Paper ↗Code ↗
09DEMUCS (causal)
“Real Time Speech Enhancement in the Waveform Domain”, Table 1 (H=48, S=4, U=4 causal). Runs real-time on a laptop CPU — the deployable variant.
paper2.932020Paper ↗Code ↗
10SGMSE+
“Speech Enhancement and Dereverberation with Diffusion-based Generative Models”, Table III. Score-based diffusion — lower PESQ than discriminative SOTA but markedly better cross-corpus generalization.
paper2.932022Paper ↗Code ↗
11MetricGAN
“MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization” (2019). Score as listed in the DEMUCS paper’s Table 1 comparison.
paper2.862019Paper ↗Source ↗
12SEGAN
“SEGAN: Speech Enhancement Generative Adversarial Network”, Table 1. The first end-to-end GAN speech enhancer — the result everyone since has measured against.
paper2.162017Paper ↗Code ↗
13Noisy (unprocessed)
Baseline: the noisy test input itself. As reported in “Real Time Speech Enhancement in the Waveform Domain” (DEMUCS), Table 1.
paper1.972016Paper ↗

Scores are as self-reported in each paper (linked per row) — extracted by us, not yet independently reproduced. SGMSE+ deliberately trades PESQ for generalization; see § 04. Spot an error? Tell us →

§ 03 · Benchmark

Microsoft DNS Challenge.

The Deep Noise Suppression Challenge (Interspeech 2020 → ICASSP 2023) is the industrial benchmark: large-scale training data, real-recording blind test sets, real-time constraints, and human MOS / DNSMOS scoring. Blind-set rankings are subjective and not directly comparable across challenge years, so the table below uses the DNS-2020 synthetic no-reverb test set — the one configuration with a clean reference, where papers report comparable intrusive metrics.

2 results × 3 metrics (WB-PESQ · STOI · SI-SDR) — only rows we could verify in primary sources. Know another paper with DNS-2020 no-reverb numbers? Submit it below.

Trust tiers for WB-PESQverifiedpapervendorcommunityunverified

Muted rows were not state of the art when published — an earlier or same-year result already scored better.

RankModelTrustScoreYearLinks
01FRCRN
“FRCRN: Boosting Feature Representation using Frequency Recurrence for Monaural Speech Enhancement”, Table 2 (DNS-2020 non-blind test set, no reverb).
paper3.232022Paper ↗Code ↗
02FullSubNet
“FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement”, Table 1 (2.777, no-reverb set, 32 ms look-ahead, 5.6M params).
paper2.782020Paper ↗Code ↗

Challenge placements for context: DCCRN won the DNS-2020 real-time track on MOS (arXiv:2008.00264); FRCRN took 2nd in the ICASSP 2022 real-time fullband track on MOS and word accuracy (arXiv:2206.07293). Those rankings are human-rated on blind sets and don’t reduce to a single comparable number — which is exactly the gap DNSMOS (arXiv:2010.15941) was built to fill.

§ 04 · How to read these numbers

Four caveats.

VoiceBank+DEMAND is close to saturated. The test set is small (824 clips, 2 speakers), the noise is mild, and STOI has been pinned at 0.96 since 2022. Post-2023 PESQ gains arrive in 0.02–0.05 increments — real, but small enough that training-recipe differences matter as much as architecture. Treat sub-0.05 deltas as noise.

Some models optimize PESQ directly. The MetricGAN line trains a discriminator to imitate PESQ; PCS post-processing stretches spectral contrast in ways PESQ rewards. That is legitimate research — but it means a PESQ gap does not always equal an audible quality gap. The DNS Challenge exists precisely because PESQ on synthetic mixtures stopped predicting human ratings on real recordings.

Discriminative vs generative is a real fork. Discriminative models (CMGAN, MP-SENet, SEMamba) win on in-domain PESQ. Diffusion models like SGMSE+ score lower in-domain (2.93) but generalize better to mismatched corpora and degradations — the property that matters if your deployment audio doesn’t look like the training set.

Causality is the deployment constraint. A video-call denoiser must be causal (no future audio) and run in a few milliseconds of CPU budget — DEMUCS-causal and FRCRN-class models live here. Non-causal, compute-heavy models top the leaderboard but cannot ship in real-time products. Always check which variant a paper’s headline number comes from.

§ 05 · Datasets & resources

Primary sources.

VoiceBank+DEMAND (Valentini)
28+2 speakers · DEMAND noise · 2016

The canonical academic train/test split for single-channel speech enhancement, published by Valentini-Botinhao et al. Clean speech from the VoiceBank corpus mixed with DEMAND environmental noise at 0–15 dB SNR (train) and 2.5–17.5 dB (test).

Source →
Microsoft DNS Challenge
Interspeech 2020 → ICASSP 2023 · github.com/microsoft/DNS-Challenge

Large-scale training corpus (hundreds of hours of clean speech + noise + room impulse responses) plus blind test sets of real recordings. The repository hosts every challenge round’s data and the official baselines (NSNet, NSNet2).

Source →
DNSMOS (P.835)
Reference-free neural MOS predictor · 2020/2022

“DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors” — and the P.835 follow-up that separates speech quality (SIG), background noise (BAK), and overall (OVRL). The standard way to score enhancement on real recordings with no clean reference.

Source →
Interspeech 2020 DNS Challenge paper
Challenge design, datasets, and subjective testing framework

“The INTERSPEECH 2020 Deep Noise Suppression Challenge” — defines the training data, the real-recording blind test set, and the ITU-T P.808 crowdsourced MOS methodology that the entire challenge series builds on.

Source →
Related comparisons
Speech Recognition (the downstream consumer)Text-to-SpeechAll tasks
Reply within 48 hours · No newsletter

What were you looking for on speech enhancement?

This page exists because a reader asked for it. Missing a model, a benchmark (DNSMOS rows, EARS, CHiME), or a deployment question? Tell us — we verify against primary sources and update the page.

Real humans read every message. We track what people are asking for and prioritize accordingly.