Noisy speech in, clean speech out. The denoiser behind every video call, hearing aid, and the front end of many speech recognition pipelines. The field is tracked on two canonical benchmarks: VoiceBank+DEMAND — the academic standard, scored with intrusive metrics (PESQ, STOI) against a clean reference — and the Microsoft DNS Challenge, whose real-recording blind sets are scored with human MOS and the reference-free DNSMOS. Below: the SOTA trajectory from SEGAN (2017) to today’s Mamba/xLSTM and diffusion models, every number verified in the primary paper.
Every score links to the paper it comes from. No aggregator-site numbers. Submit a result →
Intrusive metrics (PESQ, STOI, SI-SDR) need the clean reference signal, so they only exist on synthetic test sets. Reference-free metrics (DNSMOS) and human MOS are how real recordings get scored.
Perceptual Evaluation of Speech Quality (ITU-T P.862). Compares enhanced audio against the clean reference and predicts a quality opinion score. The headline metric on VoiceBank+DEMAND — wideband PESQ unless stated otherwise. Needs the clean reference, so it only works on synthetic test sets.
Short-Time Objective Intelligibility. Predicts how much of the words a listener can make out, not how pleasant the audio sounds. Saturated on VoiceBank+DEMAND — everything modern scores 0.95–0.96 — but still discriminative on harder, lower-SNR test sets.
Scale-Invariant Signal-to-Distortion Ratio. A waveform-level fidelity measure: how much of the output is the target signal vs residual noise and artifacts. Favored by the source-separation community; punishes generative models that produce plausible-but-different waveforms.
A neural network trained on human ratings from the Microsoft DNS Challenges that predicts MOS without needing a clean reference (arXiv:2010.15941). This is how recent DNS Challenge blind sets are scored — real recordings have no clean reference, so PESQ/STOI cannot be computed.
The academic standard since 2016: 28 VoiceBank speakers mixed with DEMAND environmental noise for training, 2 unseen speakers + unseen noises for test. Wideband PESQ is the headline number. Nearly a decade of SOTA progress fits on one axis: 1.97 (noisy) → 2.16 (SEGAN, 2017) → 3.55 (SEMamba, 2024).
13 PESQ results · 5 STOI results · each row links to the source paper.
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
| Rank | Model | Trust | Score | Year | Links |
|---|---|---|---|---|---|
| 01 | SEMamba + PCS | paper | 3.69 | 2024 | Paper ↗Code ↗ |
| 02 | SEMamba | paper | 3.55 | 2024 | Paper ↗Code ↗ |
| 03 | xLSTM-SENet2 | paper | 3.53 | 2025 | Paper ↗Code ↗ |
| 04 | MP-SENet | paper | 3.50 | 2023 | Paper ↗Code ↗ |
| 05 | CMGAN | paper | 3.41 | 2022 | Paper ↗Code ↗ |
| 06 | FRCRN | paper | 3.21 | 2022 | Paper ↗Code ↗ |
| 07 | MetricGAN+ | paper | 3.15 | 2021 | Paper ↗Code ↗ |
| 08 | DEMUCS (non-causal) | paper | 3.07 | 2020 | Paper ↗Code ↗ |
| 09 | DEMUCS (causal) | paper | 2.93 | 2020 | Paper ↗Code ↗ |
| 10 | SGMSE+ | paper | 2.93 | 2022 | Paper ↗Code ↗ |
| 11 | MetricGAN | paper | 2.86 | 2019 | Paper ↗Source ↗ |
| 12 | SEGAN | paper | 2.16 | 2017 | Paper ↗Code ↗ |
| 13 | Noisy (unprocessed) | paper | 1.97 | 2016 | Paper ↗ |
Scores are as self-reported in each paper (linked per row) — extracted by us, not yet independently reproduced. SGMSE+ deliberately trades PESQ for generalization; see § 04. Spot an error? Tell us →
The Deep Noise Suppression Challenge (Interspeech 2020 → ICASSP 2023) is the industrial benchmark: large-scale training data, real-recording blind test sets, real-time constraints, and human MOS / DNSMOS scoring. Blind-set rankings are subjective and not directly comparable across challenge years, so the table below uses the DNS-2020 synthetic no-reverb test set — the one configuration with a clean reference, where papers report comparable intrusive metrics.
2 results × 3 metrics (WB-PESQ · STOI · SI-SDR) — only rows we could verify in primary sources. Know another paper with DNS-2020 no-reverb numbers? Submit it below.
Muted rows were not state of the art when published — an earlier or same-year result already scored better.
Challenge placements for context: DCCRN won the DNS-2020 real-time track on MOS (arXiv:2008.00264); FRCRN took 2nd in the ICASSP 2022 real-time fullband track on MOS and word accuracy (arXiv:2206.07293). Those rankings are human-rated on blind sets and don’t reduce to a single comparable number — which is exactly the gap DNSMOS (arXiv:2010.15941) was built to fill.
VoiceBank+DEMAND is close to saturated. The test set is small (824 clips, 2 speakers), the noise is mild, and STOI has been pinned at 0.96 since 2022. Post-2023 PESQ gains arrive in 0.02–0.05 increments — real, but small enough that training-recipe differences matter as much as architecture. Treat sub-0.05 deltas as noise.
Some models optimize PESQ directly. The MetricGAN line trains a discriminator to imitate PESQ; PCS post-processing stretches spectral contrast in ways PESQ rewards. That is legitimate research — but it means a PESQ gap does not always equal an audible quality gap. The DNS Challenge exists precisely because PESQ on synthetic mixtures stopped predicting human ratings on real recordings.
Discriminative vs generative is a real fork. Discriminative models (CMGAN, MP-SENet, SEMamba) win on in-domain PESQ. Diffusion models like SGMSE+ score lower in-domain (2.93) but generalize better to mismatched corpora and degradations — the property that matters if your deployment audio doesn’t look like the training set.
Causality is the deployment constraint. A video-call denoiser must be causal (no future audio) and run in a few milliseconds of CPU budget — DEMUCS-causal and FRCRN-class models live here. Non-causal, compute-heavy models top the leaderboard but cannot ship in real-time products. Always check which variant a paper’s headline number comes from.
The canonical academic train/test split for single-channel speech enhancement, published by Valentini-Botinhao et al. Clean speech from the VoiceBank corpus mixed with DEMAND environmental noise at 0–15 dB SNR (train) and 2.5–17.5 dB (test).
Source →Large-scale training corpus (hundreds of hours of clean speech + noise + room impulse responses) plus blind test sets of real recordings. The repository hosts every challenge round’s data and the official baselines (NSNet, NSNet2).
Source →“DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality metric to evaluate Noise Suppressors” — and the P.835 follow-up that separates speech quality (SIG), background noise (BAK), and overall (OVRL). The standard way to score enhancement on real recordings with no clean reference.
Source →“The INTERSPEECH 2020 Deep Noise Suppression Challenge” — defines the training data, the real-recording blind test set, and the ITU-T P.808 crowdsourced MOS methodology that the entire challenge series builds on.
Source →This page exists because a reader asked for it. Missing a model, a benchmark (DNSMOS rows, EARS, CHiME), or a deployment question? Tell us — we verify against primary sources and update the page.
Real humans read every message. We track what people are asking for and prioritize accordingly.