Audiovoice-activity-detection

Voice Activity Detection

Voice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.

2 datasets0 resultsView full task mapping →

Voice activity detection (VAD) determines when someone is speaking in an audio stream — a deceptively simple binary task that is critical infrastructure for ASR, speaker diarization, and real-time communication. Silero VAD dominates the open-source space with sub-millisecond inference, while WebRTC VAD remains the embedded standard. Modern neural VADs achieve 95%+ accuracy even in noisy conditions.

History

2000

ITU-T G.729 Annex B standardizes VAD for telephony compression based on energy and spectral features

2011

WebRTC VAD (Google) ships in Chrome — GMM-based, fast enough for real-time web applications

2015

DNN-based VAD outperforms energy-based methods on noisy speech; LSTM-VAD becomes competitive

2019

Personal VAD (Google) detects target speaker activity, filtering out non-target speakers

2021

Silero VAD launches — tiny model (1.6MB) with excellent accuracy; rapidly adopted in open-source ASR pipelines

2022

Pyannote.audio 2.0 provides end-to-end speaker diarization with integrated neural VAD

2023

Silero VAD v4 and v5 improve accuracy on challenging conditions (far-field, music contamination, whispered speech)

2024

VAD is increasingly integrated into unified speech processing pipelines rather than used as a standalone module

How Voice Activity Detection Works

Frame extraction

Audio is divided into short frames (10-30ms) with overlapping windows for continuous processing

Feature extraction

Mel-frequency features or learned embeddings are computed for each frame; some models operate on raw waveforms

Classification

A small neural network (LSTM, GRU, or 1D CNN) classifies each frame as speech or non-speech

Smoothing

Raw frame-level predictions are smoothed with minimum duration constraints (e.g., speech segments > 250ms) and hangover schemes

Segmentation

Continuous speech/non-speech labels are converted to timestamped segments for downstream processing

Current Landscape

VAD in 2025 is mature infrastructure — rarely discussed but universally deployed. Every smart speaker, phone call, ASR system, and video conferencing tool runs VAD. Silero VAD has become the de facto open-source standard, replacing WebRTC VAD in most new projects due to superior accuracy with similar speed. The task is increasingly absorbed into larger models: Whisper includes implicit VAD, and end-to-end speech models detect speech boundaries as part of their processing. Standalone VAD research has slowed as the problem is considered solved for most practical applications.

Key Challenges

Background music and TV speech can trigger false positives — distinguishing target speech from played-back audio

Whispered and quiet speech near the noise floor is frequently missed by energy-based and even neural VADs

Far-field and reverberant environments: VAD accuracy degrades significantly at distances > 3 meters from the microphone

Overlapping speech: when multiple people speak simultaneously, VAD must still detect speech activity

Ultra-low latency: real-time applications need VAD decisions within 10-30ms, constraining model complexity

Quick Recommendations

Best open-source VAD

Silero VAD v5

1.6MB model, sub-millisecond inference on CPU, 95%+ accuracy; integrates with any pipeline

Speaker diarization VAD

Pyannote.audio 3.1 VAD

Optimized for segmentation that feeds into speaker diarization; handles overlapping speech

Embedded / IoT

WebRTC VAD or TensorFlow Lite VAD

Runs on microcontrollers; minimal compute and memory footprint

Telephony / VoIP

Silero VAD or RNNoise (with VAD output)

Handles telephone-band audio with codec artifacts; real-time on low-power devices

What's Next

VAD will evolve into voice presence detection — not just 'is someone speaking?' but 'who is speaking, to whom, and with what intent?' Integration with speaker verification (is it an authorized speaker?) and wake word detection will create unified voice activation systems. On-device models will become even smaller (<500KB) while handling challenging conditions like background TV, competing conversations, and acoustic echo.

Benchmarks & SOTA

AVA-Speech

20180 results

Voice activity detection in movies with dense speech labels

No results tracked yet

DIHARD

20180 results

Hard diarization and VAD across diverse domains

No results tracked yet

Related Tasks

Audio Captioning

Generating text descriptions of audio content.

Music Generation

Generating music from text, audio, or other inputs.

Sound Event Detection

Detecting and localizing sound events in audio.

Text-to-Audio

Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Voice Activity Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Audio