Voice Activity Detection
Voice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.
Voice activity detection (VAD) determines when someone is speaking in an audio stream — a deceptively simple binary task that is critical infrastructure for ASR, speaker diarization, and real-time communication. Silero VAD dominates the open-source space with sub-millisecond inference, while WebRTC VAD remains the embedded standard. Modern neural VADs achieve 95%+ accuracy even in noisy conditions.
History
ITU-T G.729 Annex B standardizes VAD for telephony compression based on energy and spectral features
WebRTC VAD (Google) ships in Chrome — GMM-based, fast enough for real-time web applications
DNN-based VAD outperforms energy-based methods on noisy speech; LSTM-VAD becomes competitive
Personal VAD (Google) detects target speaker activity, filtering out non-target speakers
Silero VAD launches — tiny model (1.6MB) with excellent accuracy; rapidly adopted in open-source ASR pipelines
Pyannote.audio 2.0 provides end-to-end speaker diarization with integrated neural VAD
Silero VAD v4 and v5 improve accuracy on challenging conditions (far-field, music contamination, whispered speech)
VAD is increasingly integrated into unified speech processing pipelines rather than used as a standalone module
How Voice Activity Detection Works
Frame extraction
Audio is divided into short frames (10-30ms) with overlapping windows for continuous processing
Feature extraction
Mel-frequency features or learned embeddings are computed for each frame; some models operate on raw waveforms
Classification
A small neural network (LSTM, GRU, or 1D CNN) classifies each frame as speech or non-speech
Smoothing
Raw frame-level predictions are smoothed with minimum duration constraints (e.g., speech segments > 250ms) and hangover schemes
Segmentation
Continuous speech/non-speech labels are converted to timestamped segments for downstream processing
Current Landscape
VAD in 2025 is mature infrastructure — rarely discussed but universally deployed. Every smart speaker, phone call, ASR system, and video conferencing tool runs VAD. Silero VAD has become the de facto open-source standard, replacing WebRTC VAD in most new projects due to superior accuracy with similar speed. The task is increasingly absorbed into larger models: Whisper includes implicit VAD, and end-to-end speech models detect speech boundaries as part of their processing. Standalone VAD research has slowed as the problem is considered solved for most practical applications.
Key Challenges
Background music and TV speech can trigger false positives — distinguishing target speech from played-back audio
Whispered and quiet speech near the noise floor is frequently missed by energy-based and even neural VADs
Far-field and reverberant environments: VAD accuracy degrades significantly at distances > 3 meters from the microphone
Overlapping speech: when multiple people speak simultaneously, VAD must still detect speech activity
Ultra-low latency: real-time applications need VAD decisions within 10-30ms, constraining model complexity
Quick Recommendations
Best open-source VAD
Silero VAD v5
1.6MB model, sub-millisecond inference on CPU, 95%+ accuracy; integrates with any pipeline
Speaker diarization VAD
Pyannote.audio 3.1 VAD
Optimized for segmentation that feeds into speaker diarization; handles overlapping speech
Embedded / IoT
WebRTC VAD or TensorFlow Lite VAD
Runs on microcontrollers; minimal compute and memory footprint
Telephony / VoIP
Silero VAD or RNNoise (with VAD output)
Handles telephone-band audio with codec artifacts; real-time on low-power devices
What's Next
VAD will evolve into voice presence detection — not just 'is someone speaking?' but 'who is speaking, to whom, and with what intent?' Integration with speaker verification (is it an authorized speaker?) and wake word detection will create unified voice activation systems. On-device models will become even smaller (<500KB) while handling challenging conditions like background TV, competing conversations, and acoustic echo.
Benchmarks & SOTA
Related Tasks
Audio Captioning
Generating text descriptions of audio content.
Music Generation
Generating music from text, audio, or other inputs.
Sound Event Detection
Detecting and localizing sound events in audio.
Text-to-Audio
Text-to-audio generates sound effects, music, and ambient audio from natural language descriptions — a field that barely existed before AudioLDM (2023) adapted latent diffusion from images to spectrograms. Meta's AudioCraft, Stability's Stable Audio, and Google's MusicLM/MusicFX pushed quality dramatically, enabling production-ready sound design from prompts like "thunderstorm with distant church bells." AudioCaps and MusicCaps are the primary benchmarks, evaluated via Fréchet Audio Distance (FAD) and text-audio alignment scores, but human evaluation still dominates because automated metrics poorly capture subjective quality. The unsolved challenges are temporal coherence in long-form generation (>30 seconds), precise control over timing and structure, and music that maintains harmonic consistency across full songs.
Something wrong or missing?
Help keep Voice Activity Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.