Voice Activity Detection
Detect when speech is present in audio. Essential preprocessing for ASR, diarization, and voice interfaces.
How Voice Activity Detection Works
A technical deep-dive into Voice Activity Detection. From signal processing fundamentals to neural networks that power modern speech systems.
What is Voice Activity Detection?
VAD answers a deceptively simple question: "Is someone speaking right now?"This binary classification task is foundational to nearly every speech processing system.
VAD in Action
Time -> 0s 1s 2s 3s 4s 5s
|---------|---------|---------|---------|---------|
Audio: ~~~....~~~~~~###########~~~~....~########~~~~~~~....
| | | |
Speech Silence Speech Silence
Output: [0,0,0,0,0][1,1,1,1,1,1,1][0,0,0][1,1,1,1][0,0,0,0]
Segments: [ (1.0s - 2.3s) (2.8s - 3.8s) ]
VAD processes an audio stream and outputs temporal boundaries where speech occurs. The core challenge: distinguishing human voice from silence, noise, music, and other sounds.
Output Labels
Why VAD Matters
Strip silence before sending audio to speech recognition. Reduces latency, cost, and improves accuracy.
Detect when a user stops speaking so the AI assistant can respond. Critical for conversational systems.
Split long recordings into manageable chunks. First step in speaker diarization pipelines.
VoIP systems use VAD to transmit only during speech, saving bandwidth by 50-70%.
The Challenge: Beyond Simple Silence Detection
- - Clear speech in quiet room
- - Complete silence between utterances
- - Single speaker, close microphone
- - Speech with background music
- - Whispering, breathing, laughter
- - Far-field audio, overlapping speakers
- - Non-speech vocalizations (coughing, hmm)
Signal Processing Approach
Before deep learning, VAD relied on handcrafted acoustic features. These methods are still valuable: fast, interpretable, and require no training data.
The Core Insight
Speech has distinctive acoustic properties. It is:
Common Signal Features
| Feature | Description | Formula | Effectiveness |
|---|---|---|---|
| Energy / RMS | Volume level of the signal | sqrt(sum(x^2) / N) | Basic |
| Zero-Crossing Rate | How often signal crosses zero | sum(sign(x[n]) != sign(x[n-1])) / N | Moderate |
| Spectral Centroid | Center of mass of spectrum | sum(f * |X(f)|) / sum(|X(f)|) | Good |
| Spectral Flux | Rate of spectral change | sum((|X(f)| - |X_prev(f)|)^2) | Good |
| MFCC | Mel-frequency cepstral coefficients | DCT(log(mel_filterbank(|FFT|))) | Excellent |
Energy-based VAD: The Simplest Approach
- Split audio into overlapping frames (20-30ms)
- Compute RMS energy of each frame
- Compare energy to threshold (adaptive or fixed)
- Mark frames above threshold as speech
- Apply smoothing to remove spurious detections
- - Fails with background noise (traffic, AC)
- - Cannot distinguish speech from music
- - Misses whispered or soft speech
- - Threshold tuning is environment-specific
Zero-Crossing Rate (ZCR)
ZCR counts how often the signal changes sign per unit time. Voiced speech (vowels) has low ZCR due to periodicity. Unvoiced speech (consonants like 's', 'f') and noise have high ZCR.
Modern Neural Approaches
Neural networks learn to detect speech from data, handling noise and edge cases that defeat hand-crafted rules. The architecture is simple: audio features in, speech probability out.
Neural VAD Architecture
Leading Neural VAD Systems
Small, fast model optimized for production. Uses LSTM layers trained on massive multilingual dataset. Runs 4000x faster than real-time on CPU.
State-of-the-art accuracy using Transformers. Part of the larger pyannote speaker diarization toolkit. Heavier but more accurate in challenging conditions.
Not neural but GMM-based. Extremely fast C library used in browsers and VoIP. Less accurate but runs anywhere with near-zero latency.
OpenAI Whisper outputs a no_speech_prob for each segment. If you are already using Whisper for ASR, you get VAD for free. Works well but adds ASR overhead.
How Neural VAD Learns
Neural VAD is trained as binary classification. Each frame gets a label: 1 for speech, 0 for non-speech. The model learns to distinguish the spectral patterns of human voice from everything else.
Frame-level vs Segment-level Output
VAD models typically output per-frame probabilities, but applications need speech segments. Understanding this conversion is crucial.
Interactive Threshold Demo
Adjust the threshold to see how it affects frame classification. Lower threshold = more speech detected.
Frame-level Output
The model processes audio in overlapping frames (typically 30ms with 10ms hop). For each frame, it outputs a probability between 0 and 1.
| Frame | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|---|---|---|---|---|---|---|---|---|
| Time | 0.00 | 0.01 | 0.02 | 0.03 | 0.04 | 0.05 | 0.06 | 0.07 |
| Prob | 0.05 | 0.08 | 0.12 | 0.45 | 0.78 | 0.92 | 0.95 | 0.88 |
| Label | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Converting to Segments
Convert probabilities to binary using a threshold (typically 0.5).
Use different thresholds for speech start (higher) and end (lower) to reduce flickering.
Discard speech segments shorter than a minimum (e.g., 250ms). Removes noise spikes.
Merge segments separated by short silence (e.g., 100ms). Handles natural pauses in speech.
Tuning Parameters for Your Use Case
- - Lower threshold (0.3-0.4) to capture all speech
- - Longer min_speech (300-500ms)
- - Generous padding (200ms) around segments
- - Higher threshold (0.5-0.6) to reduce false positives
- - Shorter min_silence (50-100ms) for responsiveness
- - Hysteresis to prevent flickering
VAD Models Comparison
From lightweight WebRTC to state-of-the-art pyannote. Choose based on your latency, accuracy, and deployment constraints.
| Model | Type | Speed | Accuracy | Notes |
|---|---|---|---|---|
| Silero VAD | Neural | ~4000x realtime | 95%+ on most data | ONNX, PyTorch, TFLite. Excellent balance of speed and accuracy. |
| WebRTC VAD | GMM-based | ~10000x realtime | 85-90% | Extremely fast, C library. Good for embedded/real-time. |
| pyannote.audio VAD | Neural | ~100x realtime | 97%+ | State-of-the-art accuracy, GPU recommended. |
| Whisper VAD | Byproduct | ~5-20x realtime | 93-97% | Uses Whisper's no_speech_prob. Great if already using Whisper. |
| Energy-based | Heuristic | ~50000x realtime | 70-80% | No dependencies. Fails in noisy environments. |
Choosing the Right VAD
- - Need balance of speed and accuracy
- - Deploying to edge devices or CPU-only
- - Building streaming applications
- - Want simple Python/ONNX integration
- - Extreme latency requirements (<1ms)
- - Embedded systems, mobile, browser
- - Simple audio conditions
- - Minimal dependencies preferred
- - Maximum accuracy is critical
- - Challenging audio (noise, overlap)
- - GPU available for inference
- - Building diarization pipeline
- - Already using Whisper for ASR
- - Want timestamps with transcription
- - Batch processing (not real-time)
- - Segment-level is sufficient
For most applications, Silero VAD offers the best trade-off. It runs 4000x faster than real-time on CPU, achieves 95%+ accuracy, and comes in multiple formats (PyTorch, ONNX, TFLite). The streaming API handles chunked audio gracefully. Unless you have extreme latency needs (WebRTC) or accuracy needs (pyannote), start with Silero.
Benchmarks and Evaluation
Standard datasets for evaluating VAD systems. Detection Error Rate (DER) is the primary metric.
| Dataset | Domain | Duration | Metric | Description |
|---|---|---|---|---|
| AVA-Speech | Movies | 45h | AUC | Diverse audio conditions from films |
| AMI Corpus | Meetings | 100h | DER | Multi-party meeting recordings |
| DIHARD III | Challenging | 40h | DER | Child speech, web video, clinical |
| VoxConverse | Broadcast | 50h | DER | News, interviews, podcasts |
Understanding VAD Metrics
VAD evaluation is time-based: errors are measured in seconds of misclassified audio, not counts. A 1-second false alarm is worse than a 10ms false alarm.
DIHARD: The Hardest VAD Challenge
DIHARD (Diarization in Hard conditions) includes the most challenging audio: child speech, restaurant noise, web video, clinical interviews. Even state-of-the-art systems struggle here.
Code Examples
Get started with VAD in Python. From simple energy-based to production-ready Silero.
import torch
# Load Silero VAD model
model, utils = torch.hub.load(
repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=False
)
(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils
# Read audio (automatically resamples to 16kHz)
audio = read_audio('audio.wav')
# Get speech timestamps
speech_timestamps = get_speech_timestamps(
audio,
model,
threshold=0.5, # Speech probability threshold
min_speech_duration_ms=250, # Minimum speech segment
min_silence_duration_ms=100, # Minimum silence between segments
sampling_rate=16000
)
# Output: [{'start': 1000, 'end': 5000}, {'start': 7000, 'end': 12000}]
for segment in speech_timestamps:
start_sec = segment['start'] / 16000
end_sec = segment['end'] / 16000
print(f"Speech: {start_sec:.2f}s - {end_sec:.2f}s")Quick Reference
- - Silero VAD (best balance)
- - pyannote.audio (max accuracy)
- - Tune thresholds for your audio
- - WebRTC VAD (sub-ms latency)
- - Silero streaming API
- - Small frame sizes (10-30ms)
- - threshold: 0.3-0.6
- - min_speech_duration: 250ms
- - min_silence_duration: 100ms
Use Cases
- ✓ASR preprocessing
- ✓Smart speaker wake word
- ✓Meeting segmentation
- ✓Audio compression
- ✓Noise reduction
Architectural Patterns
Energy-Based VAD
Simple threshold on audio energy/volume.
- +Very fast
- +No ML needed
- +Low latency
- -Fails with noise
- -No speech/noise distinction
Neural VAD
Train neural network to classify speech frames.
- +Robust to noise
- +Accurate
- +Handles music
- -Higher latency
- -Needs training data
Self-Supervised VAD
Use pre-trained audio models for VAD.
- +Best accuracy
- +Handles edge cases
- -Larger models
- -More compute
Implementations
Open Source
Benchmarks
Quick Facts
- Input
- Audio
- Output
- Structured Data
- Implementations
- 4 open source, 0 API
- Patterns
- 3 approaches