Home/Building Blocks/Voice Activity Detection
AudioStructured Data

Voice Activity Detection

Detect when speech is present in audio. Essential preprocessing for ASR, diarization, and voice interfaces.

How Voice Activity Detection Works

A technical deep-dive into Voice Activity Detection. From signal processing fundamentals to neural networks that power modern speech systems.

1

What is Voice Activity Detection?

VAD answers a deceptively simple question: "Is someone speaking right now?"This binary classification task is foundational to nearly every speech processing system.

VAD in Action


Time ->   0s        1s        2s        3s        4s        5s
         |---------|---------|---------|---------|---------|
Audio:   ~~~....~~~~~~###########~~~~....~########~~~~~~~....
                     |         |         |       |
                    Speech    Silence   Speech  Silence

Output:  [0,0,0,0,0][1,1,1,1,1,1,1][0,0,0][1,1,1,1][0,0,0,0]

Segments: [          (1.0s - 2.3s)       (2.8s - 3.8s)      ]

VAD processes an audio stream and outputs temporal boundaries where speech occurs. The core challenge: distinguishing human voice from silence, noise, music, and other sounds.

Output Labels

SPEECH
Human speech detected
0.5s - 3.2s
NON-SPEECH
Silence, noise, or music
0.0s - 0.5s
OVERLAP
Multiple speakers (some models)
2.1s - 2.8s

Why VAD Matters

ASR Preprocessing

Strip silence before sending audio to speech recognition. Reduces latency, cost, and improves accuracy.

Turn-Taking

Detect when a user stops speaking so the AI assistant can respond. Critical for conversational systems.

Audio Segmentation

Split long recordings into manageable chunks. First step in speaker diarization pipelines.

Compression

VoIP systems use VAD to transmit only during speech, saving bandwidth by 50-70%.

The Challenge: Beyond Simple Silence Detection

Easy Cases
  • - Clear speech in quiet room
  • - Complete silence between utterances
  • - Single speaker, close microphone
Hard Cases
  • - Speech with background music
  • - Whispering, breathing, laughter
  • - Far-field audio, overlapping speakers
  • - Non-speech vocalizations (coughing, hmm)
2

Signal Processing Approach

Before deep learning, VAD relied on handcrafted acoustic features. These methods are still valuable: fast, interpretable, and require no training data.

The Core Insight

Speech has distinctive acoustic properties. It is:

Louder than silence
Higher energy / RMS amplitude
Quasi-periodic
Harmonic structure from vocal cords
Spectrally distinct
Formants in 300Hz-3400Hz range

Common Signal Features

FeatureDescriptionFormulaEffectiveness
Energy / RMSVolume level of the signalsqrt(sum(x^2) / N)Basic
Zero-Crossing RateHow often signal crosses zerosum(sign(x[n]) != sign(x[n-1])) / NModerate
Spectral CentroidCenter of mass of spectrumsum(f * |X(f)|) / sum(|X(f)|)Good
Spectral FluxRate of spectral changesum((|X(f)| - |X_prev(f)|)^2)Good
MFCCMel-frequency cepstral coefficientsDCT(log(mel_filterbank(|FFT|)))Excellent

Energy-based VAD: The Simplest Approach

Algorithm
  1. Split audio into overlapping frames (20-30ms)
  2. Compute RMS energy of each frame
  3. Compare energy to threshold (adaptive or fixed)
  4. Mark frames above threshold as speech
  5. Apply smoothing to remove spurious detections
Limitations
  • - Fails with background noise (traffic, AC)
  • - Cannot distinguish speech from music
  • - Misses whispered or soft speech
  • - Threshold tuning is environment-specific

Zero-Crossing Rate (ZCR)

ZCR counts how often the signal changes sign per unit time. Voiced speech (vowels) has low ZCR due to periodicity. Unvoiced speech (consonants like 's', 'f') and noise have high ZCR.

Low ZCR
Voiced sounds: "ah", "ee"
~500-1500 crossings/sec
High ZCR
Unvoiced: "ss", "ff", "sh"
~3000-5000 crossings/sec
Very High ZCR
White noise, static
>5000 crossings/sec
3

Modern Neural Approaches

Neural networks learn to detect speech from data, handling noise and edge cases that defeat hand-crafted rules. The architecture is simple: audio features in, speech probability out.

Neural VAD Architecture

Audio Input
Raw waveform (16kHz)
[samples] 16000/sec
->
Framing
Split into overlapping frames
30ms frames, 10ms hop
->
Features
Extract spectral features
MFCCs, Mel spectrogram
->
Neural Network
LSTM / Transformer / CNN
Per-frame processing
->
Sigmoid
Speech probability
0.0 - 1.0 per frame
->
Smoothing
Hysteresis, min duration
Speech segments

Leading Neural VAD Systems

Silero VAD

Small, fast model optimized for production. Uses LSTM layers trained on massive multilingual dataset. Runs 4000x faster than real-time on CPU.

Size: ~2MB | Latency: ~10ms | Formats: ONNX, PyTorch, TFLite
pyannote.audio VAD

State-of-the-art accuracy using Transformers. Part of the larger pyannote speaker diarization toolkit. Heavier but more accurate in challenging conditions.

Size: ~100MB | GPU: Recommended | Accuracy: 97%+
WebRTC VAD

Not neural but GMM-based. Extremely fast C library used in browsers and VoIP. Less accurate but runs anywhere with near-zero latency.

Size: ~50KB | Language: C (Python bindings) | Latency: <1ms
Whisper-based VAD

OpenAI Whisper outputs a no_speech_prob for each segment. If you are already using Whisper for ASR, you get VAD for free. Works well but adds ASR overhead.

Model: Whisper decoder | Metric: no_speech_prob | Latency: ~1s chunks

How Neural VAD Learns

Neural VAD is trained as binary classification. Each frame gets a label: 1 for speech, 0 for non-speech. The model learns to distinguish the spectral patterns of human voice from everything else.

Training Data
Thousands of hours with frame-level labels. Often generated from forced alignment.
Loss Function
Binary cross-entropy on speech probability. Class weighting handles imbalance.
Data Augmentation
Add noise, reverb, music. Makes model robust to real-world conditions.
4

Frame-level vs Segment-level Output

VAD models typically output per-frame probabilities, but applications need speech segments. Understanding this conversion is crucial.

Interactive Threshold Demo

0.0 (All speech)0.5 (Default)1.0 (No speech)

Adjust the threshold to see how it affects frame classification. Lower threshold = more speech detected.

Frame-level Output

The model processes audio in overlapping frames (typically 30ms with 10ms hop). For each frame, it outputs a probability between 0 and 1.

Frame01234567
Time0.000.010.020.030.040.050.060.07
Prob0.050.080.120.450.780.920.950.88
Label00001111

Converting to Segments

Step 1: Thresholding

Convert probabilities to binary using a threshold (typically 0.5).

is_speech = prob > threshold
Step 2: Hysteresis

Use different thresholds for speech start (higher) and end (lower) to reduce flickering.

start_thresh=0.6, end_thresh=0.4
Step 3: Minimum Duration

Discard speech segments shorter than a minimum (e.g., 250ms). Removes noise spikes.

min_speech_duration_ms=250
Step 4: Gap Merging

Merge segments separated by short silence (e.g., 100ms). Handles natural pauses in speech.

min_silence_duration_ms=100

Tuning Parameters for Your Use Case

For ASR Preprocessing
  • - Lower threshold (0.3-0.4) to capture all speech
  • - Longer min_speech (300-500ms)
  • - Generous padding (200ms) around segments
For Turn-Taking Detection
  • - Higher threshold (0.5-0.6) to reduce false positives
  • - Shorter min_silence (50-100ms) for responsiveness
  • - Hysteresis to prevent flickering
5

VAD Models Comparison

From lightweight WebRTC to state-of-the-art pyannote. Choose based on your latency, accuracy, and deployment constraints.

ModelTypeSpeedAccuracyNotes
Silero VADNeural~4000x realtime95%+ on most dataONNX, PyTorch, TFLite. Excellent balance of speed and accuracy.
WebRTC VADGMM-based~10000x realtime85-90%Extremely fast, C library. Good for embedded/real-time.
pyannote.audio VADNeural~100x realtime97%+State-of-the-art accuracy, GPU recommended.
Whisper VADByproduct~5-20x realtime93-97%Uses Whisper's no_speech_prob. Great if already using Whisper.
Energy-basedHeuristic~50000x realtime70-80%No dependencies. Fails in noisy environments.

Choosing the Right VAD

Use Silero VAD when:
  • - Need balance of speed and accuracy
  • - Deploying to edge devices or CPU-only
  • - Building streaming applications
  • - Want simple Python/ONNX integration
Use WebRTC VAD when:
  • - Extreme latency requirements (<1ms)
  • - Embedded systems, mobile, browser
  • - Simple audio conditions
  • - Minimal dependencies preferred
Use pyannote.audio when:
  • - Maximum accuracy is critical
  • - Challenging audio (noise, overlap)
  • - GPU available for inference
  • - Building diarization pipeline
Use Whisper VAD when:
  • - Already using Whisper for ASR
  • - Want timestamps with transcription
  • - Batch processing (not real-time)
  • - Segment-level is sufficient
*
Silero VAD: The Sweet Spot

For most applications, Silero VAD offers the best trade-off. It runs 4000x faster than real-time on CPU, achieves 95%+ accuracy, and comes in multiple formats (PyTorch, ONNX, TFLite). The streaming API handles chunked audio gracefully. Unless you have extreme latency needs (WebRTC) or accuracy needs (pyannote), start with Silero.

6

Benchmarks and Evaluation

Standard datasets for evaluating VAD systems. Detection Error Rate (DER) is the primary metric.

DatasetDomainDurationMetricDescription
AVA-SpeechMovies45hAUCDiverse audio conditions from films
AMI CorpusMeetings100hDERMulti-party meeting recordings
DIHARD IIIChallenging40hDERChild speech, web video, clinical
VoxConverseBroadcast50hDERNews, interviews, podcasts

Understanding VAD Metrics

Detection Error Rate (DER)
Percentage of time misclassified. Lower is better.
DER = (FA + Miss) / Total
False Alarm (FA)
Non-speech detected as speech. Wastes ASR resources.
FA = FP_duration / Total
Miss Rate
Speech detected as non-speech. Loses transcription.
Miss = FN_duration / Total

VAD evaluation is time-based: errors are measured in seconds of misclassified audio, not counts. A 1-second false alarm is worse than a 10ms false alarm.

DIHARD: The Hardest VAD Challenge

DIHARD (Diarization in Hard conditions) includes the most challenging audio: child speech, restaurant noise, web video, clinical interviews. Even state-of-the-art systems struggle here.

Child SpeechRestaurantWeb VideoClinicalSociolinguistic
7

Code Examples

Get started with VAD in Python. From simple energy-based to production-ready Silero.

Silero VAD (Recommended)pip install torch torchaudio
Production Ready
import torch

# Load Silero VAD model
model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad',
    force_reload=False
)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

# Read audio (automatically resamples to 16kHz)
audio = read_audio('audio.wav')

# Get speech timestamps
speech_timestamps = get_speech_timestamps(
    audio,
    model,
    threshold=0.5,           # Speech probability threshold
    min_speech_duration_ms=250,  # Minimum speech segment
    min_silence_duration_ms=100, # Minimum silence between segments
    sampling_rate=16000
)

# Output: [{'start': 1000, 'end': 5000}, {'start': 7000, 'end': 12000}]
for segment in speech_timestamps:
    start_sec = segment['start'] / 16000
    end_sec = segment['end'] / 16000
    print(f"Speech: {start_sec:.2f}s - {end_sec:.2f}s")

Quick Reference

For Production
  • - Silero VAD (best balance)
  • - pyannote.audio (max accuracy)
  • - Tune thresholds for your audio
For Real-time
  • - WebRTC VAD (sub-ms latency)
  • - Silero streaming API
  • - Small frame sizes (10-30ms)
Key Parameters
  • - threshold: 0.3-0.6
  • - min_speech_duration: 250ms
  • - min_silence_duration: 100ms

Use Cases

  • ASR preprocessing
  • Smart speaker wake word
  • Meeting segmentation
  • Audio compression
  • Noise reduction

Architectural Patterns

Energy-Based VAD

Simple threshold on audio energy/volume.

Pros:
  • +Very fast
  • +No ML needed
  • +Low latency
Cons:
  • -Fails with noise
  • -No speech/noise distinction

Neural VAD

Train neural network to classify speech frames.

Pros:
  • +Robust to noise
  • +Accurate
  • +Handles music
Cons:
  • -Higher latency
  • -Needs training data

Self-Supervised VAD

Use pre-trained audio models for VAD.

Pros:
  • +Best accuracy
  • +Handles edge cases
Cons:
  • -Larger models
  • -More compute

Implementations

Open Source

Silero VAD

MIT
Open Source

Best open-source VAD. Fast, accurate, production-ready.

WebRTC VAD

BSD 3-Clause
Open Source

Google's VAD from WebRTC. Very fast, low latency.

pyannote VAD

MIT
Open Source

Part of pyannote pipeline. Good with diarization.

Speechbrain VAD

Apache 2.0
Open Source

CRDNN-based VAD. High accuracy.

Benchmarks

Quick Facts

Input
Audio
Output
Structured Data
Implementations
4 open source, 0 API
Patterns
3 approaches

Have benchmark data?

Help us track the state of the art for voice activity detection.

Submit Results