Home/Building Blocks/Voice Activity Detection

Audio→Structured Data

Voice Activity Detection

Detect when speech is present in audio. Essential preprocessing for ASR, diarization, and voice interfaces.

How Voice Activity Detection Works

A technical deep-dive into Voice Activity Detection. From signal processing fundamentals to neural networks that power modern speech systems.

1. What is VAD 2. Signal Processing 3. Neural Approaches 4. Frame vs Segment 5. Models 6. Benchmarks 7. Code

What is Voice Activity Detection?

VAD answers a deceptively simple question: "Is someone speaking right now?"This binary classification task is foundational to nearly every speech processing system.

VAD in Action


Time ->   0s        1s        2s        3s        4s        5s
         |---------|---------|---------|---------|---------|
Audio:   ~~~....~~~~~~###########~~~~....~########~~~~~~~....
                     |         |         |       |
                    Speech    Silence   Speech  Silence

Output:  [0,0,0,0,0][1,1,1,1,1,1,1][0,0,0][1,1,1,1][0,0,0,0]

Segments: [          (1.0s - 2.3s)       (2.8s - 3.8s)      ]

VAD processes an audio stream and outputs temporal boundaries where speech occurs. The core challenge: distinguishing human voice from silence, noise, music, and other sounds.

Output Labels

SPEECH

Human speech detected

0.5s - 3.2s

NON-SPEECH

Silence, noise, or music

0.0s - 0.5s

OVERLAP

Multiple speakers (some models)

2.1s - 2.8s

Why VAD Matters

ASR Preprocessing

Strip silence before sending audio to speech recognition. Reduces latency, cost, and improves accuracy.

Turn-Taking

Detect when a user stops speaking so the AI assistant can respond. Critical for conversational systems.

Audio Segmentation

Split long recordings into manageable chunks. First step in speaker diarization pipelines.

Compression

VoIP systems use VAD to transmit only during speech, saving bandwidth by 50-70%.

The Challenge: Beyond Simple Silence Detection

Easy Cases

- Clear speech in quiet room
- Complete silence between utterances
- Single speaker, close microphone

Hard Cases

- Speech with background music
- Whispering, breathing, laughter
- Far-field audio, overlapping speakers
- Non-speech vocalizations (coughing, hmm)

Signal Processing Approach

Before deep learning, VAD relied on handcrafted acoustic features. These methods are still valuable: fast, interpretable, and require no training data.

The Core Insight

Speech has distinctive acoustic properties. It is:

Louder than silence

Higher energy / RMS amplitude

Quasi-periodic

Harmonic structure from vocal cords

Spectrally distinct

Formants in 300Hz-3400Hz range

Common Signal Features

Feature	Description	Formula	Effectiveness
Energy / RMS	Volume level of the signal	sqrt(sum(x^2) / N)	Basic
Zero-Crossing Rate	How often signal crosses zero	sum(sign(x[n]) != sign(x[n-1])) / N	Moderate
Spectral Centroid	Center of mass of spectrum	sum(f * \|X(f)\|) / sum(\|X(f)\|)	Good
Spectral Flux	Rate of spectral change	sum((\|X(f)\| - \|X_prev(f)\|)^2)	Good
MFCC	Mel-frequency cepstral coefficients	DCT(log(mel_filterbank(\|FFT\|)))	Excellent

Energy-based VAD: The Simplest Approach

Algorithm

Split audio into overlapping frames (20-30ms)
Compute RMS energy of each frame
Compare energy to threshold (adaptive or fixed)
Mark frames above threshold as speech
Apply smoothing to remove spurious detections

Limitations

- Fails with background noise (traffic, AC)
- Cannot distinguish speech from music
- Misses whispered or soft speech
- Threshold tuning is environment-specific

Zero-Crossing Rate (ZCR)

ZCR counts how often the signal changes sign per unit time. Voiced speech (vowels) has low ZCR due to periodicity. Unvoiced speech (consonants like 's', 'f') and noise have high ZCR.

Low ZCR

Voiced sounds: "ah", "ee"

~500-1500 crossings/sec

High ZCR

Unvoiced: "ss", "ff", "sh"

~3000-5000 crossings/sec

Very High ZCR

White noise, static

>5000 crossings/sec

Modern Neural Approaches

Neural networks learn to detect speech from data, handling noise and edge cases that defeat hand-crafted rules. The architecture is simple: audio features in, speech probability out.

Neural VAD Architecture

Audio Input

Raw waveform (16kHz)

[samples] 16000/sec

Framing

Split into overlapping frames

30ms frames, 10ms hop

Features

Extract spectral features

MFCCs, Mel spectrogram

Neural Network

LSTM / Transformer / CNN

Per-frame processing

Sigmoid

Speech probability

0.0 - 1.0 per frame

Smoothing

Hysteresis, min duration

Speech segments

Leading Neural VAD Systems

Silero VAD

Small, fast model optimized for production. Uses LSTM layers trained on massive multilingual dataset. Runs 4000x faster than real-time on CPU.

Size: ~2MB | Latency: ~10ms | Formats: ONNX, PyTorch, TFLite

pyannote.audio VAD

State-of-the-art accuracy using Transformers. Part of the larger pyannote speaker diarization toolkit. Heavier but more accurate in challenging conditions.

Size: ~100MB | GPU: Recommended | Accuracy: 97%+

WebRTC VAD

Not neural but GMM-based. Extremely fast C library used in browsers and VoIP. Less accurate but runs anywhere with near-zero latency.

Size: ~50KB | Language: C (Python bindings) | Latency: <1ms

Whisper-based VAD

OpenAI Whisper outputs a no_speech_prob for each segment. If you are already using Whisper for ASR, you get VAD for free. Works well but adds ASR overhead.

Model: Whisper decoder | Metric: no_speech_prob | Latency: ~1s chunks

How Neural VAD Learns

Neural VAD is trained as binary classification. Each frame gets a label: 1 for speech, 0 for non-speech. The model learns to distinguish the spectral patterns of human voice from everything else.

Training Data

Thousands of hours with frame-level labels. Often generated from forced alignment.

Loss Function

Binary cross-entropy on speech probability. Class weighting handles imbalance.

Data Augmentation

Add noise, reverb, music. Makes model robust to real-world conditions.

Frame-level vs Segment-level Output

VAD models typically output per-frame probabilities, but applications need speech segments. Understanding this conversion is crucial.

Interactive Threshold Demo

Speech Probability Threshold: 0.50

0.0 (All speech)0.5 (Default)1.0 (No speech)

Adjust the threshold to see how it affects frame classification. Lower threshold = more speech detected.

Frame-level Output

The model processes audio in overlapping frames (typically 30ms with 10ms hop). For each frame, it outputs a probability between 0 and 1.

Frame	0	1	2	3	4	5	6	7
Time	0.00	0.01	0.02	0.03	0.04	0.05	0.06	0.07
Prob	0.05	0.08	0.12	0.45	0.78	0.92	0.95	0.88
Label	0	0	0	0	1	1	1	1

Converting to Segments

Step 1: Thresholding

Convert probabilities to binary using a threshold (typically 0.5).

is_speech = prob > threshold

Step 2: Hysteresis

Use different thresholds for speech start (higher) and end (lower) to reduce flickering.

start_thresh=0.6, end_thresh=0.4

Step 3: Minimum Duration

Discard speech segments shorter than a minimum (e.g., 250ms). Removes noise spikes.

min_speech_duration_ms=250

Step 4: Gap Merging

Merge segments separated by short silence (e.g., 100ms). Handles natural pauses in speech.

min_silence_duration_ms=100

Tuning Parameters for Your Use Case

For ASR Preprocessing

- Lower threshold (0.3-0.4) to capture all speech
- Longer min_speech (300-500ms)
- Generous padding (200ms) around segments

For Turn-Taking Detection

- Higher threshold (0.5-0.6) to reduce false positives
- Shorter min_silence (50-100ms) for responsiveness
- Hysteresis to prevent flickering

VAD Models Comparison

From lightweight WebRTC to state-of-the-art pyannote. Choose based on your latency, accuracy, and deployment constraints.

Model	Type	Speed	Accuracy	Notes
Silero VAD	Neural	~4000x realtime	95%+ on most data	ONNX, PyTorch, TFLite. Excellent balance of speed and accuracy.
WebRTC VAD	GMM-based	~10000x realtime	85-90%	Extremely fast, C library. Good for embedded/real-time.
pyannote.audio VAD	Neural	~100x realtime	97%+	State-of-the-art accuracy, GPU recommended.
Whisper VAD	Byproduct	~5-20x realtime	93-97%	Uses Whisper's no_speech_prob. Great if already using Whisper.
Energy-based	Heuristic	~50000x realtime	70-80%	No dependencies. Fails in noisy environments.

Choosing the Right VAD

Use Silero VAD when:

- Need balance of speed and accuracy
- Deploying to edge devices or CPU-only
- Building streaming applications
- Want simple Python/ONNX integration

Use WebRTC VAD when:

- Extreme latency requirements (<1ms)
- Embedded systems, mobile, browser
- Simple audio conditions
- Minimal dependencies preferred

Use pyannote.audio when:

- Maximum accuracy is critical
- Challenging audio (noise, overlap)
- GPU available for inference
- Building diarization pipeline

Use Whisper VAD when:

- Already using Whisper for ASR
- Want timestamps with transcription
- Batch processing (not real-time)
- Segment-level is sufficient

Silero VAD: The Sweet Spot

For most applications, Silero VAD offers the best trade-off. It runs 4000x faster than real-time on CPU, achieves 95%+ accuracy, and comes in multiple formats (PyTorch, ONNX, TFLite). The streaming API handles chunked audio gracefully. Unless you have extreme latency needs (WebRTC) or accuracy needs (pyannote), start with Silero.

Benchmarks and Evaluation

Standard datasets for evaluating VAD systems. Detection Error Rate (DER) is the primary metric.

Dataset	Domain	Duration	Metric	Description
AVA-Speech	Movies	45h	AUC	Diverse audio conditions from films
AMI Corpus	Meetings	100h	DER	Multi-party meeting recordings
DIHARD III	Challenging	40h	DER	Child speech, web video, clinical
VoxConverse	Broadcast	50h	DER	News, interviews, podcasts

Understanding VAD Metrics

Detection Error Rate (DER)

Percentage of time misclassified. Lower is better.

DER = (FA + Miss) / Total

False Alarm (FA)

Non-speech detected as speech. Wastes ASR resources.

FA = FP_duration / Total

Miss Rate

Speech detected as non-speech. Loses transcription.

Miss = FN_duration / Total

VAD evaluation is time-based: errors are measured in seconds of misclassified audio, not counts. A 1-second false alarm is worse than a 10ms false alarm.

DIHARD: The Hardest VAD Challenge

DIHARD (Diarization in Hard conditions) includes the most challenging audio: child speech, restaurant noise, web video, clinical interviews. Even state-of-the-art systems struggle here.

Child SpeechRestaurantWeb VideoClinicalSociolinguistic

Code Examples

Get started with VAD in Python. From simple energy-based to production-ready Silero.

Silero VAD (Recommended)pip install torch torchaudio

Production Ready

import torch

# Load Silero VAD model
model, utils = torch.hub.load(
    repo_or_dir='snakers4/silero-vad',
    model='silero_vad',
    force_reload=False
)

(get_speech_timestamps, save_audio, read_audio, VADIterator, collect_chunks) = utils

# Read audio (automatically resamples to 16kHz)
audio = read_audio('audio.wav')

# Get speech timestamps
speech_timestamps = get_speech_timestamps(
    audio,
    model,
    threshold=0.5,           # Speech probability threshold
    min_speech_duration_ms=250,  # Minimum speech segment
    min_silence_duration_ms=100, # Minimum silence between segments
    sampling_rate=16000
)

# Output: [{'start': 1000, 'end': 5000}, {'start': 7000, 'end': 12000}]
for segment in speech_timestamps:
    start_sec = segment['start'] / 16000
    end_sec = segment['end'] / 16000
    print(f"Speech: {start_sec:.2f}s - {end_sec:.2f}s")