Keyword Spotting
Detect wake words and short commands with low latency and tiny footprints.
How Keyword Spotting Works
A technical deep-dive into wake word detection. From the power constraints that shaped the field to the streaming architectures that make "Hey Siri" feel instantaneous.
The Problem: Always Listening, Never Draining
You want your device to respond the instant you say its name. But running full speech recognition 24/7 would drain the battery in hours. The solution is two-stage detection: a tiny, always-on "spotter" waits for just your wake word, then hands off to the heavy ASR engine.
Interactive Demo: Keyword Detection in Action
Watch how the model's confidence spikes only during the keyword region. The smoothing prevents false triggers from momentary high scores.
Popular Wake Words and Their Power Budgets
The Two-Stage Architecture
The KWS model runs continuously on a low-power DSP or neural accelerator. The main CPU and ASR engine only wake up when the keyword is detected.
The Constraints That Shape Everything
Keyword spotting operates under severe constraints that full ASR systems never face. Every design decision balances power, latency, accuracy, and model size.
Must run on <5mW to enable months of battery life
Detection must feel instant (<200ms from utterance end)
High recall (don't miss wake words) with low false accepts
Must fit in <100KB for embedded deployment
Power Budget Reality Check
The Accuracy Trade-off: Recall vs False Accepts
Users get frustrated if they have to repeat the wake word. Target: >95% detection rate even in noisy environments, accents, and varying speaking styles.
Nothing's worse than your device randomly activating. Target: <1 false activation per day during normal conversation and media playback.
The sensitivity parameter lets users trade off between these. Higher sensitivity catches more true activations but also more false ones. Most systems default to ~0.5.
Feature Extraction: MFCC and Beyond
Raw audio has 16,000 samples per second. We need a compact representation that captures what matters for keyword recognition while being cheap to compute.
The MFCC Pipeline
MFCCs have been the workhorse of speech processing for decades. They compress audio into ~13 numbers per frame while preserving the information that distinguishes phonemes.
Feature Extraction Methods
The classic choice for keyword spotting. Compact representation (13-40 coefficients) that captures vocal tract shape while being robust to volume changes.
Direct log energies from mel filterbanks. More information than MFCCs but larger. Used by modern neural approaches.
Let the neural network learn features from raw audio. Requires more data and compute but can discover optimal representations.
For most embedded KWS applications, MFCCs with 13 coefficients remain the best choice. They are compact, cheap to compute, and well-supported by every framework. Use 40 log-mel filterbanks only if you have compute budget for larger CNN/Transformer models.
Small Footprint Model Architectures
The key insight: we are not trying to transcribe arbitrary speech. We only need to recognize 1-10 specific phrases. This dramatically simplifies the model architecture.
Depthwise Separable CNN: The Workhorse
Standard convolution computes all filter-channel combinations at once. Depthwise separable convolution factorizes this into two steps, dramatically reducing parameters and compute.
K filters, each of size (H x W x C_in)
For 64 3x3 filters on 64 channels:
64 * 3 * 3 * 64 = 36,864 params
Depthwise (H x W per channel) + Pointwise (1x1)
Same 64 filters on 64 channels:
(3 * 3 * 64) + (64 * 64) = 4,672 params
Architecture Comparison
| Architecture | Parameters | MACs | Latency | Accuracy | Notes |
|---|---|---|---|---|---|
DS-CNN Depthwise Separable CNN | ~20-100KB | ~5-20M | ~5ms | ~95% | Splits convolution into depthwise (spatial) and pointwise (channel) operations. Dramatic parameter reduction with minimal accuracy loss. |
DSCNN-L Large DS-CNN | ~500KB | ~50M | ~15ms | ~97% | Scaled up depthwise separable CNN with more layers and channels. Better accuracy at cost of size. |
TC-ResNet Temporal Convolution ResNet | ~300KB | ~30M | ~10ms | ~96% | 1D convolutions along time axis with residual connections. Excellent for capturing temporal patterns in speech. |
Attention RNN LSTM with Attention | ~200KB | ~40M | ~20ms | ~95% | Recurrent architecture with attention mechanism. Good for variable-length keywords but harder to optimize. |
MatchboxNet NVIDIA MatchboxNet | ~75KB | ~10M | ~8ms | ~97% | QuartzNet-style architecture scaled for embedded. Jasper/QuartzNet blocks with 1D convolutions. |
Conformer-S Small Streaming Conformer | ~1MB | ~100M | ~30ms | ~98% | Hybrid attention-convolution architecture adapted for streaming. State-of-the-art accuracy but higher cost. |
Extreme constraint: <100KB, <10ms
- - DS-CNN (small)
- - TFLite Micro
- - 13 MFCCs
Balanced: <500KB, <20ms
- - DS-CNN (large) or TC-ResNet
- - ONNX Runtime
- - 40 log-mel
Maximum accuracy: size flexible
- - Conformer or attention models
- - PyTorch/TensorFlow
- - 80 log-mel or raw waveform
Streaming Inference: Real-time Detection
Keywords do not arrive in neat 1-second chunks. They can start at any moment and span chunk boundaries. Streaming inference processes audio continuously with a sliding window, maintaining state between chunks.
The Streaming Pipeline
The Ring Buffer: Why It Matters
Imagine the user says "Hey Jarvis" right at the boundary between two audio chunks. If we only process each chunk independently, we would miss the keyword because half of it is in each chunk.
- 1. Maintain a ring buffer of ~1-2 seconds of audio
- 2. On each new chunk, slide the window forward
- 3. Run inference on the entire window
- 4. The keyword is always fully contained in some window
A single high-confidence frame could be noise. Require N consecutive frames above threshold before triggering.
trigger_wake()
After a detection, suppress triggers for 2-3 seconds to prevent the same keyword from triggering multiple times.
allow_trigger()
KWS Systems and Frameworks
From open-source community projects to commercial solutions. Choose based on your customization needs, deployment target, and budget.
| System | Type | Keywords | Speed | Size | Notes |
|---|---|---|---|---|---|
| OpenWakeWord | Open Source | Custom trainable | ~5ms per inference | ~1.5MB per model | Python/ONNX, easy custom keyword training, community models available |
| Porcupine | Commercial | Custom trainable | ~2ms per inference | ~2MB per model | Picovoice product, free tier, many languages, on-device |
| Snowboy | Open Source | Custom trainable | ~5ms per inference | ~1MB per model | Deprecated but still used, Raspberry Pi compatible |
| Mycroft Precise | Open Source | Custom trainable | ~10ms per inference | ~500KB per model | TensorFlow Lite, Mycroft assistant, Python |
| TFLite Micro | Framework | Train your own | ~5-20ms | ~20-100KB | Google's microcontroller ML, runs on Cortex-M4+ |
| Google Speech Commands | Pre-trained | 35 fixed commands | ~10ms | ~500KB | Yes/No/Up/Down/etc, benchmark standard |
Start with OpenWakeWord. It's free, easy to train custom keywords, and has pre-built models for common wake words.
Porcupine offers the best balance of accuracy, latency, and cross-platform support. Free tier available.
Train your own with TFLite Micro. Full control over architecture, training data, and deployment.
Google Speech Commands dataset with 35 keywords is the standard benchmark for comparing KWS architectures.
Code Examples
Get started with keyword spotting in Python. From pre-trained models to training your own.
from openwakeword import Model
import pyaudio
import numpy as np
# Load OpenWakeWord model
model = Model(
wakeword_models=["hey_jarvis"], # Use built-in or custom model
inference_framework="onnx"
)
# Audio stream settings
CHUNK = 1280 # ~80ms at 16kHz (model expects this)
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000
p = pyaudio.PyAudio()
stream = p.open(
format=FORMAT,
channels=CHANNELS,
rate=RATE,
input=True,
frames_per_buffer=CHUNK
)
print("Listening for wake word...")
try:
while True:
# Read audio chunk
audio_bytes = stream.read(CHUNK)
audio_array = np.frombuffer(audio_bytes, dtype=np.int16)
# Run wake word detection
prediction = model.predict(audio_array)
# Check if any wake word detected
for wake_word, scores in prediction.items():
score = scores[-1] # Latest prediction
if score > 0.5:
print(f"Wake word detected: {wake_word} ({score:.2%})")
# Trigger your ASR pipeline here
except KeyboardInterrupt:
pass
finally:
stream.stop_stream()
stream.close()
p.terminate()
# Training custom wake word:
# 1. Collect 3-5 positive samples of your wake word
# 2. Use OpenWakeWord's training script
# 3. Fine-tune on your recordings
# 4. Export to ONNX for deploymentQuick Reference
- - OpenWakeWord for custom keywords
- - Google Speech Commands dataset
- - 13 MFCCs, DS-CNN architecture
- - Porcupine for cross-platform
- - TFLite Micro for MCUs
- - Streaming with ring buffer
- - Power: <5mW target
- - Latency: <200ms
- - Model: <100KB
- - Accuracy: >95% recall
Use Cases
- ✓Voice wake word
- ✓On-device commands
- ✓Industrial alarms
- ✓Assistive devices
Architectural Patterns
Tiny CNN on MFCCs
Lightweight conv models on spectrograms.
Streaming Transformers
Low-latency attention for continuous audio.
Implementations
API Services
Picovoice Porcupine
PicovoiceCommercial-grade embedded KWS.
Benchmarks
Quick Facts
- Input
- Audio
- Output
- Structured Data
- Implementations
- 2 open source, 1 API
- Patterns
- 2 approaches