Audio→Structured Data

Keyword Spotting

Detect wake words and short commands with low latency and tiny footprints.

How Keyword Spotting Works

A technical deep-dive into wake word detection. From the power constraints that shaped the field to the streaming architectures that make "Hey Siri" feel instantaneous.

1. The Problem 2. Constraints 3. Features 4. Architectures 5. Streaming 6. Systems 7. Code

The Problem: Always Listening, Never Draining

You want your device to respond the instant you say its name. But running full speech recognition 24/7 would drain the battery in hours. The solution is two-stage detection: a tiny, always-on "spotter" waits for just your wake word, then hands off to the heavy ASR engine.

Interactive Demo: Keyword Detection in Action

"Hey Jarvis"

Detection Confidence0%

Watch how the model's confidence spikes only during the keyword region. The smoothing prevents false triggers from momentary high scores.

Popular Wake Words and Their Power Budgets

"Hey Siri"

Apple

~1mW

"Alexa"

Amazon

~1.5mW

"OK Google"

Google

~2mW

"Hey Cortana"

Microsoft

~1.5mW

"Computer"

Star Trek

Custom

The Two-Stage Architecture

Stage 1: KWS

Always on

1-5 mW

~50KB model

Wake Event

Triggers ASR

Stage 2: ASR

On-demand

500+ mW

~100MB model

The KWS model runs continuously on a low-power DSP or neural accelerator. The main CPU and ASR engine only wake up when the keyword is detected.

The Constraints That Shape Everything

Keyword spotting operates under severe constraints that full ASR systems never face. Every design decision balances power, latency, accuracy, and model size.

[POWER]

Power Budget

Must run on <5mW to enable months of battery life

Target: <5mWTypical ASR uses 500mW+

[TIME]

Latency

Detection must feel instant (<200ms from utterance end)

Target: <200msUsers expect immediate response

[TARGET]

Accuracy

High recall (don't miss wake words) with low false accepts

Target: >95% recall<1 false accept per day

[SIZE]

Model Size

Must fit in <100KB for embedded deployment

Target: <100KBWhisper tiny is 39MB

Power Budget Reality Check

Full Whisper

Running continuously

Power: ~500mW (GPU) or ~100mW (NPU)

Battery: ~2-4 hours

On-device ASR

Optimized for mobile

Power: ~50mW

Battery: ~8-12 hours

KWS on DSP

Always-on detector

Power: ~1-5mW

Battery: ~months

The Accuracy Trade-off: Recall vs False Accepts

High Recall (Don't Miss)

Users get frustrated if they have to repeat the wake word. Target: >95% detection rate even in noisy environments, accents, and varying speaking styles.

Recall = TP / (TP + FN)

Low False Accepts (Don't Annoy)

Nothing's worse than your device randomly activating. Target: <1 false activation per day during normal conversation and media playback.

False Accept Rate = FP / Total Negatives

The sensitivity parameter lets users trade off between these. Higher sensitivity catches more true activations but also more false ones. Most systems default to ~0.5.

Feature Extraction: MFCC and Beyond

Raw audio has 16,000 samples per second. We need a compact representation that captures what matters for keyword recognition while being cheap to compute.

The MFCC Pipeline

MFCCs have been the workhorse of speech processing for decades. They compress audio into ~13 numbers per frame while preserving the information that distinguishes phonemes.

Audio

16kHz samples

1 sec = 16000 values

Frame

25ms windows

10ms hop = 100 frames/sec

FFT

Power spectrum

512 frequency bins

Mel Filter

26 filterbanks

Matches human hearing

Log

Compress dynamics

dB scale

DCT

Decorrelate

Keep first 13 coefficients

Result: 1 second of audio (16,000 values) -> 100 frames x 13 MFCCs = 1,300 values

12x compression with minimal information loss

Feature Extraction Methods

MFCC

Mel-Frequency Cepstral Coefficients

13-40 per frame

Compute: Low

The classic choice for keyword spotting. Compact representation (13-40 coefficients) that captures vocal tract shape while being robust to volume changes.

+Very compact

+Well understood

+CPU efficient

-Loses some fine detail

-Fixed window size

Log-Mel

Log Mel-Filterbank Energies

40-80 per frame

Compute: Low

Direct log energies from mel filterbanks. More information than MFCCs but larger. Used by modern neural approaches.

+More information retained

+Better for CNNs

+Standard for transformers

-Larger feature vectors

-More sensitive to volume

Raw Waveform

Direct Audio Samples

16000 samples/sec

Compute: High

Let the neural network learn features from raw audio. Requires more data and compute but can discover optimal representations.

+No information loss

+Model learns optimal features

-Needs more training data

-Higher compute

-Harder to interpret

[i]

Practical Recommendation

For most embedded KWS applications, MFCCs with 13 coefficients remain the best choice. They are compact, cheap to compute, and well-supported by every framework. Use 40 log-mel filterbanks only if you have compute budget for larger CNN/Transformer models.

Small Footprint Model Architectures

The key insight: we are not trying to transcribe arbitrary speech. We only need to recognize 1-10 specific phrases. This dramatically simplifies the model architecture.

Depthwise Separable CNN: The Workhorse

Standard convolution computes all filter-channel combinations at once. Depthwise separable convolution factorizes this into two steps, dramatically reducing parameters and compute.

Standard Convolution

K filters, each of size (H x W x C_in)

Params: K * H * W * C_in
For 64 3x3 filters on 64 channels:
64 * 3 * 3 * 64 = 36,864 params

Depthwise Separable

Depthwise (H x W per channel) + Pointwise (1x1)

Params: (H * W * C_in) + (K * C_in)
Same 64 filters on 64 channels:
(3 * 3 * 64) + (64 * 64) = 4,672 params

~8x parameter reduction with minimal accuracy loss

Architecture Comparison

Architecture	Parameters	MACs	Latency	Accuracy	Notes
DS-CNN Depthwise Separable CNN	~20-100KB	~5-20M	~5ms	~95%	Splits convolution into depthwise (spatial) and pointwise (channel) operations. Dramatic parameter reduction with minimal accuracy loss.
DSCNN-L Large DS-CNN	~500KB	~50M	~15ms	~97%	Scaled up depthwise separable CNN with more layers and channels. Better accuracy at cost of size.
TC-ResNet Temporal Convolution ResNet	~300KB	~30M	~10ms	~96%	1D convolutions along time axis with residual connections. Excellent for capturing temporal patterns in speech.
Attention RNN LSTM with Attention	~200KB	~40M	~20ms	~95%	Recurrent architecture with attention mechanism. Good for variable-length keywords but harder to optimize.
MatchboxNet NVIDIA MatchboxNet	~75KB	~10M	~8ms	~97%	QuartzNet-style architecture scaled for embedded. Jasper/QuartzNet blocks with 1D convolutions.
Conformer-S Small Streaming Conformer	~1MB	~100M	~30ms	~98%	Hybrid attention-convolution architecture adapted for streaming. State-of-the-art accuracy but higher cost.

Microcontroller (Cortex-M4)

Extreme constraint: <100KB, <10ms

- DS-CNN (small)
- TFLite Micro
- 13 MFCCs

Mobile/Edge (NPU)

Balanced: <500KB, <20ms

- DS-CNN (large) or TC-ResNet
- ONNX Runtime
- 40 log-mel

Cloud/Server

Maximum accuracy: size flexible

- Conformer or attention models
- PyTorch/TensorFlow
- 80 log-mel or raw waveform

Streaming Inference: Real-time Detection

Keywords do not arrive in neat 1-second chunks. They can start at any moment and span chunk boundaries. Streaming inference processes audio continuously with a sliding window, maintaining state between chunks.

The Streaming Pipeline

Audio Capture

Microphone input

16kHz, 16-bit PCM

Ring Buffer

Sliding window

1-2 second window

Frame Extract

Overlapping frames

25ms frames, 10ms hop

Feature Compute

MFCC/Mel extraction

13-40 features

Neural Network

Classification

DS-CNN, ~5ms

Smoothing

Confidence filter

Avoid spurious triggers

Wake Event

Trigger callback

Start ASR pipeline

The Ring Buffer: Why It Matters

Imagine the user says "Hey Jarvis" right at the boundary between two audio chunks. If we only process each chunk independently, we would miss the keyword because half of it is in each chunk.

Solution: Sliding Window

1. Maintain a ring buffer of ~1-2 seconds of audio
2. On each new chunk, slide the window forward
3. Run inference on the entire window
4. The keyword is always fully contained in some window

Chunk 1: [.......Hey Jar]

Chunk 2: [vis.......]

--- With ring buffer ---

Window: [...Hey Jarvis...]

Confidence Smoothing

A single high-confidence frame could be noise. Require N consecutive frames above threshold before triggering.

if all(conf > threshold for conf in last_N_frames):
trigger_wake()

Cooldown Period

After a detection, suppress triggers for 2-3 seconds to prevent the same keyword from triggering multiple times.

if time_since_last_trigger > cooldown_ms:
allow_trigger()

KWS Systems and Frameworks

From open-source community projects to commercial solutions. Choose based on your customization needs, deployment target, and budget.

System	Type	Keywords	Speed	Size	Notes
OpenWakeWord	Open Source	Custom trainable	~5ms per inference	~1.5MB per model	Python/ONNX, easy custom keyword training, community models available
Porcupine	Commercial	Custom trainable	~2ms per inference	~2MB per model	Picovoice product, free tier, many languages, on-device
Snowboy	Open Source	Custom trainable	~5ms per inference	~1MB per model	Deprecated but still used, Raspberry Pi compatible
Mycroft Precise	Open Source	Custom trainable	~10ms per inference	~500KB per model	TensorFlow Lite, Mycroft assistant, Python
TFLite Micro	Framework	Train your own	~5-20ms	~20-100KB	Google's microcontroller ML, runs on Cortex-M4+
Google Speech Commands	Pre-trained	35 fixed commands	~10ms	~500KB	Yes/No/Up/Down/etc, benchmark standard

For Hobbyist Projects

Start with OpenWakeWord. It's free, easy to train custom keywords, and has pre-built models for common wake words.

Python, ONNX, Raspberry Pi compatible

For Production Apps

Porcupine offers the best balance of accuracy, latency, and cross-platform support. Free tier available.

Mobile, embedded, desktop, 30+ languages

For Maximum Control

Train your own with TFLite Micro. Full control over architecture, training data, and deployment.

Microcontrollers, TensorFlow ecosystem

For Research/Benchmarking

Google Speech Commands dataset with 35 keywords is the standard benchmark for comparing KWS architectures.

65,000 one-second clips, 35 classes

Code Examples

Get started with keyword spotting in Python. From pre-trained models to training your own.

OpenWakeWord (Open Source)pip install openwakeword

Recommended

from openwakeword import Model
import pyaudio
import numpy as np

# Load OpenWakeWord model
model = Model(
    wakeword_models=["hey_jarvis"],  # Use built-in or custom model
    inference_framework="onnx"
)

# Audio stream settings
CHUNK = 1280  # ~80ms at 16kHz (model expects this)
FORMAT = pyaudio.paInt16
CHANNELS = 1
RATE = 16000

p = pyaudio.PyAudio()
stream = p.open(
    format=FORMAT,
    channels=CHANNELS,
    rate=RATE,
    input=True,
    frames_per_buffer=CHUNK
)

print("Listening for wake word...")

try:
    while True:
        # Read audio chunk
        audio_bytes = stream.read(CHUNK)
        audio_array = np.frombuffer(audio_bytes, dtype=np.int16)

        # Run wake word detection
        prediction = model.predict(audio_array)

        # Check if any wake word detected
        for wake_word, scores in prediction.items():
            score = scores[-1]  # Latest prediction
            if score > 0.5:
                print(f"Wake word detected: {wake_word} ({score:.2%})")
                # Trigger your ASR pipeline here

except KeyboardInterrupt:
    pass
finally:
    stream.stop_stream()
    stream.close()
    p.terminate()

# Training custom wake word:
# 1. Collect 3-5 positive samples of your wake word
# 2. Use OpenWakeWord's training script
# 3. Fine-tune on your recordings
# 4. Export to ONNX for deployment