Voice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.
Voice activity detection in movies with dense speech labels
Leading models on AVA-Speech.
No results yet. Be the first to contribute.
Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.
2 datasets tracked for this task.
Still looking for something on Voice Activity Detection? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.
Real humans read every message. We track what people are asking for and prioritize accordingly.