Voice Activity Detection
Voice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.
AVA-Speech
Voice activity detection in movies with dense speech labels
Top 10
Leading models on AVA-Speech.
No results yet. Be the first to contribute.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Audio.
Looking to run a model? HuggingFace hosts inference for this task type.