Audiovoice-activity-detection

Voice Activity Detection

Voice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.

2
Datasets
0
Results
accuracy
Canonical metric
Canonical Benchmark

AVA-Speech

Voice activity detection in movies with dense speech labels

Primary metric: accuracy
View full leaderboard

Top 10

Leading models on AVA-Speech.

No results yet. Be the first to contribute.

All datasets

2 datasets tracked for this task.

Related tasks

Other tasks in Audio.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace