Codesota · Tasks · Voice Activity DetectionHome/Tasks/Audio/Voice Activity Detection

Audio· voice-activity-detection

Voice Activity Detection.

Voice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.

Datasets

Results

accuracy

Canonical metric

§ 02 · Canonical benchmark

The reference dataset.

AVA-Speech

Voice activity detection in movies with dense speech labels

Primary metric: accuracy

View full leaderboard →

§ 03 · Top 10

Leading models.

Leading models on AVA-Speech.

No results yet. Be the first to contribute.

What were you looking for on Voice Activity Detection?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

2 datasets tracked for this task.

§ 05 · Related tasks

Other tasks in Audio.

Audio Captioning Audio-to-Audio Music Generation Sound Event Detection Text-to-Audio

Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Voice Activity Detection? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.