Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Tasks · Voice Activity DetectionHome/Tasks/Audio/Voice Activity Detection
Audio· voice-activity-detection

Voice Activity Detection.

Voice activity detection (VAD) answers the deceptively simple question "is someone speaking right now?" — and getting it wrong ruins everything downstream in speech pipelines. Silero VAD became the open-source standard by shipping a model under 2MB that runs in real-time on CPU with >95% accuracy, while pyannote.audio's segmentation model pushed the state of the art for overlapping speech detection. Production VAD must handle extreme conditions: background music, crowd noise, whispered speech, and non-speech vocalizations (coughs, laughs) that fool simpler models. Modern systems increasingly combine VAD with speaker diarization ("who spoke when") in unified models, and the rise of real-time conversational AI has made sub-100ms latency VAD a critical infrastructure component.

2
Datasets
0
Results
accuracy
Canonical metric
§ 02 · Canonical benchmark

The reference dataset.

AVA-Speech

Voice activity detection in movies with dense speech labels

Primary metric: accuracy
View full leaderboard →
§ 03 · Top 10

Leading models.

Leading models on AVA-Speech.

No results yet. Be the first to contribute.

What were you looking for on Voice Activity Detection?

Didn't find the model, metric, or dataset you needed? Tell us in one line. We read every message and reply within 48 hours.

§ 04 · All datasets

Tracked datasets.

2 datasets tracked for this task.

AVA-Speech
CANONICAL
0 results · accuracy
DIHARD
0 results · der
§ 05 · Related tasks

Other tasks in Audio.

Audio CaptioningAudio-to-AudioMusic GenerationSound Event DetectionText-to-Audio
Reply within 48 hours · No newsletter

Didn't find what you came for?

Still looking for something on Voice Activity Detection? A missing model, a stale score, a benchmark we should cover — drop it here and we'll handle it.

Real humans read every message. We track what people are asking for and prioritize accordingly.