Speaker Diarization
Separate 'who spoke when' in audio. Vital for meetings, call centers, and transcription QA.
How Speaker Diarization Works
A technical deep-dive into speaker diarization: the art of answering "who spoke when?" From voice fingerprints and clustering to handling overlapping speech.
The Problem: Who Spoke When?
Speech recognition tells you what was said. Speaker diarization tells you who said it. Picture a meeting recording: you need to split the audio stream into segments and label each with a speaker identity.
Interactive Timeline: A Two-Person Conversation
Click on any segment to see details. Notice how speakers sometimes overlap - this is one of the hardest challenges in diarization.
Why Is This Hard?
We do not know how many speakers exist or what they sound like. The system must discover speaker identities from the audio alone.
In natural conversation, people interrupt and talk over each other. Up to 10-20% of meeting audio contains overlap.
A speaker's voice changes: emotional state, microphone distance, background noise. The system must recognize them regardless.
Two people can sound alike, especially with similar age, gender, and accent. The system must distinguish subtle differences.
Where Is This Used?
The Diarization Pipeline
Most diarization systems follow a four-stage pipeline. Think of it as: find speech, fingerprint it, group similar fingerprints, assign labels.
First, find where speech occurs. VAD strips silence and noise, leaving only speech segments. This reduces processing and prevents clustering silence as a "speaker."
Output: [(0.0s-2.5s), (2.8s-5.2s), ...] speech regions
For each speech segment, extract a "voice fingerprint" - a fixed-length vector (192-512 dims) that captures who the speaker is, not what they said.
Output: [0.23, -0.15, 0.82, ...] 256-dim vector
Group embeddings by similarity. Segments from the same speaker should cluster together. The algorithm discovers how many speakers exist and which segments belong to each.
Output: Cluster labels [0, 1, 0, 2, 1, ...]
Map cluster IDs back to time segments. Optionally merge short segments, smooth boundaries, and handle edge cases. The output is the final diarization.
Output: [(0.0s, 2.5s, "SPEAKER_00"), ...]
Speaker Embeddings: Voice Fingerprints
The core insight: a neural network can learn to compress seconds of audio into a compact vector where same speaker = similar vectors, different speakers = distant vectors.
What Do Embedding Dimensions Capture?
While individual dimensions are not directly interpretable, they encode various aspects of voice identity:
Fundamental frequency range
Popular Embedding Models
Emphasized Channel Attention. 192-dim embeddings. Current favorite for diarization.
Time-delay neural network. 512-dim embeddings. Well-established, good baseline.
NVIDIA's speaker model. Squeeze-and-Excitation blocks. State-of-the-art accuracy.
How Are Embeddings Trained?
Large speaker recognition datasets with millions of utterances from thousands of speakers. VoxCeleb (7000+ speakers) is the standard.
- - VoxCeleb1: 1251 celebrities, 150K utterances
- - VoxCeleb2: 6112 speakers, 1M utterances
- - CN-Celeb: 1000 Chinese celebrities
The model learns to classify speakers (softmax) or to minimize distance between same-speaker pairs while maximizing different-speaker pairs (contrastive/triplet loss).
Angular distance with margin penalty for better separation
Clustering: Grouping Voice Fingerprints
Given N embedding vectors, cluster them into K groups where K (the number of speakers) is unknown. This is unsupervised learning with a twist: we must also discover K.
Hierarchical clustering, merges similar segments bottom-up
Graph-based clustering using affinity matrix eigenvectors
Probabilistic Linear Discriminant Analysis for speaker similarity
Real-time clustering as audio streams in
Spectral Clustering: The Modern Standard
The key advantage: spectral clustering can find non-convex clusters and is robust to noise. The number of speakers K can be estimated from the eigenvalue gap.
The Threshold Problem
Without knowing the number of speakers, we need a threshold to decide when embeddings are "similar enough" to belong to the same speaker. This threshold is domain-dependent:
Overlap Handling: When People Talk Over Each Other
In natural conversation, 10-20% of speech involves multiple speakers talking simultaneously. Traditional pipelines fail here because they assign each frame to exactly one speaker.
The Challenge
Approaches to Overlap
Assign segment to dominant speaker only
Per-frame: which speakers are active?
Single neural network outputs all speakers
End-to-End Neural Diarization (EEND)
The breakthrough: instead of a pipeline, train a single neural network that directly outputs which speakers are active at each frame. The output is a multi-hot vector per frame.
- 1. Self-attention encoder processes audio frames
- 2. Each frame outputs activation for each speaker slot
- 3. Multiple speakers can be active simultaneously
- 4. Trained with permutation-invariant loss (PIT)
- - pyannote.audio 3.1 uses local EEND
- - NeMo MSDD for multi-scale decoding
- - SA-EEND with self-attention
Models and Tools
From end-to-end neural systems to traditional pipelines. Choose based on your accuracy needs, whether you need transcription, and deployment constraints.
| Model | Type | Architecture | DER | Notes |
|---|---|---|---|---|
| pyannote.audio 3.1 | End-to-end | Segmentation + Embedding + Clustering | ~11% (AMI) | Best open-source. Handles overlap. HuggingFace token required. |
| NeMo MSDD | Neural | Multi-scale Diarization Decoder | ~8% (AMI) | NVIDIA. State-of-the-art accuracy. GPU required. |
| WhisperX | Pipeline | Whisper + pyannote + alignment | ~12% (AMI) | Best for transcription + diarization. Word-level timestamps. |
| Simple Diarizer | Basic | Embedding + AHC clustering | ~15% (AMI) | Educational. Good for understanding. No overlap handling. |
| Kaldi PLDA | Traditional | x-vector + PLDA + AHC | ~10% (AMI) | Proven production system. Complex setup. |
Choosing the Right Tool
- - You need best open-source accuracy
- - Overlap handling is important
- - Python/PyTorch environment available
- - Can obtain HuggingFace token
- - You need transcription + diarization
- - Word-level timestamps are required
- - Whisper quality transcription needed
- - Okay with slightly lower DER
- - Maximum accuracy is critical
- - NVIDIA GPU available
- - Enterprise deployment
- - Telephony or meeting domain
- - You want a managed API
- - No infrastructure to maintain
- - Budget for per-minute pricing
- - Need additional features (summaries, etc.)
Evaluation Metrics
Total time incorrectly attributed. The standard metric.
Per-speaker overlap metric. More sensitive to short segments.
Code Examples
Get started with speaker diarization in Python. From high-level APIs to building your own pipeline.
from pyannote.audio import Pipeline
import torch
# Load pre-trained speaker diarization pipeline
# Requires HuggingFace token (free, accepts terms)
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN"
)
# Optional: use GPU for faster inference
pipeline.to(torch.device("cuda"))
# Run diarization on audio file
diarization = pipeline("audio.wav")
# Iterate over speaker turns
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")
# Output:
# 0.0s - 2.5s: SPEAKER_00
# 2.8s - 5.2s: SPEAKER_01
# 5.0s - 7.5s: SPEAKER_00 (note: overlaps with previous)
# ...
# Export to RTTM format (standard diarization format)
with open("output.rttm", "w") as f:
diarization.write_rttm(f)
# Customize pipeline parameters
diarization = pipeline(
"audio.wav",
min_speakers=2, # Minimum number of speakers
max_speakers=5, # Maximum number of speakers
# Or let it auto-detect:
# num_speakers=None
)Quick Reference
- - NeMo MSDD (~8% DER)
- - pyannote.audio 3.1 (~11% DER)
- - End-to-end handles overlap
- - WhisperX for Whisper quality
- - AssemblyAI for managed API
- - Word-level speaker labels
- - VAD + Embeddings + Clustering
- - ECAPA-TDNN for embeddings
- - Spectral clustering standard
Use Cases
- ✓Meeting attribution
- ✓Call center QA
- ✓Podcast chaptering
- ✓Court recordings
Architectural Patterns
Embedding Clustering
Extract speaker embeddings, cluster over time.
End-to-End Diarization
Joint VAD + speaker tagging in one model.
Implementations
Benchmarks
Quick Facts
- Input
- Audio
- Output
- Structured Data
- Implementations
- 3 open source, 0 API
- Patterns
- 2 approaches