Home/Building Blocks/Speaker Diarization

Audio→Structured Data

Speaker Diarization

Separate 'who spoke when' in audio. Vital for meetings, call centers, and transcription QA.

How Speaker Diarization Works

A technical deep-dive into speaker diarization: the art of answering "who spoke when?" From voice fingerprints and clustering to handling overlapping speech.

1. The Problem 2. Pipeline 3. Embeddings 4. Clustering 5. Overlap Handling 6. Models 7. Code

The Problem: Who Spoke When?

Speech recognition tells you what was said. Speaker diarization tells you who said it. Picture a meeting recording: you need to split the audio stream into segments and label each with a speaker identity.

Interactive Timeline: A Two-Person Conversation

Show overlap regions

0s3s6s9s12s15s

Speaker A

Speaker B

Overlap

Click on any segment to see details. Notice how speakers sometimes overlap - this is one of the hardest challenges in diarization.

Why Is This Hard?

No Prior Knowledge

We do not know how many speakers exist or what they sound like. The system must discover speaker identities from the audio alone.

Overlapping Speech

In natural conversation, people interrupt and talk over each other. Up to 10-20% of meeting audio contains overlap.

Same Speaker, Different Conditions

A speaker's voice changes: emotional state, microphone distance, background noise. The system must recognize them regardless.

Different Speakers, Similar Voices

Two people can sound alike, especially with similar age, gender, and accent. The system must distinguish subtle differences.

Where Is This Used?

Meeting Transcription

Attribute quotes to speakers

Call Centers

Separate agent from customer

Media Indexing

Search by speaker in podcasts

Voice Assistants

Personalize to each user

The Diarization Pipeline

Most diarization systems follow a four-stage pipeline. Think of it as: find speech, fingerprint it, group similar fingerprints, assign labels.

VAD

Voice Activity Detection

Find speech regions

Embedding

Speaker Embeddings

Extract voice fingerprints

Clustering

Cluster Segments

Group by similarity

Assignment

Label Assignment

Who spoke when

Stage 1: Voice Activity Detection

First, find where speech occurs. VAD strips silence and noise, leaving only speech segments. This reduces processing and prevents clustering silence as a "speaker."

Input: 60 min audio
Output: [(0.0s-2.5s), (2.8s-5.2s), ...] speech regions

Stage 2: Speaker Embeddings

For each speech segment, extract a "voice fingerprint" - a fixed-length vector (192-512 dims) that captures who the speaker is, not what they said.

Input: Audio segment (1-3 seconds)
Output: [0.23, -0.15, 0.82, ...] 256-dim vector

Stage 3: Clustering

Group embeddings by similarity. Segments from the same speaker should cluster together. The algorithm discovers how many speakers exist and which segments belong to each.

Input: 100 embeddings
Output: Cluster labels [0, 1, 0, 2, 1, ...]

Stage 4: Label Assignment

Map cluster IDs back to time segments. Optionally merge short segments, smooth boundaries, and handle edge cases. The output is the final diarization.

Input: Segments + cluster labels
Output: [(0.0s, 2.5s, "SPEAKER_00"), ...]

End-to-End Alternative: Modern systems like pyannote 3.1 and EEND replace this pipeline with a single neural network that directly outputs speaker assignments. This avoids error propagation between stages and handles overlapping speech natively.

Speaker Embeddings: Voice Fingerprints

The core insight: a neural network can learn to compress seconds of audio into a compact vector where same speaker = similar vectors, different speakers = distant vectors.

What Do Embedding Dimensions Capture?

While individual dimensions are not directly interpretable, they encode various aspects of voice identity:

Deep voice (100 Hz)Pitch RangeHigh voice (300 Hz)

Fundamental frequency range

Popular Embedding Models

ECAPA-TDNN

Emphasized Channel Attention. 192-dim embeddings. Current favorite for diarization.

Used by: pyannote, SpeechBrain

x-vector

Time-delay neural network. 512-dim embeddings. Well-established, good baseline.

Used by: Kaldi, older systems

TitaNet

NVIDIA's speaker model. Squeeze-and-Excitation blocks. State-of-the-art accuracy.

Used by: NeMo

How Are Embeddings Trained?

Training Data

Large speaker recognition datasets with millions of utterances from thousands of speakers. VoxCeleb (7000+ speakers) is the standard.

- VoxCeleb1: 1251 celebrities, 150K utterances
- VoxCeleb2: 6112 speakers, 1M utterances
- CN-Celeb: 1000 Chinese celebrities

Training Objective

The model learns to classify speakers (softmax) or to minimize distance between same-speaker pairs while maximizing different-speaker pairs (contrastive/triplet loss).

AAM-Softmax: Large-margin softmax for speaker classification
Angular distance with margin penalty for better separation

Clustering: Grouping Voice Fingerprints

Given N embedding vectors, cluster them into K groups where K (the number of speakers) is unknown. This is unsupervised learning with a twist: we must also discover K.

Agglomerative (AHC)

pyannote, older systems

Hierarchical clustering, merges similar segments bottom-up

+Works well with unknown speaker count, no initialization

-O(n^2) complexity, sensitive to threshold

Spectral Clustering

NeMo, many SOTA systems

Graph-based clustering using affinity matrix eigenvectors

+Handles non-convex clusters, robust to noise

-Needs number of speakers or threshold

PLDA Scoring

Kaldi, traditional systems

Probabilistic Linear Discriminant Analysis for speaker similarity

+Trained on speaker data, very accurate

-Requires domain-matched training data

Online Clustering

Real-time systems

Real-time clustering as audio streams in

+Low latency, works with streaming

-Cannot correct early mistakes

Spectral Clustering: The Modern Standard

Affinity Matrix

Pairwise cosine similarity between all embeddings

Graph Laplacian

D - A where D is degree matrix

Eigenvectors

First K smallest eigenvectors

K-means

Cluster in eigenspace

The key advantage: spectral clustering can find non-convex clusters and is robust to noise. The number of speakers K can be estimated from the eigenvalue gap.

The Threshold Problem

Without knowing the number of speakers, we need a threshold to decide when embeddings are "similar enough" to belong to the same speaker. This threshold is domain-dependent:

Telephony

Usually 2 speakers (agent + customer). Threshold: ~0.3

Meetings

3-10 speakers, varying conditions. Threshold: ~0.5

Broadcast

Many speakers, studio quality. Threshold: ~0.7

Overlap Handling: When People Talk Over Each Other

In natural conversation, 10-20% of speech involves multiple speakers talking simultaneously. Traditional pipelines fail here because they assign each frame to exactly one speaker.

The Challenge

10-20%

of meeting audio contains overlap

40%+

of DER errors come from overlap

2-4

speakers can overlap at once

Approaches to Overlap

Ignore Overlaps

Assign segment to dominant speaker only

Accuracy: LowComplexity: Simple

Multi-label Classification

Per-frame: which speakers are active?

Accuracy: HighComplexity: Complex

End-to-End EEND

Single neural network outputs all speakers

Accuracy: HighestComplexity: Very Complex

End-to-End Neural Diarization (EEND)

The breakthrough: instead of a pipeline, train a single neural network that directly outputs which speakers are active at each frame. The output is a multi-hot vector per frame.

How it works:

1. Self-attention encoder processes audio frames
2. Each frame outputs activation for each speaker slot
3. Multiple speakers can be active simultaneously
4. Trained with permutation-invariant loss (PIT)

Models:

- pyannote.audio 3.1 uses local EEND
- NeMo MSDD for multi-scale decoding
- SA-EEND with self-attention

Models and Tools

From end-to-end neural systems to traditional pipelines. Choose based on your accuracy needs, whether you need transcription, and deployment constraints.

Model	Type	Architecture	DER	Notes
pyannote.audio 3.1	End-to-end	Segmentation + Embedding + Clustering	~11% (AMI)	Best open-source. Handles overlap. HuggingFace token required.
NeMo MSDD	Neural	Multi-scale Diarization Decoder	~8% (AMI)	NVIDIA. State-of-the-art accuracy. GPU required.
WhisperX	Pipeline	Whisper + pyannote + alignment	~12% (AMI)	Best for transcription + diarization. Word-level timestamps.
Simple Diarizer	Basic	Embedding + AHC clustering	~15% (AMI)	Educational. Good for understanding. No overlap handling.
Kaldi PLDA	Traditional	x-vector + PLDA + AHC	~10% (AMI)	Proven production system. Complex setup.

Choosing the Right Tool

Use pyannote.audio when:

- You need best open-source accuracy
- Overlap handling is important
- Python/PyTorch environment available
- Can obtain HuggingFace token

Use WhisperX when:

- You need transcription + diarization
- Word-level timestamps are required
- Whisper quality transcription needed
- Okay with slightly lower DER

Use NeMo when:

- Maximum accuracy is critical
- NVIDIA GPU available
- Enterprise deployment
- Telephony or meeting domain

Use AssemblyAI when:

- You want a managed API
- No infrastructure to maintain
- Budget for per-minute pricing
- Need additional features (summaries, etc.)

Evaluation Metrics

Diarization Error Rate (DER)

DER = (FA + Miss + Confusion) / Total

Total time incorrectly attributed. The standard metric.

Good: < 10%Avg: 10-20%Poor: > 20%

Jaccard Error Rate (JER)

JER = 1 - (Intersection / Union)

Per-speaker overlap metric. More sensitive to short segments.

Good: < 15%Avg: 15-25%Poor: > 25%

Code Examples

Get started with speaker diarization in Python. From high-level APIs to building your own pipeline.

pyannote.audio 3.1pip install pyannote.audio

Best Open Source

from pyannote.audio import Pipeline
import torch

# Load pre-trained speaker diarization pipeline
# Requires HuggingFace token (free, accepts terms)
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)

# Optional: use GPU for faster inference
pipeline.to(torch.device("cuda"))

# Run diarization on audio file
diarization = pipeline("audio.wav")

# Iterate over speaker turns
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")

# Output:
# 0.0s - 2.5s: SPEAKER_00
# 2.8s - 5.2s: SPEAKER_01
# 5.0s - 7.5s: SPEAKER_00  (note: overlaps with previous)
# ...

# Export to RTTM format (standard diarization format)
with open("output.rttm", "w") as f:
    diarization.write_rttm(f)

# Customize pipeline parameters
diarization = pipeline(
    "audio.wav",
    min_speakers=2,          # Minimum number of speakers
    max_speakers=5,          # Maximum number of speakers
    # Or let it auto-detect:
    # num_speakers=None
)