Home/Building Blocks/Speaker Diarization
AudioStructured Data

Speaker Diarization

Separate 'who spoke when' in audio. Vital for meetings, call centers, and transcription QA.

How Speaker Diarization Works

A technical deep-dive into speaker diarization: the art of answering "who spoke when?" From voice fingerprints and clustering to handling overlapping speech.

1

The Problem: Who Spoke When?

Speech recognition tells you what was said. Speaker diarization tells you who said it. Picture a meeting recording: you need to split the audio stream into segments and label each with a speaker identity.

Interactive Timeline: A Two-Person Conversation

0s3s6s9s12s15s
Speaker A
Speaker B
Overlap

Click on any segment to see details. Notice how speakers sometimes overlap - this is one of the hardest challenges in diarization.

Why Is This Hard?

No Prior Knowledge

We do not know how many speakers exist or what they sound like. The system must discover speaker identities from the audio alone.

Overlapping Speech

In natural conversation, people interrupt and talk over each other. Up to 10-20% of meeting audio contains overlap.

Same Speaker, Different Conditions

A speaker's voice changes: emotional state, microphone distance, background noise. The system must recognize them regardless.

Different Speakers, Similar Voices

Two people can sound alike, especially with similar age, gender, and accent. The system must distinguish subtle differences.

Where Is This Used?

Meeting Transcription
Attribute quotes to speakers
Call Centers
Separate agent from customer
Media Indexing
Search by speaker in podcasts
Voice Assistants
Personalize to each user
2

The Diarization Pipeline

Most diarization systems follow a four-stage pipeline. Think of it as: find speech, fingerprint it, group similar fingerprints, assign labels.

VAD
Voice Activity Detection
Find speech regions
->
Embedding
Speaker Embeddings
Extract voice fingerprints
->
Clustering
Cluster Segments
Group by similarity
->
Assignment
Label Assignment
Who spoke when
Stage 1: Voice Activity Detection

First, find where speech occurs. VAD strips silence and noise, leaving only speech segments. This reduces processing and prevents clustering silence as a "speaker."

Input: 60 min audio
Output: [(0.0s-2.5s), (2.8s-5.2s), ...] speech regions
Stage 2: Speaker Embeddings

For each speech segment, extract a "voice fingerprint" - a fixed-length vector (192-512 dims) that captures who the speaker is, not what they said.

Input: Audio segment (1-3 seconds)
Output: [0.23, -0.15, 0.82, ...] 256-dim vector
Stage 3: Clustering

Group embeddings by similarity. Segments from the same speaker should cluster together. The algorithm discovers how many speakers exist and which segments belong to each.

Input: 100 embeddings
Output: Cluster labels [0, 1, 0, 2, 1, ...]
Stage 4: Label Assignment

Map cluster IDs back to time segments. Optionally merge short segments, smooth boundaries, and handle edge cases. The output is the final diarization.

Input: Segments + cluster labels
Output: [(0.0s, 2.5s, "SPEAKER_00"), ...]
End-to-End Alternative: Modern systems like pyannote 3.1 and EEND replace this pipeline with a single neural network that directly outputs speaker assignments. This avoids error propagation between stages and handles overlapping speech natively.
3

Speaker Embeddings: Voice Fingerprints

The core insight: a neural network can learn to compress seconds of audio into a compact vector where same speaker = similar vectors, different speakers = distant vectors.

What Do Embedding Dimensions Capture?

While individual dimensions are not directly interpretable, they encode various aspects of voice identity:

Deep voice (100 Hz)Pitch RangeHigh voice (300 Hz)

Fundamental frequency range

Popular Embedding Models

ECAPA-TDNN

Emphasized Channel Attention. 192-dim embeddings. Current favorite for diarization.

Used by: pyannote, SpeechBrain
x-vector

Time-delay neural network. 512-dim embeddings. Well-established, good baseline.

Used by: Kaldi, older systems
TitaNet

NVIDIA's speaker model. Squeeze-and-Excitation blocks. State-of-the-art accuracy.

Used by: NeMo

How Are Embeddings Trained?

Training Data

Large speaker recognition datasets with millions of utterances from thousands of speakers. VoxCeleb (7000+ speakers) is the standard.

  • - VoxCeleb1: 1251 celebrities, 150K utterances
  • - VoxCeleb2: 6112 speakers, 1M utterances
  • - CN-Celeb: 1000 Chinese celebrities
Training Objective

The model learns to classify speakers (softmax) or to minimize distance between same-speaker pairs while maximizing different-speaker pairs (contrastive/triplet loss).

AAM-Softmax: Large-margin softmax for speaker classification
Angular distance with margin penalty for better separation
4

Clustering: Grouping Voice Fingerprints

Given N embedding vectors, cluster them into K groups where K (the number of speakers) is unknown. This is unsupervised learning with a twist: we must also discover K.

1
Agglomerative (AHC)
pyannote, older systems

Hierarchical clustering, merges similar segments bottom-up

+Works well with unknown speaker count, no initialization
-O(n^2) complexity, sensitive to threshold
2
Spectral Clustering
NeMo, many SOTA systems

Graph-based clustering using affinity matrix eigenvectors

+Handles non-convex clusters, robust to noise
-Needs number of speakers or threshold
3
PLDA Scoring
Kaldi, traditional systems

Probabilistic Linear Discriminant Analysis for speaker similarity

+Trained on speaker data, very accurate
-Requires domain-matched training data
4
Online Clustering
Real-time systems

Real-time clustering as audio streams in

+Low latency, works with streaming
-Cannot correct early mistakes

Spectral Clustering: The Modern Standard

Affinity Matrix
Pairwise cosine similarity between all embeddings
Graph Laplacian
D - A where D is degree matrix
Eigenvectors
First K smallest eigenvectors
K-means
Cluster in eigenspace

The key advantage: spectral clustering can find non-convex clusters and is robust to noise. The number of speakers K can be estimated from the eigenvalue gap.

The Threshold Problem

Without knowing the number of speakers, we need a threshold to decide when embeddings are "similar enough" to belong to the same speaker. This threshold is domain-dependent:

Telephony
Usually 2 speakers (agent + customer). Threshold: ~0.3
Meetings
3-10 speakers, varying conditions. Threshold: ~0.5
Broadcast
Many speakers, studio quality. Threshold: ~0.7
5

Overlap Handling: When People Talk Over Each Other

In natural conversation, 10-20% of speech involves multiple speakers talking simultaneously. Traditional pipelines fail here because they assign each frame to exactly one speaker.

The Challenge

10-20%
of meeting audio contains overlap
40%+
of DER errors come from overlap
2-4
speakers can overlap at once

Approaches to Overlap

Ignore Overlaps

Assign segment to dominant speaker only

Accuracy: LowComplexity: Simple
Multi-label Classification

Per-frame: which speakers are active?

Accuracy: HighComplexity: Complex
End-to-End EEND

Single neural network outputs all speakers

Accuracy: HighestComplexity: Very Complex

End-to-End Neural Diarization (EEND)

The breakthrough: instead of a pipeline, train a single neural network that directly outputs which speakers are active at each frame. The output is a multi-hot vector per frame.

How it works:
  1. 1. Self-attention encoder processes audio frames
  2. 2. Each frame outputs activation for each speaker slot
  3. 3. Multiple speakers can be active simultaneously
  4. 4. Trained with permutation-invariant loss (PIT)
Models:
  • - pyannote.audio 3.1 uses local EEND
  • - NeMo MSDD for multi-scale decoding
  • - SA-EEND with self-attention
6

Models and Tools

From end-to-end neural systems to traditional pipelines. Choose based on your accuracy needs, whether you need transcription, and deployment constraints.

ModelTypeArchitectureDERNotes
pyannote.audio 3.1End-to-endSegmentation + Embedding + Clustering~11% (AMI)Best open-source. Handles overlap. HuggingFace token required.
NeMo MSDDNeuralMulti-scale Diarization Decoder~8% (AMI)NVIDIA. State-of-the-art accuracy. GPU required.
WhisperXPipelineWhisper + pyannote + alignment~12% (AMI)Best for transcription + diarization. Word-level timestamps.
Simple DiarizerBasicEmbedding + AHC clustering~15% (AMI)Educational. Good for understanding. No overlap handling.
Kaldi PLDATraditionalx-vector + PLDA + AHC~10% (AMI)Proven production system. Complex setup.

Choosing the Right Tool

Use pyannote.audio when:
  • - You need best open-source accuracy
  • - Overlap handling is important
  • - Python/PyTorch environment available
  • - Can obtain HuggingFace token
Use WhisperX when:
  • - You need transcription + diarization
  • - Word-level timestamps are required
  • - Whisper quality transcription needed
  • - Okay with slightly lower DER
Use NeMo when:
  • - Maximum accuracy is critical
  • - NVIDIA GPU available
  • - Enterprise deployment
  • - Telephony or meeting domain
Use AssemblyAI when:
  • - You want a managed API
  • - No infrastructure to maintain
  • - Budget for per-minute pricing
  • - Need additional features (summaries, etc.)

Evaluation Metrics

Diarization Error Rate (DER)
DER = (FA + Miss + Confusion) / Total

Total time incorrectly attributed. The standard metric.

Good: < 10%Avg: 10-20%Poor: > 20%
Jaccard Error Rate (JER)
JER = 1 - (Intersection / Union)

Per-speaker overlap metric. More sensitive to short segments.

Good: < 15%Avg: 15-25%Poor: > 25%
7

Code Examples

Get started with speaker diarization in Python. From high-level APIs to building your own pipeline.

pyannote.audio 3.1pip install pyannote.audio
Best Open Source
from pyannote.audio import Pipeline
import torch

# Load pre-trained speaker diarization pipeline
# Requires HuggingFace token (free, accepts terms)
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"
)

# Optional: use GPU for faster inference
pipeline.to(torch.device("cuda"))

# Run diarization on audio file
diarization = pipeline("audio.wav")

# Iterate over speaker turns
for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"{turn.start:.1f}s - {turn.end:.1f}s: {speaker}")

# Output:
# 0.0s - 2.5s: SPEAKER_00
# 2.8s - 5.2s: SPEAKER_01
# 5.0s - 7.5s: SPEAKER_00  (note: overlaps with previous)
# ...

# Export to RTTM format (standard diarization format)
with open("output.rttm", "w") as f:
    diarization.write_rttm(f)

# Customize pipeline parameters
diarization = pipeline(
    "audio.wav",
    min_speakers=2,          # Minimum number of speakers
    max_speakers=5,          # Maximum number of speakers
    # Or let it auto-detect:
    # num_speakers=None
)

Quick Reference

For Best Accuracy
  • - NeMo MSDD (~8% DER)
  • - pyannote.audio 3.1 (~11% DER)
  • - End-to-end handles overlap
For Transcription
  • - WhisperX for Whisper quality
  • - AssemblyAI for managed API
  • - Word-level speaker labels
Key Concepts
  • - VAD + Embeddings + Clustering
  • - ECAPA-TDNN for embeddings
  • - Spectral clustering standard

Use Cases

  • Meeting attribution
  • Call center QA
  • Podcast chaptering
  • Court recordings

Architectural Patterns

Embedding Clustering

Extract speaker embeddings, cluster over time.

End-to-End Diarization

Joint VAD + speaker tagging in one model.

Implementations

Open Source

pyannote.audio

MIT
Open Source

State-of-the-art diarization pipelines.

NVIDIA NeMo Diarization

Apache 2.0
Open Source

Diarization with speaker embeddings.

Resemblyzer

MIT
Open Source

Speaker embeddings for clustering.

Benchmarks

Quick Facts

Input
Audio
Output
Structured Data
Implementations
3 open source, 0 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for speaker diarization.

Submit Results