Codesota · Speech · Voice fingerprintsA reproducible DSP analysisVol. II · Issue April 22, 2026

Deep-dive · Speech

Voices, under the microscope.

Eleven open-source Kokoro-82M voices, rendered through five complementary DSP lenses and a Griffin-Lim round-trip. Every figure comes out of the same Python pipeline; every number is measured, not claimed.

A spectrogram is not a quality judgment. It is a coordinate system in which pitch, timbre, prosody, sibilance and noise floor become legible. The point of this page is to teach the eye to read it — first by showing the same prompt said eleven different ways, then by projecting a single voice through the five lenses used in production TTS and ASR.

Pipeline: librosa + matplotlib. Audio synthesised locally from Kokoro-82M at 24 kHz. Scripts in scripts/tts-samples/.

§ 01 · The voices

Same prompt, eleven voices.

All eleven samples come from Kokoro-82M (Apache 2.0). Same sentence, same sample rate, same window. The only variable is voice identity. Pink borders mark female voices; sand borders mark male. GB flags the two British voices.

Prompt:

“The quick brown fox jumps over the lazy dog.”

Mosaic · at a glance

Mel spectrogram thumbnail for af_heart — Fig 1 · Border colour encodes gender (pink / cyan). The number is F0 median — male voices cluster at 115–152 Hz, female at 148–214 Hz. bf_emma has the brightest sibilance; am_adam and am_liam sit the lowest; af_nicole is visibly sparse — its whispered delivery leaves most frames unvoiced (21%).

Mel spectrogram thumbnail for af_bella — Fig 1 · Border colour encodes gender (pink / cyan). The number is F0 median — male voices cluster at 115–152 Hz, female at 148–214 Hz. bf_emma has the brightest sibilance; am_adam and am_liam sit the lowest; af_nicole is visibly sparse — its whispered delivery leaves most frames unvoiced (21%).

The eleven · full spectrograms + audio

Kokoro · af_heart

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range154–322 Hz

70median 204 Hz260

Centroid2807 Hz

0brightness4k Hz

Voiced ratio69% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

Default voice — balanced formants, clean harmonic stack.

Kokoro · af_bella

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range155–288 Hz

70median 191 Hz260

Centroid3111 Hz

0brightness4k Hz

Voiced ratio79% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

Softer delivery. Highest voiced ratio of the female set.

Kokoro · af_nicole

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range141–185 Hz

70median 148 Hz260

Centroid3126 Hz

0brightness4k Hz

Voiced ratio22% of framesairy / whispered

“The quick brown fox jumps over the lazy dog.”

Airy, whispered style — voiced ratio drops sharply (21%).

Kokoro · af_sarah

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range131–329 Hz

70median 202 Hz260

Centroid2860 Hz

0brightness4k Hz

Voiced ratio80% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

Mid-range female voice. Tight F0 variance.

Kokoro · af_sky

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range97–275 Hz

70median 169 Hz260

Centroid2135 Hz

0brightness4k Hz

Voiced ratio75% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

Lower centroid than other female voices — darker timbre.

Kokoro · am_michael

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range88–174 Hz

70median 119 Hz260

Centroid2514 Hz

0brightness4k Hz

Voiced ratio64% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

Neutral male. F0 median ~119 Hz.

Kokoro · am_adam

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range89–173 Hz

70median 115 Hz260

Centroid2188 Hz

0brightness4k Hz

Voiced ratio61% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

Confident, deeper. Lowest F0 of the set (115 Hz).

Kokoro · am_fenrir

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range70–238 Hz

70median 138 Hz260

Centroid2423 Hz

0brightness4k Hz

Voiced ratio77% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

Deeper, more harmonic stack density below 1 kHz.

Kokoro · am_liam

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range98–258 Hz

70median 125 Hz260

Centroid2089 Hz

0brightness4k Hz

Voiced ratio75% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

Lighter-toned male. Highest voiced ratio of male voices.

Kokoro · bf_emma

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range147–255 Hz

70median 178 Hz260

Centroid3274 Hz

0brightness4k Hz

Voiced ratio75% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

British RP. Highest spectral centroid — brighter sibilance.

Kokoro · bm_george

Kokoro-82M · Apache 2.0

mel · 0–8kHz

F0 range108–225 Hz

70median 152 Hz260

Centroid2466 Hz

0brightness4k Hz

Voiced ratio72% of framesstrongly voiced

“The quick brown fox jumps over the lazy dog.”

British male. Mid-F0 among the male voices.

§ 02 · Five lenses

Five ways to look at a voice.

One clip — Kokoro-82M / af_heart — five projections. Each reveals a property the others hide. None of them is a quality metric; they are acoustic descriptors used in ASR front-ends, TTS training losses and voice-conversion systems for decades.

Click any lens to see librosa parameters and the function that produced it.

Pipeline, one diagram

Fig 2 · Every mainstream TTS stack since Tacotron (2017) follows this shape. The mel spectrogram sits one step from both ends — models emit it, humans can read it, vocoders consume it. The Kokoro-82M architecture (StyleTTS2 descendant) is end-to-end, but mel remains the internal quantity trained against.

The five projections

Log-power perceptual spectrogram. Target of most modern TTS acoustic models, input of every neural vocoder. Harmonic stripes = voiced vowels. Dark columns = stops and word breaks. Bright diffuse regions up top = sibilants. — Mel spectrogramwhat most TTS vocoders consume
0–8 kHz · 128 mel bins

The raw signal. Cyan shading is a short-time RMS envelope — useful for speaking-rate estimation and silence detection, but effectively blind to frequency content. — Waveform + RMS envelopewhat the speaker hears
time-domain

Probabilistic YIN tracker. Solid line = pitch; gaps = unvoiced or silent frames. A good F0 contour is how you measure prosody — does the sentence lift into a question, or drop into a declarative? — F0 contourprosody
librosa.pyin, 70–500 Hz search

The feature that powered HMM speech recognition for three decades and still appears in modern ASR front-ends. First coefficient ≈ total energy; higher coefficients capture progressively finer spectral shape. — MFCCASR feature space
13 coefficients, DCT of log-mel

Centroid = where the mass of the spectrum sits. Rolloff = below which frequency 85% of the energy lives. Rising lines = sibilant consonants; falling lines = rounded vowels. Backdrop is a faded mel for context. — Spectral centroid + rolloffHz
brightness over time

How often the waveform crosses zero per frame. Low for vowels, high for fricatives. Combined with RMS, this is a surprisingly competent voiced-frame detector. — Zero-crossing ratecrossings per frame
voiced / unvoiced heuristic

F0 vs brightness · voice space

Fig 3 · Two numbers per voice: median F0 (pitch height) and mean spectral centroid (brightness). The ~165 Hz divider splits male and female clusters; the British voices pull harder on the brightness axis than the American — a regularity in Kokoro's training set, not a physical law.

Same voice · three prompts

Fix the voice (af_heart); vary the content. The sibilant sentence pushes energy into the 4–8 kHz band. The stop-consonant sentence carves deep vertical troughs at every /p/ and /b/ closure. Same weights, same window — the differences are entirely linguistic.

Kokoro · af_heart · reference

Kokoro-82M · Apache 2.0

mel · 0–8kHz

“The quick brown fox jumps over the lazy dog.”

Balanced voiced/unvoiced content.

Kokoro · af_heart · sibilants

Kokoro-82M · Apache 2.0

mel · 0–8kHz

“She sells seashells by the seashore, and six slippery snakes slithered south.”

Sibilant-heavy. The 4–8 kHz band lights up.

Kokoro · af_heart · stops

Kokoro-82M · Apache 2.0

mel · 0–8kHz

“Peter Piper picked a peck of pickled peppers. Bob baked big batches of bread.”

Plosive-heavy. Vertical dark columns at every /p/ and /b/ closure.

§ 03 · Resynthesis

Mel → Griffin-Lim → WAV.

A mel spectrogram is magnitude-only. Phase is discarded. To turn mel back into audio you need to invent a phase that is consistent across frames. Modern TTS learns this mapping with a neural vocoder (HiFi-GAN, BigVGAN, iSTFTNet). The classical alternative is Griffin-Lim: iteratively alternate imposing the target magnitude and projecting onto time-consistent phase. It works, but it sounds like a heavily processed voicemail.

64 iterations. No learned prior. librosa.griffinlim on librosa.feature.inverse.mel_to_stft.

Kokoro · af_heart · original

Kokoro-82M + vocoder · Apache 2.0

mel · 0–8kHz

“The quick brown fox jumps over the lazy dog.”

What Kokoro actually outputs: mel through the model's trained neural vocoder.

Kokoro · af_heart · Griffin-Lim

mel → GL, 64 iter · classical DSP

mel · 0–8kHz

“The quick brown fox jumps over the lazy dog.”

Same mel, phase recovered iteratively with no learned prior. Listen — speech is intelligible; timbre has been sandpapered.

Side-by-side mel spectrograms: original Kokoro output vs Griffin-Lim reconstruction — Fig 4 · Top, Kokoro's neural vocoder output. Bottom, the same mel reconstructed via mel→STFT inversion plus Griffin-Lim phase recovery (64 iterations). The magnitude envelope is nearly identical — that's what the mel captures. The between-harmonic energy changes — that's what phase carries.

§ 04 · Failure modes

Six things you can see in a spectrogram.

Spectrograms are how production TTS teams debug. The failure modes below show up as visual signatures before they show up in MOS scores. Each plate is a real DSP render from the same pipeline — healthy on top, failure on bottom.

Smeared harmonics

failure mode · 01 / 06

Mel spectrogram: healthy neural vocoder output on top with crisp harmonic stripes; oversmoothed vocoder output on bottom with fogged, low-contrast harmonics.

Cause: Vocoder undertrained, or mel-to-waveform gap. HiFi-GAN trained on studio speech but asked to synthesize a whispered voice.
What it looks like: Horizontal harmonic lines lose contrast; the whole spectrogram looks like fog.
Fix: Fine-tune vocoder on target-domain mel spectrograms. Or use Griffin-Lim baseline to confirm it's the vocoder, not the acoustic model.

Repeat / attention collapse

failure mode · 02 / 06

Mel spectrogram: healthy monotonic alignment on top; attention collapse on bottom with one syllable repeated three times in place.

Cause: Autoregressive TTS (Tacotron 1, early Tacotron 2) loses monotonic alignment and repeats a syllable, or skips words.
What it looks like: A periodic pattern that should be a word stretches into a held tone, or an entire word is missing.
Fix: Switch to FastSpeech/VITS-class non-autoregressive model, or add a monotonic attention constraint (GMM attention, Forward-Backward loss).

Flat prosody

failure mode · 03 / 06

Cause: Acoustic model averaged out pitch variance during training — classic result of using a mean-squared loss on F0.
What it looks like: F0 contour hugs a single value for the whole utterance. The mel spectrogram's harmonic stack stays at a fixed spacing.
Fix: Train with pitch as an explicit conditioning (FastPitch), add adversarial loss, or switch to flow-matching / diffusion.

Pronunciation drift on rare words

failure mode · 04 / 06

Mel spectrogram: healthy formant track matching expected vowels on top; pronunciation drift with wrong formant path and shifted pitch in a middle segment on bottom.

Cause: G2P coverage gap. "Anthropic" becomes "antrophic". Worse on brand names and technical terms.
What it looks like: The word exists (F0 moves, RMS rises), but the formant trajectory doesn't match the expected vowels.
Fix: Add a custom pronunciation lexicon, or use an LLM G2P head, or fall back to ARPAbet input on exact strings.

Clipping / peaks at segment boundaries

failure mode · 05 / 06

Mel spectrogram: healthy continuous waveform on top; streaming chunk boundary on bottom with a bright vertical broadband column at the 1.0 second splice point.

Cause: Streaming TTS concatenates chunks without crossfading.
What it looks like: A vertical bright column across every frequency band, happening exactly at a chunk boundary.
Fix: Overlap-add the chunks (typically 25–50 ms overlap) or emit an extra token of lookahead to let the vocoder decide the transition.

Artifacts in sibilants

failure mode · 06 / 06

Mel spectrogram: healthy full 0–8 kHz band on top; low-pass-filtered output on bottom with the 4–8 kHz region dimmed during sibilant frames.

Cause: Vocoder sampling rate too low, or mel cut-off below 8 kHz.
What it looks like: Sibilants (/s/, /sh/) sound muffled; the spectrogram's 4–8 kHz region is dim even on clearly pronounced fricatives.
Fix: Raise sample rate to 24 or 48 kHz end-to-end. Extend mel fmax to 12 kHz.

§ 05 · Measured

Acoustic properties, to three digits.

Computed from the same clips as the spectrograms above. Descriptive, not evaluative. Pitch height, brightness and voiced ratio differ across voices — that is true by construction, not by training-run quality.

F0 via librosa.pyin. Centroid + rolloff via STFT on 2048-sample frames. Voiced ratio = fraction of frames where pyin returned a valid F0 estimate.

Per-voice · measured

Voice	F0 median	F0 range	F0 σ	Centroid	Rolloff 85%	Voiced	ZCR
Kokoro · af_heart	203.8	153.6–321.6	36.6	2807	4929	69%	0.143
Kokoro · af_bella	191.2	155.3–288.2	29.8	3111	5759	79%	0.164
Kokoro · af_nicole	148.3	140.8–184.7	10.2	3126	5776	22%	0.174
Kokoro · af_sarah	201.5	130.6–329.2	43.6	2860	5090	80%	0.148
Kokoro · af_sky	168.9	97.3–275.2	40.2	2135	3512	75%	0.124
Kokoro · am_michael	118.8	88.2–174.4	22.7	2514	4944	64%	0.106
Kokoro · am_adam	115	89.2–173.4	18.4	2188	3863	61%	0.120
Kokoro · am_fenrir	138.4	70–238.2	38.9	2423	4367	77%	0.123
Kokoro · am_liam	124.7	97.9–258.2	39	2089	3879	75%	0.106
Kokoro · bf_emma	178.4	146.6–255.3	23.3	3274	5969	75%	0.170
Kokoro · bm_george	151.8	108–224.8	24.1	2466	4657	72%	0.113

Fig 5 · F0 in Hz. Centroid and rolloff in Hz. Voiced ratio as percentage of 2.5-second frame count. These are frame-level means; downstream code works with the time-series, not the means.

Each lens · in production

Lens	Training	Inference / production	Debug with it
Mel spectrogram	Target of Tacotron/FastSpeech/StyleTTS2 acoustic models. L1/L2 loss between predicted and ground-truth mel.	Input to every neural vocoder (HiFi-GAN, BigVGAN, iSTFTNet, UnivNet).	First place you look when TTS output sounds wrong. Muddy formants = bad acoustic model. Smeared harmonics = bad vocoder.
MFCC	Classical ASR acoustic-model input (HMM-GMM era). Still the default for speaker diarization and wake-word detection.	Most cloud ASR preprocesses to a mel-filterbank or MFCC-adjacent feature before the neural encoder.	Coefficient 0 = total energy. Higher coefficients = progressively finer spectral detail. Watch coefficient 1 for bass/treble drift.
F0 contour	Prosody loss in expressive TTS (FastPitch, EmoTTS). Also the primary signal for singing voice synthesis.	Voice-conversion systems use F0 trajectories as the pitch skeleton while replacing timbre.	Gaps mean the tracker failed — usually because the signal was too noisy or the voice was whispered. Jumps of an octave are pyin doubling/halving errors.
Spectral centroid / rolloff	Auxiliary loss in some TTS systems to match "brightness" of the target speaker.	Used in music information retrieval and in fast voice-activity detection.	A centroid that stays flat through an entire utterance suggests a model collapsing to mean — common failure mode when TTS is undertrained.
Zero-crossing rate	Feature for voiced/unvoiced classifiers in old-school vocoders.	Still used in cheap VAD, cough/cry detection, snoring classifiers.	Useful sanity check: a voice with very low ZCR and high RMS that suddenly spikes in ZCR is almost certainly a sibilant.

Fig 6 · Matrix view. Where each representation sits in a real TTS or ASR system, and what you’d debug with it.

§ 06 · A short history

Eighty-five years of making speech visible.

Thirteen stops from Dudley’s VODER (1939) to Kokoro-82M (2025). Every representation arose from a frustration with the previous one, and every one has left a residue in the pipeline that rendered the spectrograms above.

Speech-as-picture predates computers. The path below is the shortest possible honest route from Homer Dudley pressing keys at the 1939 World's Fair to Kokoro-82M running on a laptop. Every representation below arose from a concrete frustration with the previous one, and every one of them has left a residue you can still see in today's TTS code.

1939electromechanical

VODER — speech as a keyboard instrument

Homer Dudley · Bell Telephone Laboratories · World's Fair, New York

Dudley's VODER (Voice Operation Demonstrator) is the first documented machine to produce intelligible continuous speech from a non-speech source. There was no magnetic tape playback, no database of recordings, no computer. The machine had two sound sources — a buzzing relaxation oscillator for voiced sounds, a gas-discharge tube for unvoiced hiss — selected by a foot pedal. Ten bandpass filter keys on the operator's right hand gave fingertip control over ten vocal-tract resonances from roughly 500 Hz to 7.5 kHz. A wrist bar modulated pitch. Three extra keys produced the burst-release timings for stops like /t/, /p/, /k/.

It took the bank of operators — a team of women hand-picked from Bell's phone-service division — roughly a year of daily practice to play a sentence cleanly. A single demonstration required about the same rehearsal as a concerto. The result: a recognizable, somewhat ghostly, clearly human voice under manual control. Audio and film from the World's Fair survive; modern acousticians still point to VODER as the first proof that the vocal tract is, mathematically speaking, a filter bank.

The crucial move was conceptual. Dudley separated what drives the vocal folds from what the throat and mouth do to that source. That separation, and a bank of bandpass filters to shape the result, turned out to be universal enough that every later system is either a direct descendant or a refutation of it.

What survived

The source–filter model. VODER is a voicing source + hiss source + bank of resonance filters — the same three ideas that underlie LPC thirty years later, and every TTS vocoder still implicit in the mel spectrogram.

1952acoustic analysis

Sound Spectrograph — visible speech

Potter, Kopp, Green · Bell Labs · 'Visible Speech' (Van Nostrand, 1947; machine commercialized 1952)

The Sound Spectrograph is the ancestor of every figure on this page. Input speech was recorded onto a magnetic tape loop, typically about 2.4 seconds long. A rotating drum dragged the loop past a playback head over and over. In parallel, a single variable-frequency bandpass filter swept slowly from 0 to about 4 kHz through heterodyning. Each time the filter's center frequency completed another scan, an electric stylus burned a horizontal stripe onto sensitized paper wound around a synchronized drum. Darker stripes meant the filter had found more energy in that band.

The output was a rectangle. Time ran left to right, frequency ran bottom to top, darkness encoded amplitude. Linguists immediately saw that formants — the resonances Dudley's VODER filters had imposed manually — appeared as dark horizontal bands. The spectrograph gave the field its shared vocabulary: F1, F2, F3 are named after the tracks on a 1950 spectrogram. Dennis Klatt would later use spectrograms as targets when tuning his synthesizer rules. The training data of every modern TTS model is, structurally, just a digital spectrograph running on millions of utterances.

The machine is beautiful to look at. It's also incredibly slow: each 2.4-second clip takes about four minutes to render, because the filter has to sweep mechanically. The arrival of the FFT (Cooley & Tukey, 1965) made the process real-time on general-purpose computers and killed the physical spectrograph as a laboratory tool — but the visual it invented has never been replaced.

What survived

The time–frequency–intensity plot. Every diagram on this page, every loss function in modern TTS, every ASR feature — they all live in the coordinate space the 1952 spectrograph made literal.

1968source–filter formalism

Linear Prediction Coding — the vocal tract as an all-pole filter

Itakura & Saito · NTT (parallel work by Atal & Schroeder · Bell)

Linear prediction coding (LPC) formalized the intuition VODER made with wires. The claim: each speech sample can be predicted as a linear combination of the previous p samples plus a residual. The residual carries the excitation (a buzz when voiced, a hiss when not), and the predictor coefficients describe the vocal tract as an all-pole IIR filter. Ten to sixteen coefficients per 20-millisecond frame reproduce speech that is recognizably human.

Mathematically, the poles of H(z) = 1 / A(z) land near the formants. Fit the coefficients via the Levinson–Durbin recursion (linear time in p), store them as line spectral pairs (LSP) for numerical robustness, transmit at a few kilobits per second, and you have the backbone of every low-bitrate speech codec of the 1980s and 1990s. The GSM mobile standard's RPE-LTP codec is a direct child of LPC. So is the DoD 2.4 kbit/s LPC-10 military codec. Klatt's formant synthesizer used LPC as a tuning aid.

For TTS, LPC didn't produce natural speech on its own. The residual is hard to predict from text. But the source–filter abstraction — excitation at one end, spectral envelope in the middle, radiation at the output — is the frame that every later vocoder (from Klatt's to WaveNet's) implicitly or explicitly inherits.

What survived

Source–filter decomposition is still how every classical vocoder (WORLD, STRAIGHT), every cellphone codec (AMR, EVS), and the conceptual frame of neural vocoders all think about speech.

1980formant synthesis

Klatt synthesizer — hand-tuned rules

Dennis Klatt · MIT · published in JASA 1980; commercialized as DECtalk

Klatt's formant synthesizer was the final triumph of the rule-based era. A voicing source, an aspiration source, and a frication source are combined in two parallel branches: a cascade of up to five digital resonators for voiced speech (each formant feeds the next), and a bank of parallel resonators with per-formant amplitudes for fricatives, affricates, and nasals. About 60 parameters, updated every 5 ms. That is enough to say anything.

The genius was in the rule set. Klatt wrote, by hand, the expected formant trajectories and voicing patterns for every English phoneme in every context. Tens of thousands of rules. The result, commercialized as DECtalk, became the voice that Stephen Hawking used for decades (Klatt's voice model PBJOHN remained on Hawking's speech computer until his death — he explicitly refused upgrades because he had come to identify with it).

What killed Klatt synthesis was naturalness. The speech was perfectly intelligible — more intelligible than most modern TTS for technical terms, numbers, and rare words — but it sounded like a robot. The hand-written rules couldn't capture the fine-grained prosody that makes a voice sound alive. Unit-selection TTS of the 1990s kept the intelligibility and added the naturalness, at the cost of a much larger footprint.

What survived

The idea that a small set of parameters, updated every few milliseconds, can drive high-quality speech — and that prosody is explicit and rule-governed. Modern expressive TTS rediscovers this idea as 'pitch conditioning'.

1990sunit selection

Concatenative synthesis — cut, stitch, pray

AT&T Natural Voices · Edinburgh Festival · Black & Taylor · Hunt & Black 1996

Concatenative synthesis replaced rules with recordings. A voice actor sits in a studio for a week and reads a couple thousand carefully chosen sentences. The audio is phonetically segmented, catalogued by linguistic context, and stored. At synthesis time, the system picks diphone or phone units from the database using Viterbi search over two costs: a target cost (how well a candidate matches the target context: preceding phoneme, following phoneme, stress, position in phrase) and a join cost (how spectrally continuous two consecutive candidates are at their boundary).

Inside the recorded domain, the result is astonishing. AT&T's Natural Voices and early Loquendo demos are still hard to distinguish from human speech if you only listen to short, prosaic utterances. The data model is the speaker, so pronunciation is correct by construction. Emotional range is limited by what the voice actor recorded; out-of-domain words (rare names, code-switched words, new brand terms) cause audible seams because the Viterbi search has to force-fit units that don't quite match.

The approach's unforgivable weakness is scale. Every new voice requires a new multi-day recording. Every new style (cheerful, solemn, whispered) either needs its own recorded set or shows audible mismatch. Databases grew to gigabytes — unshippable to mobile. A generation of practitioners wanted to go back to parametric synthesis, but with machine-learned parameters instead of Klatt's hand rules.

What survived

The pattern of choosing data over rules. Also: objective functions that trade off local continuity (join cost) against global target match (target cost) — the same bifurcation reappears in modern TTS prosody losses.

2005statistical parametric

HMM-based TTS — parametric, flexible, smooth

HTS Working Group · Nagoya Institute of Technology · Tokuda, Yoshimura, Zen

HMM-based TTS returned to the parametric path with a statistical twist. Every phoneme in every context becomes a left-to-right hidden Markov model. Each state emits three parameter streams: a spectral envelope (mel cepstrum + deltas), a pitch contour (log F0 modeled with multi-space distributions that handle voiced/unvoiced regions), and a duration (modeled with a hidden semi-Markov extension so states can stay as long as needed). Training sees labeled speech, estimates Gaussian emissions for each state, then clusters states across contexts using decision trees so rare contexts still get sensible parameters.

At synthesis, the system picks a sequence of states from the decision trees using the target labels, generates parameters, and runs them through a vocoder — typically STRAIGHT or later WORLD. The flexibility was revolutionary. Want to make the voice happier? Retrain only on happy speech. Want a new speaker with an hour of audio? Adapt the existing model with MLLR. Want a new language? Just relabel. The downside was that everything sounded slightly muffled: the max-likelihood trajectory of a Gaussian HMM is inherently over-smoothed. HMM-TTS never reached unit-selection's peak naturalness, but it reached good-enough naturalness everywhere instead of excellent naturalness in-domain.

HTS is the direct ancestor of every modern TTS pipeline. Its labels became FastSpeech's phoneme-level features. Its three-stream modeling became FastPitch's separate pitch head. Its decision tree clusters foreshadowed speaker embeddings. The ideas are load-bearing even where the name is forgotten.

What survived

The habit of training separate heads for duration, pitch, and spectrum — still how expressive neural TTS structures its losses. Also, the dream of a small parametric model of a speaker that you can transfer and blend.

2016neural vocoder

WaveNet — autoregressive raw waveform

van den Oord et al. · DeepMind · arXiv 1609.03499

WaveNet modeled raw audio directly: one sample at a time, conditioned on every previous sample in the sequence, with categorical outputs over 256 mu-law-quantized levels. The architectural contribution was the dilated causal convolution: by doubling the dilation rate each layer, a 30-layer stack reaches a receptive field of tens of thousands of samples (seconds of audio) while remaining a feed-forward convolution during training. The network was conditioned on linguistic features — essentially the outputs of a classical TTS front-end — that told it which phoneme, with which pitch, for which speaker.

Naturalness jumped to within a single MOS point of human speech, beating every concatenative system on the same voice. The cost was brutal: because the model is autoregressive at the sample level, inference at 16 kHz requires running the stack 16,000 times per second. On the original GPU implementations a minute of audio took hours. Parallel WaveNet (Oord et al. 2017) distilled the autoregressive model into an inverse-autoregressive flow that generates audio in a single parallel pass, but the distillation was fragile and the student model rarely matched the teacher.

WaveNet did two things that stuck. It proved that a neural vocoder could close the gap between statistical parametric speech and concatenative speech. And its stacked dilated convolutions became the backbone block for essentially every fast follow-up — Parallel WaveGAN, MelGAN, HiFi-GAN — even when they discarded the autoregressive frame.

What survived

The dilated-causal-convolution block (used in every fast waveform model since). The conditioning interface: a stack of linguistic features per audio frame maps into the waveform. And a philosophical point: raw waveforms can be learned end-to-end — the long-standing 'synthesis is too low-level for neural nets' prior was wrong.

2017seq2seq with attention

Tacotron — text to mel, end-to-end

Wang et al. · Google · Interspeech 2017 (Tacotron 1); Shen et al. 2018 (Tacotron 2)

Tacotron was the first successful attempt at “put text in, get audio out, no linguistic front-end” — a single neural network trained end-to-end. An encoder (a CBHG block: 1D convs, highway layers, bidirectional GRU) consumed character embeddings and produced a per-character hidden state. An attention module aligned these hidden states to the target audio timeline. A decoder (an autoregressive LSTM) emitted mel-spectrogram frames one at a time, each conditioned on the previous frame and the attention-weighted encoder states. For the original paper the mel was turned back into audio via Griffin-Lim, the classical non-neural phase-recovery algorithm.

The elegance was the intermediate representation. Text→mel is a manageable supervised learning problem (aligned data at tens of milliseconds, not tens of microseconds). Mel→audio is a separable vocoder problem (trainable with a different loss on a different dataset, including data without paired text). Splitting the problem this way turned out to be a strictly better engineering choice than WaveNet's sample-level modeling for everything except the highest-end server deployments.

Tacotron 2 (2018) swapped the CBHG encoder for stacked convolutional and LSTM layers, used a location-sensitive attention to stabilize alignment on long utterances, and replaced Griffin-Lim with a WaveNet vocoder conditioned on the predicted mel. The result was mean-opinion-score parity with human recordings on short, neutral utterances. This is the point at which “neural TTS” stopped being a research curiosity and became a product surface.

What survived

The mel spectrogram as the canonical internal representation. The encoder–attention–decoder shape. The habit of pairing a text-to-mel model with a separate neural vocoder for the final step. If you train TTS today without using Tacotron's architecture, you are almost certainly reacting against it.

2019non-autoregressive · GAN vocoder

FastSpeech + HiFi-GAN — parallel and real-time

Ren et al. (Microsoft · FastSpeech 1/2) · Kong et al. (HiFi-GAN · NeurIPS 2020)

FastSpeech replaced Tacotron's autoregressive decoder with a parallel one. The trick was an explicit duration predictor: for each phoneme, predict how many mel frames it should span, then use a length regulator to simply repeat the phoneme embedding that many times. The expanded sequence is decoded by a stack of self-attention + convolution blocks in parallel — no recurrence, no attention to encoder states at inference. FastSpeech 2 added explicit pitch and energy predictors, drawing on the HMM-TTS lineage.

HiFi-GAN solved the vocoder speed problem. Where WaveNet generated one sample at a time and Parallel WaveGAN used a complex flow-based training recipe, HiFi-GAN was a straight convolutional generator trained with two classes of discriminator: a multi-period discriminator (MPD) that looks at the signal rearranged into 2D patches with various period lengths (catching periodic artifacts), and a multi-scale discriminator (MSD) that looks at the signal at full rate, half rate, quarter rate (catching scale-specific distortion). Plus a mel-L1 reconstruction loss. Adversarial feature matching for stability.

Combined, FastSpeech + HiFi-GAN could generate 22-kHz speech at dozens of times real-time on a single consumer GPU, with quality indistinguishable from Tacotron 2 on most listening tests. This is the point at which the “ship a custom TTS” curve crossed below a hundred GPU-hours.

What survived

The length regulator pattern for aligning non-autoregressive TTS. The multi-discriminator adversarial training recipe (MPD + MSD). These two ideas in combination are the reason 2020-era TTS ran in real-time on a single GPU.

2021variational inference + adversarial

VITS — end-to-end, no vocoder step

Jaehyeon Kim · Jungil Kong · Juhee Son · NeurIPS 2021

VITS collapsed the FastSpeech + HiFi-GAN pipeline into one network. A posterior encoder reads the target spectrogram and emits a latent z. A HiFi-GAN-style decoder reconstructs the waveform from z. A prior over z is learned from the text side via a normalizing flow that transforms a standard Gaussian conditioned on the encoded text. The ELBO ties the two — reconstructing the target waveform and matching the flow-transformed Gaussian at training time. Monotonic Alignment Search (MAS) replaces attention with a dynamic-programming alignment between text and spectrogram frames.

The effect: one optimizer, one training loop, one set of weights. Quality matches or exceeds two-stage FastSpeech + HiFi-GAN while avoiding the mismatch that accumulates when you train text→mel and mel→audio independently. The decoder discriminators (inherited from HiFi-GAN) keep the waveform sharp. The flow prior keeps the latent space multi-modal so the model can express prosodic variation.

VITS was the first open model where “paper-quality end-to-end TTS” was within a few days of training reach for a hobbyist with a single GPU. Its open-source implementation drove most of the community TTS ecosystem (Coqui-AI, SambertHifiGan, StyleTTS) through 2022 and 2023.

What survived

Monotonic Alignment Search (MAS). A single training graph that produces waveforms directly, instead of two pipelines bolted together. This shape reappeared in StyleTTS2, XTTS, and every other 2023-era end-to-end system.

2023style diffusion + adversarial decoder

StyleTTS2 — diffusion for style, adversarial for audio

Yinghao Aaron Li et al. · NeurIPS 2023

StyleTTS2 introduced two ideas. First, an entire utterance's style — speaker identity, emotional tone, recording-room texture — is compressed into a single style vector s. This vector is used to modulate every LayerNorm in the acoustic decoder through style-adaptive layer norm (SALN): the affine parameters of LN are themselves predicted from s. You get rich conditioning without adding layers to the decoder trunk.

Second, the style vector is not predicted directly. It is produced by a tiny diffusion model conditioned on the text's semantic embedding. At inference, you run 32 denoising steps on a Gaussian latent and get a style vector. This is the only diffusion step in the pipeline — the audio itself is decoded with a fast iSTFT-based vocoder. The result is diffusion-quality expressivity without diffusion-cost inference.

StyleTTS2 was among the last open architectures before flow matching became the community default for new systems. Its quality on the LJSpeech benchmark was competitive with much larger closed models at ∼150M parameters. Kokoro-82M — the model powering every audio clip on this page — is a smaller, curated training of the same architecture, with the style diffusion frozen into a lookup of pre-computed per-voice style vectors so inference runs on CPU.

What survived

Style-adaptive layer normalization (SALN): a per-utterance style vector modulates every LayerNorm in the decoder. Plus the pattern of using a small diffusion process for style, not for audio. Kokoro-82M and every 'prompt your TTS with a reference clip' product builds on this.

2024continuous & discrete generative

Flow matching and masked generative TTS

F5-TTS (SWIvid), NaturalSpeech 3 (Microsoft), MaskGCT (Amphion)

Flow matching is the formulation that superseded DDPM-style diffusion for audio in 2024. Instead of learning the score of a noising process, flow matching directly learns a vector field that transports a simple distribution (pure noise) to the data distribution along straight or near-straight paths. Training loss is an MSE between the predicted vector field and a hand-chosen ground-truth field derived from a noise-to-data coupling. Inference is ODE integration — typically 4 to 16 Euler steps, an order of magnitude fewer than diffusion, with equal or better quality.

F5-TTS (Flow matching for Speech synthesis, with diffusion Transformer) brought this approach to open-source TTS. Unlike VITS or StyleTTS2, F5-TTS is a one-stage system that predicts a mel spectrogram directly via flow matching on a transformer trunk. It matches or exceeds commercial API TTS on standard MOS tests while being fully reproducible on public data. It is also the first fully permissively licensed (MIT) model where the community can legitimately finetune and redistribute derivatives.

MaskGCT and NaturalSpeech 3 take a different route: the target is a neural audio codec (EnCodec, SoundStream, DAC), which turns audio into a sequence of discrete tokens. The TTS model masks tokens and is trained to predict them from context, like BERT but generative. At inference, the model iteratively unmasks tokens in order of confidence, doing parallel prediction within each step. These discrete-token systems are currently the quality leaders on zero-shot voice cloning benchmarks.

What survived

Flow matching: a simpler training objective than score-based diffusion, fewer inference steps, better quality. Masked generative codec modelling (MaskGCT) for when you want parallel-token decoding over a neural codec instead of continuous latents. Both remain active areas in 2026.

2025the samples on this page

Kokoro-82M — the CPU-realtime community model

hexgrad · Apache 2.0 · Hugging Face repo hexgrad/Kokoro-82M

Kokoro-82M is a distillation and community-curated training of the StyleTTS2 architecture at roughly 82M parameters — about half the size of the reference StyleTTS2 release. The style diffusion step is pre-computed per voice and stored as a fixed embedding; inference is therefore a single forward pass through the acoustic model plus an iSTFTNet vocoder. End-to-end latency on a modern laptop CPU is under 400 ms for a full sentence.

The af_heart, am_michael, and nine other voices used in the voice library above all ship with the stock Kokoro release. Apache 2.0 license. 54 voice packs at the time of writing, covering American, British, Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese. The spectrograms in this page's pipeline are rendered from audio that Kokoro generated locally, no API calls.

This is the end of the timeline, for now. The next entries will be the ElevenLabs and OpenAI comparison samples once their keys are wired in. The interesting thing is what the last 85 years have repeatedly shown: every representation introduced here — filter bank, spectrogram, LPC, formants, diphones, HMM parameters, mel-via-attention, dilated convolution audio, flow-matched codec tokens — is still visible somewhere in the pipeline on this page. The ladder is shorter than it looks.

What survived

The main point. In 2025, an open-source 82M-param model trained by one researcher (and community contributions for multilingual voice packs) runs locally, on CPU, with quality that 2019 commercial TTS would have envied. The curve continues.

§ 07 · Reproduce

Run it yourself.

Every figure on this page is generated from four Python scripts against a shared virtualenv. No API calls, no cached renders. Drop a new WAV into the samples directory and rerun — the pipeline will rebuild every spectrogram, every metric, every plate on this page.

Kokoro-82M runs on CPU; end-to-end rebuild under ten minutes on a modern laptop.

The pipeline

Stage	Script	What it does
1 · Synth	scripts/tts-samples/generate.py	Each voice renders the prompt at default settings, no prosody prompting. Saved at 24 kHz mono.
2 · Spectrograms	scripts/tts-samples/spectrograms.py	128 mel bins, n_fft=1024, hop=256, fmin=0, fmax=8000, log-power dB, magma colormap.
3 · Analyse	scripts/tts-samples/analyze.py	Five lenses + all-metrics.json. F0 via librosa.pyin (70–500 Hz). MFCC 13 coef, DCT-II.
4 · Resynth	scripts/tts-samples/resynthesis.py	mel_to_stft → griffinlim, 64 iterations, no learned prior.
5 · Failures	scripts/tts-samples/failure_modes.py	Generates the six failure-mode plates by DSP-manipulating the healthy reference.

Fig 7 · Trim + pad: 35 dB top-db silence trim, then padded/truncated to a common 2.5 s window so column counts match exactly across every render. Rendered with librosa + matplotlib at 200 DPI.

What this is

— A reproducible DSP pipeline you can run on any WAV.
— Matched-format visualisations: same window, same colormap, same axes across every voice.
— Measured acoustic descriptors, not vibes.
— A framework ready to accept ElevenLabs, OpenAI, Cartesia, XTTS-v2, F5-TTS samples next.

What this is not

— Not a quality judgment. You cannot read “better” off a spectrogram.
— Not a MOS score. For that, see the comparison pages.
— Not a claim about architecture — visible differences are usually training data, not model topology.
— Not a vocoder shoot-out. That needs matched-mel input, a separate experiment.

Get the next drop of voice fingerprints

ElevenLabs, OpenAI, Cartesia, XTTS-v2, F5-TTS — we'll ping you the week they land.

§ 08 · Related

Neighbouring reads.

Cross-references on Codesota that continue the thread.

Speech register →

The STT + TTS parent pillar.

ElevenLabs vs OpenAI TTS →

Commercial head-to-head.

Best TTS for voice cloning →

Zero-shot similarity & consent.

Guide · TTS models →

Full landscape overview.

Methodology →

How we verify every number.