Codesota · Speech · Voice fingerprintsA reproducible DSP analysisVol. II · Issue April 22, 2026
Deep-dive · Speech

Voices, under the microscope.

Eleven open-source Kokoro-82M voices, rendered through five complementary DSP lenses and a Griffin-Lim round-trip. Every figure comes out of the same Python pipeline; every number is measured, not claimed.

A spectrogram is not a quality judgment. It is a coordinate system in which pitch, timbre, prosody, sibilance and noise floor become legible. The point of this page is to teach the eye to read it — first by showing the same prompt said eleven different ways, then by projecting a single voice through the five lenses used in production TTS and ASR.

Pipeline: librosa + matplotlib. Audio synthesised locally from Kokoro-82M at 24 kHz. Scripts in scripts/tts-samples/.

§ 01 · The voices

Same prompt, eleven voices.

All eleven samples come from Kokoro-82M (Apache 2.0). Same sentence, same sample rate, same window. The only variable is voice identity. Pink borders mark female voices; sand borders mark male. GB flags the two British voices.

Prompt:

The quick brown fox jumps over the lazy dog.

Mosaic · at a glance
Mel spectrogram thumbnail for af_heart
af_heart204 Hz
Mel spectrogram thumbnail for af_bella
af_bella191 Hz
Mel spectrogram thumbnail for af_nicole
af_nicole148 Hz
airy
Mel spectrogram thumbnail for af_sarah
af_sarah202 Hz
Mel spectrogram thumbnail for af_sky
af_sky169 Hz
Mel spectrogram thumbnail for am_michael
am_michael119 Hz
Mel spectrogram thumbnail for am_adam
am_adam115 Hz
deepest
Mel spectrogram thumbnail for am_fenrir
am_fenrir138 Hz
Mel spectrogram thumbnail for am_liam
am_liam125 Hz
Mel spectrogram thumbnail for bf_emmaGB
bf_emma178 Hz
brightest
Mel spectrogram thumbnail for bm_georgeGB
bm_george152 Hz
femalemaleGBBritish accent— number next to the voice name is F0 median (pitch height).
Fig 1 · Border colour encodes gender (pink / cyan). The number is F0 median — male voices cluster at 115–152 Hz, female at 148–214 Hz. bf_emma has the brightest sibilance; am_adam and am_liam sit the lowest; af_nicole is visibly sparse — its whispered delivery leaves most frames unvoiced (21%).
The eleven · full spectrograms + audio
Kokoro · af_heart
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range154–322 Hz
70median 204 Hz260
Centroid2807 Hz
0brightness4k Hz
Voiced ratio69% of framesstrongly voiced
Mel spectrogram of Kokoro af_heart (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Default voice — balanced formants, clean harmonic stack.

Kokoro · af_bella
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range155–288 Hz
70median 191 Hz260
Centroid3111 Hz
0brightness4k Hz
Voiced ratio79% of framesstrongly voiced
Mel spectrogram of Kokoro af_bella (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Softer delivery. Highest voiced ratio of the female set.

Kokoro · af_nicole
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range141–185 Hz
70median 148 Hz260
Centroid3126 Hz
0brightness4k Hz
Voiced ratio22% of framesairy / whispered
Mel spectrogram of Kokoro af_nicole (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Airy, whispered style — voiced ratio drops sharply (21%).

Kokoro · af_sarah
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range131–329 Hz
70median 202 Hz260
Centroid2860 Hz
0brightness4k Hz
Voiced ratio80% of framesstrongly voiced
Mel spectrogram of Kokoro af_sarah (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Mid-range female voice. Tight F0 variance.

Kokoro · af_sky
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range97–275 Hz
70median 169 Hz260
Centroid2135 Hz
0brightness4k Hz
Voiced ratio75% of framesstrongly voiced
Mel spectrogram of Kokoro af_sky (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Lower centroid than other female voices — darker timbre.

Kokoro · am_michael
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range88–174 Hz
70median 119 Hz260
Centroid2514 Hz
0brightness4k Hz
Voiced ratio64% of framesstrongly voiced
Mel spectrogram of Kokoro am_michael (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Neutral male. F0 median ~119 Hz.

Kokoro · am_adam
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range89–173 Hz
70median 115 Hz260
Centroid2188 Hz
0brightness4k Hz
Voiced ratio61% of framesstrongly voiced
Mel spectrogram of Kokoro am_adam (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Confident, deeper. Lowest F0 of the set (115 Hz).

Kokoro · am_fenrir
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range70–238 Hz
70median 138 Hz260
Centroid2423 Hz
0brightness4k Hz
Voiced ratio77% of framesstrongly voiced
Mel spectrogram of Kokoro am_fenrir (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Deeper, more harmonic stack density below 1 kHz.

Kokoro · am_liam
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range98–258 Hz
70median 125 Hz260
Centroid2089 Hz
0brightness4k Hz
Voiced ratio75% of framesstrongly voiced
Mel spectrogram of Kokoro am_liam (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Lighter-toned male. Highest voiced ratio of male voices.

Kokoro · bf_emma
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range147–255 Hz
70median 178 Hz260
Centroid3274 Hz
0brightness4k Hz
Voiced ratio75% of framesstrongly voiced
Mel spectrogram of Kokoro bf_emma (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

British RP. Highest spectral centroid — brighter sibilance.

Kokoro · bm_george
Kokoro-82M · Apache 2.0
mel · 0–8kHz
F0 range108–225 Hz
70median 152 Hz260
Centroid2466 Hz
0brightness4k Hz
Voiced ratio72% of framesstrongly voiced
Mel spectrogram of Kokoro bm_george (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

British male. Mid-F0 among the male voices.

§ 02 · Five lenses

Five ways to look at a voice.

One clip — Kokoro-82M / af_heart — five projections. Each reveals a property the others hide. None of them is a quality metric; they are acoustic descriptors used in ASR front-ends, TTS training losses and voice-conversion systems for decades.

Click any lens to see librosa parameters and the function that produced it.

Pipeline, one diagram
Text"quick brown fox"PhonemesG2P · ARPAbet or IPAAcoustic modelTacotron · FastSpeechVITS · StyleTTS2Mel spectrogram80–128 bins × framesNeural vocoderHiFi-GAN · BigVGANWaveNet · iSTFTNetWaveform24 kHz PCM
Fig 2 · Every mainstream TTS stack since Tacotron (2017) follows this shape. The mel spectrogram sits one step from both ends — models emit it, humans can read it, vocoders consume it. The Kokoro-82M architecture (StyleTTS2 descendant) is end-to-end, but mel remains the internal quantity trained against.
The five projections
Mel spectrogramwhat most TTS vocoders consume
0–8 kHz · 128 mel bins
Log-power perceptual spectrogram. Target of most modern TTS acoustic models, input of every neural vocoder. Harmonic stripes = voiced vowels. Dark columns = stops and word breaks. Bright diffuse regions up top = sibilants.

Log-power perceptual spectrogram. Target of most modern TTS acoustic models, input of every neural vocoder. Harmonic stripes = voiced vowels. Dark columns = stops and word breaks. Bright diffuse regions up top = sibilants.

Waveform + RMS envelopewhat the speaker hears
time-domain
The raw signal. Cyan shading is a short-time RMS envelope — useful for speaking-rate estimation and silence detection, but effectively blind to frequency content.

The raw signal. Cyan shading is a short-time RMS envelope — useful for speaking-rate estimation and silence detection, but effectively blind to frequency content.

F0 contourprosody
librosa.pyin, 70–500 Hz search
Probabilistic YIN tracker. Solid line = pitch; gaps = unvoiced or silent frames. A good F0 contour is how you measure prosody — does the sentence lift into a question, or drop into a declarative?

Probabilistic YIN tracker. Solid line = pitch; gaps = unvoiced or silent frames. A good F0 contour is how you measure prosody — does the sentence lift into a question, or drop into a declarative?

MFCCASR feature space
13 coefficients, DCT of log-mel
The feature that powered HMM speech recognition for three decades and still appears in modern ASR front-ends. First coefficient ≈ total energy; higher coefficients capture progressively finer spectral shape.

The feature that powered HMM speech recognition for three decades and still appears in modern ASR front-ends. First coefficient ≈ total energy; higher coefficients capture progressively finer spectral shape.

Spectral centroid + rolloffHz
brightness over time
Centroid = where the mass of the spectrum sits. Rolloff = below which frequency 85% of the energy lives. Rising lines = sibilant consonants; falling lines = rounded vowels. Backdrop is a faded mel for context.

Centroid = where the mass of the spectrum sits. Rolloff = below which frequency 85% of the energy lives. Rising lines = sibilant consonants; falling lines = rounded vowels. Backdrop is a faded mel for context.

Zero-crossing ratecrossings per frame
voiced / unvoiced heuristic
How often the waveform crosses zero per frame. Low for vowels, high for fricatives. Combined with RMS, this is a surprisingly competent voiced-frame detector.

How often the waveform crosses zero per frame. Low for vowels, high for fricatives. Combined with RMS, this is a surprisingly competent voiced-frame detector.

F0 vs brightness · voice space
18002250270031503500100130160190230F0 median (Hz)Spectral centroid (Hz)≈165 Hzaf_heartaf_bellaaf_nicoleaf_sarahaf_skyam_michaelam_adamam_fenriram_liambf_emmabm_georgefemalemaleBritish
Fig 3 · Two numbers per voice: median F0 (pitch height) and mean spectral centroid (brightness). The ~165 Hz divider splits male and female clusters; the British voices pull harder on the brightness axis than the American — a regularity in Kokoro's training set, not a physical law.
Same voice · three prompts

Fix the voice (af_heart); vary the content. The sibilant sentence pushes energy into the 4–8 kHz band. The stop-consonant sentence carves deep vertical troughs at every /p/ and /b/ closure. Same weights, same window — the differences are entirely linguistic.

Kokoro · af_heart · reference
Kokoro-82M · Apache 2.0
mel · 0–8kHz
Mel spectrogram of Kokoro af_heart · reference (Kokoro-82M) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Balanced voiced/unvoiced content.

Kokoro · af_heart · sibilants
Kokoro-82M · Apache 2.0
mel · 0–8kHz
Mel spectrogram of Kokoro af_heart · sibilants (Kokoro-82M) saying "She sells seashells by the seashore, and six slippery snakes slithered south."

She sells seashells by the seashore, and six slippery snakes slithered south.

Sibilant-heavy. The 4–8 kHz band lights up.

Kokoro · af_heart · stops
Kokoro-82M · Apache 2.0
mel · 0–8kHz
Mel spectrogram of Kokoro af_heart · stops (Kokoro-82M) saying "Peter Piper picked a peck of pickled peppers. Bob baked big batches of bread."

Peter Piper picked a peck of pickled peppers. Bob baked big batches of bread.

Plosive-heavy. Vertical dark columns at every /p/ and /b/ closure.

§ 03 · Resynthesis

Mel → Griffin-Lim → WAV.

A mel spectrogram is magnitude-only. Phase is discarded. To turn mel back into audio you need to invent a phase that is consistent across frames. Modern TTS learns this mapping with a neural vocoder (HiFi-GAN, BigVGAN, iSTFTNet). The classical alternative is Griffin-Lim: iteratively alternate imposing the target magnitude and projecting onto time-consistent phase. It works, but it sounds like a heavily processed voicemail.

64 iterations. No learned prior. librosa.griffinlim on librosa.feature.inverse.mel_to_stft.

Kokoro · af_heart · original
Kokoro-82M + vocoder · Apache 2.0
mel · 0–8kHz
Mel spectrogram of Kokoro af_heart · original (Kokoro-82M + vocoder) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

What Kokoro actually outputs: mel through the model's trained neural vocoder.

Kokoro · af_heart · Griffin-Lim
mel → GL, 64 iter · classical DSP
mel · 0–8kHz
Mel spectrogram of Kokoro af_heart · Griffin-Lim (mel → GL, 64 iter) saying "The quick brown fox jumps over the lazy dog."

The quick brown fox jumps over the lazy dog.

Same mel, phase recovered iteratively with no learned prior. Listen — speech is intelligible; timbre has been sandpapered.

Side-by-side mel spectrograms: original Kokoro output vs Griffin-Lim reconstruction
Fig 4 · Top, Kokoro's neural vocoder output. Bottom, the same mel reconstructed via mel→STFT inversion plus Griffin-Lim phase recovery (64 iterations). The magnitude envelope is nearly identical — that's what the mel captures. The between-harmonic energy changes — that's what phase carries.
§ 04 · Failure modes

Six things you can see in a spectrogram.

Spectrograms are how production TTS teams debug. The failure modes below show up as visual signatures before they show up in MOS scores. Each plate is a real DSP render from the same pipeline — healthy on top, failure on bottom.

Smeared harmonics

failure mode · 01 / 06
Mel spectrogram: healthy neural vocoder output on top with crisp harmonic stripes; oversmoothed vocoder output on bottom with fogged, low-contrast harmonics.
Cause
Vocoder undertrained, or mel-to-waveform gap. HiFi-GAN trained on studio speech but asked to synthesize a whispered voice.
What it looks like
Horizontal harmonic lines lose contrast; the whole spectrogram looks like fog.
Fix
Fine-tune vocoder on target-domain mel spectrograms. Or use Griffin-Lim baseline to confirm it's the vocoder, not the acoustic model.

Repeat / attention collapse

failure mode · 02 / 06
Mel spectrogram: healthy monotonic alignment on top; attention collapse on bottom with one syllable repeated three times in place.
Cause
Autoregressive TTS (Tacotron 1, early Tacotron 2) loses monotonic alignment and repeats a syllable, or skips words.
What it looks like
A periodic pattern that should be a word stretches into a held tone, or an entire word is missing.
Fix
Switch to FastSpeech/VITS-class non-autoregressive model, or add a monotonic attention constraint (GMM attention, Forward-Backward loss).

Flat prosody

failure mode · 03 / 06
Mel spectrogram: healthy live F0 contour on top with curving harmonic stack; flat-prosody collapse on bottom with perfectly horizontal harmonic stripes at a single F0.
Cause
Acoustic model averaged out pitch variance during training — classic result of using a mean-squared loss on F0.
What it looks like
F0 contour hugs a single value for the whole utterance. The mel spectrogram's harmonic stack stays at a fixed spacing.
Fix
Train with pitch as an explicit conditioning (FastPitch), add adversarial loss, or switch to flow-matching / diffusion.

Pronunciation drift on rare words

failure mode · 04 / 06
Mel spectrogram: healthy formant track matching expected vowels on top; pronunciation drift with wrong formant path and shifted pitch in a middle segment on bottom.
Cause
G2P coverage gap. "Anthropic" becomes "antrophic". Worse on brand names and technical terms.
What it looks like
The word exists (F0 moves, RMS rises), but the formant trajectory doesn't match the expected vowels.
Fix
Add a custom pronunciation lexicon, or use an LLM G2P head, or fall back to ARPAbet input on exact strings.

Clipping / peaks at segment boundaries

failure mode · 05 / 06
Mel spectrogram: healthy continuous waveform on top; streaming chunk boundary on bottom with a bright vertical broadband column at the 1.0 second splice point.
Cause
Streaming TTS concatenates chunks without crossfading.
What it looks like
A vertical bright column across every frequency band, happening exactly at a chunk boundary.
Fix
Overlap-add the chunks (typically 25–50 ms overlap) or emit an extra token of lookahead to let the vocoder decide the transition.

Artifacts in sibilants

failure mode · 06 / 06
Mel spectrogram: healthy full 0–8 kHz band on top; low-pass-filtered output on bottom with the 4–8 kHz region dimmed during sibilant frames.
Cause
Vocoder sampling rate too low, or mel cut-off below 8 kHz.
What it looks like
Sibilants (/s/, /sh/) sound muffled; the spectrogram's 4–8 kHz region is dim even on clearly pronounced fricatives.
Fix
Raise sample rate to 24 or 48 kHz end-to-end. Extend mel fmax to 12 kHz.
§ 05 · Measured

Acoustic properties, to three digits.

Computed from the same clips as the spectrograms above. Descriptive, not evaluative. Pitch height, brightness and voiced ratio differ across voices — that is true by construction, not by training-run quality.

F0 via librosa.pyin. Centroid + rolloff via STFT on 2048-sample frames. Voiced ratio = fraction of frames where pyin returned a valid F0 estimate.

Per-voice · measured
VoiceF0 medianF0 rangeF0 σCentroidRolloff 85%VoicedZCR
Kokoro · af_heart203.8153.6–321.636.62807492969%0.143
Kokoro · af_bella191.2155.3–288.229.83111575979%0.164
Kokoro · af_nicole148.3140.8–184.710.23126577622%0.174
Kokoro · af_sarah201.5130.6–329.243.62860509080%0.148
Kokoro · af_sky168.997.3–275.240.22135351275%0.124
Kokoro · am_michael118.888.2–174.422.72514494464%0.106
Kokoro · am_adam11589.2–173.418.42188386361%0.120
Kokoro · am_fenrir138.470–238.238.92423436777%0.123
Kokoro · am_liam124.797.9–258.2392089387975%0.106
Kokoro · bf_emma178.4146.6–255.323.33274596975%0.170
Kokoro · bm_george151.8108–224.824.12466465772%0.113
Fig 5 · F0 in Hz. Centroid and rolloff in Hz. Voiced ratio as percentage of 2.5-second frame count. These are frame-level means; downstream code works with the time-series, not the means.
Each lens · in production
LensTrainingInference / productionDebug with it
Mel spectrogramTarget of Tacotron/FastSpeech/StyleTTS2 acoustic models. L1/L2 loss between predicted and ground-truth mel.Input to every neural vocoder (HiFi-GAN, BigVGAN, iSTFTNet, UnivNet).First place you look when TTS output sounds wrong. Muddy formants = bad acoustic model. Smeared harmonics = bad vocoder.
MFCCClassical ASR acoustic-model input (HMM-GMM era). Still the default for speaker diarization and wake-word detection.Most cloud ASR preprocesses to a mel-filterbank or MFCC-adjacent feature before the neural encoder.Coefficient 0 = total energy. Higher coefficients = progressively finer spectral detail. Watch coefficient 1 for bass/treble drift.
F0 contourProsody loss in expressive TTS (FastPitch, EmoTTS). Also the primary signal for singing voice synthesis.Voice-conversion systems use F0 trajectories as the pitch skeleton while replacing timbre.Gaps mean the tracker failed — usually because the signal was too noisy or the voice was whispered. Jumps of an octave are pyin doubling/halving errors.
Spectral centroid / rolloffAuxiliary loss in some TTS systems to match "brightness" of the target speaker.Used in music information retrieval and in fast voice-activity detection.A centroid that stays flat through an entire utterance suggests a model collapsing to mean — common failure mode when TTS is undertrained.
Zero-crossing rateFeature for voiced/unvoiced classifiers in old-school vocoders.Still used in cheap VAD, cough/cry detection, snoring classifiers.Useful sanity check: a voice with very low ZCR and high RMS that suddenly spikes in ZCR is almost certainly a sibilant.
Fig 6 · Matrix view. Where each representation sits in a real TTS or ASR system, and what you’d debug with it.
§ 06 · A short history

Eighty-five years of making speech visible.

Thirteen stops from Dudley’s VODER (1939) to Kokoro-82M (2025). Every representation arose from a frustration with the previous one, and every one has left a residue in the pipeline that rendered the spectrograms above.

Speech-as-picture predates computers. The path below is the shortest possible honest route from Homer Dudley pressing keys at the 1939 World's Fair to Kokoro-82M running on a laptop. Every representation below arose from a concrete frustration with the previous one, and every one of them has left a residue you can still see in today's TTS code.

1939electromechanical

VODER — speech as a keyboard instrument

Homer Dudley · Bell Telephone Laboratories · World's Fair, New York

Relaxation oscillatorvoiced buzz · F0Gas-discharge tubeunvoiced hissVoicing switchfoot pedal10 resonance filters (500 Hz – 7500 Hz)12345678910Summing amplifier → speakerOperator consolewrist bar · pitch inflectionfoot pedalvoiced / unvoicedt/dp/bk/gA single trained operator drives the keys; the machine is the vocal tract.

Dudley's VODER (Voice Operation Demonstrator) is the first documented machine to produce intelligible continuous speech from a non-speech source. There was no magnetic tape playback, no database of recordings, no computer. The machine had two sound sources — a buzzing relaxation oscillator for voiced sounds, a gas-discharge tube for unvoiced hiss — selected by a foot pedal. Ten bandpass filter keys on the operator's right hand gave fingertip control over ten vocal-tract resonances from roughly 500 Hz to 7.5 kHz. A wrist bar modulated pitch. Three extra keys produced the burst-release timings for stops like /t/, /p/, /k/.

It took the bank of operators — a team of women hand-picked from Bell's phone-service division — roughly a year of daily practice to play a sentence cleanly. A single demonstration required about the same rehearsal as a concerto. The result: a recognizable, somewhat ghostly, clearly human voice under manual control. Audio and film from the World's Fair survive; modern acousticians still point to VODER as the first proof that the vocal tract is, mathematically speaking, a filter bank.

The crucial move was conceptual. Dudley separated what drives the vocal folds from what the throat and mouth do to that source. That separation, and a bank of bandpass filters to shape the result, turned out to be universal enough that every later system is either a direct descendant or a refutation of it.

What survived
The source–filter model. VODER is a voicing source + hiss source + bank of resonance filters — the same three ideas that underlie LPC thirty years later, and every TTS vocoder still implicit in the mel spectrogram.
1952acoustic analysis

Sound Spectrograph — visible speech

Potter, Kopp, Green · Bell Labs · 'Visible Speech' (Van Nostrand, 1947; machine commercialized 1952)

Magnetic tape2.4 s loopVariable bandpass filteranalog heterodynesweeps 0 → 4 kHz each scanElectric styluselectrosensitive paperfreq ↑time →darker = louder

The Sound Spectrograph is the ancestor of every figure on this page. Input speech was recorded onto a magnetic tape loop, typically about 2.4 seconds long. A rotating drum dragged the loop past a playback head over and over. In parallel, a single variable-frequency bandpass filter swept slowly from 0 to about 4 kHz through heterodyning. Each time the filter's center frequency completed another scan, an electric stylus burned a horizontal stripe onto sensitized paper wound around a synchronized drum. Darker stripes meant the filter had found more energy in that band.

The output was a rectangle. Time ran left to right, frequency ran bottom to top, darkness encoded amplitude. Linguists immediately saw that formants — the resonances Dudley's VODER filters had imposed manually — appeared as dark horizontal bands. The spectrograph gave the field its shared vocabulary: F1, F2, F3 are named after the tracks on a 1950 spectrogram. Dennis Klatt would later use spectrograms as targets when tuning his synthesizer rules. The training data of every modern TTS model is, structurally, just a digital spectrograph running on millions of utterances.

The machine is beautiful to look at. It's also incredibly slow: each 2.4-second clip takes about four minutes to render, because the filter has to sweep mechanically. The arrival of the FFT (Cooley & Tukey, 1965) made the process real-time on general-purpose computers and killed the physical spectrograph as a laboratory tool — but the visual it invented has never been replaced.

What survived
The time–frequency–intensity plot. Every diagram on this page, every loss function in modern TTS, every ASR feature — they all live in the coordinate space the 1952 spectrograph made literal.
1968source–filter formalism

Linear Prediction Coding — the vocal tract as an all-pole filter

Itakura & Saito · NTT (parallel work by Atal & Schroeder · Bell)

Impulse trainvoiced (period = 1/F0)White noiseunvoicedGain · gAll-pole filter H(z) = 1 / A(z)10–16 coefficients · IIRSynthetic speechs[n] = Σk=1..p ak · s[n−k] + e[n]Current sample = weighted sum of past samples + prediction errorenvelope fitpole locations in z-planeformants ≈ pole pairs

Linear prediction coding (LPC) formalized the intuition VODER made with wires. The claim: each speech sample can be predicted as a linear combination of the previous p samples plus a residual. The residual carries the excitation (a buzz when voiced, a hiss when not), and the predictor coefficients describe the vocal tract as an all-pole IIR filter. Ten to sixteen coefficients per 20-millisecond frame reproduce speech that is recognizably human.

Mathematically, the poles of H(z) = 1 / A(z) land near the formants. Fit the coefficients via the Levinson–Durbin recursion (linear time in p), store them as line spectral pairs (LSP) for numerical robustness, transmit at a few kilobits per second, and you have the backbone of every low-bitrate speech codec of the 1980s and 1990s. The GSM mobile standard's RPE-LTP codec is a direct child of LPC. So is the DoD 2.4 kbit/s LPC-10 military codec. Klatt's formant synthesizer used LPC as a tuning aid.

For TTS, LPC didn't produce natural speech on its own. The residual is hard to predict from text. But the source–filter abstraction — excitation at one end, spectral envelope in the middle, radiation at the output — is the frame that every later vocoder (from Klatt's to WaveNet's) implicitly or explicitly inherits.

What survived
Source–filter decomposition is still how every classical vocoder (WORLD, STRAIGHT), every cellphone codec (AMR, EVS), and the conceptual frame of neural vocoders all think about speech.
1980formant synthesis

Klatt synthesizer — hand-tuned rules

Dennis Klatt · MIT · published in JASA 1980; commercialized as DECtalk

Voicing (AV)Aspiration (AH)Frication (AF)CASCADE — voiced vowelsF1F2F3F4F5PARALLEL — fricatives, stops, nasalsA2·F2A3·F3A4·F4A5·F5A6·F6Σcascade + parallelRadiation characteristicSynthesized waveform// vowel /i/ → set resonancesF1 = 270 Hz   B1 = 50 HzF2 = 2290 Hz   B2 = 60 HzF3 = 3010 Hz   B3 = 120 Hz∼60 parameters, updated every 5 msrules by hand for every phoneme × context

Klatt's formant synthesizer was the final triumph of the rule-based era. A voicing source, an aspiration source, and a frication source are combined in two parallel branches: a cascade of up to five digital resonators for voiced speech (each formant feeds the next), and a bank of parallel resonators with per-formant amplitudes for fricatives, affricates, and nasals. About 60 parameters, updated every 5 ms. That is enough to say anything.

The genius was in the rule set. Klatt wrote, by hand, the expected formant trajectories and voicing patterns for every English phoneme in every context. Tens of thousands of rules. The result, commercialized as DECtalk, became the voice that Stephen Hawking used for decades (Klatt's voice model PBJOHN remained on Hawking's speech computer until his death — he explicitly refused upgrades because he had come to identify with it).

What killed Klatt synthesis was naturalness. The speech was perfectly intelligible — more intelligible than most modern TTS for technical terms, numbers, and rare words — but it sounded like a robot. The hand-written rules couldn't capture the fine-grained prosody that makes a voice sound alive. Unit-selection TTS of the 1990s kept the intelligibility and added the naturalness, at the cost of a much larger footprint.

What survived
The idea that a small set of parameters, updated every few milliseconds, can drive high-quality speech — and that prosody is explicit and rule-governed. Modern expressive TTS rediscovers this idea as 'pitch conditioning'.
1990sunit selection

Concatenative synthesis — cut, stitch, pray

AT&T Natural Voices · Edinburgh Festival · Black & Taylor · Hunt & Black 1996

"hello"target textPhoneme decomposetarget diphones#-hh-eheh-ll-owow-#unit-selection database∼4 hr of studio speech∼40,000 diphone instancesViterbiselected units → PSOLA concatseams audible when target and join costs mismatchtarget cost  Ct — how close each candidate matches prosody contextjoin cost  Cj — spectral continuity at the boundary between consecutive units

Concatenative synthesis replaced rules with recordings. A voice actor sits in a studio for a week and reads a couple thousand carefully chosen sentences. The audio is phonetically segmented, catalogued by linguistic context, and stored. At synthesis time, the system picks diphone or phone units from the database using Viterbi search over two costs: a target cost (how well a candidate matches the target context: preceding phoneme, following phoneme, stress, position in phrase) and a join cost (how spectrally continuous two consecutive candidates are at their boundary).

Inside the recorded domain, the result is astonishing. AT&T's Natural Voices and early Loquendo demos are still hard to distinguish from human speech if you only listen to short, prosaic utterances. The data model is the speaker, so pronunciation is correct by construction. Emotional range is limited by what the voice actor recorded; out-of-domain words (rare names, code-switched words, new brand terms) cause audible seams because the Viterbi search has to force-fit units that don't quite match.

The approach's unforgivable weakness is scale. Every new voice requires a new multi-day recording. Every new style (cheerful, solemn, whispered) either needs its own recorded set or shows audible mismatch. Databases grew to gigabytes — unshippable to mobile. A generation of practitioners wanted to go back to parametric synthesis, but with machine-learned parameters instead of Klatt's hand rules.

What survived
The pattern of choosing data over rules. Also: objective functions that trade off local continuity (join cost) against global target match (target cost) — the same bifurcation reappears in modern TTS prosody losses.
2005statistical parametric

HMM-based TTS — parametric, flexible, smooth

HTS Working Group · Nagoya Institute of Technology · Tokuda, Yoshimura, Zen

context-dependent labels (quinphones)#-h+eh-l-owh-eh+l-ow-#eh-l+ow-#5-state left-to-right HMM per labels1s2s3s4s5s1s2s3s4s5s1s2s3s4s5each state emits three parameter streamsSpectrum40 MCEP + Δ + ΔΔlog F0MSD-HMM for voiced/unvoicedDurationHSMM, Gaussian per state

HMM-based TTS returned to the parametric path with a statistical twist. Every phoneme in every context becomes a left-to-right hidden Markov model. Each state emits three parameter streams: a spectral envelope (mel cepstrum + deltas), a pitch contour (log F0 modeled with multi-space distributions that handle voiced/unvoiced regions), and a duration (modeled with a hidden semi-Markov extension so states can stay as long as needed). Training sees labeled speech, estimates Gaussian emissions for each state, then clusters states across contexts using decision trees so rare contexts still get sensible parameters.

At synthesis, the system picks a sequence of states from the decision trees using the target labels, generates parameters, and runs them through a vocoder — typically STRAIGHT or later WORLD. The flexibility was revolutionary. Want to make the voice happier? Retrain only on happy speech. Want a new speaker with an hour of audio? Adapt the existing model with MLLR. Want a new language? Just relabel. The downside was that everything sounded slightly muffled: the max-likelihood trajectory of a Gaussian HMM is inherently over-smoothed. HMM-TTS never reached unit-selection's peak naturalness, but it reached good-enough naturalness everywhere instead of excellent naturalness in-domain.

HTS is the direct ancestor of every modern TTS pipeline. Its labels became FastSpeech's phoneme-level features. Its three-stream modeling became FastPitch's separate pitch head. Its decision tree clusters foreshadowed speaker embeddings. The ideas are load-bearing even where the name is forgotten.

What survived
The habit of training separate heads for duration, pitch, and spectrum — still how expressive neural TTS structures its losses. Also, the dream of a small parametric model of a speaker that you can transfer and blend.
2016neural vocoder

WaveNet — autoregressive raw waveform

van den Oord et al. · DeepMind · arXiv 1609.03499

dilation 1dilation 2dilation 4dilation 8dilation 16input samples (16 kHz raw audio, mu-law 256-level)softmax(p(xt | x1..t-1))

WaveNet modeled raw audio directly: one sample at a time, conditioned on every previous sample in the sequence, with categorical outputs over 256 mu-law-quantized levels. The architectural contribution was the dilated causal convolution: by doubling the dilation rate each layer, a 30-layer stack reaches a receptive field of tens of thousands of samples (seconds of audio) while remaining a feed-forward convolution during training. The network was conditioned on linguistic features — essentially the outputs of a classical TTS front-end — that told it which phoneme, with which pitch, for which speaker.

Naturalness jumped to within a single MOS point of human speech, beating every concatenative system on the same voice. The cost was brutal: because the model is autoregressive at the sample level, inference at 16 kHz requires running the stack 16,000 times per second. On the original GPU implementations a minute of audio took hours. Parallel WaveNet (Oord et al. 2017) distilled the autoregressive model into an inverse-autoregressive flow that generates audio in a single parallel pass, but the distillation was fragile and the student model rarely matched the teacher.

WaveNet did two things that stuck. It proved that a neural vocoder could close the gap between statistical parametric speech and concatenative speech. And its stacked dilated convolutions became the backbone block for essentially every fast follow-up — Parallel WaveGAN, MelGAN, HiFi-GAN — even when they discarded the autoregressive frame.

What survived
The dilated-causal-convolution block (used in every fast waveform model since). The conditioning interface: a stack of linguistic features per audio frame maps into the waveform. And a philosophical point: raw waveforms can be learned end-to-end — the long-standing 'synthesis is too low-level for neural nets' prior was wrong.
2017seq2seq with attention

Tacotron — text to mel, end-to-end

Wang et al. · Google · Interspeech 2017 (Tacotron 1); Shen et al. 2018 (Tacotron 2)

character embeddingsHELLOCBHG encoderconv + highway + biGRUattention αi,jmel tchar iAutoregressive decoderLSTM · predicts mel framesMel spectrogram80 × TGriffin-Lim or WaveNetWaveform// training lossL = L1_mel(pred, target) + L_stop_token// scheduled sampling on decoder inputx_{t} = gt_mel(t) if rand() < 0.5 else pred_mel(t-1)// mel → audio via Griffin-Lim in the paper,// replaced by WaveNet/HiFi-GAN in practice

Tacotron was the first successful attempt at “put text in, get audio out, no linguistic front-end” — a single neural network trained end-to-end. An encoder (a CBHG block: 1D convs, highway layers, bidirectional GRU) consumed character embeddings and produced a per-character hidden state. An attention module aligned these hidden states to the target audio timeline. A decoder (an autoregressive LSTM) emitted mel-spectrogram frames one at a time, each conditioned on the previous frame and the attention-weighted encoder states. For the original paper the mel was turned back into audio via Griffin-Lim, the classical non-neural phase-recovery algorithm.

The elegance was the intermediate representation. Text→mel is a manageable supervised learning problem (aligned data at tens of milliseconds, not tens of microseconds). Mel→audio is a separable vocoder problem (trainable with a different loss on a different dataset, including data without paired text). Splitting the problem this way turned out to be a strictly better engineering choice than WaveNet's sample-level modeling for everything except the highest-end server deployments.

Tacotron 2 (2018) swapped the CBHG encoder for stacked convolutional and LSTM layers, used a location-sensitive attention to stabilize alignment on long utterances, and replaced Griffin-Lim with a WaveNet vocoder conditioned on the predicted mel. The result was mean-opinion-score parity with human recordings on short, neutral utterances. This is the point at which “neural TTS” stopped being a research curiosity and became a product surface.

What survived
The mel spectrogram as the canonical internal representation. The encoder–attention–decoder shape. The habit of pairing a text-to-mel model with a separate neural vocoder for the final step. If you train TTS today without using Tacotron's architecture, you are almost certainly reacting against it.
2019non-autoregressive · GAN vocoder

FastSpeech + HiFi-GAN — parallel and real-time

Ren et al. (Microsoft · FastSpeech 1/2) · Kong et al. (HiFi-GAN · NeurIPS 2020)

phoneme sequence · length NhəlFFT block ×4self-attn + convDuration predictorscalar per phonemeLength regulatorrepeat by durationexpanded to T frameshhəəəllFFT block ×4 → mel (80 bins)HiFi-GAN — fast vocoder · MPD + MSDGeneratortransposed convs + residualMulti-period DMulti-scale Dadv + mel L1 + FM lossWaveform out

FastSpeech replaced Tacotron's autoregressive decoder with a parallel one. The trick was an explicit duration predictor: for each phoneme, predict how many mel frames it should span, then use a length regulator to simply repeat the phoneme embedding that many times. The expanded sequence is decoded by a stack of self-attention + convolution blocks in parallel — no recurrence, no attention to encoder states at inference. FastSpeech 2 added explicit pitch and energy predictors, drawing on the HMM-TTS lineage.

HiFi-GAN solved the vocoder speed problem. Where WaveNet generated one sample at a time and Parallel WaveGAN used a complex flow-based training recipe, HiFi-GAN was a straight convolutional generator trained with two classes of discriminator: a multi-period discriminator (MPD) that looks at the signal rearranged into 2D patches with various period lengths (catching periodic artifacts), and a multi-scale discriminator (MSD) that looks at the signal at full rate, half rate, quarter rate (catching scale-specific distortion). Plus a mel-L1 reconstruction loss. Adversarial feature matching for stability.

Combined, FastSpeech + HiFi-GAN could generate 22-kHz speech at dozens of times real-time on a single consumer GPU, with quality indistinguishable from Tacotron 2 on most listening tests. This is the point at which the “ship a custom TTS” curve crossed below a hundred GPU-hours.

What survived
The length regulator pattern for aligning non-autoregressive TTS. The multi-discriminator adversarial training recipe (MPD + MSD). These two ideas in combination are the reason 2020-era TTS ran in real-time on a single GPU.
2021variational inference + adversarial

VITS — end-to-end, no vocoder step

Jaehyeon Kim · Jungil Kong · Juhee Son · NeurIPS 2021

Text encoderphonemes → h_textMonotonic alignmentNormalizing flowz ~ N(μ,σ)HiFi-GAN decoderz → waveformPosterior encoderspectrogram → zKLMulti-period D · advAudio// end-to-end training — no separate vocoder stepL = L_recon(x, x̂) + L_kl(q(z | x) ∥ p(z | c_text)) + L_adv(D, G) + L_dur// Monotonic Alignment Search finds text↔audio alignment without attentionKim, Kong, Son — NeurIPS 2021

VITS collapsed the FastSpeech + HiFi-GAN pipeline into one network. A posterior encoder reads the target spectrogram and emits a latent z. A HiFi-GAN-style decoder reconstructs the waveform from z. A prior over z is learned from the text side via a normalizing flow that transforms a standard Gaussian conditioned on the encoded text. The ELBO ties the two — reconstructing the target waveform and matching the flow-transformed Gaussian at training time. Monotonic Alignment Search (MAS) replaces attention with a dynamic-programming alignment between text and spectrogram frames.

The effect: one optimizer, one training loop, one set of weights. Quality matches or exceeds two-stage FastSpeech + HiFi-GAN while avoiding the mismatch that accumulates when you train text→mel and mel→audio independently. The decoder discriminators (inherited from HiFi-GAN) keep the waveform sharp. The flow prior keeps the latent space multi-modal so the model can express prosodic variation.

VITS was the first open model where “paper-quality end-to-end TTS” was within a few days of training reach for a hobbyist with a single GPU. Its open-source implementation drove most of the community TTS ecosystem (Coqui-AI, SambertHifiGan, StyleTTS) through 2022 and 2023.

What survived
Monotonic Alignment Search (MAS). A single training graph that produces waveforms directly, instead of two pipelines bolted together. This shape reappeared in StyleTTS2, XTTS, and every other 2023-era end-to-end system.
2023style diffusion + adversarial decoder

StyleTTS2 — diffusion for style, adversarial for audio

Yinghao Aaron Li et al. · NeurIPS 2023

Text encoderBERT-styleStyle diffusion32-step denoising of sProsody + durationconditioned on style sMel decoderstyle-adaptive LayerNorm (SALN)iSTFTNet vocoderMRD + MPD + SFD adv// style diffusion stepst-1 = st − η · ε_θ(st, c_text, t)// one style vector per utterance, applied via SALN per layerLi et al. NeurIPS 2023 — Kokoro-82M descends from this architecture

StyleTTS2 introduced two ideas. First, an entire utterance's style — speaker identity, emotional tone, recording-room texture — is compressed into a single style vector s. This vector is used to modulate every LayerNorm in the acoustic decoder through style-adaptive layer norm (SALN): the affine parameters of LN are themselves predicted from s. You get rich conditioning without adding layers to the decoder trunk.

Second, the style vector is not predicted directly. It is produced by a tiny diffusion model conditioned on the text's semantic embedding. At inference, you run 32 denoising steps on a Gaussian latent and get a style vector. This is the only diffusion step in the pipeline — the audio itself is decoded with a fast iSTFT-based vocoder. The result is diffusion-quality expressivity without diffusion-cost inference.

StyleTTS2 was among the last open architectures before flow matching became the community default for new systems. Its quality on the LJSpeech benchmark was competitive with much larger closed models at ∼150M parameters. Kokoro-82M — the model powering every audio clip on this page — is a smaller, curated training of the same architecture, with the style diffusion frozen into a lookup of pre-computed per-voice style vectors so inference runs on CPU.

What survived
Style-adaptive layer normalization (SALN): a per-utterance style vector modulates every LayerNorm in the decoder. Plus the pattern of using a small diffusion process for style, not for audio. Kokoro-82M and every 'prompt your TTS with a reference clip' product builds on this.
2024continuous & discrete generative

Flow matching and masked generative TTS

F5-TTS (SWIvid), NaturalSpeech 3 (Microsoft), MaskGCT (Amphion)

t=0t=0.25t=0.5t=0.75t=1noiselatentaudioF5-TTS: integrate a learned vector field vθ(xt, t) from noise to melMaskGCT: iteratively unmask tokens of a neural codec — confidence-ordered, parallel within a step

Flow matching is the formulation that superseded DDPM-style diffusion for audio in 2024. Instead of learning the score of a noising process, flow matching directly learns a vector field that transports a simple distribution (pure noise) to the data distribution along straight or near-straight paths. Training loss is an MSE between the predicted vector field and a hand-chosen ground-truth field derived from a noise-to-data coupling. Inference is ODE integration — typically 4 to 16 Euler steps, an order of magnitude fewer than diffusion, with equal or better quality.

F5-TTS (Flow matching for Speech synthesis, with diffusion Transformer) brought this approach to open-source TTS. Unlike VITS or StyleTTS2, F5-TTS is a one-stage system that predicts a mel spectrogram directly via flow matching on a transformer trunk. It matches or exceeds commercial API TTS on standard MOS tests while being fully reproducible on public data. It is also the first fully permissively licensed (MIT) model where the community can legitimately finetune and redistribute derivatives.

MaskGCT and NaturalSpeech 3 take a different route: the target is a neural audio codec (EnCodec, SoundStream, DAC), which turns audio into a sequence of discrete tokens. The TTS model masks tokens and is trained to predict them from context, like BERT but generative. At inference, the model iteratively unmasks tokens in order of confidence, doing parallel prediction within each step. These discrete-token systems are currently the quality leaders on zero-shot voice cloning benchmarks.

What survived
Flow matching: a simpler training objective than score-based diffusion, fewer inference steps, better quality. Masked generative codec modelling (MaskGCT) for when you want parallel-token decoding over a neural codec instead of continuous latents. Both remain active areas in 2026.
2025the samples on this page

Kokoro-82M — the CPU-realtime community model

hexgrad · Apache 2.0 · Hugging Face repo hexgrad/Kokoro-82M

G2P (misaki)multilingualStyle vectorpre-extracted per voiceStyleTTS2 acoustic model∼40M paramsDuration predictorMel decoder (SALN)iSTFTNet// Kokoro-82M spec cardparams ≈ 82M · sample rate 24 kHz · mel 80 binsCPU realtime · Apache 2.0 · 54 voice packs in the stock release

Kokoro-82M is a distillation and community-curated training of the StyleTTS2 architecture at roughly 82M parameters — about half the size of the reference StyleTTS2 release. The style diffusion step is pre-computed per voice and stored as a fixed embedding; inference is therefore a single forward pass through the acoustic model plus an iSTFTNet vocoder. End-to-end latency on a modern laptop CPU is under 400 ms for a full sentence.

The af_heart, am_michael, and nine other voices used in the voice library above all ship with the stock Kokoro release. Apache 2.0 license. 54 voice packs at the time of writing, covering American, British, Japanese, Chinese, Spanish, French, Hindi, Italian, Portuguese. The spectrograms in this page's pipeline are rendered from audio that Kokoro generated locally, no API calls.

This is the end of the timeline, for now. The next entries will be the ElevenLabs and OpenAI comparison samples once their keys are wired in. The interesting thing is what the last 85 years have repeatedly shown: every representation introduced here — filter bank, spectrogram, LPC, formants, diphones, HMM parameters, mel-via-attention, dilated convolution audio, flow-matched codec tokens — is still visible somewhere in the pipeline on this page. The ladder is shorter than it looks.

What survived
The main point. In 2025, an open-source 82M-param model trained by one researcher (and community contributions for multilingual voice packs) runs locally, on CPU, with quality that 2019 commercial TTS would have envied. The curve continues.
§ 07 · Reproduce

Run it yourself.

Every figure on this page is generated from four Python scripts against a shared virtualenv. No API calls, no cached renders. Drop a new WAV into the samples directory and rerun — the pipeline will rebuild every spectrogram, every metric, every plate on this page.

Kokoro-82M runs on CPU; end-to-end rebuild under ten minutes on a modern laptop.

The pipeline
StageScriptWhat it does
1 · Synthscripts/tts-samples/generate.pyEach voice renders the prompt at default settings, no prosody prompting. Saved at 24 kHz mono.
2 · Spectrogramsscripts/tts-samples/spectrograms.py128 mel bins, n_fft=1024, hop=256, fmin=0, fmax=8000, log-power dB, magma colormap.
3 · Analysescripts/tts-samples/analyze.pyFive lenses + all-metrics.json. F0 via librosa.pyin (70–500 Hz). MFCC 13 coef, DCT-II.
4 · Resynthscripts/tts-samples/resynthesis.pymel_to_stft → griffinlim, 64 iterations, no learned prior.
5 · Failuresscripts/tts-samples/failure_modes.pyGenerates the six failure-mode plates by DSP-manipulating the healthy reference.
Fig 7 · Trim + pad: 35 dB top-db silence trim, then padded/truncated to a common 2.5 s window so column counts match exactly across every render. Rendered with librosa + matplotlib at 200 DPI.
What this is
  • — A reproducible DSP pipeline you can run on any WAV.
  • — Matched-format visualisations: same window, same colormap, same axes across every voice.
  • — Measured acoustic descriptors, not vibes.
  • — A framework ready to accept ElevenLabs, OpenAI, Cartesia, XTTS-v2, F5-TTS samples next.
What this is not
  • — Not a quality judgment. You cannot read “better” off a spectrogram.
  • — Not a MOS score. For that, see the comparison pages.
  • — Not a claim about architecture — visible differences are usually training data, not model topology.
  • — Not a vocoder shoot-out. That needs matched-mel input, a separate experiment.

Get the next drop of voice fingerprints

ElevenLabs, OpenAI, Cartesia, XTTS-v2, F5-TTS — we'll ping you the week they land.

§ 08 · Related

Neighbouring reads.

Cross-references on Codesota that continue the thread.

Speech register
The STT + TTS parent pillar.
ElevenLabs vs OpenAI TTS
Commercial head-to-head.
Best TTS for voice cloning
Zero-shot similarity & consent.
Guide · TTS models
Full landscape overview.
Methodology
How we verify every number.