Home/Building Blocks/Speech Emotion Recognition
AudioStructured Data

Speech Emotion Recognition

Classify speaker emotion or affective state from voice.

How Speech Emotion Recognition Works

A technical deep-dive into Speech Emotion Recognition. How machines learn to hear not just what we say, but how we feel when we say it.

1

The Problem

Why words alone are not enough.

Consider the phrase: "That's great."

Spoken with rising pitch and high energy, it means genuine enthusiasm. With flat pitch and a sigh, it drips with sarcasm. With trembling voice, it might mask disappointment. The same words carry completely different meanings based on how they are spoken.

Speech Emotion Recognition (SER) extracts these paralinguistic cues - the pitch, rhythm, intensity, and voice quality that reveal our emotional state. This goes beyond speech-to-text; it is about understanding the music behind the words.

Customer Service

Detect frustrated callers in real-time. Route to specialists before escalation. Measure emotional journey across interactions.

Mental Health

Screen for depression markers in voice. Track mood over time. Alert caregivers to emotional changes.

Human-AI Interaction

Make voice assistants emotionally aware. Adapt responses based on user state. Create more empathetic AI companions.

The Core Challenges

1.Subjectivity: Different people express the same emotion differently. Cultural norms vary.
2.Mixed emotions: Real speech often contains multiple simultaneous emotional states.
3.Context dependency: The same acoustic patterns can signal different emotions in different contexts.
4.Acted vs natural: Most training data is acted; real emotions are subtler and more varied.
2

Acoustic Features

The building blocks of emotion in speech. What the model actually "hears".

1
Pitch (F0)

Fundamental frequency of the voice. Higher when excited or angry, lower when sad.

Extracted metrics:
Mean F0F0 rangeF0 contourJitter
Example values:
Happy: 180-250Hz | Sad: 100-150Hz
2
Energy/Intensity

Loudness of speech. Correlates strongly with arousal and emotional intensity.

Extracted metrics:
RMS energyEnergy envelopeShimmerPeak amplitude
Example values:
Anger: +6dB | Fear: +3dB | Sadness: -4dB
3
Tempo/Duration

Speaking rate and pause patterns. Emotions affect speech rhythm significantly.

Extracted metrics:
Speech ratePause durationSyllable rateRhythm
Example values:
Anger: 5.2 syll/s | Sadness: 3.1 syll/s
4
Spectral Features

Frequency distribution characteristics. Captures voice quality and timbre.

Extracted metrics:
MFCCsSpectral centroidSpectral fluxFormants
Example values:
13-40 MFCCs + deltas typically used

Traditional vs Modern Feature Extraction

Traditional (Hand-crafted)
Audio -> MFCC + Pitch + Energy -> Statistics -> Classifier
  • +Interpretable, fast, works with small data
  • -Requires domain expertise, less accurate
Modern (Self-supervised)
Audio -> wav2vec2/HuBERT -> Embeddings -> Classifier
  • +Higher accuracy, learns rich representations
  • -Black-box, requires GPU, more data hungry
3

The Arousal-Valence Model

Beyond discrete labels: representing emotions as continuous dimensions.

Categorical labels like "happy" or "angry" are intuitive but limiting. The circumplex model places emotions on two continuous axes:

Arousal (Activation)

How energized or activated the emotional state is.

Low (calm, tired)|High (excited, angry)
Valence (Pleasantness)

How positive or negative the emotional state is.

Negative (sad, angry)|Positive (happy, calm)

Emotion Circumplex

High Arousal
Low Arousal
Negative
Positive
Anger
Excitement
Surprise
Sadness
Contentment
Fear
Neutral

Continuous prediction allows capturing subtle emotional nuances and transitions.

Discrete Emotion Categories

Anger
High arousal, negative valence. Fast speech, high pitch variability.
Happiness
High arousal, positive valence. Higher pitch, increased energy.
Sadness
Low arousal, negative valence. Slower tempo, lower pitch.
Fear
High arousal, negative valence. Trembling voice, irregular rhythm.
Surprise
High arousal, variable valence. Sudden pitch rise, increased intensity.
Disgust
Medium arousal, negative valence. Lowered pitch, slower articulation.
Neutral
Low arousal, neutral valence. Baseline for comparison.
4

Interactive Demo

Explore how different utterances map to emotions. Click to see the analysis.

"I can't believe you did this to me!"
Predicted: anger (92% confidence)
Arousal-Valence Position
High Arousal
Low Arousal
Neg
Pos
Arousal: 0.85
Valence: -0.75
Detected Acoustic Features
pitchHigh
energyVery High
tempoFast
variationHigh
5

Models and Methods

From self-supervised transformers to specialized emotion models.

ModelTypeArchitectureAccuracyNotes
wav2vec2-emotionSelf-supervisedwav2vec2 + classification head~75% (4-class)Fine-tuned wav2vec2-base, good baseline
HuBERT-emotionSelf-supervisedHuBERT + pooling + classifier~78% (4-class)Better representations than wav2vec2
emotion2vecSpecializedSelf-supervised pretraining on emotion~80% (4-class)SOTA open-source, Alibaba DAMO
SpeechBrainToolkitECAPA-TDNN, wav2vec2 recipes~76% (4-class)Production-ready, excellent docs
Hume AIAPIProprietary multimodal48 emotions + dimensionsMost granular, includes prosody
OpenAI Whisper + LLMPipelineASR -> text -> emotion via LLMGood for text emotionsLoses acoustic info, text-only analysis
wav2vec2 / HuBERT

Self-supervised models pretrained on massive unlabeled speech. Learn rich representations that transfer well to emotion recognition with minimal fine-tuning.

Best for: General-purpose, good baselines, multi-language support
emotion2vec

Purpose-built for emotion. Pretrained with emotion-aware objectives on diverse emotion datasets. Current open-source SOTA.

Best for: Maximum accuracy, production deployments
SpeechBrain

Complete toolkit with pretrained recipes. Includes data loaders, training loops, and evaluation metrics. Excellent documentation.

Best for: Research, custom training, full control
Hume AI

Commercial API with 48+ fine-grained emotion categories. Includes prosody analysis and multimodal support (face + voice).

Best for: Production apps, granular emotions, no ML expertise needed

Standard Datasets

DatasetSizeEmotionsTypeNotes
IEMOCAP12 hours9Acted + ImprovMost cited, English
RAVDESS7356 clips8ActedNorth American, balanced
CREMA-D7442 clips6ActedDiverse actors, video+audio
MSP-IMPROV8.4 hours4ImprovisedNatural interactions
CMU-MOSEI65 hours6In-the-wildYouTube, multimodal
EmoV-DB7000 clips5ActedMulti-language
6

Code Examples

Get started with speech emotion recognition in Python.

HuggingFace Transformerspip install transformers librosa
Quick Start
from transformers import pipeline
import librosa

# Load pre-trained emotion recognition pipeline
classifier = pipeline(
    "audio-classification",
    model="ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"
)

# Load audio (16kHz required for wav2vec2)
audio, sr = librosa.load("speech.wav", sr=16000)

# Classify emotion
result = classifier(audio)

for pred in result[:3]:
    print(f"{pred['label']:12} {pred['score']:.3f}")

# Output:
# angry        0.743
# sad          0.142
# neutral      0.089

Quick Reference

For Maximum Accuracy
  • - emotion2vec (SOTA open-source)
  • - Fine-tuned HuBERT
  • - Ensemble with text analysis
For Quick Start
  • - HuggingFace pipeline
  • - SpeechBrain pretrained
  • - Hume AI (no ML needed)
Key Takeaways
  • - Pitch + energy + tempo = emotion
  • - Arousal-valence captures nuance
  • - Acted data differs from real

Use Cases

  • Call center quality
  • Health monitoring
  • Gaming NPCs
  • Voice analytics

Architectural Patterns

Spectrogram CNN/Transformer

Predict emotion from mel features.

SSL Audio Fine-Tune

Fine-tune wav2vec2/Hubert embeddings for emotion.

Implementations

Open Source

SpeechBrain SER

Apache 2.0
Open Source

Recipes for IEMOCAP, CREMA-D.

Wav2Vec2-Emotion

Apache 2.0
Open Source

SSL backbone fine-tuned for SER.

Emo-CLAP

Apache 2.0
Open Source

Audio-text contrastive for zero-shot emotions.

Benchmarks

Quick Facts

Input
Audio
Output
Structured Data
Implementations
3 open source, 0 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for speech emotion recognition.

Submit Results