Academy Deep Dive

How Transformers
Actually Work

From the raw dot-product attention math to full encoder blocks and sentence embeddings. No hand-waving. Real numbers, real formulas, working code.

Architecture at a Glance

2017
Year introduced (Vaswani et al.)
130k+
Citations on original paper
O(n²d)
Attention complexity
~25 min
Estimated reading time

1. The Attention Mechanism

Attention is the core operation of every Transformer. Before Transformers, sequence models like LSTMs processed tokens one at a time, left to right. The attention mechanism lets every token look at every other token simultaneously, computing a weighted sum based on relevance.

Q, K, V: Three Separate Projections

Each input token has an embedding vector. We multiply it by three learned weight matrices to produce three different vectors:

Q = X · WQ   // Query: "what am I looking for?"
K = X · WK   // Key: "what do I contain?"
V = X · WV   // Value: "what information do I carry?"

Why three separate matrices? Because the role a token plays when asking a question (Query) is different from the role it plays when being asked about (Key), which is again different from the information it provides (Value). A single projection cannot capture all three roles. This factorization gives the model flexibility to learn asymmetric relationships: token A attending to token B does not require B to attend equally to A.

The Attention Formula

Attention(Q, K, V) = softmax(QKT / √dk) V

Let's break down every component:

Step 1: QKT — Raw Attention Scores

Multiply the Query matrix by the transpose of the Key matrix. This computes a dot product between every pair of tokens. If Q and K are shaped (n × dk), the result is an (n × n) matrix where entry (i, j) measures how much token i should attend to token j. Higher dot product = more relevant.

Step 2: / √dk — Scaling

Divide every score by the square root of the key dimension. Why? Without scaling, when dk is large (e.g., 64), dot products become very large in magnitude. This pushes the softmax into regions where gradients are extremely small (the flat parts of the sigmoid curve), making training unstable. Dividing by √dk keeps the variance of the scores at roughly 1, regardless of dimension size. For dk = 64, we divide by 8.

Step 3: softmax() — Normalize to Probabilities

Apply softmax row-wise so each row sums to 1. This turns raw scores into a probability distribution: for each token, we get a distribution over all tokens it should pay attention to. A score of 0.7 on position j means "70% of my attention goes to token j."

Step 4: × V — Weighted Sum of Values

Multiply the attention weights by the Value matrix. This produces a weighted combination of value vectors for each token. If token 0 attends 70% to token 1 and 30% to token 2, its output is 0.7 · V1 + 0.3 · V2. The output has the same shape as the input: (n × dk).

2. Numerical Walkthrough: 3 Tokens

Let's trace through the full attention computation with actual numbers. We have 3 tokens with dk = 4 (tiny, but the math is identical to dk = 64 in real models).

Input Embeddings X (3 tokens × 4 dims)

Tokend0d1d2d3
"The"1.00.01.00.0
"cat"0.01.00.01.0
"sat"1.01.00.00.0

After Projection (using identity W for clarity)

For this example, Q = K = V = X (identity weights). In real models, learned WQ, WK, WV produce different projections.

Q = K = V = X   (same as above, dk = 4)

Step 1: QKT (Raw Scores)

Each entry (i, j) = dot product of Qi and Kj

"The""cat""sat"
"The"2.00.01.0
"cat"0.02.01.0
"sat"1.01.02.0

Note: each token has the highest dot product with itself (diagonal = 2.0). "The" and "cat" are orthogonal (score = 0).

Step 2: Divide by √dk = √4 = 2

"The""cat""sat"
"The"1.000.000.50
"cat"0.001.000.50
"sat"0.500.501.00

Step 3: Softmax (row-wise)

softmax([1.0, 0.0, 0.5]) = [e1.0, e0.0, e0.5] / sum = [2.718, 1.0, 1.649] / 5.367 = [0.506, 0.186, 0.307]

"The""cat""sat"sum
"The"0.5060.1860.3071.0
"cat"0.1860.5060.3071.0
"sat"0.3070.3070.3871.0

Each token attends most to itself (~50%) but also spreads attention to others. "sat" attends equally to "The" and "cat" (both 0.307).

Step 4: Attention Weights × V = Output

Output for "The" = 0.506 · [1,0,1,0] + 0.186 · [0,1,0,1] + 0.307 · [1,1,0,0]

Tokend0d1d2d3
"The"0.8130.4940.5060.186
"cat"0.4940.8130.1860.506
"sat"0.6940.6940.3070.307

Each output vector is a blend of all input value vectors, weighted by attention. "The" absorbed some information from "cat" and "sat" — this is how context flows between tokens.

3. Multi-Head Attention

A single attention head can only capture one type of relationship at a time. Multi-head attention runs h parallel attention heads, each with its own WQ, WK, WV projections, then concatenates and projects the results.

headi = Attention(X WQi, X WKi, X WVi)
MultiHead(X) = Concat(head1, ..., headh) · WO

What Different Heads Learn

Research has shown that different heads in trained Transformers specialize in different linguistic patterns. This is not engineered — it emerges from training:

Head A

Syntactic adjacency

Attends to the next/previous token. Captures local word order and bigram relationships.

Head B

Subject-verb agreement

The verb token attends strongly to its subject, even across long distances with intervening clauses.

Head C

Coreference resolution

Pronouns attend to their antecedents. "it" attends to "the cat" across the full context window.

Head D

Positional patterns

Some heads attend to specific relative positions (e.g., always 2 tokens back), forming a learned convolution.

Concatenation and Output Projection

If the model dimension is dmodel = 768 and we have h = 12 heads, each head works with dk = dmodel / h = 64 dimensions. After computing attention independently, the h output vectors (each 64-dim) are concatenated back to 768-dim and multiplied by WO(768 × 768) to produce the final output. The total compute is the same as a single head with full dimensionality, but the model gets 12 different "perspectives" on the input.

Dimensions at Each Step (BERT-base, h=12)

Input:             (seq_len × 768)
Per-head Q, K, V:  (seq_len × 64)   × 12 heads
Per-head output:   (seq_len × 64)   × 12 heads
Concatenated:      (seq_len × 768)   = 12 × 64
After WO:         (seq_len × 768)   back to model dim

4. The Full Transformer Block

A single Transformer layer ("block") wraps multi-head attention with feedforward networks, residual connections, and layer normalization. BERT-base stacks 12 of these blocks; GPT-3 stacks 96.

Data Flow Through One Transformer Block

Input: x (seq_len × dmodel)
Multi-Head Attention(x, x, x)
Add & NormLayerNorm(x + Attention(x))
Feed-Forward NetworkFFN(x) = W2 · GELU(W1x + b1) + b2
Add & NormLayerNorm(x + FFN(x))
Output: x' (seq_len × dmodel)

Residual Connections

The "Add" in "Add & Norm" is a residual (skip) connection: the input is added to the output of each sub-layer. This ensures gradients can flow directly from output to input, making deep networks (96+ layers) trainable. Without residuals, gradients vanish and deep Transformers fail to converge.

Layer Normalization

LayerNorm normalizes across the feature dimension (not the batch dimension like BatchNorm). For each token, it centers the dmodel-dimensional vector to mean 0 and variance 1, then applies learned scale and shift parameters. This stabilizes training and makes learning rate less sensitive.

Feed-Forward Network

A two-layer MLP applied independently to each token position. The inner dimension is typically 4× the model dimension (768 → 3072 → 768 in BERT-base). This is where the model stores factual knowledge — recent work shows FFN layers act as key-value memories.

Pre-Norm vs. Post-Norm

The original Transformer applies LayerNorm after the residual (Post-Norm). Most modern models (GPT-2+, LLaMA) use Pre-Norm: normalize first, then apply the sub-layer. Pre-Norm is easier to train at scale, though Post-Norm can achieve slightly better final quality with careful tuning.

5. Positional Encoding

Attention is permutation-invariant: if you shuffle the input tokens, the attention computation produces the same output (just shuffled). The model has no inherent notion of word order. "The cat sat on the mat" and "mat the on sat cat the" would produce identical attention scores. Positional encodings inject order information.

Sinusoidal Positional Encoding (Original)

PE(pos, 2i) = sin(pos / 100002i/dmodel)
PE(pos, 2i+1) = cos(pos / 100002i/dmodel)

Each dimension of the positional encoding oscillates at a different frequency. Low dimensions change rapidly (high frequency), encoding fine position differences. High dimensions change slowly (low frequency), encoding coarse position. This design has a key property: the encoding of position pos + k can be represented as a linear function of the encoding at pos, allowing the model to learn relative position patterns easily.

Sinusoidal (Fixed)

  • • Deterministic, computed once, not learned
  • • Can generalize to longer sequences than seen during training
  • • Used in the original Transformer (Vaswani et al., 2017)
  • • No additional parameters

Learned Positions

  • • A learned embedding matrix of shape (max_len × dmodel)
  • • Used in BERT, GPT-2 (max 512 and 1024 positions respectively)
  • • Cannot extrapolate beyond max_len without tricks
  • • In practice, performs about the same as sinusoidal

Modern Alternatives: RoPE and ALiBi

Modern LLMs have moved beyond additive positional encodings. RoPE (Rotary Position Embedding, used in LLaMA, Mistral, GPT-NeoX) encodes position by rotating the Q and K vectors in 2D subspaces. This elegantly encodes relative position: the dot product between rotated Q and K depends only on the distance between tokens, not their absolute positions.

ALiBi (Attention with Linear Biases, used in BLOOM, MPT) takes an even simpler approach: no positional encoding at all. Instead, it subtracts a linear penalty from attention scores based on distance: score(i, j) − m · |i − j|, where m is a head-specific slope. This works surprisingly well and generalizes to longer sequences without fine-tuning.

6. From Tokens to Sentence Embeddings

A Transformer produces one output vector per token. But for tasks like semantic search, classification, or clustering, you need a single vector for the entire sentence. How do you collapse a sequence of vectors into one?

[CLS] Token Approach (BERT)

BERT prepends a special [CLS] token to every input. After passing through all 12 Transformer layers, the output vector at the [CLS] position is used as the sentence representation. The idea is that attention allows [CLS] to aggregate information from all other tokens.

# BERT input format:
[CLS] The cat sat on the mat [SEP]
# Sentence embedding = output at position 0 (the [CLS] token)
embedding = model_output[0]  # shape: (768,)

Limitation: The [CLS] token in base BERT is not actually optimized for sentence similarity. Without fine-tuning, [CLS] embeddings perform worse than simple word2vec averaging for semantic tasks.

Mean Pooling (Sentence-Transformers)

Average the output vectors across all tokens (excluding padding). This is the default in sentence-transformers and consistently outperforms [CLS] pooling for embedding quality.

# Mean pooling over all token outputs:
token_outputs = model_output  # shape: (seq_len, 768)
attention_mask = ...          # shape: (seq_len,), 1 for real tokens, 0 for padding
embedding = (token_outputs * mask).sum(dim=0) / mask.sum()
# Result shape: (768,)

Why it works better: Every token contributes to the final embedding, so no information is bottlenecked through a single position. Content words with strong semantic signal directly influence the result.

Other Pooling Strategies

Max Pooling

Take the element-wise max across all tokens. Captures the strongest signal in each dimension. Rarely used in modern models.

Weighted Mean

Weight later layers or specific attention heads more. Some models learn these weights during fine-tuning (e.g., SentEval probing).

Last Token (Causal)

For decoder-only models (GPT), use the last token's output. It has attended to all previous tokens via causal masking.

7. Scale: How Size Affects Quality

Transformer quality scales predictably with model size. The "Scaling Laws" research (Kaplan et al., 2020) showed that loss follows a power law with parameters, compute, and data. Doubling parameters gives diminishing but consistent returns.

ModelLayersHeadsdmodelParamsYear
BERT-base1212768110M2018
BERT-large24161024340M2018
GPT-2482516001.5B2019
GPT-3969612288175B2020
LLaMA-2 70B8064819270B2023
GPT-4 (est.)~120~96~12288~1.8T2023

Parameter Count Formula

For a Transformer with L layers, dmodel hidden dims, and V vocab size:

Attention: 4 · d2 per layer
FFN: 8 · d2 per layer (4x inner)
Embeddings: V · d
Total ≈ 12 · L · d2 + V · d

Compute Cost

Training a Transformer requires roughly 6 · N · D FLOPs, where N is parameters and D is number of training tokens. GPT-3 (175B params, 300B tokens) required ~3.14 × 1023 FLOPs, or about $4.6M on V100 GPUs at 2020 cloud prices.

Emergent Abilities

Some capabilities only appear above certain scale thresholds. Few-shot learning emerges around 1B parameters. Chain-of-thought reasoning appears around 100B. These "phase transitions" are not yet fully understood and remain an active research topic.

8. Working Code: Attention in NumPy

Here is a complete, runnable implementation of scaled dot-product attention and multi-head attention in NumPy. No frameworks, no magic — every operation is explicit.

attention.py
import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    """
    Scaled dot-product attention.
    Q, K, V: arrays of shape (seq_len, d_k)
    Returns: (output, attention_weights)
    """
    d_k = K.shape[-1]

    # Step 1: Compute raw attention scores
    scores = Q @ K.T                # (seq_len, seq_len)

    # Step 2: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)  # prevents gradient vanishing in softmax

    # Step 3: Softmax to get attention weights
    weights = softmax(scores)       # each row sums to 1

    # Step 4: Weighted sum of values
    output = weights @ V            # (seq_len, d_k)

    return output, weights

# --- Demo with 3 tokens, d_k = 4 ---
X = np.array([
    [1.0, 0.0, 1.0, 0.0],   # "The"
    [0.0, 1.0, 0.0, 1.0],   # "cat"
    [1.0, 1.0, 0.0, 0.0],   # "sat"
])

# Using identity projection (Q = K = V = X)
output, weights = scaled_dot_product_attention(X, X, X)

print("Attention weights:")
print(np.round(weights, 3))
# [[0.506, 0.186, 0.307],
#  [0.186, 0.506, 0.307],
#  [0.307, 0.307, 0.387]]

print("\nOutput:")
print(np.round(output, 3))
# [[0.813, 0.494, 0.506, 0.186],
#  [0.494, 0.813, 0.186, 0.506],
#  [0.694, 0.694, 0.307, 0.307]]
multi_head_attention.py
def multi_head_attention(X, n_heads, d_model):
    """
    Multi-head attention from scratch.
    X: (seq_len, d_model)
    n_heads: number of attention heads
    d_model: model dimension (must be divisible by n_heads)
    """
    seq_len = X.shape[0]
    d_k = d_model // n_heads  # per-head dimension

    # Initialize random projection matrices (normally these are learned)
    np.random.seed(42)
    W_Q = np.random.randn(n_heads, d_model, d_k) * 0.1
    W_K = np.random.randn(n_heads, d_model, d_k) * 0.1
    W_V = np.random.randn(n_heads, d_model, d_k) * 0.1
    W_O = np.random.randn(d_model, d_model) * 0.1

    head_outputs = []

    for h in range(n_heads):
        # Project input to Q, K, V for this head
        Q = X @ W_Q[h]   # (seq_len, d_k)
        K = X @ W_K[h]   # (seq_len, d_k)
        V = X @ W_V[h]   # (seq_len, d_k)

        # Scaled dot-product attention
        output, _ = scaled_dot_product_attention(Q, K, V)
        head_outputs.append(output)  # (seq_len, d_k)

    # Concatenate all heads: (seq_len, n_heads * d_k) = (seq_len, d_model)
    concat = np.concatenate(head_outputs, axis=-1)

    # Final output projection
    result = concat @ W_O  # (seq_len, d_model)

    return result

# --- Demo: 3 tokens, 4 heads, d_model = 8 ---
X = np.random.randn(3, 8)  # 3 tokens, 8-dim embeddings
output = multi_head_attention(X, n_heads=4, d_model=8)
print(f"Input shape:  {X.shape}")    # (3, 8)
print(f"Output shape: {output.shape}")  # (3, 8)
print("\nEach token now contains information from all other tokens,")
print("aggregated through 4 different attention perspectives.")
full_transformer_block.py
def layer_norm(x, eps=1e-5):
    """Layer normalization over the last dimension."""
    mean = x.mean(axis=-1, keepdims=True)
    std = x.std(axis=-1, keepdims=True)
    return (x - mean) / (std + eps)

def gelu(x):
    """Gaussian Error Linear Unit activation."""
    return 0.5 * x * (1 + np.tanh(
        np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)
    ))

def feed_forward(x, d_model, d_ff):
    """Position-wise feed-forward network."""
    np.random.seed(123)
    W1 = np.random.randn(d_model, d_ff) * 0.02
    b1 = np.zeros(d_ff)
    W2 = np.random.randn(d_ff, d_model) * 0.02
    b2 = np.zeros(d_model)

    hidden = gelu(x @ W1 + b1)  # (seq_len, d_ff)
    output = hidden @ W2 + b2    # (seq_len, d_model)
    return output

def transformer_block(x, n_heads, d_model, d_ff):
    """
    One complete Transformer encoder block.
    x: (seq_len, d_model)
    """
    # Sub-layer 1: Multi-head attention + residual + norm
    attn_out = multi_head_attention(x, n_heads, d_model)
    x = layer_norm(x + attn_out)     # Add & Norm

    # Sub-layer 2: Feed-forward + residual + norm
    ff_out = feed_forward(x, d_model, d_ff)
    x = layer_norm(x + ff_out)       # Add & Norm

    return x

# --- Full block: 3 tokens, 4 heads, d_model=8, d_ff=32 ---
X = np.random.randn(3, 8)
output = transformer_block(X, n_heads=4, d_model=8, d_ff=32)
print(f"Transformer block: {X.shape} -> {output.shape}")
# Transformer block: (3, 8) -> (3, 8)
# Same shape in, same shape out — stackable!

9. Key Papers

The Transformer story is told through a remarkably small set of papers. These are the ones every ML practitioner should read, in roughly chronological order.

Attention Is All You Need
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
Introduced the Transformer architecture
Venue
NeurIPS 2017
Citations
130,000+
BERT: Pre-training of Deep Bidirectional Transformers
Devlin, Chang, Lee, Toutanova
Bidirectional pre-training with [CLS] and MLM
Venue
NAACL 2019
Citations
95,000+
Improving Language Understanding by Generative Pre-Training
Radford, Narasimhan, Salimans, Sutskever
GPT-1: autoregressive pre-training
Venue
OpenAI 2018
Citations
12,000+
Language Models are Few-Shot Learners
Brown, Mann, Ryder, Subbiah, et al.
GPT-3: scaling laws and in-context learning
Venue
NeurIPS 2020
Citations
40,000+
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Gurevych
Mean pooling for sentence embeddings
Venue
EMNLP 2019
Citations
8,500+
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, Beyer, Kolesnikov, et al.
Vision Transformer (ViT)
Venue
ICLR 2021
Citations
35,000+
FlashAttention: Fast and Memory-Efficient Exact Attention
Dao, Fu, Ermon, Rudra, Re
IO-aware exact attention algorithm
Venue
NeurIPS 2022
Citations
3,000+

Keep Going

Now that you understand how Transformers work from the inside, explore the benchmarks where they dominate — or dive into more deep dives from the Academy.

← Back to Academy
Last updated: March 2026