From the raw dot-product attention math to full encoder blocks and sentence embeddings. No hand-waving. Real numbers, real formulas, working code.
Attention is the core operation of every Transformer. Before Transformers, sequence models like LSTMs processed tokens one at a time, left to right. The attention mechanism lets every token look at every other token simultaneously, computing a weighted sum based on relevance.
Each input token has an embedding vector. We multiply it by three learned weight matrices to produce three different vectors:
Why three separate matrices? Because the role a token plays when asking a question (Query) is different from the role it plays when being asked about (Key), which is again different from the information it provides (Value). A single projection cannot capture all three roles. This factorization gives the model flexibility to learn asymmetric relationships: token A attending to token B does not require B to attend equally to A.
Let's break down every component:
Multiply the Query matrix by the transpose of the Key matrix. This computes a dot product between every pair of tokens. If Q and K are shaped (n × dk), the result is an (n × n) matrix where entry (i, j) measures how much token i should attend to token j. Higher dot product = more relevant.
Divide every score by the square root of the key dimension. Why? Without scaling, when dk is large (e.g., 64), dot products become very large in magnitude. This pushes the softmax into regions where gradients are extremely small (the flat parts of the sigmoid curve), making training unstable. Dividing by √dk keeps the variance of the scores at roughly 1, regardless of dimension size. For dk = 64, we divide by 8.
Apply softmax row-wise so each row sums to 1. This turns raw scores into a probability distribution: for each token, we get a distribution over all tokens it should pay attention to. A score of 0.7 on position j means "70% of my attention goes to token j."
Multiply the attention weights by the Value matrix. This produces a weighted combination of value vectors for each token. If token 0 attends 70% to token 1 and 30% to token 2, its output is 0.7 · V1 + 0.3 · V2. The output has the same shape as the input: (n × dk).
Let's trace through the full attention computation with actual numbers. We have 3 tokens with dk = 4 (tiny, but the math is identical to dk = 64 in real models).
| Token | d0 | d1 | d2 | d3 |
| "The" | 1.0 | 0.0 | 1.0 | 0.0 |
| "cat" | 0.0 | 1.0 | 0.0 | 1.0 |
| "sat" | 1.0 | 1.0 | 0.0 | 0.0 |
For this example, Q = K = V = X (identity weights). In real models, learned WQ, WK, WV produce different projections.
Each entry (i, j) = dot product of Qi and Kj
| "The" | "cat" | "sat" | |
| "The" | 2.0 | 0.0 | 1.0 |
| "cat" | 0.0 | 2.0 | 1.0 |
| "sat" | 1.0 | 1.0 | 2.0 |
Note: each token has the highest dot product with itself (diagonal = 2.0). "The" and "cat" are orthogonal (score = 0).
| "The" | "cat" | "sat" | |
| "The" | 1.00 | 0.00 | 0.50 |
| "cat" | 0.00 | 1.00 | 0.50 |
| "sat" | 0.50 | 0.50 | 1.00 |
softmax([1.0, 0.0, 0.5]) = [e1.0, e0.0, e0.5] / sum = [2.718, 1.0, 1.649] / 5.367 = [0.506, 0.186, 0.307]
| "The" | "cat" | "sat" | sum | |
| "The" | 0.506 | 0.186 | 0.307 | 1.0 |
| "cat" | 0.186 | 0.506 | 0.307 | 1.0 |
| "sat" | 0.307 | 0.307 | 0.387 | 1.0 |
Each token attends most to itself (~50%) but also spreads attention to others. "sat" attends equally to "The" and "cat" (both 0.307).
Output for "The" = 0.506 · [1,0,1,0] + 0.186 · [0,1,0,1] + 0.307 · [1,1,0,0]
| Token | d0 | d1 | d2 | d3 |
| "The" | 0.813 | 0.494 | 0.506 | 0.186 |
| "cat" | 0.494 | 0.813 | 0.186 | 0.506 |
| "sat" | 0.694 | 0.694 | 0.307 | 0.307 |
Each output vector is a blend of all input value vectors, weighted by attention. "The" absorbed some information from "cat" and "sat" — this is how context flows between tokens.
A single attention head can only capture one type of relationship at a time. Multi-head attention runs h parallel attention heads, each with its own WQ, WK, WV projections, then concatenates and projects the results.
Research has shown that different heads in trained Transformers specialize in different linguistic patterns. This is not engineered — it emerges from training:
Attends to the next/previous token. Captures local word order and bigram relationships.
The verb token attends strongly to its subject, even across long distances with intervening clauses.
Pronouns attend to their antecedents. "it" attends to "the cat" across the full context window.
Some heads attend to specific relative positions (e.g., always 2 tokens back), forming a learned convolution.
If the model dimension is dmodel = 768 and we have h = 12 heads, each head works with dk = dmodel / h = 64 dimensions. After computing attention independently, the h output vectors (each 64-dim) are concatenated back to 768-dim and multiplied by WO(768 × 768) to produce the final output. The total compute is the same as a single head with full dimensionality, but the model gets 12 different "perspectives" on the input.
A single Transformer layer ("block") wraps multi-head attention with feedforward networks, residual connections, and layer normalization. BERT-base stacks 12 of these blocks; GPT-3 stacks 96.
The "Add" in "Add & Norm" is a residual (skip) connection: the input is added to the output of each sub-layer. This ensures gradients can flow directly from output to input, making deep networks (96+ layers) trainable. Without residuals, gradients vanish and deep Transformers fail to converge.
LayerNorm normalizes across the feature dimension (not the batch dimension like BatchNorm). For each token, it centers the dmodel-dimensional vector to mean 0 and variance 1, then applies learned scale and shift parameters. This stabilizes training and makes learning rate less sensitive.
A two-layer MLP applied independently to each token position. The inner dimension is typically 4× the model dimension (768 → 3072 → 768 in BERT-base). This is where the model stores factual knowledge — recent work shows FFN layers act as key-value memories.
The original Transformer applies LayerNorm after the residual (Post-Norm). Most modern models (GPT-2+, LLaMA) use Pre-Norm: normalize first, then apply the sub-layer. Pre-Norm is easier to train at scale, though Post-Norm can achieve slightly better final quality with careful tuning.
Attention is permutation-invariant: if you shuffle the input tokens, the attention computation produces the same output (just shuffled). The model has no inherent notion of word order. "The cat sat on the mat" and "mat the on sat cat the" would produce identical attention scores. Positional encodings inject order information.
Each dimension of the positional encoding oscillates at a different frequency. Low dimensions change rapidly (high frequency), encoding fine position differences. High dimensions change slowly (low frequency), encoding coarse position. This design has a key property: the encoding of position pos + k can be represented as a linear function of the encoding at pos, allowing the model to learn relative position patterns easily.
Modern LLMs have moved beyond additive positional encodings. RoPE (Rotary Position Embedding, used in LLaMA, Mistral, GPT-NeoX) encodes position by rotating the Q and K vectors in 2D subspaces. This elegantly encodes relative position: the dot product between rotated Q and K depends only on the distance between tokens, not their absolute positions.
ALiBi (Attention with Linear Biases, used in BLOOM, MPT) takes an even simpler approach: no positional encoding at all. Instead, it subtracts a linear penalty from attention scores based on distance: score(i, j) − m · |i − j|, where m is a head-specific slope. This works surprisingly well and generalizes to longer sequences without fine-tuning.
A Transformer produces one output vector per token. But for tasks like semantic search, classification, or clustering, you need a single vector for the entire sentence. How do you collapse a sequence of vectors into one?
BERT prepends a special [CLS] token to every input. After passing through all 12 Transformer layers, the output vector at the [CLS] position is used as the sentence representation. The idea is that attention allows [CLS] to aggregate information from all other tokens.
Limitation: The [CLS] token in base BERT is not actually optimized for sentence similarity. Without fine-tuning, [CLS] embeddings perform worse than simple word2vec averaging for semantic tasks.
Average the output vectors across all tokens (excluding padding). This is the default in sentence-transformers and consistently outperforms [CLS] pooling for embedding quality.
Why it works better: Every token contributes to the final embedding, so no information is bottlenecked through a single position. Content words with strong semantic signal directly influence the result.
Take the element-wise max across all tokens. Captures the strongest signal in each dimension. Rarely used in modern models.
Weight later layers or specific attention heads more. Some models learn these weights during fine-tuning (e.g., SentEval probing).
For decoder-only models (GPT), use the last token's output. It has attended to all previous tokens via causal masking.
Transformer quality scales predictably with model size. The "Scaling Laws" research (Kaplan et al., 2020) showed that loss follows a power law with parameters, compute, and data. Doubling parameters gives diminishing but consistent returns.
| Model | Layers | Heads | dmodel | Params | Year |
|---|---|---|---|---|---|
| BERT-base | 12 | 12 | 768 | 110M | 2018 |
| BERT-large | 24 | 16 | 1024 | 340M | 2018 |
| GPT-2 | 48 | 25 | 1600 | 1.5B | 2019 |
| GPT-3 | 96 | 96 | 12288 | 175B | 2020 |
| LLaMA-2 70B | 80 | 64 | 8192 | 70B | 2023 |
| GPT-4 (est.) | ~120 | ~96 | ~12288 | ~1.8T | 2023 |
For a Transformer with L layers, dmodel hidden dims, and V vocab size:
Training a Transformer requires roughly 6 · N · D FLOPs, where N is parameters and D is number of training tokens. GPT-3 (175B params, 300B tokens) required ~3.14 × 1023 FLOPs, or about $4.6M on V100 GPUs at 2020 cloud prices.
Some capabilities only appear above certain scale thresholds. Few-shot learning emerges around 1B parameters. Chain-of-thought reasoning appears around 100B. These "phase transitions" are not yet fully understood and remain an active research topic.
Here is a complete, runnable implementation of scaled dot-product attention and multi-head attention in NumPy. No frameworks, no magic — every operation is explicit.
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax."""
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / e_x.sum(axis=axis, keepdims=True)
def scaled_dot_product_attention(Q, K, V):
"""
Scaled dot-product attention.
Q, K, V: arrays of shape (seq_len, d_k)
Returns: (output, attention_weights)
"""
d_k = K.shape[-1]
# Step 1: Compute raw attention scores
scores = Q @ K.T # (seq_len, seq_len)
# Step 2: Scale by sqrt(d_k)
scores = scores / np.sqrt(d_k) # prevents gradient vanishing in softmax
# Step 3: Softmax to get attention weights
weights = softmax(scores) # each row sums to 1
# Step 4: Weighted sum of values
output = weights @ V # (seq_len, d_k)
return output, weights
# --- Demo with 3 tokens, d_k = 4 ---
X = np.array([
[1.0, 0.0, 1.0, 0.0], # "The"
[0.0, 1.0, 0.0, 1.0], # "cat"
[1.0, 1.0, 0.0, 0.0], # "sat"
])
# Using identity projection (Q = K = V = X)
output, weights = scaled_dot_product_attention(X, X, X)
print("Attention weights:")
print(np.round(weights, 3))
# [[0.506, 0.186, 0.307],
# [0.186, 0.506, 0.307],
# [0.307, 0.307, 0.387]]
print("\nOutput:")
print(np.round(output, 3))
# [[0.813, 0.494, 0.506, 0.186],
# [0.494, 0.813, 0.186, 0.506],
# [0.694, 0.694, 0.307, 0.307]]def multi_head_attention(X, n_heads, d_model):
"""
Multi-head attention from scratch.
X: (seq_len, d_model)
n_heads: number of attention heads
d_model: model dimension (must be divisible by n_heads)
"""
seq_len = X.shape[0]
d_k = d_model // n_heads # per-head dimension
# Initialize random projection matrices (normally these are learned)
np.random.seed(42)
W_Q = np.random.randn(n_heads, d_model, d_k) * 0.1
W_K = np.random.randn(n_heads, d_model, d_k) * 0.1
W_V = np.random.randn(n_heads, d_model, d_k) * 0.1
W_O = np.random.randn(d_model, d_model) * 0.1
head_outputs = []
for h in range(n_heads):
# Project input to Q, K, V for this head
Q = X @ W_Q[h] # (seq_len, d_k)
K = X @ W_K[h] # (seq_len, d_k)
V = X @ W_V[h] # (seq_len, d_k)
# Scaled dot-product attention
output, _ = scaled_dot_product_attention(Q, K, V)
head_outputs.append(output) # (seq_len, d_k)
# Concatenate all heads: (seq_len, n_heads * d_k) = (seq_len, d_model)
concat = np.concatenate(head_outputs, axis=-1)
# Final output projection
result = concat @ W_O # (seq_len, d_model)
return result
# --- Demo: 3 tokens, 4 heads, d_model = 8 ---
X = np.random.randn(3, 8) # 3 tokens, 8-dim embeddings
output = multi_head_attention(X, n_heads=4, d_model=8)
print(f"Input shape: {X.shape}") # (3, 8)
print(f"Output shape: {output.shape}") # (3, 8)
print("\nEach token now contains information from all other tokens,")
print("aggregated through 4 different attention perspectives.")def layer_norm(x, eps=1e-5):
"""Layer normalization over the last dimension."""
mean = x.mean(axis=-1, keepdims=True)
std = x.std(axis=-1, keepdims=True)
return (x - mean) / (std + eps)
def gelu(x):
"""Gaussian Error Linear Unit activation."""
return 0.5 * x * (1 + np.tanh(
np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)
))
def feed_forward(x, d_model, d_ff):
"""Position-wise feed-forward network."""
np.random.seed(123)
W1 = np.random.randn(d_model, d_ff) * 0.02
b1 = np.zeros(d_ff)
W2 = np.random.randn(d_ff, d_model) * 0.02
b2 = np.zeros(d_model)
hidden = gelu(x @ W1 + b1) # (seq_len, d_ff)
output = hidden @ W2 + b2 # (seq_len, d_model)
return output
def transformer_block(x, n_heads, d_model, d_ff):
"""
One complete Transformer encoder block.
x: (seq_len, d_model)
"""
# Sub-layer 1: Multi-head attention + residual + norm
attn_out = multi_head_attention(x, n_heads, d_model)
x = layer_norm(x + attn_out) # Add & Norm
# Sub-layer 2: Feed-forward + residual + norm
ff_out = feed_forward(x, d_model, d_ff)
x = layer_norm(x + ff_out) # Add & Norm
return x
# --- Full block: 3 tokens, 4 heads, d_model=8, d_ff=32 ---
X = np.random.randn(3, 8)
output = transformer_block(X, n_heads=4, d_model=8, d_ff=32)
print(f"Transformer block: {X.shape} -> {output.shape}")
# Transformer block: (3, 8) -> (3, 8)
# Same shape in, same shape out — stackable!The Transformer story is told through a remarkably small set of papers. These are the ones every ML practitioner should read, in roughly chronological order.
Now that you understand how Transformers work from the inside, explore the benchmarks where they dominate — or dive into more deep dives from the Academy.