Academy Deep Dive

How Transformers
Actually Work

From the raw dot-product attention math to full encoder blocks and sentence embeddings. No hand-waving. Real numbers, real formulas, working code.

Start: Attention Working Code Key Papers

Architecture at a Glance

2017

Year introduced (Vaswani et al.)

130k+

Citations on original paper

O(n²d)

Attention complexity

~25 min

Estimated reading time

1.The Attention Mechanism 2.Numerical Walkthrough (3 Tokens)3.Multi-Head Attention 4.The Full Transformer Block 5.Positional Encoding 6.From Tokens to Sentence Embeddings 7.Scale: How Size Affects Quality 8.Working Code: Attention in NumPy 9.Key Papers

1. The Attention Mechanism

Attention is the core operation of every Transformer. Before Transformers, sequence models like LSTMs processed tokens one at a time, left to right. The attention mechanism lets every token look at every other token simultaneously, computing a weighted sum based on relevance.

Q, K, V: Three Separate Projections

Each input token has an embedding vector. We multiply it by three learned weight matrices to produce three different vectors:

Q = X · W_Q // Query: "what am I looking for?"

K = X · W_K // Key: "what do I contain?"

V = X · W_V // Value: "what information do I carry?"

Why three separate matrices? Because the role a token plays when asking a question (Query) is different from the role it plays when being asked about (Key), which is again different from the information it provides (Value). A single projection cannot capture all three roles. This factorization gives the model flexibility to learn asymmetric relationships: token A attending to token B does not require B to attend equally to A.

The Attention Formula

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Let's break down every component:

Step 1: QK^T — Raw Attention Scores

Multiply the Query matrix by the transpose of the Key matrix. This computes a dot product between every pair of tokens. If Q and K are shaped (n × d_k), the result is an (n × n) matrix where entry (i, j) measures how much token i should attend to token j. Higher dot product = more relevant.

Step 2: / √d_k — Scaling

Divide every score by the square root of the key dimension. Why? Without scaling, when d_k is large (e.g., 64), dot products become very large in magnitude. This pushes the softmax into regions where gradients are extremely small (the flat parts of the sigmoid curve), making training unstable. Dividing by √d_k keeps the variance of the scores at roughly 1, regardless of dimension size. For d_k = 64, we divide by 8.

Step 3: softmax() — Normalize to Probabilities

Apply softmax row-wise so each row sums to 1. This turns raw scores into a probability distribution: for each token, we get a distribution over all tokens it should pay attention to. A score of 0.7 on position j means "70% of my attention goes to token j."

Step 4: × V — Weighted Sum of Values

Multiply the attention weights by the Value matrix. This produces a weighted combination of value vectors for each token. If token 0 attends 70% to token 1 and 30% to token 2, its output is 0.7 · V₁ + 0.3 · V₂. The output has the same shape as the input: (n × d_k).

2. Numerical Walkthrough: 3 Tokens

Let's trace through the full attention computation with actual numbers. We have 3 tokens with d_k = 4 (tiny, but the math is identical to d_k = 64 in real models).

Input Embeddings X (3 tokens × 4 dims)

Token	d₀	d₁	d₂	d₃
"The"	1.0	0.0	1.0	0.0
"cat"	0.0	1.0	0.0	1.0
"sat"	1.0	1.0	0.0	0.0

After Projection (using identity W for clarity)

For this example, Q = K = V = X (identity weights). In real models, learned W_Q, W_K, W_V produce different projections.

Q = K = V = X (same as above, d_k = 4)

Step 1: QK^T (Raw Scores)

Each entry (i, j) = dot product of Q_i and K_j

	"The"	"cat"	"sat"
"The"	2.0	0.0	1.0
"cat"	0.0	2.0	1.0
"sat"	1.0	1.0	2.0

Note: each token has the highest dot product with itself (diagonal = 2.0). "The" and "cat" are orthogonal (score = 0).

Step 2: Divide by √d_k = √4 = 2

	"The"	"cat"	"sat"
"The"	1.00	0.00	0.50
"cat"	0.00	1.00	0.50
"sat"	0.50	0.50	1.00

Step 3: Softmax (row-wise)

softmax([1.0, 0.0, 0.5]) = [e^1.0, e^0.0, e^0.5] / sum = [2.718, 1.0, 1.649] / 5.367 = [0.506, 0.186, 0.307]

	"The"	"cat"	"sat"	sum
"The"	0.506	0.186	0.307	1.0
"cat"	0.186	0.506	0.307	1.0
"sat"	0.307	0.307	0.387	1.0

Each token attends most to itself (~50%) but also spreads attention to others. "sat" attends equally to "The" and "cat" (both 0.307).

Step 4: Attention Weights × V = Output

Output for "The" = 0.506 · [1,0,1,0] + 0.186 · [0,1,0,1] + 0.307 · [1,1,0,0]

Token	d₀	d₁	d₂	d₃
"The"	0.813	0.494	0.506	0.186
"cat"	0.494	0.813	0.186	0.506
"sat"	0.694	0.694	0.307	0.307

Each output vector is a blend of all input value vectors, weighted by attention. "The" absorbed some information from "cat" and "sat" — this is how context flows between tokens.

3. Multi-Head Attention

A single attention head can only capture one type of relationship at a time. Multi-head attention runs h parallel attention heads, each with its own W_Q, W_K, W_V projections, then concatenates and projects the results.

head_i = Attention(X W_Qⁱ, X W_Kⁱ, X W_Vⁱ)

MultiHead(X) = Concat(head₁, ..., head_h) · W_O

What Different Heads Learn

Research has shown that different heads in trained Transformers specialize in different linguistic patterns. This is not engineered — it emerges from training:

Head A

Syntactic adjacency

Attends to the next/previous token. Captures local word order and bigram relationships.

Head B

Subject-verb agreement

The verb token attends strongly to its subject, even across long distances with intervening clauses.

Head C

Coreference resolution

Pronouns attend to their antecedents. "it" attends to "the cat" across the full context window.

Head D

Positional patterns

Some heads attend to specific relative positions (e.g., always 2 tokens back), forming a learned convolution.

Concatenation and Output Projection

If the model dimension is d_model = 768 and we have h = 12 heads, each head works with d_k = d_model / h = 64 dimensions. After computing attention independently, the h output vectors (each 64-dim) are concatenated back to 768-dim and multiplied by W_O(768 × 768) to produce the final output. The total compute is the same as a single head with full dimensionality, but the model gets 12 different "perspectives" on the input.

Dimensions at Each Step (BERT-base, h=12)

Input: (seq_len × 768)

Per-head Q, K, V: (seq_len × 64) × 12 heads

Per-head output: (seq_len × 64) × 12 heads

Concatenated: (seq_len × 768) = 12 × 64

After W_O: (seq_len × 768) back to model dim

4. The Full Transformer Block

A single Transformer layer ("block") wraps multi-head attention with feedforward networks, residual connections, and layer normalization. BERT-base stacks 12 of these blocks; GPT-3 stacks 96.

Data Flow Through One Transformer Block

Input: x (seq_len × d_model)

↓

Multi-Head Attention(x, x, x)

↓

Add & NormLayerNorm(x + Attention(x))

↓

Feed-Forward NetworkFFN(x) = W₂ · GELU(W₁x + b₁) + b₂

↓

Add & NormLayerNorm(x + FFN(x))

↓

Output: x' (seq_len × d_model)

Residual Connections

The "Add" in "Add & Norm" is a residual (skip) connection: the input is added to the output of each sub-layer. This ensures gradients can flow directly from output to input, making deep networks (96+ layers) trainable. Without residuals, gradients vanish and deep Transformers fail to converge.

Layer Normalization

LayerNorm normalizes across the feature dimension (not the batch dimension like BatchNorm). For each token, it centers the d_model-dimensional vector to mean 0 and variance 1, then applies learned scale and shift parameters. This stabilizes training and makes learning rate less sensitive.

Feed-Forward Network

A two-layer MLP applied independently to each token position. The inner dimension is typically 4× the model dimension (768 → 3072 → 768 in BERT-base). This is where the model stores factual knowledge — recent work shows FFN layers act as key-value memories.

Pre-Norm vs. Post-Norm

The original Transformer applies LayerNorm after the residual (Post-Norm). Most modern models (GPT-2+, LLaMA) use Pre-Norm: normalize first, then apply the sub-layer. Pre-Norm is easier to train at scale, though Post-Norm can achieve slightly better final quality with careful tuning.

5. Positional Encoding

Attention is permutation-invariant: if you shuffle the input tokens, the attention computation produces the same output (just shuffled). The model has no inherent notion of word order. "The cat sat on the mat" and "mat the on sat cat the" would produce identical attention scores. Positional encodings inject order information.

Sinusoidal Positional Encoding (Original)

PE_{(pos, 2i)} = sin(pos / 10000^2i/d_model)

PE_{(pos, 2i+1)} = cos(pos / 10000^2i/d_model)

Each dimension of the positional encoding oscillates at a different frequency. Low dimensions change rapidly (high frequency), encoding fine position differences. High dimensions change slowly (low frequency), encoding coarse position. This design has a key property: the encoding of position pos + k can be represented as a linear function of the encoding at pos, allowing the model to learn relative position patterns easily.

Sinusoidal (Fixed)

• Deterministic, computed once, not learned
• Can generalize to longer sequences than seen during training
• Used in the original Transformer (Vaswani et al., 2017)
• No additional parameters

Learned Positions

• A learned embedding matrix of shape (max_len × d_model)
• Used in BERT, GPT-2 (max 512 and 1024 positions respectively)
• Cannot extrapolate beyond max_len without tricks
• In practice, performs about the same as sinusoidal

Modern Alternatives: RoPE and ALiBi

Modern LLMs have moved beyond additive positional encodings. RoPE (Rotary Position Embedding, used in LLaMA, Mistral, GPT-NeoX) encodes position by rotating the Q and K vectors in 2D subspaces. This elegantly encodes relative position: the dot product between rotated Q and K depends only on the distance between tokens, not their absolute positions.

ALiBi (Attention with Linear Biases, used in BLOOM, MPT) takes an even simpler approach: no positional encoding at all. Instead, it subtracts a linear penalty from attention scores based on distance: score(i, j) − m · |i − j|, where m is a head-specific slope. This works surprisingly well and generalizes to longer sequences without fine-tuning.

6. From Tokens to Sentence Embeddings

A Transformer produces one output vector per token. But for tasks like semantic search, classification, or clustering, you need a single vector for the entire sentence. How do you collapse a sequence of vectors into one?

[CLS] Token Approach (BERT)

BERT prepends a special [CLS] token to every input. After passing through all 12 Transformer layers, the output vector at the [CLS] position is used as the sentence representation. The idea is that attention allows [CLS] to aggregate information from all other tokens.

# BERT input format:

[CLS] The cat sat on the mat [SEP]

# Sentence embedding = output at position 0 (the [CLS] token)

embedding = model_output[0] # shape: (768,)

Limitation: The [CLS] token in base BERT is not actually optimized for sentence similarity. Without fine-tuning, [CLS] embeddings perform worse than simple word2vec averaging for semantic tasks.

Mean Pooling (Sentence-Transformers)

Average the output vectors across all tokens (excluding padding). This is the default in sentence-transformers and consistently outperforms [CLS] pooling for embedding quality.

# Mean pooling over all token outputs:

token_outputs = model_output # shape: (seq_len, 768)

attention_mask = ... # shape: (seq_len,), 1 for real tokens, 0 for padding

embedding = (token_outputs * mask).sum(dim=0) / mask.sum()

# Result shape: (768,)

Why it works better: Every token contributes to the final embedding, so no information is bottlenecked through a single position. Content words with strong semantic signal directly influence the result.

Other Pooling Strategies

Max Pooling

Take the element-wise max across all tokens. Captures the strongest signal in each dimension. Rarely used in modern models.

Weighted Mean

Weight later layers or specific attention heads more. Some models learn these weights during fine-tuning (e.g., SentEval probing).

Last Token (Causal)

For decoder-only models (GPT), use the last token's output. It has attended to all previous tokens via causal masking.

7. Scale: How Size Affects Quality

Transformer quality scales predictably with model size. The "Scaling Laws" research (Kaplan et al., 2020) showed that loss follows a power law with parameters, compute, and data. Doubling parameters gives diminishing but consistent returns.

Model	Layers	Heads	d_model	Params	Year
BERT-base	12	12	768	110M	2018
BERT-large	24	16	1024	340M	2018
GPT-2	48	25	1600	1.5B	2019
GPT-3	96	96	12288	175B	2020
LLaMA-2 70B	80	64	8192	70B	2023
GPT-4 (est.)	~120	~96	~12288	~1.8T	2023

Parameter Count Formula

For a Transformer with L layers, d_model hidden dims, and V vocab size:

Attention: 4 · d² per layer

FFN: 8 · d² per layer (4x inner)

Embeddings: V · d

Total ≈ 12 · L · d² + V · d

Compute Cost

Training a Transformer requires roughly 6 · N · D FLOPs, where N is parameters and D is number of training tokens. GPT-3 (175B params, 300B tokens) required ~3.14 × 10²³ FLOPs, or about $4.6M on V100 GPUs at 2020 cloud prices.

Emergent Abilities

Some capabilities only appear above certain scale thresholds. Few-shot learning emerges around 1B parameters. Chain-of-thought reasoning appears around 100B. These "phase transitions" are not yet fully understood and remain an active research topic.

8. Working Code: Attention in NumPy

Here is a complete, runnable implementation of scaled dot-product attention and multi-head attention in NumPy. No frameworks, no magic — every operation is explicit.

attention.py

import numpy as np

def softmax(x, axis=-1):
    """Numerically stable softmax."""
    e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
    return e_x / e_x.sum(axis=axis, keepdims=True)

def scaled_dot_product_attention(Q, K, V):
    """
    Scaled dot-product attention.
    Q, K, V: arrays of shape (seq_len, d_k)
    Returns: (output, attention_weights)
    """
    d_k = K.shape[-1]

    # Step 1: Compute raw attention scores
    scores = Q @ K.T                # (seq_len, seq_len)

    # Step 2: Scale by sqrt(d_k)
    scores = scores / np.sqrt(d_k)  # prevents gradient vanishing in softmax

    # Step 3: Softmax to get attention weights
    weights = softmax(scores)       # each row sums to 1

    # Step 4: Weighted sum of values
    output = weights @ V            # (seq_len, d_k)

    return output, weights

# --- Demo with 3 tokens, d_k = 4 ---
X = np.array([
    [1.0, 0.0, 1.0, 0.0],   # "The"
    [0.0, 1.0, 0.0, 1.0],   # "cat"
    [1.0, 1.0, 0.0, 0.0],   # "sat"
])

# Using identity projection (Q = K = V = X)
output, weights = scaled_dot_product_attention(X, X, X)

print("Attention weights:")
print(np.round(weights, 3))
# [[0.506, 0.186, 0.307],
#  [0.186, 0.506, 0.307],
#  [0.307, 0.307, 0.387]]

print("\nOutput:")
print(np.round(output, 3))
# [[0.813, 0.494, 0.506, 0.186],
#  [0.494, 0.813, 0.186, 0.506],
#  [0.694, 0.694, 0.307, 0.307]]

multi_head_attention.py

def multi_head_attention(X, n_heads, d_model):
    """
    Multi-head attention from scratch.
    X: (seq_len, d_model)
    n_heads: number of attention heads
    d_model: model dimension (must be divisible by n_heads)
    """
    seq_len = X.shape[0]
    d_k = d_model // n_heads  # per-head dimension

    # Initialize random projection matrices (normally these are learned)
    np.random.seed(42)
    W_Q = np.random.randn(n_heads, d_model, d_k) * 0.1
    W_K = np.random.randn(n_heads, d_model, d_k) * 0.1
    W_V = np.random.randn(n_heads, d_model, d_k) * 0.1
    W_O = np.random.randn(d_model, d_model) * 0.1

    head_outputs = []

    for h in range(n_heads):
        # Project input to Q, K, V for this head
        Q = X @ W_Q[h]   # (seq_len, d_k)
        K = X @ W_K[h]   # (seq_len, d_k)
        V = X @ W_V[h]   # (seq_len, d_k)

        # Scaled dot-product attention
        output, _ = scaled_dot_product_attention(Q, K, V)
        head_outputs.append(output)  # (seq_len, d_k)

    # Concatenate all heads: (seq_len, n_heads * d_k) = (seq_len, d_model)
    concat = np.concatenate(head_outputs, axis=-1)

    # Final output projection
    result = concat @ W_O  # (seq_len, d_model)

    return result

# --- Demo: 3 tokens, 4 heads, d_model = 8 ---
X = np.random.randn(3, 8)  # 3 tokens, 8-dim embeddings
output = multi_head_attention(X, n_heads=4, d_model=8)
print(f"Input shape:  {X.shape}")    # (3, 8)
print(f"Output shape: {output.shape}")  # (3, 8)
print("\nEach token now contains information from all other tokens,")
print("aggregated through 4 different attention perspectives.")

full_transformer_block.py

def layer_norm(x, eps=1e-5):
    """Layer normalization over the last dimension."""
    mean = x.mean(axis=-1, keepdims=True)
    std = x.std(axis=-1, keepdims=True)
    return (x - mean) / (std + eps)

def gelu(x):
    """Gaussian Error Linear Unit activation."""
    return 0.5 * x * (1 + np.tanh(
        np.sqrt(2 / np.pi) * (x + 0.044715 * x**3)
    ))

def feed_forward(x, d_model, d_ff):
    """Position-wise feed-forward network."""
    np.random.seed(123)
    W1 = np.random.randn(d_model, d_ff) * 0.02
    b1 = np.zeros(d_ff)
    W2 = np.random.randn(d_ff, d_model) * 0.02
    b2 = np.zeros(d_model)

    hidden = gelu(x @ W1 + b1)  # (seq_len, d_ff)
    output = hidden @ W2 + b2    # (seq_len, d_model)
    return output

def transformer_block(x, n_heads, d_model, d_ff):
    """
    One complete Transformer encoder block.
    x: (seq_len, d_model)
    """
    # Sub-layer 1: Multi-head attention + residual + norm
    attn_out = multi_head_attention(x, n_heads, d_model)
    x = layer_norm(x + attn_out)     # Add & Norm

    # Sub-layer 2: Feed-forward + residual + norm
    ff_out = feed_forward(x, d_model, d_ff)
    x = layer_norm(x + ff_out)       # Add & Norm

    return x

# --- Full block: 3 tokens, 4 heads, d_model=8, d_ff=32 ---
X = np.random.randn(3, 8)
output = transformer_block(X, n_heads=4, d_model=8, d_ff=32)
print(f"Transformer block: {X.shape} -> {output.shape}")
# Transformer block: (3, 8) -> (3, 8)
# Same shape in, same shape out — stackable!

9. Key Papers

The Transformer story is told through a remarkably small set of papers. These are the ones every ML practitioner should read, in roughly chronological order.

Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

Introduced the Transformer architecture

Venue

NeurIPS 2017

Citations

130,000+

BERT: Pre-training of Deep Bidirectional Transformers

Devlin, Chang, Lee, Toutanova

Bidirectional pre-training with [CLS] and MLM

Venue

NAACL 2019

Citations

95,000+

Improving Language Understanding by Generative Pre-Training

Radford, Narasimhan, Salimans, Sutskever

GPT-1: autoregressive pre-training

Venue

OpenAI 2018

Citations

12,000+

Language Models are Few-Shot Learners

Brown, Mann, Ryder, Subbiah, et al.

GPT-3: scaling laws and in-context learning

Venue

NeurIPS 2020

Citations

40,000+

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Gurevych

Mean pooling for sentence embeddings

Venue

EMNLP 2019

Citations

8,500+

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, Beyer, Kolesnikov, et al.

Vision Transformer (ViT)

Venue

ICLR 2021

Citations

35,000+

FlashAttention: Fast and Memory-Efficient Exact Attention

Dao, Fu, Ermon, Rudra, Re

IO-aware exact attention algorithm

Venue

NeurIPS 2022

Citations

3,000+

Keep Going

Now that you understand how Transformers work from the inside, explore the benchmarks where they dominate — or dive into more deep dives from the Academy.

Browse Academy Explore Benchmarks

← Back to Academy

Last updated: March 2026

How TransformersActually Work

Architecture at a Glance

Table of Contents

1. The Attention Mechanism

Q, K, V: Three Separate Projections

The Attention Formula

Step 1: QKT — Raw Attention Scores

Step 2: / √dk — Scaling

Step 3: softmax() — Normalize to Probabilities

Step 4: × V — Weighted Sum of Values

2. Numerical Walkthrough: 3 Tokens

Input Embeddings X (3 tokens × 4 dims)

After Projection (using identity W for clarity)

Step 1: QKT (Raw Scores)

Step 2: Divide by √dk = √4 = 2

Step 3: Softmax (row-wise)

Step 4: Attention Weights × V = Output

3. Multi-Head Attention

What Different Heads Learn

Syntactic adjacency

Subject-verb agreement

Coreference resolution

Positional patterns

Concatenation and Output Projection

Dimensions at Each Step (BERT-base, h=12)

4. The Full Transformer Block

Data Flow Through One Transformer Block

Residual Connections

Layer Normalization

Feed-Forward Network

Pre-Norm vs. Post-Norm

5. Positional Encoding

Sinusoidal Positional Encoding (Original)

Sinusoidal (Fixed)

Learned Positions

Modern Alternatives: RoPE and ALiBi

6. From Tokens to Sentence Embeddings

[CLS] Token Approach (BERT)

Mean Pooling (Sentence-Transformers)

Other Pooling Strategies

7. Scale: How Size Affects Quality

Parameter Count Formula

Compute Cost

Emergent Abilities

8. Working Code: Attention in NumPy

9. Key Papers

Keep Going

How Transformers
Actually Work

Step 1: QK^T — Raw Attention Scores

Step 2: / √d_k — Scaling

Step 1: QK^T (Raw Scores)

Step 2: Divide by √d_k = √4 = 2