Why Neural Networks Need Embeddings:
The Matrix Operations Problem
The real answer is not “computers need numbers.” It is that matrix multiplication — the core operation of every neural network — is only defined over numerical vectors. This page shows you exactly what that means, with actual math.
The Real Problem: Matrix Multiplication Is Only Defined Over Numbers
Every tutorial says “computers can’t understand words, so we convert them to numbers.” That framing is misleading. Computers handle strings perfectly well — your browser is rendering these words right now. The actual constraint comes from how neural networks compute.
A single neuron in a neural network computes exactly one thing:
W = weight matrix (learned parameters)
x = input vector (your data)
b = bias vector
activation = non-linear function (ReLU, sigmoid, etc.)
The critical operation is W · x — matrix-vector multiplication. This operation is only defined when both W and x contain numbers. Not strings, not categories, not booleans — numbers that you can multiply and add together.
Step-by-step: What matrix multiplication actually does
Let’s say we have a tiny weight matrix W (2×3) and an input vector x (3×1):
W = | 0.5 -0.3 0.8 | x = | 1.0 |
| 0.2 0.7 -0.1 | | 0.5 |
| 0.3 |
W · x = | (0.5×1.0) + (-0.3×0.5) + (0.8×0.3) |
| (0.2×1.0) + (0.7×0.5) + (-0.1×0.3) |
= | 0.5 + (-0.15) + 0.24 |
| 0.2 + 0.35 + (-0.03) |
= | 0.59 |
| 0.52 |Every single element of the output required multiplication and addition between the weight and input values. Now try to imagine what happens if x contains a string:
x = | "cat" | ← What is 0.5 × "cat"?
| "sat" | ← What is -0.3 × "sat"?
| "mat" | ← What is 0.8 × "mat"?
W · x = | (0.5 × "cat") + (-0.3 × "sat") + (0.8 × "mat") |
| |
| ← UNDEFINED. Cannot compute. |The core constraint
The string “cat” has no defined multiplication operation with the float 0.5. This is not a software limitation — it is a mathematical one. Matrix multiplication requires elements from a field (a set with defined +, −, ×, ÷). Strings are not elements of any field.
So the question becomes: how do we represent words as numerical vectors in a way that is useful to the network? Not just any numbers — numbers that encode something meaningful about the word’s identity and relationships.
Why ASCII and One-Hot Encoding Fail
The naive thought is: words are already made of characters, and characters have ASCII codes. Problem solved? Not even close. Let’s see exactly why.
Attempt 1: ASCII Values
"cat" → [99, 97, 116] (c=99, a=97, t=116) "dog" → [100, 111, 103] (d=100, o=111, g=103) "car" → [99, 97, 114] (c=99, a=97, r=114)
These are numbers, so the matrix multiplication is now defined. But the numbers encode the completely wrong information:
- ✕“cat” and “car” appear more similar than “cat” and “dog” — because they share the letters c, a. But semantically, cat and dog are far more related (both are animals, pets).
- ✕The numbers imply an ordering that does not exist — “d” (100) is not “greater than” “c” (99) in any linguistically meaningful way.
- ✕Different-length words produce different-size vectors — “cat” has 3 dimensions, “elephant” has 8. Neural network layers require fixed-size inputs.
Let’s prove the similarity problem with actual dot products. The dot product is the core of matrix multiplication — if two vectors have a higher dot product, the network treats them as more similar:
dot("cat", "car") = (99×99) + (97×97) + (116×114)
= 9801 + 9409 + 13224
= 32,434
dot("cat", "dog") = (99×100) + (97×111) + (116×103)
= 9900 + 10767 + 11948
= 32,615
# "cat"·"dog" ≈ "cat"·"car" — the network can't distinguish
# semantic similarity from spelling similarityAttempt 2: One-Hot Encoding
The next idea is to assign each word a unique position in a vocabulary-sized vector. With a vocabulary of 5 words [cat, dog, car, tree, house]:
"cat" → [1, 0, 0, 0, 0] "dog" → [0, 1, 0, 0, 0] "car" → [0, 0, 1, 0, 0] "tree" → [0, 0, 0, 1, 0] "house" → [0, 0, 0, 0, 1]
This fixes the false-similarity problem from ASCII — every word is equally distant from every other word. But that is also its fatal flaw:
dot("cat", "dog") = (1×0) + (0×1) + (0×0) + (0×0) + (0×0) = 0
dot("cat", "car") = (1×0) + (0×0) + (0×1) + (0×0) + (0×0) = 0
dot("cat", "tree") = (1×0) + (0×0) + (0×0) + (0×1) + (0×0) = 0
# EVERY pair of different words has dot product = 0
# "cat" is exactly as similar to "dog" as it is to "car" or "tree"
# The network gets ZERO similarity signalThe two problems with one-hot vectors
One-hot vectors are orthogonal — their dot product is always 0. The network has to learn from scratch that “cat” and “kitten” are related. No structural hint is provided.
Real vocabularies have 30,000–100,000 tokens. Each one-hot vector has that many dimensions, with only a single 1. The weight matrix W for the first layer must be enormous (e.g., 768 × 50,000 = 38.4 million parameters) just to handle the input.
Let’s see what the matrix multiplication looks like with one-hot inputs. Suppose W is a 3×5 weight matrix (3-dim output, vocabulary of 5):
W = | 0.2 0.8 -0.1 0.5 0.3 |
| 0.9 -0.2 0.4 0.1 -0.6 |
| 0.1 0.3 0.7 -0.4 0.2 |
x_cat = [1, 0, 0, 0, 0] (one-hot for "cat")
W · x_cat = | 0.2×1 + 0.8×0 + (-0.1)×0 + 0.5×0 + 0.3×0 | | 0.2 |
| 0.9×1 + (-0.2)×0 + 0.4×0 + 0.1×0 + (-0.6)×0 | = | 0.9 |
| 0.1×1 + 0.3×0 + 0.7×0 + (-0.4)×0 + 0.2×0 | | 0.1 |
# One-hot multiplication just SELECTS a column of W
# For "cat" (index 0): output = column 0 of W = [0.2, 0.9, 0.1]
# For "dog" (index 1): output = column 1 of W = [0.8, -0.2, 0.3]Key insight
Multiplying a one-hot vector by a weight matrix just selects one column of that matrix. This means the “first layer” of a network receiving one-hot inputs is really just a lookup table. Each word gets its own learned column vector. This is exactly what an embedding layer is — but without the wasted computation of multiplying by all those zeros.
The Embedding Solution: Learned Dense Vectors
An embedding is a dense, low-dimensional vector that is learned during training. Instead of a sparse 50,000-dimensional one-hot vector, each word gets a compact vector (typically 64 to 1024 dimensions) where each dimension encodes some latent feature.
The critical property: similar words end up with similar vectors. Not because we told the network that “cat” and “dog” are related, but because during training, words that appear in similar contexts get pushed toward similar regions of the embedding space.
How it works, mechanically
Embeddings fix both problems
The dot product between “cat” and “dog” embeddings is high because they share semantic features. The dot product between “cat” and “car” is low because they don’t.
dot(cat, car) = 0.15
Instead of a 50,000-dim sparse vector, each word is a 768-dim dense vector. The first layer needs 768 × 768 = 590K parameters instead of 768 × 50,000 = 38.4M.
Embedding: 768 dims, all non-zero
Concrete Walkthrough: 3-Dimensional Embeddings
Let’s work through a complete example with actual numbers. We will use 3-dimensional embeddings (real systems use 768+, but the math is identical).
Step 1: Define the embeddings
Imagine training has converged and produced these embedding vectors. Think of the three dimensions as loosely encoding [animal-ness, size, is-a-vehicle]:
"cat" = [0.8, 0.2, 0.1] ← high animal, small, not vehicle "dog" = [0.7, 0.3, 0.1] ← high animal, medium, not vehicle "car" = [0.1, 0.1, 0.9] ← not animal, small(?), very vehicle
Step 2: Define a weight matrix
Suppose the next layer has a 2×3 weight matrix W that has been trained to detect whether the input is a living thing (row 1) or a mechanical thing (row 2):
W = | 0.9 0.5 -0.8 | ← "living thing detector"
|-0.3 -0.1 1.0 | ← "machine detector"Step 3: Matrix multiply each word
W · “cat”
W · [0.8, 0.2, 0.1] Row 1: (0.9×0.8) + (0.5×0.2) + (-0.8×0.1) = 0.72 + 0.10 - 0.08 = 0.74 Row 2: (-0.3×0.8) + (-0.1×0.2) + (1.0×0.1) = -0.24 - 0.02 + 0.10 = -0.16 Result: [0.74, -0.16] ← HIGH living, LOW machine
W · “dog”
W · [0.7, 0.3, 0.1] Row 1: (0.9×0.7) + (0.5×0.3) + (-0.8×0.1) = 0.63 + 0.15 - 0.08 = 0.70 Row 2: (-0.3×0.7) + (-0.1×0.3) + (1.0×0.1) = -0.21 - 0.03 + 0.10 = -0.14 Result: [0.70, -0.14] ← HIGH living, LOW machine
W · “car”
W · [0.1, 0.1, 0.9] Row 1: (0.9×0.1) + (0.5×0.1) + (-0.8×0.9) = 0.09 + 0.05 - 0.72 = -0.58 Row 2: (-0.3×0.1) + (-0.1×0.1) + (1.0×0.9) = -0.03 - 0.01 + 0.90 = 0.86 Result: [-0.58, 0.86] ← LOW living, HIGH machine
Results summary
| Word | Embedding | Output (W·x) | Interpretation |
|---|---|---|---|
| cat | [0.8, 0.2, 0.1] | [0.74, -0.16] | Living thing |
| dog | [0.7, 0.3, 0.1] | [0.70, -0.14] | Living thing |
| car | [0.1, 0.1, 0.9] | [-0.58, 0.86] | Machine |
cat and dog produce nearly identical outputs ([0.74, -0.16] vs [0.70, -0.14]), while car produces a completely different output ([-0.58, 0.86]). The network can now naturally group semantically similar words — this is impossible with ASCII or one-hot encodings.
Cosine similarity between outputs
To quantify how similar the network treats these words, we compute cosine similarity between their output vectors:
cos(cat_out, dog_out) = (0.74×0.70 + (-0.16)×(-0.14))
/ (√(0.74² + 0.16²) × √(0.70² + 0.14²))
= (0.518 + 0.0224) / (0.757 × 0.714)
= 0.5404 / 0.5405
= 0.9998 ← almost identical
cos(cat_out, car_out) = (0.74×(-0.58) + (-0.16)×0.86)
/ (0.757 × 1.038)
= (-0.4292 - 0.1376) / 0.7858
= -0.5668 / 0.7858
= -0.7213 ← very different (negative = opposite)Cosine similarity of 0.9998 between cat and dog outputs means the network treats them as virtually the same category. Cosine similarity of -0.72 between cat and car means the network sees them as opposites. This is exactly the semantic structure we want.
Working Python Code
Here is every step from above as runnable Python. Copy this into a notebook or script to verify all the numbers yourself.
import numpy as np
# ─── 1. Define embeddings ───────────────────────────
# In practice these are learned; here we set them manually
embeddings = {
"cat": np.array([0.8, 0.2, 0.1]),
"dog": np.array([0.7, 0.3, 0.1]),
"car": np.array([0.1, 0.1, 0.9]),
}
# ─── 2. Define a weight matrix ─────────────────────
# 2×3: maps 3-dim embeddings to 2-dim output
W = np.array([
[0.9, 0.5, -0.8], # "living thing" detector
[-0.3, -0.1, 1.0], # "machine" detector
])
bias = np.array([0.0, 0.0]) # zero bias for clarity
# ─── 3. Forward pass for each word ─────────────────
print("=== Forward pass: output = W @ embedding + bias ===\n")
outputs = {}
for word, emb in embeddings.items():
out = W @ emb + bias
outputs[word] = out
print(f' "{word}": embedding={emb} → output={np.round(out, 4)}')
# ─── 4. Compare with one-hot encoding ──────────────
print("\n=== One-hot comparison ===\n")
vocab = ["cat", "dog", "car", "tree", "house"]
one_hot = {w: np.eye(len(vocab))[i] for i, w in enumerate(vocab)}
for w in ["cat", "dog", "car"]:
print(f' "{w}": one_hot={one_hot[w].astype(int)}')
# Dot products between one-hot vectors
for w1, w2 in [("cat", "dog"), ("cat", "car")]:
dp = np.dot(one_hot[w1], one_hot[w2])
print(f' dot("{w1}", "{w2}") = {dp} ← always 0 for different words')
# ─── 5. Dot products between embeddings ────────────
print("\n=== Embedding dot products ===\n")
for w1, w2 in [("cat", "dog"), ("cat", "car"), ("dog", "car")]:
dp = np.dot(embeddings[w1], embeddings[w2])
print(f' dot("{w1}", "{w2}") = {dp:.4f}')
# ─── 6. Cosine similarity of outputs ───────────────
print("\n=== Cosine similarity of network outputs ===\n")
from numpy.linalg import norm
for w1, w2 in [("cat", "dog"), ("cat", "car"), ("dog", "car")]:
cos_sim = np.dot(outputs[w1], outputs[w2]) / (
norm(outputs[w1]) * norm(outputs[w2])
)
print(f' cos("{w1}", "{w2}") = {cos_sim:.4f}')
# ─── 7. ASCII encoding comparison ──────────────────
print("\n=== ASCII encoding (bad!) ===\n")
ascii_vecs = {
"cat": np.array([ord(c) for c in "cat"]),
"dog": np.array([ord(c) for c in "dog"]),
"car": np.array([ord(c) for c in "car"]),
}
for word, vec in ascii_vecs.items():
print(f' "{word}" → {vec}')
for w1, w2 in [("cat", "dog"), ("cat", "car")]:
dp = np.dot(ascii_vecs[w1], ascii_vecs[w2])
print(f' dot("{w1}", "{w2}") = {dp} ← meaningless similarity')Expected output
=== Forward pass: output = W @ embedding + bias ===
"cat": embedding=[0.8 0.2 0.1] → output=[ 0.74 -0.16]
"dog": embedding=[0.7 0.3 0.1] → output=[ 0.7 -0.14]
"car": embedding=[0.1 0.1 0.9] → output=[-0.58 0.86]
=== One-hot comparison ===
"cat": one_hot=[1 0 0 0 0]
"dog": one_hot=[0 1 0 0 0]
"car": one_hot=[0 0 1 0 0]
dot("cat", "dog") = 0.0 ← always 0 for different words
dot("cat", "car") = 0.0 ← always 0 for different words
=== Embedding dot products ===
dot("cat", "dog") = 0.6300
dot("cat", "car") = 0.1900
dot("dog", "car") = 0.1900
=== Cosine similarity of network outputs ===
cos("cat", "dog") = 0.9998
cos("cat", "car") = -0.7213
cos("dog", "car") = -0.7399
=== ASCII encoding (bad!) ===
"cat" → [ 99 97 116]
"dog" → [100 111 103]
"car" → [ 99 97 114]
dot("cat", "dog") = 32615 ← meaningless similarity
dot("cat", "car") = 32434 ← meaningless similarityThe Evolution: From Lookup Tables to Contextual Embeddings
The idea of representing words as dense vectors has evolved dramatically across three generations, each solving a deeper problem.
Static Embeddings (Bengio, Collobert)
Learned as a side effect of neural language models. Each word gets one fixed vector regardless of context. The word “bank” has the same embedding whether it means a river bank or a financial bank. Already a huge improvement over one-hot, but limited.
Dedicated Embedding Models (Word2Vec, GloVe, FastText)
Models trained specifically to produce good embeddings. Word2Vec showed that embeddings capture analogies: vec(“king”) - vec(“man”) + vec(“woman”) ≈ vec(“queen”). GloVe combined global co-occurrence statistics with local context windows. Still static — one vector per word.
Contextual Embeddings (ELMo, BERT, GPT)
The same word gets different vectors depending on its context. “I sat by the river bank” and “I deposited money at the bank” produce different embeddings for “bank”. Transformers take a sequence of token embeddings as input, then use self-attention to produce contextualized representations. The initial embedding lookup is still the same matrix operation — the context-dependence comes from subsequent layers.
What stays the same across all three eras
Every generation still starts with the same fundamental operation: converting discrete tokens into dense numerical vectors so that matrix multiplication is defined. The embedding layer in GPT-4 performs exactly the same lookup-table operation as Word2Vec — it maps each token ID to a row in a learned matrix. What changed is what happens after the lookup: nothing (Word2Vec), a shallow network (ELMo), or 96 layers of self-attention (GPT-4).
Key Papers
The foundational research that developed the theory and practice of word embeddings, from the first neural language model to modern contextual representations.
Summary
Neural networks compute y = Wx + b. This requires W and x to be numerical — you cannot multiply a matrix by a string.
ASCII encoding gives numbers, but the wrong numbers. Character codes encode spelling, not meaning. “cat” and “car” look more similar than “cat” and “kitten.”
One-hot encoding destroys all similarity. Every word is orthogonal to every other word. The dot product between any two different words is 0.
Learned embeddings encode semantic similarity in the geometry of the vector space. Similar words get similar vectors, so the matrix multiplication produces similar outputs. The network naturally groups related concepts.
An embedding layer is a learned lookup table. It is mathematically equivalent to multiplying a one-hot vector by a weight matrix, but implemented as an index operation for efficiency.