Why Neural Networks Need Embeddings:
The Matrix Operations Problem

The real answer is not “computers need numbers.” It is that matrix multiplication — the core operation of every neural network — is only defined over numerical vectors. This page shows you exactly what that means, with actual math.

y = Wx + b
The fundamental equation
50,000+
Typical vocabulary size
768-dim
BERT embedding size
1960s
First distributed representations

The Real Problem: Matrix Multiplication Is Only Defined Over Numbers

Every tutorial says “computers can’t understand words, so we convert them to numbers.” That framing is misleading. Computers handle strings perfectly well — your browser is rendering these words right now. The actual constraint comes from how neural networks compute.

A single neuron in a neural network computes exactly one thing:

output = activation( W · x + b )

W = weight matrix (learned parameters)

x = input vector (your data)

b = bias vector

activation = non-linear function (ReLU, sigmoid, etc.)

The critical operation is W · x — matrix-vector multiplication. This operation is only defined when both W and x contain numbers. Not strings, not categories, not booleans — numbers that you can multiply and add together.

Step-by-step: What matrix multiplication actually does

Let’s say we have a tiny weight matrix W (2×3) and an input vector x (3×1):

W = | 0.5  -0.3   0.8 |    x = | 1.0 |
    | 0.2   0.7  -0.1 |        | 0.5 |
                                | 0.3 |

W · x = | (0.5×1.0) + (-0.3×0.5) + (0.8×0.3) |
        | (0.2×1.0) + (0.7×0.5) + (-0.1×0.3)  |

      = | 0.5 + (-0.15) + 0.24 |
        | 0.2 + 0.35 + (-0.03) |

      = | 0.59 |
        | 0.52 |

Every single element of the output required multiplication and addition between the weight and input values. Now try to imagine what happens if x contains a string:

x = | "cat" |    ← What is 0.5 × "cat"?
    | "sat" |    ← What is -0.3 × "sat"?
    | "mat" |    ← What is 0.8 × "mat"?

W · x = | (0.5 × "cat") + (-0.3 × "sat") + (0.8 × "mat") |
        |                                                    |
        |              ← UNDEFINED. Cannot compute.          |

The core constraint

The string “cat” has no defined multiplication operation with the float 0.5. This is not a software limitation — it is a mathematical one. Matrix multiplication requires elements from a field (a set with defined +, −, ×, ÷). Strings are not elements of any field.

So the question becomes: how do we represent words as numerical vectors in a way that is useful to the network? Not just any numbers — numbers that encode something meaningful about the word’s identity and relationships.

Why ASCII and One-Hot Encoding Fail

The naive thought is: words are already made of characters, and characters have ASCII codes. Problem solved? Not even close. Let’s see exactly why.

Attempt 1: ASCII Values

"cat" → [99, 97, 116]     (c=99, a=97, t=116)
"dog" → [100, 111, 103]   (d=100, o=111, g=103)
"car" → [99, 97, 114]     (c=99, a=97, r=114)

These are numbers, so the matrix multiplication is now defined. But the numbers encode the completely wrong information:

  • “cat” and “car” appear more similar than “cat” and “dog” — because they share the letters c, a. But semantically, cat and dog are far more related (both are animals, pets).
  • The numbers imply an ordering that does not exist — “d” (100) is not “greater than” “c” (99) in any linguistically meaningful way.
  • Different-length words produce different-size vectors — “cat” has 3 dimensions, “elephant” has 8. Neural network layers require fixed-size inputs.

Let’s prove the similarity problem with actual dot products. The dot product is the core of matrix multiplication — if two vectors have a higher dot product, the network treats them as more similar:

dot("cat", "car") = (99×99) + (97×97) + (116×114)
                   = 9801 + 9409 + 13224
                   = 32,434

dot("cat", "dog") = (99×100) + (97×111) + (116×103)
                   = 9900 + 10767 + 11948
                   = 32,615

# "cat"·"dog" ≈ "cat"·"car" — the network can't distinguish
# semantic similarity from spelling similarity

Attempt 2: One-Hot Encoding

The next idea is to assign each word a unique position in a vocabulary-sized vector. With a vocabulary of 5 words [cat, dog, car, tree, house]:

"cat"   → [1, 0, 0, 0, 0]
"dog"   → [0, 1, 0, 0, 0]
"car"   → [0, 0, 1, 0, 0]
"tree"  → [0, 0, 0, 1, 0]
"house" → [0, 0, 0, 0, 1]

This fixes the false-similarity problem from ASCII — every word is equally distant from every other word. But that is also its fatal flaw:

dot("cat", "dog")   = (1×0) + (0×1) + (0×0) + (0×0) + (0×0) = 0
dot("cat", "car")   = (1×0) + (0×0) + (0×1) + (0×0) + (0×0) = 0
dot("cat", "tree")  = (1×0) + (0×0) + (0×0) + (0×1) + (0×0) = 0

# EVERY pair of different words has dot product = 0
# "cat" is exactly as similar to "dog" as it is to "car" or "tree"
# The network gets ZERO similarity signal

The two problems with one-hot vectors

1. No similarity information

One-hot vectors are orthogonal — their dot product is always 0. The network has to learn from scratch that “cat” and “kitten” are related. No structural hint is provided.

2. Extreme dimensionality

Real vocabularies have 30,000–100,000 tokens. Each one-hot vector has that many dimensions, with only a single 1. The weight matrix W for the first layer must be enormous (e.g., 768 × 50,000 = 38.4 million parameters) just to handle the input.

Let’s see what the matrix multiplication looks like with one-hot inputs. Suppose W is a 3×5 weight matrix (3-dim output, vocabulary of 5):

W = | 0.2   0.8  -0.1   0.5   0.3 |
    | 0.9  -0.2   0.4   0.1  -0.6 |
    | 0.1   0.3   0.7  -0.4   0.2 |

x_cat = [1, 0, 0, 0, 0]  (one-hot for "cat")

W · x_cat = | 0.2×1 + 0.8×0 + (-0.1)×0 + 0.5×0 + 0.3×0 |   | 0.2 |
            | 0.9×1 + (-0.2)×0 + 0.4×0 + 0.1×0 + (-0.6)×0 | = | 0.9 |
            | 0.1×1 + 0.3×0 + 0.7×0 + (-0.4)×0 + 0.2×0 |   | 0.1 |

# One-hot multiplication just SELECTS a column of W
# For "cat" (index 0): output = column 0 of W = [0.2, 0.9, 0.1]
# For "dog" (index 1): output = column 1 of W = [0.8, -0.2, 0.3]

Key insight

Multiplying a one-hot vector by a weight matrix just selects one column of that matrix. This means the “first layer” of a network receiving one-hot inputs is really just a lookup table. Each word gets its own learned column vector. This is exactly what an embedding layer is — but without the wasted computation of multiplying by all those zeros.

The Embedding Solution: Learned Dense Vectors

An embedding is a dense, low-dimensional vector that is learned during training. Instead of a sparse 50,000-dimensional one-hot vector, each word gets a compact vector (typically 64 to 1024 dimensions) where each dimension encodes some latent feature.

The critical property: similar words end up with similar vectors. Not because we told the network that “cat” and “dog” are related, but because during training, words that appear in similar contexts get pushed toward similar regions of the embedding space.

How it works, mechanically

1
Initialize randomly. Each word in the vocabulary gets a random vector. “cat” = [0.12, -0.87, 0.45, ...], “dog” = [-0.34, 0.56, 0.02, ...].
2
Train on a task. The network processes text (language modeling, classification, translation) and computes a loss.
3
Backpropagate gradients into the embedding. The gradient tells each embedding dimension which direction to shift to reduce the loss.
4
Converge to meaningful structure. Words that appear in similar contexts (“The cat sat on the mat” vs “The dog sat on the mat”) receive similar gradient signals, so their vectors drift together.

Embeddings fix both problems

Similarity is encoded

The dot product between “cat” and “dog” embeddings is high because they share semantic features. The dot product between “cat” and “car” is low because they don’t.

dot(cat, dog) = 0.92
dot(cat, car) = 0.15
Compact dimensions

Instead of a 50,000-dim sparse vector, each word is a 768-dim dense vector. The first layer needs 768 × 768 = 590K parameters instead of 768 × 50,000 = 38.4M.

One-hot: 50,000 dims, 1 non-zero
Embedding: 768 dims, all non-zero

Concrete Walkthrough: 3-Dimensional Embeddings

Let’s work through a complete example with actual numbers. We will use 3-dimensional embeddings (real systems use 768+, but the math is identical).

Step 1: Define the embeddings

Imagine training has converged and produced these embedding vectors. Think of the three dimensions as loosely encoding [animal-ness, size, is-a-vehicle]:

"cat" = [0.8, 0.2, 0.1]    ← high animal, small, not vehicle
"dog" = [0.7, 0.3, 0.1]    ← high animal, medium, not vehicle
"car" = [0.1, 0.1, 0.9]    ← not animal, small(?), very vehicle

Step 2: Define a weight matrix

Suppose the next layer has a 2×3 weight matrix W that has been trained to detect whether the input is a living thing (row 1) or a mechanical thing (row 2):

W = | 0.9   0.5  -0.8 |    ← "living thing detector"
    |-0.3  -0.1   1.0 |    ← "machine detector"

Step 3: Matrix multiply each word

W · “cat”

W · [0.8, 0.2, 0.1]

Row 1: (0.9×0.8) + (0.5×0.2) + (-0.8×0.1) = 0.72 + 0.10 - 0.08 = 0.74
Row 2: (-0.3×0.8) + (-0.1×0.2) + (1.0×0.1) = -0.24 - 0.02 + 0.10 = -0.16

Result: [0.74, -0.16]   ← HIGH living, LOW machine

W · “dog”

W · [0.7, 0.3, 0.1]

Row 1: (0.9×0.7) + (0.5×0.3) + (-0.8×0.1) = 0.63 + 0.15 - 0.08 = 0.70
Row 2: (-0.3×0.7) + (-0.1×0.3) + (1.0×0.1) = -0.21 - 0.03 + 0.10 = -0.14

Result: [0.70, -0.14]   ← HIGH living, LOW machine

W · “car”

W · [0.1, 0.1, 0.9]

Row 1: (0.9×0.1) + (0.5×0.1) + (-0.8×0.9) = 0.09 + 0.05 - 0.72 = -0.58
Row 2: (-0.3×0.1) + (-0.1×0.1) + (1.0×0.9) = -0.03 - 0.01 + 0.90 = 0.86

Result: [-0.58, 0.86]   ← LOW living, HIGH machine

Results summary

WordEmbeddingOutput (W·x)Interpretation
cat[0.8, 0.2, 0.1][0.74, -0.16]Living thing
dog[0.7, 0.3, 0.1][0.70, -0.14]Living thing
car[0.1, 0.1, 0.9][-0.58, 0.86]Machine

cat and dog produce nearly identical outputs ([0.74, -0.16] vs [0.70, -0.14]), while car produces a completely different output ([-0.58, 0.86]). The network can now naturally group semantically similar words — this is impossible with ASCII or one-hot encodings.

Cosine similarity between outputs

To quantify how similar the network treats these words, we compute cosine similarity between their output vectors:

cos(cat_out, dog_out) = (0.74×0.70 + (-0.16)×(-0.14))
                        / (√(0.74² + 0.16²) × √(0.70² + 0.14²))
                      = (0.518 + 0.0224) / (0.757 × 0.714)
                      = 0.5404 / 0.5405
                      = 0.9998  ← almost identical

cos(cat_out, car_out) = (0.74×(-0.58) + (-0.16)×0.86)
                        / (0.757 × 1.038)
                      = (-0.4292 - 0.1376) / 0.7858
                      = -0.5668 / 0.7858
                      = -0.7213  ← very different (negative = opposite)

Cosine similarity of 0.9998 between cat and dog outputs means the network treats them as virtually the same category. Cosine similarity of -0.72 between cat and car means the network sees them as opposites. This is exactly the semantic structure we want.

Working Python Code

Here is every step from above as runnable Python. Copy this into a notebook or script to verify all the numbers yourself.

numpy_embedding_demo.py
import numpy as np

# ─── 1. Define embeddings ───────────────────────────
# In practice these are learned; here we set them manually
embeddings = {
    "cat": np.array([0.8, 0.2, 0.1]),
    "dog": np.array([0.7, 0.3, 0.1]),
    "car": np.array([0.1, 0.1, 0.9]),
}

# ─── 2. Define a weight matrix ─────────────────────
# 2×3: maps 3-dim embeddings to 2-dim output
W = np.array([
    [0.9,  0.5, -0.8],   # "living thing" detector
    [-0.3, -0.1,  1.0],   # "machine" detector
])

bias = np.array([0.0, 0.0])  # zero bias for clarity

# ─── 3. Forward pass for each word ─────────────────
print("=== Forward pass: output = W @ embedding + bias ===\n")
outputs = {}
for word, emb in embeddings.items():
    out = W @ emb + bias
    outputs[word] = out
    print(f'  "{word}": embedding={emb} → output={np.round(out, 4)}')

# ─── 4. Compare with one-hot encoding ──────────────
print("\n=== One-hot comparison ===\n")
vocab = ["cat", "dog", "car", "tree", "house"]
one_hot = {w: np.eye(len(vocab))[i] for i, w in enumerate(vocab)}

for w in ["cat", "dog", "car"]:
    print(f'  "{w}": one_hot={one_hot[w].astype(int)}')

# Dot products between one-hot vectors
for w1, w2 in [("cat", "dog"), ("cat", "car")]:
    dp = np.dot(one_hot[w1], one_hot[w2])
    print(f'  dot("{w1}", "{w2}") = {dp}  ← always 0 for different words')

# ─── 5. Dot products between embeddings ────────────
print("\n=== Embedding dot products ===\n")
for w1, w2 in [("cat", "dog"), ("cat", "car"), ("dog", "car")]:
    dp = np.dot(embeddings[w1], embeddings[w2])
    print(f'  dot("{w1}", "{w2}") = {dp:.4f}')

# ─── 6. Cosine similarity of outputs ───────────────
print("\n=== Cosine similarity of network outputs ===\n")
from numpy.linalg import norm

for w1, w2 in [("cat", "dog"), ("cat", "car"), ("dog", "car")]:
    cos_sim = np.dot(outputs[w1], outputs[w2]) / (
        norm(outputs[w1]) * norm(outputs[w2])
    )
    print(f'  cos("{w1}", "{w2}") = {cos_sim:.4f}')

# ─── 7. ASCII encoding comparison ──────────────────
print("\n=== ASCII encoding (bad!) ===\n")
ascii_vecs = {
    "cat": np.array([ord(c) for c in "cat"]),
    "dog": np.array([ord(c) for c in "dog"]),
    "car": np.array([ord(c) for c in "car"]),
}

for word, vec in ascii_vecs.items():
    print(f'  "{word}" → {vec}')

for w1, w2 in [("cat", "dog"), ("cat", "car")]:
    dp = np.dot(ascii_vecs[w1], ascii_vecs[w2])
    print(f'  dot("{w1}", "{w2}") = {dp}  ← meaningless similarity')

Expected output

=== Forward pass: output = W @ embedding + bias ===

  "cat": embedding=[0.8 0.2 0.1] → output=[ 0.74 -0.16]
  "dog": embedding=[0.7 0.3 0.1] → output=[ 0.7  -0.14]
  "car": embedding=[0.1 0.1 0.9] → output=[-0.58  0.86]

=== One-hot comparison ===

  "cat": one_hot=[1 0 0 0 0]
  "dog": one_hot=[0 1 0 0 0]
  "car": one_hot=[0 0 1 0 0]
  dot("cat", "dog") = 0.0  ← always 0 for different words
  dot("cat", "car") = 0.0  ← always 0 for different words

=== Embedding dot products ===

  dot("cat", "dog") = 0.6300
  dot("cat", "car") = 0.1900
  dot("dog", "car") = 0.1900

=== Cosine similarity of network outputs ===

  cos("cat", "dog") = 0.9998
  cos("cat", "car") = -0.7213
  cos("dog", "car") = -0.7399

=== ASCII encoding (bad!) ===

  "cat" → [ 99  97 116]
  "dog" → [100 111 103]
  "car" → [ 99  97 114]
  dot("cat", "dog") = 32615  ← meaningless similarity
  dot("cat", "car") = 32434  ← meaningless similarity

The Evolution: From Lookup Tables to Contextual Embeddings

The idea of representing words as dense vectors has evolved dramatically across three generations, each solving a deeper problem.

2003–2012

Static Embeddings (Bengio, Collobert)

Learned as a side effect of neural language models. Each word gets one fixed vector regardless of context. The word “bank” has the same embedding whether it means a river bank or a financial bank. Already a huge improvement over one-hot, but limited.

2013–2016

Dedicated Embedding Models (Word2Vec, GloVe, FastText)

Models trained specifically to produce good embeddings. Word2Vec showed that embeddings capture analogies: vec(“king”) - vec(“man”) + vec(“woman”) ≈ vec(“queen”). GloVe combined global co-occurrence statistics with local context windows. Still static — one vector per word.

2017–now

Contextual Embeddings (ELMo, BERT, GPT)

The same word gets different vectors depending on its context. “I sat by the river bank” and “I deposited money at the bank” produce different embeddings for “bank”. Transformers take a sequence of token embeddings as input, then use self-attention to produce contextualized representations. The initial embedding lookup is still the same matrix operation — the context-dependence comes from subsequent layers.

What stays the same across all three eras

Every generation still starts with the same fundamental operation: converting discrete tokens into dense numerical vectors so that matrix multiplication is defined. The embedding layer in GPT-4 performs exactly the same lookup-table operation as Word2Vec — it maps each token ID to a row in a learned matrix. What changed is what happens after the lookup: nothing (Word2Vec), a shallow network (ELMo), or 96 layers of self-attention (GPT-4).

Key Papers

The foundational research that developed the theory and practice of word embeddings, from the first neural language model to modern contextual representations.

Efficient Estimation of Word Representations in Vector Space
Mikolov, Chen, Corrado, Dean
ICLR Workshop 2013 · 40,000+ citations
Word2Vec — introduced Skip-gram and CBOW for learning word embeddings
Distributed Representations of Words and Phrases and their Compositionality
Mikolov, Sutskever, Chen, Corrado, Dean
NeurIPS 2013 · 30,000+ citations
Word2Vec extensions — negative sampling, phrase detection
GloVe: Global Vectors for Word Representation
Pennington, Socher, Manning
EMNLP 2014 · 32,000+ citations
Combined global matrix factorization with local context windows
A Neural Probabilistic Language Model
Bengio, Ducharme, Vincent, Jauvin
JMLR 2003 · 12,000+ citations
First neural language model — introduced learned word embeddings
Attention Is All You Need
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
NeurIPS 2017 · 130,000+ citations
Transformer architecture — embeddings + positional encoding as input representation
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Devlin, Chang, Lee, Toutanova
NAACL 2019 · 90,000+ citations
Contextual embeddings — same word gets different vectors based on context

Summary

1.

Neural networks compute y = Wx + b. This requires W and x to be numerical — you cannot multiply a matrix by a string.

2.

ASCII encoding gives numbers, but the wrong numbers. Character codes encode spelling, not meaning. “cat” and “car” look more similar than “cat” and “kitten.”

3.

One-hot encoding destroys all similarity. Every word is orthogonal to every other word. The dot product between any two different words is 0.

4.

Learned embeddings encode semantic similarity in the geometry of the vector space. Similar words get similar vectors, so the matrix multiplication produces similar outputs. The network naturally groups related concepts.

5.

An embedding layer is a learned lookup table. It is mathematically equivalent to multiplying a one-hot vector by a weight matrix, but implemented as an index operation for efficiency.