Level 0: Foundations~15 min

What is an Embedding?

How neural networks convert text into numbers - and why those numbers capture meaning.

The Problem

Computers work with numbers. Neural networks are just matrix multiplications - they can only process numerical vectors. But most real-world data is not numbers: text, images, audio.

neural_network("cat") // Error: expected tensor, got string
neural_network([99, 97, 116]) // Works, but ASCII codes have no meaning

We need a way to represent "cat" as numbers where similar words get similar numbers. That's what embeddings do.

What an Embedding Actually Is

An embedding is a learned lookup table combined with a neural network transformation.

Step 1: Tokenization

First, text is split into tokens - subword units from a fixed vocabulary. "cat" might be one token, but "unbelievable" becomes ["un", "believ", "able"].

# Tokenization example
"cat" → [2368]                    # Single token ID
"unbelievable" → [348, 12871, 481]  # Three token IDs
"café" → [7467, 2634]              # Subword: "caf" + "é"

Step 2: Embedding Lookup

Each token ID maps to a row in a giant matrix - the embedding table. This is just a lookup: token 2368 → row 2368 of the matrix.

# Embedding table: 50,000 tokens × 768 dimensions
embedding_table.shape = (50000, 768)

# Lookup is just indexing
token_id = 2368  # "cat"
initial_vector = embedding_table[token_id]  # Shape: (768,)

Why 768 dimensions? It's a hyperparameter choice. More dimensions = more capacity to encode meaning, but also more compute. Common values: 384, 768, 1024, 1536.

Step 3: Transformer Processing

Modern embeddings don't stop at lookup. The initial vectors pass through a transformer - layers of attention and feedforward networks that let each token "look at" other tokens.

# Simplified transformer forward pass
x = embedding_table[token_ids]     # (seq_len, 768) - initial lookup
x = x + positional_encoding        # Add position information

for layer in transformer_layers:   # 12-24 layers typically
    x = layer.attention(x)         # Tokens attend to each other
    x = layer.feedforward(x)       # Non-linear transformation

final_embedding = mean(x)          # Pool to single vector

How Training Creates Meaning

The embedding table and transformer weights start as random numbers. Training adjusts them using gradient descent on specific tasks:

Contrastive Learning (Modern Approach)

The model learns that similar sentences should have similar embeddings:

# Training objective: similar pairs close, dissimilar pairs far
similar_pair = ("The cat sat on the mat", "A feline rested on the rug")
dissimilar = ("The cat sat on the mat", "Stock prices rose sharply")

emb1, emb2, emb3 = model.encode([similar_pair[0], similar_pair[1], dissimilar])

# Loss function: cosine similarity
loss = -cos_sim(emb1, emb2) + cos_sim(emb1, emb3)  # Minimize this

# Backpropagation updates embedding table and transformer weights
loss.backward()
optimizer.step()

After millions of such examples, similar concepts naturally cluster together because the model was penalized whenever it placed them far apart.

Key Insight

The numbers in an embedding have no predefined meaning. Dimension 42 doesn't mean "animal-ness". The meaning emerges from training - it's whatever representation helps the model distinguish similar from dissimilar text.

Measuring Similarity: Cosine Distance

"Similar embeddings" means vectors pointing in similar directions. We measure this with cosine similarity:

import numpy as np

def cosine_similarity(a, b):
    """
    Measures angle between vectors. Returns -1 to 1.
    1 = identical direction (similar)
    0 = perpendicular (unrelated)
    -1 = opposite (rare in practice)
    """
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Example
emb_cat = model.encode("cat")      # Shape: (768,)
emb_dog = model.encode("dog")      # Shape: (768,)
emb_car = model.encode("car")      # Shape: (768,)

print(cosine_similarity(emb_cat, emb_dog))  # ~0.85 (similar)
print(cosine_similarity(emb_cat, emb_car))  # ~0.45 (less similar)

Most embedding models are normalized (all vectors have length 1), so cosine similarity simplifies to just the dot product.

See It In Action

Real embeddings have 768+ dimensions. Below, we project them to 2D using t-SNE (a dimensionality reduction algorithm) so you can see clustering patterns.

Note: 2D projection distorts distances. Points that look far apart in 2D might be close in 768D.

Word Embedding Space

Dimension 1Dimension 2catdogbirdfishcartruckbusbikeapplebananaorangekingqueenprinceorigin

Click on any word to see its nearest neighbors in the embedding space. Similar words cluster together!

Animals
Vehicles
Food
Royalty

Static vs Contextual Embeddings

There are two types of embeddings with fundamentally different behavior:

Static (Word2Vec, GloVe)

One vector per word. "bank" always gets the same embedding regardless of context.

"river bank" → bank = [0.2, 0.4, ...]
"bank account" → bank = [0.2, 0.4, ...]
# Same vector! Can't distinguish meanings

Contextual (BERT, Transformers)

Different vector based on surrounding words. This is what modern models use.

"river bank" → bank = [0.8, 0.1, ...]
"bank account" → bank = [0.1, 0.9, ...]
# Different vectors! Context-aware

Common Misconception

The "king - man + woman = queen" analogy works for Word2Vec but often fails with modern contextual embeddings. Don't assume vector arithmetic transfers to all embedding types.

Working Code

Here's how to generate embeddings in practice:

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained embedding model (downloads ~400MB first time)
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Generate embeddings
texts = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn fox leaps above a sleepy canine",
    "Stock markets closed higher on Friday"
]

embeddings = model.encode(texts)  # Shape: (3, 384)

print(f"Embedding shape: {embeddings.shape}")
print(f"First 5 values of first embedding: {embeddings[0][:5]}")

# Compute similarities
from sklearn.metrics.pairwise import cosine_similarity

similarities = cosine_similarity(embeddings)
print(f"\nSimilarity matrix:")
print(f"  Text 0 vs 1: {similarities[0][1]:.3f}")  # High - same meaning
print(f"  Text 0 vs 2: {similarities[0][2]:.3f}")  # Low - different topic

Install: pip install sentence-transformers

Key Takeaways

  • 1

    Embeddings = lookup table + transformer - Token IDs index into a learned matrix, then pass through attention layers.

  • 2

    Meaning emerges from training - Contrastive learning teaches the model to place similar text close together.

  • 3

    Cosine similarity measures closeness - Vectors pointing in similar directions (high cosine) represent similar meaning.

  • 4

    Modern embeddings are contextual - The same word gets different embeddings based on surrounding context.