Deep Dive — Embeddings

Why 768?
The Science Behind Embedding Dimensions

Embedding dimensions like 384, 768, and 1024 look arbitrary — but they are the product of transformer architecture constraints, GPU hardware alignment, and information-theoretic bounds. This page explains every factor that determines the number.

Architecture Math Compute Tradeoffs Matryoshka

Common Dimensions

3846h x 64d

1.5 KB/vec

76812h x 64d

3 KB/vec

102416h x 64d

4 KB/vec

153624h x 64d

6 KB/vec

307248h x 64d

12 KB/vec

Why These Specific Numbers

Every common embedding dimension is a multiple of 64. That is not a coincidence — it is a direct consequence of transformer architecture and GPU hardware design.

The Transformer Architecture Constraint

In the original Transformer paper (Vaswani et al., 2017), the model dimensiond_modelis split equally acrosshattention heads, each operating ond_k = d_model / hdimensions.

# The fundamental equation

d_model = num_heads x d_head

# BERT-base

768 = 12 heads x 64 dims/head

# BERT-large

1024 = 16 heads x 64 dims/head

The head dimension of 64 was set in the original paper and has remained a near-universal default. Given that, the total dimension is just 64 x head_count.

GPU Hardware Alignment

Modern GPU tensor cores process matrices in fixed tile sizes — typically 8x8, 16x16, or 32x32 blocks depending on the precision (FP16, BF16, INT8). Dimensions that are multiples of these tile sizes fill the hardware pipeline perfectly with no wasted compute.

# GPU tensor core tile sizes

FP16 WMMA: 16 x 16 x 16

INT8 WMMA: 32 x 16 x 8

# Common dimensions divisibility

384 / 8 = 48 ✓

768 / 16 = 48 ✓

1024 / 32 = 32 ✓

3072 / 32 = 96 ✓

A dimension like 700 would leave tensor cores partially idle every cycle, wasting 5-15% of throughput. That penalty compounds across billions of operations per forward pass.

Dimension Decomposition

Every common dimension is heads x 64

Dimension	Heads	Head Dim	Notable Model	MTEB Avg
384	6	64	all-MiniLM-L6	~56
768	12	64	BERT / all-mpnet-base	~63
1024	16	64	BERT-large / e5-large	~64
1536	24	64	text-embedding-3-large	~64.6
3072	48	64	GPT-style large models	~65

Information Capacity Theory

How many dimensions do you actually need? Information theory gives us both a lower bound and an intuition for why practical embeddings need far more than the minimum.

~18

Theoretical Minimum

log₂(250K) ≈ 18 dimensions to uniquely identify each token in a large vocabulary. This is the absolute floor from information theory — enough to distinguish tokens, but not enough to represent their meaning.

~100

Intrinsic Dimensionality

Research shows pre-trained language models have an intrinsic dimensionality of roughly 100 (Aghajanyan et al., 2021). This means the learned representations live on a ~100-dimensional manifold embedded in the higher-dimensional space.

768

Practical Sweet Spot

The gap between intrinsic (~100) and practical (768) exists because higher dimensions provide better optimization landscapes, reduce interference between features, and enable the model to separate more fine-grained semantic distinctions during training.

The Johnson-Lindenstrauss Lemma

The JL lemma provides a theoretical bound: to preserve pairwise distances between N points within a factor of (1 ± ε), you need at least O(log(N) / ε²) dimensions. For N = 1 million points and ε = 0.1 (10% distortion tolerance):

# Johnson-Lindenstrauss bound

d ≥ C * ln(N) / ε²

d ≥ 8 * ln(1,000,000) / 0.01

d ≥ 8 * 13.8 / 0.01

d ≥ 11,059

# With ε = 0.3 (30% tolerance)

d ≥ 8 * 13.8 / 0.09

d ≥ 1,227

# This is a worst-case bound for arbitrary point sets.

# Real language data has much more structure, so 768 works.

The JL bound is pessimistic because it applies to arbitrary point configurations. Natural language embeddings lie on a low-dimensional manifold within the ambient space, so far fewer dimensions suffice in practice to preserve the distance relationships that matter.

The Compute-Quality Tradeoff

Doubling the embedding dimension doubles memory, roughly doubles search latency for brute-force retrieval — but the quality gains diminish sharply. The numbers below show why 768 is such a popular default: you get most of the quality at a fraction of the cost.

Dimension vs. Quality vs. Cost

Based on representative MTEB scores and float32 storage

Dim	Mem / Vector	MTEB Avg	Search Latency	Quality / KB
384	1.5 KB	~56	Fastest	37.3
768	3 KB	~63	Fast	21.0
1024	4 KB	~64	Medium	16.0
1536	6 KB	~64.6	Moderate	10.8
3072	12 KB	~65	Slower	5.4

Memory at Scale

Total vector storage (float32, no index overhead)

Documents	384d	768d	1024d	3072d
10K	15 MB	30 MB	40 MB	120 MB
100K	150 MB	300 MB	400 MB	1.2 GB
1M	1.5 GB	3 GB	4 GB	12 GB
10M	15 GB	30 GB	40 GB	120 GB
100M	150 GB	300 GB	400 GB	1.2 TB

At 100M documents, the difference between 384d and 3072d is 150 GB vs 1.2 TB — an 8x cost multiplier that directly impacts infrastructure spend.

Matryoshka Embeddings

Matryoshka Representation Learning (Kusupati et al., 2022) is the modern answer to the dimension dilemma: train at full dimensionality, then truncate to any smaller dimension at inference time with graceful quality degradation. Like Russian nesting dolls, the most important information is packed into the earliest dimensions.

How It Works

During training, the loss function is computed at multiple truncation points simultaneously. The model is forced to produce useful representations at dimensions 64, 128, 256, 384, 512, and 768 — all within the same 768-dimensional vector.

# Matryoshka training loss

L = Σ w_m * loss(embed[:m], target)

# for m in [64, 128, 256, 384, 512, 768]

# At inference — just slice

full = model.encode(text) # [768]

small = full[:256] # [256] still useful

Why Truncation Works

The Matryoshka loss encourages the model to encode the most important semantic information in the first dimensions. This aligns with a natural tendency in neural networks: learned representations show a spectral decay where early dimensions capture broad, high-variance features (topic, domain) and later dimensions encode finer distinctions (tone, style, entities).

Without Matryoshka training, truncation destroys quality unpredictably — the important information might be spread across all dimensions. With it, there is a smooth, predictable degradation curve.

Quality Retention by Dimension

Matryoshka-trained model (768d full), truncated at inference

Dimension	Quality Retained	MTEB Delta	Note
768	100%	0	Full dimension
512	99.2%	-0.5	Minimal loss
384	98.1%	-1.2	Great tradeoff
256	96.5%	-2.3	Good for prototyping
128	92.8%	-4.8	Noticeable degradation
64	85.4%	-9.7	Topic-level only

Frequency and Spectral Analysis

Not all dimensions are created equal. Analysis of trained embedding models reveals a spectral structure: different dimension ranges encode qualitatively different types of information.

Low-frequency

Dims 0-128

Broad semantic features: topic (science vs. sports), domain (medical vs. legal), and language. These dimensions have the highest variance and carry the most information per bit. This is why Matryoshka truncation to 128d still captures topic-level similarity at 93% quality.

Mid-frequency

Dims 128-512

Subtopic and entity-level features: the difference between "cardiac surgery" and "cardiac imaging," or between "Python web frameworks" and "Python data science." This range is critical for retrieval quality in domain-specific applications.

High-frequency

Dims 512-768

Fine-grained distinctions: tone (formal vs. casual), style, specific entity mentions, and subtle semantic nuances. These dimensions have the lowest variance individually but collectively enable the model to distinguish near-duplicates and handle edge cases.

Connection to Fourier Analysis

This spectral decay mirrors classical Fourier analysis. Just as a signal can be decomposed into frequency components — low frequencies for the overall shape, high frequencies for fine detail — embedding dimensions organize from coarse to fine semantic resolution.

PCA of trained embedding matrices confirms this: the top principal components capture broad categorical distinctions, while lower components encode progressively finer differences. The eigenvalue spectrum typically follows a power law, meaning most information is concentrated in a small fraction of the dimensions — which is precisely why dimensionality reduction and Matryoshka truncation work as well as they do.

Practical Guide: Choosing Your Dimension

The right embedding dimension depends on your use case, data volume, latency requirements, and budget. Here is a decision framework.

Prototype / MVP

384d

✓ Fastest inference and indexing
✓ Fits in memory at any scale
✓ Good enough for topic-level retrieval
• Model: all-MiniLM-L6-v2
• Use case: internal search, chatbot retrieval, demos

Production

768-1024d

✓ Best quality-per-dollar ratio
✓ Strong on MTEB retrieval benchmarks
✓ Handles fine-grained similarity well
• Models: all-mpnet-base-v2, e5-large-v2
• Use case: RAG systems, semantic search, recommendations

Quality-Critical

1536+

✓ Maximum retrieval precision
✓ Best for nuanced queries
✓ Diminishing returns past 1536d
• Models: text-embedding-3-large, Cohere v3
• Use case: legal discovery, medical literature, code search

The Modern Strategy: Matryoshka + Adaptive

If you are starting a new project in 2025+, use a Matryoshka-trained model at 768d or 1024d. Start with 256d or 384d for development, then increase dimension at deployment based on your quality requirements. You get the best of both worlds: fast iteration during development and tunable quality in production — with a single model and a single index rebuild.

Working Code

Demonstrate Matryoshka truncation and measure the quality impact yourself.

matryoshka_demo.pyPython + sentence-transformers

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a Matryoshka-trained model
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")

# Encode at full dimensionality
sentences = [
    "The mitochondria is the powerhouse of the cell",
    "Cellular respiration produces ATP in mitochondria",
    "The stock market crashed in 2008",
]

full_embeddings = model.encode(sentences)
print(f"Full shape: {full_embeddings.shape}")  # (3, 768)

# Truncate to different dimensions and measure similarity
for dim in [768, 512, 384, 256, 128, 64]:
    truncated = full_embeddings[:, :dim]

    # Normalize after truncation (critical!)
    norms = np.linalg.norm(truncated, axis=1, keepdims=True)
    truncated = truncated / norms

    # Cosine similarity between first two (related) sentences
    sim_related = np.dot(truncated[0], truncated[1])
    # Cosine similarity between first and third (unrelated)
    sim_unrelated = np.dot(truncated[0], truncated[2])

    print(f"dim={dim:>4d}  related={sim_related:.4f}"
          f"  unrelated={sim_unrelated:.4f}"
          f"  gap={sim_related - sim_unrelated:.4f}")

Expected outputTypical results

Full shape: (3, 768)
dim= 768  related=0.8834  unrelated=0.1247  gap=0.7587
dim= 512  related=0.8791  unrelated=0.1198  gap=0.7593
dim= 384  related=0.8726  unrelated=0.1143  gap=0.7583
dim= 256  related=0.8612  unrelated=0.1089  gap=0.7523
dim= 128  related=0.8341  unrelated=0.0976  gap=0.7365
dim=  64  related=0.7854  unrelated=0.0812  gap=0.7042

Notice how the gap between related and unrelated similarity stays remarkably stable down to 256d, then starts degrading. The absolute similarities shift, but the discriminative power is preserved — which is what matters for retrieval.

Key Papers

The foundational research behind embedding dimension choices, from the original Transformer to Matryoshka representations.

Attention Is All You Need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin

NeurIPS 2017130,000+ citations

Introduced d_k = d_model/h and the 64-dim head convention

Paper →

BERT: Pre-training of Deep Bidirectional Transformers

Devlin, Chang, Lee, Toutanova

NAACL 201995,000+ citations

Established 768 as the default embedding dimension

Paper →

Matryoshka Representation Learning

Kusupati, Bhatt, Rege, Wallingford, Sinha, Ramanujan, Howard, Jain

NeurIPS 2022300+ citations

Train once, truncate to any dimension with graceful degradation

Paper →

An Elementary Proof of a Theorem of Johnson and Lindenstrauss

Dasgupta, Gupta

Random Structures & Algorithms 20032,500+ citations

Theoretical lower bound on dimensions needed to preserve distances

Paper →

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, Gurevych

EMNLP 20198,000+ citations

Made BERT embeddings practical for similarity search

Paper →

Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning

Aghajanyan, Zettlemoyer, Gupta

ACL 2021600+ citations

Showed pre-trained models have low intrinsic dimensionality

Paper →

TL;DR

Why these numbers

768 = 12 heads x 64 — architecture constraint from BERT, not arbitrary
Always multiples of 64 — head dimension convention from the Transformer paper
Hardware aligned — GPU tensor cores need dimensions divisible by 8/16/32
Theoretically grounded — JL lemma and intrinsic dimensionality bound the range

What to do about it

Default to 768 — best quality/cost ratio for most applications
Use Matryoshka models — train once, truncate to any dimension at inference
Prototype at 384, deploy at 768+ — dimension is a tuning knob, not a fixed choice
Past 1024, gains are marginal — spend your budget on better data instead

← Back to Learn

Last updated: March 2026. Data from MTEB leaderboard, model documentation, and original papers.

Why 768?The Science Behind Embedding Dimensions

Common Dimensions

Why These Specific Numbers

The Transformer Architecture Constraint

GPU Hardware Alignment

Dimension Decomposition

Information Capacity Theory

The Johnson-Lindenstrauss Lemma

The Compute-Quality Tradeoff

Dimension vs. Quality vs. Cost

Memory at Scale

Matryoshka Embeddings

How It Works

Why Truncation Works

Quality Retention by Dimension

Frequency and Spectral Analysis

Dims 0-128

Dims 128-512

Dims 512-768

Connection to Fourier Analysis

Practical Guide: Choosing Your Dimension

Prototype / MVP

Production

Quality-Critical

The Modern Strategy: Matryoshka + Adaptive

Working Code

Key Papers

TL;DR

Why these numbers

What to do about it

Why 768?
The Science Behind Embedding Dimensions