Deep Dive — Embeddings

Why 768?
The Science Behind Embedding Dimensions

Embedding dimensions like 384, 768, and 1024 look arbitrary — but they are the product of transformer architecture constraints, GPU hardware alignment, and information-theoretic bounds. This page explains every factor that determines the number.

Common Dimensions

3846h x 64d
1.5 KB/vec
76812h x 64d
3 KB/vec
102416h x 64d
4 KB/vec
153624h x 64d
6 KB/vec
307248h x 64d
12 KB/vec

Why These Specific Numbers

Every common embedding dimension is a multiple of 64. That is not a coincidence — it is a direct consequence of transformer architecture and GPU hardware design.

The Transformer Architecture Constraint

In the original Transformer paper (Vaswani et al., 2017), the model dimensiond_modelis split equally acrosshattention heads, each operating ond_k = d_model / hdimensions.

# The fundamental equation
d_model = num_heads x d_head
# BERT-base
768 = 12 heads x 64 dims/head
# BERT-large
1024 = 16 heads x 64 dims/head

The head dimension of 64 was set in the original paper and has remained a near-universal default. Given that, the total dimension is just 64 x head_count.

GPU Hardware Alignment

Modern GPU tensor cores process matrices in fixed tile sizes — typically 8x8, 16x16, or 32x32 blocks depending on the precision (FP16, BF16, INT8). Dimensions that are multiples of these tile sizes fill the hardware pipeline perfectly with no wasted compute.

# GPU tensor core tile sizes
FP16 WMMA: 16 x 16 x 16
INT8 WMMA: 32 x 16 x 8
# Common dimensions divisibility
384 / 8 = 48
768 / 16 = 48
1024 / 32 = 32
3072 / 32 = 96

A dimension like 700 would leave tensor cores partially idle every cycle, wasting 5-15% of throughput. That penalty compounds across billions of operations per forward pass.

Dimension Decomposition

Every common dimension is heads x 64

DimensionHeadsHead DimNotable ModelMTEB Avg
384664all-MiniLM-L6~56
7681264BERT / all-mpnet-base~63
10241664BERT-large / e5-large~64
15362464text-embedding-3-large~64.6
30724864GPT-style large models~65

Information Capacity Theory

How many dimensions do you actually need? Information theory gives us both a lower bound and an intuition for why practical embeddings need far more than the minimum.

~18
Theoretical Minimum

log₂(250K) ≈ 18 dimensions to uniquely identify each token in a large vocabulary. This is the absolute floor from information theory — enough to distinguish tokens, but not enough to represent their meaning.

~100
Intrinsic Dimensionality

Research shows pre-trained language models have an intrinsic dimensionality of roughly 100 (Aghajanyan et al., 2021). This means the learned representations live on a ~100-dimensional manifold embedded in the higher-dimensional space.

768
Practical Sweet Spot

The gap between intrinsic (~100) and practical (768) exists because higher dimensions provide better optimization landscapes, reduce interference between features, and enable the model to separate more fine-grained semantic distinctions during training.

The Johnson-Lindenstrauss Lemma

The JL lemma provides a theoretical bound: to preserve pairwise distances between N points within a factor of (1 ± ε), you need at least O(log(N) / ε²) dimensions. For N = 1 million points and ε = 0.1 (10% distortion tolerance):

# Johnson-Lindenstrauss bound
d ≥ C * ln(N) / ε²
d ≥ 8 * ln(1,000,000) / 0.01
d ≥ 8 * 13.8 / 0.01
d ≥ 11,059
# With ε = 0.3 (30% tolerance)
d ≥ 8 * 13.8 / 0.09
d ≥ 1,227
# This is a worst-case bound for arbitrary point sets.
# Real language data has much more structure, so 768 works.

The JL bound is pessimistic because it applies to arbitrary point configurations. Natural language embeddings lie on a low-dimensional manifold within the ambient space, so far fewer dimensions suffice in practice to preserve the distance relationships that matter.

The Compute-Quality Tradeoff

Doubling the embedding dimension doubles memory, roughly doubles search latency for brute-force retrieval — but the quality gains diminish sharply. The numbers below show why 768 is such a popular default: you get most of the quality at a fraction of the cost.

Dimension vs. Quality vs. Cost

Based on representative MTEB scores and float32 storage

DimMem / VectorMTEB AvgSearch LatencyQuality / KB
3841.5 KB~56Fastest37.3
7683 KB~63Fast21.0
10244 KB~64Medium16.0
15366 KB~64.6Moderate10.8
307212 KB~65Slower5.4

Memory at Scale

Total vector storage (float32, no index overhead)

Documents384d768d1024d3072d
10K15 MB30 MB40 MB120 MB
100K150 MB300 MB400 MB1.2 GB
1M1.5 GB3 GB4 GB12 GB
10M15 GB30 GB40 GB120 GB
100M150 GB300 GB400 GB1.2 TB
At 100M documents, the difference between 384d and 3072d is 150 GB vs 1.2 TB — an 8x cost multiplier that directly impacts infrastructure spend.

Matryoshka Embeddings

Matryoshka Representation Learning (Kusupati et al., 2022) is the modern answer to the dimension dilemma: train at full dimensionality, then truncate to any smaller dimension at inference time with graceful quality degradation. Like Russian nesting dolls, the most important information is packed into the earliest dimensions.

How It Works

During training, the loss function is computed at multiple truncation points simultaneously. The model is forced to produce useful representations at dimensions 64, 128, 256, 384, 512, and 768 — all within the same 768-dimensional vector.

# Matryoshka training loss
L = Σ w_m * loss(embed[:m], target)
# for m in [64, 128, 256, 384, 512, 768]
# At inference — just slice
full = model.encode(text) # [768]
small = full[:256] # [256] still useful

Why Truncation Works

The Matryoshka loss encourages the model to encode the most important semantic information in the first dimensions. This aligns with a natural tendency in neural networks: learned representations show a spectral decay where early dimensions capture broad, high-variance features (topic, domain) and later dimensions encode finer distinctions (tone, style, entities).

Without Matryoshka training, truncation destroys quality unpredictably — the important information might be spread across all dimensions. With it, there is a smooth, predictable degradation curve.

Quality Retention by Dimension

Matryoshka-trained model (768d full), truncated at inference

DimensionQuality RetainedMTEB DeltaNoteVisual
768100%0Full dimension
51299.2%-0.5Minimal loss
38498.1%-1.2Great tradeoff
25696.5%-2.3Good for prototyping
12892.8%-4.8Noticeable degradation
6485.4%-9.7Topic-level only

Frequency and Spectral Analysis

Not all dimensions are created equal. Analysis of trained embedding models reveals a spectral structure: different dimension ranges encode qualitatively different types of information.

Low-frequency

Dims 0-128

Broad semantic features: topic (science vs. sports), domain (medical vs. legal), and language. These dimensions have the highest variance and carry the most information per bit. This is why Matryoshka truncation to 128d still captures topic-level similarity at 93% quality.

Mid-frequency

Dims 128-512

Subtopic and entity-level features: the difference between "cardiac surgery" and "cardiac imaging," or between "Python web frameworks" and "Python data science." This range is critical for retrieval quality in domain-specific applications.

High-frequency

Dims 512-768

Fine-grained distinctions: tone (formal vs. casual), style, specific entity mentions, and subtle semantic nuances. These dimensions have the lowest variance individually but collectively enable the model to distinguish near-duplicates and handle edge cases.

Connection to Fourier Analysis

This spectral decay mirrors classical Fourier analysis. Just as a signal can be decomposed into frequency components — low frequencies for the overall shape, high frequencies for fine detail — embedding dimensions organize from coarse to fine semantic resolution.

PCA of trained embedding matrices confirms this: the top principal components capture broad categorical distinctions, while lower components encode progressively finer differences. The eigenvalue spectrum typically follows a power law, meaning most information is concentrated in a small fraction of the dimensions — which is precisely why dimensionality reduction and Matryoshka truncation work as well as they do.

Practical Guide: Choosing Your Dimension

The right embedding dimension depends on your use case, data volume, latency requirements, and budget. Here is a decision framework.

Prototype / MVP

384d
  • Fastest inference and indexing
  • Fits in memory at any scale
  • Good enough for topic-level retrieval
  • Model: all-MiniLM-L6-v2
  • Use case: internal search, chatbot retrieval, demos

Production

768-1024d
  • Best quality-per-dollar ratio
  • Strong on MTEB retrieval benchmarks
  • Handles fine-grained similarity well
  • Models: all-mpnet-base-v2, e5-large-v2
  • Use case: RAG systems, semantic search, recommendations

Quality-Critical

1536+
  • Maximum retrieval precision
  • Best for nuanced queries
  • Diminishing returns past 1536d
  • Models: text-embedding-3-large, Cohere v3
  • Use case: legal discovery, medical literature, code search

The Modern Strategy: Matryoshka + Adaptive

If you are starting a new project in 2025+, use a Matryoshka-trained model at 768d or 1024d. Start with 256d or 384d for development, then increase dimension at deployment based on your quality requirements. You get the best of both worlds: fast iteration during development and tunable quality in production — with a single model and a single index rebuild.

Working Code

Demonstrate Matryoshka truncation and measure the quality impact yourself.

matryoshka_demo.pyPython + sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a Matryoshka-trained model
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5")

# Encode at full dimensionality
sentences = [
    "The mitochondria is the powerhouse of the cell",
    "Cellular respiration produces ATP in mitochondria",
    "The stock market crashed in 2008",
]

full_embeddings = model.encode(sentences)
print(f"Full shape: {full_embeddings.shape}")  # (3, 768)

# Truncate to different dimensions and measure similarity
for dim in [768, 512, 384, 256, 128, 64]:
    truncated = full_embeddings[:, :dim]

    # Normalize after truncation (critical!)
    norms = np.linalg.norm(truncated, axis=1, keepdims=True)
    truncated = truncated / norms

    # Cosine similarity between first two (related) sentences
    sim_related = np.dot(truncated[0], truncated[1])
    # Cosine similarity between first and third (unrelated)
    sim_unrelated = np.dot(truncated[0], truncated[2])

    print(f"dim={dim:>4d}  related={sim_related:.4f}"
          f"  unrelated={sim_unrelated:.4f}"
          f"  gap={sim_related - sim_unrelated:.4f}")
Expected outputTypical results
Full shape: (3, 768)
dim= 768  related=0.8834  unrelated=0.1247  gap=0.7587
dim= 512  related=0.8791  unrelated=0.1198  gap=0.7593
dim= 384  related=0.8726  unrelated=0.1143  gap=0.7583
dim= 256  related=0.8612  unrelated=0.1089  gap=0.7523
dim= 128  related=0.8341  unrelated=0.0976  gap=0.7365
dim=  64  related=0.7854  unrelated=0.0812  gap=0.7042
Notice how the gap between related and unrelated similarity stays remarkably stable down to 256d, then starts degrading. The absolute similarities shift, but the discriminative power is preserved — which is what matters for retrieval.

Key Papers

The foundational research behind embedding dimension choices, from the original Transformer to Matryoshka representations.

Attention Is All You Need
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin
NeurIPS 2017130,000+ citations

Introduced d_k = d_model/h and the 64-dim head convention

Paper →
BERT: Pre-training of Deep Bidirectional Transformers
Devlin, Chang, Lee, Toutanova
NAACL 201995,000+ citations

Established 768 as the default embedding dimension

Paper →
Matryoshka Representation Learning
Kusupati, Bhatt, Rege, Wallingford, Sinha, Ramanujan, Howard, Jain
NeurIPS 2022300+ citations

Train once, truncate to any dimension with graceful degradation

Paper →
An Elementary Proof of a Theorem of Johnson and Lindenstrauss
Dasgupta, Gupta
Random Structures & Algorithms 20032,500+ citations

Theoretical lower bound on dimensions needed to preserve distances

Paper →
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Gurevych
EMNLP 20198,000+ citations

Made BERT embeddings practical for similarity search

Paper →
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Aghajanyan, Zettlemoyer, Gupta
ACL 2021600+ citations

Showed pre-trained models have low intrinsic dimensionality

Paper →

TL;DR

Why these numbers

  • 768 = 12 heads x 64 — architecture constraint from BERT, not arbitrary
  • Always multiples of 64 — head dimension convention from the Transformer paper
  • Hardware aligned — GPU tensor cores need dimensions divisible by 8/16/32
  • Theoretically grounded — JL lemma and intrinsic dimensionality bound the range

What to do about it

  • Default to 768 — best quality/cost ratio for most applications
  • Use Matryoshka models — train once, truncate to any dimension at inference
  • Prototype at 384, deploy at 768+ — dimension is a tuning knob, not a fixed choice
  • Past 1024, gains are marginal — spend your budget on better data instead
← Back to Academy
Last updated: March 2026. Data from MTEB leaderboard, model documentation, and original papers.