Level 1: Single Blocks~25 min

Text Embeddings in Practice

From Word2Vec to the MTEB leaderboard: how text embedding models evolved, how to choose one, and how to use them in production.

Prerequisite: This lesson assumes you understand what embeddings are and how they work mechanically. If terms like "cosine similarity," "contrastive learning," or "transformer encoder" are unfamiliar, start with Lesson 0.1: What is an Embedding?

The Evolution of Text Embedding Models

Lesson 0.1 traced the theoretical foundations from Harris (1954) through BERT (2018). This lesson picks up the practical thread: how did we get from "word vectors are interesting" to "here is a leaderboard with 200+ models ranked across 56 datasets"?

The answer is four architectural shifts, each unlocking a new class of applications. Understanding these shifts tells you why certain models are good at certain tasks and saves you from cargo-culting someone else's model choice.

Shift I: Static Word Vectors
2013

Word2Vec: One Vector Per Word

Mikolov et al. at Google showed that a stripped-down neural network trained to predict context words (Skip-gram) or center words (CBOW) produced word vectors where arithmetic encoded semantic relationships. Training on 100 billion words took a day on a single machine.

The limitation was fundamental: every word got exactly one vector. "Bank" in "river bank" and "bank robbery" shared the same embedding. And there was no principled way to combine word vectors into a sentence embedding — averaging worked surprisingly well for short texts, but lost word order and emphasis entirely.

# Word2Vec: one fixed vector per word
from gensim.models import Word2Vec

model = Word2Vec(sentences, vector_size=300, window=5, sg=1)  # Skip-gram
vec_king = model.wv["king"]      # Shape: (300,) — always the same vector
vec_bank = model.wv["bank"]      # Same vector for river bank and bank account

# "Sentence embedding" = crude average
import numpy as np
sentence = "the cat sat on the mat"
sent_vec = np.mean([model.wv[w] for w in sentence.split()], axis=0)

Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop.

2014

GloVe: Count Meets Predict

Pennington, Socher, and Manning at Stanford bridged count-based methods (LSA) with prediction-based methods (Word2Vec) by training embeddings to reconstruct log co-occurrence ratios from an explicit word-word co-occurrence matrix. GloVe matched Word2Vec quality while making the connection to corpus statistics transparent. Pre-trained GloVe vectors (6B and 840B token variants) became the default initialization for NLP models for years.

Pennington, J. et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP. 35,000+ citations.

2016

fastText: Subword Embeddings

Bojanowski et al. at Facebook AI solved the out-of-vocabulary problem by learning embeddings for character n-grams instead of whole words. A word's vector was the sum of its subword vectors, meaning even misspellings and neologisms got meaningful representations. This was critical for morphologically rich languages (Turkish, Finnish, Arabic) and foreshadowed the BPE tokenization that all modern transformers use.

Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information. TACL, 5, 135-146.

The static-word-vector ceiling

Word2Vec, GloVe, and fastText all share the same fundamental limit: they produce one vector per word (or subword), not per sentence. To get a sentence embedding, you average the word vectors — destroying word order, negation, and emphasis. "The dog bit the man" and "The man bit the dog" get identical embeddings. This ceiling drove the next shift.

Shift II: Contextual Representations
October 2018

BERT: Context-Dependent Embeddings

Devlin, Chang, Lee, and Toutanova at Google released BERT, which produced different representations for the same word depending on surrounding context. "Bank" in "river bank" now got a genuinely different vector from "bank robbery."

But BERT had a critical problem for embeddings: comparing two sentences required feeding both through the model together as a cross-encoder. For N documents, finding the most similar pair required N(N-1)/2 forward passes. Searching 10,000 sentences took 65 hours.

# BERT cross-encoder: accurate but O(n²)
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# To compare sentence A with 10,000 candidates:
# Must run model(A + candidate_i) for EACH of the 10,000 candidates
# At ~10ms per pair = 100 seconds for one query
# For all-pairs on 10,000 docs: 10000² / 2 = 50M pairs ≈ 65 hours

Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL. 90,000+ citations.

Shift III: Bi-Encoder Sentence Embeddings
August 2019

Sentence-BERT: The Bi-Encoder Breakthrough

Nils Reimers and Iryna Gurevych at TU Darmstadt solved BERT's quadratic problem with an elegant trick: fine-tune BERT with a siamese network so each sentence could be independently encoded into a fixed-length vector. Compare by dot product, not by cross-attention.

The result: searching 10,000 sentences went from 65 hours to 5 seconds. Encode your corpus once, store the vectors in a database, find similar items with a single matrix multiply. This is the architecture behind every modern semantic search system, RAG pipeline, and recommendation engine.

# Sentence-BERT bi-encoder: encode once, search fast
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode corpus ONCE — O(n)
corpus_embeddings = model.encode(corpus_of_10000_docs)  # (10000, 384)

# Search ANY query in milliseconds — just a dot product
query_vec = model.encode("machine learning frameworks")  # (384,)
scores = corpus_embeddings @ query_vec  # (10000,) — instant
top_results = scores.argsort()[-5:][::-1]  # Top 5

"SBERT reduces the effort for finding the most similar pair from 65 hours with BERT to about 5 seconds with SBERT, while maintaining the accuracy from BERT."

Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.

Reimers went on to create the sentence-transformers Python library, which remains the most widely used tool for generating text embeddings. It now supports 5,000+ pre-trained models via Hugging Face.

Shift IV: The Embedding Arms Race (2022-present)
2022-2023

Contrastive Training at Scale

Three innovations supercharged embedding quality. First, hard negative mining: instead of training on random negatives, models learned from documents that were close but not quite right — the "almost correct" distractors that are hardest to distinguish. Second, multi-stage training: pre-train on billions of weakly-supervised text pairs (title-body, query-passage), then fine-tune on high-quality labeled data. Third, instruction-tuning: prepend task descriptions to inputs so the same model produces different embeddings optimized for retrieval vs. classification vs. clustering.

The result was a Cambrian explosion of models. BAAI released BGE. Microsoft released E5. Alibaba released GTE. Jina AI released jina-embeddings. OpenAI shipped text-embedding-3. Cohere shipped embed-v3. Each claimed state-of-the-art on different benchmarks.

2024-present

Matryoshka Representations and Decoder-Based Models

Matryoshka Representation Learning (Kusupati et al., 2022) trained models so that truncating embeddings to fewer dimensions preserved most of the quality — like Russian nesting dolls, the first 256 dimensions of a 3072-dimensional embedding capture most of the information. OpenAI's text-embedding-3 and Cohere's embed-v3 both support this, letting you trade storage for quality at query time.

Meanwhile, decoder-based models like GTE-Qwen2 and E5-Mistral proved that large language models (LLMs) could be adapted into powerful embedding models. By pooling the last-token representation from a 7B-parameter decoder, these models achieved new MTEB records — at the cost of 10-50x more compute per embedding.

Kusupati, A. et al. (2022). Matryoshka Representation Learning. NeurIPS.
Li, Z. et al. (2024). GTE-Qwen2: Towards General Text Embeddings with Multi-stage Contrastive Learning.

The throughline: 2013 to 2026

2013-2016Static word vectors. One embedding per word. Average for sentences. (Word2Vec, GloVe, fastText)
2018Contextual but slow. Different embeddings per context, but O(n²) for search. (BERT, ELMo)
2019Bi-encoder revolution. Encode once, search by dot product. Semantic search goes practical. (Sentence-BERT)
2022-nowArms race. Hard negatives, multi-stage training, instruction-tuning, Matryoshka dims, decoder-based models.

MTEB: The Standard Benchmark

The Massive Text Embedding Benchmark (MTEB) is the industry standard for comparing embedding models. Created by Muennighoff et al. at Hugging Face in 2022, it evaluates models across 8 task types and 56+ datasets spanning retrieval, classification, clustering, semantic textual similarity (STS), reranking, pair classification, summarization, and BitextMining.

"No single text embedding method dominates across all tasks. Current models specialize on certain types of tasks, revealing significant room for improvement."

Muennighoff, N. et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL.

MTEB matters because a model that excels at retrieval may be mediocre at clustering, and vice versa. The benchmark forces you to think about which task you care about, not just the average score. When evaluating models below, pay attention to the Retrieval column if you're building search/RAG, STS if you need fine-grained similarity, and Classification if you're labeling text.

Model Comparison: 2026 Landscape

The embedding landscape is crowded. Here are the models that matter, with real MTEB scores and the trade-offs that determine which one you should use.

ModelSourceDimsMTEB AvgRetrievalMax TokensOpen?
GTE-Qwen2-7BAlibaba358470.260.38192Yes
text-embedding-3-largeOpenAI307264.655.48191No
embed-english-v3.0Cohere102464.555.5512No
BGE-large-en-v1.5BAAI102464.254.3512Yes
E5-Mistral-7BMicrosoft409666.656.932768Yes
nomic-embed-text-v1.5Nomic AI76862.353.08192Yes
jina-embeddings-v3Jina AI102465.556.08192Yes
all-MiniLM-L6-v2Sentence-Transformers38456.341.9256Yes

Key Insight: MTEB Average Is Not the Full Story

A model with MTEB average 70 is not just "better" than one at 56. But the average hides crucial task-specific variation. BGE-large scores 54.3 on retrieval — adequate for most RAG systems. GTE-Qwen2-7B scores 60.3 — meaningfully better, but requires a GPU with 14GB+ VRAM and 50x more compute per embedding. The right model depends on your constraints, not just the leaderboard.

Production Code: Three Approaches

Every text embedding pipeline has the same three steps: choose a model, encode text into vectors, compare vectors by similarity. The code below is copy-paste ready. Run it.

Approach 1: sentence-transformers (Open Source, Local)

The most popular library. Free, runs locally, supports 5,000+ models. No API key required.

Best for: prototyping, self-hosted production, privacy-sensitive data, cost control.

# Install
pip install sentence-transformers faiss-cpu numpy
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model (~1.2GB download first time for bge-large)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Encode documents — normalize for cosine similarity via dot product
documents = [
    "Python is a programming language",
    "JavaScript runs in the browser",
    "SQL is used to query databases",
    "Redis is an in-memory data store",
    "PostgreSQL is a relational database",
]
embeddings = model.encode(documents, normalize_embeddings=True)
# Shape: (5, 1024) — 5 documents, 1024 dimensions each

# Encode a query
query = "how to store data persistently"
query_vec = model.encode(query, normalize_embeddings=True)

# Similarity = dot product (vectors are normalized)
scores = embeddings @ query_vec
ranked = np.argsort(scores)[::-1]

print("Results:")
for idx in ranked:
    print(f"  {scores[idx]:.3f}: {documents[idx]}")
Expected output:
Results:
  0.701: PostgreSQL is a relational database
  0.654: Redis is an in-memory data store
  0.598: SQL is used to query databases
  0.412: Python is a programming language
  0.389: JavaScript runs in the browser

Approach 2: OpenAI Embeddings (API)

High quality, zero infrastructure. Pay per token. Supports Matryoshka dimensions (256-3072).

Best for: teams without GPU infrastructure, quality-critical production RAG, quick integration. Pricing: ~$0.13 per 1M tokens for text-embedding-3-large.

# Install
pip install openai numpy
from openai import OpenAI
import numpy as np

client = OpenAI()  # Uses OPENAI_API_KEY env variable

def embed_texts(texts: list[str], model="text-embedding-3-large", dims=1024):
    """Embed texts with OpenAI. Supports Matryoshka truncation via dims."""
    response = client.embeddings.create(
        model=model,
        input=texts,
        dimensions=dims,  # Matryoshka: 256, 512, 1024, or 3072
    )
    return np.array([d.embedding for d in response.data])

# Embed documents and query
doc_embeddings = embed_texts([
    "Python is a programming language",
    "Redis is an in-memory data store",
    "PostgreSQL is a relational database",
])
query_embedding = embed_texts(["how to store data persistently"])

# Compare
scores = doc_embeddings @ query_embedding.T
print(scores.squeeze())  # [0.41, 0.65, 0.71]

Matryoshka dimensions

OpenAI's text-embedding-3 models support flexible dimensions. Requesting dimensions=256 returns the first 256 components of the full 3072-dim vector, losing only ~2-5% quality while cutting storage by 12x. This is Matryoshka Representation Learning in action.

Approach 3: Cohere Embeddings (API)

Strong multilingual support. Built-in input types for retrieval vs. classification. Compression-aware training.

Best for: multilingual applications, search-focused use cases, teams already on Cohere's platform.

# Install
pip install cohere numpy
import cohere
import numpy as np

co = cohere.ClientV2("your-api-key")

# Cohere distinguishes document vs query embeddings
# This matters: query embeddings are optimized for retrieval
doc_response = co.embed(
    texts=["Python is a programming language",
           "Redis is an in-memory data store",
           "PostgreSQL is a relational database"],
    model="embed-english-v3.0",
    input_type="search_document",   # For corpus documents
    embedding_types=["float"],
)
doc_embeddings = np.array(doc_response.embeddings.float_)

query_response = co.embed(
    texts=["how to store data persistently"],
    model="embed-english-v3.0",
    input_type="search_query",      # For queries — different optimization
    embedding_types=["float"],
)
query_embedding = np.array(query_response.embeddings.float_)

scores = doc_embeddings @ query_embedding.T
print(scores.squeeze())

Why separate input types? Cohere's embed-v3 uses asymmetric training: queries and documents are embedded differently because a short query like "data storage" and a long document about PostgreSQL occupy different semantic spaces. This typically improves retrieval by 2-5% nDCG over symmetric embedding.

Scaling Up: FAISS for Production Search

The dot-product approach above works for thousands of documents. For millions or billions, you need an approximate nearest neighbor (ANN) index. FAISS (Facebook AI Similarity Search) is the most widely used library.

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

# Simulate a larger corpus
documents = [
    "Python is a high-level programming language",
    "JavaScript enables interactive web pages",
    "SQL queries relational databases",
    "Redis provides in-memory caching",
    "PostgreSQL is an advanced relational database",
    "Docker containerizes applications for deployment",
    "Kubernetes orchestrates container workloads",
    "TensorFlow is a machine learning framework",
    "React is a JavaScript UI library",
    "FastAPI is a modern Python web framework",
]

# Encode and normalize
embeddings = model.encode(documents, normalize_embeddings=True).astype("float32")
dim = embeddings.shape[1]  # 1024

# Build FAISS index — IndexFlatIP = exact inner product (cosine for normalized vecs)
index = faiss.IndexFlatIP(dim)
index.add(embeddings)
print(f"Index contains {index.ntotal} vectors of dim {dim}")

# Search
query = "tools for deploying web applications"
query_vec = model.encode([query], normalize_embeddings=True).astype("float32")

# D = distances (similarities), I = indices
D, I = index.search(query_vec, k=3)

print("\nTop 3 results:")
for score, idx in zip(D[0], I[0]):
    print(f"  {score:.3f}: {documents[idx]}")
Expected output:
Index contains 10 vectors of dim 1024

Top 3 results:
  0.712: Docker containerizes applications for deployment
  0.653: Kubernetes orchestrates container workloads
  0.584: FastAPI is a modern Python web framework

Scaling beyond exact search

IndexFlatIP is exact but O(n) per query. For millions of vectors, use approximate indexes:

IndexIVFFlatPartitions vectors into clusters. Search only nearby clusters. 10-100x faster.
IndexIVFPQProduct quantization compresses vectors to ~32 bytes each. 100x less memory.
IndexHNSWFlatGraph-based index. Best recall/speed trade-off for <10M vectors.

Douze, M. et al. (2024). The FAISS Library. arXiv:2401.08281.

Embedding Dimensions: The Storage-Quality Trade-off

Embedding dimension is the single most impactful architecture decision for production systems. It determines storage cost, search latency, and (to a point) embedding quality.

384

Small

1.5 KB/vec. MiniLM. Prototypes and edge devices.

768

Medium

3 KB/vec. Nomic, BGE-base. Good balance.

1024

Large

4 KB/vec. BGE-large, Cohere. Production sweet spot.

3072

XL

12 KB/vec. OpenAI large. Max quality, max cost.

What This Means at Scale

Corpus Size384d (1.5KB)1024d (4KB)3072d (12KB)
100K docs150 MB400 MB1.2 GB
1M docs1.5 GB4 GB12 GB
10M docs15 GB40 GB120 GB
100M docs150 GB400 GB1.2 TB

float32 storage. Product quantization (PQ) can reduce this 4-8x with ~2-5% quality loss.

Practical rule of thumb

If your embeddings need to fit in RAM for fast search, work backward from your memory budget. For most RAG systems with under 1M documents, 1024 dimensions is the sweet spot: it fits comfortably in 4GB, gives you near-SOTA quality, and models like BGE-large or Cohere embed-v3 produce excellent results at this dimension. Go to 384 only if you're on edge devices or need to search 10M+ docs without quantization.

Interactive Model Comparison

Explore different embedding models side-by-side. Compare code examples, dimensions, and use cases. All code snippets are copy-paste ready.

Select Embedding Model

Install dependencies:
pip install sentence-transformers faiss-cpu numpy
Basic embedding with similarity calculation:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = ['The cat sat on the mat', 'A dog played in the park', 'Machine learning is fascinating']
embeddings = model.encode(documents, normalize_embeddings=True)

query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)
similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')

Model Comparison

ModelDimensionsMTEB ScoreCostBest For
BAAI/bge-large-en-v1.5
102464.23Free (local)Best open-source model
all-MiniLM-L6-v2
38456.3Free (local)Fast, lightweight
text-embedding-3-large
307264.6Pay per useHighest quality OpenAI embedding
text-embedding-3-small
153662.3Pay per useCost-effective API option
embed-english-v3.0
102464.5Pay per useStrong performance

Decision Guide: Choosing Your Model

Model choice is a function of four constraints: quality requirements, latency budget, infrastructure, and cost tolerance. Here is the decision tree.

Prototyping / Learning / Hackathons

Use all-MiniLM-L6-v2. Free, fast, 22MB model, runs on a laptop CPU.

384 dims | 5ms/sentence on CPU | MTEB 56.3 | Good enough to validate any idea before investing in infrastructure.

Production RAG / Semantic Search (Self-Hosted)

Use BAAI/bge-large-en-v1.5 or jina-embeddings-v3. SOTA quality without API costs or data leaving your infrastructure.

1024 dims | MTEB 64-65 | Requires a GPU for fast batch encoding (~500 docs/sec on A100). Once encoded, search is pure CPU.

Production RAG (Managed API)

Use text-embedding-3-large (OpenAI) or embed-english-v3.0 (Cohere). Zero infrastructure. Scale instantly.

1024-3072 dims | MTEB 64-65 | $0.13/1M tokens (OpenAI). Worth it when engineering time costs more than API fees. Cohere is best when you need asymmetric query/document embeddings.

Multilingual

Use BGE-M3 (open source, 100+ languages) or embed-multilingual-v3.0 (Cohere).

BGE-M3 supports hybrid dense+sparse+ColBERT retrieval in a single model. Cohere's multilingual model covers 100+ languages with strong cross-lingual transfer.

Chen, J. et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity. arXiv.

Maximum Quality (Research / High-Stakes)

Use GTE-Qwen2-7B or E5-Mistral-7B. Decoder-based models with the highest MTEB scores.

3584-4096 dims | MTEB 66-70 | Requires 14-28GB VRAM. 50x slower to encode than BGE-large. Use only when a 2-4 point MTEB improvement justifies the infrastructure cost.

Latency-Critical (Real-Time, Edge)

Use all-MiniLM-L6-v2 locally or nomic-embed-text-v1.5 with ONNX runtime.

API calls add 50-200ms network overhead. For search-as-you-type or real-time features, local inference with an optimized runtime (ONNX, TensorRT) is essential. MiniLM runs at 5ms/sentence on CPU.

Pitfalls

Five Mistakes That Break Embedding Systems

1. Mixing Models Between Index and Query

If you encode your corpus with bge-large-en-v1.5 and your queries with text-embedding-3-large, every similarity score will be meaningless. Different models produce vectors in different spaces with different dimensions. Always use the same model for indexing and querying.

2. Forgetting to Normalize

Cosine similarity requires normalized vectors (length = 1). If you use raw model output withfaiss.IndexFlatIP, your dot products will be scaled by vector magnitude, not just direction. Either set normalize_embeddings=True in sentence-transformers or L2-normalize manually: vec / np.linalg.norm(vec).

3. Exceeding the Model's Max Token Length

Most models silently truncate input beyond their maximum context (512 tokens for BGE, 8192 for Nomic). A 2,000-word document through a 512-token model only embeds the first ~380 words. For long documents, chunk first, embed each chunk, then decide how to aggregate (max pooling, hierarchical search, or store chunks separately).

4. Evaluating on the Wrong MTEB Task

A model with the highest MTEB average may rank 15th on the specific task you care about. If you're building search, look at the Retrieval subtask scores. If you're building a classifier, look at Classification. The aggregate score is marketing; the task-specific score is engineering.

5. Not Benchmarking on Your Own Data

MTEB uses academic datasets. Your data has its own vocabulary, document lengths, and query patterns. A model that wins on MS MARCO retrieval may underperform on your domain. Always create a small evaluation set (~100-500 query-document pairs) from your actual data and test 2-3 candidate models before committing to one for production.

Key Takeaways

  • 1

    Four shifts define the field: Static word vectors (2013) to contextual encoders (2018) to bi-encoder sentence embeddings (2019) to the modern arms race (2022+). Each solved the previous generation's key limitation.

  • 2

    MTEB is the benchmark — but read the subtasks: The average score is useful for rough ranking. The task-specific score (Retrieval, STS, Classification) is what determines real-world performance for your use case.

  • 3

    For most production use cases, BGE-large or an API model is the answer: BGE-large-en-v1.5 gives you MTEB 64+ quality at zero API cost. OpenAI and Cohere give you the same quality with zero infrastructure. GTE-Qwen2 wins benchmarks but demands serious GPU resources.

  • 4

    Dimension choice is a storage decision: 1024 dimensions is the production sweet spot. Go smaller for edge/latency. Go bigger only when benchmark gains justify the 3-8x storage increase.

  • 5

    Always benchmark on your own data: MTEB scores predict general quality but not domain-specific performance. Build a 100-query eval set from your real data before committing to a model.

Practice Exercises

Copy the code examples above and work through these exercises. Each builds on the previous.

  1. 1.Compare two models. Encode the same 10 sentences with bothall-MiniLM-L6-v2 (384d) andbge-large-en-v1.5 (1024d). Do the similarity rankings change? By how much?
  2. 2.Test Matryoshka truncation. Use OpenAI's text-embedding-3-large with dimensions=3072, 1024, and 256. Compare retrieval quality on the same queries. How much quality do you lose?
  3. 3.Build a semantic search engine. Load a dataset (Wikipedia paragraphs, your own documents, or a HuggingFace dataset). Index 10,000+ documents with FAISS and query interactively. Measure query latency with time.perf_counter().
  4. 4.Evaluate on your domain. Create 50 query-document pairs from data relevant to your use case. Test 2-3 models and compute recall@5 and MRR. Does the MTEB winner also win on your data?

References

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.