Text Embeddings in Practice
From Word2Vec to the MTEB leaderboard: how text embedding models evolved, how to choose one, and how to use them in production.
Prerequisite: This lesson assumes you understand what embeddings are and how they work mechanically. If terms like "cosine similarity," "contrastive learning," or "transformer encoder" are unfamiliar, start with Lesson 0.1: What is an Embedding?
The Evolution of Text Embedding Models
Lesson 0.1 traced the theoretical foundations from Harris (1954) through BERT (2018). This lesson picks up the practical thread: how did we get from "word vectors are interesting" to "here is a leaderboard with 200+ models ranked across 56 datasets"?
The answer is four architectural shifts, each unlocking a new class of applications. Understanding these shifts tells you why certain models are good at certain tasks and saves you from cargo-culting someone else's model choice.
Word2Vec: One Vector Per Word
Mikolov et al. at Google showed that a stripped-down neural network trained to predict context words (Skip-gram) or center words (CBOW) produced word vectors where arithmetic encoded semantic relationships. Training on 100 billion words took a day on a single machine.
The limitation was fundamental: every word got exactly one vector. "Bank" in "river bank" and "bank robbery" shared the same embedding. And there was no principled way to combine word vectors into a sentence embedding — averaging worked surprisingly well for short texts, but lost word order and emphasis entirely.
# Word2Vec: one fixed vector per word from gensim.models import Word2Vec model = Word2Vec(sentences, vector_size=300, window=5, sg=1) # Skip-gram vec_king = model.wv["king"] # Shape: (300,) — always the same vector vec_bank = model.wv["bank"] # Same vector for river bank and bank account # "Sentence embedding" = crude average import numpy as np sentence = "the cat sat on the mat" sent_vec = np.mean([model.wv[w] for w in sentence.split()], axis=0)
GloVe: Count Meets Predict
Pennington, Socher, and Manning at Stanford bridged count-based methods (LSA) with prediction-based methods (Word2Vec) by training embeddings to reconstruct log co-occurrence ratios from an explicit word-word co-occurrence matrix. GloVe matched Word2Vec quality while making the connection to corpus statistics transparent. Pre-trained GloVe vectors (6B and 840B token variants) became the default initialization for NLP models for years.
— Pennington, J. et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP. 35,000+ citations.
fastText: Subword Embeddings
Bojanowski et al. at Facebook AI solved the out-of-vocabulary problem by learning embeddings for character n-grams instead of whole words. A word's vector was the sum of its subword vectors, meaning even misspellings and neologisms got meaningful representations. This was critical for morphologically rich languages (Turkish, Finnish, Arabic) and foreshadowed the BPE tokenization that all modern transformers use.
— Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information. TACL, 5, 135-146.
The static-word-vector ceiling
Word2Vec, GloVe, and fastText all share the same fundamental limit: they produce one vector per word (or subword), not per sentence. To get a sentence embedding, you average the word vectors — destroying word order, negation, and emphasis. "The dog bit the man" and "The man bit the dog" get identical embeddings. This ceiling drove the next shift.
BERT: Context-Dependent Embeddings
Devlin, Chang, Lee, and Toutanova at Google released BERT, which produced different representations for the same word depending on surrounding context. "Bank" in "river bank" now got a genuinely different vector from "bank robbery."
But BERT had a critical problem for embeddings: comparing two sentences required feeding both through the model together as a cross-encoder. For N documents, finding the most similar pair required N(N-1)/2 forward passes. Searching 10,000 sentences took 65 hours.
# BERT cross-encoder: accurate but O(n²)
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# To compare sentence A with 10,000 candidates:
# Must run model(A + candidate_i) for EACH of the 10,000 candidates
# At ~10ms per pair = 100 seconds for one query
# For all-pairs on 10,000 docs: 10000² / 2 = 50M pairs ≈ 65 hours— Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL. 90,000+ citations.
Sentence-BERT: The Bi-Encoder Breakthrough
Nils Reimers and Iryna Gurevych at TU Darmstadt solved BERT's quadratic problem with an elegant trick: fine-tune BERT with a siamese network so each sentence could be independently encoded into a fixed-length vector. Compare by dot product, not by cross-attention.
The result: searching 10,000 sentences went from 65 hours to 5 seconds. Encode your corpus once, store the vectors in a database, find similar items with a single matrix multiply. This is the architecture behind every modern semantic search system, RAG pipeline, and recommendation engine.
# Sentence-BERT bi-encoder: encode once, search fast
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Encode corpus ONCE — O(n)
corpus_embeddings = model.encode(corpus_of_10000_docs) # (10000, 384)
# Search ANY query in milliseconds — just a dot product
query_vec = model.encode("machine learning frameworks") # (384,)
scores = corpus_embeddings @ query_vec # (10000,) — instant
top_results = scores.argsort()[-5:][::-1] # Top 5"SBERT reduces the effort for finding the most similar pair from 65 hours with BERT to about 5 seconds with SBERT, while maintaining the accuracy from BERT."
Reimers went on to create the sentence-transformers Python library, which remains the most widely used tool for generating text embeddings. It now supports 5,000+ pre-trained models via Hugging Face.
Contrastive Training at Scale
Three innovations supercharged embedding quality. First, hard negative mining: instead of training on random negatives, models learned from documents that were close but not quite right — the "almost correct" distractors that are hardest to distinguish. Second, multi-stage training: pre-train on billions of weakly-supervised text pairs (title-body, query-passage), then fine-tune on high-quality labeled data. Third, instruction-tuning: prepend task descriptions to inputs so the same model produces different embeddings optimized for retrieval vs. classification vs. clustering.
The result was a Cambrian explosion of models. BAAI released BGE. Microsoft released E5. Alibaba released GTE. Jina AI released jina-embeddings. OpenAI shipped text-embedding-3. Cohere shipped embed-v3. Each claimed state-of-the-art on different benchmarks.
Matryoshka Representations and Decoder-Based Models
Matryoshka Representation Learning (Kusupati et al., 2022) trained models so that truncating embeddings to fewer dimensions preserved most of the quality — like Russian nesting dolls, the first 256 dimensions of a 3072-dimensional embedding capture most of the information. OpenAI's text-embedding-3 and Cohere's embed-v3 both support this, letting you trade storage for quality at query time.
Meanwhile, decoder-based models like GTE-Qwen2 and E5-Mistral proved that large language models (LLMs) could be adapted into powerful embedding models. By pooling the last-token representation from a 7B-parameter decoder, these models achieved new MTEB records — at the cost of 10-50x more compute per embedding.
— Kusupati, A. et al. (2022). Matryoshka Representation Learning. NeurIPS.
— Li, Z. et al. (2024). GTE-Qwen2: Towards General Text Embeddings with Multi-stage Contrastive Learning.
The throughline: 2013 to 2026
MTEB: The Standard Benchmark
The Massive Text Embedding Benchmark (MTEB) is the industry standard for comparing embedding models. Created by Muennighoff et al. at Hugging Face in 2022, it evaluates models across 8 task types and 56+ datasets spanning retrieval, classification, clustering, semantic textual similarity (STS), reranking, pair classification, summarization, and BitextMining.
"No single text embedding method dominates across all tasks. Current models specialize on certain types of tasks, revealing significant room for improvement."
— Muennighoff, N. et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL.
MTEB matters because a model that excels at retrieval may be mediocre at clustering, and vice versa. The benchmark forces you to think about which task you care about, not just the average score. When evaluating models below, pay attention to the Retrieval column if you're building search/RAG, STS if you need fine-grained similarity, and Classification if you're labeling text.
Model Comparison: 2026 Landscape
The embedding landscape is crowded. Here are the models that matter, with real MTEB scores and the trade-offs that determine which one you should use.
| Model | Source | Dims | MTEB Avg | Retrieval | Max Tokens | Open? |
|---|---|---|---|---|---|---|
| GTE-Qwen2-7B | Alibaba | 3584 | 70.2 | 60.3 | 8192 | Yes |
| text-embedding-3-large | OpenAI | 3072 | 64.6 | 55.4 | 8191 | No |
| embed-english-v3.0 | Cohere | 1024 | 64.5 | 55.5 | 512 | No |
| BGE-large-en-v1.5 | BAAI | 1024 | 64.2 | 54.3 | 512 | Yes |
| E5-Mistral-7B | Microsoft | 4096 | 66.6 | 56.9 | 32768 | Yes |
| nomic-embed-text-v1.5 | Nomic AI | 768 | 62.3 | 53.0 | 8192 | Yes |
| jina-embeddings-v3 | Jina AI | 1024 | 65.5 | 56.0 | 8192 | Yes |
| all-MiniLM-L6-v2 | Sentence-Transformers | 384 | 56.3 | 41.9 | 256 | Yes |
Key Insight: MTEB Average Is Not the Full Story
A model with MTEB average 70 is not just "better" than one at 56. But the average hides crucial task-specific variation. BGE-large scores 54.3 on retrieval — adequate for most RAG systems. GTE-Qwen2-7B scores 60.3 — meaningfully better, but requires a GPU with 14GB+ VRAM and 50x more compute per embedding. The right model depends on your constraints, not just the leaderboard.
Production Code: Three Approaches
Every text embedding pipeline has the same three steps: choose a model, encode text into vectors, compare vectors by similarity. The code below is copy-paste ready. Run it.
Approach 1: sentence-transformers (Open Source, Local)
The most popular library. Free, runs locally, supports 5,000+ models. No API key required.
Best for: prototyping, self-hosted production, privacy-sensitive data, cost control.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load model (~1.2GB download first time for bge-large)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Encode documents — normalize for cosine similarity via dot product
documents = [
"Python is a programming language",
"JavaScript runs in the browser",
"SQL is used to query databases",
"Redis is an in-memory data store",
"PostgreSQL is a relational database",
]
embeddings = model.encode(documents, normalize_embeddings=True)
# Shape: (5, 1024) — 5 documents, 1024 dimensions each
# Encode a query
query = "how to store data persistently"
query_vec = model.encode(query, normalize_embeddings=True)
# Similarity = dot product (vectors are normalized)
scores = embeddings @ query_vec
ranked = np.argsort(scores)[::-1]
print("Results:")
for idx in ranked:
print(f" {scores[idx]:.3f}: {documents[idx]}")Results: 0.701: PostgreSQL is a relational database 0.654: Redis is an in-memory data store 0.598: SQL is used to query databases 0.412: Python is a programming language 0.389: JavaScript runs in the browser
Approach 2: OpenAI Embeddings (API)
High quality, zero infrastructure. Pay per token. Supports Matryoshka dimensions (256-3072).
Best for: teams without GPU infrastructure, quality-critical production RAG, quick integration. Pricing: ~$0.13 per 1M tokens for text-embedding-3-large.
from openai import OpenAI
import numpy as np
client = OpenAI() # Uses OPENAI_API_KEY env variable
def embed_texts(texts: list[str], model="text-embedding-3-large", dims=1024):
"""Embed texts with OpenAI. Supports Matryoshka truncation via dims."""
response = client.embeddings.create(
model=model,
input=texts,
dimensions=dims, # Matryoshka: 256, 512, 1024, or 3072
)
return np.array([d.embedding for d in response.data])
# Embed documents and query
doc_embeddings = embed_texts([
"Python is a programming language",
"Redis is an in-memory data store",
"PostgreSQL is a relational database",
])
query_embedding = embed_texts(["how to store data persistently"])
# Compare
scores = doc_embeddings @ query_embedding.T
print(scores.squeeze()) # [0.41, 0.65, 0.71]Matryoshka dimensions
OpenAI's text-embedding-3 models support flexible dimensions. Requesting dimensions=256 returns the first 256 components of the full 3072-dim vector, losing only ~2-5% quality while cutting storage by 12x. This is Matryoshka Representation Learning in action.
Approach 3: Cohere Embeddings (API)
Strong multilingual support. Built-in input types for retrieval vs. classification. Compression-aware training.
Best for: multilingual applications, search-focused use cases, teams already on Cohere's platform.
import cohere
import numpy as np
co = cohere.ClientV2("your-api-key")
# Cohere distinguishes document vs query embeddings
# This matters: query embeddings are optimized for retrieval
doc_response = co.embed(
texts=["Python is a programming language",
"Redis is an in-memory data store",
"PostgreSQL is a relational database"],
model="embed-english-v3.0",
input_type="search_document", # For corpus documents
embedding_types=["float"],
)
doc_embeddings = np.array(doc_response.embeddings.float_)
query_response = co.embed(
texts=["how to store data persistently"],
model="embed-english-v3.0",
input_type="search_query", # For queries — different optimization
embedding_types=["float"],
)
query_embedding = np.array(query_response.embeddings.float_)
scores = doc_embeddings @ query_embedding.T
print(scores.squeeze())Why separate input types? Cohere's embed-v3 uses asymmetric training: queries and documents are embedded differently because a short query like "data storage" and a long document about PostgreSQL occupy different semantic spaces. This typically improves retrieval by 2-5% nDCG over symmetric embedding.
Scaling Up: FAISS for Production Search
The dot-product approach above works for thousands of documents. For millions or billions, you need an approximate nearest neighbor (ANN) index. FAISS (Facebook AI Similarity Search) is the most widely used library.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
# Simulate a larger corpus
documents = [
"Python is a high-level programming language",
"JavaScript enables interactive web pages",
"SQL queries relational databases",
"Redis provides in-memory caching",
"PostgreSQL is an advanced relational database",
"Docker containerizes applications for deployment",
"Kubernetes orchestrates container workloads",
"TensorFlow is a machine learning framework",
"React is a JavaScript UI library",
"FastAPI is a modern Python web framework",
]
# Encode and normalize
embeddings = model.encode(documents, normalize_embeddings=True).astype("float32")
dim = embeddings.shape[1] # 1024
# Build FAISS index — IndexFlatIP = exact inner product (cosine for normalized vecs)
index = faiss.IndexFlatIP(dim)
index.add(embeddings)
print(f"Index contains {index.ntotal} vectors of dim {dim}")
# Search
query = "tools for deploying web applications"
query_vec = model.encode([query], normalize_embeddings=True).astype("float32")
# D = distances (similarities), I = indices
D, I = index.search(query_vec, k=3)
print("\nTop 3 results:")
for score, idx in zip(D[0], I[0]):
print(f" {score:.3f}: {documents[idx]}")Index contains 10 vectors of dim 1024 Top 3 results: 0.712: Docker containerizes applications for deployment 0.653: Kubernetes orchestrates container workloads 0.584: FastAPI is a modern Python web framework
Scaling beyond exact search
IndexFlatIP is exact but O(n) per query. For millions of vectors, use approximate indexes:
— Douze, M. et al. (2024). The FAISS Library. arXiv:2401.08281.
Embedding Dimensions: The Storage-Quality Trade-off
Embedding dimension is the single most impactful architecture decision for production systems. It determines storage cost, search latency, and (to a point) embedding quality.
Small
1.5 KB/vec. MiniLM. Prototypes and edge devices.
Medium
3 KB/vec. Nomic, BGE-base. Good balance.
Large
4 KB/vec. BGE-large, Cohere. Production sweet spot.
XL
12 KB/vec. OpenAI large. Max quality, max cost.
What This Means at Scale
| Corpus Size | 384d (1.5KB) | 1024d (4KB) | 3072d (12KB) |
|---|---|---|---|
| 100K docs | 150 MB | 400 MB | 1.2 GB |
| 1M docs | 1.5 GB | 4 GB | 12 GB |
| 10M docs | 15 GB | 40 GB | 120 GB |
| 100M docs | 150 GB | 400 GB | 1.2 TB |
float32 storage. Product quantization (PQ) can reduce this 4-8x with ~2-5% quality loss.
Practical rule of thumb
If your embeddings need to fit in RAM for fast search, work backward from your memory budget. For most RAG systems with under 1M documents, 1024 dimensions is the sweet spot: it fits comfortably in 4GB, gives you near-SOTA quality, and models like BGE-large or Cohere embed-v3 produce excellent results at this dimension. Go to 384 only if you're on edge devices or need to search 10M+ docs without quantization.
Interactive Model Comparison
Explore different embedding models side-by-side. Compare code examples, dimensions, and use cases. All code snippets are copy-paste ready.
Select Embedding Model
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = ['The cat sat on the mat', 'A dog played in the park', 'Machine learning is fascinating']
embeddings = model.encode(documents, normalize_embeddings=True)
query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)
similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
print(f'{sim:.3f}: {doc}')Model Comparison
| Model | Dimensions | MTEB Score | Cost | Best For |
|---|---|---|---|---|
BAAI/bge-large-en-v1.5 | 1024 | 64.23 | Free (local) | Best open-source model |
all-MiniLM-L6-v2 | 384 | 56.3 | Free (local) | Fast, lightweight |
text-embedding-3-large | 3072 | 64.6 | Pay per use | Highest quality OpenAI embedding |
text-embedding-3-small | 1536 | 62.3 | Pay per use | Cost-effective API option |
embed-english-v3.0 | 1024 | 64.5 | Pay per use | Strong performance |
Decision Guide: Choosing Your Model
Model choice is a function of four constraints: quality requirements, latency budget, infrastructure, and cost tolerance. Here is the decision tree.
Prototyping / Learning / Hackathons
Use all-MiniLM-L6-v2. Free, fast, 22MB model, runs on a laptop CPU.
384 dims | 5ms/sentence on CPU | MTEB 56.3 | Good enough to validate any idea before investing in infrastructure.
Production RAG / Semantic Search (Self-Hosted)
Use BAAI/bge-large-en-v1.5 or jina-embeddings-v3. SOTA quality without API costs or data leaving your infrastructure.
1024 dims | MTEB 64-65 | Requires a GPU for fast batch encoding (~500 docs/sec on A100). Once encoded, search is pure CPU.
Production RAG (Managed API)
Use text-embedding-3-large (OpenAI) or embed-english-v3.0 (Cohere). Zero infrastructure. Scale instantly.
1024-3072 dims | MTEB 64-65 | $0.13/1M tokens (OpenAI). Worth it when engineering time costs more than API fees. Cohere is best when you need asymmetric query/document embeddings.
Multilingual
Use BGE-M3 (open source, 100+ languages) or embed-multilingual-v3.0 (Cohere).
BGE-M3 supports hybrid dense+sparse+ColBERT retrieval in a single model. Cohere's multilingual model covers 100+ languages with strong cross-lingual transfer.
Maximum Quality (Research / High-Stakes)
Use GTE-Qwen2-7B or E5-Mistral-7B. Decoder-based models with the highest MTEB scores.
3584-4096 dims | MTEB 66-70 | Requires 14-28GB VRAM. 50x slower to encode than BGE-large. Use only when a 2-4 point MTEB improvement justifies the infrastructure cost.
Latency-Critical (Real-Time, Edge)
Use all-MiniLM-L6-v2 locally or nomic-embed-text-v1.5 with ONNX runtime.
API calls add 50-200ms network overhead. For search-as-you-type or real-time features, local inference with an optimized runtime (ONNX, TensorRT) is essential. MiniLM runs at 5ms/sentence on CPU.
Five Mistakes That Break Embedding Systems
1. Mixing Models Between Index and Query
If you encode your corpus with bge-large-en-v1.5 and your queries with text-embedding-3-large, every similarity score will be meaningless. Different models produce vectors in different spaces with different dimensions. Always use the same model for indexing and querying.
2. Forgetting to Normalize
Cosine similarity requires normalized vectors (length = 1). If you use raw model output withfaiss.IndexFlatIP, your dot products will be scaled by vector magnitude, not just direction. Either set normalize_embeddings=True in sentence-transformers or L2-normalize manually: vec / np.linalg.norm(vec).
3. Exceeding the Model's Max Token Length
Most models silently truncate input beyond their maximum context (512 tokens for BGE, 8192 for Nomic). A 2,000-word document through a 512-token model only embeds the first ~380 words. For long documents, chunk first, embed each chunk, then decide how to aggregate (max pooling, hierarchical search, or store chunks separately).
4. Evaluating on the Wrong MTEB Task
A model with the highest MTEB average may rank 15th on the specific task you care about. If you're building search, look at the Retrieval subtask scores. If you're building a classifier, look at Classification. The aggregate score is marketing; the task-specific score is engineering.
5. Not Benchmarking on Your Own Data
MTEB uses academic datasets. Your data has its own vocabulary, document lengths, and query patterns. A model that wins on MS MARCO retrieval may underperform on your domain. Always create a small evaluation set (~100-500 query-document pairs) from your actual data and test 2-3 candidate models before committing to one for production.
Key Takeaways
- 1
Four shifts define the field: Static word vectors (2013) to contextual encoders (2018) to bi-encoder sentence embeddings (2019) to the modern arms race (2022+). Each solved the previous generation's key limitation.
- 2
MTEB is the benchmark — but read the subtasks: The average score is useful for rough ranking. The task-specific score (Retrieval, STS, Classification) is what determines real-world performance for your use case.
- 3
For most production use cases, BGE-large or an API model is the answer: BGE-large-en-v1.5 gives you MTEB 64+ quality at zero API cost. OpenAI and Cohere give you the same quality with zero infrastructure. GTE-Qwen2 wins benchmarks but demands serious GPU resources.
- 4
Dimension choice is a storage decision: 1024 dimensions is the production sweet spot. Go smaller for edge/latency. Go bigger only when benchmark gains justify the 3-8x storage increase.
- 5
Always benchmark on your own data: MTEB scores predict general quality but not domain-specific performance. Build a 100-query eval set from your real data before committing to a model.
Practice Exercises
Copy the code examples above and work through these exercises. Each builds on the previous.
- 1.Compare two models. Encode the same 10 sentences with both
all-MiniLM-L6-v2(384d) andbge-large-en-v1.5(1024d). Do the similarity rankings change? By how much? - 2.Test Matryoshka truncation. Use OpenAI's text-embedding-3-large with dimensions=3072, 1024, and 256. Compare retrieval quality on the same queries. How much quality do you lose?
- 3.Build a semantic search engine. Load a dataset (Wikipedia paragraphs, your own documents, or a HuggingFace dataset). Index 10,000+ documents with FAISS and query interactively. Measure query latency with
time.perf_counter(). - 4.Evaluate on your domain. Create 50 query-document pairs from data relevant to your use case. Test 2-3 models and compute recall@5 and MRR. Does the MTEB winner also win on your data?
References
- Mikolov, T. et al. (2013). Efficient Estimation of Word Representations in Vector Space. ICLR Workshop.
- Pennington, J. et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP.
- Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information. TACL, 5, 135-146.
- Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL.
- Reimers, N. & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. EMNLP.
- Muennighoff, N. et al. (2023). MTEB: Massive Text Embedding Benchmark. EACL.
- Kusupati, A. et al. (2022). Matryoshka Representation Learning. NeurIPS.
- Chen, J. et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings.
- Li, Z. et al. (2024). GTE-Qwen2: Towards General Text Embeddings with Multi-stage Contrastive Learning.
- Douze, M. et al. (2024). The FAISS Library.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.