Text Embeddings Deep Dive
Not all embedding models are equal. Learn to evaluate them on MTEB and improve on the state of the art.
MTEB — Retrieval Subset
Measures retrieval quality using Normalized Discounted Cumulative Gain at rank 10.
Quick Start: Real Working Code
Let's start with code you can copy, paste, and run immediately:
Text Embedding with Sentence Transformers
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = ['The cat sat on the mat', 'A dog played in the park', 'Machine learning is fascinating']
embeddings = model.encode(documents, normalize_embeddings=True)
query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)
similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
print(f'{sim:.3f}: {doc}')Build a Semantic Search Index with FAISS
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = [
'Python is a programming language',
'JavaScript runs in the browser',
'SQL is used to query databases',
'Redis is an in-memory data store',
'PostgreSQL is a relational database'
]
embeddings = model.encode(documents, normalize_embeddings=True).astype('float32')
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)
query = 'how to store data'
query_vec = model.encode([query], normalize_embeddings=True).astype('float32')
D, I = index.search(query_vec, k=3)
print('Top 3 results:')
for score, idx in zip(D[0], I[0]):
print(f'{score:.3f}: {documents[idx]}')SOTA Performance: MTEB Benchmark
The Massive Text Embedding Benchmark (MTEB) is the industry standard for comparing embedding models:
Current SOTA Embedding Models
The MTEB Evaluation Pipeline
MTEB organizes evaluation into task categories. For this lesson, we focus on Retrieval — measured by NDCG@10 (Normalized Discounted Cumulative Gain at rank 10).
What is NDCG@10?
NDCG@10 measures how well your model ranks relevant documents in the top 10 results. A perfect ranker scores 1.0 — it puts all relevant documents at the top. The "discounted" part means relevant docs at rank 1 count more than those at rank 10.
# NDCG@10 intuition:
# Perfect ranking: [rel, rel, rel, irr, irr, ...] → NDCG@10 ≈ 1.0
# Decent ranking: [rel, irr, rel, irr, rel, ...] → NDCG@10 ≈ 0.7
# Poor ranking: [irr, irr, irr, irr, rel, ...] → NDCG@10 ≈ 0.3Running MTEB Programmatically
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Run on specific retrieval tasks
tasks = mteb.get_tasks(
tasks=["NFCorpus", "SciFact", "ArguAna", "TRECCOVID"],
languages=["eng"]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/minilm")
# Extract NDCG@10 scores
for task_result in results:
for split in task_result.scores:
for score in task_result.scores[split]:
if "ndcg_at_10" in score:
print(f"{task_result.task_name}: {score['ndcg_at_10']:.4f}")Embedding Model Categories
Sentence Transformers (Open Source)
Free, run locally, no API costs. Best: bge-large-en-v1.5 (MTEB 64.23).
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embedding = model.encode('Hello world', normalize_embeddings=True)OpenAI Embeddings (API)
High quality (MTEB ~64.6), pay-per-use. Easy integration but adds latency and cost.
from openai import OpenAI client = OpenAI() response = client.embeddings.create(model="text-embedding-3-large", input="Hello world") embedding = response.data[0].embedding
Cohere Embeddings (API)
Strong multilingual support. Purpose-built for search and retrieval.
import cohere
co = cohere.Client('api-key')
response = co.embed(texts=["Hello world"], model="embed-english-v3.0", input_type="search_document")Understanding Embedding Dimensions
Small
Fast, low memory. Good for prototypes.
Medium
Best balance. BGE-large uses this.
Large
Highest quality. OpenAI large model.
Code Reference and Model Comparison
Explore different embedding models and see real code examples:
Select Embedding Model
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = ['The cat sat on the mat', 'A dog played in the park', 'Machine learning is fascinating']
embeddings = model.encode(documents, normalize_embeddings=True)
query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)
similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
print(f'{sim:.3f}: {doc}')Model Comparison
| Model | Dimensions | MTEB Score | Cost | Best For |
|---|---|---|---|---|
BAAI/bge-large-en-v1.5 | 1024 | 64.23 | Free (local) | Best open-source model |
all-MiniLM-L6-v2 | 384 | 56.3 | Free (local) | Fast, lightweight |
text-embedding-3-large | 3072 | 64.6 | Pay per use | Highest quality OpenAI embedding |
text-embedding-3-small | 1536 | 62.3 | Pay per use | Cost-effective API option |
embed-english-v3.0 | 1024 | 64.5 | Pay per use | Strong performance |
Reproduce
Replicate all-MiniLM-L6-v2 Retrieval NDCG@10
Run all-MiniLM-L6-v2 through the MTEB retrieval evaluation and reproduce its score of 52.3.
Steps
- 1.Install:
pip install mteb sentence-transformers - 2.Run MTEB retrieval tasks: NFCorpus, SciFact, ArguAna, TREC-COVID
- 3.Average the NDCG@10 scores across all retrieval tasks
- 4.Your result should be within ±0.5 of 52.3
Reproduce Script
import mteb
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
# Retrieval tasks
tasks = mteb.get_tasks(
tasks=["NFCorpus", "SciFact", "ArguAna", "TRECCOVID"],
languages=["eng"]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/minilm-retrieval")
# Extract and average NDCG@10
ndcg_scores = []
for task_result in results:
for split in task_result.scores:
for score in task_result.scores[split]:
if "ndcg_at_10" in score:
ndcg_scores.append(score["ndcg_at_10"])
print(f"{task_result.task_name}: {score['ndcg_at_10']:.4f}")
avg = np.mean(ndcg_scores) * 100
print(f"\nAverage Retrieval NDCG@10: {avg:.2f}")
# Expected: ~52.3Hint: Use output_folder to save results for inspection. Each task generates a JSON file with per-query scores.
Improve
Beat 52.3 Retrieval NDCG@10
Improve retrieval quality beyond the all-MiniLM-L6-v2 baseline. Focus on the retrieval subset — this is where real-world search quality matters.
Strategies to explore
Domain fine-tuning with MS MARCO
Fine-tune on MS MARCO passage pairs using contrastive learning. This is the standard approach for boosting retrieval scores.
Instruction-tuned models
Try e5-large-v2 or gte-base — these models accept task-specific prefixes like "query:" and "passage:" for better retrieval.
Matryoshka representation learning
Train embeddings that work at multiple dimension sizes — truncate for speed, use full dimensions for quality.
SetFit few-shot adaptation
Use SetFit to adapt a model with just 8-16 examples per task. Surprisingly effective for domain adaptation.
Note on the Pareto frontier: Larger models score higher but have higher latency. A submission that improves NDCG@10 while keeping model size under 100M parameters is especially valuable — it pushes the efficiency frontier.
Submit Your Result
Submit your MTEB retrieval evaluation result. Include your code repository so peers can verify your methodology.
Contribute to MTEB
Help us maintain the most accurate benchmark data. Submit new results, report issues, or suggest improvements.
Submit New Results
Share benchmark scores from recent papers or your own experiments
Report Data Issues
Found incorrect scores or broken links? Let us know
Build the Data Flywheel
Your contributions help make CodeSOTA better for everyone
Submit Benchmark Result
Submissions are reviewed manually to ensure data quality. For immediate contributions, consider submitting a pull request on GitHub.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.