Level 1: Single Blocks~35 min

Text Embeddings Deep Dive

Not all embedding models are equal. Learn to evaluate them on MTEB and improve on the state of the art.

Target Benchmark

MTEB — Retrieval Subset

Measures retrieval quality using Normalized Discounted Cumulative Gain at rank 10.

52.3
Retrieval NDCG@10
all-MiniLM-L6-v2

Quick Start: Real Working Code

Let's start with code you can copy, paste, and run immediately:

# Install dependencies
pip install sentence-transformers faiss-cpu numpy

Text Embedding with Sentence Transformers

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = ['The cat sat on the mat', 'A dog played in the park', 'Machine learning is fascinating']
embeddings = model.encode(documents, normalize_embeddings=True)

query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)
similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')

Build a Semantic Search Index with FAISS

from sentence_transformers import SentenceTransformer
import faiss
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = [
    'Python is a programming language',
    'JavaScript runs in the browser',
    'SQL is used to query databases',
    'Redis is an in-memory data store',
    'PostgreSQL is a relational database'
]
embeddings = model.encode(documents, normalize_embeddings=True).astype('float32')

index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings)

query = 'how to store data'
query_vec = model.encode([query], normalize_embeddings=True).astype('float32')
D, I = index.search(query_vec, k=3)

print('Top 3 results:')
for score, idx in zip(D[0], I[0]):
    print(f'{score:.3f}: {documents[idx]}')

SOTA Performance: MTEB Benchmark

The Massive Text Embedding Benchmark (MTEB) is the industry standard for comparing embedding models:

Current SOTA Embedding Models

bge-large-en-v1.5(BAAI, Open Source)
64.23MTEB avg
text-embedding-3-large(OpenAI, API)
64.6MTEB avg
all-MiniLM-L6-v2(Sentence Transformers)
56.3MTEB avg

The MTEB Evaluation Pipeline

MTEB organizes evaluation into task categories. For this lesson, we focus on Retrieval — measured by NDCG@10 (Normalized Discounted Cumulative Gain at rank 10).

What is NDCG@10?

NDCG@10 measures how well your model ranks relevant documents in the top 10 results. A perfect ranker scores 1.0 — it puts all relevant documents at the top. The "discounted" part means relevant docs at rank 1 count more than those at rank 10.

# NDCG@10 intuition:
# Perfect ranking:    [rel, rel, rel, irr, irr, ...] → NDCG@10 ≈ 1.0
# Decent ranking:     [rel, irr, rel, irr, rel, ...] → NDCG@10 ≈ 0.7
# Poor ranking:       [irr, irr, irr, irr, rel, ...] → NDCG@10 ≈ 0.3

Running MTEB Programmatically

import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Run on specific retrieval tasks
tasks = mteb.get_tasks(
    tasks=["NFCorpus", "SciFact", "ArguAna", "TRECCOVID"],
    languages=["eng"]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/minilm")

# Extract NDCG@10 scores
for task_result in results:
    for split in task_result.scores:
        for score in task_result.scores[split]:
            if "ndcg_at_10" in score:
                print(f"{task_result.task_name}: {score['ndcg_at_10']:.4f}")

Embedding Model Categories

Sentence Transformers (Open Source)

Free, run locally, no API costs. Best: bge-large-en-v1.5 (MTEB 64.23).

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
embedding = model.encode('Hello world', normalize_embeddings=True)

OpenAI Embeddings (API)

High quality (MTEB ~64.6), pay-per-use. Easy integration but adds latency and cost.

from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(model="text-embedding-3-large", input="Hello world")
embedding = response.data[0].embedding

Cohere Embeddings (API)

Strong multilingual support. Purpose-built for search and retrieval.

import cohere
co = cohere.Client('api-key')
response = co.embed(texts=["Hello world"], model="embed-english-v3.0", input_type="search_document")

Understanding Embedding Dimensions

384

Small

Fast, low memory. Good for prototypes.

1024

Medium

Best balance. BGE-large uses this.

3072

Large

Highest quality. OpenAI large model.

Code Reference and Model Comparison

Explore different embedding models and see real code examples:

Select Embedding Model

Install dependencies:
pip install sentence-transformers faiss-cpu numpy
Basic embedding with similarity calculation:
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = ['The cat sat on the mat', 'A dog played in the park', 'Machine learning is fascinating']
embeddings = model.encode(documents, normalize_embeddings=True)

query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)
similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')

Model Comparison

ModelDimensionsMTEB ScoreCostBest For
BAAI/bge-large-en-v1.5
102464.23Free (local)Best open-source model
all-MiniLM-L6-v2
38456.3Free (local)Fast, lightweight
text-embedding-3-large
307264.6Pay per useHighest quality OpenAI embedding
text-embedding-3-small
153662.3Pay per useCost-effective API option
embed-english-v3.0
102464.5Pay per useStrong performance
Stage 1

Reproduce

Replicate all-MiniLM-L6-v2 Retrieval NDCG@10

Run all-MiniLM-L6-v2 through the MTEB retrieval evaluation and reproduce its score of 52.3.

Steps

  1. 1.Install: pip install mteb sentence-transformers
  2. 2.Run MTEB retrieval tasks: NFCorpus, SciFact, ArguAna, TREC-COVID
  3. 3.Average the NDCG@10 scores across all retrieval tasks
  4. 4.Your result should be within ±0.5 of 52.3

Reproduce Script

import mteb
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# Retrieval tasks
tasks = mteb.get_tasks(
    tasks=["NFCorpus", "SciFact", "ArguAna", "TRECCOVID"],
    languages=["eng"]
)
evaluation = mteb.MTEB(tasks=tasks)
results = evaluation.run(model, output_folder="results/minilm-retrieval")

# Extract and average NDCG@10
ndcg_scores = []
for task_result in results:
    for split in task_result.scores:
        for score in task_result.scores[split]:
            if "ndcg_at_10" in score:
                ndcg_scores.append(score["ndcg_at_10"])
                print(f"{task_result.task_name}: {score['ndcg_at_10']:.4f}")

avg = np.mean(ndcg_scores) * 100
print(f"\nAverage Retrieval NDCG@10: {avg:.2f}")
# Expected: ~52.3

Hint: Use output_folder to save results for inspection. Each task generates a JSON file with per-query scores.

Stage 2

Improve

Beat 52.3 Retrieval NDCG@10

Improve retrieval quality beyond the all-MiniLM-L6-v2 baseline. Focus on the retrieval subset — this is where real-world search quality matters.

Strategies to explore

Domain fine-tuning with MS MARCO

Fine-tune on MS MARCO passage pairs using contrastive learning. This is the standard approach for boosting retrieval scores.

Instruction-tuned models

Try e5-large-v2 or gte-base — these models accept task-specific prefixes like "query:" and "passage:" for better retrieval.

Matryoshka representation learning

Train embeddings that work at multiple dimension sizes — truncate for speed, use full dimensions for quality.

SetFit few-shot adaptation

Use SetFit to adapt a model with just 8-16 examples per task. Surprisingly effective for domain adaptation.

Note on the Pareto frontier: Larger models score higher but have higher latency. A submission that improves NDCG@10 while keeping model size under 100M parameters is especially valuable — it pushes the efficiency frontier.

Submit Your Result

Submit your MTEB retrieval evaluation result. Include your code repository so peers can verify your methodology.

Contribute to MTEB

Help us maintain the most accurate benchmark data. Submit new results, report issues, or suggest improvements.

Submit New Results

Share benchmark scores from recent papers or your own experiments

Report Data Issues

Found incorrect scores or broken links? Let us know

Build the Data Flywheel

Your contributions help make CodeSOTA better for everyone

Submit Benchmark Result

Submissions are reviewed manually to ensure data quality. For immediate contributions, consider submitting a pull request on GitHub.

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.