Measuring the Quality of
Text Understanding
MTEB is the definitive benchmark for text embedding models. 8 task categories, 56+ datasets, 112+ languages. The benchmark that turned "which embedding model should I use?" from guesswork into science.
Benchmark Stats
What is MTEB?
The Massive Text Embedding Benchmark (MTEB) was introduced by Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers at Hugging Face in their 2022 paper. Before MTEB, comparing embedding models was chaos: each paper picked its own tasks, its own datasets, its own evaluation protocol. MTEB brought order.
The benchmark evaluates embedding models across 8 distinct task categories spanning retrieval, classification, clustering, semantic similarity, reranking, pair classification, summarization, and bitext mining. This breadth is what makes MTEB special: a model that dominates retrieval might fail at clustering. MTEB catches that.
Since its release, MTEB has become the standard evaluation suite for the embedding community. The HuggingFace MTEB leaderboard has over 5,000 model submissions. Every major embedding model from OpenAI, Google, Cohere, Alibaba, and Microsoft now reports MTEB scores. It is to embeddings what ImageNet was to vision models.
Retrieval, Classification, Clustering, STS, Reranking, Pair Classification, Summarization, Bitext Mining
English-focused core with multilingual extensions via Tatoeba and BUCC
On the HuggingFace leaderboard, growing every week
The Golden Datasets
MTEB's power comes from its datasets. These aren't synthetic toy problems. They're real-world datasets with human annotations, covering domains from medical retrieval to banking intent classification. Here are four that define what it means to have good embeddings.
STS Benchmark
The cornerstone STS dataset. Human annotators rated sentence pairs on a 0-5 scale of semantic equivalence. Used as THE standard test for embedding quality since 2017.
NFCorpus
Medical information retrieval: given a health query, find relevant scientific documents. Tests whether embeddings can bridge the gap between lay language and medical terminology.
ArguAna
Counterargument retrieval: given an argument on a controversial topic, find the best counterargument. One of the hardest retrieval tasks because surface-level similarity is misleading.
Banking77
Intent detection in banking: classify customer messages into 77 fine-grained intents. Tests whether embeddings can distinguish between highly similar intents like "card_arrival" vs "card_delivery_estimate".
Task Categories Deep Dive
MTEB evaluates embeddings across 8 fundamentally different tasks. A great embedding model must excel at all of them. Each task tests a different aspect of text understanding.
Retrieval
Given a query, find the most relevant documents from a corpus.
Classification
Classify text into categories using embeddings as features.
Clustering
Group semantically similar texts into clusters without labels.
Reranking
Given a query and candidate documents, reorder by relevance.
Semantic Textual Similarity
Predict the degree of semantic equivalence between sentence pairs.
Pair Classification
Determine the relationship between two texts (duplicate, paraphrase, entailment).
Summarization
Evaluate how well a summary captures the meaning of a source document.
Bitext Mining
Find translation pairs between two sets of sentences in different languages.
MTEB Leaderboard
15 models ranked by average score across all English tasks. Updated 2025-12-26.
| # | Model | Type | Avg | Retrieval | Class. | Cluster. | STS | Rerank | Dims | Params | Links |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | KaLM-Embedding-Gemma3-12B Tencent | Open Source | 72.32 | 75.7 | 77.9 | 55.8 | 79.0 | 67.3 | 3840 | 11.76B | |
| 2 | Qwen3-Embedding-8B Qwen / Alibaba | Open Source | 70.58 | 70.9 | 74.0 | 57.6 | 81.1 | 65.6 | 4096 | 8B | |
| 3 | Seed1.6-embedding-1215 ByteDance | API | 70.26 | 66.0 | 76.8 | 56.8 | 75.9 | 66.2 | 1536 | — | |
| 4 | llama-embed-nemotron-8b NVIDIA | Open Source | 69.46 | 68.7 | 73.2 | 54.4 | 79.4 | 67.8 | 4096 | 8B | |
| 5 | Qwen3-Embedding-4B Qwen / Alibaba | Open Source | 69.45 | 69.6 | 72.3 | 57.1 | 80.9 | 65.1 | 2560 | 4B | |
| 6 | gemini-embedding-001 Google | API | 68.37 | 67.7 | 71.8 | 54.6 | 79.4 | 65.6 | 3072 | — | |
| 7 | Octen-Embedding-8B Octen | Open Source | 67.85 | 71.7 | 66.7 | 55.7 | 81.3 | 67.6 | 4096 | 8B | |
| 8 | Qwen3-Embedding-0.6B Qwen / Alibaba | Open Source | 64.34 | 64.7 | 66.8 | 52.3 | 76.2 | 61.4 | 1024 | 0.6B | |
| 9 | multilingual-e5-large-instruct Microsoft | Open Source | 63.22 | 57.1 | 64.9 | 50.8 | 76.8 | 62.6 | 1024 | 560M | |
| 10 | gte-Qwen2-7B-instruct Alibaba | Open Source | 62.51 | 60.1 | 61.5 | 52.8 | 74.0 | 65.5 | 3584 | 7B | |
| 11 | text-multilingual-embedding-002 Google | API | 62.16 | 59.7 | 64.6 | 47.8 | 76.1 | 61.2 | 768 | — | |
| 12 | bge-m3 BAAI | Open Source | 59.56 | 57.9 | 62.3 | 48.2 | 74.5 | 56.8 | 1024 | 568M | |
| 13 | text-embedding-3-large OpenAI | API | 58.96 | 56.1 | 62.5 | 45.2 | 72.5 | 54.1 | 3072 | — | |
| 14 | voyage-3.5 Voyage AI | API | 58.46 | 55.9 | 61.8 | 44.6 | 71.9 | 53.5 | 1024 | — | |
| 15 | jina-embeddings-v3 Jina AI | Open Source | 58.37 | 54.5 | 61.2 | 43.8 | 71.3 | 52.9 | 1024 | 570M |
SOTA Progress: 2019 to 2025
From Sentence-BERT's first dedicated sentence embeddings to today's 12B-parameter models scoring 72+. The evolution tracks three eras: encoder-only (BERT/RoBERTa fine-tuning), instruction-tuned (E5, BGE), and LLM-based (Qwen3, KaLM).
Reimers & Gurevych show that BERT with siamese fine-tuning creates meaningful sentence embeddings. The field is born.
Contrastive learning on unsupervised data (dropout as augmentation) pushes STS scores without labeled data.
Microsoft shows that weakly-supervised contrastive pre-training on massive web data creates superior embeddings. MTEB paper published.
BAAI's BGE family takes the lead with instruction-following and hard negative mining. Open-source catches up to OpenAI.
Microsoft proves that LLM backbones (Mistral-7B) create better embeddings than encoder-only models. A paradigm shift.
Alibaba shows that Qwen2 backbone with GTE training matches E5-Mistral. LLM-based embeddings become the norm.
Multi-granularity (dense + sparse + colbert) and task-LoRA adapters emerge as efficiency-focused alternatives.
Qwen3 family dominates with multi-task training across embedding + reranking tasks. First models to consistently break 70.
KaLM fine-tunes Gemma3-12B with contrastive learning to set the current SOTA. Open-source leads over all APIs.
Accuracy vs. Model Size
The MTEB leaderboard reveals a clear trend: LLM-based embeddings dominate, but efficiency varies wildly. Qwen3-Embedding-0.6B scores 64.34 with just 600M parameters, while KaLM-Gemma3-12B needs 12B for 72.32. The score-per-parameter efficiency matters for production deployments.
Efficiency Leaders
Absolute Performance Leaders
The LLM Embedding Revolution
Before 2024, embedding models were small encoder-only transformers: BERT, RoBERTa, XLM-R. They maxed out around 560M parameters and scored ~60 on MTEB. Then researchers discovered that decoder-only LLMs make better embedding backbones.
E5-Mistral proved it first: take Mistral-7B, add contrastive fine-tuning, and you get embeddings that crush all encoder-only models. Now every top-5 model uses an LLM backbone: Gemma3, Qwen3, LLaMA. The old BERT-based paradigm is over for high-performance embeddings.
Open Source vs. API: The Gap Closed
In 2023, OpenAI's text-embedding-3-large was considered best-in-class. Today it ranks 13th on MTEB with 58.96, behind eight open-source models. The open-source community has completely overtaken proprietary APIs.
- KaLM-Gemma3-12B (open): 72.32 — beats all APIs by 2+ points
- gemini-embedding-001 (API): 68.37 — best API, but 6th overall
- text-embedding-3-large (API): 58.96 — once the king, now 13th
- Qwen3-0.6B (open, tiny): 64.34 — a 600M model beats OpenAI
Run MTEB Yourself
MTEB is fully open-source. Install it, pick a model, and benchmark it against the entire suite in a single script. Results are automatically formatted for submission to the HuggingFace leaderboard.
Full MTEB Evaluation
Python# Install
pip install mteb sentence-transformers
# Run full English benchmark
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
# Run all English tasks (56+ datasets)
evaluation = mteb.MTEB(tasks=mteb.get_tasks(languages=["eng"]))
results = evaluation.run(model, output_folder="results/qwen3-0.6b")
# Or run specific task types
retrieval_tasks = mteb.get_tasks(
languages=["eng"],
task_types=["Retrieval"]
)
evaluation = mteb.MTEB(tasks=retrieval_tasks)
results = evaluation.run(model, output_folder="results/retrieval-only")Quick Start: Use Top Models
Python# Option 1: SOTA (KaLM-Gemma3-12B — needs ~24GB VRAM)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("KaLM-ai/KaLM-Embedding-Gemma3-12B-2511")
embeddings = model.encode(["What is machine learning?", "ML is a subset of AI."])
print(f"Similarity: {embeddings[0] @ embeddings[1]:.4f}")
# Option 2: Best bang-for-buck (Qwen3-0.6B — runs on CPU!)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
embeddings = model.encode(["Hello world", "Bonjour le monde"])
# Option 3: Production serving with HuggingFace TEI
# docker run --gpus all -p 8080:80 \
# ghcr.io/huggingface/text-embeddings-inference:latest \
# --model-id Qwen/Qwen3-Embedding-0.6BKey Papers
Essential reading for understanding MTEB and modern text embeddings.
Code & Implementations
Open-source repositories for training, evaluating, and serving embedding models.
Official MTEB benchmark framework. Run evaluations on any model with a single command.
The de facto library for text embeddings in Python. Load, fine-tune, and deploy embedding models.
BAAI's BGE embedding family. Includes bge-m3, bge-reranker, and training code.
Qwen model family including Qwen3-Embedding. Multi-task training for embedding + reranking.
Production-grade serving for embedding models. Rust-based, supports batching, quantization.
Instruction-tuned embeddings. Pioneered the "Represent the X for Y" prompting approach.
MTEB vs. Other Embedding Benchmarks
| Benchmark | Tasks | Datasets | Focus | Year |
|---|---|---|---|---|
| MTEB | 8 | 56+ | Comprehensive embedding evaluation | 2022 |
| BEIR | 1 | 18 | Zero-shot retrieval only | 2021 |
| SentEval | 4 | 17 | Sentence representation probing | 2018 |
| USEB | 4 | 8 | Unified sentence embedding eval | 2022 |
| KILT | 1 | 11 | Knowledge-intensive language tasks | 2021 |
| AIR-Bench | 2 | 24 | Automated IR benchmark (LLM-judged) | 2024 |
Understanding the Metrics
NDCG@10Retrieval
Normalized Discounted Cumulative Gain at rank 10. Measures how well the model ranks relevant documents in the top 10 results, with higher positions weighted more heavily.
DCG@10 = Σ(rel_i / log2(i+1))
A score of 1.0 means all relevant documents appear at the top. Most models score 0.4-0.7, reflecting the difficulty of zero-shot retrieval.
Spearman ρSTS
Spearman rank correlation between model cosine similarities and human similarity judgments. Measures whether the model's relative ordering of sentence pairs matches human intuition.
where d_i = rank difference for pair i
Spearman correlation of 0.80+ indicates strong alignment with human judgment. Top models now exceed 0.81.
Access MTEB
When to Use Embeddings
Text embeddings convert language into dense vectors that capture semantic meaning. Here are the primary use cases where MTEB-benchmarked models excel.
Semantic Document Search
Find relevant documents by meaning, not just keyword overlap. Embeddings enable natural-language queries over large corpora.
RAG Retrieval
Retrieve context chunks for LLM generation. Embedding quality directly determines answer accuracy in retrieval-augmented pipelines.
Duplicate Detection
Identify near-duplicate content, support tickets, or records using cosine similarity between embedding pairs.
Clustering & Topic Modeling
Group documents by semantic similarity. Embeddings provide dense features for k-means, HDBSCAN, or topic extraction.
Architecture Patterns
Three common approaches to generating embeddings in production, each with distinct trade-offs.
Sentence Transformers
Models trained specifically for sentence and paragraph embedding. Run locally with full control.
Pros
- Optimized for retrieval, fast inference
- Many specialized variants available
Cons
- Fixed context length
- May need domain fine-tuning
LLM Embeddings via API
Embedding endpoints from OpenAI, Cohere, Voyage, and others. Zero infrastructure to manage.
Pros
- High quality, long context
- No infrastructure to maintain
Cons
- Cost per token
- Data leaves your system
Sparse + Dense Hybrid
Combine BM25 with dense embeddings for better recall. Best of both worlds for production search.
Pros
- Handles exact matches well
- More robust for rare terms
Cons
- More complex pipeline
- Two indices to maintain
Quick Start Code
Get started with embeddings in minutes. Two approaches: hosted API or local model.
OpenAI API
pip install openai
from openai import OpenAI
client = OpenAI()
def get_embedding(text: str, model='text-embedding-3-large'):
response = client.embeddings.create(input=text, model=model)
return response.data[0].embedding
documents = [
'The cat sat on the mat',
'A dog played in the park',
'Machine learning is fascinating'
]
embeddings = [get_embedding(doc) for doc in documents]
print(f'Embedding dimension: {len(embeddings[0])}')Local with Sentence Transformers
pip install sentence-transformers numpy
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('BAAI/bge-large-en-v1.5')
documents = [
'The cat sat on the mat',
'A dog played in the park',
'Machine learning is fascinating'
]
embeddings = model.encode(documents, normalize_embeddings=True)
query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)
similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
print(f'{sim:.3f}: {doc}')Track More Benchmarks
MTEB is one of many benchmarks we track. Explore our full catalog of NLP, computer vision, and reasoning benchmarks with live leaderboards.