Text Embedding Benchmark

Measuring the Quality of
Text Understanding

MTEB is the definitive benchmark for text embedding models. 8 task categories, 56+ datasets, 112+ languages. The benchmark that turned "which embedding model should I use?" from guesswork into science.

Benchmark Stats

56+
Datasets across 8 task types
8
Task Categories
72.32%
SOTA Score (KaLM-Embedding)
15
Models Tracked

What is MTEB?

The Massive Text Embedding Benchmark (MTEB) was introduced by Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers at Hugging Face in their 2022 paper. Before MTEB, comparing embedding models was chaos: each paper picked its own tasks, its own datasets, its own evaluation protocol. MTEB brought order.

The benchmark evaluates embedding models across 8 distinct task categories spanning retrieval, classification, clustering, semantic similarity, reranking, pair classification, summarization, and bitext mining. This breadth is what makes MTEB special: a model that dominates retrieval might fail at clustering. MTEB catches that.

Since its release, MTEB has become the standard evaluation suite for the embedding community. The HuggingFace MTEB leaderboard has over 5,000 model submissions. Every major embedding model from OpenAI, Google, Cohere, Alibaba, and Microsoft now reports MTEB scores. It is to embeddings what ImageNet was to vision models.

Core Design
8 Task Types

Retrieval, Classification, Clustering, STS, Reranking, Pair Classification, Summarization, Bitext Mining

Coverage
112+ Languages

English-focused core with multilingual extensions via Tatoeba and BUCC

Adoption
5,000+ Submissions

On the HuggingFace leaderboard, growing every week

The Golden Datasets

MTEB's power comes from its datasets. These aren't synthetic toy problems. They're real-world datasets with human annotations, covering domains from medical retrieval to banking intent classification. Here are four that define what it means to have good embeddings.

STS Benchmark

The cornerstone STS dataset. Human annotators rated sentence pairs on a 0-5 scale of semantic equivalence. Used as THE standard test for embedding quality since 2017.

Semantic Textual Similarity
8,628 sentence pairs
Real Examples from the Dataset
Sentence A: "A plane is taking off."
Sentence B: "An air plane is taking off."
5.00Perfect equivalence
Sentence A: "A woman is playing the guitar."
Sentence B: "A man is playing the flute."
1.60Different actions, different agents
Sentence A: "A man is smoking."
Sentence B: "A man is skating."
0.50Nearly unrelated
Source: Cer et al., 2017 (SemEval)

NFCorpus

Medical information retrieval: given a health query, find relevant scientific documents. Tests whether embeddings can bridge the gap between lay language and medical terminology.

Retrieval
3,633 queries, 169,756 documents
Real Examples from the Dataset
Sentence A: "Does caffeine affect blood pressure?"
Sentence B: "Acute effects of coffee consumption on self-reported gastrointestinal symptoms, blood pressure and stress indices..."
RelevantLay query matched to scientific abstract
Sentence A: "vitamin D deficiency symptoms"
Sentence B: "The role of vitamin D in reducing cancer risk and progression..."
RelevantSymptom query matched to clinical review
Source: Boteva et al., 2016 (NutritionFacts)

ArguAna

Counterargument retrieval: given an argument on a controversial topic, find the best counterargument. One of the hardest retrieval tasks because surface-level similarity is misleading.

Retrieval
1,406 queries, 8,674 arguments
Real Examples from the Dataset
Sentence A: "Nuclear energy is clean and efficient, producing minimal greenhouse gases..."
Sentence B: "Nuclear waste remains radioactive for thousands of years with no safe long-term storage solution..."
CounterTopically similar but argumentatively opposed
Source: Wachsmuth et al., 2018

Banking77

Intent detection in banking: classify customer messages into 77 fine-grained intents. Tests whether embeddings can distinguish between highly similar intents like "card_arrival" vs "card_delivery_estimate".

Classification
13,083 customer queries
Real Examples from the Dataset
Sentence A: "Why was I charged twice for the same transaction?"
Intent: card_payment_wrong_exchange_rateFine-grained intent classification
Sentence A: "My card doesn't work at ATMs abroad"
Intent: card_not_workingMust distinguish from similar card intents
Source: Casanueva et al., 2020

Task Categories Deep Dive

MTEB evaluates embeddings across 8 fundamentally different tasks. A great embedding model must excel at all of them. Each task tests a different aspect of text understanding.

Retrieval

NDCG@1015 datasets

Given a query, find the most relevant documents from a corpus.

Example
"What is the capital of France?"
"Paris is the capital and most populous city of France, with an estimated population of 2,165,423."
The model must rank documents about Paris as capital highest among thousands of candidates.
How it works: Encode query and all documents independently. Rank by cosine similarity. NDCG@10 measures if relevant docs appear in top 10.
MS MARCONQHotpotQAFiQA+2 more

Classification

Accuracy12 datasets

Classify text into categories using embeddings as features.

Example
"This product broke after two days. Terrible quality."
Label: Negative
Embeddings are used as features for a logistic regression classifier. No fine-tuning of the embedding model.
How it works: Embed all texts, fit a simple classifier (kNN or logistic regression) on train embeddings, evaluate on test set.
AmazonCounterfactualBanking77EmotionClassificationTweetSentiment+1 more

Clustering

V-measure11 datasets

Group semantically similar texts into clusters without labels.

Example
Cluster: ["quantum computing advances", "new qubit architecture", "stock market rally", "GDP growth forecast"]
Expected: {Science: [0,1], Finance: [2,3]}
Embeddings of similar topics should be closer together than embeddings of different topics.
How it works: Embed all texts, run k-means or mini-batch k-means, compare predicted clusters to ground truth with V-measure.
ArXiv Clustering (S2S)Reddit ClusteringStackExchange ClusteringTwentyNewsgroups

Reranking

MAP4 datasets

Given a query and candidate documents, reorder by relevance.

Example
"How to fix segmentation fault in C?"
Reorder: [doc_A (irrelevant), doc_B (relevant), doc_C (partial)] -> [doc_B, doc_C, doc_A]
Unlike retrieval, candidates are pre-selected. The model must reorder them by relevance.
How it works: Score each query-document pair by cosine similarity, reorder candidates. Evaluate with Mean Average Precision (MAP).
AskUbuntuDupQuestionsMindSmallRerankingSciDocsRRStackOverflowDupQuestions

Semantic Textual Similarity

Spearman correlation10 datasets

Predict the degree of semantic equivalence between sentence pairs.

Example
"A man is playing a guitar." vs "A person plays a musical instrument."
Human score: 4.2 / 5.0 (highly similar)
Model cosine similarity should correlate with human judgments across thousands of sentence pairs.
How it works: Compute cosine similarity for each sentence pair. Measure Spearman rank correlation with human-annotated similarity scores.
STS BenchmarkSTS12STS13STS14+4 more

Pair Classification

Avg Precision (AP)3 datasets

Determine the relationship between two texts (duplicate, paraphrase, entailment).

Example
"How do I reset my password?" vs "I forgot my login credentials, how to recover?"
Label: Duplicate
Cosine similarity between embeddings must separate duplicate pairs from non-duplicate pairs.
How it works: Compute cosine similarity for each pair. Use similarity as a classifier score. Evaluate with average precision (AP).
TwitterURLCorpusSprintDuplicateQuestionsQuora Duplicate Questions (QQP subset)

Summarization

Spearman correlation1 datasets

Evaluate how well a summary captures the meaning of a source document.

Example
Source: [full news article about climate policy]
Summary: "New climate bill targets 50% emission reduction by 2030"
Embedding similarity between source and summary should correlate with human quality judgments.
How it works: Embed source documents and their summaries. Cosine similarity should correlate with human-rated summary quality scores.
SummEval

Bitext Mining

F12 datasets

Find translation pairs between two sets of sentences in different languages.

Example
EN: "The cat sat on the mat."
DE: "Die Katze saß auf der Matte."
Cross-lingual embeddings must place translations closer than non-translation pairs.
How it works: Embed sentences in both languages. Match each source sentence to its nearest neighbor in the target language. Evaluate with F1.
TatoebaBUCC

MTEB Leaderboard

15 models ranked by average score across all English tasks. Updated 2025-12-26.

Full leaderboard on HuggingFace →
#ModelTypeAvgRetrievalClass.Cluster.STSRerankDimsParamsLinks
1Open Source
72.32
75.777.955.879.067.3384011.76B
2
Qwen3-Embedding-8B
Qwen / Alibaba
Open Source
70.58
70.974.057.681.165.640968B
3
Seed1.6-embedding-1215
ByteDance
API
70.26
66.076.856.875.966.21536
4Open Source
69.46
68.773.254.479.467.840968B
5
Qwen3-Embedding-4B
Qwen / Alibaba
Open Source
69.45
69.672.357.180.965.125604B
6
gemini-embedding-001
Google
API
68.37
67.771.854.679.465.63072
7Open Source
67.85
71.766.755.781.367.640968B
8
Qwen3-Embedding-0.6B
Qwen / Alibaba
Open Source
64.34
64.766.852.376.261.410240.6B
9Open Source
63.22
57.164.950.876.862.61024560M
10Open Source
62.51
60.161.552.874.065.535847B
11
text-multilingual-embedding-002
Google
API
62.16
59.764.647.876.161.2768
12
bge-m3
BAAI
Open Source
59.56
57.962.348.274.556.81024568M
13
text-embedding-3-large
OpenAI
API
58.96
56.162.545.272.554.13072
14
voyage-3.5
Voyage AI
API
58.46
55.961.844.671.953.51024
15Open Source
58.37
54.561.243.871.352.91024570M

SOTA Progress: 2019 to 2025

From Sentence-BERT's first dedicated sentence embeddings to today's 12B-parameter models scoring 72+. The evolution tracks three eras: encoder-only (BERT/RoBERTa fine-tuning), instruction-tuned (E5, BGE), and LLM-based (Qwen3, KaLM).

2019Sentence-BERT~51Encoder-only

Reimers & Gurevych show that BERT with siamese fine-tuning creates meaningful sentence embeddings. The field is born.

2020SimCSE~54Encoder-only

Contrastive learning on unsupervised data (dropout as augmentation) pushes STS scores without labeled data.

2022E5-base~57Encoder-only

Microsoft shows that weakly-supervised contrastive pre-training on massive web data creates superior embeddings. MTEB paper published.

2023bge-large-en-v1.5~60Instruction-tuned

BAAI's BGE family takes the lead with instruction-following and hard negative mining. Open-source catches up to OpenAI.

2024 Q1E5-Mistral-7B~62LLM-based

Microsoft proves that LLM backbones (Mistral-7B) create better embeddings than encoder-only models. A paradigm shift.

2024 Q2gte-Qwen2-7B~63LLM-based

Alibaba shows that Qwen2 backbone with GTE training matches E5-Mistral. LLM-based embeddings become the norm.

2024 Q4bge-m3 / Jina v3~59Instruction-tuned

Multi-granularity (dense + sparse + colbert) and task-LoRA adapters emerge as efficiency-focused alternatives.

2025 Q1Qwen3-Embedding-8B~70LLM-based

Qwen3 family dominates with multi-task training across embedding + reranking tasks. First models to consistently break 70.

2025 Q2KaLM-Gemma3-12B72.32LLM-based

KaLM fine-tunes Gemma3-12B with contrastive learning to set the current SOTA. Open-source leads over all APIs.

Accuracy vs. Model Size

The MTEB leaderboard reveals a clear trend: LLM-based embeddings dominate, but efficiency varies wildly. Qwen3-Embedding-0.6B scores 64.34 with just 600M parameters, while KaLM-Gemma3-12B needs 12B for 72.32. The score-per-parameter efficiency matters for production deployments.

Efficiency Leaders

Qwen3-Embedding-0.6B0.6B
64.34
107.2 pts/B
multilingual-e5-large560M
63.22
112.9 pts/B
bge-m3568M
59.56
104.9 pts/B
jina-embeddings-v3570M
58.37
102.4 pts/B

Absolute Performance Leaders

KaLM-Gemma3-12B12B
72.32
3584d vectors
Qwen3-Embedding-8B8B
70.58
4096d vectors
Seed1.6-embedding
70.26
1536d vectors
llama-embed-nemotron-8b8B
69.46
4096d vectors

The LLM Embedding Revolution

Before 2024, embedding models were small encoder-only transformers: BERT, RoBERTa, XLM-R. They maxed out around 560M parameters and scored ~60 on MTEB. Then researchers discovered that decoder-only LLMs make better embedding backbones.

E5-Mistral proved it first: take Mistral-7B, add contrastive fine-tuning, and you get embeddings that crush all encoder-only models. Now every top-5 model uses an LLM backbone: Gemma3, Qwen3, LLaMA. The old BERT-based paradigm is over for high-performance embeddings.

Open Source vs. API: The Gap Closed

In 2023, OpenAI's text-embedding-3-large was considered best-in-class. Today it ranks 13th on MTEB with 58.96, behind eight open-source models. The open-source community has completely overtaken proprietary APIs.

  • KaLM-Gemma3-12B (open): 72.32 — beats all APIs by 2+ points
  • gemini-embedding-001 (API): 68.37 — best API, but 6th overall
  • text-embedding-3-large (API): 58.96 — once the king, now 13th
  • Qwen3-0.6B (open, tiny): 64.34 — a 600M model beats OpenAI

Run MTEB Yourself

MTEB is fully open-source. Install it, pick a model, and benchmark it against the entire suite in a single script. Results are automatically formatted for submission to the HuggingFace leaderboard.

Full MTEB Evaluation

Python
# Install
pip install mteb sentence-transformers

# Run full English benchmark
import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

# Run all English tasks (56+ datasets)
evaluation = mteb.MTEB(tasks=mteb.get_tasks(languages=["eng"]))
results = evaluation.run(model, output_folder="results/qwen3-0.6b")

# Or run specific task types
retrieval_tasks = mteb.get_tasks(
    languages=["eng"],
    task_types=["Retrieval"]
)
evaluation = mteb.MTEB(tasks=retrieval_tasks)
results = evaluation.run(model, output_folder="results/retrieval-only")

Quick Start: Use Top Models

Python
# Option 1: SOTA (KaLM-Gemma3-12B — needs ~24GB VRAM)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("KaLM-ai/KaLM-Embedding-Gemma3-12B-2511")
embeddings = model.encode(["What is machine learning?", "ML is a subset of AI."])
print(f"Similarity: {embeddings[0] @ embeddings[1]:.4f}")

# Option 2: Best bang-for-buck (Qwen3-0.6B — runs on CPU!)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
embeddings = model.encode(["Hello world", "Bonjour le monde"])

# Option 3: Production serving with HuggingFace TEI
# docker run --gpus all -p 8080:80 \
#   ghcr.io/huggingface/text-embeddings-inference:latest \
#   --model-id Qwen/Qwen3-Embedding-0.6B

Key Papers

Essential reading for understanding MTEB and modern text embeddings.

MTEB: Massive Text Embedding Benchmark
Muennighoff, Tazi, Magne, Reimers|EACL 2023|1,200+ citations
Original benchmark paper
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Gurevych|EMNLP 2019|8,000+ citations
Foundation of modern embeddings
Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)
Wang, Yang, Wei, et al.|ACL 2024|1,500+ citations
E5 embedding family
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity
Chen, Xiao, Zhang, et al.|ACL 2024 Findings|600+ citations
Multi-granularity retrieval
Improving Text Embeddings with Large Language Models (E5-Mistral)
Wang, Yang, Wei, et al.|ACL 2024|500+ citations
LLM-based embeddings
Jina Embeddings v3: Task-LoRA Adapters for Multi-Task Embeddings
Sturua, Mohr, et al.|arXiv 2024|100+ citations
Task-specific adapters
GTE: General Text Embeddings
Li, Zhang, et al.|arXiv 2023|400+ citations
Alibaba GTE family
Qwen3 Technical Report
Qwen Team|arXiv 2025|50+ citations
Current top open-source family

Code & Implementations

Open-source repositories for training, evaluating, and serving embedding models.

MTEB vs. Other Embedding Benchmarks

BenchmarkTasksDatasetsFocusYear
MTEB856+Comprehensive embedding evaluation2022
BEIR118Zero-shot retrieval only2021
SentEval417Sentence representation probing2018
USEB48Unified sentence embedding eval2022
KILT111Knowledge-intensive language tasks2021
AIR-Bench224Automated IR benchmark (LLM-judged)2024

Understanding the Metrics

NDCG@10Retrieval

Normalized Discounted Cumulative Gain at rank 10. Measures how well the model ranks relevant documents in the top 10 results, with higher positions weighted more heavily.

NDCG@10 = DCG@10 / IDCG@10
DCG@10 = Σ(rel_i / log2(i+1))

A score of 1.0 means all relevant documents appear at the top. Most models score 0.4-0.7, reflecting the difficulty of zero-shot retrieval.

Spearman ρSTS

Spearman rank correlation between model cosine similarities and human similarity judgments. Measures whether the model's relative ordering of sentence pairs matches human intuition.

ρ = 1 - (6 * Σd_i²) / (n(n²-1))
where d_i = rank difference for pair i

Spearman correlation of 0.80+ indicates strong alignment with human judgment. Top models now exceed 0.81.

Access MTEB

When to Use Embeddings

Text embeddings convert language into dense vectors that capture semantic meaning. Here are the primary use cases where MTEB-benchmarked models excel.

Semantic Document Search

Find relevant documents by meaning, not just keyword overlap. Embeddings enable natural-language queries over large corpora.

RAG Retrieval

Retrieve context chunks for LLM generation. Embedding quality directly determines answer accuracy in retrieval-augmented pipelines.

Duplicate Detection

Identify near-duplicate content, support tickets, or records using cosine similarity between embedding pairs.

Clustering & Topic Modeling

Group documents by semantic similarity. Embeddings provide dense features for k-means, HDBSCAN, or topic extraction.

Architecture Patterns

Three common approaches to generating embeddings in production, each with distinct trade-offs.

Sentence Transformers

Models trained specifically for sentence and paragraph embedding. Run locally with full control.

Pros

  • Optimized for retrieval, fast inference
  • Many specialized variants available

Cons

  • Fixed context length
  • May need domain fine-tuning

LLM Embeddings via API

Embedding endpoints from OpenAI, Cohere, Voyage, and others. Zero infrastructure to manage.

Pros

  • High quality, long context
  • No infrastructure to maintain

Cons

  • Cost per token
  • Data leaves your system

Sparse + Dense Hybrid

Combine BM25 with dense embeddings for better recall. Best of both worlds for production search.

Pros

  • Handles exact matches well
  • More robust for rare terms

Cons

  • More complex pipeline
  • Two indices to maintain

Quick Start Code

Get started with embeddings in minutes. Two approaches: hosted API or local model.

OpenAI API

pip install openai

from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model='text-embedding-3-large'):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = [get_embedding(doc) for doc in documents]
print(f'Embedding dimension: {len(embeddings[0])}')

Local with Sentence Transformers

pip install sentence-transformers numpy

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')

documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = model.encode(documents, normalize_embeddings=True)

query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)

similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')

Track More Benchmarks

MTEB is one of many benchmarks we track. Explore our full catalog of NLP, computer vision, and reasoning benchmarks with live leaderboards.