Codesota · Registry · NLP · Text embeddings · MTEB15 models · 56+ datasets · 8 task types← back to benchmarks
§ 00 · Benchmark

MTEB — measuring the quality
of text understanding.

The Massive Text Embedding Benchmark is the definitive evaluation suite for sentence and passage embeddings — eight task categories, fifty-six-plus datasets, over a hundred languages. The benchmark that turned “which embedding should I use?” from guesswork into science.

56+
Datasets
8
Task categories
72.32
Current SOTA (avg)
15
Models tracked
§ 01
Origins

Before MTEB, comparison was chaos.

The Massive Text Embedding Benchmark was introduced in 2022 by Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers at Hugging Face. Before MTEB, comparing embedding models was a mess: each paper picked its own tasks, its own datasets, its own evaluation protocol. MTEB brought order.

The benchmark evaluates embeddings across eight distinct task categories spanning retrieval, classification, clustering, semantic similarity, reranking, pair classification, summarisation, and bitext mining. That breadth is the point: a model that dominates retrieval can still fail at clustering. MTEB catches that.

Since its release, MTEB has become the standard evaluation suite for the embedding community. The HuggingFace leaderboard has over 5,000 model submissions. Every major embedding model from OpenAI, Google, Cohere, Alibaba, and Microsoft now reports MTEB scores. It is to embeddings what ImageNet was to vision.

§ 02 · Leaderboard

15 models, ranked by average English score.

Updated 2025-12-26
#ModelTypeAvgRetrievalClass.Cluster.STSRerankDimsParamsLinks
01KaLM-Embedding-Gemma3-12B
Tencent
Open source72.3275.777.955.879.067.3384011.76BpapercodeHF
02Qwen3-Embedding-8B
Qwen / Alibaba
Open source70.5870.974.057.681.165.640968BpaperHF
03Seed1.6-embedding-1215
ByteDance
API70.2666.076.856.875.966.21536
04llama-embed-nemotron-8b
NVIDIA
Open source69.4668.773.254.479.467.840968BHF
05Qwen3-Embedding-4B
Qwen / Alibaba
Open source69.4569.672.357.180.965.125604BpaperHF
06gemini-embedding-001
Google
API68.3767.771.854.679.465.63072
07Octen-Embedding-8B
Octen
Open source67.8571.766.755.781.367.640968BHF
08Qwen3-Embedding-0.6B
Qwen / Alibaba
Open source64.3464.766.852.376.261.410240.6BpaperHF
09multilingual-e5-large-instruct
Microsoft
Open source63.2257.164.950.876.862.61024560MpaperHF
10gte-Qwen2-7B-instruct
Alibaba
Open source62.5160.161.552.874.065.535847BpaperHF
11text-multilingual-embedding-002
Google
API62.1659.764.647.876.161.2768
12bge-m3
BAAI
Open source59.5657.962.348.274.556.81024568MpapercodeHF
13text-embedding-3-large
OpenAI
API58.9656.162.545.272.554.13072
14voyage-3.5
Voyage AI
API58.4655.961.844.671.953.51024
15jina-embeddings-v3
Jina AI
Open source58.3754.561.243.871.352.91024570MpaperHF
Fig 02 · Average score across all English MTEB tasks. Full leaderboard with 5,000+ submissions lives at huggingface.co/spaces/mteb/leaderboard.
§ 03 · Datasets

The golden datasets.

MTEB’s power comes from its datasets. These aren’t synthetic toy problems — they’re real-world corpora with human annotations, covering domains from medical retrieval to banking intent classification. Four that define what it means to have good embeddings.

Semantic Textual Similarity

STS Benchmark

8,628 sentence pairs
Cer et al., 2017 (SemEval)

The cornerstone STS dataset. Human annotators rated sentence pairs on a 0-5 scale of semantic equivalence. Used as THE standard test for embedding quality since 2017.

Real examples
A"A plane is taking off."
B"An air plane is taking off."
5.00Perfect equivalence
A"A woman is playing the guitar."
B"A man is playing the flute."
1.60Different actions, different agents
A"A man is smoking."
B"A man is skating."
0.50Nearly unrelated
Retrieval

NFCorpus

3,633 queries, 169,756 documents
Boteva et al., 2016 (NutritionFacts)

Medical information retrieval: given a health query, find relevant scientific documents. Tests whether embeddings can bridge the gap between lay language and medical terminology.

Real examples
A"Does caffeine affect blood pressure?"
B"Acute effects of coffee consumption on self-reported gastrointestinal symptoms, blood pressure and stress indices..."
RelevantLay query matched to scientific abstract
A"vitamin D deficiency symptoms"
B"The role of vitamin D in reducing cancer risk and progression..."
RelevantSymptom query matched to clinical review
Retrieval

ArguAna

1,406 queries, 8,674 arguments
Wachsmuth et al., 2018

Counterargument retrieval: given an argument on a controversial topic, find the best counterargument. One of the hardest retrieval tasks because surface-level similarity is misleading.

Real examples
A"Nuclear energy is clean and efficient, producing minimal greenhouse gases..."
B"Nuclear waste remains radioactive for thousands of years with no safe long-term storage solution..."
CounterTopically similar but argumentatively opposed
Classification

Banking77

13,083 customer queries
Casanueva et al., 2020

Intent detection in banking: classify customer messages into 77 fine-grained intents. Tests whether embeddings can distinguish between highly similar intents like "card_arrival" vs "card_delivery_estimate".

Real examples
A"Why was I charged twice for the same transaction?"
Intent: card_payment_wrong_exchange_rateFine-grained intent classification
A"My card doesn't work at ATMs abroad"
Intent: card_not_workingMust distinguish from similar card intents
§ 04 · Tasks

Eight tasks, eight kinds of understanding.

MTEB evaluates embeddings across eight fundamentally different tasks. A great embedding model must excel at all of them — each tests a different facet of text understanding.

Retrieval

NDCG@10
15 datasets

Given a query, find the most relevant documents from a corpus.

Example
"What is the capital of France?"
"Paris is the capital and most populous city of France, with an estimated population of 2,165,423."
The model must rank documents about Paris as capital highest among thousands of candidates.
How it works — Encode query and all documents independently. Rank by cosine similarity. NDCG@10 measures if relevant docs appear in top 10.
MS MARCONQHotpotQAFiQA+2 more

Classification

Accuracy
12 datasets

Classify text into categories using embeddings as features.

Example
"This product broke after two days. Terrible quality."
Label: Negative
Embeddings are used as features for a logistic regression classifier. No fine-tuning of the embedding model.
How it works — Embed all texts, fit a simple classifier (kNN or logistic regression) on train embeddings, evaluate on test set.
AmazonCounterfactualBanking77EmotionClassificationTweetSentiment+1 more

Clustering

V-measure
11 datasets

Group semantically similar texts into clusters without labels.

Example
Cluster: ["quantum computing advances", "new qubit architecture", "stock market rally", "GDP growth forecast"]
Expected: {Science: [0,1], Finance: [2,3]}
Embeddings of similar topics should be closer together than embeddings of different topics.
How it works — Embed all texts, run k-means or mini-batch k-means, compare predicted clusters to ground truth with V-measure.
ArXiv Clustering (S2S)Reddit ClusteringStackExchange ClusteringTwentyNewsgroups

Reranking

MAP
4 datasets

Given a query and candidate documents, reorder by relevance.

Example
"How to fix segmentation fault in C?"
Reorder: [doc_A (irrelevant), doc_B (relevant), doc_C (partial)] -> [doc_B, doc_C, doc_A]
Unlike retrieval, candidates are pre-selected. The model must reorder them by relevance.
How it works — Score each query-document pair by cosine similarity, reorder candidates. Evaluate with Mean Average Precision (MAP).
AskUbuntuDupQuestionsMindSmallRerankingSciDocsRRStackOverflowDupQuestions

Semantic Textual Similarity

Spearman correlation
10 datasets

Predict the degree of semantic equivalence between sentence pairs.

Example
"A man is playing a guitar." vs "A person plays a musical instrument."
Human score: 4.2 / 5.0 (highly similar)
Model cosine similarity should correlate with human judgments across thousands of sentence pairs.
How it works — Compute cosine similarity for each sentence pair. Measure Spearman rank correlation with human-annotated similarity scores.
STS BenchmarkSTS12STS13STS14+4 more

Pair Classification

Avg Precision (AP)
3 datasets

Determine the relationship between two texts (duplicate, paraphrase, entailment).

Example
"How do I reset my password?" vs "I forgot my login credentials, how to recover?"
Label: Duplicate
Cosine similarity between embeddings must separate duplicate pairs from non-duplicate pairs.
How it works — Compute cosine similarity for each pair. Use similarity as a classifier score. Evaluate with average precision (AP).
TwitterURLCorpusSprintDuplicateQuestionsQuora Duplicate Questions (QQP subset)

Summarization

Spearman correlation
1 datasets

Evaluate how well a summary captures the meaning of a source document.

Example
Source: [full news article about climate policy]
Summary: "New climate bill targets 50% emission reduction by 2030"
Embedding similarity between source and summary should correlate with human quality judgments.
How it works — Embed source documents and their summaries. Cosine similarity should correlate with human-rated summary quality scores.
SummEval

Bitext Mining

F1
2 datasets

Find translation pairs between two sets of sentences in different languages.

Example
EN: "The cat sat on the mat."
DE: "Die Katze saß auf der Matte."
Cross-lingual embeddings must place translations closer than non-translation pairs.
How it works — Embed sentences in both languages. Match each source sentence to its nearest neighbor in the target language. Evaluate with F1.
TatoebaBUCC
§ 05 · Timeline

SOTA progress, 2019 → 2025.

From Sentence-BERT’s first dedicated sentence embeddings to today’s 12B-parameter models scoring 72+. Three eras: encoder-only (BERT/RoBERTa fine-tuning), instruction-tuned (E5, BGE), and LLM-based (Qwen3, KaLM).

2019
Sentence-BERT~51
Reimers & Gurevych show that BERT with siamese fine-tuning creates meaningful sentence embeddings. The field is born.
Encoder-only
2020
SimCSE~54
Contrastive learning on unsupervised data (dropout as augmentation) pushes STS scores without labeled data.
Encoder-only
2022
E5-base~57
Microsoft shows that weakly-supervised contrastive pre-training on massive web data creates superior embeddings. MTEB paper published.
Encoder-only
2023
bge-large-en-v1.5~60
BAAI's BGE family takes the lead with instruction-following and hard negative mining. Open-source catches up to OpenAI.
Instruction-tuned
2024 Q1
E5-Mistral-7B~62
Microsoft proves that LLM backbones (Mistral-7B) create better embeddings than encoder-only models. A paradigm shift.
LLM-based
2024 Q2
gte-Qwen2-7B~63
Alibaba shows that Qwen2 backbone with GTE training matches E5-Mistral. LLM-based embeddings become the norm.
LLM-based
2024 Q4
bge-m3 / Jina v3~59
Multi-granularity (dense + sparse + colbert) and task-LoRA adapters emerge as efficiency-focused alternatives.
Instruction-tuned
2025 Q1
Qwen3-Embedding-8B~70
Qwen3 family dominates with multi-task training across embedding + reranking tasks. First models to consistently break 70.
LLM-based
2025 Q2
KaLM-Gemma3-12B72.32
KaLM fine-tunes Gemma3-12B with contrastive learning to set the current SOTA. Open-source leads over all APIs.
LLM-based
§ 06 · Trade-off

Accuracy vs model size.

LLM-based embeddings dominate — but efficiency varies wildly. Qwen3-Embedding-0.6B scores 64.34 with just 600M parameters, while KaLM-Gemma3-12B needs 12B for 72.32. Score-per-parameter efficiency matters for production deployments.

Efficiency leaders
ModelParamsScorepts / B
Qwen3-Embedding-0.6B0.6B64.34107.2 pts/B
multilingual-e5-large560M63.22112.9 pts/B
bge-m3568M59.56104.9 pts/B
jina-embeddings-v3570M58.37102.4 pts/B
Absolute performance leaders
ModelParamsScoreDims
KaLM-Gemma3-12B12B72.323584
Qwen3-Embedding-8B8B70.584096
Seed1.6-embedding70.261536
llama-embed-nemotron-8b8B69.464096
§ 07a
Essay

The LLM embedding revolution.

Before 2024, embedding models were small encoder-only transformers — BERT, RoBERTa, XLM-R. They maxed out around 560M parameters and scored ~60 on MTEB. Then researchers discovered that decoder-only LLMs make better embedding backbones.

E5-Mistral proved it first: take Mistral-7B, add contrastive fine-tuning, and you get embeddings that crush all encoder-only models. Now every top-5 model uses an LLM backbone — Gemma3, Qwen3, LLaMA. The old BERT-based paradigm is over for high-performance embeddings.

§ 07b
Essay

Open source vs API: the gap closed.

In 2023, OpenAI’s text-embedding-3-large was considered best-in-class. Today it ranks 13th on MTEB with 58.96, behind eight open-source models. The open-source community has completely overtaken proprietary APIs.

  • KaLM-Gemma3-12B (open) — 72.32, beats all APIs by 2+ points.
  • gemini-embedding-001 (API) — 68.37, best API but 6th overall.
  • text-embedding-3-large (API) — 58.96, once king, now 13th.
  • Qwen3-0.6B (open, tiny) — 64.34, a 600M model beats OpenAI.
§ 08 · Reproduce

Run MTEB yourself.

MTEB is fully open source. Install it, pick a model, and benchmark it against the entire suite in a single script. Results are automatically formatted for submission to the HuggingFace leaderboard.

Full MTEB evaluationPython
# Install
pip install mteb sentence-transformers

# Run full English benchmark
import mteb
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")

# Run all English tasks (56+ datasets)
evaluation = mteb.MTEB(tasks=mteb.get_tasks(languages=["eng"]))
results = evaluation.run(model, output_folder="results/qwen3-0.6b")

# Or run specific task types
retrieval_tasks = mteb.get_tasks(
    languages=["eng"],
    task_types=["Retrieval"]
)
evaluation = mteb.MTEB(tasks=retrieval_tasks)
results = evaluation.run(model, output_folder="results/retrieval-only")
Quick start — use top modelsPython
# Option 1: SOTA (KaLM-Gemma3-12B — needs ~24GB VRAM)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("KaLM-ai/KaLM-Embedding-Gemma3-12B-2511")
embeddings = model.encode(["What is machine learning?", "ML is a subset of AI."])
print(f"Similarity: {embeddings[0] @ embeddings[1]:.4f}")

# Option 2: Best bang-for-buck (Qwen3-0.6B — runs on CPU!)
model = SentenceTransformer("Qwen/Qwen3-Embedding-0.6B")
embeddings = model.encode(["Hello world", "Bonjour le monde"])

# Option 3: Production serving with HuggingFace TEI
# docker run --gpus all -p 8080:80 \
#   ghcr.io/huggingface/text-embeddings-inference:latest \
#   --model-id Qwen/Qwen3-Embedding-0.6B
§ 09 · Papers

Key papers.

Essential reading for understanding MTEB and modern text embeddings.

MTEB: Massive Text Embedding Benchmark
Muennighoff, Tazi, Magne, Reimers·EACL 2023·1,200+ citations
Original benchmark paper
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Gurevych·EMNLP 2019·8,000+ citations
Foundation of modern embeddings
Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5)
Wang, Yang, Wei, et al.·ACL 2024·1,500+ citations
E5 embedding family
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity
Chen, Xiao, Zhang, et al.·ACL 2024 Findings·600+ citations
Multi-granularity retrieval
Improving Text Embeddings with Large Language Models (E5-Mistral)
Wang, Yang, Wei, et al.·ACL 2024·500+ citations
LLM-based embeddings
Jina Embeddings v3: Task-LoRA Adapters for Multi-Task Embeddings
Sturua, Mohr, et al.·arXiv 2024·100+ citations
Task-specific adapters
GTE: General Text Embeddings
Li, Zhang, et al.·arXiv 2023·400+ citations
Alibaba GTE family
Qwen3 Technical Report
Qwen Team·arXiv 2025·50+ citations
Current top open-source family
§ 10 · Code

Implementations worth reading.

Open-source repositories for training, evaluating, and serving embedding models.

embeddings-benchmark/mteb2.1k

Official MTEB benchmark framework. Run evaluations on any model with a single command.

UKPLab/sentence-transformers15.8k

The de facto library for text embeddings in Python. Load, fine-tune, and deploy embedding models.

FlagOpen/FlagEmbedding8.2k

BAAI's BGE embedding family. Includes bge-m3, bge-reranker, and training code.

QwenLM/Qwen18k

Qwen model family including Qwen3-Embedding. Multi-task training for embedding + reranking.

huggingface/text-embeddings-inference3.4k

Production-grade serving for embedding models. Rust-based, supports batching, quantization.

HKUNLP/instructor-embedding1.3k

Instruction-tuned embeddings. Pioneered the "Represent the X for Y" prompting approach.

§ 11 · Context

MTEB vs other embedding benchmarks.

BenchmarkTasksDatasetsFocusYear
MTEB856+Comprehensive embedding evaluation2022
BEIR118Zero-shot retrieval only2021
SentEval417Sentence representation probing2018
USEB48Unified sentence embedding eval2022
KILT111Knowledge-intensive language tasks2021
AIR-Bench224Automated IR benchmark (LLM-judged)2024
§ 12 · Metrics

Understanding the numbers.

NDCG@10Retrieval

Normalised Discounted Cumulative Gain at rank 10. Measures how well the model ranks relevant documents in the top 10 results, with higher positions weighted more heavily.

NDCG@10 = DCG@10 / IDCG@10
DCG@10 = Σ(rel_i / log₂(i+1))

A score of 1.0 means all relevant documents appear at the top. Most models score 0.4–0.7, reflecting the difficulty of zero-shot retrieval.

Spearman ρSTS

Spearman rank correlation between model cosine similarities and human similarity judgments. Measures whether the model's relative ordering of sentence pairs matches human intuition.

ρ = 1 − (6 × Σd_i²) / (n(n²−1))
where d_i = rank difference for pair i

Spearman correlation of 0.80+ indicates strong alignment with human judgment. Top models now exceed 0.81.

§ 13 · Access

Where to find it.

MTEB GitHub

Official benchmark code. Install with pip install mteb.

HuggingFace Leaderboard

Full leaderboard with 5,000+ submissions and filters.

Original Paper

Muennighoff et al., EACL 2023. Benchmark design and analysis.

MTEB Datasets

All 56+ datasets available on HuggingFace Datasets.

§ 14 · Applications

When to use embeddings.

Text embeddings convert language into dense vectors that capture semantic meaning. The primary use cases where MTEB-benchmarked models excel.

Semantic document search

Find relevant documents by meaning, not just keyword overlap. Embeddings enable natural-language queries over large corpora.

RAG retrieval

Retrieve context chunks for LLM generation. Embedding quality directly determines answer accuracy in retrieval-augmented pipelines.

Duplicate detection

Identify near-duplicate content, support tickets, or records using cosine similarity between embedding pairs.

Clustering & topic modeling

Group documents by semantic similarity. Embeddings provide dense features for k-means, HDBSCAN, or topic extraction.

§ 15 · Patterns

Three production architectures.

Three common approaches to generating embeddings in production, each with distinct trade-offs.

Sentence transformers

Models trained specifically for sentence and paragraph embedding. Run locally with full control.

Pros
  • Optimised for retrieval, fast inference
  • Many specialised variants available
Cons
  • Fixed context length
  • May need domain fine-tuning

LLM embeddings via API

Embedding endpoints from OpenAI, Cohere, Voyage, and others. Zero infrastructure to manage.

Pros
  • High quality, long context
  • No infrastructure to maintain
Cons
  • Cost per token
  • Data leaves your system

Sparse + dense hybrid

Combine BM25 with dense embeddings for better recall. Best of both worlds for production search.

Pros
  • Handles exact matches well
  • More robust for rare terms
Cons
  • More complex pipeline
  • Two indices to maintain
§ 16 · Quick start

Embeddings in minutes.

Two approaches — hosted API or local model. Pick one.

OpenAI APIpip install openai
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str, model='text-embedding-3-large'):
    response = client.embeddings.create(input=text, model=model)
    return response.data[0].embedding

documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = [get_embedding(doc) for doc in documents]
print(f'Embedding dimension: {len(embeddings[0])}')
Local with sentence-transformerspip install sentence-transformers numpy
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('BAAI/bge-large-en-v1.5')

documents = [
    'The cat sat on the mat',
    'A dog played in the park',
    'Machine learning is fascinating'
]

embeddings = model.encode(documents, normalize_embeddings=True)

query = 'pets resting at home'
query_embedding = model.encode(query, normalize_embeddings=True)

similarities = np.dot(embeddings, query_embedding)
for doc, sim in zip(documents, similarities):
    print(f'{sim:.3f}: {doc}')
§ 17 · More

Track more benchmarks.

MTEB is one of many benchmarks we track. Explore the full catalogue of NLP, computer vision, and reasoning benchmarks with live leaderboards.