MTEB evaluates embeddings across eight fundamentally different tasks. A great embedding model must excel at all of them — each tests a different facet of text understanding.
Retrieval
NDCG@1015 datasets
Given a query, find the most relevant documents from a corpus.
Example
"What is the capital of France?"
"Paris is the capital and most populous city of France, with an estimated population of 2,165,423."
The model must rank documents about Paris as capital highest among thousands of candidates.
How it works — Encode query and all documents independently. Rank by cosine similarity. NDCG@10 measures if relevant docs appear in top 10.
MS MARCONQHotpotQAFiQA+2 more
Classification
Accuracy12 datasets
Classify text into categories using embeddings as features.
Example
"This product broke after two days. Terrible quality."
Label: Negative
Embeddings are used as features for a logistic regression classifier. No fine-tuning of the embedding model.
How it works — Embed all texts, fit a simple classifier (kNN or logistic regression) on train embeddings, evaluate on test set.
AmazonCounterfactualBanking77EmotionClassificationTweetSentiment+1 more
Clustering
V-measure11 datasets
Group semantically similar texts into clusters without labels.
Example
Cluster: ["quantum computing advances", "new qubit architecture", "stock market rally", "GDP growth forecast"]
Expected: {Science: [0,1], Finance: [2,3]}
Embeddings of similar topics should be closer together than embeddings of different topics.
How it works — Embed all texts, run k-means or mini-batch k-means, compare predicted clusters to ground truth with V-measure.
ArXiv Clustering (S2S)Reddit ClusteringStackExchange ClusteringTwentyNewsgroups
Reranking
MAP4 datasets
Given a query and candidate documents, reorder by relevance.
Example
"How to fix segmentation fault in C?"
Reorder: [doc_A (irrelevant), doc_B (relevant), doc_C (partial)] -> [doc_B, doc_C, doc_A]
Unlike retrieval, candidates are pre-selected. The model must reorder them by relevance.
How it works — Score each query-document pair by cosine similarity, reorder candidates. Evaluate with Mean Average Precision (MAP).
AskUbuntuDupQuestionsMindSmallRerankingSciDocsRRStackOverflowDupQuestions
Semantic Textual Similarity
Spearman correlation10 datasets
Predict the degree of semantic equivalence between sentence pairs.
Example
"A man is playing a guitar." vs "A person plays a musical instrument."
Human score: 4.2 / 5.0 (highly similar)
Model cosine similarity should correlate with human judgments across thousands of sentence pairs.
How it works — Compute cosine similarity for each sentence pair. Measure Spearman rank correlation with human-annotated similarity scores.
STS BenchmarkSTS12STS13STS14+4 more
Pair Classification
Avg Precision (AP)3 datasets
Determine the relationship between two texts (duplicate, paraphrase, entailment).
Example
"How do I reset my password?" vs "I forgot my login credentials, how to recover?"
Label: Duplicate
Cosine similarity between embeddings must separate duplicate pairs from non-duplicate pairs.
How it works — Compute cosine similarity for each pair. Use similarity as a classifier score. Evaluate with average precision (AP).
TwitterURLCorpusSprintDuplicateQuestionsQuora Duplicate Questions (QQP subset)
Summarization
Spearman correlation1 datasets
Evaluate how well a summary captures the meaning of a source document.
Example
Source: [full news article about climate policy]
Summary: "New climate bill targets 50% emission reduction by 2030"
Embedding similarity between source and summary should correlate with human quality judgments.
How it works — Embed source documents and their summaries. Cosine similarity should correlate with human-rated summary quality scores.
SummEval
Bitext Mining
F12 datasets
Find translation pairs between two sets of sentences in different languages.
Example
EN: "The cat sat on the mat."
DE: "Die Katze saß auf der Matte."
Cross-lingual embeddings must place translations closer than non-translation pairs.
How it works — Embed sentences in both languages. Match each source sentence to its nearest neighbor in the target language. Evaluate with F1.
TatoebaBUCC