MTEB Leaderboard
Massive Text Embedding Benchmark - Compare embedding models across retrieval, classification, clustering, semantic similarity, and more.
Last updated: 2025-12-26 | Source: HuggingFace MTEB
Understanding Text Embeddings and MTEB
Text embeddings convert words and sentences into vectors (lists of numbers) that capture meaning. MTEB (Massive Text Embedding Benchmark) tests how well these embeddings work across different tasks.
What Are Text Embeddings?
Think of embeddings as GPS coordinates for meaning. Just like GPS uses latitude and longitude to place locations in space, embeddings use numbers to place text in "meaning space."
GPS Analogy
Like GPS coordinates place cities on a map, embeddings place words in "meaning space."
Fingerprint Analogy
Each text gets a unique "fingerprint" - a pattern of numbers that captures its essence.
Color Mixing Analogy
Like colors can be described as RGB values, text can be described as dimension values.
Interactive: See Embeddings in Action
Key Insight
Notice how "The cat sat on the mat" and "A dog plays in the park" have high similarity (0.97) because they are both about animals, while "The neural network learns" is very different (0.14) because it is about technology.
Real Embedding Dimensions
Why MTEB Matters: No Single Metric Tells the Whole Story
An embedding model might be great at finding similar documents but terrible at classification. MTEB tests 7 different task types to give you the complete picture.
Retrieval
Finding relevant documents for a query
The Problem with Single Metrics
- - Model A: 95% retrieval, 60% classification
- - Model B: 80% retrieval, 90% classification
- - If you only look at retrieval, you miss that Model B is better for classification tasks
The MTEB Solution
- - Test across 7 task categories
- - 56+ datasets covering different domains
- - See the complete picture before choosing
The Trade-offs: What You Gain vs What You Give Up
Every embedding model makes trade-offs. Understanding these helps you make better choices.
Dimensions
Speed vs Accuracy
Open Source vs API
Multilingual vs English
How to Choose: Match Your Use Case
The best model depends on what you are building. Select your use case to see recommendations.
RAG / Document Search
Retrieval directly measures how well the model finds relevant documents for queries.
Try It: See How Embeddings Measure Similarity
Quick Decision Guide
When Reading MTEB Scores:
- 1.Identify your primary use case (RAG, search, classification, etc.)
- 2.Focus on the relevant task category score, not just the average
- 3.Consider dimensions and speed for your infrastructure
- 4.Check if you need multilingual support
General Recommendations:
- -Start simple: all-MiniLM-L6-v2 for prototyping
- -Best open source: bge-m3 or e5-mistral-7b
- -Best API: voyage-3 or text-embedding-3-large
- -Multilingual: bge-m3 (100+ languages)
| # | Model | Type | Avg Score | Retrieval | Classification | Clustering | STS | Dims |
|---|---|---|---|---|---|---|---|---|
| 1 | Open Source | 72.32 | 75.66 | 77.88 | 55.77 | 79.02 | 3584 | |
| 2 | Open Source | 70.58 | 70.88 | 74.00 | 57.65 | 81.08 | 4096 | |
| 3 | Seed1.6-embedding-1215 ByteDance | API | 70.26 | 66.05 | 76.75 | 56.78 | 75.92 | 1536 |
| 4 | NVIDIA | Open Source | 69.46 | 68.69 | 73.21 | 54.35 | 79.41 | 4096 |
| 5 | Open Source | 69.45 | 69.60 | 72.33 | 57.15 | 80.86 | 2560 | |
| 6 | gemini-embedding-001 Google | API | 68.37 | 67.71 | 71.82 | 54.59 | 79.40 | 3072 |
| 7 | Octen | Open Source | 67.85 | 71.68 | 66.68 | 55.68 | 81.27 | 4096 |
| 8 | Open Source | 64.34 | 64.65 | 66.83 | 52.33 | 76.17 | 1024 | |
| 9 | Microsoft | Open Source | 63.22 | 57.12 | 64.94 | 50.75 | 76.81 | 1024 |
| 10 | Alibaba | Open Source | 62.51 | 60.08 | 61.55 | 52.77 | 73.98 | 3584 |
| 11 | text-multilingual-embedding-002 Google | API | 62.16 | 59.68 | 64.64 | 47.84 | 76.11 | 768 |
| 12 | BAAI | Open Source | 59.56 | 57.89 | 62.34 | 48.23 | 74.45 | 1024 |
| 13 | text-embedding-3-large OpenAI | API | 58.96 | 56.12 | 62.45 | 45.23 | 72.45 | 3072 |
| 14 | voyage-3.5 Voyage AI | API | 58.46 | 55.89 | 61.78 | 44.56 | 71.89 | 1024 |
| 15 | Jina AI | Open Source | 58.37 | 54.45 | 61.23 | 43.78 | 71.34 | 1024 |
What is MTEB?
The Massive Text Embedding Benchmark evaluates embedding models across 8 task categories and 56+ datasets covering 112+ languages.
Key Metrics
- Retrieval: NDCG@10
- Classification: Accuracy
- Clustering: V-measure
- STS: Spearman correlation
Choosing a Model
Consider your use case: retrieval-focused apps should prioritize retrieval scores. For general use, look at the average score and dimensions (smaller = faster).
Quick Start: Use Top Models
Python# Option 1: Open Source SOTA (Qwen3-Embedding-4B)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('Qwen/Qwen3-Embedding-4B')
embeddings = model.encode(['Hello world', 'How are you?'])
# Option 2: Lightweight (Qwen3-Embedding-0.6B - still beats OpenAI!)
model = SentenceTransformer('Qwen/Qwen3-Embedding-0.6B')
embeddings = model.encode(['Hello world', 'How are you?'])
# Option 3: API (Google Gemini)
import google.generativeai as genai
genai.configure(api_key='YOUR_API_KEY')
result = genai.embed_content(
model='models/embedding-001',
content=['Hello world', 'How are you?']
)