Text Embeddings15+ modelsUpdated 2025-12-26

MTEB Leaderboard

Massive Text Embedding Benchmark - Compare embedding models across retrieval, classification, clustering, semantic similarity, and more.

Last updated: 2025-12-26 | Source: HuggingFace MTEB

Understanding Text Embeddings and MTEB

Text embeddings convert words and sentences into vectors (lists of numbers) that capture meaning. MTEB (Massive Text Embedding Benchmark) tests how well these embeddings work across different tasks.

1

What Are Text Embeddings?

Think of embeddings as GPS coordinates for meaning. Just like GPS uses latitude and longitude to place locations in space, embeddings use numbers to place text in "meaning space."

GPS Analogy

Like GPS coordinates place cities on a map, embeddings place words in "meaning space."

Paris (48.8, 2.3) is close to London (51.5, -0.1) because they are both European capitals.

Fingerprint Analogy

Each text gets a unique "fingerprint" - a pattern of numbers that captures its essence.

Similar texts have similar fingerprints, making them easy to match.

Color Mixing Analogy

Like colors can be described as RGB values, text can be described as dimension values.

Red(255,0,0) is far from Blue(0,0,255) but close to Orange(255,165,0).

Interactive: See Embeddings in Action

SELECT A TEXT:
SIMPLIFIED EMBEDDING (5 dims shown):
animal
0.82
action
0.15
tech
0.03
domestic
0.88
location
0.21
[0.82, 0.15, 0.03, 0.88, 0.21]
SIMILARITY TO OTHER TEXTS:
The cat sat on the mat...
1.00
A dog plays in the park...
0.97
The neural network learns...
0.14
Machine learning models t...
0.12
Key Insight

Notice how "The cat sat on the mat" and "A dog plays in the park" have high similarity (0.97) because they are both about animals, while "The neural network learns" is very different (0.14) because it is about technology.

Real Embedding Dimensions

384
MiniLM
768
BERT base
1024
bge-m3
3072
OpenAI ada-3
2

Why MTEB Matters: No Single Metric Tells the Whole Story

An embedding model might be great at finding similar documents but terrible at classification. MTEB tests 7 different task types to give you the complete picture.

MTEB TASK TYPES (click to explore):
1

Retrieval

Finding relevant documents for a query

USE CASES
RAG systems, search engines, documentation lookup
METRIC
NDCG@10
EXAMPLE
Query: 'How to train a neural network' -> Find best matching docs

The Problem with Single Metrics

  • - Model A: 95% retrieval, 60% classification
  • - Model B: 80% retrieval, 90% classification
  • - If you only look at retrieval, you miss that Model B is better for classification tasks

The MTEB Solution

  • - Test across 7 task categories
  • - 56+ datasets covering different domains
  • - See the complete picture before choosing
3

The Trade-offs: What You Gain vs What You Give Up

Every embedding model makes trade-offs. Understanding these helps you make better choices.

Dimensions

384 dims
Faster, less memory, ~20MB models
e.g., all-MiniLM-L6-v2
3072 dims
More nuanced, better accuracy
e.g., text-embedding-3-large
Insight: Higher dimensions capture more nuance but cost more to store and compute. 768-1024 is often the sweet spot.

Speed vs Accuracy

Fast (5ms)
Lightweight models, fewer layers
e.g., DistilBERT, MiniLM
Slow (200ms)
Large models, more accurate
e.g., e5-mistral-7b, gte-Qwen2
Insight: For real-time applications, prioritize speed. For batch processing, maximize accuracy.

Open Source vs API

Open Source
Free, privacy, customizable
e.g., bge-m3, e5-mistral
API
No infra, always updated
e.g., OpenAI, Voyage, Cohere
Insight: APIs are easier to start. Open source wins at scale (cost) and for privacy-sensitive data.

Multilingual vs English

English only
Often higher English performance
e.g., e5-large-v2, bge-large-en
Multilingual
100+ languages, broader coverage
e.g., bge-m3, multilingual-e5
Insight: English-only models often score higher on English benchmarks. Use multilingual if you need other languages.
4

How to Choose: Match Your Use Case

The best model depends on what you are building. Select your use case to see recommendations.

SELECT YOUR USE CASE:

RAG / Document Search

PRIORITIZE THIS METRIC:
Retrieval score
WHY:

Retrieval directly measures how well the model finds relevant documents for queries.

RECOMMENDED MODELS:
1voyage-3
2bge-m3
3text-embedding-3-large

Try It: See How Embeddings Measure Similarity

SELECT A TEXT PAIR:
Cosine Similarity:0.89
These texts are semantically very similar - an embedding model correctly identifies the shared meaning.

Quick Decision Guide

When Reading MTEB Scores:

  • 1.Identify your primary use case (RAG, search, classification, etc.)
  • 2.Focus on the relevant task category score, not just the average
  • 3.Consider dimensions and speed for your infrastructure
  • 4.Check if you need multilingual support

General Recommendations:

  • -Start simple: all-MiniLM-L6-v2 for prototyping
  • -Best open source: bge-m3 or e5-mistral-7b
  • -Best API: voyage-3 or text-embedding-3-large
  • -Multilingual: bge-m3 (100+ languages)
#ModelTypeAvg ScoreRetrievalClassificationClusteringSTSDims
1Open Source72.3275.6677.8855.7779.023584
2Open Source70.5870.8874.0057.6581.084096
3
Seed1.6-embedding-1215
ByteDance
API70.2666.0576.7556.7875.921536
4Open Source69.4668.6973.2154.3579.414096
5Open Source69.4569.6072.3357.1580.862560
6
gemini-embedding-001
Google
API68.3767.7171.8254.5979.403072
7Open Source67.8571.6866.6855.6881.274096
8Open Source64.3464.6566.8352.3376.171024
9Open Source63.2257.1264.9450.7576.811024
10Open Source62.5160.0861.5552.7773.983584
11
text-multilingual-embedding-002
Google
API62.1659.6864.6447.8476.11768
12
BAAI
Open Source59.5657.8962.3448.2374.451024
13
text-embedding-3-large
OpenAI
API58.9656.1262.4545.2372.453072
14
voyage-3.5
Voyage AI
API58.4655.8961.7844.5671.891024
15Open Source58.3754.4561.2343.7871.341024

What is MTEB?

The Massive Text Embedding Benchmark evaluates embedding models across 8 task categories and 56+ datasets covering 112+ languages.

Key Metrics

  • Retrieval: NDCG@10
  • Classification: Accuracy
  • Clustering: V-measure
  • STS: Spearman correlation

Choosing a Model

Consider your use case: retrieval-focused apps should prioritize retrieval scores. For general use, look at the average score and dimensions (smaller = faster).

Quick Start: Use Top Models

Python
# Option 1: Open Source SOTA (Qwen3-Embedding-4B)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('Qwen/Qwen3-Embedding-4B')
embeddings = model.encode(['Hello world', 'How are you?'])

# Option 2: Lightweight (Qwen3-Embedding-0.6B - still beats OpenAI!)
model = SentenceTransformer('Qwen/Qwen3-Embedding-0.6B')
embeddings = model.encode(['Hello world', 'How are you?'])

# Option 3: API (Google Gemini)
import google.generativeai as genai
genai.configure(api_key='YOUR_API_KEY')
result = genai.embed_content(
    model='models/embedding-001',
    content=['Hello world', 'How are you?']
)