Text Embeddings15+ modelsUpdated 2025-12-26

MTEB Leaderboard

Massive Text Embedding Benchmark - Compare embedding models across retrieval, classification, clustering, semantic similarity, and more.

Last updated: 2025-12-26 | Source: HuggingFace MTEB

Understanding Text Embeddings and MTEB

Text embeddings convert words and sentences into vectors (lists of numbers) that capture meaning. MTEB (Massive Text Embedding Benchmark) tests how well these embeddings work across different tasks.

1. What Are Embeddings 2. Why MTEB Matters 3. Trade-offs 4. How to Choose

What Are Text Embeddings?

Think of embeddings as GPS coordinates for meaning. Just like GPS uses latitude and longitude to place locations in space, embeddings use numbers to place text in "meaning space."

GPS Analogy

Like GPS coordinates place cities on a map, embeddings place words in "meaning space."

Paris (48.8, 2.3) is close to London (51.5, -0.1) because they are both European capitals.

Fingerprint Analogy

Each text gets a unique "fingerprint" - a pattern of numbers that captures its essence.

Similar texts have similar fingerprints, making them easy to match.

Color Mixing Analogy

Like colors can be described as RGB values, text can be described as dimension values.

Red(255,0,0) is far from Blue(0,0,255) but close to Orange(255,165,0).

Interactive: See Embeddings in Action

SELECT A TEXT:

SIMPLIFIED EMBEDDING (5 dims shown):

animal

0.82

action

0.15

tech

0.03

domestic

0.88

location

0.21

[0.82, 0.15, 0.03, 0.88, 0.21]

SIMILARITY TO OTHER TEXTS:

The cat sat on the mat...

1.00

A dog plays in the park...

0.97

The neural network learns...

0.14

Machine learning models t...

0.12

Key Insight

Notice how "The cat sat on the mat" and "A dog plays in the park" have high similarity (0.97) because they are both about animals, while "The neural network learns" is very different (0.14) because it is about technology.

Real Embedding Dimensions

384

MiniLM

768

BERT base

1024

bge-m3

3072

OpenAI ada-3

Why MTEB Matters: No Single Metric Tells the Whole Story

An embedding model might be great at finding similar documents but terrible at classification. MTEB tests 7 different task types to give you the complete picture.

MTEB TASK TYPES (click to explore):

Retrieval

Finding relevant documents for a query

USE CASES

RAG systems, search engines, documentation lookup

METRIC

NDCG@10

EXAMPLE

Query: 'How to train a neural network' -> Find best matching docs

The Problem with Single Metrics

- Model A: 95% retrieval, 60% classification
- Model B: 80% retrieval, 90% classification
- If you only look at retrieval, you miss that Model B is better for classification tasks

The MTEB Solution

- Test across 7 task categories
- 56+ datasets covering different domains
- See the complete picture before choosing

The Trade-offs: What You Gain vs What You Give Up

Every embedding model makes trade-offs. Understanding these helps you make better choices.

Dimensions

384 dims

Faster, less memory, ~20MB models

e.g., all-MiniLM-L6-v2

↔

3072 dims

More nuanced, better accuracy

e.g., text-embedding-3-large

Insight: Higher dimensions capture more nuance but cost more to store and compute. 768-1024 is often the sweet spot.

Speed vs Accuracy

Fast (5ms)

Lightweight models, fewer layers

e.g., DistilBERT, MiniLM

↔

Slow (200ms)

Large models, more accurate

e.g., e5-mistral-7b, gte-Qwen2

Insight: For real-time applications, prioritize speed. For batch processing, maximize accuracy.

Open Source vs API

Open Source

Free, privacy, customizable

e.g., bge-m3, e5-mistral

↔

API

No infra, always updated

e.g., OpenAI, Voyage, Cohere

Insight: APIs are easier to start. Open source wins at scale (cost) and for privacy-sensitive data.

Multilingual vs English

English only

Often higher English performance

e.g., e5-large-v2, bge-large-en

↔

Multilingual

100+ languages, broader coverage

e.g., bge-m3, multilingual-e5

Insight: English-only models often score higher on English benchmarks. Use multilingual if you need other languages.

How to Choose: Match Your Use Case

The best model depends on what you are building. Select your use case to see recommendations.

SELECT YOUR USE CASE:

RAG / Document Search

PRIORITIZE THIS METRIC:

Retrieval score

WHY:

Retrieval directly measures how well the model finds relevant documents for queries.

RECOMMENDED MODELS:

1voyage-3

2bge-m3

3text-embedding-3-large

Try It: See How Embeddings Measure Similarity

SELECT A TEXT PAIR:

Cosine Similarity:0.89

These texts are semantically very similar - an embedding model correctly identifies the shared meaning.

Quick Decision Guide

When Reading MTEB Scores:

1.Identify your primary use case (RAG, search, classification, etc.)
2.Focus on the relevant task category score, not just the average
3.Consider dimensions and speed for your infrastructure
4.Check if you need multilingual support

General Recommendations:

-Start simple: all-MiniLM-L6-v2 for prototyping
-Best open source: bge-m3 or e5-mistral-7b
-Best API: voyage-3 or text-embedding-3-large
-Multilingual: bge-m3 (100+ languages)

#	Model	Type	Avg Score	Retrieval	Classification	Clustering	STS	Dims
1	KaLM-Embedding-Gemma3-12B KaLM	Open Source	72.32	75.66	77.88	55.77	79.02	3584
2	Qwen3-Embedding-8B Qwen	Open Source	70.58	70.88	74.00	57.65	81.08	4096
3	Seed1.6-embedding-1215 ByteDance	API	70.26	66.05	76.75	56.78	75.92	1536
4	llama-embed-nemotron-8b NVIDIA	Open Source	69.46	68.69	73.21	54.35	79.41	4096
5	Qwen3-Embedding-4B Qwen	Open Source	69.45	69.60	72.33	57.15	80.86	2560
6	gemini-embedding-001 Google	API	68.37	67.71	71.82	54.59	79.40	3072
7	Octen-Embedding-8B Octen	Open Source	67.85	71.68	66.68	55.68	81.27	4096
8	Qwen3-Embedding-0.6B Qwen	Open Source	64.34	64.65	66.83	52.33	76.17	1024
9	multilingual-e5-large-instruct Microsoft	Open Source	63.22	57.12	64.94	50.75	76.81	1024
10	gte-Qwen2-7B-instruct Alibaba	Open Source	62.51	60.08	61.55	52.77	73.98	3584
11	text-multilingual-embedding-002 Google	API	62.16	59.68	64.64	47.84	76.11	768
12	bge-m3 BAAI	Open Source	59.56	57.89	62.34	48.23	74.45	1024
13	text-embedding-3-large OpenAI	API	58.96	56.12	62.45	45.23	72.45	3072
14	voyage-3.5 Voyage AI	API	58.46	55.89	61.78	44.56	71.89	1024
15	jina-embeddings-v3 Jina AI	Open Source	58.37	54.45	61.23	43.78	71.34	1024

What is MTEB?

The Massive Text Embedding Benchmark evaluates embedding models across 8 task categories and 56+ datasets covering 112+ languages.

Key Metrics

Retrieval: NDCG@10
Classification: Accuracy
Clustering: V-measure
STS: Spearman correlation

Choosing a Model

Consider your use case: retrieval-focused apps should prioritize retrieval scores. For general use, look at the average score and dimensions (smaller = faster).

Quick Start: Use Top Models

Python

# Option 1: Open Source SOTA (Qwen3-Embedding-4B)
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('Qwen/Qwen3-Embedding-4B')
embeddings = model.encode(['Hello world', 'How are you?'])

# Option 2: Lightweight (Qwen3-Embedding-0.6B - still beats OpenAI!)
model = SentenceTransformer('Qwen/Qwen3-Embedding-0.6B')
embeddings = model.encode(['Hello world', 'How are you?'])

# Option 3: API (Google Gemini)
import google.generativeai as genai
genai.configure(api_key='YOUR_API_KEY')
result = genai.embed_content(
    model='models/embedding-001',
    content=['Hello world', 'How are you?']
)

Text Embedding Building Block Image Embedding Building Block Full MTEB Leaderboard (HuggingFace)