What is an
Embedding?
How neural networks convert text into numbers — and why those numbers capture meaning. From theory to a real MTEB benchmark contribution.
Target Benchmark
The Problem
Computers work with numbers. Neural networks are matrix multiplications — they can only process numerical vectors. But most real-world data is not numbers: text, images, audio.
We need a way to represent "cat" as numbers where similar concepts get similar numbers. ASCII codes won't work — "cat" and "kitten" would be far apart. That's what embeddings solve.
neural_network("cat")
# Error: expected tensor, got string
neural_network([99, 97, 116])
# Works, but ASCII codes have no meaning
neural_network(embedding("cat"))
# ✓ Learned representation that captures meaningHow Embeddings Work
An embedding is a learned lookup table combined with a neural network transformation. Three steps turn text into vectors.
Tokenization
Text is split into subword tokens from a fixed vocabulary.
"cat" → [2368]
"unbelievable" → [348, 12871, 481]
"café" → [7467, 2634]Embedding Lookup
Each token ID maps to a row in a learned matrix — the embedding table.
# 50,000 tokens × 768 dimensions
table.shape = (50000, 768)
vec = table[2368] # Shape: (768,)Transformer Processing
Attention layers let each token "look at" other tokens. Then pool to a single vector.
for layer in transformer:
x = layer.attention(x)
x = layer.feedforward(x)
output = mean(x) # (768,)How Training Creates Meaning
The weights start random. Contrastive learning adjusts them: similar sentences should have similar vectors, dissimilar sentences should be far apart.
After millions of pairs, meaning emerges. Dimension 42 doesn't mean "animal-ness" — the representation is whatever helps the model distinguish similar from dissimilar text.
Cosine Similarity
Measures the angle between vectors. 1 = identical, 0 = unrelated, -1 = opposite.
"The cat sat on the mat"
vs "A feline rested on the rug"
→ cosine similarity: 0.75 (similar)
"The cat sat on the mat"
vs "Stock prices rose sharply"
→ cosine similarity: 0.36 (unrelated)Real values from BAAI/bge-small-en-v1.5. Single words separate less cleanly than sentences.
Static vs Contextual
There are two fundamentally different types of embeddings:
Static (Word2Vec, GloVe)
One vector per word. "bank" always gets the same embedding regardless of context.
"river bank" → bank = [0.2, 0.4, ...]
"bank account" → bank = [0.2, 0.4, ...]
# Same! Can't distinguish.Contextual (BERT, Transformers)
Different vector based on surrounding words. This is what modern models use.
"river bank" → bank = [0.8, 0.1, ...]
"bank account" → bank = [0.1, 0.9, ...]
# Different! Context-aware.See It In Action
Real embeddings have 768+ dimensions. Below, we project them to 2D using t-SNE so you can see clustering patterns.
Note: 2D projection distorts distances. Points that look far apart in 2D might be close in 768D.
Word Embedding Space
Click on any word to see its nearest neighbors in the embedding space. Similar words cluster together!
Working Code
Copy-paste ready. Install with pip install sentence-transformers.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a pre-trained embedding model (downloads ~90MB first time)
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
# Generate embeddings
texts = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn fox leaps above a sleepy canine",
"Stock markets closed higher on Friday"
]
embeddings = model.encode(texts)
print(f"Shape: {embeddings.shape}") # (3, 384)
# Compute similarities
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeddings)
print(f"Similar sentences: {sims[0][1]:.3f}") # ~0.83
print(f"Unrelated topics: {sims[0][2]:.3f}") # ~0.37Shape: (3, 384) Similar sentences: 0.828 Unrelated topics: 0.366
Actual values from BAAI/bge-small-en-v1.5 on Apple M-series.
How Embedding Models Are Benchmarked
The Massive Text Embedding Benchmark (MTEB) evaluates models across 8 task categories using curated "golden" datasets — collections where humans have labeled the correct answers.
The Golden Dataset Pattern: STS Benchmark
8,628 sentence pairs scored by humans from 0 (unrelated) to 5 (identical meaning)
| Sentence A | Sentence B | Human Score |
|---|---|---|
| "A plane is taking off" | "An air plane is taking off" | 5.0 |
| "A man is playing a guitar" | "A man is playing a flute" | 2.2 |
| "The woman is dancing" | "A man is riding a horse" | 0.4 |
Given a query, find relevant documents from a corpus
15 datasetsScore sentence pair similarity against human ratings
10 datasetsAssign text to correct category using embedding kNN
12 datasetsGroup similar texts without predefined labels
11 datasetsReorder candidate documents by relevance to query
4 datasetsDetect paraphrases, duplicates, entailment
3 datasetsThe Overall Score
MTEB averages across all task categories. For BAAI/bge-small-en-v1.5, that's 51.68:
bge-small-en-v1.5 scores by category:
Retrieval: 47.8 (NDCG@10 across 15 datasets)
STS: 77.4 (Spearman on 10 datasets)
Classification: 63.2 (Accuracy on 12 datasets)
Clustering: 37.1 (V-measure on 11 datasets)
...
──────────────────────────────────────
Overall MTEB avg: 51.68 (each category weighted equally)MTEB Leaderboard (Excerpt)
Selected models. Full leaderboard →
| # | Model | MTEB Avg | Params | Dims | Type |
|---|---|---|---|---|---|
| 1 | KaLM-Embedding-Gemma3-12B Tencent | 72.32 | 12B | 3840 | Open Source |
| 2 | Qwen3-Embedding-8B Alibaba | 70.58 | 8B | 4096 | Open Source |
| 3 | llama-embed-nemotron-8b NVIDIA | 69.46 | 8B | 4096 | Open Source |
| 4 | Qwen3-Embedding-4B Alibaba | 69.45 | 4B | 4096 | Open Source |
| 5 | gemini-embedding-001 Google | 68.37 | — | 3072 | API |
| 6 | stella_en_1.5B_v5 dunzhang | 66.2 | 1.5B | 8192 | Open Source |
| 7 | Qwen3-Embedding-0.6B Alibaba | 64.34 | 0.6B | 1024 | Open Source |
| 8 | text-embedding-3-large OpenAI | 64.6 | — | 3072 | API |
| 9 | bge-large-en-v1.5 BAAI | 64.23 | 326M | 1024 | Open Source |
| 10 | jina-embeddings-v3 Jina AI | 62.5 | 570M | 1024 | Open Source |
| 11 | all-MiniLM-L6-v2 Sentence Transformers | 56.3 | 22M | 384 | Open Source |
| 12 | bge-small-en-v1.5 BAAI | 51.68 | 33M | 384 | Open Source |
BAAI/bge-small-en-v1.5 (highlighted) is your reproduce target for this lesson.
Replicate BAAI/bge-small-en-v1.5 on MTEB
Run the model through the full MTEB evaluation pipeline and reproduce its published average score of 51.68.
pip install mteb sentence-transformersimport mteb
# Load the target model (mteb wrapper handles setup)
model = mteb.get_model("BAAI/bge-small-en-v1.5")
# Select all English tasks
tasks = mteb.get_tasks(languages=["eng"])
# Run evaluation (saves results to folder)
results = mteb.evaluate(
model,
tasks=tasks,
output_folder="results/bge-small"
)
# Results are returned per task — inspect them
for task_result in results:
print(f"{task_result.task_name}: {task_result.get_score():.4f}")
# Overall MTEB average: ~51.68Target: Your score should be within ±0.5 of 51.68. Small differences are normal due to hardware, library versions, and evaluation subset selection. Save your results folder — you'll need it for the submission.
Beat 51.68 on MTEB
Now that you understand the pipeline, improve on it. There's no single right answer — this is real research.
Fine-tune on domain data
Use contrastive learning with domain-specific pairs (MS MARCO, NQ, NLI data) to adapt the model for specific MTEB task categories.
Try a different base model
bge-base (110M) scores higher than bge-small (33M). Or try instruction-tuned models like e5-large-v2 with query/passage prefixes.
Matryoshka training
Train embeddings that work at multiple dimensions — truncating to fewer dims with minimal quality loss. Pushes the efficiency frontier.
Task-specific pooling
Instead of mean pooling, try [CLS] token, attention-weighted pooling, or late interaction (ColBERT-style).
This is real research. If your approach beats the baseline with a novel method, it's a genuine benchmark contribution. Your result goes on the leaderboard.
Submit Your Result
Submit your MTEB evaluation result. Include your code repository so peers can verify your methodology.
Contribute to MTEB
Help us maintain the most accurate benchmark data. Submit new results, report issues, or suggest improvements.
Submit New Results
Share benchmark scores from recent papers or your own experiments
Report Data Issues
Found incorrect scores or broken links? Let us know
Build the Data Flywheel
Your contributions help make CodeSOTA better for everyone
Submit Benchmark Result
Submissions are reviewed manually to ensure data quality. For immediate contributions, consider submitting a pull request on GitHub.
Key Papers
KaLM-Embedding-V2: Superior Training Techniques and Data Inspire A Versatile Embedding Model
Zhao, Hu, Shan et al. (Tencent)
Current SOTA on MMTEB (72.32)EACL 2023MTEB: Massive Text Embedding Benchmark
Muennighoff et al.
The benchmark itselfEMNLP 2019Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers & Gurevych
Foundation for sentence-transformers2023C-Pack: Packaged Resources To Advance General Chinese Embedding
Xiao et al. (BAAI)
BGE embedding modelsNeurIPS 2022Matryoshka Representation Learning
Kusupati et al.
Flexible-dimension embeddingsarXiv 2025MMTEB: Massive Multilingual Text Embedding Benchmark
Enevoldsen, Chung et al.
Multilingual extension (70+ authors)GitHub Repositories
sentence-transformers
16k starsUKPLab
The standard library for computing text embeddings. Supports 100+ models.
mteb
2.1k starsembeddings-benchmark
Official MTEB evaluation toolkit. Run all tasks with one command.
FlagEmbedding
8.5k starsBAAI
BGE model family. Current best open-source embeddings.
FAISS
33k starsMeta AI
Production vector search. Billions of vectors with millisecond queries.
Citations
If you use MTEB in your work, please cite both papers:
@article{muennighoff2022mteb,
author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils},
title = {MTEB: Massive Text Embedding Benchmark},
journal = {arXiv preprint arXiv:2210.07316},
year = {2022},
url = {https://arxiv.org/abs/2210.07316},
doi = {10.48550/ARXIV.2210.07316}
}@article{enevoldsen2025mmteb,
author = {Enevoldsen, Kenneth and Chung, Isaac and {70+ co-authors}},
title = {MMTEB: Massive Multilingual Text Embedding Benchmark},
journal = {arXiv preprint arXiv:2502.13595},
year = {2025},
url = {https://arxiv.org/abs/2502.13595},
doi = {10.48550/arXiv.2502.13595}
}Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.