Academy — Level 0: Foundations

What is an
Embedding?

How neural networks convert text into numbers — and why those numbers capture meaning. From theory to a real MTEB benchmark contribution.

Jump to Reproduce Task How MTEB Works

Target Benchmark

MTEB

Massive Text Embedding Benchmark

51.68

Reproduce target (BAAI/bge-small-en-v1.5)

Task categories evaluated

~45 min

Estimated lesson time

The Problem

Computers work with numbers. Neural networks are matrix multiplications — they can only process numerical vectors. But most real-world data is not numbers: text, images, audio.

We need a way to represent "cat" as numbers where similar concepts get similar numbers. ASCII codes won't work — "cat" and "kitten" would be far apart. That's what embeddings solve.

neural_network("cat")
# Error: expected tensor, got string

neural_network([99, 97, 116])
# Works, but ASCII codes have no meaning

neural_network(embedding("cat"))
# ✓ Learned representation that captures meaning

How Embeddings Work

An embedding is a learned lookup table combined with a neural network transformation. Three steps turn text into vectors.

Step 1

Tokenization

Text is split into subword tokens from a fixed vocabulary.

"cat" → [2368]
"unbelievable" → [348, 12871, 481]
"café" → [7467, 2634]

Step 2

Embedding Lookup

Each token ID maps to a row in a learned matrix — the embedding table.

# 50,000 tokens × 768 dimensions
table.shape = (50000, 768)
vec = table[2368]  # Shape: (768,)

Step 3

Transformer Processing

Attention layers let each token "look at" other tokens. Then pool to a single vector.

for layer in transformer:
    x = layer.attention(x)
    x = layer.feedforward(x)
output = mean(x)  # (768,)

How Training Creates Meaning

The weights start random. Contrastive learning adjusts them: similar sentences should have similar vectors, dissimilar sentences should be far apart.

After millions of pairs, meaning emerges. Dimension 42 doesn't mean "animal-ness" — the representation is whatever helps the model distinguish similar from dissimilar text.

Cosine Similarity

Measures the angle between vectors. 1 = identical, 0 = unrelated, -1 = opposite.

"The cat sat on the mat"
vs "A feline rested on the rug"
→ cosine similarity: 0.75 (similar)

"The cat sat on the mat"
vs "Stock prices rose sharply"
→ cosine similarity: 0.36 (unrelated)

Real values from BAAI/bge-small-en-v1.5. Single words separate less cleanly than sentences.

Static vs Contextual

There are two fundamentally different types of embeddings:

Static (Word2Vec, GloVe)

One vector per word. "bank" always gets the same embedding regardless of context.

"river bank" → bank = [0.2, 0.4, ...]
"bank account" → bank = [0.2, 0.4, ...]
# Same! Can't distinguish.

Contextual (BERT, Transformers)

Different vector based on surrounding words. This is what modern models use.

"river bank" → bank = [0.8, 0.1, ...]
"bank account" → bank = [0.1, 0.9, ...]
# Different! Context-aware.

See It In Action

Real embeddings have 768+ dimensions. Below, we project them to 2D using t-SNE so you can see clustering patterns.

Note: 2D projection distorts distances. Points that look far apart in 2D might be close in 768D.

Word Embedding Space

Click on any word to see its nearest neighbors in the embedding space. Similar words cluster together!

Animals

Vehicles

Food

Royalty

Working Code

Copy-paste ready. Install with pip install sentence-transformers.

embed_and_compare.pyPython 3.10+

from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained embedding model (downloads ~90MB first time)
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Generate embeddings
texts = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn fox leaps above a sleepy canine",
    "Stock markets closed higher on Friday"
]

embeddings = model.encode(texts)
print(f"Shape: {embeddings.shape}")  # (3, 384)

# Compute similarities
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeddings)

print(f"Similar sentences: {sims[0][1]:.3f}")   # ~0.83
print(f"Unrelated topics:  {sims[0][2]:.3f}")   # ~0.37

Expected output:

Shape: (3, 384)
Similar sentences: 0.828
Unrelated topics:  0.366

Actual values from BAAI/bge-small-en-v1.5 on Apple M-series.

How Embedding Models Are Benchmarked

The Massive Text Embedding Benchmark (MTEB) evaluates models across 8 task categories using curated "golden" datasets — collections where humans have labeled the correct answers.

The Golden Dataset Pattern: STS Benchmark

8,628 sentence pairs scored by humans from 0 (unrelated) to 5 (identical meaning)

Sentence A	Sentence B	Human Score
"A plane is taking off"	"An air plane is taking off"	5.0
"A man is playing a guitar"	"A man is playing a flute"	2.2
"The woman is dancing"	"A man is riding a horse"	0.4

The benchmark measures how well your model's cosine similarity correlates with human judgments (Spearman rank correlation).

RetrievalNDCG@10

Given a query, find relevant documents from a corpus

15 datasets

STSSpearman ρ

Score sentence pair similarity against human ratings

10 datasets

ClassificationAccuracy

Assign text to correct category using embedding kNN

12 datasets

ClusteringV-measure

Group similar texts without predefined labels

11 datasets

RerankingMAP

Reorder candidate documents by relevance to query

4 datasets

Pair ClassificationAP

Detect paraphrases, duplicates, entailment

3 datasets

The Overall Score

MTEB averages across all task categories. For BAAI/bge-small-en-v1.5, that's 51.68:

bge-small-en-v1.5 scores by category:
  Retrieval:        47.8   (NDCG@10 across 15 datasets)
  STS:              77.4   (Spearman on 10 datasets)
  Classification:   63.2   (Accuracy on 12 datasets)
  Clustering:       37.1   (V-measure on 11 datasets)
  ...
  ──────────────────────────────────────
  Overall MTEB avg: 51.68  (each category weighted equally)

MTEB Leaderboard (Excerpt)

Selected models. Full leaderboard →

#	Model	MTEB Avg	Params	Dims	Type
1	KaLM-Embedding-Gemma3-12B Tencent	72.32	12B	3840	Open Source
2	Qwen3-Embedding-8B Alibaba	70.58	8B	4096	Open Source
3	llama-embed-nemotron-8b NVIDIA	69.46	8B	4096	Open Source
4	Qwen3-Embedding-4B Alibaba	69.45	4B	4096	Open Source
5	gemini-embedding-001 Google	68.37	—	3072	API
6	stella_en_1.5B_v5 dunzhang	66.2	1.5B	8192	Open Source
7	Qwen3-Embedding-0.6B Alibaba	64.34	0.6B	1024	Open Source
8	text-embedding-3-large OpenAI	64.6	—	3072	API
9	bge-large-en-v1.5 BAAI	64.23	326M	1024	Open Source
10	jina-embeddings-v3 Jina AI	62.5	570M	1024	Open Source
11	all-MiniLM-L6-v2 Sentence Transformers	56.3	22M	384	Open Source
12	bge-small-en-v1.5 BAAI	51.68	33M	384	Open Source

BAAI/bge-small-en-v1.5 (highlighted) is your reproduce target for this lesson.

Stage 1: Reproduce

Replicate BAAI/bge-small-en-v1.5 on MTEB

Run the model through the full MTEB evaluation pipeline and reproduce its published average score of 51.68.

Install

pip install mteb sentence-transformers

Model size

33M params (~90MB download)

Compute time

~2 hours on CPU, ~20 min GPU

reproduce_mteb.py

import mteb

# Load the target model (mteb wrapper handles setup)
model = mteb.get_model("BAAI/bge-small-en-v1.5")

# Select all English tasks
tasks = mteb.get_tasks(languages=["eng"])

# Run evaluation (saves results to folder)
results = mteb.evaluate(
    model,
    tasks=tasks,
    output_folder="results/bge-small"
)

# Results are returned per task — inspect them
for task_result in results:
    print(f"{task_result.task_name}: {task_result.get_score():.4f}")

# Overall MTEB average: ~51.68

Target: Your score should be within ±0.5 of 51.68. Small differences are normal due to hardware, library versions, and evaluation subset selection. Save your results folder — you'll need it for the submission.

Stage 2: Improve

Beat 51.68 on MTEB

Now that you understand the pipeline, improve on it. There's no single right answer — this is real research.

Fine-tune on domain data

Use contrastive learning with domain-specific pairs (MS MARCO, NQ, NLI data) to adapt the model for specific MTEB task categories.

Try a different base model

bge-base (110M) scores higher than bge-small (33M). Or try instruction-tuned models like e5-large-v2 with query/passage prefixes.

Matryoshka training

Train embeddings that work at multiple dimensions — truncating to fewer dims with minimal quality loss. Pushes the efficiency frontier.

Task-specific pooling

Instead of mean pooling, try [CLS] token, attention-weighted pooling, or late interaction (ColBERT-style).

This is real research. If your approach beats the baseline with a novel method, it's a genuine benchmark contribution. Your result goes on the leaderboard.

Submit Your Result

Submit your MTEB evaluation result. Include your code repository so peers can verify your methodology.

Contribute to MTEB

Help us maintain the most accurate benchmark data. Submit new results, report issues, or suggest improvements.

Submit New Results

Share benchmark scores from recent papers or your own experiments

Report Data Issues

Found incorrect scores or broken links? Let us know

Build the Data Flywheel

Your contributions help make CodeSOTA better for everyone

Submit Benchmark Result

Submissions are reviewed manually to ensure data quality. For immediate contributions, consider submitting a pull request on GitHub.

Key Papers

arXiv 2025

GitHub Repositories

sentence-transformers

16k stars

UKPLab

The standard library for computing text embeddings. Supports 100+ models.

mteb

2.1k stars

embeddings-benchmark

Official MTEB evaluation toolkit. Run all tasks with one command.

FlagEmbedding

8.5k stars

BAAI

BGE model family. Current best open-source embeddings.

FAISS

33k stars

Meta AI

Production vector search. Billions of vectors with millisecond queries.

Citations

If you use MTEB in your work, please cite both papers:

MTEB: Massive Text Embedding BenchmarkarXiv 2022

@article{muennighoff2022mteb,
  author    = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils},
  title     = {MTEB: Massive Text Embedding Benchmark},
  journal   = {arXiv preprint arXiv:2210.07316},
  year      = {2022},
  url       = {https://arxiv.org/abs/2210.07316},
  doi       = {10.48550/ARXIV.2210.07316}
}

MMTEB: Massive Multilingual Text Embedding BenchmarkarXiv 2025

@article{enevoldsen2025mmteb,
  author    = {Enevoldsen, Kenneth and Chung, Isaac and {70+ co-authors}},
  title     = {MMTEB: Massive Multilingual Text Embedding Benchmark},
  journal   = {arXiv preprint arXiv:2502.13595},
  year      = {2025},
  url       = {https://arxiv.org/abs/2502.13595},
  doi       = {10.48550/arXiv.2502.13595}
}

Next: Text Embeddings Deep Dive Back to Academy

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.

What is anEmbedding?