Academy — Level 0: Foundations

What is an
Embedding?

How neural networks convert text into numbers — and why those numbers capture meaning. From theory to a real MTEB benchmark contribution.

Target Benchmark

MTEB
Massive Text Embedding Benchmark
51.68
Reproduce target (BAAI/bge-small-en-v1.5)
8
Task categories evaluated
~45 min
Estimated lesson time

The Problem

Computers work with numbers. Neural networks are matrix multiplications — they can only process numerical vectors. But most real-world data is not numbers: text, images, audio.

We need a way to represent "cat" as numbers where similar concepts get similar numbers. ASCII codes won't work — "cat" and "kitten" would be far apart. That's what embeddings solve.

neural_network("cat")
# Error: expected tensor, got string

neural_network([99, 97, 116])
# Works, but ASCII codes have no meaning

neural_network(embedding("cat"))
# ✓ Learned representation that captures meaning

How Embeddings Work

An embedding is a learned lookup table combined with a neural network transformation. Three steps turn text into vectors.

Step 1

Tokenization

Text is split into subword tokens from a fixed vocabulary.

"cat" → [2368]
"unbelievable" → [348, 12871, 481]
"café" → [7467, 2634]
Step 2

Embedding Lookup

Each token ID maps to a row in a learned matrix — the embedding table.

# 50,000 tokens × 768 dimensions
table.shape = (50000, 768)
vec = table[2368]  # Shape: (768,)
Step 3

Transformer Processing

Attention layers let each token "look at" other tokens. Then pool to a single vector.

for layer in transformer:
    x = layer.attention(x)
    x = layer.feedforward(x)
output = mean(x)  # (768,)

How Training Creates Meaning

The weights start random. Contrastive learning adjusts them: similar sentences should have similar vectors, dissimilar sentences should be far apart.

After millions of pairs, meaning emerges. Dimension 42 doesn't mean "animal-ness" — the representation is whatever helps the model distinguish similar from dissimilar text.

Cosine Similarity

Measures the angle between vectors. 1 = identical, 0 = unrelated, -1 = opposite.

"The cat sat on the mat"
vs "A feline rested on the rug"
→ cosine similarity: 0.75 (similar)

"The cat sat on the mat"
vs "Stock prices rose sharply"
→ cosine similarity: 0.36 (unrelated)

Real values from BAAI/bge-small-en-v1.5. Single words separate less cleanly than sentences.

Static vs Contextual

There are two fundamentally different types of embeddings:

Static (Word2Vec, GloVe)

One vector per word. "bank" always gets the same embedding regardless of context.

"river bank" → bank = [0.2, 0.4, ...]
"bank account" → bank = [0.2, 0.4, ...]
# Same! Can't distinguish.

Contextual (BERT, Transformers)

Different vector based on surrounding words. This is what modern models use.

"river bank" → bank = [0.8, 0.1, ...]
"bank account" → bank = [0.1, 0.9, ...]
# Different! Context-aware.

See It In Action

Real embeddings have 768+ dimensions. Below, we project them to 2D using t-SNE so you can see clustering patterns.

Note: 2D projection distorts distances. Points that look far apart in 2D might be close in 768D.

Word Embedding Space

Dimension 1Dimension 2catdogbirdfishcartruckbusbikeapplebananaorangekingqueenprinceorigin

Click on any word to see its nearest neighbors in the embedding space. Similar words cluster together!

Animals
Vehicles
Food
Royalty

Working Code

Copy-paste ready. Install with pip install sentence-transformers.

embed_and_compare.pyPython 3.10+
from sentence_transformers import SentenceTransformer
import numpy as np

# Load a pre-trained embedding model (downloads ~90MB first time)
model = SentenceTransformer('BAAI/bge-small-en-v1.5')

# Generate embeddings
texts = [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn fox leaps above a sleepy canine",
    "Stock markets closed higher on Friday"
]

embeddings = model.encode(texts)
print(f"Shape: {embeddings.shape}")  # (3, 384)

# Compute similarities
from sklearn.metrics.pairwise import cosine_similarity
sims = cosine_similarity(embeddings)

print(f"Similar sentences: {sims[0][1]:.3f}")   # ~0.83
print(f"Unrelated topics:  {sims[0][2]:.3f}")   # ~0.37
Expected output:
Shape: (3, 384)
Similar sentences: 0.828
Unrelated topics:  0.366

Actual values from BAAI/bge-small-en-v1.5 on Apple M-series.

How Embedding Models Are Benchmarked

The Massive Text Embedding Benchmark (MTEB) evaluates models across 8 task categories using curated "golden" datasets — collections where humans have labeled the correct answers.

The Golden Dataset Pattern: STS Benchmark

8,628 sentence pairs scored by humans from 0 (unrelated) to 5 (identical meaning)

Sentence ASentence BHuman Score
"A plane is taking off""An air plane is taking off"5.0
"A man is playing a guitar""A man is playing a flute"2.2
"The woman is dancing""A man is riding a horse"0.4
The benchmark measures how well your model's cosine similarity correlates with human judgments (Spearman rank correlation).
RetrievalNDCG@10

Given a query, find relevant documents from a corpus

15 datasets
STSSpearman ρ

Score sentence pair similarity against human ratings

10 datasets
ClassificationAccuracy

Assign text to correct category using embedding kNN

12 datasets
ClusteringV-measure

Group similar texts without predefined labels

11 datasets
RerankingMAP

Reorder candidate documents by relevance to query

4 datasets
Pair ClassificationAP

Detect paraphrases, duplicates, entailment

3 datasets

The Overall Score

MTEB averages across all task categories. For BAAI/bge-small-en-v1.5, that's 51.68:

bge-small-en-v1.5 scores by category:
  Retrieval:        47.8   (NDCG@10 across 15 datasets)
  STS:              77.4   (Spearman on 10 datasets)
  Classification:   63.2   (Accuracy on 12 datasets)
  Clustering:       37.1   (V-measure on 11 datasets)
  ...
  ──────────────────────────────────────
  Overall MTEB avg: 51.68  (each category weighted equally)

MTEB Leaderboard (Excerpt)

Selected models. Full leaderboard →

#ModelMTEB AvgParamsDimsType
1KaLM-Embedding-Gemma3-12B
Tencent
72.32
12B3840Open Source
2Qwen3-Embedding-8B
Alibaba
70.58
8B4096Open Source
3llama-embed-nemotron-8b
NVIDIA
69.46
8B4096Open Source
4Qwen3-Embedding-4B
Alibaba
69.45
4B4096Open Source
5gemini-embedding-001
Google
68.37
3072API
6stella_en_1.5B_v5
dunzhang
66.2
1.5B8192Open Source
7Qwen3-Embedding-0.6B
Alibaba
64.34
0.6B1024Open Source
8text-embedding-3-large
OpenAI
64.6
3072API
9bge-large-en-v1.5
BAAI
64.23
326M1024Open Source
10jina-embeddings-v3
Jina AI
62.5
570M1024Open Source
11all-MiniLM-L6-v2
Sentence Transformers
56.3
22M384Open Source
12bge-small-en-v1.5
BAAI
51.68
33M384Open Source

BAAI/bge-small-en-v1.5 (highlighted) is your reproduce target for this lesson.

Stage 1: Reproduce

Replicate BAAI/bge-small-en-v1.5 on MTEB

Run the model through the full MTEB evaluation pipeline and reproduce its published average score of 51.68.

Install
pip install mteb sentence-transformers
Model size
33M params (~90MB download)
Compute time
~2 hours on CPU, ~20 min GPU
reproduce_mteb.py
import mteb

# Load the target model (mteb wrapper handles setup)
model = mteb.get_model("BAAI/bge-small-en-v1.5")

# Select all English tasks
tasks = mteb.get_tasks(languages=["eng"])

# Run evaluation (saves results to folder)
results = mteb.evaluate(
    model,
    tasks=tasks,
    output_folder="results/bge-small"
)

# Results are returned per task — inspect them
for task_result in results:
    print(f"{task_result.task_name}: {task_result.get_score():.4f}")

# Overall MTEB average: ~51.68

Target: Your score should be within ±0.5 of 51.68. Small differences are normal due to hardware, library versions, and evaluation subset selection. Save your results folder — you'll need it for the submission.

Stage 2: Improve

Beat 51.68 on MTEB

Now that you understand the pipeline, improve on it. There's no single right answer — this is real research.

Fine-tune on domain data

Use contrastive learning with domain-specific pairs (MS MARCO, NQ, NLI data) to adapt the model for specific MTEB task categories.

Try a different base model

bge-base (110M) scores higher than bge-small (33M). Or try instruction-tuned models like e5-large-v2 with query/passage prefixes.

Matryoshka training

Train embeddings that work at multiple dimensions — truncating to fewer dims with minimal quality loss. Pushes the efficiency frontier.

Task-specific pooling

Instead of mean pooling, try [CLS] token, attention-weighted pooling, or late interaction (ColBERT-style).

This is real research. If your approach beats the baseline with a novel method, it's a genuine benchmark contribution. Your result goes on the leaderboard.

Submit Your Result

Submit your MTEB evaluation result. Include your code repository so peers can verify your methodology.

Contribute to MTEB

Help us maintain the most accurate benchmark data. Submit new results, report issues, or suggest improvements.

Submit New Results

Share benchmark scores from recent papers or your own experiments

Report Data Issues

Found incorrect scores or broken links? Let us know

Build the Data Flywheel

Your contributions help make CodeSOTA better for everyone

Submit Benchmark Result

Submissions are reviewed manually to ensure data quality. For immediate contributions, consider submitting a pull request on GitHub.

Key Papers

GitHub Repositories

Citations

If you use MTEB in your work, please cite both papers:

MTEB: Massive Text Embedding BenchmarkarXiv 2022
@article{muennighoff2022mteb,
  author    = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Loïc and Reimers, Nils},
  title     = {MTEB: Massive Text Embedding Benchmark},
  journal   = {arXiv preprint arXiv:2210.07316},
  year      = {2022},
  url       = {https://arxiv.org/abs/2210.07316},
  doi       = {10.48550/ARXIV.2210.07316}
}
MMTEB: Massive Multilingual Text Embedding BenchmarkarXiv 2025
@article{enevoldsen2025mmteb,
  author    = {Enevoldsen, Kenneth and Chung, Isaac and {70+ co-authors}},
  title     = {MMTEB: Massive Multilingual Text Embedding Benchmark},
  journal   = {arXiv preprint arXiv:2502.13595},
  year      = {2025},
  url       = {https://arxiv.org/abs/2502.13595},
  doi       = {10.48550/arXiv.2502.13595}
}

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.