What is an Embedding?
How neural networks convert text into numbers - and why those numbers capture meaning.
72 Years of Turning Words into Numbers
Embeddings didn't emerge from a single paper or a single genius. They are the product of seven decades of converging ideas from linguistics, information theory, neural science, and statistical learning — each generation solving one limitation of the last, each breakthrough ignored for years before being rediscovered.
Understanding this history isn't optional context. It's the fastest way to understand why modern embeddings work the way they do, what trade-offs were made, and where the field is still fundamentally limited.
The Distributional Hypothesis
At the University of Pennsylvania, linguist Zellig Harris published "Distributional Structure" in the journal Word. His thesis was radical for the time: the meaning of a word is not some abstract Platonic concept but is fully determined by the contexts in which it appears. Two words that show up in the same linguistic environments — surrounded by the same neighbors, in the same grammatical slots — must mean similar things.
"If we consider words or morphemes A and B to be more different in meaning than A and C, then we will often find that the distributions of A and B are more different than the distributions of A and C. In other words, difference of meaning correlates with difference of distribution."
— Harris, Z. (1954). Distributional Structure. Word, 10(2-3), 146–162.
Harris's student Noam Chomsky would take linguistics in a completely different direction — toward innate grammar and away from statistics. But it was Harris's empirical, data-driven insight that would eventually win out in computing. Three years later, the British linguist J.R. Firth coined the more memorable version:
"You shall know a word by the company it keeps."
— Firth, J.R. (1957). A Synopsis of Linguistic Theory 1930–1955.
Osgood's Semantic Differential
Psychologist Charles Osgood published The Measurement of Meaning, in which human subjects rated words on bipolar scales (good–bad, strong–weak, active–passive). He discovered that just three dimensions — evaluation, potency, and activity — explained most of the variance in how people judge word meaning. This was the first empirical evidence that semantic meaning could be captured in a low-dimensional numerical space. Today's 768-dimensional embeddings are the logical descendant of Osgood's three axes.
Vector Space Model & tf-idf
Gerard Salton at Cornell built the SMART information retrieval system, treating documents as vectors in word-space. Each dimension was a word; each value was a count. His student Karen Spärck Jones (1972) introduced inverse document frequency (IDF) — the insight that rare words carry more information than common ones. Together, tf-idf became the standard for four decades.
# tf-idf: first "embedding" — sparse, high-dimensional
vocab_size = 50000 # one dimension per word in vocabulary
doc_vector = np.zeros(vocab_size) # mostly zeros
doc_vector[word_to_id["cat"]] = tf("cat", doc) * idf("cat", corpus)
# Result: 50,000-dim sparse vector. "cat" and "feline" are orthogonal.The fatal limitation: words are orthogonal. "cat" and "feline" share zero dimensions, so their similarity is 0 despite meaning the same thing. This is the problem embeddings solve.
Latent Semantic Analysis (LSA)
Scott Deerwester, Susan Dumais, and colleagues at Bell Labs asked: what if we could compress the massive term-document matrix to reveal hidden structure? They applied truncated SVD (Singular Value Decomposition) to reduce 50,000-dimensional sparse vectors to ~300 dense dimensions. The result: words that never co-occurred directly but appeared in similar documents now had similar vectors.
# LSA: compress co-occurrence with SVD
term_doc_matrix # shape: (50000, 100000) — sparse
U, S, Vt = svd(term_doc_matrix)
word_embeddings = U[:, :300] @ np.diag(S[:300]) # (50000, 300) — dense
# Now cos_sim("cat", "feline") > 0 even if they never co-occurred!— Deerwester, S. et al. (1990). Indexing by Latent Semantic Analysis. JASIS, 41(6), 391–407.
LSA was the first true "embedding" — dense, low-dimensional, learned from data. But it had no notion of word order, couldn't handle polysemy ("bank" gets one vector), and the SVD was brutally expensive on large corpora. It dominated NLP for 15 years anyway.
Co-occurrence Refinements: HAL, PPMI, Random Indexing
A generation of researchers refined the count-based approach. HAL (Lund & Burgess, 1996) used sliding context windows instead of documents. PMI (Pointwise Mutual Information) replaced raw counts with a statistically principled measure of association — Levy & Goldberg would later prove (2014) that Word2Vec's Skip-gram was implicitly factorizing a PMI matrix, linking the neural and count-based worlds. Random Indexing (Kanerva et al., 2000) avoided building the full matrix entirely, foreshadowing modern approximate methods.
Hinton's Distributed Representations
While linguists were counting co-occurrences, Geoffrey Hinton proposed something fundamentally different in the PDP (Parallel Distributed Processing) volumes: concepts should be encoded as patterns of activity across many neurons, not as single dedicated units. He called these "distributed representations."
"The aim is to find a set of features that allow each entity to be represented as a pattern of activity across many features, and each feature to be involved in representing many entities."
— Hinton, G. (1986). Learning Distributed Representations of Concepts. Proc. Cognitive Science Society.
The key insight: a localist representation (one neuron = one concept) requires N neurons for N concepts and can't generalize. A distributed representation uses N neurons to represent exponentially more concepts, and similar concepts automatically get similar patterns. This paper defined the theoretical blueprint that every learned embedding follows today.
Elman's Word Representations from Prediction
Jeff Elman trained a simple recurrent neural network (SRN) to predict the next word in a sequence. When he analyzed the hidden states, he found that the network had spontaneously organized words into grammatical and semantic categories — nouns clustered with nouns, verbs with verbs, and within those clusters, animals grouped with animals. Nobody told the network about parts of speech. This was the first demonstration that meaningful word representations could emerge purely from the pressure to predict.
— Elman, J. (1990). Finding Structure in Time. Cognitive Science, 14(2), 179–211.
Bengio's Neural Probabilistic Language Model
This is the paper that invented the modern embedding. Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin at Université de Montréal proposed a model with three ideas that remain the foundation of every language model today:
- A learned embedding lookup table — each word in the vocabulary maps to a dense vector of dimension m
- A neural network that operates on concatenated embeddings — taking the last n word vectors as input to predict the next word
- Joint training of both components — the embeddings and the prediction network are trained together end-to-end via backpropagation
# Bengio et al. 2003 — the architecture every LLM still uses
C = nn.Embedding(vocab_size, m) # The lookup table: |V| × m
# For context words w_{t-n+1}, ..., w_{t-1}:
x = concat(C[w_{t-n+1}], ..., C[w_{t-1}]) # Shape: (n-1) × m
h = tanh(Hx + d) # Hidden layer
y = softmax(Wx + Uh + b) # P(w_t | context)
# Backprop through the loss updates BOTH the network AND the embedding table C— Bengio, Y. et al. (2003). A Neural Probabilistic Language Model. JMLR, 3, 1137–1155.
The paper was ahead of its time. It took 10 years and dramatically faster hardware before the approach became practical at scale. But every modern embedding — from Word2Vec to GPT's token embeddings — is a direct descendant of Bengio's lookup table C.
Collobert & Weston: Pre-trained Embeddings for NLP
Ronan Collobert and Jason Weston at NEC Labs showed that embeddings pre-trained on unlabeled text could be transferred to improve downstream NLP tasks like named entity recognition and part-of-speech tagging. Their 2008 ICML paper and the expanded 2011 JMLR version ("Natural Language Processing (Almost) from Scratch") demonstrated that a single set of embeddings trained once could replace years of hand-engineered features. This was the birth of transfer learning for NLP — the same paradigm that now drives BERT, GPT, and every fine-tuned model.
— Collobert, R. & Weston, J. (2008). A Unified Architecture for NLP. ICML.
— Collobert, R. et al. (2011). NLP (Almost) from Scratch. JMLR, 12, 2493–2537.
Word2Vec — The Inflection Point
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean at Google released two papers in rapid succession that changed the field overnight. The insight was counterintuitive: make the model simpler and the training dramatically faster.
They stripped Bengio's architecture down to its skeleton. No hidden layer. No non-linearity. Just the embedding lookup and a linear prediction — either predicting context from a word (Skip-gram) or a word from context (CBOW). With negative sampling replacing the expensive softmax over the full vocabulary, Word2Vec could train on 100 billion words in a single day on one machine.
# Skip-gram with negative sampling (simplified)
# For each word w_t in the corpus, predict its context words
for w_t in corpus:
for w_c in context_window(w_t, size=5):
# Maximize: σ(v_wc · v_wt) — context word should be close
loss += -log(sigmoid(dot(embed[w_c], embed[w_t])))
# Minimize: σ(v_neg · v_wt) — random words should be far
for w_neg in random_sample(vocab, k=5):
loss += -log(sigmoid(-dot(embed[w_neg], embed[w_t])))The famous king − man + woman ≈ queen demonstration made embeddings tangible to everyone. Vector arithmetic encoded analogies: Paris − France + Italy ≈ Rome. The AI community was electrified. Within 12 months, embeddings went from a niche research topic to the default first layer in every NLP system.
Why Word2Vec mattered more than it "should" have
Technically, Word2Vec was a simplification of Bengio 2003. Levy & Goldberg (2014) proved it was implicitly factorizing a shifted PMI matrix — a count-based method in disguise. But none of that diminishes its impact. By packaging the idea into fast, open-source C code with a memorable demo, Mikolov made embeddings accessible to every engineer and researcher. Adoption matters more than novelty.
— Mikolov, T. et al. (2013a). Efficient Estimation of Word Representations. ICLR Workshop.
— Mikolov, T. et al. (2013b). Distributed Representations of Words and Phrases. NeurIPS.
— Levy, O. & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. NeurIPS.
GloVe: Bridging the Count–Predict Divide
Jeffrey Pennington, Richard Socher, and Christopher Manning at Stanford argued that the count-based and prediction-based approaches were two sides of the same coin. GloVe (Global Vectors) explicitly constructed a co-occurrence matrix, then trained embeddings to reconstruct the log co-occurrence ratios. It matched Word2Vec on analogies while making the objective function's connection to corpus statistics transparent. The paper has 35,000+ citations.
— Pennington, J. et al. (2014). GloVe: Global Vectors for Word Representation. EMNLP.
fastText: Subword Embeddings
Piotr Bojanowski et al. at Facebook AI Research solved a critical Word2Vec limitation: out-of-vocabulary (OOV) words. Instead of one embedding per word, fastText learned embeddings for character n-grams — "where" was represented as the sum of embeddings for {<wh, whe, her, ere, re>}. A word never seen during training could still get a meaningful vector from its subword parts. This made embeddings practical for morphologically rich languages (Turkish, Finnish, Arabic) and for handling typos, slang, and neologisms. It also anticipated the subword tokenization (BPE) that all modern transformers use.
— Bojanowski, P. et al. (2017). Enriching Word Vectors with Subword Information. TACL, 5, 135–146.
"Attention Is All You Need"
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, and Polosukhin at Google Brain published the Transformer. The architectural innovation was replacing recurrence (processing words one-by-one) with self-attention — letting every token directly attend to every other token in parallel. This made training dramatically faster on GPUs and allowed models to capture long-range dependencies that RNNs struggled with.
For embeddings specifically, the Transformer meant something profound: the representation of a word was no longer a fixed vector but a function of its entire context. The word "bank" after 12 layers of self-attention would have completely different activations in "river bank" vs "bank robbery" — because the attention mechanism routes different information into the same position depending on what else is in the sequence.
— Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. 100,000+ citations.
ELMo: Embeddings from Language Models
Matthew Peters et al. at Allen AI showed that the internal states of a pre-trained bidirectional LSTM language model were powerful word representations. Unlike Word2Vec, ELMo gave each word a different embedding depending on its sentence. "ELMo" stood for Embeddings from Language Models — the name itself marked the shift. Plugging ELMo into existing NLP models improved performance on every benchmark by 2–25%. The contextual embedding era had begun.
— Peters, M. et al. (2018). Deep Contextualized Word Representations. NAACL.
BERT: The Pre-training Paradigm
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google released BERT (Bidirectional Encoder Representations from Transformers), and NLP changed permanently. BERT was pre-trained on two tasks:
- Masked Language Modeling (MLM) — randomly mask 15% of tokens, predict them from context
- Next Sentence Prediction (NSP) — predict whether two sentences are consecutive
Trained on 3.3 billion words (Wikipedia + BookCorpus), BERT set new state-of-the-art on 11 NLP benchmarks simultaneously. More importantly, it established the pre-train → fine-tune paradigm: train once on massive unlabeled data, then adapt to any task with minimal labeled examples. Every modern embedding model is architecturally a descendant of BERT.
— Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers. NAACL. 90,000+ citations.
Sentence-BERT: Making BERT Embeddings Practical
BERT could produce embeddings, but comparing two sentences required feeding both through the model together — making it O(n²) for search over n documents. Nils Reimers and Iryna Gurevych fine-tuned BERT with a siamese/triplet network structure so that each sentence could be independently encoded into a fixed vector. Searching 10,000 sentences went from 65 hours (BERT cross-encoder) to 5 seconds (SBERT bi-encoder).
This unlocked the use case that drives most embedding adoption today: semantic search. Encode your corpus once, store the vectors, find similar items with a dot product. RAG, recommendation engines, duplicate detection — all downstream of this architectural choice.
The Modern Embedding Landscape
The field has exploded. Models are now trained with sophisticated multi-stage pipelines: pre-training on billions of text pairs, hard-negative mining, knowledge distillation, and instruction-tuning for task-specific behavior.
E5 & E5-Mistral
Microsoft. Instruction-tuned embeddings using prompted LLMs as backbone.
BGE & BGE-M3
BAAI. Multi-lingual, multi-granularity. Hybrid dense+sparse retrieval.
OpenAI text-embedding-3
API-only. Matryoshka dimensions (256–3072). State-of-the-art on MTEB at launch.
Cohere embed-v3 & v4
Compression-aware training. int8/binary quantization with minimal quality loss.
GTE & GTE-Qwen2
Alibaba. Decoder-based architecture using Qwen2 as backbone. 8K context.
Nomic Embed
Fully open source. Reproducible training. 8K context window.
The MTEB benchmark now tracks 200+ embedding models across 56+ datasets spanning retrieval, classification, clustering, STS, and reranking. New models appear weekly. The race is for better multilingual coverage, longer context windows, lower latency, and — increasingly — multi-modal embeddings that unify text, images, and code in a single space.
The throughline: 1954 → 2026
Seven decades. One idea, refined relentlessly:
Every advance solved a limitation of the previous generation. Every generation preserved the core principle: words that appear in similar contexts should get similar vectors.
The Problem
Computers work with numbers. Neural networks are just matrix multiplications - they can only process numerical vectors. But most real-world data is not numbers: text, images, audio.
neural_network("cat") // Error: expected tensor, got string
neural_network([99, 97, 116]) // Works, but ASCII codes have no meaning
We need a way to represent "cat" as numbers where similar words get similar numbers. That's what embeddings do.
What an Embedding Actually Is
An embedding is a learned lookup table combined with a neural network transformation.
Step 1: Tokenization
First, text is split into tokens - subword units from a fixed vocabulary. "cat" might be one token, but "unbelievable" becomes ["un", "believ", "able"].
# Tokenization example
"cat" → [2368] # Single token ID
"unbelievable" → [348, 12871, 481] # Three token IDs
"café" → [7467, 2634] # Subword: "caf" + "é"Step 2: Embedding Lookup
Each token ID maps to a row in a giant matrix - the embedding table. This is just a lookup: token 2368 → row 2368 of the matrix.
# Embedding table: 50,000 tokens × 768 dimensions
embedding_table.shape = (50000, 768)
# Lookup is just indexing
token_id = 2368 # "cat"
initial_vector = embedding_table[token_id] # Shape: (768,)Why 768 dimensions? It's a hyperparameter choice. More dimensions = more capacity to encode meaning, but also more compute. Common values: 384, 768, 1024, 1536.
Step 3: Transformer Processing
Modern embeddings don't stop at lookup. The initial vectors pass through a transformer - layers of attention and feedforward networks that let each token "look at" other tokens.
# Simplified transformer forward pass
x = embedding_table[token_ids] # (seq_len, 768) - initial lookup
x = x + positional_encoding # Add position information
for layer in transformer_layers: # 12-24 layers typically
x = layer.attention(x) # Tokens attend to each other
x = layer.feedforward(x) # Non-linear transformation
final_embedding = mean(x) # Pool to single vectorHow Training Creates Meaning
The embedding table and transformer weights start as random numbers. Training adjusts them using gradient descent on specific tasks:
Contrastive Learning (Modern Approach)
The model learns that similar sentences should have similar embeddings:
# Training objective: similar pairs close, dissimilar pairs far
similar_pair = ("The cat sat on the mat", "A feline rested on the rug")
dissimilar = ("The cat sat on the mat", "Stock prices rose sharply")
emb1, emb2, emb3 = model.encode([similar_pair[0], similar_pair[1], dissimilar])
# Loss function: cosine similarity
loss = -cos_sim(emb1, emb2) + cos_sim(emb1, emb3) # Minimize this
# Backpropagation updates embedding table and transformer weights
loss.backward()
optimizer.step()After millions of such examples, similar concepts naturally cluster together because the model was penalized whenever it placed them far apart.
Key Insight
The numbers in an embedding have no predefined meaning. Dimension 42 doesn't mean "animal-ness". The meaning emerges from training - it's whatever representation helps the model distinguish similar from dissimilar text.
Measuring Similarity: Cosine Distance
"Similar embeddings" means vectors pointing in similar directions. We measure this with cosine similarity:
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
def cosine_similarity(a, b):
"""
Measures angle between vectors. Returns -1 to 1.
1 = identical direction (similar)
0 = perpendicular (unrelated)
-1 = opposite (rare in practice)
"""
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# Sentences separate better than single words
emb_a = model.encode("The cat sat on the mat")
emb_b = model.encode("A feline rested on the rug")
emb_c = model.encode("Stock prices rose sharply today")
print(cosine_similarity(emb_a, emb_b)) # ~0.75 (similar meaning)
print(cosine_similarity(emb_a, emb_c)) # ~0.36 (unrelated topic)Most embedding models are normalized (all vectors have length 1), so cosine similarity simplifies to just the dot product.
See It In Action
Real embeddings have 768+ dimensions. Below, we project them to 2D using t-SNE (a dimensionality reduction algorithm) so you can see clustering patterns.
Note: 2D projection distorts distances. Points that look far apart in 2D might be close in 768D.
Word Embedding Space
Click on any word to see its nearest neighbors in the embedding space. Similar words cluster together!
Static vs Contextual Embeddings
There are two types of embeddings with fundamentally different behavior:
Static (Word2Vec, GloVe)
One vector per word. "bank" always gets the same embedding regardless of context.
"river bank" → bank = [0.2, 0.4, ...]
"bank account" → bank = [0.2, 0.4, ...]
# Same vector! Can't distinguish meaningsContextual (BERT, Transformers)
Different vector based on surrounding words. This is what modern models use.
"river bank" → bank = [0.8, 0.1, ...]
"bank account" → bank = [0.1, 0.9, ...]
# Different vectors! Context-aware"King − Man + Woman = Queen" Is Misleading
The most famous embedding demo is also the most misunderstood. Here's what actually happens.
Every introduction to embeddings — including this one, two sections ago — mentions the analogy king − man + woman ≈ queen. It's a beautiful story: vector arithmetic captures semantic relationships. Gender is a direction in embedding space. Subtract masculinity, add femininity, arrive at the female equivalent.
The problem is that the story is significantly more complicated than it appears, and the way it's typically presented obscures critical details about how embedding spaces actually work.
Problem 1: The Input Words Are Excluded From Results
When you compute king − man + woman and search for the nearest neighbor, the standard evaluation code silently removes "king", "man", and "woman" from the candidate set before ranking. This is baked into the original Word2Vec evaluation script and into gensim'smost_similar() function.
# What the demo code actually does (gensim)
result = model.most_similar(
positive=["king", "woman"],
negative=["man"],
)
# Internally: compute v_king - v_man + v_woman
# Then: EXCLUDE "king", "man", "woman" from candidate results
# Then: return nearest neighbor from remaining vocabulary
# Without exclusion, "king" itself is often the top result
# because v_king - v_man + v_woman ≈ v_king (the operation barely moves it)Why does this matter? Because the analogy vector is often closer to the input words themselves than to "queen." The exclusion trick hides this — it makes the result look cleaner than the underlying geometry justifies.
Problem 2: The Similarity Score Is Modest, Not Definitive
When tested with BERT embeddings, the cosine similarity between the analogy vector king − man + woman and "queen" is approximately 0.57. That's a shared variance of roughly 32% (r² = 0.57²). For context, the similarity between "man" and "woman" directly is 0.58 — essentially the same magnitude.
"Queen" wins, but not by a commanding margin. The runner-up is "kings" (a morphological variant of the input, not a gender analogy), and "princess", "female", and "women" are all close behind. This looks more like "the neighborhood of royalty + femininity" than a precise algebraic operation.
Problem 3: The Analogy Doesn't Generalize
The king/queen example is cherry-picked. It works because "king" and "queen" are among the most frequent words in training data, they co-occur heavily in the same documents, and the gender relationship is one of the strongest semantic axes in English text. Try less famous analogies and the results fall apart:
Studies on the Google Analogy Test Set show Word2Vec achieves roughly 40–75% accuracy depending on the category — with syntactic analogies (morphological patterns like verb tenses) performing far better than semantic ones (capitals, currencies, family relationships). The analogies that "work" tend to be the ones where the relationship is already encoded in surface-level co-occurrence patterns, not deep semantic understanding.
Problem 4: Embedding Spaces Are Not Linear
The analogy demo implies that semantic relationships are clean linear directions in embedding space — that "gender" is a single vector you can add or subtract. This is a significant oversimplification.
Modern embedding spaces are highly non-linear. The representations are produced by neural networks with many layers of non-linear transformations (ReLU, layer normalization, attention). The resulting geometry is curved, clustered, and anisotropic (vectors are not uniformly distributed in all directions — they tend to occupy a narrow cone in high-dimensional space).
Ethayarajh (2019) showed that in contextual models like BERT and GPT-2, word representations become increasingly anisotropic in higher layers — they occupy a smaller and smaller portion of the available space. Linear arithmetic in such spaces is geometrically dubious.
— Ethayarajh, K. (2019). How Contextual are Contextualized Word Representations? EMNLP.
Problem 5: The Analogy Encodes (and Amplifies) Bias
If king − man + woman = queen works because embeddings capture societal associations, then so does doctor − man + woman = nurse andprogrammer − man + woman = homemaker.
Bolukbasi et al. (2016) documented this systematically in "Man is to Computer Programmer as Woman is to Homemaker?" — the same mechanism that produces the clean king/queen result also produces harmful stereotypical associations. The analogy isn't discovering ground truth about gender; it's reflecting (and concentrating) the statistical biases of its training corpus.
Presenting king/queen as proof that embeddings "understand gender" without noting that the same mechanism produces sexist associations is academically incomplete.
— Bolukbasi, T. et al. (2016). Man is to Computer Programmer as Woman is to Homemaker? NeurIPS.
So what's actually going on?
The king/queen analogy isn't completely wrong — it's incomplete. What it actually demonstrates:
- 1.Embeddings do capture semantic relationships, but they're noisy, approximate, and entangled — not clean linear axes.
- 2.The demo is hand-picked for a relationship (gender in royalty) that is exceptionally strong in training data. Most analogies don't produce such clean results.
- 3.The evaluation protocol is rigged (excluding input words). Without this, the result would look much less impressive.
- 4.Embeddings capture statistical co-occurrence, not understanding. The same mechanism that produces king/queen also produces doctor/nurse. What looks like knowledge is pattern matching on training data distributions.
None of this makes embeddings less useful — they're extraordinarily powerful for retrieval, clustering, and similarity search. But their power comes from learned statistical patterns over massive corpora, not from encoding clean algebraic relationships between concepts. Understanding this distinction matters for building reliable systems.
Working Code
Here's how to generate embeddings in practice:
from sentence_transformers import SentenceTransformer
import numpy as np
# Load a pre-trained embedding model (downloads ~400MB first time)
model = SentenceTransformer('BAAI/bge-small-en-v1.5')
# Generate embeddings
texts = [
"The quick brown fox jumps over the lazy dog",
"A fast auburn fox leaps above a sleepy canine",
"Stock markets closed higher on Friday"
]
embeddings = model.encode(texts) # Shape: (3, 384)
print(f"Embedding shape: {embeddings.shape}")
print(f"First 5 values of first embedding: {embeddings[0][:5]}")
# Compute similarities
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)
print(f"\nSimilarity matrix:")
print(f" Text 0 vs 1: {similarities[0][1]:.3f}") # High - same meaning
print(f" Text 0 vs 2: {similarities[0][2]:.3f}") # Low - different topicInstall: pip install sentence-transformers
Key Takeaways
- 1
Embeddings = lookup table + transformer - Token IDs index into a learned matrix, then pass through attention layers.
- 2
Meaning emerges from training - Contrastive learning teaches the model to place similar text close together.
- 3
Cosine similarity measures closeness - Vectors pointing in similar directions (high cosine) represent similar meaning.
- 4
Modern embeddings are contextual - The same word gets different embeddings based on surrounding context.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.