Codesota · Building Blocks · § 02Image → VectorReading time ~ 18 min
§ 02 · The Building Blocks
ImageVector

Image embedding.

Convert images directly to dense vector representations for semantic search, clustering, and similarity matching — without ever generating a caption.

§ 02.1 · The problem

Pixels are a terrible index.

A 512×512 photograph is a quarter of a million numbers. Two photos of the same cat, taken a second apart, share almost none of them. Shift the camera two pixels to the left and the raw arrays become strangers. This is the central nuisance of computer vision: semantic identity does not live in pixel space.

An image embedding is the fix. It is a short list of numbers — typically 512, 768, or 1024 floats — learned so that photographs of similar things land near one another and photographs of unrelated things land far apart. The journey from pixels to embedding is one forward pass through a trained neural network. The vector you get back is the image’s coordinate in a semantic space.

“Two images are similar if a good model, shown them side by side, would agree they depict the same kind of thing. Embeddings are that judgment, compressed into a ruler.”

§ 02.2 · The geometry

A space you can walk through.

The clearest way to understand an embedding is to see one. Below: a miniature coordinate system and the cosine-similarity matrix between eight photographs. Notice how the cats cluster, the dogs cluster, and the landscapes hang somewhere else entirely.

How Image Embedding Works

Neural networks convert images and text into vectors (lists of numbers). Similar concepts have similar vectors.

1

Image becomes a vector

Photo of a cat
Photo of a cat
768 dimensions (showing 8)
animal
0.82
outdoor
0.15
food
0.03
furry
0.91
wild
0.22
water
0.08
small
0.67
cute
0.44
[0.82, 0.15, 0.03, 0.91, 0.22, 0.08, 0.67, 0.44]
Photo of a dog
Photo of a dog
768 dimensions (showing 8)
animal
0.88
outdoor
0.45
food
0.05
furry
0.85
wild
0.35
water
0.12
small
0.52
cute
0.78
[0.88, 0.45, 0.05, 0.85, 0.35, 0.12, 0.52, 0.78]
2

Similar things have similar vectors

Photo of a cat
cat
Photo of a dog
dog
Sunset landscape
landscape
Delicious meal
meal
Cosine Similarity:
1.00
0.95
0.29
0.34
0.95
1.00
0.48
0.39
0.29
0.48
1.00
0.38
0.34
0.39
0.38
1.00
Key Insight:
Cat and Dog vectors are similar (0.92) because they're both animals. Sunset is very different from both (0.15-0.20).
3

Search by text (same vector space)

Text query gets embedded too:
4

Visualizing in 2D (t-SNE projection)

AnimalsLandscapesFood
Similar items cluster together in vector space
Fig. 1 · Toy projection. Real CLIP space is 768-dimensional — this is the shadow it casts on two axes.

Try It: Visual Similarity Search

See how CLIP embeds images and text into the same vector space. Click an image or type a query to find similar content.

Orange cat
Orange cat
Gray cat
Gray cat
Golden retriever
Golden retriever
Puppy
Puppy
Mountain sunset
Mountain sunset
Beach
Beach
Food plate
Food plate
Pancakes
Pancakes

Image Similarity Matrix

Cosine similarity between CLIP embeddings. Higher = more similar.

Orange cat
Gray cat
Golden retriever
Puppy
Mountain sunset
Beach
Food plate
Pancakes
1.00
0.92
0.45
0.48
0.22
0.25
0.31
0.29
0.92
1.00
0.43
0.46
0.24
0.23
0.28
0.27
0.45
0.43
1.00
0.89
0.28
0.31
0.35
0.33
0.48
0.46
0.89
1.00
0.26
0.29
0.32
0.31
0.22
0.24
0.28
0.26
1.00
0.78
0.19
0.21
0.25
0.23
0.31
0.29
0.78
1.00
0.22
0.24
0.31
0.28
0.35
0.32
0.19
0.22
1.00
0.85
0.29
0.27
0.33
0.31
0.21
0.24
0.85
1.00
Fig. 2 · Precomputed cosine similarities from OpenAI CLIP ViT-L/14.

§ 02.3 · Architectural patterns

Three ways to compress an image into a ruler.

Pattern A

Direct CLIP / SigLIP

Embed images directly into a shared vision-language space. One forward pass. Text and images live in the same coordinate system, so a phrase and a photo can be compared directly.

Pros
  • Real-time capable on GPU
  • Multilingual text search possible
  • Single model, single forward pass
Cons
  • Misses fine-grained details
  • Struggles outside training distribution
Pattern B

CNN feature extraction

Take a pretrained ResNet or EfficientNet, strip the classification head, and use the penultimate layer as the embedding. Older trick; still surprisingly strong.

Pros
  • Well-understood, decade of tooling
  • Dozens of pretrained checkpoints
Cons
  • No text-to-image search
  • Often needs domain fine-tuning
Pattern C

Self-supervised (DINOv2)

DINOv2 never sees a caption — it learns purely from the pixels by asking each image to be consistent with different crops of itself. The result is embeddings that excel at fine-grained visual similarity: spotting the same dress in two photos, flagging near-duplicates, clustering on visual style rather than verbal category.


§ 02.4 · Implementations

The models worth knowing.

ModelOriginLicenseBest for
OpenAI CLIP ViT-L/14OpenAI, 2021MITGeneral-purpose zero-shot
OpenCLIP (LAION)LAION, 2022+MITScale — trained on 2B image-text pairs
SigLIP-SO400MGoogle, 2023Apache 2.0Zero-shot accuracy, sigmoid loss
DINOv2Meta, 2023Apache 2.0Fine-grained similarity, duplicates
Fig. 3 · Open-source checkpoints we reach for first.

§ 02.5 · Code

Three recipes, forty lines each.

The shortest useful path from a folder of JPEGs to a searchable index. The third recipe is a complete visual-search system in roughly forty lines of Python.

CLIP Image Embedding with OpenCLIP

Embed images and search with text queries using OpenCLIP.

Install:pip install open-clip-torch pillow
import open_clip
import torch
from PIL import Image

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='laion2b_s32b_b82k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Embed an image
image = preprocess(Image.open('photo.jpg')).unsqueeze(0)
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)

# Embed a text query
text = tokenizer(['a photo of a cat', 'a photo of a dog'])
with torch.no_grad():
    text_features = model.encode_text(text)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Compute similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f'Similarity: {similarity}')

SigLIP with Transformers

Google's SigLIP swaps CLIP's softmax for a sigmoid loss — calibrated probabilities and better zero-shot numbers.

Install:pip install transformers torch pillow
from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch

# Load SigLIP
processor = AutoProcessor.from_pretrained('google/siglip-so400m-patch14-384')
model = AutoModel.from_pretrained('google/siglip-so400m-patch14-384')

# Prepare inputs
image = Image.open('photo.jpg')
texts = ['a cat sleeping', 'a dog running', 'a sunset']

inputs = processor(
    text=texts,
    images=image,
    padding='max_length',
    return_tensors='pt'
)

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits_per_image  # image-text similarity
    probs = torch.sigmoid(logits)  # SigLIP uses sigmoid, not softmax

for text, prob in zip(texts, probs[0]):
    print(f'{text}: {prob:.3f}')

Build a Visual Search Index

Index thousands of images in a folder, then query them with natural language.

Install:pip install open-clip-torch faiss-cpu pillow tqdm
import open_clip
import torch
import faiss
import numpy as np
from PIL import Image
from pathlib import Path
from tqdm import tqdm

# Setup
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='laion2b_s32b_b82k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Index all images in a folder
image_paths = list(Path('photos/').glob('*.jpg'))
embeddings = []

for path in tqdm(image_paths, desc='Indexing'):
    image = preprocess(Image.open(path)).unsqueeze(0)
    with torch.no_grad():
        feat = model.encode_image(image)
        feat /= feat.norm(dim=-1, keepdim=True)
    embeddings.append(feat.numpy())

# Build FAISS index
embeddings = np.vstack(embeddings).astype('float32')
index = faiss.IndexFlatIP(embeddings.shape[1])  # Inner product = cosine sim
index.add(embeddings)

# Search with text
query = 'sunset at the beach'
text_tokens = tokenizer([query])
with torch.no_grad():
    query_feat = model.encode_text(text_tokens)
    query_feat /= query_feat.norm(dim=-1, keepdim=True)

D, I = index.search(query_feat.numpy().astype('float32'), k=5)
print('Top 5 matches:')
for i, (dist, idx) in enumerate(zip(D[0], I[0])):
    print(f'  {i+1}. {image_paths[idx].name} (score: {dist:.3f})')

§ 02.6 · In the wild

Where this ends up shipping.

Use cases
  • Visual search in photo libraries. “Find the picture of my dog at the beach” without tags or filenames.
  • Similar-product retrieval. E-commerce “more like this” on pure visual appearance.
  • Deduplication. Near-duplicate detection across a million-image corpus in seconds.
  • Content-based recommendation. Cold-start item similarity when there’s no user interaction history yet.
Missing a model?

Fresh paper, stale data, or a checkpoint we missed — real humans read every message, and we reply within 48 hours.

Tell us