Home/Building Blocks/Image Embedding

Image→Vector

Image Embedding

Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.

How Image Embedding Works

Neural networks convert images and text into vectors (lists of numbers). Similar concepts have similar vectors.

Image becomes a vector

Photo of a cat

768 dimensions (showing 8)

animal

0.82

outdoor

0.15

food

0.03

furry

0.91

wild

0.22

water

0.08

small

0.67

cute

0.44

[0.82, 0.15, 0.03, 0.91, 0.22, 0.08, 0.67, 0.44]

Photo of a dog

768 dimensions (showing 8)

animal

0.88

outdoor

0.45

food

0.05

furry

0.85

wild

0.35

water

0.12

small

0.52

cute

0.78

[0.88, 0.45, 0.05, 0.85, 0.35, 0.12, 0.52, 0.78]

Similar things have similar vectors

cat

dog

landscape

meal

Cosine Similarity:

1.00

0.95

0.29

0.34

0.95

1.00

0.48

0.39

0.29

0.48

1.00

0.38

0.34

0.39

0.38

1.00

Key Insight:

Cat and Dog vectors are similar (0.92) because they're both animals. Sunset is very different from both (0.15-0.20).

Search by text (same vector space)

Text query gets embedded too:

Visualizing in 2D (t-SNE projection)

Similar items cluster together in vector space

Try It: Visual Similarity Search

See how CLIP embeds images and text into the same vector space. Click an image or type a query to find similar content.

Orange cat

Gray cat

Golden retriever

Puppy

Mountain sunset

Beach

Food plate

Pancakes

Image Similarity Matrix

Cosine similarity between CLIP embeddings. Higher = more similar.

1.00

0.92

0.45

0.48

0.22

0.25

0.31

0.29

0.92

1.00

0.43

0.46

0.24

0.23

0.28

0.27

0.45

0.43

1.00

0.89

0.28

0.31

0.35

0.33

0.48

0.46

0.89

1.00

0.26

0.29

0.32

0.31

0.22

0.24

0.28

0.26

1.00

0.78

0.19

0.21

0.25

0.23

0.31

0.29

0.78

1.00

0.22

0.24

0.31

0.28

0.35

0.32

0.19

0.22

1.00

0.85

0.29

0.27

0.33

0.31

0.21

0.24

0.85

1.00

Use Cases

✓Visual search in photo libraries
✓Finding similar products
✓Image deduplication
✓Content-based recommendation

Architectural Patterns

Direct CLIP/SigLIP Embedding

Embed images directly into a shared vision-language space. Fast, works well for general concepts.

Pros:

+Single-step process
+Real-time capable
+Multilingual search possible

Cons:

-May miss fine-grained details
-Limited to training distribution

CNN Feature Extraction

Use pre-trained CNN (ResNet, EfficientNet) penultimate layer as embedding.

Pros:

+Well understood
+Many pre-trained options

Cons:

-No text-to-image search
-Requires fine-tuning for domains

Implementations

Open Source

OpenAI CLIP

MIT

Open Source

The original contrastive vision-language model. ViT-L/14 is the most used variant.

GitHub HuggingFace

SigLIP

Apache 2.0

Open Source

Improved CLIP training with sigmoid loss. Better zero-shot performance.

GitHub HuggingFace

OpenCLIP

MIT

Open Source

Open-source CLIP reproductions trained on LAION. Multiple model sizes available.

GitHub

DINOv2

Apache 2.0

Open Source

Self-supervised vision features. Excellent for fine-grained visual similarity.

GitHub HuggingFace

Benchmarks

ImageNet Zero-Shot →MTEB Retrieval →

Code Examples

CLIP Image Embedding with OpenCLIP

Embed images and search with text queries using OpenCLIP

Install:pip install open-clip-torch pillow

import open_clip
import torch
from PIL import Image

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='laion2b_s32b_b82k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Embed an image
image = preprocess(Image.open('photo.jpg')).unsqueeze(0)
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)

# Embed a text query
text = tokenizer(['a photo of a cat', 'a photo of a dog'])
with torch.no_grad():
    text_features = model.encode_text(text)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Compute similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f'Similarity: {similarity}')

SigLIP with Transformers

Use Google's SigLIP for better zero-shot performance

Install:pip install transformers torch pillow

from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch

# Load SigLIP
processor = AutoProcessor.from_pretrained('google/siglip-so400m-patch14-384')
model = AutoModel.from_pretrained('google/siglip-so400m-patch14-384')

# Prepare inputs
image = Image.open('photo.jpg')
texts = ['a cat sleeping', 'a dog running', 'a sunset']

inputs = processor(
    text=texts,
    images=image,
    padding='max_length',
    return_tensors='pt'
)

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits_per_image  # image-text similarity
    probs = torch.sigmoid(logits)  # SigLIP uses sigmoid, not softmax

for text, prob in zip(texts, probs[0]):
    print(f'{text}: {prob:.3f}')

Build a Visual Search Index

Index thousands of images for fast similarity search

Install:pip install open-clip-torch faiss-cpu pillow tqdm

import open_clip
import torch
import faiss
import numpy as np
from PIL import Image
from pathlib import Path
from tqdm import tqdm

# Setup
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='laion2b_s32b_b82k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Index all images in a folder
image_paths = list(Path('photos/').glob('*.jpg'))
embeddings = []

for path in tqdm(image_paths, desc='Indexing'):
    image = preprocess(Image.open(path)).unsqueeze(0)
    with torch.no_grad():
        feat = model.encode_image(image)
        feat /= feat.norm(dim=-1, keepdim=True)
    embeddings.append(feat.numpy())

# Build FAISS index
embeddings = np.vstack(embeddings).astype('float32')
index = faiss.IndexFlatIP(embeddings.shape[1])  # Inner product = cosine sim
index.add(embeddings)

# Search with text
query = 'sunset at the beach'
text_tokens = tokenizer([query])
with torch.no_grad():
    query_feat = model.encode_text(text_tokens)
    query_feat /= query_feat.norm(dim=-1, keepdim=True)

D, I = index.search(query_feat.numpy().astype('float32'), k=5)
print('Top 5 matches:')
for i, (dist, idx) in enumerate(zip(D[0], I[0])):
    print(f'  {i+1}. {image_paths[idx].name} (score: {dist:.3f})')

Quick Facts

Input: Image
Output: Vector
Implementations: 4 open source, 0 API
Patterns: 2 approaches

Related Blocks

Have benchmark data?

Help us track the state of the art for image embedding.

Submit Results