Home/Building Blocks/Image Embedding
ImageVector

Image Embedding

Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.

How Image Embedding Works

Neural networks convert images and text into vectors (lists of numbers). Similar concepts have similar vectors.

1

Image becomes a vector

Photo of a cat
Photo of a cat
768 dimensions (showing 8)
animal
0.82
outdoor
0.15
food
0.03
furry
0.91
wild
0.22
water
0.08
small
0.67
cute
0.44
[0.82, 0.15, 0.03, 0.91, 0.22, 0.08, 0.67, 0.44]
Photo of a dog
Photo of a dog
768 dimensions (showing 8)
animal
0.88
outdoor
0.45
food
0.05
furry
0.85
wild
0.35
water
0.12
small
0.52
cute
0.78
[0.88, 0.45, 0.05, 0.85, 0.35, 0.12, 0.52, 0.78]
2

Similar things have similar vectors

Photo of a cat
cat
Photo of a dog
dog
Sunset landscape
landscape
Delicious meal
meal
Cosine Similarity:
1.00
0.95
0.29
0.34
0.95
1.00
0.48
0.39
0.29
0.48
1.00
0.38
0.34
0.39
0.38
1.00
Key Insight:
Cat and Dog vectors are similar (0.92) because they're both animals. Sunset is very different from both (0.15-0.20).
3

Search by text (same vector space)

Text query gets embedded too:
4

Visualizing in 2D (t-SNE projection)

AnimalsLandscapesFood
Similar items cluster together in vector space

Try It: Visual Similarity Search

See how CLIP embeds images and text into the same vector space. Click an image or type a query to find similar content.

Orange cat
Orange cat
Gray cat
Gray cat
Golden retriever
Golden retriever
Puppy
Puppy
Mountain sunset
Mountain sunset
Beach
Beach
Food plate
Food plate
Pancakes
Pancakes

Image Similarity Matrix

Cosine similarity between CLIP embeddings. Higher = more similar.

Orange cat
Gray cat
Golden retriever
Puppy
Mountain sunset
Beach
Food plate
Pancakes
1.00
0.92
0.45
0.48
0.22
0.25
0.31
0.29
0.92
1.00
0.43
0.46
0.24
0.23
0.28
0.27
0.45
0.43
1.00
0.89
0.28
0.31
0.35
0.33
0.48
0.46
0.89
1.00
0.26
0.29
0.32
0.31
0.22
0.24
0.28
0.26
1.00
0.78
0.19
0.21
0.25
0.23
0.31
0.29
0.78
1.00
0.22
0.24
0.31
0.28
0.35
0.32
0.19
0.22
1.00
0.85
0.29
0.27
0.33
0.31
0.21
0.24
0.85
1.00

Use Cases

  • Visual search in photo libraries
  • Finding similar products
  • Image deduplication
  • Content-based recommendation

Architectural Patterns

Direct CLIP/SigLIP Embedding

Embed images directly into a shared vision-language space. Fast, works well for general concepts.

Pros:
  • +Single-step process
  • +Real-time capable
  • +Multilingual search possible
Cons:
  • -May miss fine-grained details
  • -Limited to training distribution

CNN Feature Extraction

Use pre-trained CNN (ResNet, EfficientNet) penultimate layer as embedding.

Pros:
  • +Well understood
  • +Many pre-trained options
Cons:
  • -No text-to-image search
  • -Requires fine-tuning for domains

Implementations

Open Source

OpenAI CLIP

MIT
Open Source

The original contrastive vision-language model. ViT-L/14 is the most used variant.

SigLIP

Apache 2.0
Open Source

Improved CLIP training with sigmoid loss. Better zero-shot performance.

OpenCLIP

MIT
Open Source

Open-source CLIP reproductions trained on LAION. Multiple model sizes available.

DINOv2

Apache 2.0
Open Source

Self-supervised vision features. Excellent for fine-grained visual similarity.

Benchmarks

Code Examples

CLIP Image Embedding with OpenCLIP

Embed images and search with text queries using OpenCLIP

Install:pip install open-clip-torch pillow
import open_clip
import torch
from PIL import Image

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='laion2b_s32b_b82k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Embed an image
image = preprocess(Image.open('photo.jpg')).unsqueeze(0)
with torch.no_grad():
    image_features = model.encode_image(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)

# Embed a text query
text = tokenizer(['a photo of a cat', 'a photo of a dog'])
with torch.no_grad():
    text_features = model.encode_text(text)
    text_features /= text_features.norm(dim=-1, keepdim=True)

# Compute similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f'Similarity: {similarity}')

SigLIP with Transformers

Use Google's SigLIP for better zero-shot performance

Install:pip install transformers torch pillow
from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch

# Load SigLIP
processor = AutoProcessor.from_pretrained('google/siglip-so400m-patch14-384')
model = AutoModel.from_pretrained('google/siglip-so400m-patch14-384')

# Prepare inputs
image = Image.open('photo.jpg')
texts = ['a cat sleeping', 'a dog running', 'a sunset']

inputs = processor(
    text=texts,
    images=image,
    padding='max_length',
    return_tensors='pt'
)

# Get embeddings
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits_per_image  # image-text similarity
    probs = torch.sigmoid(logits)  # SigLIP uses sigmoid, not softmax

for text, prob in zip(texts, probs[0]):
    print(f'{text}: {prob:.3f}')

Build a Visual Search Index

Index thousands of images for fast similarity search

Install:pip install open-clip-torch faiss-cpu pillow tqdm
import open_clip
import torch
import faiss
import numpy as np
from PIL import Image
from pathlib import Path
from tqdm import tqdm

# Setup
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='laion2b_s32b_b82k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-L-14')

# Index all images in a folder
image_paths = list(Path('photos/').glob('*.jpg'))
embeddings = []

for path in tqdm(image_paths, desc='Indexing'):
    image = preprocess(Image.open(path)).unsqueeze(0)
    with torch.no_grad():
        feat = model.encode_image(image)
        feat /= feat.norm(dim=-1, keepdim=True)
    embeddings.append(feat.numpy())

# Build FAISS index
embeddings = np.vstack(embeddings).astype('float32')
index = faiss.IndexFlatIP(embeddings.shape[1])  # Inner product = cosine sim
index.add(embeddings)

# Search with text
query = 'sunset at the beach'
text_tokens = tokenizer([query])
with torch.no_grad():
    query_feat = model.encode_text(text_tokens)
    query_feat /= query_feat.norm(dim=-1, keepdim=True)

D, I = index.search(query_feat.numpy().astype('float32'), k=5)
print('Top 5 matches:')
for i, (dist, idx) in enumerate(zip(D[0], I[0])):
    print(f'  {i+1}. {image_paths[idx].name} (score: {dist:.3f})')

Quick Facts

Input
Image
Output
Vector
Implementations
4 open source, 0 API
Patterns
2 approaches

Have benchmark data?

Help us track the state of the art for image embedding.

Submit Results