Image Embedding
Convert images directly to dense vector representations for semantic search, clustering, and similarity matching.
How Image Embedding Works
Neural networks convert images and text into vectors (lists of numbers). Similar concepts have similar vectors.
Image becomes a vector
Similar things have similar vectors
Search by text (same vector space)
Visualizing in 2D (t-SNE projection)
Try It: Visual Similarity Search
See how CLIP embeds images and text into the same vector space. Click an image or type a query to find similar content.
Image Similarity Matrix
Cosine similarity between CLIP embeddings. Higher = more similar.
Use Cases
- ✓Visual search in photo libraries
- ✓Finding similar products
- ✓Image deduplication
- ✓Content-based recommendation
Architectural Patterns
Direct CLIP/SigLIP Embedding
Embed images directly into a shared vision-language space. Fast, works well for general concepts.
- +Single-step process
- +Real-time capable
- +Multilingual search possible
- -May miss fine-grained details
- -Limited to training distribution
CNN Feature Extraction
Use pre-trained CNN (ResNet, EfficientNet) penultimate layer as embedding.
- +Well understood
- +Many pre-trained options
- -No text-to-image search
- -Requires fine-tuning for domains
Implementations
Open Source
OpenAI CLIP
MITThe original contrastive vision-language model. ViT-L/14 is the most used variant.
SigLIP
Apache 2.0Improved CLIP training with sigmoid loss. Better zero-shot performance.
OpenCLIP
MITOpen-source CLIP reproductions trained on LAION. Multiple model sizes available.
DINOv2
Apache 2.0Self-supervised vision features. Excellent for fine-grained visual similarity.
Benchmarks
Code Examples
CLIP Image Embedding with OpenCLIP
Embed images and search with text queries using OpenCLIP
pip install open-clip-torch pillowimport open_clip
import torch
from PIL import Image
# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14', pretrained='laion2b_s32b_b82k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-L-14')
# Embed an image
image = preprocess(Image.open('photo.jpg')).unsqueeze(0)
with torch.no_grad():
image_features = model.encode_image(image)
image_features /= image_features.norm(dim=-1, keepdim=True)
# Embed a text query
text = tokenizer(['a photo of a cat', 'a photo of a dog'])
with torch.no_grad():
text_features = model.encode_text(text)
text_features /= text_features.norm(dim=-1, keepdim=True)
# Compute similarity
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print(f'Similarity: {similarity}')SigLIP with Transformers
Use Google's SigLIP for better zero-shot performance
pip install transformers torch pillowfrom transformers import AutoProcessor, AutoModel
from PIL import Image
import torch
# Load SigLIP
processor = AutoProcessor.from_pretrained('google/siglip-so400m-patch14-384')
model = AutoModel.from_pretrained('google/siglip-so400m-patch14-384')
# Prepare inputs
image = Image.open('photo.jpg')
texts = ['a cat sleeping', 'a dog running', 'a sunset']
inputs = processor(
text=texts,
images=image,
padding='max_length',
return_tensors='pt'
)
# Get embeddings
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits_per_image # image-text similarity
probs = torch.sigmoid(logits) # SigLIP uses sigmoid, not softmax
for text, prob in zip(texts, probs[0]):
print(f'{text}: {prob:.3f}')Build a Visual Search Index
Index thousands of images for fast similarity search
pip install open-clip-torch faiss-cpu pillow tqdmimport open_clip
import torch
import faiss
import numpy as np
from PIL import Image
from pathlib import Path
from tqdm import tqdm
# Setup
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14', pretrained='laion2b_s32b_b82k'
)
model.eval()
tokenizer = open_clip.get_tokenizer('ViT-L-14')
# Index all images in a folder
image_paths = list(Path('photos/').glob('*.jpg'))
embeddings = []
for path in tqdm(image_paths, desc='Indexing'):
image = preprocess(Image.open(path)).unsqueeze(0)
with torch.no_grad():
feat = model.encode_image(image)
feat /= feat.norm(dim=-1, keepdim=True)
embeddings.append(feat.numpy())
# Build FAISS index
embeddings = np.vstack(embeddings).astype('float32')
index = faiss.IndexFlatIP(embeddings.shape[1]) # Inner product = cosine sim
index.add(embeddings)
# Search with text
query = 'sunset at the beach'
text_tokens = tokenizer([query])
with torch.no_grad():
query_feat = model.encode_text(text_tokens)
query_feat /= query_feat.norm(dim=-1, keepdim=True)
D, I = index.search(query_feat.numpy().astype('float32'), k=5)
print('Top 5 matches:')
for i, (dist, idx) in enumerate(zip(D[0], I[0])):
print(f' {i+1}. {image_paths[idx].name} (score: {dist:.3f})')Quick Facts
- Input
- Image
- Output
- Vector
- Implementations
- 4 open source, 0 API
- Patterns
- 2 approaches