Image Search with CLIP
How contrastive learning unified vision and language in a single embedding space — and why it changed everything downstream.
25 Years of Teaching Machines to See
CLIP didn't appear in a vacuum. It sits at the end of a long lineage of progressively more powerful approaches to image understanding — from hand-crafted feature descriptors to learned convolutional filters to attention-based transformers. Each generation solved one critical limitation of the last. Understanding this lineage is the fastest way to grasp why CLIP's specific architecture choices matter.
The fundamental question has remained the same since the 1960s: how do you represent an image as numbers that a computer can reason about? The answers have changed dramatically.
SIFT: Scale-Invariant Feature Transform
David Lowe at the University of British Columbia published the paper that defined a decade of computer vision. SIFT detected "keypoints" in images — corners, edges, blobs — and described each one as a 128-dimensional vector based on local gradient orientations. These descriptors were invariant to scale, rotation, and partial changes in illumination.
# SIFT pipeline (conceptual) keypoints = detect_extrema(DoG_pyramid(image)) # Find interesting points orientations = compute_gradient_histograms(keypoints) # 128-dim descriptor per keypoint # Match images by finding keypoints with similar descriptors matches = brute_force_match(descriptors_A, descriptors_B)
Each image became a bag of keypoint descriptors, not a single vector. You couldn't compute "cosine similarity between two images" — you had to run a matching algorithm.
SIFT and its successors (SURF, ORB) powered Google Image Search, panorama stitching, and augmented reality for over a decade. But they had a fundamental limitation: the features were hand-designed by humans. They captured low-level geometry — edges, corners, textures — but had no concept of semantics. A SIFT descriptor could match the same building from two angles but couldn't tell you it was a "church."
— Lowe, D. (2004). Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2), 91–110. 70,000+ citations.
HOG, Bag of Visual Words & Deformable Parts
The community built increasingly sophisticated pipelines on top of hand-crafted features. HOG (Dalal & Triggs, 2005) encoded gradient histograms in a grid pattern — the backbone of pedestrian detection for years. Bag of Visual Words (Csurka et al., 2004) borrowed from text retrieval: cluster SIFT descriptors into a "visual vocabulary," then represent each image as a histogram of visual word frequencies. DPM (Felzenszwalb et al., 2010) won PASCAL VOC detection three years running by modeling objects as collections of deformable parts. The entire edifice was hand-engineered feature extraction followed by an SVM classifier. It worked, but each new task required months of feature engineering.
AlexNet: The Inflection Point
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a deep convolutional neural network in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It won by a margin so large — 15.3% top-5 error vs. the runner-up's 26.2% — that it ended the debate about whether learned features could compete with hand-crafted ones.
AlexNet had 60 million parameters, trained on two GTX 580 GPUs for six days. The key insight was not architectural novelty (CNNs existed since LeCun's 1989 work) but scale: enough data (1.2M ImageNet images), enough compute (GPU training), and enough capacity (8 layers instead of 2–3). The penultimate layer — a 4096-dimensional vector — became the first widely used learned image embedding. Researchers quickly discovered you could use this "AlexNet feature" for any vision task, even ones the network was never trained on.
# The "AlexNet feature" — first practical learned image embedding model = AlexNet(pretrained=True) model.classifier = model.classifier[:-1] # Remove final classification layer image_embedding = model(image) # Shape: (4096,) # This 4096-dim vector transfers to new tasks without retraining
Deeper Networks: VGG, GoogLeNet, ResNet
The community rapidly scaled depth. VGG-16 (Simonyan & Zisserman, 2014) showed that uniform 3x3 convolutions stacked 16 layers deep could match more complex architectures. GoogLeNet/Inception (Szegedy et al., 2014) introduced multi-scale parallel convolutions. Then ResNet (He et al., 2015) solved the vanishing gradient problem with skip connections, pushing networks to 152 layers and achieving 3.57% top-5 error — surpassing estimated human performance on ImageNet.
Each of these networks produced an image embedding as a byproduct of classification. "ResNet features" became the default backbone for detection (Faster R-CNN), segmentation (U-Net), and retrieval throughout 2016–2020. But they all shared a fundamental constraint: the embedding only understood categories it was trained on. A ResNet trained on ImageNet's 1000 classes had no way to understand "a dog wearing a birthday hat" — that concept existed in no training label.
ViT: An Image is Worth 16x16 Words
Alexey Dosovitskiy et al. at Google Brain asked a heretical question: what if you just chopped an image into patches and fed them to a standard Transformer — no convolutions at all? Split a 224x224 image into 196 patches of 16x16 pixels, linearly project each patch into a vector, add position embeddings, and run self-attention.
The result, Vision Transformer (ViT), matched or exceeded the best CNNs when pre-trained on enough data (JFT-300M, 300 million images). The architecture was simpler than any CNN — no pooling layers, no strided convolutions, no feature pyramid. Just patches and attention. ViT became the standard image encoder for CLIP four months later.
— Dosovitskiy, A. et al. (2021). An Image is Worth 16x16 Words. ICLR.
CLIP: Contrastive Language-Image Pre-training
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh et al. at OpenAI combined two ideas: the ViT image encoder and a Transformer text encoder, trained together on 400 million image-text pairs scraped from the internet (the WIT dataset, never publicly released). The training objective was contrastive: given a batch of N image-text pairs, maximize the similarity of matching pairs while minimizing the similarity of all N² - N non-matching pairs.
The result was a shared embedding space where images and text with the same meaning occupied the same neighborhood. For the first time, you could search images with natural language, classify into arbitrary categories without retraining, and measure image-text relevance — all from a single model. CLIP achieved 76.2% zero-shot top-1 accuracy on ImageNet — matching a supervised ResNet-50 without seeing a single ImageNet training image.
SigLIP: Sigmoid Loss for Language-Image Pre-training
Xiaohua Zhai et al. at Google replaced CLIP's softmax-based contrastive loss with a simpler sigmoid loss that operates on individual pairs rather than requiring the full NxN similarity matrix. This removed the need for large batch sizes (CLIP needed 32,768) and enabled efficient training on smaller hardware. SigLIP achieved better performance with fewer resources and produced better-calibrated similarity scores that function as actual probabilities.
— Zhai, X. et al. (2023). Sigmoid Loss for Language Image Pre-Training. ICCV.
The throughline: 1999 → 2026
Twenty-five years. The same question, progressively better answers:
Each generation replaced more hand-engineering with more learning. CLIP's breakthrough was replacing labeled categories with natural language descriptions — turning the entire internet into a training set.
How Contrastive Learning Works
CLIP's training objective is contrastive learning — arguably the most important idea in multimodal AI. The core principle is deceptively simple: given a batch of image-text pairs, push matching pairs together and push non-matching pairs apart in a shared embedding space.
But the details of how you push and pull — the loss function — have profound consequences for what the model learns.
The Contrastive Setup
Positive Pairs (diagonal)
Image of a sunset + "a beautiful sunset over the ocean"
These came from the same web page. Push their embeddings CLOSE together.
Negative Pairs (off-diagonal)
Image of a sunset + "a cat sitting on a keyboard"
Random pairings within the batch. Push their embeddings FAR apart.
InfoNCE Loss: The Math Behind CLIP
CLIP uses a symmetric variant of InfoNCE (Noise Contrastive Estimation), first introduced by van den Oord et al. (2018). For a batch of N image-text pairs, compute the NxN matrix of cosine similarities between all image and text embeddings, then apply cross-entropy loss twice — once treating each image as a query (which text matches?) and once treating each text as a query (which image matches?).
# InfoNCE loss for CLIP (actual PyTorch)
import torch
import torch.nn.functional as F
def clip_loss(image_embeds, text_embeds, temperature=0.07):
"""
image_embeds: (N, D) — L2-normalized image embeddings
text_embeds: (N, D) — L2-normalized text embeddings
temperature: learnable scalar (initialized to 0.07)
The i-th image matches the i-th text (diagonal = positive pairs).
"""
# Compute NxN similarity matrix, scaled by temperature
logits = (image_embeds @ text_embeds.T) / temperature # (N, N)
# Labels: the diagonal — image[i] matches text[i]
labels = torch.arange(len(image_embeds))
# Symmetric loss: image→text and text→image
loss_i2t = F.cross_entropy(logits, labels) # Each row: which text?
loss_t2i = F.cross_entropy(logits.T, labels) # Each column: which image?
return (loss_i2t + loss_t2i) / 2The temperature parameter is crucial. A lower temperature makes the softmax distribution sharper — the model must be more confident about which pairs match. CLIP learns this parameter during training, starting at 0.07 and typically converging around 0.01. Getting the temperature wrong can collapse training entirely.
Why CLIP Needs Enormous Batch Sizes
In a batch of N pairs, each positive pair is contrasted against N-1 negative pairs. With a batch size of 32,768 (CLIP's actual training setting), each image is compared against 32,767 wrong captions. Larger batches provide more informative negatives — a harder "multiple choice test" — which forces more discriminative representations.
This is also CLIP's main limitation: you need hundreds of GPUs to maintain large batches. CLIP was trained on 256 V100 GPUs. This motivated SigLIP's sigmoid loss, which doesn't require the full NxN matrix and works with smaller batches.
CLIP Loss vs. SigLIP Loss
CLIP (Softmax / InfoNCE)
Treats each row/column as a classification problem: "which of these N items is the correct match?" Requires the full NxN matrix, so batch size directly determines the number of negatives.
loss = -log(exp(sim(i,t)/τ) / Σ_j exp(sim(i,t_j)/τ))SigLIP (Sigmoid / Binary)
Treats each pair independently: "does this image match this text? Yes/no." No softmax normalization across the batch. Works with any batch size and can be chunked.
loss = -log(σ(y_ij · sim(i,j)/τ + b))
# y_ij = +1 if match, -1 if notCLIP's Architecture
CLIP has two parallel encoders that produce vectors in the same space. The image encoder and text encoder are trained jointly from scratch — their only connection is the contrastive loss that pulls matching pairs together.
Image Encoder
Either a ResNet (RN50, RN101) or a Vision Transformer (ViT-B/32, ViT-B/16, ViT-L/14). The ViT variants are stronger. The image is split into patches, each patch is linearly projected, and self-attention produces a final [CLS] token embedding.
# ViT-L/14 image encoder
image (224×224×3)
→ 16×16 patches (196 patches)
→ linear projection (each → 1024-dim)
→ + [CLS] token + position embeddings
→ 24 Transformer layers
→ [CLS] output → linear projection
→ L2-normalize → 768-dim embeddingText Encoder
A standard Transformer (not BERT — CLIP uses its own tokenizer and architecture). Text is tokenized with BPE (49,152 vocab), padded/truncated to 77 tokens, and the [EOS] token's output is used as the text embedding.
# CLIP text encoder
text "a photo of a dog"
→ BPE tokenize → [49406, 320, 1125, ...]
→ token embedding lookup
→ + position embeddings
→ 12 Transformer layers (masked attention)
→ [EOS] output → linear projection
→ L2-normalize → 768-dim embeddingBoth encoders project into the same 512 or 768-dimensional space. After L2 normalization, similarity is just a dot product. An image of a dog and the text "a photo of a dog" end up as nearby vectors. This shared space is what enables every downstream application — zero-shot classification, cross-modal search, and image generation guidance.
Try It: Cross-Modal Search
This visualization shows CLIP's shared embedding space projected to 2D. Circles are image embeddings, triangles are text embeddings. Type a query to see how text embeddings align with matching images.
Note: 2D projection distorts distances. Points that appear far apart in 2D may be close in 768D space.
CLIP Shared Embedding Space
Enter a text query above to search images using CLIP-style cross-modal retrieval.
Try queries like "cat", "a photo of a dog", "vehicle", or "nature"
Zero-Shot Classification
CLIP's most celebrated capability: classify images into categories the model was never explicitly trained on. No fine-tuning, no labeled data, no retraining. Just text prompts that describe each category.
The Algorithm
- 1
Define candidate classes as text prompts
"a photo of a dog", "a photo of a cat", "a photo of a bird". Prompt engineering matters — "a photo of a {class}" works better than just the class name because CLIP was trained on image-caption pairs, not single words.
- 2
Encode all prompts once
Each class prompt becomes a vector in the shared space. Cache these — they don't change.
- 3
Encode the image
The image becomes a vector in the same space.
- 4
Pick the highest cosine similarity
The class whose text embedding is most similar to the image embedding wins. Apply softmax to get calibrated probabilities.
Working Code with OpenCLIP
OpenAI's original CLIP is frozen at 2021 weights. For production, use OpenCLIP — the open-source reimplementation by LAION that provides newer, stronger models trained on larger datasets.
Zero-Shot Classification
import open_clip
import torch
from PIL import Image
# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14', pretrained='datacomp_xl_s13b_b90k'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')
model.eval()
# Prepare image and candidate labels
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
classes = ["dog", "cat", "bird", "car", "house"]
prompts = [f"a photo of a {c}" for c in classes]
text = tokenizer(prompts)
# Encode and classify
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Normalize and compute similarities
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
for cls, prob in zip(classes, probs[0]):
print(f" {cls}: {prob.item():.1%}")Install: pip install open-clip-torch
Text-to-Image Search Over a Collection
import open_clip
import torch
from pathlib import Path
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms(
'ViT-L-14', pretrained='datacomp_xl_s13b_b90k'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')
model.eval()
# Step 1: Encode all images in a directory (do this once, save to disk)
image_dir = Path("./photos")
image_paths = list(image_dir.glob("*.jpg"))
image_embeddings = []
with torch.no_grad():
for path in image_paths:
img = preprocess(Image.open(path)).unsqueeze(0)
emb = model.encode_image(img)
emb /= emb.norm(dim=-1, keepdim=True)
image_embeddings.append(emb)
image_embeddings = torch.cat(image_embeddings) # (N, 768)
# Step 2: Search with natural language
query = "a dog playing in snow"
text = tokenizer([query])
with torch.no_grad():
text_emb = model.encode_text(text)
text_emb /= text_emb.norm(dim=-1, keepdim=True)
# Step 3: Rank by similarity
similarities = (text_emb @ image_embeddings.T).squeeze(0)
top_k = similarities.argsort(descending=True)[:5]
for idx in top_k:
print(f" {image_paths[idx].name}: {similarities[idx]:.3f}")For production, store embeddings in a vector database (Qdrant, Pinecone, pgvector) instead of recomputing them per query. Encoding images is the expensive step — encoding text queries is nearly instant.
Image-to-Image Search
CLIP embeddings also enable reverse image search — find visually and semantically similar images by comparing image embeddings directly.
# Given a query image, find similar images in the collection
query_img = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
query_emb = model.encode_image(query_img)
query_emb /= query_emb.norm(dim=-1, keepdim=True)
# Same dot product — works because all embeddings share the same space
similarities = (query_emb @ image_embeddings.T).squeeze(0)
# Top results will be semantically similar, not just pixel-similarModel Comparison: Choosing the Right CLIP
Since OpenAI released CLIP in 2021, the community has created improved variants with better performance, open training data, or more efficient training objectives. Here is the current landscape.
| Model | Training Data | Loss | Best For |
|---|---|---|---|
| OpenAI CLIPViT-L/14, 2021 | WIT 400M (private) | InfoNCE | Baseline, widest library support |
| OpenCLIP ViT-G/14LAION, 2023 | LAION-2B (open) | InfoNCE | Best open-source accuracy |
| SigLIP ViT-SO400MGoogle, 2023 | WebLI (private) | Sigmoid | Calibrated scores, smaller batches |
| EVA-02-CLIP-E/14+BAAI, 2023 | Merged datasets (18B params) | InfoNCE + distillation | Maximum ImageNet zero-shot accuracy |
| MetaCLIP ViT-H/14Meta, 2023 | CommonPool curated (open) | InfoNCE | Reproducible research |
| DFN ViT-H/14Apple, 2023 | Data-filtered 2B | InfoNCE | Quality over quantity in data curation |
ImageNet Zero-Shot Top-1 Accuracy
ImageNet zero-shot top-1 accuracy. For reference, a supervised ResNet-50 trained on ImageNet achieves ~76%. The best CLIP variants now exceed supervised baselines without seeing any ImageNet labels.
Where CLIP Fails
CLIP is extraordinarily capable, but its failure modes are well-documented and important to understand before deploying it in production.
1. Compositionality Failures
CLIP struggles with compositional understanding — understanding relationships between objects, not just their presence. The model encodes images and text as bag-of-concepts rather than structured representations.
Yuksekgonul et al. (2023) systematically demonstrated this in "When and Why Vision-Language Models Behave like Bags-of-Words" — CLIP performs near chance on benchmarks requiring understanding of spatial relations, attribute binding, or object ordering.
— Yuksekgonul, B. et al. (2023). When and Why Vision-Language Models Behave like Bags-of-Words. ICLR.
2. Counting and Quantity
CLIP cannot reliably distinguish "three dogs" from "one dog" or "five dogs." The contrastive objective optimizes for presence of concepts, not their quantity. This is a fundamental limitation of the global pooling operation that produces a single vector per image — fine-grained spatial information is discarded.
3. Typographic Attacks
Because CLIP aligns images with text, it can be fooled by text written in images. An image of an apple with "iPod" written on a sticky note gets classified as an iPod. The text encoder's representation of the word dominates the visual content. This is a known attack vector for any CLIP-based system.
4. Training Data Distribution Gaps
CLIP was trained on English-centric web data. It performs significantly worse on:
- -Non-Western content: medical imagery, satellite photos, specialized domains not well-represented on the English web
- -Fine-grained classification: distinguishing 200 bird species or 100 car models (Stanford Cars drops to ~65% vs ~93% supervised)
- -Abstract or symbolic content: diagrams, charts, handwritten text, non-photographic images
5. Social Biases
CLIP inherits and sometimes amplifies the biases present in its web-scraped training data. The original paper documented that CLIP disproportionately misclassifies images of people with darker skin tones, associates certain demographics with occupational stereotypes, and reflects Western-centric visual norms. Any production system using CLIP for content involving people should include bias mitigation strategies.
What this means in practice
CLIP is a semantic similarity engine, not a structured understanding system. It excels at:
- +Broad category matching ("photos of dogs", "sunset landscapes")
- +Cross-modal retrieval (text query → image results)
- +Content filtering and moderation at scale
- +Guiding image generation (Stable Diffusion, DALL-E)
It fails at tasks requiring compositional reasoning, counting, spatial relations, or domain expertise. For those, you need fine-tuned models or more recent architectures that build structured representations on top of CLIP features.
Why CLIP Mattered Beyond Classification
CLIP's impact extends far beyond image search. Its shared embedding space became the foundation for an entire generation of multimodal AI systems:
Image Generation
DALL-E 2 and Stable Diffusion use CLIP text embeddings to guide image generation. The text encoder converts "a painting of a cat in the style of Monet" into the same vector space that images occupy, telling the diffusion model what to generate.
Vision-Language Models
LLaVA, GPT-4V, and Gemini use CLIP (or SigLIP) as their "eyes" — the image encoder that converts pixels into tokens that a language model can understand. CLIP features are the bridge between visual and linguistic processing.
Content Moderation
Platforms use CLIP to detect policy-violating images without maintaining a fixed taxonomy. Instead of training a classifier on every possible violation category, describe the violation in natural language and measure embedding similarity.
E-commerce & Recommendations
Product search ("blue running shoes with white sole"), visual similarity ("more like this"), and cross-modal recommendations all run on CLIP-derived embeddings at companies like Pinterest, Shopify, and Amazon.
Key Takeaways
- 1
CLIP creates a shared embedding space for images and text — matching pairs are close together, enabling cross-modal search without task-specific training.
- 2
Contrastive learning (InfoNCE loss) — the training objective that makes it work: maximize similarity of matching pairs, minimize similarity of non-matching pairs across large batches.
- 3
Zero-shot classification works by prompt engineering — define categories as text, encode them, pick the closest to your image. No retraining needed.
- 4
Use OpenCLIP or SigLIP in production — open-source, better accuracy than the original CLIP, and actively maintained. SigLIP when you need calibrated probabilities.
- 5
Know the limitations — CLIP fails at compositionality, counting, and fine-grained discrimination. It is a semantic similarity engine, not a reasoning system.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.