Level 1: Single Blocks~25 min

Image Search with CLIP

How contrastive learning unified vision and language in a single embedding space — and why it changed everything downstream.

25 Years of Teaching Machines to See

CLIP didn't appear in a vacuum. It sits at the end of a long lineage of progressively more powerful approaches to image understanding — from hand-crafted feature descriptors to learned convolutional filters to attention-based transformers. Each generation solved one critical limitation of the last. Understanding this lineage is the fastest way to grasp why CLIP's specific architecture choices matter.

The fundamental question has remained the same since the 1960s: how do you represent an image as numbers that a computer can reason about? The answers have changed dramatically.

Era I: Hand-Crafted Features
1999–2004

SIFT: Scale-Invariant Feature Transform

David Lowe at the University of British Columbia published the paper that defined a decade of computer vision. SIFT detected "keypoints" in images — corners, edges, blobs — and described each one as a 128-dimensional vector based on local gradient orientations. These descriptors were invariant to scale, rotation, and partial changes in illumination.

# SIFT pipeline (conceptual)
keypoints = detect_extrema(DoG_pyramid(image))    # Find interesting points
orientations = compute_gradient_histograms(keypoints)  # 128-dim descriptor per keypoint
# Match images by finding keypoints with similar descriptors
matches = brute_force_match(descriptors_A, descriptors_B)

Each image became a bag of keypoint descriptors, not a single vector. You couldn't compute "cosine similarity between two images" — you had to run a matching algorithm.

SIFT and its successors (SURF, ORB) powered Google Image Search, panorama stitching, and augmented reality for over a decade. But they had a fundamental limitation: the features were hand-designed by humans. They captured low-level geometry — edges, corners, textures — but had no concept of semantics. A SIFT descriptor could match the same building from two angles but couldn't tell you it was a "church."

Lowe, D. (2004). Distinctive Image Features from Scale-Invariant Keypoints. IJCV, 60(2), 91–110. 70,000+ citations.

2005–2012

HOG, Bag of Visual Words & Deformable Parts

The community built increasingly sophisticated pipelines on top of hand-crafted features. HOG (Dalal & Triggs, 2005) encoded gradient histograms in a grid pattern — the backbone of pedestrian detection for years. Bag of Visual Words (Csurka et al., 2004) borrowed from text retrieval: cluster SIFT descriptors into a "visual vocabulary," then represent each image as a histogram of visual word frequencies. DPM (Felzenszwalb et al., 2010) won PASCAL VOC detection three years running by modeling objects as collections of deformable parts. The entire edifice was hand-engineered feature extraction followed by an SVM classifier. It worked, but each new task required months of feature engineering.

Era II: Learned Features
September 2012

AlexNet: The Inflection Point

Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered a deep convolutional neural network in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC). It won by a margin so large — 15.3% top-5 error vs. the runner-up's 26.2% — that it ended the debate about whether learned features could compete with hand-crafted ones.

AlexNet had 60 million parameters, trained on two GTX 580 GPUs for six days. The key insight was not architectural novelty (CNNs existed since LeCun's 1989 work) but scale: enough data (1.2M ImageNet images), enough compute (GPU training), and enough capacity (8 layers instead of 2–3). The penultimate layer — a 4096-dimensional vector — became the first widely used learned image embedding. Researchers quickly discovered you could use this "AlexNet feature" for any vision task, even ones the network was never trained on.

# The "AlexNet feature" — first practical learned image embedding
model = AlexNet(pretrained=True)
model.classifier = model.classifier[:-1]  # Remove final classification layer

image_embedding = model(image)  # Shape: (4096,)
# This 4096-dim vector transfers to new tasks without retraining

Krizhevsky, A. et al. (2012). ImageNet Classification with Deep Convolutional Neural Networks. NeurIPS.

2014–2015

Deeper Networks: VGG, GoogLeNet, ResNet

The community rapidly scaled depth. VGG-16 (Simonyan & Zisserman, 2014) showed that uniform 3x3 convolutions stacked 16 layers deep could match more complex architectures. GoogLeNet/Inception (Szegedy et al., 2014) introduced multi-scale parallel convolutions. Then ResNet (He et al., 2015) solved the vanishing gradient problem with skip connections, pushing networks to 152 layers and achieving 3.57% top-5 error — surpassing estimated human performance on ImageNet.

Each of these networks produced an image embedding as a byproduct of classification. "ResNet features" became the default backbone for detection (Faster R-CNN), segmentation (U-Net), and retrieval throughout 2016–2020. But they all shared a fundamental constraint: the embedding only understood categories it was trained on. A ResNet trained on ImageNet's 1000 classes had no way to understand "a dog wearing a birthday hat" — that concept existed in no training label.

Era III: Transformers Meet Vision
October 2020

ViT: An Image is Worth 16x16 Words

Alexey Dosovitskiy et al. at Google Brain asked a heretical question: what if you just chopped an image into patches and fed them to a standard Transformer — no convolutions at all? Split a 224x224 image into 196 patches of 16x16 pixels, linearly project each patch into a vector, add position embeddings, and run self-attention.

The result, Vision Transformer (ViT), matched or exceeded the best CNNs when pre-trained on enough data (JFT-300M, 300 million images). The architecture was simpler than any CNN — no pooling layers, no strided convolutions, no feature pyramid. Just patches and attention. ViT became the standard image encoder for CLIP four months later.

Dosovitskiy, A. et al. (2021). An Image is Worth 16x16 Words. ICLR.

January 2021

CLIP: Contrastive Language-Image Pre-training

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh et al. at OpenAI combined two ideas: the ViT image encoder and a Transformer text encoder, trained together on 400 million image-text pairs scraped from the internet (the WIT dataset, never publicly released). The training objective was contrastive: given a batch of N image-text pairs, maximize the similarity of matching pairs while minimizing the similarity of all N² - N non-matching pairs.

The result was a shared embedding space where images and text with the same meaning occupied the same neighborhood. For the first time, you could search images with natural language, classify into arbitrary categories without retraining, and measure image-text relevance — all from a single model. CLIP achieved 76.2% zero-shot top-1 accuracy on ImageNet — matching a supervised ResNet-50 without seeing a single ImageNet training image.

Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. ICML.

2023

SigLIP: Sigmoid Loss for Language-Image Pre-training

Xiaohua Zhai et al. at Google replaced CLIP's softmax-based contrastive loss with a simpler sigmoid loss that operates on individual pairs rather than requiring the full NxN similarity matrix. This removed the need for large batch sizes (CLIP needed 32,768) and enabled efficient training on smaller hardware. SigLIP achieved better performance with fewer resources and produced better-calibrated similarity scores that function as actual probabilities.

Zhai, X. et al. (2023). Sigmoid Loss for Language Image Pre-Training. ICCV.

The throughline: 1999 → 2026

Twenty-five years. The same question, progressively better answers:

1999–2012Hand-crafted: SIFT, HOG, Bag of Visual Words — human-designed features, SVM classifiers
2012–2020Learned (CNN): AlexNet → ResNet — features emerge from supervised training on labeled images
2020–2021Learned (Transformer): ViT → CLIP — patches + attention, trained on language supervision instead of labels
2023–nowEfficient contrastive: SigLIP, EVA-CLIP — better losses, bigger data, and open-source models closing the gap

Each generation replaced more hand-engineering with more learning. CLIP's breakthrough was replacing labeled categories with natural language descriptions — turning the entire internet into a training set.

How Contrastive Learning Works

CLIP's training objective is contrastive learning — arguably the most important idea in multimodal AI. The core principle is deceptively simple: given a batch of image-text pairs, push matching pairs together and push non-matching pairs apart in a shared embedding space.

But the details of how you push and pull — the loss function — have profound consequences for what the model learns.

The Contrastive Setup

Positive Pairs (diagonal)

Image of a sunset + "a beautiful sunset over the ocean"

These came from the same web page. Push their embeddings CLOSE together.

Negative Pairs (off-diagonal)

Image of a sunset + "a cat sitting on a keyboard"

Random pairings within the batch. Push their embeddings FAR apart.

InfoNCE Loss: The Math Behind CLIP

CLIP uses a symmetric variant of InfoNCE (Noise Contrastive Estimation), first introduced by van den Oord et al. (2018). For a batch of N image-text pairs, compute the NxN matrix of cosine similarities between all image and text embeddings, then apply cross-entropy loss twice — once treating each image as a query (which text matches?) and once treating each text as a query (which image matches?).

# InfoNCE loss for CLIP (actual PyTorch)
import torch
import torch.nn.functional as F

def clip_loss(image_embeds, text_embeds, temperature=0.07):
    """
    image_embeds: (N, D) — L2-normalized image embeddings
    text_embeds:  (N, D) — L2-normalized text embeddings
    temperature:  learnable scalar (initialized to 0.07)

    The i-th image matches the i-th text (diagonal = positive pairs).
    """
    # Compute NxN similarity matrix, scaled by temperature
    logits = (image_embeds @ text_embeds.T) / temperature  # (N, N)

    # Labels: the diagonal — image[i] matches text[i]
    labels = torch.arange(len(image_embeds))

    # Symmetric loss: image→text and text→image
    loss_i2t = F.cross_entropy(logits, labels)       # Each row: which text?
    loss_t2i = F.cross_entropy(logits.T, labels)     # Each column: which image?

    return (loss_i2t + loss_t2i) / 2

The temperature parameter is crucial. A lower temperature makes the softmax distribution sharper — the model must be more confident about which pairs match. CLIP learns this parameter during training, starting at 0.07 and typically converging around 0.01. Getting the temperature wrong can collapse training entirely.

Why CLIP Needs Enormous Batch Sizes

In a batch of N pairs, each positive pair is contrasted against N-1 negative pairs. With a batch size of 32,768 (CLIP's actual training setting), each image is compared against 32,767 wrong captions. Larger batches provide more informative negatives — a harder "multiple choice test" — which forces more discriminative representations.

This is also CLIP's main limitation: you need hundreds of GPUs to maintain large batches. CLIP was trained on 256 V100 GPUs. This motivated SigLIP's sigmoid loss, which doesn't require the full NxN matrix and works with smaller batches.

CLIP Loss vs. SigLIP Loss

CLIP (Softmax / InfoNCE)

Treats each row/column as a classification problem: "which of these N items is the correct match?" Requires the full NxN matrix, so batch size directly determines the number of negatives.

loss = -log(exp(sim(i,t)/τ) / Σ_j exp(sim(i,t_j)/τ))

SigLIP (Sigmoid / Binary)

Treats each pair independently: "does this image match this text? Yes/no." No softmax normalization across the batch. Works with any batch size and can be chunked.

loss = -log(σ(y_ij · sim(i,j)/τ + b))
# y_ij = +1 if match, -1 if not

CLIP's Architecture

CLIP has two parallel encoders that produce vectors in the same space. The image encoder and text encoder are trained jointly from scratch — their only connection is the contrastive loss that pulls matching pairs together.

Image Encoder

Either a ResNet (RN50, RN101) or a Vision Transformer (ViT-B/32, ViT-B/16, ViT-L/14). The ViT variants are stronger. The image is split into patches, each patch is linearly projected, and self-attention produces a final [CLS] token embedding.

# ViT-L/14 image encoder
image (224×224×3)
  → 16×16 patches (196 patches)
  → linear projection (each → 1024-dim)
  → + [CLS] token + position embeddings
  → 24 Transformer layers
  → [CLS] output → linear projection
  → L2-normalize → 768-dim embedding

Text Encoder

A standard Transformer (not BERT — CLIP uses its own tokenizer and architecture). Text is tokenized with BPE (49,152 vocab), padded/truncated to 77 tokens, and the [EOS] token's output is used as the text embedding.

# CLIP text encoder
text "a photo of a dog"
  → BPE tokenize → [49406, 320, 1125, ...]
  → token embedding lookup
  → + position embeddings
  → 12 Transformer layers (masked attention)
  → [EOS] output → linear projection
  → L2-normalize → 768-dim embedding

Both encoders project into the same 512 or 768-dimensional space. After L2 normalization, similarity is just a dot product. An image of a dog and the text "a photo of a dog" end up as nearby vectors. This shared space is what enables every downstream application — zero-shot classification, cross-modal search, and image generation guidance.

Try It: Cross-Modal Search

This visualization shows CLIP's shared embedding space projected to 2D. Circles are image embeddings, triangles are text embeddings. Type a query to see how text embeddings align with matching images.

Note: 2D projection distorts distances. Points that appear far apart in 2D may be close in 768D space.

CLIP Shared Embedding Space

Semantic Dimension 1Semantic Dimension 2catdogbirdcarairplaneshiptreeflowerchairlaptopImage embedText embed

Enter a text query above to search images using CLIP-style cross-modal retrieval.

Try queries like "cat", "a photo of a dog", "vehicle", or "nature"

Animals
Vehicles
Nature
Objects

Zero-Shot Classification

CLIP's most celebrated capability: classify images into categories the model was never explicitly trained on. No fine-tuning, no labeled data, no retraining. Just text prompts that describe each category.

The Algorithm

  1. 1

    Define candidate classes as text prompts

    "a photo of a dog", "a photo of a cat", "a photo of a bird". Prompt engineering matters — "a photo of a {class}" works better than just the class name because CLIP was trained on image-caption pairs, not single words.

  2. 2

    Encode all prompts once

    Each class prompt becomes a vector in the shared space. Cache these — they don't change.

  3. 3

    Encode the image

    The image becomes a vector in the same space.

  4. 4

    Pick the highest cosine similarity

    The class whose text embedding is most similar to the image embedding wins. Apply softmax to get calibrated probabilities.

Working Code with OpenCLIP

OpenAI's original CLIP is frozen at 2021 weights. For production, use OpenCLIP — the open-source reimplementation by LAION that provides newer, stronger models trained on larger datasets.

Zero-Shot Classification

import open_clip
import torch
from PIL import Image

# Load model and preprocessing
model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='datacomp_xl_s13b_b90k'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')
model.eval()

# Prepare image and candidate labels
image = preprocess(Image.open("photo.jpg")).unsqueeze(0)
classes = ["dog", "cat", "bird", "car", "house"]
prompts = [f"a photo of a {c}" for c in classes]
text = tokenizer(prompts)

# Encode and classify
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    # Normalize and compute similarities
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

for cls, prob in zip(classes, probs[0]):
    print(f"  {cls}: {prob.item():.1%}")

Install: pip install open-clip-torch

Text-to-Image Search Over a Collection

import open_clip
import torch
from pathlib import Path
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    'ViT-L-14', pretrained='datacomp_xl_s13b_b90k'
)
tokenizer = open_clip.get_tokenizer('ViT-L-14')
model.eval()

# Step 1: Encode all images in a directory (do this once, save to disk)
image_dir = Path("./photos")
image_paths = list(image_dir.glob("*.jpg"))
image_embeddings = []

with torch.no_grad():
    for path in image_paths:
        img = preprocess(Image.open(path)).unsqueeze(0)
        emb = model.encode_image(img)
        emb /= emb.norm(dim=-1, keepdim=True)
        image_embeddings.append(emb)

image_embeddings = torch.cat(image_embeddings)  # (N, 768)

# Step 2: Search with natural language
query = "a dog playing in snow"
text = tokenizer([query])
with torch.no_grad():
    text_emb = model.encode_text(text)
    text_emb /= text_emb.norm(dim=-1, keepdim=True)

# Step 3: Rank by similarity
similarities = (text_emb @ image_embeddings.T).squeeze(0)
top_k = similarities.argsort(descending=True)[:5]

for idx in top_k:
    print(f"  {image_paths[idx].name}: {similarities[idx]:.3f}")

For production, store embeddings in a vector database (Qdrant, Pinecone, pgvector) instead of recomputing them per query. Encoding images is the expensive step — encoding text queries is nearly instant.

Image-to-Image Search

CLIP embeddings also enable reverse image search — find visually and semantically similar images by comparing image embeddings directly.

# Given a query image, find similar images in the collection
query_img = preprocess(Image.open("query.jpg")).unsqueeze(0)
with torch.no_grad():
    query_emb = model.encode_image(query_img)
    query_emb /= query_emb.norm(dim=-1, keepdim=True)

# Same dot product — works because all embeddings share the same space
similarities = (query_emb @ image_embeddings.T).squeeze(0)
# Top results will be semantically similar, not just pixel-similar

Model Comparison: Choosing the Right CLIP

Since OpenAI released CLIP in 2021, the community has created improved variants with better performance, open training data, or more efficient training objectives. Here is the current landscape.

ModelTraining DataLossBest For
OpenAI CLIPViT-L/14, 2021WIT 400M (private)InfoNCEBaseline, widest library support
OpenCLIP ViT-G/14LAION, 2023LAION-2B (open)InfoNCEBest open-source accuracy
SigLIP ViT-SO400MGoogle, 2023WebLI (private)SigmoidCalibrated scores, smaller batches
EVA-02-CLIP-E/14+BAAI, 2023Merged datasets (18B params)InfoNCE + distillationMaximum ImageNet zero-shot accuracy
MetaCLIP ViT-H/14Meta, 2023CommonPool curated (open)InfoNCEReproducible research
DFN ViT-H/14Apple, 2023Data-filtered 2BInfoNCEQuality over quantity in data curation

ImageNet Zero-Shot Top-1 Accuracy

DFN ViT-H/14
83.4%
EVA-02-CLIP-E/14+ (18B)
82%
MetaCLIP ViT-H/14 (2.5B)
80.5%
OpenCLIP ViT-G/14 (2B)
80.1%
SigLIP ViT-SO400M/14
78.4%
OpenAI CLIP ViT-L/14
75.5%
OpenAI CLIP ViT-B/32
63.2%

ImageNet zero-shot top-1 accuracy. For reference, a supervised ResNet-50 trained on ImageNet achieves ~76%. The best CLIP variants now exceed supervised baselines without seeing any ImageNet labels.

Limitations & Failure Modes

Where CLIP Fails

CLIP is extraordinarily capable, but its failure modes are well-documented and important to understand before deploying it in production.

1. Compositionality Failures

CLIP struggles with compositional understanding — understanding relationships between objects, not just their presence. The model encodes images and text as bag-of-concepts rather than structured representations.

"a horse riding an astronaut" scores similarly to "an astronaut riding a horse"-- cannot distinguish
"a red car and a blue house" confuses "a blue car and a red house"-- attribute binding failure

Yuksekgonul et al. (2023) systematically demonstrated this in "When and Why Vision-Language Models Behave like Bags-of-Words" — CLIP performs near chance on benchmarks requiring understanding of spatial relations, attribute binding, or object ordering.

Yuksekgonul, B. et al. (2023). When and Why Vision-Language Models Behave like Bags-of-Words. ICLR.

2. Counting and Quantity

CLIP cannot reliably distinguish "three dogs" from "one dog" or "five dogs." The contrastive objective optimizes for presence of concepts, not their quantity. This is a fundamental limitation of the global pooling operation that produces a single vector per image — fine-grained spatial information is discarded.

3. Typographic Attacks

Because CLIP aligns images with text, it can be fooled by text written in images. An image of an apple with "iPod" written on a sticky note gets classified as an iPod. The text encoder's representation of the word dominates the visual content. This is a known attack vector for any CLIP-based system.

Documented in the original CLIP paper, Section 7.1.

4. Training Data Distribution Gaps

CLIP was trained on English-centric web data. It performs significantly worse on:

  • -Non-Western content: medical imagery, satellite photos, specialized domains not well-represented on the English web
  • -Fine-grained classification: distinguishing 200 bird species or 100 car models (Stanford Cars drops to ~65% vs ~93% supervised)
  • -Abstract or symbolic content: diagrams, charts, handwritten text, non-photographic images

5. Social Biases

CLIP inherits and sometimes amplifies the biases present in its web-scraped training data. The original paper documented that CLIP disproportionately misclassifies images of people with darker skin tones, associates certain demographics with occupational stereotypes, and reflects Western-centric visual norms. Any production system using CLIP for content involving people should include bias mitigation strategies.

What this means in practice

CLIP is a semantic similarity engine, not a structured understanding system. It excels at:

  • +Broad category matching ("photos of dogs", "sunset landscapes")
  • +Cross-modal retrieval (text query → image results)
  • +Content filtering and moderation at scale
  • +Guiding image generation (Stable Diffusion, DALL-E)

It fails at tasks requiring compositional reasoning, counting, spatial relations, or domain expertise. For those, you need fine-tuned models or more recent architectures that build structured representations on top of CLIP features.

Why CLIP Mattered Beyond Classification

CLIP's impact extends far beyond image search. Its shared embedding space became the foundation for an entire generation of multimodal AI systems:

Image Generation

DALL-E 2 and Stable Diffusion use CLIP text embeddings to guide image generation. The text encoder converts "a painting of a cat in the style of Monet" into the same vector space that images occupy, telling the diffusion model what to generate.

Vision-Language Models

LLaVA, GPT-4V, and Gemini use CLIP (or SigLIP) as their "eyes" — the image encoder that converts pixels into tokens that a language model can understand. CLIP features are the bridge between visual and linguistic processing.

Content Moderation

Platforms use CLIP to detect policy-violating images without maintaining a fixed taxonomy. Instead of training a classifier on every possible violation category, describe the violation in natural language and measure embedding similarity.

E-commerce & Recommendations

Product search ("blue running shoes with white sole"), visual similarity ("more like this"), and cross-modal recommendations all run on CLIP-derived embeddings at companies like Pinterest, Shopify, and Amazon.

Key Takeaways

  • 1

    CLIP creates a shared embedding space for images and text — matching pairs are close together, enabling cross-modal search without task-specific training.

  • 2

    Contrastive learning (InfoNCE loss) — the training objective that makes it work: maximize similarity of matching pairs, minimize similarity of non-matching pairs across large batches.

  • 3

    Zero-shot classification works by prompt engineering — define categories as text, encode them, pick the closest to your image. No retraining needed.

  • 4

    Use OpenCLIP or SigLIP in production — open-source, better accuracy than the original CLIP, and actively maintained. SigLIP when you need calibrated probabilities.

  • 5

    Know the limitations — CLIP fails at compositionality, counting, and fine-grained discrimination. It is a semantic similarity engine, not a reasoning system.

Help improve this page

Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.