Zero-Shot Image Classification
Zero-shot image classification uses vision-language models to categorize images into arbitrary classes never seen during training — you describe categories in text, and the model matches. CLIP (2021) proved this was viable at scale by training on 400M image-text pairs, achieving competitive accuracy on ImageNet without ever seeing a labeled example. SigLIP, EVA-CLIP, and MetaCLIP have since pushed zero-shot ImageNet accuracy above 83%, closing the gap with supervised models. The paradigm shift this represents is profound: instead of collecting labeled datasets for every new domain, you just describe what you're looking for.
Zero-shot image classification assigns labels from an open vocabulary — classes never seen during training — by matching images to text descriptions in a shared embedding space. CLIP (2021) made this practical, and successors like SigLIP and EVA-CLIP have pushed zero-shot ImageNet accuracy from 76% to 83%+. It eliminates the need for labeled training data entirely for many applications.
History
Lampert et al. introduce attribute-based zero-shot learning — classify unseen animals by transferring attribute descriptions
DeViSE (Frome et al.) projects images and word embeddings into shared space, first showing vision-language transfer
Zero-shot learning benchmarks (CUB, AWA2, SUN) established; generalized ZSL (both seen and unseen classes at test time) becomes the harder standard
CLIP (Radford et al.) trains on 400M image-text pairs with contrastive loss, achieving 76.2% zero-shot ImageNet top-1 — a paradigm shift
ALIGN (Google) scales to 1.8B noisy image-text pairs, showing that data quantity can compensate for noise
OpenCLIP reproduces CLIP with open data (LAION-5B); community can now train and customize CLIP models
EVA-CLIP scales to ViT-18B parameters, achieving 82.0% zero-shot ImageNet; SigLIP replaces softmax with sigmoid loss for better scaling
MetaCLIP (Meta) shows that data curation matters more than scale — matching CLIP quality with 400M curated samples vs. billions of unfiltered ones
SigLIP-SO400M-384 achieves 83.1% zero-shot ImageNet top-1; becomes the default vision encoder for VLMs (LLaVA, PaLI)
How Zero-Shot Image Classification Works
Dual Encoder Architecture
An image encoder (ViT) and text encoder (transformer) independently produce embedding vectors. The image encoder processes pixels into a [CLS] token or mean-pooled representation. The text encoder processes class names or descriptions into text embeddings.
Contrastive Training
During training, matching image-text pairs are pulled together in embedding space while non-matching pairs are pushed apart. CLIP uses symmetric cross-entropy over a batch; SigLIP uses per-pair sigmoid loss, enabling larger effective batch sizes.
Zero-Shot Inference
At test time, each candidate class is converted to a text prompt (e.g., 'a photo of a {class}') and encoded. The image is encoded, and cosine similarity between the image embedding and all text embeddings determines the predicted class. No training on the target classes is needed.
Prompt Engineering
Prompt templates significantly affect accuracy — 'a photo of a {class}' outperforms just '{class}' by 3-5%. Ensemble of 80+ prompts ('a painting of a {class}', 'a blurry photo of a {class}', etc.) gives another 3% boost. CoOp/CoCoOp learn prompts automatically.
Current Landscape
Zero-shot image classification in 2025 is synonymous with CLIP-family models. SigLIP has emerged as the practical favorite, powering the vision encoder of most VLMs (LLaVA, Qwen-VL, PaLI-X). The accuracy ceiling on ImageNet has plateaued around 83%, but real-world utility keeps expanding as these models are deployed for content moderation, visual search, and robotics perception. The field has split: researchers push accuracy on established benchmarks, while practitioners care about domain transfer, calibration, and compositional understanding. Open-source (OpenCLIP, MetaCLIP) has fully caught up with proprietary models.
Key Challenges
Fine-grained discrimination — zero-shot models struggle to distinguish visually similar classes (dog breeds, bird species, car models) without specialized training
Compositionality — CLIP often fails on relational prompts ('a horse riding an astronaut' vs. 'an astronaut riding a horse') because the embedding doesn't capture word order well
Bias amplification — web-scraped training data encodes societal biases that affect classification (e.g., associating certain professions with specific demographics)
Calibration — zero-shot classifiers are poorly calibrated; similarity scores don't map cleanly to probabilities, making threshold-setting difficult in production
Domain gap — models trained on web-scraped image-text pairs underperform on specialized domains (medical, satellite, industrial) where web data is scarce
Quick Recommendations
Best zero-shot accuracy
SigLIP-SO400M-384 or EVA-CLIP-ViT-18B
83%+ ImageNet zero-shot; SigLIP is more practical (400M params) while EVA pushes the accuracy frontier
Open-source / customizable
OpenCLIP ViT-G/14 (LAION-2B)
Fully open training data and weights; 80.1% ImageNet zero-shot; easy to fine-tune on domain data
Efficient deployment
SigLIP-B/16-256 or MobileCLIP
~80% zero-shot accuracy at <100M params; runs on consumer GPUs and mobile devices
Domain adaptation
OpenCLIP + CLIP-Adapter or Tip-Adapter
Add a lightweight adapter with 1-16 shots per class; boosts domain-specific accuracy by 5-15% without full fine-tuning
Composed/relational queries
NegCLIP or SigLIP with hard negative mining
Better at order-sensitive prompts than standard CLIP through targeted training on compositional examples
What's Next
The frontier is compositional zero-shot understanding (relational, spatial, temporal descriptions), unified vision-language models that subsume CLIP as an emergent capability, and zero-shot classification in specialized domains through domain-adapted pretraining. Long-term, dedicated zero-shot classifiers may be absorbed into general-purpose VLMs that can classify, describe, and reason about images simultaneously.
Benchmarks & SOTA
Related Tasks
Something wrong or missing?
Help keep Zero-Shot Image Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.