Computer Visionzero-shot-image-classification

Zero-Shot Image Classification

Zero-shot image classification uses vision-language models to categorize images into arbitrary classes never seen during training — you describe categories in text, and the model matches. CLIP (2021) proved this was viable at scale by training on 400M image-text pairs, achieving competitive accuracy on ImageNet without ever seeing a labeled example. SigLIP, EVA-CLIP, and MetaCLIP have since pushed zero-shot ImageNet accuracy above 83%, closing the gap with supervised models. The paradigm shift this represents is profound: instead of collecting labeled datasets for every new domain, you just describe what you're looking for.

1 datasets0 resultsView full task mapping →

Zero-shot image classification assigns labels from an open vocabulary — classes never seen during training — by matching images to text descriptions in a shared embedding space. CLIP (2021) made this practical, and successors like SigLIP and EVA-CLIP have pushed zero-shot ImageNet accuracy from 76% to 83%+. It eliminates the need for labeled training data entirely for many applications.

History

2009

Lampert et al. introduce attribute-based zero-shot learning — classify unseen animals by transferring attribute descriptions

2013

DeViSE (Frome et al.) projects images and word embeddings into shared space, first showing vision-language transfer

2017

Zero-shot learning benchmarks (CUB, AWA2, SUN) established; generalized ZSL (both seen and unseen classes at test time) becomes the harder standard

2021

CLIP (Radford et al.) trains on 400M image-text pairs with contrastive loss, achieving 76.2% zero-shot ImageNet top-1 — a paradigm shift

2021

ALIGN (Google) scales to 1.8B noisy image-text pairs, showing that data quantity can compensate for noise

2022

OpenCLIP reproduces CLIP with open data (LAION-5B); community can now train and customize CLIP models

2023

EVA-CLIP scales to ViT-18B parameters, achieving 82.0% zero-shot ImageNet; SigLIP replaces softmax with sigmoid loss for better scaling

2023

MetaCLIP (Meta) shows that data curation matters more than scale — matching CLIP quality with 400M curated samples vs. billions of unfiltered ones

2024

SigLIP-SO400M-384 achieves 83.1% zero-shot ImageNet top-1; becomes the default vision encoder for VLMs (LLaVA, PaLI)

How Zero-Shot Image Classification Works

1Dual Encoder Architec…An image encoder (ViT) and …2Contrastive TrainingDuring training3Zero-Shot InferenceAt test time4Prompt EngineeringPrompt templates significan…Zero-Shot Image Classification Pipeline
1

Dual Encoder Architecture

An image encoder (ViT) and text encoder (transformer) independently produce embedding vectors. The image encoder processes pixels into a [CLS] token or mean-pooled representation. The text encoder processes class names or descriptions into text embeddings.

2

Contrastive Training

During training, matching image-text pairs are pulled together in embedding space while non-matching pairs are pushed apart. CLIP uses symmetric cross-entropy over a batch; SigLIP uses per-pair sigmoid loss, enabling larger effective batch sizes.

3

Zero-Shot Inference

At test time, each candidate class is converted to a text prompt (e.g., 'a photo of a {class}') and encoded. The image is encoded, and cosine similarity between the image embedding and all text embeddings determines the predicted class. No training on the target classes is needed.

4

Prompt Engineering

Prompt templates significantly affect accuracy — 'a photo of a {class}' outperforms just '{class}' by 3-5%. Ensemble of 80+ prompts ('a painting of a {class}', 'a blurry photo of a {class}', etc.) gives another 3% boost. CoOp/CoCoOp learn prompts automatically.

Current Landscape

Zero-shot image classification in 2025 is synonymous with CLIP-family models. SigLIP has emerged as the practical favorite, powering the vision encoder of most VLMs (LLaVA, Qwen-VL, PaLI-X). The accuracy ceiling on ImageNet has plateaued around 83%, but real-world utility keeps expanding as these models are deployed for content moderation, visual search, and robotics perception. The field has split: researchers push accuracy on established benchmarks, while practitioners care about domain transfer, calibration, and compositional understanding. Open-source (OpenCLIP, MetaCLIP) has fully caught up with proprietary models.

Key Challenges

Fine-grained discrimination — zero-shot models struggle to distinguish visually similar classes (dog breeds, bird species, car models) without specialized training

Compositionality — CLIP often fails on relational prompts ('a horse riding an astronaut' vs. 'an astronaut riding a horse') because the embedding doesn't capture word order well

Bias amplification — web-scraped training data encodes societal biases that affect classification (e.g., associating certain professions with specific demographics)

Calibration — zero-shot classifiers are poorly calibrated; similarity scores don't map cleanly to probabilities, making threshold-setting difficult in production

Domain gap — models trained on web-scraped image-text pairs underperform on specialized domains (medical, satellite, industrial) where web data is scarce

Quick Recommendations

Best zero-shot accuracy

SigLIP-SO400M-384 or EVA-CLIP-ViT-18B

83%+ ImageNet zero-shot; SigLIP is more practical (400M params) while EVA pushes the accuracy frontier

Open-source / customizable

OpenCLIP ViT-G/14 (LAION-2B)

Fully open training data and weights; 80.1% ImageNet zero-shot; easy to fine-tune on domain data

Efficient deployment

SigLIP-B/16-256 or MobileCLIP

~80% zero-shot accuracy at <100M params; runs on consumer GPUs and mobile devices

Domain adaptation

OpenCLIP + CLIP-Adapter or Tip-Adapter

Add a lightweight adapter with 1-16 shots per class; boosts domain-specific accuracy by 5-15% without full fine-tuning

Composed/relational queries

NegCLIP or SigLIP with hard negative mining

Better at order-sensitive prompts than standard CLIP through targeted training on compositional examples

What's Next

The frontier is compositional zero-shot understanding (relational, spatial, temporal descriptions), unified vision-language models that subsume CLIP as an emergent capability, and zero-shot classification in specialized domains through domain-adapted pretraining. Long-term, dedicated zero-shot classifiers may be absorbed into general-purpose VLMs that can classify, describe, and reason about images simultaneously.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Zero-Shot Image Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000