Computer Visionzero-shot-image-classification

Zero-Shot Image Classification

Zero-shot image classification uses vision-language models to categorize images into arbitrary classes never seen during training — you describe categories in text, and the model matches. CLIP (2021) proved this was viable at scale by training on 400M image-text pairs, achieving competitive accuracy on ImageNet without ever seeing a labeled example. SigLIP, EVA-CLIP, and MetaCLIP have since pushed zero-shot ImageNet accuracy above 83%, closing the gap with supervised models. The paradigm shift this represents is profound: instead of collecting labeled datasets for every new domain, you just describe what you're looking for.

1 datasets0 resultsView full task mapping →

Zero-shot image classification assigns labels from an open vocabulary — classes never seen during training — by matching images to text descriptions in a shared embedding space. CLIP (2021) made this practical, and successors like SigLIP and EVA-CLIP have pushed zero-shot ImageNet accuracy from 76% to 83%+. It eliminates the need for labeled training data entirely for many applications.

History

2009

Lampert et al. introduce attribute-based zero-shot learning — classify unseen animals by transferring attribute descriptions

2013

DeViSE (Frome et al.) projects images and word embeddings into shared space, first showing vision-language transfer

2017

Zero-shot learning benchmarks (CUB, AWA2, SUN) established; generalized ZSL (both seen and unseen classes at test time) becomes the harder standard

2021

CLIP (Radford et al.) trains on 400M image-text pairs with contrastive loss, achieving 76.2% zero-shot ImageNet top-1 — a paradigm shift

2021

ALIGN (Google) scales to 1.8B noisy image-text pairs, showing that data quantity can compensate for noise

2022

OpenCLIP reproduces CLIP with open data (LAION-5B); community can now train and customize CLIP models

2023

EVA-CLIP scales to ViT-18B parameters, achieving 82.0% zero-shot ImageNet; SigLIP replaces softmax with sigmoid loss for better scaling

2023

MetaCLIP (Meta) shows that data curation matters more than scale — matching CLIP quality with 400M curated samples vs. billions of unfiltered ones

2024

SigLIP-SO400M-384 achieves 83.1% zero-shot ImageNet top-1; becomes the default vision encoder for VLMs (LLaVA, PaLI)

How Zero-Shot Image Classification Works

Dual Encoder Architecture

An image encoder (ViT) and text encoder (transformer) independently produce embedding vectors. The image encoder processes pixels into a [CLS] token or mean-pooled representation. The text encoder processes class names or descriptions into text embeddings.

Contrastive Training

During training, matching image-text pairs are pulled together in embedding space while non-matching pairs are pushed apart. CLIP uses symmetric cross-entropy over a batch; SigLIP uses per-pair sigmoid loss, enabling larger effective batch sizes.

Zero-Shot Inference

At test time, each candidate class is converted to a text prompt (e.g., 'a photo of a {class}') and encoded. The image is encoded, and cosine similarity between the image embedding and all text embeddings determines the predicted class. No training on the target classes is needed.

Prompt Engineering

Prompt templates significantly affect accuracy — 'a photo of a {class}' outperforms just '{class}' by 3-5%. Ensemble of 80+ prompts ('a painting of a {class}', 'a blurry photo of a {class}', etc.) gives another 3% boost. CoOp/CoCoOp learn prompts automatically.

Current Landscape

Zero-shot image classification in 2025 is synonymous with CLIP-family models. SigLIP has emerged as the practical favorite, powering the vision encoder of most VLMs (LLaVA, Qwen-VL, PaLI-X). The accuracy ceiling on ImageNet has plateaued around 83%, but real-world utility keeps expanding as these models are deployed for content moderation, visual search, and robotics perception. The field has split: researchers push accuracy on established benchmarks, while practitioners care about domain transfer, calibration, and compositional understanding. Open-source (OpenCLIP, MetaCLIP) has fully caught up with proprietary models.

Key Challenges

Fine-grained discrimination — zero-shot models struggle to distinguish visually similar classes (dog breeds, bird species, car models) without specialized training

Compositionality — CLIP often fails on relational prompts ('a horse riding an astronaut' vs. 'an astronaut riding a horse') because the embedding doesn't capture word order well

Bias amplification — web-scraped training data encodes societal biases that affect classification (e.g., associating certain professions with specific demographics)

Calibration — zero-shot classifiers are poorly calibrated; similarity scores don't map cleanly to probabilities, making threshold-setting difficult in production

Domain gap — models trained on web-scraped image-text pairs underperform on specialized domains (medical, satellite, industrial) where web data is scarce

Quick Recommendations

Best zero-shot accuracy

SigLIP-SO400M-384 or EVA-CLIP-ViT-18B

83%+ ImageNet zero-shot; SigLIP is more practical (400M params) while EVA pushes the accuracy frontier

Open-source / customizable

OpenCLIP ViT-G/14 (LAION-2B)

Fully open training data and weights; 80.1% ImageNet zero-shot; easy to fine-tune on domain data

Efficient deployment

SigLIP-B/16-256 or MobileCLIP

~80% zero-shot accuracy at <100M params; runs on consumer GPUs and mobile devices

Domain adaptation

OpenCLIP + CLIP-Adapter or Tip-Adapter

Add a lightweight adapter with 1-16 shots per class; boosts domain-specific accuracy by 5-15% without full fine-tuning

Composed/relational queries

NegCLIP or SigLIP with hard negative mining

Better at order-sensitive prompts than standard CLIP through targeted training on compositional examples

What's Next

The frontier is compositional zero-shot understanding (relational, spatial, temporal descriptions), unified vision-language models that subsume CLIP as an emergent capability, and zero-shot classification in specialized domains through domain-adapted pretraining. Long-term, dedicated zero-shot classifiers may be absorbed into general-purpose VLMs that can classify, describe, and reason about images simultaneously.

Benchmarks & SOTA

ImageNet Zero-Shot

20090 results

Zero-shot classification accuracy on ImageNet without task-specific training

No results tracked yet

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Zero-Shot Image Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision