Computer Visionzero-shot-image-classification

Zero-Shot Image Classification

Zero-shot image classification uses vision-language models to categorize images into arbitrary classes never seen during training — you describe categories in text, and the model matches. CLIP (2021) proved this was viable at scale by training on 400M image-text pairs, achieving competitive accuracy on ImageNet without ever seeing a labeled example. SigLIP, EVA-CLIP, and MetaCLIP have since pushed zero-shot ImageNet accuracy above 83%, closing the gap with supervised models. The paradigm shift this represents is profound: instead of collecting labeled datasets for every new domain, you just describe what you're looking for.

1
Datasets
4
Results
top-1-accuracy
Canonical metric
Canonical Benchmark

ImageNet Zero-Shot

Zero-shot classification accuracy on ImageNet without task-specific training

Primary metric: top-1-accuracy
View full leaderboard

Top 10

Leading models on ImageNet Zero-Shot.

RankModeltop-1YearSource
1
EVA-CLIP-18B
83.82024paper
2
SigLIP-SO400M
83.22023paper
3
OpenCLIP ViT-G/14
80.12022paper
4
CLIP ViT-L/14
75.52021paper

All datasets

1 dataset tracked for this task.

Related tasks

Other tasks in Computer Vision.

Run Inference

Looking to run a model? HuggingFace hosts inference for this task type.

HuggingFace