Image Classification Standard

ImageNet (ILSVRC)

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is the world's most influential computer vision benchmark. Since 2010, it has served as the primary catalyst for the deep learning revolution, benchmarking over 1.4 million images across 1,000 object categories.

SOTA Top-1
91.0%
CoCa (ViT-G)
Total Images
14.2M
Full Dataset
Classes
1,000
ILSVRC Subset
Citations
71k+
Original Paper

The Benchmark that Changed Everything

Before ImageNet, computer vision datasets were small and specialized. In 2009, researchers from Stanford and Princeton introduced a dataset of unprecedented scale, organized according to the WordNet hierarchy.

The annual ILSVRC competition (2010–2017) provided a standardized evaluation framework that allowed researchers to compare architectures fairly. The 2012 victory of AlexNet marked the definitive shift from hand-crafted features (like SIFT) to end-to-end learned representations via Convolutional Neural Networks (CNNs).

Key Innovations

  • Standardized 1,000-class subset for reproducible research.
  • Hierarchical structure enabling fine-grained classification.
  • Established Top-1 and Top-5 error as industry-standard metrics.
ImageNet Category Distribution

VISUALIZATION 01

Synset Hierarchy: From "Mammal" to "Golden Retriever"

Accuracy Evolution

The rapid rise of Top-1 accuracy (Top-5 for pre-2022 era) over the ILSVRC era and beyond.

Key Milestones
Human Top-5 (~95%)
71.8%
NEC-UIUC
2010
83.6%
AlexNet
2012
93.3%
GoogLeNet
2014
96.43%
ResNet-152
2015
97.75%
SENet
2017
91.0%
CoCa (ViT-G)
2022

Current SOTA Leaderboard

Metric: Top-1 Accuracy (%)
RankModel ArchitectureTop-1 AccDateResources
1
CoCa (ViT-G/14)
Google · Finetuned; 2.1B params
91.000%2022-05
2
SoViT-400M/14
Google · Compute-optimal ViT shape
90.300%2023-05
3
EVA-02 (ViT-L/14+)
BAAI · Finetuned; 304M params, public data only
90.000%2023-03
4
ViT-22B/14
Google · 22B params; finetuned on ImageNet-1K
89.510%2023-02
5
InternViT-6B (InternVL)
OpenGVLab · CVPR 2024 Oral; 6B params
88.200%2024-06
6
maxvit_base_tf_512.in1k
Google
86.598%2023-04
7
coatnet_2_rw_224.sw_in12k_ft_in1k
Google
86.580%2022-09
8
nextvit_large.bd_ssld_6m_in1k_384
ByteDance
86.542%2022-11
9
swin_large.ms_in22k_ft_in1k
Microsoft
86.330%2021-03
10
convnext_base.fb_in22k_ft_in1k
Meta AI
86.298%2022-01

Note: All scores are finetuned Top-1 accuracy on the ImageNet-1K validation set. Most top models use ImageNet-21K or large-scale image-text data for pre-training. Last updated March 2026.

Dataset Variants

While ILSVRC 2012 is the "standard" ImageNet, the ecosystem has expanded to address specific challenges like scale, robustness, and distribution shift.

ImageNet-1K (ILSVRC)

1.28M Images1,000 Classes

Standard Benchmark

ImageNet-21K

14M Images21,841 Classes

Large-scale Pre-training

ImageNet-v2

10K Images1,000 Classes

Robustness Testing

ImageNet-C / R

N/A Images1,000 Classes

Corruption & Rendition

The Evaluation Pipeline

01

Preprocessing

Resizing to 224x224 or 384x384, center cropping, and normalization.

02

Inference

Forward pass through the model to generate class logits.

03

Softmax

Converting logits to a probability distribution over 1,000 classes.

04

Scoring

Checking if the ground truth label is the top prediction (Top-1).

Implementation & Tools

Foundational Papers

Top-1 Accuracy

The standard metric for ImageNet. It measures the percentage of test images where the model's highest-probability prediction exactly matches the ground truth label. As of 2024, SOTA models exceed 90% Top-1 accuracy on the 1K validation set.

Accuracy = (Correct Predictions) / (Total Images)

Top-5 Error

Historically used when classification was more difficult. A "success" is counted if the correct label is among the model's top 5 predictions. This was the primary metric for the original ILSVRC competitions.

Error = 1 - (Correct in Top 5) / (Total Images)

Related Benchmarks

BenchmarkFocusScaleKey Difference
CIFAR-10/100Small-scale classification60k images (32x32)Low resolution, toy dataset
COCODetection & Segmentation330k imagesFocus on object localization
PASCAL VOCObject Recognition11k imagesPre-dated ImageNet scale

Access the ImageNet Dataset

Ready to train your own models? Access the official ImageNet database for research and non-commercial use. Requires registration and institutional affiliation.

Download Dataset