ImageNet-1K

Unknown

1.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.

Benchmark Stats

Models22
Papers22
Metrics1

SOTA History

top-1-accuracy

top-1-accuracy

Higher is better

RankModelSourceScoreYearPaper
1coca-finetuned

Current SOTA on ImageNet-1K. 2.1B parameters. Contrastive Captioner architecture.

Editorial912025Source
2vit-g-14

Giant ViT variant. 1.8B parameters.

Editorial90.452025Source
3EVA-02-L

EVA-02 ViT-L/14+ 304M params. MIM pre-training on Merged-38M, finetuned on IN-22K then IN-1K at 448x448. Source: timm results CSV (eva02_large_patch14_448.mim_m38m_ft_in22k_in1k). Paper: arxiv:2303.11331.

Community90.0562026Source
4EVA-Giant

EVA ViT-Giant/14, 1B params. MIM pre-training on Merged-30M, finetuned on IN-22K then IN-1K at 560x560. Source: timm results CSV (eva_giant_patch14_560.m30m_ft_in22k_in1k). Paper: arxiv:2211.07636.

Community89.792026Source
5InternImage-H

InternImage-H 1.08B params with deformable convolutions. IN-22K pretraining + joint ImageNet training, 640x640. Source: OpenGVLab/InternImage classification README. Paper: arxiv:2211.05778.

Community89.62026Source
6SigLIP-SO400M

Shape-Optimized SigLIP 400M, patch14, res 378. Contrastive pre-training on WebLI, finetuned on IN-1K. Source: timm results CSV (vit_so400m_patch14_siglip_378.webli_ft_in1k). Paper: arxiv:2303.15343.

Community89.412026Source
7convnext-v2-huge

Best pure ConvNet. 650M parameters. Trained with FCMAE.

Editorial88.92025Source
8ViT-H/14 CLIP (LAION-2B)

ViT-H/14 CLIP pre-trained on LAION-2B, finetuned on IN-12K then IN-1K at 336px. Source: timm results CSV (vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k). Paper: arxiv:2212.07143.

Community88.6342026Source
9ConvNeXt-XXLarge (CLIP LAION)

ConvNeXt-XXLarge, CLIP pre-trained on LAION-2B, soup finetuned on IN-1K. Source: timm results CSV (convnext_xxlarge.clip_laion2b_soup_ft_in1k). Paper: arxiv:2212.07143 (OpenCLIP).

Community88.6222026Source
10vit-h-14

Huge ViT variant. 632M parameters.

Editorial88.552025Source
11swin-large

Hierarchical Vision Transformer with shifted windows.

Editorial87.32025Source
12efficientnet-v2-l

Pretrained on ImageNet-21K, fine-tuned on 1K.

Editorial85.72025Source
13deit-b-distilled

Data-efficient ViT with distillation. Trained on ImageNet-1K only.

Editorial85.22025Source
14efficientnet-b7

8.4x smaller than GPipe. 66M parameters.

Editorial84.42025Source
15deit-b

Without distillation. Trained from scratch on ImageNet-1K.

Editorial83.12025Source
16convnext-v2-tiny

28M parameters. Efficient variant.

Editorial832025Source
17vit-l-16

Large ViT with ImageNet-21K pretraining.

Editorial82.72025Source
18vit-b-16

Base ViT with ImageNet-21K pretraining.

Editorial81.22025Source
19resnet-50-a3

ResNet Strikes Back. Modern training recipe on classic architecture.

Editorial80.42025Source
20resnet-152

10-crop evaluation. Original deep residual network.

Editorial78.62025Source
21efficientnet-b0

Only 5.3M parameters. Baseline for compound scaling.

Editorial77.12025Source
22resnet-50

Standard torchvision baseline. 25M parameters.

Editorial76.152025Source

Submit a Result

ImageNet-1K Leaderboard | CodeSOTA | CodeSOTA