ImageNet-1K
Unknown
1.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.
Benchmark Stats
SOTA History
top-1-accuracy
top-1-accuracy
Higher is better
| Rank | Model | Source | Score | Year | Paper |
|---|---|---|---|---|---|
| 1 | coca-finetuned Current SOTA on ImageNet-1K. 2.1B parameters. Contrastive Captioner architecture. | Editorial | 91 | 2025 | Source |
| 2 | vit-g-14 Giant ViT variant. 1.8B parameters. | Editorial | 90.45 | 2025 | Source |
| 3 | EVA-02-L EVA-02 ViT-L/14+ 304M params. MIM pre-training on Merged-38M, finetuned on IN-22K then IN-1K at 448x448. Source: timm results CSV (eva02_large_patch14_448.mim_m38m_ft_in22k_in1k). Paper: arxiv:2303.11331. | Community | 90.056 | 2026 | Source |
| 4 | EVA-Giant EVA ViT-Giant/14, 1B params. MIM pre-training on Merged-30M, finetuned on IN-22K then IN-1K at 560x560. Source: timm results CSV (eva_giant_patch14_560.m30m_ft_in22k_in1k). Paper: arxiv:2211.07636. | Community | 89.79 | 2026 | Source |
| 5 | InternImage-H InternImage-H 1.08B params with deformable convolutions. IN-22K pretraining + joint ImageNet training, 640x640. Source: OpenGVLab/InternImage classification README. Paper: arxiv:2211.05778. | Community | 89.6 | 2026 | Source |
| 6 | SigLIP-SO400M Shape-Optimized SigLIP 400M, patch14, res 378. Contrastive pre-training on WebLI, finetuned on IN-1K. Source: timm results CSV (vit_so400m_patch14_siglip_378.webli_ft_in1k). Paper: arxiv:2303.15343. | Community | 89.41 | 2026 | Source |
| 7 | convnext-v2-huge Best pure ConvNet. 650M parameters. Trained with FCMAE. | Editorial | 88.9 | 2025 | Source |
| 8 | ViT-H/14 CLIP (LAION-2B) ViT-H/14 CLIP pre-trained on LAION-2B, finetuned on IN-12K then IN-1K at 336px. Source: timm results CSV (vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k). Paper: arxiv:2212.07143. | Community | 88.634 | 2026 | Source |
| 9 | ConvNeXt-XXLarge (CLIP LAION) ConvNeXt-XXLarge, CLIP pre-trained on LAION-2B, soup finetuned on IN-1K. Source: timm results CSV (convnext_xxlarge.clip_laion2b_soup_ft_in1k). Paper: arxiv:2212.07143 (OpenCLIP). | Community | 88.622 | 2026 | Source |
| 10 | vit-h-14 Huge ViT variant. 632M parameters. | Editorial | 88.55 | 2025 | Source |
| 11 | swin-large Hierarchical Vision Transformer with shifted windows. | Editorial | 87.3 | 2025 | Source |
| 12 | efficientnet-v2-l Pretrained on ImageNet-21K, fine-tuned on 1K. | Editorial | 85.7 | 2025 | Source |
| 13 | deit-b-distilled Data-efficient ViT with distillation. Trained on ImageNet-1K only. | Editorial | 85.2 | 2025 | Source |
| 14 | efficientnet-b7 8.4x smaller than GPipe. 66M parameters. | Editorial | 84.4 | 2025 | Source |
| 15 | deit-b Without distillation. Trained from scratch on ImageNet-1K. | Editorial | 83.1 | 2025 | Source |
| 16 | convnext-v2-tiny 28M parameters. Efficient variant. | Editorial | 83 | 2025 | Source |
| 17 | vit-l-16 Large ViT with ImageNet-21K pretraining. | Editorial | 82.7 | 2025 | Source |
| 18 | vit-b-16 Base ViT with ImageNet-21K pretraining. | Editorial | 81.2 | 2025 | Source |
| 19 | resnet-50-a3 ResNet Strikes Back. Modern training recipe on classic architecture. | Editorial | 80.4 | 2025 | Source |
| 20 | resnet-152 10-crop evaluation. Original deep residual network. | Editorial | 78.6 | 2025 | Source |
| 21 | efficientnet-b0 Only 5.3M parameters. Baseline for compound scaling. | Editorial | 77.1 | 2025 | Source |
| 22 | resnet-50 Standard torchvision baseline. 25M parameters. | Editorial | 76.15 | 2025 | Source |