Recent studyBlind TTS Elo is live. Compare two anonymous voice samples, vote after listening, and help separate real preference signal from noise.Vote in the study ->
Codesota · Benchmark · ImageNet-1KHome/Leaderboards/Vision & Documents/Image Classification/ImageNet-1K
Unknown

ImageNet-1K.

1.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.

Paper Leaderboard Lineage
§ 01 · SOTA history

Year over year.

§ 02 · Leaderboard

Results by metric.

Found a wrong score or missing run?
Use row edits to send a sourced correction into moderation.
Add / edit result Report issue

top-1-accuracy

Top 1 Accuracy is the reported evaluation metric for ImageNet-1K. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for top-1-accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01coca-finetuned
Current SOTA on ImageNet-1K. 2.1B parameters. Contrastive Captioner architecture.
paper912025Source ↗Edit result
02CoCa (finetuned)
Current SOTA on ImageNet-1K. 2.1B parameters. Contrastive Captioner architecture.
unverified912025Source ↗Edit result
03vit-g-14
Giant ViT variant. 1.8B parameters.
paper90.452025Source ↗Edit result
04ViT-G/14
Giant ViT variant. 1.8B parameters.
unverified90.452025Source ↗Edit result
05SoViT-400m/14
SoViT-400m/14, shape-optimized ViT with 400M params. Finetuned on ImageNet-1K at 224px. 90.3% top-1 on IN-1K val. Surpasses ViT-g/14 (90.0%) at less than half the inference cost. NeurIPS 2023, paper revised Jan 2024. Source: arxiv:2305.13035 abstract.
unverified90.32026Source ↗Edit result
06EVA-02-L
EVA-02 ViT-L/14+ 304M params. MIM pre-training on Merged-38M, finetuned on IN-22K then IN-1K at 448x448. Source: timm results CSV (eva02_large_patch14_448.mim_m38m_ft_in22k_in1k). Paper: arxiv:2303.11331.
verified90.0562026Source ↗Edit result
07EVA-Giant
EVA ViT-Giant/14, 1B params. MIM pre-training on Merged-30M, finetuned on IN-22K then IN-1K at 560x560. Source: timm results CSV (eva_giant_patch14_560.m30m_ft_in22k_in1k). Paper: arxiv:2211.07636.
verified89.792026Source ↗Edit result
08InternImage-H
InternImage-H 1.08B params with deformable convolutions. IN-22K pretraining + joint ImageNet training, 640x640. Source: OpenGVLab/InternImage classification README. Paper: arxiv:2211.05778.
verified89.62026Source ↗Edit result
09AIMv2-3B
AIMv2-3B, multimodal autoregressive pre-training, 2.7B params, 448px. 89.5% top-1 on IN-1K val using attentive probing (frozen backbone + 2-layer attentive head). Apple, Nov 2024. Source: github.com/apple/ml-aim README table. Paper: arxiv:2411.14402.
paper89.52026Source ↗Edit result
10SigLIP-SO400M
Shape-Optimized SigLIP 400M, patch14, res 378. Contrastive pre-training on WebLI, finetuned on IN-1K. Source: timm results CSV (vit_so400m_patch14_siglip_378.webli_ft_in1k). Paper: arxiv:2303.15343.
verified89.412026Source ↗Edit result
11convnext-v2-huge
Best pure ConvNet. 650M parameters. Trained with FCMAE.
paper88.92025Source ↗Edit result
12ConvNeXt V2 Huge
Best pure ConvNet. 650M parameters. Trained with FCMAE.
unverified88.92025Source ↗Edit result
13ViT-H/14 CLIP (LAION-2B)
ViT-H/14 CLIP pre-trained on LAION-2B, finetuned on IN-12K then IN-1K at 336px. Source: timm results CSV (vit_huge_patch14_clip_336.laion2b_ft_in12k_in1k). Paper: arxiv:2212.07143.
verified88.6342026Source ↗Edit result
14ConvNeXt-XXLarge (CLIP LAION)
ConvNeXt-XXLarge, CLIP pre-trained on LAION-2B, soup finetuned on IN-1K. Source: timm results CSV (convnext_xxlarge.clip_laion2b_soup_ft_in1k). Paper: arxiv:2212.07143 (OpenCLIP).
verified88.6222026Source ↗Edit result
15ViT-H/14
Huge ViT variant. 632M parameters.
unverified88.552025Source ↗Edit result
16vit-h-14
Huge ViT variant. 632M parameters.
paper88.552025Source ↗Edit result
17InternViT-6B (InternVL)
InternViT-6B, 6B-param vision encoder, patch14, 224px. 88.23% Acc@1 on IN-1K val (50k images) via linear probing (frozen backbone + linear head). OpenGVLab, CVPR 2024 Oral. Source: Hugging Face model card OpenGVLab/InternViT-6B-448px-V2_5. Paper: arxiv:2312.14238.
unverified88.232026Source ↗Edit result
18swin-large
Hierarchical Vision Transformer with shifted windows.
paper87.32025Source ↗Edit result
19Swin Transformer Large
Hierarchical Vision Transformer with shifted windows.
unverified87.32025Source ↗Edit result
20efficientnet-v2-l
Pretrained on ImageNet-21K, fine-tuned on 1K.
paper85.72025Source ↗Edit result
21EfficientNetV2-L
Pretrained on ImageNet-21K, fine-tuned on 1K.
unverified85.72025Source ↗Edit result
22MambaVision-L2
MambaVision-L2, hybrid Mamba-Transformer backbone, 241M params. Finetuned on ImageNet-1K at 224px. 85.3% top-1. Sets new SOTA Pareto front for accuracy vs. throughput. NVIDIA, CVPR 2025. Source: arxiv:2407.08083 Table 1.
unverified85.32026Source ↗Edit result
23deit-b-distilled
Data-efficient ViT with distillation. Trained on ImageNet-1K only.
paper85.22025Source ↗Edit result
24DeiT-B Distilled
Data-efficient ViT with distillation. Trained on ImageNet-1K only.
unverified85.22025Source ↗Edit result
25EfficientNet-B7
8.4x smaller than GPipe. 66M parameters.
unverified84.42025Source ↗Edit result
26DeiT-B
Without distillation. Trained from scratch on ImageNet-1K.
unverified83.12025Source ↗Edit result
27ConvNeXt V2 Tiny
28M parameters. Efficient variant.
unverified832025Source ↗Edit result
28convnext-v2-tiny
28M parameters. Efficient variant.
paper832025Source ↗Edit result
29vit-l-16
Large ViT with ImageNet-21K pretraining.
paper82.72025Source ↗Edit result
30ViT-L/16
Large ViT with ImageNet-21K pretraining.
unverified82.72025Source ↗Edit result
31vit-b-16
Base ViT with ImageNet-21K pretraining.
paper81.22025Source ↗Edit result
32ViT-B/16
Base ViT with ImageNet-21K pretraining.
unverified81.22025Source ↗Edit result
33ResNet-50 (A3 training)
ResNet Strikes Back. Modern training recipe on classic architecture.
unverified80.42025Source ↗Edit result
34resnet-50-a3
ResNet Strikes Back. Modern training recipe on classic architecture.
paper80.42025Source ↗Edit result
35resnet-152
10-crop evaluation. Original deep residual network.
paper78.62025Source ↗Edit result
36efficientnet-b0
Only 5.3M parameters. Baseline for compound scaling.
paper77.12025Source ↗Edit result
37resnet-50
Standard torchvision baseline. 25M parameters.
paper76.152025Source ↗Edit result

Accuracy

Accuracy is the reported evaluation metric for ImageNet-1K. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Accuracyverifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01AIMv2 ViT-3B/14 448pxunverified89.52024Paper ↗Code ↗Edit result
02BEiT-L+unverified89.52021Paper ↗Code ↗Edit result
03ALIGNunverified88.642021Paper ↗Code ↗Edit result
04Vision Transformer (ViT-H/14)unverified88.552020Paper ↗Code ↗Edit result
05DINOv3 (7B)unverified88.42025Paper ↗Code ↗Edit result
06MAE (ViT-H, 448)unverified87.82021Paper ↗Code ↗Edit result
07ConvNeXt (XL)unverified87.82022Paper ↗Code ↗Edit result
08BiT-Lunverified87.542019Paper ↗Code ↗Edit result
09DINOv2 (ViT-g/14)unverified86.52023Paper ↗Code ↗Edit result
10V-JEPA 2 ViT-g (1B, 384px)unverified85.12025Paper ↗Code ↗Edit result
11SigLIP 2 (g/16)unverified852025Paper ↗Code ↗Edit result
12ResNet-152unverified80.622015Paper ↗Code ↗Edit result
13DINO (ViT-B/8)unverified80.12021Paper ↗Code ↗Edit result
14YOLO26x-clsunverified79.92026Paper ↗Code ↗Edit result
15YOLOv8x-clsunverified792023Paper ↗Code ↗Edit result
16YOLO26l-clsunverified792026Paper ↗Code ↗Edit result
17YOLO26m-clsunverified78.12026Paper ↗Code ↗Edit result
18YOLOv8m-clsunverified76.82023Paper ↗Code ↗Edit result
19YOLOv8l-clsunverified76.82023Paper ↗Code ↗Edit result
20CLIPunverified76.22021Paper ↗Code ↗Edit result
21YOLO26s-clsunverified762026Paper ↗Code ↗Edit result
22AltCLIPunverified74.52022Paper ↗Code ↗Edit result
23YOLOv8s-clsunverified73.82023Paper ↗Code ↗Edit result
24YOLO26n-clsunverified71.42026Paper ↗Code ↗Edit result
25YOLOv8n-clsunverified692023Paper ↗Code ↗Edit result
26CN-CLIPunverified59.62022Paper ↗Code ↗Edit result

Pass@1

Pass@1 is the reported evaluation metric for ImageNet-1K. Codesota tracks published model scores on this metric so readers can compare state-of-the-art results across sources and model families.

Higher is better

Trust tiers for Pass@1verifiedpapervendorcommunityunverified
RankModelTrustScoreYearLinksEdit
01pMF-H + FD-loss
The paper uses FID (Fréchet Inception Distance) as the primary metric for image generation quality. In the provided schema, 'pass@1' is used as a placeholder for the FID score as it is the primary performance metric reported in the tables.
verified0.72N/ASource ↗Edit result
Lineage

ImageNet in context.

See full vision benchmarks lineage →
Predecessors (1)
saturated2009-01
CIFAR-10/100
ImageNet replaced CIFAR as the canonical vision benchmark when GPU compute made large-scale image classification practical. AlexNet's 2012 ImageNet win effectively ended CIFAR's era as the frontier benchmark.
This benchmark (1)
saturated2009-06
ImageNet
§ 04 · Submit a result

Add to the leaderboard.

← Back to Image Classification