Who leads the ImageNet-1K benchmark?

CoCa (finetuned) currently leads ImageNet-1K with a score of 91 on top-1-accuracy.

What is the state-of-the-art score on ImageNet-1K?

The state-of-the-art result on ImageNet-1K is 91 (top-1-accuracy), achieved by CoCa (finetuned) as of 2026.

How many models are tracked on ImageNet-1K?

Codesota tracks 46 models on ImageNet-1K across 3 metrics.

When was the ImageNet-1K leaderboard last updated?

The ImageNet-1K leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2015.

Codesota · Computer Vision · Image Classification · ImageNet-1KTasks/Computer Vision/Image Classification

Image Classification · benchmark dataset · 2012 · EN

ImageNet Large Scale Visual Recognition Challenge 2012.

Name: ImageNet Large Scale Visual Recognition Challenge 2012 Benchmark Results
Creator: Codesota
Published: 2015-01-01
License: https://creativecommons.org/licenses/by/4.0/

1.28M training images, 50K validation images across 1,000 object classes. The standard benchmark for image classification since 2012.

Saturated benchmark

Benchmark near ceiling or stagnant — no meaningful SOTA movement in 2+ years

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

47 results indexed across 3 metrics. Shaded row marks current SOTA; ties broken by submission date.

Primary: top-1-accuracy · higher is better
All metrics: accuracy, pass@1, top-1-accuracy

accuracy

26 rows

#	Model	Org	Submitted	Paper / code	accuracy
01	BEiT-L+	—	Jun 2021	BEiT: BERT Pre-Training of Image Transformers · code	89.50
02	AIMv2 ViT-3B/14 448px	—	Nov 2024	Multimodal Autoregressive Pre-training of Large Vision E… · code	89.50
03	ALIGN	—	Feb 2021	Scaling Up Visual and Vision-Language Representation Lea… · code	88.64
04	Vision Transformer (ViT-H/14)	—	Oct 2020	An Image is Worth 16x16 Words: Transformers for Image Re… · code	88.55
05	DINOv3 (7B)	—	Aug 2025	DINOv3 · code	88.40
06	MAE (ViT-H, 448)	—	Nov 2021	Masked Autoencoders Are Scalable Vision Learners · code	87.80
07	ConvNeXt (XL)	—	Jan 2022	A ConvNet for the 2020s · code	87.80
08	BiT-L	—	Dec 2019	Big Transfer (BiT): General Visual Representation Learni… · code	87.54
09	DINOv2 (ViT-g/14)	—	Apr 2023	DINOv2: Learning Robust Visual Features without Supervis… · code	86.50
10	V-JEPA 2 ViT-g (1B, 384px)	—	Jun 2025	V-JEPA 2: Self-Supervised Video Models Enable Understand… · code	85.10
11	SigLIP 2 (g/16)	—	Feb 2025	SigLIP 2: Multilingual Vision-Language Encoders with Imp… · code	85
12	ResNet-152Open	Microsoft	Dec 2015	Deep Residual Learning for Image Recognition · code	80.62
13	DINO (ViT-B/8)	—	Apr 2021	Emerging Properties in Self-Supervised Vision Transforme… · code	80.10
14	YOLO26x-cls	—	Jan 2026	pwc-dump · code	79.90
15	YOLO26l-cls	—	Jan 2026	pwc-dump · code	79
16	YOLOv8x-cls	—	Jan 2023	pwc-dump · code	79
17	YOLO26m-cls	—	Jan 2026	pwc-dump · code	78.10
18	YOLOv8m-cls	—	Jan 2023	pwc-dump · code	76.80
19	YOLOv8l-cls	—	Jan 2023	pwc-dump · code	76.80
20	CLIP	—	Feb 2021	Learning Transferable Visual Models From Natural Languag… · code	76.20
21	YOLO26s-cls	—	Jan 2026	pwc-dump · code	76
22	AltCLIP	—	Nov 2022	AltCLIP: Altering the Language Encoder in CLIP for Exten… · code	74.50
23	YOLOv8s-cls	—	Jan 2023	pwc-dump · code	73.80
24	YOLO26n-cls	—	Jan 2026	pwc-dump · code	71.40
25	YOLOv8n-cls	—	Jan 2023	pwc-dump · code	69
26	CN-CLIP	—	Nov 2022	Chinese CLIP: Contrastive Vision-Language Pretraining in… · code	59.60

pass@1

1 row

#	Model	Org	Submitted	Paper / code	pass@1
01	pMF-H + FD-lossOpen	N/A	—	paper	0.720

top-1-accuracy· primary

20 rows

#	Model	Org	Submitted	Paper / code	top-1-accuracy
01	CoCa (finetuned)Open	Google	Dec 2025	google-research	91
02	ViT-G/14Open	Google	Dec 2025	google-research	90.45
03	SoViT-400m/14Open	Google DeepMind	Apr 2026	neurips-2023	90.30
04	AIMv2-3BOpen	Apple	Apr 2026	arxiv-paper	89.50
05	ConvNeXt V2 HugeOpen	Meta	Dec 2025	meta-research	88.90
06	ViT-H/14Open	Google	Dec 2025	google-research	88.55
07	InternViT-6B (InternVL)Open	OpenGVLab	Apr 2026	cvpr-2024	88.23
08	Swin Transformer LargeOpen	Microsoft	Dec 2025	microsoft-research	87.30
09	EfficientNetV2-LOpen	Google	Dec 2025	google-research	85.70
10	MambaVision-L2Open	NVIDIA	Apr 2026	cvpr-2025	85.30
11	DeiT-B DistilledOpen	Meta	Dec 2025	meta-research	85.20
12	EfficientNet-B7Open	Google	Dec 2025	google-research	84.40
13	DeiT-BOpen	Meta	Dec 2025	meta-research	83.10
14	ConvNeXt V2 TinyOpen	Meta	Dec 2025	meta-research	83
15	ViT-L/16Open	Google	Dec 2025	google-research	82.70
16	ViT-B/16Open	Google	Dec 2025	google-research	81.20
17	ResNet-50 (A3 training)Open	Timm	Dec 2025	timm-research	80.40
18	ResNet-152Open	Microsoft	Dec 2025	microsoft-research	78.60
19	EfficientNet-B0Open	Google	Dec 2025	google-research	77.10
20	ResNet-50Open	Microsoft	Dec 2025	pytorch-vision	76.15

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on top-1-accuracy. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · top-1-accuracy

Dec 18, 2025CoCa (finetuned)Google91

Fig 3 · SOTA-setting models only. 1 entries span Dec 2025 → Dec 2025.

§ 04 · Literature

16 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

DINOv3
Aug 2025·DINOv3 (7B)
arXiv ↗Code
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Jun 2025·V-JEPA 2 ViT-g (1B, 384px)
arXiv ↗Code
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Feb 2025·SigLIP 2 (g/16)
arXiv ↗Code
Multimodal Autoregressive Pre-training of Large Vision Encoders
Nov 2024·AIMv2 ViT-3B/14 448px
arXiv ↗Code
DINOv2: Learning Robust Visual Features without Supervision
Apr 2023·DINOv2 (ViT-g/14)
arXiv ↗Code
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
Nov 2022·AltCLIP
arXiv ↗Code
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
Nov 2022·CN-CLIP
arXiv ↗Code
A ConvNet for the 2020s
Jan 2022·ConvNeXt (XL)
arXiv ↗Code
Masked Autoencoders Are Scalable Vision Learners
Nov 2021·MAE (ViT-H, 448)
arXiv ↗Code
BEiT: BERT Pre-Training of Image Transformers
Jun 2021·BEiT-L+
arXiv ↗Code
Emerging Properties in Self-Supervised Vision Transformers
Apr 2021·DINO (ViT-B/8)
arXiv ↗Code
Learning Transferable Visual Models From Natural Language Supervision
Feb 2021·CLIP
arXiv ↗Code
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Feb 2021·ALIGN
arXiv ↗Code
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Oct 2020·Vision Transformer (ViT-H/14)
arXiv ↗Code
Big Transfer (BiT): General Visual Representation Learning
Dec 2019·BiT-L
arXiv ↗Code
Deep Residual Learning for Image Recognition
Dec 2015·ResNet-152
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

ImageNet Large Scale Visual Recognition Challenge 2012.

Best published scores.

1 stepsof state of the art.

16 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

1 steps
of state of the art.

16 papers
tied to this benchmark.

Have a score that beats
this table?