Who leads the ADE20K benchmark?

InternImage-H currently leads ADE20K with a score of 62.90 on mIoU.

What is the state-of-the-art score on ADE20K?

The state-of-the-art result on ADE20K is 62.90 (mIoU), achieved by InternImage-H as of 2026.

How many models are tracked on ADE20K?

Codesota tracks 19 models on ADE20K across 2 metrics.

When was the ADE20K leaderboard last updated?

The ADE20K leaderboard on Codesota includes results through 2026, with the earliest tracked result from 2021.

Codesota · Computer Vision · Semantic Segmentation · ADE20KTasks/Computer Vision/Semantic Segmentation

Semantic Segmentation · benchmark dataset · 2016 · EN

ADE20K Scene Parsing Benchmark.

Name: ADE20K Scene Parsing Benchmark Benchmark Results
Creator: Codesota
Published: 2021-01-01
License: https://creativecommons.org/licenses/by/4.0/

20K training, 2K validation images annotated with 150 object categories. Complex scene parsing benchmark.

Paper ↗Download dataset Submit a result ↵

§ 01 · Leaderboard

Best published scores.

21 results indexed across 2 metrics. Shaded row marks current SOTA; ties broken by submission date.

Primary: mIoU · higher is better
All metrics: mIoU, miou

mIoU· primary

6 rows

#	Model	Org	Submitted	Paper / code	mIoU
01	InternImage-HOpen	Shanghai AI Lab	Dec 2025	arxiv-paper	62.90
02	BEiT-3 (ViT-L)Open	Microsoft	Mar 2026	arxiv	62.80
03	DINOv2 (ViT-g) + LinearOpen	Meta AI	Mar 2026	arxiv	62
04	Mask2Former (Swin-L)Open	Meta AI	Mar 2026	arxiv	57.30
05	Mask2Former (Swin-L)Open	Meta AI / UIUC	Dec 2025	arxiv-paper	57.30
06	Swin-L + UperNetOpen	Microsoft	Mar 2026	arxiv	53.50

miou

15 rows

#	Model	Org	Submitted	Paper / code	miou
01	DINOv3 + Mask2Former (simple)	—	Aug 2025	DINOv3 · code	62.60
02	EoMT (ViT-L)	—	Mar 2025	Your ViT is Secretly an Image Segmentation Model · code	58.40
03	BEiT-L+	—	Jun 2021	BEiT: BERT Pre-Training of Image Transformers · code	57.90
04	OneFormer (Swin-L)	—	Nov 2022	OneFormer: One Transformer to Rule Universal Image Segme… · code	57
05	Mask2Former + Swin-L-FaPN	—	Dec 2021	Masked-attention Mask Transformer for Universal Image Se… · code	56.40
06	Mask2Former (Swin-L)Open	Meta AI / UIUC	Dec 2021	Masked-attention Mask Transformer for Universal Image Se… · code	56.10
07	DINOv3 + linear probe	—	Aug 2025	DINOv3 · code	55.90
08	ConvNeXt (XL)	—	Jan 2022	A ConvNet for the 2020s · code	54
09	MAE (ViT-H, 448)	—	Nov 2021	Masked Autoencoders Are Scalable Vision Learners · code	53.60
10	DINOv2 (ViT-g/14)	—	Apr 2023	DINOv2: Learning Robust Visual Features without Supervis… · code	53
11	Mask2Former + Swin-T	—	Dec 2021	Masked-attention Mask Transformer for Universal Image Se… · code	47.70
12	Mask2Former + ResNet-50	—	Dec 2021	Masked-attention Mask Transformer for Universal Image Se… · code	47.20
13	MaskFormer (Swin-T)	—	Jul 2021	Per-Pixel Classification is Not All You Need for Semanti… · code	46.70
14	SigLIP 2 (g/16)	—	Feb 2025	SigLIP 2: Multilingual Vision-Language Encoders with Imp… · code	45.40
15	SegFormer (MiT-B0)	—	May 2021	SegFormer: Simple and Efficient Design for Semantic Segm… · code	37.40

Fig 2 · Rows sorted by score within each metric. Shaded row marks SOTA. Dates reflect model or paper release where available, otherwise the date Codesota accessed the source.

§ 03 · Progress

1 steps
of state of the art.

Each row below marks a model that broke the previous record on mIoU. Intermediate submissions are kept in the leaderboard above; only SOTA-setting entries are re-listed here.

Higher scores win. Each subsequent entry improved upon the previous best.

SOTA line · mIoU

Dec 18, 2025InternImage-HShanghai AI Lab62.90

Fig 3 · SOTA-setting models only. 1 entries span Dec 2025 → Dec 2025.

§ 04 · Literature

11 papers
tied to this benchmark.

Every paper below corresponds to at least one row in the leaderboard above. Click through for the arXiv preprint and, when available, the reference implementation.

DINOv3
Aug 2025·DINOv3 + Mask2Former (simple) , DINOv3 + linear probe
arXiv ↗Code
Your ViT is Secretly an Image Segmentation Model
Mar 2025·EoMT (ViT-L)
arXiv ↗Code
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features
Feb 2025·SigLIP 2 (g/16)
arXiv ↗Code
DINOv2: Learning Robust Visual Features without Supervision
Apr 2023·DINOv2 (ViT-g/14)
arXiv ↗Code
OneFormer: One Transformer to Rule Universal Image Segmentation
Nov 2022·OneFormer (Swin-L)
arXiv ↗Code
A ConvNet for the 2020s
Jan 2022·ConvNeXt (XL)
arXiv ↗Code
Masked-attention Mask Transformer for Universal Image Segmentation
Dec 2021·Mask2Former + Swin-L-FaPN, Mask2Former (Swin-L), Mask2Former + Swin-T +1
arXiv ↗Code
Masked Autoencoders Are Scalable Vision Learners
Nov 2021·MAE (ViT-H, 448)
arXiv ↗Code
Per-Pixel Classification is Not All You Need for Semantic Segmentation
Jul 2021·MaskFormer (Swin-T)
arXiv ↗Code
BEiT: BERT Pre-Training of Image Transformers
Jun 2021·BEiT-L+
arXiv ↗Code
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers
May 2021·SegFormer (MiT-B0)
arXiv ↗Code

§ 06 · Contribute

Have a score that beats
this table?

Submit a checkpoint and a reproduction script. We will run it, publish the score, and — if it takes the top — annotate the step on the progress chart with your name.

Submit a result ↵Read submission guide

What a submission needs

01A public checkpoint or API endpoint
02A reproduction script with frozen commit + seed
03Declared evaluation environment (Python, deps)
04One row per metric declared by this dataset
05A contact so we can follow up on discrepancies

ADE20K Scene Parsing Benchmark.

Best published scores.

1 stepsof state of the art.

11 paperstied to this benchmark.

Neighbouring benchmarks.

Have a score that beatsthis table?

1 steps
of state of the art.

11 papers
tied to this benchmark.

Have a score that beats
this table?