Codesota · OCR · Benchmark · ADE20KSemantic segmentation · 27,574 images · 150 categoriesSOTA 62.9% mIoU

§ 00 · ADE20K

Every pixel, labelled.

Name: ADE20K Scene Parsing Benchmark
Creator: MIT CSAIL
License: https://creativecommons.org/licenses/by-sa/4.0/

ADE20K is the MIT CSAIL scene-parsing benchmark: 27,574 images densely labelled across 150 object categories, the test bed that dragged segmentation models from object recognition to full pixel-level parsing.

The leaderboard below preserves 23 reported mIoU scores verbatim, with paper and code links where the original authors published them.

§ 01 · Leaderboard

ADE20K validation, ranked by mIoU.

#	Model	Vendor	Architecture	mIoU	Params	Year	Links
01	InternImage-H	Shanghai AI Lab	Deformable CNN + Mask2Former	62.9	1.08B	2023	paper · code
02	BEiT-3 (ViT-g)	Microsoft Research	ViT-g + UPerNet	62.8	1.9B	2023	paper
03	ViT-CoMer-L	Fudan University	ViT + CNN Bidirectional Fusion	62.1	383M	2024	paper · code
04	EVA-02-L	BAAI	ViT-L + Mask2Former	61.5	304M	2023	paper · code
05	FD-SwinV2-G	Microsoft Research	Swin V2 Giant	61.4	3B	2023	paper
06	DINOv2-g + Mask2Former	Meta AI	ViT-g + Mask2Former	61.4	1.1B	2023	paper · code
07	OneFormer (InternImage-H)	SHI Labs	InternImage-H + OneFormer	60.8	1.1B	2023	paper · code
08	ViT-Adapter-L (BEiT)	Shanghai AI Lab	ViT-L + Adapter + Mask2Former	60.5	571M	2022	paper · code
09	SERNet-Former v2	Research	Efficient Transformer	59.4	—	2024
10	OneFormer (DiNAT-L)	SHI Labs	DiNAT-L + OneFormer	58.4	225M	2022	paper · code
11	SeMask-L (Mask2Former)	KAIST	Swin-L FaPN + Mask2Former	58.2	—	2022	paper
12	Mask2Former (Swin-L)	Meta AI	Swin-L + Masked Attention	57.7	216M	2022	paper · code
13	BEiT-L (UPerNet)	Microsoft Research	ViT-L + UPerNet	57.0	441M	2022	paper · code
14	DeiT-L	Meta AI	ViT-L + DeiT Distillation	55.6	—	2021	paper
15	ConvNeXt-XL (UPerNet)	Meta AI	Pure ConvNet + UPerNet	54.0	391M	2022	paper · code
16	Seg-L-Mask/16	INRIA	ViT-L Segmenter	53.6	—	2021	paper
17	Swin-L (UPerNet)	Microsoft Research	Swin Transformer + UPerNet	53.5	234M	2021	paper · code
18	SegFormer-B5	NVIDIA	Mix Transformer + MLP	51.8	85M	2021	paper · code
19	SETR-MLA (ViT-L)	Fudan University	ViT-L + Multi-Level Aggregation	47.7	310M	2020	paper
20	HRNetV2-W48	Microsoft Research	High-Resolution Net	46.2	66M	2019	paper · code
21	DeepLabv3+ (Xception-71)	Google	Atrous Separable Convolution	45.7	62M	2018	paper
22	PSPNet (ResNet-101)	CUHK / SenseTime	Pyramid Scene Parsing	43.3	65M	2017	paper
23	FCN-8s (VGG-16)	UC Berkeley	Fully Convolutional Network	40.4	134M	2015	paper

Fig 1 · 23 models ranked by mIoU on the ADE20K validation set. Single-scale evaluation unless the original paper reports otherwise. Shaded row marks SOTA.

§ 02 · What it measures

mIoU, averaged across 150 classes.

Mean Intersection-over-Union computes the ratio of correctly predicted pixels to all pixels (predicted + ground truth) for each class, then averages across every one of the 150 categories. Because it is class-averaged, rare categories weigh equally with common ones — which is what makes ADE20K a demanding, imbalance-aware metric rather than a pixel-accuracy contest.

Pixel accuracy is reported as a secondary metric. A model that predicts only “wall” and “sky” can score very highly on pixel accuracy while failing on the other 148 categories; mIoU makes that trick impossible.

§ 03 · Dataset details

Inside the MIT Scene Parsing set.

ADE20K was built by Bolei Zhou, Hang Zhao and Antonio Torralba at MIT CSAIL. Unlike simpler benchmarks it demands that models understand 150 diverse categories spanning stuff (sky, wall, floor) and things (person, car, chair). Images were sourced from the SUN and Places databases — on average 19.5 object instances and 10.5 distinct classes per image.

Every pixel in every image is labelled. That makes ADE20K significantly harder than PASCAL VOC (21 classes) and more diverse than Cityscapes (driving scenes only), which is why it became the standard test for general-purpose segmentation.

Split	Images	Notes
Training	25,574	Dense pixel-level annotations, 150 categories
Validation	2,000	Used for model evaluation + leaderboard ranking
Full ontology	3,169	712,812 labelled objects; parts, materials, scene types

§ 04 · How scores are verified

Reported, then reproduced.

Each row above cites the paper and, where the authors released code, the implementation repository. For ADE20K, the canonical evaluation server lives at MIT’s scene-parsing site; community reproductions typically run through OpenMMLab’s mmsegmentation toolbox against the 2,000-image validation split.

Scores without a paper link are treated as claim-only until a reproduction lands in the registry. The wider rule-set is in the Codesota methodology.

§ 05 · Key papers

Essential reading for the ADE20K line.

Title	Authors	Venue	Role
Semantic Understanding of Scenes Through the ADE20K Dataset	Zhou, Zhao, Puig, Xiao, Fidler, Barriuso, Torralba	IJCV 2019	Original dataset paper
Scene Parsing through ADE20K Dataset	Zhou, Zhao, Puig, Fidler, Barriuso, Torralba	CVPR 2017	Conference version
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions	Wang et al.	CVPR 2023 Highlight	Current SOTA (62.9 mIoU)
Masked-attention Mask Transformer for Universal Image Segmentation	Cheng et al.	CVPR 2022	Unified segmentation framework
OneFormer: One Transformer to Rule Universal Image Segmentation	Jain et al.	CVPR 2023	Multi-task segmentation
SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers	Xie et al.	NeurIPS 2021	Efficient transformer baseline
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows	Liu et al.	ICCV 2021 Best Paper	Shifted window attention
Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks	Wang et al.	CVPR 2023	BEiT-3 (#2 on ADE20K)

§ Final · Related OCR benchmarks