Semantic Segmentation Benchmark

Parsing Every
Pixel of a Scene

ADE20K is the gold standard for semantic segmentation. 27,574 images densely labeled with 150 object categories, created by MIT CSAIL. The benchmark that pushed scene understanding from object recognition to full pixel-level parsing.

Dataset Stats

27,574
Total Images (train + val)
150
Semantic Categories
62.9%
SOTA mIoU (InternImage-H)
23
Models Tracked

Inside the Dataset

Real images from ADE20K with their dense semantic annotations. Every pixel is labeled with one of 150 categories. The dataset spans indoor scenes (kitchens, bedrooms), outdoor scenes (streets, landscapes), and everything in between.

ADE20K dataset samples: original images (top row) with corresponding dense semantic segmentation masks (bottom row) showing pixel-level annotations across 150 categories

Top row: original images from ADE20K validation set. Bottom row: corresponding semantic segmentation masks where each color represents a different object category. Images sourced from the HuggingFace ADE20K dataset.

What is ADE20K?

ADE20K (MIT Scene Parsing Benchmark) is a large-scale dataset for semantic segmentation created by Bolei Zhou, Hang Zhao, and Antonio Torralba at MIT CSAIL. Unlike simpler benchmarks that handle a handful of classes, ADE20K demands that models understand 150 diverse object categories spanning stuff (sky, wall, floor) and things (person, car, chair).

Every pixel in every image is labeled. This makes ADE20K significantly harder than PASCAL VOC (21 classes) and more diverse than Cityscapes (focused only on driving scenes). It has become the standard benchmark for evaluating general-purpose semantic segmentation models.

The dataset covers both indoor and outdoor scenes with hierarchical annotations: objects, parts of objects, and even parts of parts. Images were sourced from the SUN and Places databases, providing exceptional scene diversity with an average of 19.5 object instances and 10.5 distinct classes per image.

Training Set
25,574 images

Dense pixel-level annotations across 150 categories

Validation Set
2,000 images

Used for model evaluation and leaderboard ranking

Full Ontology
3,169 object classes

712,812 labeled objects with parts, materials, and scene types

SOTA Progress: 2016 to 2025

From FCN's 40.4% mIoU in 2015 to InternImage's 62.9% in 2023 — a 55% relative improvement over 8 years. The CNN-to-Transformer transition in 2020–2021 drove the biggest single leap in performance.

ADE20K SOTA progress timeline from 2016 to 2025, showing mIoU improvement from FCN (40.4%) through PSPNet, DeepLabv3+, SegFormer, Mask2Former to InternImage-H (62.9%)

Accuracy vs. Model Size

Bigger models generally score higher, but efficiency matters. SegFormer-B5 achieves 51.8 mIoU with just 85M parameters, while InternImage-H needs 1.08B for 62.9. The efficiency frontier shows the best mIoU achievable at each parameter count.

Scatter plot of ADE20K mIoU vs model parameters, showing efficiency frontier from SegFormer-B2 (28M, 46.5%) to InternImage-H (1.08B, 62.9%)

ADE20K Leaderboard

23 models ranked by mIoU on the validation set. Single-scale evaluation unless noted.

#ModelmIoUTypeParamsYearLinks
1InternImage-H
Shanghai AI Lab
62.9
Hybrid1.08B2023
2BEiT-3 (ViT-g)
Microsoft Research
62.8
Transformer1.9B2023
3ViT-CoMer-L
Fudan University
62.1
Hybrid383M2024
4EVA-02-L
BAAI
61.5
Transformer304M2023
5FD-SwinV2-G
Microsoft Research
61.4
Transformer3B2023
6DINOv2-g + Mask2Former
Meta AI
61.4
Transformer1.1B2023
7OneFormer (InternImage-H)
SHI Labs
60.8
Hybrid1.1B2023
8ViT-Adapter-L (BEiT)
Shanghai AI Lab
60.5
Transformer571M2022
9SERNet-Former v2
Research
59.4
Transformer2024
10OneFormer (DiNAT-L)
SHI Labs
58.4
Transformer225M2022
11SeMask-L (Mask2Former)
KAIST
58.2
Transformer2022
12Mask2Former (Swin-L)
Meta AI
57.7
Transformer216M2022
13BEiT-L (UPerNet)
Microsoft Research
57.0
Transformer441M2022
14DeiT-L
Meta AI
55.6
Transformer2021
15ConvNeXt-XL (UPerNet)
Meta AI
54.0
CNN391M2022
16Seg-L-Mask/16
INRIA
53.6
Transformer2021
17Swin-L (UPerNet)
Microsoft Research
53.5
Transformer234M2021
18SegFormer-B5
NVIDIA
51.8
Transformer85M2021
19SETR-MLA (ViT-L)
Fudan University
47.7
Transformer310M2020
20HRNetV2-W48
Microsoft Research
46.2
CNN66M2019
21DeepLabv3+ (Xception-71)
Google
45.7
CNN62M2018
22PSPNet (ResNet-101)
CUHK / SenseTime
43.3
CNN65M2017
23FCN-8s (VGG-16)
UC Berkeley
40.4
CNN134M2015

Class Distribution

ADE20K follows a Zipf's law distribution: "wall" covers 14.2% of all pixels while rare classes like "chandelier" appear in less than 0.1%. This long-tail makes mIoU especially challenging — every rare class counts equally in the average.

Bar chart showing ADE20K class frequency distribution: wall (14.2%), building (8.1%), sky (7.8%) dominating, with a long tail of 120+ rare classes

Stuff vs. Things

ADE20K uniquely combines stuff (amorphous regions like sky, wall, grass) and things (countable objects like person, car, chair). Stuff classes dominate pixel coverage (61%) despite being only 35 of 150 categories, while 115 thing categories share just 32% of pixels.

This split is critical for architecture design: stuff benefits from large receptive fields (atrous convolutions, pyramid pooling), while things benefit from instance-aware features (Mask2Former, masked attention).

Pie charts showing ADE20K stuff vs things distribution: stuff covers 61% of pixels with 35 classes, things cover 32% with 115 classes

From CNNs to Transformers

ADE20K has tracked the entire evolution of segmentation architectures. FCN (2015) introduced end-to-end pixel classification. PSPNet added pyramid pooling. DeepLabv3+ refined it with atrous convolutions.

Then Transformers arrived. Swin Transformer + UPerNet broke CNN dominance in 2021. Mask2Former unified instance and semantic segmentation. Now InternImage pushes the limit with deformable convolutions at scale, reaching 62.9 mIoU.

Why 150 Classes is Hard

ADE20K's 150-class setup creates challenges that simpler benchmarks miss:

  • Long-tail distribution: "wall" and "sky" dominate, while "chandelier" and "van" appear rarely. Models must handle extreme class imbalance.
  • Stuff vs. things: Amorphous regions (sky, grass) require different reasoning than countable objects (chair, car).
  • Scale variation: A single image may contain a tiny "lamp" and a massive "building," demanding robust multi-scale reasoning.
  • Context dependency: A "shelf" in a kitchen vs. a "shelf" in a library look completely different. Models need scene-level understanding.

How Scene Parsing Works

From raw pixels to a complete understanding of every object in a scene. The typical pipeline for ADE20K evaluation.

Step 1: Encoder

Feature Extraction

A backbone network (Swin Transformer, ConvNeXt, InternImage) extracts multi-scale features from the input image at 1/4, 1/8, 1/16, and 1/32 resolution.

Step 2: Decoder

Pixel Classification

A decoder head (UperNet, Mask2Former, or SegFormer MLP) fuses multi-scale features and predicts a class label for every pixel in the image.

Step 3: Evaluation

mIoU Scoring

Mean Intersection-over-Union (mIoU) measures the overlap between predicted and ground-truth masks, averaged across all 150 categories. Higher is better.

Semantic Class Color Palette

ADE20K color palette showing 60 of 150 semantic classes with their assigned colors used in segmentation masks

ADE20K annotates 150 semantic classes for evaluation, drawn from a full ontology of 3,169 categories. Each class has an assigned RGB color for segmentation masks. The top 20 most frequent classes (shown right) account for the majority of labeled pixels.

wall
building
sky
floor
tree
ceiling
road
grass
sidewalk
person
earth
door
table
mountain
plant
curtain
chair
car
water
painting

Understanding the Metrics

mIoUPrimary

Mean Intersection-over-Union computes the ratio of correctly predicted pixels to all pixels (predicted + ground truth) for each class, then averages across all 150 classes.

mIoU = (1/150) * Σ( TP_i / (TP_i + FP_i + FN_i) )

Because it's class-averaged, rare categories weigh equally with common ones. This makes mIoU a demanding, imbalance-aware metric.

Pixel AccuracySecondary

Pixel accuracy measures the fraction of pixels correctly classified. While intuitive, it's biased toward dominant classes.

Pixel Acc = Correct Pixels / Total Pixels

A model predicting only "wall" and "sky" could achieve high pixel accuracy while failing on 148 other categories. That's why mIoU is preferred.

Key Papers

Essential reading for understanding ADE20K and its SOTA methods.

Semantic Understanding of Scenes Through the ADE20K Dataset
Zhou, Zhao, Puig, Xiao, Fidler, Barriuso, Torralba|IJCV 2019|5,000+ citations
Original dataset paper
Scene Parsing through ADE20K Dataset
Zhou, Zhao, Puig, Fidler, Barriuso, Torralba|CVPR 2017|3,200+ citations
Conference version
Current SOTA (62.9 mIoU)
Unified segmentation framework
Multi-task segmentation
Efficient transformer baseline
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Liu et al.|ICCV 2021 Best Paper|15,000+ citations
Shifted window attention
BEiT-3 (#2 on ADE20K)

Code & Implementations

Open-source repositories for training and evaluating models on ADE20K.

ADE20K vs. Other Segmentation Benchmarks

BenchmarkImagesClassesDomainYear
ADE20K27,574150Indoor + Outdoor scenes2016
COCO-Stuff164,000171General scenes2018
Cityscapes25,00030Urban driving only2016
Mapillary Vistas25,00066Street-level imagery2017
PASCAL VOC 201211,53021General objects2012
PASCAL Context10,103459Whole-scene labeling2014

Access the Dataset

Track More Benchmarks

ADE20K is one of many benchmarks we track. Explore our full catalog of computer vision, NLP, and reasoning benchmarks with live leaderboards.