Parsing Every
Pixel of a Scene
ADE20K is the gold standard for semantic segmentation. 27,574 images densely labeled with 150 object categories, created by MIT CSAIL. The benchmark that pushed scene understanding from object recognition to full pixel-level parsing.
Dataset Stats
Inside the Dataset
Real images from ADE20K with their dense semantic annotations. Every pixel is labeled with one of 150 categories. The dataset spans indoor scenes (kitchens, bedrooms), outdoor scenes (streets, landscapes), and everything in between.

Top row: original images from ADE20K validation set. Bottom row: corresponding semantic segmentation masks where each color represents a different object category. Images sourced from the HuggingFace ADE20K dataset.
What is ADE20K?
ADE20K (MIT Scene Parsing Benchmark) is a large-scale dataset for semantic segmentation created by Bolei Zhou, Hang Zhao, and Antonio Torralba at MIT CSAIL. Unlike simpler benchmarks that handle a handful of classes, ADE20K demands that models understand 150 diverse object categories spanning stuff (sky, wall, floor) and things (person, car, chair).
Every pixel in every image is labeled. This makes ADE20K significantly harder than PASCAL VOC (21 classes) and more diverse than Cityscapes (focused only on driving scenes). It has become the standard benchmark for evaluating general-purpose semantic segmentation models.
The dataset covers both indoor and outdoor scenes with hierarchical annotations: objects, parts of objects, and even parts of parts. Images were sourced from the SUN and Places databases, providing exceptional scene diversity with an average of 19.5 object instances and 10.5 distinct classes per image.
Dense pixel-level annotations across 150 categories
Used for model evaluation and leaderboard ranking
712,812 labeled objects with parts, materials, and scene types
SOTA Progress: 2016 to 2025
From FCN's 40.4% mIoU in 2015 to InternImage's 62.9% in 2023 — a 55% relative improvement over 8 years. The CNN-to-Transformer transition in 2020–2021 drove the biggest single leap in performance.

Accuracy vs. Model Size
Bigger models generally score higher, but efficiency matters. SegFormer-B5 achieves 51.8 mIoU with just 85M parameters, while InternImage-H needs 1.08B for 62.9. The efficiency frontier shows the best mIoU achievable at each parameter count.

ADE20K Leaderboard
23 models ranked by mIoU on the validation set. Single-scale evaluation unless noted.
| # | Model | mIoU | Type | Params | Year | Links |
|---|---|---|---|---|---|---|
| 1 | InternImage-H Shanghai AI Lab | 62.9 | Hybrid | 1.08B | 2023 | |
| 2 | BEiT-3 (ViT-g) Microsoft Research | 62.8 | Transformer | 1.9B | 2023 | |
| 3 | ViT-CoMer-L Fudan University | 62.1 | Hybrid | 383M | 2024 | |
| 4 | EVA-02-L BAAI | 61.5 | Transformer | 304M | 2023 | |
| 5 | FD-SwinV2-G Microsoft Research | 61.4 | Transformer | 3B | 2023 | |
| 6 | DINOv2-g + Mask2Former Meta AI | 61.4 | Transformer | 1.1B | 2023 | |
| 7 | OneFormer (InternImage-H) SHI Labs | 60.8 | Hybrid | 1.1B | 2023 | |
| 8 | ViT-Adapter-L (BEiT) Shanghai AI Lab | 60.5 | Transformer | 571M | 2022 | |
| 9 | SERNet-Former v2 Research | 59.4 | Transformer | — | 2024 | |
| 10 | OneFormer (DiNAT-L) SHI Labs | 58.4 | Transformer | 225M | 2022 | |
| 11 | SeMask-L (Mask2Former) KAIST | 58.2 | Transformer | — | 2022 | |
| 12 | Mask2Former (Swin-L) Meta AI | 57.7 | Transformer | 216M | 2022 | |
| 13 | BEiT-L (UPerNet) Microsoft Research | 57.0 | Transformer | 441M | 2022 | |
| 14 | DeiT-L Meta AI | 55.6 | Transformer | — | 2021 | |
| 15 | ConvNeXt-XL (UPerNet) Meta AI | 54.0 | CNN | 391M | 2022 | |
| 16 | Seg-L-Mask/16 INRIA | 53.6 | Transformer | — | 2021 | |
| 17 | Swin-L (UPerNet) Microsoft Research | 53.5 | Transformer | 234M | 2021 | |
| 18 | SegFormer-B5 NVIDIA | 51.8 | Transformer | 85M | 2021 | |
| 19 | SETR-MLA (ViT-L) Fudan University | 47.7 | Transformer | 310M | 2020 | |
| 20 | HRNetV2-W48 Microsoft Research | 46.2 | CNN | 66M | 2019 | |
| 21 | DeepLabv3+ (Xception-71) Google | 45.7 | CNN | 62M | 2018 | |
| 22 | PSPNet (ResNet-101) CUHK / SenseTime | 43.3 | CNN | 65M | 2017 | |
| 23 | FCN-8s (VGG-16) UC Berkeley | 40.4 | CNN | 134M | 2015 |
Class Distribution
ADE20K follows a Zipf's law distribution: "wall" covers 14.2% of all pixels while rare classes like "chandelier" appear in less than 0.1%. This long-tail makes mIoU especially challenging — every rare class counts equally in the average.

Stuff vs. Things
ADE20K uniquely combines stuff (amorphous regions like sky, wall, grass) and things (countable objects like person, car, chair). Stuff classes dominate pixel coverage (61%) despite being only 35 of 150 categories, while 115 thing categories share just 32% of pixels.
This split is critical for architecture design: stuff benefits from large receptive fields (atrous convolutions, pyramid pooling), while things benefit from instance-aware features (Mask2Former, masked attention).

From CNNs to Transformers
ADE20K has tracked the entire evolution of segmentation architectures. FCN (2015) introduced end-to-end pixel classification. PSPNet added pyramid pooling. DeepLabv3+ refined it with atrous convolutions.
Then Transformers arrived. Swin Transformer + UPerNet broke CNN dominance in 2021. Mask2Former unified instance and semantic segmentation. Now InternImage pushes the limit with deformable convolutions at scale, reaching 62.9 mIoU.
Why 150 Classes is Hard
ADE20K's 150-class setup creates challenges that simpler benchmarks miss:
- Long-tail distribution: "wall" and "sky" dominate, while "chandelier" and "van" appear rarely. Models must handle extreme class imbalance.
- Stuff vs. things: Amorphous regions (sky, grass) require different reasoning than countable objects (chair, car).
- Scale variation: A single image may contain a tiny "lamp" and a massive "building," demanding robust multi-scale reasoning.
- Context dependency: A "shelf" in a kitchen vs. a "shelf" in a library look completely different. Models need scene-level understanding.
How Scene Parsing Works
From raw pixels to a complete understanding of every object in a scene. The typical pipeline for ADE20K evaluation.
Feature Extraction
A backbone network (Swin Transformer, ConvNeXt, InternImage) extracts multi-scale features from the input image at 1/4, 1/8, 1/16, and 1/32 resolution.
Pixel Classification
A decoder head (UperNet, Mask2Former, or SegFormer MLP) fuses multi-scale features and predicts a class label for every pixel in the image.
mIoU Scoring
Mean Intersection-over-Union (mIoU) measures the overlap between predicted and ground-truth masks, averaged across all 150 categories. Higher is better.
Semantic Class Color Palette

ADE20K annotates 150 semantic classes for evaluation, drawn from a full ontology of 3,169 categories. Each class has an assigned RGB color for segmentation masks. The top 20 most frequent classes (shown right) account for the majority of labeled pixels.
Understanding the Metrics
mIoUPrimary
Mean Intersection-over-Union computes the ratio of correctly predicted pixels to all pixels (predicted + ground truth) for each class, then averages across all 150 classes.
Because it's class-averaged, rare categories weigh equally with common ones. This makes mIoU a demanding, imbalance-aware metric.
Pixel AccuracySecondary
Pixel accuracy measures the fraction of pixels correctly classified. While intuitive, it's biased toward dominant classes.
A model predicting only "wall" and "sky" could achieve high pixel accuracy while failing on 148 other categories. That's why mIoU is preferred.
Key Papers
Essential reading for understanding ADE20K and its SOTA methods.
Code & Implementations
Open-source repositories for training and evaluating models on ADE20K.
Comprehensive semantic segmentation toolbox. Supports 50+ models on ADE20K out of the box.
Current SOTA on ADE20K (62.9 mIoU). Deformable convolutions at scale.
Universal segmentation with masked attention. Archived Jan 2025.
One model for semantic, instance, and panoptic segmentation.
Efficient Mix Transformer design. Great accuracy/speed tradeoff.
Shifted window attention. ICCV 2021 Best Paper.
Official MIT CSAIL implementation for ADE20K evaluation.
Self-supervised ViT features. Strong ADE20K results with linear probing.
ADE20K vs. Other Segmentation Benchmarks
| Benchmark | Images | Classes | Domain | Year |
|---|---|---|---|---|
| ADE20K | 27,574 | 150 | Indoor + Outdoor scenes | 2016 |
| COCO-Stuff | 164,000 | 171 | General scenes | 2018 |
| Cityscapes | 25,000 | 30 | Urban driving only | 2016 |
| Mapillary Vistas | 25,000 | 66 | Street-level imagery | 2017 |
| PASCAL VOC 2012 | 11,530 | 21 | General objects | 2012 |
| PASCAL Context | 10,103 | 459 | Whole-scene labeling | 2014 |
Access the Dataset
Track More Benchmarks
ADE20K is one of many benchmarks we track. Explore our full catalog of computer vision, NLP, and reasoning benchmarks with live leaderboards.