Every pixel, labelled.
ADE20K is the MIT CSAIL scene-parsing benchmark: 27,574 images densely labelled across 150 object categories, the test bed that dragged segmentation models from object recognition to full pixel-level parsing.
The leaderboard below preserves 23 reported mIoU scores verbatim, with paper and code links where the original authors published them.
ADE20K validation, ranked by mIoU.
| # | Model | Vendor | Architecture | mIoU | Params | Year | Links |
|---|---|---|---|---|---|---|---|
| 01 | InternImage-H | Shanghai AI Lab | Deformable CNN + Mask2Former | 62.9 | 1.08B | 2023 | paper · code |
| 02 | BEiT-3 (ViT-g) | Microsoft Research | ViT-g + UPerNet | 62.8 | 1.9B | 2023 | paper |
| 03 | ViT-CoMer-L | Fudan University | ViT + CNN Bidirectional Fusion | 62.1 | 383M | 2024 | paper · code |
| 04 | EVA-02-L | BAAI | ViT-L + Mask2Former | 61.5 | 304M | 2023 | paper · code |
| 05 | FD-SwinV2-G | Microsoft Research | Swin V2 Giant | 61.4 | 3B | 2023 | paper |
| 06 | DINOv2-g + Mask2Former | Meta AI | ViT-g + Mask2Former | 61.4 | 1.1B | 2023 | paper · code |
| 07 | OneFormer (InternImage-H) | SHI Labs | InternImage-H + OneFormer | 60.8 | 1.1B | 2023 | paper · code |
| 08 | ViT-Adapter-L (BEiT) | Shanghai AI Lab | ViT-L + Adapter + Mask2Former | 60.5 | 571M | 2022 | paper · code |
| 09 | SERNet-Former v2 | Research | Efficient Transformer | 59.4 | — | 2024 | |
| 10 | OneFormer (DiNAT-L) | SHI Labs | DiNAT-L + OneFormer | 58.4 | 225M | 2022 | paper · code |
| 11 | SeMask-L (Mask2Former) | KAIST | Swin-L FaPN + Mask2Former | 58.2 | — | 2022 | paper |
| 12 | Mask2Former (Swin-L) | Meta AI | Swin-L + Masked Attention | 57.7 | 216M | 2022 | paper · code |
| 13 | BEiT-L (UPerNet) | Microsoft Research | ViT-L + UPerNet | 57.0 | 441M | 2022 | paper · code |
| 14 | DeiT-L | Meta AI | ViT-L + DeiT Distillation | 55.6 | — | 2021 | paper |
| 15 | ConvNeXt-XL (UPerNet) | Meta AI | Pure ConvNet + UPerNet | 54.0 | 391M | 2022 | paper · code |
| 16 | Seg-L-Mask/16 | INRIA | ViT-L Segmenter | 53.6 | — | 2021 | paper |
| 17 | Swin-L (UPerNet) | Microsoft Research | Swin Transformer + UPerNet | 53.5 | 234M | 2021 | paper · code |
| 18 | SegFormer-B5 | NVIDIA | Mix Transformer + MLP | 51.8 | 85M | 2021 | paper · code |
| 19 | SETR-MLA (ViT-L) | Fudan University | ViT-L + Multi-Level Aggregation | 47.7 | 310M | 2020 | paper |
| 20 | HRNetV2-W48 | Microsoft Research | High-Resolution Net | 46.2 | 66M | 2019 | paper · code |
| 21 | DeepLabv3+ (Xception-71) | Atrous Separable Convolution | 45.7 | 62M | 2018 | paper | |
| 22 | PSPNet (ResNet-101) | CUHK / SenseTime | Pyramid Scene Parsing | 43.3 | 65M | 2017 | paper |
| 23 | FCN-8s (VGG-16) | UC Berkeley | Fully Convolutional Network | 40.4 | 134M | 2015 | paper |
mIoU, averaged across 150 classes.
Mean Intersection-over-Union computes the ratio of correctly predicted pixels to all pixels (predicted + ground truth) for each class, then averages across every one of the 150 categories. Because it is class-averaged, rare categories weigh equally with common ones — which is what makes ADE20K a demanding, imbalance-aware metric rather than a pixel-accuracy contest.
Pixel accuracy is reported as a secondary metric. A model that predicts only “wall” and “sky” can score very highly on pixel accuracy while failing on the other 148 categories; mIoU makes that trick impossible.
Inside the MIT Scene Parsing set.
ADE20K was built by Bolei Zhou, Hang Zhao and Antonio Torralba at MIT CSAIL. Unlike simpler benchmarks it demands that models understand 150 diverse categories spanning stuff (sky, wall, floor) and things (person, car, chair). Images were sourced from the SUN and Places databases — on average 19.5 object instances and 10.5 distinct classes per image.
Every pixel in every image is labelled. That makes ADE20K significantly harder than PASCAL VOC (21 classes) and more diverse than Cityscapes (driving scenes only), which is why it became the standard test for general-purpose segmentation.
| Split | Images | Notes |
|---|---|---|
| Training | 25,574 | Dense pixel-level annotations, 150 categories |
| Validation | 2,000 | Used for model evaluation + leaderboard ranking |
| Full ontology | 3,169 | 712,812 labelled objects; parts, materials, scene types |
Reported, then reproduced.
Each row above cites the paper and, where the authors released code, the implementation repository. For ADE20K, the canonical evaluation server lives at MIT’s scene-parsing site; community reproductions typically run through OpenMMLab’s mmsegmentation toolbox against the 2,000-image validation split.
Scores without a paper link are treated as claim-only until a reproduction lands in the registry. The wider rule-set is in the Codesota methodology.
Essential reading for the ADE20K line.
Cross-links, sibling leaderboards.
- /ocr — pillar hub
- /ocr/benchmarks — full directory
- /ocr/benchmark/omnidocbench — document parsing SOTA
- /ocr/benchmark/imagenet-1k — image classification SOTA
- /ocr/benchmark/cifar-100 — 100-class recognition SOTA
- /ocr/benchmark/mvtec-ad — anomaly detection SOTA
- /ocr/results — every scored run in the registry