Codesota · OCR · Benchmark · ADE20KSemantic segmentation · 27,574 images · 150 categoriesSOTA 62.9% mIoU
§ 00 · ADE20K

Every pixel, labelled.

ADE20K is the MIT CSAIL scene-parsing benchmark: 27,574 images densely labelled across 150 object categories, the test bed that dragged segmentation models from object recognition to full pixel-level parsing.

The leaderboard below preserves 23 reported mIoU scores verbatim, with paper and code links where the original authors published them.

§ 01 · Leaderboard

ADE20K validation, ranked by mIoU.

#ModelVendorArchitecturemIoUParamsYearLinks
01InternImage-HShanghai AI LabDeformable CNN + Mask2Former62.91.08B2023paper · code
02BEiT-3 (ViT-g)Microsoft ResearchViT-g + UPerNet62.81.9B2023paper
03ViT-CoMer-LFudan UniversityViT + CNN Bidirectional Fusion62.1383M2024paper · code
04EVA-02-LBAAIViT-L + Mask2Former61.5304M2023paper · code
05FD-SwinV2-GMicrosoft ResearchSwin V2 Giant61.43B2023paper
06DINOv2-g + Mask2FormerMeta AIViT-g + Mask2Former61.41.1B2023paper · code
07OneFormer (InternImage-H)SHI LabsInternImage-H + OneFormer60.81.1B2023paper · code
08ViT-Adapter-L (BEiT)Shanghai AI LabViT-L + Adapter + Mask2Former60.5571M2022paper · code
09SERNet-Former v2ResearchEfficient Transformer59.42024
10OneFormer (DiNAT-L)SHI LabsDiNAT-L + OneFormer58.4225M2022paper · code
11SeMask-L (Mask2Former)KAISTSwin-L FaPN + Mask2Former58.22022paper
12Mask2Former (Swin-L)Meta AISwin-L + Masked Attention57.7216M2022paper · code
13BEiT-L (UPerNet)Microsoft ResearchViT-L + UPerNet57.0441M2022paper · code
14DeiT-LMeta AIViT-L + DeiT Distillation55.62021paper
15ConvNeXt-XL (UPerNet)Meta AIPure ConvNet + UPerNet54.0391M2022paper · code
16Seg-L-Mask/16INRIAViT-L Segmenter53.62021paper
17Swin-L (UPerNet)Microsoft ResearchSwin Transformer + UPerNet53.5234M2021paper · code
18SegFormer-B5NVIDIAMix Transformer + MLP51.885M2021paper · code
19SETR-MLA (ViT-L)Fudan UniversityViT-L + Multi-Level Aggregation47.7310M2020paper
20HRNetV2-W48Microsoft ResearchHigh-Resolution Net46.266M2019paper · code
21DeepLabv3+ (Xception-71)GoogleAtrous Separable Convolution45.762M2018paper
22PSPNet (ResNet-101)CUHK / SenseTimePyramid Scene Parsing43.365M2017paper
23FCN-8s (VGG-16)UC BerkeleyFully Convolutional Network40.4134M2015paper
Fig 1 · 23 models ranked by mIoU on the ADE20K validation set. Single-scale evaluation unless the original paper reports otherwise. Shaded row marks SOTA.
§ 02 · What it measures

mIoU, averaged across 150 classes.

Mean Intersection-over-Union computes the ratio of correctly predicted pixels to all pixels (predicted + ground truth) for each class, then averages across every one of the 150 categories. Because it is class-averaged, rare categories weigh equally with common ones — which is what makes ADE20K a demanding, imbalance-aware metric rather than a pixel-accuracy contest.

Pixel accuracy is reported as a secondary metric. A model that predicts only “wall” and “sky” can score very highly on pixel accuracy while failing on the other 148 categories; mIoU makes that trick impossible.

§ 03 · Dataset details

Inside the MIT Scene Parsing set.

ADE20K was built by Bolei Zhou, Hang Zhao and Antonio Torralba at MIT CSAIL. Unlike simpler benchmarks it demands that models understand 150 diverse categories spanning stuff (sky, wall, floor) and things (person, car, chair). Images were sourced from the SUN and Places databases — on average 19.5 object instances and 10.5 distinct classes per image.

Every pixel in every image is labelled. That makes ADE20K significantly harder than PASCAL VOC (21 classes) and more diverse than Cityscapes (driving scenes only), which is why it became the standard test for general-purpose segmentation.

SplitImagesNotes
Training25,574Dense pixel-level annotations, 150 categories
Validation2,000Used for model evaluation + leaderboard ranking
Full ontology3,169712,812 labelled objects; parts, materials, scene types
§ 04 · How scores are verified

Reported, then reproduced.

Each row above cites the paper and, where the authors released code, the implementation repository. For ADE20K, the canonical evaluation server lives at MIT’s scene-parsing site; community reproductions typically run through OpenMMLab’s mmsegmentation toolbox against the 2,000-image validation split.

Scores without a paper link are treated as claim-only until a reproduction lands in the registry. The wider rule-set is in the Codesota methodology.

§ 05 · Key papers

Essential reading for the ADE20K line.

TitleAuthorsVenueRole
Semantic Understanding of Scenes Through the ADE20K DatasetZhou, Zhao, Puig, Xiao, Fidler, Barriuso, TorralbaIJCV 2019Original dataset paper
Scene Parsing through ADE20K DatasetZhou, Zhao, Puig, Fidler, Barriuso, TorralbaCVPR 2017Conference version
InternImage: Exploring Large-Scale Vision Foundation Models with Deformable ConvolutionsWang et al.CVPR 2023 HighlightCurrent SOTA (62.9 mIoU)
Masked-attention Mask Transformer for Universal Image SegmentationCheng et al.CVPR 2022Unified segmentation framework
OneFormer: One Transformer to Rule Universal Image SegmentationJain et al.CVPR 2023Multi-task segmentation
SegFormer: Simple and Efficient Design for Semantic Segmentation with TransformersXie et al.NeurIPS 2021Efficient transformer baseline
Swin Transformer: Hierarchical Vision Transformer using Shifted WindowsLiu et al.ICCV 2021 Best PaperShifted window attention
Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language TasksWang et al.CVPR 2023BEiT-3 (#2 on ADE20K)
§ Final · Related OCR benchmarks

Cross-links, sibling leaderboards.