ADE20K

Unknown

20K training, 2K validation images annotated with 150 object categories. Complex scene parsing benchmark.

Benchmark Stats

Models13
Papers13
Metrics1

SOTA History

mIoU

mIoU

Higher is better

RankModelSourceScoreYearPaper
1ONE-PEACE

ONE-PEACE (Exploring One General Representation Model Toward Unlimited Modalities). 63.0 mIoU on ADE20K val (150 classes). Uses ViT-Adapter for dense prediction. SOTA at release. Alibaba DAMO Academy. arXiv May 2023.

Community632026Source
2internimage-h

Semantic segmentation SOTA.

Editorial62.92025Source
3ViT-Adapter-L (BEiT-3)

BEiT-3 (Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks). ViT-L backbone pre-trained via BEiT-3 multimodal masked modeling. With ViT-Adapter + Mask2Former, achieves 62.8 MS mIoU on ADE20K val — SOTA at time of publication. ICLR 2023 Spotlight (ViT-Adapter).

Community62.82026Source
4ViT-CoMer-L

ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions. Large variant achieves 62.1 mIoU on ADE20K val. Pre-training-free, plain ViT backbone with bidirectional CNN interaction. CVPR 2024.

Community62.12026Source
5DINOv2 ViT-g/14 + Mask2Former

DINOv2: Learning Robust Visual Features without Supervision. ViT-g/14 self-supervised on curated LVD-142M dataset via DINO+iBOT losses. With ViT-Adapter + Mask2Former decoder achieves 60.2 mIoU on ADE20K val. Near-frozen backbone. Meta AI. TMLR 2024.

Community60.22026Source
6EVA-02-L + UperNet

EVA-02: A Visual Representation for Neon Genesis. EVA-02-L+ backbone with UperNet head achieves 60.1 mIoU (single-scale, 640² resolution) on ADE20K val. Language-aligned CLIP features improve dense prediction. Published ICLR 2024. BAAI / Tsinghua.

Community60.12026Source
7EoMT-L (DINOv2)

EoMT (Encoder-only Mask Transformer): Your ViT is Secretly an Image Segmentation Model. 59.5 mIoU on ADE20K val at 512x512 with DINOv2 ViT-L backbone. No decoder — jointly encodes patches and queries. CVPR 2025 Highlight.

Community59.52026Source
8OneFormer (DiNAT-L)

OneFormer: One Transformer to Rule Universal Image Segmentation. Universal framework jointly trained on panoptic+instance+semantic tasks. DiNAT-L backbone achieves 58.3 mIoU (single-scale) on ADE20K val. First model to outperform specialized models on all three tasks simultaneously. CVPR 2023.

Community58.32026Source
9mask2former-swin-l

Masked-attention Mask Transformer.

Editorial57.32025Source
10Swin-L + UperNet

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Swin-L backbone + UperNet head. 53.5 mIoU on ADE20K val (single-scale), ImageNet-22K pre-trained. Gold-standard strong baseline for segmentation. ICCV 2021 Best Paper.

Community53.52026Source
11SegMAN-L

SegMAN-L: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation. Large variant achieves 53.2 mIoU on ADE20K val. Hybrid Mamba SSM + local attention. CVPR 2025.

Community53.22026Source
12SegFormer-B5

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. B5 variant (82M params) achieves 51.8 mIoU on ADE20K val (150 classes). Mix Transformer backbone + MLP decoder. NeurIPS 2021. NVIDIA.

Community51.82026Source
13SeMask-L

SeMask: Semantically Masked Transformers for Semantic Segmentation. Swin-L encoder with Semantic Attention modules at multiple stages, paired with Mask2Former decoder. 49.35 mIoU on ADE20K val. ICCVW 2023.

Community49.352026Source

Submit a Result