ADE20K
Unknown
20K training, 2K validation images annotated with 150 object categories. Complex scene parsing benchmark.
Benchmark Stats
SOTA History
mIoU
mIoU
Higher is better
| Rank | Model | Source | Score | Year | Paper |
|---|---|---|---|---|---|
| 1 | ONE-PEACE ONE-PEACE (Exploring One General Representation Model Toward Unlimited Modalities). 63.0 mIoU on ADE20K val (150 classes). Uses ViT-Adapter for dense prediction. SOTA at release. Alibaba DAMO Academy. arXiv May 2023. | Community | 63 | 2026 | Source |
| 2 | internimage-h Semantic segmentation SOTA. | Editorial | 62.9 | 2025 | Source |
| 3 | ViT-Adapter-L (BEiT-3) BEiT-3 (Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks). ViT-L backbone pre-trained via BEiT-3 multimodal masked modeling. With ViT-Adapter + Mask2Former, achieves 62.8 MS mIoU on ADE20K val — SOTA at time of publication. ICLR 2023 Spotlight (ViT-Adapter). | Community | 62.8 | 2026 | Source |
| 4 | ViT-CoMer-L ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions. Large variant achieves 62.1 mIoU on ADE20K val. Pre-training-free, plain ViT backbone with bidirectional CNN interaction. CVPR 2024. | Community | 62.1 | 2026 | Source |
| 5 | DINOv2 ViT-g/14 + Mask2Former DINOv2: Learning Robust Visual Features without Supervision. ViT-g/14 self-supervised on curated LVD-142M dataset via DINO+iBOT losses. With ViT-Adapter + Mask2Former decoder achieves 60.2 mIoU on ADE20K val. Near-frozen backbone. Meta AI. TMLR 2024. | Community | 60.2 | 2026 | Source |
| 6 | EVA-02-L + UperNet EVA-02: A Visual Representation for Neon Genesis. EVA-02-L+ backbone with UperNet head achieves 60.1 mIoU (single-scale, 640² resolution) on ADE20K val. Language-aligned CLIP features improve dense prediction. Published ICLR 2024. BAAI / Tsinghua. | Community | 60.1 | 2026 | Source |
| 7 | EoMT-L (DINOv2) EoMT (Encoder-only Mask Transformer): Your ViT is Secretly an Image Segmentation Model. 59.5 mIoU on ADE20K val at 512x512 with DINOv2 ViT-L backbone. No decoder — jointly encodes patches and queries. CVPR 2025 Highlight. | Community | 59.5 | 2026 | Source |
| 8 | OneFormer (DiNAT-L) OneFormer: One Transformer to Rule Universal Image Segmentation. Universal framework jointly trained on panoptic+instance+semantic tasks. DiNAT-L backbone achieves 58.3 mIoU (single-scale) on ADE20K val. First model to outperform specialized models on all three tasks simultaneously. CVPR 2023. | Community | 58.3 | 2026 | Source |
| 9 | mask2former-swin-l Masked-attention Mask Transformer. | Editorial | 57.3 | 2025 | Source |
| 10 | Swin-L + UperNet Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Swin-L backbone + UperNet head. 53.5 mIoU on ADE20K val (single-scale), ImageNet-22K pre-trained. Gold-standard strong baseline for segmentation. ICCV 2021 Best Paper. | Community | 53.5 | 2026 | Source |
| 11 | SegMAN-L SegMAN-L: Omni-scale Context Modeling with State Space Models and Local Attention for Semantic Segmentation. Large variant achieves 53.2 mIoU on ADE20K val. Hybrid Mamba SSM + local attention. CVPR 2025. | Community | 53.2 | 2026 | Source |
| 12 | SegFormer-B5 SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. B5 variant (82M params) achieves 51.8 mIoU on ADE20K val (150 classes). Mix Transformer backbone + MLP decoder. NeurIPS 2021. NVIDIA. | Community | 51.8 | 2026 | Source |
| 13 | SeMask-L SeMask: Semantically Masked Transformers for Semantic Segmentation. Swin-L encoder with Semantic Attention modules at multiple stages, paired with Mask2Former decoder. 49.35 mIoU on ADE20K val. ICCVW 2023. | Community | 49.35 | 2026 | Source |