Semantic Segmentation
Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins autonomous driving, medical imaging, and satellite analysis. FCN (2015) showed you could repurpose classifiers for pixel labeling, DeepLab introduced atrous convolutions and CRFs, and SegFormer (2021) proved transformers dominate here too. State-of-the-art on Cityscapes exceeds 85 mIoU, but ADE20K with its 150 classes remains brutally challenging. The frontier has moved toward universal segmentation models like Mask2Former that handle semantic, instance, and panoptic segmentation in a single architecture.
Semantic segmentation assigns a class label to every pixel in an image — no instance distinction, just dense pixel-wise classification. It's critical for autonomous driving, medical imaging, and remote sensing. ADE20K mIoU has climbed from 29% (FCN, 2015) to 62%+ (InternImage, 2023), and open-vocabulary segmentation is redefining the task entirely.
History
Fully Convolutional Network (FCN) by Long et al. adapts classification CNNs for dense prediction, establishing the modern segmentation paradigm
U-Net introduces encoder-decoder with skip connections for biomedical segmentation — becomes the most-cited architecture in medical imaging
DeepLab v2 introduces atrous/dilated convolutions and CRF post-processing, pushing VOC mIoU past 79%
DeepLab v3+ combines atrous spatial pyramid pooling (ASPP) with encoder-decoder structure, achieving 89% on VOC
HRNet maintains high-resolution representations throughout the network, proving that resolution matters more than depth for segmentation
SegFormer (Xie et al.) applies transformers to segmentation with a simple MLP decoder, achieving 84.0% Cityscapes mIoU at efficient compute
Mask2Former unifies semantic, instance, and panoptic segmentation under one masked-attention architecture
InternImage-H achieves 62.9% ADE20K mIoU using deformable convolutions at scale; Segment Anything (SAM) enables zero-shot segmentation of any object
SAM 2 extends to video segmentation with streaming memory; open-vocabulary segmenters (SAN, CAT-Seg) classify pixels into arbitrary text-described categories
How Semantic Segmentation Works
Encoder / Backbone
A pretrained backbone (ResNet, Swin, InternImage) extracts hierarchical features. Transformers capture global context; CNNs excel at local texture. Features are extracted at 1/4, 1/8, 1/16, 1/32 of input resolution.
Context Aggregation
Modules like ASPP (multi-scale dilated convolutions), PPM (pyramid pooling), or transformer self-attention aggregate global context needed to resolve ambiguous pixels.
Decoder / Upsampling
Features are progressively upsampled to input resolution. U-Net-style skip connections recover spatial detail lost in downsampling. SegFormer uses a lightweight MLP decoder; Mask2Former uses masked cross-attention.
Per-Pixel Classification
A 1×1 convolution projects features to C channels (one per class), producing dense logits. Cross-entropy or dice loss supervises every pixel independently.
Evaluation
Mean IoU (mIoU) — the intersection-over-union averaged across all classes — is the standard metric. Cityscapes (19 classes, urban), ADE20K (150 classes, diverse), and PASCAL VOC (21 classes) are the primary benchmarks.
Current Landscape
Semantic segmentation in 2025 is split between task-specific architectures optimized for benchmarks and foundation models that segment anything. On the benchmark side, transformer-based models (Mask2Former, SegFormer) and scaled CNNs (InternImage, ConvNeXt) compete for ADE20K and Cityscapes leaderboard positions, with mIoU gains now in the sub-1% range. On the practical side, SAM/SAM 2 has fundamentally changed the game: segment any object with a click or text prompt, then classify the segments however you want. Open-vocabulary segmentation (CAT-Seg, SAN, ODISE) merges CLIP's language understanding with dense prediction, making fixed-class segmenters feel outdated.
Key Challenges
Class imbalance — rare classes (e.g., cyclists, poles in driving scenes) get overwhelmed by dominant classes (road, sky), requiring loss weighting or oversampling
Fine boundary precision — segmentation masks are typically blurry at object edges, which matters for medical imaging (tumor margins) and autonomous driving (pedestrian boundaries)
Annotation cost — pixel-level labeling takes 1-4 hours per image for complex scenes (Cityscapes reports 90 minutes average), making large-scale datasets extremely expensive
Domain gap — models trained on daytime urban scenes (Cityscapes) degrade on night, rain, snow, and different cities without domain adaptation
Computational cost of dense prediction at full resolution — most models operate at 512×512 or 1024×1024, but applications like satellite imagery need 4K+ resolution
Quick Recommendations
Best mIoU (general scenes)
InternImage-H or BEiT-3 + UperNet
62%+ mIoU on ADE20K; InternImage's deformable convolutions handle multi-scale objects well
Real-time (autonomous driving)
SegFormer-B2 or DDRNet-23
80%+ Cityscapes mIoU at 30+ FPS on a single GPU; SegFormer is easier to train, DDRNet is faster
Medical imaging
nnU-Net (auto-configured) or MedSAM
nnU-Net self-configures architecture, preprocessing, and training for any medical dataset; wins most segmentation challenges
Open-vocabulary / novel classes
SAM 2 + CLIP-based classifier (CAT-Seg, SAN)
Segment anything via prompts, then classify segments into arbitrary text categories — no per-class training needed
Low-annotation regime
SAM + pseudo-label pipeline
Use SAM for automatic mask proposals, manually verify a subset, then train a task-specific model on the pseudo-labels
What's Next
The future is universal segmentation models that handle semantic, instance, and panoptic segmentation with open-vocabulary class sets. SAM 3.0 and its successors will likely subsume traditional segmentation pipelines. Key research frontiers: 3D semantic segmentation for robotics (point clouds, NeRF-based), video segmentation with temporal consistency, and weakly-supervised methods that learn from image-level labels or natural language descriptions instead of expensive pixel annotations.
Benchmarks & SOTA
ADE20K
ADE20K Scene Parsing Benchmark
20K training, 2K validation images annotated with 150 object categories. Complex scene parsing benchmark.
State of the Art
InternImage-H
Shanghai AI Lab
62.9
mIoU
Cityscapes
Cityscapes Dataset
5,000 images with fine annotations and 20,000 with coarse annotations of urban street scenes.
No results tracked yet
Related Tasks
Something wrong or missing?
Help keep Semantic Segmentation benchmarks accurate. Report outdated results, missing benchmarks, or errors.