Computer Visionimage-segmentation

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins autonomous driving, medical imaging, and satellite analysis. FCN (2015) showed you could repurpose classifiers for pixel labeling, DeepLab introduced atrous convolutions and CRFs, and SegFormer (2021) proved transformers dominate here too. State-of-the-art on Cityscapes exceeds 85 mIoU, but ADE20K with its 150 classes remains brutally challenging. The frontier has moved toward universal segmentation models like Mask2Former that handle semantic, instance, and panoptic segmentation in a single architecture.

2 datasets6 resultsView full task mapping →

Semantic segmentation assigns a class label to every pixel in an image — no instance distinction, just dense pixel-wise classification. It's critical for autonomous driving, medical imaging, and remote sensing. ADE20K mIoU has climbed from 29% (FCN, 2015) to 62%+ (InternImage, 2023), and open-vocabulary segmentation is redefining the task entirely.

History

2015

Fully Convolutional Network (FCN) by Long et al. adapts classification CNNs for dense prediction, establishing the modern segmentation paradigm

2015

U-Net introduces encoder-decoder with skip connections for biomedical segmentation — becomes the most-cited architecture in medical imaging

2017

DeepLab v2 introduces atrous/dilated convolutions and CRF post-processing, pushing VOC mIoU past 79%

2018

DeepLab v3+ combines atrous spatial pyramid pooling (ASPP) with encoder-decoder structure, achieving 89% on VOC

2019

HRNet maintains high-resolution representations throughout the network, proving that resolution matters more than depth for segmentation

2021

SegFormer (Xie et al.) applies transformers to segmentation with a simple MLP decoder, achieving 84.0% Cityscapes mIoU at efficient compute

2021

Mask2Former unifies semantic, instance, and panoptic segmentation under one masked-attention architecture

2023

InternImage-H achieves 62.9% ADE20K mIoU using deformable convolutions at scale; Segment Anything (SAM) enables zero-shot segmentation of any object

2024

SAM 2 extends to video segmentation with streaming memory; open-vocabulary segmenters (SAN, CAT-Seg) classify pixels into arbitrary text-described categories

How Semantic Segmentation Works

1Encoder / BackboneA pretrained backbone (ResN…2Context AggregationModules like ASPP (multi-sc…3Decoder / UpsamplingFeatures are progressively …4Per-Pixel Classificat…A 1×1 convolution projects …5EvaluationMean IoU (mIoU) — the inter…Semantic Segmentation Pipeline
1

Encoder / Backbone

A pretrained backbone (ResNet, Swin, InternImage) extracts hierarchical features. Transformers capture global context; CNNs excel at local texture. Features are extracted at 1/4, 1/8, 1/16, 1/32 of input resolution.

2

Context Aggregation

Modules like ASPP (multi-scale dilated convolutions), PPM (pyramid pooling), or transformer self-attention aggregate global context needed to resolve ambiguous pixels.

3

Decoder / Upsampling

Features are progressively upsampled to input resolution. U-Net-style skip connections recover spatial detail lost in downsampling. SegFormer uses a lightweight MLP decoder; Mask2Former uses masked cross-attention.

4

Per-Pixel Classification

A 1×1 convolution projects features to C channels (one per class), producing dense logits. Cross-entropy or dice loss supervises every pixel independently.

5

Evaluation

Mean IoU (mIoU) — the intersection-over-union averaged across all classes — is the standard metric. Cityscapes (19 classes, urban), ADE20K (150 classes, diverse), and PASCAL VOC (21 classes) are the primary benchmarks.

Current Landscape

Semantic segmentation in 2025 is split between task-specific architectures optimized for benchmarks and foundation models that segment anything. On the benchmark side, transformer-based models (Mask2Former, SegFormer) and scaled CNNs (InternImage, ConvNeXt) compete for ADE20K and Cityscapes leaderboard positions, with mIoU gains now in the sub-1% range. On the practical side, SAM/SAM 2 has fundamentally changed the game: segment any object with a click or text prompt, then classify the segments however you want. Open-vocabulary segmentation (CAT-Seg, SAN, ODISE) merges CLIP's language understanding with dense prediction, making fixed-class segmenters feel outdated.

Key Challenges

Class imbalance — rare classes (e.g., cyclists, poles in driving scenes) get overwhelmed by dominant classes (road, sky), requiring loss weighting or oversampling

Fine boundary precision — segmentation masks are typically blurry at object edges, which matters for medical imaging (tumor margins) and autonomous driving (pedestrian boundaries)

Annotation cost — pixel-level labeling takes 1-4 hours per image for complex scenes (Cityscapes reports 90 minutes average), making large-scale datasets extremely expensive

Domain gap — models trained on daytime urban scenes (Cityscapes) degrade on night, rain, snow, and different cities without domain adaptation

Computational cost of dense prediction at full resolution — most models operate at 512×512 or 1024×1024, but applications like satellite imagery need 4K+ resolution

Quick Recommendations

Best mIoU (general scenes)

InternImage-H or BEiT-3 + UperNet

62%+ mIoU on ADE20K; InternImage's deformable convolutions handle multi-scale objects well

Real-time (autonomous driving)

SegFormer-B2 or DDRNet-23

80%+ Cityscapes mIoU at 30+ FPS on a single GPU; SegFormer is easier to train, DDRNet is faster

Medical imaging

nnU-Net (auto-configured) or MedSAM

nnU-Net self-configures architecture, preprocessing, and training for any medical dataset; wins most segmentation challenges

Open-vocabulary / novel classes

SAM 2 + CLIP-based classifier (CAT-Seg, SAN)

Segment anything via prompts, then classify segments into arbitrary text categories — no per-class training needed

Low-annotation regime

SAM + pseudo-label pipeline

Use SAM for automatic mask proposals, manually verify a subset, then train a task-specific model on the pseudo-labels

What's Next

The future is universal segmentation models that handle semantic, instance, and panoptic segmentation with open-vocabulary class sets. SAM 3.0 and its successors will likely subsume traditional segmentation pipelines. Key research frontiers: 3D semantic segmentation for robotics (point clouds, NeRF-based), video segmentation with temporal consistency, and weakly-supervised methods that learn from image-level labels or natural language descriptions instead of expensive pixel annotations.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Semantic Segmentation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000