Computer Visionimage-segmentation

Semantic Segmentation

Semantic segmentation assigns a class label to every pixel — the dense prediction problem that underpins autonomous driving, medical imaging, and satellite analysis. FCN (2015) showed you could repurpose classifiers for pixel labeling, DeepLab introduced atrous convolutions and CRFs, and SegFormer (2021) proved transformers dominate here too. State-of-the-art on Cityscapes exceeds 85 mIoU, but ADE20K with its 150 classes remains brutally challenging. The frontier has moved toward universal segmentation models like Mask2Former that handle semantic, instance, and panoptic segmentation in a single architecture.

2 datasets24 resultsView full task mapping →

Semantic segmentation assigns a class label to every pixel in an image — no instance distinction, just dense pixel-wise classification. It's critical for autonomous driving, medical imaging, and remote sensing. ADE20K mIoU has climbed from 29% (FCN, 2015) to 62%+ (InternImage, 2023), and open-vocabulary segmentation is redefining the task entirely.

History

2015

Fully Convolutional Network (FCN) by Long et al. adapts classification CNNs for dense prediction, establishing the modern segmentation paradigm

2015

U-Net introduces encoder-decoder with skip connections for biomedical segmentation — becomes the most-cited architecture in medical imaging

2017

DeepLab v2 introduces atrous/dilated convolutions and CRF post-processing, pushing VOC mIoU past 79%

2018

DeepLab v3+ combines atrous spatial pyramid pooling (ASPP) with encoder-decoder structure, achieving 89% on VOC

2019

HRNet maintains high-resolution representations throughout the network, proving that resolution matters more than depth for segmentation

2021

SegFormer (Xie et al.) applies transformers to segmentation with a simple MLP decoder, achieving 84.0% Cityscapes mIoU at efficient compute

2021

Mask2Former unifies semantic, instance, and panoptic segmentation under one masked-attention architecture

2023

InternImage-H achieves 62.9% ADE20K mIoU using deformable convolutions at scale; Segment Anything (SAM) enables zero-shot segmentation of any object

2024

SAM 2 extends to video segmentation with streaming memory; open-vocabulary segmenters (SAN, CAT-Seg) classify pixels into arbitrary text-described categories

How Semantic Segmentation Works

Encoder / Backbone

A pretrained backbone (ResNet, Swin, InternImage) extracts hierarchical features. Transformers capture global context; CNNs excel at local texture. Features are extracted at 1/4, 1/8, 1/16, 1/32 of input resolution.

Context Aggregation

Modules like ASPP (multi-scale dilated convolutions), PPM (pyramid pooling), or transformer self-attention aggregate global context needed to resolve ambiguous pixels.

Decoder / Upsampling

Features are progressively upsampled to input resolution. U-Net-style skip connections recover spatial detail lost in downsampling. SegFormer uses a lightweight MLP decoder; Mask2Former uses masked cross-attention.

Per-Pixel Classification

A 1×1 convolution projects features to C channels (one per class), producing dense logits. Cross-entropy or dice loss supervises every pixel independently.

Evaluation

Mean IoU (mIoU) — the intersection-over-union averaged across all classes — is the standard metric. Cityscapes (19 classes, urban), ADE20K (150 classes, diverse), and PASCAL VOC (21 classes) are the primary benchmarks.

Current Landscape

Semantic segmentation in 2025 is split between task-specific architectures optimized for benchmarks and foundation models that segment anything. On the benchmark side, transformer-based models (Mask2Former, SegFormer) and scaled CNNs (InternImage, ConvNeXt) compete for ADE20K and Cityscapes leaderboard positions, with mIoU gains now in the sub-1% range. On the practical side, SAM/SAM 2 has fundamentally changed the game: segment any object with a click or text prompt, then classify the segments however you want. Open-vocabulary segmentation (CAT-Seg, SAN, ODISE) merges CLIP's language understanding with dense prediction, making fixed-class segmenters feel outdated.

Key Challenges

Class imbalance — rare classes (e.g., cyclists, poles in driving scenes) get overwhelmed by dominant classes (road, sky), requiring loss weighting or oversampling

Fine boundary precision — segmentation masks are typically blurry at object edges, which matters for medical imaging (tumor margins) and autonomous driving (pedestrian boundaries)

Annotation cost — pixel-level labeling takes 1-4 hours per image for complex scenes (Cityscapes reports 90 minutes average), making large-scale datasets extremely expensive

Domain gap — models trained on daytime urban scenes (Cityscapes) degrade on night, rain, snow, and different cities without domain adaptation

Computational cost of dense prediction at full resolution — most models operate at 512×512 or 1024×1024, but applications like satellite imagery need 4K+ resolution

Quick Recommendations

Best mIoU (general scenes)

InternImage-H or BEiT-3 + UperNet

62%+ mIoU on ADE20K; InternImage's deformable convolutions handle multi-scale objects well

Real-time (autonomous driving)

SegFormer-B2 or DDRNet-23

80%+ Cityscapes mIoU at 30+ FPS on a single GPU; SegFormer is easier to train, DDRNet is faster

Medical imaging

nnU-Net (auto-configured) or MedSAM

nnU-Net self-configures architecture, preprocessing, and training for any medical dataset; wins most segmentation challenges

Open-vocabulary / novel classes

SAM 2 + CLIP-based classifier (CAT-Seg, SAN)

Segment anything via prompts, then classify segments into arbitrary text categories — no per-class training needed

Low-annotation regime

SAM + pseudo-label pipeline

Use SAM for automatic mask proposals, manually verify a subset, then train a task-specific model on the pseudo-labels

What's Next

The future is universal segmentation models that handle semantic, instance, and panoptic segmentation with open-vocabulary class sets. SAM 3.0 and its successors will likely subsume traditional segmentation pipelines. Key research frontiers: 3D semantic segmentation for robotics (point clouds, NeRF-based), video segmentation with temporal consistency, and weakly-supervised methods that learn from image-level labels or natural language descriptions instead of expensive pixel annotations.

Benchmarks & SOTA

ADE20K

ADE20K Scene Parsing Benchmark

201621 results

20K training, 2K validation images annotated with 150 object categories. Complex scene parsing benchmark.

State of the Art

InternImage-H

Shanghai AI Lab

62.9

mIoU

Cityscapes

Cityscapes Dataset

20163 results

5,000 images with fine annotations and 20,000 with coarse annotations of urban street scenes.

State of the Art

EoMT (ViT-L)

84.2

miou

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Semantic Segmentation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision