Image Segmentation: From FCN to SAM 2
Twelve years of pixel-level understanding. From hand-designed features to "click on anything, get a perfect mask."
What is Image Segmentation?
Image segmentation is the task of dividing an image into meaningful regions at the pixel level. Unlike object detection (which draws bounding boxes), segmentation produces exact masks that follow object boundaries. Unlike classification (which assigns one label to an entire image), segmentation labels every single pixel.
Understanding the history of segmentation isn't academic trivia. Each generation solved a specific limitation of the last, and those limitations explain why modern architectures like SAM look the way they do. The field evolved through five distinct eras, each driven by a key insight.
There are three types of segmentation:
Semantic Segmentation
All people share one label. No distinction between individuals.
Instance Segmentation
Each object gets a unique ID. Background is ignored.
Panoptic Segmentation
Every pixel labeled. Things get instance IDs, stuff gets class labels.
A Decade of Pixel-Perfect Progress
Before 2014, image segmentation relied on hand-crafted features — superpixels, CRFs, random forests. The results were brittle and category-specific. Then fully convolutional networks changed everything, kicking off a rapid succession of architectural breakthroughs that culminated in models capable of segmenting anything with a single click.
Each generation solved one critical limitation of the last. Understanding this chain of problems and solutions is the fastest way to grasp why SAM's architecture looks the way it does.
Fully Convolutional Networks (FCN)
Jonathan Long, Evan Shelhamer, and Trevor Darrell at UC Berkeley had a deceptively simple insight: take an image classification network (like VGG-16), rip out the fully connected layers, and replace them with convolutional layers that produce spatial output maps instead of a single class label. The network could now accept any input size and produce a dense prediction — one label per pixel.
"We show that a fully convolutional network (FCN) trained end-to-end, pixels-to-pixels on semantic segmentation exceeds the state-of-the-art without further machinery."
— Long, J. et al. (2015). Fully Convolutional Networks for Semantic Segmentation. CVPR. 30,000+ citations.
FCN introduced skip connections from earlier layers to combine coarse, high-level semantics with fine, low-level spatial detail. The FCN-8s variant fused predictions from three different resolutions, achieving 62.2 mIoU on PASCAL VOC 2012 — a 20% relative improvement over previous methods.
Why FCN mattered
Before FCN, segmentation pipelines were Rube Goldberg machines: extract hand-crafted features, run superpixel algorithms, apply CRFs for boundary refinement. FCN showed that a single end-to-end trained network could beat all of that. Every segmentation model since — U-Net, DeepLab, Mask R-CNN, SAM — is architecturally a descendant of FCN's core idea.
U-Net — The Encoder-Decoder That Won Medicine
Olaf Ronneberger, Philipp Fischer, and Thomas Brox at the University of Freiburg designed U-Net for a specific problem: biomedical image segmentation, where labeled data is scarce and boundary precision is critical. The architecture was symmetric — an encoder that contracts (downsamples), a decoder that expands (upsamples), and skip connections that concatenate encoder features directly to the corresponding decoder level.
# U-Net: the symmetric encoder-decoder # Encoder (contracting path): x1 = conv_block(input) # 572×572 → 568×568, 64 channels x2 = pool + conv_block(x1) # 284×284 → 280×280, 128 channels x3 = pool + conv_block(x2) # 140×140 → 136×136, 256 channels x4 = pool + conv_block(x3) # 68×68 → 64×64, 512 channels x5 = pool + conv_block(x4) # 32×32 → 28×28, 1024 channels (bottleneck) # Decoder (expanding path) — skip connections are the key: d4 = up_conv(x5) + CROP_AND_CONCAT(x4) # combine with encoder features d3 = up_conv(d4) + CROP_AND_CONCAT(x3) d2 = up_conv(d3) + CROP_AND_CONCAT(x2) d1 = up_conv(d2) + CROP_AND_CONCAT(x1) output = conv_1x1(d1) # per-pixel class probabilities
— Ronneberger, O. et al. (2015). U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI. 75,000+ citations.
Two key innovations made U-Net practical where FCN wasn't: heavy data augmentation (elastic deformations, rotations) to learn from as few as 30 training images, and a weighted loss function that penalized errors at cell boundaries — critical for separating touching cells under a microscope. U-Net won the ISBI cell tracking challenge by a wide margin and became the de facto standard in medical imaging. Its encoder-decoder architecture with skip connections remains the backbone of segmentation to this day.
Mask R-CNN — Adding a Mask Head
FCN and U-Net solved semantic segmentation — labeling every pixel with a class. But they couldn't distinguish between individual instances of the same class (person #1 vs person #2). Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick at Facebook AI Research (FAIR) solved this with an elegant extension: take Faster R-CNN (already excellent at detecting objects with bounding boxes), and add a small fully convolutional branch that predicts a binary mask for each detected object.
"Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. [...] Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN."
— He, K. et al. (2017). Mask R-CNN. ICCV. 20,000+ citations.
The critical innovation was RoIAlign — replacing the previous RoIPool with bilinear interpolation to avoid the harsh spatial quantization that destroyed fine mask details. This single change improved mask AP by 10–50% relative. Mask R-CNN achieved 37.1 AP on COCO instance segmentation and became the dominant framework for instance segmentation for the next four years.
The Detect-then-Segment paradigm
Mask R-CNN established a two-stage approach: first detect objects (propose bounding boxes), then segment within each box. This is the same paradigm SAM uses seven years later — the image encoder plays the role of the detection backbone, and the mask decoder segments within prompt-defined regions. The difference is that SAM replaces category-specific detection with any prompt type.
Panoptic Segmentation: Unifying the Tasks
Alexander Kirillov et al. at FAIR proposed the panoptic segmentation task and metric (PQ), arguing that semantic and instance segmentation should not be treated as separate problems. The Panoptic FPN (Feature Pyramid Network) model showed that a single backbone could simultaneously label every "stuff" pixel (sky, road, grass) and every "thing" instance (car #1, person #2). This unification simplified annotation pipelines and set the stage for models that reason about scenes holistically.
The DeepLab Family — Atrous Convolutions at Scale
Simultaneously with U-Net's rise in medicine, Liang-Chieh Chen and colleagues at Google were attacking a different limitation: the loss of spatial resolution caused by pooling layers. Their solution was atrous (dilated) convolutions — convolutions with holes that increase the receptive field without reducing resolution.
The DeepLab lineage progressed rapidly:
DeepLab v1 (2015)
Combined atrous convolutions with a CRF (Conditional Random Field) post-processing step to sharpen boundaries. 71.6 mIoU on PASCAL VOC.
DeepLab v2 (2017)
Introduced Atrous Spatial Pyramid Pooling (ASPP) — applying atrous convolutions at multiple rates (6, 12, 18, 24) in parallel, then fusing them. This captured objects at multiple scales simultaneously, a persistent challenge in segmentation. 79.7 mIoU on VOC.
— Chen, L.-C. et al. (2018). DeepLab: Semantic Image Segmentation with Atrous Convolutions. TPAMI.
DeepLab v3/v3+ (2017-2018)
Added batch normalization to ASPP, an image-level feature branch for global context, and an encoder-decoder structure. DeepLab v3+ achieved 89.0 mIoU on PASCAL VOC 2012 — a result that stood as the benchmark standard for years.
— Chen, L.-C. et al. (2018). Encoder-Decoder with Atrous Separable Convolution. ECCV. 15,000+ citations.
The multi-scale insight
A person photographed close-up occupies most of the image. The same person at a distance might be 20 pixels tall. Segmentation networks need to handle both extremes. ASPP solved this by looking at the same location through multiple receptive field sizes simultaneously — like viewing the image through zoom lenses of different focal lengths in parallel. SAM's hierarchical Hiera backbone achieves the same goal with a different mechanism (multi-scale feature pyramids from vision transformers).
PSPNet: Pyramid Pooling for Global Context
Hengshuang Zhao et al. at the Chinese University of Hong Kong proposed Pyramid Scene Parsing Network (PSPNet), which pool features at four different scales (1x1, 2x2, 3x3, 6x6 grids) then upsample and concatenate them. This gave the model global scene context — understanding that a pixel is part of a "boat" is easier if the network knows the scene is a "lake." PSPNet won the ImageNet Scene Parsing Challenge 2016 with 57.21 mIoU on ADE20K.
— Zhao, H. et al. (2017). Pyramid Scene Parsing Network. CVPR.
SegFormer & MaskFormer: Attention Replaces Convolution
The Transformer revolution that rewrote NLP in 2017 finally reached segmentation in 2021. SegFormer (Xie et al., NeurIPS 2021) replaced convolutional encoders with a hierarchical vision transformer, achieving better results with simpler decoders — no CRFs, no ASPP, just MLP heads on multi-scale transformer features. 51.0 mIoU on ADE20K.
MaskFormer (Cheng et al., NeurIPS 2021) reframed segmentation as a mask classification problem: predict N binary masks and assign a class to each, rather than predicting a class for each pixel independently. Mask2Former (2022) refined this with masked attention and multi-scale features, setting new records across semantic, instance, and panoptic segmentation with one architecture — 57.8 mIoU on ADE20K.
— Xie, E. et al. (2021). SegFormer. NeurIPS.
— Cheng, B. et al. (2022). Masked-attention Mask Transformer (Mask2Former). CVPR.
SAM: Segment Anything Model
Alexander Kirillov, Eric Mintun, Nikhila Ravi et al. at Meta AI made the leap from task-specific to foundation model. Instead of training on fixed category sets (COCO's 80 classes, ADE20K's 150 classes), SAM was trained on SA-1B — a dataset of 11 million images and 1.1 billion masks, collected through a data engine that used early model predictions to accelerate human annotation.
"We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. [...] Our goal is to build a foundation model for image segmentation."
— Kirillov, A. et al. (2023). Segment Anything. ICCV. 7,000+ citations in under two years.
The key architectural insight: decouple the heavy computation from the prompt. Run the image encoder once (expensive), then decode masks interactively for any prompt (cheap). This is the same asymmetry that makes search engines work — index once, query many times.
SAM 2: Images and Video, Unified
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu et al. at Meta extended SAM to video with a streaming memory architecture. The model maintains a memory bank of past predictions and uses cross-attention to propagate masks through time. A memory attention module conditions current-frame predictions on stored memory features, enabling objects to be tracked through occlusions, appearance changes, and camera motion.
"SAM 2 [...] achieves better segmentation accuracy while being 6x faster than SAM for images and produces the first whole-image high-quality video segmentation."
— Ravi, N. et al. (2024). SAM 2: Segment Anything in Images and Videos. arXiv.
SAM 2 replaced SAM's ViT-H backbone with Hiera (a hierarchical vision transformer), achieving both speed and accuracy improvements. The model was trained on the new SA-V dataset — 50,900 videos with 642,600 masklets (spatio-temporal masks), the largest video segmentation dataset ever created.
The throughline: 2014 → 2024
A decade. One problem — pixel-level understanding — refined through five paradigm shifts:
Every advance solved a limitation of the previous generation. Every generation preserved the core goal: assign the right label to every pixel in the image.
How SAM Works: Architecture
SAM's architecture has three components. The image encoder runs once per image (the expensive part), then prompts and mask decoding are lightweight — enabling real-time interactive segmentation.
Why This Architecture Is Clever
The image encoder (Hiera backbone) takes ~150ms per image on a modern GPU. The prompt encoder + mask decoder take ~6ms. By decoupling them, SAM enables a workflow where you encode the image once, then try dozens of prompts interactively — each taking only 6ms. This is what makes real-time interactive segmentation possible.
Compare this to Mask R-CNN, which must run the entire backbone for every new inference. SAM's decoupled design is directly inspired by search engine architecture: index (encode) once, query (prompt) many times.
SAM 2: Point Prompt Segmentation
The most intuitive way to use SAM: click on an object, get its mask. SAM 2 uses a hierarchical vision transformer (Hiera) for faster, more accurate predictions.
1. Click on object
2. Get precise mask
Refine with multiple points: positive (include) and negative (exclude)
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from PIL import Image
import numpy as np
# Load model (various sizes available)
checkpoint = 'sam2_hiera_large.pt'
model_cfg = 'sam2_hiera_l.yaml'
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
# Load and set image
image = np.array(Image.open('photo.jpg'))
predictor.set_image(image)
# Point prompt - click on object
input_point = np.array([[500, 375]]) # x, y coordinates
input_label = np.array([1]) # 1 = foreground
# Get masks (returns multiple candidates)
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True
)
best_mask = masks[scores.argmax()]
Prompt Types
Positive Points (label=1)
Click on object you want to segment
Multiple points refine the selection
Negative Points (label=0)
Click to exclude regions from mask
Useful for separating overlapping objects
input_points = np.array([
[500, 375], # foreground point 1
[520, 390], # foreground point 2
[100, 100], # background point (exclude)
])
input_labels = np.array([1, 1, 0]) # 1=include, 0=exclude
masks, scores, _ = predictor.predict(
point_coords=input_points,
point_labels=input_labels,
multimask_output=False # single refined mask
)
Automatic Mask Generation
SAM can also automatically segment everything in an image without any prompts. It uses a grid of points to discover all objects and outputs detailed metadata for each mask.
1. Sample point grid
2. All masks generated
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
mask_generator = SAM2AutomaticMaskGenerator(
build_sam2(model_cfg, checkpoint)
)
masks = mask_generator.generate(image)
# Each mask contains rich metadata
for mask in masks:
print(f"Area: {mask['area']}, IoU: {mask['predicted_iou']:.2f}")
Mask Metadata
| Field | Description |
|---|---|
| segmentation | Binary mask (H x W boolean array) |
| area | Number of pixels in the mask |
| bbox | Bounding box [x, y, width, height] |
| predicted_iou | Model's confidence in mask quality (0-1) |
| stability_score | How stable the mask is to threshold changes |
mask_generator = SAM2AutomaticMaskGenerator(
model=build_sam2(model_cfg, checkpoint),
points_per_side=32, # density of point grid
pred_iou_thresh=0.88, # min IoU score
stability_score_thresh=0.95, # min stability
min_mask_region_area=100, # filter tiny masks
)
Grounded SAM: Text-Prompt Segmentation
What if you could just say "segment the cat"? Grounded SAM combines a text-to-box detector (Grounding DINO) with SAM to enable natural language segmentation.
1. Text prompt
"cat . dog"
2. Grounding DINO detects
Grounding DINO → boxes
3. SAM segments each box
SAM → pixel masks
# Step 1: Grounding DINO detects objects from text
# Step 2: SAM segments within detected boxes
from groundingdino.util.inference import load_model, predict
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
# Load both models
grounding_model = load_model('groundingdino_swinb_cogcoor.pth')
sam_predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
# Detect objects with text prompt
text_prompt = "cat . dog . person" # classes separated by ' . '
boxes, logits, phrases = predict(
model=grounding_model,
image=image,
caption=text_prompt,
box_threshold=0.35,
text_threshold=0.25
)
# Segment each detected box
sam_predictor.set_image(image)
for box, phrase in zip(boxes, phrases):
masks, _, _ = sam_predictor.predict(box=box)
print(f"Segmented: {phrase}")
Why Grounded SAM Matters
Traditional segmentation requires pre-defined classes. Grounded SAM works with any text description, enabling open-vocabulary segmentation. Ask for "the red car on the left" or "damaged areas on the wall" — it understands natural language. This bridges the gap between SAM's category-agnostic masks and the semantic labels that downstream applications need.
Real-World Applications
Background Removal
Product photography, portrait editing, compositing. One click removes backgrounds cleanly.
fg_mask = masks[scores.argmax()]
result = image * fg_mask[..., None]
Object Isolation for Editing
Select objects precisely for color correction, style transfer, or inpainting.
edited = image.copy()
edited[mask] = apply_effect(image[mask])
Video Object Tracking (SAM 2)
Segment once, track across frames. SAM 2 propagates masks through video automatically.
predictor.add_new_points(frame_idx=0, ...)
masks = predictor.propagate_in_video()
Interactive Annotation Tools
Speed up dataset labeling 10x. Click instead of tracing polygon boundaries.
annotation = mask_to_coco(mask, image_id)
dataset['annotations'].append(annotation)
SAM Model Variants
SAM 2 comes in multiple sizes to balance speed and quality. Choose based on your latency requirements and hardware.
| Model | Parameters | Speed | Use Case |
|---|---|---|---|
| SAM 2 Tinysam2_hiera_t.yaml | 38M | Fastest | Real-time, mobile |
| SAM 2 Smallsam2_hiera_s.yaml | 46M | Fast | Balanced performance |
| SAM 2 Base+sam2_hiera_b+.yaml | 80M | Medium | Production quality |
| SAM 2 Largesam2_hiera_l.yaml | 224M | Slower | Maximum accuracy |
Choosing the Right Size
The speed/quality tradeoff is not linear. SAM 2 Small achieves 93% of Large's quality at 3x the speed. For most production workloads, Base+ is the sweet spot. Use Large only for annotation pipelines where quality matters more than latency, and Tiny for real-time video applications on edge devices.
All variants share the same prompt encoder and mask decoder — only the image encoder backbone differs. This means the interactive prompting experience is equally fast across all sizes.
Segmentation Benchmarks
Segmentation models are evaluated on standard datasets. Key metrics include mIoU (mean Intersection over Union) for semantic segmentation and AP (Average Precision) for instance segmentation. The field has progressed from ~62 mIoU (FCN, 2014) to ~89 mIoU (DeepLab v3+, 2018) on PASCAL VOC, and from ~37 AP (Mask R-CNN, 2017) to ~50+ AP (Mask2Former, 2022) on COCO.
COCO Instance Segmentation
80 object categories, 330K images. Standard for instance segmentation. AP measures mask quality across IoU thresholds from 0.5 to 0.95.
View leaderboard on CodeSOTAADE20K Semantic Segmentation
150 semantic classes, 25K images. Scene understanding benchmark. mIoU measures average overlap between predicted and ground truth masks across all classes.
View leaderboard on CodeSOTASAM 2 Zero-Shot Performance
SA-V test zero-shot video object segmentation. J&F score combines Jaccard (region) and F-measure (boundary). SAM 2 Large achieves 46.5 J&F without any task-specific fine-tuning — a testament to the foundation model approach.
What SAM Cannot Do (Yet)
Foundation models are powerful but not omniscient. Understanding SAM's limitations is essential for building reliable systems.
No Semantic Labels
SAM produces masks, not labels. It can segment a dog from its background, but it doesn't know the object is a dog. You need an external classifier (like Grounding DINO or CLIP) to attach meaning to SAM's masks. This is by design — it keeps the model category-agnostic — but it means SAM alone cannot replace a full panoptic segmentation pipeline.
Struggles with Thin Structures and Transparency
Fine structures like bicycle spokes, chain-link fences, hair strands, and transparent objects (glass, water) remain challenging. The 64x64 feature map resolution means the mask decoder operates at 1/16th of the original resolution — fine details below this threshold are lost. The model also has no explicit mechanism for reasoning about transparency or refraction.
Ambiguity in Complex Scenes
A single click on a person's shirt might return the shirt, the torso, or the entire person — SAM returns three mask candidates at different granularities (sub-part, part, whole object) to handle this, but the user must choose. In automated pipelines, this ambiguity requires careful prompt engineering or multi-round refinement to resolve.
The honest assessment
SAM is a foundation model for mask generation, not a complete segmentation solution. Its power lies in being combined with other components:
- 1.SAM + Grounding DINO = open-vocabulary instance segmentation
- 2.SAM + CLIP = zero-shot semantic segmentation
- 3.SAM + tracking = video object segmentation (SAM 2 does this natively)
- 4.SAM + human-in-the-loop = interactive annotation at 10x the speed of polygon tracing
The composability is the point. SAM replaced the need for task-specific segmentation models by providing a universal mask primitive that other systems can build on.
Key Takeaways
- 1
Segmentation evolved through five eras — from FCN's end-to-end training (2014) through U-Net's encoder-decoder, Mask R-CNN's instance awareness, DeepLab's multi-scale reasoning, to SAM's foundation model approach (2023).
- 2
SAM 2 enables promptable segmentation — click on any object to get a precise mask, no training required. The decoupled encoder-decoder design makes interactive use real-time.
- 3
Grounded SAM adds semantic understanding — combine with Grounding DINO for natural language prompts like "segment the cat."
- 4
SAM is a mask primitive, not a complete solution — its power lies in composability. Pair it with classifiers, trackers, or human annotators for complete segmentation pipelines.
Help improve this page
Missing a benchmark? Found an error? Have a suggestion? Your feedback feeds the CodeSOTA flywheel.