Level 1: Single Blocks~15 min

Image Segmentation with SAM

Precise pixel-level understanding. Click on anything, get a perfect mask.

What is Image Segmentation?

Image segmentation is the task of dividing an image into meaningful regions at the pixel level. Unlike object detection (which draws bounding boxes), segmentation produces exact masks that follow object boundaries.

There are three types of segmentation:

Semantic Segmentation

Labels every pixel with a class (sky, road, person) but does not distinguish between instances. All people are labeled "person" with no separation.

Instance Segmentation

Separates individual object instances. Person 1, Person 2, Person 3 each get their own mask. Essential for counting and tracking.

Panoptic Segmentation

Combines both: instance masks for countable things (people, cars) plus semantic labels for uncountable stuff (sky, grass). The complete picture.

The SAM Revolution

Meta's Segment Anything Model (SAM) changed the game in 2023. Instead of training on specific categories, SAM can segment any object with just a point click. SAM 2 (2024) extended this to video with real-time tracking.

SAM 2: Point Prompt Segmentation

The most intuitive way to use SAM: click on an object, get its mask. SAM 2 uses a hierarchical vision transformer (Hiera) for faster, more accurate predictions.

# SAM 2 Point Prompt Segmentation
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from PIL import Image
import numpy as np

# Load model (various sizes available)
checkpoint = 'sam2_hiera_large.pt'
model_cfg = 'sam2_hiera_l.yaml'
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

# Load and set image
image = np.array(Image.open('photo.jpg'))
predictor.set_image(image)

# Point prompt - click on object
input_point = np.array([[500, 375]]) # x, y coordinates
input_label = np.array([1]) # 1 = foreground

# Get masks (returns multiple candidates)
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True
)
best_mask = masks[scores.argmax()]

Prompt Types

Positive Points (label=1)

Click on object you want to segment

Multiple points refine the selection

Negative Points (label=0)

Click to exclude regions from mask

Useful for separating overlapping objects

# Multiple points for refinement
input_points = np.array([
[500, 375], # foreground point 1
[520, 390], # foreground point 2
[100, 100], # background point (exclude)
])
input_labels = np.array([1, 1, 0]) # 1=include, 0=exclude

masks, scores, _ = predictor.predict(
point_coords=input_points,
point_labels=input_labels,
multimask_output=False # single refined mask
)

Automatic Mask Generation

SAM can also automatically segment everything in an image without any prompts. It uses a grid of points to discover all objects and outputs detailed metadata for each mask.

# Automatic mask generation - segment everything
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator

mask_generator = SAM2AutomaticMaskGenerator(
build_sam2(model_cfg, checkpoint)
)

masks = mask_generator.generate(image)

# Each mask contains rich metadata
for mask in masks:
print(f"Area: {mask['area']}, IoU: {mask['predicted_iou']:.2f}")

Mask Metadata

FieldDescription
segmentationBinary mask (H x W boolean array)
areaNumber of pixels in the mask
bboxBounding box [x, y, width, height]
predicted_iouModel's confidence in mask quality (0-1)
stability_scoreHow stable the mask is to threshold changes
# Fine-tune automatic generation
mask_generator = SAM2AutomaticMaskGenerator(
model=build_sam2(model_cfg, checkpoint),
points_per_side=32, # density of point grid
pred_iou_thresh=0.88, # min IoU score
stability_score_thresh=0.95, # min stability
min_mask_region_area=100, # filter tiny masks
)

Grounded SAM: Text-Prompt Segmentation

What if you could just say "segment the cat"? Grounded SAM combines a text-to-box detector (Grounding DINO) with SAM to enable natural language segmentation.

# Grounded SAM: Text to Segmentation
# Step 1: Grounding DINO detects objects from text
# Step 2: SAM segments within detected boxes

from groundingdino.util.inference import load_model, predict
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor

# Load both models
grounding_model = load_model('groundingdino_swinb_cogcoor.pth')
sam_predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))

# Detect objects with text prompt
text_prompt = "cat . dog . person" # classes separated by ' . '
boxes, logits, phrases = predict(
model=grounding_model,
image=image,
caption=text_prompt,
box_threshold=0.35,
text_threshold=0.25
)

# Segment each detected box
sam_predictor.set_image(image)
for box, phrase in zip(boxes, phrases):
masks, _, _ = sam_predictor.predict(box=box)
print(f"Segmented: {phrase}")

Why Grounded SAM Matters

Traditional segmentation requires pre-defined classes. Grounded SAM works with any text description, enabling open-vocabulary segmentation. Ask for "the red car on the left" or "damaged areas on the wall" - it understands natural language.

Real-World Applications

Background Removal

Product photography, portrait editing, compositing. One click removes backgrounds cleanly.

# Remove background
fg_mask = masks[scores.argmax()]
result = image * fg_mask[..., None]

Object Isolation for Editing

Select objects precisely for color correction, style transfer, or inpainting.

# Apply edit only to mask
edited = image.copy()
edited[mask] = apply_effect(image[mask])

Video Object Tracking (SAM 2)

Segment once, track across frames. SAM 2 propagates masks through video automatically.

# SAM 2 video tracking
predictor.add_new_points(frame_idx=0, ...)
masks = predictor.propagate_in_video()

Interactive Annotation Tools

Speed up dataset labeling 10x. Click instead of tracing polygon boundaries.

# Export COCO format
annotation = mask_to_coco(mask, image_id)
dataset['annotations'].append(annotation)

SAM Model Variants

SAM 2 comes in multiple sizes to balance speed and quality. Choose based on your latency requirements and hardware.

ModelParametersSpeedUse Case
SAM 2 Tinysam2_hiera_t.yaml38MFastestReal-time, mobile
SAM 2 Smallsam2_hiera_s.yaml46MFastBalanced performance
SAM 2 Base+sam2_hiera_b+.yaml80MMediumProduction quality
SAM 2 Largesam2_hiera_l.yaml224MSlowerMaximum accuracy

Segmentation Benchmarks

Segmentation models are evaluated on standard datasets. Key metrics include mIoU (mean Intersection over Union) for semantic segmentation and AP (Average Precision) for instance segmentation.

COCO Instance Segmentation

80 object categories, 330K images. Standard for instance segmentation.

View leaderboard on CodeSOTA

ADE20K Semantic Segmentation

150 semantic classes, 25K images. Scene understanding benchmark.

View leaderboard on CodeSOTA

SAM 2 Zero-Shot Performance

SAM 2 Large
46.5 J&F
SAM 2 Base+
45 J&F
SAM 2 Small
43.5 J&F
SAM 1 (ViT-H)
40.2 J&F

SA-V test zero-shot video object segmentation. J&F score combines Jaccard (region) and F-measure (boundary).

Key Takeaways

  • 1

    SAM 2 enables promptable segmentation - click on any object to get a precise mask, no training required.

  • 2

    Multiple prompt types - positive/negative points, bounding boxes, or automatic grid-based generation.

  • 3

    Grounded SAM adds text understanding - combine with Grounding DINO for "segment the cat" natural language prompts.

  • 4

    SAM 2 extends to video - segment once, track across frames. Essential for video editing and analysis.