Image Segmentation with SAM
Precise pixel-level understanding. Click on anything, get a perfect mask.
What is Image Segmentation?
Image segmentation is the task of dividing an image into meaningful regions at the pixel level. Unlike object detection (which draws bounding boxes), segmentation produces exact masks that follow object boundaries.
There are three types of segmentation:
Semantic Segmentation
Labels every pixel with a class (sky, road, person) but does not distinguish between instances. All people are labeled "person" with no separation.
Instance Segmentation
Separates individual object instances. Person 1, Person 2, Person 3 each get their own mask. Essential for counting and tracking.
Panoptic Segmentation
Combines both: instance masks for countable things (people, cars) plus semantic labels for uncountable stuff (sky, grass). The complete picture.
The SAM Revolution
Meta's Segment Anything Model (SAM) changed the game in 2023. Instead of training on specific categories, SAM can segment any object with just a point click. SAM 2 (2024) extended this to video with real-time tracking.
SAM 2: Point Prompt Segmentation
The most intuitive way to use SAM: click on an object, get its mask. SAM 2 uses a hierarchical vision transformer (Hiera) for faster, more accurate predictions.
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
from PIL import Image
import numpy as np
# Load model (various sizes available)
checkpoint = 'sam2_hiera_large.pt'
model_cfg = 'sam2_hiera_l.yaml'
predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
# Load and set image
image = np.array(Image.open('photo.jpg'))
predictor.set_image(image)
# Point prompt - click on object
input_point = np.array([[500, 375]]) # x, y coordinates
input_label = np.array([1]) # 1 = foreground
# Get masks (returns multiple candidates)
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True
)
best_mask = masks[scores.argmax()]
Prompt Types
Positive Points (label=1)
Click on object you want to segment
Multiple points refine the selection
Negative Points (label=0)
Click to exclude regions from mask
Useful for separating overlapping objects
input_points = np.array([
[500, 375], # foreground point 1
[520, 390], # foreground point 2
[100, 100], # background point (exclude)
])
input_labels = np.array([1, 1, 0]) # 1=include, 0=exclude
masks, scores, _ = predictor.predict(
point_coords=input_points,
point_labels=input_labels,
multimask_output=False # single refined mask
)
Automatic Mask Generation
SAM can also automatically segment everything in an image without any prompts. It uses a grid of points to discover all objects and outputs detailed metadata for each mask.
from sam2.automatic_mask_generator import SAM2AutomaticMaskGenerator
mask_generator = SAM2AutomaticMaskGenerator(
build_sam2(model_cfg, checkpoint)
)
masks = mask_generator.generate(image)
# Each mask contains rich metadata
for mask in masks:
print(f"Area: {mask['area']}, IoU: {mask['predicted_iou']:.2f}")
Mask Metadata
| Field | Description |
|---|---|
| segmentation | Binary mask (H x W boolean array) |
| area | Number of pixels in the mask |
| bbox | Bounding box [x, y, width, height] |
| predicted_iou | Model's confidence in mask quality (0-1) |
| stability_score | How stable the mask is to threshold changes |
mask_generator = SAM2AutomaticMaskGenerator(
model=build_sam2(model_cfg, checkpoint),
points_per_side=32, # density of point grid
pred_iou_thresh=0.88, # min IoU score
stability_score_thresh=0.95, # min stability
min_mask_region_area=100, # filter tiny masks
)
Grounded SAM: Text-Prompt Segmentation
What if you could just say "segment the cat"? Grounded SAM combines a text-to-box detector (Grounding DINO) with SAM to enable natural language segmentation.
# Step 1: Grounding DINO detects objects from text
# Step 2: SAM segments within detected boxes
from groundingdino.util.inference import load_model, predict
from sam2.build_sam import build_sam2
from sam2.sam2_image_predictor import SAM2ImagePredictor
# Load both models
grounding_model = load_model('groundingdino_swinb_cogcoor.pth')
sam_predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
# Detect objects with text prompt
text_prompt = "cat . dog . person" # classes separated by ' . '
boxes, logits, phrases = predict(
model=grounding_model,
image=image,
caption=text_prompt,
box_threshold=0.35,
text_threshold=0.25
)
# Segment each detected box
sam_predictor.set_image(image)
for box, phrase in zip(boxes, phrases):
masks, _, _ = sam_predictor.predict(box=box)
print(f"Segmented: {phrase}")
Why Grounded SAM Matters
Traditional segmentation requires pre-defined classes. Grounded SAM works with any text description, enabling open-vocabulary segmentation. Ask for "the red car on the left" or "damaged areas on the wall" - it understands natural language.
Real-World Applications
Background Removal
Product photography, portrait editing, compositing. One click removes backgrounds cleanly.
fg_mask = masks[scores.argmax()]
result = image * fg_mask[..., None]
Object Isolation for Editing
Select objects precisely for color correction, style transfer, or inpainting.
edited = image.copy()
edited[mask] = apply_effect(image[mask])
Video Object Tracking (SAM 2)
Segment once, track across frames. SAM 2 propagates masks through video automatically.
predictor.add_new_points(frame_idx=0, ...)
masks = predictor.propagate_in_video()
Interactive Annotation Tools
Speed up dataset labeling 10x. Click instead of tracing polygon boundaries.
annotation = mask_to_coco(mask, image_id)
dataset['annotations'].append(annotation)
SAM Model Variants
SAM 2 comes in multiple sizes to balance speed and quality. Choose based on your latency requirements and hardware.
| Model | Parameters | Speed | Use Case |
|---|---|---|---|
| SAM 2 Tinysam2_hiera_t.yaml | 38M | Fastest | Real-time, mobile |
| SAM 2 Smallsam2_hiera_s.yaml | 46M | Fast | Balanced performance |
| SAM 2 Base+sam2_hiera_b+.yaml | 80M | Medium | Production quality |
| SAM 2 Largesam2_hiera_l.yaml | 224M | Slower | Maximum accuracy |
Segmentation Benchmarks
Segmentation models are evaluated on standard datasets. Key metrics include mIoU (mean Intersection over Union) for semantic segmentation and AP (Average Precision) for instance segmentation.
COCO Instance Segmentation
80 object categories, 330K images. Standard for instance segmentation.
View leaderboard on CodeSOTAADE20K Semantic Segmentation
150 semantic classes, 25K images. Scene understanding benchmark.
View leaderboard on CodeSOTASAM 2 Zero-Shot Performance
SA-V test zero-shot video object segmentation. J&F score combines Jaccard (region) and F-measure (boundary).
Key Takeaways
- 1
SAM 2 enables promptable segmentation - click on any object to get a precise mask, no training required.
- 2
Multiple prompt types - positive/negative points, bounding boxes, or automatic grid-based generation.
- 3
Grounded SAM adds text understanding - combine with Grounding DINO for "segment the cat" natural language prompts.
- 4
SAM 2 extends to video - segment once, track across frames. Essential for video editing and analysis.