Mask Generation
Mask generation produces pixel-precise segmentation masks for objects, and Meta's Segment Anything (SAM, 2023) transformed it from a specialized task into a foundational capability. Trained on 11M images with 1B+ masks, SAM demonstrated that a single promptable model — click a point, draw a box, or provide text — could segment virtually anything. SAM 2 (2024) extended this to video with real-time tracking, while EfficientSAM and FastSAM address the original's computational cost. The "foundation model" moment for segmentation, analogous to what GPT-3 was for NLP.
Mask generation produces pixel-precise segmentation masks for objects in images, typically in a class-agnostic way (segmenting 'things' without labeling them). The Segment Anything Model (SAM) by Meta (2023) transformed this from a niche task into a foundation capability — segment any object with a point click, box, or text prompt. SAM 2 extended it to video.
History
Selective Search and MCG generate object mask proposals for detection pipelines (R-CNN), producing ~2000 candidate masks per image
Mask R-CNN (He et al.) adds a mask head to Faster R-CNN, producing instance masks — 37.1% mask AP on COCO, establishing instance segmentation as a task
LVIS dataset introduces 1203 categories with long-tail distribution, exposing that models fail badly on rare objects
PointRend treats mask boundaries as rendering, applying iterative refinement for sharper edges — improving thin structures significantly
Mask2Former unifies semantic, instance, and panoptic segmentation with masked attention, achieving SOTA on all three
Segment Anything Model (SAM) trained on SA-1B (11M images, 1.1B masks) enables zero-shot segmentation from points, boxes, or text — paradigm shift
SAM 2 extends to video with streaming memory architecture; EfficientSAM and FastSAM make SAM-quality inference 50× faster
Grounded SAM combines Grounding DINO (text→boxes) with SAM (boxes→masks) for end-to-end text-prompted segmentation
SAM 2.1 improves video consistency; HQ-SAM and SAM-HQ variants push mask quality for fine structures (hair, fur, lace)
How Mask Generation Works
Image Encoding
SAM uses a ViT-H (632M params) to encode the image into dense feature embeddings in a single forward pass. This is the expensive step (~150ms on A100) but is amortized across multiple prompts.
Prompt Encoding
User prompts (point clicks, bounding boxes, rough masks, or text) are encoded into prompt embeddings. Points use positional encoding; boxes use corner coordinates; text uses CLIP-style encoding.
Mask Decoder
A lightweight transformer decoder (just 2 blocks in SAM) cross-attends between prompt tokens and image features to produce mask logits. It outputs 3 masks at different granularity levels (whole object, part, subpart) plus confidence scores.
Automatic Mask Generation
For full-image segmentation, SAM runs a 32×32 grid of point prompts, generates masks for each, then merges overlapping masks using NMS on mask IoU. This produces a full 'segment everything' output.
Evaluation
IoU between predicted and ground-truth masks is the primary metric. SA-1B evaluation uses human quality ratings. For automatic mode, Average Recall across IoU thresholds measures how well the model discovers all objects.
Current Landscape
Mask generation in 2025 is defined entirely by the SAM family. Before SAM (2023), mask generation was a subtask within instance segmentation pipelines (Mask R-CNN, Mask2Former). SAM reframed it as a promptable foundation task — encode an image once, then generate masks interactively for any prompt. SAM 2 extended this to video. The ecosystem is now built around SAM: Grounded SAM for text prompting, HQ-SAM for fine details, EfficientSAM for speed, and Semantic-SAM for multi-granularity. The SA-1B dataset (1.1B masks) is the largest segmentation dataset ever created and has become a standard pretraining resource.
Key Challenges
Ambiguity — a single click on a person's eye could mean 'segment the eye,' 'the face,' or 'the whole person'; SAM returns 3 masks but the right one requires human selection
Fine structures — hair strands, fence wires, tree branches, and translucent objects (glass, smoke) remain difficult even for SAM; mask edges are often imprecise
Semantic understanding — SAM segments objects but doesn't know what they are; Grounded SAM addresses this but adds complexity and latency
Video consistency — SAM 2 improves temporal tracking but still loses objects during occlusion and reappearance, especially for similar-looking objects
Speed — SAM's ViT-H image encoder is too slow for real-time applications; EfficientSAM and MobileSAM trade accuracy for speed
Quick Recommendations
Best mask quality
SAM 2.1 (ViT-L/H) or HQ-SAM
Best general mask quality; HQ-SAM specifically improves fine-structure segmentation (hair, lace) with minimal overhead
Text-prompted segmentation
Grounded SAM 2 (Grounding DINO + SAM 2)
End-to-end: describe what to segment in text, get pixel-precise masks — no manual prompting needed
Video mask tracking
SAM 2.1
Click on an object in one frame, track its mask throughout the video with streaming memory; works on 24+ FPS video
Real-time / edge
EfficientSAM or MobileSAM
10-50× faster than SAM-H with ~95% mask quality; suitable for mobile and embedded applications
Automatic full-image segmentation
SAM 2 auto mode or Semantic-SAM
Segment everything in the image at multiple granularity levels without any prompts
What's Next
The frontier is real-time interactive segmentation on edge devices (AR glasses, robotics), video understanding with persistent object identity (not just masks but tracking who is who), and 3D mask generation from single images or video. SAM's paradigm will likely be absorbed into general-purpose vision foundation models that segment, detect, caption, and reason simultaneously. The long-term vision is spatial AI that maintains a 3D understanding of every object in a scene.
Benchmarks & SOTA
Related Tasks
Something wrong or missing?
Help keep Mask Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.