Computer Visionmask-generation

Mask Generation

Mask generation produces pixel-precise segmentation masks for objects, and Meta's Segment Anything (SAM, 2023) transformed it from a specialized task into a foundational capability. Trained on 11M images with 1B+ masks, SAM demonstrated that a single promptable model — click a point, draw a box, or provide text — could segment virtually anything. SAM 2 (2024) extended this to video with real-time tracking, while EfficientSAM and FastSAM address the original's computational cost. The "foundation model" moment for segmentation, analogous to what GPT-3 was for NLP.

1 datasets0 resultsView full task mapping →

Mask generation produces pixel-precise segmentation masks for objects in images, typically in a class-agnostic way (segmenting 'things' without labeling them). The Segment Anything Model (SAM) by Meta (2023) transformed this from a niche task into a foundation capability — segment any object with a point click, box, or text prompt. SAM 2 extended it to video.

History

2014

Selective Search and MCG generate object mask proposals for detection pipelines (R-CNN), producing ~2000 candidate masks per image

2017

Mask R-CNN (He et al.) adds a mask head to Faster R-CNN, producing instance masks — 37.1% mask AP on COCO, establishing instance segmentation as a task

2019

LVIS dataset introduces 1203 categories with long-tail distribution, exposing that models fail badly on rare objects

2019

PointRend treats mask boundaries as rendering, applying iterative refinement for sharper edges — improving thin structures significantly

2021

Mask2Former unifies semantic, instance, and panoptic segmentation with masked attention, achieving SOTA on all three

2023

Segment Anything Model (SAM) trained on SA-1B (11M images, 1.1B masks) enables zero-shot segmentation from points, boxes, or text — paradigm shift

2024

SAM 2 extends to video with streaming memory architecture; EfficientSAM and FastSAM make SAM-quality inference 50× faster

2024

Grounded SAM combines Grounding DINO (text→boxes) with SAM (boxes→masks) for end-to-end text-prompted segmentation

2025

SAM 2.1 improves video consistency; HQ-SAM and SAM-HQ variants push mask quality for fine structures (hair, fur, lace)

How Mask Generation Works

1Image EncodingSAM uses a ViT-H (632M para…2Prompt EncodingUser prompts (point clicks3Mask DecoderA lightweight transformer d…4Automatic Mask Genera…For full-image segmentation5EvaluationIoU between predicted and g…Mask Generation Pipeline
1

Image Encoding

SAM uses a ViT-H (632M params) to encode the image into dense feature embeddings in a single forward pass. This is the expensive step (~150ms on A100) but is amortized across multiple prompts.

2

Prompt Encoding

User prompts (point clicks, bounding boxes, rough masks, or text) are encoded into prompt embeddings. Points use positional encoding; boxes use corner coordinates; text uses CLIP-style encoding.

3

Mask Decoder

A lightweight transformer decoder (just 2 blocks in SAM) cross-attends between prompt tokens and image features to produce mask logits. It outputs 3 masks at different granularity levels (whole object, part, subpart) plus confidence scores.

4

Automatic Mask Generation

For full-image segmentation, SAM runs a 32×32 grid of point prompts, generates masks for each, then merges overlapping masks using NMS on mask IoU. This produces a full 'segment everything' output.

5

Evaluation

IoU between predicted and ground-truth masks is the primary metric. SA-1B evaluation uses human quality ratings. For automatic mode, Average Recall across IoU thresholds measures how well the model discovers all objects.

Current Landscape

Mask generation in 2025 is defined entirely by the SAM family. Before SAM (2023), mask generation was a subtask within instance segmentation pipelines (Mask R-CNN, Mask2Former). SAM reframed it as a promptable foundation task — encode an image once, then generate masks interactively for any prompt. SAM 2 extended this to video. The ecosystem is now built around SAM: Grounded SAM for text prompting, HQ-SAM for fine details, EfficientSAM for speed, and Semantic-SAM for multi-granularity. The SA-1B dataset (1.1B masks) is the largest segmentation dataset ever created and has become a standard pretraining resource.

Key Challenges

Ambiguity — a single click on a person's eye could mean 'segment the eye,' 'the face,' or 'the whole person'; SAM returns 3 masks but the right one requires human selection

Fine structures — hair strands, fence wires, tree branches, and translucent objects (glass, smoke) remain difficult even for SAM; mask edges are often imprecise

Semantic understanding — SAM segments objects but doesn't know what they are; Grounded SAM addresses this but adds complexity and latency

Video consistency — SAM 2 improves temporal tracking but still loses objects during occlusion and reappearance, especially for similar-looking objects

Speed — SAM's ViT-H image encoder is too slow for real-time applications; EfficientSAM and MobileSAM trade accuracy for speed

Quick Recommendations

Best mask quality

SAM 2.1 (ViT-L/H) or HQ-SAM

Best general mask quality; HQ-SAM specifically improves fine-structure segmentation (hair, lace) with minimal overhead

Text-prompted segmentation

Grounded SAM 2 (Grounding DINO + SAM 2)

End-to-end: describe what to segment in text, get pixel-precise masks — no manual prompting needed

Video mask tracking

SAM 2.1

Click on an object in one frame, track its mask throughout the video with streaming memory; works on 24+ FPS video

Real-time / edge

EfficientSAM or MobileSAM

10-50× faster than SAM-H with ~95% mask quality; suitable for mobile and embedded applications

Automatic full-image segmentation

SAM 2 auto mode or Semantic-SAM

Segment everything in the image at multiple granularity levels without any prompts

What's Next

The frontier is real-time interactive segmentation on edge devices (AR glasses, robotics), video understanding with persistent object identity (not just masks but tracking who is who), and 3D mask generation from single images or video. SAM's paradigm will likely be absorbed into general-purpose vision foundation models that segment, detect, caption, and reason simultaneously. The long-term vision is spatial AI that maintains a 3D understanding of every object in a scene.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Mask Generation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000