Image segmentation
Image segmentation is a computer vision technique that divides a digital image into multiple parts or "segments," where each segment contains pixels with similar characteristics, such as color, texture, or brightness. The goal is to simplify an image by changing its representation into something more meaningful and easier to analyze, often by identifying and locating objects, their boundaries, and different regions within the image. This process has wide-ranging applications, from medical image analysis to autonomous vehicles and satellite imagery.
Image segmentation is a key task in computer vision. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
COCO 2017 Instance Segmentation
Microsoft COCO 2017 (Instance Segmentation)
The Microsoft COCO 2017 Instance Segmentation dataset (COCO 2017) is a large-scale benchmark for object detection and instance segmentation. It provides images with per-instance segmentation annotations (polygon masks and RLE), bounding boxes, and category labels for commonly occurring object classes (the standard COCO set of 80 detection/segmentation categories). The 2017 split commonly used for benchmarking includes train2017 and val2017 (HF mirrors list ~118,287 training images and 5,000 validation images) and test splits; annotations are provided in COCO JSON format. COCO was introduced in Lin et al., "Microsoft COCO: Common Objects in Context" (arXiv:1405.0312 / ECCV 2014) and is widely used for evaluating instance segmentation, object detection, and related tasks.
State of the Art
Segment Anything Model (SAM)
46.5
mAP
LVIS (Instance Segmentation)
LVIS is a large-scale, high-quality dataset for instance segmentation containing 160k-164k images and 2M instance annotations for over 1000 object categories. It focuses on long-tail object recognition, providing a larger and more detailed vocabulary than COCO. LVIS uses the same images as the COCO dataset but with different splits and annotations optimized for instance segmentation. The dataset includes common and rare object categories and provides standardized evaluation metrics like mean Average Precision (mAP) for instance segmentation.
State of the Art
Segment Anything Model (SAM)
44.7
mAP
BSDS500
Berkeley Segmentation Dataset (BSDS500)
The Berkeley Segmentation Dataset (BSDS500) is a widely used benchmark for image boundary detection and image segmentation. It contains 500 natural images (an extension of the earlier BSDS300) split into train/val/test (200 / 100 / 200). Each image has multiple human-labeled ground-truth segmentations (typically ~5 annotations per image) which are used as reference boundaries/segmentations for evaluation. The dataset is commonly used for contour/boundary detection and region segmentation research; standard evaluation measures include precision/recall on detected boundaries and summary F-measures (e.g., ODS/OIS) and PR curves. The dataset and benchmark resources (download, code, evaluation scripts and leaderboards) are hosted by the UC Berkeley Vision Group.
State of the Art
Segment Anything Model (SAM)
0.768
ODS
LoveDA
Land-cOVEr Domain Adaptation (LoveDA)
LoveDA (Land-cOVEr Domain Adaptation) is a high-resolution remote-sensing land-cover dataset created for semantic segmentation and domain-adaptive (cross-domain) semantic segmentation research. The dataset contains imagery from three different cities and is explicitly split into two domains (urban vs. rural) to study transferability and unsupervised domain adaptation. According to the authors, LoveDA comprises 5,987 high-spatial-resolution (HSR) images with 166,768 annotated objects covering seven common land-cover categories. The paper provides benchmarks of 11 semantic segmentation methods and 8 unsupervised domain-adaptation (UDA) methods. Code and data are hosted from the authors' GitHub repository (Junjue-Wang/LoveDA).
No results tracked yet
BRAVO (OOD)
BRAVO (BRAVO Semantic Segmentation / BRAVO Challenge dataset)
BRAVO is a benchmark and challenge dataset for evaluating out-of-distribution (OOD) robustness and reliability of semantic segmentation models in urban driving scenes. Created and organized by the BRAVO Challenge (Valeo and UNCV organizers), BRAVO focuses on two reliability aspects: (1) semantic reliability (accuracy and calibration under perturbations) and (2) OOD reliability (detection and handling of unknown out-of-distribution content). The benchmark contains urban-scene images with diverse natural degradations and realistic-looking synthetic corruptions; in the BRAVO challenge setup, models are commonly trained on Cityscapes (or other accepted training sets) and evaluated on BRAVO to measure OOD generalization. The BRAVO code/toolkit and evaluation protocol are available from the BRAVO Challenge repository (valeoai/bravo_challenge), and the challenge and results are described in the ECCV/UNCV BRAVO challenge papers (see arXiv:2409.15107).
No results tracked yet
ADE20K
ADE20K
The **ADE20K** semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are totally 150 semantic categories, which include stuffs like sky, road, grass, and discrete objects like person, car, bed. Source: [Cooperative Image Segmentation and Restoration in Adverse Environmental Conditions](https://arxiv.org/abs/1911.00679) Image Source: [https://groups.csail.mit.edu/vision/datasets/ADE20K/](https://groups.csail.mit.edu/vision/datasets/ADE20K/)
No results tracked yet
CityScapes
Cityscapes is a large-scale dataset for semantic urban scene understanding. It provides high-quality pixel-level (fine) annotations for 5,000 images and coarse annotations for 20,000 images captured across 50 cities. The dataset includes dense semantic segmentation (30 classes), instance segmentation for vehicles and people, stereo pairs, preceding/trailing video frames, and rich metadata (GPS, vehicle odometry). It is used as a benchmark for pixel-level, instance-level, and panoptic semantic labeling.
No results tracked yet
Oxford-IIIT Pets
Oxford-IIIT Pet Dataset
The Oxford-IIIT Pet Dataset is a fine-grained image dataset of pets created by the Visual Geometry Group (VGG) at the University of Oxford. It contains images of 37 pet breeds (cats and dogs) with large variations in scale, pose and lighting. The dataset provides per-image annotations including the breed label and species (cat/dog), a tight head ROI (bounding box), and pixel-level trimap segmentation (foreground / background / ignore). Common splits used in ML libraries (e.g., TensorFlow Datasets) have 3,680 training images and 3,669 test images (7,349 images total). Typical uses: breed classification (37-way), binary species classification (cat vs dog), and segmentation/foreground extraction. Licensing: CC BY-SA 4.0 (as listed on several mirrors). Source / references: original VGG dataset page (robots.ox.ac.uk), TensorFlow Datasets entry (oxford_iiit_pet), and Hugging Face dataset mirrors (e.g., timm/oxford-iiit-pet).
No results tracked yet
PASCAL VOC 2012
PASCAL Visual Object Classes (VOC) 2012
PASCAL VOC 2012 (VOC2012) is a standard benchmark dataset from the PASCAL Visual Object Classes (VOC) challenge series for object recognition tasks including image classification, object detection, and pixel-level semantic segmentation. The VOC2012 release provides images collected from Flickr with high-quality annotations: bounding boxes and class labels for objects, and pixel-wise segmentation masks for a subset of images. It covers 20 common object classes plus background and has been widely used as a semantic segmentation benchmark (and for detection/classification). The commonly-cited VOC2012 train/val collection contains 11,530 images (with ~27,450 ROI-tagged objects and ~6,929 segmentation annotations in the release), and the dataset is distributed together with devkit/evaluation code and documentation. Note that many images originate from Flickr and must be used in accordance with their license/terms.
No results tracked yet
Related Tasks
Few-Shot Image Classification
Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.
Open-Vocabulary Object Detection
Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.
Object counting
Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.
Video segmentation
Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Image segmentation benchmarks accurate. Report outdated results, missing benchmarks, or errors.