Computer Visionzero-shot-object-detection

Zero-Shot Object Detection

Zero-shot object detection finds and localizes objects described by free-form text, without any task-specific fine-tuning — the open-vocabulary dream of detection. Grounding DINO (2023) married DINO's detection architecture with grounded pre-training to achieve state-of-the-art open-set detection, while OWL-ViT and YOLO-World showed different paths to the same goal. The technical challenge is grounding language precisely enough to distinguish similar objects ("the red car" vs "the blue car" in the same scene). This is rapidly replacing traditional closed-set detectors in production because it eliminates the most painful step: collecting and annotating domain-specific training data.

2 datasets0 resultsView full task mapping →

Zero-shot object detection finds and localizes objects from arbitrary text descriptions — detecting any object without training on its category. Grounding DINO (2023) made this practical, achieving ~50% zero-shot AP on COCO with text queries. Combined with SAM for masks, it creates a fully open-vocabulary perception pipeline that eliminates per-domain detector training.

History

2018

Zero-shot detection papers (Bansal et al., Rahman et al.) first attempt detecting unseen classes by transferring from seen categories via word embeddings

2021

ViLD (Gu et al.) distills CLIP's open-vocabulary knowledge into a detector, enabling zero-shot detection on LVIS rare categories

2022

OWL-ViT (Minderer et al.) adapts CLIP for detection by adding box prediction heads, achieving competitive zero-shot LVIS AP

2022

GLIP (Li et al.) unifies grounding and detection by reformulating detection as phrase grounding — matching phrases to regions

2023

Grounding DINO (Liu et al.) combines DINO-DETR with grounded pretraining, achieving 52.5% zero-shot COCO AP — first to rival supervised detectors

2024

OWLv2 (Google) scales self-training on web data, improving open-vocabulary detection to 47.2% APrare on LVIS

2024

Grounding DINO 1.5 (IDEA Research) adds edge deployment optimizations and improved text grounding; Florence-2 unifies detection with other tasks

2025

Open-vocabulary detection becomes standard in robotics and content moderation; models like Grounding DINO 1.6 achieve near-supervised-level accuracy on common objects

How Zero-Shot Object Detection Works

Text and Image Encoding

The text query (e.g., 'person carrying a red bag') is encoded by a text encoder (BERT, CLIP text). The image is processed by a vision backbone (Swin, ViT). Both produce dense feature representations.

Cross-Modal Fusion

Grounding DINO uses a feature enhancer module with cross-attention between text and image features at multiple scales. This allows each image region to attend to relevant text phrases and vice versa.

Query Selection

A language-guided query selection module picks the most relevant image features as detection queries, ensuring that queries focus on regions matching the text description.

Box Prediction + Grounding

A transformer decoder refines detection queries into bounding boxes and computes text-region matching scores. Each predicted box is scored against each phrase in the input text. Hungarian matching assigns predictions to ground truth during training.

Evaluation

Zero-shot AP on COCO (80 classes) and LVIS (1203 classes, with rare/common/frequent splits) are standard. LVIS APrare — accuracy on categories with <10 training images — specifically tests open-vocabulary capability.

Current Landscape

Zero-shot object detection in 2025 has moved from research curiosity to practical tool. Grounding DINO is the de facto standard, combining DETR-style detection with cross-modal grounding. The Grounded SAM pipeline (Grounding DINO → SAM) has become the Swiss Army knife of vision — detect and segment anything from text. Competition from OWLv2, YOLO-World, and Florence-2 is healthy but Grounding DINO maintains the accuracy lead. The biggest shift is in robotics and autonomous systems, where open-vocabulary detection eliminates the need to retrain detectors for each new environment or object set.

Key Challenges

Text-region grounding precision — models sometimes match the wrong text phrase to the right region, or detect the right concept in the wrong location

Fine-grained categories — distinguishing 'golden retriever' from 'labrador retriever' via text alone, without visual examples, remains difficult

Complex spatial queries — descriptions like 'the cup to the left of the laptop' require spatial reasoning that current models handle poorly

Speed — open-vocabulary detectors are 3-10× slower than closed-set detectors due to text encoding and cross-modal attention; real-time is challenging

Negative detection — knowing when an object is NOT present (avoiding false positives when queried about absent objects) is poorly handled

Quick Recommendations

Best zero-shot accuracy

Grounding DINO 1.5 Pro

52%+ zero-shot COCO AP; best open-vocabulary detector available, strong on both common and rare categories

Scalable web deployment

OWLv2-L

Good balance of speed and accuracy; self-training on web data means it handles diverse internet imagery well

Detection + segmentation

Grounding DINO + SAM 2 (Grounded SAM)

Detect with text, segment with SAM — full open-vocabulary instance segmentation without any training

Unified multi-task

Florence-2-Large

Single model handles detection, captioning, grounding, and segmentation; versatile for applications needing multiple capabilities

Edge deployment

Grounding DINO 1.5 Edge or YOLO-World

YOLO-World adds CLIP text grounding to YOLO architecture; runs at real-time speeds with acceptable accuracy loss

What's Next

The field is converging toward unified vision-language models where detection is one capability among many (grounding, segmentation, VQA, captioning). Dedicated open-vocabulary detectors may be absorbed into general VLMs within 2-3 years. Key frontiers: real-time open-vocabulary detection on edge devices, 3D open-vocabulary detection from RGB-D or point clouds, and active detection (asking clarifying questions when the query is ambiguous).

Benchmarks & SOTA

LVIS Zero-Shot

20190 results

Open-vocabulary object detection on 1203 LVIS categories

No results tracked yet

OmniLabel

20230 results

Open-vocabulary detection with free-form text descriptions

No results tracked yet

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Image editing

Image editing is the process of altering and improving images, whether digital or traditional, using specialized tools and software to enhance their quality, appearance, and functionality. This can involve simple tasks like cropping and color correction or complex techniques such as layering, retouching to remove blemishes, and creating new composite images. The goal of image editing is to make images more aesthetically pleasing, correct flaws, or achieve a desired artistic effect.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Zero-Shot Object Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision