Zero-Shot Object Detection
Zero-shot object detection finds and localizes objects described by free-form text, without any task-specific fine-tuning — the open-vocabulary dream of detection. Grounding DINO (2023) married DINO's detection architecture with grounded pre-training to achieve state-of-the-art open-set detection, while OWL-ViT and YOLO-World showed different paths to the same goal. The technical challenge is grounding language precisely enough to distinguish similar objects ("the red car" vs "the blue car" in the same scene). This is rapidly replacing traditional closed-set detectors in production because it eliminates the most painful step: collecting and annotating domain-specific training data.
LVIS Zero-Shot
Open-vocabulary object detection on 1203 LVIS categories
Top 10
Leading models on LVIS Zero-Shot.
All datasets
2 datasets tracked for this task.
Related tasks
Other tasks in Computer Vision.
Looking to run a model? HuggingFace hosts inference for this task type.