Open-Vocabulary Object Detection
Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.
Open-Vocabulary Object Detection is a key task in computer vision. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.
Benchmarks & SOTA
LVIS (Object Detection)
LVIS is a large-scale, high-quality dataset for object detection containing 160k-164k images and 2M instance annotations for over 1000 object categories. It focuses on long-tail object recognition, providing a larger and more detailed vocabulary than COCO. LVIS uses the same images as the COCO dataset but with different splits and annotations optimized for object detection. The dataset includes common and rare object categories and provides standardized evaluation metrics like mean Average Precision (mAP) for object detection.
No results tracked yet
ODinW13 (subset of ODinW)
Object Detection in the Wild (ODinW) — subset: ODinW13
Object Detection in the Wild (ODinW) is a benchmark/leaderboard (originating from the "Computer Vision in the Wild" community / EvalAI challenge) that aggregates multiple public object‑detection datasets to evaluate in-the-wild / zero-shot transfer performance of detectors. "ODinW13" refers to a specific subset of 13 datasets from the ODinW collection that is commonly reported as a single metric (average mAP across those 13 datasets) for measuring in-the-wild zero-shot detection. ODinW/ODinW13 is not a single stand-alone dataset with one canonical paper introducing it; instead it is a benchmark suite (used by many papers) and appears as an evaluation collection in numerous object-detection / open-vocabulary detection papers (for example: ScaleDet — arXiv:2306.04849 — which reports results on “13 datasets from Object Detection in the Wild (ODinW)”, and other open-vocabulary detection works that evaluate on ODinW). The ODinW benchmark is linked to the EvalAI challenge page for “Object Detection in the Wild” (Computer Vision in the Wild). Because ODinW13 is a reported subset/metric of that benchmark (not an independent dataset release), there is no single introducing arXiv paper or Hugging Face dataset page to link to; papers that use ODinW13 typically cite the benchmark or the CVinW/ELEVATER resources when reporting results.
No results tracked yet
Related Tasks
Few-Shot Image Classification
Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.
Object counting
Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.
Video segmentation
Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.
OCR
OCR, or Optical Character Recognition, is the task of converting an image containing text into machine-readable, editable, and searchable digital text data. This involves converting scanned documents, photos, or image-only PDFs to text from their static visual format, enabling the document to be edited, searched, or used for data entry and other applications. Examples include digitizing receipts for your bank app, translating signs with Google Translate, or creating searchable archives from old documents.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Open-Vocabulary Object Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.