Computer Vision

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

2 datasets0 resultsView full task mapping →

Open-Vocabulary Object Detection is a key task in computer vision. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

LVIS (Object Detection)

0 results

LVIS is a large-scale, high-quality dataset for object detection containing 160k-164k images and 2M instance annotations for over 1000 object categories. It focuses on long-tail object recognition, providing a larger and more detailed vocabulary than COCO. LVIS uses the same images as the COCO dataset but with different splits and annotations optimized for object detection. The dataset includes common and rare object categories and provides standardized evaluation metrics like mean Average Precision (mAP) for object detection.

No results tracked yet

ODinW13 (subset of ODinW)

Object Detection in the Wild (ODinW) — subset: ODinW13

0 results

Object Detection in the Wild (ODinW) is a benchmark/leaderboard (originating from the "Computer Vision in the Wild" community / EvalAI challenge) that aggregates multiple public object‑detection datasets to evaluate in-the-wild / zero-shot transfer performance of detectors. "ODinW13" refers to a specific subset of 13 datasets from the ODinW collection that is commonly reported as a single metric (average mAP across those 13 datasets) for measuring in-the-wild zero-shot detection. ODinW/ODinW13 is not a single stand-alone dataset with one canonical paper introducing it; instead it is a benchmark suite (used by many papers) and appears as an evaluation collection in numerous object-detection / open-vocabulary detection papers (for example: ScaleDet — arXiv:2306.04849 — which reports results on “13 datasets from Object Detection in the Wild (ODinW)”, and other open-vocabulary detection works that evaluate on ODinW). The ODinW benchmark is linked to the EvalAI challenge page for “Object Detection in the Wild” (Computer Vision in the Wild). Because ODinW13 is a reported subset/metric of that benchmark (not an independent dataset release), there is no single introducing arXiv paper or Hugging Face dataset page to link to; papers that use ODinW13 typically cite the benchmark or the CVinW/ELEVATER resources when reporting results.

No results tracked yet

Related Tasks

Few-Shot Image Classification

Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

OCR

OCR, or Optical Character Recognition, is the task of converting an image containing text into machine-readable, editable, and searchable digital text data. This involves converting scanned documents, photos, or image-only PDFs to text from their static visual format, enabling the document to be edited, searched, or used for data entry and other applications. Examples include digitizing receipts for your bank app, translating signs with Google Translate, or creating searchable archives from old documents.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Open-Vocabulary Object Detection benchmarks accurate. Report outdated results, missing benchmarks, or errors.

Back to Computer Vision

Open-Vocabulary Object Detection Benchmarks - Computer Vision - CodeSOTA | CodeSOTA