Computer Vision

Few-Shot Image Classification

Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.

97 datasets0 resultsView full task mapping →

Few-Shot Image Classification is a key task in computer vision. Below you will find the standard benchmarks used to evaluate models, along with current state-of-the-art results.

Benchmarks & SOTA

COCO Captions

COCO Captions Dataset

20150 results

COCO Captions contains over 1.5 million captions describing over 330,000 images. For training and validation images, five independent human-generated captions are provided for each image.

No results tracked yet

WebBench

0 results

No results tracked yet

REAL

0 results

WorkBench, referred to as REAL in the context, is an action-based dataset for agents in a realistic workplace setting. It supports actions but not tool usage.

No results tracked yet

OSUniverse

0 results

No results tracked yet

ScreenSuite

0 results

No results tracked yet

Atari

0 results

The Atari dataset is used for reinforcement learning and involves training agents on Atari games using deep reinforcement learning. It is loaded from the RL Baselines Zoo package and trained on additional samples of each game.

No results tracked yet

MTEB

0 results

MTEB (Massive Text Embedding Benchmark) is a large-scale benchmark designed to measure the performance of text embedding models across diverse embedding tasks. It includes 56 datasets covering 8 tasks and supports over 112 different languages. It's easy to use and extensible, allowing new datasets to be added. MTEB tests how well embedding models work with different types of text and tasks, providing a complete picture of each model's strengths and weaknesses.

No results tracked yet

MMTEB

0 results

MMTEB (Massive Multilingual Text Embedding Benchmark) is a large-scale, open collaboration benchmark with over 500 tasks, covering many low-resource languages. It includes diverse and challenging tasks such as instruction following, long-document retrieval, and code retrieval. It also uses a downsampling technique to minimize evaluation resources.

No results tracked yet

COCO-Text

COCO-Text Dataset

20160 results

COCO-Text is a dataset for text detection and recognition. Based on MS COCO, it contains 63,686 images with 173,589 text instances. It includes both legible and illegible text.

No results tracked yet

7-Scenes

0 results

The 7-Scenes dataset is a collection of tracked RGB-D camera frames. It can be used to evaluate methods for dense tracking and mapping and relocalization techniques. The dataset contains 7 different indoor environments, each with 500-1000 image video sequences recorded by a handheld Kinect RGB-D camera at a resolution of 640x480. It includes ground truth camera tracks and dense 3D models obtained through Kinect Fusion.

No results tracked yet

Cambridge

0 results

The Cambridge dataset is used for 3D Generation tasks.

No results tracked yet

COCO-WholeBody

COCO-WholeBody Dataset

20200 results

COCO-WholeBody is an extension of COCO with whole-body pose annotations including body, foot, face, and hand keypoints.

No results tracked yet

BLAB

0 results

BLAB (Brutally Long Audio Bench) is a dataset designed to evaluate the multimodal understanding abilities of audio language models (LMs). It focuses on audio perception and reasoning, with eight distinct audio tasks across four categories: localization, counting, emotion, and duration estimation. It assesses audio LMs that accept both text and audio as input and generate text.

No results tracked yet

AudioBench

0 results

AudioBench is a universal benchmark designed to evaluate Audio Large Language Models (AudioLLMs). It encompasses 8 distinct tasks and 26 datasets, including newly proposed datasets. Its task categories include Speech Understanding, Audio-Scene Understanding, Voice Understanding (Paralinguistic), Music Understanding, and Singlish Understanding.

No results tracked yet

SciVideoBench

SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models

0 results

The first comprehensive benchmark dedicated to scientific video reasoning. SciVideoBench evaluates models across Physics, Chemistry, Biology, and Medicine, covering both perceptual understanding and high-level reasoning tasks. It provides a rigorous benchmark for evaluating long-form video reasoning in domains where accuracy and explainability matter most. Features 1,000 high-quality, human-verified multiple-choice questions across 240+ scientific experiments with rich metadata including discipline, subject, timestamp breakdowns, and rationale.

No results tracked yet

OCRBench v2

OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning

0 results

OCRBench v2 is a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4× more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios including street scene, receipt, formula, diagram, and so on), and thorough evaluation metrics, with a total of 10,000 human-verified question-answering pairs and a high proportion of difficult samples.

No results tracked yet

MRCR

Multi-turn Response Coherence and Relevance

0 results

MRCR (Multi-turn Response Coherence and Relevance) is a dataset for evaluating language models on multi-turn conversation quality, focusing on response coherence and relevance in dialogue contexts.

No results tracked yet

LOFT

Long-Context Frontiers

0 results

A benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate long-context language models' performance on in-context retrieval and reasoning

No results tracked yet

ASV2015

0 results

The ASV2015 (Automatic Speaker Verification Spoofing and Countermeasures) dataset is used for automated speaker verification purposes. The dataset is generated from more than ten voice conversion and speech synthesis spoofing techniques.

No results tracked yet

CREMA-D

0 results

The CREMA-D dataset is a collection of 7,442 video and audio clips of 91 actors expressing six basic emotions: happy, sad, anger, fear, disgust, and neutral. The dataset was created for studying multimodal emotion recognition and includes perceptual ratings from crowd-sourced raters in audio-only, visual-only, and audio-visual modalities.

No results tracked yet

Fluent Speech Commands

0 results

The Fluent Speech Commands dataset is a collection of 30,043 English voice command utterances from 97 speakers, designed for training and evaluating systems that understand spoken language directly from audio without an intermediate text transcription. The dataset contains single-channel .wav files, each labeled with three slots (action, object, and location) that together represent the intent of the command, such as "turn up the heat in the kitchen". It was created by Fluent.ai and is released for academic research purposes only.

No results tracked yet

LibriCount

0 results

LibriCount is a dataset designed for speaker count estimation that simulates a "cocktail party" environment with up to 10 speakers. It includes audio wave files and JSON annotation files, which contain metadata like the ground truth number of speakers, speaker IDs, and vocal activity. The dataset consists of 5-second, 16kHz, 16-bit mono audio recordings mixed from random utterances from the LibriSpeech CleanTest dataset.

No results tracked yet

LibriSpeech-100h

0 results

The LibriSpeech dataset is a large corpus of approximately 1,000 hours of 16kHz read English speech, derived from public domain audiobooks from the LibriVox project. It is widely used for training and evaluating automatic speech recognition (ASR) systems, and the data has been meticulously segmented and aligned with corresponding text transcripts. The LibriSpeech 100h is a smaller variant of it.

No results tracked yet

LibriSpeech-Male-Female

0 results

The LibriSpeech dataset is a large corpus of approximately 1,000 hours of 16kHz read English speech, derived from public domain audiobooks from the LibriVox project. It is widely used for training and evaluating automatic speech recognition (ASR) systems, and the data has been meticulously segmented and aligned with corresponding text transcripts. This dataset is about classifying male vs. female voices

No results tracked yet

RAVDESS

0 results

The RAVDESS dataset is a collection of audio-visual recordings of 24 professional actors (12 male, 12 female) vocalizing emotions like calm, happy, sad, angry, fearful, disgust, and surprised. It includes both speech and song, with recordings in various modalities such as audio-only, video-only, and full audio-visual. This dataset is used for speech emotion recognition (SER) to train and test machine learning models.

No results tracked yet

Speech Commands V1

0 results

The Speech Commands V1 dataset is an audio dataset of 65,000 one-second audio clips of 30 single words, used to train and evaluate keyword spotting systems. It includes 20 core command words like "yes," "no," and "go," and 10 auxiliary words like "marvin" and "wow," along with background noise files. The audio was recorded by thousands of different people via crowdsourcing and is available under a Creative Commons BY 4.0 license.

No results tracked yet

VoxCeleb1

0 results

VoxCeleb1 contains over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.

No results tracked yet

VoxLingua33

0 results

VoxLingua107 is a comprehensive speech dataset designed for training spoken language identification models. It comprises short speech segments sourced from YouTube videos, labeled based on the language indicated in the video title and description. The dataset covers 107 languages and contains a total of 6628 hours of speech data, averaging 62 hours per language. However, the actual amount of data per language varies significantly. Additionally, there is a separate development set consisting of 1609 speech segments from 33 languages, validated by at least two volunteers to ensure the accuracy of language representation.

No results tracked yet

DESED

0 results

DESED dataset is a dataset designed to recognize sound event classes in domestic environments. This dataset is designed to be used for sound event detection (SED, recognize events with their time boundaries) but it can also be used for audio tagging (AT, indicate presence of an event in an audio file). For now, the dataset is composed of 10 event classes to recognize in 10 second audio files. Classes: Alarm/bell/ringing, Blender, Cat, Dog, Dishes, Electric shaver/toothbrush, Frying, Running water, Speech, Vacuum cleaner.

No results tracked yet

FSD18-Kaggle

0 results

FSDKaggle2018 is an audio dataset containing 11,073 audio files annotated with 41 labels of the AudioSet Ontology.

No results tracked yet

FSD50k

0 results

Freesound Dataset 50k (or FSD50K for short) is an open dataset of human-labeled sound events containing 51,197 Freesound clips unequally distributed in 200 classes drawn from the AudioSet Ontology. FSD50K has been created at the Music Technology Group of Universitat Pompeu Fabra.

No results tracked yet

UrbanSound 8k

0 results

The UrbanSound 8k dataset contains 8732 labeled sound excerpts (<=4s) of urban sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark, drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The classes are drawn from the urban sound taxonomy.

No results tracked yet

Vocal Imitation

0 results

The Vocal Imitation dataset is a collection of crowd-sourced recordings where people imitate a wide variety of sounds, like animal noises or mechanical sounds. The dataset was created to support research into systems that can understand and process vocal imitations, such as search engines that can be queried by imitating a sound (Query-by-Vocal Imitation or QBV).

No results tracked yet

SWE-Bench

0 results

SWE-bench is a full benchmark with diverse repositories that evaluates language models on software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. The full benchmark has 2,294 instances.

No results tracked yet

Free Music Archive (FMA) Small

0 results

The Free Music Archive (FMA) Small dataset is a subset of the FMA, containing 8,000 audio tracks of 30 seconds each, balanced across 8 genres (Electronic, Experimental, Folk, Hip-Hop, Instrumental, International, Pop, Rock). It is primarily used for music information retrieval research, such as training genre classification models, and includes both the audio clips and associated metadata.

No results tracked yet

MAESTRO

0 results

MAESTRO (MIDI and Audio Edited for Synchronous Tracks and Organization) is a dataset composed of over 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.

No results tracked yet

NSynth-Instruments

0 results

The NSynth Dataset is a dataset for musical instrument identification. The dataset contains 305,979 samples (musical notes), each with a unique pitch, timbre, and envelope. Each sample has a unique pitch, timbre, and envelope. samples were generated from 1006 musical instruments.

No results tracked yet

X-ARES (kNN)

0 results

X-ARES (eXtensive Audio Representation and Evaluation Suite) is an audio encoder benchmark with kNN (K-nearest neighbor) evaluation method. This unparameterized evaluation uses pre-trained model embeddings directly for classification without training, testing the inherent quality of audio representations.

No results tracked yet

X-ARES (MLP)

0 results

X-ARES (eXtensive Audio Representation and Evaluation Suite) is an audio encoder benchmark with MLP (Linear Fine-Tuning) evaluation method. This method trains a linear layer using provided embeddings with predefined hyperparameters, assessing how effectively fixed representations can be adapted to specific tasks.

No results tracked yet

NTIRE 2024 Transparent Surface Challenge (relative)

NTIRE 2024: HR Depth from Images of Specular and Transparent Surfaces (Booster Dataset) (Relative Depth)

0 results

The 'NTIRE 2024 Transparent Surface Challenge' dataset is part of the "NTIRE 2024: HR Depth from Images of Specular and Transparent Surfaces Challenge," held with the New Trends in Image Restoration and Enhancement (NTIRE) workshop at CVPR 2024. The challenge targets advancing depth estimation in challenging scenarios, specifically high-resolution images of specular and transparent (non-Lambertian) surfaces. The affiliated dataset, referred to as the 'Booster' dataset, provides annotated images focusing on these types of surfaces to foster algorithms capable of accurate, high-resolution depth prediction in difficult visual conditions. The dataset supports both monocular and stereo depth estimation tracks and is intended to catalyze the field toward solving unsolved challenges in depth prediction.

No results tracked yet

SWE-Bench Pro

0 results

SWE-Bench Pro is a challenging benchmark for evaluating LLMs/Agents on long-horizon software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. The dataset is used for language modeling. The public set contains 731 instances, and there is also a commercial set with 276 instances from private, proprietary codebases.

No results tracked yet

SAT-493M

SAT-493M (Maxar 493M satellite imagery pretraining dataset)

0 results

SAT-493M is a large-scale pretraining corpus of commercial Maxar satellite imagery used by the DINOv3 project. It contains approximately 493 million RGB, ortho-rectified image chips (tiles) at 512×512 pixels, sampled at ~0.6 meter ground sampling distance. The collection was assembled from Maxar high-resolution optical imagery and was used to pre-train DINOv3 satellite models (e.g., ViT-7B and distilled variants) to produce high-quality dense features for remote-sensing / overhead-vision tasks. The dataset is a proprietary/commercial compilation of Maxar imagery (not published as an open Hugging Face dataset) and is provided to the DINOv3 team under Maxar licensing; access and redistribution are therefore restricted. Primary sources: DINOv3 paper (arXiv:2508.10104) and the DINOv3 repository / model cards which describe models pretrained on “SAT-493M.”

No results tracked yet

LVD-1689M

LVD-1689M

0 results

LVD-1689M is a large curated web-image dataset used by the DINOv3 authors for self-supervised pretraining. According to the DINOv3 paper and the model README, LVD-1689M contains approximately 1,689 million (1.689B) images sampled from a much larger pool (~17 billion) of web images collected from public Instagram posts. The authors describe LVD-1689M as a curated subset intended for large-scale SSL pretraining (used in the DINOv3 pretraining mixture); the paper and associated model documentation state the subset was created via clustering and balanced sampling to improve diversity and downstream generalization. LVD-1689M is not listed as a standalone public dataset on Hugging Face; primary references are the DINOv3 paper (arXiv:2508.10104), the Meta AI DINOv3 project page, the facebookresearch/dinov3 GitHub repo, and multiple DINOv3 model cards on Hugging Face that state models were pretrained on "LVD-1689M".

No results tracked yet

ScreenSpot

0 results

The ScreenSpot-Pro dataset is designed for GUI grounding evaluation. It consists of 1,581 instruction-image pairs across 23 applications, with each sample including a high-resolution screenshot and a natural language instruction describing the target UI element. The dataset is categorized into 6 application types: Development and Programming, Creative Software, CAD and Engineering, Scientific and Analytical, Office Software, and Operating System Commons.

No results tracked yet

ImageNet-Hard

ImageNet-Hard

0 results

ImageNet-Hard is a robustness benchmark of "hard" ImageNet-scale examples curated to challenge modern vision models. It contains ~10.98k images gathered from multiple ImageNet variants and related benchmarks (ImageNet, ImageNet-V2, ImageNet-Sketch, ImageNet-C, ImageNet-R, ImageNet-ReaL, ImageNet-A, and ObjectNet). The set was created in the NeurIPS 2023 Datasets & Benchmarks work “ImageNet-Hard: The Hardest Images Remaining from a Study of the Power of Zoom and Spatial Biases in Image Classification” to capture the hardest remaining examples after studying model behavior under zoom and spatial-bias interventions. Images are provided with class labels and a metadata/origin field indicating their source dataset. The benchmark is intended for evaluating classification robustness and OOD / hard-example performance; the Hugging Face dataset card lists task category image-classification, MIT license, and size category ~10K–100K images.

No results tracked yet

IMC (Image Matching Challenge)

Image Matching Challenge — Phototourism (IMC-PT)

0 results

The Image Matching Challenge (IMC) Phototourism benchmark (IMC-PT) is part of the Image Matching Challenge / Image Matching Workshop (IMW) benchmark suite used to evaluate local features, matching methods and robust geometry estimation for camera pose estimation and multi-view reconstruction. The benchmark provides image collections (Phototourism scenes) with Colmap-derived (pseudo-)ground-truth camera poses, densified depth maps and co-visibility information; it is organized into tracks for stereo (image pairs) and multi-view reconstruction and supports restricted/unrestricted keypoint settings. Typical evaluation measures include camera-pose accuracy summarized as AUC of angular/positional thresholds (reported commonly as AUC@3°, AUC@5°, AUC@10°), plus runtime. The IMC site and the UBC-hosted pages provide downloadable training/validation sets (photo-tourism scenes), leaderboards, and an open-source evaluation pipeline (github: ubc-vision/image-matching-benchmark). The benchmark has been used in the IMW/CVPR workshop challenges (2020/2021 etc.) and is the reference benchmark for many recent local-feature and image-matching papers.

No results tracked yet

TAP-Vid (RGB-S)

TAP-Vid: A Benchmark for Tracking Any Point in a Video

0 results

TAP-Vid is a benchmark for the Tracking Any Point (TAP) problem: given a video and a set of query 2D points, the task is to track those physical/image points through time. Introduced by DeepMind et al., TAP-Vid includes both real-world videos with accurate human-annotated 2D point tracks and synthetic videos with dense ground-truth trajectories, enabling evaluation of long-range, deformable and occluded point motion. The benchmark provides standardized evaluation splits and metrics used in follow-up work (examples reported in the literature include AJ, delta_avg^vis and OA), and is widely used to evaluate point/pixel-level tracking methods (TAP models, TAPNet/TAPTR, CoTracker, etc.).

No results tracked yet

CodeForces (CodeElo)

CodeElo

0 results

CodeElo is a competition-level code generation benchmark built from CodeForces problems and introduced in the paper “CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings” (arXiv:2501.01257). The Hugging Face dataset (Qwen/CodeElo) contains CodeForces problem metadata (problem id, url, title, difficulty rating, tags, contest division, time/memory limits, problem statement, IO examples, and notes) for the evaluation set (recent contest problems used by the benchmark). The benchmark standardizes evaluation by submitting solutions to the official CodeForces judge and computing Elo-style ratings for models, enabling direct comparison between LLMs and human competitors.

No results tracked yet

MMVP

MMVP (Multimodal Visual Patterns) Benchmark

0 results

MMVP (Multimodal Visual Patterns) is a small benchmark created to study systematic visual shortcomings of modern multimodal/vision-language models. It focuses on “CLIP‑blind” image pairs — images that CLIP-style embeddings consider similar despite clear visual differences — and categorizes failures into nine basic visual pattern classes (e.g., camera perspective, occlusion, small parts, etc.). The collection is intended for perception and reasoning evaluation of multimodal LLMs/VLMs (the authors evaluate models such as GPT-4V). The Hugging Face mirror contains ~300 images (MMVP) and a labeled subset/variant for VLM evaluation (MMVP_VLM, ~270 examples, 9 classes). The dataset was released with the paper “Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs” and the accompanying code/release on GitHub.

No results tracked yet

Language benchmarks (overall)

Language benchmarks (overall)

0 results

Aggregated metric computed by the paper authors representing the overall averaged performance across multiple language benchmarks listed in Table 11. This is not a standalone dataset of examples; rather it is a summary score (an aggregate evaluation) computed from a collection of language-task benchmark results reported in the paper. It represents an overall language evaluation across many language benchmarks (average across the listed language tasks).

No results tracked yet

Crossmodal-3600 (XM3600)

Crossmodal-3600: A Massively Multilingual Multimodal Evaluation Dataset

0 results

Crossmodal-3600 (XM3600) is an evaluation benchmark for massively multilingual image captioning. It contains 3,600 geographically-diverse images annotated with human-generated reference captions in 36 languages. The images were selected from regions where the target languages are spoken and captions were produced to be consistent in style across languages while avoiding direct translation artifacts. The dataset is intended for model selection and automatic evaluation of multilingual image-captioning systems (and has also been used as a golden reference in related image-text retrieval evaluations). The dataset and paper report experiments showing strong correlation between automatic metrics (using XM3600 as references) and human evaluations. Resources and metadata are available on the project page and a Hugging Face mirror.

No results tracked yet

MathVerse

0 results

MathVerse is an all-around visual math benchmark designed for an equitable and in-depth evaluation of multi-modal large language models (MLLMs). It also proposes a Chain-of-Thought evaluation strategy for a fine-grained assessment of the output answers.

No results tracked yet

HELMET

HELMET: How to Evaluate Long-context Language Models Effectively and Thoroughly

0 results

HELMET (How to Evaluate Long-context Language Models Effectively and Thoroughly) is a comprehensive benchmark for evaluating long-context language models (LCLMs). It comprises seven diverse, application-centric categories designed to test models' ability to process long inputs at multiple controllable lengths (paper reports support up to 128k tokens) and uses reliable, task-appropriate metrics and few-shot prompting. The benchmark is accompanied by code and data (available from the Princeton-NLP GitHub) and is hosted as a Hugging Face dataset (princeton-nlp/HELMET). HELMET was published as an ICLR 2025 paper (arXiv:2410.02694 / OpenReview entry) and is intended for evaluating LMs' long-input understanding and processing rather than generation-only tasks.

No results tracked yet

LVD-142M

LVD-142M

0 results

LVD-142M is the curated pretraining dataset used to train the DINOv2 family of self-supervised ViT models. According to the DINOv2 paper (arXiv:2304.07193) it is a deduplicated, automatically-assembled collection of roughly 142 million images built by retrieving and filtering images from multiple curated and uncurated sources (the paper reports the exact composition in Appendix Table 15). The dataset was created to provide a diverse, high-quality corpus for large-scale visual self-supervised pretraining (training a 1B-parameter ViT and distilled smaller variants). LVD-142M itself has not been published as a standalone dataset (the authors were asked about releasing the dataset / curation code in the facebookresearch/dinov2 GitHub issues but no public dataset release is available), and no official Hugging Face dataset page for LVD-142M could be found. Many model cards on Hugging Face list "Pretrain Dataset: LVD-142M" to indicate models were pretrained on it (e.g., Meta / timm DINOv2 model cards), but the dataset files/download are not publicly hosted by the authors.

No results tracked yet

Tanks and Temples (6)

Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction

0 results

Tanks and Temples is a widely-used benchmark for image-based 3D reconstruction and multi-view stereo (MVS). Introduced by Knapitsch et al. (Tanks and Temples: Benchmarking Large-Scale Scene Reconstruction), the benchmark provides high-resolution video/image sequences of real-world scenes and laser-scanned ground-truth geometry for evaluating reconstruction and novel-view-synthesis methods. The benchmark is organized into testing groups (commonly referred to as the “intermediate” and “advanced” sets) and is frequently used in the literature; many papers also evaluate on a standard 6-scene subset of the benchmark for out-of-domain novel-view-synthesis (NVS) evaluation. Official dataset/download and benchmark materials are hosted at the TanksAndTemples website and the original paper (SIGGRAPH / ACM TOG) provides dataset details and evaluation instructions.

No results tracked yet

MegaDepth (19)

MegaDepth

0 results

MegaDepth is a large-scale dataset for single-view depth prediction constructed from Internet photo collections using structure-from-motion (SfM) and multi-view stereo (MVS). The authors use COLMAP reconstructions to produce dense depth maps and masks for images of many outdoor landmark scenes; the published dataset comprises reconstructions for ≈196 distinct locations (landmarks) and tens to hundreds of images per scene, together with cleaned, scale-normalized dense depth maps and validity masks suitable for training and evaluating single-view depth and related 3D tasks. The dataset was introduced in: Li & Snavely, "MegaDepth: Learning Single-View Depth Prediction from Internet Photos" (CVPR 2018 / arXiv:1804.00607). Note: in the evaluation context referenced in your paper, the authors (of that paper) use a 19-scene subset of MegaDepth (referred to as “MegaDepth (19)”), specifically scenes indexed 5000–5018, as an out-of-domain novel-view-synthesis (NVS) / evaluation split. This subset is not a separate release of MegaDepth but a selection of 19 scenes from the full MegaDepth reconstructions used for out-of-domain NVS testing.

No results tracked yet

DL3DV-Benchmarks (140)

DL3DV Benchmark (140 scenes)

0 results

DL3DV Benchmark (140 scenes) is a curated in-domain novel-view-synthesis (NVS) benchmark sampled from the DL3DV-10K dataset. It contains 140 real-world scenes (images + camera poses) together with Colmap reconstructions and scene labels, formatted to be compatible with NeRF toolchains (e.g., nerfstudio) and 3D Gaussian Splatting workflows. The repo also ships README/license information and reported baseline results from the DL3DV paper (e.g., ZipNeRF, 3DGS, MipNeRF-360, nerfacto, Instant-NGP). The subset is intended for feed-forward novel view synthesis evaluation and generalizable NeRF / 3D generation research. The dataset on Hugging Face is gated (terms-of-use required) and provides download instructions and a benchmark preview (DL3DV-Benchmark-Preview).

No results tracked yet

HiRoom

HiRoom

0 results

HiRoom — a Blender-rendered synthetic indoor dataset of 30 furnished/living-room scenes used by the authors of "Depth Anything 3" (DA3) as part of their visual-geometry benchmark. The dataset is used for evaluating any-view geometry / 3D reconstruction and camera-pose estimation in the DA3 paper; the paper reports using a reconstruction F1 threshold d = 0.05 m for HiRoom. No standalone Hugging Face dataset page or separate publication for HiRoom was found; the dataset appears in the DA3 benchmark and tech report (arXiv:2511.10647 / project page).

No results tracked yet

COCO test-challenge

COCO test-challenge Split

20140 results

COCO test-challenge evaluation split used for official challenge submissions.

No results tracked yet

Finance Agent

0 results

The Finance Agent dataset includes 537 expert-authored questions, covering tasks from information retrieval to complex financial modeling, simple retrieval, market research, and projections. Each question was validated through a rigorous review process to ensure accuracy and relevance.

No results tracked yet

TauBench

0 results

A conversational benchmark designed to test AI agents in dynamic, open-ended real-world scenarios. It specifically evaluates an agent's ability to interact with simulated human users and programmatic APIs while strictly adhering to domain-specific policies and maintaining consistent behavior, with domains in e-commerce and airline reservations.

No results tracked yet

COCO 2017 Panoptic Segmentation

Microsoft COCO (Common Objects in Context) — 2017 Panoptic Segmentation

0 results

Microsoft COCO 2017 Panoptic Segmentation is the COCO 2017 subset annotated for panoptic segmentation, a task that unifies instance segmentation for "thing" classes and semantic segmentation for "stuff" classes into a single per-pixel labeling. The 2017 release contains the standard COCO image splits (train2017 with ~118,000 images and val2017 with 5,000 images, ~123k images total) and panoptic annotations (JSON panoptic annotation files plus per-image panoptic PNG segment maps). Panoptic annotations provide both instance ids and semantic class ids so models can be evaluated on a single panoptic quality metric that accounts for both things and stuff. The dataset is derived from the Microsoft COCO collection introduced in Lin et al., "Microsoft COCO: Common Objects in Context" (arXiv:1405.0312 / ECCV 2014). Common distribution points and tools include the official COCO website (cocodataset.org), the cocodataset panoptic API (cocodataset/panopticapi on GitHub), and multiple Hugging Face dataset mirrors that expose the train/val splits and panoptic annotations (e.g., AISNP/COCO2017-panoptic).

No results tracked yet

COCO 2017 Captions

Microsoft COCO Captions (COCO 2017 Captions)

0 results

Microsoft COCO Captions (COCO 2017 Captions) is the image-captioning portion of the MS COCO (Microsoft Common Objects in Context) benchmark. It provides human-written natural language captions describing the images; each annotated image has five independent captions. The COCO 2017 release reorganized the original COCO images into the train/val/test2017 splits (commonly used splits are train2017 with ~118,287 images and val2017 with 5,000 images; caption annotations cover the train+val captioned images, i.e. ~123k images with 5 captions each). COCO Captions is a standard benchmark for image-to-text / vision-language tasks and is widely used to train and evaluate image captioning and vision-language models using metrics such as BLEU, METEOR and CIDEr. (Original paper: "Microsoft COCO: Common Objects in Context", arXiv:1405.0312.)

No results tracked yet

COCO 2017 Stuff

COCO-Stuff (COCO 2017 Stuff / COCO-Stuff 164K)

0 results

COCO-Stuff (COCO 2017 Stuff) augments the MS COCO dataset with dense pixel-wise annotations for "stuff" classes (amorphous background regions like sky, grass, road). The COCO-Stuff v2 release annotates all ~164K images in the COCO 2017 collection with 91 stuff classes (in addition to the 80 COCO thing classes), enabling large-scale semantic segmentation and scene-understanding research focused on stuff/thing interactions and context. The annotations were produced with an efficient superpixel-based protocol that leverages COCO thing masks. (Original COCO dataset: arXiv:1405.0312; COCO-Stuff paper/announcement: arXiv:1612.03716 / CVPR 2018.)

No results tracked yet

COCO val2017 (Instance Segmentation)

COCO 2017 Object Detection (validation split)

0 results

COCO 2017 validation split (5K images) for instance segmentation evaluation. This dataset is specifically used for instance segmentation tasks, where models are evaluated on their ability to detect and localize objects in images using bounding boxes.

No results tracked yet

SimpleQA-Verified

0 results

The dataset is named SimpleQA-Verified, and its task is language modeling. It is a reliable factuality benchmark to measure parametric knowledge, with an original size of 4,326 samples, which is reduced to 1,000 samples after various processing stages. The dataset includes problems with varying string lengths, topics (e.g., Sports, Politics, Art, History, Geography, Music, Other), and answer types (e.g., Person, Other). Some problems may require reasoning or be multi-step.

No results tracked yet

AndroidWorld

0 results

AndroidWorld is an environment and benchmark for autonomous agents, featuring 116 diverse tasks across 20 real-world apps, dynamic task instantiation for millions of unique variations, and durable reward signals for reliable evaluation. It is an open environment with access to millions of Android apps and websites, and has a lightweight footprint (2 GB memory, 8 GB disk).

No results tracked yet

AITW

0 results

AITW (Android in the Wild) is a large-scale dataset for device-control research, specifically for Android devices. It is significantly larger than previous datasets and contains human-collected data. The dataset consists of four multi-step datasets (GOOGLEAPPS, INSTALL, WEBSHOPPING, and GENERAL) and a single-step dataset (SINGLE). It includes observations represented by screenshots and pixel-based screen features.

No results tracked yet

Online-Mind2Web

0 results

Online-Mind2Web is a new benchmark that contains 300 diverse and realistic tasks spanning 136 websites, introduced to assess the current state of web agents. It also includes WebJudge, an automatic evaluation based on LLM-as-a-judge, to facilitate future agent development and evaluation.

No results tracked yet

COCO minival

COCO minival Split

20140 results

COCO minival evaluation split (5K images), a subset commonly used for validation during development.

No results tracked yet

AppWorld

0 results

The AppWorld dataset is designed for "Computer Use Agents". These agents possess capabilities such as multimodal reasoning, control over applications through simulated or API-driven inputs, memory management, and autonomy in executing multistep flows. They can adaptively interact with systems, perform actions, update files, navigate menus, and generate responses, effectively automating tasks across various applications by understanding user instructions.

No results tracked yet

Design2Code

0 results

The dataset is named Design2Code and is used for video language models.

No results tracked yet

ChartMimic_v2_Direct

0 results

The dataset "ChartMimic_v2_Direct" is used for evaluating video language models. More specifically, it focuses on evaluating Large Multimodal Models' (LMMs) cross-modal reasoning capabilities through chart-to-code generation, encompassing visual understanding, code generation, and cross-modal reasoning. It is available on Hugging Face.

No results tracked yet

UniSvg

0 results

UniSVG is the first large-scale, multi-task, open-source SVG-centric dataset for unified generation and understanding, supporting Multimodal Large Language Model (MLLM) training and evaluation. It also includes the UniSVG benchmark and diverse evaluation metrics to assess SVG generation and understanding capabilities.

No results tracked yet

VideoMMMU

0 results

The Video-MMMU dataset is a multi-modal, multi-disciplinary benchmark designed to assess Large Multi-modal Models' (LMMs) ability to acquire and utilize knowledge from videos. It evaluates LMMs' knowledge acquisition capability from educational videos, focusing on video as a knowledge source and knowledge acquisition-based question design.

No results tracked yet

CharadesSTA

0 results

The Charades dataset contains 66,500 temporal annotations for 157 action classes, 41,104 labels for 46 object classes, and 27,847 textual descriptions of the videos. It is designed to guide research into unstructured video activity recognition and commonsense reasoning for daily human activities.

No results tracked yet

MuirBench

0 results

MuirBench is a benchmark dataset for vision language models. It is designed to evaluate robust multi-image understanding, covering various types of multi-image relations such as temporal, ordered-pages, or narrative relations. It also includes unanswerable questions to fairly assess multimodal LLMs. The dataset aims to identify the gap between current multi-modal language models and humans in understanding multiple image inputs.

No results tracked yet

VisuLogic

0 results

The VisuLogic dataset is used for tasks involving vision-language models, which are a multimodal AI architecture that comprehends both image and text data. These models correlate information from both visual and natural language inputs and can be used for tasks such as image classification, object detection, image segmentation, optical character recognition, image recognition, visual inspection, captioning, summarization, image generation, image search and retrieval, and visual question answering.

No results tracked yet

ARC-AGI

0 results

The ARC-AGI dataset is an AI benchmark that measures progress towards general intelligence by testing a system's ability to efficiently acquire new skills outside of its training data. It requires the test taker (human or AI) to deduce underlying rules through abstraction, inference, and prior knowledge rather than brute-force or extensive training. It was used as the basis of the ARC Prize competition and determined the final leaderboard in 2020, 2022, 2023, and 2024.

No results tracked yet

ARC-AGI 2

0 results

ARC-AGI 2 is a dataset for general tasks, specifically compositional reasoning tasks that require the simultaneous application of rules or multiple interacting rules. It includes a semi-private set for testing remotely-hosted commercial models and a fully-private set for testing self-contained models during the ARC Prize competition. The dataset contains 1000 tasks for training and 120 tasks for evaluation, combining tasks from ARC-AGI-1 and new tasks.

No results tracked yet

GAIA

0 results

A landmark benchmark designed to evaluate General AI Assistants, posing real-world questions that are conceptually simple for humans but significantly challenging for most advanced AI systems. It requires AI models to demonstrate a combination of fundamental abilities, including reasoning, multi-modality handling, web browsing, and proficient tool use.

No results tracked yet

ToolBench

0 results

A massive-scale benchmark designed for evaluating and facilitating large language models in mastering over 16,000 real-world RESTful APIs. It functions as an instruction-tuning dataset for tool use, which was automatically generated using ChatGPT to enhance the general tool-use capabilities of large language models.

No results tracked yet

ComplexFuncBench

0 results

A benchmark specifically designed for the evaluation of complex function calling in LLMs. It addresses challenging scenarios across five key aspects: multi-step function calls within a single turn, function calls involving user-provided constraints, parameter value reasoning, calls with long parameter values, and calls requiring a 128k long-context length.

No results tracked yet

LiveMCPBench

0 results

A comprehensive benchmark designed to evaluate the ability of LLM agents to navigate and effectively utilize a large-scale Model Context Protocol (MCP) toolset in real-world scenarios, overcoming limitations of single-server environments.

No results tracked yet

MCP-Universe

0 results

A comprehensive framework and benchmark for developing, testing, and evaluating AI agents and LLMs through direct interaction with real-world Model Context Protocol (MCP) servers, rather than relying on simulations, covering domains like financial analysis and browser automation.

No results tracked yet

API-Bank

0 results

Evaluates an agent's ability to plan step-by-step API calls, retrieve relevant APIs, and correctly execute API calls to meet human needs based on understanding real-world API documentation. It features over 2,200 dialogues utilizing thousands of APIs.

No results tracked yet

LongVideoBench

0 results

LongVideoBench is a benchmark for long-context interleaved video-language understanding, addressing a gap in existing benchmarks for long video understanding. It proposes a new referring reasoning task to evaluate the abilities of large multimodal models (LMMs) and aims to address the single-frame bias problem in video understanding metrics. The benchmark covers a wide range of video lengths (up to one hour) and themes, with diverse question types and high-quality, manually annotated data. It is used to comprehensively evaluate proprietary and open-source models to understand their long-context multimodal modeling capabilities.

No results tracked yet

OVOBench

0 results

OVOBench is a dataset for evaluating Video Language Models (Video-LLMs) on their ability to understand real-world online video, specifically by finding temporal visual clues from ongoing input and waiting for sufficient evidence before responding. It evaluates capabilities such as backward tracing, real-time visual perception, and forward active responding.

No results tracked yet

COCO-Stuff

Common Objects in COntext-stuff

20180 results

COCO-Stuff augments COCO with pixel-level stuff annotations. It spans 164K images with 172 categories including 80 things, 91 stuff, and 1 unlabeled class.

No results tracked yet

WindowsAgentArena-V2

0 results

WindowsAgentArena-V2 is an updated benchmark comprising 141 tasks across 11 widely-used Windows applications, all derived from the original WindowsAgentArena, but with improvements. It is used for agent tasks.

No results tracked yet

HammerBench

0 results

HammerBench is a benchmark for evaluating agents in real mobile assistant scenarios, focusing on fine-grained function-calling and slot-filling tasks in interactive dialogues. It tests agents across multiple domains with diverse tools and query types, capturing various user behaviors like detailed vs. vague queries and single-turn vs. multi-turn interactions. It also allows evaluation of LLM performance under circumstances such as imperfect instructions.

No results tracked yet

GAIA2

0 results

GAIA2 is a dataset designed for agentic evaluation on real-life assistant tasks. It is used with the open Meta Agents Research Environments (ARE) framework to run, debug, and evaluate agents. The dataset includes scenarios for multi-step instruction following and tool-use (execution), cross-source information gathering (search), and clarification of conflicting requests (ambiguity handling). It also features temporal constraints, dynamic environment events, and multi-agent collaboration scenarios.

No results tracked yet

AgentBench

0 results

AgentBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) as agents. It provides insights into their strengths and limitations, serving as a standardized platform for future research and development in AI agent technologies. It also includes a trajectory dataset for behavior cloning training.

No results tracked yet

AssistantBench

0 results

No results tracked yet

SWE-PolyBench

0 results

SWE-PolyBench is a multi-language benchmark for repository-level evaluation of coding agents. It contains 2110 curated issues in four languages (Java, JavaScript, TypeScript, and Python), covering bug fixes, feature additions, and code refactoring.

No results tracked yet

WebVoyager

0 results

WebVoyager is a multimodal web agent that integrates textual and visual information to address web tasks end-to-end. It uses a semi-automated approach to generate and filter 643 task queries, covering 15 websites with each website containing over 40 queries.

No results tracked yet

VisualWebArena

0 results

VisualWebArena is a benchmark for multimodal agents, comprising 910 realistic, visually-grounded web tasks across three environments: Classifieds, Shopping, and Reddit. It evaluates various capabilities of autonomous multimodal agents on complex web-based visual tasks.

No results tracked yet

Related Tasks

Open-Vocabulary Object Detection

Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.

Object counting

Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.

Video segmentation

Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.

OCR

OCR, or Optical Character Recognition, is the task of converting an image containing text into machine-readable, editable, and searchable digital text data. This involves converting scanned documents, photos, or image-only PDFs to text from their static visual format, enabling the document to be edited, searched, or used for data entry and other applications. Examples include digitizing receipts for your bank app, translating signs with Google Translate, or creating searchable archives from old documents.

Get notified when these results update

New models drop weekly. We track them so you don't have to.

Something wrong or missing?

Help keep Few-Shot Image Classification benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000