Depth estimation
Depth estimation is the computer vision task for inferring the 3D spatial structure of a scene from 2D images, often resulting in a depth map that shows the distance of each pixel from the camera. It enables applications like autonomous navigation, 3D reconstruction, and augmented reality by providing a measure of distance for various points in a scene. Depth estimation can use a single camera (monocular) or multiple cameras (stereoscopic) and can be either absolute, providing precise measurements in units like meters, or relative, which indicates the order of distances without exact values.
Monocular depth estimation predicts per-pixel distance from a single 2D image — a fundamentally ill-posed problem that deep learning has made remarkably practical. Relative depth models like MiDaS/Depth Anything now generalize across scenes, while metric depth (actual meters) remains dataset-dependent. It enables 3D photography, AR, robotics, and autonomous driving without expensive LiDAR.
History
Saxena et al. pioneer supervised monocular depth with Markov Random Fields on small-scale indoor datasets
Eigen et al. use multi-scale CNNs to predict depth from a single image, establishing the deep learning baseline on NYUv2 and KITTI
Godard et al. introduce self-supervised monocular depth using left-right stereo consistency — no depth labels needed
Monodepth2 (Godard et al.) adds per-pixel minimum reprojection and auto-masking, becoming the dominant self-supervised method
MiDaS (Ranftl et al.) trains on mixed datasets (12 sources) to achieve robust zero-shot relative depth estimation across arbitrary images
DPT (Dense Prediction Transformer) by Ranftl et al. applies ViT to dense prediction, improving fine-grained depth details
ZoeDepth combines relative depth pretraining with metric depth fine-tuning, producing the first practical zero-shot metric depth model
Depth Anything v1 (Tsinghua/ByteDance) trains on 62M unlabeled images with self-training, achieving the best zero-shot relative depth
Depth Anything v2 trains on 595K synthetic images (from Hypersim, Virtual KITTI, etc.) with pseudo-labeled real data, setting new SOTA on NYUv2 and KITTI
Depth Pro (Apple) and Metric3D v2 push zero-shot metric depth estimation — predicting actual distances without per-scene calibration
How Depth estimation Works
Encoder
A pretrained backbone (DINOv2-ViT, Swin, ConvNeXt) extracts multi-scale features from the input image. ViT-based encoders dominate because global self-attention captures long-range depth cues (perspective, occlusion, relative size).
Decoder
A DPT-style decoder reassembles features at progressively higher resolutions using convolution and upsampling layers with skip connections from the encoder.
Depth Head
A final layer predicts either relative disparity (inverse depth, arbitrary scale) or metric depth (absolute meters). Relative depth is easier to learn across datasets; metric depth requires known camera intrinsics or focal length estimation.
Loss Functions
Scale-and-shift-invariant losses (for relative depth) compare predicted and ground-truth depth up to an affine transformation. Metric depth uses L1/L2 loss, sometimes with gradient-matching terms for edge sharpness.
Evaluation
Key metrics: AbsRel (absolute relative error), δ₁ (% of pixels within 1.25× of ground truth), RMSE. NYUv2 (indoor) and KITTI (outdoor driving) are standard benchmarks.
Current Landscape
Monocular depth estimation in 2025 is dominated by foundation models trained on massive mixed datasets. Depth Anything v2 is the clear leader for relative depth, while Depth Pro and Metric3D v2 are pushing metric depth toward practical zero-shot use. The self-supervised era (Monodepth2, etc.) has been largely superseded by models trained on synthetic data + pseudo-labels from teacher models. The field has split into two camps: those optimizing benchmark numbers on NYUv2/KITTI, and those building robust zero-shot models that work on any image from the internet. The latter camp is winning in practice.
Key Challenges
Scale ambiguity — a single image doesn't contain absolute distance information, so monocular models predict relative depth unless trained with metric supervision or camera intrinsics
Generalization across domains — models trained on indoor scenes (NYUv2) fail outdoors (KITTI) and vice versa; truly universal depth estimation requires massive mixed-domain training
Thin structures and object boundaries — depth maps are typically blurry at edges, which causes artifacts in 3D reconstruction and novel view synthesis
Transparent and reflective surfaces (glass, mirrors, water) violate the assumptions of both learning-based and geometric depth methods
Ground truth acquisition — LiDAR is sparse, stereo matching has errors at occlusion boundaries, and structured light only works indoors, limiting training data quality
Quick Recommendations
Best zero-shot relative depth
Depth Anything v2-Large (ViT-L)
Best AbsRel on NYUv2 (0.043) and KITTI (0.042) without fine-tuning; works on arbitrary images out of the box
Zero-shot metric depth
Depth Pro (Apple) or Metric3D v2
Predicts actual meters without per-scene scaling; Depth Pro estimates focal length jointly with depth
Real-time / mobile
Depth Anything v2-Small (ViT-S)
Runs at 30+ FPS on mobile GPUs with competitive accuracy; 25M params
Autonomous driving
Metric3D v2 fine-tuned on KITTI/nuScenes
Metric depth matters for planning; combine with temporal consistency for video
3D photography / NeRF
Depth Anything v2 as initialization for DUSt3R/MASt3R
Monocular depth provides strong priors for multi-view 3D reconstruction pipelines
What's Next
The frontier is moving toward video depth estimation with temporal consistency (eliminating flickering between frames), 4D depth (dynamic scenes with moving objects), and integration with 3D reconstruction pipelines (DUSt3R, Gaussian splatting). Zero-shot metric depth will likely become standard within a year, eliminating the need for per-dataset calibration. The endgame is depth as a commodity feature embedded in every vision foundation model.
Benchmarks & SOTA
ETH3D (relative)
ETH3D Multi-View Stereo Benchmark Dataset (Relative Depth)
ETH3D is a comprehensive multi-view stereo and SLAM benchmark dataset designed for evaluating 3D reconstruction algorithms. Developed by the Computer Vision and Geometry Group at ETH Zurich, it features a wide variety of indoor and outdoor scenes, captured using both high-resolution DSLR cameras and synchronized multi-camera video systems. Ground truth geometry is obtained using high-precision laser scans. The benchmark consists of multiple challenges: high-res multi-view stereo with 13 training and 12 test scenes using DSLR images, low-res many-view stereo on video data with 5 training and 5 test sequences, and low-res two-view stereo with 27 training and 20 test frames. ETH3D is intended to advance research in 3D reconstruction by providing accurate ground truth and challenging scenarios, including mobile and hand-held camera use cases. The dataset offers rich visualizations and an online evaluation server.
No results tracked yet
Sintel (relative)
MPI Sintel Dataset (Depth and Optical Flow Benchmark) (Relative Depth)
The MPI Sintel Dataset is a synthetic dataset for the evaluation of optical flow and depth estimation algorithms, derived from the open source 3D animated short film 'Sintel' by the Blender Foundation. The dataset includes long sequences with large motions, specular reflections, motion blur, defocus blur, and atmospheric effects. For depth estimation, it provides ground truth depth maps (in meters), camera data (intrinsic and extrinsic parameters), and image sequences, rendered under realistic and challenging conditions. It is widely used in benchmarking optical flow and monocular depth estimation methods. The Sintel dataset is notable for its diversity, complexity, and photorealistic synthetic scenes and is a key benchmark in both the optical flow and depth estimation research communities.
No results tracked yet
DIODE (relative)
DIODE: A Dense Indoor and Outdoor DEpth Dataset (Relative Depth)
DIODE (Dense Indoor/Outdoor DEpth) is a public RGB-D dataset that provides diverse, high-resolution color images coupled with accurate, dense, long-range depth measurements for both indoor and outdoor scenes acquired with a single sensor suite. It was introduced to enable and evaluate depth-estimation methods that generalize across scene domains; the dataset includes RGB images, dense depth maps and surface normals (where available), a development toolkit on GitHub, and a sample gallery on the project site. The authors describe DIODE as containing "thousands" of diverse scenes and provide data curation and processing scripts (diode-devkit). Primary references: the project site (diode-dataset.org) and the technical report (arXiv:1908.00463).
No results tracked yet
SUN RGB-D (metric)
SUN RGB-D: A RGB-D Scene Understanding Benchmark Suite
SUN RGB-D is a large-scale RGB-D (color + depth) indoor scene understanding benchmark introduced by Song, Lichtenberg, and Xiao. The dataset contains 10,335 real RGB-D images captured by four different sensors and densely annotated for a variety of scene-understanding tasks. Annotations include 146,617 2D polygons, tens of thousands of 3D bounding boxes (the paper reports 64,595 3D boxes), object orientations, 3D room layouts and scene categories. The dataset provides train/test splits (commonly reported as 5,285 train and 5,050 test images) and is used for tasks such as 2D/3D object detection, semantic/instance segmentation, scene classification and depth-related evaluations. Official dataset pages and the CVPR 2015 paper provide download links, annotation details and evaluation toolkits.
No results tracked yet
DIODE Outdoor (metric)
DIODE: A Dense Indoor and Outdoor DEpth Dataset
Outdoor split of DIODE used in zero-shot metric depth evaluation (reported in Table 5 of the paper). DIODE (Dense Indoor/Outdoor DEpth) is a public RGB-D dataset that provides diverse, high-resolution color images paired with accurate, dense, long-range depth measurements covering both indoor and outdoor scenes captured with a single sensor suite. The dataset was introduced to enable research on depth estimation and cross-domain generalization (indoor↔outdoor) by providing dense ground-truth depth maps (and derived normals) for a variety of scene types and ranges. The authors release capture/processing tools (diode-devkit) and make the dataset and project resources available from the project website. This entry refers specifically to the Outdoor split commonly used for zero-shot metric depth evaluation.
No results tracked yet
iBims-1 (metric)
iBims-1 (independent Benchmark images and matched scans - version 1)
iBims-1 (independent Benchmark images and matched scans - version 1) is a high-quality RGB-D dataset created for evaluation of single-image (monocular) depth estimation methods. It was captured with a DSLR camera together with a high-precision laser scanner to provide high-resolution RGB images and highly accurate depth maps with low noise, sharp depth transitions, minimal occlusions and a large depth range. The dataset was designed to support geometry-aware evaluation metrics (e.g., edge/planarity preservation, absolute distance accuracy) and includes per-image masks for invalid/transparent regions and for planar or sharp depth-transition areas, as well as camera calibration parameters. The core release contains 100 RGB–depth image pairs from indoor scenes; the authors also provide an extension with additional variations (reported as 56 variants/extensions and several additional sequences and test images). The dataset and its evaluation protocol were introduced alongside the paper “Evaluation of CNN-based Single-Image Depth Estimation Methods” (ECCV Workshops 2018 / arXiv:1805.01328).
No results tracked yet
Virtual KITTI 2 (metric)
Virtual KITTI 2
Virtual KITTI 2 is a synthetic, photo-realistic driving dataset that is a revised and expanded version of the original Virtual KITTI. It provides synthetic "clones" of 5 sequences from the KITTI tracking benchmark (scenes: 01, 02, 06, 18, 20) together with multiple variants per sequence (e.g., different weather such as fog and rain, and modified camera configurations such as rotations). For each sequence and variant the dataset supplies multi-modal ground truth: RGB (stereo), dense depth, semantic (class) segmentation, instance segmentation, optical flow, scene flow, camera parameters and poses, and vehicle locations. The dataset was built with improved photorealism using a modern game engine and is intended for tasks such as depth estimation, segmentation, flow, and domain transfer / synthetic-to-real evaluation. The dataset is distributed for non-commercial research use under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 license (copyright Naver Corporation). Reported statistics include a total of about 21,260 stereo pairs across the 5 cloned scenes (scene-level counts reported in the release). Sources: arXiv:2001.10773 (Virtual KITTI 2) and the Naver Labs Europe project page. Also referenced in the paper "Depth Anything" (arXiv:2401.10891) for zero-shot metric depth evaluation (Table 5).
No results tracked yet
DDAD (relative)
Dense Depth for Autonomous Driving (DDAD)
Dense Depth for Autonomous Driving (DDAD) is a long-range, multi-camera autonomous-driving depth dataset released by the Toyota Research Institute (TRI / TRI-ML). The dataset provides synchronized 6-camera RGB imagery together with LiDAR point clouds, poses, camera intrinsics/extrinsics and additional annotations (2D/3D boxes and semantic labels reported in the public repo/blog). According to the TRI-ML release and accompanying references used by later papers, the training split contains 12,650 samples (≈75,900 images for six cameras) and the validation split contains 3,950 samples (≈15,800 images) with ground-truth dense depth maps used for evaluation (depths evaluated to long range, e.g., up to 200m). DDAD was released via the TRI-ML GitHub (TRI-ML/DDAD) and has been used as an unseen test domain for zero-shot/transfer depth evaluation in several works. No standalone arXiv paper that formally “introduces” DDAD was found; the dataset is distributed via the TRI-ML repository and referenced in workshop/paper supplements.
No results tracked yet
HyperSim (metric)
Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding
Hypersim is a photorealistic synthetic dataset for holistic indoor scene understanding (introduced by Roberts et al.). It contains 77,400 rendered images of 461 indoor scenes and provides dense per-pixel ground-truth annotations and complete scene information useful for tasks such as depth prediction, surface normals, semantic/instance segmentation, intrinsic decomposition (diffuse reflectance, illumination, non-diffuse residual), full scene geometry, material properties, and camera parameters. The dataset was created from a large repository of professionally authored 3D assets and renderings; the project provides code and data on GitHub and the paper was published at ICCV 2021. Synthetic indoor dataset used in zero-shot metric depth evaluation (reported in Table 5 of the paper).
No results tracked yet
ScanNet
ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes
Large-scale RGB-D video dataset containing 2.5M views in 1513 scenes with 3D camera poses, surface reconstructions, and instance-level semantic segmentations. Used for depth estimation evaluation.
No results tracked yet
DA-2K
Depth Anything 2K Evaluation Benchmark
Versatile evaluation benchmark for relative monocular depth estimation with ~1K high-resolution images and ~2K sparse pixel-pair relative-depth annotations across eight representative scenarios (indoor, outdoor, underwater, aerial, transparent/reflective objects, etc.)
No results tracked yet
KITTI (metric)
KITTI Vision Benchmark Suite (Metric Depth)
The KITTI Vision Benchmark Suite is a dataset that allows for the training of complex deep learning models for depth completion and single image depth prediction tasks. It contains over 93 thousand depth maps with corresponding raw LiDAR scans and RGB images, aligned with the "raw data" of the KITTI dataset. It also provides manually selected images with unpublished depth maps to serve as a benchmark for these two challenging tasks.
No results tracked yet
NYUv2 (metric)
NYU Depth Dataset V2 (Metric Depth)
The NYUv2 dataset is used for depth estimation. It is comprised of video sequences from a variety of indoor scenes, recorded by both RGB and Depth cameras from the Microsoft Kinect. It features 1449 densely labeled pairs of aligned RGB and depth images, 464 new scenes from 3 cities, and 407,024 new unlabeled frames. Each object is labeled with a class and an instance number. The dataset has several components: Labeled (a subset of video data with dense multi-class labels, preprocessed to fill missing depth labels), Raw (raw RGB, depth, and accelerometer data from the Kinect), and Toolbox (functions for manipulating the data and labels).
No results tracked yet
KITTI (relative)
KITTI (relative)
The KITTI Vision Benchmark Suite is a dataset that allows for the training of complex deep learning models for depth completion and single image depth prediction tasks. It contains over 93 thousand depth maps with corresponding raw LiDAR scans and RGB images, aligned with the "raw data" of the KITTI dataset. It also provides manually selected images with unpublished depth maps to serve as a benchmark for these two challenging tasks.
No results tracked yet
NYUv2 (relative)
NYUv2 (relative)
The NYUv2 dataset is used for depth estimation. It is comprised of video sequences from a variety of indoor scenes, recorded by both RGB and Depth cameras from the Microsoft Kinect. It features 1449 densely labeled pairs of aligned RGB and depth images, 464 new scenes from 3 cities, and 407,024 new unlabeled frames. Each object is labeled with a class and an instance number. The dataset has several components: Labeled (a subset of video data with dense multi-class labels, preprocessed to fill missing depth labels), Raw (raw RGB, depth, and accelerometer data from the Kinect), and Toolbox (functions for manipulating the data and labels).
No results tracked yet
KITTI Depth
Outdoor depth estimation from autonomous driving LiDAR data
No results tracked yet
NYU Depth V2
Indoor depth estimation from RGB-D sensor data
No results tracked yet
Related Tasks
Few-Shot Image Classification
Image classification with limited labeled examples per class (few-shot learning). Models are evaluated on their ability to classify images into categories with only a handful of training examples (typically 1-10) per class.
Open-Vocabulary Object Detection
Object detection with open vocabulary - detecting objects from arbitrary text descriptions without being limited to a fixed set of categories.
Object counting
Object counting in AI is a computer vision task that uses machine learning and image processing to identify and enumerate distinct objects within digital images and videos. It can differentiate between various object types, sizes, and shapes, even in crowded or dynamically changing scenes. The process typically involves object detection using deep learning models like convolutional neural networks (CNNs) to recognize and localize objects, followed by aggregation to provide a total count. This technology is applied in fields like manufacturing for quality control and production monitoring.
Video segmentation
Video segmentation is the task of partitioning video frames into multiple segments or objects. Unlike image segmentation which works on static images, video segmentation tracks objects across frames in a video sequence.
Get notified when these results update
New models drop weekly. We track them so you don't have to.
Something wrong or missing?
Help keep Depth estimation benchmarks accurate. Report outdated results, missing benchmarks, or errors.