Depth Estimation
Depth estimation recovers 3D structure from 2D images — a problem that haunted computer vision for decades before deep learning cracked monocular depth prediction. The field shifted dramatically with MiDaS (2019) showing that mixing diverse training data beats task-specific models, then again with Depth Anything (2024) proving foundation model scale changes everything. Modern systems achieve sub-5% relative error on NYU Depth V2, but real-world robustness — handling reflections, transparency, and extreme lighting — remains the frontier. Critical for autonomous driving, AR/VR, and robotics where accurate 3D perception is non-negotiable.
Monocular depth estimation predicts per-pixel distance from a single 2D image — a fundamentally ill-posed problem that deep learning has made remarkably practical. Relative depth models like MiDaS/Depth Anything now generalize across scenes, while metric depth (actual meters) remains dataset-dependent. It enables 3D photography, AR, robotics, and autonomous driving without expensive LiDAR.
History
Saxena et al. pioneer supervised monocular depth with Markov Random Fields on small-scale indoor datasets
Eigen et al. use multi-scale CNNs to predict depth from a single image, establishing the deep learning baseline on NYUv2 and KITTI
Godard et al. introduce self-supervised monocular depth using left-right stereo consistency — no depth labels needed
Monodepth2 (Godard et al.) adds per-pixel minimum reprojection and auto-masking, becoming the dominant self-supervised method
MiDaS (Ranftl et al.) trains on mixed datasets (12 sources) to achieve robust zero-shot relative depth estimation across arbitrary images
DPT (Dense Prediction Transformer) by Ranftl et al. applies ViT to dense prediction, improving fine-grained depth details
ZoeDepth combines relative depth pretraining with metric depth fine-tuning, producing the first practical zero-shot metric depth model
Depth Anything v1 (Tsinghua/ByteDance) trains on 62M unlabeled images with self-training, achieving the best zero-shot relative depth
Depth Anything v2 trains on 595K synthetic images (from Hypersim, Virtual KITTI, etc.) with pseudo-labeled real data, setting new SOTA on NYUv2 and KITTI
Depth Pro (Apple) and Metric3D v2 push zero-shot metric depth estimation — predicting actual distances without per-scene calibration
How Depth Estimation Works
Encoder
A pretrained backbone (DINOv2-ViT, Swin, ConvNeXt) extracts multi-scale features from the input image. ViT-based encoders dominate because global self-attention captures long-range depth cues (perspective, occlusion, relative size).
Decoder
A DPT-style decoder reassembles features at progressively higher resolutions using convolution and upsampling layers with skip connections from the encoder.
Depth Head
A final layer predicts either relative disparity (inverse depth, arbitrary scale) or metric depth (absolute meters). Relative depth is easier to learn across datasets; metric depth requires known camera intrinsics or focal length estimation.
Loss Functions
Scale-and-shift-invariant losses (for relative depth) compare predicted and ground-truth depth up to an affine transformation. Metric depth uses L1/L2 loss, sometimes with gradient-matching terms for edge sharpness.
Evaluation
Key metrics: AbsRel (absolute relative error), δ₁ (% of pixels within 1.25× of ground truth), RMSE. NYUv2 (indoor) and KITTI (outdoor driving) are standard benchmarks.
Current Landscape
Monocular depth estimation in 2025 is dominated by foundation models trained on massive mixed datasets. Depth Anything v2 is the clear leader for relative depth, while Depth Pro and Metric3D v2 are pushing metric depth toward practical zero-shot use. The self-supervised era (Monodepth2, etc.) has been largely superseded by models trained on synthetic data + pseudo-labels from teacher models. The field has split into two camps: those optimizing benchmark numbers on NYUv2/KITTI, and those building robust zero-shot models that work on any image from the internet. The latter camp is winning in practice.
Key Challenges
Scale ambiguity — a single image doesn't contain absolute distance information, so monocular models predict relative depth unless trained with metric supervision or camera intrinsics
Generalization across domains — models trained on indoor scenes (NYUv2) fail outdoors (KITTI) and vice versa; truly universal depth estimation requires massive mixed-domain training
Thin structures and object boundaries — depth maps are typically blurry at edges, which causes artifacts in 3D reconstruction and novel view synthesis
Transparent and reflective surfaces (glass, mirrors, water) violate the assumptions of both learning-based and geometric depth methods
Ground truth acquisition — LiDAR is sparse, stereo matching has errors at occlusion boundaries, and structured light only works indoors, limiting training data quality
Quick Recommendations
Best zero-shot relative depth
Depth Anything v2-Large (ViT-L)
Best AbsRel on NYUv2 (0.043) and KITTI (0.042) without fine-tuning; works on arbitrary images out of the box
Zero-shot metric depth
Depth Pro (Apple) or Metric3D v2
Predicts actual meters without per-scene scaling; Depth Pro estimates focal length jointly with depth
Real-time / mobile
Depth Anything v2-Small (ViT-S)
Runs at 30+ FPS on mobile GPUs with competitive accuracy; 25M params
Autonomous driving
Metric3D v2 fine-tuned on KITTI/nuScenes
Metric depth matters for planning; combine with temporal consistency for video
3D photography / NeRF
Depth Anything v2 as initialization for DUSt3R/MASt3R
Monocular depth provides strong priors for multi-view 3D reconstruction pipelines
What's Next
The frontier is moving toward video depth estimation with temporal consistency (eliminating flickering between frames), 4D depth (dynamic scenes with moving objects), and integration with 3D reconstruction pipelines (DUSt3R, Gaussian splatting). Zero-shot metric depth will likely become standard within a year, eliminating the need for per-dataset calibration. The endgame is depth as a commodity feature embedded in every vision foundation model.
Benchmarks & SOTA
Related Tasks
Something wrong or missing?
Help keep Depth Estimation benchmarks accurate. Report outdated results, missing benchmarks, or errors.