Computer Visiondepth-estimation

Depth Estimation

Depth estimation recovers 3D structure from 2D images — a problem that haunted computer vision for decades before deep learning cracked monocular depth prediction. The field shifted dramatically with MiDaS (2019) showing that mixing diverse training data beats task-specific models, then again with Depth Anything (2024) proving foundation model scale changes everything. Modern systems achieve sub-5% relative error on NYU Depth V2, but real-world robustness — handling reflections, transparency, and extreme lighting — remains the frontier. Critical for autonomous driving, AR/VR, and robotics where accurate 3D perception is non-negotiable.

2 datasets0 resultsView full task mapping →

Monocular depth estimation predicts per-pixel distance from a single 2D image — a fundamentally ill-posed problem that deep learning has made remarkably practical. Relative depth models like MiDaS/Depth Anything now generalize across scenes, while metric depth (actual meters) remains dataset-dependent. It enables 3D photography, AR, robotics, and autonomous driving without expensive LiDAR.

History

2005

Saxena et al. pioneer supervised monocular depth with Markov Random Fields on small-scale indoor datasets

2014

Eigen et al. use multi-scale CNNs to predict depth from a single image, establishing the deep learning baseline on NYUv2 and KITTI

2017

Godard et al. introduce self-supervised monocular depth using left-right stereo consistency — no depth labels needed

2019

Monodepth2 (Godard et al.) adds per-pixel minimum reprojection and auto-masking, becoming the dominant self-supervised method

2020

MiDaS (Ranftl et al.) trains on mixed datasets (12 sources) to achieve robust zero-shot relative depth estimation across arbitrary images

2021

DPT (Dense Prediction Transformer) by Ranftl et al. applies ViT to dense prediction, improving fine-grained depth details

2023

ZoeDepth combines relative depth pretraining with metric depth fine-tuning, producing the first practical zero-shot metric depth model

2024

Depth Anything v1 (Tsinghua/ByteDance) trains on 62M unlabeled images with self-training, achieving the best zero-shot relative depth

2024

Depth Anything v2 trains on 595K synthetic images (from Hypersim, Virtual KITTI, etc.) with pseudo-labeled real data, setting new SOTA on NYUv2 and KITTI

2025

Depth Pro (Apple) and Metric3D v2 push zero-shot metric depth estimation — predicting actual distances without per-scene calibration

How Depth Estimation Works

1EncoderA pretrained backbone (DINO…2DecoderA DPT-style decoder reassem…3Depth HeadA final layer predicts eith…4Loss FunctionsScale-and-shift-invariant l…5EvaluationKey metrics: AbsRel (absolu…Depth Estimation Pipeline
1

Encoder

A pretrained backbone (DINOv2-ViT, Swin, ConvNeXt) extracts multi-scale features from the input image. ViT-based encoders dominate because global self-attention captures long-range depth cues (perspective, occlusion, relative size).

2

Decoder

A DPT-style decoder reassembles features at progressively higher resolutions using convolution and upsampling layers with skip connections from the encoder.

3

Depth Head

A final layer predicts either relative disparity (inverse depth, arbitrary scale) or metric depth (absolute meters). Relative depth is easier to learn across datasets; metric depth requires known camera intrinsics or focal length estimation.

4

Loss Functions

Scale-and-shift-invariant losses (for relative depth) compare predicted and ground-truth depth up to an affine transformation. Metric depth uses L1/L2 loss, sometimes with gradient-matching terms for edge sharpness.

5

Evaluation

Key metrics: AbsRel (absolute relative error), δ₁ (% of pixels within 1.25× of ground truth), RMSE. NYUv2 (indoor) and KITTI (outdoor driving) are standard benchmarks.

Current Landscape

Monocular depth estimation in 2025 is dominated by foundation models trained on massive mixed datasets. Depth Anything v2 is the clear leader for relative depth, while Depth Pro and Metric3D v2 are pushing metric depth toward practical zero-shot use. The self-supervised era (Monodepth2, etc.) has been largely superseded by models trained on synthetic data + pseudo-labels from teacher models. The field has split into two camps: those optimizing benchmark numbers on NYUv2/KITTI, and those building robust zero-shot models that work on any image from the internet. The latter camp is winning in practice.

Key Challenges

Scale ambiguity — a single image doesn't contain absolute distance information, so monocular models predict relative depth unless trained with metric supervision or camera intrinsics

Generalization across domains — models trained on indoor scenes (NYUv2) fail outdoors (KITTI) and vice versa; truly universal depth estimation requires massive mixed-domain training

Thin structures and object boundaries — depth maps are typically blurry at edges, which causes artifacts in 3D reconstruction and novel view synthesis

Transparent and reflective surfaces (glass, mirrors, water) violate the assumptions of both learning-based and geometric depth methods

Ground truth acquisition — LiDAR is sparse, stereo matching has errors at occlusion boundaries, and structured light only works indoors, limiting training data quality

Quick Recommendations

Best zero-shot relative depth

Depth Anything v2-Large (ViT-L)

Best AbsRel on NYUv2 (0.043) and KITTI (0.042) without fine-tuning; works on arbitrary images out of the box

Zero-shot metric depth

Depth Pro (Apple) or Metric3D v2

Predicts actual meters without per-scene scaling; Depth Pro estimates focal length jointly with depth

Real-time / mobile

Depth Anything v2-Small (ViT-S)

Runs at 30+ FPS on mobile GPUs with competitive accuracy; 25M params

Autonomous driving

Metric3D v2 fine-tuned on KITTI/nuScenes

Metric depth matters for planning; combine with temporal consistency for video

3D photography / NeRF

Depth Anything v2 as initialization for DUSt3R/MASt3R

Monocular depth provides strong priors for multi-view 3D reconstruction pipelines

What's Next

The frontier is moving toward video depth estimation with temporal consistency (eliminating flickering between frames), 4D depth (dynamic scenes with moving objects), and integration with 3D reconstruction pipelines (DUSt3R, Gaussian splatting). Zero-shot metric depth will likely become standard within a year, eliminating the need for per-dataset calibration. The endgame is depth as a commodity feature embedded in every vision foundation model.

Benchmarks & SOTA

Related Tasks

Something wrong or missing?

Help keep Depth Estimation benchmarks accurate. Report outdated results, missing benchmarks, or errors.

0/2000
Depth Estimation Benchmarks - Computer Vision - CodeSOTA | CodeSOTA