Depth Estimation Models
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,496 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
15 citations
Review status
Source-backed
Revision
v2 ยท 2,496 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: Computer Vision Models and Tasks
Depth estimation models are computer vision systems that predict per-pixel depth (distance from the camera to objects in the scene) from one or more images. The output is a depth map, an image where each pixel value represents the corresponding distance in metric units or a relative scale. Depth estimation is a core component of 3D scene understanding and powers autonomous driving, robotics, augmented reality, photogrammetry, novel view synthesis, and computational photography.
Three main settings exist. Monocular depth estimation predicts depth from a single image, an ill-posed problem because a 2D projection has infinite consistent 3D explanations. Stereo depth estimation uses two images from a calibrated camera pair, recovering depth from disparity. Multi-view depth combines three or more images, sometimes with known poses (structure-from-motion) and sometimes without. Outputs are distinguished as relative depth (ordering, up to an unknown scale and shift) or metric depth (calibrated distance in physical units).
Classical depth estimation relied on geometric cues. Stereo matching locates corresponding pixels between left and right images, then converts disparity to depth using the camera baseline and focal length. Block matching compares image patches over a search range; semi-global matching (SGM) by Heiko Hirschmuller (2008) aggregates costs along multiple directions and remains a strong classical baseline still used in production robotics. Structure from motion (SfM) and multi-view stereo (MVS) extract sparse and dense 3D from many calibrated images, with COLMAP a widely used reference implementation.
Monocular depth from a single image is harder because there is no geometric constraint, only learned priors over typical scene structure. Early work used hand-engineered features (Make3D by Saxena et al., 2008) before deep learning reshaped the field. The first end-to-end deep monocular depth network was published by David Eigen, Christian Puhrsch, and Rob Fergus at NYU in 2014, using a two-stage convolutional neural network trained with a scale-invariant loss on the NYU Depth v2 and KITTI datasets.
Deep stereo took off with DispNet (Mayer et al., 2015), an encoder-decoder CNN trained on the FlyingThings3D synthetic dataset that the authors released alongside the network. DispNet established a recipe later refined by GC-Net (Kendall et al., 2017), which built an explicit 3D cost volume, and PSMNet (Chang and Chen, 2018), which introduced pyramid pooling and a 3D CNN cost-volume regularizer.
More recent work imports ideas from optical flow. RAFT-Stereo (Lahav Lipson, Zachary Teed, Jia Deng, 2021) adapts the iterative recurrent update operator from RAFT to stereo matching, using multi-level convolutional GRUs to propagate disparity information across the image. IGEV-Stereo (Xu et al., 2023) further combines geometry-encoded cost volumes with iterative refinement. Modern variants treat stereo as a special case of more general pointmap regression.
Following Eigen et al., supervised monocular networks pushed accuracy higher. DenseDepth (Alhashim and Wonka, 2018) used a pretrained DenseNet encoder. DORN (Fu et al., 2018) reformulated depth as ordinal regression. BTS (Lee et al., 2019) added local planar guidance layers. AdaBins (Bhat et al., 2020) adaptively binned the depth range per image with a transformer head.
A turning point came with vision transformers. DPT (Dense Prediction Transformer, Rene Ranftl, Alexey Bochkovskiy, Vladlen Koltun, Intel 2021) replaced the convolutional backbone with a Vision Transformer and assembled tokens from multiple stages into a dense decoder. DPT reported up to 28% relative improvement over fully-convolutional approaches on monocular depth.
The MiDaS family (Ranftl, Lasinger, Hafner, Schindler, Koltun, 2019, with v3.1 released August 2023 by Intel) tackled generalization rather than per-dataset accuracy. MiDaS trains a single model on a mixture of incompatible depth datasets using a robust scale-and-shift-invariant loss, producing affine-invariant relative depth that transfers zero-shot to unseen domains. MiDaS became the standard pretrained backbone for many depth pipelines and is widely used in Stable Diffusion ControlNet conditioning.
Ground-truth depth is expensive to collect, so self-supervised approaches train using only image sequences or stereo pairs as supervision. Monodepth (Clement Godard, Oisin Mac Aodha, Gabriel Brostow, 2016) trained a CNN on rectified stereo footage using a left-right disparity consistency loss and image reconstruction; the network never saw ground-truth depth yet outperformed some supervised baselines on KITTI.
SfMLearner (Tinghui Zhou et al., 2017) extended this to monocular video, jointly learning depth and ego-motion via a view-synthesis loss: render a target frame from a source frame using predicted depth and pose, then minimize reconstruction error. Monodepth2 (Godard, Mac Aodha, Michael Firman, Brostow, Niantic and UCL, 2018) added a minimum reprojection loss for occlusion handling, an auto-masking loss for static pixels, and a multi-scale sampling strategy. ManyDepth (Watson et al., Niantic, 2021) brought multi-frame inference at test time via a cost volume built from adjacent frames.
From 2023 onward, depth research shifted toward foundation models that generalize across domains.
ZoeDepth (Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Muller, Intel, February 2023) bridged relative and metric depth. It pretrains a MiDaS-style backbone on 12 datasets, then adds lightweight metric bins heads fine-tuned on NYU Depth v2 and KITTI, with a latent classifier routing each image. The flagship ZoeD-M12-NK was the first single model giving both strong relative generalization and accurate metric scale.
Depth Anything (Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao, TikTok and HKU, January 2024) scaled monocular depth using a self-training data engine that pseudo-labeled approximately 62 million unlabeled internet images, supervising a student DPT-style network with a frozen MiDaS-trained teacher. The authors released ViT-S, ViT-B, and ViT-L variants achieving strong zero-shot performance on six public benchmarks.
Depth Anything V2 (same team, June 2024, NeurIPS 2024) replaced labeled real images with synthetic data, scaled the teacher network, and used pseudo-labels on unlabeled real images to bridge the synthetic-to-real gap. The authors report more than 10x faster and more accurate than diffusion-based competitors, with model sizes from 25 million to 1.3 billion parameters.
Marigold (Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, Konrad Schindler, ETH Zurich, December 2023, CVPR 2024) finetunes Stable Diffusion for depth prediction, treating depth as an image modality the diffusion model can generate. Trained only on synthetic data on a single GPU, Marigold reported over 20% gains on some datasets and produces unusually sharp depth boundaries. Follow-up work includes GeoWizard and Lotus.
UniDepth (Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, Fisher Yu, ETH Zurich and INSAIT, March 2024) directly predicts metric 3D points without needing camera intrinsics at inference. It uses a self-promptable camera module conditioning the depth head, plus a pseudo-spherical output representation that decouples camera and depth. PatchFusion (Li et al., 2023) addresses high-resolution depth by fusing patch-level and global predictions for megapixel images.
A parallel line of work treats depth estimation as one face of a broader 3D vision problem. DUSt3R (Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, Jerome Revaud, Naver Labs Europe and Inria, December 2023) regresses pointmaps directly from two unposed images using a transformer built on CroCo cross-view completion pretraining. It outputs depth, camera intrinsics, relative pose, and pixel correspondences simultaneously.
Mast3R (Vincent Leroy, Yohann Cabon, Jerome Revaud, Naver Labs Europe, June 2024) extends DUSt3R with a dense local feature head and a fast reciprocal matching scheme, framing image matching itself as a 3D problem.
| Model | Year | Group | Architecture | Notes |
|---|---|---|---|---|
| Eigen et al. | 2014 | NYU | Two-stage CNN | First end-to-end deep monocular depth |
| DispNet | 2015 | Brox group, Freiburg | Encoder-decoder CNN | Trained on FlyingThings3D synthetic stereo |
| Monodepth | 2016 | UCL | CNN with stereo loss | Self-supervised, left-right consistency |
| SfMLearner | 2017 | Berkeley | CNN | Self-supervised from monocular video |
| Monodepth2 | 2018 | Niantic / UCL | CNN | Minimum reprojection, auto-masking |
| AdaBins | 2020 | KAUST | CNN + transformer head | Adaptive depth binning |
| DPT | 2021 | Intel | Vision Transformer | Dense Prediction Transformer |
| MiDaS v3.1 | 2023 | Intel | DPT backbone | Multi-dataset relative depth |
| RAFT-Stereo | 2021 | Princeton | Recurrent stereo | Iterative GRU updates |
| ZoeDepth | 2023 | Intel | MiDaS + metric bins | Combined relative and metric |
| Marigold | Dec 2023 | ETH Zurich | Stable Diffusion finetune | Diffusion-based depth |
| DUSt3R | Dec 2023 | Naver / Inria | Transformer pointmap | Pose + depth + matching |
| Depth Anything | Jan 2024 | TikTok / HKU | DPT (ViT-S/B/L) | 62M pseudo-labeled images |
| UniDepth | Mar 2024 | ETH Zurich | Transformer | Direct metric 3D points |
| Mast3R | Jun 2024 | Naver Labs | DUSt3R + matching head | Dense matching as 3D problem |
| Depth Anything V2 | Jun 2024 | TikTok / HKU | DPT | Synthetic data, scaled teacher |
| Dataset | Year | Domain | Description |
|---|---|---|---|
| KITTI | 2012 (depth bench. 2017) | Driving | Stereo + LiDAR from cars, Geiger et al., Karlsruhe Institute of Technology |
| NYU Depth v2 | 2012 | Indoor | RGB-D from Microsoft Kinect, Silberman et al., NYU |
| Middlebury Stereo | 2001 (multiple updates) | Indoor objects | Classical stereo benchmark, Scharstein and Szeliski |
| Sintel | 2012 | Synthetic film | Optical flow and stereo from animated movie |
| Cityscapes | 2016 | Urban driving | Stereo pairs from European cities |
| ScanNet | 2017 | Indoor | RGB-D video of 1500+ rooms, Dai et al. |
| ETH3D | 2017 | Indoor + outdoor | High-resolution multi-view stereo |
| MegaDepth | 2018 | Internet photos | Crowd-sourced landmarks, Li and Snavely, Cornell |
| DIODE | 2019 | Mixed | Dense indoor and outdoor RGB-D |
| DDAD | 2020 | Driving | Dense Depth for Autonomous Driving, Toyota Research Institute |
| Hypersim | 2021 | Synthetic indoor | High-quality photorealistic renders, Apple |
| Spring | 2023 | Synthetic high-res | 4K stereo and flow benchmark |
| Metric | Formula or definition | Notes |
|---|---|---|
| AbsRel | Mean over pixels of |d - d*| / d* | Absolute relative error, primary metric |
| Sq Rel | Mean of (d - d*)^2 / d* | Squared relative error |
| RMSE | sqrt(mean((d - d*)^2)) | Root mean square error in meters |
| RMSE log | sqrt(mean((log d - log d*)^2)) | Log-space RMSE |
| Threshold accuracy | Percent of pixels with max(d/d*, d*/d) < 1.25 | Reported at 1.25, 1.25^2, 1.25^3 |
| Scale-invariant log error | Variance of log-difference | Introduced by Eigen et al. 2014 |
For relative depth, predictions are aligned to ground truth via least-squares scale and shift before computing metrics, since the output is only defined up to an affine transform.
Depth Anything V2 has emerged as the practical default for monocular relative depth across general scenes, with open weights in three sizes. Marigold and successor diffusion-based methods produce the cleanest depth boundaries and are favored for VFX where sharp edges matter more than raw speed. DUSt3R and Mast3R have replaced classical SfM pipelines for many multi-view tasks, returning camera poses, depth, and matching from unposed image collections in a single forward pass. Metric foundation models such as UniDepth and Metric3D close the gap between relative and metric prediction without requiring camera intrinsics at inference.
Depth estimation feeds 3D perception stacks across robotics and consumer products. In autonomous driving, depth from stereo or monocular networks complements LiDAR for obstacle distance, free-space estimation, and lane geometry. In robotic manipulation, depth informs grasp planning, object pose estimation, and collision avoidance. AR/VR systems rely on per-pixel depth for occlusion, surface alignment, and scene scanning. Computational photography pipelines use depth for portrait-mode bokeh, refocusing, and relighting. In film and VFX, depth maps drive matchmoving, set extension, and stereo conversion of monocular footage. Depth also serves as conditioning for generative AI, with ControlNet using MiDaS-style depth to guide Stable Diffusion image generation. Novel view synthesis pipelines like NeRF and 3D Gaussian Splatting accept depth priors to speed up training and improve geometry. Medical imaging uses depth from endoscopy for surgical guidance.
Monocular depth has fundamental scale ambiguity: a small object close to the camera and a large object far away can produce the same image. Models output relative depth unless trained with metric supervision and camera intrinsics. Transparent surfaces (glass, water) and specular reflections confuse both stereo and monocular networks because feature matching breaks down. Very distant points have low parallax in stereo and low gradient in monocular learning, so far-field accuracy degrades. Dynamic objects violate the static-scene assumption used by self-supervised video methods. High-resolution inference is computationally expensive, motivating patch-based approaches like PatchFusion. Indoor and outdoor scenes have different depth statistics. Metric depth typically requires known camera intrinsics, though UniDepth and similar models attempt to circumvent this.