Depth estimation

Computer Vision Deep Learning

28 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

18 citations

Revision

v5 · 5,668 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Depth estimation is the computer vision task of predicting how far each surface in a scene is from the camera, producing a dense per-pixel depth map from one or more images. Given an input image of size $H \times W$ , a depth estimation model outputs a depth map of the same dimensions in which each pixel stores a scalar distance, expressed in meters for metric depth or in arbitrary units for relative depth ^[1]^[4]^[8]. It can be solved from a single image (monocular), from two horizontally offset views (stereo), or from many views (multi-view), with monocular depth being fundamentally ill-posed because infinitely many 3D scenes project to the same 2D image ^[1]^[4].

Depth information is essential for applications that require geometric understanding of the world, including autonomous driving, augmented reality, robotics, 3D reconstruction, and computational photography. While dedicated depth sensors such as LiDAR, structured light projectors, and time-of-flight cameras can capture depth directly, they add cost, weight, and power consumption to a system. Estimating depth from ordinary RGB cameras is therefore a long-standing research goal in computer vision, and since 2014 it has been driven almost entirely by deep learning ^[1].

What is depth estimation?

Formally, depth estimation seeks a mapping f that takes an image I (or a set of images) and produces a depth map D, where $D(u, v)$ represents the distance from the camera to the scene surface visible at pixel $(u, v)$ . The problem can be approached in several ways depending on the number of input views.

What is monocular depth estimation?

Monocular depth estimation predicts depth from a single RGB image. This is inherently an ill-posed problem because infinitely many 3D scenes can project onto the same 2D image. Humans resolve this ambiguity using learned priors: perspective cues, texture gradients, occlusion relationships, known object sizes, and atmospheric haze. Early computational approaches struggled with this task, but deep learning models trained on large datasets have learned to exploit these same cues effectively ^[1]^[8].

What is stereo depth estimation?

Stereo depth estimation uses a pair of images captured from two horizontally offset cameras (a stereo rig) to compute depth through triangulation. The core idea comes from binocular stereo vision: by identifying corresponding points in the left and right images and measuring their horizontal displacement (disparity), the depth at each pixel can be calculated using the formula:

z = \frac{f B}{d}

where z is depth, f is the focal length, B is the baseline (distance between cameras), and d is the disparity in pixels. Larger disparities correspond to closer objects, and smaller disparities correspond to objects further away.

What is multi-view depth estimation?

Multi-view depth estimation extends the stereo concept to more than two images. By observing a scene from multiple viewpoints, depth can be inferred with higher accuracy and completeness. Techniques such as Structure from Motion (SfM) and Multi-View Stereo (MVS) fall into this category.

How did traditional depth estimation work?

Stereo matching

Stereo matching algorithms find pixel correspondences between rectified stereo image pairs and produce a disparity map, which can be converted to depth. The process typically follows four steps: cost computation, cost aggregation, disparity optimization, and disparity refinement.

Local methods evaluate a small neighborhood around each pixel and select the disparity that minimizes a matching cost (such as the sum of absolute differences or the census transform). Local methods are fast but sensitive to textureless regions and occlusions.

Global methods formulate stereo matching as an energy minimization problem over the entire image. They define a cost function that balances data fidelity (how well pixel intensities match) with smoothness (the assumption that neighboring pixels likely have similar depths). Graph cuts and belief propagation are classic optimization techniques for global stereo matching.

Semi-Global Matching (SGM), introduced by Heiko Hirschmuller in 2005, strikes a balance between local and global methods ^[11]. SGM approximates a global 2D smoothness constraint by aggregating matching costs along multiple (typically 8 or 16) one-dimensional paths through the image. This yields near-global-quality results at a fraction of the computational cost. SGM has been widely adopted in real-time stereo applications, including robotics, advanced driver assistance systems, and satellite photogrammetry, because of its favorable accuracy-to-speed tradeoff and its suitability for parallel hardware implementations on FPGAs and GPUs.

Structure from Motion

Structure from Motion (SfM) recovers both 3D scene structure and camera poses from a collection of 2D images taken from different viewpoints. The classical SfM pipeline consists of several stages:

Feature detection and description: Local features (such as SIFT, SURF, or ORB keypoints) are detected in each image.
Feature matching: Correspondences between features in different images are established.
Geometric verification: Incorrect matches are filtered using geometric constraints such as the epipolar constraint and RANSAC.
Camera pose estimation: The relative positions and orientations of cameras are computed from verified correspondences.
Triangulation: 3D points are computed by intersecting viewing rays from two or more calibrated cameras.
Bundle adjustment: Camera parameters and 3D point positions are jointly refined by minimizing reprojection error.

SfM can be performed incrementally (adding one camera at a time) or globally (solving for all cameras simultaneously). While SfM primarily produces sparse 3D point clouds, it provides the camera parameters needed for dense reconstruction methods like Multi-View Stereo.

How does deep learning estimate depth from one image?

The application of deep learning to monocular depth estimation has transformed the field. Neural networks can learn complex priors about scene geometry from large training datasets, enabling accurate depth prediction from a single image. The progression has moved from per-dataset convolutional models (2014-2019) to multi-dataset and transformer models (2020-2021), to large foundation models trained on tens of millions of images (2024).

Eigen et al. (2014): the pioneering work

David Eigen, Christian Puhrsch, and Rob Fergus published "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" at NeurIPS 2014 (then called NIPS) ^[1]. This paper is widely regarded as the foundational work for deep learning-based monocular depth estimation. The architecture uses two stacked convolutional neural networks: a coarse-scale network that, in the authors' words, "makes a coarse global prediction based on the entire image," and a fine-scale network "that refines this prediction locally" ^[1]. The authors also introduced a scale-invariant loss function that focuses on relative depth relationships rather than absolute scale. On the NYU Depth V2 benchmark, the model achieved an AbsRel error of 0.215 and a delta-1 accuracy of 0.611, which represented a major improvement over prior non-learning methods ^[1]. While these numbers are far below current standards, the paper demonstrated that a neural network could learn meaningful depth priors from data alone.

MiDaS (Ranftl et al., 2020)

MiDaS (Monocular Depth estimation via a Single image) was developed by Rene Ranftl and colleagues at Intel Labs ^[4]. The key insight behind MiDaS was that training on a diverse mixture of datasets produces models with strong zero-shot generalization. As the paper states, "The success of monocular depth estimation relies on large and diverse training sets," yet "acquiring dense ground-truth depth across different environments at scale" is difficult, so the authors "develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible" ^[4]. Prior monocular depth models trained on a single dataset (such as NYU or KITTI) often failed when applied to images from different domains.

The original MiDaS paper, "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer," was published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in 2020 ^[4]. The authors trained on a combination of datasets including ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, and NYU Depth V2 (up to 12 datasets in later versions) using multi-objective optimization. Because these datasets use different depth representations and scales, MiDaS predicts relative inverse depth rather than absolute metric depth ^[4].

MiDaS v3.1, released in 2023, expanded the model zoo to include backbones based on BEiT, Swin, SwinV2, Next-ViT, and LeViT transformers, in addition to the original ViT backbone ^[7]. The BEiT-based models achieved the highest depth estimation quality, while smaller backbones like LeViT enabled efficient inference for real-time applications ^[7].

DPT: Dense Prediction Transformers (Ranftl et al., 2021)

DPT (Dense Prediction Transformers) was introduced by Ranftl, Bochkovskiy, and Koltun at ICCV 2021 in the paper "Vision Transformers for Dense Prediction" ^[5]. DPT replaced the convolutional backbone of MiDaS with a Vision Transformer (ViT) encoder. The architecture works by dividing the input image into non-overlapping patches, projecting them into token embeddings, and processing them through standard transformer encoder layers. Tokens from multiple transformer stages are then reassembled into image-like representations at multiple resolutions and fused through a convolutional decoder to produce dense depth predictions.

The key advantage of DPT over convolutional architectures is the transformer's global receptive field at every stage, which allows it to capture long-range spatial relationships. This produces depth maps with finer details and more globally coherent structure. DPT improved monocular depth estimation by over 28% compared to the best convolutional approaches at the time and set new performance records on both NYU Depth V2 and KITTI benchmarks ^[5].

Three variants were released: DPT-Base (ViT-B), DPT-Large (ViT-L), and DPT-Hybrid (using a ResNet-50 feature extractor combined with transformer layers) ^[5].

ZoeDepth (Bhat et al., 2023)

ZoeDepth, published as "ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth" (arXiv:2302.12288), addresses a fundamental limitation of models like MiDaS: they predict only relative depth and cannot provide measurements in real-world units ^[6]. ZoeDepth bridges the gap between relative and metric depth estimation through a two-stage approach.

The model first pretrains on 12 datasets using the MiDaS framework to learn robust relative depth representations with strong generalization. It then fine-tunes on specific target domains (such as NYU Depth V2 for indoor scenes or KITTI for outdoor scenes) using a novel metric bins module appended to the decoder ^[6]. This module predicts domain-specific depth bin centers and combines them with the relative depth features to produce metric depth output.

The flagship model, ZoeD-M12-NK, was the first to jointly train on multiple datasets (NYU Depth V2 and KITTI) without significant performance degradation ^[6]. During inference, a latent classifier automatically routes each input image to the appropriate domain-specific head. ZoeDepth achieved unprecedented zero-shot generalization performance across eight unseen datasets spanning both indoor and outdoor domains ^[6].

Depth Anything (Yang et al., 2024)

Depth Anything, published at CVPR 2024, represents a shift toward building foundation models for monocular depth estimation ^[8]. The paper, "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data," was authored by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao from the University of Hong Kong and TikTok (ByteDance) ^[8]. The authors describe it as "a highly practical solution for robust monocular depth estimation," the core of which is to "scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M)" so as to "significantly enlarge the data coverage and thus reduce the generalization error" ^[8].

The core idea is a semi-supervised training pipeline that leverages massive amounts of unlabeled data. The process works as follows:

A teacher model is trained on 1.5 million labeled images from six public datasets, using a DINOv2 encoder as the backbone ^[8]^[17].
The teacher generates pseudo depth labels for approximately 62 million unlabeled images collected from eight large-scale datasets ^[8].
A student model is then trained on the combined labeled and pseudo-labeled data, with data augmentation designed to create challenging optimization targets ^[8].
An auxiliary self-supervised loss preserves semantic knowledge from the pretrained DINOv2 encoder ^[8]^[17].

In total, the released models were trained on 1.5 million labeled images and roughly 62 million unlabeled images jointly ^[8]. The massive scale of training data significantly reduced generalization error, allowing Depth Anything to produce robust relative depth estimates across diverse scenes, from indoor rooms to outdoor landscapes, even in challenging conditions. The models are available in three sizes based on the DINOv2-ViT encoder.

Model variant	Encoder	Parameters	NYU AbsRel	NYU delta-1	KITTI AbsRel	KITTI delta-1
Depth Anything-S	ViT-S	24.8M	0.053	0.972	0.080	0.936
Depth Anything-B	ViT-B	97.5M	0.046	0.979	0.080	0.939
Depth Anything-L	ViT-L	335.3M	0.043	0.981	0.076	0.947

Notably, Depth Anything did not use NYU Depth V2 or KITTI data during pretraining; these numbers reflect fine-tuned metric depth performance ^[8]. The model also demonstrated strong results when used as a depth prior for downstream tasks, achieving 86.2 mIoU on Cityscapes semantic segmentation and 59.4 mIoU on ADE20K when the encoder was fine-tuned ^[8].

Depth Anything V2 (Yang et al., 2024)

Depth Anything V2, published at NeurIPS 2024, significantly improved upon its predecessor ^[9]. The authors summarize the upgrade plainly: the model "produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images" ^[9]. Concretely:

Synthetic training data for the teacher: Instead of training the teacher on labeled real images, V2 trained the teacher exclusively on approximately 595,000 images from five precise synthetic datasets ^[9]. Synthetic data offers pixel-perfect ground truth depth without the noise and artifacts found in sensor-captured depth maps (from LiDAR, Kinect, etc.).
Larger teacher model: The teacher was scaled up to a DINOv2-Giant backbone (approximately 1.3 billion parameters) to produce higher-quality pseudo labels ^[9].
Student-teacher distillation at scale: Student models of various sizes were trained solely on more than 62 million pseudo-labeled real images, bridging the synthetic-to-real domain gap through the teacher's pseudo labels ^[9].

This approach produced depth maps with finer details and greater robustness. The authors also introduced DA-2K, a curated evaluation set of 2,000 annotated pixel pairs across 1,000 high-resolution images spanning indoor, outdoor, transparent, adverse-style, aerial, underwater, and object-centric scenes ^[9]. On DA-2K, the V2-Large model reached 97.1% accuracy, ahead of diffusion-based competitors such as Marigold ^[9]. On the NYU Depth V2 and KITTI benchmarks with metric depth fine-tuning, V2 set new records.

Model variant	Encoder	Parameters	NYU AbsRel	NYU delta-1	KITTI AbsRel	KITTI delta-1
Depth Anything V2-S	ViT-S	24.8M	0.073	0.961	0.053	0.973
Depth Anything V2-B	ViT-B	97.5M	0.063	0.977	0.048	0.979
Depth Anything V2-L	ViT-L	335.3M	0.056	0.984	0.045	0.983
Depth Anything V2-G	ViT-G	1.3B	N/A	N/A	N/A	N/A

Compared to diffusion-based depth estimation models like Marigold, the authors report that Depth Anything V2 is "more than 10x faster" while achieving comparable or superior accuracy ^[9].

Marigold (Ke et al., 2024)

Marigold, presented as an Oral paper and Best Paper Award Candidate at CVPR 2024, took a different approach by repurposing a pretrained diffusion model for depth estimation ^[10]. The method fine-tunes the Stable Diffusion U-Net on synthetic depth data by encoding both the RGB image and its corresponding depth map into the latent space using the original VAE encoder, concatenating the two latent codes, and optimizing the standard diffusion denoising objective ^[10].

Despite being trained exclusively on synthetic data, Marigold achieves strong zero-shot transfer to real-world images ^[10]. The model leverages the rich visual priors learned by Stable Diffusion during its large-scale pretraining on billions of images. Marigold produces affine-invariant (relative) depth predictions with high detail quality. A faster variant, Marigold-LCM, uses latent consistency distillation to reduce the number of required denoising steps.

What is the difference between metric depth and relative depth?

A critical distinction in depth estimation is between metric depth and relative depth.

Relative depth captures the ordinal relationships between points in a scene: which objects are closer and which are farther. The depth values are in arbitrary units that can vary across images. Relative depth models (such as MiDaS and the base Depth Anything models) generalize well across diverse scenes because they do not need to learn scene-specific scale and shift ^[4]^[8]. However, relative depth alone is insufficient for applications requiring precise measurements, such as navigation or 3D reconstruction with correct dimensions.

Metric depth provides depth values in real-world units, typically meters. Metric depth estimation is harder because the absolute scale of a scene is ambiguous from a single image. A photograph of a small model room and a photograph of a full-sized room can look nearly identical. Metric depth models (such as ZoeDepth and the fine-tuned Depth Anything variants) resolve this ambiguity by learning domain-specific scale priors during training ^[6]^[8]. The tradeoff is that metric models often generalize less well across domains. A model fine-tuned on indoor scenes at 0 to 10 meters range may struggle with outdoor scenes at 0 to 80 meters range.

Recent work such as ZoeDepth and UniDepth attempts to bridge this gap by combining the generalization strength of relative depth models with domain-specific metric heads ^[6].

How does self-supervised depth estimation avoid ground-truth labels?

Monodepth (Godard et al., 2017)

A major limitation of supervised depth estimation is the need for ground truth depth data, which requires expensive sensors (LiDAR, structured light) and careful calibration. Self-supervised approaches sidestep this requirement by using photometric consistency as the training signal.

Clement Godard, Oisin Mac Aodha, and Gabriel Brostow published "Unsupervised Monocular Depth Estimation with Left-Right Consistency" at CVPR 2017 ^[2]. The key idea is to train a network to predict depth from a single image by using stereo image pairs during training only. The predicted depth map is used to warp one image of the stereo pair to reconstruct the other, and the photometric reconstruction error serves as the loss function. Godard et al. introduced a left-right consistency constraint that enforces agreement between the disparity maps predicted from the left and right images, reducing artifacts around occlusion boundaries ^[2].

This approach achieved results on the KITTI benchmark that were competitive with, and in some metrics surpassed, fully supervised methods of the time ^[2].

Monodepth2 (Godard et al., 2019)

Monodepth2, published at ICCV 2019 as "Digging Into Self-Supervised Monocular Depth Estimation," extended the self-supervised framework in several important ways ^[3]. Rather than requiring stereo pairs, Monodepth2 can also train using monocular video sequences. It jointly predicts depth and ego-motion (the camera's movement between frames) and uses photometric reprojection loss between adjacent frames.

Three key contributions improved over the original Monodepth ^[3]:

Minimum reprojection loss: Instead of averaging the reprojection error across source views, the model takes the per-pixel minimum, which handles occlusions more gracefully.
Auto-masking: A binary mask automatically filters out pixels where the photometric loss of the identity mapping (no warping) is lower than the reprojected loss. This handles static scenes and moving objects at the same speed as the camera.
Multi-scale estimation: Depth is predicted and supervised at multiple image resolutions to reduce texture-copying artifacts.

Monodepth2 demonstrated through empirical evidence that systematic redesign of loss functions could surpass performance gains from more complex architectures, simplifying the pipeline while improving results ^[3].

How is depth estimation accuracy measured?

Depth estimation performance is measured using a standard set of error metrics and accuracy metrics. Given a predicted depth map D* and a ground truth depth map D, with N valid pixels, the following metrics are commonly used ^[1]^[4].

Error metrics (lower is better)

Metric	Name	Formula	Description
AbsRel	Absolute Relative Error	$\frac{1}{N} \sum \frac{\lvert d_i - d_i^* \rvert}{d_i}$	Average percentage difference between predicted and ground truth depth
SqRel	Squared Relative Error	$\frac{1}{N} \sum \frac{(d_i - d_i^*)^2}{d_i}$	Penalizes large errors more heavily than AbsRel
RMSE	Root Mean Square Error	$\sqrt{\frac{1}{N} \sum (d_i - d_i^*)^2}$	Standard deviation of prediction errors in absolute depth units
RMSE log	Log RMSE	$\sqrt{\frac{1}{N} \sum (\log(d_i) - \log(d_i^*))^2}$	RMSE computed in log-space, reducing sensitivity to absolute scale
SILog	Scale-Invariant Log Error	$\sqrt{\mathrm{mean}(\delta_{\text{log}}^2) - 0.5 \cdot \mathrm{mean}(\delta_{\text{log}})^2}$	Measures depth accuracy independent of global scale

Accuracy metrics (higher is better)

Metric	Name	Formula	Description
delta-1	Threshold accuracy at 1.25	% of pixels where $\max(d_i/d_i^, d_i^/d_i) < 1.25$	Proportion of predictions within 25% of ground truth
delta-2	Threshold accuracy at $1.25^2$	% of pixels where $\max(d_i/d_i^, d_i^/d_i) < 1.5625$	Proportion within ~56% of ground truth
delta-3	Threshold accuracy at $1.25^3$	% of pixels where $\max(d_i/d_i^, d_i^/d_i) < 1.953$	Proportion within ~95% of ground truth

AbsRel and delta-1 are the two most commonly reported metrics in recent papers ^[8]^[9]. For relative depth models, predictions and ground truth are typically aligned in scale and shift for each image before computing error metrics ^[4].

What datasets are used for depth estimation?

Progress in depth estimation has been driven by several benchmark datasets that provide paired RGB images and ground truth depth maps.

NYU Depth V2

NYU Depth V2 was collected by Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus, and released in 2012 alongside the paper "Indoor Segmentation and Support Inference from RGBD Images" at ECCV ^[12]. The dataset was captured using a Microsoft Kinect sensor in 464 different indoor scenes across three cities, spanning bedrooms, kitchens, offices, living rooms, bookstores, cafes, and other environments.

Property	Detail
Labeled pairs	1,449 densely aligned RGB-depth pairs
Raw frames	407,024 unlabeled RGB-depth frames
Resolution	640 x 480 pixels
Depth sensor	Microsoft Kinect (structured light)
Depth range	0.5 to 10 meters
Scene type	Indoor
Standard split	249 scenes for training, 215 scenes for testing (654 test images)

NYU Depth V2 remains the most widely used indoor benchmark for monocular depth estimation ^[12].

KITTI

The KITTI Vision Benchmark Suite was introduced by Andreas Geiger, Philip Lenz, and Raquel Urtasun at CVPR 2012 in the paper "Are We Ready for Autonomous Driving?" ^[13]. The dataset was recorded from a car driving through Karlsruhe, Germany, using a stereo camera rig, a Velodyne HDL-64E LiDAR scanner, and a GPS/IMU system.

Property	Detail
Stereo pairs	389 stereo and optical flow image pairs
Image resolution	Approximately 1242 x 375 pixels (after rectification)
Ground truth depth	Velodyne LiDAR (sparse, projected onto image plane)
Depth range	0 to approximately 80 meters
Scene type	Outdoor (urban driving)
Eigen split	23,488 training images, 697 test images

KITTI ground truth is sparse because LiDAR returns occupy only a fraction of the image pixels. Evaluation is performed only at pixels where LiDAR data is available. KITTI remains the primary outdoor driving benchmark for depth estimation ^[13].

ScanNet

ScanNet was published by Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner at CVPR 2017 ^[14]. It provides RGB-D video sequences of indoor scenes captured with commodity depth sensors.

Property	Detail
Total views	2.5 million RGB-D frames
Scenes	1,513 indoor scenes
Annotations	3D camera poses, surface reconstructions, instance-level semantic segmentations
Depth sensor	Structure Sensor (structured light)
Scene type	Indoor

ScanNet has been used both for depth estimation evaluation and as a training data source, particularly for methods targeting indoor environments ^[14].

Other notable datasets

Dataset	Year	Scene type	Key characteristics
Middlebury Stereo	2002+	Indoor/controlled	High-quality structured light ground truth for stereo evaluation
ETH3D	2017	Indoor/outdoor	High-resolution images with multi-view and LiDAR ground truth
Cityscapes	2016	Urban driving	5,000 images with stereo pairs and disparity maps
DIODE	2019	Indoor/outdoor	Laser scanner ground truth with both indoor and outdoor scenes
TartanAir	2020	Synthetic	Photo-realistic rendered scenes with perfect ground truth
Hypersim	2021	Synthetic indoor	77,400 synthetic images with perfect metric depth
Virtual KITTI	2016/2020	Synthetic driving	Synthetic clone of KITTI driving sequences

How do the major depth estimation models compare?

The following table summarizes key monocular depth estimation models, ordered chronologically.

Model	Year	Venue	Supervision	Depth type	Backbone	Key contribution
Eigen et al.	2014	NeurIPS	Supervised	Metric	CNN (custom)	First deep learning approach; coarse-to-fine multi-scale architecture
Monodepth	2017	CVPR	Self-supervised	Metric	ResNet	Left-right consistency loss; no ground truth depth needed
Monodepth2	2019	ICCV	Self-supervised	Metric	ResNet-18	Auto-masking, minimum reprojection loss, monocular video training
BTS	2019	arXiv	Supervised	Metric	DenseNet-161	Local planar guidance layers for sharp boundaries
MiDaS	2020	TPAMI	Supervised	Relative	ResNeXt-101	Multi-dataset training for zero-shot cross-dataset transfer
AdaBins	2021	CVPR	Supervised	Metric	EfficientNet-B5	Adaptive bin-center prediction for fine-grained depth
DPT	2021	ICCV	Supervised	Relative	ViT-Large	Vision Transformer backbone for dense prediction
ZoeDepth	2023	arXiv	Supervised	Metric	BEiT-L (MiDaS)	Combines relative and metric depth; domain-specific heads
MiDaS v3.1	2023	arXiv	Supervised	Relative	BEiT/Swin/ViT	Model zoo with multiple transformer backbones
Marigold	2024	CVPR	Supervised (synthetic)	Relative	Stable Diffusion U-Net	Repurposed diffusion model; synthetic-only training
Depth Anything	2024	CVPR	Semi-supervised	Relative/Metric	DINOv2-ViT	62M+ unlabeled images; foundation model approach
Depth Anything V2	2024	NeurIPS	Semi-supervised	Relative/Metric	DINOv2-ViT	Synthetic teacher training; models from 25M to 1.3B parameters

What is depth estimation used for?

Autonomous driving

Depth estimation is fundamental to self-driving vehicles. Perception systems need to understand the 3D layout of the road, detect obstacles, measure distances to other vehicles and pedestrians, and plan safe trajectories. While most autonomous vehicle systems use LiDAR as a primary depth sensor, monocular and stereo depth estimation from cameras serves as a complementary or backup modality. Tesla's camera-only approach relies entirely on vision-based depth estimation. Depth maps help with lane detection, free-space estimation, and collision avoidance.

Augmented reality

AR applications need to understand scene geometry to place virtual objects realistically in the physical world. Depth estimation enables occlusion handling (virtual objects should be hidden behind real objects that are closer to the camera), surface detection (placing objects on tables, floors, and walls), and lighting estimation. Apple's ARKit and Google's ARCore frameworks use a combination of inertial measurement, visual odometry, and depth estimation to build real-time spatial maps. The LiDAR scanner on iPhone Pro models provides hardware depth sensing that complements software-based estimates.

3D reconstruction

Depth estimation is a core building block for 3D reconstruction pipelines. Given depth maps from multiple viewpoints along with camera poses, a 3D point cloud or mesh can be constructed through depth fusion (using algorithms such as TSDF fusion). Recent neural 3D reconstruction methods, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, benefit from depth priors produced by monocular estimation models. Depth Anything and similar models are increasingly used to provide dense depth supervision that accelerates NeRF training and improves reconstruction quality, especially in regions with limited multi-view coverage ^[8].

Robotics

Robots navigating in unstructured environments need depth information for obstacle avoidance, path planning, and manipulation. While many robots carry dedicated depth sensors (Intel RealSense, stereo cameras), monocular depth estimation provides a lightweight alternative for drones, small robots, or scenarios where weight and power budgets are constrained. Self-supervised depth estimation is particularly relevant for robotics because robots can collect massive amounts of unlabeled video data during operation ^[3].

Computational photography

Smartphone portrait mode effects rely on depth estimation to separate the foreground subject from the background and apply synthetic bokeh (background blur). Early implementations (such as on the iPhone 7 Plus in 2016) used dual cameras in a stereo configuration. Modern smartphones with a single camera use deep learning-based monocular depth estimation to achieve similar effects. Depth maps also enable other computational photography features: refocusing after capture, relighting, and 3D photo effects. The LiDAR scanner on iPhone Pro models captures hardware depth that improves portrait mode accuracy, especially in low light.

Video depth and temporal consistency

Video Depth Anything, presented at CVPR 2025, extends the Depth Anything framework to video sequences by enforcing temporal consistency across frames ^[18]. Temporally consistent depth is important for applications like video editing, visual effects, and augmented reality, where flickering or inconsistent depth maps between frames create visible artifacts.

What are the open problems in depth estimation?

Despite rapid progress, several challenges remain in depth estimation:

Scale ambiguity in monocular estimation: A single image provides no absolute scale information. While metric depth models attempt to learn scale priors, they often fail on scenes outside their training distribution ^[6]. The depth estimation community currently lacks a widely accepted benchmarking standard for evaluating cross-domain generalization ^[18].

Textureless and reflective surfaces: Stereo matching and multi-view methods struggle with surfaces that lack visual texture (white walls, glass, water). Monocular methods can sometimes infer depth from context, but reflective and transparent surfaces remain difficult for all approaches.

Dynamic scenes: Moving objects violate the static-scene assumption used by many multi-view and self-supervised methods. Handling independently moving objects while estimating camera ego-motion remains an active research area ^[3].

Evaluation inconsistencies: Different papers use different evaluation protocols (crop ratios, depth caps, alignment procedures), making direct comparisons difficult. The field is working toward standardized evaluation practices ^[18].

Outdoor long-range depth: Estimating accurate metric depth at long range (beyond 80 meters) is challenging because small errors in disparity or predicted depth translate to large absolute errors at distance.

ELI5: depth estimation in simple terms

When you close one eye and look at a room, you can still tell which things are near and which are far, because your brain has learned that closer things look bigger, objects in front cover up objects behind, and far-away things look hazier. Depth estimation teaches a computer to do the same thing: you give it a flat photo, and it colors in how far away every spot is, like a map where bright means close and dark means far. With two cameras it is easier, because the computer can compare the two slightly different pictures (the way your two eyes do) and measure how much each thing shifts between them. With only one camera it has to guess from experience, which is why modern systems study tens of millions of pictures first so their guesses get very good.

References

Eigen, D., Puhrsch, C., & Fergus, R. (2014). "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network." *Advances in Neural Information Processing Systems (NeurIPS)*, 27. arXiv:1406.2283. ↩
Godard, C., Mac Aodha, O., & Brostow, G.J. (2017). "Unsupervised Monocular Depth Estimation with Left-Right Consistency." *Proceedings of CVPR*, 270-279. ↩
Godard, C., Mac Aodha, O., Firman, M., & Brostow, G.J. (2019). "Digging Into Self-Supervised Monocular Depth Estimation." *Proceedings of ICCV*, 3828-3838. ↩
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer." *IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)*, 44(3), 1623-1637. arXiv:1907.01341. ↩
Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). "Vision Transformers for Dense Prediction." *Proceedings of ICCV*, 12179-12188. ↩
Bhat, S.F., Alhashim, I., & Wonka, P. (2023). "ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth." *arXiv:2302.12288*. ↩
Birkl, R., Wofk, D., & Muller, M. (2023). "MiDaS v3.1: A Model Zoo for Robust Monocular Relative Depth Estimation." *arXiv:2307.14460*. ↩
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., & Zhao, H. (2024). "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." *Proceedings of CVPR*. arXiv:2401.10891. ↩
Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., & Zhao, H. (2024). "Depth Anything V2." *Proceedings of NeurIPS*. arXiv:2406.09414. ↩
Ke, B., Obukhov, A., Huang, S., Mez, N., Dauber, M., & Schindler, K. (2024). "Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation" (Marigold). *Proceedings of CVPR*. arXiv:2312.02145. ↩
Hirschmuller, H. (2005). "Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information." *Proceedings of CVPR*, 807-814. ↩
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). "Indoor Segmentation and Support Inference from RGBD Images." *Proceedings of ECCV*, 746-760. ↩
Geiger, A., Lenz, P., & Urtasun, R. (2012). "Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite." *Proceedings of CVPR*. ↩
Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., & Niessner, M. (2017). "ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes." *Proceedings of CVPR*. ↩
Lee, J.H., Han, M.K., Ko, D.W., & Suh, I.H. (2019). "From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation" (BTS). *arXiv:1907.10326*.
Bhat, S.F., Alhashim, I., & Wonka, P. (2021). "AdaBins: Depth Estimation Using Adaptive Bins." *Proceedings of CVPR*, 4009-4018.
Oquab, M. et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision." *arXiv:2304.07193*. ↩
Spencer, J. et al. (2024). "The Third Monocular Depth Estimation Challenge." *Proceedings of CVPR Workshops*. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributors · full history

Suggest edit

What links here

Augmented reality DINO (computer vision)DINOv2 DINOv3 Depth MATRIX-3

What is depth estimation?

What is monocular depth estimation?

What is stereo depth estimation?

What is multi-view depth estimation?

How did traditional depth estimation work?

Stereo matching

Structure from Motion

How does deep learning estimate depth from one image?

Eigen et al. (2014): the pioneering work

MiDaS (Ranftl et al., 2020)

DPT: Dense Prediction Transformers (Ranftl et al., 2021)

ZoeDepth (Bhat et al., 2023)

Depth Anything (Yang et al., 2024)

Depth Anything V2 (Yang et al., 2024)

Marigold (Ke et al., 2024)

What is the difference between metric depth and relative depth?

How does self-supervised depth estimation avoid ground-truth labels?

Monodepth (Godard et al., 2017)

Monodepth2 (Godard et al., 2019)

How is depth estimation accuracy measured?

Error metrics (lower is better)

Accuracy metrics (higher is better)

What datasets are used for depth estimation?

NYU Depth V2

KITTI

ScanNet

Other notable datasets

How do the major depth estimation models compare?

What is depth estimation used for?

Autonomous driving

Augmented reality

3D reconstruction

Robotics

Computational photography

Video depth and temporal consistency

What are the open problems in depth estimation?

ELI5: depth estimation in simple terms

See also

References

Improve this article

Related Articles

Diffusion model

Translational invariance

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

What links here

Related Articles

Diffusion model

Translational invariance

Computer vision

Convolutional Filter

Convolutional Layer

Convolutional Neural Network

What links here