Depth estimation is the task of predicting per-pixel depth values from one or more images, producing a dense depth map that encodes how far each point in the scene is from the camera. It sits at the core of 3D computer vision and enables machines to reason about the spatial layout of scenes. Given an input image of size H x W, a depth estimation model outputs a depth map of the same spatial dimensions, where each pixel stores a scalar depth value (in meters for metric depth, or in arbitrary units for relative depth).
Depth information is essential for applications that require geometric understanding of the world, including autonomous driving, augmented reality, robotics, 3D reconstruction, and computational photography. While dedicated depth sensors such as LiDAR, structured light projectors, and time-of-flight cameras can capture depth directly, they add cost, weight, and power consumption to a system. Estimating depth from ordinary RGB cameras is therefore a long-standing research goal in computer vision.
Formally, depth estimation seeks a mapping f that takes an image I (or a set of images) and produces a depth map D, where D(u, v) represents the distance from the camera to the scene surface visible at pixel (u, v). The problem can be approached in several ways depending on the number of input views.
Monocular depth estimation predicts depth from a single RGB image. This is inherently an ill-posed problem because infinitely many 3D scenes can project onto the same 2D image. Humans resolve this ambiguity using learned priors: perspective cues, texture gradients, occlusion relationships, known object sizes, and atmospheric haze. Early computational approaches struggled with this task, but deep learning models trained on large datasets have learned to exploit these same cues effectively.
Stereo depth estimation uses a pair of images captured from two horizontally offset cameras (a stereo rig) to compute depth through triangulation. The core idea comes from binocular vision: by identifying corresponding points in the left and right images and measuring their horizontal displacement (disparity), the depth at each pixel can be calculated using the formula:
z = (f * B) / d
where z is depth, f is the focal length, B is the baseline (distance between cameras), and d is the disparity in pixels. Larger disparities correspond to closer objects, and smaller disparities correspond to objects further away.
Multi-view depth estimation extends the stereo concept to more than two images. By observing a scene from multiple viewpoints, depth can be inferred with higher accuracy and completeness. Techniques such as Structure from Motion (SfM) and Multi-View Stereo (MVS) fall into this category.
Stereo matching algorithms find pixel correspondences between rectified stereo image pairs and produce a disparity map, which can be converted to depth. The process typically follows four steps: cost computation, cost aggregation, disparity optimization, and disparity refinement.
Local methods evaluate a small neighborhood around each pixel and select the disparity that minimizes a matching cost (such as the sum of absolute differences or the census transform). Local methods are fast but sensitive to textureless regions and occlusions.
Global methods formulate stereo matching as an energy minimization problem over the entire image. They define a cost function that balances data fidelity (how well pixel intensities match) with smoothness (the assumption that neighboring pixels likely have similar depths). Graph cuts and belief propagation are classic optimization techniques for global stereo matching.
Semi-Global Matching (SGM), introduced by Heiko Hirschmuller in 2005, strikes a balance between local and global methods. SGM approximates a global 2D smoothness constraint by aggregating matching costs along multiple (typically 8 or 16) one-dimensional paths through the image. This yields near-global-quality results at a fraction of the computational cost. SGM has been widely adopted in real-time stereo applications, including robotics, advanced driver assistance systems, and satellite photogrammetry, because of its favorable accuracy-to-speed tradeoff and its suitability for parallel hardware implementations on FPGAs and GPUs.
Structure from Motion (SfM) recovers both 3D scene structure and camera poses from a collection of 2D images taken from different viewpoints. The classical SfM pipeline consists of several stages:
SfM can be performed incrementally (adding one camera at a time) or globally (solving for all cameras simultaneously). While SfM primarily produces sparse 3D point clouds, it provides the camera parameters needed for dense reconstruction methods like Multi-View Stereo.
The application of deep learning to monocular depth estimation has transformed the field. Neural networks can learn complex priors about scene geometry from large training datasets, enabling accurate depth prediction from a single image.
David Eigen, Christian Puhrsch, and Rob Fergus published "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" at NeurIPS 2014 (then called NIPS). This paper is widely regarded as the foundational work for deep learning-based monocular depth estimation. The architecture uses two stacked convolutional neural networks: a coarse-scale network that captures global scene structure from the full image, and a fine-scale network that refines the prediction using local details. The authors also introduced a scale-invariant loss function that focuses on relative depth relationships rather than absolute scale. On the NYU Depth V2 benchmark, the model achieved an AbsRel error of 0.215 and a delta-1 accuracy of 0.611, which represented a major improvement over prior non-learning methods. While these numbers are far below current standards, the paper demonstrated that a neural network could learn meaningful depth priors from data alone.
MiDaS (Monocular Depth estimation via a Single image) was developed by Rene Ranftl and colleagues at Intel Labs. The key insight behind MiDaS was that training on a diverse mixture of datasets produces models with strong zero-shot generalization. Prior monocular depth models trained on a single dataset (such as NYU or KITTI) often failed when applied to images from different domains.
The original MiDaS paper, "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer," was published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in 2020. The authors trained on a combination of datasets including ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, and NYU Depth V2 (up to 12 datasets in later versions) using multi-objective optimization. Because these datasets use different depth representations and scales, MiDaS predicts relative inverse depth rather than absolute metric depth.
MiDaS v3.1, released in 2023, expanded the model zoo to include backbones based on BEiT, Swin, SwinV2, Next-ViT, and LeViT transformers, in addition to the original ViT backbone. The BEiT-based models achieved the highest depth estimation quality, while smaller backbones like LeViT enabled efficient inference for real-time applications.
DPT (Dense Prediction Transformers) was introduced by Ranftl, Bochkovskiy, and Koltun at ICCV 2021 in the paper "Vision Transformers for Dense Prediction." DPT replaced the convolutional backbone of MiDaS with a Vision Transformer (ViT) encoder. The architecture works by dividing the input image into non-overlapping patches, projecting them into token embeddings, and processing them through standard transformer encoder layers. Tokens from multiple transformer stages are then reassembled into image-like representations at multiple resolutions and fused through a convolutional decoder to produce dense depth predictions.
The key advantage of DPT over convolutional architectures is the transformer's global receptive field at every stage, which allows it to capture long-range spatial relationships. This produces depth maps with finer details and more globally coherent structure. DPT improved monocular depth estimation by over 28% compared to the best convolutional approaches at the time and set new performance records on both NYU Depth V2 and KITTI benchmarks.
Three variants were released: DPT-Base (ViT-B), DPT-Large (ViT-L), and DPT-Hybrid (using a ResNet-50 feature extractor combined with transformer layers).
ZoeDepth, published as "ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth" (arXiv:2302.12288), addresses a fundamental limitation of models like MiDaS: they predict only relative depth and cannot provide measurements in real-world units. ZoeDepth bridges the gap between relative and metric depth estimation through a two-stage approach.
The model first pretrains on 12 datasets using the MiDaS framework to learn robust relative depth representations with strong generalization. It then fine-tunes on specific target domains (such as NYU Depth V2 for indoor scenes or KITTI for outdoor scenes) using a novel metric bins module appended to the decoder. This module predicts domain-specific depth bin centers and combines them with the relative depth features to produce metric depth output.
The flagship model, ZoeD-M12-NK, was the first to jointly train on multiple datasets (NYU Depth V2 and KITTI) without significant performance degradation. During inference, a latent classifier automatically routes each input image to the appropriate domain-specific head. ZoeDepth achieved unprecedented zero-shot generalization performance across eight unseen datasets spanning both indoor and outdoor domains.
Depth Anything, published at CVPR 2024, represents a shift toward building foundation models for monocular depth estimation. The paper, "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data," was authored by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao from the University of Hong Kong and TikTok (ByteDance).
The core idea is a semi-supervised training pipeline that leverages massive amounts of unlabeled data. The process works as follows:
The massive scale of training data significantly reduced generalization error, allowing Depth Anything to produce robust relative depth estimates across diverse scenes, from indoor rooms to outdoor landscapes, even in challenging conditions. The models are available in three sizes based on the DINOv2-ViT encoder.
| Model variant | Encoder | Parameters | NYU AbsRel | NYU delta-1 | KITTI AbsRel | KITTI delta-1 |
|---|---|---|---|---|---|---|
| Depth Anything-S | ViT-S | 24.8M | 0.053 | 0.972 | 0.080 | 0.936 |
| Depth Anything-B | ViT-B | 97.5M | 0.046 | 0.979 | 0.080 | 0.939 |
| Depth Anything-L | ViT-L | 335.3M | 0.043 | 0.981 | 0.076 | 0.947 |
Notably, Depth Anything did not use NYU Depth V2 or KITTI data during pretraining; these numbers reflect fine-tuned metric depth performance. The model also demonstrated strong results when used as a depth prior for downstream tasks, achieving 86.2 mIoU on Cityscapes semantic segmentation and 59.4 mIoU on ADE20K when the encoder was fine-tuned.
Depth Anything V2, published at NeurIPS 2024, significantly improved upon its predecessor through three key changes:
This approach produced depth maps with finer details and greater robustness. On the DA-2K benchmark (a curated evaluation set), the V2-Large model achieved 97.1% accuracy. On the NYU Depth V2 and KITTI benchmarks with metric depth fine-tuning, V2 set new records.
| Model variant | Encoder | Parameters | NYU AbsRel | NYU delta-1 | KITTI AbsRel | KITTI delta-1 |
|---|---|---|---|---|---|---|
| Depth Anything V2-S | ViT-S | 24.8M | 0.073 | 0.961 | 0.053 | 0.973 |
| Depth Anything V2-B | ViT-B | 97.5M | 0.063 | 0.977 | 0.048 | 0.979 |
| Depth Anything V2-L | ViT-L | 335.3M | 0.056 | 0.984 | 0.045 | 0.983 |
| Depth Anything V2-G | ViT-G | 1.3B | N/A | N/A | N/A | N/A |
Compared to diffusion-based depth estimation models like Marigold, Depth Anything V2 is more than 10 times faster while achieving comparable or superior accuracy.
Marigold, presented as an Oral paper and Best Paper Award Candidate at CVPR 2024, took a different approach by repurposing a pretrained diffusion model for depth estimation. The method fine-tunes the Stable Diffusion U-Net on synthetic depth data by encoding both the RGB image and its corresponding depth map into the latent space using the original VAE encoder, concatenating the two latent codes, and optimizing the standard diffusion denoising objective.
Despite being trained exclusively on synthetic data, Marigold achieves strong zero-shot transfer to real-world images. The model leverages the rich visual priors learned by Stable Diffusion during its large-scale pretraining on billions of images. Marigold produces affine-invariant (relative) depth predictions with high detail quality. A faster variant, Marigold-LCM, uses latent consistency distillation to reduce the number of required denoising steps.
A critical distinction in depth estimation is between metric depth and relative depth.
Relative depth captures the ordinal relationships between points in a scene: which objects are closer and which are farther. The depth values are in arbitrary units that can vary across images. Relative depth models (such as MiDaS and the base Depth Anything models) generalize well across diverse scenes because they do not need to learn scene-specific scale and shift. However, relative depth alone is insufficient for applications requiring precise measurements, such as navigation or 3D reconstruction with correct dimensions.
Metric depth provides depth values in real-world units, typically meters. Metric depth estimation is harder because the absolute scale of a scene is ambiguous from a single image. A photograph of a small model room and a photograph of a full-sized room can look nearly identical. Metric depth models (such as ZoeDepth and the fine-tuned Depth Anything variants) resolve this ambiguity by learning domain-specific scale priors during training. The tradeoff is that metric models often generalize less well across domains. A model fine-tuned on indoor scenes at 0 to 10 meters range may struggle with outdoor scenes at 0 to 80 meters range.
Recent work such as ZoeDepth and UniDepth attempts to bridge this gap by combining the generalization strength of relative depth models with domain-specific metric heads.
A major limitation of supervised depth estimation is the need for ground truth depth data, which requires expensive sensors (LiDAR, structured light) and careful calibration. Self-supervised approaches sidestep this requirement by using photometric consistency as the training signal.
Clement Godard, Oisin Mac Aodha, and Gabriel Brostow published "Unsupervised Monocular Depth Estimation with Left-Right Consistency" at CVPR 2017. The key idea is to train a network to predict depth from a single image by using stereo image pairs during training only. The predicted depth map is used to warp one image of the stereo pair to reconstruct the other, and the photometric reconstruction error serves as the loss function. Godard et al. introduced a left-right consistency constraint that enforces agreement between the disparity maps predicted from the left and right images, reducing artifacts around occlusion boundaries.
This approach achieved results on the KITTI benchmark that were competitive with, and in some metrics surpassed, fully supervised methods of the time.
Monodepth2, published at ICCV 2019 as "Digging Into Self-Supervised Monocular Depth Estimation," extended the self-supervised framework in several important ways. Rather than requiring stereo pairs, Monodepth2 can also train using monocular video sequences. It jointly predicts depth and ego-motion (the camera's movement between frames) and uses photometric reprojection loss between adjacent frames.
Three key contributions improved over the original Monodepth:
Monodepth2 demonstrated through empirical evidence that systematic redesign of loss functions could surpass performance gains from more complex architectures, simplifying the pipeline while improving results.
Depth estimation performance is measured using a standard set of error metrics and accuracy metrics. Given a predicted depth map D* and a ground truth depth map D, with N valid pixels, the following metrics are commonly used.
| Metric | Name | Formula | Description |
|---|---|---|---|
| AbsRel | Absolute Relative Error | (1/N) * sum of abs(d_i - d_i*) / d_i | Average percentage difference between predicted and ground truth depth |
| SqRel | Squared Relative Error | (1/N) * sum of (d_i - d_i*)^2 / d_i | Penalizes large errors more heavily than AbsRel |
| RMSE | Root Mean Square Error | sqrt((1/N) * sum of (d_i - d_i*)^2) | Standard deviation of prediction errors in absolute depth units |
| RMSE log | Log RMSE | sqrt((1/N) * sum of (log(d_i) - log(d_i*))^2) | RMSE computed in log-space, reducing sensitivity to absolute scale |
| SILog | Scale-Invariant Log Error | sqrt(mean(delta_log^2) - 0.5 * mean(delta_log)^2) | Measures depth accuracy independent of global scale |
| Metric | Name | Formula | Description |
|---|---|---|---|
| delta-1 | Threshold accuracy at 1.25 | % of pixels where max(d_i/d_i*, d_i*/d_i) < 1.25 | Proportion of predictions within 25% of ground truth |
| delta-2 | Threshold accuracy at 1.25^2 | % of pixels where max(d_i/d_i*, d_i*/d_i) < 1.5625 | Proportion within ~56% of ground truth |
| delta-3 | Threshold accuracy at 1.25^3 | % of pixels where max(d_i/d_i*, d_i*/d_i) < 1.953 | Proportion within ~95% of ground truth |
AbsRel and delta-1 are the two most commonly reported metrics in recent papers. For relative depth models, predictions and ground truth are typically aligned in scale and shift for each image before computing error metrics.
Progress in depth estimation has been driven by several benchmark datasets that provide paired RGB images and ground truth depth maps.
NYU Depth V2 was collected by Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus, and released in 2012 alongside the paper "Indoor Segmentation and Support Inference from RGBD Images" at ECCV. The dataset was captured using a Microsoft Kinect sensor in 464 different indoor scenes across three cities, spanning bedrooms, kitchens, offices, living rooms, bookstores, cafes, and other environments.
| Property | Detail |
|---|---|
| Labeled pairs | 1,449 densely aligned RGB-depth pairs |
| Raw frames | 407,024 unlabeled RGB-depth frames |
| Resolution | 640 x 480 pixels |
| Depth sensor | Microsoft Kinect (structured light) |
| Depth range | 0.5 to 10 meters |
| Scene type | Indoor |
| Standard split | 249 scenes for training, 215 scenes for testing (654 test images) |
NYU Depth V2 remains the most widely used indoor benchmark for monocular depth estimation.
The KITTI Vision Benchmark Suite was introduced by Andreas Geiger, Philip Lenz, and Raquel Urtasun at CVPR 2012 in the paper "Are We Ready for Autonomous Driving?" The dataset was recorded from a car driving through Karlsruhe, Germany, using a stereo camera rig, a Velodyne HDL-64E LiDAR scanner, and a GPS/IMU system.
| Property | Detail |
|---|---|
| Stereo pairs | 389 stereo and optical flow image pairs |
| Image resolution | Approximately 1242 x 375 pixels (after rectification) |
| Ground truth depth | Velodyne LiDAR (sparse, projected onto image plane) |
| Depth range | 0 to approximately 80 meters |
| Scene type | Outdoor (urban driving) |
| Eigen split | 23,488 training images, 697 test images |
KITTI ground truth is sparse because LiDAR returns occupy only a fraction of the image pixels. Evaluation is performed only at pixels where LiDAR data is available. KITTI remains the primary outdoor driving benchmark for depth estimation.
ScanNet was published by Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner at CVPR 2017. It provides RGB-D video sequences of indoor scenes captured with commodity depth sensors.
| Property | Detail |
|---|---|
| Total views | 2.5 million RGB-D frames |
| Scenes | 1,513 indoor scenes |
| Annotations | 3D camera poses, surface reconstructions, instance-level semantic segmentations |
| Depth sensor | Structure Sensor (structured light) |
| Scene type | Indoor |
ScanNet has been used both for depth estimation evaluation and as a training data source, particularly for methods targeting indoor environments.
| Dataset | Year | Scene type | Key characteristics |
|---|---|---|---|
| Middlebury Stereo | 2002+ | Indoor/controlled | High-quality structured light ground truth for stereo evaluation |
| ETH3D | 2017 | Indoor/outdoor | High-resolution images with multi-view and LiDAR ground truth |
| Cityscapes | 2016 | Urban driving | 5,000 images with stereo pairs and disparity maps |
| DIODE | 2019 | Indoor/outdoor | Laser scanner ground truth with both indoor and outdoor scenes |
| TartanAir | 2020 | Synthetic | Photo-realistic rendered scenes with perfect ground truth |
| Hypersim | 2021 | Synthetic indoor | 77,400 synthetic images with perfect metric depth |
| Virtual KITTI | 2016/2020 | Synthetic driving | Synthetic clone of KITTI driving sequences |
The following table summarizes key monocular depth estimation models, ordered chronologically.
| Model | Year | Venue | Supervision | Depth type | Backbone | Key contribution |
|---|---|---|---|---|---|---|
| Eigen et al. | 2014 | NeurIPS | Supervised | Metric | CNN (custom) | First deep learning approach; coarse-to-fine multi-scale architecture |
| Monodepth | 2017 | CVPR | Self-supervised | Metric | ResNet | Left-right consistency loss; no ground truth depth needed |
| Monodepth2 | 2019 | ICCV | Self-supervised | Metric | ResNet-18 | Auto-masking, minimum reprojection loss, monocular video training |
| BTS | 2019 | arXiv | Supervised | Metric | DenseNet-161 | Local planar guidance layers for sharp boundaries |
| MiDaS | 2020 | TPAMI | Supervised | Relative | ResNeXt-101 | Multi-dataset training for zero-shot cross-dataset transfer |
| AdaBins | 2021 | CVPR | Supervised | Metric | EfficientNet-B5 | Adaptive bin-center prediction for fine-grained depth |
| DPT | 2021 | ICCV | Supervised | Relative | ViT-Large | Vision Transformer backbone for dense prediction |
| ZoeDepth | 2023 | arXiv | Supervised | Metric | BEiT-L (MiDaS) | Combines relative and metric depth; domain-specific heads |
| MiDaS v3.1 | 2023 | arXiv | Supervised | Relative | BEiT/Swin/ViT | Model zoo with multiple transformer backbones |
| Marigold | 2024 | CVPR | Supervised (synthetic) | Relative | Stable Diffusion U-Net | Repurposed diffusion model; synthetic-only training |
| Depth Anything | 2024 | CVPR | Semi-supervised | Relative/Metric | DINOv2-ViT | 62M+ unlabeled images; foundation model approach |
| Depth Anything V2 | 2024 | NeurIPS | Semi-supervised | Relative/Metric | DINOv2-ViT | Synthetic teacher training; models from 25M to 1.3B parameters |
Depth estimation is fundamental to self-driving vehicles. Perception systems need to understand the 3D layout of the road, detect obstacles, measure distances to other vehicles and pedestrians, and plan safe trajectories. While most autonomous vehicle systems use LiDAR as a primary depth sensor, monocular and stereo depth estimation from cameras serves as a complementary or backup modality. Tesla's camera-only approach relies entirely on vision-based depth estimation. Depth maps help with lane detection, free-space estimation, and collision avoidance.
AR applications need to understand scene geometry to place virtual objects realistically in the physical world. Depth estimation enables occlusion handling (virtual objects should be hidden behind real objects that are closer to the camera), surface detection (placing objects on tables, floors, and walls), and lighting estimation. Apple's ARKit and Google's ARCore frameworks use a combination of inertial measurement, visual odometry, and depth estimation to build real-time spatial maps. The LiDAR scanner on iPhone Pro models provides hardware depth sensing that complements software-based estimates.
Depth estimation is a core building block for 3D reconstruction pipelines. Given depth maps from multiple viewpoints along with camera poses, a 3D point cloud or mesh can be constructed through depth fusion (using algorithms such as TSDF fusion). Recent neural 3D reconstruction methods, including Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, benefit from depth priors produced by monocular estimation models. Depth Anything and similar models are increasingly used to provide dense depth supervision that accelerates NeRF training and improves reconstruction quality, especially in regions with limited multi-view coverage.
Robots navigating in unstructured environments need depth information for obstacle avoidance, path planning, and manipulation. While many robots carry dedicated depth sensors (Intel RealSense, stereo cameras), monocular depth estimation provides a lightweight alternative for drones, small robots, or scenarios where weight and power budgets are constrained. Self-supervised depth estimation is particularly relevant for robotics because robots can collect massive amounts of unlabeled video data during operation.
Smartphone portrait mode effects rely on depth estimation to separate the foreground subject from the background and apply synthetic bokeh (background blur). Early implementations (such as on the iPhone 7 Plus in 2016) used dual cameras in a stereo configuration. Modern smartphones with a single camera use deep learning-based monocular depth estimation to achieve similar effects. Depth maps also enable other computational photography features: refocusing after capture, relighting, and 3D photo effects. The LiDAR scanner on iPhone Pro models captures hardware depth that improves portrait mode accuracy, especially in low light.
Video Depth Anything, presented at CVPR 2025, extends the Depth Anything framework to video sequences by enforcing temporal consistency across frames. Temporally consistent depth is important for applications like video editing, visual effects, and augmented reality, where flickering or inconsistent depth maps between frames create visible artifacts.
Despite rapid progress, several challenges remain in depth estimation:
Scale ambiguity in monocular estimation: A single image provides no absolute scale information. While metric depth models attempt to learn scale priors, they often fail on scenes outside their training distribution. The depth estimation community currently lacks a widely accepted benchmarking standard for evaluating cross-domain generalization.
Textureless and reflective surfaces: Stereo matching and multi-view methods struggle with surfaces that lack visual texture (white walls, glass, water). Monocular methods can sometimes infer depth from context, but reflective and transparent surfaces remain difficult for all approaches.
Dynamic scenes: Moving objects violate the static-scene assumption used by many multi-view and self-supervised methods. Handling independently moving objects while estimating camera ego-motion remains an active research area.
Evaluation inconsistencies: Different papers use different evaluation protocols (crop ratios, depth caps, alignment procedures), making direct comparisons difficult. The field is working toward standardized evaluation practices.
Outdoor long-range depth: Estimating accurate metric depth at long range (beyond 80 meters) is challenging because small errors in disparity or predicted depth translate to large absolute errors at distance.