# Depth estimation

> Source: https://aiwiki.ai/wiki/depth_estimation
> Updated: 2026-07-11
> Categories: Computer Vision, Deep Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

Depth estimation is the [computer vision](/wiki/computer_vision) task of predicting how far each surface in a scene is from the camera, producing a dense per-pixel depth map from one or more images. Given an input image of size $$H \times W$$, a depth estimation model outputs a depth map of the same dimensions in which each pixel stores a scalar distance, expressed in meters for metric depth or in arbitrary units for relative depth [1][4][8]. It can be solved from a single image (monocular), from two horizontally offset views (stereo), or from many views (multi-view), with monocular depth being fundamentally ill-posed because infinitely many 3D scenes project to the same 2D image [1][4].

Depth information is essential for applications that require geometric understanding of the world, including [autonomous driving](/wiki/autonomous_driving), [augmented reality](/wiki/augmented_reality), robotics, [3D reconstruction](/wiki/3d_reconstruction), and computational photography. While dedicated depth sensors such as LiDAR, structured light projectors, and time-of-flight cameras can capture depth directly, they add cost, weight, and power consumption to a system. Estimating depth from ordinary RGB cameras is therefore a long-standing research goal in computer vision, and since 2014 it has been driven almost entirely by [deep learning](/wiki/deep_learning) [1].

## What is depth estimation?

Formally, depth estimation seeks a mapping f that takes an image I (or a set of images) and produces a depth map D, where $$D(u, v)$$ represents the distance from the camera to the scene surface visible at pixel $$(u, v)$$. The problem can be approached in several ways depending on the number of input views.

### What is monocular depth estimation?

Monocular depth estimation predicts depth from a single RGB image. This is inherently an ill-posed problem because infinitely many 3D scenes can project onto the same 2D image. Humans resolve this ambiguity using learned priors: perspective cues, texture gradients, occlusion relationships, known object sizes, and atmospheric haze. Early computational approaches struggled with this task, but [deep learning](/wiki/deep_learning) models trained on large datasets have learned to exploit these same cues effectively [1][8].

### What is stereo depth estimation?

Stereo depth estimation uses a pair of images captured from two horizontally offset cameras (a stereo rig) to compute depth through triangulation. The core idea comes from binocular [stereo vision](/wiki/stereo_vision): by identifying corresponding points in the left and right images and measuring their horizontal displacement (disparity), the depth at each pixel can be calculated using the formula:

$$
z = \frac{f B}{d}
$$

where z is depth, f is the focal length, B is the baseline (distance between cameras), and d is the disparity in pixels. Larger disparities correspond to closer objects, and smaller disparities correspond to objects further away.

### What is multi-view depth estimation?

Multi-view depth estimation extends the stereo concept to more than two images. By observing a scene from multiple viewpoints, depth can be inferred with higher accuracy and completeness. Techniques such as [Structure from Motion](/wiki/structure_from_motion) (SfM) and Multi-View Stereo (MVS) fall into this category.

## How did traditional depth estimation work?

### Stereo matching

Stereo matching algorithms find pixel correspondences between rectified stereo image pairs and produce a disparity map, which can be converted to depth. The process typically follows four steps: cost computation, cost aggregation, disparity optimization, and disparity refinement.

**Local methods** evaluate a small neighborhood around each pixel and select the disparity that minimizes a matching cost (such as the sum of absolute differences or the census transform). Local methods are fast but sensitive to textureless regions and occlusions.

**Global methods** formulate stereo matching as an energy minimization problem over the entire image. They define a cost function that balances data fidelity (how well pixel intensities match) with smoothness (the assumption that neighboring pixels likely have similar depths). Graph cuts and belief propagation are classic optimization techniques for global stereo matching.

**Semi-Global Matching (SGM)**, introduced by Heiko Hirschmuller in 2005, strikes a balance between local and global methods [11]. SGM approximates a global 2D smoothness constraint by aggregating matching costs along multiple (typically 8 or 16) one-dimensional paths through the image. This yields near-global-quality results at a fraction of the computational cost. SGM has been widely adopted in real-time stereo applications, including robotics, advanced driver assistance systems, and satellite photogrammetry, because of its favorable accuracy-to-speed tradeoff and its suitability for parallel hardware implementations on FPGAs and GPUs.

### Structure from Motion

[Structure from Motion](/wiki/structure_from_motion) (SfM) recovers both 3D scene structure and camera poses from a collection of 2D images taken from different viewpoints. The classical SfM pipeline consists of several stages:

1. **Feature detection and description:** Local features (such as SIFT, SURF, or ORB keypoints) are detected in each image.
2. **Feature matching:** Correspondences between features in different images are established.
3. **Geometric verification:** Incorrect matches are filtered using geometric constraints such as the epipolar constraint and RANSAC.
4. **Camera pose estimation:** The relative positions and orientations of cameras are computed from verified correspondences.
5. **Triangulation:** 3D points are computed by intersecting viewing rays from two or more calibrated cameras.
6. **Bundle adjustment:** Camera parameters and 3D point positions are jointly refined by minimizing reprojection error.

SfM can be performed incrementally (adding one camera at a time) or globally (solving for all cameras simultaneously). While SfM primarily produces sparse 3D point clouds, it provides the camera parameters needed for dense reconstruction methods like Multi-View Stereo.

## How does deep learning estimate depth from one image?

The application of [deep learning](/wiki/deep_learning) to monocular depth estimation has transformed the field. Neural networks can learn complex priors about scene geometry from large training datasets, enabling accurate depth prediction from a single image. The progression has moved from per-dataset convolutional models (2014-2019) to multi-dataset and transformer models (2020-2021), to large foundation models trained on tens of millions of images (2024).

### Eigen et al. (2014): the pioneering work

David Eigen, Christian Puhrsch, and Rob Fergus published "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network" at [NeurIPS](/wiki/neurips) 2014 (then called NIPS) [1]. This paper is widely regarded as the foundational work for deep learning-based monocular depth estimation. The architecture uses two stacked [convolutional neural networks](/wiki/convolutional_neural_network): a coarse-scale network that, in the authors' words, "makes a coarse global prediction based on the entire image," and a fine-scale network "that refines this prediction locally" [1]. The authors also introduced a scale-invariant loss function that focuses on relative depth relationships rather than absolute scale. On the NYU Depth V2 benchmark, the model achieved an AbsRel error of 0.215 and a delta-1 accuracy of 0.611, which represented a major improvement over prior non-learning methods [1]. While these numbers are far below current standards, the paper demonstrated that a neural network could learn meaningful depth priors from data alone.

### MiDaS (Ranftl et al., 2020)

MiDaS (Monocular Depth estimation via a Single image) was developed by Rene Ranftl and colleagues at Intel Labs [4]. The key insight behind MiDaS was that training on a diverse mixture of datasets produces models with strong zero-shot generalization. As the paper states, "The success of monocular depth estimation relies on large and diverse training sets," yet "acquiring dense ground-truth depth across different environments at scale" is difficult, so the authors "develop tools that enable mixing multiple datasets during training, even if their annotations are incompatible" [4]. Prior monocular depth models trained on a single dataset (such as NYU or KITTI) often failed when applied to images from different domains.

The original MiDaS paper, "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer," was published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) in 2020 [4]. The authors trained on a combination of datasets including ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, and NYU Depth V2 (up to 12 datasets in later versions) using multi-objective optimization. Because these datasets use different depth representations and scales, MiDaS predicts relative inverse depth rather than absolute metric depth [4].

MiDaS v3.1, released in 2023, expanded the model zoo to include backbones based on BEiT, Swin, SwinV2, Next-ViT, and LeViT transformers, in addition to the original ViT backbone [7]. The BEiT-based models achieved the highest depth estimation quality, while smaller backbones like LeViT enabled efficient inference for real-time applications [7].

### DPT: Dense Prediction Transformers (Ranftl et al., 2021)

DPT (Dense Prediction Transformers) was introduced by Ranftl, Bochkovskiy, and Koltun at ICCV 2021 in the paper "Vision Transformers for Dense Prediction" [5]. DPT replaced the convolutional backbone of MiDaS with a [Vision Transformer](/wiki/vision_transformer) (ViT) encoder. The architecture works by dividing the input image into non-overlapping patches, projecting them into token embeddings, and processing them through standard [transformer](/wiki/transformer) encoder layers. Tokens from multiple transformer stages are then reassembled into image-like representations at multiple resolutions and fused through a convolutional decoder to produce dense depth predictions.

The key advantage of DPT over convolutional architectures is the transformer's global receptive field at every stage, which allows it to capture long-range spatial relationships. This produces depth maps with finer details and more globally coherent structure. DPT improved monocular depth estimation by over 28% compared to the best convolutional approaches at the time and set new performance records on both NYU Depth V2 and KITTI benchmarks [5].

Three variants were released: DPT-Base (ViT-B), DPT-Large (ViT-L), and DPT-Hybrid (using a [ResNet](/wiki/resnet)-50 feature extractor combined with transformer layers) [5].

### ZoeDepth (Bhat et al., 2023)

ZoeDepth, published as "ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth" (arXiv:2302.12288), addresses a fundamental limitation of models like MiDaS: they predict only relative depth and cannot provide measurements in real-world units [6]. ZoeDepth bridges the gap between relative and metric depth estimation through a two-stage approach.

The model first pretrains on 12 datasets using the MiDaS framework to learn robust relative depth representations with strong generalization. It then fine-tunes on specific target domains (such as NYU Depth V2 for indoor scenes or KITTI for outdoor scenes) using a novel metric bins module appended to the decoder [6]. This module predicts domain-specific depth bin centers and combines them with the relative depth features to produce metric depth output.

The flagship model, ZoeD-M12-NK, was the first to jointly train on multiple datasets (NYU Depth V2 and KITTI) without significant performance degradation [6]. During inference, a latent classifier automatically routes each input image to the appropriate domain-specific head. ZoeDepth achieved unprecedented zero-shot generalization performance across eight unseen datasets spanning both indoor and outdoor domains [6].

### Depth Anything (Yang et al., 2024)

Depth Anything, published at CVPR 2024, represents a shift toward building foundation models for monocular depth estimation [8]. The paper, "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data," was authored by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao from the University of Hong Kong and TikTok (ByteDance) [8]. The authors describe it as "a highly practical solution for robust monocular depth estimation," the core of which is to "scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M)" so as to "significantly enlarge the data coverage and thus reduce the generalization error" [8].

The core idea is a semi-supervised training pipeline that leverages massive amounts of unlabeled data. The process works as follows:

1. A teacher model is trained on 1.5 million labeled images from six public datasets, using a [DINOv2](/wiki/dinov2) encoder as the backbone [8][17].
2. The teacher generates pseudo depth labels for approximately 62 million unlabeled images collected from eight large-scale datasets [8].
3. A student model is then trained on the combined labeled and pseudo-labeled data, with data augmentation designed to create challenging optimization targets [8].
4. An auxiliary self-supervised loss preserves semantic knowledge from the pretrained DINOv2 encoder [8][17].

In total, the released models were trained on 1.5 million labeled images and roughly 62 million unlabeled images jointly [8]. The massive scale of training data significantly reduced generalization error, allowing Depth Anything to produce robust relative depth estimates across diverse scenes, from indoor rooms to outdoor landscapes, even in challenging conditions. The models are available in three sizes based on the DINOv2-ViT encoder.

| Model variant | Encoder | Parameters | NYU AbsRel | NYU delta-1 | KITTI AbsRel | KITTI delta-1 |
|---|---|---|---|---|---|---|
| Depth Anything-S | ViT-S | 24.8M | 0.053 | 0.972 | 0.080 | 0.936 |
| Depth Anything-B | ViT-B | 97.5M | 0.046 | 0.979 | 0.080 | 0.939 |
| Depth Anything-L | ViT-L | 335.3M | 0.043 | 0.981 | 0.076 | 0.947 |

Notably, Depth Anything did not use NYU Depth V2 or KITTI data during pretraining; these numbers reflect fine-tuned metric depth performance [8]. The model also demonstrated strong results when used as a depth prior for downstream tasks, achieving 86.2 mIoU on Cityscapes semantic segmentation and 59.4 mIoU on ADE20K when the encoder was fine-tuned [8].

### Depth Anything V2 (Yang et al., 2024)

Depth Anything V2, published at NeurIPS 2024, significantly improved upon its predecessor [9]. The authors summarize the upgrade plainly: the model "produces much finer and more robust depth predictions through three key practices: 1) replacing all labeled real images with synthetic images, 2) scaling up the capacity of our teacher model, and 3) teaching student models via the bridge of large-scale pseudo-labeled real images" [9]. Concretely:

1. **Synthetic training data for the teacher:** Instead of training the teacher on labeled real images, V2 trained the teacher exclusively on approximately 595,000 images from five precise synthetic datasets [9]. Synthetic data offers pixel-perfect ground truth depth without the noise and artifacts found in sensor-captured depth maps (from LiDAR, Kinect, etc.).
2. **Larger teacher model:** The teacher was scaled up to a DINOv2-Giant backbone (approximately 1.3 billion parameters) to produce higher-quality pseudo labels [9].
3. **Student-teacher distillation at scale:** Student models of various sizes were trained solely on more than 62 million pseudo-labeled real images, bridging the synthetic-to-real domain gap through the teacher's pseudo labels [9].

This approach produced depth maps with finer details and greater robustness. The authors also introduced DA-2K, a curated evaluation set of 2,000 annotated pixel pairs across 1,000 high-resolution images spanning indoor, outdoor, transparent, adverse-style, aerial, underwater, and object-centric scenes [9]. On DA-2K, the V2-Large model reached 97.1% accuracy, ahead of diffusion-based competitors such as Marigold [9]. On the NYU Depth V2 and KITTI benchmarks with metric depth fine-tuning, V2 set new records.

| Model variant | Encoder | Parameters | NYU AbsRel | NYU delta-1 | KITTI AbsRel | KITTI delta-1 |
|---|---|---|---|---|---|---|
| Depth Anything V2-S | ViT-S | 24.8M | 0.073 | 0.961 | 0.053 | 0.973 |
| Depth Anything V2-B | ViT-B | 97.5M | 0.063 | 0.977 | 0.048 | 0.979 |
| Depth Anything V2-L | ViT-L | 335.3M | 0.056 | 0.984 | 0.045 | 0.983 |
| Depth Anything V2-G | ViT-G | 1.3B | N/A | N/A | N/A | N/A |

Compared to diffusion-based depth estimation models like Marigold, the authors report that Depth Anything V2 is "more than 10x faster" while achieving comparable or superior accuracy [9].

### Marigold (Ke et al., 2024)

Marigold, presented as an Oral paper and Best Paper Award Candidate at CVPR 2024, took a different approach by repurposing a pretrained [diffusion model](/wiki/diffusion_model) for depth estimation [10]. The method fine-tunes the [Stable Diffusion](/wiki/stable_diffusion) [U-Net](/wiki/unet) on synthetic depth data by encoding both the RGB image and its corresponding depth map into the latent space using the original VAE encoder, concatenating the two latent codes, and optimizing the standard diffusion denoising objective [10].

Despite being trained exclusively on synthetic data, Marigold achieves strong zero-shot transfer to real-world images [10]. The model leverages the rich visual priors learned by Stable Diffusion during its large-scale pretraining on billions of images. Marigold produces affine-invariant (relative) depth predictions with high detail quality. A faster variant, Marigold-LCM, uses latent consistency distillation to reduce the number of required denoising steps.

## What is the difference between metric depth and relative depth?

A critical distinction in depth estimation is between metric depth and relative depth.

**Relative depth** captures the ordinal relationships between points in a scene: which objects are closer and which are farther. The depth values are in arbitrary units that can vary across images. Relative depth models (such as MiDaS and the base Depth Anything models) generalize well across diverse scenes because they do not need to learn scene-specific scale and shift [4][8]. However, relative depth alone is insufficient for applications requiring precise measurements, such as navigation or 3D reconstruction with correct dimensions.

**Metric depth** provides depth values in real-world units, typically meters. Metric depth estimation is harder because the absolute scale of a scene is ambiguous from a single image. A photograph of a small model room and a photograph of a full-sized room can look nearly identical. Metric depth models (such as ZoeDepth and the fine-tuned Depth Anything variants) resolve this ambiguity by learning domain-specific scale priors during training [6][8]. The tradeoff is that metric models often generalize less well across domains. A model fine-tuned on indoor scenes at 0 to 10 meters range may struggle with outdoor scenes at 0 to 80 meters range.

Recent work such as ZoeDepth and UniDepth attempts to bridge this gap by combining the generalization strength of relative depth models with domain-specific metric heads [6].

## How does self-supervised depth estimation avoid ground-truth labels?

### Monodepth (Godard et al., 2017)

A major limitation of supervised depth estimation is the need for ground truth depth data, which requires expensive sensors (LiDAR, structured light) and careful calibration. Self-supervised approaches sidestep this requirement by using photometric consistency as the training signal.

Clement Godard, Oisin Mac Aodha, and Gabriel Brostow published "Unsupervised Monocular Depth Estimation with Left-Right Consistency" at CVPR 2017 [2]. The key idea is to train a network to predict depth from a single image by using stereo image pairs during training only. The predicted depth map is used to warp one image of the stereo pair to reconstruct the other, and the photometric reconstruction error serves as the loss function. Godard et al. introduced a left-right consistency constraint that enforces agreement between the disparity maps predicted from the left and right images, reducing artifacts around occlusion boundaries [2].

This approach achieved results on the KITTI benchmark that were competitive with, and in some metrics surpassed, fully supervised methods of the time [2].

### Monodepth2 (Godard et al., 2019)

Monodepth2, published at ICCV 2019 as "Digging Into Self-Supervised Monocular Depth Estimation," extended the self-supervised framework in several important ways [3]. Rather than requiring stereo pairs, Monodepth2 can also train using monocular video sequences. It jointly predicts depth and ego-motion (the camera's movement between frames) and uses photometric reprojection loss between adjacent frames.

Three key contributions improved over the original Monodepth [3]:

1. **Minimum reprojection loss:** Instead of averaging the reprojection error across source views, the model takes the per-pixel minimum, which handles occlusions more gracefully.
2. **Auto-masking:** A binary mask automatically filters out pixels where the photometric loss of the identity mapping (no warping) is lower than the reprojected loss. This handles static scenes and moving objects at the same speed as the camera.
3. **Multi-scale estimation:** Depth is predicted and supervised at multiple image resolutions to reduce texture-copying artifacts.

Monodepth2 demonstrated through empirical evidence that systematic redesign of loss functions could surpass performance gains from more complex architectures, simplifying the pipeline while improving results [3].

## How is depth estimation accuracy measured?

Depth estimation performance is measured using a standard set of error metrics and accuracy metrics. Given a predicted depth map D* and a ground truth depth map D, with N valid pixels, the following metrics are commonly used [1][4].

### Error metrics (lower is better)

| Metric | Name | Formula | Description |
|---|---|---|---|
| AbsRel | Absolute Relative Error | $$\frac{1}{N} \sum \frac{\lvert d_i - d_i^* \rvert}{d_i}$$ | Average percentage difference between predicted and ground truth depth |
| SqRel | Squared Relative Error | $$\frac{1}{N} \sum \frac{(d_i - d_i^*)^2}{d_i}$$ | Penalizes large errors more heavily than AbsRel |
| RMSE | Root Mean Square Error | $$\sqrt{\frac{1}{N} \sum (d_i - d_i^*)^2}$$ | Standard deviation of prediction errors in absolute depth units |
| RMSE log | Log RMSE | $$\sqrt{\frac{1}{N} \sum (\log(d_i) - \log(d_i^*))^2}$$ | RMSE computed in log-space, reducing sensitivity to absolute scale |
| SILog | Scale-Invariant Log Error | $$\sqrt{\mathrm{mean}(\delta_{\text{log}}^2) - 0.5 \cdot \mathrm{mean}(\delta_{\text{log}})^2}$$ | Measures depth accuracy independent of global scale |

### Accuracy metrics (higher is better)

| Metric | Name | Formula | Description |
|---|---|---|---|
| delta-1 | Threshold accuracy at 1.25 | % of pixels where $$\max(d_i/d_i^*, d_i^*/d_i) < 1.25$$ | Proportion of predictions within 25% of ground truth |
| delta-2 | Threshold accuracy at $$1.25^2$$ | % of pixels where $$\max(d_i/d_i^*, d_i^*/d_i) < 1.5625$$ | Proportion within ~56% of ground truth |
| delta-3 | Threshold accuracy at $$1.25^3$$ | % of pixels where $$\max(d_i/d_i^*, d_i^*/d_i) < 1.953$$ | Proportion within ~95% of ground truth |

AbsRel and delta-1 are the two most commonly reported metrics in recent papers [8][9]. For relative depth models, predictions and ground truth are typically aligned in scale and shift for each image before computing error metrics [4].

## What datasets are used for depth estimation?

Progress in depth estimation has been driven by several benchmark datasets that provide paired RGB images and ground truth depth maps.

### NYU Depth V2

NYU Depth V2 was collected by Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus, and released in 2012 alongside the paper "Indoor Segmentation and Support [Inference](/wiki/inference) from RGBD Images" at ECCV [12]. The dataset was captured using a Microsoft Kinect sensor in 464 different indoor scenes across three cities, spanning bedrooms, kitchens, offices, living rooms, bookstores, cafes, and other environments.

| Property | Detail |
|---|---|
| Labeled pairs | 1,449 densely aligned RGB-depth pairs |
| Raw frames | 407,024 unlabeled RGB-depth frames |
| Resolution | 640 x 480 pixels |
| Depth sensor | Microsoft Kinect (structured light) |
| Depth range | 0.5 to 10 meters |
| Scene type | Indoor |
| Standard split | 249 scenes for training, 215 scenes for testing (654 test images) |

NYU Depth V2 remains the most widely used indoor benchmark for monocular depth estimation [12].

### KITTI

The KITTI Vision Benchmark Suite was introduced by Andreas Geiger, Philip Lenz, and Raquel Urtasun at CVPR 2012 in the paper "Are We Ready for Autonomous Driving?" [13]. The dataset was recorded from a car driving through Karlsruhe, Germany, using a stereo camera rig, a Velodyne HDL-64E LiDAR scanner, and a GPS/IMU system.

| Property | Detail |
|---|---|
| Stereo pairs | 389 stereo and optical flow image pairs |
| Image resolution | Approximately 1242 x 375 pixels (after rectification) |
| Ground truth depth | Velodyne LiDAR (sparse, projected onto image plane) |
| Depth range | 0 to approximately 80 meters |
| Scene type | Outdoor (urban driving) |
| Eigen split | 23,488 training images, 697 test images |

KITTI ground truth is sparse because LiDAR returns occupy only a fraction of the image pixels. Evaluation is performed only at pixels where LiDAR data is available. KITTI remains the primary outdoor driving benchmark for depth estimation [13].

### ScanNet

ScanNet was published by Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Niessner at CVPR 2017 [14]. It provides RGB-D video sequences of indoor scenes captured with commodity depth sensors.

| Property | Detail |
|---|---|
| Total views | 2.5 million RGB-D frames |
| Scenes | 1,513 indoor scenes |
| Annotations | 3D camera poses, surface reconstructions, instance-level semantic segmentations |
| Depth sensor | Structure Sensor (structured light) |
| Scene type | Indoor |

ScanNet has been used both for depth estimation evaluation and as a training data source, particularly for methods targeting indoor environments [14].

### Other notable datasets

| Dataset | Year | Scene type | Key characteristics |
|---|---|---|---|
| Middlebury Stereo | 2002+ | Indoor/controlled | High-quality structured light ground truth for stereo evaluation |
| ETH3D | 2017 | Indoor/outdoor | High-resolution images with multi-view and LiDAR ground truth |
| Cityscapes | 2016 | Urban driving | 5,000 images with stereo pairs and disparity maps |
| DIODE | 2019 | Indoor/outdoor | Laser scanner ground truth with both indoor and outdoor scenes |
| TartanAir | 2020 | Synthetic | Photo-realistic rendered scenes with perfect ground truth |
| Hypersim | 2021 | Synthetic indoor | 77,400 synthetic images with perfect metric depth |
| Virtual KITTI | 2016/2020 | Synthetic driving | Synthetic clone of KITTI driving sequences |

## How do the major depth estimation models compare?

The following table summarizes key monocular depth estimation models, ordered chronologically.

| Model | Year | Venue | Supervision | Depth type | Backbone | Key contribution |
|---|---|---|---|---|---|---|
| Eigen et al. | 2014 | NeurIPS | Supervised | Metric | CNN (custom) | First deep learning approach; coarse-to-fine multi-scale architecture |
| Monodepth | 2017 | CVPR | Self-supervised | Metric | ResNet | Left-right consistency loss; no ground truth depth needed |
| Monodepth2 | 2019 | ICCV | Self-supervised | Metric | ResNet-18 | Auto-masking, minimum reprojection loss, monocular video training |
| BTS | 2019 | arXiv | Supervised | Metric | DenseNet-161 | Local planar guidance layers for sharp boundaries |
| MiDaS | 2020 | TPAMI | Supervised | Relative | ResNeXt-101 | Multi-dataset training for zero-shot cross-dataset transfer |
| AdaBins | 2021 | CVPR | Supervised | Metric | EfficientNet-B5 | Adaptive bin-center prediction for fine-grained depth |
| DPT | 2021 | ICCV | Supervised | Relative | ViT-Large | Vision Transformer backbone for dense prediction |
| ZoeDepth | 2023 | arXiv | Supervised | Metric | BEiT-L (MiDaS) | Combines relative and metric depth; domain-specific heads |
| MiDaS v3.1 | 2023 | arXiv | Supervised | Relative | BEiT/Swin/ViT | Model zoo with multiple transformer backbones |
| Marigold | 2024 | CVPR | Supervised (synthetic) | Relative | Stable Diffusion U-Net | Repurposed diffusion model; synthetic-only training |
| Depth Anything | 2024 | CVPR | Semi-supervised | Relative/Metric | DINOv2-ViT | 62M+ unlabeled images; foundation model approach |
| Depth Anything V2 | 2024 | NeurIPS | Semi-supervised | Relative/Metric | DINOv2-ViT | Synthetic teacher training; models from 25M to 1.3B parameters |

## What is depth estimation used for?

### Autonomous driving

Depth estimation is fundamental to self-driving vehicles. Perception systems need to understand the 3D layout of the road, detect obstacles, measure distances to other vehicles and pedestrians, and plan safe trajectories. While most autonomous vehicle systems use LiDAR as a primary depth sensor, monocular and stereo depth estimation from cameras serves as a complementary or backup modality. Tesla's camera-only approach relies entirely on vision-based depth estimation. Depth maps help with lane detection, free-space estimation, and collision avoidance.

### Augmented reality

AR applications need to understand scene geometry to place virtual objects realistically in the physical world. Depth estimation enables occlusion handling (virtual objects should be hidden behind real objects that are closer to the camera), surface detection (placing objects on tables, floors, and walls), and lighting estimation. Apple's ARKit and Google's ARCore frameworks use a combination of inertial measurement, visual odometry, and depth estimation to build real-time spatial maps. The LiDAR scanner on iPhone Pro models provides hardware depth sensing that complements software-based estimates.

### 3D reconstruction

Depth estimation is a core building block for [3D reconstruction](/wiki/3d_reconstruction) pipelines. Given depth maps from multiple viewpoints along with camera poses, a 3D point cloud or mesh can be constructed through depth fusion (using algorithms such as TSDF fusion). Recent neural 3D reconstruction methods, including [Neural Radiance Fields](/wiki/nerf) ([NeRF](/wiki/nerf)) and [3D Gaussian Splatting](/wiki/gaussian_splatting), benefit from depth priors produced by monocular estimation models. Depth Anything and similar models are increasingly used to provide dense depth supervision that accelerates NeRF training and improves reconstruction quality, especially in regions with limited multi-view coverage [8].

### Robotics

Robots navigating in unstructured environments need depth information for obstacle avoidance, path planning, and manipulation. While many robots carry dedicated depth sensors (Intel RealSense, stereo cameras), monocular depth estimation provides a lightweight alternative for drones, small robots, or scenarios where weight and power budgets are constrained. Self-supervised depth estimation is particularly relevant for robotics because robots can collect massive amounts of unlabeled video data during operation [3].

### Computational photography

Smartphone portrait mode effects rely on depth estimation to separate the foreground subject from the background and apply synthetic bokeh (background blur). Early implementations (such as on the iPhone 7 Plus in 2016) used dual cameras in a stereo configuration. Modern smartphones with a single camera use deep learning-based monocular depth estimation to achieve similar effects. Depth maps also enable other computational photography features: refocusing after capture, relighting, and 3D photo effects. The LiDAR scanner on iPhone Pro models captures hardware depth that improves portrait mode accuracy, especially in low light.

### Video depth and temporal consistency

Video Depth Anything, presented at CVPR 2025, extends the Depth Anything framework to video sequences by enforcing temporal consistency across frames [18]. Temporally consistent depth is important for applications like video editing, visual effects, and augmented reality, where flickering or inconsistent depth maps between frames create visible artifacts.

## What are the open problems in depth estimation?

Despite rapid progress, several challenges remain in depth estimation:

**Scale ambiguity in monocular estimation:** A single image provides no absolute scale information. While metric depth models attempt to learn scale priors, they often fail on scenes outside their training distribution [6]. The depth estimation community currently lacks a widely accepted benchmarking standard for evaluating cross-domain generalization [18].

**Textureless and reflective surfaces:** Stereo matching and multi-view methods struggle with surfaces that lack visual texture (white walls, glass, water). Monocular methods can sometimes infer depth from context, but reflective and transparent surfaces remain difficult for all approaches.

**Dynamic scenes:** Moving objects violate the static-scene assumption used by many multi-view and self-supervised methods. Handling independently moving objects while estimating camera ego-motion remains an active research area [3].

**Evaluation inconsistencies:** Different papers use different evaluation protocols (crop ratios, depth caps, alignment procedures), making direct comparisons difficult. The field is working toward standardized evaluation practices [18].

**Outdoor long-range depth:** Estimating accurate metric depth at long range (beyond 80 meters) is challenging because small errors in disparity or predicted depth translate to large absolute errors at distance.

## ELI5: depth estimation in simple terms

When you close one eye and look at a room, you can still tell which things are near and which are far, because your brain has learned that closer things look bigger, objects in front cover up objects behind, and far-away things look hazier. Depth estimation teaches a computer to do the same thing: you give it a flat photo, and it colors in how far away every spot is, like a map where bright means close and dark means far. With two cameras it is easier, because the computer can compare the two slightly different pictures (the way your two eyes do) and measure how much each thing shifts between them. With only one camera it has to guess from experience, which is why modern systems study tens of millions of pictures first so their guesses get very good.

## See also

- [Computer vision](/wiki/computer_vision)
- [Stereo vision](/wiki/stereo_vision)
- [3D reconstruction](/wiki/3d_reconstruction)
- [Structure from Motion](/wiki/structure_from_motion)
- [Convolutional neural network](/wiki/convolutional_neural_network)
- [Vision Transformer](/wiki/vision_transformer)
- [Neural Radiance Fields](/wiki/nerf)
- [3D Gaussian Splatting](/wiki/gaussian_splatting)
- [Diffusion model](/wiki/diffusion_model)
- [DINOv2](/wiki/dinov2)

## References

1. Eigen, D., Puhrsch, C., & Fergus, R. (2014). "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network." *Advances in Neural Information Processing Systems (NeurIPS)*, 27. arXiv:1406.2283.
2. Godard, C., Mac Aodha, O., & Brostow, G.J. (2017). "Unsupervised Monocular Depth Estimation with Left-Right Consistency." *Proceedings of CVPR*, 270-279.
3. Godard, C., Mac Aodha, O., Firman, M., & Brostow, G.J. (2019). "Digging Into Self-Supervised Monocular Depth Estimation." *Proceedings of ICCV*, 3828-3838.
4. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., & Koltun, V. (2020). "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer." *IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)*, 44(3), 1623-1637. arXiv:1907.01341.
5. Ranftl, R., Bochkovskiy, A., & Koltun, V. (2021). "Vision Transformers for Dense Prediction." *Proceedings of ICCV*, 12179-12188.
6. Bhat, S.F., Alhashim, I., & Wonka, P. (2023). "ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth." *arXiv:2302.12288*.
7. Birkl, R., Wofk, D., & Muller, M. (2023). "MiDaS v3.1: A Model Zoo for Robust Monocular Relative Depth Estimation." *arXiv:2307.14460*.
8. Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., & Zhao, H. (2024). "Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data." *Proceedings of CVPR*. arXiv:2401.10891.
9. Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., & Zhao, H. (2024). "Depth Anything V2." *Proceedings of NeurIPS*. arXiv:2406.09414.
10. Ke, B., Obukhov, A., Huang, S., Mez, N., Dauber, M., & Schindler, K. (2024). "Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation" (Marigold). *Proceedings of CVPR*. arXiv:2312.02145.
11. Hirschmuller, H. (2005). "Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information." *Proceedings of CVPR*, 807-814.
12. Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). "Indoor Segmentation and Support Inference from RGBD Images." *Proceedings of ECCV*, 746-760.
13. Geiger, A., Lenz, P., & Urtasun, R. (2012). "Are We Ready for Autonomous Driving? The KITTI Vision Benchmark Suite." *Proceedings of CVPR*.
14. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., & Niessner, M. (2017). "ScanNet: Richly-annotated 3D Reconstructions of Indoor Scenes." *Proceedings of CVPR*.
15. Lee, J.H., Han, M.K., Ko, D.W., & Suh, I.H. (2019). "From Big to Small: Multi-Scale Local Planar Guidance for Monocular Depth Estimation" (BTS). *arXiv:1907.10326*.
16. Bhat, S.F., Alhashim, I., & Wonka, P. (2021). "AdaBins: Depth Estimation Using Adaptive Bins." *Proceedings of CVPR*, 4009-4018.
17. Oquab, M. et al. (2023). "DINOv2: Learning Robust Visual Features without Supervision." *arXiv:2304.07193*.
18. Spencer, J. et al. (2024). "The Third Monocular Depth Estimation Challenge." *Proceedings of CVPR Workshops*.