# Depth

> Source: https://aiwiki.ai/wiki/depth
> Updated: 2026-06-27
> Categories: Machine Learning
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

In [machine learning](/wiki/machine_learning), depth is the number of sequential processing stages a model applies between its input and its output. In a [neural network](/wiki/neural_network) it means the number of [layers](/wiki/layer) the input passes through, and the word is the origin of the term [deep learning](/wiki/deep_learning): a model with three or more hidden layers is usually called deep, while a model with one hidden layer is called shallow. In a [decision tree](/wiki/decision_tree) it means the length of the longest path from the root node to any leaf. Both senses share one idea: depth is how many sequential decisions a model is allowed to make before committing to an answer.

## Introduction

Depth is a loaded word in machine learning. In a neural network, depth means the number of layers the input passes through on its way to becoming an output. In a decision tree, depth means the length of the longest path from the root node to any leaf. The two definitions live in different parts of the field, but they share a common idea: depth is the number of sequential decisions a model is allowed to make before committing to an answer.

The term carries weight because the phrase deep learning is built directly on it. When Geoffrey Hinton and his collaborators published "A Fast Learning Algorithm for Deep Belief Nets" in Neural Computation in 2006, the "deep" in the name referred to the architectural depth of the stack, and the paper helped reignite interest in training many-layer networks. [1] That naming choice stuck. Today a model with three or more hidden layers is usually called deep, and a model with one hidden layer is usually called shallow.

This article covers both senses: depth in neural networks (how it is counted, why it matters, famous examples like [AlexNet](/wiki/alexnet), [VGG](/wiki/vgg), [ResNet](/wiki/resnet), and [GPT-3](/wiki/gpt-3)) and depth in decision trees and tree ensembles (including the `max_depth` hyperparameter and pruning). It is distinct from [depth estimation](/wiki/depth_estimation), the computer vision task of predicting per-pixel distance from a camera, which is covered on its own page.

## What is the depth of a neural network?

A neural network is a stack of layers. Data enters at the [input layer](/wiki/input_layer), flows through a sequence of hidden layers, and exits at the [output layer](/wiki/output_layer). Each layer applies a transformation: a linear projection followed by a nonlinear [activation function](/wiki/activation_function). The depth of the network is the number of these transformations the data has to go through.

The input layer is usually not counted, because it does not actually transform anything. It just holds the input values. So a network described as having one input layer, three hidden layers, and one output layer is most often called a network of depth four, counting only the layers that do work.

Conventions vary. Some authors count only hidden layers; others count hidden plus output layers; others count weight layers (those with trainable parameters), excluding pooling and activation layers. Published deep learning papers usually count weight layers.

### How is depth measured? The credit assignment path

A more formal version of depth is the credit assignment path (CAP). The CAP is the chain of transformations from input to output. For a feedforward network the CAP length is the number of hidden layers plus one. For a [recurrent neural network](/wiki/recurrent_neural_network), the CAP can be much larger because it includes unrolled time steps, so an RNN run on a 1,000 step sequence has an effective depth of more than 1,000 even if the cell itself has only a few layers. [2]

Most researchers agree that deep learning involves CAP depth greater than two. There is no hard cutoff. In the 1980s a three layer network was considered ambitious. By the late 2010s a 100 layer network was routine.

### Why does depth matter?

The argument for depth is hierarchical feature learning. Early layers of a [convolutional neural network](/wiki/convolutional_neural_network) trained on images tend to detect simple things like edges and color blobs. Middle layers combine those into textures and shapes. Late layers combine those into object parts and whole objects. Each layer builds on the abstractions of the one below. You cannot get that hierarchy out of a single layer, no matter how wide it is.

A classical result called the universal approximation theorem says that even a network with one hidden layer can approximate any reasonable function, given enough neurons. George Cybenko proved this for sigmoidal activations in 1989, and Kurt Hornik, Maxwell Stinchcombe, and Halbert White established a closely related result the same year for multilayer feedforward networks. [3][4] So in principle, you do not need depth at all. In practice, you do. A shallow wide network that matches the accuracy of a deep network often needs an exponentially larger number of neurons. This is the depth efficiency or depth separation result.

The theory behind this is concrete. Matus Telgarsky's 2016 paper "Benefits of Depth in Neural Networks" showed that there exist functions a deep ReLU network can represent with a number of layers on the order of k cubed and a constant number of nodes per layer, yet any network with depth on the order of k would need an exponential number of nodes, roughly 2 to the power k, to approximate them. [5] In other words, for some function classes, depth buys an exponential reduction in width.

### What does depth cost?

More layers means more compute, more memory, and harder optimization. The most famous problem is the [vanishing gradient](/wiki/vanishing_gradient) problem. During [backpropagation](/wiki/backpropagation), the [gradient](/wiki/gradient) signal is repeatedly multiplied by the local derivatives of each layer's activation. If those derivatives are small, the product shrinks exponentially with depth, and the early layers receive almost no useful update. The opposite problem, exploding gradients, occurs when the derivatives are large.

The field has accumulated a toolkit for handling these issues. ReLU activations replaced sigmoid and tanh for most uses because they do not saturate on the positive side. [Batch normalization](/wiki/batch_normalization) and layer normalization keep activation distributions stable across layers. Residual or skip connections, introduced in ResNet, let the gradient flow around blocks instead of always passing through them. [6] Weight initialization schemes such as He and Xavier initialization keep variances stable as you go deeper.

A second cost is overfitting. A very deep network has many parameters and can memorize the training set. This is countered with regularization, dropout, data augmentation, and larger datasets.

### What are some famous network depths?

The trajectory of depth in image classification shows how the field has scaled.

| Network | Year | Depth (weight layers) | Notes |
|---|---|---|---|
| LeNet-5 | 1998 | 7 | Yann LeCun's handwritten digit network. |
| AlexNet | 2012 | 8 | 5 convolutional plus 3 fully connected. Won ImageNet 2012 with a 15.3% top-5 error. [7] |
| VGG-16 | 2014 | 16 | 13 convolutional plus 3 fully connected. [8] |
| VGG-19 | 2014 | 19 | 16 convolutional plus 3 fully connected. [8] |
| GoogLeNet (Inception v1) | 2014 | 22 | Used inception modules to keep parameters down. |
| ResNet-50 | 2015 | 50 | First ResNet variant with bottleneck blocks. [6] |
| ResNet-101 | 2015 | 101 | [6] |
| ResNet-152 | 2015 | 152 | Won ImageNet 2015 with 3.57% top-5 error. [6] |
| DenseNet-201 | 2017 | 201 | Each layer connects to all subsequent layers. [9] |

ResNet was the breakthrough that made very deep networks practical. The authors observed that stacking more layers actually made networks perform worse on training data, not just on test data. They called this the degradation problem, noting that "with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly," and that "such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error." [6] Their fix was to let each block learn a residual function on top of an identity shortcut. The skip connection lets the gradient take a shorter path back, which addresses both vanishing gradients and the optimization difficulty. As the paper puts it, they evaluated "residual nets with a depth of up to 152 layers, 8x deeper than VGG nets but still having lower complexity," and "an ensemble of these residual nets achieves 3.57% error on the ImageNet test set," which won first place in the ILSVRC 2015 classification task. [6]

Language models pushed depth in a different direction. The original transformer paper from 2017 used 6 encoder and 6 decoder layers. By GPT-3 in 2020, the largest variant had 96 transformer decoder layers, 96 attention heads per layer, and 12,288 dimensional embeddings across its 175 billion parameters. [10] PaLM, released in 2022, had 118 layers and 48 attention heads in its 540 billion parameter version. [11] Modern frontier models have not always grown deeper in lockstep with parameter count; some have grown wider instead.

### How does depth differ from width?

At fixed compute or parameter budget, designers face a tradeoff: should you add more layers, or more neurons per layer? Both are forms of capacity.

Depth tends to win for hierarchical tasks. Theoretical results on ReLU networks show that depth has a logarithmic relationship with the width needed to express certain function classes, meaning a deeper network can be exponentially smaller than a wide shallow network with comparable expressive power. [5] Empirically, the deep learning revolution has been driven mostly by depth.

Width has its own advantages. Wider layers can process more features in parallel and are easier to train because the loss landscape is flatter. In practice, the best architectures balance the two, often pairing moderate depth with substantial width and many residual or attention connections to keep gradients well behaved.

## What is the depth of a decision tree?

A decision tree is a model that makes predictions by following a sequence of yes or no questions about the input features. The tree starts at a root node, each internal node splits the data based on one feature, and each leaf node holds a prediction. The depth of a decision tree is the length of the longest path from the root to any leaf, measured in edges. By that convention, a tree whose root is also a leaf has depth zero.

In a balanced tree, depth is roughly the logarithm of the number of leaves. In a heavily skewed tree, depth can grow much larger than that.

### How does depth cause overfitting in a tree?

Depth controls how much capacity a tree has. A shallow tree can make only a few splits, so it underfits anything except very simple problems. A deep tree can make a unique split for nearly every training example, which lets it memorize the training set and generalize poorly. The relationship is direct: deeper trees have lower bias and higher variance. The right depth depends on the size and complexity of the dataset, and it is usually chosen by cross-validation.

In scikit-learn, the `max_depth` parameter of `DecisionTreeClassifier` and `DecisionTreeRegressor` is the standard way to limit depth, and its default value is `None`, which grows the tree until all leaves are pure or fall below `min_samples_split`. [12] The documentation advises: "Use `max_depth=3` as an initial tree depth to get a feel for how the tree is fitting to your data, and then increase the depth." [13] It also warns that "the number of samples required to populate the tree doubles for each additional level the tree grows to," which is why unconstrained trees overfit so quickly on smaller datasets. [13]

### What is pre-pruning by depth?

Limiting tree depth before or during training is a form of pre-pruning, also called early stopping. The idea is to halt growth based on a heuristic before the tree becomes too complex. Common pre-pruning parameters include:

| Parameter | Effect |
|---|---|
| `max_depth` | Hard cap on tree depth. |
| `min_samples_split` | Minimum samples required to split an internal node. |
| `min_samples_leaf` | Minimum samples required in a leaf. |
| `max_leaf_nodes` | Hard cap on the total number of leaves. |
| `min_impurity_decrease` | A split must improve impurity by at least this much. |

Pre-pruning is fast and easy to apply, but it can stop too early. A split that looks bad at the moment of evaluation might have led to two excellent splits one level below. This is sometimes called the horizon effect: the algorithm cannot see past the immediate split.

### What is post-pruning by depth?

Post-pruning grows the tree to its full depth first and then trims back branches that do not earn their keep. The most common technique in scikit-learn is minimal cost complexity pruning, parameterized by `ccp_alpha`. [14] The algorithm produces a series of progressively simpler trees by repeatedly removing the subtree whose pruning gives the smallest increase in error per leaf removed. The right value of `ccp_alpha` is usually chosen by cross-validation on a separate split of the data.

Other post-pruning methods include reduced error pruning, where each internal node is tentatively replaced by a leaf labeled with the majority class, and the change is kept only if validation accuracy does not drop.

Post-pruning is more computationally expensive than pre-pruning but tends to produce better trees, because the algorithm sees the fully developed tree before deciding what to cut.

### How does depth work in tree ensembles?

Most serious applications of trees today use ensembles. The two dominant families are bagging methods like [random forest](/wiki/random_forest) and boosting methods like [gradient boosting](/wiki/gradient_boosting), XGBoost, and LightGBM. Depth plays a different role in each.

In a random forest, the individual trees are usually grown deep, often with no depth limit. The variance reduction comes from averaging many decorrelated trees, so overfitting at the individual tree level is acceptable. The defaults in scikit-learn's `RandomForestClassifier` set `max_depth=None`. [15]

In gradient boosting, the individual trees are usually shallow, often called weak learners. Depths of three to eight are typical. XGBoost defaults to `max_depth=6`. [16] The shallowness keeps each tree from explaining too much variance on its own, which is the point of boosting: many small corrections beat one large overfit model.

## Explain like I'm 5

Imagine a stack of pancakes. Each pancake takes the stuff on top, mixes it, and passes it up. A short stack of two pancakes can only do simple mixing. A tall stack of fifty pancakes can do something fancy, like turning a photo of a cat into the word "cat," because each pancake handles a small step. That stack height is depth.

Now imagine twenty questions. The depth of a decision tree is how many questions you can ask before you have to guess. Too few and you cannot narrow it down. Too many and you ask weird questions that only work for the people you have already met.

## References

1. Hinton, G., Osindero, S., Teh, Y., [A Fast Learning Algorithm for Deep Belief Nets](https://direct.mit.edu/neco/article/18/7/1527/7065/A-Fast-Learning-Algorithm-for-Deep-Belief-Nets), Neural Computation 18, 1527-1554, 2006
2. Schmidhuber, J., [Deep Learning in Neural Networks: An Overview](https://arxiv.org/abs/1404.7828), 2015 (credit assignment path)
3. Cybenko, G., Approximation by Superpositions of a Sigmoidal Function, Mathematics of Control, Signals and Systems, 1989
4. Hornik, K., Stinchcombe, M., White, H., Multilayer Feedforward Networks are Universal Approximators, Neural Networks, 1989; see also Wikipedia, [Universal approximation theorem](https://en.wikipedia.org/wiki/Universal_approximation_theorem)
5. Telgarsky, M., [Benefits of Depth in Neural Networks](https://arxiv.org/abs/1602.04485), COLT 2016
6. He, K., Zhang, X., Ren, S., Sun, J., [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385), CVPR 2016 (ResNet)
7. Krizhevsky, A., Sutskever, I., Hinton, G., ImageNet Classification with Deep Convolutional Neural Networks, NeurIPS 2012 (AlexNet); Wikipedia, [AlexNet](https://en.wikipedia.org/wiki/AlexNet)
8. Simonyan, K. and Zisserman, A., [Very Deep Convolutional Networks for Large-Scale Image Recognition](https://arxiv.org/abs/1409.1556), 2014 (VGG)
9. Huang, G., Liu, Z., van der Maaten, L., Weinberger, K., [Densely Connected Convolutional Networks](https://arxiv.org/abs/1608.06993), CVPR 2017 (DenseNet)
10. Brown, T. et al., [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165), NeurIPS 2020 (GPT-3)
11. Chowdhery, A. et al., [PaLM: Scaling Language Modeling with Pathways](https://arxiv.org/abs/2204.02311), 2022
12. scikit-learn documentation, [DecisionTreeClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)
13. scikit-learn documentation, [Decision Trees](https://scikit-learn.org/stable/modules/tree.html) (tips on practical use)
14. scikit-learn documentation, [Post pruning decision trees with cost complexity pruning](https://scikit-learn.org/stable/auto_examples/tree/plot_cost_complexity_pruning.html)
15. scikit-learn documentation, [RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
16. XGBoost documentation, [XGBoost Parameters](https://xgboost.readthedocs.io/en/stable/parameter.html)
17. Wikipedia, [Deep learning](https://en.wikipedia.org/wiki/Deep_learning)
18. Wikipedia, [Decision tree pruning](https://en.wikipedia.org/wiki/Decision_tree_pruning)