Hidden Layer
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 6,110 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v7 · 6,110 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: neural network, input layer, output layer, activation function
A hidden layer is a layer of artificial neurons in a neural network that sits between the input layer and the output layer. The term "hidden" refers to the fact that these layers are not directly exposed to the external environment: they do not receive raw data from outside the network (as the input layer does), nor do they produce the final result (as the output layer does). Instead, hidden layers operate internally, transforming inputs into intermediate representations that make it possible for the network to learn complex, nonlinear relationships in data.[1]
A neural network with one or more hidden layers is called a multilayer perceptron (MLP). Networks with many hidden layers are commonly referred to as deep neural networks, and the practice of training such networks is known as deep learning. The number of hidden layers in a network defines its "depth," while the number of neurons in each hidden layer defines its "width."[2]
Hidden layers are the locus of learned computation in a neural network. The weights and biases inside each hidden layer are adjusted during training (typically by backpropagation and gradient descent), so the hidden layer parameters carry essentially all of the model's learned knowledge. When practitioners refer to the "size" of a model in terms of parameter count, they are largely talking about the parameters held in hidden layers.
The name comes from the perspective of someone observing the network from the outside. During training and inference, a user can see the inputs fed into the network and the outputs it produces. However, the intermediate computations performed by the layers between input and output are not directly visible or interpretable without specialized tools. These internal layers are therefore "hidden" from view.[3]
Put another way, in a supervised learning setup, the training data provides explicit target values for the output layer and explicit feature values for the input layer. No such direct supervision exists for the intermediate layers; the network must figure out on its own what representations to build in these layers. This self-organized nature of the internal representations is another reason the layers are considered hidden.[4]
The term predates the modern era. It appears in Geoffrey Hinton's 1986 work and earlier in connectionist literature, where "hidden units" referred to processing elements that were neither sensory inputs nor motor outputs. The corresponding mathematical formulation distinguishes "visible variables" (inputs and outputs that are clamped to data values) from "hidden variables" (those whose states must be inferred), a distinction that goes back to Boltzmann machines in the early 1980s.
Each neuron in a hidden layer performs a simple computation. It receives a set of inputs (either from the input layer or from the previous hidden layer), multiplies each input by a corresponding weight, sums the results, adds a bias term, and then passes the sum through an activation function. The output of this computation is then sent forward to the next layer.
Mathematically, for a single neuron in hidden layer l, the output a is computed as:
z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
a = f(z)
where x₁, x₂, ..., xₙ are the inputs, w₁, w₂, ..., wₙ are the weights, b is the bias, and f is the activation function (such as ReLU, sigmoid, or tanh).
In matrix form, the entire layer can be written as a = f(Wx + b), where W is the weight matrix of the layer, x is the input vector from the previous layer, b is the bias vector, and f is applied element-wise. This vectorized form is what GPU and TPU accelerators are optimized to compute, since matrix multiplications dominate the runtime cost of neural network training and inference.
The activation function is critical because it introduces nonlinearity. Without it, stacking multiple layers would be mathematically equivalent to a single linear transformation, and the network would be no more powerful than a simple linear regression model.[5]
During the forward pass, activations flow from input through each hidden layer to the output. The result is a prediction. During the backward pass, backpropagation computes gradients of the loss function with respect to every parameter in every hidden layer, working from output back toward input by repeated application of the chain rule. An optimizer (often SGD, Adam, or AdamW) uses those gradients to nudge each weight and bias in a direction that should reduce the loss on the current batch.
All learning in a deep network happens through this loop. Hidden layer parameters are updated in step with the loss signal, while inputs are typically held fixed and outputs are matched to labels. Over millions of update steps, the hidden layer parameters settle into configurations that encode useful intermediate features.
One of the most important functions of hidden layers is automatic feature extraction. Rather than relying on hand-engineered features, a neural network with hidden layers can learn to identify the relevant patterns in raw data on its own.
This process is hierarchical. In a network trained on images, for example:
| Layer | What It Learns | Example |
|---|---|---|
| First hidden layer | Low-level features | Edges, corners, color gradients |
| Second hidden layer | Mid-level features | Textures, simple shapes (circles, rectangles) |
| Third hidden layer | High-level features | Object parts (eyes, ears, wheels) |
| Deeper hidden layers | Abstract concepts | Entire objects, scene context |
Each successive hidden layer builds on the representations created by the previous layer, composing simple features into increasingly complex and abstract ones. This hierarchical feature learning is what gives deep neural networks their ability to handle tasks like image recognition, natural language processing, and speech recognition.[6]
Without hidden layers, a neural network can only capture linear relationships between input and output. A single hidden layer with a nonlinear activation function is sufficient, in theory, to approximate any continuous function (see below), but deeper networks tend to learn more efficient representations in practice.
In the language modeling setting, hidden layers play a similar but less geometric role. Early layers in a language transformer tend to capture token-level and surface syntactic features, middle layers carry rich syntactic and semantic information, and later layers specialize on output-related computations such as next-token prediction. Probing studies have shown that part-of-speech tags, parse structures, and even some forms of factual knowledge can be linearly recovered from intermediate hidden states.
The universal approximation theorem provides the theoretical foundation for using hidden layers. The first widely cited version was proved by George Cybenko in 1989 for sigmoid activation functions. Independently, Kurt Hornik, Maxwell Stinchcombe, and Halbert White published a result the same year covering a broader class of "squashing" activations. Hornik strengthened the result in 1991 by showing that the universality property does not depend on a specific choice of activation function; rather, it follows from the multilayer feedforward architecture itself, provided the activation is non-polynomial.[7]
In simple terms, the theorem says that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of ℝⁿ, given a non-polynomial activation function and enough hidden units.
This means that, in principle, even a single hidden layer is enough to represent any function. However, the theorem is an existence result: it guarantees that such a network exists but does not specify how many neurons are needed or how to find the right weights. In practice, the required number of neurons in a single-layer network can grow exponentially with the complexity of the target function, making deeper architectures far more practical.
Later research extended the theorem to show that universality can also be achieved by increasing depth (number of layers) while keeping width (neurons per layer) bounded. The first such result for the bounded depth and bounded width case was published by Maiorov and Pinkus in 1999, who constructed an analytic sigmoidal activation function that gives universal approximation with two hidden layers and a bounded number of units per layer. In 2018, Guliyev and Ismailov produced a smooth sigmoidal activation with a similar property and even smaller width.
These results provide theoretical support for the modern preference for deep networks. A network with many narrow layers can express functions that would require an exponentially larger number of units in a shallow network, especially when the target function has a compositional structure that mirrors the network's depth.
Designing a neural network involves choosing both the number of hidden layers (depth) and the number of neurons per layer (width). Research has shown that depth and width contribute to a network's capabilities in different ways.
| Aspect | Deeper networks (more layers) | Wider networks (more neurons per layer) |
|---|---|---|
| Feature hierarchy | Learn hierarchical, compositional features | Capture many features in parallel at each level |
| Parameter efficiency | Represent complex functions with fewer total parameters | May require many more parameters to match a deep network |
| Training difficulty | Susceptible to vanishing gradients; may need skip connections | Generally easier to train with standard methods |
| Expressive power | Can represent certain functions exponentially more efficiently | Equivalent expressiveness may require exponentially more neurons |
| Computational cost | Sequential layer computation can be slower | Parallelizes well within each layer |
| Memory footprint | Activations must be cached for backprop, growing with depth | Wider activations consume more memory per layer |
| Risk of overfitting | More parameters per added layer can increase overfitting risk | Wider layers also increase parameter count and overfitting risk |
A 2020 study by Nguyen, Raghu, and Kornblith at Google Research found that very deep and very wide networks develop different internal representations. Wide networks tend to produce more uniform representations across layers, while deep networks develop increasingly distinct representations at each layer, reflecting hierarchical feature extraction.[8]
In practice, modern architectures balance depth and width based on the task. Convolutional neural networks for image tasks tend to be deep (dozens to hundreds of layers), while large language models use wide hidden layers stacked tens to hundreds deep. Image classifiers like ResNet-152 use 152 layers; transformer language models like GPT-3 use 96 transformer blocks with hidden dimension 12,288 across each block.
Choosing the right architecture is one of the most common practical questions in neural network design. While no universal formula exists, several guidelines and rules of thumb have emerged.
| Hidden layers | Capability |
|---|---|
| 0 | Only capable of representing linearly separable functions |
| 1 | Can approximate any continuous function (universal approximation theorem) |
| 2 | Can represent arbitrary decision boundaries with rational activation functions |
| 3+ | Can learn complex hierarchical representations (automatic feature engineering) |
Before the deep learning era, most problems were solved with one or two hidden layers. Today, tasks like computer vision and natural language processing routinely use dozens or even hundreds of layers.
Common heuristics for setting the number of neurons include:[9]
The best approach is to treat the layer count and neuron count as hyperparameters and use cross-validation or automated hyperparameter tuning to find the optimal configuration for a given problem. For very large models, exhaustive search becomes infeasible, and practitioners instead rely on scaling laws and architecture families that have been validated empirically.
For large language models, OpenAI's 2020 scaling-law work and DeepMind's 2022 Chinchilla paper showed that test loss decreases as a smooth power-law function of model size, dataset size, and compute. The Chinchilla finding (Hoffmann et al., 2022) was that for a fixed compute budget, models should be roughly 20 tokens of training data per parameter, which has shaped modern model sizing. These results let researchers project the gain from adding hidden layer parameters before training, rather than guessing.
The activation function in each hidden layer determines the nonlinearity of the network. The choice has changed substantially over time, driven by training stability, gradient behavior, and empirical performance.
| Activation | Formula | Range | Notable use | Year/origin |
|---|---|---|---|---|
| Sigmoid | 1 / (1 + e⁻ᵣ) | (0, 1) | Early MLPs, output of binary classifiers | 1940s-1980s |
| Tanh | (eˣ − e⁻ˣ) / (eˣ + e⁻ˣ) | (−1, 1) | Pre-ReLU MLPs, classical RNNs | 1990s |
| ReLU | max(0, x) | [0, ∞) | Most CNNs, MLPs since 2012 | Nair and Hinton, 2010 |
| Leaky ReLU | x if x > 0, else αx | (−∞, ∞) | Avoids "dying ReLU" | Maas et al., 2013 |
| ELU | x if x > 0, else α(eˣ − 1) | (−α, ∞) | Smooth alternative to ReLU | Clevert et al., 2015 |
| GELU | x · Φ(x) | (−∞, ∞) | BERT, GPT-2, GPT-3 | Hendrycks and Gimpel, 2016 |
| SiLU / Swish | x · σ(x) | (−∞, ∞) | EfficientNet, some LLMs | Ramachandran et al., 2017 |
| SwiGLU | Swish(xW) ⊙ (xV) | depends on inputs | LLaMA, PaLM, Mixtral, DeepSeek | Shazeer, 2020 |
| GeGLU | GELU(xW) ⊙ (xV) | depends on inputs | T5 v1.1, Gemma | Shazeer, 2020 |
Sigmoid and tanh dominated the early decades of neural network research but suffer from the vanishing gradient problem: for inputs far from zero, the derivative is close to zero, which causes gradients to shrink as they propagate through many layers. ReLU largely solved this issue by having a constant gradient of 1 for positive inputs, and Nair and Hinton's 2010 paper on rectified linear units in restricted Boltzmann machines was an early demonstration that ReLU outperforms saturating units.[10]
GELU was introduced by Dan Hendrycks and Kevin Gimpel in 2016 and weights inputs by their value rather than gating them by sign as ReLU does. It became the default in BERT, GPT-2, and GPT-3, and remains widely used in encoder-only and earlier decoder-only transformers.[11]
Gated linear unit variants (SwiGLU, GeGLU) were proposed by Noam Shazeer in 2020. They replace the standard two-matrix feedforward block with a three-matrix gated formulation. To keep parameter count comparable, the inner hidden dimension is reduced by a factor of 2/3. SwiGLU is now the default feedforward activation in LLaMA 2, LLaMA 3, PaLM, Mistral 7B, Mixtral, and DeepSeek models.[12]
The values used to initialize the weights and biases of hidden layers strongly affect whether a deep network can train at all. Random initialization with the wrong scale either causes activations and gradients to vanish (when the variance shrinks layer by layer) or explode (when it grows). Two schemes have become standard.
Xavier (Glorot) initialization, introduced by Xavier Glorot and Yoshua Bengio in 2010, samples weights from a distribution with variance 2 / (n_in + n_out), where n_in and n_out are the number of input and output connections to the layer. The goal is to keep the variance of activations and gradients roughly constant across layers. It is well suited to sigmoid and tanh hidden layers.[13]
He initialization, introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in 2015, was designed for ReLU and its variants. Because ReLU zeros out negative inputs, roughly halving the variance, He initialization compensates by sampling from a distribution with variance 2 / n_in. It is the default for ReLU-based networks and is implemented in PyTorch as kaiming_normal_ and kaiming_uniform_.[14]
Normalization layers stabilize the distribution of activations across hidden layers, which lets networks train faster and at higher learning rates. The major techniques used inside hidden layers are summarized below.
| Technique | Normalizes over | Year | Original paper | Common in |
|---|---|---|---|---|
| Batch normalization | Across the mini-batch (per feature) | 2015 | Ioffe and Szegedy | CNNs, ResNet |
| Layer normalization | Across features within one example | 2016 | Ba, Kiros, Hinton | RNNs, original Transformer, BERT |
| RMSNorm | RMS of features within one example, no centering | 2019 | Zhang and Sennrich | LLaMA, T5, Mistral, Gemma |
| Group normalization | Groups of channels per example | 2018 | Wu and He | Vision tasks with small batches |
| Instance normalization | Per channel, per example | 2016 | Ulyanov, Vedaldi, Lempitsky | Style transfer, GANs |
Batch normalization was introduced by Sergey Ioffe and Christian Szegedy in 2015. It normalizes the activations of each layer using statistics computed over the current mini-batch, then applies learnable scale and shift parameters. The original motivation was the reduction of "internal covariate shift," though later work suggested the actual benefit is smoother loss landscapes.[15]
Layer normalization, introduced by Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton in 2016, normalizes across the feature dimension within a single training example. It works well for recurrent networks and was used in the original Transformer paper and BERT. RMSNorm, introduced by Biao Zhang and Rico Sennrich in 2019, removes the mean-centering step from layer normalization, leaving only division by root mean square. It is faster and uses fewer parameters; LLaMA, T5, Mistral, and Gemma all adopt RMSNorm in place of layer norm.[16]
Normalization can be placed before each sublayer ("pre-norm") or after ("post-norm"). Pre-norm is now standard in deep transformer stacks because it makes deep networks more stable to train.
Large hidden layers can memorize training data instead of learning useful patterns. Several techniques constrain hidden-layer behavior to encourage better generalization.
Dropout, introduced by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov in 2014, randomly sets a fraction p of the activations in a hidden layer to zero on each forward pass during training, and rescales the surviving activations by 1/(1−p). At inference time the full network is used. The effect is that no hidden unit can rely on any other specific unit being present, which prevents co-adaptation and acts as an implicit ensemble of subnetworks.[17]
L2 regularization (also called weight decay) penalizes the squared sum of all hidden-layer weights, pulling weights toward zero. L1 regularization penalizes the absolute sum and tends to produce sparse weight matrices. Both can be applied separately to each hidden layer.
Early stopping halts training once validation loss stops improving, which prevents the hidden layers from continuing to memorize the training data after their generalization error bottoms out.
Label smoothing, mixup, and cutout are data-side regularizers that indirectly affect hidden layer learning by changing the supervision or input statistics.
Hidden layers take different forms depending on the network architecture.
In a standard multilayer perceptron, hidden layers are fully connected: every neuron in one layer connects to every neuron in the next. These networks are the simplest and most general form of deep network, and they remain useful for tabular data, lightweight classifiers, and as components inside larger architectures.
In a CNN, hidden layers include convolutional layers, pooling layers, and fully connected layers. Convolutional hidden layers apply learned filters across spatial positions, which gives them translation equivariance and a far smaller parameter count than a fully connected equivalent. Early layers learn edges and textures, deeper layers learn object parts and whole objects. AlexNet (2012) used five convolutional hidden layers; ResNet-152 uses 152.
In an RNN, a hidden layer also acts as the model's memory. At each time step, the hidden layer receives both the current input and the hidden state from the previous time step, allowing the network to process sequential data like text and time series. Vanilla RNNs suffer from vanishing and exploding gradients over long sequences, which led to gated variants. LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) add gating mechanisms to their hidden layers to better preserve long-range dependencies.
In a transformer, each hidden layer (commonly called a "block" or "layer") consists of a multi-head self-attention sublayer followed by a feedforward neural network sublayer, with residual connections and normalization around each. The attention sublayer mixes information across positions, while the feedforward sublayer processes each position independently. The feedforward sublayer is typically four times wider than the residual stream (or 8/3 times wider for SwiGLU variants). Models like GPT and BERT stack many such layers to achieve strong performance on language tasks.
The "hidden state" in a transformer refers to the residual stream vector at each token position, which has dimension d_model. This vector is updated by each transformer block and carries the contextualized representation of the token through the network.
In a graph neural network, a hidden layer aggregates messages from neighboring nodes, transforms them with a learned function, and produces an updated node representation. Stacking such hidden layers lets information flow further across the graph at each step.
The "hidden size" or d_model of a transformer language model refers to the width of the residual stream that passes through every block. The feedforward sublayer typically uses an inner dimension that is 4x or 8/3x larger. The table below lists representative hidden dimensions and layer counts for several well-known models.
| Model | Parameters | Hidden size (d_model) | Number of layers | FFN inner dim | Activation |
|---|---|---|---|---|---|
| GPT-2 small | 124M | 768 | 12 | 3,072 | GELU |
| GPT-2 medium | 355M | 1,024 | 24 | 4,096 | GELU |
| GPT-2 large | 774M | 1,280 | 36 | 5,120 | GELU |
| GPT-2 XL | 1.5B | 1,600 | 48 | 6,400 | GELU |
| BERT-base | 110M | 768 | 12 | 3,072 | GELU |
| BERT-large | 340M | 1,024 | 24 | 4,096 | GELU |
| GPT-3 175B | 175B | 12,288 | 96 | 49,152 | GELU |
| PaLM 540B | 540B | 18,432 | 118 | 73,728 | SwiGLU |
| LLaMA 2 7B | 7B | 4,096 | 32 | 11,008 | SwiGLU |
| LLaMA 2 70B | 70B | 8,192 | 80 | 28,672 | SwiGLU |
| LLaMA 3 8B | 8B | 4,096 | 32 | 14,336 | SwiGLU |
| LLaMA 3 70B | 70B | 8,192 | 80 | 28,672 | SwiGLU |
| Mistral 7B | 7B | 4,096 | 32 | 14,336 | SwiGLU |
| Mixtral 8x7B | 47B (12.9B active) | 4,096 | 32 | 14,336 (per expert) | SwiGLU |
For encoder-decoder models like T5, the d_model and FFN inner dim apply separately to encoder and decoder stacks. Mixture-of-experts models such as Mixtral and DeepSeek-V3 use multiple parallel feedforward experts in each layer; only a subset are active for any given token.
The number and size of hidden layers directly determine a neural network's capacity, which is its ability to fit a wide variety of functions.
Techniques to manage capacity include regularization (L1, L2), dropout, batch normalization, and early stopping. These methods allow practitioners to build larger networks while controlling the effective capacity during training.
A related modern observation is the "double descent" phenomenon, in which test error first rises (classical overfitting) and then falls again as model size grows past the interpolation threshold. This was documented by Belkin et al. (2019) and Nakkiran et al. (2020) and helps explain why very large neural networks with many hidden parameters can generalize well even when they have enough capacity to memorize the training set.
As networks grow deeper, training becomes more difficult due to the vanishing gradient problem: gradients shrink as they propagate backward through many layers, causing early layers to learn very slowly or not at all. Saturating activations like sigmoid make the problem worse, and even with ReLU, very deep networks degrade because the optimizer struggles to find good parameter settings.
Skip connections (also called residual connections) address this problem by creating shortcut paths that bypass one or more hidden layers. Introduced in the ResNet architecture by Kaiming He and colleagues in 2015, skip connections add the input of a block directly to its output:
y = F(x) + x
Instead of learning a complete transformation, each block only needs to learn the residual difference between its input and the desired output. This makes training much easier and allows networks to scale to hundreds or even thousands of layers. ResNet won the ImageNet Large Scale Visual Recognition Challenge in 2015 and demonstrated that very deep networks with skip connections consistently outperform shallower ones.[18]
Skip connections have since become a standard component in many architectures, including transformers, DenseNet, and U-Net. In transformers, the residual stream that flows through every block is the direct successor of the ResNet skip path, and the additive nature of these updates is part of why the residual stream view has become central to mechanistic interpretability research.
Because hidden layers operate as a black box, researchers have developed several techniques to understand what they learn.
These tools have shown that hidden layers in deep networks develop organized internal representations, with clear specialization emerging among neurons even without explicit instructions to do so.
A more recent line of work, often called mechanistic interpretability, tries to reverse-engineer the algorithms that hidden layers implement. The Anthropic interpretability team and the open Transformer Circuits Thread (transformer-circuits.pub) have published extended studies of attention heads, MLP neurons, and entire circuits inside production language models.
A central finding from this work is the phenomenon of superposition: hidden-layer neurons routinely represent more concepts than they have dimensions, by packing multiple features along overlapping directions in activation space. To recover human-interpretable units, researchers train sparse autoencoders (SAEs) on hidden-layer activations. The SAE projects the layer activation into a much higher-dimensional, sparse latent space where each dimension can correspond to a single nameable feature.
Anthropic's 2024 paper Scaling Monosemanticity applied SAEs to the residual stream of Claude 3 Sonnet and extracted millions of features ranging from concrete entities (specific cities, people, code constructs) to abstract concepts (sycophancy, deception, security vulnerabilities). The features were largely shared between models, which suggests that hidden layers in different transformers converge on similar internal vocabularies.
In modern pretrained models, hidden layer activations are often valuable in their own right, separate from the model's task output. Practitioners use the hidden states from intermediate layers as feature vectors for downstream classifiers or similarity search.
The practice of using a pretrained network's hidden layer as a feature extractor predates transformers. ImageNet-trained CNNs from the 2014 to 2018 era were widely used as fixed feature extractors for downstream vision tasks, with practitioners taking activations from the second-to-last fully connected layer.
The history of hidden layers is closely tied to the development of neural networks as a whole.
| Year | Milestone |
|---|---|
| 1958 | Frank Rosenblatt introduces the perceptron, originally proposed as a model with input layer, a hidden layer with random non-learning weights, and learnable output connections |
| 1969 | Marvin Minsky and Seymour Papert publish Perceptrons, highlighting the limitations of single-layer networks and contributing to the first "AI winter" |
| 1986 | David Rumelhart, Geoffrey Hinton, and Ronald Williams publish their Nature paper on backpropagation, demonstrating that hidden layers can learn useful internal representations |
| 1989 | Yann LeCun et al. apply backpropagation to convolutional hidden layers for handwritten digit recognition |
| 1989 | Cybenko proves the universal approximation theorem for sigmoid networks; Hornik, Stinchcombe, and White publish a related result the same year |
| 1991 | Hornik shows the universality property follows from the multilayer feedforward architecture itself, not the specific activation function |
| 1997 | Sepp Hochreiter and Jürgen Schmidhuber introduce LSTM, with gated hidden state to handle long sequences |
| 2006 | Hinton, Salakhutdinov introduce deep belief networks with layer-wise pretraining, reigniting interest in deep (multi-hidden-layer) networks |
| 2010 | Glorot and Bengio publish the Xavier initialization scheme; Nair and Hinton popularize ReLU activations |
| 2012 | AlexNet wins ImageNet with a deep CNN containing five convolutional hidden layers, launching the modern deep learning era |
| 2014 | Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov publish the dropout paper |
| 2015 | He et al. publish He initialization; Ioffe and Szegedy introduce batch normalization; ResNet introduces skip connections, enabling networks with over 100 hidden layers |
| 2016 | Ba, Kiros, and Hinton introduce layer normalization; Hendrycks and Gimpel introduce GELU |
| 2017 | Vaswani et al. publish Attention Is All You Need, introducing the transformer, which replaces recurrent hidden layers with self-attention |
| 2019 | Zhang and Sennrich introduce RMSNorm |
| 2020 | Shazeer publishes GLU Variants Improve Transformer, introducing SwiGLU and GeGLU |
| 2022 | Hoffmann et al. publish the Chinchilla scaling-law paper, refining how to size hidden layers under a compute budget |
| 2023-2024 | Sparse autoencoders applied to production LLMs reveal interpretable feature dictionaries inside hidden layers |
The 1986 backpropagation paper was particularly important for hidden layers. Rumelhart, Hinton, and Williams showed that when a network is trained with backpropagation, the hidden units come to represent important features of the task domain on their own, without being explicitly told what to learn. This finding established hidden layers as the engine of representation learning in neural networks.[4]
Imagine you are trying to decide if a picture shows a cat or a dog. Your eyes are the input layer: they see the picture. Your final answer ("cat" or "dog") is the output layer.
But between seeing the picture and giving your answer, your brain does a lot of work. First, you notice basic things like shapes and colors. Then you put those together to see ears, a nose, and a tail. Finally, you combine all of that to recognize the whole animal.
Those middle steps, where your brain is working things out before giving an answer, are like hidden layers. They are called "hidden" because nobody else can see what is happening inside your head. They only see the picture you looked at and the answer you gave. Everything in between is hidden.
A neural network works the same way. The hidden layers are the "thinking steps" between receiving the input and producing the output. More hidden layers let the network think in more steps, which helps it solve harder problems. Researchers use special tools to peek inside those thinking steps, like a doctor using an X-ray, but they have to be patient: the patterns in there can be tangled and surprising.