Hidden Layer

introduction

A hidden layer is a layer of artificial neurons in a neural network that sits between the input layer and the output layer. The term "hidden" refers to the fact that these layers are not directly exposed to the external environment: they do not receive raw data from outside the network (as the input layer does), nor do they produce the final result (as the output layer does). Instead, hidden layers operate internally, transforming inputs into intermediate representations that make it possible for the network to learn complex, nonlinear relationships in data.[1]

A neural network with one or more hidden layers is called a multilayer perceptron (MLP). Networks with many hidden layers are commonly referred to as deep neural networks, and the practice of training such networks is known as deep learning. The number of hidden layers in a network defines its "depth," while the number of neurons in each hidden layer defines its "width."[2]

Hidden layers are the locus of learned computation in a neural network. The weights and biases inside each hidden layer are adjusted during training (typically by backpropagation and gradient descent), so the hidden layer parameters carry essentially all of the model's learned knowledge. When practitioners refer to the "size" of a model in terms of parameter count, they are largely talking about the parameters held in hidden layers.

why are they called "hidden"?

The name comes from the perspective of someone observing the network from the outside. During training and inference, a user can see the inputs fed into the network and the outputs it produces. However, the intermediate computations performed by the layers between input and output are not directly visible or interpretable without specialized tools. These internal layers are therefore "hidden" from view.[3]

Put another way, in a supervised learning setup, the training data provides explicit target values for the output layer and explicit feature values for the input layer. No such direct supervision exists for the intermediate layers; the network must figure out on its own what representations to build in these layers. This self-organized nature of the internal representations is another reason the layers are considered hidden.[4]

The term predates the modern era. It appears in Geoffrey Hinton's 1986 work and earlier in connectionist literature, where "hidden units" referred to processing elements that were neither sensory inputs nor motor outputs. The corresponding mathematical formulation distinguishes "visible variables" (inputs and outputs that are clamped to data values) from "hidden variables" (those whose states must be inferred), a distinction that goes back to Boltzmann machines in the early 1980s.

how hidden layers work

Each neuron in a hidden layer performs a simple computation. It receives a set of inputs (either from the input layer or from the previous hidden layer), multiplies each input by a corresponding weight, sums the results, adds a bias term, and then passes the sum through an activation function. The output of this computation is then sent forward to the next layer.

Mathematically, for a single neuron in hidden layer l, the output a is computed as:

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b

a = f(z)

where x₁, x₂, ..., xₙ are the inputs, w₁, w₂, ..., wₙ are the weights, b is the bias, and f is the activation function (such as ReLU, sigmoid, or tanh).

In matrix form, the entire layer can be written as a = f(Wx + b), where W is the weight matrix of the layer, x is the input vector from the previous layer, b is the bias vector, and f is applied element-wise. This vectorized form is what GPU and TPU accelerators are optimized to compute, since matrix multiplications dominate the runtime cost of neural network training and inference.

The activation function is critical because it introduces nonlinearity. Without it, stacking multiple layers would be mathematically equivalent to a single linear transformation, and the network would be no more powerful than a simple linear regression model.[5]

forward pass and backward pass

During the forward pass, activations flow from input through each hidden layer to the output. The result is a prediction. During the backward pass, backpropagation computes gradients of the loss function with respect to every parameter in every hidden layer, working from output back toward input by repeated application of the chain rule. An optimizer (often SGD, Adam, or AdamW) uses those gradients to nudge each weight and bias in a direction that should reduce the loss on the current batch.

All learning in a deep network happens through this loop. Hidden layer parameters are updated in step with the loss signal, while inputs are typically held fixed and outputs are matched to labels. Over millions of update steps, the hidden layer parameters settle into configurations that encode useful intermediate features.

role in feature learning

One of the most important functions of hidden layers is automatic feature extraction. Rather than relying on hand-engineered features, a neural network with hidden layers can learn to identify the relevant patterns in raw data on its own.

This process is hierarchical. In a network trained on images, for example:

Layer	What It Learns	Example
First hidden layer	Low-level features	Edges, corners, color gradients
Second hidden layer	Mid-level features	Textures, simple shapes (circles, rectangles)
Third hidden layer	High-level features	Object parts (eyes, ears, wheels)
Deeper hidden layers	Abstract concepts	Entire objects, scene context

Each successive hidden layer builds on the representations created by the previous layer, composing simple features into increasingly complex and abstract ones. This hierarchical feature learning is what gives deep neural networks their ability to handle tasks like image recognition, natural language processing, and speech recognition.[6]

Without hidden layers, a neural network can only capture linear relationships between input and output. A single hidden layer with a nonlinear activation function is sufficient, in theory, to approximate any continuous function (see below), but deeper networks tend to learn more efficient representations in practice.

In the language modeling setting, hidden layers play a similar but less geometric role. Early layers in a language transformer tend to capture token-level and surface syntactic features, middle layers carry rich syntactic and semantic information, and later layers specialize on output-related computations such as next-token prediction. Probing studies have shown that part-of-speech tags, parse structures, and even some forms of factual knowledge can be linearly recovered from intermediate hidden states.

the universal approximation theorem

The universal approximation theorem provides the theoretical foundation for using hidden layers. The first widely cited version was proved by George Cybenko in 1989 for sigmoid activation functions. Independently, Kurt Hornik, Maxwell Stinchcombe, and Halbert White published a result the same year covering a broader class of "squashing" activations. Hornik strengthened the result in 1991 by showing that the universality property does not depend on a specific choice of activation function; rather, it follows from the multilayer feedforward architecture itself, provided the activation is non-polynomial.[7]

In simple terms, the theorem says that a feedforward neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function on a compact subset of ℝⁿ, given a non-polynomial activation function and enough hidden units.

This means that, in principle, even a single hidden layer is enough to represent any function. However, the theorem is an existence result: it guarantees that such a network exists but does not specify how many neurons are needed or how to find the right weights. In practice, the required number of neurons in a single-layer network can grow exponentially with the complexity of the target function, making deeper architectures far more practical.

bounded width versus bounded depth

Later research extended the theorem to show that universality can also be achieved by increasing depth (number of layers) while keeping width (neurons per layer) bounded. The first such result for the bounded depth and bounded width case was published by Maiorov and Pinkus in 1999, who constructed an analytic sigmoidal activation function that gives universal approximation with two hidden layers and a bounded number of units per layer. In 2018, Guliyev and Ismailov produced a smooth sigmoidal activation with a similar property and even smaller width.

These results provide theoretical support for the modern preference for deep networks. A network with many narrow layers can express functions that would require an exponentially larger number of units in a shallow network, especially when the target function has a compositional structure that mirrors the network's depth.

depth versus width: the architecture tradeoff

Designing a neural network involves choosing both the number of hidden layers (depth) and the number of neurons per layer (width). Research has shown that depth and width contribute to a network's capabilities in different ways.

Aspect	Deeper networks (more layers)	Wider networks (more neurons per layer)
Feature hierarchy	Learn hierarchical, compositional features	Capture many features in parallel at each level
Parameter efficiency	Represent complex functions with fewer total parameters	May require many more parameters to match a deep network
Training difficulty	Susceptible to vanishing gradients; may need skip connections	Generally easier to train with standard methods
Expressive power	Can represent certain functions exponentially more efficiently	Equivalent expressiveness may require exponentially more neurons
Computational cost	Sequential layer computation can be slower	Parallelizes well within each layer
Memory footprint	Activations must be cached for backprop, growing with depth	Wider activations consume more memory per layer
Risk of overfitting	More parameters per added layer can increase overfitting risk	Wider layers also increase parameter count and overfitting risk

A 2020 study by Nguyen, Raghu, and Kornblith at Google Research found that very deep and very wide networks develop different internal representations. Wide networks tend to produce more uniform representations across layers, while deep networks develop increasingly distinct representations at each layer, reflecting hierarchical feature extraction.[8]

In practice, modern architectures balance depth and width based on the task. Convolutional neural networks for image tasks tend to be deep (dozens to hundreds of layers), while large language models use wide hidden layers stacked tens to hundreds deep. Image classifiers like ResNet-152 use 152 layers; transformer language models like GPT-3 use 96 transformer blocks with hidden dimension 12,288 across each block.

how many hidden layers and neurons to use

Choosing the right architecture is one of the most common practical questions in neural network design. While no universal formula exists, several guidelines and rules of thumb have emerged.

number of hidden layers

Hidden layers	Capability
0	Only capable of representing linearly separable functions
1	Can approximate any continuous function (universal approximation theorem)
2	Can represent arbitrary decision boundaries with rational activation functions
3+	Can learn complex hierarchical representations (automatic feature engineering)

Before the deep learning era, most problems were solved with one or two hidden layers. Today, tasks like computer vision and natural language processing routinely use dozens or even hundreds of layers.

neurons per hidden layer

Common heuristics for setting the number of neurons include:[9]

Between input and output size: Choose a number of neurons somewhere between the size of the input layer and the output layer.
Two-thirds rule: Set the hidden layer size to roughly two-thirds of the input layer size plus the output layer size.
Less than double the input: Keep hidden layer neurons below twice the number of input features.
Same size across layers: Recent research suggests using the same number of neurons in all hidden layers often works as well as pyramid-shaped architectures.

The best approach is to treat the layer count and neuron count as hyperparameters and use cross-validation or automated hyperparameter tuning to find the optimal configuration for a given problem. For very large models, exhaustive search becomes infeasible, and practitioners instead rely on scaling laws and architecture families that have been validated empirically.

scaling laws

For large language models, OpenAI's 2020 scaling-law work and DeepMind's 2022 Chinchilla paper showed that test loss decreases as a smooth power-law function of model size, dataset size, and compute. The Chinchilla finding (Hoffmann et al., 2022) was that for a fixed compute budget, models should be roughly 20 tokens of training data per parameter, which has shaped modern model sizing. These results let researchers project the gain from adding hidden layer parameters before training, rather than guessing.

activation functions in hidden layers

The activation function in each hidden layer determines the nonlinearity of the network. The choice has changed substantially over time, driven by training stability, gradient behavior, and empirical performance.

Activation	Formula	Range	Notable use	Year/origin
Sigmoid	1 / (1 + e⁻ᵣ)	(0, 1)	Early MLPs, output of binary classifiers	1940s-1980s
Tanh	(eˣ − e⁻ˣ) / (eˣ + e⁻ˣ)	(−1, 1)	Pre-ReLU MLPs, classical RNNs	1990s
ReLU	max(0, x)	[0, ∞)	Most CNNs, MLPs since 2012	Nair and Hinton, 2010
Leaky ReLU	x if x > 0, else αx	(−∞, ∞)	Avoids "dying ReLU"	Maas et al., 2013
ELU	x if x > 0, else α(eˣ − 1)	(−α, ∞)	Smooth alternative to ReLU	Clevert et al., 2015
GELU	x · Φ(x)	(−∞, ∞)	BERT, GPT-2, GPT-3	Hendrycks and Gimpel, 2016
SiLU / Swish	x · σ(x)	(−∞, ∞)	EfficientNet, some LLMs	Ramachandran et al., 2017
SwiGLU	Swish(xW) ⊙ (xV)	depends on inputs	LLaMA, PaLM, Mixtral, DeepSeek	Shazeer, 2020
GeGLU	GELU(xW) ⊙ (xV)	depends on inputs	T5 v1.1, Gemma	Shazeer, 2020

Sigmoid and tanh dominated the early decades of neural network research but suffer from the vanishing gradient problem: for inputs far from zero, the derivative is close to zero, which causes gradients to shrink as they propagate through many layers. ReLU largely solved this issue by having a constant gradient of 1 for positive inputs, and Nair and Hinton's 2010 paper on rectified linear units in restricted Boltzmann machines was an early demonstration that ReLU outperforms saturating units.[10]

GELU was introduced by Dan Hendrycks and Kevin Gimpel in 2016 and weights inputs by their value rather than gating them by sign as ReLU does. It became the default in BERT, GPT-2, and GPT-3, and remains widely used in encoder-only and earlier decoder-only transformers.[11]

Gated linear unit variants (SwiGLU, GeGLU) were proposed by Noam Shazeer in 2020. They replace the standard two-matrix feedforward block with a three-matrix gated formulation. To keep parameter count comparable, the inner hidden dimension is reduced by a factor of 2/3. SwiGLU is now the default feedforward activation in LLaMA 2, LLaMA 3, PaLM, Mistral 7B, Mixtral, and DeepSeek models.[12]

weight initialization

The values used to initialize the weights and biases of hidden layers strongly affect whether a deep network can train at all. Random initialization with the wrong scale either causes activations and gradients to vanish (when the variance shrinks layer by layer) or explode (when it grows). Two schemes have become standard.

Xavier (Glorot) initialization, introduced by Xavier Glorot and Yoshua Bengio in 2010, samples weights from a distribution with variance 2 / (n_in + n_out), where n_in and n_out are the number of input and output connections to the layer. The goal is to keep the variance of activations and gradients roughly constant across layers. It is well suited to sigmoid and tanh hidden layers.[13]

He initialization, introduced by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun in 2015, was designed for ReLU and its variants. Because ReLU zeros out negative inputs, roughly halving the variance, He initialization compensates by sampling from a distribution with variance 2 / n_in. It is the default for ReLU-based networks and is implemented in PyTorch as kaiming_normal_ and kaiming_uniform_.[14]

normalization techniques

Normalization layers stabilize the distribution of activations across hidden layers, which lets networks train faster and at higher learning rates. The major techniques used inside hidden layers are summarized below.

Technique	Normalizes over	Year	Original paper	Common in
Batch normalization	Across the mini-batch (per feature)	2015	Ioffe and Szegedy	CNNs, ResNet
Layer normalization	Across features within one example	2016	Ba, Kiros, Hinton	RNNs, original Transformer, BERT
RMSNorm	RMS of features within one example, no centering	2019	Zhang and Sennrich	LLaMA, T5, Mistral, Gemma
Group normalization	Groups of channels per example	2018	Wu and He	Vision tasks with small batches
Instance normalization	Per channel, per example	2016	Ulyanov, Vedaldi, Lempitsky	Style transfer, GANs

Batch normalization was introduced by Sergey Ioffe and Christian Szegedy in 2015. It normalizes the activations of each layer using statistics computed over the current mini-batch, then applies learnable scale and shift parameters. The original motivation was the reduction of "internal covariate shift," though later work suggested the actual benefit is smoother loss landscapes.[15]

Layer normalization, introduced by Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey Hinton in 2016, normalizes across the feature dimension within a single training example. It works well for recurrent networks and was used in the original Transformer paper and BERT. RMSNorm, introduced by Biao Zhang and Rico Sennrich in 2019, removes the mean-centering step from layer normalization, leaving only division by root mean square. It is faster and uses fewer parameters; LLaMA, T5, Mistral, and Gemma all adopt RMSNorm in place of layer norm.[16]

Normalization can be placed before each sublayer ("pre-norm") or after ("post-norm"). Pre-norm is now standard in deep transformer stacks because it makes deep networks more stable to train.

regularization in hidden layers

Large hidden layers can memorize training data instead of learning useful patterns. Several techniques constrain hidden-layer behavior to encourage better generalization.

Dropout, introduced by Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov in 2014, randomly sets a fraction p of the activations in a hidden layer to zero on each forward pass during training, and rescales the surviving activations by 1/(1−p). At inference time the full network is used. The effect is that no hidden unit can rely on any other specific unit being present, which prevents co-adaptation and acts as an implicit ensemble of subnetworks.[17]

L2 regularization (also called weight decay) penalizes the squared sum of all hidden-layer weights, pulling weights toward zero. L1 regularization penalizes the absolute sum and tends to produce sparse weight matrices. Both can be applied separately to each hidden layer.

Early stopping halts training once validation loss stops improving, which prevents the hidden layers from continuing to memorize the training data after their generalization error bottoms out.

Label smoothing, mixup, and cutout are data-side regularizers that indirectly affect hidden layer learning by changing the supervision or input statistics.

hidden layers in different architectures

Hidden layers take different forms depending on the network architecture.

feedforward neural networks (MLPs)

In a standard multilayer perceptron, hidden layers are fully connected: every neuron in one layer connects to every neuron in the next. These networks are the simplest and most general form of deep network, and they remain useful for tabular data, lightweight classifiers, and as components inside larger architectures.

convolutional neural networks (CNNs)

In a CNN, hidden layers include convolutional layers, pooling layers, and fully connected layers. Convolutional hidden layers apply learned filters across spatial positions, which gives them translation equivariance and a far smaller parameter count than a fully connected equivalent. Early layers learn edges and textures, deeper layers learn object parts and whole objects. AlexNet (2012) used five convolutional hidden layers; ResNet-152 uses 152.

recurrent neural networks (RNNs)

In an RNN, a hidden layer also acts as the model's memory. At each time step, the hidden layer receives both the current input and the hidden state from the previous time step, allowing the network to process sequential data like text and time series. Vanilla RNNs suffer from vanishing and exploding gradients over long sequences, which led to gated variants. LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Cho et al., 2014) add gating mechanisms to their hidden layers to better preserve long-range dependencies.

transformers

In a transformer, each hidden layer (commonly called a "block" or "layer") consists of a multi-head self-attention sublayer followed by a feedforward neural network sublayer, with residual connections and normalization around each. The attention sublayer mixes information across positions, while the feedforward sublayer processes each position independently. The feedforward sublayer is typically four times wider than the residual stream (or 8/3 times wider for SwiGLU variants). Models like GPT and BERT stack many such layers to achieve strong performance on language tasks.

The "hidden state" in a transformer refers to the residual stream vector at each token position, which has dimension d_model. This vector is updated by each transformer block and carries the contextualized representation of the token through the network.

graph neural networks

In a graph neural network, a hidden layer aggregates messages from neighboring nodes, transforms them with a learned function, and produces an updated node representation. Stacking such hidden layers lets information flow further across the graph at each step.

hidden dimensions of major language models

The "hidden size" or d_model of a transformer language model refers to the width of the residual stream that passes through every block. The feedforward sublayer typically uses an inner dimension that is 4x or 8/3x larger. The table below lists representative hidden dimensions and layer counts for several well-known models.

Model	Parameters	Hidden size (d_model)	Number of layers	FFN inner dim	Activation
GPT-2 small	124M	768	12	3,072	GELU
GPT-2 medium	355M	1,024	24	4,096	GELU
GPT-2 large	774M	1,280	36	5,120	GELU
GPT-2 XL	1.5B	1,600	48	6,400	GELU
BERT-base	110M	768	12	3,072	GELU
BERT-large	340M	1,024	24	4,096	GELU
GPT-3 175B	175B	12,288	96	49,152	GELU
PaLM 540B	540B	18,432	118	73,728	SwiGLU
LLaMA 2 7B	7B	4,096	32	11,008	SwiGLU
LLaMA 2 70B	70B	8,192	80	28,672	SwiGLU
LLaMA 3 8B	8B	4,096	32	14,336	SwiGLU
LLaMA 3 70B	70B	8,192	80	28,672	SwiGLU
Mistral 7B	7B	4,096	32	14,336	SwiGLU
Mixtral 8x7B	47B (12.9B active)	4,096	32	14,336 (per expert)	SwiGLU

For encoder-decoder models like T5, the d_model and FFN inner dim apply separately to encoder and decoder stacks. Mixture-of-experts models such as Mixtral and DeepSeek-V3 use multiple parallel feedforward experts in each layer; only a subset are active for any given token.

relationship to model capacity

The number and size of hidden layers directly determine a neural network's capacity, which is its ability to fit a wide variety of functions.

Too little capacity (too few hidden layers or neurons) leads to underfitting, where the model cannot capture the underlying patterns in the data and performs poorly on both training and test sets.
Too much capacity (too many hidden layers or neurons) can lead to overfitting, where the model memorizes the training data, including its noise, and performs well on training data but poorly on unseen data.

Techniques to manage capacity include regularization (L1, L2), dropout, batch normalization, and early stopping. These methods allow practitioners to build larger networks while controlling the effective capacity during training.

A related modern observation is the "double descent" phenomenon, in which test error first rises (classical overfitting) and then falls again as model size grows past the interpolation threshold. This was documented by Belkin et al. (2019) and Nakkiran et al. (2020) and helps explain why very large neural networks with many hidden parameters can generalize well even when they have enough capacity to memorize the training set.

skip connections and very deep networks

As networks grow deeper, training becomes more difficult due to the vanishing gradient problem: gradients shrink as they propagate backward through many layers, causing early layers to learn very slowly or not at all. Saturating activations like sigmoid make the problem worse, and even with ReLU, very deep networks degrade because the optimizer struggles to find good parameter settings.

Skip connections (also called residual connections) address this problem by creating shortcut paths that bypass one or more hidden layers. Introduced in the ResNet architecture by Kaiming He and colleagues in 2015, skip connections add the input of a block directly to its output:

y = F(x) + x

Instead of learning a complete transformation, each block only needs to learn the residual difference between its input and the desired output. This makes training much easier and allows networks to scale to hundreds or even thousands of layers. ResNet won the ImageNet Large Scale Visual Recognition Challenge in 2015 and demonstrated that very deep networks with skip connections consistently outperform shallower ones.[18]

Skip connections have since become a standard component in many architectures, including transformers, DenseNet, and U-Net. In transformers, the residual stream that flows through every block is the direct successor of the ResNet skip path, and the additive nature of these updates is part of why the residual stream view has become central to mechanistic interpretability research.

visualization and interpretability of hidden layers

Because hidden layers operate as a black box, researchers have developed several techniques to understand what they learn.

Feature visualization generates input patterns that maximally activate specific neurons or channels, revealing what features a neuron has learned to detect. The classic reference is the Distill article "Feature Visualization" by Olah, Mordvintsev, and Schubert (2017).
Activation maximization optimizes an input image to produce the strongest response from a chosen hidden unit, creating visualizations of what that unit responds to.
t-SNE and UMAP are dimensionality reduction techniques used to project the high-dimensional hidden representations to two or three dimensions, showing how the network clusters and separates different classes.
Probing classifiers train small linear models on intermediate hidden representations to test whether specific information (part of speech, sentiment, factual relations) is linearly recoverable.
TensorFlow Playground is an interactive browser-based tool that lets users experiment with small neural networks and see how hidden layers transform data in real time.

These tools have shown that hidden layers in deep networks develop organized internal representations, with clear specialization emerging among neurons even without explicit instructions to do so.

mechanistic interpretability

A more recent line of work, often called mechanistic interpretability, tries to reverse-engineer the algorithms that hidden layers implement. The Anthropic interpretability team and the open Transformer Circuits Thread (transformer-circuits.pub) have published extended studies of attention heads, MLP neurons, and entire circuits inside production language models.

A central finding from this work is the phenomenon of superposition: hidden-layer neurons routinely represent more concepts than they have dimensions, by packing multiple features along overlapping directions in activation space. To recover human-interpretable units, researchers train sparse autoencoders (SAEs) on hidden-layer activations. The SAE projects the layer activation into a much higher-dimensional, sparse latent space where each dimension can correspond to a single nameable feature.

Anthropic's 2024 paper Scaling Monosemanticity applied SAEs to the residual stream of Claude 3 Sonnet and extracted millions of features ranging from concrete entities (specific cities, people, code constructs) to abstract concepts (sycophancy, deception, security vulnerabilities). The features were largely shared between models, which suggests that hidden layers in different transformers converge on similar internal vocabularies.

hidden layers in self-supervised and pretrained models

In modern pretrained models, hidden layer activations are often valuable in their own right, separate from the model's task output. Practitioners use the hidden states from intermediate layers as feature vectors for downstream classifiers or similarity search.

BERT pooled output and per-token hidden states are commonly used for sentence classification, named entity recognition, and question answering by attaching a small task head on top.
Sentence embeddings like SBERT and OpenAI text-embedding models extract a single dense vector from a hidden layer to represent meaning, supporting semantic search and retrieval-augmented generation.
Vision transformer hidden states form the backbone of models like CLIP, where image and text encoders project into a shared hidden space.
CLS tokens in BERT-style transformers act as designated summary positions whose final hidden state is read out as the document embedding.

The practice of using a pretrained network's hidden layer as a feature extractor predates transformers. ImageNet-trained CNNs from the 2014 to 2018 era were widely used as fixed feature extractors for downstream vision tasks, with practitioners taking activations from the second-to-last fully connected layer.

historical context

The history of hidden layers is closely tied to the development of neural networks as a whole.

Year	Milestone
1958	Frank Rosenblatt introduces the perceptron, originally proposed as a model with input layer, a hidden layer with random non-learning weights, and learnable output connections
1969	Marvin Minsky and Seymour Papert publish Perceptrons, highlighting the limitations of single-layer networks and contributing to the first "AI winter"
1986	David Rumelhart, Geoffrey Hinton, and Ronald Williams publish their Nature paper on backpropagation, demonstrating that hidden layers can learn useful internal representations
1989	Yann LeCun et al. apply backpropagation to convolutional hidden layers for handwritten digit recognition
1989	Cybenko proves the universal approximation theorem for sigmoid networks; Hornik, Stinchcombe, and White publish a related result the same year
1991	Hornik shows the universality property follows from the multilayer feedforward architecture itself, not the specific activation function
1997	Sepp Hochreiter and Jürgen Schmidhuber introduce LSTM, with gated hidden state to handle long sequences
2006	Hinton, Salakhutdinov introduce deep belief networks with layer-wise pretraining, reigniting interest in deep (multi-hidden-layer) networks
2010	Glorot and Bengio publish the Xavier initialization scheme; Nair and Hinton popularize ReLU activations
2012	AlexNet wins ImageNet with a deep CNN containing five convolutional hidden layers, launching the modern deep learning era
2014	Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov publish the dropout paper
2015	He et al. publish He initialization; Ioffe and Szegedy introduce batch normalization; ResNet introduces skip connections, enabling networks with over 100 hidden layers
2016	Ba, Kiros, and Hinton introduce layer normalization; Hendrycks and Gimpel introduce GELU
2017	Vaswani et al. publish Attention Is All You Need, introducing the transformer, which replaces recurrent hidden layers with self-attention
2019	Zhang and Sennrich introduce RMSNorm
2020	Shazeer publishes GLU Variants Improve Transformer, introducing SwiGLU and GeGLU
2022	Hoffmann et al. publish the Chinchilla scaling-law paper, refining how to size hidden layers under a compute budget
2023-2024	Sparse autoencoders applied to production LLMs reveal interpretable feature dictionaries inside hidden layers

The 1986 backpropagation paper was particularly important for hidden layers. Rumelhart, Hinton, and Williams showed that when a network is trained with backpropagation, the hidden units come to represent important features of the task domain on their own, without being explicitly told what to learn. This finding established hidden layers as the engine of representation learning in neural networks.[4]

explain like I'm 5 (ELI5)

Imagine you are trying to decide if a picture shows a cat or a dog. Your eyes are the input layer: they see the picture. Your final answer ("cat" or "dog") is the output layer.

But between seeing the picture and giving your answer, your brain does a lot of work. First, you notice basic things like shapes and colors. Then you put those together to see ears, a nose, and a tail. Finally, you combine all of that to recognize the whole animal.

Those middle steps, where your brain is working things out before giving an answer, are like hidden layers. They are called "hidden" because nobody else can see what is happening inside your head. They only see the picture you looked at and the answer you gave. Everything in between is hidden.

A neural network works the same way. The hidden layers are the "thinking steps" between receiving the input and producing the output. More hidden layers let the network think in more steps, which helps it solve harder problems. Researchers use special tools to peek inside those thinking steps, like a doctor using an X-ray, but they have to be patient: the patterns in there can be tangled and surprising.

references

Zhang, Aston et al. (2024). "5.1. Multilayer Perceptrons." Dive into Deep Learning. Cambridge University Press.
"Hidden Layer Definition." DeepAI Machine Learning Glossary. https://deepai.org/machine-learning-glossary-and-terms/hidden-layer-machine-learning
"What Is a Hidden Layer in a Neural Network?" Coursera. https://www.coursera.org/articles/hidden-layer-neural-network
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature, 323(6088), 533-536.
"Neural networks: Nodes and hidden layers." Google Machine Learning Crash Course. https://developers.google.com/machine-learning/crash-course/neural-networks/nodes-hidden-layers
Molnar, C. "Learned Features." Interpretable Machine Learning. https://christophm.github.io/interpretable-ml-book/cnn-features.html
Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." Mathematics of Control, Signals and Systems, 2(4), 303-314. Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." Neural Networks, 4(2), 251-257.
Nguyen, T., Raghu, M., & Kornblith, S. (2020). "Do Wide and Deep Networks Learn the Same Things?" arXiv preprint arXiv:2010.15327.
Heaton, J. (2017). "The Number of Hidden Layers." Heaton Research. https://www.heatonresearch.com/2017/06/01/hidden-layers.html
Nair, V., & Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." Proceedings of the 27th International Conference on Machine Learning (ICML-10), 807-814.
Hendrycks, D., & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv preprint arXiv:1606.08415.
Shazeer, N. (2020). "GLU Variants Improve Transformer." arXiv preprint arXiv:2002.05202.
Glorot, X., & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 249-256.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1026-1034.
Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." Proceedings of the 32nd International Conference on Machine Learning (ICML), 448-456.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). "Layer Normalization." arXiv preprint arXiv:1607.06450. Zhang, B., & Sennrich, R. (2019). "Root Mean Square Layer Normalization." Advances in Neural Information Processing Systems, 32.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." Journal of Machine Learning Research, 15(1), 1929-1958.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778.
Hoffmann, J. et al. (2022). "Training Compute-Optimal Large Language Models." arXiv preprint arXiv:2203.15556.
Olah, C., Mordvintsev, A., & Schubert, L. (2017). "Feature Visualization." Distill. https://distill.pub/2017/feature-visualization/
Templeton, A. et al. (2024). "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet." Transformer Circuits Thread. https://transformer-circuits.pub/2024/scaling-monosemanticity/
Brown, T. B. et al. (2020). "Language Models are Few-Shot Learners." Advances in Neural Information Processing Systems, 33.
Touvron, H. et al. (2023). "LLaMA 2: Open Foundation and Fine-Tuned Chat Models." arXiv preprint arXiv:2307.09288.
Meta AI (2024). "Introducing Meta Llama 3." Meta. https://ai.meta.com/blog/meta-llama-3/

introduction

why are they called "hidden"?

how hidden layers work

forward pass and backward pass

role in feature learning

the universal approximation theorem

bounded width versus bounded depth

depth versus width: the architecture tradeoff

how many hidden layers and neurons to use

number of hidden layers

neurons per hidden layer

scaling laws

activation functions in hidden layers

weight initialization

normalization techniques

regularization in hidden layers

hidden layers in different architectures

feedforward neural networks (MLPs)

convolutional neural networks (CNNs)

recurrent neural networks (RNNs)

transformers

graph neural networks

hidden dimensions of major language models

relationship to model capacity

skip connections and very deep networks

visualization and interpretability of hidden layers

mechanistic interpretability

hidden layers in self-supervised and pretrained models

historical context

explain like I'm 5 (ELI5)

see also

references

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Mixture of Experts (MoE)

introduction

why are they called "hidden"?

how hidden layers work

forward pass and backward pass

role in feature learning

the universal approximation theorem

bounded width versus bounded depth

depth versus width: the architecture tradeoff

how many hidden layers and neurons to use

number of hidden layers

neurons per hidden layer

scaling laws

activation functions in hidden layers

weight initialization

normalization techniques

regularization in hidden layers

hidden layers in different architectures

feedforward neural networks (MLPs)

convolutional neural networks (CNNs)

recurrent neural networks (RNNs)

transformers

graph neural networks

hidden dimensions of major language models

relationship to model capacity

skip connections and very deep networks

visualization and interpretability of hidden layers

mechanistic interpretability

hidden layers in self-supervised and pretrained models

historical context

explain like I'm 5 (ELI5)

see also

references

Related Articles

GELU (Gaussian Error Linear Unit)

Multi-head Latent Attention

Sparse autoencoder

ARC-AGI 2

LeNet

Mixture of Experts (MoE)