# Layer

> Source: https://aiwiki.ai/wiki/layer
> Updated: 2026-06-23
> Categories: Deep Learning, Machine Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

A **layer** is the fundamental building block of a [neural network](/wiki/neural_network): an organized group of [neurons](/wiki/neuron) (also called nodes or units) that together apply one mathematical transformation to their input and pass the result to the next layer. Each layer receives an input, computes a transformation such as a weighted sum followed by an [activation function](/wiki/activation_function), and produces an output. A network is defined by how many layers it stacks (its depth) and which types of layers it uses, and these two choices determine what patterns the model can learn.

The number of layers is the single characteristic that gives "deep learning" its name. Modern models range from a handful of layers to many hundreds: the 2015 ResNet architecture trained networks up to 152 layers deep, which was 8 times deeper than the earlier VGG networks, while keeping lower computational complexity.[2] A network is typically organized into three roles: an [input layer](/wiki/input_layer) that receives raw data, one or more [hidden layers](/wiki/hidden_layer) that extract features, and an [output layer](/wiki/output_layer) that produces the prediction. Understanding how individual layers operate is essential for designing, training, and debugging neural networks.[1]

## How does a layer work?

At the most basic level, a layer performs three steps:

1. **Linear transformation**: The layer computes the dot product of its input vector with a weight matrix and adds a bias term. Mathematically, this is expressed as `z = Wx + b`, where `W` is the weight matrix, `x` is the input, and `b` is the bias vector.
2. **Activation**: The result of the linear transformation is passed through a non-linear [activation function](/wiki/activation_function) such as [ReLU](/wiki/rectified_linear_unit_relu), [sigmoid](/wiki/sigmoid_function), or [softmax](/wiki/softmax). This non-linearity allows the network to learn complex, non-linear relationships in the data.
3. **Output**: The activated values become the layer's output, which serves as input to the subsequent layer.

During [training](/wiki/training), the weights and biases of each layer are adjusted through [backpropagation](/wiki/backpropagation) and an optimization algorithm such as [stochastic gradient descent](/wiki/stochastic_gradient_descent_sgd) to minimize a [loss function](/wiki/loss_function).[1]

## What are the main types of layers?

Neural networks use many different types of layers, each designed for specific data types and tasks. The table below summarizes the most common layer types.

| Layer Type | Description | Typical Use Case |
|---|---|---|
| [Input Layer](/wiki/input_layer) | The first layer that receives raw data and passes it to the network without performing learned transformations. Its size matches the dimensionality of the input features. | All neural networks |
| [Hidden Layer](/wiki/hidden_layer) | Any intermediate layer between the input and output layers. Hidden layers perform learned transformations and are responsible for extracting features and representations from the data. | All neural networks with depth > 1 |
| [Output Layer](/wiki/output_layer) | The final layer that produces the network's prediction or output. Its size and activation function depend on the task (e.g., softmax for classification, linear for regression). | All neural networks |
| [Dense (Fully Connected) Layer](/wiki/dense_layer) | Every neuron is connected to every neuron in the previous layer. Performs a full matrix multiplication between inputs and weights. | Classification heads, regression, tabular data |
| [Convolutional Layer](/wiki/convolutional_layer) | Applies learnable filters (kernels) that slide across spatial input to detect local features such as edges, textures, and shapes. Neurons connect only to a local region of the input, and filters share parameters across spatial positions. | Image recognition, computer vision, audio processing |
| Pooling Layer | Reduces the spatial dimensions of feature maps by applying an aggregation function (max or average) over local regions. Contains no learnable parameters. | Downsampling in CNNs |
| Recurrent Layer | Maintains a hidden state across time steps, allowing it to process sequential data by incorporating information from previous inputs. Variants include LSTM and GRU cells. | Time series, [natural language processing](/wiki/natural_language_processing), speech |
| [Attention Layer](/wiki/attention) | Computes dynamic weights that determine how much each element in a sequence should attend to every other element. Scaled dot-product attention and multi-head attention are the core mechanisms in [transformer](/wiki/transformer) architectures. | Large language models, machine translation, vision transformers |
| Normalization Layer | Normalizes activations within a layer to stabilize and accelerate training. Common variants include [batch normalization](/wiki/batch_normalization) and layer normalization. | Nearly all modern deep networks |
| [Embedding Layer](/wiki/embeddings) | Maps discrete categorical inputs (such as word indices) to dense, continuous vector representations. Functions as a trainable lookup table of size (vocabulary size x embedding dimension). | NLP, recommendation systems, categorical features |
| Dropout Layer | Randomly sets a fraction of input units to zero during training, forcing the network to learn redundant representations and reducing [overfitting](/wiki/overfitting). Inactive during inference. | Regularization in most architectures |

### Dense (Fully Connected) Layers

A [dense layer](/wiki/dense_layer) is the simplest and most general-purpose layer type. Every neuron in a dense layer is connected to every neuron in the preceding layer, meaning the layer computes a full matrix multiplication. Dense layers are sometimes called "fully connected" or "linear" layers. They are the standard building block of [multilayer perceptrons](/wiki/perceptron) and appear at the end of many architectures as classification or regression heads.

The number of learnable parameters in a dense layer is `(input_size x output_size) + output_size` (weights plus biases), which can grow quickly for large inputs. This is why dense layers are often combined with other layer types that reduce dimensionality first.

### Convolutional Layers

[Convolutional layers](/wiki/convolutional_layer) are the core component of [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs).[8] Instead of connecting every input to every output, a convolutional layer uses small learnable filters (kernels) that slide across the spatial dimensions of the input. Each filter computes a dot product with a local patch of the input, producing a feature map. Four key hyperparameters control a convolutional layer: the number of filters (K), the filter spatial size (F), the stride (S), and the amount of zero-padding (P). The output spatial size is calculated as `(W - F + 2P) / S + 1`, where W is the input width.[10]

A critical property of convolutional layers is **parameter sharing**: all neurons in a given feature map share the same weights. This dramatically reduces the number of parameters compared to a fully connected layer and encodes the assumption that a feature useful at one spatial location is likely useful at another.[10]

### Pooling Layers

Pooling layers reduce the spatial dimensions of feature maps while retaining the most important information. The most common variant, **max pooling**, selects the maximum value within each local region (typically 2x2 with stride 2), discarding roughly 75% of activations.[10] **Average pooling** computes the mean of each region instead. Pooling layers contain no learnable parameters and operate independently on each depth slice of the input, leaving the depth dimension unchanged. They help reduce computation, memory usage, and overfitting in CNNs.

### Recurrent Layers

Recurrent layers are designed for sequential data. Unlike feedforward layers that process each input independently, a recurrent layer maintains a hidden state that is updated at each time step, giving the network a form of memory. The basic recurrent layer suffers from the [vanishing gradient problem](/wiki/vanishing_gradient_problem) when processing long sequences, which led to the development of [Long Short-Term Memory (LSTM)](/wiki/long_short-term_memory_lstm) and Gated Recurrent Unit (GRU) cells. These variants use gating mechanisms to control the flow of information, allowing them to capture long-range dependencies more effectively.

### Attention Layers

[Attention](/wiki/attention) layers compute dynamic, input-dependent weights that determine how much each element in a sequence should focus on every other element. The most widely used form is **scaled dot-product attention**, where queries, keys, and values are derived from the input. The attention score between a query and a key is computed as their dot product divided by the square root of the key dimension, then passed through a softmax to produce weights that are applied to the values.[6]

**Multi-head attention** extends this by running multiple attention operations in parallel, each with its own learned projection matrices. In the original transformer, the base model used 8 parallel attention heads, each operating in a 64-dimensional subspace (d_k = d_v = 64), and concatenated their outputs.[6] This allows the layer to attend to information from different representation subspaces simultaneously. Multi-head attention is the central mechanism in the [transformer](/wiki/transformer) architecture introduced in the 2017 paper "Attention Is All You Need," and it forms the backbone of models such as [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [GPT](/wiki/gpt_generative_pre-trained_transformer), and their successors.[6]

### Normalization Layers

Normalization layers stabilize training by normalizing the distribution of activations within the network. Without normalization, the distribution of inputs to each layer shifts as the parameters of preceding layers change during training, a phenomenon the batch normalization authors named **internal covariate shift** and defined as the change in "the distribution of each layer's inputs during training, as the parameters of the previous layers change."[4] This makes training slower and more sensitive to hyperparameters such as learning rate.

| Normalization Type | What It Normalizes | Best Suited For |
|---|---|---|
| [Batch Normalization](/wiki/batch_normalization) | Across the batch dimension for each feature. Computes mean and variance from all examples in a mini-batch. | CNNs, feedforward networks with large batch sizes |
| Layer Normalization | Across all features for each individual example. Independent of batch size. | [Transformers](/wiki/transformer), [RNNs](/wiki/recurrent_neural_network), small-batch or online settings |
| Group Normalization | Across groups of channels for each example. A middle ground between batch and layer normalization. | Object detection, segmentation with small batches |
| Instance Normalization | Across spatial dimensions for each channel of each example individually. | Style transfer, image generation |

Batch normalization, proposed by Ioffe and Szegedy in 2015, was the first widely adopted normalization technique. It normalizes activations by subtracting the batch mean and dividing by the batch standard deviation, then applies learned scale and shift parameters.[4] The technique can sharply accelerate training: the authors reported that "Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin" on a state-of-the-art image classification model.[4] Layer normalization, proposed by Ba, Kiros, and Hinton in 2016, computes statistics across all features of a single example rather than across the batch. This makes it especially suitable for recurrent networks and transformers, where batch statistics may be unreliable.[5]

### Embedding Layers

An [embedding layer](/wiki/embeddings) converts discrete, categorical inputs into dense, continuous vector representations. Rather than using sparse one-hot encodings (where a vocabulary of 50,000 words would require 50,000-dimensional vectors), an embedding layer maintains a trainable weight matrix of shape `(vocabulary_size x embedding_dimension)`. When given an input index, the layer simply looks up the corresponding row in this matrix. The embedding values are learned during training, just like weights in a dense layer, allowing the network to discover meaningful relationships between items. Words with similar meanings, for example, end up with similar embedding vectors.

### Dropout Layers

Dropout is a [regularization](/wiki/regularization) technique introduced by Srivastava et al. in 2014.[7] During training, a dropout layer randomly sets a fraction (determined by the dropout rate, typically 0.2 to 0.5) of its input units to zero. As the authors put it, the key idea is to "randomly drop units (along with their connections) from the neural network during training," which "prevents units from co-adapting too much."[7] The remaining active units are scaled up by a factor of `1 / (1 - rate)` so that the expected sum of activations remains the same. This prevents neurons from becoming overly co-dependent on each other and forces the network to learn more robust features.[7] During inference, dropout is disabled and all neurons are active. Networks that use dropout may require longer training times, but they typically generalize better to unseen data.

## How are layers composed into an architecture?

The way layers are composed determines the network's overall architecture. Several key design patterns have emerged.

### Sequential Composition

The simplest approach stacks layers one after another in a linear chain: the output of one layer feeds directly into the input of the next. This is how basic feedforward networks and early CNNs such as [VGG](/wiki/vgg) are organized. Frameworks provide convenient abstractions for this pattern (e.g., `torch.nn.Sequential` in PyTorch, `tf.keras.Sequential` in TensorFlow).

### Residual (Skip) Connections

Residual connections, introduced in [ResNet](/wiki/resnet) by He et al. in 2015, add the input of a block directly to its output: `output = F(x) + x`. This "skip connection" creates a shortcut path that allows gradients to flow more easily during backpropagation, addressing the vanishing gradient problem in very deep networks. The authors provided "comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth."[2] Before residual connections, training networks deeper than roughly 20 layers was extremely difficult. An ensemble of ResNet models reached a 3.57% top-5 error on the ImageNet test set, winning first place in the ILSVRC 2015 classification task, and the technique has since become standard in transformers and many other architectures.[2]

Two design variants exist for residual blocks:

- **Post-activation**: The original ResNet design, where the sequence is convolution, batch normalization, then ReLU activation. The activation is applied after the addition of the skip connection.[2]
- **Pre-activation**: Proposed by He et al. in 2016, this reorders the block so that batch normalization and ReLU come before the convolution. This keeps the skip connection path as a clean identity mapping, which improves gradient flow and produces lower error rates, especially in very deep networks (200+ layers).[3]

Residual connections are used across many modern architectures, including [transformers](/wiki/transformer), U-Net, and DenseNet.

### Why does network depth matter?

The number of layers in a network (its **depth**) is a critical design choice. Deeper networks can learn more hierarchical and abstract representations: early layers detect simple features (edges, textures), while deeper layers combine these into complex concepts (faces, objects, semantic meaning).[1] However, increasing depth introduces challenges:

- **Vanishing gradients**: Without mitigation strategies, gradients shrink exponentially as they propagate backward through many layers. For example, with sigmoid activations (maximum derivative of 0.25), after just five layers the gradient can shrink to 0.25^5, or approximately 0.001.
- **Computational cost**: Each additional layer increases training time, memory usage, and inference latency.
- **Overfitting risk**: Very deep networks with many parameters may memorize training data rather than learning generalizable patterns.

Solutions to these challenges include residual connections, normalization layers, careful weight initialization (e.g., He initialization, Xavier initialization), and modern activation functions like ReLU that maintain stronger gradients.

## Layer-Wise Training and Transfer Learning

Not all layers in a network need to be trained simultaneously or from scratch.

**Layer-wise pre-training** was an important technique in early deep learning. Hinton et al. (2006) showed that deep networks could be trained effectively by first training each layer as a restricted Boltzmann machine in an unsupervised manner, then fine-tuning the entire network with supervised learning.[9] While less common today due to advances in initialization and normalization, the concept influenced modern approaches.

**[Transfer learning](/wiki/transfer_learning)** leverages the fact that lower layers in a network learn general features (edges, textures, basic patterns) while higher layers learn task-specific features. A pretrained model's lower layers can be frozen (their weights kept fixed) while only the upper layers are [fine-tuned](/wiki/fine_tuning) on a new dataset. This is standard practice in both [computer vision](/wiki/computer_vision) and [natural language processing](/wiki/natural_language_processing), where foundation models trained on massive datasets serve as starting points for downstream tasks.

**Layer-wise learning rate schedules** assign different learning rates to different layers. Typically, lower layers receive smaller learning rates (since their general features need less adjustment), while higher layers receive larger learning rates to adapt to the new task.

## How are layers implemented in software frameworks?

Modern deep learning frameworks provide layer abstractions that handle the underlying mathematics, memory management, and automatic differentiation.

| Framework | Layer Base Class | Example: Dense Layer | Sequential Model |
|---|---|---|---|
| [PyTorch](/wiki/pytorch) | `torch.nn.Module` | `nn.Linear(in_features, out_features)` | `nn.Sequential(layer1, layer2, ...)` |
| [TensorFlow](/wiki/tensorflow) / Keras | `tf.keras.layers.Layer` | `tf.keras.layers.Dense(units)` | `tf.keras.Sequential([layer1, layer2, ...])` |
| JAX / Flax | `flax.linen.Module` | `nn.Dense(features)` | Defined in the `__call__` method |

In PyTorch, every layer is a subclass of `nn.Module`. Custom layers are created by subclassing `nn.Module`, defining learnable parameters in the `__init__` method, and implementing the computation in the `forward` method. When you call a module as a function (e.g., `layer(input)`), it automatically invokes `forward` and integrates with PyTorch's autograd system for gradient computation.

In TensorFlow/Keras, layers subclass `tf.keras.layers.Layer`. The `build` method creates weights when the layer is first called with data, and the `call` method defines the forward computation. Keras also provides a high-level `Sequential` API for simple stacks of layers and a `Functional` API for more complex architectures with branching and merging.

```python
# PyTorch: Defining a simple custom layer
import torch
import torch.nn as nn

class CustomLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.linear = nn.Linear(in_features, out_features)
        self.norm = nn.LayerNorm(out_features)
    
    def forward(self, x):
        return self.norm(torch.relu(self.linear(x)))
```

## Explain Like I'm 5 (ELI5)

Imagine you are building a tower out of different colored blocks. Each block does one simple job: one block sorts things by color, the next block sorts by size, and the top block puts everything together to make a decision.

A layer in a [neural network](/wiki/neural_network) is like one of those blocks. Information enters at the bottom of the tower (the [input layer](/wiki/input_layer)), and each block (layer) looks at the information and changes it a little bit, adding its own understanding. The blocks in the middle (the [hidden layers](/wiki/hidden_layer)) do most of the thinking. The block at the very top (the [output layer](/wiki/output_layer)) gives you the final answer.

The more blocks you stack, the smarter the tower can be, because each block can learn something new about the information. That is why really smart AI systems, called "deep" learning systems, have many, many layers stacked together.

## References

1. Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. https://www.deeplearningbook.org/
2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Deep Residual Learning for Image Recognition." *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*. https://arxiv.org/abs/1512.03385
3. He, K., Zhang, X., Ren, S., & Sun, J. (2016). "Identity Mappings in Deep Residual Networks." *European Conference on Computer Vision (ECCV)*. https://arxiv.org/abs/1603.05027
4. Ioffe, S., & Szegedy, C. (2015). "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift." *Proceedings of the 32nd International Conference on Machine Learning (ICML)*. https://arxiv.org/abs/1502.03167
5. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). "Layer Normalization." *arXiv preprint arXiv:1607.06450*. https://arxiv.org/abs/1607.06450
6. Vaswani, A., et al. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems (NeurIPS)*. https://arxiv.org/abs/1706.03762
7. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." *Journal of Machine Learning Research*, 15(56), 1929-1958. https://jmlr.org/papers/v15/srivastava14a.html
8. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). "Gradient-Based Learning Applied to Document Recognition." *Proceedings of the IEEE*, 86(11), 2278-2324.
9. Hinton, G. E., Osindero, S., & Teh, Y. W. (2006). "A Fast Learning Algorithm for Deep Belief Nets." *Neural Computation*, 18(7), 1527-1554.
10. CS231n Convolutional Neural Networks for Visual Recognition. Stanford University. https://cs231n.github.io/convolutional-networks/

