# Fully Connected Layer

> Source: https://aiwiki.ai/wiki/fully_connected_layer
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **fully connected layer** (also called a **dense layer** or **linear layer**) is a layer in an [artificial neural network](/wiki/neural_network) in which every input value connects to every output value through a learned weight, so the layer computes the affine transformation $$\mathbf{y} = f(\mathbf{W}\mathbf{x} + \mathbf{b})$$: a matrix multiply of the input vector $$\mathbf{x}$$ by a weight matrix $$\mathbf{W}$$, plus a bias vector $$\mathbf{b}$$, followed by an optional nonlinear [activation function](/wiki/activation_function) $$f$$. A fully connected layer with *n* inputs and *m* outputs holds $$m(n + 1)$$ learnable parameters (the $$n \times m$$ weights plus m biases), which is why these layers dominate the parameter count of classic networks: the three fully connected layers of VGGNet-16 alone account for about 123.6 million of its roughly 138 million parameters, or close to 90% [8]. Each connection carries a learnable [weight](/wiki/weight), and each output neuron typically includes a learnable [bias](/wiki/bias_math_or_bias_term) term. Fully connected layers form the basis of the [multilayer perceptron](/wiki/multilayer_perceptron) (MLP) and appear as the final classification or regression stage in many [deep learning](/wiki/deep_neural_network) architectures, including [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) and hybrid models.

The term "fully connected" reflects the fact that every input value participates in the computation of every output value, in contrast to [convolutional layers](/wiki/convolutional_layer) (which operate on local spatial regions) or sparse layers (which connect only a subset of inputs to each output). In [Keras](/wiki/keras), the corresponding class is `Dense`, documented as "Just your regular densely-connected NN layer" [13]; in [PyTorch](/wiki/pytorch), it is `torch.nn.Linear`, which the documentation describes as a module that "applies an affine linear transformation to the incoming data: $$y = xA^\top + b$$" [14]; and in [TensorFlow](/wiki/tensorflow), it is `tf.keras.layers.Dense`. The page on the [dense layer](/wiki/dense_layer) covers the same construct under its alternate name.

## Explain like I'm 5 (ELI5)

Imagine a classroom where every student has a string connected to every student in the next classroom. Each string has a different thickness, representing how strong that connection is. When a student pulls on their strings, the students in the next room feel different amounts of pull depending on the string thickness. A fully connected layer works the same way: every input value is connected to every output value, and the network learns how strong each connection should be during training.

## What is the mathematical formula for a fully connected layer?

A fully connected layer performs an [affine transformation](https://en.wikipedia.org/wiki/Affine_transformation) on its input, followed by an optional nonlinear [activation function](/wiki/activation_function). Given an input vector $$\mathbf{x}$$ of dimension $$n$$, a weight matrix $$\mathbf{W}$$ of shape $$m \times n$$, and a bias vector $$\mathbf{b}$$ of dimension $$m$$, the output $$\mathbf{y}$$ of a fully connected layer is:

$$
\mathbf{y} = f(\mathbf{W}\mathbf{x} + \mathbf{b})
$$

where $$f$$ is the activation function. When no activation function is applied (or when $$f$$ is the identity function), the layer performs a purely linear transformation. The number of learnable parameters in a single fully connected layer is:

$$
\text{Parameters} = (n \times m) + m = m(n + 1)
$$

where $$n$$ is the number of input features and $$m$$ is the number of output neurons. The "+m" accounts for the bias vector. If bias is disabled, the parameter count reduces to $$n \times m$$. In framework terms this matches `nn.Linear(in_features, out_features)` in PyTorch, whose weight tensor has shape (out_features, in_features) and whose bias has length out_features, with bias enabled by default [14]; Keras expresses the same operation as `output = activation(dot(input, kernel) + bias)` [13].

### Forward pass

During the forward pass, the layer computes the weighted sum of all inputs for each neuron, adds the bias, and applies the activation function. For a mini-batch of *B* input vectors (each of dimension *n*), the computation can be expressed as a matrix multiplication:

$$
\mathbf{Y} = f(\mathbf{X}\mathbf{W}^\top + \mathbf{b})
$$

where $$\mathbf{X}$$ is a $$B \times n$$ matrix, $$\mathbf{W}$$ is an $$m \times n$$ weight matrix, and $$\mathbf{b}$$ is broadcast across all samples in the batch. This formulation allows efficient computation on GPUs through parallelized [matrix operations](/wiki/tensor).

### Backward pass

During [backpropagation](/wiki/gradient), the gradients of the [loss function](/wiki/loss_function) with respect to the layer's weights and biases are computed using the [chain rule](https://en.wikipedia.org/wiki/Chain_rule). For a loss *L*, the gradients are:

- Gradient with respect to weights: $$dL/dW = \delta^\top x$$, where $$\delta$$ is the error signal from the activation function
- Gradient with respect to biases: $$dL/db$$ = sum of $$\delta$$ across the batch
- Gradient with respect to inputs (passed to previous layer): $$dL/dx = \delta W$$

The auxiliary quantity $$\delta$$ at each layer is computed recursively from the layer above, enabling efficient gradient computation one layer at a time without redundant calculations [1].

## Activation functions

Fully connected layers are almost always paired with a nonlinear activation function. Without nonlinearity, stacking multiple fully connected layers would be equivalent to a single linear transformation, since the composition of linear functions is itself linear. The choice of activation function affects training dynamics, convergence speed, and the types of functions the network can approximate.

| Activation function | Formula | Output range | Typical use case | Key property |
|---|---|---|---|---|
| [ReLU](/wiki/rectified_linear_unit_relu) | $$\max(0, x)$$ | $$[0, \infty)$$ | Hidden layers (default choice) | Constant gradient for positive inputs; mitigates [vanishing gradients](/wiki/vanishing_gradient_problem) |
| [Sigmoid](/wiki/sigmoid_function) | $$1 / (1 + e^{-x})$$ | $$(0, 1)$$ | Binary classification output | Outputs interpretable as probabilities |
| [Tanh](https://en.wikipedia.org/wiki/Hyperbolic_functions) | $$(e^x - e^{-x}) / (e^x + e^{-x})$$ | $$(-1, 1)$$ | Hidden layers (older networks) | Zero-centered output |
| [Softmax](/wiki/softmax) | $$e^{x_i} / \sum_j e^{x_j}$$ | $$(0, 1)$$, sums to 1 | Multi-class classification output | Produces a probability distribution |
| Leaky ReLU | $$x$$ if $$x > 0$$; $$0.01x$$ otherwise | $$(-\infty, \infty)$$ | Hidden layers | Avoids dying ReLU problem |
| ELU | $$x$$ if $$x > 0$$; $$\alpha(e^x - 1)$$ otherwise | $$(-\alpha, \infty)$$ | Hidden layers | Smooth transition at zero |
| [GELU](https://en.wikipedia.org/wiki/Activation_function#GELU) | $$x \Phi(x)$$ | approx $$(-0.17, \infty)$$ | [Transformer](/wiki/transformer) hidden layers | Used in [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [GPT](/wiki/gpt_generative_pre-trained_transformer) |
| Swish (SiLU) | $$x \cdot \mathrm{sigmoid}(x)$$ | approx $$(-0.278, \infty)$$ | Hidden layers in modern architectures | Self-gated; smooth |

For hidden layers in modern networks, ReLU and its variants are the most common choices. For output layers, the activation function depends on the task: sigmoid for binary classification, softmax for multi-class classification, and linear (no activation) for [regression](/wiki/regression_model) [2].

## Weight initialization

Proper [weight initialization](/wiki/weight) is essential for training fully connected layers effectively. Poor initialization can lead to [vanishing](/wiki/vanishing_gradient_problem) or [exploding gradients](/wiki/exploding_gradient_problem), causing the network to train slowly or fail to converge entirely. Two widely used initialization methods are designed to maintain stable activation and gradient magnitudes across layers.

| Initialization method | Formula (variance) | Best used with | Proposed by |
|---|---|---|---|
| Xavier (Glorot) uniform | $$\mathrm{Var}(W) = 2 / (n_{\text{in}} + n_{\text{out}})$$ | Sigmoid, Tanh | Glorot and Bengio, 2010 [3] |
| He (Kaiming) normal | $$\mathrm{Var}(W) = 2 / n_{\text{in}}$$ | ReLU, Leaky ReLU | He et al., 2015 [4] |

Xavier initialization draws weights from a distribution (uniform or normal) scaled so that the variance of activations remains approximately constant across layers during the forward pass, and the variance of gradients remains approximately constant during the backward pass. This is achieved by setting the variance to $$2 / (n_{\text{in}} + n_{\text{out}})$$, where $$n_{\text{in}}$$ is the number of input neurons (fan-in) and $$n_{\text{out}}$$ is the number of output neurons (fan-out) [3].

He initialization modifies this approach for ReLU networks, where Xavier initialization performs poorly because ReLU zeroes out roughly half of the activations. He initialization compensates by doubling the variance, setting it to $$2 / n_{\text{in}}$$ [4].

Bias vectors are typically initialized to zero, though some practitioners use small positive values (e.g., 0.01) for ReLU layers to ensure that most neurons are active at the start of training. In Keras, the `Dense` layer defaults to `kernel_initializer="glorot_uniform"` (Xavier) and `bias_initializer="zeros"` [13].

## What role do fully connected layers play in neural network architectures?

### Multilayer perceptrons

The [multilayer perceptron](/wiki/multilayer_perceptron) (MLP) is the simplest and oldest [deep learning](/wiki/deep_neural_network) architecture, consisting entirely of fully connected layers. An MLP has an [input layer](/wiki/input_layer), one or more [hidden layers](/wiki/hidden_layer), and an [output layer](/wiki/output_layer). Each layer is fully connected to the next, and nonlinear activation functions are applied after each hidden layer. MLPs are universal function approximators: a single hidden layer with a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary accuracy, as proven by the [universal approximation theorem](/wiki/neural_network) (Cybenko, 1989; Hornik, 1991) [5][6]. Cybenko's 1989 result established that finite linear combinations built from a single hidden layer with "any continuous sigmoidal nonlinearity" can uniformly approximate any continuous function on the unit hypercube [5].

Despite this theoretical result, shallow networks may require an impractically large number of neurons. Deeper networks with multiple hidden layers can often represent the same function with far fewer total parameters, which is one motivation for using deep architectures.

### Convolutional neural networks

In [CNNs](/wiki/convolutional_neural_network), fully connected layers traditionally appear at the end of the network, after a series of [convolutional](/wiki/convolutional_layer) and [pooling](/wiki/pooling) layers have extracted spatial features from the input. The transition from convolutional to fully connected layers requires flattening the three-dimensional feature map (height x width x channels) into a one-dimensional vector.

A typical CNN architecture follows the pattern:

INPUT -> [CONV -> RELU]* -> POOL -> ... -> FLATTEN -> FC -> RELU -> FC -> OUTPUT

The fully connected layers at the end integrate information from all spatial locations and channels, combining learned features into a final prediction. In classification tasks, the last FC layer has as many neurons as there are classes, and a [softmax](/wiki/softmax) activation produces class probabilities [7].

### Why do fully connected layers dominate CNN parameter counts?

Fully connected layers often account for the majority of parameters in CNN architectures. The table below shows the parameter distribution in VGGNet-16, one of the classic CNN architectures. VGGNet-16 has 13 convolutional layers and 3 fully connected layers (4,096, 4,096, and 1,000 units), and over 138 million parameters in total [8].

| Layer | Input size | Output size | Parameters |
|---|---|---|---|
| FC-1 | 7 x 7 x 512 (25,088) | 4,096 | 102,764,544 |
| FC-2 | 4,096 | 4,096 | 16,781,312 |
| FC-3 (output) | 4,096 | 1,000 | 4,097,000 |
| **Total FC parameters** | | | **123,642,856** |
| Total network parameters | | | ~138,000,000 |
| FC share of total | | | ~89.6% |

This heavy parameter cost motivated the development of architectures that reduce or eliminate fully connected layers entirely.

### AlexNet

[AlexNet](https://en.wikipedia.org/wiki/AlexNet) (Krizhevsky et al., 2012) was one of the first deep CNN architectures to demonstrate the power of fully connected layers combined with [dropout](/wiki/dropout_regularization). The published paper reports a network that "has 60 million parameters and 500,000 neurons" and "consists of five convolutional layers" plus three fully connected layers with a final 1000-way softmax [9]. Its three FC layers have 4,096, 4,096, and 1,000 neurons respectively. To control the cost of so many dense parameters, the authors wrote that "to reduce overfitting in the globally connected layers we employed a new regularization method that proved to be very effective": [dropout](/wiki/dropout_regularization) with probability 0.5, applied to the first two FC layers during training to prevent [overfitting](/wiki/overfitting) [9].

### How have modern architectures reduced fully connected layers?

Several modern architectures have reduced or removed fully connected layers in favor of alternatives that require fewer parameters.

**Global average pooling (GAP).** Introduced in the Network in Network paper (Lin et al., 2013), global average pooling computes the spatial average of each feature map in the final convolutional layer, producing one value per channel. These values are fed directly to a softmax classifier without any intermediate FC layers. The authors argued that, "with enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers" [10]. GAP also has no parameters to learn, which significantly reduces the model's parameter count [10].

**[ResNet](/wiki/resnet)** (He et al., 2015) uses global average pooling followed by a single FC layer for classification. ResNet-50 contains 49 convolutional layers and 1 fully connected layer, reducing the total parameter count to approximately 25.6 million (compared to VGGNet-16's roughly 138 million) [4].

**[GoogLeNet/Inception](/wiki/inception)** (Szegedy et al., 2015) also eliminates intermediate FC layers, relying on global average pooling and a single FC output layer.

**[Vision Transformers](/wiki/transformer)** (ViT) use an MLP head (one or two FC layers) on top of a [transformer](/wiki/transformer) encoder, applied to the classification token rather than to flattened feature maps.

## Regularization techniques for fully connected layers

Fully connected layers are prone to [overfitting](/wiki/overfitting) because of their large number of parameters. Several [regularization](/wiki/regularization) techniques have been developed to address this.

### Dropout

[Dropout](/wiki/dropout_regularization) is the most widely used regularization method for fully connected layers. During training, each neuron's output is set to zero with a specified probability *p* (commonly 0.5 for hidden FC layers). This prevents neurons from co-adapting too strongly and acts as an implicit form of model averaging across an exponential number of sub-networks. At inference time, dropout is turned off, and the weights are scaled by (1 - p) to maintain consistent expected output magnitudes [11].

Dropout was introduced by Hinton et al. (2012) and formalized by Srivastava et al. (2014), who demonstrated its effectiveness across vision, speech, and text domains [11].

### L1 and L2 regularization

[L1 regularization](/wiki/l1_regularization) adds the sum of absolute weight values to the loss function, encouraging sparsity (many weights become exactly zero). [L2 regularization](/wiki/l2_regularization) (also called weight decay) adds the sum of squared weight values, penalizing large weights and encouraging the network to distribute information across many connections rather than relying on a few strong ones.

### Batch normalization

[Batch normalization](/wiki/batch_normalization) normalizes the inputs to each layer by subtracting the batch mean and dividing by the batch standard deviation. While originally designed for convolutional layers, it can also be applied to fully connected layers. When batch normalization is used, the bias term in the FC layer is typically omitted because the normalization step already includes a learnable shift parameter.

## How do fully connected layers differ from convolutional layers?

The table below summarizes the main differences between fully connected and [convolutional layers](/wiki/convolutional_layer).

| Property | Fully connected layer | Convolutional layer |
|---|---|---|
| Connectivity | Every input connected to every output | Each output connected to a local region of the input |
| [Parameter](/wiki/parameter) sharing | No; each connection has a unique weight | Yes; the same filter weights are applied across all spatial positions |
| Spatial awareness | None; input is treated as a flat vector | Preserves spatial structure (height, width, channels) |
| Translation invariance | No | Yes, through parameter sharing |
| Parameter count | $$n_{\text{in}} \times n_{\text{out}} + n_{\text{out}}$$ | $$(\text{filter}_h \times \text{filter}_w \times \text{channels}_{\text{in}} + 1) \times \text{channels}_{\text{out}}$$ |
| Typical use | Classification heads, regression output, MLP hidden layers | Feature extraction from images, audio, sequences |
| Input format | 1D vector | 2D or 3D tensor |

Convolutional layers can be viewed as a special case of fully connected layers with two constraints: local connectivity and parameter sharing. Conversely, any fully connected layer can be expressed as a [convolutional layer](/wiki/convolutional_layer) with a filter size equal to the full spatial extent of the input. A fully connected layer that operates on a 1 x 1 spatial input is identical to a 1 x 1 convolution applied at a single location, which is why 1 x 1 convolutions are sometimes described as position-wise fully connected layers. This equivalence is used in practice to convert trained FC layers to convolutional layers for efficient sliding-window inference over larger images [7].

## FC-to-convolutional layer conversion

A trained fully connected layer can be converted to an equivalent convolutional layer by reshaping the weight matrix into a set of filters. For example, an FC layer that takes a 7 x 7 x 512 input and produces 4,096 outputs can be replaced by a convolutional layer with 4,096 filters of size 7 x 7 x 512. Both representations compute the same function on inputs of the original size.

The practical benefit of this conversion is efficiency when applying a classifier to images larger than the training input size. Instead of cropping the image into overlapping patches and running each through the network separately, the converted convolutional network can process the full image in a single forward pass, sharing computation across overlapping regions. This technique was described in the Stanford CS231n course materials and has been applied in object detection frameworks [7].

## Vanishing gradient problem

Deep networks composed of many fully connected layers historically suffered from the [vanishing gradient problem](/wiki/vanishing_gradient_problem). When using sigmoid or [tanh](/wiki/activation_function) activation functions, the derivative is bounded between 0 and 1. During backpropagation, these small gradient values are multiplied across many layers, causing the gradient signal to shrink exponentially as it propagates toward earlier layers. This makes it extremely difficult to update the weights of early layers, effectively preventing the network from learning [12].

Sepp Hochreiter formally identified this problem in his 1991 diploma thesis, and Bengio et al. (1994) provided further analysis showing that learning long-range dependencies with gradient descent is inherently difficult for deep networks with saturating activations [15].

Several innovations addressed this problem:

- **[ReLU](/wiki/rectified_linear_unit_relu) activation functions** provide a constant gradient of 1 for positive inputs, allowing gradients to flow without shrinking.
- **Proper weight initialization** (Xavier, He) keeps activation and gradient magnitudes stable across layers.
- **[Batch normalization](/wiki/batch_normalization)** reduces internal covariate shift, keeping activations in a stable range.
- **Residual connections** (as in [ResNets](/wiki/resnet)) provide shortcut paths that allow gradients to skip layers entirely.

## Implementation examples

### Keras (TensorFlow)

```python
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])
```

In Keras, `Dense(128, activation='relu')` creates a fully connected layer with 128 neurons and ReLU activation. The `kernel_initializer` defaults to Glorot uniform (Xavier), and `bias_initializer` defaults to zeros [13].

### PyTorch

```python
import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.5)

    def forward(self, x):
        x = self.dropout(self.relu(self.fc1(x)))
        x = self.dropout(self.relu(self.fc2(x)))
        x = self.fc3(x)
        return x
```

In PyTorch, `nn.Linear(784, 128)` creates a fully connected layer that maps 784 inputs to 128 outputs. The layer computes $$y = xW^\top + b$$, with bias enabled by default [14]. By default, weights are initialized using Kaiming uniform (He initialization) and biases are initialized to a uniform distribution based on the fan-in [14].

## Computational considerations

### Memory usage

Fully connected layers are memory-intensive because they store a dense weight matrix with n_in x n_out entries plus n_out bias values. For a layer with 25,088 inputs and 4,096 outputs (as in VGGNet's first FC layer), the weight matrix alone requires approximately 392 MB in 32-bit floating point. During training, additional memory is needed for storing activations (for backpropagation), gradients, and optimizer states (e.g., momentum, adaptive learning rate statistics).

### Computation cost

The forward pass of a fully connected layer is dominated by a matrix multiplication of complexity $$O(B \times n_{\text{in}} \times n_{\text{out}})$$, where B is the [batch size](/wiki/batch_size). Modern GPUs and hardware accelerators such as [TPUs](/wiki/tpu) are highly optimized for these dense matrix operations. For inference on edge devices, techniques such as [quantization](/wiki/quantization) (reducing weight precision from 32-bit to 8-bit or lower) and pruning (removing near-zero weights) can significantly reduce both memory and computation requirements.

### Comparison with convolutions

Although a convolutional layer may produce a higher-dimensional output than a fully connected layer, it typically has far fewer parameters due to weight sharing. A convolutional layer with 64 filters of size 3 x 3 x 3 has only 1,792 parameters, while a fully connected layer mapping the same 27-dimensional input to 64 outputs would require the same 1,792 parameters. The difference becomes dramatic for larger inputs: a fully connected layer on a 224 x 224 x 3 input with 64 outputs would require 9,633,856 parameters, while a single 3 x 3 convolutional layer with 64 filters still requires only 1,792 parameters.

## Applications

Fully connected layers serve several distinct roles depending on where they appear in a network.

| Application | Description | Example architecture |
|---|---|---|
| Classification head | Maps learned features to class probabilities via [softmax](/wiki/softmax) | [VGGNet](/wiki/vgg), [AlexNet](https://en.wikipedia.org/wiki/AlexNet), [ResNet](/wiki/resnet) |
| [Regression](/wiki/regression_model) output | Produces continuous-valued predictions (single neuron, linear activation) | MLP for price prediction, age estimation |
| Feature embedding | Maps inputs to a lower-dimensional [embedding](/wiki/embedding_vector) space | Siamese networks, face verification |
| Encoder/decoder bottleneck | Compresses representation in [autoencoders](/wiki/variational_autoencoder) or [variational autoencoders](/wiki/variational_autoencoder) | VAE latent space |
| MLP blocks in [transformers](/wiki/transformer) | Two-layer FC networks (expand then contract) applied position-wise | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [GPT](/wiki/gpt_generative_pre-trained_transformer) |
| Reinforcement learning value/policy heads | Maps state representation to action values or policy distribution | [DQN](/wiki/deep_q-network_dqn), actor-critic networks |

## When were fully connected layers invented?

Fully connected layers trace their origins to the earliest work on artificial neural networks. Warren McCulloch and Walter Pitts proposed the first mathematical model of an artificial neuron in 1943. Frank Rosenblatt introduced the [perceptron](/wiki/perceptron) in 1958, a single-layer network capable of learning linearly separable functions. The limitations of single-layer perceptrons, notably their inability to learn the XOR function (as demonstrated by Minsky and Papert in 1969), led to reduced interest in neural networks during the first "AI winter."

The development of the backpropagation algorithm for training multi-layer networks, independently derived by several researchers and popularized by Rumelhart, Hinton, and Williams in 1986, revived interest in fully connected architectures. Backpropagation enabled efficient gradient computation across multiple FC layers, making it feasible to train deeper networks [1].

The [universal approximation theorem](/wiki/neural_network) (Cybenko, 1989; Hornik, 1991) provided theoretical justification for using fully connected networks, proving that a single hidden layer with enough neurons can approximate any continuous function. However, practical training of deep FC networks remained difficult due to vanishing gradients until the development of ReLU activations, proper initialization schemes, and dropout regularization in the 2010s [5][6].

Today, pure fully connected architectures (MLPs) have been largely replaced by specialized architectures for structured data types (CNNs for images, [RNNs](/wiki/recurrent_neural_network)/[transformers](/wiki/transformer) for sequences). However, fully connected layers remain a fundamental building block within these architectures, serving as classification heads, embedding layers, and MLP blocks.

## References

1. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
2. Goodfellow, I., Bengio, Y., and Courville, A. (2016). *Deep Learning*. MIT Press. Chapter 6: Deep Feedforward Networks.
3. Glorot, X. and Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." *Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS)*.
4. He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Deep Residual Learning for Image Recognition." *arXiv:1512.03385*. (ResNet-50 parameter count.) See also He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." *arXiv:1502.01852*.
5. Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals, and Systems*, 2(4), 303-314.
6. Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." *Neural Networks*, 4(2), 251-257.
7. Karpathy, A. "CS231n Convolutional Neural Networks for Visual Recognition." Stanford University. https://cs231n.github.io/convolutional-networks/
8. Simonyan, K. and Zisserman, A. (2015). "Very Deep Convolutional Networks for Large-Scale Image Recognition." *arXiv:1409.1556*.
9. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 25. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
10. Lin, M., Chen, Q., and Yan, S. (2013). "Network In Network." *arXiv:1312.4400*. https://arxiv.org/abs/1312.4400
11. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). "Dropout: A Simple Way to Prevent Neural Networks from Overfitting." *Journal of Machine Learning Research*, 15(56), 1929-1958.
12. Hochreiter, S. (1991). "Untersuchungen zu dynamischen neuronalen Netzen." Diploma thesis, Technische Universitat Munchen.
13. Keras documentation. "Dense layer." https://keras.io/api/layers/core_layers/dense/
14. PyTorch documentation. "torch.nn.Linear." https://pytorch.org/docs/stable/generated/torch.nn.Linear.html
15. Bengio, Y., Simard, P., and Frasconi, P. (1994). "Learning long-term dependencies with gradient descent is difficult." *IEEE Transactions on Neural Networks*, 5(2), 157-166.