# Feedforward Neural Network (FFN)

> Source: https://aiwiki.ai/wiki/feedforward_neural_network_ffn
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

A **feedforward neural network** (FFN), also called a **multilayer perceptron** (MLP) when it has multiple layers, is a type of [artificial neural network](/wiki/neural_network) in which information flows in one direction only, from the input layer through any [hidden layers](/wiki/hidden_layer) to the [output layer](/wiki/output_layer), with no cycles, loops, or feedback connections in the graph of the network. This unidirectional flow distinguishes FFNs from [recurrent neural networks](/wiki/recurrent_neural_network) (RNNs), which contain feedback loops that allow information from later processing stages to influence earlier ones.

Feedforward networks are among the oldest and most widely studied neural network architectures, and they remain central to modern AI: the position-wise feedforward sublayer inside every [transformer](/wiki/transformer) block is itself a small FFN, and these sublayers account for roughly two-thirds of the parameters in a transformer model.[16] The 1989 universal approximation theorem proved that a feedforward network with a single hidden layer can approximate any continuous function on a compact set to arbitrary accuracy, which is why FFNs are described as universal approximators.[6] Despite the rise of more specialized architectures such as [convolutional neural networks](/wiki/convolutional_neural_network) (CNNs) and transformers, plain feedforward networks remain a workhorse for tabular data, function approximation, classification, and regression tasks.

## ELI5 (explain like I'm 5)

Imagine a toy factory with three rooms in a row. In the first room, workers receive raw materials (like plastic, paint, and screws). They pass those materials through a window into the second room, where a different team puts them together and paints them. Then the half-finished toy goes through another window into the third room, where the final team adds stickers, checks for problems, and boxes it up. Materials only move forward through the rooms; nobody sends anything backward.

A feedforward neural network works the same way. Data enters the first layer, gets transformed by the middle layers (the "hidden" layers), and comes out the other end as an answer. Each layer does its own small job, and the information only moves in one direction. During training, a supervisor checks the final answer, figures out what each room got wrong, and tells every room how to adjust. Over time, the factory learns to build exactly the right answer.

## When was the feedforward neural network invented?

The development of feedforward neural networks spans over eight decades, moving through periods of rapid progress and stagnation.

### McCulloch-Pitts neuron (1943)

Warren McCulloch and Walter Pitts published "A Logical Calculus of the Ideas Immanent in Nervous Activity" in 1943.[1] Their paper proposed a mathematical model of a biological neuron as a simple binary threshold unit. Each McCulloch-Pitts neuron receives binary inputs, computes a weighted sum, and fires (outputs 1) if the sum exceeds a threshold. McCulloch and Pitts showed that networks of these units can compute any Boolean function, establishing the theoretical link between neural networks and computation.[1]

### The perceptron (1958)

Frank Rosenblatt introduced the [perceptron](/wiki/perceptron) at the Cornell Aeronautical Laboratory in 1958.[2] Unlike the fixed-weight McCulloch-Pitts neuron, the perceptron had adjustable weights that could be learned from data through a supervised learning rule. The perceptron was a single-layer network capable of [binary classification](/wiki/binary_classification) for linearly separable patterns.[2] Rosenblatt's work generated significant excitement about the potential of neural networks.

### The XOR problem and the first AI winter (1969)

In 1969, Marvin Minsky and Seymour Papert published *Perceptrons*, a book that rigorously analyzed the limitations of single-layer perceptrons.[3] They proved that a single-layer perceptron cannot learn the XOR function because XOR is not linearly separable.[3] Although multilayer networks could in principle solve XOR, no effective training algorithm for multilayer networks was widely known at the time. The book contributed to a sharp decline in funding and interest in neural network research, a period often called the first "AI winter."

### Early backpropagation work (1970s)

Seppo Linnainmaa published the general method of automatic differentiation (reverse mode), which is the mathematical foundation of [backpropagation](/wiki/backpropagation), in his 1970 master's thesis. Paul Werbos described applying gradient descent to neural networks in his 1974 PhD thesis and further developed the idea in a 1982 publication. However, these contributions did not receive widespread attention at the time.

### Backpropagation breakthrough (1986)

David Rumelhart, Geoffrey Hinton, and Ronald Williams published "Learning Representations by Back-propagating Errors" in *Nature* in 1986.[4] Their paper clearly demonstrated that backpropagation could train multilayer feedforward networks to learn internal representations and solve problems like XOR that had defeated single-layer perceptrons.[4] This work reignited interest in neural networks and made multilayer perceptrons a practical tool for the first time.

### Universal approximation theorems (1989-1993)

George Cybenko proved in 1989 that a feedforward network with a single hidden layer of sigmoid neurons can approximate any continuous function on a compact subset of $$\mathbb{R}^n$$ to arbitrary accuracy, given enough hidden units.[5] In the same year, Kurt Hornik, Maxwell Stinchcombe, and Halbert White proved a broader version of this result, showing that standard multilayer feedforward networks are universal approximators.[6] Hornik followed up in 1991 by showing that it is the multilayer feedforward architecture itself, not the specific activation function, that provides the universal approximation capability.[7] Moshe Leshno, Vladimir Ya. Lin, Allan Pinkus, and Shimon Schocken established in 1993 that a feedforward network can approximate any continuous function if and only if its activation function is not a polynomial.[8] This necessary-and-sufficient condition unified previous results.

### Deep learning era (2006 onward)

Geoffrey Hinton and collaborators showed in 2006 that deep networks could be effectively trained using layer-wise pretraining with restricted Boltzmann machines. The introduction of the [rectified linear unit](/wiki/rectified_linear_unit_relu) (ReLU) activation function, [batch normalization](/wiki/batch_normalization), [residual connections](/wiki/resnet), [dropout](/wiki/dropout_regularization), and dramatically faster GPU hardware removed many of the obstacles that had previously limited deep feedforward networks. Today, feedforward layers are a component of nearly every neural network architecture, from standalone MLPs to the position-wise FFN sublayers within transformers.

| Year | Milestone | Key contributors |
|------|-----------|-------------------|
| 1943 | McCulloch-Pitts binary neuron model | Warren McCulloch, Walter Pitts |
| 1958 | [Perceptron](/wiki/perceptron) with learnable weights | Frank Rosenblatt |
| 1965 | Group Method of Data Handling (early deep learning) | Alexei Ivakhnenko, Valentin Lapa |
| 1967 | First multilayer network trained by [SGD](/wiki/stochastic_gradient_descent_sgd) | Shun'ichi Amari |
| 1969 | *Perceptrons* book; XOR limitation proved | Marvin Minsky, Seymour Papert |
| 1970 | Reverse-mode automatic differentiation | Seppo Linnainmaa |
| 1974 | [Backpropagation](/wiki/backpropagation) applied to neural networks (thesis) | Paul Werbos |
| 1986 | Backpropagation popularized for MLPs | David Rumelhart, Geoffrey Hinton, Ronald Williams |
| 1989 | Universal approximation theorem for sigmoid networks | George Cybenko |
| 1989 | Universal approximation for general activations | Kurt Hornik, Maxwell Stinchcombe, Halbert White |
| 1991 | Architecture, not activation choice, is key to universality | Kurt Hornik |
| 1993 | Non-polynomial activation is necessary and sufficient | Moshe Leshno, Allan Pinkus, et al. |
| 2006 | Deep pretraining with restricted Boltzmann machines | Geoffrey Hinton, Ruslan Salakhutdinov |
| 2010s | [ReLU](/wiki/rectified_linear_unit_relu), [batch normalization](/wiki/batch_normalization), [dropout](/wiki/dropout_regularization), [residual connections](/wiki/resnet) | Various researchers |

## How is a feedforward neural network structured?

A feedforward neural network is organized as a sequence of layers. Each layer is a collection of [neurons](/wiki/neuron) (also called units or nodes). Connections run from every neuron in one layer to every neuron in the next layer (in a fully connected, or "dense," network), but never within the same layer or backward to a previous layer.

### Input layer

The [input layer](/wiki/input_layer) receives the raw feature values of an input example. The number of neurons in this layer equals the dimensionality of the input data. No computation occurs in the input layer; it simply distributes values to the first hidden layer.

### Hidden layers

Hidden layers perform the actual computation. Each neuron in a hidden layer computes a weighted sum of its inputs, adds a bias term, and passes the result through a nonlinear [activation function](/wiki/activation_function). In mathematical notation, the output of neuron j in layer l is:

$$
a_j^{(l)} = f\!\left( \sum_i w_{ji}^{(l)} a_i^{(l-1)} + b_j^{(l)} \right)
$$

where:
- $$a_i^{(l-1)}$$ is the output of neuron i in the previous layer
- $$w_{ji}^{(l)}$$ is the weight connecting neuron i in layer l-1 to neuron j in layer l
- $$b_j^{(l)}$$ is the bias of neuron j in layer l
- $$f$$ is the activation function

In vector form for an entire layer:

$$
a^{(l)} = f\!\left( W^{(l)} a^{(l-1)} + b^{(l)} \right)
$$

A network may have one hidden layer (a "shallow" network) or many hidden layers (a "deep" network). Adding more hidden layers increases the network's depth, which generally allows it to learn more abstract, hierarchical representations of the data.

### Output layer

The [output layer](/wiki/output_layer) produces the network's final prediction. Its structure depends on the task:

| Task | Output neurons | Activation function | Example |
|------|---------------|---------------------|----------|
| [Binary classification](/wiki/binary_classification) | 1 | [Sigmoid](/wiki/sigmoid_function) | $$\sigma(z) = \frac{1}{1 + e^{-z}}$$ |
| Multi-class classification | One per class | [Softmax](/wiki/softmax) | $$\frac{e^{z_i}}{\sum_j e^{z_j}}$$ |
| [Regression](/wiki/regression_model) | One per output dimension | Linear (identity) | Price prediction |
| Multi-label classification | One per label | Sigmoid (per neuron) | Tag prediction |

### Neurons

An individual [neuron](/wiki/neuron) is the basic computational unit of the network. It performs two operations in sequence: (1) compute the weighted sum of its inputs plus a bias, and (2) apply a nonlinear activation function to that sum. The weights and biases are the learnable [parameters](/wiki/parameter) of the network, adjusted during training to minimize a [loss function](/wiki/loss_function).

## Activation functions

Activation functions introduce nonlinearity into the network. Without them, a multilayer network would collapse to a single linear transformation, regardless of depth. The choice of activation function affects training dynamics, convergence speed, and final performance.

| Activation function | Formula | Range | Advantages | Disadvantages | Typical use |
|---------------------|---------|-------|------------|---------------|-------------|
| [Sigmoid](/wiki/sigmoid_function) | sigma(z) = 1 / (1 + e^(-z)) | (0, 1) | Output interpretable as probability | Vanishing gradients; output not zero-centered | Binary classification output |
| Tanh | $$\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$$ | (-1, 1) | Zero-centered; stronger gradients near zero | Vanishing gradients for large inputs | Hidden layers (older networks) |
| [ReLU](/wiki/rectified_linear_unit_relu) | $$\max(0, z)$$ | [0, inf) | Simple; efficient; mitigates vanishing gradients | Dying ReLU problem (neurons output zero permanently) | Hidden layers (most common default) |
| Leaky ReLU | $$\max(\alpha z, z), \; \alpha \approx 0.01$$ | (-inf, inf) | Avoids dying ReLU | Introduces extra hyperparameter | Hidden layers |
| Parametric ReLU (PReLU) | $$\max(\alpha z, z), \; \alpha \text{ learned}$$ | (-inf, inf) | Learns optimal negative slope | Slightly more parameters | Hidden layers |
| ELU | $$z \text{ if } z > 0; \; \alpha(e^z - 1) \text{ if } z \le 0$$ | (-alpha, inf) | Smooth; pushes mean activations toward zero | Slower to compute than ReLU | Hidden layers |
| [GELU](/wiki/activation_function) | $$z \cdot \Phi(z)$$, where $$\Phi$$ is the standard Gaussian CDF | approx (-0.17, inf) | Smooth, probabilistic gating; good gradient flow | More expensive than ReLU | [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [GPT](/wiki/gpt_generative_pre-trained_transformer) |
| SiLU / Swish | $$z \cdot \sigma(z)$$ | approx (-0.28, inf) | Non-monotonic; smooth; self-gated | Slightly more expensive than ReLU | [EfficientNet](/wiki/efficientnet), vision models |
| [SwiGLU](/wiki/activation_function) | $$\text{Swish}(xW) \cdot (xV)$$ | varies | State-of-the-art for LLMs; gated linear unit variant | Requires two weight matrices per layer | [LLaMA](/wiki/llama), [PaLM](/wiki/palm) |
| [Softmax](/wiki/softmax) | e^(z_i) / sum(e^(z_j)) | (0, 1), sums to 1 | Produces valid probability distribution | Only used at output layer | Multi-class classification output |

ReLU and its variants remain the most common choice for hidden layers in general-purpose feedforward networks. GELU is standard in transformer encoder models like BERT, while SiLU/Swish and SwiGLU are preferred in large decoder-only transformers such as LLaMA and PaLM.

## Mathematical formulation

A feedforward neural network with L layers defines a function $$f: \mathbb{R}^{d_0} \to \mathbb{R}^{d_L}$$ that maps an input vector $$x \in \mathbb{R}^{d_0}$$ to an output $$y \in \mathbb{R}^{d_L}$$, where $$d_0$$ is the input dimension and $$d_L$$ is the output dimension.

For each layer l from 1 to L:

$$
z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)} \quad \text{(linear transformation)}
$$

$$
a^{(l)} = f_l(z^{(l)}) \quad \text{(activation function)}
$$

where:
- $$a^{(0)} = x$$ (the input)
- $$W^{(l)}$$ is a $$d_l \times d_{l-1}$$ weight matrix for layer l
- $$b^{(l)}$$ is a $$d_l$$-dimensional bias vector for layer l
- $$f_l$$ is the activation function for layer l
- $$z^{(l)}$$ is the pre-activation value
- $$a^{(l)}$$ is the post-activation (output) of layer l

The full network computes:

$$
y = f_L\!\left( W^{(L)} f_{L-1}\!\left( \cdots f_2\!\left( W^{(2)} f_1\!\left( W^{(1)} x + b^{(1)} \right) + b^{(2)} \right) \cdots \right) + b^{(L)} \right)
$$

The total number of learnable parameters in a fully connected feedforward network is:

$$
\sum_{l=1}^{L} \left( d_l \cdot d_{l-1} + d_l \right)
$$

where $$d_l \cdot d_{l-1}$$ counts the weights and $$d_l$$ counts the biases in layer l.

## What is the universal approximation theorem?

The universal approximation theorem is one of the foundational theoretical results in neural network research. It provides a mathematical guarantee that feedforward networks have sufficient representational power to model a broad class of functions. Hornik, Stinchcombe, and White (1989) state the result directly: standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions "are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available."[6]

### Statement

In its most general form (Leshno et al., 1993), the theorem states: a standard feedforward network with a single hidden layer using any locally bounded, piecewise continuous activation function can approximate any continuous function on a compact subset of $$\mathbb{R}^n$$ to any desired accuracy, if and only if the activation function is not a polynomial.[8]

### Key results by year

| Year | Authors | Result |
|------|---------|--------|
| 1989 | George Cybenko | A single hidden layer with [sigmoid](/wiki/sigmoid_function) activation can approximate any continuous function on a compact set |
| 1989 | Kurt Hornik, Maxwell Stinchcombe, Halbert White | Multilayer feedforward networks with as few as one hidden layer are universal approximators (broader class of activations) |
| 1991 | Kurt Hornik | The multilayer feedforward architecture itself, not the choice of activation function, gives networks the universal approximation property |
| 1993 | Moshe Leshno, V.Y. Lin, Allan Pinkus, S. Schocken | Non-polynomial activation is the necessary and sufficient condition for universal approximation |
| 2017 | Zhou Lu et al. | Networks of bounded width (n + 4 neurons per layer, with [ReLU](/wiki/rectified_linear_unit_relu)) can approximate any Lebesgue-integrable function if depth is unlimited |
| 2020 | Patrick Kidger, Terry Lyons | Extended depth results to activations like tanh and GELU |
| 2021 | Park et al. | Minimum width for universal approximation is $$\max(d_x + 1, d_y)$$, where $$d_x$$ and $$d_y$$ are input and output dimensions |

### Practical implications and limitations

The theorem is an existence result. It guarantees that a network with the right architecture and weights can approximate a target function, but it does not:

- Specify how many hidden neurons are needed for a given level of accuracy
- Provide a method for finding the correct weights (that is the job of training algorithms like backpropagation)
- Guarantee that [gradient descent](/wiki/gradient_descent) will converge to a good solution
- Address the sample complexity (how much training data is needed)

In practice, deeper networks (more layers with fewer neurons per layer) tend to approximate complex functions more efficiently than very wide, shallow networks. This observation, supported by theoretical work on the expressive power of depth, is one reason modern deep learning favors deep architectures.

## How is a feedforward neural network trained?

Training a feedforward neural network means adjusting its weights and biases to minimize a loss function that measures how far the network's predictions are from the true values.

### Forward pass

During the forward pass, input data propagates through the network layer by layer. Each layer computes its weighted sum, applies the activation function, and passes the result to the next layer. The final output is compared to the target value using a [loss function](/wiki/loss_function).

### Loss functions

The choice of loss function depends on the task:

| Task | Loss function | Formula |
|------|--------------|----------|
| [Regression](/wiki/regression_model) | [Mean squared error](/wiki/mean_squared_error_mse) (MSE) | $$\frac{1}{n} \sum_i (y_i - \hat{y}_i)^2$$ |
| [Binary classification](/wiki/binary_classification) | Binary [cross-entropy](/wiki/cross-entropy) | $$-\frac{1}{n} \sum_i \left( y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right)$$ |
| Multi-class classification | Categorical cross-entropy | $$-\frac{1}{n} \sum_i \sum_c y_{ic} \log(\hat{y}_{ic})$$ |
| Regression (robust) | [Mean absolute error](/wiki/mean_absolute_error_mae) (MAE) | $$\frac{1}{n} \sum_i \lvert y_i - \hat{y}_i \rvert$$ |
| Ranking | Hinge loss | $$\max(0, 1 - y_i \hat{y}_i)$$ |

### Backpropagation

[Backpropagation](/wiki/backpropagation) is the algorithm used to compute the gradient of the loss function with respect to every weight and bias in the network. It works by applying the chain rule of calculus, propagating error signals backward from the output layer through each hidden layer to the input layer.

For a weight w_{ji}^(l) connecting neuron i in layer l-1 to neuron j in layer l:

$$
\frac{\partial L}{\partial w_{ji}^{(l)}} = \delta_j^{(l)} \, a_i^{(l-1)}
$$

where $$\delta_j^{(l)}$$ is the error signal (local gradient) at neuron j in layer l, computed recursively from the output layer backward.

The computational cost of backpropagation scales linearly with the number of parameters in the network, making it efficient even for large models.

Backpropagation has a complex history. Although most commonly associated with the 1986 paper by Rumelhart, Hinton, and Williams, earlier versions were developed by Henry Kelley (1960), Arthur Bryson (1961), Stuart Dreyfus (1962), and Seppo Linnainmaa (1970).[4] Paul Werbos first applied it to neural networks in 1982. The 1986 paper provided the clearest exposition and experimental validation that drove widespread adoption.[4]

### Optimization algorithms

Once gradients are computed, an optimization algorithm updates the weights. Several algorithms have been developed, each with different trade-offs between convergence speed, stability, and generalization.

| Optimizer | Key idea | Introduced by |
|-----------|----------|---------------|
| [SGD](/wiki/stochastic_gradient_descent_sgd) | Update weights using gradient of a mini-batch | Robbins and Monro, 1951 |
| SGD with [momentum](/wiki/momentum) | Accumulate past gradients to accelerate convergence in consistent directions | Polyak, 1964 |
| Nesterov momentum | Look-ahead gradient for better convergence | Nesterov, 1983 |
| AdaGrad | Adapt learning rate per parameter based on historical gradient magnitudes | Duchi et al., 2011 |
| RMSProp | Exponential moving average of squared gradients | Hinton (unpublished lecture), 2012 |
| Adam | Combines momentum (first moment) with adaptive learning rates (second moment) | Kingma and Ba, 2015 |
| AdamW | Decouples [weight decay](/wiki/l2_regularization) from the adaptive learning rate update | Loshchilov and Hutter, 2019 |

[Adam](/wiki/optimizer) is the most widely used optimizer in practice because it converges quickly and requires little hyperparameter tuning.[12] [SGD](/wiki/stochastic_gradient_descent_sgd) with momentum sometimes achieves better generalization in the final model, especially in computer vision tasks, but requires more careful learning rate scheduling.

### Weight initialization

Proper weight initialization is important to prevent vanishing or exploding gradients at the start of training.

| Method | Designed for | Variance of weights |
|--------|-------------|---------------------|
| Xavier / Glorot (Glorot and Bengio, 2010) | [Sigmoid](/wiki/sigmoid_function) and tanh activations | $$2 / (\text{fan\_in} + \text{fan\_out})$$ |
| He / Kaiming (He et al., 2015) | [ReLU](/wiki/rectified_linear_unit_relu) activations | $$2 / \text{fan\_in}$$ |
| LeCun (LeCun et al., 1998) | SELU activations | $$1 / \text{fan\_in}$$ |

Xavier initialization keeps the variance of activations roughly constant across layers when using sigmoid or tanh activations.[15] He initialization doubles the variance to compensate for the fact that ReLU zeros out roughly half of its inputs.

## Deep vs. shallow networks

The universal approximation theorem shows that a single hidden layer is theoretically sufficient to approximate any continuous function. In practice, however, deeper networks (with more layers and fewer neurons per layer) offer several advantages over wider, shallow networks:

- **Parameter efficiency.** Deep networks can represent certain functions with exponentially fewer parameters than shallow networks. Theoretical results show that there exist functions computable by a depth-k network with a polynomial number of neurons that would require an exponential number of neurons in a depth-(k-1) network.
- **Hierarchical feature learning.** Each layer in a deep network builds on the representations learned by the previous layer, allowing the network to learn progressively more abstract features. In image processing, for example, early layers might detect edges, middle layers might detect textures and shapes, and later layers might detect objects.
- **Better generalization.** Empirically, deeper networks often generalize better than shallow networks with the same number of parameters, particularly on complex real-world tasks.

Deep networks also face challenges:

- **Vanishing gradients.** Gradients can shrink exponentially as they propagate backward through many layers, making it difficult for early layers to learn. Addressed by [ReLU](/wiki/rectified_linear_unit_relu) activations, [batch normalization](/wiki/batch_normalization), and [residual connections](/wiki/resnet).
- **Exploding gradients.** Gradients can grow exponentially, causing unstable weight updates. Addressed by [gradient clipping](/wiki/gradient_clipping) and careful initialization.
- **Increased computational cost.** More layers mean more computation per forward and backward pass.
- **Harder optimization.** The loss surface of a deep network is more complex, with more saddle points and local minima.

## Regularization techniques

[Overfitting](/wiki/overfitting) occurs when a network learns to fit the training data too closely, including its noise, and performs poorly on unseen data. Several regularization techniques have been developed to address this.

### L1 and L2 regularization

[L1 regularization](/wiki/l1_regularization) adds the sum of absolute values of weights to the loss function, encouraging sparsity. [L2 regularization](/wiki/l2_regularization) (weight decay) adds the sum of squared weights, discouraging large weight values. Both techniques penalize model complexity.

### Dropout

Introduced by Srivastava, Hinton, Krizhevsky, Sutskever, and Salakhutdinov in 2014, [dropout](/wiki/dropout_regularization) randomly sets a fraction of neuron outputs to zero during each training step.[9] This prevents neurons from co-adapting and forces the network to learn redundant representations. At test time, all neurons are active but their outputs are scaled to account for the dropout rate. Dropout can be interpreted as training an exponential number of "thinned" sub-networks and averaging their predictions.[9]

### Batch normalization

Proposed by Ioffe and Szegedy in 2015, [batch normalization](/wiki/batch_normalization) normalizes the input to each layer by subtracting the mini-batch mean and dividing by the mini-batch standard deviation.[10] It adds learnable scale and shift parameters. Batch normalization stabilizes training, allows higher learning rates, and provides a mild regularization effect.[10] A related technique, layer normalization (Ba, Kiros, and Hinton, 2016), normalizes across features rather than across the batch and is preferred in transformers and RNNs.

### Early stopping

[Early stopping](/wiki/early_stopping) monitors the loss on a held-out validation set during training and halts training when the validation loss stops improving. It is a simple and effective form of regularization that limits the capacity of the network by restricting the number of training steps.

### Data augmentation

[Data augmentation](/wiki/data_augmentation) artificially increases the size of the training set by applying transformations (rotation, flipping, cropping, color jittering for images; synonym replacement, back-translation for text). This exposes the network to more variation and reduces overfitting.

### Label smoothing

Label smoothing replaces hard target labels (0 or 1) with soft targets (e.g., 0.1 and 0.9). This prevents the network from becoming overconfident and improves generalization, especially in classification tasks.

## How are feedforward networks used in transformers?

The [transformer](/wiki/transformer) architecture, introduced by Vaswani et al. in 2017 in "Attention Is All You Need," contains a position-wise feedforward network (FFN) as one of two main sublayers in each transformer block (the other being the [self-attention](/wiki/self-attention_also_called_self-attention_layer) sublayer).[13] The paper describes it as "a fully connected feed-forward network, which is applied to each position separately and identically."[13]

### Architecture

The standard transformer FFN applies two linear transformations with a nonlinear activation in between:

$$
\mathrm{FFN}(x) = W_2 \, \mathrm{activation}(W_1 x + b_1) + b_2
$$

In the original transformer, the activation function was [ReLU](/wiki/rectified_linear_unit_relu).[13] Modern transformers use [GELU](/wiki/activation_function) (in [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers), [GPT-2](/wiki/gpt2)) or SwiGLU (in [LLaMA](/wiki/llama), [PaLM](/wiki/palm)).

### Expansion factor

The hidden dimension of the FFN ($$d_{\text{ff}}$$) is typically four times the model dimension ($$d_{\text{model}}$$). For example, in the original transformer with $$d_{\text{model}} = 512$$, $$d_{\text{ff}} = 2048$$.[13] This 4x expansion allows the FFN to project token representations into a higher-dimensional space where nonlinear transformations can capture richer patterns, before projecting back down to d_model.

The FFN parameters typically account for about two-thirds of the total parameters in a transformer block.[16] In a model with $$d_{\text{model}} = 4096$$ and $$d_{\text{ff}} = 16384$$, each FFN sublayer has $$2 \cdot 4096 \cdot 16384 = 134$$ million parameters (ignoring biases).

### FFN activations in modern transformers

| Model | FFN activation | Year |
|-------|---------------|------|
| Original Transformer | [ReLU](/wiki/rectified_linear_unit_relu) | 2017 |
| [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) | GELU | 2018 |
| [GPT-2](/wiki/gpt2), GPT-3 | GELU | 2019, 2020 |
| [PaLM](/wiki/palm) | SwiGLU | 2022 |
| [LLaMA](/wiki/llama), LLaMA 2 | SwiGLU | 2023 |

SwiGLU, proposed by Noam Shazeer in 2020, incorporates a gating mechanism that controls information flow through the FFN.[14] It uses three weight matrices instead of two, which adds approximately 15% more computation but consistently improves model quality as measured by perplexity.[14]

### Role in transformer blocks

While the self-attention sublayer allows tokens to exchange information across positions in a sequence, the FFN sublayer processes each token independently and identically. Geva et al. (2021) showed that FFN sublayers operate as key-value memories that store factual associations learned during training, with lower layers capturing shallow patterns and upper layers capturing more semantic ones.[16] The combination of attention (inter-token communication) and FFN (per-token transformation) gives transformers their representational power.

### Mixture of experts

In some modern architectures, the dense FFN sublayer is replaced by a mixture of experts (MoE) layer. In an MoE layer, multiple FFN "experts" exist in parallel, and a gating network routes each token to a small subset of experts. This allows the model to have many more total parameters while keeping the computation per token roughly constant.

## Variants of feedforward networks

Several specialized architectures build on the basic feedforward design.

### Radial basis function (RBF) networks

RBF networks use radial basis functions (typically Gaussian) as activation functions instead of sigmoid or ReLU. They have a single hidden layer where each neuron computes the distance between the input and a stored center vector, then applies a radial function. RBF networks train faster than MLPs for low-dimensional problems but scale poorly to high-dimensional data.

### Autoencoders

An [autoencoder](/wiki/encoder) is a feedforward network trained to reconstruct its input. It consists of an encoder that compresses the input into a lower-dimensional representation and a decoder that reconstructs the original input from that representation. Autoencoders are used for dimensionality reduction, denoising, and feature learning.

### Residual networks

[Residual networks](/wiki/resnet) (ResNets), introduced by Kaiming He et al. in 2015, add skip connections that allow the input to a layer to bypass one or more layers and be added directly to the output.[11] Formally, a residual block computes $$y = F(x) + x$$, where $$F$$ is the transformation learned by the skipped layers. Residual connections address the vanishing gradient problem and enable training of networks with over 100 layers.[11] He et al. won the ImageNet classification challenge in 2015 with a 152-layer ResNet achieving a 3.57% top-5 error rate.[11]

### Highway networks

Introduced by Srivastava, Greff, and Schmidhuber in 2015, highway networks use gating mechanisms to regulate information flow through skip connections. Unlike ResNets, which simply add the skip connection, highway networks learn a gating function that controls how much information flows through the transformation versus the skip path.

## How does an FFN differ from RNNs, CNNs, and transformers?

| Feature | FFN / MLP | [RNN](/wiki/recurrent_neural_network) | [CNN](/wiki/convolutional_neural_network) | [Transformer](/wiki/transformer) |
|---------|-----------|-------|-------|-------------|
| Information flow | Unidirectional, no cycles | Contains feedback loops | Unidirectional with local receptive fields | Unidirectional with global [attention](/wiki/attention) |
| Parameter sharing | None (fully connected) | Weights shared across time steps | Weights shared across spatial positions | Weights shared across sequence positions |
| Parallelization | Fully parallel | Sequential (inherently) | Highly parallel | Highly parallel |
| Inductive bias | None (general purpose) | Sequential / temporal structure | Spatial locality and translation invariance | Pairwise interactions between all positions |
| Input type | Fixed-size vectors | Variable-length sequences | Grid-structured data (images) | Variable-length sequences |
| Typical applications | Tabular data, function approximation, classification | [Time series](/wiki/time_series_analysis), language modeling (legacy) | [Image recognition](/wiki/image_recognition), [object detection](/wiki/object_detection) | [NLP](/wiki/natural_language_understanding), [computer vision](/wiki/computer_vision), multimodal |
| Scalability | Quadratic in layer width | Limited by sequential processing | Scales well with spatial dimensions | Scales well; quadratic in sequence length |

## What are feedforward neural networks used for?

Feedforward neural networks are used across many domains:

- **Classification and regression.** MLPs are a standard baseline for [supervised learning](/wiki/supervised_machine_learning) on tabular data, where they compete with [gradient boosting](/wiki/gradient_boosting) and [random forests](/wiki/random_forest). They are used for tasks like credit scoring, medical diagnosis, and customer churn prediction.
- **Function approximation.** The universal approximation theorem guarantees that FFNs can approximate any continuous function, making them useful for modeling nonlinear relationships in scientific computing, engineering simulation, and control systems.
- **Recommender systems.** Deep feedforward networks power the ranking and scoring components of [recommendation systems](/wiki/recommender_system) at companies like Google, Netflix, and YouTube. The Wide & Deep model (Cheng et al., 2016) combines a linear model with a deep MLP for recommendation.
- **[Reinforcement learning](/wiki/reinforcement_learning_rl).** FFNs serve as function approximators for value functions and policy networks in [deep Q-networks](/wiki/deep_q-network_dqn) (DQN) and policy gradient methods.
- **Feature extraction.** FFN layers are used as components within larger architectures. Every transformer block contains an FFN sublayer. [CNN](/wiki/convolutional_neural_network) classifiers typically end with one or more fully connected (FFN) layers.
- **Natural language processing.** Beyond their role in transformers, standalone MLPs have been used for text classification, [sentiment analysis](/wiki/sentiment_analysis), and [named entity recognition](/wiki/named_entity_recognition).
- **Scientific computing.** Physics-informed neural networks (PINNs) use feedforward networks to solve partial differential equations by encoding physical laws as constraints in the loss function.
- **Healthcare.** Disease prediction models, medical image analysis, and drug interaction prediction.
- **Financial analysis.** Portfolio optimization, credit risk assessment, and fraud detection.

## Advantages and limitations

### Advantages

- **Universal approximation.** Theoretically capable of approximating any continuous function with a non-polynomial activation.
- **Simplicity.** Straightforward to implement, understand, and debug compared to more complex architectures.
- **Flexibility.** Can handle any fixed-size input and produce any fixed-size output.
- **Full parallelism.** All computations within a layer can be parallelized, unlike the sequential nature of RNNs.
- **Well-understood theory.** Decades of theoretical and empirical research provide strong foundations for architecture design and training.

### Limitations

- **No built-in structure.** FFNs treat all input features as equally unstructured. They lack the inductive biases of CNNs (spatial locality) or RNNs (sequential structure), which can make them data-hungry for tasks where such structure matters.
- **Fixed input size.** Standard FFNs require a fixed-dimensional input vector, making them unsuitable for variable-length sequences or images of varying resolution without preprocessing.
- **Parameter count.** Fully connected layers have many parameters (proportional to the product of input and output dimensions), which can lead to overfitting on small datasets and high memory usage.
- **No memory.** FFNs process each input independently and have no mechanism for maintaining state across inputs, unlike RNNs or transformers with attention.
- **Black box nature.** The learned representations in hidden layers are difficult to interpret, making it challenging to understand why the network makes specific predictions.

## Implementation example

A simple feedforward neural network for binary classification can be defined in [PyTorch](/wiki/pytorch) as follows:

```python
import torch
import torch.nn as nn

class FeedforwardNet(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.layer1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.layer2 = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        x = self.sigmoid(x)
        return x

# Example: 10 input features, 64 hidden neurons, 1 output
model = FeedforwardNet(input_dim=10, hidden_dim=64, output_dim=1)
```

This network has one hidden layer with 64 neurons using ReLU activation and a sigmoid output for binary classification. In practice, adding more hidden layers, using [dropout](/wiki/dropout_regularization), and selecting an appropriate optimizer like [Adam](/wiki/optimizer) would improve performance.

## See also

- [Neural network](/wiki/neural_network)
- [Backpropagation](/wiki/backpropagation)
- [Activation function](/wiki/activation_function)
- [Perceptron](/wiki/perceptron)
- [Deep neural network](/wiki/deep_neural_network)
- [Recurrent neural network](/wiki/recurrent_neural_network)
- [Convolutional neural network](/wiki/convolutional_neural_network)
- [Transformer](/wiki/transformer)
- [Dropout](/wiki/dropout_regularization)
- [Batch normalization](/wiki/batch_normalization)

## References

1. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. *Bulletin of Mathematical Biophysics*, 5(4), 115-133.
2. Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. *Psychological Review*, 65(6), 386-408.
3. Minsky, M., & Papert, S. (1969). *Perceptrons: An Introduction to Computational Geometry*. MIT Press.
4. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. *Nature*, 323(6088), 533-536.
5. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. *Mathematics of Control, Signals and Systems*, 2(4), 303-314.
6. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. *Neural Networks*, 2(5), 359-366.
7. Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. *Neural Networks*, 4(2), 251-257.
8. Leshno, M., Lin, V. Y., Pinkus, A., & Schocken, S. (1993). Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. *Neural Networks*, 6(6), 861-867.
9. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. *Journal of Machine Learning Research*, 15(1), 1929-1958.
10. Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. *Proceedings of the 32nd International Conference on Machine Learning (ICML)*, 448-456.
11. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. *Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)*, 770-778.
12. Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. *Proceedings of the 3rd International Conference on Learning Representations (ICLR)*.
13. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. *Advances in Neural Information Processing Systems (NeurIPS)*, 30, 5998-6008.
14. Shazeer, N. (2020). GLU variants improve transformer. *arXiv preprint arXiv:2002.05202*.
15. Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. *Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 249-256.
16. Geva, M., Schuster, R., Berant, J., & Levy, O. (2021). Transformer feed-forward layers are key-value memories. *Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP)*, 5484-5495.