# Activation Function

> Source: https://aiwiki.ai/wiki/activation_function
> Updated: 2026-07-11
> Categories: Deep Learning, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

An **activation function** is a nonlinear mathematical function applied to the output of each neuron in an [artificial neural network](/wiki/neural_network), and it is what gives the network the ability to learn complex, nonlinear patterns rather than just linear ones. It takes the weighted sum of a neuron's inputs and maps that value through a nonlinear transformation that decides how strongly the neuron "fires." The most widely used activation function in modern deep learning is the [Rectified Linear Unit (ReLU)](/wiki/rectified_linear_unit_relu), $$f(x) = \max(0, x)$$, while [transformer](/wiki/transformer) models such as [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) and the GPT series default to the smoother GELU, and modern large language models such as [LLaMA](/wiki/llama) use the gated SwiGLU variant.

Without activation functions, a [deep neural network](/wiki/deep_neural_network) of any depth would reduce to a single linear transformation, regardless of how many layers it contained. This is because the composition of multiple linear functions is itself linear. By inserting a nonlinear activation function after each linear operation, neural networks gain the ability to approximate virtually any continuous function. Cybenko's 1989 Universal Approximation Theorem proved that a feedforward network with a single hidden layer of sigmoidal units, given enough width, can approximate any continuous function on a closed interval to arbitrary accuracy.[3][15][16]

## Explain like I'm 5 (ELI5)

Imagine you have a bunch of friends passing notes in a chain. Each friend reads the note, decides how excited they should be about it, and writes a new note based on their excitement level to pass to the next friend. The "excitement rule" each friend uses is like an activation function. If everyone just copied the note exactly (linear), the last friend would get pretty much the same message no matter how many friends were in the chain. But if each friend adds their own twist (nonlinear), the chain can come up with really creative and complicated messages. That twist is what an activation function provides to a neural network.

## Why does nonlinearity matter?

A single layer of a neural network computes a linear transformation of its inputs: $$y = Wx + b$$, where W is a weight matrix and b is a bias vector. Stacking two such layers without any activation function gives $$y = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + (W_2 b_1 + b_2)$$, which is still a linear function of x. No matter how many layers are added, the result is equivalent to one linear layer.

Nonlinear activation functions break this linearity. When a nonlinear function f is applied between layers, the composition $$f(W_2 f(W_1 x + b_1) + b_2)$$ is no longer linear. This allows each layer to learn a different nonlinear transformation of its input, giving the network the representational power to model complex relationships.[3] This property is central to the success of [backpropagation](/wiki/backpropagation) in training deep networks, since meaningful gradients can flow through layers that compute genuinely different transformations.[2]

## Historical evolution

The history of activation functions mirrors the broader development of neural network research.

### Step functions and perceptrons (1950s-1960s)

The earliest artificial neurons, including Frank Rosenblatt's Perceptron (1958), used binary step functions.[1] A step function outputs 1 if the weighted input exceeds a threshold and 0 otherwise. While simple and biologically motivated (neurons either fire or do not), step functions are not differentiable at the threshold, making them incompatible with gradient-based optimization.

### Sigmoid and tanh era (1980s-2000s)

The development of the backpropagation algorithm in the 1980s required differentiable activation functions.[2] The sigmoid (logistic) function and the hyperbolic tangent (tanh) became standard choices. These smooth, bounded functions allowed gradient computation throughout the network. However, as networks grew deeper, researchers discovered that both sigmoid and tanh suffer from the [vanishing gradient problem](/wiki/vanishing_gradient_problem): for large or small input values, the gradient approaches zero, effectively halting learning in early layers.

### ReLU revolution (2010-2015)

Vinod Nair and Geoffrey Hinton introduced the [Rectified Linear Unit (ReLU)](/wiki/rectified_linear_unit_relu) in 2010 at the 27th International Conference on Machine Learning, demonstrating that this simple function could dramatically improve training of deep networks.[4] ReLU's constant gradient for positive inputs eliminated the vanishing gradient problem in one direction and provided computational efficiency. The success of ReLU in AlexNet (2012) for [image recognition](/wiki/image_recognition) cemented its dominance: Krizhevsky, Sutskever, and Hinton reported that a four-layer convolutional network with ReLUs reached a 25% training-error rate on CIFAR-10 "six times faster" than an equivalent network using tanh neurons, and AlexNet went on to win the ILSVRC-2012 ImageNet challenge with a 15.3% top-5 error versus 26.2% for the runner-up.[17] Variants like Leaky ReLU and Parametric ReLU soon followed to address ReLU's limitations.[6][7]

### Modern smooth activations (2016-present)

The rise of [transformer](/wiki/transformer) architectures brought a new generation of activation functions. The Gaussian Error Linear Unit (GELU), proposed by Hendrycks and Gimpel in 2016, became the default in [BERT](/wiki/bert_bidirectional_encoder_representations_from_transformers) and GPT models.[9] Google Brain's automated search discovered Swish (SiLU) in 2017.[11] More recently, gated variants like SwiGLU have become standard in large language models such as [LLaMA](/wiki/llama) and PaLM.[14][18]

## Common activation functions

### Sigmoid (logistic)

The sigmoid function maps any real number to the range (0, 1):

**Formula:** $$\sigma(x) = \frac{1}{1 + e^{-x}}$$

**Derivative:** $$\sigma'(x) = \sigma(x)(1 - \sigma(x))$$

The sigmoid function was historically one of the most popular activation functions. Its output can be interpreted as a probability, making it natural for binary classification output layers. However, sigmoid has several drawbacks for hidden layers.[15] Its outputs are not zero-centered (they range from 0 to 1), which can cause zig-zagging during gradient updates. More critically, the gradient saturates for large positive or negative inputs, approaching zero. This saturation causes the vanishing gradient problem in deep networks, slowing or preventing learning in early layers.[5][15]

### Tanh (hyperbolic tangent)

The tanh function is a rescaled version of sigmoid that maps inputs to the range (-1, 1):

**Formula:** $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$$

**Derivative:** $$\tanh'(x) = 1 - \tanh(x)^2$$

**Relation to sigmoid:** $$\tanh(x) = 2\sigma(2x) - 1$$

Tanh has the advantage of being zero-centered, which generally leads to faster convergence during training compared to sigmoid. Despite this improvement, tanh still saturates at extreme values, so it shares the vanishing gradient limitation.[15] Tanh remains common in [recurrent neural networks](/wiki/recurrent_neural_network) (RNNs) and certain gating mechanisms such as those in LSTMs.

### ReLU (Rectified Linear Unit)

ReLU is the most widely used activation function in modern deep learning:

**Formula:** $$f(x) = \max(0, x)$$

**Derivative:** $$f'(x) = \begin{cases} 0 & x < 0 \\ 1 & x > 0 \end{cases}$$ (undefined at $$x = 0$$, typically set to 0 or 1)

ReLU has several key advantages. It is computationally cheap, requiring only a comparison operation. For positive inputs, the gradient is always 1, which eliminates the vanishing gradient problem and enables effective training of very deep networks.[4][5] ReLU also induces sparsity: neurons with negative pre-activation values output exactly zero, meaning a subset of neurons is effectively inactive for any given input. This sparsity can improve computational efficiency and serve as a form of implicit regularization.[5]

The primary limitation is the "dying ReLU" problem.[15] If a neuron's weights are updated such that its input is always negative, the neuron will output zero for every input and receive zero gradient. Once this happens, the neuron can never recover. In practice, this can cause a significant fraction of neurons to become permanently inactive, especially with high learning rates.

### Leaky ReLU

Leaky ReLU addresses the dying ReLU problem by allowing a small, non-zero gradient for negative inputs:

**Formula:** $$f(x) = \begin{cases} x & x > 0 \\ \alpha x & x \le 0 \end{cases}$$ (typically $$\alpha = 0.01$$)

**Derivative:** $$f'(x) = \begin{cases} 1 & x > 0 \\ \alpha & x \le 0 \end{cases}$$

By using a small positive slope (commonly 0.01) for negative inputs, Leaky ReLU ensures that every neuron always has a non-zero gradient. This means neurons cannot "die" as they can with standard ReLU.[6] The negative slope is a fixed hyperparameter that must be chosen before training.

### Parametric ReLU (PReLU)

PReLU generalizes Leaky ReLU by making the negative slope a learnable parameter:

**Formula:** $$f(x) = \begin{cases} x & x \ge 0 \\ \alpha x & x < 0 \end{cases}$$ ($$\alpha$$ is learned during training)

Proposed by He et al. (2015), PReLU allows the network to learn the optimal negative slope through backpropagation.[7] This adds a small number of extra parameters (one per channel or one shared across all channels) but can improve performance. Combined with a tailored weight initialization scheme, PReLU achieved a 4.94% top-5 error on the ImageNet 2012 classification set, which the authors reported as the first result to surpass the estimated human-level performance of 5.1% and a 26% relative improvement over the ILSVRC-2014 winner GoogLeNet (6.66%).[7]

### ELU (Exponential Linear Unit)

ELU uses an exponential function for negative inputs, providing a smooth transition:

**Formula:** $$f(x) = \begin{cases} x & x > 0 \\ \alpha(e^x - 1) & x \le 0 \end{cases}$$ (typically $$\alpha = 1.0$$)

**Derivative:** $$f'(x) = \begin{cases} 1 & x > 0 \\ f(x) + \alpha & x \le 0 \end{cases}$$

Proposed by Clevert, Unterthiner, and Hochreiter (2015), ELU combines the benefits of ReLU (no vanishing gradient for positive values) with negative outputs that push the mean activation closer to zero.[8] Unlike Leaky ReLU, ELU saturates for large negative values (approaching -alpha), which can make the network more robust to noise. ELU's smoothness at zero provides better gradient flow compared to the sharp corner in ReLU.

### SELU (Scaled Exponential Linear Unit)

SELU is a scaled version of ELU with specific parameter values that enable self-normalizing behavior:

**Formula:** $$f(x) = \begin{cases} \lambda x & x > 0 \\ \lambda \alpha (e^x - 1) & x \le 0 \end{cases}$$

where $$\lambda = 1.0507$$ and $$\alpha = 1.6733$$

Introduced by Klambauer et al. (2017), SELU was designed so that activations in a neural network automatically converge toward zero mean and unit variance during forward propagation. The authors proved this self-normalizing property using the Banach fixed-point theorem.[10] SELU eliminates the need for explicit normalization layers like [batch normalization](/wiki/batch_normalization) in fully connected architectures.[10] However, SELU requires specific weight initialization (LeCun normal) and works best with standard feedforward networks rather than convolutional or recurrent architectures.

### GELU (Gaussian Error Linear Unit)

GELU has become the default activation function in [transformer](/wiki/transformer) models:

**Formula:** $$\mathrm{GELU}(x) = x\Phi(x) = \frac{x}{2}\left(1 + \mathrm{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$

where $$\Phi(x)$$ is the cumulative distribution function of the standard normal distribution.

**Approximation:** $$\mathrm{GELU}(x)$$ is approximately equal to $$0.5 x \left(1 + \tanh\left(\sqrt{2/\pi}(x + 0.044715 x^3)\right)\right)$$

Proposed by Hendrycks and Gimpel (2016), GELU weights inputs by their magnitude using a probabilistic framework.[9] As the authors put it, "The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs," and they reported "performance improvements across all considered computer vision, natural language processing, and speech tasks."[9] Instead of gating inputs with a hard threshold (as ReLU does) or a fixed function, GELU gates inputs by how much they exceed other values, based on the Gaussian distribution. As x becomes large and positive, GELU approaches the identity function (like ReLU). As x becomes large and negative, it approaches zero. The transition between these two regimes is smooth and non-monotonic; GELU has a slight dip below zero for small negative inputs (its minimum value is approximately -0.17).[9]

GELU gained wide adoption as the activation function in BERT (2018) and the GPT series of models.[9] Its smoothness is believed to benefit optimization in transformer training.

### Swish / SiLU (Sigmoid Linear Unit)

Swish was discovered through an automated search by Google Brain researchers:

**Formula:** $$f(x) = x\sigma(x) = \frac{x}{1 + e^{-x}}$$

where $$\sigma$$ is the sigmoid function.

**Derivative:** $$f'(x) = f(x) + \sigma(x)(1 - f(x))$$

Ramachandran, Zoph, and Le (2017) used a combination of exhaustive and reinforcement-learning-based search over a space of candidate activation functions and found that Swish, $$f(x) = x\sigma(\beta x)$$, consistently outperformed ReLU across tasks.[11] Simply replacing ReLU with Swish raised top-1 ImageNet classification accuracy by 0.9% on Mobile NASNet-A and by 0.6% on Inception-ResNet-v2.[11] Swish is self-gated: the sigmoid component acts as a learned gate on the linear component x. Like GELU, Swish is smooth, non-monotonic, and unbounded above but bounded below (minimum approximately -0.278).[11] Swish has been adopted in architectures like [EfficientNet](/wiki/efficientnet) and MobileNetV3. In PyTorch, Swish is available as `torch.nn.SiLU`.

### Mish

Mish is a smooth, self-regularized, non-monotonic activation function:

**Formula:** $$f(x) = x\tanh(\mathrm{softplus}(x)) = x\tanh(\ln(1 + e^x))$$

Proposed by Diganta Misra (2019), Mish shares several properties with Swish, including smoothness, non-monotonicity, and unboundedness above.[13] Mish showed improvements over both ReLU and Swish in image classification benchmarks such as CIFAR-100 and ImageNet, as well as object detection with YOLOv4.[13] Its self-regularizing property comes from the smooth, continuous first derivative that avoids sharp transitions.

### Softmax

Softmax is a multi-input, multi-output activation function used primarily in the output layer for multi-class classification:

**Formula:** $$\mathrm{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$$ for all j

Unlike other activation functions that operate element-wise, [softmax](/wiki/softmax) takes an entire vector of raw scores (logits) and converts them into a probability distribution. Each output is in the range (0, 1) and all outputs sum to 1. Softmax is the standard output layer activation for multi-class classification tasks and is used in the attention mechanism of transformer models to compute attention weights.[15]

### Softplus

Softplus is a smooth approximation of ReLU:

**Formula:** $$f(x) = \ln(1 + e^x)$$

**Derivative:** $$f'(x) = \sigma(x) = \frac{1}{1 + e^{-x}}$$

Softplus produces only positive outputs and is infinitely differentiable, making it useful in contexts where a smooth, strictly positive output is needed. It approaches ReLU as inputs become large and positive but has a smooth curve near zero rather than a sharp corner. Softplus is used as a building block in other activation functions (it appears in the Mish formula) and in variational autoencoders where positive variance parameters are needed.[13]

### GLU and SwiGLU (Gated Linear Units)

Gated Linear Units and their variants have become the dominant activation mechanism in modern large language models:

**GLU formula:** $$\mathrm{GLU}(x, W, V, b, c) = (xW + b)\sigma(xV + c)$$

where $$\sigma$$ is the sigmoid function and * denotes element-wise multiplication.

**SwiGLU formula:** $$\mathrm{SwiGLU}(x, W, V, b, c) = (\mathrm{Swish}(xW + b))(xV + c)$$

GLU was introduced by Dauphin et al. (2017) for language modeling.[12] The key idea is that one linear projection acts as a "gate" that controls information flow from the other projection. Shazeer (2020) proposed replacing the sigmoid gate with other activation functions, producing variants like SwiGLU (using Swish as the gate) and GeGLU (using GELU), and found that these gated variants achieved better perplexity than ReLU and GELU feed-forward layers when tested on the T5 text-to-text transformer.[14] The paper offered no theoretical justification, concluding only: "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."[14]

SwiGLU has been adopted as the feed-forward network activation in LLaMA, Mistral, PaLM, and many other state-of-the-art large language models. Compared to standard GELU-based feed-forward layers, SwiGLU consistently improves perplexity and downstream task performance while requiring an adjusted hidden dimension (typically 2/3 of the original) to keep the parameter count equivalent.[14] The LLaMA paper states it plainly: "We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of 2/3 4d instead of 4d as in PaLM."[18]

## Comparison table

| Function | Formula | Output range | Smooth | Monotonic | Zero-centered | Common use cases |
|---|---|---|---|---|---|---|
| Sigmoid | $$\frac{1}{1 + e^{-x}}$$ | $$(0, 1)$$ | Yes | Yes | No | Binary classification output |
| Tanh | $$\frac{e^x - e^{-x}}{e^x + e^{-x}}$$ | $$(-1, 1)$$ | Yes | Yes | Yes | RNNs, LSTM gates |
| ReLU | $$\max(0, x)$$ | $$[0, \infty)$$ | No | Yes | No | CNN hidden layers, default |
| Leaky ReLU | $$\max(0.01x, x)$$ | $$(-\infty, \infty)$$ | No | Yes | No | Deep CNNs, GANs |
| PReLU | $$\max(\alpha x, x)$$, $$\alpha$$ learned | $$(-\infty, \infty)$$ | No | Yes | No | Image classification |
| ELU | $$x$$ if $$x > 0$$; $$\alpha(e^x - 1)$$ if $$x \le 0$$ | $$(-\alpha, \infty)$$ | Yes | Yes | Approximately | Deep feedforward networks |
| SELU | $$\lambda \cdot \mathrm{ELU}(x)$$ with fixed $$\lambda, \alpha$$ | $$(-\lambda\alpha, \infty)$$ | Yes | Yes | Approximately | Self-normalizing networks |
| GELU | $$x\Phi(x)$$ | $$(\sim -0.17, \infty)$$ | Yes | No | No | Transformers (BERT, GPT) |
| Swish/SiLU | $$x \cdot \mathrm{sigmoid}(x)$$ | $$(\sim -0.278, \infty)$$ | Yes | No | No | EfficientNet, MobileNet |
| Mish | $$x\tanh(\mathrm{softplus}(x))$$ | $$(\sim -0.31, \infty)$$ | Yes | No | No | YOLOv4, image classification |
| Softmax | $$\frac{e^{x_i}}{\sum_j e^{x_j}}$$ | $$(0, 1)$$, sums to 1 | Yes | N/A | No | Multi-class classification output |
| Softplus | $$\ln(1 + e^x)$$ | $$(0, \infty)$$ | Yes | Yes | No | Variance parameters, building block |
| SwiGLU | $$\mathrm{Swish}(xW+b)(xV+c)$$ | $$(-\infty, \infty)$$ | Yes | No | Approximately | LLaMA, PaLM, modern LLMs |

## How do you choose an activation function?

Selecting an activation function depends on the network architecture, the task, and the specific layer within the network.

### Hidden layers

For most tasks, ReLU is a strong default for hidden layers, especially in convolutional neural networks.[15] If the dying ReLU problem is observed (many neurons producing only zeros), switching to Leaky ReLU or ELU can help.[6][8] For transformer architectures, GELU is the standard choice. If training a feedforward network without batch normalization, SELU with LeCun normal initialization is worth considering.[10]

### Output layers

The output layer activation should match the task:

| Task | Recommended output activation | Reason |
|---|---|---|
| Binary classification | Sigmoid | Outputs a single probability in (0, 1) |
| Multi-class classification | Softmax | Outputs a probability distribution over classes |
| Multi-label classification | Sigmoid (per output) | Each output independently predicts presence of a label |
| Regression | Linear (no activation) | Outputs can be any real number |
| Regression (positive only) | Softplus or ReLU | Constrains output to positive values |

### Architecture-specific conventions

| Architecture | Typical activation | Notes |
|---|---|---|
| CNNs ([ResNet](/wiki/resnet), [VGG](/wiki/vgg)) | ReLU | Simple, efficient, well-studied |
| Transformers (encoder, e.g. BERT) | GELU | Standard since BERT (2018) |
| Transformers (decoder LLMs, e.g. LLaMA) | SwiGLU | Standard since LLaMA (2023) |
| RNNs / LSTMs | Tanh (hidden), Sigmoid (gates) | Traditional choices for recurrent models |
| [GANs](/wiki/generative_adversarial_network) | Leaky ReLU (discriminator), ReLU/Tanh (generator) | Leaky ReLU prevents dead neurons in discriminator |
| Self-normalizing networks | SELU | Requires LeCun normal initialization |

## How are activation functions used in transformers?

Transformer architectures use activation functions primarily in their feed-forward network (FFN) sublayers. In the original Transformer paper by Vaswani et al. (2017), the FFN used ReLU: $$\mathrm{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$.

BERT (2018) replaced ReLU with GELU in the FFN, and this choice was carried forward into GPT-2, GPT-3, and many subsequent encoder-based models.[9] The smoother gating behavior of GELU, where inputs near zero are partially rather than fully suppressed, is thought to benefit the optimization dynamics of transformer training.

More recent decoder-only language models have moved to gated FFN variants. LLaMA (2023) introduced the use of SwiGLU in its FFN layers, following Shazeer's (2020) findings that gated activations consistently improved transformer language modeling.[14][18] The SwiGLU FFN replaces the standard two-matrix FFN with a three-matrix gated structure: $$\mathrm{FFN}_{\mathrm{SwiGLU}}(x) = (\mathrm{Swish}(xW_1)(xW_3))W_2$$. This approach has been adopted by Mistral, Gemma, and numerous other model families.

The attention mechanism in transformers also uses softmax to compute attention weights from query-key dot products, though this is typically considered part of the attention computation rather than an activation function in the traditional sense.

## Implementation examples

### PyTorch

```python
import torch
import torch.nn as nn

# Common activation functions
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
elu = nn.ELU(alpha=1.0)
selu = nn.SELU()
gelu = nn.GELU()
silu = nn.SiLU()  # Swish
mish = nn.Mish()
softmax = nn.Softmax(dim=-1)

# Using in a simple network
model = nn.Sequential(
    nn.Linear(784, 256),
    nn.ReLU(),
    nn.Linear(256, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

# Using GELU in a transformer-style FFN
class TransformerFFN(nn.Module):
    def __init__(self, d_model, d_ff):
        super().__init__()
        self.linear1 = nn.Linear(d_model, d_ff)
        self.linear2 = nn.Linear(d_ff, d_model)
        self.activation = nn.GELU()

    def forward(self, x):
        return self.linear2(self.activation(self.linear1(x)))
```

### TensorFlow / Keras

```python
import tensorflow as tf
from tensorflow import keras

# Using activation functions in layers
model = keras.Sequential([
    keras.layers.Dense(256, activation='relu', input_shape=(784,)),
    keras.layers.Dense(128, activation='elu'),
    keras.layers.Dense(10, activation='softmax')
])

# Available activations: 'relu', 'sigmoid', 'tanh', 'elu', 'selu',
# 'gelu', 'swish', 'mish', 'softmax', 'softplus'

# Custom activation function
@tf.function
def custom_leaky_relu(x, alpha=0.01):
    return tf.where(x > 0, x, alpha * x)
```

## Desirable properties of activation functions

Several mathematical and practical properties influence the effectiveness of an activation function:

- **Nonlinearity.** The core requirement. Without nonlinearity, deep networks collapse to a single linear layer.
- **Differentiability.** Gradient-based optimization (backpropagation) requires computing derivatives. Functions like ReLU that are not differentiable at a single point still work in practice because the non-differentiable point has measure zero.
- **Non-saturating gradients.** Functions whose gradients approach zero for large inputs (like sigmoid and tanh) cause the vanishing gradient problem. Non-saturating functions like ReLU maintain useful gradients across a wider input range.[4][5]
- **Computational efficiency.** Simpler functions (ReLU requires only a comparison) enable faster training, which matters at scale.
- **Zero-centered output.** Outputs centered around zero (as with tanh) can improve gradient flow and convergence speed compared to outputs that are always positive (as with sigmoid or ReLU).
- **Bounded or unbounded range.** Bounded outputs (sigmoid, tanh) can stabilize training but may limit representational capacity. Unbounded outputs (ReLU, GELU) allow for larger activations but may require careful initialization and normalization.
- **Monotonicity.** Traditional wisdom favored monotonic functions, but recent successful activations (GELU, Swish, Mish) are non-monotonic, suggesting monotonicity is not strictly necessary.[9][11][13]

## See also

- [Neural network](/wiki/neural_network)
- [Deep neural network](/wiki/deep_neural_network)
- [Backpropagation](/wiki/backpropagation)
- [Vanishing gradient problem](/wiki/vanishing_gradient_problem)
- [Rectified Linear Unit (ReLU)](/wiki/rectified_linear_unit_relu)
- [Sigmoid function](/wiki/sigmoid_function)
- [Softmax](/wiki/softmax)
- [Batch normalization](/wiki/batch_normalization)
- [Transformer](/wiki/transformer)
- [Loss function](/wiki/loss_function)

## References

1. Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain." *Psychological Review*, 65(6), 386-408.
2. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). "Learning representations by back-propagating errors." *Nature*, 323(6088), 533-536.
3. Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals and Systems*, 2(4), 303-314.
4. Nair, V. and Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." *Proceedings of the 27th International Conference on Machine Learning (ICML)*, 807-814.
5. Glorot, X., Bordes, A., and Bengio, Y. (2011). "Deep Sparse Rectifier Neural Networks." *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 315-323.
6. Maas, A. L., Hannun, A. Y., and Ng, A. Y. (2013). "Rectifier Nonlinearities Improve Neural Network Acoustic Models." *ICML Workshop on Deep Learning for Audio, Speech, and Language Processing*.
7. He, K., Zhang, X., Ren, S., and Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, 1026-1034.
8. Clevert, D.-A., Unterthiner, T., and Hochreiter, S. (2016). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." *Proceedings of the International Conference on Learning Representations (ICLR)*.
9. Hendrycks, D. and Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." *arXiv preprint arXiv:1606.08415*.
10. Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S. (2017). "Self-Normalizing Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 30.
11. Ramachandran, P., Zoph, B., and Le, Q. V. (2017). "Searching for Activation Functions." *arXiv preprint arXiv:1710.05941*.
12. Dauphin, Y. N., Fan, A., Auli, M., and Grangier, D. (2017). "Language Modeling with Gated Convolutional Networks." *Proceedings of the 34th International Conference on Machine Learning (ICML)*, 933-941.
13. Misra, D. (2019). "Mish: A Self Regularized Non-Monotonic Activation Function." *arXiv preprint arXiv:1908.08681*.
14. Shazeer, N. (2020). "GLU Variants Improve Transformer." *arXiv preprint arXiv:2002.05202*.
15. Dubey, S. R., Singh, S. K., and Chaudhuri, B. B. (2022). "Activation Functions in Deep Learning: A Comprehensive Survey and Benchmark." *Neurocomputing*, 503, 92-108.
16. Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." *Mathematics of Control, Signals and Systems*, 2(4), 303-314. (Universal Approximation Theorem.)
17. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems (NeurIPS)*, 25, 1097-1105.
18. Touvron, H., et al. (2023). "LLaMA: Open and Efficient Foundation Language Models." *arXiv preprint arXiv:2302.13971*.