An activation function is a mathematical function applied to the output of each neuron (or node) in an artificial neural network. It determines whether a neuron should be "activated" by computing a weighted sum of inputs and mapping the result through a nonlinear transformation. Activation functions are essential to deep learning because they introduce the nonlinearity that allows neural networks to learn complex patterns in data.
Without activation functions, a deep neural network of any depth would reduce to a single linear transformation, regardless of how many layers it contained. This is because the composition of multiple linear functions is itself linear. By inserting a nonlinear activation function after each linear operation, neural networks gain the ability to approximate virtually any continuous function, a result formalized by the Universal Approximation Theorem.
Imagine you have a bunch of friends passing notes in a chain. Each friend reads the note, decides how excited they should be about it, and writes a new note based on their excitement level to pass to the next friend. The "excitement rule" each friend uses is like an activation function. If everyone just copied the note exactly (linear), the last friend would get pretty much the same message no matter how many friends were in the chain. But if each friend adds their own twist (nonlinear), the chain can come up with really creative and complicated messages. That twist is what an activation function provides to a neural network.
A single layer of a neural network computes a linear transformation of its inputs: y = Wx + b, where W is a weight matrix and b is a bias vector. Stacking two such layers without any activation function gives y = W2(W1x + b1) + b2 = W2W1x + (W2b1 + b2), which is still a linear function of x. No matter how many layers are added, the result is equivalent to one linear layer.
Nonlinear activation functions break this linearity. When a nonlinear function f is applied between layers, the composition f(W2 f(W1 x + b1) + b2) is no longer linear. This allows each layer to learn a different nonlinear transformation of its input, giving the network the representational power to model complex relationships. This property is central to the success of backpropagation in training deep networks, since meaningful gradients can flow through layers that compute genuinely different transformations.
The history of activation functions mirrors the broader development of neural network research.
The earliest artificial neurons, including Frank Rosenblatt's Perceptron (1958), used binary step functions. A step function outputs 1 if the weighted input exceeds a threshold and 0 otherwise. While simple and biologically motivated (neurons either fire or do not), step functions are not differentiable at the threshold, making them incompatible with gradient-based optimization.
The development of the backpropagation algorithm in the 1980s required differentiable activation functions. The sigmoid (logistic) function and the hyperbolic tangent (tanh) became standard choices. These smooth, bounded functions allowed gradient computation throughout the network. However, as networks grew deeper, researchers discovered that both sigmoid and tanh suffer from the vanishing gradient problem: for large or small input values, the gradient approaches zero, effectively halting learning in early layers.
Vinod Nair and Geoffrey Hinton introduced the Rectified Linear Unit (ReLU) in 2010, demonstrating that this simple function could dramatically improve training of deep networks. ReLU's constant gradient for positive inputs eliminated the vanishing gradient problem in one direction and provided computational efficiency. The success of ReLU in AlexNet (2012) for image recognition cemented its dominance. Variants like Leaky ReLU and Parametric ReLU soon followed to address ReLU's limitations.
The rise of transformer architectures brought a new generation of activation functions. The Gaussian Error Linear Unit (GELU), proposed by Hendrycks and Gimpel in 2016, became the default in BERT and GPT models. Google Brain's automated search discovered Swish (SiLU) in 2017. More recently, gated variants like SwiGLU have become standard in large language models such as LLaMA and PaLM.
The sigmoid function maps any real number to the range (0, 1):
Formula: sigma(x) = 1 / (1 + e^(-x))
Derivative: sigma'(x) = sigma(x) * (1 - sigma(x))
The sigmoid function was historically one of the most popular activation functions. Its output can be interpreted as a probability, making it natural for binary classification output layers. However, sigmoid has several drawbacks for hidden layers. Its outputs are not zero-centered (they range from 0 to 1), which can cause zig-zagging during gradient updates. More critically, the gradient saturates for large positive or negative inputs, approaching zero. This saturation causes the vanishing gradient problem in deep networks, slowing or preventing learning in early layers.
The tanh function is a rescaled version of sigmoid that maps inputs to the range (-1, 1):
Formula: tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
Derivative: tanh'(x) = 1 - tanh(x)^2
Relation to sigmoid: tanh(x) = 2 * sigma(2x) - 1
Tanh has the advantage of being zero-centered, which generally leads to faster convergence during training compared to sigmoid. Despite this improvement, tanh still saturates at extreme values, so it shares the vanishing gradient limitation. Tanh remains common in recurrent neural networks (RNNs) and certain gating mechanisms such as those in LSTMs.
ReLU is the most widely used activation function in modern deep learning:
Formula: f(x) = max(0, x)
Derivative: f'(x) = 0 if x < 0, 1 if x > 0 (undefined at x = 0, typically set to 0 or 1)
ReLU has several key advantages. It is computationally cheap, requiring only a comparison operation. For positive inputs, the gradient is always 1, which eliminates the vanishing gradient problem and enables effective training of very deep networks. ReLU also induces sparsity: neurons with negative pre-activation values output exactly zero, meaning a subset of neurons is effectively inactive for any given input. This sparsity can improve computational efficiency and serve as a form of implicit regularization.
The primary limitation is the "dying ReLU" problem. If a neuron's weights are updated such that its input is always negative, the neuron will output zero for every input and receive zero gradient. Once this happens, the neuron can never recover. In practice, this can cause a significant fraction of neurons to become permanently inactive, especially with high learning rates.
Leaky ReLU addresses the dying ReLU problem by allowing a small, non-zero gradient for negative inputs:
Formula: f(x) = x if x > 0, alpha * x if x <= 0 (typically alpha = 0.01)
Derivative: f'(x) = 1 if x > 0, alpha if x <= 0
By using a small positive slope (commonly 0.01) for negative inputs, Leaky ReLU ensures that every neuron always has a non-zero gradient. This means neurons cannot "die" as they can with standard ReLU. The negative slope is a fixed hyperparameter that must be chosen before training.
PReLU generalizes Leaky ReLU by making the negative slope a learnable parameter:
Formula: f(x) = x if x >= 0, alpha * x if x < 0 (alpha is learned during training)
Proposed by He et al. (2015), PReLU allows the network to learn the optimal negative slope through backpropagation. This adds a small number of extra parameters (one per channel or one shared across all channels) but can improve performance. The authors showed that PReLU achieved a then-record result on the ImageNet classification task, surpassing human-level accuracy.
ELU uses an exponential function for negative inputs, providing a smooth transition:
Formula: f(x) = x if x > 0, alpha * (e^x - 1) if x <= 0 (typically alpha = 1.0)
Derivative: f'(x) = 1 if x > 0, f(x) + alpha if x <= 0
Proposed by Clevert, Unterthiner, and Hochreiter (2015), ELU combines the benefits of ReLU (no vanishing gradient for positive values) with negative outputs that push the mean activation closer to zero. Unlike Leaky ReLU, ELU saturates for large negative values (approaching -alpha), which can make the network more robust to noise. ELU's smoothness at zero provides better gradient flow compared to the sharp corner in ReLU.
SELU is a scaled version of ELU with specific parameter values that enable self-normalizing behavior:
Formula: f(x) = lambda * x if x > 0, lambda * alpha * (e^x - 1) if x <= 0
where lambda = 1.0507 and alpha = 1.6733
Introduced by Klambauer et al. (2017), SELU was designed so that activations in a neural network automatically converge toward zero mean and unit variance during forward propagation. The authors proved this self-normalizing property using the Banach fixed-point theorem. SELU eliminates the need for explicit normalization layers like batch normalization in fully connected architectures. However, SELU requires specific weight initialization (LeCun normal) and works best with standard feedforward networks rather than convolutional or recurrent architectures.
GELU has become the default activation function in transformer models:
Formula: GELU(x) = x * Phi(x) = (x/2) * (1 + erf(x / sqrt(2)))
where Phi(x) is the cumulative distribution function of the standard normal distribution.
Approximation: GELU(x) is approximately equal to 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
Proposed by Hendrycks and Gimpel (2016), GELU weights inputs by their magnitude using a probabilistic framework. Instead of gating inputs with a hard threshold (as ReLU does) or a fixed function, GELU gates inputs by how much they exceed other values, based on the Gaussian distribution. As x becomes large and positive, GELU approaches the identity function (like ReLU). As x becomes large and negative, it approaches zero. The transition between these two regimes is smooth and non-monotonic; GELU has a slight dip below zero for small negative inputs (its minimum value is approximately -0.17).
GELU gained wide adoption as the activation function in BERT (2018) and the GPT series of models. Its smoothness is believed to benefit optimization in transformer training.
Swish was discovered through an automated search by Google Brain researchers:
Formula: f(x) = x * sigma(x) = x / (1 + e^(-x))
where sigma is the sigmoid function.
Derivative: f'(x) = f(x) + sigma(x) * (1 - f(x))
Ramachandran, Zoph, and Le (2017) used reinforcement learning to search over a space of candidate activation functions and found that Swish consistently outperformed ReLU across tasks. Swish is self-gated: the sigmoid component acts as a learned gate on the linear component x. Like GELU, Swish is smooth, non-monotonic, and unbounded above but bounded below (minimum approximately -0.278). Swish has been adopted in architectures like EfficientNet and MobileNetV3. In PyTorch, Swish is available as torch.nn.SiLU.
Mish is a smooth, self-regularized, non-monotonic activation function:
Formula: f(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^x))
Proposed by Diganta Misra (2019), Mish shares several properties with Swish, including smoothness, non-monotonicity, and unboundedness above. Mish showed improvements over both ReLU and Swish in image classification benchmarks such as CIFAR-100 and ImageNet, as well as object detection with YOLOv4. Its self-regularizing property comes from the smooth, continuous first derivative that avoids sharp transitions.
Softmax is a multi-input, multi-output activation function used primarily in the output layer for multi-class classification:
Formula: softmax(x_i) = e^(x_i) / sum(e^(x_j)) for all j
Unlike other activation functions that operate element-wise, softmax takes an entire vector of raw scores (logits) and converts them into a probability distribution. Each output is in the range (0, 1) and all outputs sum to 1. Softmax is the standard output layer activation for multi-class classification tasks and is used in the attention mechanism of transformer models to compute attention weights.
Softplus is a smooth approximation of ReLU:
Formula: f(x) = ln(1 + e^x)
Derivative: f'(x) = sigma(x) = 1 / (1 + e^(-x))
Softplus produces only positive outputs and is infinitely differentiable, making it useful in contexts where a smooth, strictly positive output is needed. It approaches ReLU as inputs become large and positive but has a smooth curve near zero rather than a sharp corner. Softplus is used as a building block in other activation functions (it appears in the Mish formula) and in variational autoencoders where positive variance parameters are needed.
Gated Linear Units and their variants have become the dominant activation mechanism in modern large language models:
GLU formula: GLU(x, W, V, b, c) = (xW + b) * sigma(xV + c)
where sigma is the sigmoid function and * denotes element-wise multiplication.
SwiGLU formula: SwiGLU(x, W, V, b, c) = (Swish(xW + b)) * (xV + c)
GLU was introduced by Dauphin et al. (2017) for language modeling. The key idea is that one linear projection acts as a "gate" that controls information flow from the other projection. Shazeer (2020) proposed replacing the sigmoid gate with other activation functions, producing variants like SwiGLU (using Swish as the gate) and GeGLU (using GELU).
SwiGLU has been adopted as the feed-forward network activation in LLaMA, Mistral, PaLM, and many other state-of-the-art large language models. Compared to standard GELU-based feed-forward layers, SwiGLU consistently improves perplexity and downstream task performance while requiring an adjusted hidden dimension (typically 2/3 of the original) to keep the parameter count equivalent.
| Function | Formula | Output range | Smooth | Monotonic | Zero-centered | Common use cases |
|---|---|---|---|---|---|---|
| Sigmoid | 1 / (1 + e^(-x)) | (0, 1) | Yes | Yes | No | Binary classification output |
| Tanh | (e^x - e^(-x)) / (e^x + e^(-x)) | (-1, 1) | Yes | Yes | Yes | RNNs, LSTM gates |
| ReLU | max(0, x) | [0, infinity) | No | Yes | No | CNN hidden layers, default |
| Leaky ReLU | max(0.01x, x) | (-infinity, infinity) | No | Yes | No | Deep CNNs, GANs |
| PReLU | max(alpha*x, x), alpha learned | (-infinity, infinity) | No | Yes | No | Image classification |
| ELU | x if x>0; alpha*(e^x - 1) if x<=0 | (-alpha, infinity) | Yes | Yes | Approximately | Deep feedforward networks |
| SELU | lambda*ELU(x) with fixed lambda, alpha | (-lambda*alpha, infinity) | Yes | Yes | Approximately | Self-normalizing networks |
| GELU | x * Phi(x) | (~-0.17, infinity) | Yes | No | No | Transformers (BERT, GPT) |
| Swish/SiLU | x * sigmoid(x) | (~-0.278, infinity) | Yes | No | No | EfficientNet, MobileNet |
| Mish | x * tanh(softplus(x)) | (~-0.31, infinity) | Yes | No | No | YOLOv4, image classification |
| Softmax | e^(x_i) / sum(e^(x_j)) | (0, 1), sums to 1 | Yes | N/A | No | Multi-class classification output |
| Softplus | ln(1 + e^x) | (0, infinity) | Yes | Yes | No | Variance parameters, building block |
| SwiGLU | Swish(xW+b) * (xV+c) | (-infinity, infinity) | Yes | No | Approximately | LLaMA, PaLM, modern LLMs |
Selecting an activation function depends on the network architecture, the task, and the specific layer within the network.
For most tasks, ReLU is a strong default for hidden layers, especially in convolutional neural networks. If the dying ReLU problem is observed (many neurons producing only zeros), switching to Leaky ReLU or ELU can help. For transformer architectures, GELU is the standard choice. If training a feedforward network without batch normalization, SELU with LeCun normal initialization is worth considering.
The output layer activation should match the task:
| Task | Recommended output activation | Reason |
|---|---|---|
| Binary classification | Sigmoid | Outputs a single probability in (0, 1) |
| Multi-class classification | Softmax | Outputs a probability distribution over classes |
| Multi-label classification | Sigmoid (per output) | Each output independently predicts presence of a label |
| Regression | Linear (no activation) | Outputs can be any real number |
| Regression (positive only) | Softplus or ReLU | Constrains output to positive values |
| Architecture | Typical activation | Notes |
|---|---|---|
| CNNs (ResNet, VGG) | ReLU | Simple, efficient, well-studied |
| Transformers (encoder, e.g. BERT) | GELU | Standard since BERT (2018) |
| Transformers (decoder LLMs, e.g. LLaMA) | SwiGLU | Standard since LLaMA (2023) |
| RNNs / LSTMs | Tanh (hidden), Sigmoid (gates) | Traditional choices for recurrent models |
| GANs | Leaky ReLU (discriminator), ReLU/Tanh (generator) | Leaky ReLU prevents dead neurons in discriminator |
| Self-normalizing networks | SELU | Requires LeCun normal initialization |
Transformer architectures use activation functions primarily in their feed-forward network (FFN) sublayers. In the original Transformer paper by Vaswani et al. (2017), the FFN used ReLU: FFN(x) = max(0, xW1 + b1)W2 + b2.
BERT (2018) replaced ReLU with GELU in the FFN, and this choice was carried forward into GPT-2, GPT-3, and many subsequent encoder-based models. The smoother gating behavior of GELU, where inputs near zero are partially rather than fully suppressed, is thought to benefit the optimization dynamics of transformer training.
More recent decoder-only language models have moved to gated FFN variants. LLaMA (2023) introduced the use of SwiGLU in its FFN layers, following Shazeer's (2020) findings that gated activations consistently improved transformer language modeling. The SwiGLU FFN replaces the standard two-matrix FFN with a three-matrix gated structure: FFN_SwiGLU(x) = (Swish(xW1)) * (xW3))W2. This approach has been adopted by Mistral, Gemma, and numerous other model families.
The attention mechanism in transformers also uses softmax to compute attention weights from query-key dot products, though this is typically considered part of the attention computation rather than an activation function in the traditional sense.
import torch
import torch.nn as nn
# Common activation functions
relu = nn.ReLU()
leaky_relu = nn.LeakyReLU(negative_slope=0.01)
elu = nn.ELU(alpha=1.0)
selu = nn.SELU()
gelu = nn.GELU()
silu = nn.SiLU() # Swish
mish = nn.Mish()
softmax = nn.Softmax(dim=-1)
# Using in a simple network
model = nn.Sequential(
nn.Linear(784, 256),
nn.ReLU(),
nn.Linear(256, 128),
nn.ReLU(),
nn.Linear(128, 10)
)
# Using GELU in a transformer-style FFN
class TransformerFFN(nn.Module):
def __init__(self, d_model, d_ff):
super().__init__()
self.linear1 = nn.Linear(d_model, d_ff)
self.linear2 = nn.Linear(d_ff, d_model)
self.activation = nn.GELU()
def forward(self, x):
return self.linear2(self.activation(self.linear1(x)))
import tensorflow as tf
from tensorflow import keras
# Using activation functions in layers
model = keras.Sequential([
keras.layers.Dense(256, activation='relu', input_shape=(784,)),
keras.layers.Dense(128, activation='elu'),
keras.layers.Dense(10, activation='softmax')
])
# Available activations: 'relu', 'sigmoid', 'tanh', 'elu', 'selu',
# 'gelu', 'swish', 'mish', 'softmax', 'softplus'
# Custom activation function
@tf.function
def custom_leaky_relu(x, alpha=0.01):
return tf.where(x > 0, x, alpha * x)
Several mathematical and practical properties influence the effectiveness of an activation function: