See also: Machine learning terms
ReLU (Rectified Linear Unit) is an activation function used in neural networks, defined by the formula f(x) = max(0, x). For any positive input, ReLU returns the input unchanged; for any negative input, it returns zero. Despite this simplicity, ReLU was the activation function that made modern deep learning practical. Its adoption in the early 2010s enabled the training of much deeper networks than had been possible with sigmoid or tanh activations, and it remains one of the most widely used activation functions today.
The ReLU function is defined as:
f(x) = max(0, x)
This can also be written piecewise:
The derivative of ReLU is:
ReLU is piecewise linear: it consists of two linear segments joined at x = 0. This piecewise linearity is what makes it both computationally efficient and mathematically interesting. A network of ReLU neurons partitions its input space into regions, within each of which the network computes a different linear function.
The ReLU function was first applied to neural network dynamics by Hahnloser, Sarpeshkar, Mahowald, Douglas, and Seung in a 2000 paper published in Nature. Their work was motivated by neuroscience: biological neurons fire at a rate roughly proportional to their input current when that current exceeds a threshold, and do not fire at all below the threshold. The max(0, x) function captures this behavior. Hahnloser et al. showed that ReLU-type activations enable recurrent neural network dynamics to stabilize under weaker conditions than smooth activations like sigmoid.
The basic mathematical concept of the positive part function, x+ = max(0, x), has a much longer history in mathematics and was used in various contexts before neural networks. But the deliberate application to neural computation began with Hahnloser et al.
ReLU's breakthrough in practical deep learning came from two papers that demonstrated its advantages for training deep networks.
Vinod Nair and Geoffrey Hinton published "Rectified Linear Units Improve Restricted Boltzmann Machines" at ICML 2010. They compared ReLU against sigmoid activations in Restricted Boltzmann Machines (RBMs) and showed that ReLU led to better generative models, particularly on image data. Their theoretical argument centered on the observation that a ReLU unit can be approximated by an infinite sum of binary sigmoid units with shifted biases, giving it a much richer representational capacity.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio published "Deep Sparse Rectifier Neural Networks" at AISTATS 2011. This paper argued for ReLU in standard feedforward and supervised networks. Glorot et al. identified several advantages:
Critically, Glorot et al. demonstrated that deep networks with ReLU activations could be trained successfully with purely supervised learning, without the unsupervised pre-training that had been considered necessary for deep networks. This was a turning point: it meant that deep networks were practical for a much wider range of problems.
By 2012, ReLU was the default activation in convolutional neural networks. AlexNet (Krizhevsky, Sutskever, and Hinton, 2012), which won the ImageNet competition and is often cited as the catalyst for the modern deep learning era, used ReLU throughout. The authors noted that ReLU trained several times faster than an equivalent network with tanh activations.
From that point, ReLU became the standard activation for hidden layers in nearly all deep learning architectures, including VGGNet, GoogLeNet, and ResNet.
For any given input, a network with ReLU activations typically has a significant fraction of neurons outputting exactly zero. This sparsity has several benefits. It makes the representation more efficient (the network effectively selects a subset of features for each input), it acts as a form of implicit regularization (reducing the chance of overfitting), and it makes the computations faster (zero-valued activations require no further computation in subsequent layers).
Glorot et al. (2011) found that ReLU networks naturally learned representations where 50-80% of hidden units were inactive for a given input, compared to almost no inactive units with sigmoid or tanh.
The vanishing gradient problem was the main obstacle to training deep networks before ReLU. With sigmoid, the maximum gradient is 0.25, and with tanh, the gradient drops off steeply for large or small inputs. When these gradients are multiplied across many layers during backpropagation, the signal reaching early layers can be negligibly small.
ReLU has a gradient of exactly 1 for all positive inputs. This means that for any neuron that is active (outputting a positive value), the gradient passes through without any reduction. This property allows gradients to flow through very deep networks without vanishing, which is what enabled the training of networks with 10, 50, or even 100+ layers.
ReLU requires only a comparison and a conditional assignment: if x > 0, output x; otherwise, output 0. Sigmoid requires computing an exponential function, and tanh requires computing two exponentials. On modern hardware, the difference per operation is small, but neural networks evaluate activation functions billions of times during training. The cumulative savings are significant. Krizhevsky et al. (2012) reported that AlexNet with ReLU reached 25% training error on CIFAR-10 six times faster than the same network with tanh.
The most significant drawback of ReLU is the dying ReLU problem. When a neuron's weighted input is negative for every example in the training set, the neuron outputs zero for all inputs. Because the gradient of ReLU is zero for negative inputs, the neuron's weights receive no gradient updates, and the neuron cannot recover. It is permanently "dead."
Several conditions can trigger dying ReLU:
Lu et al. (2019) provided a theoretical analysis showing that the probability of dying ReLU increases with network width and depth, and that proper initialization is critical to preventing it. In practice, using He initialization (which sets the variance of initial weights to 2/fan_in) substantially reduces the risk.
In some networks, over 40% of neurons can become dead during training, severely reducing the effective capacity of the model. Monitoring the fraction of dead neurons is a useful diagnostic during training.
Several variants have been proposed to address the dying ReLU problem while retaining ReLU's advantages.
| Variant | Formula | Key difference from ReLU | Proposed by |
|---|---|---|---|
| Leaky ReLU | f(x) = x if x > 0, else 0.01*x | Small gradient for negative inputs prevents dead neurons | Maas et al., 2013 |
| PReLU (Parametric ReLU) | f(x) = x if x > 0, else a*x (a is learned) | Negative slope is a trainable parameter; achieved state-of-the-art on ImageNet | He et al., 2015 |
| ELU (Exponential Linear Unit) | f(x) = x if x > 0, else alpha*(e^x - 1) | Smooth exponential curve for negatives; pushes mean activation toward zero | Clevert et al., 2015 |
| SELU (Scaled ELU) | f(x) = lambda * ELU(x) with specific lambda, alpha | Self-normalizing; activations maintain zero mean and unit variance without batch normalization | Klambauer et al., 2017 |
Leaky ReLU replaces the flat zero region of ReLU with a small negative slope (typically 0.01). This means that even when the input is negative, the gradient is non-zero (0.01), so the neuron can still receive gradient updates and potentially recover. The fixed slope is a hyperparameter; values between 0.01 and 0.3 are common.
He et al. (2015) proposed making the negative slope a learnable parameter. In their ImageNet experiments, PReLU improved top-1 accuracy by about 1.1% over ReLU on a very deep model, achieving a top-5 error rate of 4.94%, which surpassed human-level performance (5.1%) on the ImageNet classification benchmark for the first time. The learned slopes varied across layers and tended to be larger in early layers (0.1-0.3) and smaller in later layers.
Clevert, Unterthiner, and Hochreiter (2015) introduced ELU, which uses an exponential curve for negative inputs that smoothly saturates at -alpha (typically alpha=1). The smooth transition at zero means the gradient is continuous, unlike ReLU's sharp corner. ELU's negative saturation pushes the mean activation closer to zero, which can reduce the bias shift effect and speed up learning.
Klambauer et al. (2017) showed that with specific values of lambda (approximately 1.0507) and alpha (approximately 1.6733), along with lecun_normal weight initialization, ELU-based networks have the self-normalizing property: activations converge to zero mean and unit variance through the network without explicit normalization layers. This property holds for fully connected networks but does not extend straightforwardly to convolutional or recurrent architectures, which has limited SELU's adoption.
| Property | ReLU | Sigmoid | Tanh |
|---|---|---|---|
| Output range | [0, infinity) | (0, 1) | (-1, 1) |
| Zero-centered | No | No | Yes |
| Maximum gradient | 1 (constant for x > 0) | 0.25 (at x = 0) | 1.0 (at x = 0) |
| Vanishing gradient | No (for positive inputs) | Yes (severe) | Yes (moderate) |
| Computational cost | Very low (comparison only) | Moderate (exponential) | Moderate (two exponentials) |
| Sparsity | Yes (outputs exactly 0 for negatives) | No (always outputs positive values) | No (always outputs non-zero values) |
| Dying neuron problem | Yes | No | No |
| Typical use today | CNN hidden layers | Binary output layers, RNN gates | RNN state computations |
Sigmoid and tanh still have important roles. Sigmoid is the standard activation for binary classification output layers and for gate mechanisms in LSTM and GRU cells. Tanh is used for the cell state in LSTMs. But for hidden layers in feedforward and convolutional networks, ReLU and its variants have almost entirely replaced both.
ReLU remains the default activation for hidden layers in convolutional neural networks and many feedforward architectures. For computer vision tasks using established CNN architectures (ResNet, VGG, and their descendants), ReLU continues to perform well and is the standard choice.
However, in transformer-based models, ReLU has been largely replaced:
The shift from ReLU to GELU/SiLU in transformers is not because ReLU fails catastrophically in that context. Rather, GELU and SiLU provide smoother gradients and slightly better performance on language tasks. The difference is often small (fractions of a percent in perplexity), but at the scale of modern LLM training, even small improvements justify the modest additional computational cost.
For practitioners choosing between ReLU and modern alternatives: if you are working with a CNN or a standard feedforward network, ReLU is still a perfectly good default. If you are building a transformer or fine-tuning a language model, follow the conventions of the model family (typically GELU or SwiGLU). If you encounter dying ReLU during training, switch to Leaky ReLU or ELU rather than immediately jumping to more complex activations.
The shift from ReLU to newer activation functions in transformer models deserves detailed examination, as these choices directly affect model quality and training stability at scale.
GELU was proposed by Hendrycks and Gimpel (2016) and is defined as:
GELU(x) = x * Phi(x)
where Phi(x) is the cumulative distribution function of the standard normal distribution. In practice, GELU is often approximated as:
GELU(x) approximately equals 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
GELU can be interpreted as a smooth, stochastic version of ReLU. Where ReLU deterministically zeros out negative inputs, GELU probabilistically scales inputs based on how likely they are to be positive under a Gaussian distribution. For large positive inputs, GELU behaves like the identity function; for large negative inputs, it outputs near-zero values; and for inputs near zero, it provides a smooth transition.
GELU became the standard activation in encoder-style transformers after its adoption in BERT (2018). It is also used in GPT-2, GPT-3, RoBERTa, and many other models.
SiLU, also known as Swish, was discovered through automated search by Ramachandran, Zoph, and Le (2017) at Google. It is defined as:
SiLU(x) = x * sigmoid(x) = x / (1 + exp(-x))
Like GELU, SiLU is a smooth, non-monotonic function that allows small negative values to pass through. The key difference is that SiLU uses the sigmoid function rather than the Gaussian CDF. In practice, SiLU and GELU produce very similar outputs, with the main difference occurring for inputs near -1 to -3.
SiLU is computationally cheaper than GELU because sigmoid is simpler to compute than the Gaussian CDF (even the tanh approximation). This small efficiency advantage matters at the scale of modern LLM training.
SwiGLU, proposed by Shazeer (2020), combines SiLU with a gated linear unit (GLU) mechanism. In a standard transformer feed-forward network (FFN), the computation is:
FFN(x) = W_2 * activation(W_1 * x + b_1) + b_2
With SwiGLU, this becomes:
SwiGLU(x) = (SiLU(W_1 * x) .* (W_3 * x))
where .* denotes element-wise multiplication and W_3 is an additional weight matrix. The gating mechanism allows the network to learn which features to pass through, providing more expressive power than a simple pointwise activation.
SwiGLU has become the dominant FFN activation in modern LLMs. It was first used at scale in Google's PaLM (2022) and subsequently adopted by Meta's LLaMA family, Mistral, and many other open-weight models.
The tradeoff is that SwiGLU requires an extra weight matrix (W_3), which increases the total number of parameters. To compensate, the hidden dimension of the FFN is typically reduced by a factor of 2/3 (e.g., from 4 * d_model to 8/3 * d_model), keeping the total parameter count roughly the same.
| Activation | Formula | Used in | Computational cost | Key advantage |
|---|---|---|---|---|
| ReLU | max(0, x) | ResNet, VGG, older CNNs | Lowest | Simplicity, sparsity |
| GELU | x * Phi(x) | BERT, GPT-2, GPT-3, RoBERTa | Moderate (requires tanh approx.) | Smooth gradients near zero |
| SiLU (Swish) | x * sigmoid(x) | EfficientNet, some LLMs | Moderate (sigmoid only) | Smooth, slightly cheaper than GELU |
| SwiGLU | SiLU(W_1 x) .* (W_3 x) | LLaMA, PaLM, Mistral, DeepSeek | Higher (extra weight matrix) | Gating mechanism improves expressivity |
| GeGLU | GELU(W_1 x) .* (W_3 x) | Some research models | Higher | GELU variant of gated mechanism |
| Smooth-SwiGLU | Modified SwiGLU for FP8 stability | Intel Gaudi FP8 training | Similar to SwiGLU | Avoids outlier amplification in low precision |
The choice of activation function depends on the architecture and task:
| Architecture / Use case | Recommended activation | Notes |
|---|---|---|
| CNN hidden layers (ResNet, VGG, etc.) | ReLU | Still the default; well-understood, fast |
| Vision transformer (ViT) | GELU | Standard since the original ViT paper |
| Encoder-only transformer (BERT-style) | GELU | Established convention |
| Decoder-only LLM (new training) | SwiGLU | Current best practice; used by most SOTA models |
| Fine-tuning a pre-trained LLM | Match the base model | Use whatever activation the pre-trained model uses |
| Binary output layer | Sigmoid | Standard for binary classification |
| Multi-class output layer | Softmax | Standard for multi-class classification |
| RNN/LSTM gates | Sigmoid and tanh | Required by the gating mechanism |
| FP8 / very low precision training | Smooth-SwiGLU or GELU | SwiGLU can amplify outliers at low precision |
Despite the dominance of GELU and SwiGLU in transformers, recent research (ICLR 2024) has shown that ReLU can match GELU performance in vision transformers when combined with proper training recipes. The paper "ReLU Strikes Back" demonstrated that the performance gap between ReLU and GELU in ViTs is largely due to training recipe differences rather than inherent activation function superiority. With appropriate augmentation, regularization, and learning rate schedules, ReLU-based ViTs achieve comparable accuracy to GELU-based ones, while being computationally cheaper and producing sparser activations.
Imagine you're trying to learn a new skill, like playing soccer. Your brain has to figure out which moves work well and which ones do not. In machine learning, a similar process happens when a computer tries to learn something new. The ReLU function helps the computer decide which parts of its "brain" to use for learning. When the computer finds something important (a positive signal), the ReLU function keeps it. When it finds something unimportant or bad (a negative signal), it sets it to zero. This way, the computer can learn more efficiently and figure out the best way to complete a task.