ReLU

Introduction

ReLU (Rectified Linear Unit) is an activation function used in neural networks, defined by the formula f(x) = max(0, x). For any positive input, ReLU returns the input unchanged; for any negative input, it returns zero. Despite this simplicity, ReLU was the activation function that made modern deep learning practical. Its adoption in the early 2010s enabled the training of much deeper networks than had been possible with sigmoid or tanh activations, and it remains one of the most widely used activation functions today.

Mathematical definition

The ReLU function is defined as:

f(x) = max(0, x)

This can also be written piecewise:

f(x) = x, if x > 0
f(x) = 0, if x <= 0

The derivative of ReLU is:

f'(x) = 1, if x > 0
f'(x) = 0, if x < 0
f'(x) is undefined at x = 0 (in practice, it is set to 0 or 1 by convention)

ReLU is piecewise linear: it consists of two linear segments joined at x = 0. This piecewise linearity is what makes it both computationally efficient and mathematically interesting. A network of ReLU neurons partitions its input space into regions, within each of which the network computes a different linear function.

History

Origins in neuroscience (2000)

The ReLU function was first applied to neural network dynamics by Hahnloser, Sarpeshkar, Mahowald, Douglas, and Seung in a 2000 paper published in Nature. Their work was motivated by neuroscience: biological neurons fire at a rate roughly proportional to their input current when that current exceeds a threshold, and do not fire at all below the threshold. The max(0, x) function captures this behavior. Hahnloser et al. showed that ReLU-type activations enable recurrent neural network dynamics to stabilize under weaker conditions than smooth activations like sigmoid.

The basic mathematical concept of the positive part function, x+ = max(0, x), has a much longer history in mathematics and was used in various contexts before neural networks. But the deliberate application to neural computation began with Hahnloser et al.

Adoption in deep learning (2010-2011)

ReLU's breakthrough in practical deep learning came from two papers that demonstrated its advantages for training deep networks.

Vinod Nair and Geoffrey Hinton published "Rectified Linear Units Improve Restricted Boltzmann Machines" at ICML 2010. They compared ReLU against sigmoid activations in Restricted Boltzmann Machines (RBMs) and showed that ReLU led to better generative models, particularly on image data. Their theoretical argument centered on the observation that a ReLU unit can be approximated by an infinite sum of binary sigmoid units with shifted biases, giving it a much richer representational capacity.

Xavier Glorot, Antoine Bordes, and Yoshua Bengio published "Deep Sparse Rectifier Neural Networks" at AISTATS 2011. This paper argued for ReLU in standard feedforward and supervised networks. Glorot et al. identified several advantages:

Biological plausibility: ReLU more closely resembles the firing pattern of biological neurons.
Sparsity: With typical data, roughly half of ReLU neurons output zero for a given input, creating sparse representations.
No vanishing gradient (for positive inputs): The derivative is 1 for all positive values, so gradients flow through without shrinking.
Computational efficiency: Computing max(0, x) is far cheaper than computing exp() or tanh().

Critically, Glorot et al. demonstrated that deep networks with ReLU activations could be trained successfully with purely supervised learning, without the unsupervised pre-training that had been considered necessary for deep networks. This was a turning point: it meant that deep networks were practical for a much wider range of problems.

Widespread adoption (2012 onward)

By 2012, ReLU was the default activation in convolutional neural networks. AlexNet (Krizhevsky, Sutskever, and Hinton, 2012), which won the ImageNet competition and is often cited as the catalyst for the modern deep learning era, used ReLU throughout. The authors noted that ReLU trained several times faster than an equivalent network with tanh activations.

From that point, ReLU became the standard activation for hidden layers in nearly all deep learning architectures, including VGGNet, GoogLeNet, and ResNet.

Why ReLU works

Sparse activation

For any given input, a network with ReLU activations typically has a significant fraction of neurons outputting exactly zero. This sparsity has several benefits. It makes the representation more efficient (the network effectively selects a subset of features for each input), it acts as a form of implicit regularization (reducing the chance of overfitting), and it makes the computations faster (zero-valued activations require no further computation in subsequent layers).

Glorot et al. (2011) found that ReLU networks naturally learned representations where 50-80% of hidden units were inactive for a given input, compared to almost no inactive units with sigmoid or tanh.

No vanishing gradient for positive inputs

The vanishing gradient problem was the main obstacle to training deep networks before ReLU. With sigmoid, the maximum gradient is 0.25, and with tanh, the gradient drops off steeply for large or small inputs. When these gradients are multiplied across many layers during backpropagation, the signal reaching early layers can be negligibly small.

ReLU has a gradient of exactly 1 for all positive inputs. This means that for any neuron that is active (outputting a positive value), the gradient passes through without any reduction. This property allows gradients to flow through very deep networks without vanishing, which is what enabled the training of networks with 10, 50, or even 100+ layers.

Computational efficiency

ReLU requires only a comparison and a conditional assignment: if x > 0, output x; otherwise, output 0. Sigmoid requires computing an exponential function, and tanh requires computing two exponentials. On modern hardware, the difference per operation is small, but neural networks evaluate activation functions billions of times during training. The cumulative savings are significant. Krizhevsky et al. (2012) reported that AlexNet with ReLU reached 25% training error on CIFAR-10 six times faster than the same network with tanh.

The dying ReLU problem

The most significant drawback of ReLU is the dying ReLU problem. When a neuron's weighted input is negative for every example in the training set, the neuron outputs zero for all inputs. Because the gradient of ReLU is zero for negative inputs, the neuron's weights receive no gradient updates, and the neuron cannot recover. It is permanently "dead."

Several conditions can trigger dying ReLU:

High learning rates: A large weight update can push a neuron's weights into a region where all inputs produce negative pre-activation values.
Poor initialization: If initial weights happen to produce consistently negative pre-activations, the neuron starts dead and stays dead.
Large negative biases: A bias term that is too negative offsets the weighted sum below zero for all inputs.

Lu et al. (2019) provided a theoretical analysis showing that the probability of dying ReLU increases with network width and depth, and that proper initialization is critical to preventing it. In practice, using He initialization (which sets the variance of initial weights to 2/fan_in) substantially reduces the risk.

In some networks, over 40% of neurons can become dead during training, severely reducing the effective capacity of the model. Monitoring the fraction of dead neurons is a useful diagnostic during training.

ReLU variants

Several variants have been proposed to address the dying ReLU problem while retaining ReLU's advantages.

Variant	Formula	Key difference from ReLU	Proposed by
Leaky ReLU	f(x) = x if x > 0, else 0.01*x	Small gradient for negative inputs prevents dead neurons	Maas et al., 2013
PReLU (Parametric ReLU)	f(x) = x if x > 0, else a*x (a is learned)	Negative slope is a trainable parameter; achieved state-of-the-art on ImageNet	He et al., 2015
ELU (Exponential Linear Unit)	f(x) = x if x > 0, else alpha*(e^x - 1)	Smooth exponential curve for negatives; pushes mean activation toward zero	Clevert et al., 2015
SELU (Scaled ELU)	f(x) = lambda * ELU(x) with specific lambda, alpha	Self-normalizing; activations maintain zero mean and unit variance without batch normalization	Klambauer et al., 2017

Leaky ReLU

Leaky ReLU replaces the flat zero region of ReLU with a small negative slope (typically 0.01). This means that even when the input is negative, the gradient is non-zero (0.01), so the neuron can still receive gradient updates and potentially recover. The fixed slope is a hyperparameter; values between 0.01 and 0.3 are common.

PReLU

He et al. (2015) proposed making the negative slope a learnable parameter. In their ImageNet experiments, PReLU improved top-1 accuracy by about 1.1% over ReLU on a very deep model, achieving a top-5 error rate of 4.94%, which surpassed human-level performance (5.1%) on the ImageNet classification benchmark for the first time. The learned slopes varied across layers and tended to be larger in early layers (0.1-0.3) and smaller in later layers.

ELU

Clevert, Unterthiner, and Hochreiter (2015) introduced ELU, which uses an exponential curve for negative inputs that smoothly saturates at -alpha (typically alpha=1). The smooth transition at zero means the gradient is continuous, unlike ReLU's sharp corner. ELU's negative saturation pushes the mean activation closer to zero, which can reduce the bias shift effect and speed up learning.

SELU

Klambauer et al. (2017) showed that with specific values of lambda (approximately 1.0507) and alpha (approximately 1.6733), along with lecun_normal weight initialization, ELU-based networks have the self-normalizing property: activations converge to zero mean and unit variance through the network without explicit normalization layers. This property holds for fully connected networks but does not extend straightforwardly to convolutional or recurrent architectures, which has limited SELU's adoption.

Comparison with sigmoid and tanh

Property	ReLU	Sigmoid	Tanh
Output range	[0, infinity)	(0, 1)	(-1, 1)
Zero-centered	No	No	Yes
Maximum gradient	1 (constant for x > 0)	0.25 (at x = 0)	1.0 (at x = 0)
Vanishing gradient	No (for positive inputs)	Yes (severe)	Yes (moderate)
Computational cost	Very low (comparison only)	Moderate (exponential)	Moderate (two exponentials)
Sparsity	Yes (outputs exactly 0 for negatives)	No (always outputs positive values)	No (always outputs non-zero values)
Dying neuron problem	Yes	No	No
Typical use today	CNN hidden layers	Binary output layers, RNN gates	RNN state computations

Sigmoid and tanh still have important roles. Sigmoid is the standard activation for binary classification output layers and for gate mechanisms in LSTM and GRU cells. Tanh is used for the cell state in LSTMs. But for hidden layers in feedforward and convolutional networks, ReLU and its variants have almost entirely replaced both.

When ReLU is still used and when it has been replaced

ReLU remains the default activation for hidden layers in convolutional neural networks and many feedforward architectures. For computer vision tasks using established CNN architectures (ResNet, VGG, and their descendants), ReLU continues to perform well and is the standard choice.

However, in transformer-based models, ReLU has been largely replaced:

GELU is the default in encoder-style transformers like BERT, RoBERTa, and GPT-2/GPT-3. Hendrycks and Gimpel (2016) proposed GELU, but it was the adoption by BERT (2018) and GPT-2 (2019) that established it as the transformer standard.
SiLU/Swish is used in many decoder-style and open-source LLMs, including Meta's LLaMA family and Google's PaLM.
SwiGLU, which combines SiLU with a gated linear unit mechanism (Shazeer, 2020), is the current state of the art for feed-forward layers in large language models. LLaMA, LLaMA 2, Mistral, and many recent models use SwiGLU.

The shift from ReLU to GELU/SiLU in transformers is not because ReLU fails catastrophically in that context. Rather, GELU and SiLU provide smoother gradients and slightly better performance on language tasks. The difference is often small (fractions of a percent in perplexity), but at the scale of modern LLM training, even small improvements justify the modest additional computational cost.

For practitioners choosing between ReLU and modern alternatives: if you are working with a CNN or a standard feedforward network, ReLU is still a perfectly good default. If you are building a transformer or fine-tuning a language model, follow the conventions of the model family (typically GELU or SwiGLU). If you encounter dying ReLU during training, switch to Leaky ReLU or ELU rather than immediately jumping to more complex activations.

GELU, SiLU, and SwiGLU: the transformer-era activation functions

The shift from ReLU to newer activation functions in transformer models deserves detailed examination, as these choices directly affect model quality and training stability at scale.

GELU (Gaussian Error Linear Unit)

GELU was proposed by Hendrycks and Gimpel (2016) and is defined as:

GELU(x) = x * Phi(x)

where Phi(x) is the cumulative distribution function of the standard normal distribution. In practice, GELU is often approximated as:

GELU(x) approximately equals 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))

GELU can be interpreted as a smooth, stochastic version of ReLU. Where ReLU deterministically zeros out negative inputs, GELU probabilistically scales inputs based on how likely they are to be positive under a Gaussian distribution. For large positive inputs, GELU behaves like the identity function; for large negative inputs, it outputs near-zero values; and for inputs near zero, it provides a smooth transition.

GELU became the standard activation in encoder-style transformers after its adoption in BERT (2018). It is also used in GPT-2, GPT-3, RoBERTa, and many other models.

SiLU (Sigmoid Linear Unit) / Swish

SiLU, also known as Swish, was discovered through automated search by Ramachandran, Zoph, and Le (2017) at Google. It is defined as:

SiLU(x) = x * sigmoid(x) = x / (1 + exp(-x))

Like GELU, SiLU is a smooth, non-monotonic function that allows small negative values to pass through. The key difference is that SiLU uses the sigmoid function rather than the Gaussian CDF. In practice, SiLU and GELU produce very similar outputs, with the main difference occurring for inputs near -1 to -3.

SiLU is computationally cheaper than GELU because sigmoid is simpler to compute than the Gaussian CDF (even the tanh approximation). This small efficiency advantage matters at the scale of modern LLM training.

SwiGLU

SwiGLU, proposed by Shazeer (2020), combines SiLU with a gated linear unit (GLU) mechanism. In a standard transformer feed-forward network (FFN), the computation is:

FFN(x) = W_2 * activation(W_1 * x + b_1) + b_2

With SwiGLU, this becomes:

SwiGLU(x) = (SiLU(W_1 * x) .* (W_3 * x))

where .* denotes element-wise multiplication and W_3 is an additional weight matrix. The gating mechanism allows the network to learn which features to pass through, providing more expressive power than a simple pointwise activation.

SwiGLU has become the dominant FFN activation in modern LLMs. It was first used at scale in Google's PaLM (2022) and subsequently adopted by Meta's LLaMA family, Mistral, and many other open-weight models.

The tradeoff is that SwiGLU requires an extra weight matrix (W_3), which increases the total number of parameters. To compensate, the hidden dimension of the FFN is typically reduced by a factor of 2/3 (e.g., from 4 * d_model to 8/3 * d_model), keeping the total parameter count roughly the same.

Comparison table

Activation	Formula	Used in	Computational cost	Key advantage
ReLU	max(0, x)	ResNet, VGG, older CNNs	Lowest	Simplicity, sparsity
GELU	x * Phi(x)	BERT, GPT-2, GPT-3, RoBERTa	Moderate (requires tanh approx.)	Smooth gradients near zero
SiLU (Swish)	x * sigmoid(x)	EfficientNet, some LLMs	Moderate (sigmoid only)	Smooth, slightly cheaper than GELU
SwiGLU	SiLU(W_1 x) .* (W_3 x)	LLaMA, PaLM, Mistral, DeepSeek	Higher (extra weight matrix)	Gating mechanism improves expressivity
GeGLU	GELU(W_1 x) .* (W_3 x)	Some research models	Higher	GELU variant of gated mechanism
Smooth-SwiGLU	Modified SwiGLU for FP8 stability	Intel Gaudi FP8 training	Similar to SwiGLU	Avoids outlier amplification in low precision

Activation function selection guide (2025-2026)

The choice of activation function depends on the architecture and task:

Architecture / Use case	Recommended activation	Notes
CNN hidden layers (ResNet, VGG, etc.)	ReLU	Still the default; well-understood, fast
Vision transformer (ViT)	GELU	Standard since the original ViT paper
Encoder-only transformer (BERT-style)	GELU	Established convention
Decoder-only LLM (new training)	SwiGLU	Current best practice; used by most SOTA models
Fine-tuning a pre-trained LLM	Match the base model	Use whatever activation the pre-trained model uses
Binary output layer	Sigmoid	Standard for binary classification
Multi-class output layer	Softmax	Standard for multi-class classification
RNN/LSTM gates	Sigmoid and tanh	Required by the gating mechanism
FP8 / very low precision training	Smooth-SwiGLU or GELU	SwiGLU can amplify outliers at low precision

ReLU strikes back

Despite the dominance of GELU and SwiGLU in transformers, recent research (ICLR 2024) has shown that ReLU can match GELU performance in vision transformers when combined with proper training recipes. The paper "ReLU Strikes Back" demonstrated that the performance gap between ReLU and GELU in ViTs is largely due to training recipe differences rather than inherent activation function superiority. With appropriate augmentation, regularization, and learning rate schedules, ReLU-based ViTs achieve comparable accuracy to GELU-based ones, while being computationally cheaper and producing sparser activations.

Explain Like I'm 5 (ELI5)

Imagine you're trying to learn a new skill, like playing soccer. Your brain has to figure out which moves work well and which ones do not. In machine learning, a similar process happens when a computer tries to learn something new. The ReLU function helps the computer decide which parts of its "brain" to use for learning. When the computer finds something important (a positive signal), the ReLU function keeps it. When it finds something unimportant or bad (a negative signal), it sets it to zero. This way, the computer can learn more efficiently and figure out the best way to complete a task.

References

Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). "Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit." *Nature*, 405, 947-951. https://doi.org/10.1038/35016072
Nair, V., & Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." *Proceedings of the 27th International Conference on Machine Learning (ICML)*. https://www.cs.toronto.edu/~fritz/absps/reluICML.pdf
Glorot, X., Bordes, A., & Bengio, Y. (2011). "Deep Sparse Rectifier Neural Networks." *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)*.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *NeurIPS 2012*. https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). "Rectifier Nonlinearities Improve Neural Network Acoustic Models." *ICML Workshop on Deep Learning for Audio, Speech, and Language Processing*.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*. https://arxiv.org/abs/1502.01852
Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." *arXiv preprint*. https://arxiv.org/abs/1511.07289
Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). "Self-Normalizing Neural Networks." *NeurIPS 2017*. https://arxiv.org/abs/1706.02515
Lu, L., Shin, Y., Su, Y., & Karniadakis, G. E. (2019). "Dying ReLU and Initialization: Theory and Numerical Examples." *arXiv preprint*. https://arxiv.org/abs/1903.06733
Hendrycks, D., & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." *arXiv preprint*. https://arxiv.org/abs/1606.08415
Shazeer, N. (2020). "GLU Variants Improve Transformer." *arXiv preprint*. https://arxiv.org/abs/2002.05202
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). "Searching for Activation Functions." *arXiv preprint*. https://arxiv.org/abs/1710.05941
Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*. https://arxiv.org/abs/2204.02311
Bick, T., et al. (2024). "ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models." *ICLR 2024*. https://openreview.net/forum?id=osoWxY8q2E

Introduction

Mathematical definition

History

Origins in neuroscience (2000)

Adoption in deep learning (2010-2011)

Widespread adoption (2012 onward)

Why ReLU works

Sparse activation

No vanishing gradient for positive inputs

Computational efficiency

The dying ReLU problem

ReLU variants

Leaky ReLU

PReLU

ELU

SELU

Comparison with sigmoid and tanh

When ReLU is still used and when it has been replaced

GELU, SiLU, and SwiGLU: the transformer-era activation functions

GELU (Gaussian Error Linear Unit)

SiLU (Sigmoid Linear Unit) / Swish

SwiGLU

Comparison table

Activation function selection guide (2025-2026)

ReLU strikes back

Explain Like I'm 5 (ELI5)

References

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

Sparse autoencoder

ARC-AGI 2

Tanh (hyperbolic tangent)

SwiGLU

LeNet

Introduction

Mathematical definition

History

Origins in neuroscience (2000)

Adoption in deep learning (2010-2011)

Widespread adoption (2012 onward)

Why ReLU works

Sparse activation

No vanishing gradient for positive inputs

Computational efficiency

The dying ReLU problem

ReLU variants

Leaky ReLU

PReLU

ELU

SELU

Comparison with sigmoid and tanh

When ReLU is still used and when it has been replaced

GELU, SiLU, and SwiGLU: the transformer-era activation functions

GELU (Gaussian Error Linear Unit)

SiLU (Sigmoid Linear Unit) / Swish

SwiGLU

Comparison table

Activation function selection guide (2025-2026)

ReLU strikes back

Explain Like I'm 5 (ELI5)

References

Related Articles

GELU (Gaussian Error Linear Unit)

Sparse autoencoder

ARC-AGI 2

Tanh (hyperbolic tangent)

SwiGLU

LeNet