Rectified Linear Unit (ReLU)

The Rectified Linear Unit (ReLU) is the most widely used activation function in deep learning. Defined mathematically as f(x) = max(0, x), it returns the input directly when positive and outputs zero otherwise. Despite its simplicity, ReLU was a breakthrough that helped make training deep neural networks practical, and its adoption in the early 2010s played a central role in the modern deep learning revolution.

ReLU belongs to a family of rectifier functions. Its name comes from an analogy with half-wave rectification in electrical engineering, where only the positive half of a signal passes through. In the context of neural networks, ReLU allows only positive activations to propagate while suppressing all negative inputs to zero. This straightforward behavior gives it several computational and mathematical advantages over earlier activation functions like the sigmoid function and hyperbolic tangent (tanh), both of which suffer from saturation and vanishing gradients in deep architectures.

Mathematical Definition

The ReLU function is defined as:

f(x) = max(0, x)

This can be written equivalently in several forms:

Piecewise: f(x) = x if x > 0; f(x) = 0 if x <= 0
Closed form: f(x) = (x + |x|) / 2
Using the Heaviside step function: f(x) = x * H(x), where H(x) is 1 for x > 0 and 0 otherwise

The derivative (subgradient) of ReLU is equally simple:

f'(x) = 1 if x > 0; f'(x) = 0 if x < 0

At x = 0, ReLU is not differentiable in the strict mathematical sense. In practice, implementations typically assign the derivative at zero to be either 0 or 1, and this choice has negligible effect on training. The simplicity of both the function and its derivative is one of the key reasons ReLU is so computationally efficient: evaluating it requires only a comparison operation, and computing the gradient requires only checking the sign of the input.

The derivative of ReLU has an important training implication. Because f'(x) is either 0 or 1, the gradient signal during backpropagation is either completely passed through (for positive activations) or completely blocked (for negative activations). This binary behavior stands in contrast to sigmoid and tanh, whose derivatives are continuous values between 0 and 1, meaning they always attenuate the gradient to some degree. The pass-through property of ReLU's derivative is the primary mechanism by which it combats the vanishing gradient problem.

Geometric Interpretation

A single ReLU neuron computes a "hinge" or "fold" in the input space. When multiple ReLU neurons are composed across layers of a neural network, the network creates a piecewise linear function that can approximate arbitrarily complex nonlinear mappings. Each neuron contributes a linear piece, and the combination of many such pieces across multiple layers can represent highly complex decision boundaries and function surfaces.

History

Early Origins (1941 to 1969)

The mathematical function max(0, x) has a long history that predates its use in neural networks. In 1941, Alston Householder first applied it as a mathematical abstraction of biological neural networks in his work on the theory of neural computation. The function captures a basic property of biological neurons: they either fire (positive output) or remain silent (zero output), with no negative firing rate.

In 1969, Kunihiko Fukushima employed the rectifier function in the context of visual feature extraction in hierarchical neural networks. Fukushima referred to it as an "analog threshold element" and used it within architectures that would eventually evolve into the Cognitron (1975) and the Neocognitron (1979), which are considered precursors to modern convolutional neural networks.

Biological Motivation and Theoretical Foundations (2000)

In a landmark 2000 paper published in Nature, Hahnloser, Sarpeshkar, Mahowald, Douglas, and Seung provided both biological and mathematical justifications for the rectifier activation function. They argued that ReLU approximates the biological relationship between neural firing rates and input current, a property known as the f-I curve in neuroscience. Biological neurons in their main operating regime respond roughly linearly to increases in input current above a threshold, and produce zero output below that threshold. This matches the ReLU function closely. Additionally, Hahnloser et al. demonstrated that ReLU enables recurrent neural network dynamics to stabilize under weaker mathematical criteria than networks using other activation functions.

Rise to Prominence (2009 to 2011)

Before 2009, the sigmoid function and tanh were the dominant activation functions in neural networks. These functions saturate for large positive or negative inputs, causing their gradients to shrink toward zero. In deep networks with many layers, this saturation compounds through backpropagation, leading to the vanishing gradient problem that makes training extremely slow or effectively impossible for networks beyond a few layers.

In 2009, Jarrett, Kavukcuoglu, Ranzato, and LeCun showed that rectified activation was "the single most important factor" for achieving good performance in object recognition with convolutional neural networks. Their experiments demonstrated that rectification allowed average pooling to work effectively without the cancellation effects that occur with functions that produce both positive and negative outputs.

In 2010, Vinod Nair and Geoffrey Hinton made a theoretical argument in favor of the softplus function (a smooth approximation to ReLU) and found that ReLU activation allowed strong empirical performance in restricted Boltzmann machines. Their paper, "Rectified Linear Units Improve Restricted Boltzmann Machines," was influential in bringing ReLU to the attention of the broader deep learning community.

In 2011, Xavier Glorot, Antoine Bordes, and Yoshua Bengio published "Deep Sparse Rectifier Neural Networks," which provided a comprehensive argument for why ReLU should replace sigmoid and tanh as the default activation function. Glorot et al. highlighted four key advantages: ReLU is more similar to biological neurons in their main operating regime; it avoids vanishing gradients because the gradient is either 0 or 1; it is cheaper to compute than exponential-based functions; and it naturally produces sparse representations, because many hidden units output exactly zero for a given input.

AlexNet and the Deep Learning Revolution (2012)

The adoption of ReLU reached a turning point in 2012 when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton used it in AlexNet, their deep convolutional neural network that won the ImageNet Large Scale Visual Recognition Challenge. AlexNet achieved a top-5 error rate of 15.3%, dramatically outperforming the second-place entry at 26.2%. This 10.8 percentage point improvement stunned the computer vision community. Yann LeCun described the result as "an unequivocal turning point in the history of computer vision."

The AlexNet paper reported that using ReLU activation allowed training to converge roughly six times faster than an equivalent network using tanh activation on the CIFAR-10 dataset. This speedup was critical because the sheer depth and size of AlexNet (60 million parameters across 650,000 neurons) made training with slower-converging activation functions impractical. After AlexNet, ReLU became the default activation function for deep neural networks across virtually every domain, and it held that position for much of the following decade.

Why ReLU Replaced Sigmoid and Tanh

Mitigating the Vanishing Gradient Problem

The vanishing gradient problem was one of the most significant obstacles to training deep neural networks before ReLU. With sigmoid or tanh activations, the gradients of neurons in early layers become exponentially small as the number of layers increases, because each layer's gradient is multiplied by a derivative that is always less than 1 (and often much less). This makes it nearly impossible for the network to learn useful representations in its earlier layers.

ReLU solves this problem in a direct way: for any positive input, the gradient of ReLU is exactly 1. This means that during backpropagation, the gradient passes through ReLU neurons unchanged (for positive activations). There is no multiplicative shrinkage of the signal, so gradients can flow through many layers without vanishing. This property was essential for making it possible to train networks with tens or even hundreds of layers.

Sparse Activation

When a network uses ReLU, any neuron receiving a negative input produces an output of exactly zero. In a randomly initialized network, roughly 50% of hidden units will be inactive (outputting zero) for any given input. This sparsity has several benefits:

Computational efficiency: Sparse activations mean fewer multiplications and additions during the forward and backward passes, because operations with zero can be skipped or simplified.
Feature selectivity: Each input activates only a subset of the network's neurons, encouraging different neurons to specialize in detecting different features.
Reduced overfitting: Sparsity acts as a form of implicit regularization, since the network cannot rely on all neurons being active simultaneously.

This contrasts sharply with sigmoid and tanh, where neurons are almost always producing nonzero outputs, leading to dense and computationally heavier representations.

Computational Efficiency

ReLU requires only a comparison operation (checking if the input is positive) and possibly an assignment (setting the output to zero). There are no exponential, logarithmic, or trigonometric operations involved. This makes both the forward pass and the gradient computation extremely fast. For large-scale models with millions or billions of parameters, this efficiency translates into meaningful reductions in training time and energy consumption.

Scale Invariance

ReLU satisfies the property that max(0, ax) = a * max(0, x) for any non-negative scalar a. This scale invariance means that scaling the weights of a ReLU neuron scales the output proportionally, a property that interacts usefully with weight initialization schemes and batch normalization. However, this property also introduces a form of parametric redundancy, since multiplying all incoming weights by a constant and dividing all outgoing weights by the same constant does not change the network's output.

Biological Plausibility

One of ReLU's underappreciated strengths is its connection to biological neuroscience. The activation function of a biological neuron can be roughly characterized by the f-I (frequency-current) curve, which describes how a neuron's firing rate responds to injected current. In the main operating regime of cortical neurons, this relationship is approximately linear above a threshold and zero below it, closely matching the ReLU function.

Hahnloser et al. (2000) formally showed this connection in their Nature paper, demonstrating that neural circuits with half-wave rectification (equivalent to ReLU) exhibit stable dynamics and can perform analog computation. Glorot et al. (2011) further argued that the sparsity produced by ReLU mirrors the sparse coding observed in biological neural systems, where only a small fraction of neurons are active at any given time. Neuroscience research has shown that sparse activation patterns in the cortex improve energy efficiency and increase the representational capacity of neural populations.

The thresholding behavior of ReLU is also reminiscent of the all-or-nothing firing pattern of biological neurons, which require net excitatory synaptic input to surpass a certain threshold before generating an action potential. Below that threshold, the neuron produces no output. While the analogy is not perfect (biological neurons have much more complex dynamics including temporal coding, refractory periods, and Dale's principle constraining excitatory and inhibitory roles), ReLU captures the essential nonlinearity that makes neural computation powerful.

The Dying ReLU Problem

Despite its advantages, ReLU has a well-known failure mode called the "dying ReLU" problem. A neuron is said to have "died" when its inputs are consistently negative across all training examples, causing it to always output zero. Because the gradient of ReLU is zero for negative inputs, a dead neuron receives no gradient signal during backpropagation and can never recover. Effectively, it is permanently removed from the network's computation.

Causes

The dying ReLU problem typically arises from several causes:

Excessively high learning rates. Large gradient updates can push the weights of a neuron into a regime where the weighted sum of inputs is negative for every training example. Once this happens, the neuron's gradient is zero, and no further updates can move it back to a useful state.
Poor weight initialization. If the initial weights are set in a way that produces large negative biases or systematically negative pre-activation values, many neurons may start in a dead state and never contribute to learning. Symmetric probability distributions used in conventional initialization are particularly susceptible to this issue.
Large negative biases. A bias term that is too negative can shift the pre-activation value below zero for all inputs, killing the neuron from the start.
Data distribution shifts. If the distribution of inputs to a layer changes substantially during training (a phenomenon related to internal covariate shift), neurons that were previously active may find themselves consistently receiving negative inputs.

Solutions

Several strategies can mitigate or prevent the dying ReLU problem:

Careful learning rate selection. Using a moderate learning rate, or employing learning rate schedules that start small and gradually increase, reduces the chance of pushing neurons into dead states.
Proper weight initialization. He initialization (also called Kaiming initialization), introduced by He, Zhang, Ren, and Sun in 2015, was specifically designed for ReLU networks. It sets the variance of the initial weights to 2/n, where n is the number of input units, which keeps the variance of activations stable across layers.
Small positive bias initialization. Setting the initial bias to a small positive value (e.g., 0.01) pushes the pre-activation slightly above zero, making it less likely that neurons start dead.
Using ReLU variants. Activation functions like Leaky ReLU, PReLU, and ELU address the dying ReLU problem directly by allowing a small nonzero gradient for negative inputs, which prevents neurons from becoming permanently inactive.
Batch normalization. Applying batch normalization before ReLU can help keep the pre-activation distribution centered around zero, reducing the chance that all inputs to a neuron become negative.

Other Limitations

Beyond the dying ReLU problem, ReLU has several other limitations:

Not zero-centered. The output of ReLU is always non-negative, which means that the gradients with respect to the weights in earlier layers will always be all-positive or all-negative for a given input. This can lead to a zig-zagging pattern during gradient descent, slowing convergence.
Unbounded output. ReLU has no upper bound on its output, which can sometimes cause numerical instability or exploding activations in very deep networks. Batch normalization is commonly used to address this.
Non-differentiable at zero. Although this rarely causes practical issues, the non-differentiability at x = 0 means ReLU is technically not a smooth function, and some theoretical results about neural networks require smooth activation functions.

Variants of ReLU

The limitations of standard ReLU have motivated the development of numerous variants, each designed to address specific shortcomings while preserving ReLU's core advantages.

Leaky ReLU

Introduced by Maas, Hannun, and Ng in 2013, Leaky ReLU allows a small, non-zero gradient when the input is negative:

f(x) = x if x > 0; f(x) = alpha * x if x <= 0

The slope parameter alpha is typically set to a small constant like 0.01 or 0.1. By maintaining a small gradient for negative inputs, Leaky ReLU prevents neurons from dying completely. The function remains computationally inexpensive, adding only a single multiplication for negative values.

Parametric ReLU (PReLU)

Proposed by He, Zhang, Ren, and Sun in 2015 (the same paper that introduced He initialization), PReLU treats the negative slope alpha as a learnable parameter that is optimized during training along with the other network weights:

f(x) = x if x > 0; f(x) = alpha * x if x <= 0

The difference from Leaky ReLU is that alpha is not fixed but learned. This gives the network the flexibility to determine the optimal amount of information to pass through for negative inputs, which can vary across layers and neurons. PReLU was shown to improve classification accuracy on the ImageNet dataset compared to standard ReLU.

Exponential Linear Unit (ELU)

Proposed by Clevert, Unterthiner, and Hochreiter in 2015, ELU uses an exponential curve for negative inputs:

f(x) = x if x > 0; f(x) = alpha * (e^x - 1) if x <= 0

The hyperparameter alpha (typically set to 1.0) controls the saturation value for negative inputs. ELU has several advantages over ReLU: it produces negative outputs, making its mean activation closer to zero (which helps gradient flow); it saturates smoothly to -alpha for large negative inputs, adding robustness to noise; and it is smooth and differentiable everywhere, including at zero. Experiments demonstrated that ELU networks achieved higher classification accuracy and converged faster than ReLU networks, particularly in deeper architectures.

Scaled Exponential Linear Unit (SELU)

Introduced by Klambauer, Unterthiner, Mayr, and Hochreiter in 2017, SELU is a scaled version of ELU with carefully chosen constants:

f(x) = lambda * x if x > 0; f(x) = lambda * alpha * (e^x - 1) if x <= 0

where lambda = 1.0507 and alpha = 1.6733 (approximately).

The specific values of lambda and alpha were derived using the Banach fixed-point theorem to guarantee that activations converge toward zero mean and unit variance as they propagate through layers. This "self-normalizing" property means that SELU networks do not need batch normalization to maintain stable activations. The authors proved that vanishing and exploding gradients are impossible when the self-normalizing conditions are met. However, SELU requires specific conditions to achieve its theoretical guarantees: the network must use fully connected layers, and the weights must be initialized with LeCun normal initialization (zero mean, variance of 1/n). SELU-based self-normalizing networks significantly outperformed competing methods on 121 UCI machine learning tasks.

Gaussian Error Linear Unit (GELU)

Introduced by Hendrycks and Gimpel in 2016, GELU takes a fundamentally different approach by incorporating a probabilistic element:

f(x) = x * Phi(x)

where Phi(x) is the cumulative distribution function of the standard normal distribution.

Rather than deterministically zeroing out negative values as ReLU does, GELU gates each input by its probability of being positive under a standard normal distribution. Inputs that are clearly positive pass through nearly unchanged, while inputs near zero are partially suppressed, and strongly negative inputs are driven close to zero. This creates a smooth, non-monotonic function with a small "bump" in the negative region. The GELU nonlinearity weights inputs by their value rather than gating inputs by their sign as in ReLU, combining the intuitions of dropout and zoneout.

GELU rose to prominence as the activation function used in BERT (2018) and the GPT series of models from OpenAI. Its smooth gradient flow and probabilistic gating proved particularly effective for transformer architectures. Common approximations used in practice include:

f(x) = 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
f(x) = x * sigmoid(1.702 * x)

SiLU / Swish

The Sigmoid Linear Unit (SiLU), independently discovered and also known as Swish, was proposed by Elfwing et al. in 2017 and popularized by Ramachandran, Zoph, and Le at Google Brain in 2017 through a neural architecture search:

f(x) = x * sigmoid(x) = x / (1 + e^(-x))

Swish was discovered through a combination of exhaustive and reinforcement learning-based search over possible activation function formulas. Like GELU, Swish is smooth and non-monotonic. It allows small negative values to pass through, which can help with information flow. Unlike ReLU, Swish is bounded below (approaching approximately -0.278) and unbounded above. In experiments across a range of tasks, Swish consistently matched or outperformed ReLU, particularly in very deep networks. On ImageNet, replacing ReLUs with Swish improved top-1 classification accuracy by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2. Swish has lower computational cost than GELU while offering similar benefits.

ReLU6

ReLU6 is a clipped variant of ReLU defined as:

f(x) = min(max(0, x), 6)

First introduced in the context of convolutional deep belief networks, ReLU6 caps the maximum output value at 6. This seemingly simple modification serves an important practical purpose in mobile and embedded deep learning deployments. By limiting activations to a predefined range of [0, 6], ReLU6 makes neural networks more robust when using low-precision (fixed-point or quantized) arithmetic. Without an upper bound, large activation values can cause overflow or significant quantization errors in 8-bit or 16-bit representations.

ReLU6 gained prominence through its use in MobileNet (Howard et al., 2017) and MobileNetV2 (Sandler et al., 2018), where computational efficiency and compatibility with mobile hardware are critical. The value of 6 was chosen empirically to balance expressiveness with bit compression, making it suitable for fixed-point inference and efficient hardware implementation. In MobileNetV3, both ReLU6 and hard-swish (h-swish) are used depending on the layer.

Mish

Proposed by Misra in 2019, Mish is another smooth, non-monotonic activation function:

f(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^x))

Mish shares several properties with Swish, including non-monotonicity and smoothness, but uses the softplus function instead of the sigmoid. Experiments found that Mish frequently outperformed both ReLU and Swish across a range of tasks and architectures. The function exhibits a "self-regularizing" property attributed to a specific term in its first derivative, which helps prevent overfitting. Being unbounded above, Mish avoids saturation that would slow training due to near-zero gradients. Mish demonstrated strong results in object detection, outperforming Leaky ReLU on YOLOv4 with a CSP-DarkNet-53 backbone by 2.1% average precision on MS-COCO, and outperforming ReLU on ResNet-50 on ImageNet-1k in top-1 accuracy by approximately 1%.

SwiGLU

SwiGLU (Swish-Gated Linear Unit), introduced by Shazeer in 2020, combines the Swish activation with a gating mechanism in the feed-forward layers of transformer models:

SwiGLU(x) = Swish(xW_1) * (xW_2)

where W_1 and W_2 are two separate weight matrices. Instead of a single linear transformation followed by an activation, SwiGLU uses two linear projections whose outputs are multiplied element-wise, with one passing through the Swish function as a gate. SwiGLU has become the standard activation in many modern large language models, including LLaMA, PaLM, and DeepSeek. It consistently achieves better perplexity and downstream task performance compared to GELU or ReLU in large-scale transformer training.

Other Variants

Several additional ReLU variants have been proposed for specific use cases:

Concatenated ReLU (CReLU): Proposed by Shang et al. in 2016, it preserves both positive and negative phase information by outputting the concatenation [ReLU(x), ReLU(-x)]. This doubles the output dimension but retains more information from the input.
Softplus: Defined as f(x) = ln(1 + e^x), softplus is a smooth approximation of ReLU. Its derivative is the logistic sigmoid function. While smoother than ReLU, it is more expensive to compute and does not produce sparse activations.
Squareplus: Proposed in 2021, defined as f(x) = (x + sqrt(x^2 + b)) / 2, where b >= 0. It approximates ReLU using only algebraic operations, avoiding the numerical stability issues that can arise with exponentials for large inputs.

Comparison of Activation Functions

The following table summarizes the key properties and trade-offs of ReLU and its major variants:

Activation Function	Formula	Smooth	Zero-Centered	Dying Neuron Risk	Computational Cost	Typical Use Case
ReLU	max(0, x)	No	No	Yes	Very low	CNNs, general deep learning
Leaky ReLU	max(alpha*x, x)	No	Approximately	No	Very low	CNNs when dying ReLU is a concern
PReLU	max(alpha*x, x), alpha learned	No	Approximately	No	Low	Image classification (e.g., ImageNet)
ELU	x if x>0; alpha*(e^x - 1) if x<=0	Yes	Approximately	No	Moderate	Deep networks needing zero-centered output
SELU	lambda*ELU(x) with fixed constants	Yes	Yes (converges)	No	Moderate	Fully connected self-normalizing networks
ReLU6	min(max(0, x), 6)	No	No	Yes	Very low	Mobile and embedded networks
GELU	x * Phi(x)	Yes	Approximately	No	Moderate to high	Transformers (BERT, GPT)
SiLU / Swish	x * sigmoid(x)	Yes	Approximately	No	Moderate	Deep CNNs, general deep learning
Mish	x * tanh(softplus(x))	Yes	Approximately	No	Moderate to high	Image classification, object detection
SwiGLU	Swish(xW1) * xW2	Yes	Approximately	No	Higher (two projections)	Large language models (LLaMA, PaLM)

ReLU as the Default for Hidden Layers

ReLU is widely recommended as the default activation function for hidden layers in neural networks. This recommendation, which has held since the early 2010s, comes from several practical observations:

It works well out of the box. ReLU requires no hyperparameter tuning (unlike Leaky ReLU's alpha or ELU's alpha). Combined with He initialization, ReLU reliably trains across a wide range of architectures and datasets.
It is fast to compute. In production environments where inference speed matters, ReLU's simple comparison operation is substantially faster than any activation function involving exponentials or trigonometric operations.
It produces sparse representations. The exact zeros in ReLU activations enable hardware-level optimizations such as sparse matrix operations, which smooth activation functions cannot exploit.
Extensive empirical track record. Decades of experiments across computer vision, natural language processing, speech recognition, and reinforcement learning have validated ReLU's effectiveness.

For practitioners starting a new project, ReLU remains a strong first choice for hidden layers unless there is a specific reason to use an alternative (such as building a transformer model, where GELU or SwiGLU may be preferable).

When Not to Use ReLU

While ReLU is an excellent default, there are situations where it is not the best choice:

Output layers. ReLU should almost never be used for output layers. Regression tasks often require unbounded outputs (including negative values), classification tasks use softmax or sigmoid, and tasks predicting bounded values need appropriate activation functions. ReLU can only output non-negative values, which makes it unsuitable for outputs that need to span the full real line.
Transformer feed-forward layers. Modern transformer architectures have largely moved beyond ReLU to GELU, SiLU/Swish, or SwiGLU. These smooth activation functions provide marginally better gradient flow and have been shown to improve perplexity and downstream task performance in large language models.
Self-normalizing architectures. When building deep fully connected networks without batch normalization, SELU may be a better choice because it provides automatic normalization of activations through its self-normalizing property.
Networks with dying neuron problems. If monitoring reveals a large fraction of dead neurons during training, switching to Leaky ReLU, PReLU, or ELU can recover the lost capacity without significant computational overhead.
Mobile and quantized deployments. When deploying to hardware with limited numerical precision, ReLU6 or hard-swish (as used in MobileNetV3) may be more appropriate because their bounded output range is more compatible with fixed-point arithmetic.
Recurrent neural networks. Standard RNNs with ReLU can suffer from exploding gradients because the activations are unbounded and the same weight matrix is applied at each time step. Gated architectures like LSTMs and GRUs use sigmoid and tanh for their gating mechanisms instead.

ReLU in Modern Architectures

Convolutional Neural Networks

ReLU became the default activation function for CNNs following AlexNet's success in 2012. Subsequent architectures like VGGNet, GoogLeNet/Inception, and ResNet all used ReLU throughout their convolutional and fully connected layers. The combination of ReLU with batch normalization (Ioffe and Szegedy, 2015) and residual connections (He et al., 2015) enabled training networks with over 100 layers for the first time. ReLU remains commonly used in convolutional architectures to this day.

Recurrent Neural Networks

In recurrent neural networks (RNNs), the use of ReLU is more nuanced. Standard RNNs with ReLU can suffer from exploding gradients because the activations are unbounded and the same weight matrix is applied at each time step. However, carefully initialized ReLU-based RNNs (such as the IRNN proposed by Le, Jaitly, and Hinton in 2015) have been shown to work well for certain sequence tasks. In practice, gated architectures like LSTMs and GRUs use sigmoid and tanh for their gating mechanisms rather than ReLU.

Transformers and Large Language Models

The original Transformer architecture (Vaswani et al., 2017) used ReLU in its feed-forward network (FFN) sublayers. However, as transformer models grew in scale and sophistication, researchers found that smoother activation functions could yield meaningful improvements. BERT (2018) adopted GELU, and this choice was carried forward by most subsequent transformer-based language models. The GPT series from OpenAI uses GELU throughout its feed-forward layers. More recently, models like LLaMA and PaLM have moved to SwiGLU, which consistently outperforms both ReLU and GELU in large-scale language model training.

Despite these newer alternatives, ReLU and its variants remain relevant even in the transformer era. Some researchers have shown that replacing GELU or SiLU with ReLU in large language models can be beneficial for inference efficiency, because ReLU's exact sparsity (outputting precisely zero for negative inputs) enables hardware-level optimizations that smooth activation functions cannot exploit. The 2023 paper "ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models" demonstrated that ReLU-based transformers can achieve comparable quality to GELU-based models while being significantly faster at inference time due to sparse matrix operations.

Generative Models

ReLU and its variants are widely used in generative models. Generative adversarial networks (GANs) typically use ReLU in the generator and Leaky ReLU in the discriminator, following the DCGAN guidelines established by Radford, Metz, and Chintala in 2015. Variational autoencoders (VAEs) and diffusion models also commonly employ ReLU or its variants in their network architectures.

Universal Approximation with ReLU Networks

The universal approximation theorem states that a feedforward neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary precision, provided the activation function satisfies certain conditions. ReLU satisfies these conditions because it is non-polynomial.

For ReLU networks specifically, the mechanism of universal approximation is through piecewise linear functions. Each ReLU neuron contributes a "hinge" or breakpoint, and a network with n hidden neurons in a single layer can produce a piecewise linear function with up to n+1 linear pieces in one dimension. In higher dimensions, the picture is more complex: ReLU networks partition the input space into polytopes (higher-dimensional analogs of polygons), and the network computes a separate linear function within each polytope.

A key result is that the number of linear regions grows exponentially with depth. A network with d layers of width w can create up to O(w^d) linear regions, compared to O(w) regions for a single-layer network of the same total number of neurons. This exponential growth in expressiveness with depth is a fundamental theoretical justification for using deep (many-layered) networks rather than wide (single-layer) networks.

Arora, Basu, Mianjy, and Mukherjee (2018) formally proved that deep ReLU networks can approximate any piecewise linear function with far fewer parameters than shallow networks, establishing depth efficiency as a mathematical fact rather than just an empirical observation.

Implementation

ReLU is available in all major deep learning frameworks:

Framework	ReLU Implementation
PyTorch	torch.nn.ReLU() or torch.nn.functional.relu(x)
TensorFlow / Keras	tf.keras.activations.relu or tf.nn.relu
JAX	jax.nn.relu
ONNX	Relu operator

A minimal implementation in Python requires only one line:

def relu(x):
    return max(0, x)

For NumPy arrays, this becomes np.maximum(0, x). In practice, ReLU is often applied as a separate layer rather than fused into the linear/convolutional layer, though some frameworks offer fused implementations for better performance.

Explain Like I'm 5 (ELI5)

Imagine you have a bunch of blocks, some with positive numbers and some with negative numbers. The Rectified Linear Unit (ReLU) is like a magical filter that you use to sort these blocks. When a positive block goes through the filter, it stays the same. But when a negative block goes through the filter, it magically becomes zero. This simple trick helps computers learn complex things more easily and quickly.

Think of it like a water faucet. When the water pressure (the input) is positive, water flows out at exactly that rate. But when the pressure drops below zero, the faucet shuts off completely. No water comes out, and no water gets pushed back in. This "all or nothing" behavior for negative values, combined with a perfectly proportional response for positive values, turns out to be exactly what neural networks need to learn efficiently.

Why is this better than what came before? Older activation functions (like the sigmoid) are more like dimmer switches: they work, but they squish everything into a tiny range, and the switch gets harder and harder to turn in very deep networks. ReLU is like a simple on/off valve that never gets stuck, so even in a network with a hundred layers, the learning signal can flow all the way through.

References

Householder, A. S. (1941). "A theory of steady-state activity in nerve-fiber networks." *Bulletin of Mathematical Biophysics*, 3(2), 63-69.
Fukushima, K. (1969). "Visual feature extraction by a multilayered network of analog threshold elements." *IEEE Transactions on Systems Science and Cybernetics*, 5(4), 322-333.
Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). "Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit." *Nature*, 405(6789), 947-951.
Jarrett, K., Kavukcuoglu, K., Ranzato, M., & LeCun, Y. (2009). "What is the best multi-stage architecture for object recognition?" *2009 IEEE 12th International Conference on Computer Vision*, 2146-2153.
Nair, V. & Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." *Proceedings of the 27th International Conference on Machine Learning (ICML)*, 807-814.
Glorot, X., Bordes, A., & Bengio, Y. (2011). "Deep Sparse Rectifier Neural Networks." *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)*, 315-323.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *Advances in Neural Information Processing Systems 25 (NeurIPS)*, 1097-1105.
Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). "Rectifier Nonlinearities Improve Neural Network Acoustic Models." *ICML Workshop on Deep Learning for Audio, Speech and Language Processing*.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." *2015 IEEE International Conference on Computer Vision (ICCV)*, 1026-1034.
Clevert, D. A., Unterthiner, T., & Hochreiter, S. (2015). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." *arXiv preprint arXiv:1511.07289*.
Hendrycks, D. & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." *arXiv preprint arXiv:1606.08415*.
Shang, W., Sohn, K., Almeida, D., & Lee, H. (2016). "Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units." *Proceedings of the 33rd International Conference on Machine Learning (ICML)*, 2217-2225.
Elfwing, S., Uchibe, E., & Doya, K. (2017). "Sigmoid-weighted linear units for neural network function approximation in reinforcement learning." *arXiv preprint arXiv:1702.03118*.
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). "Searching for Activation Functions." *arXiv preprint arXiv:1710.05941*.
Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). "Self-Normalizing Neural Networks." *Advances in Neural Information Processing Systems 30 (NeurIPS)*, 971-980.
Howard, A. G., Zhu, M., Chen, B., et al. (2017). "MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications." *arXiv preprint arXiv:1704.04861*.
Arora, R., Basu, A., Mianjy, P., & Mukherjee, A. (2018). "Understanding Deep Neural Networks with Rectified Linear Units." *International Conference on Learning Representations (ICLR)*.
Misra, D. (2019). "Mish: A Self Regularized Non-Monotonic Activation Function." *arXiv preprint arXiv:1908.08681*.
Shazeer, N. (2020). "GLU Variants Improve Transformer." *arXiv preprint arXiv:2002.05202*.
Lu, L., Shin, Y., Su, Y., & Karniadakis, G. E. (2020). "Dying ReLU and Initialization: Theory and Numerical Examples." *Communications in Computational Physics*, 28(5), 1671-1706.
Mirzadeh, S. I., Alizadeh, K., Mehta, S., Del Mundo, C. C., Tuzel, O., Samei, G., Rastegari, M., & Farajtabar, M. (2023). "ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models." *arXiv preprint arXiv:2310.04564*.