See also: Activation function, Deep learning, Neural network
The Rectified Linear Unit (ReLU) is the most widely used activation function in deep learning. Defined mathematically as f(x) = max(0, x), it returns the input directly when positive and outputs zero otherwise. Despite its simplicity, ReLU was a breakthrough that helped make training deep neural networks practical, and its adoption in the early 2010s played a central role in the modern deep learning revolution.
ReLU belongs to a family of rectifier functions. Its name comes from an analogy with half-wave rectification in electrical engineering, where only the positive half of a signal passes through. In the context of neural networks, ReLU allows only positive activations to propagate while suppressing all negative inputs to zero. This straightforward behavior gives it several computational and mathematical advantages over earlier activation functions like the sigmoid function and hyperbolic tangent (tanh), both of which suffer from saturation and vanishing gradients in deep architectures.
The ReLU function is defined as:
f(x) = max(0, x)
This can be written equivalently in several forms:
The derivative (subgradient) of ReLU is equally simple:
f'(x) = 1 if x > 0; f'(x) = 0 if x < 0
At x = 0, ReLU is not differentiable in the strict mathematical sense. In practice, implementations typically assign the derivative at zero to be either 0 or 1, and this choice has negligible effect on training. The simplicity of both the function and its derivative is one of the key reasons ReLU is so computationally efficient: evaluating it requires only a comparison operation, and computing the gradient requires only checking the sign of the input.
The derivative of ReLU has an important training implication. Because f'(x) is either 0 or 1, the gradient signal during backpropagation is either completely passed through (for positive activations) or completely blocked (for negative activations). This binary behavior stands in contrast to sigmoid and tanh, whose derivatives are continuous values between 0 and 1, meaning they always attenuate the gradient to some degree. The pass-through property of ReLU's derivative is the primary mechanism by which it combats the vanishing gradient problem.
A single ReLU neuron computes a "hinge" or "fold" in the input space. When multiple ReLU neurons are composed across layers of a neural network, the network creates a piecewise linear function that can approximate arbitrarily complex nonlinear mappings. Each neuron contributes a linear piece, and the combination of many such pieces across multiple layers can represent highly complex decision boundaries and function surfaces.
The mathematical function max(0, x) has a long history that predates its use in neural networks. In 1941, Alston Householder first applied it as a mathematical abstraction of biological neural networks in his work on the theory of neural computation. The function captures a basic property of biological neurons: they either fire (positive output) or remain silent (zero output), with no negative firing rate.
In 1969, Kunihiko Fukushima employed the rectifier function in the context of visual feature extraction in hierarchical neural networks. Fukushima referred to it as an "analog threshold element" and used it within architectures that would eventually evolve into the Cognitron (1975) and the Neocognitron (1979), which are considered precursors to modern convolutional neural networks.
In a landmark 2000 paper published in Nature, Hahnloser, Sarpeshkar, Mahowald, Douglas, and Seung provided both biological and mathematical justifications for the rectifier activation function. They argued that ReLU approximates the biological relationship between neural firing rates and input current, a property known as the f-I curve in neuroscience. Biological neurons in their main operating regime respond roughly linearly to increases in input current above a threshold, and produce zero output below that threshold. This matches the ReLU function closely. Additionally, Hahnloser et al. demonstrated that ReLU enables recurrent neural network dynamics to stabilize under weaker mathematical criteria than networks using other activation functions.
Before 2009, the sigmoid function and tanh were the dominant activation functions in neural networks. These functions saturate for large positive or negative inputs, causing their gradients to shrink toward zero. In deep networks with many layers, this saturation compounds through backpropagation, leading to the vanishing gradient problem that makes training extremely slow or effectively impossible for networks beyond a few layers.
In 2009, Jarrett, Kavukcuoglu, Ranzato, and LeCun showed that rectified activation was "the single most important factor" for achieving good performance in object recognition with convolutional neural networks. Their experiments demonstrated that rectification allowed average pooling to work effectively without the cancellation effects that occur with functions that produce both positive and negative outputs.
In 2010, Vinod Nair and Geoffrey Hinton made a theoretical argument in favor of the softplus function (a smooth approximation to ReLU) and found that ReLU activation allowed strong empirical performance in restricted Boltzmann machines. Their paper, "Rectified Linear Units Improve Restricted Boltzmann Machines," was influential in bringing ReLU to the attention of the broader deep learning community.
In 2011, Xavier Glorot, Antoine Bordes, and Yoshua Bengio published "Deep Sparse Rectifier Neural Networks," which provided a comprehensive argument for why ReLU should replace sigmoid and tanh as the default activation function. Glorot et al. highlighted four key advantages: ReLU is more similar to biological neurons in their main operating regime; it avoids vanishing gradients because the gradient is either 0 or 1; it is cheaper to compute than exponential-based functions; and it naturally produces sparse representations, because many hidden units output exactly zero for a given input.
The adoption of ReLU reached a turning point in 2012 when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton used it in AlexNet, their deep convolutional neural network that won the ImageNet Large Scale Visual Recognition Challenge. AlexNet achieved a top-5 error rate of 15.3%, dramatically outperforming the second-place entry at 26.2%. This 10.8 percentage point improvement stunned the computer vision community. Yann LeCun described the result as "an unequivocal turning point in the history of computer vision."
The AlexNet paper reported that using ReLU activation allowed training to converge roughly six times faster than an equivalent network using tanh activation on the CIFAR-10 dataset. This speedup was critical because the sheer depth and size of AlexNet (60 million parameters across 650,000 neurons) made training with slower-converging activation functions impractical. After AlexNet, ReLU became the default activation function for deep neural networks across virtually every domain, and it held that position for much of the following decade.
The vanishing gradient problem was one of the most significant obstacles to training deep neural networks before ReLU. With sigmoid or tanh activations, the gradients of neurons in early layers become exponentially small as the number of layers increases, because each layer's gradient is multiplied by a derivative that is always less than 1 (and often much less). This makes it nearly impossible for the network to learn useful representations in its earlier layers.
ReLU solves this problem in a direct way: for any positive input, the gradient of ReLU is exactly 1. This means that during backpropagation, the gradient passes through ReLU neurons unchanged (for positive activations). There is no multiplicative shrinkage of the signal, so gradients can flow through many layers without vanishing. This property was essential for making it possible to train networks with tens or even hundreds of layers.
When a network uses ReLU, any neuron receiving a negative input produces an output of exactly zero. In a randomly initialized network, roughly 50% of hidden units will be inactive (outputting zero) for any given input. This sparsity has several benefits:
This contrasts sharply with sigmoid and tanh, where neurons are almost always producing nonzero outputs, leading to dense and computationally heavier representations.
ReLU requires only a comparison operation (checking if the input is positive) and possibly an assignment (setting the output to zero). There are no exponential, logarithmic, or trigonometric operations involved. This makes both the forward pass and the gradient computation extremely fast. For large-scale models with millions or billions of parameters, this efficiency translates into meaningful reductions in training time and energy consumption.
ReLU satisfies the property that max(0, ax) = a * max(0, x) for any non-negative scalar a. This scale invariance means that scaling the weights of a ReLU neuron scales the output proportionally, a property that interacts usefully with weight initialization schemes and batch normalization. However, this property also introduces a form of parametric redundancy, since multiplying all incoming weights by a constant and dividing all outgoing weights by the same constant does not change the network's output.
One of ReLU's underappreciated strengths is its connection to biological neuroscience. The activation function of a biological neuron can be roughly characterized by the f-I (frequency-current) curve, which describes how a neuron's firing rate responds to injected current. In the main operating regime of cortical neurons, this relationship is approximately linear above a threshold and zero below it, closely matching the ReLU function.
Hahnloser et al. (2000) formally showed this connection in their Nature paper, demonstrating that neural circuits with half-wave rectification (equivalent to ReLU) exhibit stable dynamics and can perform analog computation. Glorot et al. (2011) further argued that the sparsity produced by ReLU mirrors the sparse coding observed in biological neural systems, where only a small fraction of neurons are active at any given time. Neuroscience research has shown that sparse activation patterns in the cortex improve energy efficiency and increase the representational capacity of neural populations.
The thresholding behavior of ReLU is also reminiscent of the all-or-nothing firing pattern of biological neurons, which require net excitatory synaptic input to surpass a certain threshold before generating an action potential. Below that threshold, the neuron produces no output. While the analogy is not perfect (biological neurons have much more complex dynamics including temporal coding, refractory periods, and Dale's principle constraining excitatory and inhibitory roles), ReLU captures the essential nonlinearity that makes neural computation powerful.
Despite its advantages, ReLU has a well-known failure mode called the "dying ReLU" problem. A neuron is said to have "died" when its inputs are consistently negative across all training examples, causing it to always output zero. Because the gradient of ReLU is zero for negative inputs, a dead neuron receives no gradient signal during backpropagation and can never recover. Effectively, it is permanently removed from the network's computation.
The dying ReLU problem typically arises from several causes:
Excessively high learning rates. Large gradient updates can push the weights of a neuron into a regime where the weighted sum of inputs is negative for every training example. Once this happens, the neuron's gradient is zero, and no further updates can move it back to a useful state.
Poor weight initialization. If the initial weights are set in a way that produces large negative biases or systematically negative pre-activation values, many neurons may start in a dead state and never contribute to learning. Symmetric probability distributions used in conventional initialization are particularly susceptible to this issue.
Large negative biases. A bias term that is too negative can shift the pre-activation value below zero for all inputs, killing the neuron from the start.
Data distribution shifts. If the distribution of inputs to a layer changes substantially during training (a phenomenon related to internal covariate shift), neurons that were previously active may find themselves consistently receiving negative inputs.
Several strategies can mitigate or prevent the dying ReLU problem:
Beyond the dying ReLU problem, ReLU has several other limitations:
The limitations of standard ReLU have motivated the development of numerous variants, each designed to address specific shortcomings while preserving ReLU's core advantages.
Introduced by Maas, Hannun, and Ng in 2013, Leaky ReLU allows a small, non-zero gradient when the input is negative:
f(x) = x if x > 0; f(x) = alpha * x if x <= 0
The slope parameter alpha is typically set to a small constant like 0.01 or 0.1. By maintaining a small gradient for negative inputs, Leaky ReLU prevents neurons from dying completely. The function remains computationally inexpensive, adding only a single multiplication for negative values.
Proposed by He, Zhang, Ren, and Sun in 2015 (the same paper that introduced He initialization), PReLU treats the negative slope alpha as a learnable parameter that is optimized during training along with the other network weights:
f(x) = x if x > 0; f(x) = alpha * x if x <= 0
The difference from Leaky ReLU is that alpha is not fixed but learned. This gives the network the flexibility to determine the optimal amount of information to pass through for negative inputs, which can vary across layers and neurons. PReLU was shown to improve classification accuracy on the ImageNet dataset compared to standard ReLU.
Proposed by Clevert, Unterthiner, and Hochreiter in 2015, ELU uses an exponential curve for negative inputs:
f(x) = x if x > 0; f(x) = alpha * (e^x - 1) if x <= 0
The hyperparameter alpha (typically set to 1.0) controls the saturation value for negative inputs. ELU has several advantages over ReLU: it produces negative outputs, making its mean activation closer to zero (which helps gradient flow); it saturates smoothly to -alpha for large negative inputs, adding robustness to noise; and it is smooth and differentiable everywhere, including at zero. Experiments demonstrated that ELU networks achieved higher classification accuracy and converged faster than ReLU networks, particularly in deeper architectures.
Introduced by Klambauer, Unterthiner, Mayr, and Hochreiter in 2017, SELU is a scaled version of ELU with carefully chosen constants:
f(x) = lambda * x if x > 0; f(x) = lambda * alpha * (e^x - 1) if x <= 0
where lambda = 1.0507 and alpha = 1.6733 (approximately).
The specific values of lambda and alpha were derived using the Banach fixed-point theorem to guarantee that activations converge toward zero mean and unit variance as they propagate through layers. This "self-normalizing" property means that SELU networks do not need batch normalization to maintain stable activations. The authors proved that vanishing and exploding gradients are impossible when the self-normalizing conditions are met. However, SELU requires specific conditions to achieve its theoretical guarantees: the network must use fully connected layers, and the weights must be initialized with LeCun normal initialization (zero mean, variance of 1/n). SELU-based self-normalizing networks significantly outperformed competing methods on 121 UCI machine learning tasks.
Introduced by Hendrycks and Gimpel in 2016, GELU takes a fundamentally different approach by incorporating a probabilistic element:
f(x) = x * Phi(x)
where Phi(x) is the cumulative distribution function of the standard normal distribution.
Rather than deterministically zeroing out negative values as ReLU does, GELU gates each input by its probability of being positive under a standard normal distribution. Inputs that are clearly positive pass through nearly unchanged, while inputs near zero are partially suppressed, and strongly negative inputs are driven close to zero. This creates a smooth, non-monotonic function with a small "bump" in the negative region. The GELU nonlinearity weights inputs by their value rather than gating inputs by their sign as in ReLU, combining the intuitions of dropout and zoneout.
GELU rose to prominence as the activation function used in BERT (2018) and the GPT series of models from OpenAI. Its smooth gradient flow and probabilistic gating proved particularly effective for transformer architectures. Common approximations used in practice include:
The Sigmoid Linear Unit (SiLU), independently discovered and also known as Swish, was proposed by Elfwing et al. in 2017 and popularized by Ramachandran, Zoph, and Le at Google Brain in 2017 through a neural architecture search:
f(x) = x * sigmoid(x) = x / (1 + e^(-x))
Swish was discovered through a combination of exhaustive and reinforcement learning-based search over possible activation function formulas. Like GELU, Swish is smooth and non-monotonic. It allows small negative values to pass through, which can help with information flow. Unlike ReLU, Swish is bounded below (approaching approximately -0.278) and unbounded above. In experiments across a range of tasks, Swish consistently matched or outperformed ReLU, particularly in very deep networks. On ImageNet, replacing ReLUs with Swish improved top-1 classification accuracy by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2. Swish has lower computational cost than GELU while offering similar benefits.
ReLU6 is a clipped variant of ReLU defined as:
f(x) = min(max(0, x), 6)
First introduced in the context of convolutional deep belief networks, ReLU6 caps the maximum output value at 6. This seemingly simple modification serves an important practical purpose in mobile and embedded deep learning deployments. By limiting activations to a predefined range of [0, 6], ReLU6 makes neural networks more robust when using low-precision (fixed-point or quantized) arithmetic. Without an upper bound, large activation values can cause overflow or significant quantization errors in 8-bit or 16-bit representations.
ReLU6 gained prominence through its use in MobileNet (Howard et al., 2017) and MobileNetV2 (Sandler et al., 2018), where computational efficiency and compatibility with mobile hardware are critical. The value of 6 was chosen empirically to balance expressiveness with bit compression, making it suitable for fixed-point inference and efficient hardware implementation. In MobileNetV3, both ReLU6 and hard-swish (h-swish) are used depending on the layer.
Proposed by Misra in 2019, Mish is another smooth, non-monotonic activation function:
f(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^x))
Mish shares several properties with Swish, including non-monotonicity and smoothness, but uses the softplus function instead of the sigmoid. Experiments found that Mish frequently outperformed both ReLU and Swish across a range of tasks and architectures. The function exhibits a "self-regularizing" property attributed to a specific term in its first derivative, which helps prevent overfitting. Being unbounded above, Mish avoids saturation that would slow training due to near-zero gradients. Mish demonstrated strong results in object detection, outperforming Leaky ReLU on YOLOv4 with a CSP-DarkNet-53 backbone by 2.1% average precision on MS-COCO, and outperforming ReLU on ResNet-50 on ImageNet-1k in top-1 accuracy by approximately 1%.
SwiGLU (Swish-Gated Linear Unit), introduced by Shazeer in 2020, combines the Swish activation with a gating mechanism in the feed-forward layers of transformer models:
SwiGLU(x) = Swish(xW_1) * (xW_2)
where W_1 and W_2 are two separate weight matrices. Instead of a single linear transformation followed by an activation, SwiGLU uses two linear projections whose outputs are multiplied element-wise, with one passing through the Swish function as a gate. SwiGLU has become the standard activation in many modern large language models, including LLaMA, PaLM, and DeepSeek. It consistently achieves better perplexity and downstream task performance compared to GELU or ReLU in large-scale transformer training.
Several additional ReLU variants have been proposed for specific use cases:
The following table summarizes the key properties and trade-offs of ReLU and its major variants:
| Activation Function | Formula | Smooth | Zero-Centered | Dying Neuron Risk | Computational Cost | Typical Use Case |
|---|---|---|---|---|---|---|
| ReLU | max(0, x) | No | No | Yes | Very low | CNNs, general deep learning |
| Leaky ReLU | max(alpha*x, x) | No | Approximately | No | Very low | CNNs when dying ReLU is a concern |
| PReLU | max(alpha*x, x), alpha learned | No | Approximately | No | Low | Image classification (e.g., ImageNet) |
| ELU | x if x>0; alpha*(e^x - 1) if x<=0 | Yes | Approximately | No | Moderate | Deep networks needing zero-centered output |
| SELU | lambda*ELU(x) with fixed constants | Yes | Yes (converges) | No | Moderate | Fully connected self-normalizing networks |
| ReLU6 | min(max(0, x), 6) | No | No | Yes | Very low | Mobile and embedded networks |
| GELU | x * Phi(x) | Yes | Approximately | No | Moderate to high | Transformers (BERT, GPT) |
| SiLU / Swish | x * sigmoid(x) | Yes | Approximately | No | Moderate | Deep CNNs, general deep learning |
| Mish | x * tanh(softplus(x)) | Yes | Approximately | No | Moderate to high | Image classification, object detection |
| SwiGLU | Swish(xW1) * xW2 | Yes | Approximately | No | Higher (two projections) | Large language models (LLaMA, PaLM) |
ReLU is widely recommended as the default activation function for hidden layers in neural networks. This recommendation, which has held since the early 2010s, comes from several practical observations:
For practitioners starting a new project, ReLU remains a strong first choice for hidden layers unless there is a specific reason to use an alternative (such as building a transformer model, where GELU or SwiGLU may be preferable).
While ReLU is an excellent default, there are situations where it is not the best choice:
ReLU became the default activation function for CNNs following AlexNet's success in 2012. Subsequent architectures like VGGNet, GoogLeNet/Inception, and ResNet all used ReLU throughout their convolutional and fully connected layers. The combination of ReLU with batch normalization (Ioffe and Szegedy, 2015) and residual connections (He et al., 2015) enabled training networks with over 100 layers for the first time. ReLU remains commonly used in convolutional architectures to this day.
In recurrent neural networks (RNNs), the use of ReLU is more nuanced. Standard RNNs with ReLU can suffer from exploding gradients because the activations are unbounded and the same weight matrix is applied at each time step. However, carefully initialized ReLU-based RNNs (such as the IRNN proposed by Le, Jaitly, and Hinton in 2015) have been shown to work well for certain sequence tasks. In practice, gated architectures like LSTMs and GRUs use sigmoid and tanh for their gating mechanisms rather than ReLU.
The original Transformer architecture (Vaswani et al., 2017) used ReLU in its feed-forward network (FFN) sublayers. However, as transformer models grew in scale and sophistication, researchers found that smoother activation functions could yield meaningful improvements. BERT (2018) adopted GELU, and this choice was carried forward by most subsequent transformer-based language models. The GPT series from OpenAI uses GELU throughout its feed-forward layers. More recently, models like LLaMA and PaLM have moved to SwiGLU, which consistently outperforms both ReLU and GELU in large-scale language model training.
Despite these newer alternatives, ReLU and its variants remain relevant even in the transformer era. Some researchers have shown that replacing GELU or SiLU with ReLU in large language models can be beneficial for inference efficiency, because ReLU's exact sparsity (outputting precisely zero for negative inputs) enables hardware-level optimizations that smooth activation functions cannot exploit. The 2023 paper "ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models" demonstrated that ReLU-based transformers can achieve comparable quality to GELU-based models while being significantly faster at inference time due to sparse matrix operations.
ReLU and its variants are widely used in generative models. Generative adversarial networks (GANs) typically use ReLU in the generator and Leaky ReLU in the discriminator, following the DCGAN guidelines established by Radford, Metz, and Chintala in 2015. Variational autoencoders (VAEs) and diffusion models also commonly employ ReLU or its variants in their network architectures.
The universal approximation theorem states that a feedforward neural network with a single hidden layer containing a sufficient number of neurons can approximate any continuous function on a compact domain to arbitrary precision, provided the activation function satisfies certain conditions. ReLU satisfies these conditions because it is non-polynomial.
For ReLU networks specifically, the mechanism of universal approximation is through piecewise linear functions. Each ReLU neuron contributes a "hinge" or breakpoint, and a network with n hidden neurons in a single layer can produce a piecewise linear function with up to n+1 linear pieces in one dimension. In higher dimensions, the picture is more complex: ReLU networks partition the input space into polytopes (higher-dimensional analogs of polygons), and the network computes a separate linear function within each polytope.
A key result is that the number of linear regions grows exponentially with depth. A network with d layers of width w can create up to O(w^d) linear regions, compared to O(w) regions for a single-layer network of the same total number of neurons. This exponential growth in expressiveness with depth is a fundamental theoretical justification for using deep (many-layered) networks rather than wide (single-layer) networks.
Arora, Basu, Mianjy, and Mukherjee (2018) formally proved that deep ReLU networks can approximate any piecewise linear function with far fewer parameters than shallow networks, establishing depth efficiency as a mathematical fact rather than just an empirical observation.
ReLU is available in all major deep learning frameworks:
| Framework | ReLU Implementation |
|---|---|
| PyTorch | torch.nn.ReLU() or torch.nn.functional.relu(x) |
| TensorFlow / Keras | tf.keras.activations.relu or tf.nn.relu |
| JAX | jax.nn.relu |
| ONNX | Relu operator |
A minimal implementation in Python requires only one line:
def relu(x):
return max(0, x)
For NumPy arrays, this becomes np.maximum(0, x). In practice, ReLU is often applied as a separate layer rather than fused into the linear/convolutional layer, though some frameworks offer fused implementations for better performance.
Imagine you have a bunch of blocks, some with positive numbers and some with negative numbers. The Rectified Linear Unit (ReLU) is like a magical filter that you use to sort these blocks. When a positive block goes through the filter, it stays the same. But when a negative block goes through the filter, it magically becomes zero. This simple trick helps computers learn complex things more easily and quickly.
Think of it like a water faucet. When the water pressure (the input) is positive, water flows out at exactly that rate. But when the pressure drops below zero, the faucet shuts off completely. No water comes out, and no water gets pushed back in. This "all or nothing" behavior for negative values, combined with a perfectly proportional response for positive values, turns out to be exactly what neural networks need to learn efficiently.
Why is this better than what came before? Older activation functions (like the sigmoid) are more like dimmer switches: they work, but they squish everything into a tiny range, and the switch gets harder and harder to turn in very deep networks. ReLU is like a simple on/off valve that never gets stuck, so even in a network with a hundred layers, the learning signal can flow all the way through.