# ReLU

> Source: https://aiwiki.ai/wiki/relu
> Updated: 2026-07-12
> Categories: Deep Learning, Machine Learning, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

*See also: [Machine learning terms](/wiki/machine_learning_terms)*

**ReLU** (Rectified Linear Unit) is an [activation function](/wiki/activation_function) used in [neural networks](/wiki/neural_network), defined by the formula $$f(x) = \max(0, x)$$: it passes positive inputs through unchanged and outputs zero for any negative input. Introduced to deep learning by Vinod Nair and [Geoffrey Hinton](/wiki/geoffrey_hinton) in 2010 [2] and popularized by Xavier Glorot, Antoine Bordes, and [Yoshua Bengio](/wiki/yoshua_bengio) in 2011 [3], ReLU is the activation function that made modern [deep learning](/wiki/deep_learning) practical. It largely solved the [vanishing gradient problem](/wiki/vanishing_gradient_problem) that had blocked the training of deep networks, and in the 2012 [AlexNet](/wiki/alexnet) paper it trained roughly six times faster than the equivalent network using tanh [4]. ReLU remains one of the most widely used activation functions in convolutional networks today.

## Introduction

ReLU (Rectified Linear Unit) is an [activation function](/wiki/activation_function) used in [neural networks](/wiki/neural_network), defined by the formula $$f(x) = \max(0, x)$$. For any positive input, ReLU returns the input unchanged; for any negative input, it returns zero. Despite this simplicity, ReLU was the activation function that made modern [deep learning](/wiki/deep_learning) practical. Its adoption in the early 2010s enabled the training of much deeper networks than had been possible with sigmoid or tanh activations, and it remains one of the most widely used activation functions today.

## What is the formula for ReLU?

The ReLU function is defined as:

$$
f(x) = \max(0, x)
$$

This can also be written piecewise:

$$
f(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \le 0 \end{cases}
$$

The derivative of ReLU is:

- $$f'(x) = 1$$, if $$x > 0$$
- $$f'(x) = 0$$, if $$x < 0$$
- $$f'(x)$$ is undefined at $$x = 0$$ (in practice, it is set to 0 or 1 by convention)

ReLU is piecewise linear: it consists of two linear segments joined at x = 0. This piecewise linearity is what makes it both computationally efficient and mathematically interesting. A network of ReLU neurons partitions its input space into regions, within each of which the network computes a different linear function.

## When was ReLU invented?

### Origins in neuroscience (2000)

The ReLU function was first applied to neural network dynamics by Hahnloser, Sarpeshkar, Mahowald, Douglas, and Seung in a 2000 paper published in *Nature* [1]. Their work was motivated by neuroscience: biological neurons fire at a rate roughly proportional to their input current when that current exceeds a threshold, and do not fire at all below the threshold. The $$\max(0, x)$$ function captures this behavior. Hahnloser et al. showed that ReLU-type activations enable recurrent neural network dynamics to stabilize under weaker conditions than smooth activations like sigmoid [1].

The basic mathematical concept of the positive part function, $$x^+ = \max(0, x)$$, has a much longer history in mathematics and was used in various contexts before neural networks. But the deliberate application to neural computation began with Hahnloser et al.

### Adoption in deep learning (2010-2011)

ReLU's breakthrough in practical deep learning came from two papers that demonstrated its advantages for training deep networks.

Vinod Nair and [Geoffrey Hinton](/wiki/geoffrey_hinton) published "Rectified Linear Units Improve Restricted Boltzmann Machines" at ICML 2010 (Haifa, June 21-24, 2010, pp. 807-814) [2]. They compared ReLU against sigmoid activations in [Restricted Boltzmann Machines](/wiki/restricted_boltzmann_machine) (RBMs) and showed that ReLU led to better generative models, particularly on image data, with the rectified units learning features that improved object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild (LFW) dataset [2]. Their theoretical argument centered on the observation that a ReLU unit can be approximated by an infinite sum of binary sigmoid units with shifted biases, giving it a much richer representational capacity [2].

Xavier Glorot, Antoine Bordes, and [Yoshua Bengio](/wiki/yoshua_bengio) published "Deep Sparse Rectifier Neural Networks" at AISTATS 2011 (Fort Lauderdale, April 11-13, 2011, pp. 315-323) [3]. This paper argued for ReLU in standard feedforward and supervised networks. Glorot et al. identified several advantages:

- **Biological plausibility:** ReLU more closely resembles the firing pattern of biological neurons.
- **Sparsity:** With typical data, roughly half of ReLU neurons output zero for a given input, creating sparse representations with true zeros.
- **No vanishing gradient (for positive inputs):** The derivative is 1 for all positive values, so gradients flow through without shrinking.
- **Computational efficiency:** Computing $$\max(0, x)$$ is far cheaper than computing exp() or tanh().

Critically, Glorot et al. demonstrated that deep networks with ReLU activations could be trained successfully with purely supervised learning, without the unsupervised pre-training that had been considered necessary for deep networks [3]. This was a turning point: it meant that deep networks were practical for a much wider range of problems.

### Widespread adoption (2012 onward)

By 2012, ReLU was the default activation in [convolutional neural networks](/wiki/convolutional_neural_network). [AlexNet](/wiki/alexnet) (Krizhevsky, Sutskever, and Hinton, 2012), which won the [ImageNet](/wiki/imagenet) competition and is often cited as the catalyst for the modern deep learning era, used ReLU throughout. The authors stated plainly that "Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units," and reported that ReLU trained several times faster than an equivalent network with tanh activations [4].

From that point, ReLU became the standard activation for hidden layers in nearly all [deep learning](/wiki/deep_learning) architectures, including VGGNet, GoogLeNet, and [ResNet](/wiki/resnet).

## Why does ReLU work so well?

### Sparse activation

For any given input, a network with ReLU activations typically has a significant fraction of neurons outputting exactly zero. This sparsity has several benefits. It makes the representation more efficient (the network effectively selects a subset of features for each input), it acts as a form of implicit [regularization](/wiki/regularization) (reducing the chance of [overfitting](/wiki/overfitting)), and it makes the computations faster (zero-valued activations require no further computation in subsequent layers).

Glorot et al. (2011) found that ReLU networks naturally learned representations where 50-80% of hidden units were inactive for a given input, compared to almost no inactive units with sigmoid or tanh [3].

### No vanishing gradient for positive inputs

The [vanishing gradient problem](/wiki/vanishing_gradient_problem) was the main obstacle to training deep networks before ReLU. With sigmoid, the maximum gradient is 0.25, and with tanh, the gradient drops off steeply for large or small inputs. When these gradients are multiplied across many layers during [backpropagation](/wiki/backpropagation), the signal reaching early layers can be negligibly small.

ReLU has a gradient of exactly 1 for all positive inputs. This means that for any neuron that is active (outputting a positive value), the gradient passes through without any reduction. This property allows gradients to flow through very deep networks without vanishing, which is what enabled the training of networks with 10, 50, or even 100+ layers.

### Computational efficiency

ReLU requires only a comparison and a conditional assignment: if x > 0, output x; otherwise, output 0. Sigmoid requires computing an exponential function, and tanh requires computing two exponentials. On modern hardware, the difference per operation is small, but neural networks evaluate activation functions billions of times during training. The cumulative savings are significant. Krizhevsky et al. (2012) reported that a four-layer convolutional network with ReLU reached 25% training error on CIFAR-10 six times faster than the same network with tanh [4].

## What is the dying ReLU problem?

The most significant drawback of ReLU is the dying ReLU problem. When a neuron's weighted input is negative for every example in the training set, the neuron outputs zero for all inputs. Because the gradient of ReLU is zero for negative inputs, the neuron's weights receive no gradient updates, and the neuron cannot recover. It is permanently "dead."

Several conditions can trigger dying ReLU:

- **High learning rates:** A large weight update can push a neuron's weights into a region where all inputs produce negative pre-activation values.
- **Poor initialization:** If initial weights happen to produce consistently negative pre-activations, the neuron starts dead and stays dead.
- **Large negative biases:** A bias term that is too negative offsets the weighted sum below zero for all inputs.

Lu et al. (2019) provided a theoretical analysis showing that the probability of dying ReLU increases with network width and depth, and that proper initialization is critical to preventing it [9]. In practice, using [He initialization](/wiki/he_initialization) (which sets the variance of initial weights to 2/fan_in) substantially reduces the risk.

In some networks, over 40% of neurons can become dead during training, severely reducing the effective capacity of the model. Monitoring the fraction of dead neurons is a useful diagnostic during training.

## What are the ReLU variants?

Several variants have been proposed to address the dying ReLU problem while retaining ReLU's advantages.

| Variant | Formula | Key difference from ReLU | Proposed by |
|---|---|---|---|
| Leaky ReLU | $$f(x) = x$$ if $$x > 0$$, else $$0.01x$$ | Small gradient for negative inputs prevents dead neurons | Maas et al., 2013 |
| PReLU (Parametric ReLU) | $$f(x) = x$$ if $$x > 0$$, else $$ax$$ ($$a$$ is learned) | Negative slope is a trainable [parameter](/wiki/parameter); achieved state-of-the-art on ImageNet | He et al., 2015 |
| ELU (Exponential Linear Unit) | $$f(x) = x$$ if $$x > 0$$, else $$\alpha(e^x - 1)$$ | Smooth exponential curve for negatives; pushes mean activation toward zero | Clevert et al., 2015 |
| SELU (Scaled ELU) | $$f(x) = \lambda \cdot \mathrm{ELU}(x)$$ with specific $$\lambda, \alpha$$ | Self-normalizing; activations maintain zero mean and unit variance without [batch normalization](/wiki/batch_normalization) | Klambauer et al., 2017 |

### Leaky ReLU

Leaky ReLU replaces the flat zero region of ReLU with a small negative slope (typically 0.01) [5]. This means that even when the input is negative, the gradient is non-zero (0.01), so the neuron can still receive gradient updates and potentially recover. The fixed slope is a hyperparameter; values between 0.01 and 0.3 are common.

### PReLU

He et al. (2015) proposed making the negative slope a learnable parameter [6]. In their ImageNet experiments, a PReLU-based model achieved a top-5 error rate of 4.94%, a 26% relative improvement over the ILSVRC 2014 winner GoogLeNet (6.66%), and the first reported result to surpass the estimated human-level performance of 5.1% on the ImageNet classification benchmark [6]. The learned slopes varied across layers and tended to be larger in early layers (0.1-0.3) and smaller in later layers.

### ELU

Clevert, Unterthiner, and Hochreiter (2015) introduced ELU, which uses an exponential curve for negative inputs that smoothly saturates at -alpha (typically alpha=1) [7]. The smooth transition at zero means the gradient is continuous, unlike ReLU's sharp corner. ELU's negative saturation pushes the mean activation closer to zero, which can reduce the bias shift effect and speed up learning.

### SELU

Klambauer et al. (2017) showed that with specific values of lambda (approximately 1.0507) and alpha (approximately 1.6733), along with lecun_normal weight initialization, ELU-based networks have the self-normalizing property: activations converge to zero mean and unit variance through the network without explicit normalization layers [8]. This property holds for fully connected networks but does not extend straightforwardly to convolutional or recurrent architectures, which has limited SELU's adoption.

## How does ReLU differ from sigmoid and tanh?

| Property | ReLU | Sigmoid | Tanh |
|---|---|---|---|
| Output range | $$[0, \infty)$$ | $$(0, 1)$$ | $$(-1, 1)$$ |
| Zero-centered | No | No | Yes |
| Maximum gradient | 1 (constant for x > 0) | 0.25 (at x = 0) | 1.0 (at x = 0) |
| Vanishing gradient | No (for positive inputs) | Yes (severe) | Yes (moderate) |
| Computational cost | Very low (comparison only) | Moderate (exponential) | Moderate (two exponentials) |
| Sparsity | Yes (outputs exactly 0 for negatives) | No (always outputs positive values) | No (always outputs non-zero values) |
| Dying neuron problem | Yes | No | No |
| Typical use today | CNN hidden layers | Binary output layers, RNN gates | RNN state computations |

Sigmoid and tanh still have important roles. Sigmoid is the standard activation for binary classification output layers and for gate mechanisms in [LSTM](/wiki/lstm) and [GRU](/wiki/recurrent_neural_network) cells. Tanh is used for the cell state in LSTMs. But for hidden layers in feedforward and convolutional networks, ReLU and its variants have almost entirely replaced both.

## When is ReLU still used and when has it been replaced?

ReLU remains the default activation for hidden layers in convolutional neural networks and many feedforward architectures. For [computer vision](/wiki/computer_vision) tasks using established CNN architectures (ResNet, [VGG](/wiki/vgg), and their descendants), ReLU continues to perform well and is the standard choice.

However, in [transformer](/wiki/transformer)-based models, ReLU has been largely replaced:

- **GELU** is the default in encoder-style transformers like [BERT](/wiki/bert), [RoBERTa](/wiki/roberta), and [GPT-2](/wiki/gpt)/GPT-3. Hendrycks and Gimpel (2016) proposed GELU, but it was the adoption by BERT (2018) and GPT-2 (2019) that established it as the transformer standard [10].
- **SiLU/Swish** is used in many decoder-style and open-source LLMs, including Meta's [LLaMA](/wiki/llama) family and Google's [PaLM](/wiki/palm).
- **SwiGLU**, which combines SiLU with a gated linear unit mechanism (Shazeer, 2020), is the current state of the art for feed-forward layers in large language models [11]. LLaMA, [LLaMA 2](/wiki/llama_2), Mistral, and many recent models use SwiGLU.

The shift from ReLU to GELU/SiLU in transformers is not because ReLU fails catastrophically in that context. Rather, GELU and SiLU provide smoother gradients and slightly better performance on language tasks. The difference is often small (fractions of a percent in perplexity), but at the scale of modern LLM training, even small improvements justify the modest additional computational cost.

For practitioners choosing between ReLU and modern alternatives: if you are working with a CNN or a standard feedforward network, ReLU is still a perfectly good default. If you are building a transformer or fine-tuning a language model, follow the conventions of the model family (typically GELU or SwiGLU). If you encounter dying ReLU during training, switch to Leaky ReLU or ELU rather than immediately jumping to more complex activations.

## GELU, SiLU, and SwiGLU: the transformer-era activation functions

The shift from ReLU to newer activation functions in [transformer](/wiki/transformer) models deserves detailed examination, as these choices directly affect model quality and training stability at scale.

### GELU (Gaussian Error Linear Unit)

GELU was proposed by Hendrycks and Gimpel (2016) and is defined as:

$$
\mathrm{GELU}(x) = x \Phi(x)
$$

where $$\Phi(x)$$ is the cumulative distribution function of the standard normal distribution. In practice, GELU is often approximated as:

$$
\mathrm{GELU}(x) \approx 0.5 x (1 + \tanh(\sqrt{2/\pi}(x + 0.044715 x^3)))
$$

GELU can be interpreted as a smooth, stochastic version of ReLU. As Hendrycks and Gimpel put it, "the GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs" [10]. Where ReLU deterministically zeros out negative inputs, GELU probabilistically scales inputs based on how likely they are to be positive under a Gaussian distribution. For large positive inputs, GELU behaves like the identity function; for large negative inputs, it outputs near-zero values; and for inputs near zero, it provides a smooth transition.

GELU became the standard activation in encoder-style transformers after its adoption in [BERT](/wiki/bert) (2018). It is also used in [GPT-2](/wiki/gpt), [GPT-3](/wiki/gpt-3), RoBERTa, and many other models.

### SiLU (Sigmoid Linear Unit) / Swish

SiLU, also known as Swish, was discovered through automated search by Ramachandran, Zoph, and Le (2017) at Google [12]. It is defined as:

$$
\mathrm{SiLU}(x) = x \sigma(x) = \frac{x}{1 + e^{-x}}
$$

Like GELU, SiLU is a smooth, non-monotonic function that allows small negative values to pass through. The key difference is that SiLU uses the sigmoid function rather than the Gaussian CDF. In practice, SiLU and GELU produce very similar outputs, with the main difference occurring for inputs near -1 to -3.

SiLU is computationally cheaper than GELU because sigmoid is simpler to compute than the Gaussian CDF (even the tanh approximation). This small efficiency advantage matters at the scale of modern LLM training.

### SwiGLU

SwiGLU, proposed by Shazeer (2020), combines SiLU with a **gated linear unit** (GLU) mechanism [11]. In a standard transformer feed-forward network (FFN), the computation is:

$$
\mathrm{FFN}(x) = W_2 \cdot \mathrm{activation}(W_1 x + b_1) + b_2
$$

With SwiGLU, this becomes:

$$
\mathrm{SwiGLU}(x) = \mathrm{SiLU}(W_1 x) \odot (W_3 x)
$$

where $$\odot$$ denotes element-wise multiplication and $$W_3$$ is an additional weight matrix. The gating mechanism allows the network to learn which features to pass through, providing more expressive power than a simple pointwise activation. Shazeer reports that GLU variants such as GEGLU and SwiGLU, tested "in the feed-forward sublayers of the Transformer sequence-to-sequence model," "yield quality improvements over the typically-used ReLU or GELU activations" [11].

SwiGLU has become the dominant FFN activation in modern LLMs. It was first used at scale in Google's PaLM (2022) and subsequently adopted by Meta's [LLaMA](/wiki/llama) family, [Mistral](/wiki/mistral), and many other open-weight models [13].

The tradeoff is that SwiGLU requires an extra weight matrix ($$W_3$$), which increases the total number of parameters. To compensate, the hidden dimension of the FFN is typically reduced by a factor of 2/3 (e.g., from $$4 d_{\text{model}}$$ to $$\tfrac{8}{3} d_{\text{model}}$$), keeping the total parameter count roughly the same.

### Comparison table

| Activation | Formula | Used in | Computational cost | Key advantage |
|---|---|---|---|---|
| [ReLU](/wiki/relu) | $$\max(0, x)$$ | ResNet, VGG, older CNNs | Lowest | Simplicity, sparsity |
| GELU | $$x \Phi(x)$$ | BERT, GPT-2, GPT-3, RoBERTa | Moderate (requires tanh approx.) | Smooth gradients near zero |
| SiLU (Swish) | $$x \sigma(x)$$ | EfficientNet, some LLMs | Moderate (sigmoid only) | Smooth, slightly cheaper than GELU |
| SwiGLU | $$\mathrm{SiLU}(W_1 x) \odot (W_3 x)$$ | LLaMA, PaLM, Mistral, DeepSeek | Higher (extra weight matrix) | Gating mechanism improves expressivity |
| GeGLU | $$\mathrm{GELU}(W_1 x) \odot (W_3 x)$$ | Some research models | Higher | GELU variant of gated mechanism |
| Smooth-SwiGLU | Modified SwiGLU for FP8 stability | Intel Gaudi FP8 training | Similar to SwiGLU | Avoids outlier amplification in low precision |

### Activation function selection guide (2025-2026)

The choice of activation function depends on the architecture and task:

| Architecture / Use case | Recommended activation | Notes |
|---|---|---|
| CNN hidden layers (ResNet, VGG, etc.) | ReLU | Still the default; well-understood, fast |
| Vision transformer (ViT) | GELU | Standard since the original ViT paper |
| Encoder-only transformer (BERT-style) | GELU | Established convention |
| Decoder-only LLM (new training) | SwiGLU | Current best practice; used by most SOTA models |
| Fine-tuning a pre-trained LLM | Match the base model | Use whatever activation the pre-trained model uses |
| Binary output layer | [Sigmoid](/wiki/sigmoid_function) | Standard for binary classification |
| Multi-class output layer | [Softmax](/wiki/softmax) | Standard for multi-class classification |
| RNN/LSTM gates | Sigmoid and tanh | Required by the gating mechanism |
| FP8 / very low precision training | Smooth-SwiGLU or GELU | SwiGLU can amplify outliers at low precision |

### ReLU strikes back

Despite the dominance of GELU and SwiGLU in transformers, recent research has renewed interest in ReLU for large language models. The paper "ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models" by Mirzadeh, Alizadeh, and colleagues at Apple, presented as an oral at ICLR 2024, found that using ReLU has "negligible impact on convergence and performance while significantly reducing computation and weight transfer" [14]. By exploiting the natural sparsity that ReLU induces in the FFN activations, the authors showed that inference computation could be reduced by up to a factor of three, which is especially valuable in the memory-bound inference regime [14]. The result revived the practical case for ReLU sparsity at LLM scale, where it had been displaced by smoother but denser alternatives.

## Explain Like I'm 5 (ELI5)

Imagine you're trying to learn a new skill, like playing soccer. Your brain has to figure out which moves work well and which ones do not. In machine learning, a similar process happens when a computer tries to learn something new. The ReLU function helps the computer decide which parts of its "brain" to use for learning. When the computer finds something important (a positive signal), the ReLU function keeps it. When it finds something unimportant or bad (a negative signal), it sets it to zero. This way, the computer can learn more efficiently and figure out the best way to complete a task.

## References

1. Hahnloser, R. H. R., Sarpeshkar, R., Mahowald, M. A., Douglas, R. J., & Seung, H. S. (2000). "Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit." *Nature*, 405, 947-951. [https://doi.org/10.1038/35016072](https://doi.org/10.1038/35016072)
2. Nair, V., & Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." *Proceedings of the 27th International Conference on Machine Learning (ICML)*, Haifa, Israel, pp. 807-814. [https://www.cs.toronto.edu/~fritz/absps/reluICML.pdf](https://www.cs.toronto.edu/~fritz/absps/reluICML.pdf)
3. Glorot, X., Bordes, A., & Bengio, Y. (2011). "Deep Sparse Rectifier Neural Networks." *Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS)*, vol. 15, pp. 315-323. [https://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf](https://proceedings.mlr.press/v15/glorot11a/glorot11a.pdf)
4. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). "ImageNet Classification with Deep Convolutional Neural Networks." *[NeurIPS](/wiki/neurips) 2012*. [https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks)
5. Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). "Rectifier Nonlinearities Improve Neural Network Acoustic Models." *ICML Workshop on Deep Learning for Audio, Speech, and Language Processing*.
6. He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification." *Proceedings of the IEEE International Conference on Computer Vision (ICCV)*, pp. 1026-1034. [https://arxiv.org/abs/1502.01852](https://arxiv.org/abs/1502.01852)
7. Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2015). "Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)." *arXiv preprint*. [https://arxiv.org/abs/1511.07289](https://arxiv.org/abs/1511.07289)
8. Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). "Self-Normalizing Neural Networks." *NeurIPS 2017*. [https://arxiv.org/abs/1706.02515](https://arxiv.org/abs/1706.02515)
9. Lu, L., Shin, Y., Su, Y., & Karniadakis, G. E. (2019). "Dying ReLU and Initialization: Theory and Numerical Examples." *arXiv preprint*. [https://arxiv.org/abs/1903.06733](https://arxiv.org/abs/1903.06733)
10. Hendrycks, D., & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." *arXiv preprint*. [https://arxiv.org/abs/1606.08415](https://arxiv.org/abs/1606.08415)
11. Shazeer, N. (2020). "GLU Variants Improve Transformer." *arXiv preprint*. [https://arxiv.org/abs/2002.05202](https://arxiv.org/abs/2002.05202)
12. Ramachandran, P., Zoph, B., & Le, Q. V. (2017). "Searching for Activation Functions." *arXiv preprint*. [https://arxiv.org/abs/1710.05941](https://arxiv.org/abs/1710.05941)
13. Chowdhery, A., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*. [https://arxiv.org/abs/2204.02311](https://arxiv.org/abs/2204.02311)
14. Mirzadeh, I., Alizadeh, K., Mehta, S., Del Mundo, C. C., Tuzel, O., Samei, G., Rastegari, M., & Farajtabar, M. (2024). "ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models." *ICLR 2024 (Oral)*. [https://arxiv.org/abs/2310.04564](https://arxiv.org/abs/2310.04564)