# GELU (Gaussian Error Linear Unit)

> Source: https://aiwiki.ai/wiki/gelu
> Updated: 2026-06-22
> Categories: Artificial Intelligence, Deep Learning, Neural Networks
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

The **Gaussian Error Linear Unit (GELU)** is a smooth, non-monotonic [activation function](/wiki/activation_function) defined as `GELU(x) = x · Φ(x)`, where `Φ(x)` is the cumulative distribution function of the standard normal distribution. It was introduced by [Dan Hendrycks](/wiki/dan_hendrycks) and [Kevin Gimpel](/wiki/kevin_gimpel) in the 2016 paper *Gaussian Error Linear Units (GELUs)*, posted to arXiv on 27 June 2016 as 1606.08415, and it is the default nonlinearity in the feed-forward layers of [BERT](/wiki/bert), the GPT series, and most [transformer](/wiki/transformer) models built between 2018 and 2022.[1] The paper's defining sentence is that the GELU "weights inputs by their value, rather than gates inputs by their sign as in ReLUs."[1]

Hendrycks was then at the University of Chicago (later UC Berkeley), and Gimpel was at the [Toyota Technological Institute at Chicago](/wiki/ttic). The earliest preprint was titled *Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units*, and the authors kept revising the manuscript through June 2020 (final revision v5).[1]

[BERT](/wiki/bert), [GPT-2](/wiki/gpt-2), [GPT-3](/wiki/gpt-3), [RoBERTa](/wiki/roberta), [ALBERT](/wiki/albert), [ELECTRA](/wiki/electra), the original GPT (often called GPT-1), and the original [Vision Transformer](/wiki/vision_transformer) all use GELU between the two linear layers of each [feed-forward](/wiki/feedforward_neural_network_ffn) block. [OpenAI](/wiki/openai)'s [Whisper](/wiki/whisper) speech-recognition model uses it as well.[11] Newer architectures like [PaLM](/wiki/palm) and [LLaMA](/wiki/llama) have moved to gated variants such as [SwiGLU](/wiki/swiglu), but GELU remains the activation that most production transformer codebases default to.

## Quick facts

| | |
|---|---|
| Introduced | June 2016 |
| Paper | *Gaussian Error Linear Units (GELUs)* |
| arXiv ID | 1606.08415 (versions v1 through v5) |
| Authors | Dan Hendrycks, Kevin Gimpel |
| Affiliations | University of Chicago / UC Berkeley; Toyota Technological Institute at Chicago |
| Exact formula | `GELU(x) = x · Φ(x) = (x / 2) · (1 + erf(x / √2))` |
| Tanh approximation | `0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))` |
| Sigmoid approximation | `x · σ(1.702 · x)` |
| Smooth | Yes (infinitely differentiable) |
| Monotonic | No (small dip near `x ≈ -0.75`) |
| Used in | BERT, GPT-1, GPT-2, GPT-3, RoBERTa, ALBERT, ELECTRA, T5 v1.1 (via GeGLU), Vision Transformer, Whisper |
| Frameworks | PyTorch, TensorFlow, JAX, Hugging Face Transformers |

## What problem does GELU solve?

Before GELU, the dominant nonlinearity in deep networks was the [rectified linear unit](/wiki/relu) (ReLU), introduced in its modern form by Nair and Hinton (2010) and popularised by AlexNet (Krizhevsky, Sutskever, Hinton, 2012). ReLU is `max(0, x)`. It is cheap, sparsity-inducing, and largely solves the vanishing-gradient problem that plagued sigmoidal networks. It also has a well-known failure mode usually called the *dead ReLU* problem: once a unit has a sufficiently negative pre-activation, its gradient is exactly zero, and standard gradient descent cannot recover it.

A family of patches followed. **Leaky ReLU** (Maas, Hannun, Ng, 2013) replaces the flat negative half with a small linear slope `αx`.[16] **PReLU** (He et al., 2015) makes that slope a learned parameter.[17] **ELU** (Clevert, Unterthiner, Hochreiter, 2015) replaces the negative half with `α · (eˣ - 1)`, giving a smooth saturating curve below zero.[14] **SELU** (Klambauer et al., 2017) adds a fixed multiplicative scale chosen so successive layers self-normalise their activation statistics.[15]

All of these are deterministic point-wise functions of the pre-activation. In parallel, regularisation work was producing increasingly sophisticated stochastic perturbations of the forward pass: [dropout](/wiki/dropout_regularization) (Srivastava et al., 2014), zoneout (Krueger et al., 2016), and stochastic depth (Huang et al., 2016). Hendrycks and Gimpel's contribution was to notice that a particular kind of stochastic gating gives rise, in expectation, to a smooth deterministic activation that outperforms ReLU and ELU.[1] The paper introduces what it calls the *weighted-input view*: an activation function can be read as a way to weight or gate its input by some function of itself. ReLU weights the input by `1[x > 0]`, ELU weights the negative half by `(eˣ - 1) / x`, and GELU weights it by the probability that a Gaussian random variable is below the input.[1]

## How is GELU defined?

The exact definition is

```
GELU(x) = x · Φ(x)
```

where `Φ` is the standard normal CDF. Equivalently, using the [error function](/wiki/error_function) `erf`:

```
GELU(x) = 0.5 · x · (1 + erf(x / √2))
```

The function is smooth (infinitely differentiable), non-monotonic, and approximately linear for large positive `x`. For negative `x` it stays small but never identically zero, so unlike [ReLU](/wiki/relu) it does not produce dead units that always output zero. Asymptotically, GELU coincides with ReLU: `GELU(x) → x` as `x → +∞` and `GELU(x) → 0` as `x → -∞`.

A few quick numerical values give a feel for the shape:

| Input `x` | `Φ(x)` | `GELU(x)` |
|---|---|---|
| -3 | 0.00135 | -0.00405 |
| -2 | 0.0228 | -0.0455 |
| -1 | 0.1587 | -0.1587 |
| -0.75 | 0.2266 | -0.170 |
| -0.5 | 0.3085 | -0.1543 |
| 0 | 0.5 | 0 |
| 0.5 | 0.6915 | 0.3457 |
| 1 | 0.8413 | 0.8413 |
| 2 | 0.9772 | 1.9545 |
| 3 | 0.99865 | 2.996 |

GELU has a small negative dip near `x ≈ -0.75`, where it bottoms out at roughly -0.170 (the global minimum). Past `x ≈ 2` it is essentially equal to `x`. Past `x ≈ -3` it is essentially zero. Figure 1 of the original paper plots GELU alongside ReLU and ELU and is the reference picture most readers carry in their heads.[1]

The derivative of the exact form is

```
GELU'(x) = Φ(x) + x · φ(x)
```

where `φ(x) = (1 / √(2π)) · exp(-x² / 2)` is the standard normal probability density function. The autograd engines in [PyTorch](/wiki/pytorch), [TensorFlow](/wiki/tensorflow), and JAX all generate this analytically. The derivative is bounded (its supremum is around 1.13 near `x ≈ 0.7`), so GELU does not amplify gradients in the way some unbounded activations can.

## Why a Gaussian gate? The stochastic regularizer interpretation

The original paper motivates GELU as a deterministic version of an input-dependent stochastic regularizer.[1] Suppose you multiply each input by a Bernoulli mask `m`, where `m = 1` with probability `Φ(x)` and `m = 0` otherwise. The expected value of the output is

```
E[m · x] = Φ(x) · x = GELU(x)
```

Under this view, GELU is the expected value of an input that is either kept (probability `Φ(x)`) or dropped (probability `1 - Φ(x)`), where the keep probability rises with the input's magnitude on the positive side. Hendrycks and Gimpel describe it as combining the gating behaviour of ReLU and the stochasticity of dropout and zoneout into a single deterministic nonlinearity.[1] Inputs that look more like noise (small in absolute value, especially negative) get attenuated more aggressively in expectation than inputs that are clearly signal.

This framing explains the slight negative bump: when `x` is moderately negative, the mask is unlikely to be 1, but if it is, the output is small and negative. The expected value reflects that. The paper also defines a *stochastic GELU* by drawing the mask `m ~ Bernoulli(Φ(x))` at training time and using the expected value at inference.[1] In practice the deterministic version is what everyone uses.

The choice of the Gaussian CDF rather than the logistic sigmoid matches the assumption that pre-activations in well-initialised deep networks are approximately Gaussian. The sigmoid approximation `x · σ(1.702 · x)` substitutes the logistic CDF (rescaled to match `Φ` near the origin).

## What are the GELU approximations?

The exact form requires evaluating `erf`, which is more expensive than the elementary operations used by ReLU or the [sigmoid function](/wiki/sigmoid_function). The 2016 paper proposed two approximations that became standard practice for years.[1]

**Tanh approximation:**

```
GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))
```

The constant `√(2/π) ≈ 0.7978845608` and the cubic coefficient `0.044715` are fitted to match the exact function closely; the cubic term was chosen empirically.[1] This is the form used in the original BERT and GPT-2 reference implementations, and a great deal of pretrained-checkpoint code still uses it.[2][4]

**Sigmoid approximation:**

```
GELU(x) ≈ x · σ(1.702 · x)
```

where `σ` is the logistic sigmoid. This is faster than the tanh approximation but less accurate. It is rarely used in production but appears in the original paper as a convenient closed form.[1] The constant `1.702` is the value that aligns the slope of `x · σ(αx)` with the slope of `x · Φ(x)` near the origin, and it is also the factor that turns plain [SiLU](/wiki/silu) (`x · σ(x)`) into a GELU approximation.[1][13]

**Exact form** uses `erf` directly. Most modern hardware has fast `erf` instructions or libm implementations, so on contemporary GPUs the exact form is competitive with the tanh approximation in wall-clock time. PyTorch exposes the choice through `torch.nn.functional.gelu(x, approximate='none')` (the default, which calls `erf`) and `approximate='tanh'` (the original BERT formulation).[20] TensorFlow exposes the same selection through `tf.nn.gelu` with an `approximate=True/False` flag.[21] JAX provides `jax.nn.gelu` with an `approximate` keyword argument that defaults to `True`.[22]

The maximum absolute deviation between the tanh approximation and the exact `erf` form is on the order of `1e-4` over the range `[-5, 5]`. The sigmoid approximation deviates by closer to `1e-2` in the same range.

## Properties

- **Smooth.** Continuous derivatives of all orders, unlike ReLU (kink at zero) or Leaky ReLU.
- **Non-monotonic.** Decreases on `(-∞, -0.75)` then increases, so it is not one-to-one near the origin.
- **Differentiable everywhere.** No special handling for `x = 0`.
- **Asymptotically ReLU.** `GELU(x) - max(0, x) → 0` as `|x| → ∞`.
- **Bounded below.** `GELU(x) ≥ -0.170` for all real `x`.
- **Approximately zero-centred outputs.** Easier to layer-normalise than ReLU.
- **Self-gated.** The function gates each input by some monotone function of itself, putting it in the same family as Swish/SiLU and Mish.

GELU's small negative bump means that two distinct inputs can produce the same output, which complicates some theoretical analyses but does not seem to hurt training in practice.

## How does GELU differ from ReLU and other activations?

| Activation | Formula | Smooth | Non-monotonic | Notes |
|---|---|---|---|---|
| [ReLU](/wiki/relu) | `max(0, x)` | No | No | Cheapest. Standard before transformers. Has dead-unit problem. |
| Leaky ReLU | `max(αx, x)`, `α = 0.01` typical | No | No | Avoids dead units. Cheap. |
| PReLU | `max(αx, x)`, `α` learned | No | No | Adds parameters per channel. |
| [ELU](/wiki/elu) | `x` if `x > 0`, else `α(eˣ - 1)` | Yes | No | Smooth saturating negatives. |
| SELU | Scaled ELU with fixed `α`, `λ` | Yes | No | Self-normalising. |
| [SiLU / Swish](/wiki/silu) | `x · σ(x)` | Yes | Yes | GELU's sigmoid form with the `1.702` coefficient set to 1. |
| GELU (exact) | `x · Φ(x)` | Yes | Yes | Gaussian gate. Min around -0.170. |
| GELU (tanh approx) | `0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))` | Yes | Yes | BERT/GPT-2 default. |
| [Mish](/wiki/mish) | `x · tanh(softplus(x))` | Yes | Yes | Min about -0.31. |

The key difference from ReLU is captured by the original abstract: GELU "weights inputs by their value, rather than gates inputs by their sign."[1] ReLU makes a hard binary decision based on the sign of the pre-activation; GELU applies a smooth, probabilistic weighting that keeps a small nonzero signal for negative inputs and so avoids dead units.

The Swish/SiLU function `x · σ(x)` was published independently by Hendrycks and Gimpel as the GELU sigmoid approximation in 2016 and rediscovered by Ramachandran, Zoph, and Le in 2017 under the name Swish.[1][13] Google later renamed it SiLU. The two names refer to the same function. GELU is sometimes described as Swish with a Gaussian gate instead of a sigmoid one. In practice GELU and SiLU produce very similar shapes, with GELU having a slightly more pronounced negative dip.

## What did the original experiments show?

Hendrycks and Gimpel ran experiments on MNIST classification, MNIST autoencoding, CIFAR-10, CIFAR-100, and the TIMIT phone-recognition task, reporting "performance improvements across all considered computer vision, natural language processing, and speech tasks."[1] GELU outperformed ReLU and ELU on the median across runs in every setting they tested, with the largest gains on TIMIT. The differences were modest but consistent.[1]

- **MNIST classification.** Fully connected networks of depth 8 with widths 128 to 1024. GELU reached lower median test error than ReLU and ELU across batch sizes.
- **MNIST autoencoding.** Eight-layer autoencoder. GELU achieved lower mean reconstruction loss than ReLU or ELU.
- **CIFAR-10 and CIFAR-100.** Wide residual network (Wide ResNet 28-10) trained for 200 epochs. GELU narrowly ahead on both datasets.
- **TIMIT phone recognition.** Five-hidden-layer fully connected network. GELU beat ReLU and ELU by close to a percentage point in median phone error rate.

These benchmarks predate the transformer era. The reason GELU caught on was not the MNIST result but the fact that the GPT and BERT teams adopted it for the FFN layers, and pretty much every transformer paper after that used whatever the BERT codebase used.

## Which transformer models use GELU?

GELU was not the first nonlinearity used inside a transformer. The original *Attention Is All You Need* paper (Vaswani et al., 2017) used ReLU in the position-wise feed-forward sublayer.[19] The shift to GELU happened with the next generation of pretrained models.

| Model | Year | FFN activation | Approximation in reference code |
|---|---|---|---|
| Original Transformer | 2017 | ReLU | n/a |
| GPT (GPT-1) | 2018 | GELU | Tanh approximation |
| [BERT](/wiki/bert) | 2018 | GELU | Tanh approximation |
| [GPT-2](/wiki/gpt-2) | 2019 | GELU | Tanh approximation |
| [RoBERTa](/wiki/roberta) | 2019 | GELU | Tanh approximation |
| XLNet | 2019 | GELU | Tanh approximation |
| [ALBERT](/wiki/albert) | 2019 | GELU | Tanh approximation |
| T5 (v1.0) | 2019 | ReLU | n/a |
| T5 (v1.1) | 2020 | GeGLU (uses GELU) | Exact |
| [ELECTRA](/wiki/electra) | 2020 | GELU | Tanh approximation |
| [GPT-3](/wiki/gpt-3) | 2020 | GELU | Tanh approximation |
| [Vision Transformer](/wiki/vision_transformer) | 2020 | GELU | Tanh approximation |
| [Whisper](/wiki/whisper) | 2022 | GELU | Tanh approximation |
| [PaLM](/wiki/palm) | 2022 | SwiGLU | n/a |
| [LLaMA](/wiki/llama) | 2023 | SwiGLU | n/a |

BERT (Devlin et al., October 2018) is the single largest cause of GELU's spread.[2] Its public TensorFlow reference implementation defined `gelu` using the tanh approximation, and every subsequent paper that built on BERT inherited that choice.[2] The OpenAI GPT-1 paper (Radford et al., June 2018) also used GELU, but BERT's open-source release was the catalyst that pushed it into nearly every transformer codebase between 2018 and 2022.[3]

## How does GELU compare to SwiGLU?

The activation question got more complicated in February 2020 with Noam Shazeer's paper *GLU Variants Improve Transformer* (arXiv:2002.05202), which proposed using gated linear units in place of the standard FFN.[12] Two variants became popular:

- **GeGLU**: `GeGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c)`. T5 v1.1 and the LaMDA family use this.[12]
- **SwiGLU**: same structure but with SiLU/Swish in place of GELU. PaLM and LLaMA adopted this, and the Mistral and Mixtral lines kept it.[12]

Gated FFNs use three weight matrices instead of two and reduce the inner dimension by `2/3` to keep parameter count roughly constant.[12] They consistently produce small but reliable perplexity improvements over plain GELU FFNs. Most large open-weight models released after 2022 use SwiGLU. GELU is still the default in encoder-only and encoder-decoder models, in vision transformers, and in any transformer codebase matching the BERT lineage. Shazeer offers no theoretical account of the gains, concluding: "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."[12]

## Implementation

A reference PyTorch implementation of all three forms looks like this:

```python
import math
import torch
import torch.nn.functional as F

def gelu_exact(x):
    return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))

def gelu_tanh(x):
    c = math.sqrt(2.0 / math.pi)
    return 0.5 * x * (1.0 + torch.tanh(c * (x + 0.044715 * x.pow(3))))

def gelu_sigmoid(x):
    return x * torch.sigmoid(1.702 * x)

# Library-provided versions
y1 = F.gelu(x)                        # exact, uses erf
y2 = F.gelu(x, approximate='tanh')    # tanh approximation
```

The library version is preferable in real code because it dispatches to fused kernels on supported hardware.

### Framework APIs

- **PyTorch.** `torch.nn.GELU(approximate='none' | 'tanh')` and `torch.nn.functional.gelu`. The `approximate='tanh'` flag was added in PyTorch 1.12 (June 2022); before that, only the exact form was available.[20]
- **TensorFlow.** `tf.keras.activations.gelu(x, approximate=False)`. The `approximate` flag has been present since TensorFlow 2.4.[21]
- **JAX.** `jax.nn.gelu(x, approximate=True)`. The default is the tanh approximation, the opposite of PyTorch's default.[22]
- **[Hugging Face](/wiki/hugging_face) Transformers.** The `transformers.activations` module exposes `gelu` (exact `erf`), `gelu_new` (BERT tanh approximation, kept for checkpoint compatibility), `gelu_fast` (cheaper tanh variant), `gelu_pytorch_tanh` (calls `F.gelu(x, approximate='tanh')`), `quick_gelu` (sigmoid approximation, used by CLIP), and `gelu_python` (a pure-Python reference). The names are historical artefacts.[23]

The `quick_gelu` variant is used in OpenAI's CLIP text encoder.[23] Loading CLIP weights with the wrong activation produces clearly degraded zero-shot accuracy, which is a recurring source of confusion.

## Numerical reproducibility pitfalls

Small numerical differences between the exact and tanh approximations can affect reproducibility. The most well-known case is the divergence between Hugging Face's PyTorch and TensorFlow ports of BERT and GPT-2 from roughly 2018 to 2020. The PyTorch port used a custom tanh-form implementation while the TensorFlow side called the framework's own `gelu`, which at one point was the exact `erf` form. Numerically equivalent inputs produced outputs that differed in the fifth decimal, and downstream metrics like GLUE accuracy could drift by a fraction of a point. The Hugging Face team eventually unified the implementations and added the `gelu_pytorch_tanh` alias once PyTorch 1.12 added the matching flag.[23]

Practical rules:

- **Match the activation used at training time.** If a checkpoint was trained with the tanh approximation, evaluate it with the tanh approximation.
- **For new training runs, the exact `erf` form is the safer default.** It is framework-agnostic and avoids the historical baggage of competing approximations.
- **Test for numerical equivalence carefully.** Comparing forward passes between frameworks at a tolerance of `1e-6` will fail if one side uses the exact form and the other uses the tanh approximation; `1e-3` is more realistic.

A related issue affects half-precision (FP16/BF16) training. The tanh approximation involves a cubic term `0.044715 · x³`, which can overflow in FP16 for large `x` even though the final output is well-behaved. Most framework implementations compute the cubic in higher precision to avoid this. The exact `erf` form is more numerically stable in low precision.

## Computational considerations

On modern GPUs, the cost of `erf` is dominated by the surrounding `elementwise * matmul + bias` pattern, so the activation choice is not the bottleneck for most transformer training. Profiling tools usually show GELU taking under 1% of total step time. The tanh approximation can be marginally faster on hardware with weak transcendental support, which is one reason older code paths still use it.

The more practical reason to keep using the tanh form is checkpoint compatibility. If you are running a pretrained model whose weights were tuned against the tanh approximation, switching to the exact form changes the function values slightly, and the model's downstream metrics will drift. The drift is small but reproducible. Derivatives are inexpensive in either form: the exact derivative is `Φ(x) + x · φ(x)`, and every autograd framework computes the tanh derivative correctly.

## See also

- [Activation Function](/wiki/activation_function)
- [ReLU](/wiki/relu)
- [SiLU / Swish](/wiki/silu)
- [SwiGLU](/wiki/swiglu)
- [GLU](/wiki/glu)
- [Transformer](/wiki/transformer)
- [BERT](/wiki/bert)

## References

1. Hendrycks, Dan; Gimpel, Kevin. *Gaussian Error Linear Units (GELUs)*. arXiv:1606.08415, 27 June 2016 (final revision v5, June 2020). https://arxiv.org/abs/1606.08415
2. Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. arXiv:1810.04805, October 2018.
3. Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya. *Improving Language Understanding by Generative Pre-Training* (GPT-1), OpenAI, June 2018.
4. Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya. *Language Models are Unsupervised Multitask Learners* (GPT-2), OpenAI, 2019.
5. Brown, Tom et al. *Language Models are Few-Shot Learners* (GPT-3). arXiv:2005.14165, May 2020.
6. Liu, Yinhan et al. *RoBERTa: A Robustly Optimized BERT Pretraining Approach*. arXiv:1907.11692, July 2019.
7. Lan, Zhenzhong et al. *ALBERT: A Lite BERT for Self-supervised Learning of Language Representations*. arXiv:1909.11942, September 2019.
8. Clark, Kevin; Luong, Minh-Thang; Le, Quoc V.; Manning, Christopher D. *ELECTRA*. arXiv:2003.10555, March 2020.
9. Raffel, Colin et al. *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer* (T5). arXiv:1910.10683, October 2019.
10. Dosovitskiy, Alexey et al. *An Image is Worth 16x16 Words* (Vision Transformer). arXiv:2010.11929, October 2020.
11. Radford, Alec et al. *Robust Speech Recognition via Large-Scale Weak Supervision* (Whisper). arXiv:2212.04356, December 2022.
12. Shazeer, Noam. *GLU Variants Improve Transformer*. arXiv:2002.05202, February 2020. https://arxiv.org/abs/2002.05202
13. Ramachandran, Prajit; Zoph, Barret; Le, Quoc V. *Searching for Activation Functions* (Swish). arXiv:1710.05941, October 2017.
14. Clevert, Djork-Arne; Unterthiner, Thomas; Hochreiter, Sepp. *Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)*. arXiv:1511.07289, November 2015.
15. Klambauer, Gunter et al. *Self-Normalizing Neural Networks* (SELU). arXiv:1706.02515, June 2017.
16. Maas, Andrew L.; Hannun, Awni Y.; Ng, Andrew Y. *Rectifier Nonlinearities Improve Neural Network Acoustic Models* (Leaky ReLU). ICML 2013.
17. He, Kaiming et al. *Delving Deep into Rectifiers* (PReLU). arXiv:1502.01852, February 2015.
18. Misra, Diganta. *Mish: A Self Regularized Non-Monotonic Activation Function*. arXiv:1908.08681, August 2019.
19. Vaswani, Ashish et al. *Attention Is All You Need*. arXiv:1706.03762, June 2017.
20. PyTorch documentation, `torch.nn.functional.gelu`. https://pytorch.org/docs/stable/generated/torch.nn.functional.gelu.html
21. TensorFlow documentation, `tf.keras.activations.gelu`. https://www.tensorflow.org/api_docs/python/tf/keras/activations/gelu
22. JAX documentation, `jax.nn.gelu`. https://jax.readthedocs.io/en/latest/_autosummary/jax.nn.gelu.html
23. Hugging Face Transformers, `transformers.activations` source. https://github.com/huggingface/transformers/blob/main/src/transformers/activations.py

