GELU (Gaussian Error Linear Unit)
Last reviewed
May 2, 2026
Sources
23 citations
Review status
Source-backed
Revision
v2 · 3,491 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
23 citations
Review status
Source-backed
Revision
v2 · 3,491 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Gaussian Error Linear Unit, almost always shortened to GELU, is a smooth, non-monotonic activation function defined as GELU(x) = x · Φ(x), where Φ(x) is the cumulative distribution function of the standard normal distribution. It was introduced by Dan Hendrycks (then at the University of Chicago, later UC Berkeley) and Kevin Gimpel of the Toyota Technological Institute at Chicago in the 2016 paper Gaussian Error Linear Units (GELUs), posted to arXiv on 27 June 2016 as 1606.08415. The earliest preprint was titled Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units. Hendrycks and Gimpel kept revising the manuscript through June 2020 (final revision v5).
GELU became the default activation in the feed-forward sublayers of most early transformer language models. BERT, GPT-2, GPT-3, RoBERTa, ALBERT, ELECTRA, the original GPT (often called GPT-1), and the original Vision Transformer all use GELU between the two linear layers of each feed-forward block. OpenAI's Whisper speech-recognition model uses it as well. Newer architectures like PaLM and LLaMA have moved to gated variants such as SwiGLU, but GELU is still ubiquitous and remains the activation that most production transformer codebases default to.
| Introduced | June 2016 |
| Paper | Gaussian Error Linear Units (GELUs) |
| arXiv ID | 1606.08415 (versions v1 through v5) |
| Authors | Dan Hendrycks, Kevin Gimpel |
| Affiliations | University of Chicago / UC Berkeley; Toyota Technological Institute at Chicago |
| Exact formula | GELU(x) = x · Φ(x) = (x / 2) · (1 + erf(x / √2)) |
| Tanh approximation | 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³))) |
| Sigmoid approximation | x · σ(1.702 · x) |
| Smooth | Yes (infinitely differentiable) |
| Monotonic | No (small dip near x ≈ -0.75) |
| Used in | BERT, GPT-1, GPT-2, GPT-3, RoBERTa, ALBERT, ELECTRA, T5 v1.1 (via GeGLU), Vision Transformer, Whisper |
| Frameworks | PyTorch, TensorFlow, JAX, Hugging Face Transformers |
Before GELU, the dominant nonlinearity in deep networks was the rectified linear unit (ReLU), introduced in its modern form by Nair and Hinton (2010) and popularised by AlexNet (Krizhevsky, Sutskever, Hinton, 2012). ReLU is max(0, x). It is cheap, sparsity-inducing, and largely solves the vanishing-gradient problem that plagued sigmoidal networks. It also has a well-known failure mode usually called the dead ReLU problem: once a unit has a sufficiently negative pre-activation, its gradient is exactly zero, and standard gradient descent cannot recover it.
A family of patches followed. Leaky ReLU (Maas, Hannun, Ng, 2013) replaces the flat negative half with a small linear slope αx. PReLU (He et al., 2015) makes that slope a learned parameter. ELU (Clevert, Unterthiner, Hochreiter, 2015) replaces the negative half with α · (eˣ - 1), giving a smooth saturating curve below zero. SELU (Klambauer et al., 2017) adds a fixed multiplicative scale chosen so successive layers self-normalise their activation statistics.
All of these are deterministic point-wise functions of the pre-activation. In parallel, regularisation work was producing increasingly sophisticated stochastic perturbations of the forward pass: dropout (Srivastava et al., 2014), zoneout (Krueger et al., 2016), and stochastic depth (Huang et al., 2016). Hendrycks and Gimpel's contribution was to notice that a particular kind of stochastic gating gives rise, in expectation, to a smooth deterministic activation that outperforms ReLU and ELU. The paper introduces what it calls the weighted-input view: an activation function can be read as a way to weight or gate its input by some function of itself. ReLU weights the input by 1[x > 0], ELU weights the negative half by (eˣ - 1) / x, and GELU weights it by the probability that a Gaussian random variable is below the input.
The exact definition is
GELU(x) = x · Φ(x)
where Φ is the standard normal CDF. Equivalently, using the error function erf:
GELU(x) = 0.5 · x · (1 + erf(x / √2))
The function is smooth (infinitely differentiable), non-monotonic, and approximately linear for large positive x. For negative x it stays small but never identically zero, so unlike ReLU it does not produce dead units that always output zero. Asymptotically, GELU coincides with ReLU: GELU(x) → x as x → +∞ and GELU(x) → 0 as x → -∞.
A few quick numerical values give a feel for the shape:
Input x | Φ(x) | GELU(x) |
|---|---|---|
| -3 | 0.00135 | -0.00405 |
| -2 | 0.0228 | -0.0455 |
| -1 | 0.1587 | -0.1587 |
| -0.75 | 0.2266 | -0.170 |
| -0.5 | 0.3085 | -0.1543 |
| 0 | 0.5 | 0 |
| 0.5 | 0.6915 | 0.3457 |
| 1 | 0.8413 | 0.8413 |
| 2 | 0.9772 | 1.9545 |
| 3 | 0.99865 | 2.996 |
GELU has a small negative dip near x ≈ -0.75, where it bottoms out at roughly -0.170 (the global minimum). Past x ≈ 2 it is essentially equal to x. Past x ≈ -3 it is essentially zero. Figure 1 of the original paper plots GELU alongside ReLU and ELU and is the reference picture most readers carry in their heads.
The derivative of the exact form is
GELU'(x) = Φ(x) + x · φ(x)
where φ(x) = (1 / √(2π)) · exp(-x² / 2) is the standard normal probability density function. The autograd engines in PyTorch, TensorFlow, and JAX all generate this analytically. The derivative is bounded (its supremum is around 1.13 near x ≈ 0.7), so GELU does not amplify gradients in the way some unbounded activations can.
The original paper motivates GELU as a deterministic version of an input-dependent stochastic regularizer. Suppose you multiply each input by a Bernoulli mask m, where m = 1 with probability Φ(x) and m = 0 otherwise. The expected value of the output is
E[m · x] = Φ(x) · x = GELU(x)
Under this view, GELU is the expected value of an input that is either kept (probability Φ(x)) or dropped (probability 1 - Φ(x)), where the keep probability rises with the input's magnitude on the positive side. Hendrycks and Gimpel describe it as combining the gating behaviour of ReLU and the stochasticity of dropout and zoneout into a single deterministic nonlinearity. Inputs that look more like noise (small in absolute value, especially negative) get attenuated more aggressively in expectation than inputs that are clearly signal.
This framing explains the slight negative bump: when x is moderately negative, the mask is unlikely to be 1, but if it is, the output is small and negative. The expected value reflects that. The paper also defines a stochastic GELU by drawing the mask m ~ Bernoulli(Φ(x)) at training time and using the expected value at inference. In practice the deterministic version is what everyone uses.
The choice of the Gaussian CDF rather than the logistic sigmoid matches the assumption that pre-activations in well-initialised deep networks are approximately Gaussian. The sigmoid approximation x · σ(1.702 · x) substitutes the logistic CDF (rescaled to match Φ near the origin).
The exact form requires evaluating erf, which is more expensive than the elementary operations used by ReLU or the sigmoid function. The 2016 paper proposed two approximations that became standard practice for years.
Tanh approximation:
GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))
The constant √(2/π) ≈ 0.7978845608 and the cubic coefficient 0.044715 are fitted to match the exact function closely; the cubic term was chosen empirically. This is the form used in the original BERT and GPT-2 reference implementations, and a great deal of pretrained-checkpoint code still uses it.
Sigmoid approximation:
GELU(x) ≈ x · σ(1.702 · x)
where σ is the logistic sigmoid. This is faster than the tanh approximation but less accurate. It is rarely used in production but appears in the original paper as a convenient closed form. The constant 1.702 is the value that aligns the slope of x · σ(αx) with the slope of x · Φ(x) near the origin.
Exact form uses erf directly. Most modern hardware has fast erf instructions or libm implementations, so on contemporary GPUs the exact form is competitive with the tanh approximation in wall-clock time. PyTorch exposes the choice through torch.nn.functional.gelu(x, approximate='none') (the default, which calls erf) and approximate='tanh' (the original BERT formulation). TensorFlow exposes the same selection through tf.nn.gelu with an approximate=True/False flag. JAX provides jax.nn.gelu with an approximate keyword argument that defaults to True.
The maximum absolute deviation between the tanh approximation and the exact erf form is on the order of 1e-4 over the range [-5, 5]. The sigmoid approximation deviates by closer to 1e-2 in the same range.
(-∞, -0.75) then increases, so it is not one-to-one near the origin.x = 0.GELU(x) - max(0, x) → 0 as |x| → ∞.GELU(x) ≥ -0.170 for all real x.GELU's small negative bump means that two distinct inputs can produce the same output, which complicates some theoretical analyses but does not seem to hurt training in practice.
| Activation | Formula | Smooth | Non-monotonic | Notes |
|---|---|---|---|---|
| ReLU | max(0, x) | No | No | Cheapest. Standard before transformers. Has dead-unit problem. |
| Leaky ReLU | max(αx, x), α = 0.01 typical | No | No | Avoids dead units. Cheap. |
| PReLU | max(αx, x), α learned | No | No | Adds parameters per channel. |
| ELU | x if x > 0, else α(eˣ - 1) | Yes | No | Smooth saturating negatives. |
| SELU | Scaled ELU with fixed α, λ | Yes | No | Self-normalising. |
| SiLU / Swish | x · σ(x) | Yes | Yes | GELU's sigmoid form with the 1.702 coefficient set to 1. |
| GELU (exact) | x · Φ(x) | Yes | Yes | Gaussian gate. Min around -0.170. |
| GELU (tanh approx) | 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³))) | Yes | Yes | BERT/GPT-2 default. |
| Mish | x · tanh(softplus(x)) | Yes | Yes | Min about -0.31. |
The Swish/SiLU function x · σ(x) was published independently by Hendrycks and Gimpel as the GELU sigmoid approximation in 2016 and rediscovered by Ramachandran, Zoph, and Le in 2017 under the name Swish. Google later renamed it SiLU. The two names refer to the same function. GELU is sometimes described as Swish with a Gaussian gate instead of a sigmoid one. In practice GELU and SiLU produce very similar shapes, with GELU having a slightly more pronounced negative dip.
Hendrycks and Gimpel ran experiments on MNIST classification, MNIST autoencoding, CIFAR-10, CIFAR-100, and the TIMIT phone-recognition task. GELU outperformed ReLU and ELU on the median across runs in every setting they tested, with the largest gains on TIMIT. The differences were modest but consistent.
These benchmarks predate the transformer era. The reason GELU caught on was not the MNIST result but the fact that the GPT and BERT teams adopted it for the FFN layers, and pretty much every transformer paper after that used whatever the BERT codebase used.
GELU was not the first nonlinearity used inside a transformer. The original Attention Is All You Need paper (Vaswani et al., 2017) used ReLU in the position-wise feed-forward sublayer. The shift to GELU happened with the next generation of pretrained models.
| Model | Year | FFN activation | Approximation in reference code |
|---|---|---|---|
| Original Transformer | 2017 | ReLU | n/a |
| GPT (GPT-1) | 2018 | GELU | Tanh approximation |
| BERT | 2018 | GELU | Tanh approximation |
| GPT-2 | 2019 | GELU | Tanh approximation |
| RoBERTa | 2019 | GELU | Tanh approximation |
| XLNet | 2019 | GELU | Tanh approximation |
| ALBERT | 2019 | GELU | Tanh approximation |
| T5 (v1.0) | 2019 | ReLU | n/a |
| T5 (v1.1) | 2020 | GeGLU (uses GELU) | Exact |
| ELECTRA | 2020 | GELU | Tanh approximation |
| GPT-3 | 2020 | GELU | Tanh approximation |
| Vision Transformer | 2020 | GELU | Tanh approximation |
| Whisper | 2022 | GELU | Tanh approximation |
| PaLM | 2022 | SwiGLU | n/a |
| LLaMA | 2023 | SwiGLU | n/a |
BERT (Devlin et al., October 2018) is the single largest cause of GELU's spread. Its public TensorFlow reference implementation defined gelu using the tanh approximation, and every subsequent paper that built on BERT inherited that choice. The OpenAI GPT-1 paper (Radford et al., June 2018) also used GELU, but BERT's open-source release was the catalyst that pushed it into nearly every transformer codebase between 2018 and 2022.
The activation question got more complicated in February 2020 with Noam Shazeer's paper GLU Variants Improve Transformer (arXiv:2002.05202), which proposed using gated linear units in place of the standard FFN. Two variants became popular:
GeGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c). T5 v1.1 and the LaMDA family use this.Gated FFNs use three weight matrices instead of two and reduce the inner dimension by 2/3 to keep parameter count roughly constant. They consistently produce small but reliable perplexity improvements over plain GELU FFNs. Most large open-weight models released after 2022 use SwiGLU. GELU is still the default in encoder-only and encoder-decoder models, in vision transformers, and in any transformer codebase matching the BERT lineage. Shazeer notes that the gains are unexplained, describing them as "a fortunate gift from the divine providence" in the conclusion of the GLU Variants paper.
A reference PyTorch implementation of all three forms looks like this:
import math
import torch
import torch.nn.functional as F
def gelu_exact(x):
return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))
def gelu_tanh(x):
c = math.sqrt(2.0 / math.pi)
return 0.5 * x * (1.0 + torch.tanh(c * (x + 0.044715 * x.pow(3))))
def gelu_sigmoid(x):
return x * torch.sigmoid(1.702 * x)
# Library-provided versions
y1 = F.gelu(x) # exact, uses erf
y2 = F.gelu(x, approximate='tanh') # tanh approximation
The library version is preferable in real code because it dispatches to fused kernels on supported hardware.
torch.nn.GELU(approximate='none' | 'tanh') and torch.nn.functional.gelu. The approximate='tanh' flag was added in PyTorch 1.12 (June 2022); before that, only the exact form was available.tf.keras.activations.gelu(x, approximate=False). The approximate flag has been present since TensorFlow 2.4.jax.nn.gelu(x, approximate=True). The default is the tanh approximation, the opposite of PyTorch's default.transformers.activations module exposes gelu (exact erf), gelu_new (BERT tanh approximation, kept for checkpoint compatibility), gelu_fast (cheaper tanh variant), gelu_pytorch_tanh (calls F.gelu(x, approximate='tanh')), quick_gelu (sigmoid approximation, used by CLIP), and gelu_python (a pure-Python reference). The names are historical artefacts.The quick_gelu variant is used in OpenAI's CLIP text encoder. Loading CLIP weights with the wrong activation produces clearly degraded zero-shot accuracy, which is a recurring source of confusion.
Small numerical differences between the exact and tanh approximations can affect reproducibility. The most well-known case is the divergence between Hugging Face's PyTorch and TensorFlow ports of BERT and GPT-2 from roughly 2018 to 2020. The PyTorch port used a custom tanh-form implementation while the TensorFlow side called the framework's own gelu, which at one point was the exact erf form. Numerically equivalent inputs produced outputs that differed in the fifth decimal, and downstream metrics like GLUE accuracy could drift by a fraction of a point. The Hugging Face team eventually unified the implementations and added the gelu_pytorch_tanh alias once PyTorch 1.12 added the matching flag.
Practical rules:
erf form is the safer default. It is framework-agnostic and avoids the historical baggage of competing approximations.1e-6 will fail if one side uses the exact form and the other uses the tanh approximation; 1e-3 is more realistic.A related issue affects half-precision (FP16/BF16) training. The tanh approximation involves a cubic term 0.044715 · x³, which can overflow in FP16 for large x even though the final output is well-behaved. Most framework implementations compute the cubic in higher precision to avoid this. The exact erf form is more numerically stable in low precision.
On modern GPUs, the cost of erf is dominated by the surrounding elementwise * matmul + bias pattern, so the activation choice is not the bottleneck for most transformer training. Profiling tools usually show GELU taking under 1% of total step time. The tanh approximation can be marginally faster on hardware with weak transcendental support, which is one reason older code paths still use it.
The more practical reason to keep using the tanh form is checkpoint compatibility. If you are running a pretrained model whose weights were tuned against the tanh approximation, switching to the exact form changes the function values slightly, and the model's downstream metrics will drift. The drift is small but reproducible. Derivatives are inexpensive in either form: the exact derivative is Φ(x) + x · φ(x), and every autograd framework computes the tanh derivative correctly.