GELU (Gaussian Error Linear Unit)

Artificial Intelligence Deep Learning Neural Networks

18 min read

Updated Jun 22, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 22, 2026

Fact-checked

In review queue

Sources

23 citations

Revision

v4 · 3,635 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Gaussian Error Linear Unit (GELU) is a smooth, non-monotonic activation function defined as GELU(x) = x · Φ(x), where Φ(x) is the cumulative distribution function of the standard normal distribution. It was introduced by Dan Hendrycks and Kevin Gimpel in the 2016 paper Gaussian Error Linear Units (GELUs), posted to arXiv on 27 June 2016 as 1606.08415, and it is the default nonlinearity in the feed-forward layers of BERT, the GPT series, and most transformer models built between 2018 and 2022.^[1] The paper's defining sentence is that the GELU "weights inputs by their value, rather than gates inputs by their sign as in ReLUs."^[1]

Hendrycks was then at the University of Chicago (later UC Berkeley), and Gimpel was at the Toyota Technological Institute at Chicago. The earliest preprint was titled Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units, and the authors kept revising the manuscript through June 2020 (final revision v5).^[1]

BERT, GPT-2, GPT-3, RoBERTa, ALBERT, ELECTRA, the original GPT (often called GPT-1), and the original Vision Transformer all use GELU between the two linear layers of each feed-forward block. OpenAI's Whisper speech-recognition model uses it as well.^[11] Newer architectures like PaLM and LLaMA have moved to gated variants such as SwiGLU, but GELU remains the activation that most production transformer codebases default to.

Quick facts


Introduced	June 2016
Paper	Gaussian Error Linear Units (GELUs)
arXiv ID	1606.08415 (versions v1 through v5)
Authors	Dan Hendrycks, Kevin Gimpel
Affiliations	University of Chicago / UC Berkeley; Toyota Technological Institute at Chicago
Exact formula	`GELU(x) = x · Φ(x) = (x / 2) · (1 + erf(x / √2))`
Tanh approximation	`0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))`
Sigmoid approximation	`x · σ(1.702 · x)`
Smooth	Yes (infinitely differentiable)
Monotonic	No (small dip near `x ≈ -0.75`)
Used in	BERT, GPT-1, GPT-2, GPT-3, RoBERTa, ALBERT, ELECTRA, T5 v1.1 (via GeGLU), Vision Transformer, Whisper
Frameworks	PyTorch, TensorFlow, JAX, Hugging Face Transformers

What problem does GELU solve?

Before GELU, the dominant nonlinearity in deep networks was the rectified linear unit (ReLU), introduced in its modern form by Nair and Hinton (2010) and popularised by AlexNet (Krizhevsky, Sutskever, Hinton, 2012). ReLU is max(0, x). It is cheap, sparsity-inducing, and largely solves the vanishing-gradient problem that plagued sigmoidal networks. It also has a well-known failure mode usually called the dead ReLU problem: once a unit has a sufficiently negative pre-activation, its gradient is exactly zero, and standard gradient descent cannot recover it.

A family of patches followed. Leaky ReLU (Maas, Hannun, Ng, 2013) replaces the flat negative half with a small linear slope αx.^[16] PReLU (He et al., 2015) makes that slope a learned parameter.^[17] ELU (Clevert, Unterthiner, Hochreiter, 2015) replaces the negative half with α · (eˣ - 1), giving a smooth saturating curve below zero.^[14] SELU (Klambauer et al., 2017) adds a fixed multiplicative scale chosen so successive layers self-normalise their activation statistics.^[15]

All of these are deterministic point-wise functions of the pre-activation. In parallel, regularisation work was producing increasingly sophisticated stochastic perturbations of the forward pass: dropout (Srivastava et al., 2014), zoneout (Krueger et al., 2016), and stochastic depth (Huang et al., 2016). Hendrycks and Gimpel's contribution was to notice that a particular kind of stochastic gating gives rise, in expectation, to a smooth deterministic activation that outperforms ReLU and ELU.^[1] The paper introduces what it calls the weighted-input view: an activation function can be read as a way to weight or gate its input by some function of itself. ReLU weights the input by 1[x > 0], ELU weights the negative half by (eˣ - 1) / x, and GELU weights it by the probability that a Gaussian random variable is below the input.^[1]

How is GELU defined?

The exact definition is

GELU(x) = x · Φ(x)

where Φ is the standard normal CDF. Equivalently, using the error function erf:

GELU(x) = 0.5 · x · (1 + erf(x / √2))

The function is smooth (infinitely differentiable), non-monotonic, and approximately linear for large positive x. For negative x it stays small but never identically zero, so unlike ReLU it does not produce dead units that always output zero. Asymptotically, GELU coincides with ReLU: GELU(x) → x as x → +∞ and GELU(x) → 0 as x → -∞.

A few quick numerical values give a feel for the shape:

Input `x`	`Φ(x)`	`GELU(x)`
-3	0.00135	-0.00405
-2	0.0228	-0.0455
-1	0.1587	-0.1587
-0.75	0.2266	-0.170
-0.5	0.3085	-0.1543
0	0.5	0
0.5	0.6915	0.3457
1	0.8413	0.8413
2	0.9772	1.9545
3	0.99865	2.996

GELU has a small negative dip near x ≈ -0.75, where it bottoms out at roughly -0.170 (the global minimum). Past x ≈ 2 it is essentially equal to x. Past x ≈ -3 it is essentially zero. Figure 1 of the original paper plots GELU alongside ReLU and ELU and is the reference picture most readers carry in their heads.^[1]

The derivative of the exact form is

GELU'(x) = Φ(x) + x · φ(x)

where φ(x) = (1 / √(2π)) · exp(-x² / 2) is the standard normal probability density function. The autograd engines in PyTorch, TensorFlow, and JAX all generate this analytically. The derivative is bounded (its supremum is around 1.13 near x ≈ 0.7), so GELU does not amplify gradients in the way some unbounded activations can.

Why a Gaussian gate? The stochastic regularizer interpretation

The original paper motivates GELU as a deterministic version of an input-dependent stochastic regularizer.^[1] Suppose you multiply each input by a Bernoulli mask m, where m = 1 with probability Φ(x) and m = 0 otherwise. The expected value of the output is

E[m · x] = Φ(x) · x = GELU(x)

Under this view, GELU is the expected value of an input that is either kept (probability Φ(x)) or dropped (probability 1 - Φ(x)), where the keep probability rises with the input's magnitude on the positive side. Hendrycks and Gimpel describe it as combining the gating behaviour of ReLU and the stochasticity of dropout and zoneout into a single deterministic nonlinearity.^[1] Inputs that look more like noise (small in absolute value, especially negative) get attenuated more aggressively in expectation than inputs that are clearly signal.

This framing explains the slight negative bump: when x is moderately negative, the mask is unlikely to be 1, but if it is, the output is small and negative. The expected value reflects that. The paper also defines a stochastic GELU by drawing the mask m ~ Bernoulli(Φ(x)) at training time and using the expected value at inference.^[1] In practice the deterministic version is what everyone uses.

The choice of the Gaussian CDF rather than the logistic sigmoid matches the assumption that pre-activations in well-initialised deep networks are approximately Gaussian. The sigmoid approximation x · σ(1.702 · x) substitutes the logistic CDF (rescaled to match Φ near the origin).

What are the GELU approximations?

The exact form requires evaluating erf, which is more expensive than the elementary operations used by ReLU or the sigmoid function. The 2016 paper proposed two approximations that became standard practice for years.^[1]

Tanh approximation:

GELU(x) ≈ 0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))

The constant √(2/π) ≈ 0.7978845608 and the cubic coefficient 0.044715 are fitted to match the exact function closely; the cubic term was chosen empirically.^[1] This is the form used in the original BERT and GPT-2 reference implementations, and a great deal of pretrained-checkpoint code still uses it.^[2]^[4]

Sigmoid approximation:

GELU(x) ≈ x · σ(1.702 · x)

where σ is the logistic sigmoid. This is faster than the tanh approximation but less accurate. It is rarely used in production but appears in the original paper as a convenient closed form.^[1] The constant 1.702 is the value that aligns the slope of x · σ(αx) with the slope of x · Φ(x) near the origin, and it is also the factor that turns plain SiLU (x · σ(x)) into a GELU approximation.^[1]^[13]

Exact form uses erf directly. Most modern hardware has fast erf instructions or libm implementations, so on contemporary GPUs the exact form is competitive with the tanh approximation in wall-clock time. PyTorch exposes the choice through torch.nn.functional.gelu(x, approximate='none') (the default, which calls erf) and approximate='tanh' (the original BERT formulation).^[20] TensorFlow exposes the same selection through tf.nn.gelu with an approximate=True/False flag.^[21] JAX provides jax.nn.gelu with an approximate keyword argument that defaults to True.^[22]

The maximum absolute deviation between the tanh approximation and the exact erf form is on the order of 1e-4 over the range [-5, 5]. The sigmoid approximation deviates by closer to 1e-2 in the same range.

Properties

Smooth. Continuous derivatives of all orders, unlike ReLU (kink at zero) or Leaky ReLU.
Non-monotonic. Decreases on (-∞, -0.75) then increases, so it is not one-to-one near the origin.
Differentiable everywhere. No special handling for x = 0.
Asymptotically ReLU. GELU(x) - max(0, x) → 0 as |x| → ∞.
Bounded below. GELU(x) ≥ -0.170 for all real x.
Approximately zero-centred outputs. Easier to layer-normalise than ReLU.
Self-gated. The function gates each input by some monotone function of itself, putting it in the same family as Swish/SiLU and Mish.

GELU's small negative bump means that two distinct inputs can produce the same output, which complicates some theoretical analyses but does not seem to hurt training in practice.

How does GELU differ from ReLU and other activations?

Activation	Formula	Smooth	Non-monotonic	Notes
ReLU	`max(0, x)`	No	No	Cheapest. Standard before transformers. Has dead-unit problem.
Leaky ReLU	`max(αx, x)`, `α = 0.01` typical	No	No	Avoids dead units. Cheap.
PReLU	`max(αx, x)`, `α` learned	No	No	Adds parameters per channel.
ELU	`x` if `x > 0`, else `α(eˣ - 1)`	Yes	No	Smooth saturating negatives.
SELU	Scaled ELU with fixed `α`, `λ`	Yes	No	Self-normalising.
SiLU / Swish	`x · σ(x)`	Yes	Yes	GELU's sigmoid form with the `1.702` coefficient set to 1.
GELU (exact)	`x · Φ(x)`	Yes	Yes	Gaussian gate. Min around -0.170.
GELU (tanh approx)	`0.5 · x · (1 + tanh(√(2/π) · (x + 0.044715 · x³)))`	Yes	Yes	BERT/GPT-2 default.
Mish	`x · tanh(softplus(x))`	Yes	Yes	Min about -0.31.

The key difference from ReLU is captured by the original abstract: GELU "weights inputs by their value, rather than gates inputs by their sign."^[1] ReLU makes a hard binary decision based on the sign of the pre-activation; GELU applies a smooth, probabilistic weighting that keeps a small nonzero signal for negative inputs and so avoids dead units.

The Swish/SiLU function x · σ(x) was published independently by Hendrycks and Gimpel as the GELU sigmoid approximation in 2016 and rediscovered by Ramachandran, Zoph, and Le in 2017 under the name Swish.^[1]^[13] Google later renamed it SiLU. The two names refer to the same function. GELU is sometimes described as Swish with a Gaussian gate instead of a sigmoid one. In practice GELU and SiLU produce very similar shapes, with GELU having a slightly more pronounced negative dip.

What did the original experiments show?

Hendrycks and Gimpel ran experiments on MNIST classification, MNIST autoencoding, CIFAR-10, CIFAR-100, and the TIMIT phone-recognition task, reporting "performance improvements across all considered computer vision, natural language processing, and speech tasks."^[1] GELU outperformed ReLU and ELU on the median across runs in every setting they tested, with the largest gains on TIMIT. The differences were modest but consistent.^[1]

MNIST classification. Fully connected networks of depth 8 with widths 128 to 1024. GELU reached lower median test error than ReLU and ELU across batch sizes.
MNIST autoencoding. Eight-layer autoencoder. GELU achieved lower mean reconstruction loss than ReLU or ELU.
CIFAR-10 and CIFAR-100. Wide residual network (Wide ResNet 28-10) trained for 200 epochs. GELU narrowly ahead on both datasets.
TIMIT phone recognition. Five-hidden-layer fully connected network. GELU beat ReLU and ELU by close to a percentage point in median phone error rate.

These benchmarks predate the transformer era. The reason GELU caught on was not the MNIST result but the fact that the GPT and BERT teams adopted it for the FFN layers, and pretty much every transformer paper after that used whatever the BERT codebase used.

Which transformer models use GELU?

GELU was not the first nonlinearity used inside a transformer. The original Attention Is All You Need paper (Vaswani et al., 2017) used ReLU in the position-wise feed-forward sublayer.^[19] The shift to GELU happened with the next generation of pretrained models.

Model	Year	FFN activation	Approximation in reference code
Original Transformer	2017	ReLU	n/a
GPT (GPT-1)	2018	GELU	Tanh approximation
BERT	2018	GELU	Tanh approximation
GPT-2	2019	GELU	Tanh approximation
RoBERTa	2019	GELU	Tanh approximation
XLNet	2019	GELU	Tanh approximation
ALBERT	2019	GELU	Tanh approximation
T5 (v1.0)	2019	ReLU	n/a
T5 (v1.1)	2020	GeGLU (uses GELU)	Exact
ELECTRA	2020	GELU	Tanh approximation
GPT-3	2020	GELU	Tanh approximation
Vision Transformer	2020	GELU	Tanh approximation
Whisper	2022	GELU	Tanh approximation
PaLM	2022	SwiGLU	n/a
LLaMA	2023	SwiGLU	n/a

BERT (Devlin et al., October 2018) is the single largest cause of GELU's spread.^[2] Its public TensorFlow reference implementation defined gelu using the tanh approximation, and every subsequent paper that built on BERT inherited that choice.^[2] The OpenAI GPT-1 paper (Radford et al., June 2018) also used GELU, but BERT's open-source release was the catalyst that pushed it into nearly every transformer codebase between 2018 and 2022.^[3]

How does GELU compare to SwiGLU?

The activation question got more complicated in February 2020 with Noam Shazeer's paper GLU Variants Improve Transformer (arXiv:2002.05202), which proposed using gated linear units in place of the standard FFN.^[12] Two variants became popular:

GeGLU: GeGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c). T5 v1.1 and the LaMDA family use this.^[12]
SwiGLU: same structure but with SiLU/Swish in place of GELU. PaLM and LLaMA adopted this, and the Mistral and Mixtral lines kept it.^[12]

Gated FFNs use three weight matrices instead of two and reduce the inner dimension by 2/3 to keep parameter count roughly constant.^[12] They consistently produce small but reliable perplexity improvements over plain GELU FFNs. Most large open-weight models released after 2022 use SwiGLU. GELU is still the default in encoder-only and encoder-decoder models, in vision transformers, and in any transformer codebase matching the BERT lineage. Shazeer offers no theoretical account of the gains, concluding: "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."^[12]

Implementation

A reference PyTorch implementation of all three forms looks like this:

import math
import torch
import torch.nn.functional as F

def gelu_exact(x):
    return 0.5 * x * (1.0 + torch.erf(x / math.sqrt(2.0)))

def gelu_tanh(x):
    c = math.sqrt(2.0 / math.pi)
    return 0.5 * x * (1.0 + torch.tanh(c * (x + 0.044715 * x.pow(3))))

def gelu_sigmoid(x):
    return x * torch.sigmoid(1.702 * x)

# Library-provided versions
y1 = F.gelu(x)                        # exact, uses erf
y2 = F.gelu(x, approximate='tanh')    # tanh approximation

The library version is preferable in real code because it dispatches to fused kernels on supported hardware.

Framework APIs

PyTorch. torch.nn.GELU(approximate='none' | 'tanh') and torch.nn.functional.gelu. The approximate='tanh' flag was added in PyTorch 1.12 (June 2022); before that, only the exact form was available.^[20]
TensorFlow. tf.keras.activations.gelu(x, approximate=False). The approximate flag has been present since TensorFlow 2.4.^[21]
JAX. jax.nn.gelu(x, approximate=True). The default is the tanh approximation, the opposite of PyTorch's default.^[22]
Hugging Face Transformers. The transformers.activations module exposes gelu (exact erf), gelu_new (BERT tanh approximation, kept for checkpoint compatibility), gelu_fast (cheaper tanh variant), gelu_pytorch_tanh (calls F.gelu(x, approximate='tanh')), quick_gelu (sigmoid approximation, used by CLIP), and gelu_python (a pure-Python reference). The names are historical artefacts.^[23]

The quick_gelu variant is used in OpenAI's CLIP text encoder.^[23] Loading CLIP weights with the wrong activation produces clearly degraded zero-shot accuracy, which is a recurring source of confusion.

Numerical reproducibility pitfalls

Small numerical differences between the exact and tanh approximations can affect reproducibility. The most well-known case is the divergence between Hugging Face's PyTorch and TensorFlow ports of BERT and GPT-2 from roughly 2018 to 2020. The PyTorch port used a custom tanh-form implementation while the TensorFlow side called the framework's own gelu, which at one point was the exact erf form. Numerically equivalent inputs produced outputs that differed in the fifth decimal, and downstream metrics like GLUE accuracy could drift by a fraction of a point. The Hugging Face team eventually unified the implementations and added the gelu_pytorch_tanh alias once PyTorch 1.12 added the matching flag.^[23]

Practical rules:

Match the activation used at training time. If a checkpoint was trained with the tanh approximation, evaluate it with the tanh approximation.
For new training runs, the exact erf form is the safer default. It is framework-agnostic and avoids the historical baggage of competing approximations.
Test for numerical equivalence carefully. Comparing forward passes between frameworks at a tolerance of 1e-6 will fail if one side uses the exact form and the other uses the tanh approximation; 1e-3 is more realistic.

A related issue affects half-precision (FP16/BF16) training. The tanh approximation involves a cubic term 0.044715 · x³, which can overflow in FP16 for large x even though the final output is well-behaved. Most framework implementations compute the cubic in higher precision to avoid this. The exact erf form is more numerically stable in low precision.

Computational considerations

On modern GPUs, the cost of erf is dominated by the surrounding elementwise * matmul + bias pattern, so the activation choice is not the bottleneck for most transformer training. Profiling tools usually show GELU taking under 1% of total step time. The tanh approximation can be marginally faster on hardware with weak transcendental support, which is one reason older code paths still use it.

The more practical reason to keep using the tanh form is checkpoint compatibility. If you are running a pretrained model whose weights were tuned against the tanh approximation, switching to the exact form changes the function values slightly, and the model's downstream metrics will drift. The drift is small but reproducible. Derivatives are inexpensive in either form: the exact derivative is Φ(x) + x · φ(x), and every autograd framework computes the tanh derivative correctly.

References

Hendrycks, Dan; Gimpel, Kevin. *Gaussian Error Linear Units (GELUs)*. arXiv:1606.08415, 27 June 2016 (final revision v5, June 2020). https://arxiv.org/abs/1606.08415 ↩
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. *BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*. arXiv:1810.04805, October 2018. ↩
Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya. *Improving Language Understanding by Generative Pre-Training* (GPT-1), OpenAI, June 2018. ↩
Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya. *Language Models are Unsupervised Multitask Learners* (GPT-2), OpenAI, 2019. ↩
Brown, Tom et al. *Language Models are Few-Shot Learners* (GPT-3). arXiv:2005.14165, May 2020.
Liu, Yinhan et al. *RoBERTa: A Robustly Optimized BERT Pretraining Approach*. arXiv:1907.11692, July 2019.
Lan, Zhenzhong et al. *ALBERT: A Lite BERT for Self-supervised Learning of Language Representations*. arXiv:1909.11942, September 2019.
Clark, Kevin; Luong, Minh-Thang; Le, Quoc V.; Manning, Christopher D. *ELECTRA*. arXiv:2003.10555, March 2020.
Raffel, Colin et al. *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer* (T5). arXiv:1910.10683, October 2019.
Dosovitskiy, Alexey et al. *An Image is Worth 16x16 Words* (Vision Transformer). arXiv:2010.11929, October 2020.
Radford, Alec et al. *Robust Speech Recognition via Large-Scale Weak Supervision* (Whisper). arXiv:2212.04356, December 2022. ↩
Shazeer, Noam. *GLU Variants Improve Transformer*. arXiv:2002.05202, February 2020. https://arxiv.org/abs/2002.05202 ↩
Ramachandran, Prajit; Zoph, Barret; Le, Quoc V. *Searching for Activation Functions* (Swish). arXiv:1710.05941, October 2017. ↩
Clevert, Djork-Arne; Unterthiner, Thomas; Hochreiter, Sepp. *Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)*. arXiv:1511.07289, November 2015. ↩
Klambauer, Gunter et al. *Self-Normalizing Neural Networks* (SELU). arXiv:1706.02515, June 2017. ↩
Maas, Andrew L.; Hannun, Awni Y.; Ng, Andrew Y. *Rectifier Nonlinearities Improve Neural Network Acoustic Models* (Leaky ReLU). ICML 2013. ↩
He, Kaiming et al. *Delving Deep into Rectifiers* (PReLU). arXiv:1502.01852, February 2015. ↩
Misra, Diganta. *Mish: A Self Regularized Non-Monotonic Activation Function*. arXiv:1908.08681, August 2019.
Vaswani, Ashish et al. *Attention Is All You Need*. arXiv:1706.03762, June 2017. ↩
PyTorch documentation, `torch.nn.functional.gelu`. https://pytorch.org/docs/stable/generated/torch.nn.functional.gelu.html ↩
TensorFlow documentation, `tf.keras.activations.gelu`. https://www.tensorflow.org/api_docs/python/tf/keras/activations/gelu ↩
JAX documentation, `jax.nn.gelu`. https://jax.readthedocs.io/en/latest/_autosummary/jax.nn.gelu.html ↩
Hugging Face Transformers, `transformers.activations` source. https://github.com/huggingface/transformers/blob/main/src/transformers/activations.py ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

3 revisions by 1 contributors · full history

Suggest edit

What links here

Dan Hendrycks Etched Sohu GPT-2 Hidden Layer Homomorphic encryption for machine learning Humanity's Last Exam Neural Network Nonlinear PaLM Pixtral Rectified Linear Unit (ReLU)Step SwiGLU TPU Node Transformers Vision Transformer Whisper Width

Quick facts

What problem does GELU solve?

How is GELU defined?

Why a Gaussian gate? The stochastic regularizer interpretation

What are the GELU approximations?

Properties

How does GELU differ from ReLU and other activations?

What did the original experiments show?

Which transformer models use GELU?

How does GELU compare to SwiGLU?

Implementation

Framework APIs

Numerical reproducibility pitfalls

Computational considerations

See also

References

Improve this article

Related Articles

Deep Learning

LeNet

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

What links here

Related Articles

Deep Learning

LeNet

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

What links here