SwiGLU

SwiGLU, short for Swish-Gated Linear Unit, is an activation function used inside the feed-forward sublayer of transformer models. It is a member of the GLU family, replacing the sigmoid gate of the original Gated Linear Unit with Swish (also known as SiLU). SwiGLU was introduced by Noam Shazeer in the February 2020 paper GLU Variants Improve Transformer (arXiv:2002.05202), where it produced the lowest validation perplexity among the variants tested on a T5 base model. After PaLM adopted it in 2022 and LLaMA followed in early 2023, SwiGLU quickly became the default feed-forward activation in nearly every new large language model, including Mistral 7B, the Falcon 3 family, DeepSeek, Qwen, and OLMo.

The paper's recommendation was deliberately ambivalent. Shazeer found that several gated variants (GEGLU, SwiGLU, ReGLU, and even the parameter-free Bilinear) all beat the standard ReLU and GELU baselines, and the differences between them were within experimental noise. The community eventually settled on SwiGLU through a combination of PaLM's high-profile adoption, LLaMA's open-source release, and the resulting tooling momentum. Once Hugging Face Transformers, llama.cpp, vLLM, and the major training stacks were tuned for the SwiGLU shape, switching back to a non-gated FFN stopped being free. As of 2026 every leading open-weight foundation model except for Google's Gemma line uses SwiGLU as its FFN nonlinearity.

quick facts


Introduced	February 2020
Paper	GLU Variants Improve Transformer
arXiv ID	2002.05202
Author	Noam Shazeer (Google)
Type	Gated activation, GLU family
Role in transformer	Replaces the single nonlinearity inside the position-wise FFN
Component activation	Swish (SiLU), usually with beta = 1
Hidden dimension factor	2/3 of the standard 4d (i.e. 8/3 times d) to keep parameter count comparable
Matrices in FFN	3 (gate W, value V, down W2), vs 2 in a standard FFN
Used in	PaLM, LLaMA 1/2/3, Mistral 7B, Mixtral, Falcon 3, DeepSeek (V2/V3), Qwen2/Qwen3, OLMo
PyTorch primitive	`torch.nn.functional.silu`
C4 log-perplexity in original paper (524k steps)	1.636 (vs 1.677 ReLU baseline)
GLUE average in original paper	84.36 (best of all tested variants)

background: the standard transformer FFN

A standard transformer block contains two sublayers: multi-head self-attention and a position-wise feed-forward network (FFN). The FFN, introduced in Attention Is All You Need (Vaswani et al., 2017), is a small two-layer multilayer perceptron applied independently to every position. In matrix form it is

FFN(x) = max(0, xW_1 + b_1) W_2 + b_2

The inner dimension d_ff is conventionally four times the model dimension d_model, so for d_model = 512 the hidden layer has 2048 units. The original paper used ReLU. Later work swapped ReLU for GELU, giving

FFN_GELU(x) = GELU(xW_1 + b_1) W_2 + b_2

GELU was the default in BERT, GPT-2, GPT-3, RoBERTa, and the original T5. The T5 codebase also dropped the bias terms, simplifying the formula to FFN_ReLU(x) = max(0, xW_1) W_2. The structure is the same in every case: project up to a wider hidden dimension, apply a pointwise nonlinearity, project back down. Two matrix multiplies, one elementwise nonlinearity. For most of the late 2010s this was the only FFN design anyone seriously considered.

The FFN is a surprisingly large fraction of the parameters and FLOPs of a transformer. With d_ff = 4 times d_model, the FFN sublayer holds roughly 8 times d_model squared parameters per layer, while the attention sublayer (with four projection matrices of size d_model by d_model) holds 4 times d_model squared. So the FFN is about two thirds of the non-embedding compute and parameters. Improving the FFN therefore moves the needle on overall model quality more than tweaking attention typically does.

Gated variants change the FFN topology. They split the up-projection in two and use one half to gate the other. The first such variant in modern deep learning was the GLU, introduced for convolutional language modeling in 2016.

GLU: the original gated linear unit

The Gated Linear Unit was proposed by Yann Dauphin, Angela Fan, Michael Auli, and David Grangier of Facebook AI Research in Language Modeling with Gated Convolutional Networks (arXiv:1612.08083, December 2016). They were trying to make convolutional networks competitive with LSTMs on word-level language modeling, and the gating mechanism was their answer to the long-range information flow problem. The basic GLU is

GLU(x, W, V, b, c) = (xW + b) ⊗ σ(xV + c)

where ⊗ is elementwise (Hadamard) multiplication and σ is the sigmoid function. One linear projection produces the candidate values, the other produces a sigmoid gate that decides how much of each value passes through. The Dauphin paper argued that this gating gave a linear path for gradients (via the value branch) while keeping enough nonlinearity (via the gate) for expressivity. The model achieved state of the art on WikiText-103 and was competitive with LSTMs on the Google Billion Words benchmark, which at the time felt surprising for a non-recurrent architecture.

Dauphin et al. also defined the parameter-free Bilinear layer, which is GLU without any gating activation:

Bilinear(x, W, V, b, c) = (xW + b) ⊗ (xV + c)

They credited this construction to Mnih and Hinton's 2007 paper Three new graphical models for statistical language modelling, which is the deepest historical root of the entire GLU family. The Bilinear layer is purely multiplicative interaction between two linear projections, and as Shazeer would later show, it is already strong enough to outperform a non-gated ReLU FFN.

The core idea is simple: gating turns a linear projection into a multiplicative interaction without changing the parameter count by much. SwiGLU and its siblings just swap the sigmoid for a different gate activation.

Swish, SiLU, and the activation function naming muddle

The gate function used by SwiGLU is Swish, also called SiLU. The two names refer to the same function, with a small distinction over the beta parameter that almost nobody respects in practice.

SiLU (Sigmoid Linear Unit, also called sigmoid-weighted linear unit) was first defined by Stefan Elfwing, Eiichi Uchibe, and Kenji Doya at OIST in Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning (arXiv:1702.03118, February 2017). They proposed it for value function approximation in deep reinforcement learning, where the smooth derivative was useful for stable training. The function is

SiLU(x) = x · σ(x)

where σ is the sigmoid. Elfwing et al. also defined dSiLU, the derivative of SiLU, and proposed it as a smooth replacement for the sigmoid in RL value heads. The name SiLU is the one that ended up in modern PyTorch and JAX.

Independently, Dan Hendrycks and Kevin Gimpel had already mentioned the same function as a special case of GELU in Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units (arXiv:1606.08415, June 2016). They observed that x times σ(x) is a smooth gating function similar to the GELU they were proposing.

Swish was named in Searching for Activation Functions by Prajit Ramachandran, Barret Zoph, and Quoc V. Le of Google Brain (arXiv:1710.05941, October 2017). The team ran an automated architecture search over scalar activation functions. The best discovered candidate was

Swish_β(x) = x · σ(βx)

with β either fixed or learned. When β = 1 the function is identical to Elfwing's SiLU. The Ramachandran paper explicitly acknowledges this. When β is learned per-layer, the function can interpolate between near-linear behavior (small β) and a hard ReLU-like cutoff (large β).

In the SwiGLU paper Shazeer used Swish_1, i.e. SiLU. Every public LLM that says it uses SwiGLU is using SiLU as the gate. PyTorch exposes the function as torch.nn.functional.silu, and most LLaMA-style codebases call it SiLU even though the upstream paper trail uses Swish. In this article we use the two names interchangeably, with Swish reserved for the case where β is variable and SiLU for the β = 1 case used in production.

numerical properties of swish

Swish has a few quantitative properties worth keeping in mind because they show up in optimization arguments later:

It is C-infinity smooth (infinitely differentiable). Unlike ReLU, the derivative is well defined at every point.
It is non-monotonic. The function dips below zero for negative arguments, reaching its global minimum of approximately -0.2785 at x ≈ -1.2785, then asymptotically approaches zero as x goes to negative infinity.
For large positive x it asymptotes to the identity, like ReLU.
The derivative is Swish'(x) = σ(βx) + βx σ(βx)(1 - σ(βx)). At β = 1 this is bounded above by about 1.0998 (attained near x ≈ 2.4). The derivative never identically vanishes for finite x, so units do not die the way pure ReLU units sometimes do.
Swish is self-gated: the input itself, after a sigmoid squashing, controls how much of the input passes through. This is conceptually similar to LSTM gating but uses no extra parameters.

The non-monotonic dip near x = -1.28 is what most theoretical accounts focus on, since it is the property that distinguishes Swish from a plain smoothed ReLU. The argument is that the dip lets the function represent locally non-monotonic behavior and supplies extra gradient information in the negative pre-activation regime, which can help models escape saddle regions during training.

SwiGLU formulation

Shazeer combined the GLU template with Swish, dropped the bias terms (since transformers usually run without biases inside the FFN), and called the result SwiGLU. The general definition from the 2020 paper is

SwiGLU(x, W, V, b, c, β) = Swish_β(xW + b) ⊗ (xV + c)

which expands to

SwiGLU(x, W, V, b, c, β) = ((xW + b) · σ(β(xW + b))) ⊗ (xV + c)

In practice almost everyone uses β = 1 and omits the bias vectors. The version actually deployed in PaLM, LLaMA, and friends is

SwiGLU(x, W, V) = SiLU(xW) ⊗ xV

Note that SwiGLU as usually written is not really a pointwise activation. It is a small parameterised module: it owns two weight matrices W and V. To get a drop-in replacement for the FFN, you wrap it with a final down-projection W2. That gives the FFN_SwiGLU layer.

A useful way to read the formula is to call xW the gate pre-activation (what gets squashed by Swish to produce the gate) and xV the value branch (the linear path). The gate values are roughly in the range -0.28 to plus infinity, and they multiply the value branch elementwise. Where the gate is near zero, the value is suppressed. Where the gate is near one, the value passes through nearly unchanged. Where the gate is large, the value is amplified.

SwiGLU inside the transformer FFN

The full feed-forward block becomes

FFN_SwiGLU(x, W, V, W_2) = (Swish_1(xW) ⊗ xV) W_2

Three matrices instead of two. If you keep the inner width the same, the parameter count and FLOP count both increase by 50%, which is not a fair comparison against a baseline FFN. To control for this, Shazeer reduced the inner width by a factor of 2/3, so the SwiGLU FFN ends up with roughly the same parameter and compute budget as a ReLU or GELU FFN of width 4 times d_model.

This 2/3 factor is why LLaMA, Mistral, and most other SwiGLU models use an inner FFN width of 8/3 times d_model, then round to a convenient multiple (typically 256 or 128) for hardware efficiency. Touvron et al. write in the LLaMA paper, "We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of 2/3 4d instead of 4d as in PaLM." PaLM kept the inner width at 4 times d_model and ate the extra parameters; LLaMA chose to match the budget instead. Both are valid choices.

The parameter budget for one FFN block, ignoring biases, works out to roughly:

Variant	Inner width	Parameters per FFN
Standard FFN (ReLU/GELU)	4d	2 times d times 4d = 8d squared
SwiGLU FFN, equal width	4d	3 times d times 4d = 12d squared
SwiGLU FFN, 2/3 reduction	8d/3	3 times d times 8d/3 = 8d squared

So the LLaMA-style SwiGLU FFN has the same nominal parameter count as a baseline FFN, just split across three thinner matrices instead of two wider ones.

For concrete numbers, here are the SwiGLU FFN dimensions used by several real models. Note that the rounding convention varies: LLaMA rounds 8/3 times d_model up to a multiple of 256, Mistral rounds to a multiple of 128, and PaLM keeps the unreduced 4d.

Model	d_model	Computed 8/3 times d	Actual d_ff	Multiple
LLaMA 7B	4096	10923	11008	256
LLaMA 13B	5120	13653	13824	256
LLaMA 65B	8192	21845	22016	256
Mistral 7B	4096	10923	14336	128
Mixtral 8x7B (per expert)	4096	10923	14336	128
Llama 3 8B	4096	10923	14336	1024
Llama 3 70B	8192	21845	28672	1024
PaLM 540B	18432	49152	73728	4d (no reduction)

LLaMA 3 and Mistral both use slightly larger inner widths than the strict 8/3 rule predicts, partly to compensate for parameters saved by Grouped-Query Attention (GQA), and partly because tuning d_ff is a useful free hyperparameter. The 8/3 figure is the right starting point but not a sacred constant.

performance results from Shazeer (2020)

Shazeer evaluated each variant on a T5 base configuration (12 encoder and 12 decoder layers, d_model = 768, d_ff = 3072 for non-gated variants, d_ff = 2048 for gated variants) trained on the C4 corpus for 524,288 steps. The held-out log-perplexity numbers (lower is better) reported in the paper are below.

FFN variant	Activation in gate	Log-perplexity at 65k steps	Log-perplexity at 524k steps
FFN_ReLU (baseline)	ReLU (no gate)	1.997	1.677
FFN_GELU	GELU (no gate)	1.983	1.679
FFN_Swish	Swish (no gate)	1.994	1.683
FFN_GLU	sigmoid	1.982	1.663
FFN_Bilinear	identity (no gate activation)	1.960	1.648
FFN_ReGLU	ReLU	1.953	1.645
FFN_GEGLU	GELU	1.942	1.633
FFN_SwiGLU	Swish	1.944	1.636

GEGLU and SwiGLU are essentially tied, with both clearly beating any of the non-gated baselines. The improvement over GELU on log-perplexity is small in absolute terms (around 0.04 nats), but consistent and free in compute terms once you apply the 2/3 width reduction.

Shazeer also reported downstream task scores from fine-tuning on GLUE, SuperGLUE, and SQuAD. The averages were:

FFN variant	GLUE avg	SuperGLUE avg	SQuAD F1
FFN_ReLU	83.80	72.76	90.87
FFN_GELU	83.86	72.98	90.79
FFN_Swish	83.60	72.40	90.76
FFN_GLU	84.20	73.95	90.69
FFN_Bilinear	83.79	73.81	91.06
FFN_ReGLU	84.67	73.66	91.18
FFN_GEGLU	84.12	73.96	91.12
FFN_SwiGLU	84.36	74.56	91.03
Raffel et al. 2019 (T5 reference)	83.28	71.36	88.81

SwiGLU posted the highest SuperGLUE average. ReGLU narrowly led on GLUE and SQuAD F1. The differences are within the inter-run standard deviations Raffel et al. reported (about 0.24 for GLUE, 0.42 for SuperGLUE, 0.23 for SQuAD F1), so the headline of the paper is really that the entire gated family beats the non-gated baselines, not that any one gated variant is clearly best.

The paper does not actually argue for SwiGLU over GEGLU on principled grounds. Both worked, both were the recommendation. Subsequent practice tilted toward SwiGLU because PaLM and LLaMA chose it, and the ecosystem followed.

adoption in major language models

SwiGLU went from an obscure 5-page tech report in early 2020 to the dominant FFN variant in production by 2024. The inflection point was Google's PaLM in April 2022, which used SwiGLU at 540B scale and reported strong gains. Meta then chose SwiGLU for LLaMA in February 2023, and once the LLaMA weights were widely circulated, every fine-tuner and downstream researcher inherited the same FFN topology.

Model	Year	Activation	Notes
PaLM	2022	SwiGLU	Inner width kept at 4 times d_model; 540B parameters
LLaMA (1)	2023	SwiGLU	Inner width 2/3 times 4d, beta = 1, no biases
LLaMA 2	2023	SwiGLU	Same convention as LLaMA
LLaMA 3	2024	SwiGLU	Carried over unchanged; intermediate size grew with GQA
Mistral 7B	2023	SwiGLU	Inner width 14336 with d_model 4096
Mixtral 8x7B	2023	SwiGLU	Used inside each of the 8 experts
Falcon 3	2024	SwiGLU	Falcon 1/2 had used GELU
DeepSeek V2 / V3	2024	SwiGLU	DeepSeek-V3 caches SwiGLU input and recomputes output in the backward pass to save activation memory
Qwen 2 / Qwen 3	2024	SwiGLU	Standard 8d/3 inner width with rounding
OLMo	2024	SwiGLU	Hidden size set to roughly 8d/3 rounded up to a multiple of 128
StableLM 2	2024	SwiGLU	Stability AI open weights
Phi-3	2024	SwiGLU	Microsoft small model
Yi-34B	2023	SwiGLU	01.AI
Baichuan 2	2023	SwiGLU	Baichuan Inc.
InternLM 2	2024	SwiGLU	Shanghai AI Lab
Gemma 1/2	2024	GeGLU	Sibling of SwiGLU using GELU as the gate; Google's open-weight choice
GPT-2/GPT-3	2019/2020	GELU	Pre-SwiGLU; OpenAI never publicly switched
Pythia	2023	GELU	EleutherAI's interpretability suite stayed on GELU for comparability

Gemma is a useful counterexample: Google's open-weight model uses GeGLU rather than SwiGLU. The two perform almost identically in Shazeer's original benchmark, so this is not a serious quality difference, just a different bet. Gemma uses an exact GELU rather than the tanh approximation that GPT-2 popularised, which has caused occasional inference inconsistencies when third-party runtimes substitute the approximation.

OpenAI's GPT-3 and GPT-4 are publicly described as using GELU in the FFN, although the company has not released architecture details for its newest models. Anthropic has not published details of Claude's FFN, though common scuttlebutt is that it uses a SwiGLU or GeGLU variant like everyone else.

The following table summarises the family. All gated variants (the bottom four rows) require three matrices in the FFN block and double the up-projection parameters relative to a non-gated FFN of equal hidden width.

Activation	Formula	Gated	Smooth	Used in
ReLU	max(0, x)	no	no	Original Transformer, vanilla T5
GELU	x times Phi(x)	no	yes	BERT, GPT-2, GPT-3, RoBERTa
Swish / SiLU	x times σ(beta times x)	no	yes	EfficientNet (often), various RL agents
GLU	(xW + b) ⊗ σ(xV + c)	yes	partly	Dauphin et al. 2016 conv LM
ReGLU	max(0, xW + b) ⊗ (xV + c)	yes	no	Shazeer 2020
GEGLU	GELU(xW + b) ⊗ (xV + c)	yes	yes	T5 v1.1, mT5, Gemma
SwiGLU	Swish(xW + b) ⊗ (xV + c)	yes	yes	PaLM, LLaMA family, Mistral, Falcon 3, DeepSeek, Qwen, OLMo
Bilinear	(xW + b) ⊗ (xV + c)	yes (linear gate)	yes	Studied in Shazeer 2020; rare in production
Mish	x times tanh(softplus(x))	no	yes	YOLOv4 and several CV models
ELU	x if x>0 else alpha(exp(x)-1)	no	yes	Some early LM experiments

SwiGLU sits at the intersection of two design choices: gated (vs. pointwise), and smooth (vs. piecewise linear). GEGLU shares both properties; the only difference is which smooth nonlinearity does the gating.

why does it work?

Nobody really knows. The most-quoted line in Shazeer's paper is the mock-modest disclaimer in the conclusions section:

"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."

The academic literature has produced a few plausible stories. The most common one is that the multiplicative gate gives the network a low-cost way to represent conditional computation: the value branch carries information forward, and the gate branch (a smooth, almost-binary mask after Swish) decides which features matter for which input. This is essentially the same intuition Dauphin used to motivate the original GLU. Smooth gating activations like Swish and GELU avoid the dead-unit problem of ReGLU, and Swish's slight non-monotonic dip near zero may help with optimization.

A second story is expressivity. A gated FFN can represent products of features, not just additive combinations. Each output unit is a sum of inputs scaled by a learned mask, so it can selectively combine information across input dimensions in a way a single-projection ReLU cannot. The Bilinear baseline already enjoys this property, which is why its perplexity is closer to SwiGLU than to ReLU.

A more recent line of analysis comes from approximation theory. A 2026 preprint with the tongue-in-cheek title Divine Benevolence is an x-squared (arXiv:2602.14495) argued that GLU-style layers form piecewise quadratic approximators, while standard MLPs form piecewise linear approximators. Quadratic piecewise approximators have asymptotically better convergence rates for smooth target functions, which the authors offered as a partial explanation for the consistent perplexity advantage of GLU variants over plain MLPs. Whether this story holds up at very large scale remains open.

A fourth perspective is purely empirical: the gain over GELU on log-perplexity is small but the gain comes for free, the tooling is mature, and the inertia is enormous. Even practitioners who suspect the gain would disappear at trillion-token scale have little reason to fight the consensus.

In 2024, Mirzadeh et al. published ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models (arXiv:2310.04564, ICLR 2024). They showed that replacing the SiLU inside SwiGLU with a plain ReLU produces minimal quality loss while making activations highly sparse, which can be exploited to speed up inference by up to 3x in memory-bound regimes. The paper's authors at Apple did not argue that ReLU is intrinsically better but that the exotic smoothness of SwiGLU may not be earning its keep, especially during inference. Subsequent work (e.g. ReLU2, gated ReLU) has continued this line.

The current status is roughly: SwiGLU is clearly competitive, the difference from GEGLU is negligible, and the difference from a well-tuned GELU FFN is small but consistent enough that all the major labs default to SwiGLU now anyway. The decision is partly about momentum and tooling.

PyTorch implementation

A minimal LLaMA-style SwiGLU feed-forward block looks like this. PyTorch's F.silu is exactly Swish with beta = 1, which is what every public model in the table above uses.

import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU_FFN(nn.Module):
    def __init__(self, d_model: int, d_ff: int | None = None):
        super().__init__()
        # LLaMA convention: round 8/3 * d_model up to a multiple of 256
        if d_ff is None:
            d_ff = int(8 * d_model / 3)
            d_ff = ((d_ff + 255) // 256) * 256
        self.w_gate = nn.Linear(d_model, d_ff, bias=False)
        self.w_value = nn.Linear(d_model, d_ff, bias=False)
        self.w_down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w_down(F.silu(self.w_gate(x)) * self.w_value(x))

A few notes on the implementation in practice:

Bias terms are dropped, matching Shazeer's preferred form and every modern LLM that uses SwiGLU.
w_gate and w_value can be fused into a single 2 times d_ff projection and split, which is slightly faster on GPU and is what the official LLaMA reference code does (it calls the merged matrix gate_up_proj or similar).
For training memory, the activations after silu and the elementwise product are the largest tensors in the block. DeepSeek-V3 reportedly avoids storing them by recomputing silu(w_gate(x)) and the elementwise product during the backward pass, trading a small amount of compute for a meaningful memory saving.
PyTorch (as of 2.x) does not yet ship a fused swiglu kernel; community kernels like xFormers' SwiGLU op or Triton implementations are common in production training stacks. There is an open issue in the PyTorch repo (#128712) requesting a built-in version.

In JAX-based stacks the equivalent uses jax.nn.silu and jax.numpy.einsum, with the same three-matrix structure.

fused kernels and hardware considerations

A naive PyTorch implementation of the SwiGLU FFN issues five separate GPU operations: two large matrix multiplications for the gate and value projections, one elementwise SiLU, one elementwise multiplication, and one large matrix multiplication for the down projection. The two elementwise operations between the projections are memory-bound on modern GPUs because they read and write the full d_ff-dimensional activation tensor without doing much arithmetic. Fusing them into a single kernel doubles or triples the throughput of that part of the block.

Several production stacks ship fused SwiGLU kernels:

xFormers from Meta provides xformers.ops.SwiGLU with a packed weight format that concatenates the gate and value projections as a single weight tensor w12. The forward and backward passes are written in CUTLASS templates, which lets the kernel fuse the activation with the surrounding matrix multiplications on H100 and A100 GPUs.
Liger Kernel, an open-source project from LinkedIn, provides a Triton implementation of SwiGLU forward and backward that the authors report can reduce LLM-training memory by up to 60% in conjunction with their other fused kernels (RMSNorm, RoPE, cross-entropy). Liger composes with PyTorch FSDP, DeepSpeed, and FlashAttention.
NVIDIA Transformer Engine ships fused SwiGLU as part of its FP8 training kernels for H100 and Blackwell, which are the kernels Llama 3 405B and the public DeepSeek runs use in practice.
llama.cpp and other inference engines (vLLM, TGI, MLX) implement SwiGLU as a single CUDA or Metal kernel that reads both projections and writes only the gated output, halving the memory bandwidth requirement.

The activation memory savings come from fusing the SiLU and the elementwise product into a single kernel that does not materialise the intermediate tensors. The key tensors are the gate pre-activation xW (size batch times sequence times d_ff), the value pre-activation xV (same size), and the gated output (same size). A non-fused implementation stores all three; a fused implementation can stream them through registers or shared memory and only write the output. For a Llama 3 70B forward pass at 8K context with batch 1, this saves several gigabytes per layer, which compounds across the 80 layers.

For inference, the memory bandwidth pattern of SwiGLU is roughly twice that of a non-gated FFN of the same hidden width: two reads of d_model-dimensional input, two large weight matrices to load, two elementwise activations to write, and a final down projection. This is why SwiGLU LLMs are slightly more memory-bandwidth-bound than equivalently sized GELU LLMs, and why it is more important to colocate the gate and value projections in cache.

activation checkpointing and memory tricks

DeepSeek-V3's technical report (arXiv:2412.19437) describes a specific memory optimisation for SwiGLU during training. They cache only the input x to the SwiGLU operator and recompute both the gate pre-activation silu(xW) and the gated product silu(xW) ⊗ xV during the backward pass. The cached tensor is small (d_model wide), while the recomputed tensors are large (d_ff wide, which is roughly 8/3 of d_model). The recomputation cost is tiny relative to the matrix multiplications surrounding it, so this trade is essentially free in compute and saves substantial memory.

DeepSeek-V3 combines this trick with FP8 quantisation of the SwiGLU activations using their own fine-grained per-block scaling scheme. The combination is one of the design choices that lets DeepSeek-V3 train a 671B-parameter Mixture of Experts model with only 2.788M H800 GPU hours.

Generic activation checkpointing libraries (such as PyTorch's torch.utils.checkpoint or Megatron-LM's recompute_activation flag) can apply the same trade automatically across the whole transformer block. When training memory is the bottleneck, recomputing SwiGLU forwards in the backward pass is one of the highest-value moves, since the FFN activations are the largest single tensor in a transformer block.

relationship to mixture of experts

SwiGLU also appears inside the experts of Mixture of Experts (MoE) models. Mixtral 8x7B places eight independent SwiGLU FFNs in each transformer block; the router selects two of them per token, and only the selected experts are evaluated. DeepSeek-V2 and V3 use a finer-grained variant with hundreds of small SwiGLU experts and many active per token. In each case the SwiGLU formula is identical to the dense case; what changes is how many of the FFN sublayers are run. The total parameter count grows linearly in the number of experts, but the active compute is dominated by the few experts the router picks.

The choice of activation inside the experts is generally treated as a separate decision from the routing scheme, and SwiGLU is the default for the same reasons it is the default in dense models: it is the format the open-source training stacks are tuned for, and the empirical loss per parameter is at least as good as the alternatives.

variants and follow-up work

Researchers have continued to explore the design space around SwiGLU, though the empirical advantage of any single variant remains small.

GeGLU, the GELU-gated sibling, is the closest alternative and is used in T5 v1.1, mT5, and Gemma. The choice between SwiGLU and GeGLU is essentially aesthetic at this point.
Bilinear, the parameter-free gated layer (no activation on the gate branch), already outperforms ReLU and GELU baselines in Shazeer's table. It is rarely used in production but appears in a few research papers as a strong baseline.
ReGLU squared (ReLU-squared GLU), used in some sparse-LLM research, replaces SiLU with ReLU squared. The idea is to combine GLU's gating with the activation sparsity that pure ReLU networks enjoy.
Masked GLU (arXiv:2506.23225), a 2025 proposal, applies a learned discrete mask on top of the SwiGLU gate to recover ReLU-style sparsity while keeping the smooth optimization landscape. Reported gains are modest.
NoGLU is an informal label some researchers attach to non-gated FFN replacements; one example is the 2024 ReLU Strikes Back paper, which demonstrated that swapping SiLU for ReLU inside SwiGLU produces a model with similar quality but much sparser activations, which can be exploited at inference time.
Beta-tuned SwiGLU would mean using Swish_beta with a learned beta per layer, as in the original Ramachandran paper. No major LLM appears to do this; everyone uses beta = 1.
GLU with bias, retaining the bias terms in the gate and value projections, is a minor option enabled by some training frameworks. It is rarely worth the small additional parameter count.

There has also been work on extending SwiGLU-style gating to attention. Several recent papers experiment with multiplicative gates inside the attention sublayer rather than replacing the FFN, with mixed results.

limitations and criticism

SwiGLU is not without drawbacks:

Three matrices, not two. Memory bandwidth for the FFN block is higher than a non-gated FFN at equal hidden width. The 2/3 reduction normalises parameters but not bandwidth.
No hard zero. Unlike ReLU, Swish never produces an exact zero. This kills the activation sparsity that some inference techniques rely on. Research like ReLU Strikes Back shows that retrofitting ReLU-style sparsity into a SwiGLU model is possible but requires re-training or fine-tuning.
Numerical sensitivity. Mixed-precision (FP16 or BF16) training of SwiGLU sometimes shows larger activation magnitudes than GELU, since the gate and value can both be large positive numbers and their product can grow accordingly. This is not catastrophic but means SwiGLU layers can need slightly more careful loss-scaling and norm clipping during training. FP8 training typically uses per-block scaling for SwiGLU activations.
Limited theoretical understanding. As Shazeer himself wrote, no convincing first-principles explanation exists for why GLU variants outperform pointwise activations. The empirical advantage is robust but small, and most theoretical accounts are post hoc.
Not better at every scale. Some work suggests the perplexity gap between SwiGLU and GELU narrows as data and compute increase. Whether SwiGLU is still better at the trillion-parameter scale is hard to study because the major labs only train one architecture per scale and rarely run controlled ablations.

None of these are dealbreakers, and they have not slowed adoption.

historical timeline

Year	Event
2007	Mnih and Hinton describe a bilinear product of two linear projections in Three new graphical models for statistical language modelling.
2016	Hendrycks and Gimpel propose GELU (arXiv:1606.08415); they implicitly mention x times σ(x).
2016	Dauphin, Fan, Auli, and Grangier propose GLU for convolutional language modeling (arXiv:1612.08083).
2017	Elfwing, Uchibe, and Doya propose SiLU for RL (arXiv:1702.03118).
2017	Ramachandran, Zoph, and Le rediscover SiLU as Swish via neural architecture search (arXiv:1710.05941).
2017	Vaswani et al. publish Attention Is All You Need; the original transformer FFN uses ReLU.
2019	Raffel et al. release T5; FFN uses ReLU (no bias). T5 v1.1 later switches to GeGLU.
2020	Shazeer publishes GLU Variants Improve Transformer (arXiv:2002.05202), introducing SwiGLU.
2022	Google's PaLM (arXiv:2204.02311) uses SwiGLU at 540B scale.
2023	Meta's LLaMA (arXiv:2302.13971) uses SwiGLU with the 8/3 rule and no bias; the open-source release locks in the LLaMA-style FFN as the de facto standard.
2023	Llama 2, Mistral 7B, Mixtral 8x7B, Yi-34B, Baichuan 2 all ship with SwiGLU.
2024	Mirzadeh et al. publish ReLU Strikes Back (ICLR 2024), questioning whether SwiGLU is necessary.
2024	Llama 3, DeepSeek V2/V3, Qwen 2, OLMo, Falcon 3, Phi-3, InternLM 2 all use SwiGLU; Gemma uses GeGLU.
2024	DeepSeek-V3 documents SwiGLU activation recomputation in FP8 training.
2025	Continued research on SwiGLU variants (Masked GLU, gated ReLU, NoGLU); SwiGLU remains the production default.
2026	SwiGLU is essentially universal in open-weight LLMs except for Google's Gemma line.

references

Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv:2002.05202. https://arxiv.org/abs/2002.05202
Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2016). Language Modeling with Gated Convolutional Networks. arXiv:1612.08083. https://arxiv.org/abs/1612.08083
Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for Activation Functions. arXiv:1710.05941. https://arxiv.org/abs/1710.05941
Elfwing, S., Uchibe, E., & Doya, K. (2017). Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. arXiv:1702.03118. https://arxiv.org/abs/1702.03118
Hendrycks, D. & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415. https://arxiv.org/abs/1606.08415
Vaswani, A., et al. (2017). Attention Is All You Need. arXiv:1706.03762. https://arxiv.org/abs/1706.03762
Raffel, C., et al. (2019). Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683.
Chowdhery, A., et al. (2022). PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311. https://arxiv.org/abs/2204.02311
Touvron, H., et al. (2023). LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971. https://arxiv.org/abs/2302.13971
Touvron, H., et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288.
Llama Team, AI @ Meta (2024). The Llama 3 Herd of Models. arXiv:2407.21783.
Jiang, A. Q., et al. (2023). Mistral 7B. arXiv:2310.06825.
Jiang, A. Q., et al. (2024). Mixtral of Experts. arXiv:2401.04088.
DeepSeek-AI (2024). DeepSeek-V3 Technical Report. arXiv:2412.19437. https://arxiv.org/abs/2412.19437
Yang, A., et al. (2024). Qwen2 Technical Report. arXiv:2407.10671.
Gemma Team (2024). Gemma: Open Models Based on Gemini Research and Technology. arXiv:2403.08295.
Mirzadeh, I., et al. (2024). ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models. ICLR 2024. arXiv:2310.04564. https://arxiv.org/abs/2310.04564
Mnih, A. & Hinton, G. (2007). Three new graphical models for statistical language modelling. ICML 2007.
Wang, A., et al. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv:1804.07461.
Wang, A., et al. (2019). SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. arXiv:1905.00537.
PyTorch documentation, torch.nn.functional.silu. https://pytorch.org/docs/stable/generated/torch.nn.functional.silu.html
PyTorch issue #128712, Add SwiGLU activation function. https://github.com/pytorch/pytorch/issues/128712
xFormers documentation, SwiGLU operator. https://facebookresearch.github.io/xformers/components/ops.html
Liger Kernel (LinkedIn). https://github.com/linkedin/Liger-Kernel

SwiGLU

quick facts

background: the standard transformer FFN

GLU: the original gated linear unit

Swish, SiLU, and the activation function naming muddle

numerical properties of swish

SwiGLU formulation

SwiGLU inside the transformer FFN

performance results from Shazeer (2020)

adoption in major language models

comparison with related activations

why does it work?

PyTorch implementation

fused kernels and hardware considerations

activation checkpointing and memory tricks

relationship to mixture of experts

variants and follow-up work

limitations and criticism

historical timeline

see also

references

Improve this article

Related Articles

GELU (Gaussian Error Linear Unit)

ReLU

Tanh (hyperbolic tangent)

Sparse autoencoder

LeNet

Context window

SwiGLU

quick facts

background: the standard transformer FFN

GLU: the original gated linear unit

Swish, SiLU, and the activation function naming muddle

numerical properties of swish

SwiGLU formulation

SwiGLU inside the transformer FFN

performance results from Shazeer (2020)

adoption in major language models

comparison with related activations

why does it work?

PyTorch implementation

fused kernels and hardware considerations

activation checkpointing and memory tricks

relationship to mixture of experts

variants and follow-up work

limitations and criticism

historical timeline

see also

references

Related Articles

GELU (Gaussian Error Linear Unit)

ReLU

Tanh (hyperbolic tangent)

Sparse autoencoder

LeNet

Context window