SwiGLU
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,176 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 9, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v2 · 6,176 words
Add missing citations, update stale details, or suggest a clearer explanation.
SwiGLU, short for Swish-Gated Linear Unit, is an activation function used inside the feed-forward sublayer of transformer models. It is a member of the GLU family, replacing the sigmoid gate of the original Gated Linear Unit with Swish (also known as SiLU). SwiGLU was introduced by Noam Shazeer in the February 2020 paper GLU Variants Improve Transformer (arXiv:2002.05202), where it produced the lowest validation perplexity among the variants tested on a T5 base model. After PaLM adopted it in 2022 and LLaMA followed in early 2023, SwiGLU quickly became the default feed-forward activation in nearly every new large language model, including Mistral 7B, the Falcon 3 family, DeepSeek, Qwen, and OLMo.
The paper's recommendation was deliberately ambivalent. Shazeer found that several gated variants (GEGLU, SwiGLU, ReGLU, and even the parameter-free Bilinear) all beat the standard ReLU and GELU baselines, and the differences between them were within experimental noise. The community eventually settled on SwiGLU through a combination of PaLM's high-profile adoption, LLaMA's open-source release, and the resulting tooling momentum. Once Hugging Face Transformers, llama.cpp, vLLM, and the major training stacks were tuned for the SwiGLU shape, switching back to a non-gated FFN stopped being free. As of 2026 every leading open-weight foundation model except for Google's Gemma line uses SwiGLU as its FFN nonlinearity.
| Introduced | February 2020 |
| Paper | GLU Variants Improve Transformer |
| arXiv ID | 2002.05202 |
| Author | Noam Shazeer (Google) |
| Type | Gated activation, GLU family |
| Role in transformer | Replaces the single nonlinearity inside the position-wise FFN |
| Component activation | Swish (SiLU), usually with beta = 1 |
| Hidden dimension factor | 2/3 of the standard 4d (i.e. 8/3 times d) to keep parameter count comparable |
| Matrices in FFN | 3 (gate W, value V, down W2), vs 2 in a standard FFN |
| Used in | PaLM, LLaMA 1/2/3, Mistral 7B, Mixtral, Falcon 3, DeepSeek (V2/V3), Qwen2/Qwen3, OLMo |
| PyTorch primitive | torch.nn.functional.silu |
| C4 log-perplexity in original paper (524k steps) | 1.636 (vs 1.677 ReLU baseline) |
| GLUE average in original paper | 84.36 (best of all tested variants) |
A standard transformer block contains two sublayers: multi-head self-attention and a position-wise feed-forward network (FFN). The FFN, introduced in Attention Is All You Need (Vaswani et al., 2017), is a small two-layer multilayer perceptron applied independently to every position. In matrix form it is
FFN(x) = max(0, xW_1 + b_1) W_2 + b_2
The inner dimension d_ff is conventionally four times the model dimension d_model, so for d_model = 512 the hidden layer has 2048 units. The original paper used ReLU. Later work swapped ReLU for GELU, giving
FFN_GELU(x) = GELU(xW_1 + b_1) W_2 + b_2
GELU was the default in BERT, GPT-2, GPT-3, RoBERTa, and the original T5. The T5 codebase also dropped the bias terms, simplifying the formula to FFN_ReLU(x) = max(0, xW_1) W_2. The structure is the same in every case: project up to a wider hidden dimension, apply a pointwise nonlinearity, project back down. Two matrix multiplies, one elementwise nonlinearity. For most of the late 2010s this was the only FFN design anyone seriously considered.
The FFN is a surprisingly large fraction of the parameters and FLOPs of a transformer. With d_ff = 4 times d_model, the FFN sublayer holds roughly 8 times d_model squared parameters per layer, while the attention sublayer (with four projection matrices of size d_model by d_model) holds 4 times d_model squared. So the FFN is about two thirds of the non-embedding compute and parameters. Improving the FFN therefore moves the needle on overall model quality more than tweaking attention typically does.
Gated variants change the FFN topology. They split the up-projection in two and use one half to gate the other. The first such variant in modern deep learning was the GLU, introduced for convolutional language modeling in 2016.
The Gated Linear Unit was proposed by Yann Dauphin, Angela Fan, Michael Auli, and David Grangier of Facebook AI Research in Language Modeling with Gated Convolutional Networks (arXiv:1612.08083, December 2016). They were trying to make convolutional networks competitive with LSTMs on word-level language modeling, and the gating mechanism was their answer to the long-range information flow problem. The basic GLU is
GLU(x, W, V, b, c) = (xW + b) ⊗ σ(xV + c)
where ⊗ is elementwise (Hadamard) multiplication and σ is the sigmoid function. One linear projection produces the candidate values, the other produces a sigmoid gate that decides how much of each value passes through. The Dauphin paper argued that this gating gave a linear path for gradients (via the value branch) while keeping enough nonlinearity (via the gate) for expressivity. The model achieved state of the art on WikiText-103 and was competitive with LSTMs on the Google Billion Words benchmark, which at the time felt surprising for a non-recurrent architecture.
Dauphin et al. also defined the parameter-free Bilinear layer, which is GLU without any gating activation:
Bilinear(x, W, V, b, c) = (xW + b) ⊗ (xV + c)
They credited this construction to Mnih and Hinton's 2007 paper Three new graphical models for statistical language modelling, which is the deepest historical root of the entire GLU family. The Bilinear layer is purely multiplicative interaction between two linear projections, and as Shazeer would later show, it is already strong enough to outperform a non-gated ReLU FFN.
The core idea is simple: gating turns a linear projection into a multiplicative interaction without changing the parameter count by much. SwiGLU and its siblings just swap the sigmoid for a different gate activation.
The gate function used by SwiGLU is Swish, also called SiLU. The two names refer to the same function, with a small distinction over the beta parameter that almost nobody respects in practice.
SiLU (Sigmoid Linear Unit, also called sigmoid-weighted linear unit) was first defined by Stefan Elfwing, Eiichi Uchibe, and Kenji Doya at OIST in Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning (arXiv:1702.03118, February 2017). They proposed it for value function approximation in deep reinforcement learning, where the smooth derivative was useful for stable training. The function is
SiLU(x) = x · σ(x)
where σ is the sigmoid. Elfwing et al. also defined dSiLU, the derivative of SiLU, and proposed it as a smooth replacement for the sigmoid in RL value heads. The name SiLU is the one that ended up in modern PyTorch and JAX.
Independently, Dan Hendrycks and Kevin Gimpel had already mentioned the same function as a special case of GELU in Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units (arXiv:1606.08415, June 2016). They observed that x times σ(x) is a smooth gating function similar to the GELU they were proposing.
Swish was named in Searching for Activation Functions by Prajit Ramachandran, Barret Zoph, and Quoc V. Le of Google Brain (arXiv:1710.05941, October 2017). The team ran an automated architecture search over scalar activation functions. The best discovered candidate was
Swish_β(x) = x · σ(βx)
with β either fixed or learned. When β = 1 the function is identical to Elfwing's SiLU. The Ramachandran paper explicitly acknowledges this. When β is learned per-layer, the function can interpolate between near-linear behavior (small β) and a hard ReLU-like cutoff (large β).
In the SwiGLU paper Shazeer used Swish_1, i.e. SiLU. Every public LLM that says it uses SwiGLU is using SiLU as the gate. PyTorch exposes the function as torch.nn.functional.silu, and most LLaMA-style codebases call it SiLU even though the upstream paper trail uses Swish. In this article we use the two names interchangeably, with Swish reserved for the case where β is variable and SiLU for the β = 1 case used in production.
Swish has a few quantitative properties worth keeping in mind because they show up in optimization arguments later:
The non-monotonic dip near x = -1.28 is what most theoretical accounts focus on, since it is the property that distinguishes Swish from a plain smoothed ReLU. The argument is that the dip lets the function represent locally non-monotonic behavior and supplies extra gradient information in the negative pre-activation regime, which can help models escape saddle regions during training.
Shazeer combined the GLU template with Swish, dropped the bias terms (since transformers usually run without biases inside the FFN), and called the result SwiGLU. The general definition from the 2020 paper is
SwiGLU(x, W, V, b, c, β) = Swish_β(xW + b) ⊗ (xV + c)
which expands to
SwiGLU(x, W, V, b, c, β) = ((xW + b) · σ(β(xW + b))) ⊗ (xV + c)
In practice almost everyone uses β = 1 and omits the bias vectors. The version actually deployed in PaLM, LLaMA, and friends is
SwiGLU(x, W, V) = SiLU(xW) ⊗ xV
Note that SwiGLU as usually written is not really a pointwise activation. It is a small parameterised module: it owns two weight matrices W and V. To get a drop-in replacement for the FFN, you wrap it with a final down-projection W2. That gives the FFN_SwiGLU layer.
A useful way to read the formula is to call xW the gate pre-activation (what gets squashed by Swish to produce the gate) and xV the value branch (the linear path). The gate values are roughly in the range -0.28 to plus infinity, and they multiply the value branch elementwise. Where the gate is near zero, the value is suppressed. Where the gate is near one, the value passes through nearly unchanged. Where the gate is large, the value is amplified.
The full feed-forward block becomes
FFN_SwiGLU(x, W, V, W_2) = (Swish_1(xW) ⊗ xV) W_2
Three matrices instead of two. If you keep the inner width the same, the parameter count and FLOP count both increase by 50%, which is not a fair comparison against a baseline FFN. To control for this, Shazeer reduced the inner width by a factor of 2/3, so the SwiGLU FFN ends up with roughly the same parameter and compute budget as a ReLU or GELU FFN of width 4 times d_model.
This 2/3 factor is why LLaMA, Mistral, and most other SwiGLU models use an inner FFN width of 8/3 times d_model, then round to a convenient multiple (typically 256 or 128) for hardware efficiency. Touvron et al. write in the LLaMA paper, "We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of 2/3 4d instead of 4d as in PaLM." PaLM kept the inner width at 4 times d_model and ate the extra parameters; LLaMA chose to match the budget instead. Both are valid choices.
The parameter budget for one FFN block, ignoring biases, works out to roughly:
| Variant | Inner width | Parameters per FFN |
|---|---|---|
| Standard FFN (ReLU/GELU) | 4d | 2 times d times 4d = 8d squared |
| SwiGLU FFN, equal width | 4d | 3 times d times 4d = 12d squared |
| SwiGLU FFN, 2/3 reduction | 8d/3 | 3 times d times 8d/3 = 8d squared |
So the LLaMA-style SwiGLU FFN has the same nominal parameter count as a baseline FFN, just split across three thinner matrices instead of two wider ones.
For concrete numbers, here are the SwiGLU FFN dimensions used by several real models. Note that the rounding convention varies: LLaMA rounds 8/3 times d_model up to a multiple of 256, Mistral rounds to a multiple of 128, and PaLM keeps the unreduced 4d.
| Model | d_model | Computed 8/3 times d | Actual d_ff | Multiple |
|---|---|---|---|---|
| LLaMA 7B | 4096 | 10923 | 11008 | 256 |
| LLaMA 13B | 5120 | 13653 | 13824 | 256 |
| LLaMA 65B | 8192 | 21845 | 22016 | 256 |
| Mistral 7B | 4096 | 10923 | 14336 | 128 |
| Mixtral 8x7B (per expert) | 4096 | 10923 | 14336 | 128 |
| Llama 3 8B | 4096 | 10923 | 14336 | 1024 |
| Llama 3 70B | 8192 | 21845 | 28672 | 1024 |
| PaLM 540B | 18432 | 49152 | 73728 | 4d (no reduction) |
LLaMA 3 and Mistral both use slightly larger inner widths than the strict 8/3 rule predicts, partly to compensate for parameters saved by Grouped-Query Attention (GQA), and partly because tuning d_ff is a useful free hyperparameter. The 8/3 figure is the right starting point but not a sacred constant.
Shazeer evaluated each variant on a T5 base configuration (12 encoder and 12 decoder layers, d_model = 768, d_ff = 3072 for non-gated variants, d_ff = 2048 for gated variants) trained on the C4 corpus for 524,288 steps. The held-out log-perplexity numbers (lower is better) reported in the paper are below.
| FFN variant | Activation in gate | Log-perplexity at 65k steps | Log-perplexity at 524k steps |
|---|---|---|---|
| FFN_ReLU (baseline) | ReLU (no gate) | 1.997 | 1.677 |
| FFN_GELU | GELU (no gate) | 1.983 | 1.679 |
| FFN_Swish | Swish (no gate) | 1.994 | 1.683 |
| FFN_GLU | sigmoid | 1.982 | 1.663 |
| FFN_Bilinear | identity (no gate activation) | 1.960 | 1.648 |
| FFN_ReGLU | ReLU | 1.953 | 1.645 |
| FFN_GEGLU | GELU | 1.942 | 1.633 |
| FFN_SwiGLU | Swish | 1.944 | 1.636 |
GEGLU and SwiGLU are essentially tied, with both clearly beating any of the non-gated baselines. The improvement over GELU on log-perplexity is small in absolute terms (around 0.04 nats), but consistent and free in compute terms once you apply the 2/3 width reduction.
Shazeer also reported downstream task scores from fine-tuning on GLUE, SuperGLUE, and SQuAD. The averages were:
| FFN variant | GLUE avg | SuperGLUE avg | SQuAD F1 |
|---|---|---|---|
| FFN_ReLU | 83.80 | 72.76 | 90.87 |
| FFN_GELU | 83.86 | 72.98 | 90.79 |
| FFN_Swish | 83.60 | 72.40 | 90.76 |
| FFN_GLU | 84.20 | 73.95 | 90.69 |
| FFN_Bilinear | 83.79 | 73.81 | 91.06 |
| FFN_ReGLU | 84.67 | 73.66 | 91.18 |
| FFN_GEGLU | 84.12 | 73.96 | 91.12 |
| FFN_SwiGLU | 84.36 | 74.56 | 91.03 |
| Raffel et al. 2019 (T5 reference) | 83.28 | 71.36 | 88.81 |
SwiGLU posted the highest SuperGLUE average. ReGLU narrowly led on GLUE and SQuAD F1. The differences are within the inter-run standard deviations Raffel et al. reported (about 0.24 for GLUE, 0.42 for SuperGLUE, 0.23 for SQuAD F1), so the headline of the paper is really that the entire gated family beats the non-gated baselines, not that any one gated variant is clearly best.
The paper does not actually argue for SwiGLU over GEGLU on principled grounds. Both worked, both were the recommendation. Subsequent practice tilted toward SwiGLU because PaLM and LLaMA chose it, and the ecosystem followed.
SwiGLU went from an obscure 5-page tech report in early 2020 to the dominant FFN variant in production by 2024. The inflection point was Google's PaLM in April 2022, which used SwiGLU at 540B scale and reported strong gains. Meta then chose SwiGLU for LLaMA in February 2023, and once the LLaMA weights were widely circulated, every fine-tuner and downstream researcher inherited the same FFN topology.
| Model | Year | Activation | Notes |
|---|---|---|---|
| PaLM | 2022 | SwiGLU | Inner width kept at 4 times d_model; 540B parameters |
| LLaMA (1) | 2023 | SwiGLU | Inner width 2/3 times 4d, beta = 1, no biases |
| LLaMA 2 | 2023 | SwiGLU | Same convention as LLaMA |
| LLaMA 3 | 2024 | SwiGLU | Carried over unchanged; intermediate size grew with GQA |
| Mistral 7B | 2023 | SwiGLU | Inner width 14336 with d_model 4096 |
| Mixtral 8x7B | 2023 | SwiGLU | Used inside each of the 8 experts |
| Falcon 3 | 2024 | SwiGLU | Falcon 1/2 had used GELU |
| DeepSeek V2 / V3 | 2024 | SwiGLU | DeepSeek-V3 caches SwiGLU input and recomputes output in the backward pass to save activation memory |
| Qwen 2 / Qwen 3 | 2024 | SwiGLU | Standard 8d/3 inner width with rounding |
| OLMo | 2024 | SwiGLU | Hidden size set to roughly 8d/3 rounded up to a multiple of 128 |
| StableLM 2 | 2024 | SwiGLU | Stability AI open weights |
| Phi-3 | 2024 | SwiGLU | Microsoft small model |
| Yi-34B | 2023 | SwiGLU | 01.AI |
| Baichuan 2 | 2023 | SwiGLU | Baichuan Inc. |
| InternLM 2 | 2024 | SwiGLU | Shanghai AI Lab |
| Gemma 1/2 | 2024 | GeGLU | Sibling of SwiGLU using GELU as the gate; Google's open-weight choice |
| GPT-2/GPT-3 | 2019/2020 | GELU | Pre-SwiGLU; OpenAI never publicly switched |
| Pythia | 2023 | GELU | EleutherAI's interpretability suite stayed on GELU for comparability |
Gemma is a useful counterexample: Google's open-weight model uses GeGLU rather than SwiGLU. The two perform almost identically in Shazeer's original benchmark, so this is not a serious quality difference, just a different bet. Gemma uses an exact GELU rather than the tanh approximation that GPT-2 popularised, which has caused occasional inference inconsistencies when third-party runtimes substitute the approximation.
OpenAI's GPT-3 and GPT-4 are publicly described as using GELU in the FFN, although the company has not released architecture details for its newest models. Anthropic has not published details of Claude's FFN, though common scuttlebutt is that it uses a SwiGLU or GeGLU variant like everyone else.
The following table summarises the family. All gated variants (the bottom four rows) require three matrices in the FFN block and double the up-projection parameters relative to a non-gated FFN of equal hidden width.
| Activation | Formula | Gated | Smooth | Used in |
|---|---|---|---|---|
| ReLU | max(0, x) | no | no | Original Transformer, vanilla T5 |
| GELU | x times Phi(x) | no | yes | BERT, GPT-2, GPT-3, RoBERTa |
| Swish / SiLU | x times σ(beta times x) | no | yes | EfficientNet (often), various RL agents |
| GLU | (xW + b) ⊗ σ(xV + c) | yes | partly | Dauphin et al. 2016 conv LM |
| ReGLU | max(0, xW + b) ⊗ (xV + c) | yes | no | Shazeer 2020 |
| GEGLU | GELU(xW + b) ⊗ (xV + c) | yes | yes | T5 v1.1, mT5, Gemma |
| SwiGLU | Swish(xW + b) ⊗ (xV + c) | yes | yes | PaLM, LLaMA family, Mistral, Falcon 3, DeepSeek, Qwen, OLMo |
| Bilinear | (xW + b) ⊗ (xV + c) | yes (linear gate) | yes | Studied in Shazeer 2020; rare in production |
| Mish | x times tanh(softplus(x)) | no | yes | YOLOv4 and several CV models |
| ELU | x if x>0 else alpha(exp(x)-1) | no | yes | Some early LM experiments |
SwiGLU sits at the intersection of two design choices: gated (vs. pointwise), and smooth (vs. piecewise linear). GEGLU shares both properties; the only difference is which smooth nonlinearity does the gating.
Nobody really knows. The most-quoted line in Shazeer's paper is the mock-modest disclaimer in the conclusions section:
"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."
The academic literature has produced a few plausible stories. The most common one is that the multiplicative gate gives the network a low-cost way to represent conditional computation: the value branch carries information forward, and the gate branch (a smooth, almost-binary mask after Swish) decides which features matter for which input. This is essentially the same intuition Dauphin used to motivate the original GLU. Smooth gating activations like Swish and GELU avoid the dead-unit problem of ReGLU, and Swish's slight non-monotonic dip near zero may help with optimization.
A second story is expressivity. A gated FFN can represent products of features, not just additive combinations. Each output unit is a sum of inputs scaled by a learned mask, so it can selectively combine information across input dimensions in a way a single-projection ReLU cannot. The Bilinear baseline already enjoys this property, which is why its perplexity is closer to SwiGLU than to ReLU.
A more recent line of analysis comes from approximation theory. A 2026 preprint with the tongue-in-cheek title Divine Benevolence is an x-squared (arXiv:2602.14495) argued that GLU-style layers form piecewise quadratic approximators, while standard MLPs form piecewise linear approximators. Quadratic piecewise approximators have asymptotically better convergence rates for smooth target functions, which the authors offered as a partial explanation for the consistent perplexity advantage of GLU variants over plain MLPs. Whether this story holds up at very large scale remains open.
A fourth perspective is purely empirical: the gain over GELU on log-perplexity is small but the gain comes for free, the tooling is mature, and the inertia is enormous. Even practitioners who suspect the gain would disappear at trillion-token scale have little reason to fight the consensus.
In 2024, Mirzadeh et al. published ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models (arXiv:2310.04564, ICLR 2024). They showed that replacing the SiLU inside SwiGLU with a plain ReLU produces minimal quality loss while making activations highly sparse, which can be exploited to speed up inference by up to 3x in memory-bound regimes. The paper's authors at Apple did not argue that ReLU is intrinsically better but that the exotic smoothness of SwiGLU may not be earning its keep, especially during inference. Subsequent work (e.g. ReLU2, gated ReLU) has continued this line.
The current status is roughly: SwiGLU is clearly competitive, the difference from GEGLU is negligible, and the difference from a well-tuned GELU FFN is small but consistent enough that all the major labs default to SwiGLU now anyway. The decision is partly about momentum and tooling.
A minimal LLaMA-style SwiGLU feed-forward block looks like this. PyTorch's F.silu is exactly Swish with beta = 1, which is what every public model in the table above uses.
import torch
import torch.nn as nn
import torch.nn.functional as F
class SwiGLU_FFN(nn.Module):
def __init__(self, d_model: int, d_ff: int | None = None):
super().__init__()
# LLaMA convention: round 8/3 * d_model up to a multiple of 256
if d_ff is None:
d_ff = int(8 * d_model / 3)
d_ff = ((d_ff + 255) // 256) * 256
self.w_gate = nn.Linear(d_model, d_ff, bias=False)
self.w_value = nn.Linear(d_model, d_ff, bias=False)
self.w_down = nn.Linear(d_ff, d_model, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.w_down(F.silu(self.w_gate(x)) * self.w_value(x))
A few notes on the implementation in practice:
w_gate and w_value can be fused into a single 2 times d_ff projection and split, which is slightly faster on GPU and is what the official LLaMA reference code does (it calls the merged matrix gate_up_proj or similar).silu and the elementwise product are the largest tensors in the block. DeepSeek-V3 reportedly avoids storing them by recomputing silu(w_gate(x)) and the elementwise product during the backward pass, trading a small amount of compute for a meaningful memory saving.swiglu kernel; community kernels like xFormers' SwiGLU op or Triton implementations are common in production training stacks. There is an open issue in the PyTorch repo (#128712) requesting a built-in version.In JAX-based stacks the equivalent uses jax.nn.silu and jax.numpy.einsum, with the same three-matrix structure.
A naive PyTorch implementation of the SwiGLU FFN issues five separate GPU operations: two large matrix multiplications for the gate and value projections, one elementwise SiLU, one elementwise multiplication, and one large matrix multiplication for the down projection. The two elementwise operations between the projections are memory-bound on modern GPUs because they read and write the full d_ff-dimensional activation tensor without doing much arithmetic. Fusing them into a single kernel doubles or triples the throughput of that part of the block.
Several production stacks ship fused SwiGLU kernels:
xformers.ops.SwiGLU with a packed weight format that concatenates the gate and value projections as a single weight tensor w12. The forward and backward passes are written in CUTLASS templates, which lets the kernel fuse the activation with the surrounding matrix multiplications on H100 and A100 GPUs.The activation memory savings come from fusing the SiLU and the elementwise product into a single kernel that does not materialise the intermediate tensors. The key tensors are the gate pre-activation xW (size batch times sequence times d_ff), the value pre-activation xV (same size), and the gated output (same size). A non-fused implementation stores all three; a fused implementation can stream them through registers or shared memory and only write the output. For a Llama 3 70B forward pass at 8K context with batch 1, this saves several gigabytes per layer, which compounds across the 80 layers.
For inference, the memory bandwidth pattern of SwiGLU is roughly twice that of a non-gated FFN of the same hidden width: two reads of d_model-dimensional input, two large weight matrices to load, two elementwise activations to write, and a final down projection. This is why SwiGLU LLMs are slightly more memory-bandwidth-bound than equivalently sized GELU LLMs, and why it is more important to colocate the gate and value projections in cache.
DeepSeek-V3's technical report (arXiv:2412.19437) describes a specific memory optimisation for SwiGLU during training. They cache only the input x to the SwiGLU operator and recompute both the gate pre-activation silu(xW) and the gated product silu(xW) ⊗ xV during the backward pass. The cached tensor is small (d_model wide), while the recomputed tensors are large (d_ff wide, which is roughly 8/3 of d_model). The recomputation cost is tiny relative to the matrix multiplications surrounding it, so this trade is essentially free in compute and saves substantial memory.
DeepSeek-V3 combines this trick with FP8 quantisation of the SwiGLU activations using their own fine-grained per-block scaling scheme. The combination is one of the design choices that lets DeepSeek-V3 train a 671B-parameter Mixture of Experts model with only 2.788M H800 GPU hours.
Generic activation checkpointing libraries (such as PyTorch's torch.utils.checkpoint or Megatron-LM's recompute_activation flag) can apply the same trade automatically across the whole transformer block. When training memory is the bottleneck, recomputing SwiGLU forwards in the backward pass is one of the highest-value moves, since the FFN activations are the largest single tensor in a transformer block.
SwiGLU also appears inside the experts of Mixture of Experts (MoE) models. Mixtral 8x7B places eight independent SwiGLU FFNs in each transformer block; the router selects two of them per token, and only the selected experts are evaluated. DeepSeek-V2 and V3 use a finer-grained variant with hundreds of small SwiGLU experts and many active per token. In each case the SwiGLU formula is identical to the dense case; what changes is how many of the FFN sublayers are run. The total parameter count grows linearly in the number of experts, but the active compute is dominated by the few experts the router picks.
The choice of activation inside the experts is generally treated as a separate decision from the routing scheme, and SwiGLU is the default for the same reasons it is the default in dense models: it is the format the open-source training stacks are tuned for, and the empirical loss per parameter is at least as good as the alternatives.
Researchers have continued to explore the design space around SwiGLU, though the empirical advantage of any single variant remains small.
There has also been work on extending SwiGLU-style gating to attention. Several recent papers experiment with multiplicative gates inside the attention sublayer rather than replacing the FFN, with mixed results.
SwiGLU is not without drawbacks:
None of these are dealbreakers, and they have not slowed adoption.
| Year | Event |
|---|---|
| 2007 | Mnih and Hinton describe a bilinear product of two linear projections in Three new graphical models for statistical language modelling. |
| 2016 | Hendrycks and Gimpel propose GELU (arXiv:1606.08415); they implicitly mention x times σ(x). |
| 2016 | Dauphin, Fan, Auli, and Grangier propose GLU for convolutional language modeling (arXiv:1612.08083). |
| 2017 | Elfwing, Uchibe, and Doya propose SiLU for RL (arXiv:1702.03118). |
| 2017 | Ramachandran, Zoph, and Le rediscover SiLU as Swish via neural architecture search (arXiv:1710.05941). |
| 2017 | Vaswani et al. publish Attention Is All You Need; the original transformer FFN uses ReLU. |
| 2019 | Raffel et al. release T5; FFN uses ReLU (no bias). T5 v1.1 later switches to GeGLU. |
| 2020 | Shazeer publishes GLU Variants Improve Transformer (arXiv:2002.05202), introducing SwiGLU. |
| 2022 | Google's PaLM (arXiv:2204.02311) uses SwiGLU at 540B scale. |
| 2023 | Meta's LLaMA (arXiv:2302.13971) uses SwiGLU with the 8/3 rule and no bias; the open-source release locks in the LLaMA-style FFN as the de facto standard. |
| 2023 | Llama 2, Mistral 7B, Mixtral 8x7B, Yi-34B, Baichuan 2 all ship with SwiGLU. |
| 2024 | Mirzadeh et al. publish ReLU Strikes Back (ICLR 2024), questioning whether SwiGLU is necessary. |
| 2024 | Llama 3, DeepSeek V2/V3, Qwen 2, OLMo, Falcon 3, Phi-3, InternLM 2 all use SwiGLU; Gemma uses GeGLU. |
| 2024 | DeepSeek-V3 documents SwiGLU activation recomputation in FP8 training. |
| 2025 | Continued research on SwiGLU variants (Masked GLU, gated ReLU, NoGLU); SwiGLU remains the production default. |
| 2026 | SwiGLU is essentially universal in open-weight LLMs except for Google's Gemma line. |
torch.nn.functional.silu. https://pytorch.org/docs/stable/generated/torch.nn.functional.silu.html