# SwiGLU

> Source: https://aiwiki.ai/wiki/swiglu
> Updated: 2026-07-11
> Categories: Deep Learning, Model Architecture, Neural Networks
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**SwiGLU** (Swish-Gated Linear Unit) is the [activation function](/wiki/activation_function) used inside the feed-forward sublayer of most modern [transformer](/wiki/transformers) large language models, including LLaMA, PaLM, Mistral, Qwen, and DeepSeek. It is a member of the GLU family that replaces the sigmoid gate of the original [Gated Linear Unit](/wiki/glu) with [Swish](/wiki/swish) (also known as [SiLU](/wiki/silu)), and it was introduced by [Noam Shazeer](/wiki/noam_shazeer) in the February 2020 paper *GLU Variants Improve Transformer* (arXiv:2002.05202), where it produced one of the two lowest validation perplexities among the eight variants tested on a [T5](/wiki/t5) base model.[1] In the LLaMA-style form deployed in production, SwiGLU is written as SiLU(xW) elementwise-times xV, followed by a down-projection, with the inner feed-forward width set to 8/3 of the model dimension so the parameter count matches a standard [ReLU](/wiki/relu) feed-forward block.[9]

After [PaLM](/wiki/palm) adopted it in April 2022 and [LLaMA](/wiki/llama) followed in February 2023, SwiGLU quickly became the default feed-forward activation in nearly every new large language model, including [Mistral](/wiki/mistral) 7B, the Falcon 3 family, [DeepSeek](/wiki/deepseek), [Qwen](/wiki/qwen), and [OLMo](/wiki/olmo).[8][9] As of 2026 every leading open-weight foundation model except for Google's [Gemma](/wiki/gemma) line uses SwiGLU as its feed-forward nonlinearity.[16]

The paper's recommendation was deliberately ambivalent. Shazeer found that several gated variants (GEGLU, SwiGLU, ReGLU, and even the parameter-free Bilinear) all beat the standard ReLU and GELU baselines, and the differences between them were within experimental noise.[1] The community eventually settled on SwiGLU through a combination of PaLM's high-profile adoption, LLaMA's open-source release, and the resulting tooling momentum. Once Hugging Face Transformers, llama.cpp, vLLM, and the major training stacks were tuned for the SwiGLU shape, switching back to a non-gated feed-forward network stopped being free.

## Quick facts

| | |
|---|---|
| Introduced | February 2020 |
| Paper | *GLU Variants Improve Transformer* |
| arXiv ID | 2002.05202 |
| Author | Noam Shazeer (Google) |
| Type | Gated activation, GLU family |
| Role in transformer | Replaces the single nonlinearity inside the position-wise FFN |
| Component activation | Swish (SiLU), usually with $$\beta = 1$$ |
| Hidden dimension factor | 2/3 of the standard 4d (i.e. 8/3 times d) to keep parameter count comparable |
| Matrices in FFN | 3 (gate W, value V, down W2), vs 2 in a standard FFN |
| Used in | PaLM, LLaMA 1/2/3, Mistral 7B, Mixtral, Falcon 3, DeepSeek (V2/V3), Qwen2/Qwen3, OLMo |
| PyTorch primitive | `torch.nn.functional.silu` |
| C4 log-perplexity in original paper (524k steps) | 1.636 (vs 1.677 ReLU baseline) |
| GLUE average in original paper | 84.36 (best of all tested variants) |

## What is the standard transformer FFN?

A standard [transformer](/wiki/transformers) block contains two sublayers: multi-head self-attention and a position-wise feed-forward network (FFN). The FFN, introduced in *Attention Is All You Need* (Vaswani et al., 2017), is a small two-layer multilayer perceptron applied independently to every position.[6] In matrix form it is

$$
\mathrm{FFN}(x) = \max(0, xW_1 + b_1) W_2 + b_2
$$

The inner dimension d_ff is conventionally four times the model dimension d_model, so for d_model = 512 the hidden layer has 2048 units. The original paper used [ReLU](/wiki/relu).[6] Later work swapped ReLU for [GELU](/wiki/gelu), giving

$$
\mathrm{FFN}_{\mathrm{GELU}}(x) = \mathrm{GELU}(xW_1 + b_1) W_2 + b_2
$$

GELU was the default in BERT, GPT-2, GPT-3, RoBERTa, and the original [T5](/wiki/t5).[5] The T5 codebase also dropped the bias terms, simplifying the formula to FFN_ReLU(x) = max(0, xW_1) W_2.[7] The structure is the same in every case: project up to a wider hidden dimension, apply a pointwise nonlinearity, project back down. Two matrix multiplies, one elementwise nonlinearity. For most of the late 2010s this was the only FFN design anyone seriously considered.

The FFN is a surprisingly large fraction of the parameters and FLOPs of a transformer. With d_ff = 4 times d_model, the FFN sublayer holds roughly 8 times d_model squared parameters per layer, while the attention sublayer (with four projection matrices of size d_model by d_model) holds 4 times d_model squared. So the FFN is about two thirds of the non-embedding compute and parameters. Improving the FFN therefore moves the needle on overall model quality more than tweaking attention typically does.

Gated variants change the FFN topology. They split the up-projection in two and use one half to gate the other. The first such variant in modern deep learning was the [GLU](/wiki/glu), introduced for convolutional language modeling in 2016.[2]

## What is the original gated linear unit (GLU)?

The Gated Linear Unit was proposed by Yann Dauphin, Angela Fan, Michael Auli, and David Grangier of Facebook AI Research in *Language Modeling with Gated Convolutional Networks* (arXiv:1612.08083, December 2016).[2] They were trying to make convolutional networks competitive with LSTMs on word-level language modeling, and the gating mechanism was their answer to the long-range information flow problem. The basic GLU is

$$
\mathrm{GLU}(x, W, V, b, c) = (xW + b) \otimes \sigma(xV + c)
$$

where ⊗ is elementwise (Hadamard) multiplication and σ is the [sigmoid](/wiki/activation_function) function. One linear projection produces the candidate values, the other produces a sigmoid gate that decides how much of each value passes through. The Dauphin paper argued that this gating gave a linear path for gradients (via the value branch) while keeping enough nonlinearity (via the gate) for expressivity. The model achieved state of the art on WikiText-103 and was competitive with LSTMs on the Google Billion Words benchmark, which at the time felt surprising for a non-recurrent architecture.[2]

Dauphin et al. also defined the parameter-free **Bilinear** layer, which is GLU without any gating activation:

$$
\mathrm{Bilinear}(x, W, V, b, c) = (xW + b) \otimes (xV + c)
$$

They credited this construction to Mnih and Hinton's 2007 paper *Three new graphical models for statistical language modelling*, which is the deepest historical root of the entire GLU family.[18] The Bilinear layer is purely multiplicative interaction between two linear projections, and as Shazeer would later show, it is already strong enough to outperform a non-gated ReLU FFN.[1]

The core idea is simple: gating turns a linear projection into a multiplicative interaction without changing the parameter count by much. SwiGLU and its siblings just swap the sigmoid for a different gate activation.

## What is the difference between Swish and SiLU?

The gate function used by SwiGLU is **Swish**, also called **SiLU**. The two names refer to the same function, with a small distinction over the beta parameter that almost nobody respects in practice.

**SiLU** (Sigmoid Linear Unit, also called sigmoid-weighted linear unit) was first defined by Stefan Elfwing, Eiichi Uchibe, and Kenji Doya at OIST in *Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning* (arXiv:1702.03118, February 2017).[4] They proposed it for value function approximation in deep reinforcement learning, where the smooth derivative was useful for stable training. The function is

$$
\mathrm{SiLU}(x) = x \cdot \sigma(x)
$$

where σ is the sigmoid. Elfwing et al. also defined dSiLU, the derivative of SiLU, and proposed it as a smooth replacement for the sigmoid in RL value heads.[4] The name SiLU is the one that ended up in modern PyTorch and JAX.

Independently, Dan Hendrycks and Kevin Gimpel had already mentioned the same function as a special case of GELU in *Gaussian Error Linear Units (GELUs)* (arXiv:1606.08415, June 2016).[5] They observed that x times σ(x) is a smooth gating function similar to the GELU they were proposing.[5]

**Swish** was named in *Searching for Activation Functions* by Prajit Ramachandran, Barret Zoph, and Quoc V. Le of Google Brain (arXiv:1710.05941, October 2017).[3] The team ran an automated architecture search over scalar activation functions. The best discovered candidate was

$$
\mathrm{Swish}_\beta(x) = x \cdot \sigma(\beta x)
$$

with β either fixed or learned. When $$\beta = 1$$ the function is identical to Elfwing's SiLU. The Ramachandran paper explicitly acknowledges this.[3] When β is learned per-layer, the function can interpolate between near-linear behavior (small β) and a hard ReLU-like cutoff (large β).

In the SwiGLU paper Shazeer used Swish_1, i.e. SiLU.[1] Every public LLM that says it uses SwiGLU is using SiLU as the gate. PyTorch exposes the function as `torch.nn.functional.silu`, and most LLaMA-style codebases call it SiLU even though the upstream paper trail uses Swish.[21] In this article we use the two names interchangeably, with Swish reserved for the case where β is variable and SiLU for the $$\beta = 1$$ case used in production.

## Numerical properties of Swish

Swish has a few quantitative properties worth keeping in mind because they show up in optimization arguments later:

* It is C-infinity smooth (infinitely differentiable). Unlike ReLU, the derivative is well defined at every point.
* It is non-monotonic. The function dips below zero for negative arguments, reaching its global minimum of approximately -0.2785 at $$x \approx -1.2785$$, then asymptotically approaches zero as x goes to negative infinity.[3]
* For large positive x it asymptotes to the identity, like ReLU.
* The derivative is $$\mathrm{Swish}'(x) = \sigma(\beta x) + \beta x \, \sigma(\beta x)(1 - \sigma(\beta x))$$. At $$\beta = 1$$ this is bounded above by about 1.0998 (attained near $$x \approx 2.4$$). The derivative never identically vanishes for finite x, so units do not die the way pure ReLU units sometimes do.
* Swish is self-gated: the input itself, after a sigmoid squashing, controls how much of the input passes through. This is conceptually similar to LSTM gating but uses no extra parameters.[3]

The non-monotonic dip near $$x = -1.28$$ is what most theoretical accounts focus on, since it is the property that distinguishes Swish from a plain smoothed ReLU. The argument is that the dip lets the function represent locally non-monotonic behavior and supplies extra gradient information in the negative pre-activation regime, which can help models escape saddle regions during training.

## What is the SwiGLU formula?

Shazeer combined the GLU template with Swish, dropped the bias terms (since transformers usually run without biases inside the FFN), and called the result SwiGLU.[1] The general definition from the 2020 paper is

$$
\mathrm{SwiGLU}(x, W, V, b, c, \beta) = \mathrm{Swish}_\beta(xW + b) \otimes (xV + c)
$$

which expands to

$$
\mathrm{SwiGLU}(x, W, V, b, c, \beta) = ((xW + b) \cdot \sigma(\beta(xW + b))) \otimes (xV + c)
$$

In practice almost everyone uses β = 1 and omits the bias vectors. The version actually deployed in PaLM, LLaMA, and friends is

$$
\mathrm{SwiGLU}(x, W, V) = \mathrm{SiLU}(xW) \otimes xV
$$

Note that SwiGLU as usually written is not really a pointwise activation. It is a small parameterised module: it owns two weight matrices W and V. To get a drop-in replacement for the FFN, you wrap it with a final down-projection W2. That gives the FFN_SwiGLU layer.

A useful way to read the formula is to call xW the **gate pre-activation** (what gets squashed by Swish to produce the gate) and xV the **value branch** (the linear path). The gate values are roughly in the range -0.28 to plus infinity, and they multiply the value branch elementwise. Where the gate is near zero, the value is suppressed. Where the gate is near one, the value passes through nearly unchanged. Where the gate is large, the value is amplified.

## How does SwiGLU fit inside the transformer FFN?

The full feed-forward block becomes

$$
\mathrm{FFN}_{\mathrm{SwiGLU}}(x, W, V, W_2) = (\mathrm{Swish}_1(xW) \otimes xV) W_2
$$

Three matrices instead of two. If you keep the inner width the same, the parameter count and FLOP count both increase by 50%, which is not a fair comparison against a baseline FFN. To control for this, Shazeer reduced the inner width by a factor of 2/3, so the SwiGLU FFN ends up with roughly the same parameter and compute budget as a ReLU or GELU FFN of width 4 times d_model.[1]

This 2/3 factor is why LLaMA, Mistral, and most other SwiGLU models use an inner FFN width of 8/3 times d_model, then round to a convenient multiple (typically 256 or 128) for hardware efficiency. Touvron et al. write in the LLaMA paper, "We replace the ReLU non-linearity by the SwiGLU activation function, introduced by Shazeer (2020) to improve the performance. We use a dimension of 2/3 4d instead of 4d as in PaLM."[9] PaLM kept the inner width at 4 times d_model and ate the extra parameters; LLaMA chose to match the budget instead.[8][9] Both are valid choices.

The parameter budget for one FFN block, ignoring biases, works out to roughly:

| Variant | Inner width | Parameters per FFN |
|---|---|---|
| Standard FFN ([ReLU](/wiki/relu)/[GELU](/wiki/gelu)) | $$4d$$ | $$2 \cdot d \cdot 4d = 8d^2$$ |
| [SwiGLU](/wiki/swiglu) FFN, equal width | $$4d$$ | $$3 \cdot d \cdot 4d = 12d^2$$ |
| [SwiGLU](/wiki/swiglu) FFN, 2/3 reduction | $$8d/3$$ | $$3 \cdot d \cdot 8d/3 = 8d^2$$ |

So the LLaMA-style SwiGLU FFN has the same nominal parameter count as a baseline FFN, just split across three thinner matrices instead of two wider ones.

For concrete numbers, here are the SwiGLU FFN dimensions used by several real models. Note that the rounding convention varies: LLaMA rounds 8/3 times d_model up to a multiple of 256, Mistral rounds to a multiple of 128, and PaLM keeps the unreduced 4d.

| Model | d_model | Computed 8/3 times d | Actual d_ff | Multiple |
|---|---|---|---|---|
| LLaMA 7B | 4096 | 10923 | 11008 | 256 |
| LLaMA 13B | 5120 | 13653 | 13824 | 256 |
| LLaMA 65B | 8192 | 21845 | 22016 | 256 |
| Mistral 7B | 4096 | 10923 | 14336 | 128 |
| Mixtral 8x7B (per expert) | 4096 | 10923 | 14336 | 128 |
| Llama 3 8B | 4096 | 10923 | 14336 | 1024 |
| Llama 3 70B | 8192 | 21845 | 28672 | 1024 |
| PaLM 540B | 18432 | 49152 | 73728 | 4d (no reduction) |

LLaMA 3 and Mistral both use slightly larger inner widths than the strict 8/3 rule predicts, partly to compensate for parameters saved by [Grouped-Query Attention](/wiki/grouped_query_attention) (GQA), and partly because tuning d_ff is a useful free hyperparameter.[11][12] The 8/3 figure is the right starting point but not a sacred constant.

## How well did SwiGLU perform in the 2020 paper?

Shazeer evaluated each variant on a T5 base configuration (12 encoder and 12 decoder layers, d_model = 768, d_ff = 3072 for non-gated variants, d_ff = 2048 for gated variants) trained on the C4 corpus for 524,288 steps.[1] The held-out log-perplexity numbers (lower is better) reported in the paper are below.

| FFN variant | Activation in gate | Log-perplexity at 65k steps | Log-perplexity at 524k steps |
|---|---|---|---|
| FFN_ReLU (baseline) | ReLU (no gate) | 1.997 | 1.677 |
| FFN_GELU | GELU (no gate) | 1.983 | 1.679 |
| FFN_Swish | Swish (no gate) | 1.994 | 1.683 |
| FFN_GLU | sigmoid | 1.982 | 1.663 |
| FFN_Bilinear | identity (no gate activation) | 1.960 | 1.648 |
| FFN_ReGLU | ReLU | 1.953 | 1.645 |
| FFN_GEGLU | GELU | **1.942** | **1.633** |
| FFN_SwiGLU | Swish | 1.944 | 1.636 |

GEGLU and SwiGLU are essentially tied, with both clearly beating any of the non-gated baselines.[1] The improvement over GELU on log-perplexity is small in absolute terms (around 0.04 nats), but consistent and free in compute terms once you apply the 2/3 width reduction.

Shazeer also reported downstream task scores from fine-tuning on GLUE, SuperGLUE, and SQuAD.[1][19][20] The averages were:

| FFN variant | GLUE avg | SuperGLUE avg | SQuAD F1 |
|---|---|---|---|
| FFN_ReLU | 83.80 | 72.76 | 90.87 |
| FFN_GELU | 83.86 | 72.98 | 90.79 |
| FFN_Swish | 83.60 | 72.40 | 90.76 |
| FFN_GLU | 84.20 | 73.95 | 90.69 |
| FFN_Bilinear | 83.79 | 73.81 | 91.06 |
| FFN_ReGLU | 84.67 | 73.66 | 91.18 |
| FFN_GEGLU | 84.12 | 73.96 | 91.12 |
| FFN_SwiGLU | 84.36 | **74.56** | 91.03 |
| Raffel et al. 2019 (T5 reference) | 83.28 | 71.36 | 88.81 |

SwiGLU posted the highest SuperGLUE average. ReGLU narrowly led on GLUE and SQuAD F1.[1] The differences are within the inter-run standard deviations Raffel et al. reported (about 0.24 for GLUE, 0.42 for SuperGLUE, 0.23 for SQuAD F1), so the headline of the paper is really that the entire gated family beats the non-gated baselines, not that any one gated variant is clearly best.[7]

The paper does not actually argue for SwiGLU over GEGLU on principled grounds. Both worked, both were the recommendation.[1] Subsequent practice tilted toward SwiGLU because PaLM and LLaMA chose it, and the ecosystem followed.

## Which models use SwiGLU?

SwiGLU went from an obscure 5-page tech report in early 2020 to the dominant FFN variant in production by 2024. The inflection point was Google's PaLM in April 2022, which used SwiGLU at 540B scale and reported strong gains.[8] Meta then chose SwiGLU for [LLaMA](/wiki/llama) in February 2023, and once the LLaMA weights were widely circulated, every fine-tuner and downstream researcher inherited the same FFN topology.[9]

| Model | Year | Activation | Notes |
|---|---|---|---|
| [PaLM](/wiki/palm) | 2022 | SwiGLU | Inner width kept at 4 times d_model; 540B parameters |
| [LLaMA](/wiki/llama) (1) | 2023 | SwiGLU | Inner width $$\tfrac{2}{3} \cdot 4d$$, $$\beta = 1$$, no biases |
| [LLaMA 2](/wiki/llama_2) | 2023 | SwiGLU | Same convention as LLaMA |
| [LLaMA 3](/wiki/llama_3) | 2024 | SwiGLU | Carried over unchanged; intermediate size grew with GQA |
| [Mistral 7B](/wiki/mistral_7b) | 2023 | SwiGLU | Inner width 14336 with d_model 4096 |
| Mixtral 8x7B | 2023 | SwiGLU | Used inside each of the 8 experts |
| [Falcon](/wiki/falcon) 3 | 2024 | SwiGLU | Falcon 1/2 had used GELU |
| [DeepSeek](/wiki/deepseek) V2 / V3 | 2024 | SwiGLU | DeepSeek-V3 caches SwiGLU input and recomputes output in the backward pass to save activation memory |
| [Qwen](/wiki/qwen) 2 / Qwen 3 | 2024 | SwiGLU | Standard 8d/3 inner width with rounding |
| [OLMo](/wiki/olmo) | 2024 | SwiGLU | Hidden size set to roughly 8d/3 rounded up to a multiple of 128 |
| StableLM 2 | 2024 | SwiGLU | Stability AI open weights |
| Phi-3 | 2024 | SwiGLU | Microsoft small model |
| Yi-34B | 2023 | SwiGLU | 01.AI |
| Baichuan 2 | 2023 | SwiGLU | Baichuan Inc. |
| InternLM 2 | 2024 | SwiGLU | Shanghai AI Lab |
| [Gemma](/wiki/gemma) 1/2 | 2024 | GeGLU | Sibling of SwiGLU using GELU as the gate; Google's open-weight choice |
| GPT-2/GPT-3 | 2019/2020 | GELU | Pre-SwiGLU; OpenAI never publicly switched |
| Pythia | 2023 | GELU | EleutherAI's interpretability suite stayed on GELU for comparability |

Gemma is a useful counterexample: Google's open-weight model uses GeGLU rather than SwiGLU.[16] The two perform almost identically in Shazeer's original benchmark, so this is not a serious quality difference, just a different bet.[1] Gemma uses an exact GELU rather than the tanh approximation that GPT-2 popularised, which has caused occasional inference inconsistencies when third-party runtimes substitute the approximation.[16]

OpenAI's GPT-3 and GPT-4 are publicly described as using GELU in the FFN, although the company has not released architecture details for its newest models. Anthropic has not published details of [Claude](/wiki/claude)'s FFN, though common scuttlebutt is that it uses a SwiGLU or GeGLU variant like everyone else.

## How does SwiGLU compare with related activations?

The following table summarises the family. All gated variants (the bottom four rows) require three matrices in the FFN block and double the up-projection parameters relative to a non-gated FFN of equal hidden width.

| Activation | Formula | Gated | Smooth | Used in |
|---|---|---|---|---|
| [ReLU](/wiki/relu) | $$\max(0, x)$$ | no | no | Original Transformer, vanilla T5 |
| [GELU](/wiki/gelu) | $$x \cdot \Phi(x)$$ | no | yes | BERT, GPT-2, GPT-3, RoBERTa |
| Swish / [SiLU](/wiki/silu) | $$x \cdot \sigma(\beta x)$$ | no | yes | EfficientNet (often), various RL agents |
| [GLU](/wiki/glu) | $$(xW + b) \otimes \sigma(xV + c)$$ | yes | partly | Dauphin et al. 2016 conv LM |
| ReGLU | $$\max(0, xW + b) \otimes (xV + c)$$ | yes | no | Shazeer 2020 |
| GEGLU | $$\mathrm{GELU}(xW + b) \otimes (xV + c)$$ | yes | yes | T5 v1.1, mT5, [Gemma](/wiki/gemma) |
| [SwiGLU](/wiki/swiglu) | $$\mathrm{Swish}(xW + b) \otimes (xV + c)$$ | yes | yes | [PaLM](/wiki/palm), [LLaMA](/wiki/llama) family, [Mistral](/wiki/mistral), Falcon 3, [DeepSeek](/wiki/deepseek), [Qwen](/wiki/qwen), [OLMo](/wiki/olmo) |
| Bilinear | $$(xW + b) \otimes (xV + c)$$ | yes (linear gate) | yes | Studied in Shazeer 2020; rare in production |
| Mish | $$x \cdot \tanh(\mathrm{softplus}(x))$$ | no | yes | YOLOv4 and several CV models |
| ELU | $$x \text{ if } x > 0 \text{ else } \alpha(\exp(x) - 1)$$ | no | yes | Some early LM experiments |

SwiGLU sits at the intersection of two design choices: gated (vs. pointwise), and smooth (vs. piecewise linear). GEGLU shares both properties; the only difference is which smooth nonlinearity does the gating.

## Why does SwiGLU work?

Nobody really knows. The most-quoted line in Shazeer's paper is the mock-modest disclaimer in the conclusions section:

> "We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence."[1]

The academic literature has produced a few plausible stories. The most common one is that the multiplicative gate gives the network a low-cost way to represent **conditional computation**: the value branch carries information forward, and the gate branch (a smooth, almost-binary mask after Swish) decides which features matter for which input. This is essentially the same intuition Dauphin used to motivate the original GLU.[2] Smooth gating activations like Swish and GELU avoid the dead-unit problem of ReGLU, and Swish's slight non-monotonic dip near zero may help with optimization.

A second story is **expressivity**. A gated FFN can represent products of features, not just additive combinations. Each output unit is a sum of inputs scaled by a learned mask, so it can selectively combine information across input dimensions in a way a single-projection ReLU cannot. The Bilinear baseline already enjoys this property, which is why its perplexity is closer to SwiGLU than to ReLU.[1]

A more recent line of analysis comes from approximation theory. A 2026 preprint titled *Divine Benevolence is an x-squared: GLUs scale asymptotically faster than MLPs* (arXiv:2602.14495) by Alejandro Francisco Queiruga argued that GLU-style layers form piecewise quadratic approximators, while standard MLPs form piecewise linear approximators.[25] On function-reconstruction problems the paper reports a loss-scaling slope of roughly $$L(P) \propto P^{-3}$$ for GLUs versus $$P^{-2}$$ for MLPs, where $$P$$ is the parameter count, and from this it derives a "Gated Quadratic Unit" with an even steeper slope.[25] Quadratic piecewise approximators have asymptotically better convergence rates for smooth target functions, which the author offered as a partial explanation for the consistent perplexity advantage of GLU variants over plain MLPs. Whether this story holds up at very large scale remains open.

A fourth perspective is purely empirical: the gain over GELU on log-perplexity is small but the gain comes for free, the tooling is mature, and the inertia is enormous. Even practitioners who suspect the gain would disappear at trillion-token scale have little reason to fight the consensus.

In 2024, Mirzadeh et al. published *ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models* (arXiv:2310.04564, ICLR 2024).[17] They showed that replacing the SiLU inside SwiGLU with a plain ReLU produces minimal quality loss while making activations highly sparse, which can be exploited to speed up inference by up to 3x in memory-bound regimes.[17] The paper's authors at Apple did not argue that ReLU is intrinsically better but that the exotic smoothness of SwiGLU may not be earning its keep, especially during inference. Subsequent work (e.g. ReLU2, gated ReLU) has continued this line.

The current status is roughly: SwiGLU is clearly competitive, the difference from GEGLU is negligible, and the difference from a well-tuned GELU FFN is small but consistent enough that all the major labs default to SwiGLU now anyway. The decision is partly about momentum and tooling.

## How is SwiGLU implemented in PyTorch?

A minimal LLaMA-style SwiGLU feed-forward block looks like this. PyTorch's `F.silu` is exactly Swish with beta = 1, which is what every public model in the table above uses.[21]

```python
import torch
import torch.nn as nn
import torch.nn.functional as F

class SwiGLU_FFN(nn.Module):
    def __init__(self, d_model: int, d_ff: int | None = None):
        super().__init__()
        # LLaMA convention: round 8/3 * d_model up to a multiple of 256
        if d_ff is None:
            d_ff = int(8 * d_model / 3)
            d_ff = ((d_ff + 255) // 256) * 256
        self.w_gate = nn.Linear(d_model, d_ff, bias=False)
        self.w_value = nn.Linear(d_model, d_ff, bias=False)
        self.w_down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w_down(F.silu(self.w_gate(x)) * self.w_value(x))
```

A few notes on the implementation in practice:

* Bias terms are dropped, matching Shazeer's preferred form and every modern LLM that uses SwiGLU.[1]
* `w_gate` and `w_value` can be fused into a single 2 times d_ff projection and split, which is slightly faster on GPU and is what the official LLaMA reference code does (it calls the merged matrix `gate_up_proj` or similar).
* For training memory, the activations after `silu` and the elementwise product are the largest tensors in the block. DeepSeek-V3 reportedly avoids storing them by recomputing `silu(w_gate(x))` and the elementwise product during the backward pass, trading a small amount of compute for a meaningful memory saving.[14]
* PyTorch (as of 2.x) does not yet ship a fused `swiglu` kernel; community kernels like xFormers' `SwiGLU` op or Triton implementations are common in production training stacks. There is an open issue in the PyTorch repo (#128712) requesting a built-in version.[22][23]

In JAX-based stacks the equivalent uses `jax.nn.silu` and `jax.numpy.einsum`, with the same three-matrix structure.

## Fused kernels and hardware considerations

A naive PyTorch implementation of the SwiGLU FFN issues five separate GPU operations: two large matrix multiplications for the gate and value projections, one elementwise SiLU, one elementwise multiplication, and one large matrix multiplication for the down projection. The two elementwise operations between the projections are memory-bound on modern GPUs because they read and write the full d_ff-dimensional activation tensor without doing much arithmetic. Fusing them into a single kernel doubles or triples the throughput of that part of the block.

Several production stacks ship fused SwiGLU kernels:

* **xFormers** from Meta provides `xformers.ops.SwiGLU` with a packed weight format that concatenates the gate and value projections as a single weight tensor `w12`. The forward and backward passes are written in CUTLASS templates, which lets the kernel fuse the activation with the surrounding matrix multiplications on H100 and A100 GPUs.[23]
* **Liger Kernel**, an open-source project from LinkedIn, provides a Triton implementation of SwiGLU forward and backward that the authors report can reduce LLM-training memory by up to 60% in conjunction with their other fused kernels (RMSNorm, RoPE, cross-entropy). Liger composes with PyTorch FSDP, DeepSpeed, and FlashAttention.[24]
* **NVIDIA Transformer Engine** ships fused SwiGLU as part of its FP8 training kernels for H100 and Blackwell, which are the kernels Llama 3 405B and the public DeepSeek runs use in practice.[11][14]
* **llama.cpp** and other inference engines (vLLM, TGI, MLX) implement SwiGLU as a single CUDA or Metal kernel that reads both projections and writes only the gated output, halving the memory bandwidth requirement.

The activation memory savings come from fusing the SiLU and the elementwise product into a single kernel that does not materialise the intermediate tensors. The key tensors are the gate pre-activation xW (size batch times sequence times d_ff), the value pre-activation xV (same size), and the gated output (same size). A non-fused implementation stores all three; a fused implementation can stream them through registers or shared memory and only write the output. For a Llama 3 70B forward pass at 8K context with batch 1, this saves several gigabytes per layer, which compounds across the 80 layers.

For inference, the memory bandwidth pattern of SwiGLU is roughly twice that of a non-gated FFN of the same hidden width: two reads of d_model-dimensional input, two large weight matrices to load, two elementwise activations to write, and a final down projection. This is why SwiGLU LLMs are slightly more memory-bandwidth-bound than equivalently sized GELU LLMs, and why it is more important to colocate the gate and value projections in cache.

## Activation checkpointing and memory tricks

DeepSeek-V3's technical report (arXiv:2412.19437) describes a specific memory optimisation for SwiGLU during training.[14] They cache only the input x to the SwiGLU operator and recompute both the gate pre-activation `silu(xW)` and the gated product `silu(xW) ⊗ xV` during the backward pass.[14] The cached tensor is small (d_model wide), while the recomputed tensors are large (d_ff wide, which is roughly 8/3 of d_model). The recomputation cost is tiny relative to the matrix multiplications surrounding it, so this trade is essentially free in compute and saves substantial memory.

DeepSeek-V3 combines this trick with FP8 quantisation of the SwiGLU activations using their own fine-grained per-block scaling scheme. The combination is one of the design choices that lets DeepSeek-V3 train a 671B-parameter [Mixture of Experts](/wiki/mixture_of_experts) model with only 2.788M H800 GPU hours.[14]

Generic activation checkpointing libraries (such as PyTorch's `torch.utils.checkpoint` or Megatron-LM's `recompute_activation` flag) can apply the same trade automatically across the whole transformer block. When training memory is the bottleneck, recomputing SwiGLU forwards in the backward pass is one of the highest-value moves, since the FFN activations are the largest single tensor in a transformer block.

## How does SwiGLU relate to Mixture of Experts?

SwiGLU also appears inside the experts of [Mixture of Experts](/wiki/mixture_of_experts) (MoE) models. Mixtral 8x7B places eight independent SwiGLU FFNs in each transformer block; the router selects two of them per token, and only the selected experts are evaluated.[13] DeepSeek-V2 and V3 use a finer-grained variant with hundreds of small SwiGLU experts and many active per token.[14] In each case the SwiGLU formula is identical to the dense case; what changes is how many of the FFN sublayers are run. The total parameter count grows linearly in the number of experts, but the active compute is dominated by the few experts the router picks.

The choice of activation inside the experts is generally treated as a separate decision from the routing scheme, and SwiGLU is the default for the same reasons it is the default in dense models: it is the format the open-source training stacks are tuned for, and the empirical loss per parameter is at least as good as the alternatives.

## Variants and follow-up work

Researchers have continued to explore the design space around SwiGLU, though the empirical advantage of any single variant remains small.

* **GeGLU**, the GELU-gated sibling, is the closest alternative and is used in T5 v1.1, mT5, and Gemma.[16] The choice between SwiGLU and GeGLU is essentially aesthetic at this point.
* **Bilinear**, the parameter-free gated layer (no activation on the gate branch), already outperforms ReLU and GELU baselines in Shazeer's table.[1] It is rarely used in production but appears in a few research papers as a strong baseline.
* **ReGLU squared (ReLU-squared GLU)**, used in some sparse-LLM research, replaces SiLU with ReLU squared. The idea is to combine GLU's gating with the activation sparsity that pure ReLU networks enjoy.
* **Masked GLU** (Masked Gated Linear Unit, arXiv:2506.23225), a June 2025 proposal by Tajima et al., learns multiple binary masks on a single shared weight matrix (a Mixture of Element-wise Gating) to halve the memory reads of standard GLUs; its FlashMGLU kernel reports up to a 19.7x inference-time speed-up over a naive PyTorch MGLU, and the Swish-activated SwiMGLU variant matches or surpasses the SwiGLU baseline's downstream accuracy.[26]
* **NoGLU** is an informal label some researchers attach to non-gated FFN replacements; one example is the 2024 *ReLU Strikes Back* paper, which demonstrated that swapping SiLU for ReLU inside SwiGLU produces a model with similar quality but much sparser activations, which can be exploited at inference time.[17]
* **Beta-tuned SwiGLU** would mean using Swish_beta with a learned beta per layer, as in the original Ramachandran paper.[3] No major LLM appears to do this; everyone uses $$\beta = 1$$.
* **GLU with bias**, retaining the bias terms in the gate and value projections, is a minor option enabled by some training frameworks. It is rarely worth the small additional parameter count.

There has also been work on extending SwiGLU-style gating to attention. Several recent papers experiment with multiplicative gates inside the attention sublayer rather than replacing the FFN, with mixed results.

## Limitations and criticism

SwiGLU is not without drawbacks:

* **Three matrices, not two.** Memory bandwidth for the FFN block is higher than a non-gated FFN at equal hidden width. The 2/3 reduction normalises parameters but not bandwidth.
* **No hard zero.** Unlike ReLU, Swish never produces an exact zero. This kills the activation sparsity that some inference techniques rely on. Research like *ReLU Strikes Back* shows that retrofitting ReLU-style sparsity into a SwiGLU model is possible but requires re-training or fine-tuning.[17]
* **Numerical sensitivity.** Mixed-precision (FP16 or BF16) training of SwiGLU sometimes shows larger activation magnitudes than GELU, since the gate and value can both be large positive numbers and their product can grow accordingly. This is not catastrophic but means SwiGLU layers can need slightly more careful loss-scaling and norm clipping during training. FP8 training typically uses per-block scaling for SwiGLU activations.[14]
* **Limited theoretical understanding.** As Shazeer himself wrote, no convincing first-principles explanation exists for why GLU variants outperform pointwise activations.[1] The empirical advantage is robust but small, and most theoretical accounts are post hoc.
* **Not better at every scale.** Some work suggests the perplexity gap between SwiGLU and GELU narrows as data and compute increase. Whether SwiGLU is still better at the trillion-parameter scale is hard to study because the major labs only train one architecture per scale and rarely run controlled ablations.

None of these are dealbreakers, and they have not slowed adoption.

## Historical timeline

| Year | Event |
|---|---|
| 2007 | Mnih and Hinton describe a bilinear product of two linear projections in *Three new graphical models for statistical language modelling*. |
| 2016 | Hendrycks and Gimpel propose [GELU](/wiki/gelu) (arXiv:1606.08415); they implicitly mention x times σ(x). |
| 2016 | Dauphin, Fan, Auli, and Grangier propose [GLU](/wiki/glu) for convolutional language modeling (arXiv:1612.08083). |
| 2017 | Elfwing, Uchibe, and Doya propose [SiLU](/wiki/silu) for RL (arXiv:1702.03118). |
| 2017 | Ramachandran, Zoph, and Le rediscover SiLU as [Swish](/wiki/swish) via neural architecture search (arXiv:1710.05941). |
| 2017 | Vaswani et al. publish *Attention Is All You Need*; the original [transformer](/wiki/transformers) FFN uses ReLU. |
| 2019 | Raffel et al. release [T5](/wiki/t5); FFN uses ReLU (no bias). T5 v1.1 later switches to GeGLU. |
| 2020 | Shazeer publishes *GLU Variants Improve Transformer* (arXiv:2002.05202), introducing SwiGLU. |
| 2022 | Google's [PaLM](/wiki/palm) (arXiv:2204.02311) uses SwiGLU at 540B scale. |
| 2023 | Meta's [LLaMA](/wiki/llama) (arXiv:2302.13971) uses SwiGLU with the 8/3 rule and no bias; the open-source release locks in the LLaMA-style FFN as the de facto standard. |
| 2023 | Llama 2, [Mistral 7B](/wiki/mistral_7b), Mixtral 8x7B, Yi-34B, Baichuan 2 all ship with SwiGLU. |
| 2024 | Mirzadeh et al. publish *ReLU Strikes Back* (ICLR 2024), questioning whether SwiGLU is necessary. |
| 2024 | Llama 3, DeepSeek V2/V3, Qwen 2, OLMo, Falcon 3, Phi-3, InternLM 2 all use SwiGLU; Gemma uses GeGLU. |
| 2024 | DeepSeek-V3 documents SwiGLU activation recomputation in FP8 training. |
| 2025 | Continued research on SwiGLU variants (Masked GLU, gated ReLU, NoGLU); SwiGLU remains the production default. |
| 2026 | SwiGLU is essentially universal in open-weight LLMs except for Google's Gemma line. |

## See also

* [GELU](/wiki/gelu)
* [ReLU](/wiki/relu)
* [Swish](/wiki/swish)
* [SiLU](/wiki/silu)
* [GLU](/wiki/glu)
* [Activation Function](/wiki/activation_function)
* [Transformer](/wiki/transformers)
* [Feedforward Neural Network (FFN)](/wiki/feedforward_neural_network_ffn)
* [LLaMA](/wiki/llama)
* [PaLM](/wiki/palm)
* [Mistral](/wiki/mistral)
* [DeepSeek](/wiki/deepseek)
* [Mixture of Experts](/wiki/mixture_of_experts)
* [Noam Shazeer](/wiki/noam_shazeer)

## References

1. Shazeer, N. (2020). *GLU Variants Improve Transformer*. arXiv:2002.05202. https://arxiv.org/abs/2002.05202
2. Dauphin, Y. N., Fan, A., Auli, M., & Grangier, D. (2016). *Language Modeling with Gated Convolutional Networks*. arXiv:1612.08083. https://arxiv.org/abs/1612.08083
3. Ramachandran, P., Zoph, B., & Le, Q. V. (2017). *Searching for Activation Functions*. arXiv:1710.05941. https://arxiv.org/abs/1710.05941
4. Elfwing, S., Uchibe, E., & Doya, K. (2017). *Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning*. arXiv:1702.03118. https://arxiv.org/abs/1702.03118
5. Hendrycks, D. & Gimpel, K. (2016). *Gaussian Error Linear Units (GELUs)*. arXiv:1606.08415. https://arxiv.org/abs/1606.08415
6. Vaswani, A., et al. (2017). *Attention Is All You Need*. arXiv:1706.03762. https://arxiv.org/abs/1706.03762
7. Raffel, C., et al. (2019). *Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer*. arXiv:1910.10683. https://arxiv.org/abs/1910.10683
8. Chowdhery, A., et al. (2022). *PaLM: Scaling Language Modeling with Pathways*. arXiv:2204.02311. https://arxiv.org/abs/2204.02311
9. Touvron, H., et al. (2023). *LLaMA: Open and Efficient Foundation Language Models*. arXiv:2302.13971. https://arxiv.org/abs/2302.13971
10. Touvron, H., et al. (2023). *Llama 2: Open Foundation and Fine-Tuned Chat Models*. arXiv:2307.09288. https://arxiv.org/abs/2307.09288
11. Llama Team, AI @ Meta (2024). *The Llama 3 Herd of Models*. arXiv:2407.21783. https://arxiv.org/abs/2407.21783
12. Jiang, A. Q., et al. (2023). *Mistral 7B*. arXiv:2310.06825. https://arxiv.org/abs/2310.06825
13. Jiang, A. Q., et al. (2024). *Mixtral of Experts*. arXiv:2401.04088. https://arxiv.org/abs/2401.04088
14. DeepSeek-AI (2024). *DeepSeek-V3 Technical Report*. arXiv:2412.19437. https://arxiv.org/abs/2412.19437
15. Yang, A., et al. (2024). *Qwen2 Technical Report*. arXiv:2407.10671. https://arxiv.org/abs/2407.10671
16. Gemma Team (2024). *Gemma: Open Models Based on Gemini Research and Technology*. arXiv:2403.08295. https://arxiv.org/abs/2403.08295
17. Mirzadeh, I., et al. (2024). *ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models*. ICLR 2024. arXiv:2310.04564. https://arxiv.org/abs/2310.04564
18. Mnih, A. & Hinton, G. (2007). *Three new graphical models for statistical language modelling*. ICML 2007.
19. Wang, A., et al. (2018). *GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding*. arXiv:1804.07461. https://arxiv.org/abs/1804.07461
20. Wang, A., et al. (2019). *SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems*. arXiv:1905.00537. https://arxiv.org/abs/1905.00537
21. PyTorch documentation, `torch.nn.functional.silu`. https://pytorch.org/docs/stable/generated/torch.nn.functional.silu.html
22. PyTorch issue #128712, *Add SwiGLU activation function*. https://github.com/pytorch/pytorch/issues/128712
23. xFormers documentation, *SwiGLU operator*. https://facebookresearch.github.io/xformers/components/ops.html
24. Liger Kernel (LinkedIn). https://github.com/linkedin/Liger-Kernel
25. Queiruga, A. F. (2026). *Divine Benevolence is an x^2: GLUs scale asymptotically faster than MLPs*. arXiv:2602.14495. https://arxiv.org/abs/2602.14495
26. Tajima, Y., Inoue, N., Sekikawa, Y., Sato, I., & Yokota, R. (2025). *Masked Gated Linear Unit*. arXiv:2506.23225. https://arxiv.org/abs/2506.23225