Differential Transformer
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,274 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,274 words
Add missing citations, update stale details, or suggest a clearer explanation.
The Differential Transformer (often shortened to Diff Transformer or DIFF Transformer) is a decoder-only neural sequence architecture introduced by researchers at Microsoft Research and Tsinghua University in October 2024. It replaces the standard scaled dot-product attention block of the Transformer with a differential attention operator that computes two parallel softmax attention maps and outputs their weighted difference, controlled by a learnable scalar lambda. The subtraction is designed to cancel "common-mode" attention given to irrelevant tokens (what the authors call attention noise) while amplifying attention to genuinely informative context.[^1] The architecture was published as arXiv preprint 2410.05258 on 7 October 2024 and selected as an oral presentation at ICLR 2025.[^1][^2] Reference code is released by Microsoft inside the microsoft/unilm repository under the Diff-Transformer directory.[^3]
The work reports gains across language model pretraining loss, downstream zero-shot evaluation, long-context retrieval, many-shot in-context learning, hallucination mitigation in question answering and summarization, and robustness to low-bit quantization of activations.[^1][^4] A follow-up note titled Differential Transformer V2 by the same group, posted on Hugging Face in January 2026, simplifies the parameterization and removes the need for custom attention kernels.[^5]
| Attribute | Value |
|---|---|
| Architecture name | Differential Transformer (Diff Transformer) |
| Core operator | Differential attention (subtraction of two softmax maps) |
| Paper | arXiv:2410.05258, "Differential Transformer" |
| Authors | Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei |
| Affiliations | Microsoft Research; Tsinghua University |
| First posted | 7 October 2024 (v1); 7 April 2025 (v2) |
| Venue | ICLR 2025 (oral) |
| Reference implementation | microsoft/unilm/tree/master/Diff-Transformer |
| Model sizes evaluated | 830M, 1.4B, 2.8B, 3B, 6.8B, 13.1B parameters |
| Training tokens (3B) | 1 trillion |
| Successor | Differential Transformer V2 (January 2026) |
Standard Transformers compute attention as a single softmax over query-key inner products. Empirically, researchers have observed that this softmax often spreads non-trivial probability mass over tokens that are not relevant to the prediction, a phenomenon the Differential Transformer authors describe as attention noise. The mass concentrated on uninformative positions can crowd out the genuine signal, which becomes especially harmful when the input is long or when the relevant evidence is a small fraction of the context window.[^1][^4]
A separate and well-documented manifestation of this issue is the attention sink, where a disproportionate share of attention weight collapses onto a small set of tokens (often the first token of the sequence) regardless of semantic content. Attention sinks contribute to large activation outliers, complicate low-bit deployment, and limit the effective use of model capacity.[^6] The Differential Transformer paper frames differential attention as a way to suppress such common-mode noise in analogy to differential amplifiers in electrical engineering and to noise-cancelling headphones: two correlated signals are subtracted so that the shared noise cancels and the differential signal is amplified.[^1][^7]
This noise-cancellation framing motivated the authors to look for an attention operator whose output remains close to the standard softmax when the two maps disagree (genuine attention) but approaches zero when the two maps largely agree (noise).[^1] The result, differential attention, is otherwise designed to drop into a conventional decoder-only Transformer with minimal changes.
microsoft/unilm (the Microsoft UniLM monorepo for foundation-model research), with both a naive PyTorch version and FlashAttention-compatible variants.[^3]In a standard multi-head attention block, each head computes a single attention map by applying softmax to scaled query-key inner products. The Differential Transformer replaces this with two separate softmax maps per head, subtracted with a learnable scalar coefficient lambda.[^1]
Given an input sequence X with N tokens and model dimension d_model, each head projects X into queries, keys, and values. The query and key projections are doubled in width and then split into two halves: Q1, Q2 (each in R^{N x d}) and K1, K2 (each in R^{N x d}). The value projection V remains a single tensor in R^{N x 2d}. The differential attention output for one head is:
DiffAttn(X) = ( softmax(Q1 K1^T / sqrt(d)) - lambda * softmax(Q2 K2^T / sqrt(d)) ) V
The first softmax acts like a conventional attention map; the second softmax is interpreted as a "noise" map that captures the common-mode attention distributed over irrelevant positions. The subtraction cancels what the two maps share and amplifies what they disagree on. Because each head consumes twice the query/key budget but the same value budget, the authors recommend using models with a sufficiently large head count so that halving heads has little practical effect.[^1][^3]
The scalar lambda is not a free parameter directly. It is reparameterized as a difference of two exponentials of learned vector dot products plus a layer-dependent initialization constant. For a head with learned parameters lambda_q1, lambda_k1, lambda_q2, lambda_k2 (all vectors in R^d), the lambda used at inference is:
lambda = exp( lambda_q1 . lambda_k1 ) - exp( lambda_q2 . lambda_k2 ) + lambda_init
This form keeps lambda differentiable and allows the network to deviate freely from the initialization while staying numerically well-behaved. The initial constant follows a layer-decaying schedule:
lambda_init(l) = 0.8 - 0.6 * exp( -0.3 * (l - 1) )
where l is the layer index starting at 1. Lower layers start near lambda_init = 0.2 and the value rises toward 0.8 in deeper layers. Ablation studies show that fixed constants such as 0.5 or 0.8 perform almost as well, with validation-loss variance under 0.005.[^1][^11]
After differential attention, each head is passed through an independent normalization layer (the paper describes it as a per-head GroupNorm acting on the attention output channels) before the heads are concatenated and projected. The normalization stabilizes the diverse statistics that arise once the two-map subtraction begins to produce sparse and signed values, which would otherwise yield wildly different head magnitudes. A fixed multiplier of (1 - lambda_init) scales the output of the GroupNorm so that the gradient flow at initialization matches the gradient flow of a standard Transformer block.[^1] Removing this per-head normalization was reported by the authors to harm training stability with sparse attention patterns.[^11] In V2 the per-head RMSNorm was removed entirely after the team found that, at very large pretraining scales, it caused gradient-norm spikes; a different stabilization is used instead.[^5]
Apart from the attention block, the Differential Transformer is otherwise a conventional decoder-only Transformer. It uses rotary position embeddings (RoPE), pre-normalization with RMSNorm, SwiGLU feed-forward layers, and the same loss function as the baseline. The reference repository ships an example.py that instantiates a standard multi-head attention block and a multi-head differential attention block side by side so that they can be swapped at the layer level.[^3]
The paper reports a series of pretraining runs from 830M to 13.1B parameters, all trained on the same data mixture with matched hyperparameters between Diff Transformer and a Transformer baseline that uses the same modern improvements (RoPE, RMSNorm, SwiGLU).[^1]
The flagship 3B Diff Transformer uses 28 layers, hidden size 3072, and 12 attention heads, trained on 1 trillion tokens with a global batch size of 4M tokens. A 3B Transformer baseline of equivalent shape was trained for comparison (with the language-modeling-only comparisons in the paper using 350B tokens). On the language model evaluation harness across ARC-C, ARC-E, BoolQ, HellaSwag, OBQA, PIQA, and WinoGrande, the 3B Diff Transformer reaches roughly 60.6% average zero-shot accuracy versus roughly 56.8% for the Transformer baseline.[^1][^4]
In the size-scan, Diff Transformer reaches the same language-modeling loss as the Transformer baseline using approximately 65% of the parameters or training tokens. Concretely, the 7.8B Diff Transformer matches a 13.1B Transformer in the comparable configuration, a reduction of about 38% parameters and 36% tokens for matched loss.[^1][^4]
Diff Transformer was further trained at 64K context length on a book corpus and evaluated with cumulative negative log-likelihood across context positions. The Diff Transformer improvement is consistent across positions, with the gap growing for tokens deep into the context.[^1]
In Needle in a Haystack (NIAH) benchmarks, distractor "needles" are inserted into long contexts and the model is asked to retrieve specific keys. In a multi-needle setting with N = 6 needles and R = 2 queries inside a 4K-token context, the Transformer baseline reaches roughly 55% retrieval accuracy while the Diff Transformer reaches roughly 85%, a 30-point gap. At 64K context length, Diff Transformer maintains stable accuracy as the needle depth varies, while the Transformer baseline degrades sharply. The paper highlights a 76-point accuracy improvement for needles positioned around the 25% depth mark of a 64K context.[^1][^4]
The authors evaluate many-shot in-context learning on TREC (6 classes), TREC-fine (50 classes), Banking-77, and Clinic-150, scaling demonstrations up to a 64K-token context. Diff Transformer improves over Transformer by an average of 5.2% to 21.6% across these datasets. Crucially, Diff Transformer is also far less sensitive to demonstration order. The accuracy standard deviation across random permutations of demonstrations drops to below 2%, versus more than 10% for the Transformer baseline.[^1][^4]
For hallucination mitigation, the paper measures the rate at which model outputs include claims not entailed by the context. On the XSum, CNN/DailyMail, and MultiNews summarization tasks, the rate of hallucination-free outputs improves by 9 to 19 percentage points. On the Qasper, HotpotQA, and 2WikiMultihopQA question-answering datasets, the accuracy gap reaches 8 to 11 percentage points using a GPT-4o-based evaluation protocol, with a notable 13-point accuracy gain on single-document QA and a 21-point gain on multi-document QA reported in secondary coverage.[^1][^4]
A surprising consequence of differential attention is a substantial reduction in activation outliers, which makes the network easier to quantize. Top-1 attention-logit values drop from about 318.0 in the Transformer baseline to about 38.8 in Diff Transformer, and top-1 hidden-state values drop from 3608.6 to 1688.2.[^1] When the same models are post-training quantized, the 4-bit Diff Transformer matches the HellaSwag zero-shot accuracy of the 6-bit Transformer baseline, and the 6-bit Diff Transformer is essentially indistinguishable from the 16-bit baseline.[^1][^4]
The v2 version of the paper, posted in April 2025, adds a mathematical reasoning section. After 20B additional tokens of math-focused continued pretraining, the Diff Transformer is reported as 11.3% ahead of the Transformer baseline on a set of math benchmarks. Under o1-style extended chain-of-thought reasoning, the average accuracy gain across eight math benchmarks is roughly 7.5%.[^2]
The Microsoft unilm monorepo provides a Diff-Transformer/ directory with three attention implementations.[^3]
| File | What it provides |
|---|---|
multihead_diffattn.py | A naive PyTorch implementation of multi-head differential attention. Useful as a readable reference. |
multihead_flashdiff_1.py | A FlashAttention-based implementation that supports different query/key and value head dimensions; the authors recommend this variant. |
multihead_flashdiff_2.py | A FlashAttention variant that does not require different query/key and value dimensions. |
multihead_attention.py | A conventional multi-head attention reference for direct comparison. |
example.py | Side-by-side instantiation of standard and differential attention. |
rms_norm.py | RMSNorm utility shared by both reference blocks. |
kernel/ | Additional kernel implementations supporting custom attention shapes. |
The FlashAttention-1 variant is recommended for new training runs because halving the value head dimension relative to the query/key head dimension preserves parameter and FLOP budgets close to the Transformer baseline. The codebase also includes a Diff-Transformer-V2/ subdirectory matching the V2 blog.[^3][^5]
Community implementations exist on PyPI as differential-transformer and in standalone PyTorch repositories on GitHub, including nanowell/Differential-Transformer-PyTorch.[^12][^13] These independent reproductions follow the formulas above but vary in how they handle FlashAttention compatibility and the lambda re-initialization schedule.
In January 2026 the Microsoft team published a blog titled Differential Transformer V2 on Hugging Face, accompanied by code in the Diff-Transformer-V2/ subdirectory of microsoft/unilm. V2 makes three principal changes.[^5]
V2 is reported to match the baseline Transformer's decoding speed and to lower language-modeling loss by 0.02-0.03 nats at 1T training tokens on both dense and Mixture-of-Experts (MoE) configurations. It also further reduces activation-outlier magnitudes, with the team reporting that the "context RMS" of attention outputs is constrained to (0, sqrt(2)) rather than the (1/sqrt(n), 1) range of standard softmax outputs.[^5]
Yueyang Cang and collaborators proposed Shared DIFF Transformer in January 2025 (arXiv:2501.17900). They argue that the two independent Q/K projections in the original block are partly redundant. They instead share a base Q/K matrix and add a low-rank adjustment that produces the second softmax map. The architecture is reported to retain the noise-suppression behaviour of Diff Transformer while reducing parameters and improving efficiency for long-sequence modelling and key information retrieval.[^9]
A 2025 arXiv preprint extends differential attention to a vision-language setting by inserting the differential attention block into the PaliGemma 3B model and fine-tuning with LoRA (referenced via the wikilink-validated topics in the See also section). Across noisy information-retrieval and visual question-answering tasks, the differential variant improves accuracy by up to roughly 30 percentage points in long-context multimodal scenarios.[^14]
A January 2026 preprint titled Threshold Differential Attention for Sink-Free, Ultra-Sparse, and Long-Context Modeling (arXiv:2601.12145) proposes adding a learned threshold to differential attention to suppress residual attention sinks; this paper also discusses how differential attention interacts with attention-sink phenomena.[^15] A separate paper, Understanding Differential Transformer Unchains Pretrained Self-Attentions (arXiv:2505.16333), provides analytic decompositions of the operator and shows how differential attention can be inserted into pretrained Transformer weights.[^10]
Because differential attention is a drop-in replacement for the attention block, it has been picked up in several settings.
The reference implementation in microsoft/unilm and the V2 blog have been the canonical pointers for downstream adopters; multiple independent PyTorch implementations and a PyPI package wrap the original equations for users who do not need the FlashAttention-optimised kernels.[^3][^12][^13]
The Differential Transformer is significant for three reasons that go beyond its specific benchmark numbers.
First, it locates a concrete failure mode (the spread of softmax mass to uninformative tokens) and addresses it with a small architectural change rather than a training trick or a new objective. The change does not increase parameter count substantially, does not require new data, and does not require an alternative training regime, which makes it unusually easy to test.[^1]
Second, the reported reductions in activation outliers and the resulting quantization robustness connect the architecture to a parallel literature on outlier features and attention sinks, which had previously been treated as artifacts to be patched in post-training or via training-stability tricks. The Diff Transformer paper suggests that some of these artifacts emerge precisely because the single-softmax operator forces attention probability to remain a probability distribution that always sums to one, and that allowing signed attention through the subtraction relieves the pressure that produces outliers in the first place.[^1][^6]
Third, the benchmark gains on long-context language modeling, many-shot ICL, and contextual hallucination are correlated. A model that is better at ignoring irrelevant context tokens should be better at retrieving needles, less swayed by demonstration order, and less likely to copy unrelated phrasing into a summary. The Diff Transformer results offer evidence that a single architectural intervention can move all of these correlated capabilities together, which is consistent with the framing that they share a common root cause.[^1][^4][^7]
Reported and observed limitations include the following.
The table below contrasts the original Diff Transformer (V1) with a standard Transformer baseline using the same modern features and with the V2 redesign. Each row is independently validated for its wikilinks.
| Property | Standard Transformer | Diff Transformer V1 | Diff Transformer V2 |
|---|---|---|---|
| Attention operator | Single softmax | Two softmax maps, subtracted with scalar lambda | Two softmax maps from same GQA group, gated by sigmoid lambda |
| Lambda parameterization | n/a | exp(.) - exp(.) + lambda_init with layer-decay | Token/head sigmoid projection |
| Per-head normalization | none specific to attention | Per-head RMSNorm after differential attention | Removed (causes instability at scale) |
| FlashAttention support | Direct | Custom kernels recommended; FlashAttention variants provided | Direct, standard FlashAttention |
| Decoding speed | Baseline | Slightly slower | Matches baseline |
| Activation outliers | Large (top-1 logits ~318) | Sharply reduced (top-1 logits ~39) | Further reduced |
| 4-bit quantization accuracy on HellaSwag | Drops sharply | Matches 6-bit Transformer | Improves further |
| Needle-in-a-haystack (4K, N=6, R=2) | ~55% | ~85% | n/a (not yet reported) |
| Public reference | Many | microsoft/unilm/Diff-Transformer | Diff-Transformer-V2/ subdirectory |
It is also useful to contrast Diff Transformer with two other broad lines of attention research.
Differential attention sits inside a broader effort to characterize and improve the softmax attention operator. Relevant adjacent lines include: