Differential Transformer

Microsoft Model Architecture Transformer Models

21 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

16 citations

Revision

v5 · 4,264 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Differential Transformer (often shortened to Diff Transformer or DIFF Transformer) is a decoder-only neural sequence architecture introduced by researchers at Microsoft Research and Tsinghua University in October 2024. It replaces the standard scaled dot-product attention block of the Transformer with a differential attention operator that computes two parallel softmax attention maps and outputs their weighted difference, controlled by a learnable scalar lambda. The subtraction is designed to cancel "common-mode" attention given to irrelevant tokens (what the authors call attention noise) while amplifying attention to genuinely informative context.^[1] The architecture was published as arXiv preprint 2410.05258 on 7 October 2024 and selected as an oral presentation at ICLR 2025.^[1]^[2] Reference code is released by Microsoft inside the microsoft/unilm repository under the Diff-Transformer directory.^[3]

The work reports gains across language model pretraining loss, downstream zero-shot evaluation, long-context retrieval, many-shot in-context learning, hallucination mitigation in question answering and summarization, and robustness to low-bit quantization of activations.^[1]^[4] A follow-up note titled Differential Transformer V2 by the same group, posted on Hugging Face in January 2026, simplifies the parameterization and removes the need for custom attention kernels.^[5]

Infobox

Attribute	Value
Architecture name	Differential Transformer (Diff Transformer)
Core operator	Differential attention (subtraction of two softmax maps)
Paper	arXiv:2410.05258, "Differential Transformer"
Authors	Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei
Affiliations	Microsoft Research; Tsinghua University
First posted	7 October 2024 (v1); 7 April 2025 (v2)
Venue	ICLR 2025 (oral)
Reference implementation	`microsoft/unilm/tree/master/Diff-Transformer`
Model sizes evaluated	830M, 1.4B, 2.8B, 3B, 6.8B, 13.1B parameters
Training tokens (3B)	1 trillion
Successor	Differential Transformer V2 (January 2026)

Background

Standard Transformers compute attention as a single softmax over query-key inner products. Empirically, researchers have observed that this softmax often spreads non-trivial probability mass over tokens that are not relevant to the prediction, a phenomenon the Differential Transformer authors describe as attention noise. The mass concentrated on uninformative positions can crowd out the genuine signal, which becomes especially harmful when the input is long or when the relevant evidence is a small fraction of the context window.^[1]^[4]

A separate and well-documented manifestation of this issue is the attention sink, where a disproportionate share of attention weight collapses onto a small set of tokens (often the first token of the sequence) regardless of semantic content. Attention sinks contribute to large activation outliers, complicate low-bit deployment, and limit the effective use of model capacity.^[6] The Differential Transformer paper frames differential attention as a way to suppress such common-mode noise in analogy to differential amplifiers in electrical engineering and to noise-cancelling headphones: two correlated signals are subtracted so that the shared noise cancels and the differential signal is amplified.^[1]^[7]

This noise-cancellation framing motivated the authors to look for an attention operator whose output remains close to the standard softmax when the two maps disagree (genuine attention) but approaches zero when the two maps largely agree (noise).^[1] The result, differential attention, is otherwise designed to drop into a conventional decoder-only Transformer with minimal changes.

History

7 October 2024, version 1 of the arXiv preprint 2410.05258 is posted by Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, and Furu Wei. Tianzhu Ye lists a Microsoft Research and Tsinghua University affiliation, Yi Zhu and Gao Huang are at Tsinghua, and the remaining authors are at Microsoft Research with Furu Wei as the corresponding author.^[1]
9 October 2024, popular technical press coverage appears, including a write-up by MarkTechPost and subsequent VentureBeat reporting that frames the work as a "noise-cancelling" architecture for large language models.^[4]^[7]
Late 2024, the reference implementation lands in microsoft/unilm (the Microsoft UniLM monorepo for foundation-model research), with both a naive PyTorch version and FlashAttention-compatible variants.^[3]
7 April 2025, the authors post version 2 of the arXiv preprint, adding mathematical reasoning evaluation and o1-style reasoning experiments.^[2]
April-May 2025, the paper is presented as an oral at ICLR 2025 in Singapore; slides and the conference page are published on the ICLR virtual site.^[8]
January-May 2025, follow-up papers cite Diff Transformer and propose variants such as Shared DIFF Transformer (arXiv:2501.17900) that introduce a shared base matrix with low-rank task-specific adjustments.^[9] A separate analysis paper, Understanding Differential Transformer Unchains Pretrained Self-Attentions (arXiv:2505.16333), examines how differential attention reshapes the function class of pretrained models.^[10]
January 2026, the Microsoft Research team publishes Differential Transformer V2 via the Hugging Face blog, replacing the global learnable lambda with token- and head-specific projected lambdas using a sigmoid gate, removing per-head RMSNorm, and aligning head dimensions with standard FlashAttention kernels.^[5]

How differential attention works

The two-map subtraction

In a standard multi-head attention block, each head computes a single attention map by applying softmax to scaled query-key inner products. The Differential Transformer replaces this with two separate softmax maps per head, subtracted with a learnable scalar coefficient lambda.^[1]

Given an input sequence X with N tokens and model dimension $d_{\mathrm{model}}$ , each head projects X into queries, keys, and values. The query and key projections are doubled in width and then split into two halves: $Q_1, Q_2$ (each in $\mathbb{R}^{N \times d}$ ) and $K_1, K_2$ (each in $\mathbb{R}^{N \times d}$ ). The value projection $V$ remains a single tensor in $\mathbb{R}^{N \times 2d}$ . The differential attention output for one head is:

\mathrm{DiffAttn}(X) = \left(\mathrm{softmax}\left(\frac{Q_1 K_1^\top}{\sqrt{d}}\right) - \lambda \, \mathrm{softmax}\left(\frac{Q_2 K_2^\top}{\sqrt{d}}\right)\right) V

The first softmax acts like a conventional attention map; the second softmax is interpreted as a "noise" map that captures the common-mode attention distributed over irrelevant positions. The subtraction cancels what the two maps share and amplifies what they disagree on. Because each head consumes twice the query/key budget but the same value budget, the authors recommend using models with a sufficiently large head count so that halving heads has little practical effect.^[1]^[3]

Lambda re-initialization

The scalar lambda is not a free parameter directly. It is reparameterized as a difference of two exponentials of learned vector dot products plus a layer-dependent initialization constant. For a head with learned parameters $\lambda_{q1}, \lambda_{k1}, \lambda_{q2}, \lambda_{k2}$ (all vectors in $\mathbb{R}^d$ ), the lambda used at inference is:

\lambda = \exp(\lambda_{q1} \cdot \lambda_{k1}) - \exp(\lambda_{q2} \cdot \lambda_{k2}) + \lambda_{\mathrm{init}}

This form keeps lambda differentiable and allows the network to deviate freely from the initialization while staying numerically well-behaved. The initial constant follows a layer-decaying schedule:

\lambda_{\mathrm{init}}(l) = 0.8 - 0.6 \exp(-0.3(l - 1))

where $l$ is the layer index starting at 1. Lower layers start near $\lambda_{\mathrm{init}} = 0.2$ and the value rises toward 0.8 in deeper layers. Ablation studies show that fixed constants such as 0.5 or 0.8 perform almost as well, with validation-loss variance under 0.005.^[1]^[11]

Per-head GroupNorm and output scaling

After differential attention, each head is passed through an independent normalization layer (the paper describes it as a per-head GroupNorm acting on the attention output channels) before the heads are concatenated and projected. The normalization stabilizes the diverse statistics that arise once the two-map subtraction begins to produce sparse and signed values, which would otherwise yield wildly different head magnitudes. A fixed multiplier of $(1 - \lambda_{\mathrm{init}})$ scales the output of the GroupNorm so that the gradient flow at initialization matches the gradient flow of a standard Transformer block.^[1] Removing this per-head normalization was reported by the authors to harm training stability with sparse attention patterns.^[11] In V2 the per-head RMSNorm was removed entirely after the team found that, at very large pretraining scales, it caused gradient-norm spikes; a different stabilization is used instead.^[5]

Drop-in nature

Apart from the attention block, the Differential Transformer is otherwise a conventional decoder-only Transformer. It uses rotary position embeddings (RoPE), pre-normalization with RMSNorm, SwiGLU feed-forward layers, and the same loss function as the baseline. The reference repository ships an example.py that instantiates a standard multi-head attention block and a multi-head differential attention block side by side so that they can be swapped at the layer level.^[3]

Training setup and scaling experiments

The paper reports a series of pretraining runs from 830M to 13.1B parameters, all trained on the same data mixture with matched hyperparameters between Diff Transformer and a Transformer baseline that uses the same modern improvements (RoPE, RMSNorm, SwiGLU).^[1]

Headline 3B configuration

The flagship 3B Diff Transformer uses 28 layers, hidden size 3072, and 12 attention heads, trained on 1 trillion tokens with a global batch size of 4M tokens. A 3B Transformer baseline of equivalent shape was trained for comparison (with the language-modeling-only comparisons in the paper using 350B tokens). On the language model evaluation harness across ARC-C, ARC-E, BoolQ, HellaSwag, OBQA, PIQA, and WinoGrande, the 3B Diff Transformer reaches roughly 60.6% average zero-shot accuracy versus roughly 56.8% for the Transformer baseline.^[1]^[4]

Parameter and token efficiency

In the size-scan, Diff Transformer reaches the same language-modeling loss as the Transformer baseline using approximately 65% of the parameters or training tokens. Concretely, the 7.8B Diff Transformer matches a 13.1B Transformer in the comparable configuration, a reduction of about 38% parameters and 36% tokens for matched loss.^[1]^[4]

Long-context training

Diff Transformer was further trained at 64K context length on a book corpus and evaluated with cumulative negative log-likelihood across context positions. The Diff Transformer improvement is consistent across positions, with the gap growing for tokens deep into the context.^[1]

Experimental results

Needle-in-a-haystack retrieval

In Needle in a Haystack (NIAH) benchmarks, distractor "needles" are inserted into long contexts and the model is asked to retrieve specific keys. In a multi-needle setting with $N = 6$ needles and $R = 2$ queries inside a 4K-token context, the Transformer baseline reaches roughly 55% retrieval accuracy while the Diff Transformer reaches roughly 85%, a 30-point gap. At 64K context length, Diff Transformer maintains stable accuracy as the needle depth varies, while the Transformer baseline degrades sharply. The paper highlights a 76-point accuracy improvement for needles positioned around the 25% depth mark of a 64K context.^[1]^[4]

Many-shot in-context learning

The authors evaluate many-shot in-context learning on TREC (6 classes), TREC-fine (50 classes), Banking-77, and Clinic-150, scaling demonstrations up to a 64K-token context. Diff Transformer improves over Transformer by an average of 5.2% to 21.6% across these datasets. Crucially, Diff Transformer is also far less sensitive to demonstration order. The accuracy standard deviation across random permutations of demonstrations drops to below 2%, versus more than 10% for the Transformer baseline.^[1]^[4]

Contextual hallucination

For hallucination mitigation, the paper measures the rate at which model outputs include claims not entailed by the context. On the XSum, CNN/DailyMail, and MultiNews summarization tasks, the rate of hallucination-free outputs improves by 9 to 19 percentage points. On the Qasper, HotpotQA, and 2WikiMultihopQA question-answering datasets, the accuracy gap reaches 8 to 11 percentage points using a GPT-4o-based evaluation protocol, with a notable 13-point accuracy gain on single-document QA and a 21-point gain on multi-document QA reported in secondary coverage.^[1]^[4]

Activation outliers and quantization

A surprising consequence of differential attention is a substantial reduction in activation outliers, which makes the network easier to quantize. Top-1 attention-logit values drop from about 318.0 in the Transformer baseline to about 38.8 in Diff Transformer, and top-1 hidden-state values drop from 3608.6 to 1688.2.^[1] When the same models are post-training quantized, the 4-bit Diff Transformer matches the HellaSwag zero-shot accuracy of the 6-bit Transformer baseline, and the 6-bit Diff Transformer is essentially indistinguishable from the 16-bit baseline.^[1]^[4]

Reasoning evaluation (v2 addition)

The v2 version of the paper, posted in April 2025, adds a mathematical reasoning section. After 20B additional tokens of math-focused continued pretraining, the Diff Transformer is reported as 11.3% ahead of the Transformer baseline on a set of math benchmarks. Under o1-style extended chain-of-thought reasoning, the average accuracy gain across eight math benchmarks is roughly 7.5%.^[2]

Reference implementation

The Microsoft unilm monorepo provides a Diff-Transformer/ directory with three attention implementations.^[3]

File	What it provides
`multihead_diffattn.py`	A naive PyTorch implementation of multi-head differential attention. Useful as a readable reference.
`multihead_flashdiff_1.py`	A FlashAttention-based implementation that supports different query/key and value head dimensions; the authors recommend this variant.
`multihead_flashdiff_2.py`	A FlashAttention variant that does not require different query/key and value dimensions.
`multihead_attention.py`	A conventional multi-head attention reference for direct comparison.
`example.py`	Side-by-side instantiation of standard and differential attention.
`rms_norm.py`	RMSNorm utility shared by both reference blocks.
`kernel/`	Additional kernel implementations supporting custom attention shapes.

The FlashAttention-1 variant is recommended for new training runs because halving the value head dimension relative to the query/key head dimension preserves parameter and FLOP budgets close to the Transformer baseline. The codebase also includes a Diff-Transformer-V2/ subdirectory matching the V2 blog.^[3]^[5]

Community implementations exist on PyPI as differential-transformer and in standalone PyTorch repositories on GitHub, including nanowell/Differential-Transformer-PyTorch.^[12]^[13] These independent reproductions follow the formulas above but vary in how they handle FlashAttention compatibility and the lambda re-initialization schedule.

Variants and successor work

Differential Transformer V2

In January 2026 the Microsoft team published a blog titled Differential Transformer V2 on Hugging Face, accompanied by code in the Diff-Transformer-V2/ subdirectory of microsoft/unilm. V2 makes three principal changes.^[5]

Instead of doubling Q and K head widths and halving the value width, V2 doubles the number of query heads but keeps the grouped-query key-value head count. The two halves of the differential pair are drawn from the same group, so the attention computation can be carried out with a single call to a standard FlashAttention kernel. No custom CUDA kernels are needed.
The global, exponential reparameterization of lambda is replaced by a token-specific, head-wise projection passed through a sigmoid. This eliminates the layer-decay schedule and the four lambda_* vectors per head.
The per-head RMSNorm after the differential attention is removed because it introduced gradient-norm spikes at production scale.

V2 is reported to match the baseline Transformer's decoding speed and to lower language-modeling loss by 0.02-0.03 nats at 1T training tokens on both dense and Mixture-of-Experts (MoE) configurations. It also further reduces activation-outlier magnitudes, with the team reporting that the "context RMS" of attention outputs is constrained to $(0, \sqrt{2})$ rather than the $(1/\sqrt{n}, 1)$ range of standard softmax outputs.^[5]

Shared DIFF Transformer

Yueyang Cang and collaborators proposed Shared DIFF Transformer in January 2025 (arXiv:2501.17900). They argue that the two independent Q/K projections in the original block are partly redundant. They instead share a base Q/K matrix and add a low-rank adjustment that produces the second softmax map. The architecture is reported to retain the noise-suppression behaviour of Diff Transformer while reducing parameters and improving efficiency for long-sequence modelling and key information retrieval.^[9]

Differential MultiModal Transformer (PaliGemma extension)

A 2025 arXiv preprint extends differential attention to a vision-language setting by inserting the differential attention block into the PaliGemma 3B model and fine-tuning with LoRA (referenced via the wikilink-validated topics in the See also section). Across noisy information-retrieval and visual question-answering tasks, the differential variant improves accuracy by up to roughly 30 percentage points in long-context multimodal scenarios.^[14]

Threshold differential attention and analysis work

A January 2026 preprint titled Threshold Differential Attention for Sink-Free, Ultra-Sparse, and Long-Context Modeling (arXiv:2601.12145) proposes adding a learned threshold to differential attention to suppress residual attention sinks; this paper also discusses how differential attention interacts with attention-sink phenomena.^[15] A separate paper, Understanding Differential Transformer Unchains Pretrained Self-Attentions (arXiv:2505.16333), provides analytic decompositions of the operator and shows how differential attention can be inserted into pretrained Transformer weights.^[10]

Applications and adoption

Because differential attention is a drop-in replacement for the attention block, it has been picked up in several settings.

Pretraining of large language models in the size range from below 1B up to 13B parameters in the original paper, and extended in V2 to production-scale dense and MoE runs.^[1]^[5]
Long-context retrieval systems, where the NIAH-style robustness of differential attention is useful for retrieval-augmented generation pipelines that must locate small evidence spans in 32K-64K-token contexts.^[1]^[4]
In-context-learning-heavy applications where the answer depends on order-sensitive few-shot demonstrations. The reported reduction in permutation variance reduces an important source of evaluation noise.^[1]^[4]
Hallucination-sensitive deployments in summarization and grounded question answering, where the 9-19 percentage-point reduction in non-grounded outputs is directly relevant to faithfulness guarantees.^[1]
Edge and on-device inference, where the lower activation magnitudes allow 4-bit weight/activation quantization to retain accuracy that would otherwise require 6-bit precision in a Transformer baseline.^[1]^[4]
Multimodal vision-language models through differential adaptations of PaliGemma and related backbones.^[14]
Hyperspectral image classification, where DiffFormer (a spatial-spectral Diff Transformer for hyperspectral data, arXiv:2412.17350) applies the differential operator to image bands.^[16]

The reference implementation in microsoft/unilm and the V2 blog have been the canonical pointers for downstream adopters; multiple independent PyTorch implementations and a PyPI package wrap the original equations for users who do not need the FlashAttention-optimised kernels.^[3]^[12]^[13]

Significance

The Differential Transformer is significant for three reasons that go beyond its specific benchmark numbers.

First, it locates a concrete failure mode (the spread of softmax mass to uninformative tokens) and addresses it with a small architectural change rather than a training trick or a new objective. The change does not increase parameter count substantially, does not require new data, and does not require an alternative training regime, which makes it unusually easy to test.^[1]

Second, the reported reductions in activation outliers and the resulting quantization robustness connect the architecture to a parallel literature on outlier features and attention sinks, which had previously been treated as artifacts to be patched in post-training or via training-stability tricks. The Diff Transformer paper suggests that some of these artifacts emerge precisely because the single-softmax operator forces attention probability to remain a probability distribution that always sums to one, and that allowing signed attention through the subtraction relieves the pressure that produces outliers in the first place.^[1]^[6]

Third, the benchmark gains on long-context language modeling, many-shot ICL, and contextual hallucination are correlated. A model that is better at ignoring irrelevant context tokens should be better at retrieving needles, less swayed by demonstration order, and less likely to copy unrelated phrasing into a summary. The Diff Transformer results offer evidence that a single architectural intervention can move all of these correlated capabilities together, which is consistent with the framing that they share a common root cause.^[1]^[4]^[7]

Limitations and criticisms

Reported and observed limitations include the following.

Throughput cost. The naive implementation of differential attention runs two softmax computations per head and applies an additional normalization. The original paper acknowledges a slight decrease in throughput compared with the Transformer baseline, and secondary commentary cites figures in the 5-30% range depending on the differential-attention wrapper and on whether custom kernels are used.^[1]^[6]
Need for custom kernels at V1. The recommended V1 variant uses different head dimensions for queries/keys and values, which means standard FlashAttention kernels do not directly apply. The reference repo ships custom kernels; the V2 redesign was motivated in part to remove this requirement.^[3]^[5]
From-scratch training assumption. The original paper trains all models from scratch. Inserting differential attention into a fully pretrained Transformer requires care, and analysis work has explicitly studied how the operator can be unchained from existing self-attentions without destroying their representations.^[10]
Sensitivity to lambda initialization at scale. While ablations show small validation-loss differences for several $\lambda_{\mathrm{init}}$ schedules in the 1.4B regime, V2 reports that the original exponential reparameterization produced gradient-norm instabilities at production-scale learning rates (6e-4 to 1e-3) and so was replaced with a simpler sigmoid gate.^[5]
Halving of effective heads. Because each differential head consumes a Q-K budget worth two regular heads, the architecture is recommended for models with a sufficiently large head count. Small-head configurations may not benefit as much.^[3]
Mechanistic understanding is still incomplete. Why differential attention generalizes the way it does, and how exactly the two maps decompose into "signal" and "noise" components inside trained networks, remains an area of active study, with analysis papers proposing specific decompositions and noting that the empirical gains are not yet fully explained.^[10]
Public-review concerns. The ICLR 2025 OpenReview record for this paper hosts the formal reviewer discussion, and while the paper was accepted as an oral, like any paper it received specific critiques that interested readers can consult directly.^[11]

Comparison

The table below contrasts the original Diff Transformer (V1) with a standard Transformer baseline using the same modern features and with the V2 redesign. Each row is independently validated for its wikilinks.

Property	Standard Transformer	Diff Transformer V1	Diff Transformer V2
Attention operator	Single softmax	Two softmax maps, subtracted with scalar lambda	Two softmax maps from same GQA group, gated by sigmoid lambda
Lambda parameterization	n/a	$\exp(\cdot) - \exp(\cdot) + \lambda_{\mathrm{init}}$ with layer-decay	Token/head sigmoid projection
Per-head normalization	none specific to attention	Per-head RMSNorm after differential attention	Removed (causes instability at scale)
FlashAttention support	Direct	Custom kernels recommended; FlashAttention variants provided	Direct, standard FlashAttention
Decoding speed	Baseline	Slightly slower	Matches baseline
Activation outliers	Large (top-1 logits ~318)	Sharply reduced (top-1 logits ~39)	Further reduced
4-bit quantization accuracy on HellaSwag	Drops sharply	Matches 6-bit Transformer	Improves further
Needle-in-a-haystack (4K, $N=6$ , $R=2$ )	~55%	~85%	n/a (not yet reported)
Public reference	Many	`microsoft/unilm/Diff-Transformer`	`Diff-Transformer-V2/` subdirectory

It is also useful to contrast Diff Transformer with two other broad lines of attention research.

Compared with sparse-attention methods that explicitly mask or sub-sample positions (for example sliding-window attention or learned-pattern variants), Diff Transformer leaves the attention map dense but allows the two softmax maps to cancel mass in irrelevant positions. The resulting effective attention map is sparse without requiring a fixed sparsity pattern.^[1]
Compared with linear-recurrent alternatives such as RetNet and Mamba, which replace softmax attention altogether with subquadratic mechanisms, Diff Transformer preserves the quadratic-attention block and the standard training stack, paying its cost in a constant factor rather than in algorithmic complexity.^[1]

Differential attention sits inside a broader effort to characterize and improve the softmax attention operator. Relevant adjacent lines include:

Attention sinks. Survey and analysis work documents how a few tokens (often the first) absorb disproportionate attention mass and create activation outliers. Diff Transformer is one of several recent architectures cited as targeting this problem from the operator-design side.^[6]
Outlier features and quantization robustness. The lower outlier magnitudes of Diff Transformer make it directly relevant to work that designs training-time interventions to keep activations within ranges friendly to 4-bit and 8-bit deployment.^[1]
Long-context architectures. Long-context language models, including those that rely on RoPE extrapolation and on retrieval-augmented decoding, often degrade because of the same attention-noise phenomenon that Diff Transformer targets at the architectural level.^[1]^[4]
In-context-learning robustness. The reduced sensitivity to demonstration order observed for Diff Transformer connects with literature documenting the order sensitivity of in-context learning in standard Transformers.^[1]^[4]
Reasoning and chain-of-thought. The v2 reasoning evaluation places Diff Transformer in the conversation about whether architectural changes can improve reasoning at fixed compute, complementing data-side and training-objective interventions.^[2]

References

Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, "Differential Transformer", arXiv:2410.05258v1, 2024-10-07. https://arxiv.org/abs/2410.05258. Accessed 2026-05-20. ↩
Tianzhu Ye, Li Dong, Yuqing Xia, Yutao Sun, Yi Zhu, Gao Huang, Furu Wei, "Differential Transformer (v2 revision)", arXiv:2410.05258v2, 2025-04-07. https://arxiv.org/html/2410.05258v2. Accessed 2026-05-20. ↩
Microsoft Research, "Diff-Transformer reference implementation", GitHub `microsoft/unilm` repository, 2024-2026. https://github.com/microsoft/unilm/tree/master/Diff-Transformer. Accessed 2026-05-20. ↩
MarkTechPost, "Differential Transformer: A Foundation Architecture for Large Language Models that Reduces Attention Noise and Achieves Significant Gains in Efficiency and Accuracy", 2024-10-09. https://www.marktechpost.com/2024/10/09/differential-transformer-a-foundation-architecture-for-large-language-models-that-reduces-attention-noise-and-achieves-significant-gains-in-efficiency-and-accuracy/. Accessed 2026-05-20. ↩
Microsoft Research, "Differential Transformer V2", Hugging Face Blog, 2026-01-20. https://huggingface.co/blog/microsoft/diff-attn-v2. Accessed 2026-05-20. ↩
Various authors, "Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation", arXiv:2604.10098, 2026. https://arxiv.org/html/2604.10098. Accessed 2026-05-20. ↩
VentureBeat, "Microsoft's Differential Transformer cancels attention noise in LLMs", 2024-10. https://venturebeat.com/ai/microsofts-differential-transformer-cancels-attention-noise-in-llms. Accessed 2026-05-20. ↩
ICLR 2025 program, "Differential Transformer (Oral)", International Conference on Learning Representations 2025. https://iclr.cc/virtual/2025/oral/31859. Accessed 2026-05-20. ↩
Yueyang Cang, Yuhang Liu, Xiaoteng Zhang, Li Shi, Wenge Que, "Shared DIFF Transformer", arXiv:2501.17900, 2025-01-29 (revised 2025-12-16). https://arxiv.org/abs/2501.17900. Accessed 2026-05-20. ↩
Authors of arXiv:2505.16333, "Understanding Differential Transformer Unchains Pretrained Self-Attentions", arXiv:2505.16333v3, 2025. https://arxiv.org/html/2505.16333v3. Accessed 2026-05-20. ↩
OpenReview, "Differential Transformer (ICLR 2025 forum)", OpenReview submission OvoCm1gGhN. https://openreview.net/forum?id=OvoCm1gGhN. Accessed 2026-05-20. ↩
PyPI, "differential-transformer", Python Package Index. https://pypi.org/project/differential-transformer/. Accessed 2026-05-20. ↩
nanowell, "Differential-Transformer-PyTorch", GitHub repository. https://github.com/nanowell/Differential-Transformer-PyTorch. Accessed 2026-05-20. ↩
Authors of arXiv:2507.15875, "Differential Multimodal Transformers", arXiv:2507.15875, 2025. https://arxiv.org/abs/2507.15875. Accessed 2026-05-20. ↩
Authors of arXiv:2601.12145, "Threshold Differential Attention for Sink-Free, Ultra-Sparse, and Long-Context Modeling", arXiv:2601.12145, 2026. https://www.arxiv.org/pdf/2601.12145. Accessed 2026-05-20. ↩
Authors of arXiv:2412.17350, "DiffFormer: a Differential Spatial-Spectral Transformer for Hyperspectral Image Classification", arXiv:2412.17350, 2024. https://arxiv.org/pdf/2412.17350. Accessed 2026-05-20. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

4 revisions by 1 contributor · full history

Suggest edit

What links here

Attention Multi-Head Self-Attention

Infobox

Background

History

How differential attention works

The two-map subtraction

Lambda re-initialization

Per-head GroupNorm and output scaling

Drop-in nature

Training setup and scaling experiments

Headline 3B configuration

Parameter and token efficiency

Long-context training

Experimental results

Needle-in-a-haystack retrieval

Many-shot in-context learning

Contextual hallucination

Activation outliers and quantization

Reasoning evaluation (v2 addition)

Reference implementation

Variants and successor work

Differential Transformer V2

Shared DIFF Transformer

Differential MultiModal Transformer (PaliGemma extension)

Threshold differential attention and analysis work

Applications and adoption

Significance

Limitations and criticisms

Comparison

Related work

See also

References

Improve this article

Related Articles

LongNet

DeBERTa

BitNet

BitNet b1.58

LongRoPE

YOCO (You Only Cache Once)

What links here

Related Articles

LongNet

DeBERTa

BitNet

BitNet b1.58

LongRoPE

YOCO (You Only Cache Once)

What links here