Lightning Attention
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,115 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,115 words
Add missing citations, update stale details, or suggest a clearer explanation.
Lightning Attention is an IO-aware (input/output aware) implementation of linear attention that lets the method reach its theoretical linear-time complexity in practice, including in the causal (autoregressive) setting that language models require. It was developed by researchers at OpenNLPLab and the Shanghai AI Laboratory, led by Zhen Qin and Yiran Zhong, and grew out of their TransNormer line of work rather than originating at a single product company. Lightning Attention is best known as the engine behind MiniMax-01, released by MiniMax in January 2025, the first large-scale foundation model whose attention layers are predominantly linear rather than softmax. [1][2]
Standard transformer self-attention compares every token with every other token, so its cost grows quadratically, O(N^2), with sequence length N. Linear attention removes the softmax so the computation can be reordered into a form that scales linearly, O(N), with sequence length. In principle this makes very long contexts cheap, but for years naive implementations were slower than well-optimized quadratic attention, because the causal version of linear attention reduces to a sequential cumulative-sum (cumsum) recurrence that maps poorly onto GPU hardware. Lightning Attention is the kernel-level technique that closes this gap. It reorganizes the computation into a blockwise, tiled algorithm modeled on FlashAttention that keeps data in fast on-chip memory, avoids the slow cumsum, and delivers training and inference speed that stays essentially constant as the sequence grows. [2][4]
Two versions exist. Lightning Attention-1 was introduced in the 2023 TransNormerLLM report as an IO-aware acceleration of linear attention. Lightning Attention-2, published in January 2024, added the intra-block and inter-block decomposition that finally removed the cumsum bottleneck in the causal case, achieving genuinely constant speed regardless of context length. [2][3]
In conventional attention, an output is computed as softmax(QK^T)V, where Q, K and V are the query, key and value matrices for a sequence of length N and head dimension d. Forming the N by N score matrix QK^T costs O(N^2 d) time and O(N^2) memory, which dominates at long context. [6]
Linear attention, introduced for transformers by Katharopoulos and colleagues in 2020, drops the softmax and replaces it with a simple feature map applied to the queries and keys. Without the softmax tying them together, the matrix product becomes associative, so instead of computing (QK^T)V from left to right one can compute Q(K^T V) from right to left. The middle term K^T V is only a d by d matrix. Because d is fixed and small (often 64 or 128) while N can run into the millions, the right-product order costs O(N d^2) time and O(d^2) memory, that is, linear in sequence length. This reordering is often called the kernel trick or right-product associativity, and it is the source of linear attention's appeal. [9][4]
The catch is causality. A language model may only let position t attend to positions up to t, so the simple global product K^T V is not allowed: each query must see a different running prefix of the keys and values. The exact linear-attention output then requires maintaining a running state, S_t = S_{t-1} + k_t^T v_t, optionally with a decay factor so older tokens fade, and reading it at every step. This prefix sum, or cumsum, is a sequential scan along the sequence. It has linear floating-point-operation count but is memory-bandwidth bound and inherently serial, so it cannot keep the GPU's matrix-multiply units busy. In practice a model built this way trained slower than an ordinary quadratic transformer accelerated with FlashAttention, which is why "linear attention is fast" remained largely theoretical. Removing this bottleneck is exactly what Lightning Attention set out to do. [2][4]
Lightning Attention keeps the linear-attention math but changes the order of operations so the hardware is used efficiently. It splits the sequence into contiguous blocks (tiles) and computes the output of each block as the sum of two parts, an intra-block term and an inter-block term. [2]
The intra-block term handles interactions inside a single block. Because a block is small, Lightning Attention computes it with the conventional left product, the same way ordinary attention works: it forms the small block score matrix Q_i K_i^T, applies a within-block causal mask M that also carries the per-position decay, and multiplies by V_i. Compactly, the intra-block output is [(Q_i K_i^T) masked by M] V_i. This part is quadratic in the block size, but the block size is a small fixed constant, so the cost per token stays bounded, and it runs as dense matrix multiplies that GPUs execute very fast. [2][4]
The inter-block term handles everything that came before the current block. Here Lightning Attention uses the right product. It maintains a single d by d key-value state matrix, written KV, that summarizes all previous blocks, and the contribution to the current block is simply Q_i (KV). After each block the state is updated by adding that block's own key-value product, scaled by the decay: the new KV equals the decay factor times the old KV plus K_i^T V_i. Combining the two parts, the output of block i is O_i = [(Q_i K_i^T) masked by M] V_i + Q_i (KV from the previous blocks). [2]
This block recurrence is the key idea. Instead of a token-by-token cumsum, Lightning Attention carries one compact state from block to block, so the only sequential dependency is across the relatively few blocks, not across every token. Within each block the heavy work is dense matrix multiplication. Like FlashAttention, the kernels are written to be IO-aware: they load each tile into the GPU's fast on-chip SRAM, do the intra-block computation and the state update there, and write back only the small results, minimizing traffic to slower high-bandwidth memory. The same tiling is applied to the backward pass for training. The result is a method with O(N d^2) time complexity, constant memory in the sequence length, and, crucially, training and inference throughput that stays flat as context grows from a few thousand to hundreds of thousands of tokens, in contrast to the quadratic growth of softmax attention even when accelerated by FlashAttention. [2][4]
The difference between the two versions lies here. Lightning Attention-1, from TransNormerLLM, already used FlashAttention-style tiling and cut memory roughly fourfold, but it had not fully escaped the causal cumsum, so its advantage eroded at long sequence lengths. Lightning Attention-2 introduced the explicit intra-block plus inter-block split with the carried KV state, which is what makes the causal case truly linear and constant-speed. [2][3]
Lightning Attention moved from research to large scale with MiniMax-01, a family open-sourced by MiniMax on January 14, 2025, comprising the MiniMax-Text-01 language model and the MiniMax-VL-01 vision-language model. MiniMax-Text-01 is a mixture-of-experts model with 456 billion total parameters, of which about 45.9 billion are activated per token, spread over 32 experts and 80 layers with a hidden dimension of 9,216. [1][7]
Rather than make every layer linear, MiniMax-01 uses a hybrid attention stack. In each group of eight layers, seven use Lightning Attention and one uses ordinary softmax attention, a 7 to 1 ratio; equivalently, a softmax block is inserted after every seven Lightning Attention blocks. The linear layers provide cheap, near-linear scaling over the bulk of the network, while the periodic softmax layers preserve the precise, content-based retrieval that pure linear attention can struggle with. The model uses 64 heads of dimension 128, and its softmax layers employ grouped-query attention with a group size of 8. Positional information is supplied mainly through those softmax layers, which apply rotary position embeddings (RoPE) to half of each head's dimension with a base frequency of 10,000, an arrangement chosen to support length extrapolation. [1][7]
This design let MiniMax train with context lengths up to 1 million tokens and extrapolate to 4 million tokens at inference, roughly 20 to 32 times the window of contemporaneous models, while keeping compute close to linear in the input length. MiniMax later reused the same Lightning-Attention hybrid in its MiniMax-M1 reasoning model, released in June 2025, which the company described as scaling test-time compute efficiently thanks to the linear backbone. [1][8]
The headline efficiency result for Lightning Attention itself is constant speed. In the Lightning Attention-2 experiments, training and inference throughput stayed essentially flat as sequence length increased, whereas both a vanilla PyTorch attention and a FlashAttention-2 baseline showed the expected quadratic slowdown; the linear-attention language model also reported large inference-throughput gains over a comparable softmax transformer. Memory use stayed constant in the sequence length rather than growing with it. [2][4]
At model scale, MiniMax-Text-01 demonstrated that a predominantly linear-attention model can be competitive with strong softmax models. It scored about 88.5 percent on MMLU and remained near the top on long-context benchmarks, holding RULER scores in the low-to-mid 0.9 range out to a 1-million-token context and reaching 100 percent on the 4-million-token needle-in-a-haystack retrieval test. MiniMax reported overall quality on par with frontier models such as GPT-4o and Claude 3.5 Sonnet while offering a far larger context window. [1]
| Version | Source | Year | Key contribution |
|---|---|---|---|
| TransNormer | The Devil in Linear Transformer (EMNLP 2022) | 2022 | Fixed gradient and normalization problems in linear attention |
| Lightning Attention-1 | TransNormerLLM report | 2023 | IO-aware tiled linear attention, about 4x less memory |
| Lightning Attention-2 | A Free Lunch for Unlimited Sequence Lengths | 2024 | Intra-block and inter-block split removes the cumsum; constant speed |
| MiniMax-01 | Scaling Foundation Models with Lightning Attention | 2025 | 456B-parameter hybrid model; 1M-token training, 4M inference |
Lightning Attention sits at the intersection of three lines of work. From FlashAttention it borrows the IO-aware, blockwise-tiling philosophy of keeping tiles in SRAM and minimizing high-bandwidth-memory traffic, and like FlashAttention its kernels are written in Triton. The difference is what is being tiled: FlashAttention computes exact softmax attention with the full O(N^2) arithmetic, just without ever materializing the score matrix, whereas Lightning Attention computes linear attention with only O(N) arithmetic. FlashAttention makes quadratic attention fast; Lightning Attention makes linear attention fast. [2][6]
As an implementation of linear attention, Lightning Attention does not invent a new approximation of softmax; it inherits the kernel-trick reformulation and focuses entirely on making the causal computation hardware-efficient. Its block-recurrent form is closely related to the chunkwise algorithms used by other modern efficient-sequence architectures, including the Retentive Network (RetNet), gated linear attention, and the state space model family such as Mamba. All of these share the same underlying duality: a causal linear-attention or state-space layer can be written either as a parallel form for fast training or as a recurrence for fast generation, and the chunked, block-by-block computation is what links the two. Lightning Attention's intra-block parallel term plus inter-block recurrent state is the attention-flavored instance of that idea. The kinship extends to RWKV and other linear-time recurrent language models, which likewise trade the softmax for a decaying linear state. [2][4]
Lightning Attention is therefore a different route to sub-quadratic attention than the trainable sparse-attention methods that appeared around the same time, such as DeepSeek's Native Sparse Attention and Moonshot's Mixture of Block Attention, which keep the softmax but compute it over a learned subset of tokens. Lightning Attention instead changes the attention function itself into a linear one and then engineers the kernels so the theoretical linear scaling shows up as real wall-clock speed. [1][2]