Lightning Attention

Deep Learning Neural Networks

11 min read

Updated Jul 12, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 12, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 2,255 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Lightning Attention is an IO-aware (input/output aware) implementation of linear attention that lets the method reach its theoretical linear-time complexity in practice, including in the causal (autoregressive) setting that language models require. Its time cost grows linearly, $O(N)$ , with sequence length $N$ rather than quadratically, $O(N^2)$ , and its memory stays constant as the context grows, so per-token training and inference speed remains essentially flat from a few thousand tokens to millions. It was developed by researchers at OpenNLPLab and the Shanghai AI Laboratory, led by Zhen Qin and Yiran Zhong, and grew out of their TransNormer line of work rather than originating at a single product company. Lightning Attention is best known as the engine behind MiniMax-01, released by MiniMax in January 2025, the first large-scale foundation model whose attention layers are predominantly linear rather than softmax. ^[1]^[2]

What is Lightning Attention?

Standard transformer self-attention compares every token with every other token, so its cost grows quadratically, $O(N^2)$ , with sequence length $N$ . Linear attention removes the softmax so the computation can be reordered into a form that scales linearly, $O(N)$ , with sequence length. In principle this makes very long contexts cheap, but for years naive implementations were slower than well-optimized quadratic attention, because the causal version of linear attention reduces to a sequential cumulative-sum (cumsum) recurrence that maps poorly onto GPU hardware. Lightning Attention is the kernel-level technique that closes this gap. It reorganizes the computation into a blockwise, tiled algorithm modeled on FlashAttention that keeps data in fast on-chip memory, avoids the slow cumsum, and delivers training and inference speed that stays essentially constant as the sequence grows. The authors of Lightning Attention-2 summarize the result directly: the method "retains consistent training and inference speed regardless of input sequence length and is significantly faster than other attention mechanisms." ^[2]^[4]

Two versions exist. Lightning Attention-1 was introduced in the 2023 TransNormerLLM report as an IO-aware acceleration of linear attention. Lightning Attention-2, published in January 2024, added the intra-block and inter-block decomposition that finally removed the cumsum bottleneck in the causal case, achieving genuinely constant speed regardless of context length. ^[2]^[3]

What problem does linear attention have that Lightning Attention fixes?

In conventional attention, an output is computed as $\mathrm{softmax}(QK^\top)V$ , where $Q$ , $K$ and $V$ are the query, key and value matrices for a sequence of length $N$ and head dimension $d$ . Forming the $N$ by $N$ score matrix $QK^\top$ costs $O(N^2 d)$ time and $O(N^2)$ memory, which dominates at long context. ^[6]

Linear attention, introduced for transformers by Katharopoulos and colleagues in 2020, drops the softmax and replaces it with a simple feature map applied to the queries and keys. Without the softmax tying them together, the matrix product becomes associative, so instead of computing $(QK^\top)V$ from left to right one can compute $Q(K^\top V)$ from right to left. The middle term $K^\top V$ is only a $d$ by $d$ matrix. Because $d$ is fixed and small (often 64 or 128) while $N$ can run into the millions, the right-product order costs $O(N d^2)$ time and $O(d^2)$ memory, that is, linear in sequence length. This reordering is often called the kernel trick or right-product associativity, and it is the source of linear attention's appeal. ^[9]^[4]

The catch is causality. A language model may only let position $t$ attend to positions up to $t$ , so the simple global product $K^\top V$ is not allowed: each query must see a different running prefix of the keys and values. The exact linear-attention output then requires maintaining a running state, $S_t = S_{t-1} + k_t^\top v_t$ , optionally with a decay factor so older tokens fade, and reading it at every step. This prefix sum, or cumsum, is a sequential scan along the sequence. It has linear floating-point-operation count but is memory-bandwidth bound and inherently serial, so it cannot keep the GPU's matrix-multiply units busy. In practice a model built this way trained slower than an ordinary quadratic transformer accelerated with FlashAttention, which is why "linear attention is fast" remained largely theoretical. Removing this bottleneck is exactly what Lightning Attention set out to do. ^[2]^[4]

How does Lightning Attention achieve linear complexity?

Lightning Attention keeps the linear-attention math but changes the order of operations so the hardware is used efficiently. As the Lightning Attention-2 paper puts it, "we leverage the thought of tiling, separately handling the intra-block and inter-block components in linear attention calculation." It splits the sequence into contiguous blocks (tiles) and computes the output of each block as the sum of two parts, an intra-block term and an inter-block term. ^[2]

The intra-block term handles interactions inside a single block. Because a block is small, Lightning Attention computes it with the conventional left product, the same way ordinary attention works: it forms the small block score matrix $Q_i K_i^\top$ , applies a within-block causal mask $M$ that also carries the per-position decay, and multiplies by $V_i$ . Compactly, the intra-block output is $[(Q_i K_i^\top) \text{ masked by } M] V_i$ . This part is quadratic in the block size, but the block size is a small fixed constant, so the cost per token stays bounded, and it runs as dense matrix multiplies that GPUs execute very fast. ^[2]^[4]

The inter-block term handles everything that came before the current block. Here Lightning Attention uses the right product. It maintains a single $d$ by $d$ key-value state matrix, written $KV$ , that summarizes all previous blocks, and the contribution to the current block is simply $Q_i (KV)$ . After each block the state is updated by adding that block's own key-value product, scaled by the decay: the new KV equals the decay factor times the old KV plus $K_i^\top V_i$ . Combining the two parts, the output of block i is $O_i = [(Q_i K_i^\top) \text{ masked by } M] V_i + Q_i (KV \text{ from the previous blocks})$ . ^[2]

This block recurrence is the key idea. Instead of a token-by-token cumsum, Lightning Attention carries one compact state from block to block, so the only sequential dependency is across the relatively few blocks, not across every token. Within each block the heavy work is dense matrix multiplication. Like FlashAttention, the kernels are written to be IO-aware: they load each tile into the GPU's fast on-chip SRAM, do the intra-block computation and the state update there, and write back only the small results, minimizing traffic to slower high-bandwidth memory. The same tiling is applied to the backward pass for training. The result is a method with $O(N d^2)$ time complexity, constant memory in the sequence length, and, crucially, training and inference throughput that stays flat as context grows from a few thousand to hundreds of thousands of tokens, in contrast to the quadratic growth of softmax attention even when accelerated by FlashAttention. ^[2]^[4]

The difference between the two versions lies here. Lightning Attention-1, from TransNormerLLM, already used FlashAttention-style tiling and cut memory roughly fourfold, but it had not fully escaped the causal cumsum, so its advantage eroded at long sequence lengths. Lightning Attention-2 introduced the explicit intra-block plus inter-block split with the carried KV state, which is what makes the causal case truly linear and constant-speed. ^[2]^[3]

How is Lightning Attention used in MiniMax-01?

Lightning Attention moved from research to large scale with MiniMax-01, a family open-sourced by MiniMax on January 14, 2025, comprising the MiniMax-Text-01 language model and the MiniMax-VL-01 vision-language model. MiniMax-Text-01 is a mixture-of-experts model with 456 billion total parameters, of which about 45.9 billion are activated per token, spread over 32 experts and 80 layers with a hidden dimension of 9,216. MiniMax states plainly that for this model "the core lies in lightning attention and its efficient scaling." ^[1]^[7]

Rather than make every layer linear, MiniMax-01 uses a hybrid attention stack. In each group of eight layers, seven use Lightning Attention and one uses ordinary softmax attention, a 7 to 1 ratio; equivalently, a softmax block is inserted after every seven Lightning Attention blocks. The linear layers provide cheap, near-linear scaling over the bulk of the network, while the periodic softmax layers preserve the precise, content-based retrieval that pure linear attention can struggle with. The model uses 64 heads of dimension 128, and its softmax layers employ grouped-query attention with a group size of 8. Positional information is supplied mainly through those softmax layers, which apply rotary position embeddings (RoPE) to half of each head's dimension with a base frequency of 10,000, an arrangement chosen to support length extrapolation. ^[1]^[7]

This design let MiniMax train with context lengths up to 1 million tokens and extrapolate to 4 million tokens at inference, roughly 20 to 32 times the window of contemporaneous models, while keeping compute close to linear in the input length. MiniMax later reused the same Lightning-Attention hybrid in its MiniMax-M1 reasoning model, released in June 2025, which the company described as scaling test-time compute efficiently thanks to the linear backbone. ^[1]^[8]

What results does Lightning Attention deliver?

The headline efficiency result for Lightning Attention itself is constant speed. In the Lightning Attention-2 experiments, training and inference throughput stayed essentially flat as sequence length increased, whereas both a vanilla PyTorch attention and a FlashAttention-2 baseline showed the expected quadratic slowdown; the linear-attention language model also reported large inference-throughput gains over a comparable softmax transformer. Memory use stayed constant in the sequence length rather than growing with it. ^[2]^[4]

At model scale, MiniMax-Text-01 demonstrated that a predominantly linear-attention model can be competitive with strong softmax models. It scored about 88.5 percent on MMLU and remained near the top on long-context benchmarks, holding RULER scores in the low-to-mid 0.9 range out to a 1-million-token context and reaching 100 percent on the 4-million-token needle-in-a-haystack retrieval test. MiniMax reported that the MiniMax-01 models "match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering 20-32 times longer context window." ^[1]

Version	Source	Year	Key contribution
TransNormer	The Devil in Linear Transformer (EMNLP 2022)	2022	Fixed gradient and normalization problems in linear attention
Lightning Attention-1	TransNormerLLM report	2023	IO-aware tiled linear attention, about 4x less memory
Lightning Attention-2	A Free Lunch for Unlimited Sequence Lengths	2024	Intra-block and inter-block split removes the cumsum; constant speed
MiniMax-01	Scaling Foundation Models with Lightning Attention	2025	456B-parameter hybrid model; 1M-token training, 4M inference

How does Lightning Attention differ from FlashAttention and other methods?

Lightning Attention sits at the intersection of three lines of work. From FlashAttention it borrows the IO-aware, blockwise-tiling philosophy of keeping tiles in SRAM and minimizing high-bandwidth-memory traffic, and like FlashAttention its kernels are written in Triton. The difference is what is being tiled: FlashAttention computes exact softmax attention with the full $O(N^2)$ arithmetic, just without ever materializing the score matrix, whereas Lightning Attention computes linear attention with only $O(N)$ arithmetic. FlashAttention makes quadratic attention fast; Lightning Attention makes linear attention fast. ^[2]^[6]

As an implementation of linear attention, Lightning Attention does not invent a new approximation of softmax; it inherits the kernel-trick reformulation and focuses entirely on making the causal computation hardware-efficient. Its block-recurrent form is closely related to the chunkwise algorithms used by other modern efficient-sequence architectures, including the Retentive Network (RetNet), gated linear attention, and the state space model family such as Mamba. All of these share the same underlying duality: a causal linear-attention or state-space layer can be written either as a parallel form for fast training or as a recurrence for fast generation, and the chunked, block-by-block computation is what links the two. Lightning Attention's intra-block parallel term plus inter-block recurrent state is the attention-flavored instance of that idea. The kinship extends to RWKV and other linear-time recurrent language models, which likewise trade the softmax for a decaying linear state. ^[2]^[4]

Lightning Attention is therefore a different route to sub-quadratic attention than the trainable sparse-attention methods that appeared around the same time, such as DeepSeek's Native Sparse Attention and Moonshot's Mixture of Block Attention, which keep the softmax but compute it over a learned subset of tokens. Lightning Attention instead changes the attention function itself into a linear one and then engineers the kernels so the theoretical linear scaling shows up as real wall-clock speed. ^[1]^[2]

References

MiniMax. "MiniMax-01: Scaling Foundation Models with Lightning Attention." arXiv:2501.08313, January 14, 2025. https://arxiv.org/abs/2501.08313 ↩
Qin, Zhen; Sun, Weigao; Li, Dong; Shen, Xuyang; Sun, Weixuan; Zhong, Yiran. "Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models." arXiv:2401.04658, January 9, 2024. https://arxiv.org/abs/2401.04658 ↩
Qin, Zhen; et al. "TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer." arXiv:2307.14995, July 2023. https://arxiv.org/abs/2307.14995 ↩
Qin, Zhen; Sun, Weigao; Li, Dong; Shen, Xuyang; Sun, Weixuan; Zhong, Yiran. "Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention." Proceedings of the 41st International Conference on Machine Learning (ICML 2024). arXiv:2405.17381. https://arxiv.org/abs/2405.17381 ↩
Qin, Zhen; Han, Xiaodong; Sun, Weixuan; Li, Dongxu; Kong, Lingpeng; Barnes, Nick; Zhong, Yiran. "The Devil in Linear Transformer." Proceedings of EMNLP 2022. arXiv:2210.10340. https://arxiv.org/abs/2210.10340
Dao, Tri; Fu, Daniel Y.; Ermon, Stefano; Rudra, Atri; Re, Christopher. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness." arXiv:2205.14135, May 2022. https://arxiv.org/abs/2205.14135 ↩
MiniMax. "MiniMax-Text-01" (model card and code). Hugging Face and GitHub, January 2025. https://huggingface.co/MiniMaxAI/MiniMax-Text-01 ↩
MiniMax. "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention." arXiv:2506.13585, June 2025. https://arxiv.org/abs/2506.13585 ↩
Katharopoulos, Angelos; Vyas, Apoorv; Pappas, Nikolaos; Fleuret, Francois. "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention." Proceedings of ICML 2020. arXiv:2006.16236. https://arxiv.org/abs/2006.16236 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Artificial intelligence in China MiniMax M2 MiniMax-Text-01

What is Lightning Attention?

What problem does linear attention have that Lightning Attention fixes?

How does Lightning Attention achieve linear complexity?

How is Lightning Attention used in MiniMax-01?

What results does Lightning Attention deliver?

How does Lightning Attention differ from FlashAttention and other methods?

References

Improve this article

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here

Related Articles

LSTM

Mixture of Experts (MoE)

Translational invariance

Activation Function

Attention

Backpropagation

What links here