Linear Attention
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,170 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 5,170 words
Add missing citations, update stale details, or suggest a clearer explanation.
Linear attention is a family of sub-quadratic attention mechanisms that replaces the softmax dot-product operation of standard Transformer self-attention with a feature-map-based factorization. By writing attention as phi(Q) (phi(K)^T V), where phi is a non-negative kernel feature map, the computation can be reordered to scale linearly rather than quadratically with sequence length, reducing complexity from O(N^2 d) to O(N d^2).[1] The formulation also reveals a natural recurrent state representation that allows autoregressive decoding in constant time and memory per token, blurring the line between Transformers and recurrent neural networks.[1] First popularized in the 2020 ICML paper "Transformers are RNNs" by Katharopoulos, Vyas, Pappas and Fleuret, linear attention has since been refined through gating, decay, delta-rule updates and structured state space duality, with variants such as RWKV, RetNet, Mamba-2, GLA and Gated DeltaNet narrowing the historical quality gap to softmax attention.[1][2][3][4][5][6]
The history of linear attention can be traced through three overlapping waves of research. The earliest precursors come from the 1990s, when Schmidhuber proposed fast-weight programmers as an alternative to standard recurrent networks. In a 1992 Neural Computation paper, a slow controller network was trained by gradient descent to write context-dependent fast weights into a second feedforward network via additive outer products of self-generated key and value patterns.[9] This is, up to renaming of variables, the same operation that linear attention performs at each token, although the connection to attention as understood in modern Transformers was not made explicit until decades later.[3] An earlier 1991 technical report by Schmidhuber describes the same idea under a different name.[9]
The second wave coincided with the rise of the Transformer. After Vaswani et al. introduced softmax self-attention in 2017, the quadratic cost of long-sequence training became a recognized bottleneck.[7] Several teams in 2019 and 2020 proposed efficient-attention mechanisms that broadly fell into three families: sparse approximations of the attention matrix, low-rank projections, and kernel-based linearizations.[8] In June 2020 Wang and collaborators released Linformer, which projected keys and values to a low-rank subspace; later the same month Katharopoulos, Vyas, Pappas and Fleuret submitted the seminal "Transformers are RNNs" paper introducing the modern ELU+1 feature-map linear attention and the recurrent state interpretation.[1][16] In September 2020 Choromanski and collaborators released the Performer paper, providing the first kernel approximation of softmax with provable error bounds.[10]
The third wave, from 2021 onward, focused on closing the quality gap. Schlag, Irie and Schmidhuber formalized the equivalence to fast weight programmers and introduced the delta-rule update.[3] In 2023 a series of more expressive variants appeared: RWKV applied linear attention with per-channel exponential decay at large scale, and RetNet introduced explicit decay and a triple parallel/recurrent/chunkwise computation form for industrial-scale training.[4][17] GLA in late 2023 added data-dependent gates and a hardware-aware kernel.[2] In 2024 Mamba 2 unified linear attention with selective state space models through the SSD framework, and a wave of delta-rule variants culminated in Gated DeltaNet, which combines gating with the delta update.[5][6][14] By 2025, hybrid architectures interleaving linear-attention or SSM layers with periodic softmax layers had been adopted at the foundation-model scale by teams releasing MiniMax-01, Kimi Linear and certain Qwen variants, signalling broader industrial uptake of the approach.[18]
Standard self-attention in the transformer architecture computes outputs as softmax(QK^T / sqrt(d_k)) V, where the N x N matrix QK^T is materialized for every layer and every head.[7] This makes time and memory scale as O(N^2 d), which becomes prohibitive when the sequence length N reaches tens or hundreds of thousands of tokens.[1][7] Many efficient-attention proposals appeared between 2019 and 2021, including sparse, low-rank and locality-sensitive approximations, but most retained an explicit softmax over a reduced set of token pairs.[8]
Linear attention takes a different route. Instead of approximating the softmax, it removes the softmax altogether and replaces the exponential kernel exp(q . k) with an explicit feature map phi chosen so that sim(q, k) = phi(q)^T phi(k) is non-negative.[1] Because the new similarity is bilinear, the order of matrix products can be changed: rather than computing (QK^T) V, the algorithm computes phi(K)^T V first (an outer-product sum of size d x d), and then multiplies by phi(Q). The cost of building the intermediate is O(N d^2) instead of O(N^2 d), which is favorable whenever N >> d.[1]
The conceptual roots of this trick go back further. In 1992 Jürgen Schmidhuber introduced "fast weight programmers", in which a slow network learns by gradient descent to write key-value outer products into a separate fast-weight matrix that another network reads from.[3][9] Schlag, Irie and Schmidhuber later showed in 2021 that linearized self-attention is mathematically equivalent to that classical construction, recasting linear Transformers as fast weight memory systems and motivating a richer class of update rules.[3]
Let Q, K, V be the query, key and value matrices of shape N x d. The i-th output of softmax attention is
y_i = sum_j exp(q_i^T k_j) / sum_l exp(q_i^T k_l) * v_j
Katharopoulos et al. generalize the exponential kernel to any non-negative similarity sim(q, k), giving
y_i = sum_j sim(q_i, k_j) v_j / sum_j sim(q_i, k_j)
If sim factorizes as sim(q, k) = phi(q)^T phi(k) for some feature map phi: R^d -> R^r with non-negative outputs, the numerator and denominator both rearrange:
y_i = phi(q_i)^T ( sum_j phi(k_j) v_j^T ) / phi(q_i)^T ( sum_j phi(k_j) )
The two sums no longer depend on i, so they can be precomputed once per layer at cost O(N r d) and reused for every query.[1] The complexity drops from O(N^2 d) to O(N r d), which equals O(N d^2) for the natural choice r = d.[1]
For autoregressive language modeling the sum must respect the causal mask, so that position i only sees positions j <= i. Define the cumulative running statistics
S_i = S_{i-1} + phi(k_i) v_i^T
z_i = z_{i-1} + phi(k_i)
y_i = (phi(q_i)^T S_i) / (phi(q_i)^T z_i)
S_i is a d x d (or r x d) matrix and z_i is a d-vector; both are constant in size with respect to sequence length.[1] These updates make linear attention behave like a recurrent neural network whose hidden state is the matrix S_i, which is why Katharopoulos et al. titled their paper "Transformers are RNNs".[1] At inference time each new token costs O(d^2) arithmetic and O(d^2) memory, independent of how many tokens have already been generated, in contrast to the linearly growing KV cache of softmax Transformers.[1]
The same paper reports speedups of up to 4000 times for autoregressive prediction on very long sequences compared to a standard Transformer of comparable size, while attaining comparable likelihood on tasks tested in the paper.[1]
Several feature maps appear repeatedly in the literature:
| Feature map | Definition | Source |
|---|---|---|
| ELU+1 | phi(x) = elu(x) + 1, applied elementwise | Katharopoulos et al. 2020[1] |
| Positive random features (FAVOR+) | `phi(x) = exp(- | |
| Deterministic Parameter-Free Projection (DPFP) | Concatenation of products of ReLU shifts of x | Schlag, Irie, Schmidhuber 2021[3] |
| Polynomial features | phi(x) = (1, x, x (x) x, ...) truncated to a fixed order | Kacham et al., Arora et al.[11][12] |
| Cosine / sine projections | phi(x) = [cos(W x); sin(W x)] for orthogonal W | Performer variants[10] |
| Learned MLP | phi(x) = MLP(x), trained to mimic softmax spikes | Hedgehog (Zhang et al. 2024)[11] |
The ELU+1 choice was motivated by the need for a strictly positive output without requiring a particular kernel approximation; it is cheap, deterministic and the default for many later studies of pure linear attention.[1] The Performer paper instead derives a positive random-feature decomposition of the exact softmax kernel, giving an unbiased estimator of softmax attention rather than an unrelated similarity.[10] DPFP is a deterministic finite-dimensional approximation that the Schlag et al. authors found to balance simplicity and effectiveness on language modeling and translation tasks.[3]
The core algebraic step that makes linear attention work is the reassociation of three matrix products. In softmax attention, computing softmax(QK^T) V requires materializing QK^T, an N x N matrix, before normalizing across rows; this matrix cannot be avoided because the softmax couples all entries in a row through the normalizer. In linear attention, the absence of a softmax makes the bilinear similarity phi(q)^T phi(k) a true inner product, and matrix multiplication is associative, so (phi(Q) phi(K)^T) V = phi(Q) (phi(K)^T V). Choosing the right-associated form means the algorithm builds a r x d matrix phi(K)^T V rather than an N x N matrix, eliminating the quadratic intermediate.[1] This is a special case of the kernel trick familiar from kernel methods in classical machine learning, where an inner product in a feature space replaces an explicit pairwise similarity computation in input space.[3][10]
The kernel-trick view also clarifies what Performer adds over generic linear attention. The Performer feature map is not arbitrary: it is constructed so that E[phi(q)^T phi(k)] equals the softmax kernel exp(q^T k) for random orthogonal projection matrices, with the expectation taken over those random matrices.[10] As a result, FAVOR+ is an unbiased Monte Carlo estimator of softmax attention rather than a different attention mechanism, and the approximation error is controlled by the number of random features. Generic linear attentions such as ELU+1 or DPFP are not estimators of softmax; they define their own kernel and let training reshape the network around it.[1][3]
A pure recurrence is hardware-unfriendly because it serializes the time dimension. The chunkwise parallel form, popularized by GLA, RetNet, Mamba-2 and DeltaNet, splits the sequence of length L into L/C chunks of size C.[2][4][5][13] Within each chunk the quadratic form phi(Q) phi(K)^T V is used (allowing fused GEMM kernels), while between chunks only the final hidden state S is propagated via the recurrence.[2] This interpolates between the parallel form (one big chunk) and the recurrent form (chunk size one), allowing implementations to choose chunk size based on the hardware and sequence length.[2] The Gated Linear Attention paper develops a hardware-efficient instantiation called FlashLinearAttention that the authors report to be faster than Flash Attention-2 as a standalone layer even on short sequences such as 1K tokens.[2]
The recurrence S_i = S_{i-1} + phi(k_i) v_i^T makes the equivalence to a particular kind of recurrent neural network explicit: instead of a fixed-size vector hidden state as in an LSTM or GRU, the state is a matrix that grows by an outer product at every step.[1][3] This view connects to the older fast-weight literature in two ways. First, Schmidhuber's 1992 Neural Computation paper proposed fast weight memories updated by exactly this kind of additive outer product, with a slow controller producing the keys and values.[9] Second, Schlag, Irie and Schmidhuber proved in 2021 that the additive update is identical, up to renaming, to a linearized self-attention layer, and they then proposed a delta rule extension that replaces the additive write with a Widrow-Hoff style error-correcting write.[3]
The delta-rule update is
S_t = S_{t-1} - beta_t (S_{t-1} k_t - v_t) k_t^T
where beta_t is a per-token learning-rate that the network predicts.[3][14] This is exactly one step of gradient descent on the per-key reconstruction error || S k_t - v_t ||^2, which makes the matrix S behave like an associative memory trained on the fly during the forward pass.[14] Yang et al. later showed how to parallelize this update over sequence length by writing the recurrence using generalized Householder transformations, giving DeltaNet a chunkwise parallel training algorithm with the same hardware properties as plain linear attention.[14]
The recurrent perspective also clarifies why linear attention sits between Transformers and classical RNNs. Each new token writes an outer product into a finite memory whose capacity is bounded by d^2; this contrasts with softmax attention, which retains every prior token in the KV cache and can address it exactly through the softmax. Linear attention thus trades a growing exact memory for a fixed-size compressed memory, with cost O(1) per token at inference but a constant-bandwidth bottleneck on how much can be remembered.[1][15]
The Performer architecture, by Choromanski and collaborators at Google Research, was introduced at ICLR 2021 and approximates the softmax kernel with positive orthogonal random features (FAVOR+, Fast Attention Via positive Orthogonal Random features).[10] Unlike the ELU+1 map, FAVOR+ provides an unbiased estimator of softmax attention with provable uniform convergence and low variance, which made it the first widely-adopted linear-attention variant that could approximate, rather than replace, the original Transformer kernel.[10] Performer is fully compatible with pretrained Transformer weights, allowing softmax models to be converted to a linear-time form post-hoc.[10]
Linformer, by Wang and collaborators in mid-2020, also achieves linear complexity but through a different mechanism.[16] Rather than removing the softmax, Linformer projects the N x d key and value matrices through learned k x N matrices to produce k x d proxies, then applies standard softmax attention on the reduced length-k sequence.[16] It is therefore strictly speaking a low-rank attention method rather than a feature-map linear attention, but it is often grouped with linear-attention work because it brings the cost to O(N k d).[16] One consequence is that Linformer is non-causal by default and not as readily reformulated as a recurrence.[16]
RWKV (Receptance Weighted Key Value) was introduced in 2023 by Peng and a large open-source collaboration, and combines a transformer-style attention block with an RNN-style hidden state.[17] Its time-mixing block uses a variant of linear attention in which the addition of new key-value outer products is multiplied by an exponential decay along the time dimension, controlled by a learnable per-channel decay vector w and a bonus vector u for the current token.[17] The result is a recurrence that the authors describe as a linear-cost RNN with transformer-level performance, and they trained models up to 14 billion parameters which the paper describes as the largest dense RNN trained at the time of publication.[17] RWKV-7 is a more recent generation in the same family.
RetNet (Retentive Network), proposed in July 2023 by Sun and collaborators at Microsoft Research and Tsinghua, presents the retention operator as a linear attention with explicit exponential decay along the sequence.[4] Each retention layer admits three mathematically equivalent computation modes: a parallel form for training (similar in cost to softmax attention), a recurrent form for O(1) token-by-token inference, and a chunkwise recurrent form that interpolates between the two for long-sequence inference.[4] The authors position the model as a successor architecture for large language models and report better inference throughput than softmax Transformers paired with Flash Attention, with competitive language-modeling perplexity.[4]
GLA, introduced by Yang, Wang, Shen, Panda and Kim in December 2023 and published at ICML 2024, augments the standard linear-attention recurrence with a data-dependent gate G_t that multiplies the running state before each update.[2] The recurrence becomes
S_t = G_t (.) S_{t-1} + k_t v_t^T
where (.) denotes a structured (typically diagonal or outer-product) decay matrix produced from the input.[2] Compared to RetNet's fixed exponential decay, this lets the model selectively forget context based on the current token. The accompanying FlashLinearAttention kernel is a hardware-aware implementation that the authors found to outperform Flash Attention-2 even on short sequences.[2] GLA was reported to be competitive with the LLaMA-architecture Transformer, RetNet and Mamba in moderate-scale language modeling experiments, with strong length-generalization behavior when extending from 2K to over 20K tokens.[2]
Yang, Wang, Zhang, Shen and Kim published in June 2024 a hardware-efficient algorithm for training DeltaNet, the delta-rule linear attention of Schlag et al., parallelized across sequence length via a reparameterization using Householder transformations.[14] At the 1.3B parameter scale, the authors reported that DeltaNet trained on 100 billion tokens outperformed Mamba and GLA on perplexity and zero-shot downstream tasks, and that DeltaNet-attention hybrid models outperformed Transformer baselines.[14]
Gated DeltaNet by Yang, Kautz and Hatamizadeh, published at ICLR 2025, combines DeltaNet's delta-rule update with GLA-style gating to produce a recurrence that can both forget memory rapidly (via the gate) and update specific keys precisely (via the delta rule).[6] The paper reports consistent improvement over Mamba 2 and DeltaNet on language modeling, common-sense reasoning, in-context retrieval, length extrapolation and long-context understanding benchmarks, with hybrid architectures pairing Gated DeltaNet layers with sliding window attention or Mamba 2 layers performing the best.[6]
Mamba 2, published at ICML 2024 by Dao and Gu under the title "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality", established a precise theoretical bridge between selective state space models and linear attention.[5] The State Space Duality (SSD) framework expresses both families as decompositions of structured semiseparable matrices, with linear attention corresponding to the dual quadratic form and selective SSMs corresponding to the dual linear (matrix-multiply-free) form.[5] Under this view, the two computation modes are exactly the same model evaluated by different algorithms, in the same way that linear attention has both a recurrent and a parallel form.[5] Mamba 2's core layer is described in the paper as a refinement of Mamba's selective SSM that is two to eight times faster while remaining competitive with Transformers on language modeling.[5]
In February 2024, Arora and collaborators at Stanford released BASED (Building Attention Systems for Efficient Decoding), a linear-attention model designed explicitly to address the associative-recall weakness of prior linear-attention and SSM models.[12] The authors observed that on synthetic recall benchmarks and in-context-learning tasks, models such as Mamba, RWKV, Hyena and RetNet underperform standard Transformers, even when their language-modeling perplexity is comparable.[12] BASED combines a sliding-window softmax attention with a Taylor-series linear attention (a polynomial-feature approximation of softmax) and interleaves these layers throughout the network.[12] At inference time the model decodes without a KV cache for the linear-attention component, which Arora et al. report yields up to 24 times higher throughput than a Transformer of comparable parameter count using Flash Attention 2.[12] The paper frames the result as a tunable tradeoff between recall (favored by softmax) and throughput (favored by linear attention), with hybrid models occupying intermediate points on a Pareto frontier.[12]
In February 2024, Zhang, Bhatia, Kumbong and Re proposed Hedgehog at ICLR 2024 to attack the quality gap from a different angle.[11] They identified two properties of softmax that prior linear-attention feature maps lack: low-entropy ("spiky") attention weights that concentrate mass on a small number of relevant tokens, and dot-product monotonicity (attention weights should be monotone in q . k).[11] Hedgehog replaces the fixed feature map by a small trainable MLP, trained either from scratch or to imitate the softmax of a teacher Transformer.[11] The authors reported recovering more than 99 percent of standard Transformer quality, including converting pretrained GPT-2 and Llama-2 7B models into linear-attention form with substantially smaller quality drops than prior linear attentions.[11]
| Variant | Year | Mechanism | Decay/gate | Distinctive feature |
|---|---|---|---|---|
| Linear Transformer | 2020 | phi(x) = elu(x) + 1 | None | Original O(N d^2) formulation[1] |
| Performer | 2020/21 | Positive random features | None | Unbiased softmax estimator[10] |
| DeltaNet (Schlag) | 2021 | DPFP kernel + delta rule | None | Error-correcting writes[3] |
| RWKV | 2023 | Linear attention with channel decay | Fixed exponential | Open-source RNN-Transformer hybrid[17] |
| RetNet | 2023 | Retention with explicit decay | Fixed exponential | Three computation modes[4] |
| GLA | 2023 | Linear attention with data-dependent gate | Learned per-token | FlashLinearAttention kernel[2] |
| Mamba 2 | 2024 | Selective SSM (SSD dual) | Learned diagonal | 2-8x faster than Mamba[5] |
| DeltaNet (Yang) | 2024 | Delta rule, parallelized | None | Householder reparameterization[14] |
| Hedgehog | 2024 | Learned MLP feature map | None | Mimics softmax spikiness[11] |
| Gated DeltaNet | 2024/25 | Delta rule + gate | Learned per-token | Combines both ideas[6] |
Open-source implementations of linear attention exist in several frameworks. The original Linear Transformer is published with reference code by the EPFL Idiap group as the fast-transformers library, which includes the ELU+1 feature map and a CUDA implementation of the causal cumulative-sum kernel.[1] Performer has a maintained PyTorch port (performer-pytorch) and a JAX implementation that follows the FAVOR+ paper.[10] DeltaNet and GLA are distributed by the flash-linear-attention project on GitHub, which packages chunkwise kernels and Triton implementations of GLA, DeltaNet, RetNet and related recurrences.[2][14] Gated DeltaNet's official implementation is released by NVIDIA Labs.[6]
Linear-attention layers also appear inside larger publicly released models. RWKV is the most prominent example of a model family trained from scratch on a pure linear-attention recurrence, with checkpoints released openly by the RWKV project.[17] MiniMax-Text-01 interleaves linear-attention blocks (described by the authors as a Lightning Attention variant) with periodic softmax-attention blocks, in a hybrid model. RetNet was released with checkpoints by Microsoft Research. Recent industrial-scale models such as Kimi Linear and certain Qwen variants are also reported to use hybrid architectures combining linear attention or selective SSM layers with conventional attention for short-context fidelity.[18]
Linear attention is most attractive when sequence length is the dominant cost driver. Long-document modeling, code repositories spanning hundreds of thousands of tokens, audio sequences sampled at high frame rates, genomic sequences, and streaming inference workloads all stress the quadratic scaling of softmax attention.[1][4] In these regimes, the constant-time-per-token recurrent form of linear attention removes the linearly growing KV cache that otherwise dominates serving memory in autoregressive decoding.[1][4][17]
A second area is on-device or low-memory inference. Because the per-token state is O(d^2) rather than O(N d), a 7B linear-attention model can in principle decode tokens with memory bounded independent of context length, which is valuable for mobile and embedded deployment.[17] RWKV in particular has been used as a research vehicle for this kind of deployment.[17]
A third use is as a building block in hybrid architectures, where most layers are linear-attention or SSM layers and a small minority are softmax attention. Such hybrids combine the long-range efficiency of linear recurrences with the precise retrieval capacity of softmax attention, and have been adopted in recent industrial systems.[6][14][18] DeltaNet, Gated DeltaNet and Mamba 2 hybrids with full attention or sliding window attention are reported in their respective papers to match or exceed pure Transformer baselines on language modeling benchmarks at moderate scale.[6][14]
Beyond language modeling, linear attention has been applied to vision (e.g. linear-attention vision transformers), speech, time-series modeling and reinforcement learning settings where the sequence length or rollout length is the primary cost concern.[8]
A practical lens on the tradeoff is what Arora et al. call the recall-throughput frontier: any model must trade some amount of in-context recall capacity for some amount of decoding throughput.[12] Softmax attention occupies the high-recall, low-throughput end of this frontier because it stores every past token exactly but pays a linearly growing KV cache. Pure linear attention with a small state occupies the high-throughput, low-recall end because per-token cost is constant but the matrix state can store only a bounded number of key-value pairs without interference. Variants with larger feature dimensions, decay, gating or the delta rule move along this frontier; hybrid models combine layers from different points on the frontier and let the network decide where to allocate softmax-style precision. The pace of recent improvements suggests that hybrid linear-attention architectures, rather than pure replacements for softmax, will be the dominant form of efficient long-context modeling in the near term.[6][12][14][18]
Despite the recent improvements, linear attention historically lagged softmax attention on language modeling perplexity and on retrieval-heavy tasks, and the quality gap remains a major topic of study.[11][14] Several specific weaknesses have been characterized in the literature:
q . k, breaking an invariant that softmax preserves and that downstream layers appear to rely on.[11]These limitations explain why early linear-attention models, despite excellent asymptotic complexity, often failed to match same-size Transformers in perplexity and downstream accuracy. The most recent variants (GLA, DeltaNet, Mamba 2 and Gated DeltaNet) close most of the gap on moderate-scale benchmarks and outperform Transformers on length-extrapolation tasks, but pure linear-attention models at very large scale are still less battle-tested than softmax Transformers, and most production systems that adopt linear attention do so in a hybrid with at least some softmax layers.[2][6][14][18]
A related point of contrast is the attention sink phenomenon described by Xiao and collaborators in late 2023.[19] In softmax Transformers, a large fraction of attention heads systematically place a disproportionate share of their attention mass on a small set of early tokens (often the first token), regardless of those tokens' semantic content.[19] This behavior is thought to arise because softmax forces attention weights to sum to one, so heads that have nothing useful to attend to must place their mass somewhere; the model learns to park unused attention on tokens with low-magnitude value vectors.[19] StreamingLLM, the system Xiao et al. propose, exploits this by keeping the first few tokens together with a sliding window during streaming decoding, allowing softmax models trained on a fixed window to generalize to multi-million-token streams.[19]
Linear attention does not impose the sum-to-one constraint, and empirical studies report that linear-attention models and other unnormalized variants such as sigmoid-attention models do not develop the same kind of attention sinks even at the hundreds-of-millions-of-parameters scale.[20] This is sometimes cited as evidence that the sink is a softmax-specific artifact rather than a property of attention as such, and as one of the structural differences that linear-attention designers must keep in mind: the failure mode that the sink relieves in softmax models (heads with nothing to attend to) instead manifests in linear-attention models as state-overwriting interference, which gating and the delta rule attempt to address.[15][20]
Linear attention is one strand of a broader effort to escape the quadratic cost of softmax attention in the Transformer. Adjacent approaches include:
N x N matrix with a low-rank approximation, sharing linear scaling but retaining a softmax.[16]