Positional encoding is a technique used to inject information about token order into transformer models. Because transformers process all tokens in a sequence simultaneously through self-attention, they have no built-in sense of which token comes before or after another. Without some form of position signal, a transformer would treat the sentence "the cat sat on the mat" identically to "mat the on sat cat the." Positional encoding solves this by adding or otherwise integrating position-dependent signals into the model's representations, allowing it to distinguish token order and learn patterns that depend on sequence structure.
The concept was introduced alongside the original transformer architecture by Vaswani et al. in 2017 and has since evolved into a rich family of methods. Different approaches offer different trade-offs in terms of generalization to unseen sequence lengths, computational overhead, and the ability to capture relative versus absolute positions.
Recurrent neural networks (RNNs) and LSTMs process tokens one at a time in sequence order, so position information is implicit in the computation itself. Convolutional models similarly operate over local windows in a fixed order. Transformers, by contrast, compute attention scores between all pairs of tokens in parallel. The attention mechanism is permutation-equivariant: if you shuffle the input tokens, the output gets shuffled in exactly the same way, with no change in the attention weights themselves.
This parallelism is what makes transformers fast and scalable, but it also means the model cannot tell position 0 from position 500 unless something external provides that information. Positional encoding fills this gap. Without it, tasks like language modeling (where the next token depends heavily on word order), machine translation, and virtually any sequential task would break down.
More formally, consider the standard scaled dot-product attention: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V. This function depends only on the content of the query, key, and value vectors, not on the positions from which those vectors originated. If two tokens are swapped in the input, the corresponding rows of Q, K, and V simply swap, and the output follows suit. There is no mechanism that says "token A came before token B." Positional encoding breaks this symmetry by making Q, K, or V depend on position, either by adding position-dependent vectors to the input embeddings, rotating the query and key vectors, or biasing the attention scores directly.
The original transformer paper, "Attention Is All You Need," introduced a deterministic positional encoding based on sine and cosine functions at different frequencies. For a token at position pos in the sequence and dimension index i in the embedding vector of dimension d_model, the encoding is defined as:
Each dimension of the positional encoding corresponds to a sinusoidal wave with a different wavelength, ranging from 2pi to 10000 * 2pi. The positional encoding vector is added element-wise to the token embedding before being fed into the transformer layers.
The base value of 10,000 was chosen by Vaswani et al. as a practical default. It ensures that the lowest-frequency sinusoidal completes one full cycle over roughly 10,000 positions, while the highest-frequency dimension oscillates much more rapidly. This spread of wavelengths means that nearby positions differ primarily in their high-frequency components, while distant positions differ across all frequencies. The choice of base has since become a tunable hyperparameter in modern RoPE-based models (see below).
Several properties made this design attractive:
Bounded values. The sine and cosine functions produce values in the range [-1, 1], which is compatible with the scale of typical token embeddings.
Unique representation. Each position receives a distinct encoding vector, so the model can in principle distinguish any two positions.
Relative position through linear transformation. Vaswani et al. noted that for any fixed offset k, PE(pos + k) can be expressed as a linear function of PE(pos). This means the model can potentially learn to attend to relative positions ("three tokens to the left") rather than only absolute positions. Specifically, the relationship PE(pos + k) = T_k * PE(pos) holds, where T_k is a rotation matrix that depends only on the offset k and the frequency index, not on the absolute position.
No learned parameters. The sinusoidal encoding is entirely deterministic and adds zero trainable parameters to the model.
Potential for length generalization. Because the encoding is defined for any positive integer position, it can in principle generalize to sequences longer than those seen during training, though in practice this generalization is limited.
The sinusoidal approach was used in the original transformer for machine translation and remained common in early transformer variants. However, later work showed that learned positional embeddings often performed comparably or better on many tasks.
Rather than using a fixed mathematical formula, learned positional embeddings treat each position's encoding as a trainable parameter vector. The model maintains a lookup table of shape (max_sequence_length, d_model), and during training, the embedding for each position is optimized via backpropagation just like any other model parameter.
BERT (Devlin et al., 2019) used learned positional embeddings with a maximum sequence length of 512 tokens. GPT and GPT-2 also used learned positional embeddings, with GPT-2 supporting up to 1,024 positions. The original GPT-3 likewise used learned positional embeddings with a context window of 2,048 tokens.
Research by Gehring et al. (2017) on convolutional sequence-to-sequence models also employed learned positional embeddings, predating BERT. In practice, learned positional embeddings tend to converge to smooth, structured patterns during training that often resemble sinusoidal functions, suggesting that the two approaches capture similar information when sufficient training data is available.
The primary limitation is that the model cannot generalize beyond the maximum position seen during training. If a model is trained with a maximum sequence length of 512, it has no embedding for position 513. This makes learned embeddings inherently bounded, which became a significant constraint as researchers pushed toward longer context windows. Additionally, learned embeddings add parameters proportional to the maximum sequence length, and those extra parameters scale linearly with the model's hidden dimension as well.
Positional encoding methods can be broadly classified into two families: absolute and relative.
Absolute positional encoding assigns a fixed representation to each position in the sequence, independent of the content at that position or the positions of other tokens. Both sinusoidal encoding and learned positional embeddings fall into this category. The model receives information like "this token is at position 47" and must learn to derive relative relationships ("these two tokens are 5 positions apart") from the absolute signals.
Relative positional encoding directly encodes the distance or relationship between pairs of tokens rather than their absolute positions. This family of methods has gained favor because many language tasks depend more on the relative ordering of tokens ("the adjective is just before the noun") than on absolute positions ("the adjective is at position 12").
Shaw, Uszkoreit, and Vaswani introduced one of the first explicit relative position representations for transformers in their 2018 paper "Self-Attention with Relative Position Representations." Their method modifies the self-attention mechanism by adding learned relative position embeddings to the key and value computations. For each pair of positions (i, j), the model uses a learned embedding vector a_ij that depends on the relative distance (i - j) between the two positions.
The attention logit between positions i and j becomes:
e_ij = (x_i * W_Q) * (x_j * W_K + a_ij^K)^T
where a_ij^K is a learned relative position embedding for keys. A similar term a_ij^V is added to the value computation.
A key design choice is clipping the maximum relative distance to a value k, so that all relative positions beyond k or below -k share the same embedding. This clipping serves two purposes: it reduces the number of parameters (only 2k+1 embeddings are needed regardless of sequence length), and it embodies the hypothesis that precise relative position information is not useful beyond a certain distance. Shaw et al. used k=16 in their experiments, finding this sufficient for translation tasks.
Transformer-XL reformulated the attention score to decompose it into four terms: content-to-content, content-to-position, position-to-content (using a global query bias), and position-to-position interactions. This decomposition enabled the model to handle longer sequences through segment-level recurrence while maintaining coherent position information across segments.
The T5 model (Raffel et al., 2020) simplified relative position encoding by using scalar biases added to attention logits, with the biases depending only on the distance between query and key positions. T5 bucketed distances into a set of discrete bins using a logarithmic spacing scheme: nearby positions are represented with fine granularity while distant positions are grouped into broader bins. This bucketing reduces the number of parameters needed and provides a form of inductive bias: the model treats small distance differences as meaningful but treats large distances as roughly equivalent. T5's relative bias is shared across layers but differs per attention head.
Relative methods have generally shown better performance on tasks requiring understanding of local structure and have proven more amenable to length generalization, since they do not depend on absolute position indices.
Rotary Position Embeddings (RoPE) were introduced by Su et al. in 2021 in the paper "RoFormer: Enhanced Transformer with Rotary Position Embedding." RoPE has since become the dominant positional encoding method in modern large language models, adopted by LLaMA, Mistral, Qwen, PaLM, DeepSeek, Falcon, Yi, Phi, and many other model families.
Unlike earlier methods that add positional vectors to the input embeddings, RoPE encodes position by rotating the query and key vectors in the attention computation. This multiplicative approach has several desirable theoretical properties and integrates both absolute and relative position information in a unified framework.
RoPE operates on pairs of dimensions in the query and key vectors. Consider a query vector q at position m and a key vector k at position n. The d-dimensional vectors are split into d/2 pairs of consecutive dimensions. Each pair (q_{2i}, q_{2i+1}) is treated as a 2D vector and rotated by an angle that depends on the position m and the frequency index i.
The rotation for the i-th dimension pair at position m is given by the 2x2 rotation matrix:
R(m, theta_i) = [[cos(m * theta_i), -sin(m * theta_i)], [sin(m * theta_i), cos(m * theta_i)]]
where theta_i = 10000^(-2i/d) is the frequency for dimension pair i. The full rotation is applied block-diagonally across all d/2 dimension pairs:
RoPE(q, m) = R_d(m) * q
where R_d(m) is a block-diagonal matrix composed of the 2x2 rotation matrices for each pair.
The rotation can be equivalently expressed using complex arithmetic. If we treat each dimension pair (q_{2i}, q_{2i+1}) as a complex number z_i = q_{2i} + j * q_{2i+1}, then the RoPE operation is simply multiplication by a complex exponential:
RoPE(z_i, m) = z_i * e^(j * m * theta_i)
This complex multiplication rotates z_i by the angle m * theta_i in the complex plane. The use of different frequencies theta_i for different dimension pairs creates a multi-scale representation: high-frequency dimensions (small i) encode fine-grained position differences, while low-frequency dimensions (large i) capture broader positional structure.
The critical property of RoPE is that when computing the dot product between a rotated query at position m and a rotated key at position n, the result depends only on the relative distance (m - n):
(R_d(m) * q)^T * (R_d(n) * k) = q^T * R_d(n - m) * k
This follows from the fact that rotation matrices satisfy R(m)^T * R(n) = R(n - m). The attention score between two tokens therefore encodes their relative position automatically, without any explicit relative position embeddings. This property gives RoPE its name and its theoretical appeal.
RoPE has several other favorable characteristics:
The base value theta_base = 10,000 in the original formulation determines the spread of frequencies across dimensions. Increasing the base stretches the wavelengths of all frequency components, effectively making the encoding less sensitive to position and better able to handle longer sequences. Many modern models have adjusted this hyperparameter:
| Model | RoPE base (theta) | Context length |
|---|---|---|
| LLaMA 1, LLaMA 2 | 10,000 | 2K, 4K |
| Code Llama | 1,000,000 | 16K (100K with fine-tuning) |
| LLaMA 3, LLaMA 3.1 | 500,000 | 8K (128K with scaling) |
| Qwen 2.5 | 1,000,000 | 32K (128K with YaRN) |
| Mistral 7B | 10,000 | 8K (32K with sliding window) |
Increasing the base alone is a simple form of context extension, but it has diminishing returns and can degrade performance on short sequences. More sophisticated scaling techniques (described below) have proven more effective.
Attention with Linear Biases (ALiBi) was introduced by Press, Smith, and Lewis in 2021 in the paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation," published at ICLR 2022. ALiBi takes a radically simple approach: it does not add any positional embeddings to the input at all. Instead, it adds a static, linear penalty to attention scores based on the distance between query and key positions.
Specifically, before the softmax operation in attention, ALiBi adds a bias of -m * |q_pos - k_pos| to each attention score, where m is a head-specific slope that is fixed (not learned) and set before training. The slopes are set as a geometric sequence. For a model with n attention heads, the slopes follow the pattern:
m_1 = 2^(-8/n), m_2 = 2^(-16/n), ..., m_n = 2^(-8)
For example, with 8 attention heads, the slopes are: 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256. Heads with steeper slopes (larger m) focus more on nearby tokens, while heads with gentler slopes can attend over longer distances. This creates a natural multi-scale attention pattern where different heads specialize in different distance ranges.
ALiBi demonstrated strong length extrapolation: a 1.3 billion parameter model trained on sequences of length 1,024 could extrapolate to sequences of length 2,048, matching the perplexity of a sinusoidal model trained on the longer length while using 11% less memory and training 11% faster. The method also introduces an inductive bias toward recency, which aligns well with the locality patterns common in natural language.
ALiBi was adopted by models including BLOOM (BigScience, 2022), MPT (MosaicML, 2023), and StarCoder. However, as the field shifted toward RoPE, ALiBi's adoption in newer models has decreased, though it remains influential as a conceptual approach. One limitation observed in practice is that ALiBi's linear bias may not capture the non-linear attention patterns needed for certain tasks, and its extrapolation ability, while better than absolute methods, still degrades at very large context extensions.
As applications demanded longer context windows, researchers developed several techniques to extend models beyond their original training length. Most of these techniques target RoPE-based models because of RoPE's mathematical properties and its dominance in modern architectures.
Position Interpolation (PI) was introduced by Chen, Wong, Chen, and Tian at Meta in 2023. The core idea is straightforward: instead of extrapolating RoPE to unseen positions (which causes catastrophically high attention scores), PI linearly down-scales position indices so that the extended context fits within the original position range.
If a model was trained with a maximum context length L and needs to operate at an extended length L', PI replaces the position index m with m * (L / L'). For example, if a model trained on 2,048 positions needs to handle 8,192 positions, each position is scaled by 2048/8192 = 0.25, mapping the range [0, 8192] back to [0, 2048].
Formally, PI modifies the RoPE frequency function by applying a scaling factor s = L'/L:
g(m) = m / s, h(theta_d) = theta_d
Chen et al. showed that the interpolation bound is at least 600 times smaller than the extrapolation bound, providing a theoretical justification for why interpolation is far more stable than extrapolation. In practice, only about 1,000 fine-tuning steps were needed to adapt a LLaMA model from 2K to 8K context, or even to 32K and 65K contexts with similarly modest fine-tuning.
The main drawback of PI is that it uniformly compresses all frequency components. High-frequency dimensions (which encode fine-grained local positions) get squashed just as much as low-frequency dimensions (which encode global position). This can cause slight perplexity increases on short sequences after fine-tuning, since the model loses some resolution at the local level.
NTK-aware interpolation was proposed by the pseudonymous researcher bloc97 in a Reddit post in 2023. The method addresses a fundamental flaw in Position Interpolation: not all frequency dimensions should be scaled equally.
The insight comes from Neural Tangent Kernel (NTK) theory, which shows that deep neural networks struggle to learn high-frequency information if the input embeddings lack high-frequency components. In the context of RoPE, high-frequency dimensions encode fine-grained, local position differences (distinguishing adjacent tokens), while low-frequency dimensions encode coarse, global position information. PI's uniform scaling compresses high-frequency components unnecessarily, damaging the model's ability to distinguish nearby positions.
NTK-aware interpolation spreads the scaling pressure unevenly across dimensions: high frequencies are scaled less (preserving local resolution), while low frequencies are scaled more (accommodating the extended context). This is achieved by modifying the RoPE base frequency:
b' = b * s^(d / (d - 2))
where b is the original base (typically 10,000), s is the scaling factor, and d is the embedding dimension. This base adjustment implicitly rescales each frequency dimension by a different amount.
NTK-aware scaling showed notably better zero-shot performance on longer sequences than PI, enabling models to extend to 8K or even 16K contexts without any fine-tuning. A subsequent refinement, "Dynamic NTK" scaling, adjusts the scaling factor at inference time based on the actual sequence length, applying stronger scaling only when the current sequence exceeds the training length. This dynamic variant works particularly well without fine-tuning.
YaRN (Yet another RoPE extensioN) was introduced by Peng, Quesnelle, Fan, and Shippole in 2023 and published at ICLR 2024. YaRN represents the most complete and theoretically grounded context extension method for RoPE-based models, combining several innovations into a unified approach.
YaRN consists of two main components:
NTK-by-parts interpolation. Rather than applying a single scaling strategy to all frequency dimensions, YaRN partitions the RoPE frequency spectrum into three regions using a ramp function:
The ramp function gamma determines the interpolation ratio for each dimension:
h(theta_d) = (1 - gamma) * theta_d / s + gamma * theta_d
where gamma transitions smoothly from 0 (full interpolation) to 1 (no interpolation) based on the wavelength of each frequency component.
Attention temperature scaling. YaRN introduces a temperature factor t that modifies the attention computation:
softmax(q_m^T * k_n / (t * sqrt(d)))
This temperature compensates for the distribution shift in attention scores caused by longer input sequences. Without it, the entropy of the attention distribution changes at extended lengths, degrading performance. The optimal temperature follows the empirical formula:
sqrt(1/t) = 0.1 * ln(s) + 1
YaRN is highly compute-efficient, requiring roughly 10 times fewer tokens and 2.5 times fewer training steps than Position Interpolation to achieve comparable results. It has been widely adopted: Qwen 2.5 uses YaRN to extend from 32K to 128K context, DeepSeek-V3 uses YaRN to extend from 4K to 128K through a two-stage process, and many other models leverage YaRN through Hugging Face's rope_scaling API.
| Method | Year | Fine-tuning required | Approach | Strengths | Weaknesses |
|---|---|---|---|---|---|
| Position Interpolation (PI) | 2023 | Yes (minimal, ~1000 steps) | Uniform downscaling of position indices | Simple, theoretically motivated, effective | Compresses high frequencies, slight short-context degradation |
| NTK-aware scaling | 2023 | Optional | Non-uniform base frequency adjustment | Better zero-shot performance, preserves local resolution | Heuristic derivation, less effective than YaRN after fine-tuning |
| Dynamic NTK | 2023 | No | Runtime scaling based on sequence length | Works without fine-tuning, adaptive | Performance still below fine-tuned methods |
| YaRN | 2023 | Yes (very minimal) | Frequency-wise piecewise scaling + temperature | Best overall performance, compute-efficient | More complex implementation |
Beyond the methods already discussed, several additional approaches have been proposed:
Relative position bias (RPB), used in models like DeBERTa (He et al., 2021), disentangles content and position by using separate attention matrices for content-to-content and content-to-position interactions. This approach showed improvements on many NLU benchmarks.
KERPLE (Chi et al., 2022) generalizes relative position embeddings using conditionally positive definite kernels, offering a principled framework for learning position-dependent biases with improved length extrapolation.
FIRE (Li et al., 2023) uses a small MLP to learn functional interpolation over relative positions. It can theoretically represent several other methods (including T5's RPE, ALiBi, and KERPLE) and demonstrated strong performance on length generalization benchmarks.
CoPE (Golovneva et al., 2024), or Contextual Position Encoding, from Meta, computes positions based on the content of tokens rather than their indices, allowing the model to count specific types of tokens (like sentences or words) rather than raw positions. This content-dependent approach represents a departure from purely index-based methods.
iRoPE (Meta, 2025), introduced with LLaMA 4, interleaves RoPE layers with NoPE (No Positional Encoding) layers. Three out of every four decoder layers use standard RoPE with chunked attention (chunk size of 8,192), while every fourth layer uses no positional encoding at all but attends over the full causal context. The NoPE layers use a scaled softmax (temperature tuning) to prevent attention probability scores from fading as sequence length increases. This hybrid approach enabled LLaMA 4 Scout to achieve a 10 million token context window.
| Method | Type | Added parameters | Length extrapolation | Relative position | Key model users |
|---|---|---|---|---|---|
| Sinusoidal (Vaswani et al., 2017) | Absolute | None | Limited | Indirect (via linear property) | Original Transformer |
| Learned embeddings | Absolute | O(L * d) | None (fixed max length) | No | BERT, GPT-2, GPT-3 |
| Shaw et al. (2018) | Relative | O(K * d) | Moderate (via clipping) | Yes (learned pairwise) | Various research models |
| Transformer-XL (Dai et al., 2019) | Relative | O(L * d) | Good | Yes (decomposed scores) | Transformer-XL, XLNet |
| T5 RPE (Raffel et al., 2020) | Relative | O(B * H) | Moderate | Yes (scalar bias, bucketed) | T5, Flan-T5 |
| RoPE (Su et al., 2021) | Relative (via rotation) | None | Good (with scaling) | Yes (rotation encodes difference) | LLaMA, Mistral, Qwen, DeepSeek |
| ALiBi (Press et al., 2021) | Relative | None | Strong | Yes (linear distance penalty) | BLOOM, MPT, StarCoder |
| KERPLE (Chi et al., 2022) | Relative | Small (kernel params) | Strong | Yes (kernel-based) | Research models |
| RoPE + PI (Chen et al., 2023) | Relative (via rotation) | None | Strong (with fine-tuning) | Yes | Extended LLaMA models |
| RoPE + YaRN (Peng et al., 2023) | Relative (via rotation) | None | Very strong | Yes | Qwen 2.5, DeepSeek-V3, many others |
| FIRE (Li et al., 2023) | Relative | Small (MLP) | Strong | Yes (functional interpolation) | Research models |
| iRoPE (Meta, 2025) | Hybrid (RoPE + NoPE) | None | Very strong | Yes (in RoPE layers) | LLaMA 4 |
The evolution of positional encoding has been one of the primary drivers behind the dramatic expansion of context windows in large language models. From the original transformer's 512-token contexts to today's multi-million-token windows, positional encoding innovations have been central to each generational leap.
| Year | Model | Context length | Positional encoding method |
|---|---|---|---|
| 2017 | Original Transformer | 512 tokens | Sinusoidal |
| 2018 | BERT | 512 tokens | Learned embeddings |
| 2019 | GPT-2 | 1,024 tokens | Learned embeddings |
| 2020 | GPT-3 | 2,048 tokens | Learned embeddings |
| 2022 | BLOOM | 2,048 tokens | ALiBi |
| 2023 | LLaMA 2 | 4,096 tokens | RoPE |
| 2023 | Claude 2 | 100,000 tokens | Undisclosed |
| 2023 | GPT-4 Turbo | 128,000 tokens | Undisclosed |
| 2024 | LLaMA 3.1 | 128,000 tokens | RoPE (theta=500K, with scaling) |
| 2024 | Qwen 2.5 | 128,000 tokens | RoPE + YaRN |
| 2024 | DeepSeek-V3 | 128,000 tokens | RoPE + YaRN (decoupled) |
| 2024 | Gemini 1.5 Pro | 2,000,000 tokens | Undisclosed |
| 2025 | LLaMA 4 Scout | 10,000,000 tokens | iRoPE |
| 2025 | Claude (Opus/Sonnet) | 1,000,000 tokens | Undisclosed |
| 2025 | GPT-5 | 1,000,000 tokens | Undisclosed |
Extending context to millions of tokens introduces several challenges beyond positional encoding:
Attention score distribution. As context length grows, the softmax distribution in attention becomes increasingly flat, making it harder for the model to focus on relevant tokens. Temperature scaling (as in YaRN and iRoPE) partially addresses this, but very long contexts still suffer from diluted attention. Research into sparse attention patterns, retrieval-augmented attention, and hierarchical attention mechanisms complements positional encoding work.
Training data requirements. Models need to be trained (or at least fine-tuned) on sequences of comparable length to their target context. Generating and processing such long sequences is computationally expensive. The efficiency of methods like YaRN, which requires only a few hundred fine-tuning steps, has been critical to making long-context models practical.
Evaluation difficulty. Standard benchmarks such as perplexity on short texts do not capture a model's ability to use long context effectively. Specialized benchmarks like "needle in a haystack" retrieval, long-document QA, and multi-document summarization have been developed to assess long-context performance.
Effective vs. advertised context. Research suggests that effective utilization of context typically reaches 60-70% of the advertised maximum. Models may accept very long inputs but fail to retrieve or reason over information in the middle of the context (the "lost in the middle" phenomenon). Better positional encoding can help, but it is not the sole factor in effective long-context use.
As of early 2026, RoPE remains the dominant positional encoding method in large language models. After Meta adopted RoPE for the LLaMA family, virtually every major open-source LLM followed: Mistral, Gemma, Qwen, DeepSeek, Falcon 2, Yi, and Phi all use RoPE. The entire inference ecosystem, including FlashAttention, vLLM, and Hugging Face's rope_scaling API, has been optimized around it.
For encoder-only models used in classification and retrieval tasks, learned positional embeddings remain common, in part because these models typically operate on shorter sequences where length extrapolation is less of a concern.
Some recent architectures have begun experimenting with hybrid approaches. Command R7B (Cohere, 2024) combines RoPE layers with layers that have no positional encoding at all (NoPE). Gemma 3 uses RoPE with different base frequencies for local attention (theta = 10,000) and global attention (theta = 1,000,000) within the same model. LLaMA 4's iRoPE architecture (described above) takes this further by systematically interleaving RoPE and NoPE layers.
There is also growing interest in models that use no positional encoding at all. Research has shown that some decoder-only models can learn implicit position information from causal attention masks alone, though this approach has not yet been widely adopted in production models.
Positional encoding is not limited to language models. Vision transformers (ViT) (Dosovitskiy et al., 2021) split images into patches and treat them as a sequence, using either learned 1D positional embeddings (treating patches in raster order) or learned 2D embeddings that reflect the spatial grid structure. Subsequent work explored relative position biases for vision, including the Swin Transformer (Liu et al., 2021), which uses relative position biases within local attention windows.
Multimodal models that combine text and image tokens must reconcile different positional encoding strategies for different modalities, which remains an active area of research. Qwen2-VL introduced Multimodal RoPE (MRoPE), which unifies positional encoding for text and visual tokens within a single framework. Newer work on Multi-Head RoPE (MHRoPE) and MRoPE-Interleave has shown further improvements across both general and fine-grained multimodal understanding benchmarks.
When implementing positional encoding, several practical considerations arise:
Addition vs. concatenation. Most methods add the positional encoding to the token embedding. An alternative is to concatenate them, doubling the input dimension. Addition is strongly preferred in practice because it avoids increasing the model dimension.
Where to apply. Sinusoidal and learned encodings are typically applied once at the input layer. RoPE, by contrast, is applied at every attention layer, rotating query and key vectors before computing attention scores. ALiBi modifies the attention logits directly. The layer at which position information is injected affects how the model uses it.
Scaling for long contexts. When extending a model to longer contexts than it was trained on, position encodings often need adjustment. For RoPE, techniques like Position Interpolation, NTK-aware scaling, and YaRN have been developed (see above). For learned embeddings, the only option is typically to fine-tune with longer sequences. For ALiBi, the linear bias naturally extends to longer sequences without modification, which was one of its original selling points.
Computational cost. Sinusoidal encoding and ALiBi have negligible computational cost. RoPE requires a rotation operation at each layer but can be implemented efficiently using element-wise operations rather than matrix multiplication (exploiting the block-diagonal structure of the rotation). Learned embeddings require a lookup but no computation. Methods like FIRE that use MLPs incur a small additional cost.
Interaction with FlashAttention. Modern fused attention kernels like FlashAttention can incorporate RoPE rotations and ALiBi biases directly into the attention computation, avoiding the need to materialize the full attention matrix. This integration is important for practical efficiency, especially at long context lengths where the attention matrix would be prohibitively large.