Positional encoding

Deep Learning Natural Language Processing Transformer Models

29 min read

Updated Jul 11, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 11, 2026

Fact-checked

In review queue

Sources

19 citations

Revision

v11 · 5,882 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Positional encoding is a technique used to inject information about token order into transformer models. Because transformers process all tokens in a sequence simultaneously through self-attention, they have no built-in sense of which token comes before or after another. Without some form of position signal, a transformer would treat the sentence "the cat sat on the mat" identically to "mat the on sat cat the." Positional encoding solves this by adding or otherwise integrating position-dependent signals into the model's representations, allowing it to distinguish token order and learn patterns that depend on sequence structure.

The concept was introduced alongside the original transformer architecture by Vaswani et al. in 2017 and has since evolved into a rich family of methods.^[1] Different approaches offer different trade-offs in terms of generalization to unseen sequence lengths, computational overhead, and the ability to capture relative versus absolute positions. The four most influential schemes are sinusoidal encoding (2017), learned positional embeddings (used by BERT, GPT-2, and GPT-3), Rotary Position Embeddings (RoPE, 2021), and Attention with Linear Biases (ALiBi, 2021). As of 2026, RoPE is the dominant method in production large language models, used by LLaMA, Mistral, Qwen, DeepSeek, Gemma, and most other major model families.^[8]

Why do transformers need positional encoding?

Recurrent neural networks (RNNs) and LSTMs process tokens one at a time in sequence order, so position information is implicit in the computation itself. Convolutional models similarly operate over local windows in a fixed order. Transformers, by contrast, compute attention scores between all pairs of tokens in parallel. The attention mechanism is permutation-equivariant: if you shuffle the input tokens, the output gets shuffled in exactly the same way, with no change in the attention weights themselves.

This parallelism is what makes transformers fast and scalable, but it also means the model cannot tell position 0 from position 500 unless something external provides that information. Positional encoding fills this gap. Without it, tasks like language modeling (where the next token depends heavily on word order), machine translation, and virtually any sequential task would break down.

More formally, consider the standard scaled dot-product attention: $\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$ .^[1] This function depends only on the content of the query, key, and value vectors, not on the positions from which those vectors originated. If two tokens are swapped in the input, the corresponding rows of Q, K, and V simply swap, and the output follows suit. There is no mechanism that says "token A came before token B." Positional encoding breaks this symmetry by making Q, K, or V depend on position, either by adding position-dependent vectors to the input embeddings, rotating the query and key vectors, or biasing the attention scores directly.

What is sinusoidal positional encoding?

The original transformer paper, "Attention Is All You Need," introduced a deterministic positional encoding based on sine and cosine functions at different frequencies.^[1] For a token at position pos in the sequence and dimension index i in the embedding vector of dimension $d_{\mathrm{model}}$ , the encoding is defined as:

\mathrm{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i / d_{\mathrm{model}}}}\right)

\mathrm{PE}(pos, 2i+1) = \cos\left(\frac{pos}{10000^{2i / d_{\mathrm{model}}}}\right)

Each dimension of the positional encoding corresponds to a sinusoidal wave with a different wavelength, ranging from $2\pi$ to $10000 \cdot 2\pi$ . The positional encoding vector is added element-wise to the token embedding before being fed into the transformer layers.^[1]

The base value of 10,000 was chosen by Vaswani et al. as a practical default.^[1] It ensures that the lowest-frequency sinusoidal completes one full cycle over roughly 10,000 positions, while the highest-frequency dimension oscillates much more rapidly. This spread of wavelengths means that nearby positions differ primarily in their high-frequency components, while distant positions differ across all frequencies. The choice of base has since become a tunable hyperparameter in modern RoPE-based models (see below).

Properties of sinusoidal encoding

Several properties made this design attractive:

Bounded values. The sine and cosine functions produce values in the range [-1, 1], which is compatible with the scale of typical token embeddings.
Unique representation. Each position receives a distinct encoding vector, so the model can in principle distinguish any two positions.
Relative position through linear transformation. Vaswani et al. noted that for any fixed offset k, $\mathrm{PE}(pos + k)$ can be expressed as a linear function of $\mathrm{PE}(pos)$ .^[1] This means the model can potentially learn to attend to relative positions ("three tokens to the left") rather than only absolute positions. Specifically, the relationship $\mathrm{PE}(pos + k) = T_k\, \mathrm{PE}(pos)$ holds, where $T_k$ is a rotation matrix that depends only on the offset k and the frequency index, not on the absolute position.
No learned parameters. The sinusoidal encoding is entirely deterministic and adds zero trainable parameters to the model.
Potential for length generalization. Because the encoding is defined for any positive integer position, it can in principle generalize to sequences longer than those seen during training, though in practice this generalization is limited.

The sinusoidal approach was used in the original transformer for machine translation and remained common in early transformer variants.^[1] However, later work showed that learned positional embeddings often performed comparably or better on many tasks.

Learned positional embeddings

Rather than using a fixed mathematical formula, learned positional embeddings treat each position's encoding as a trainable parameter vector. The model maintains a lookup table of shape (max_sequence_length, d_model), and during training, the embedding for each position is optimized via backpropagation just like any other model parameter.

BERT (Devlin et al., 2019) used learned positional embeddings with a maximum sequence length of 512 tokens.^[2] GPT and GPT-2 also used learned positional embeddings, with GPT-2 supporting up to 1,024 positions.^[3] The original GPT-3 likewise used learned positional embeddings with a context window of 2,048 tokens.

Research by Gehring et al. (2017) on convolutional sequence-to-sequence models also employed learned positional embeddings, predating BERT. In practice, learned positional embeddings tend to converge to smooth, structured patterns during training that often resemble sinusoidal functions, suggesting that the two approaches capture similar information when sufficient training data is available.

Limitations of learned positional embeddings

The primary limitation is that the model cannot generalize beyond the maximum position seen during training. If a model is trained with a maximum sequence length of 512, it has no embedding for position 513. This makes learned embeddings inherently bounded, which became a significant constraint as researchers pushed toward longer context windows. Additionally, learned embeddings add parameters proportional to the maximum sequence length, and those extra parameters scale linearly with the model's hidden dimension as well.

What is the difference between absolute and relative positional encoding?

Positional encoding methods can be broadly classified into two families: absolute and relative.

Absolute positional encoding assigns a fixed representation to each position in the sequence, independent of the content at that position or the positions of other tokens. Both sinusoidal encoding and learned positional embeddings fall into this category. The model receives information like "this token is at position 47" and must learn to derive relative relationships ("these two tokens are 5 positions apart") from the absolute signals.

Relative positional encoding directly encodes the distance or relationship between pairs of tokens rather than their absolute positions. This family of methods has gained favor because many language tasks depend more on the relative ordering of tokens ("the adjective is just before the noun") than on absolute positions ("the adjective is at position 12").

Shaw et al. (2018): Relative position representations

Shaw, Uszkoreit, and Vaswani introduced one of the first explicit relative position representations for transformers in their 2018 paper "Self-Attention with Relative Position Representations."^[4] Their method modifies the self-attention mechanism by adding learned relative position embeddings to the key and value computations. For each pair of positions $(i, j)$ , the model uses a learned embedding vector $a_{ij}$ that depends on the relative distance $(i - j)$ between the two positions.

The attention logit between positions i and j becomes:

e_{ij} = (x_i W_Q)(x_j W_K + a_{ij}^K)^\top

where $a_{ij}^K$ is a learned relative position embedding for keys. A similar term $a_{ij}^V$ is added to the value computation.

A key design choice is clipping the maximum relative distance to a value k, so that all relative positions beyond k or below -k share the same embedding. This clipping serves two purposes: it reduces the number of parameters (only $2k+1$ embeddings are needed regardless of sequence length), and it embodies the hypothesis that precise relative position information is not useful beyond a certain distance. Shaw et al. used $k=16$ in their experiments, finding this sufficient for translation tasks.^[4]

Transformer-XL (Dai et al., 2019)

Transformer-XL reformulated the attention score to decompose it into four terms: content-to-content, content-to-position, position-to-content (using a global query bias), and position-to-position interactions.^[5] This decomposition enabled the model to handle longer sequences through segment-level recurrence while maintaining coherent position information across segments.

T5 relative position bias (Raffel et al., 2020)

The T5 model (Raffel et al., 2020) simplified relative position encoding by using scalar biases added to attention logits, with the biases depending only on the distance between query and key positions.^[6] T5 bucketed distances into a set of discrete bins using a logarithmic spacing scheme: nearby positions are represented with fine granularity while distant positions are grouped into broader bins. This bucketing reduces the number of parameters needed and provides a form of inductive bias: the model treats small distance differences as meaningful but treats large distances as roughly equivalent. T5's relative bias is shared across layers but differs per attention head.

Relative methods have generally shown better performance on tasks requiring understanding of local structure and have proven more amenable to length generalization, since they do not depend on absolute position indices.

What are Rotary Position Embeddings (RoPE)?

Rotary Position Embeddings (RoPE) were introduced by Su et al. in 2021 in the paper "RoFormer: Enhanced Transformer with Rotary Position Embedding."^[8] RoPE has since become the dominant positional encoding method in modern large language models, adopted by LLaMA, Mistral, Qwen, PaLM, DeepSeek, Falcon, Yi, Phi, and many other model families.

As the RoFormer authors describe it, RoPE "encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation."^[8] Unlike earlier methods that add positional vectors to the input embeddings, RoPE encodes position by rotating the query and key vectors in the attention computation.^[8] This multiplicative approach has several desirable theoretical properties and integrates both absolute and relative position information in a unified framework.

Mathematical formulation

RoPE operates on pairs of dimensions in the query and key vectors. Consider a query vector q at position m and a key vector k at position n. The d-dimensional vectors are split into d/2 pairs of consecutive dimensions. Each pair $(q_{2i}, q_{2i+1})$ is treated as a 2D vector and rotated by an angle that depends on the position m and the frequency index i.

The rotation for the i-th dimension pair at position m is given by the 2x2 rotation matrix:

R(m, \theta_i) = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}

where $\theta_i = 10000^{-2i/d}$ is the frequency for dimension pair i. The full rotation is applied block-diagonally across all d/2 dimension pairs:

\mathrm{RoPE}(q, m) = R_d(m)\, q

where $R_d(m)$ is a block-diagonal matrix composed of the 2x2 rotation matrices for each pair.

Complex number interpretation

The rotation can be equivalently expressed using complex arithmetic. If we treat each dimension pair $(q_{2i}, q_{2i+1})$ as a complex number $z_i = q_{2i} + j\, q_{2i+1}$ , then the RoPE operation is simply multiplication by a complex exponential:

\mathrm{RoPE}(z_i, m) = z_i\, e^{j m \theta_i}

This complex multiplication rotates $z_i$ by the angle $m \theta_i$ in the complex plane. The use of different frequencies $\theta_i$ for different dimension pairs creates a multi-scale representation: high-frequency dimensions (small i) encode fine-grained position differences, while low-frequency dimensions (large i) capture broader positional structure.

Relative position through inner products

The critical property of RoPE is that when computing the dot product between a rotated query at position m and a rotated key at position n, the result depends only on the relative distance $(m - n)$ :^[8]

(R_d(m)\, q)^\top (R_d(n)\, k) = q^\top R_d(n - m)\, k

This follows from the fact that rotation matrices satisfy $R(m)^\top R(n) = R(n - m)$ . The attention score between two tokens therefore encodes their relative position automatically, without any explicit relative position embeddings. This property gives RoPE its name and its theoretical appeal.

Additional properties

RoPE has several other favorable characteristics:

Norm preservation. Orthogonal rotations preserve the L2 norm of query and key vectors, ensuring that position information does not inadvertently scale the attention scores.
No additional parameters. Like sinusoidal encoding, RoPE introduces no trainable parameters; the rotation angles are determined entirely by the position and frequency.
Applied at every layer. Unlike sinusoidal encodings that are added once at the input, RoPE rotations are applied to queries and keys at every attention layer. This means every layer has direct access to position information, rather than relying on position signals to propagate through residual connections.
Decaying inter-token dependency. Su et al. showed that the inner product between two tokens naturally decays as their relative distance increases, providing a built-in recency bias that aligns with the locality patterns common in natural language.^[8]
Efficient implementation. The rotation can be computed element-wise in O(d) time and has been integrated into optimized attention kernels such as FlashAttention.

RoPE base frequency

The base value $\theta_{\mathrm{base}} = 10{,}000$ in the original formulation determines the spread of frequencies across dimensions. Increasing the base stretches the wavelengths of all frequency components, effectively making the encoding less sensitive to position and better able to handle longer sequences. Many modern models have adjusted this hyperparameter:

Model	RoPE base ( $\theta$ )	Context length
LLaMA 1, LLaMA 2	10,000	2K, 4K
Code Llama	1,000,000	16K (100K with fine-tuning)
LLaMA 3, LLaMA 3.1	500,000	8K (128K with scaling)
Qwen 2.5	1,000,000	32K (128K with YaRN)
Mistral 7B	10,000	8K (32K with sliding window)

Increasing the base alone is a simple form of context extension, but it has diminishing returns and can degrade performance on short sequences. More sophisticated scaling techniques (described below) have proven more effective.

ALiBi (Press et al., 2021)

Attention with Linear Biases (ALiBi) was introduced by Press, Smith, and Lewis in 2021 in the paper "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation," published at ICLR 2022.^[7] ALiBi takes a radically simple approach: it does not add any positional embeddings to the input at all. Instead, it adds a static, linear penalty to attention scores based on the distance between query and key positions.

Specifically, before the softmax operation in attention, ALiBi adds a bias of $-m \lvert q_{\mathrm{pos}} - k_{\mathrm{pos}} \rvert$ to each attention score, where m is a head-specific slope that is fixed (not learned) and set before training.^[7] The slopes are set as a geometric sequence. For a model with n attention heads, the slopes follow the pattern:

m_1 = 2^{-8/n}, \quad m_2 = 2^{-16/n}, \quad \ldots, \quad m_n = 2^{-8}

For example, with 8 attention heads, the slopes are: 1/2, 1/4, 1/8, 1/16, 1/32, 1/64, 1/128, 1/256. Heads with steeper slopes (larger m) focus more on nearby tokens, while heads with gentler slopes can attend over longer distances. This creates a natural multi-scale attention pattern where different heads specialize in different distance ranges.

Advantages of ALiBi

ALiBi demonstrated strong length extrapolation. Press et al. reported that ALiBi "allows training a 1.3 billion parameter model on input sequences of length 1024 that extrapolates to input sequences of length 2048, achieving the same perplexity as a sinusoidal position embedding model trained on inputs of length 2048 but training 11% faster and using 11% less memory."^[7] The method also introduces an inductive bias toward recency, which aligns well with the locality patterns common in natural language.

ALiBi was adopted by models including BLOOM (BigScience, 2022), MPT (MosaicML, 2023), and StarCoder. However, as the field shifted toward RoPE, ALiBi's adoption in newer models has decreased, though it remains influential as a conceptual approach. One limitation observed in practice is that ALiBi's linear bias may not capture the non-linear attention patterns needed for certain tasks, and its extrapolation ability, while better than absolute methods, still degrades at very large context extensions.

How is context length extended in RoPE models?

As applications demanded longer context windows, researchers developed several techniques to extend models beyond their original training length. Most of these techniques target RoPE-based models because of RoPE's mathematical properties and its dominance in modern architectures.

Position Interpolation (Chen et al., 2023)

Position Interpolation (PI) was introduced by Chen, Wong, Chen, and Tian at Meta in 2023.^[15] The core idea is straightforward: instead of extrapolating RoPE to unseen positions (which causes catastrophically high attention scores), PI linearly down-scales position indices so that the extended context fits within the original position range.

If a model was trained with a maximum context length L and needs to operate at an extended length L', PI replaces the position index m with $m \cdot (L / L')$ . For example, if a model trained on 2,048 positions needs to handle 8,192 positions, each position is scaled by $2048/8192 = 0.25$ , mapping the range [0, 8192] back to [0, 2048].

Formally, PI modifies the RoPE frequency function by applying a scaling factor $s = L'/L$ :

g(m) = m / s, \quad h(\theta_d) = \theta_d

Chen et al. showed that the interpolation bound is at least 600 times smaller than the extrapolation bound, providing a theoretical justification for why interpolation is far more stable than extrapolation.^[15] In practice, only about 1,000 fine-tuning steps were needed to adapt a LLaMA model from 2K to 8K context, or even to 32K and 65K contexts with similarly modest fine-tuning.^[15]

The main drawback of PI is that it uniformly compresses all frequency components. High-frequency dimensions (which encode fine-grained local positions) get squashed just as much as low-frequency dimensions (which encode global position). This can cause slight perplexity increases on short sequences after fine-tuning, since the model loses some resolution at the local level.

NTK-aware scaling (bloc97, 2023)

NTK-aware interpolation was proposed by the pseudonymous researcher bloc97 in a Reddit post in 2023.^[17] The method addresses a fundamental flaw in Position Interpolation: not all frequency dimensions should be scaled equally.

The insight comes from Neural Tangent Kernel (NTK) theory, which shows that deep neural networks struggle to learn high-frequency information if the input embeddings lack high-frequency components. In the context of RoPE, high-frequency dimensions encode fine-grained, local position differences (distinguishing adjacent tokens), while low-frequency dimensions encode coarse, global position information. PI's uniform scaling compresses high-frequency components unnecessarily, damaging the model's ability to distinguish nearby positions.

NTK-aware interpolation spreads the scaling pressure unevenly across dimensions: high frequencies are scaled less (preserving local resolution), while low frequencies are scaled more (accommodating the extended context).^[17] This is achieved by modifying the RoPE base frequency:

b' = b \cdot s^{d / (d - 2)}

where b is the original base (typically 10,000), s is the scaling factor, and d is the embedding dimension. This base adjustment implicitly rescales each frequency dimension by a different amount.

NTK-aware scaling showed notably better zero-shot performance on longer sequences than PI, enabling models to extend to 8K or even 16K contexts without any fine-tuning.^[17] A subsequent refinement, "Dynamic NTK" scaling, adjusts the scaling factor at inference time based on the actual sequence length, applying stronger scaling only when the current sequence exceeds the training length. This dynamic variant works particularly well without fine-tuning.

YaRN (Peng et al., 2023)

YaRN (Yet another RoPE extensioN) was introduced by Peng, Quesnelle, Fan, and Shippole in 2023 and published at ICLR 2024.^[16] YaRN represents the most complete and theoretically grounded context extension method for RoPE-based models, combining several innovations into a unified approach.

YaRN consists of two main components:

NTK-by-parts interpolation. Rather than applying a single scaling strategy to all frequency dimensions, YaRN partitions the RoPE frequency spectrum into three regions using a ramp function:

High-frequency dimensions (wavelength shorter than the original context): these are left unmodified (extrapolation), since they already capture local patterns well within the original range.
Low-frequency dimensions (wavelength longer than the extended context): these receive full PI-style interpolation, since they need the most adjustment to accommodate the new context length.
Intermediate dimensions: a smooth blend between extrapolation and interpolation, controlled by a ramp function with parameters alpha and beta. Peng et al. recommended alpha=1 and beta=32 for LLaMA models.^[16]

The ramp function gamma determines the interpolation ratio for each dimension:

h(\theta_d) = (1 - \gamma)\, \theta_d / s + \gamma\, \theta_d

where gamma transitions smoothly from 0 (full interpolation) to 1 (no interpolation) based on the wavelength of each frequency component.

Attention temperature scaling. YaRN introduces a temperature factor t that modifies the attention computation:

\mathrm{softmax}\left(\frac{q_m^\top k_n}{t \sqrt{d}}\right)

This temperature compensates for the distribution shift in attention scores caused by longer input sequences. Without it, the entropy of the attention distribution changes at extended lengths, degrading performance. The optimal temperature follows the empirical formula:

\sqrt{1/t} = 0.1 \cdot \ln(s) + 1

YaRN is highly compute-efficient, requiring roughly 10 times fewer tokens and 2.5 times fewer training steps than Position Interpolation to achieve comparable results.^[16] It has been widely adopted: Qwen 2.5 uses YaRN to extend from 32K to 128K context, DeepSeek-V3 uses YaRN to extend from 4K to 128K through a two-stage process, and many other models leverage YaRN through Hugging Face's rope_scaling API.^[18]

Comparison of context extension techniques

Method	Year	Fine-tuning required	Approach	Strengths	Weaknesses
Position Interpolation (PI)	2023	Yes (minimal, ~1000 steps)	Uniform downscaling of position indices	Simple, theoretically motivated, effective	Compresses high frequencies, slight short-context degradation
NTK-aware scaling	2023	Optional	Non-uniform base frequency adjustment	Better zero-shot performance, preserves local resolution	Heuristic derivation, less effective than YaRN after fine-tuning
Dynamic NTK	2023	No	Runtime scaling based on sequence length	Works without fine-tuning, adaptive	Performance still below fine-tuned methods
YaRN	2023	Yes (very minimal)	Frequency-wise piecewise scaling + temperature	Best overall performance, compute-efficient	More complex implementation

Other positional encoding methods

Beyond the methods already discussed, several additional approaches have been proposed:

Relative position bias (RPB), used in models like DeBERTa (He et al., 2021), disentangles content and position by using separate attention matrices for content-to-content and content-to-position interactions.^[9] This approach showed improvements on many NLU benchmarks.

KERPLE (Chi et al., 2022) generalizes relative position embeddings using conditionally positive definite kernels, offering a principled framework for learning position-dependent biases with improved length extrapolation.^[10]

FIRE (Li et al., 2023) uses a small MLP to learn functional interpolation over relative positions.^[11] It can theoretically represent several other methods (including T5's RPE, ALiBi, and KERPLE) and demonstrated strong performance on length generalization benchmarks.

CoPE (Golovneva et al., 2024), or Contextual Position Encoding, from Meta, computes positions based on the content of tokens rather than their indices, allowing the model to count specific types of tokens (like sentences or words) rather than raw positions.^[14] This content-dependent approach represents a departure from purely index-based methods.

iRoPE (Meta, 2025), introduced with LLaMA 4, interleaves RoPE layers with NoPE (No Positional Encoding) layers.^[19] Three out of every four decoder layers use standard RoPE with chunked attention (chunk size of 8,192), while every fourth layer uses no positional encoding at all but attends over the full causal context. The NoPE layers use a scaled softmax (temperature tuning) to prevent attention probability scores from fading as sequence length increases. This hybrid approach enabled LLaMA 4 Scout to achieve a 10 million token context window.^[19]

Comprehensive comparison of positional encoding methods

Method	Type	Added parameters	Length extrapolation	Relative position	Key model users
Sinusoidal (Vaswani et al., 2017)	Absolute	None	Limited	Indirect (via linear property)	Original Transformer
Learned embeddings	Absolute	O(L * d)	None (fixed max length)	No	BERT, GPT-2, GPT-3
Shaw et al. (2018)	Relative	O(K * d)	Moderate (via clipping)	Yes (learned pairwise)	Various research models
Transformer-XL (Dai et al., 2019)	Relative	O(L * d)	Good	Yes (decomposed scores)	Transformer-XL, XLNet
T5 RPE (Raffel et al., 2020)	Relative	O(B * H)	Moderate	Yes (scalar bias, bucketed)	T5, Flan-T5
RoPE (Su et al., 2021)	Relative (via rotation)	None	Good (with scaling)	Yes (rotation encodes difference)	LLaMA, Mistral, Qwen, DeepSeek
ALiBi (Press et al., 2021)	Relative	None	Strong	Yes (linear distance penalty)	BLOOM, MPT, StarCoder
KERPLE (Chi et al., 2022)	Relative	Small (kernel params)	Strong	Yes (kernel-based)	Research models
RoPE + PI (Chen et al., 2023)	Relative (via rotation)	None	Strong (with fine-tuning)	Yes	Extended LLaMA models
RoPE + YaRN (Peng et al., 2023)	Relative (via rotation)	None	Very strong	Yes	Qwen 2.5, DeepSeek-V3, many others
FIRE (Li et al., 2023)	Relative	Small (MLP)	Strong	Yes (functional interpolation)	Research models
iRoPE (Meta, 2025)	Hybrid (RoPE + NoPE)	None	Very strong	Yes (in RoPE layers)	LLaMA 4

Impact on long-context models

The evolution of positional encoding has been one of the primary drivers behind the dramatic expansion of context windows in large language models. From the original transformer's 512-token contexts to today's multi-million-token windows, positional encoding innovations have been central to each generational leap.

Historical progression of context lengths

Year	Model	Context length	Positional encoding method
2017	Original Transformer	512 tokens	Sinusoidal
2018	BERT	512 tokens	Learned embeddings
2019	GPT-2	1,024 tokens	Learned embeddings
2020	GPT-3	2,048 tokens	Learned embeddings
2022	BLOOM	2,048 tokens	ALiBi
2023	LLaMA 2	4,096 tokens	RoPE
2023	Claude 2	100,000 tokens	Undisclosed
2023	GPT-4 Turbo	128,000 tokens	Undisclosed
2024	LLaMA 3.1	128,000 tokens	RoPE (theta=500K, with scaling)
2024	Qwen 2.5	128,000 tokens	RoPE + YaRN
2024	DeepSeek-V3	128,000 tokens	RoPE + YaRN (decoupled)
2024	Gemini 1.5 Pro	2,000,000 tokens	Undisclosed
2025	LLaMA 4 Scout	10,000,000 tokens	iRoPE
2025	Claude (Opus/Sonnet)	1,000,000 tokens	Undisclosed
2025	GPT-5	1,000,000 tokens	Undisclosed

Technical challenges at extreme context lengths

Extending context to millions of tokens introduces several challenges beyond positional encoding:

Attention score distribution. As context length grows, the softmax distribution in attention becomes increasingly flat, making it harder for the model to focus on relevant tokens. Temperature scaling (as in YaRN and iRoPE) partially addresses this, but very long contexts still suffer from diluted attention. Research into sparse attention patterns, retrieval-augmented attention, and hierarchical attention mechanisms complements positional encoding work.

Training data requirements. Models need to be trained (or at least fine-tuned) on sequences of comparable length to their target context. Generating and processing such long sequences is computationally expensive. The efficiency of methods like YaRN, which requires only a few hundred fine-tuning steps, has been critical to making long-context models practical.

Evaluation difficulty. Standard benchmarks such as perplexity on short texts do not capture a model's ability to use long context effectively. Specialized benchmarks like "needle in a haystack" retrieval, long-document QA, and multi-document summarization have been developed to assess long-context performance.

Effective vs. advertised context. Research suggests that effective utilization of context typically reaches 60-70% of the advertised maximum. Models may accept very long inputs but fail to retrieve or reason over information in the middle of the context (the "lost in the middle" phenomenon). Better positional encoding can help, but it is not the sole factor in effective long-context use.

Which positional encoding do modern LLMs use?

As of early 2026, RoPE remains the dominant positional encoding method in large language models. After Meta adopted RoPE for the LLaMA family, virtually every major open-source LLM followed: Mistral, Gemma, Qwen, DeepSeek, Falcon 2, Yi, and Phi all use RoPE. The entire inference ecosystem, including FlashAttention, vLLM, and Hugging Face's rope_scaling API, has been optimized around it.

For encoder-only models used in classification and retrieval tasks, learned positional embeddings remain common, in part because these models typically operate on shorter sequences where length extrapolation is less of a concern.

Some recent architectures have begun experimenting with hybrid approaches. Command R7B (Cohere, 2024) combines RoPE layers with layers that have no positional encoding at all (NoPE). Gemma 3 uses RoPE with different base frequencies for local attention (theta = 10,000) and global attention (theta = 1,000,000) within the same model. LLaMA 4's iRoPE architecture (described above) takes this further by systematically interleaving RoPE and NoPE layers.^[19]

There is also growing interest in models that use no positional encoding at all. Research has shown that some decoder-only models can learn implicit position information from causal attention masks alone, though this approach has not yet been widely adopted in production models.

Positional encoding in vision transformers

Positional encoding is not limited to language models. Vision transformers (ViT) (Dosovitskiy et al., 2021) split images into patches and treat them as a sequence, using either learned 1D positional embeddings (treating patches in raster order) or learned 2D embeddings that reflect the spatial grid structure.^[12] Subsequent work explored relative position biases for vision, including the Swin Transformer (Liu et al., 2021), which uses relative position biases within local attention windows.^[13]

Multimodal models that combine text and image tokens must reconcile different positional encoding strategies for different modalities, which remains an active area of research. Qwen2-VL introduced Multimodal RoPE (MRoPE), which unifies positional encoding for text and visual tokens within a single framework. Newer work on Multi-Head RoPE (MHRoPE) and MRoPE-Interleave has shown further improvements across both general and fine-grained multimodal understanding benchmarks.

Implementation considerations

When implementing positional encoding, several practical considerations arise:

Addition vs. concatenation. Most methods add the positional encoding to the token embedding. An alternative is to concatenate them, doubling the input dimension. Addition is strongly preferred in practice because it avoids increasing the model dimension.

Where to apply. Sinusoidal and learned encodings are typically applied once at the input layer. RoPE, by contrast, is applied at every attention layer, rotating query and key vectors before computing attention scores. ALiBi modifies the attention logits directly. The layer at which position information is injected affects how the model uses it.

Scaling for long contexts. When extending a model to longer contexts than it was trained on, position encodings often need adjustment. For RoPE, techniques like Position Interpolation, NTK-aware scaling, and YaRN have been developed (see above). For learned embeddings, the only option is typically to fine-tune with longer sequences. For ALiBi, the linear bias naturally extends to longer sequences without modification, which was one of its original selling points.

Computational cost. Sinusoidal encoding and ALiBi have negligible computational cost. RoPE requires a rotation operation at each layer but can be implemented efficiently using element-wise operations rather than matrix multiplication (exploiting the block-diagonal structure of the rotation). Learned embeddings require a lookup but no computation. Methods like FIRE that use MLPs incur a small additional cost.

Interaction with FlashAttention. Modern fused attention kernels like FlashAttention can incorporate RoPE rotations and ALiBi biases directly into the attention computation, avoiding the need to materialize the full attention matrix. This integration is important for practical efficiency, especially at long context lengths where the attention matrix would be prohibitively large.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *Advances in Neural Information Processing Systems 30 (NeurIPS 2017)*. https://arxiv.org/abs/1706.03762 ↩
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." *NAACL-HLT 2019*. https://arxiv.org/abs/1810.04805 ↩
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners." *OpenAI Technical Report*. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf ↩
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). "Self-Attention with Relative Position Representations." *NAACL 2018*. https://arxiv.org/abs/1803.02155 ↩
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." *ACL 2019*. https://arxiv.org/abs/1901.02860 ↩
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *JMLR 21*. https://arxiv.org/abs/1910.10683 ↩
Press, O., Smith, N. A., & Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." *ICLR 2022*. https://arxiv.org/abs/2108.12409 ↩
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." *Neurocomputing, 2024*. https://arxiv.org/abs/2104.09864 ↩
He, P., Liu, X., Gao, J., & Chen, W. (2021). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention." *ICLR 2021*. https://arxiv.org/abs/2006.03654 ↩
Chi, T.-C., Fan, T.-H., Ramadge, P. J., & Rudnicky, A. I. (2022). "KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation." *NeurIPS 2022*. https://arxiv.org/abs/2205.09921 ↩
Li, S., Cai, T., & Deng, Y. (2023). "Functional Interpolation for Relative Positions Improves Long Context Transformers (FIRE)." *arXiv preprint*. https://arxiv.org/abs/2310.04418 ↩
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale." *ICLR 2021*. https://arxiv.org/abs/2010.11929 ↩
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows." *ICCV 2021*. https://arxiv.org/abs/2103.14030 ↩
Golovneva, O., Wang, T., Weston, J., & Sukhbaatar, S. (2024). "Contextual Position Encoding: Learning to Count What's Important." *arXiv preprint*. https://arxiv.org/abs/2405.18719 ↩
Chen, S., Wong, S., Chen, L., & Tian, Y. (2023). "Extending Context Window of Large Language Models via Positional Interpolation." *EMNLP 2024*. https://arxiv.org/abs/2306.15595 ↩
Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." *ICLR 2024*. https://arxiv.org/abs/2309.00071 ↩
bloc97. (2023). "NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation." *Reddit/r/LocalLLaMA*. ↩
DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." https://arxiv.org/abs/2412.19437 ↩
Meta AI. (2025). "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation." https://ai.meta.com/blog/llama-4-multimodal-intelligence/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

10 revisions by 1 contributors · full history

Suggest edit

Positional encoding

Why do transformers need positional encoding?

What is sinusoidal positional encoding?

Properties of sinusoidal encoding

Learned positional embeddings

Limitations of learned positional embeddings

What is the difference between absolute and relative positional encoding?

Shaw et al. (2018): Relative position representations

Transformer-XL (Dai et al., 2019)

T5 relative position bias (Raffel et al., 2020)

What are Rotary Position Embeddings (RoPE)?

Mathematical formulation

Complex number interpretation

Relative position through inner products

Additional properties

RoPE base frequency

ALiBi (Press et al., 2021)

Advantages of ALiBi

How is context length extended in RoPE models?

Position Interpolation (Chen et al., 2023)

NTK-aware scaling (bloc97, 2023)

YaRN (Peng et al., 2023)

Comparison of context extension techniques

Other positional encoding methods

Comprehensive comparison of positional encoding methods

Impact on long-context models

Historical progression of context lengths

Technical challenges at extreme context lengths

Which positional encoding do modern LLMs use?

Positional encoding in vision transformers

Implementation considerations

References

Improve this article

What links here (24 of 31)

What links here (24 of 31)

Why do transformers need positional encoding?

What is sinusoidal positional encoding?

Properties of sinusoidal encoding

Learned positional embeddings

Limitations of learned positional embeddings

What is the difference between absolute and relative positional encoding?

Shaw et al. (2018): Relative position representations

Transformer-XL (Dai et al., 2019)

T5 relative position bias (Raffel et al., 2020)

What are Rotary Position Embeddings (RoPE)?

Mathematical formulation

Complex number interpretation

Relative position through inner products

Additional properties

RoPE base frequency

ALiBi (Press et al., 2021)

Advantages of ALiBi

How is context length extended in RoPE models?

Position Interpolation (Chen et al., 2023)

NTK-aware scaling (bloc97, 2023)

YaRN (Peng et al., 2023)

Comparison of context extension techniques

Other positional encoding methods

Comprehensive comparison of positional encoding methods

Impact on long-context models

Historical progression of context lengths

Technical challenges at extreme context lengths

Which positional encoding do modern LLMs use?

Positional encoding in vision transformers

Implementation considerations

References

Improve this article

Related Articles

XLNet

RoBERTa

ELECTRA

ALBERT

DeBERTa

DistilBERT

What links here (24 of 31)

Related Articles

XLNet

RoBERTa

ELECTRA

ALBERT

DeBERTa

DistilBERT

What links here (24 of 31)