Rotary Position Embedding
Last reviewed
May 8, 2026
Sources
22 citations
Review status
Source-backed
Revision
v6 · 5,225 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 8, 2026
Sources
22 citations
Review status
Source-backed
Revision
v6 · 5,225 words
Add missing citations, update stale details, or suggest a clearer explanation.
Rotary Position Embedding (RoPE) is a positional encoding method for transformer models that encodes absolute position by rotating query and key vectors in two-dimensional subspaces. Introduced by Jianlin Su and colleagues at Zhuiyi Technology in 2021 in the paper "RoFormer: Enhanced Transformer with Rotary Position Embedding," RoPE has become the dominant positional encoding scheme in modern large language models. Its core insight is that multiplying query and key vectors by rotation matrices causes their dot product to depend only on relative position, not absolute position. This gives the model relative position information without requiring explicit relative position embeddings or any additional learnable parameters.
The underlying idea first appeared in a series of Chinese-language blog posts on Su's site Kexue.fm in late 2020 and early 2021 before being formalized in the RoFormer paper. EleutherAI later picked up the technique, implemented it in GPT-J and GPT-NeoX, and wrote an influential English-language explainer that helped popularize RoPE in the broader open-source community.
RoPE is used in virtually all major open-source LLMs released since 2023, including LLaMA, LLaMA 2, LLaMA 3, Llama 4, Mistral, Mixtral, Qwen, Gemma, DeepSeek, Yi, Phi, and Falcon 2. Google's PaLM also adopted RoPE. The method's combination of simplicity, parameter-free design, and compatibility with efficient inference (particularly KV caching) has made it the de facto standard.
Before RoPE, transformer models typically used either sinusoidal positional encodings (Vaswani et al., 2017) or learned positional embeddings (BERT, GPT-2). Both are absolute positional methods: they assign a fixed vector to each position in the sequence, which is added to the token embedding. While these approaches work, they have notable limitations.
Absolute positional encodings do not directly encode the distance between tokens. If a model needs to learn that "a verb typically follows its subject by one or two positions," it must extract this relative information indirectly from absolute position vectors. Learned embeddings also impose a hard maximum sequence length, since there is no embedding for positions beyond the training maximum.
Relative positional methods like those in Shaw et al. (2018), Transformer-XL (Dai et al., 2019), and T5 (Raffel et al., 2020) address the relative position problem by adding bias terms that depend on the distance between query and key positions. ALiBi (Press et al., 2022) takes a similar route with linear attention biases that decay with distance. However, these methods typically require additional parameters or modify the attention computation in ways that can complicate efficient implementation.
Su et al. sought a method that would encode absolute position (so each token knows where it is), naturally produce relative position information in the attention scores (so the model knows how far apart tokens are), require no additional parameters, and work efficiently with standard transformer implementations. RoPE belongs to a different family from earlier methods: it does not add anything to the residual stream. Instead, it transforms the query and key vectors right before the attention dot product, so the dot product itself becomes position-aware.
The central idea of RoPE is to encode position by rotating the query and key vectors. A rotation in two dimensions is a well-understood geometric operation: you take a 2D vector and spin it by some angle. The rotation preserves the vector's magnitude while changing its direction.
RoPE exploits a specific property of rotations: when you compute the dot product between two rotated vectors, the result depends only on the angle difference between the two rotations, not on the individual rotation angles. If vector A is rotated by angle alpha and vector B is rotated by angle beta, their dot product depends on (alpha - beta), not on alpha and beta individually. Since self-attention is fundamentally a dot product between query and key vectors, this property means that rotating queries and keys by position-dependent angles automatically injects relative position information into the attention scores.
Pair up dimensions. Given a query or key vector of dimension d, RoPE groups the dimensions into d/2 consecutive pairs: (x_1, x_2), (x_3, x_4), (x_5, x_6), and so on. Each pair defines a 2D subspace.
Assign rotation frequencies. Each pair i (where i ranges from 0 to d/2 - 1) is assigned a base frequency: theta_i = 10000^(-2i/d). This follows the same frequency schedule as the original sinusoidal positional encoding. Lower-indexed pairs rotate at higher frequencies (capturing fine-grained position differences), while higher-indexed pairs rotate at lower frequencies (capturing coarser position information).
Compute rotation angles. For a token at position m, the rotation angle for pair i is simply m * theta_i. Tokens at later positions get rotated by larger angles.
Apply 2D rotation to each pair. For each dimension pair (x_{2i}, x_{2i+1}), the rotation is applied as:
This is the standard 2D rotation matrix applied to each pair independently.
Compute attention as usual. After rotation, the query and key vectors are used in the standard dot-product attention computation. No other changes to the attention mechanism are needed.
More formally, let x_m be the embedding of the token at position m, and let W_q and W_k be the query and key projection matrices. The query and key vectors are:
where R(theta, m) is a block-diagonal rotation matrix. Each 2x2 block along the diagonal is:
[ cos(m * theta_i) -sin(m * theta_i) ]
[ sin(m * theta_i) cos(m * theta_i) ]
The attention score between positions m and n is:
Because rotation matrices are orthogonal, R(theta, m)^T * R(theta, n) = R(theta, n - m). The score therefore depends on the relative position (n - m) rather than on m and n individually:
This is the core mathematical property that makes RoPE work: absolute rotations produce relative position information in the dot product.
RoPE has several properties that contributed to its widespread adoption:
RoPE is sometimes described as a "bridge" between absolute and relative positional encoding. Each token is encoded with its absolute position (via the rotation angle), but the attention mechanism naturally computes relative positions (via the angle difference in the dot product). This gives the model both types of information without requiring separate mechanisms.
RoPE adds zero learnable parameters to the model. The rotation angles are deterministic functions of position and dimension index. This means RoPE has exactly the same parameter count as a model with no positional encoding at all, and the same count as the original sinusoidal encoding. Fused into the attention kernel, RoPE costs roughly 1 to 3 percent extra runtime.
Su et al. showed that the inner product between RoPE-encoded vectors exhibits a natural decay as the relative distance increases. Tokens that are close together tend to have higher attention scores (all else being equal) than tokens far apart. This aligns with the empirical observation that local context is generally more relevant than distant context in natural language. The authors framed this as a soft inductive bias toward locality that is consistent with the empirical behavior of language.
During autoregressive generation, transformers cache the key and value vectors from previous tokens to avoid recomputing them. Because RoPE applies the rotation to query and key vectors before they enter the attention computation, the rotation can be applied once when a token is first processed, and the rotated key can be cached directly. This is straightforward to implement and does not interfere with standard KV caching strategies, which is an important practical consideration for inference efficiency.
A further advantage is compatibility with linear attention. Because RoPE multiplies queries and keys by orthogonal rotation matrices and does not introduce a softmax-dependent bias, it can be combined with kernelized or linear attention variants (such as Performer-style attention) where additive relative biases are difficult to apply.
Because RoPE's rotation angles are continuous functions of position, they are defined for any positive integer position. A model trained with a maximum sequence length of 4,096 will still produce well-defined rotations for position 8,192. Whether the model actually performs well at those extended positions depends on other factors, but the positional encoding itself does not have a hard boundary. This property has made RoPE the foundation for several context-length extension techniques.
| Property | Sinusoidal | Learned embeddings | ALiBi | RoPE | NoPE |
|---|---|---|---|---|---|
| Type | Absolute | Absolute | Relative (bias) | Relative (via rotation) | None |
| Additional parameters | None | O(L * d) | None (fixed slopes per head) | None | None |
| Where applied | Input layer (additive) | Input layer (additive) | Attention logits (additive bias) | Query/key at each layer (multiplicative) | Not applied |
| Relative position signal | Indirect | No | Direct (linear penalty) | Direct (rotation difference) | Implicit via causal mask |
| Length extrapolation | Limited | None (hard max) | Strong | Moderate (strong with PI/NTK/YaRN scaling) | Surprisingly strong on some reasoning tasks |
| Adopted by | Original Transformer | BERT, GPT-2 | BLOOM, MPT | LLaMA family, PaLM, Mistral, Mixtral, Qwen, DeepSeek, GPT-J, GPT-NeoX, Phi-3 | Some research models, partial use in Llama 4 (iRoPE) |
| Distance decay | No built-in decay | No built-in decay | Strong (linear) | Moderate (oscillating) | None |
Kazemnejad et al. (2023), in The Impact of Positional Encoding on Length Generalization in Transformers, found that decoder-only transformers trained without any positional encoding (NoPE) often generalize to longer sequences better than RoPE, ALiBi, or learned absolute encodings on certain reasoning tasks. The result is one reason researchers continue to question whether RoPE is universally optimal even though it dominates in production systems.
The widespread adoption of RoPE can be traced to Meta's release of LLaMA in February 2023. LLaMA's architecture choices, including RoPE, became a template that nearly every subsequent open-source LLM followed. The timeline of adoption includes:
| Model | Organization | Release | RoPE variant |
|---|---|---|---|
| GPT-J | EleutherAI | June 2021 | Partial RoPE on 25% of head dimensions |
| GPT-NeoX | EleutherAI | February 2022 | RoPE on full head dimension |
| PaLM | April 2022 | Standard RoPE | |
| LLaMA | Meta | February 2023 | Standard RoPE |
| LLaMA 2 | Meta | July 2023 | Standard RoPE |
| Code Llama | Meta | August 2023 | RoPE with NTK scaling |
| Mistral 7B | Mistral AI | September 2023 | Standard RoPE with sliding-window attention |
| Qwen | Alibaba | September 2023 | Standard RoPE with NTK-aware scaling |
| Yi | 01.AI | November 2023 | Standard RoPE |
| Phi-2 | Microsoft | December 2023 | Standard RoPE |
| Mixtral 8x7B / 8x22B | Mistral AI | December 2023 / April 2024 | Standard RoPE |
| Gemma | February 2024 | Standard RoPE | |
| Mistral Large | Mistral AI | February 2024 | Standard RoPE |
| LLaMA 3 | Meta | April 2024 | RoPE with extended theta (500,000) |
| Phi-3 | Microsoft | April 2024 | RoPE with Su-scaled or YaRN-style scaling for 128k variant |
| DeepSeek-V2 | DeepSeek | May 2024 | Decoupled RoPE with Multi-head Latent Attention |
| Qwen 2.5 | Alibaba | September 2024 | RoPE with YaRN scaling |
| Llama 4 | Meta | 2025 | iRoPE (interleaved RoPE and NoPE layers) |
| Qwen 3 | Alibaba | 2025 | RoPE |
| DeepSeek-V3 | DeepSeek | 2024 to 2025 | Decoupled RoPE with MLA |
The infrastructure ecosystem followed this adoption wave. FlashAttention, vLLM, and Hugging Face Transformers all provide optimized RoPE implementations. Hugging Face's rope_scaling configuration parameter allows users to specify different scaling strategies (linear, dynamic, YaRN) directly in model configuration files.
A few notable variants deserve mention. GPT-J applies RoPE to only 25 percent of head dimensions (64 out of 256), leaving the remaining dimensions untouched, while GPT-NeoX applies RoPE to the full head dimension. DeepSeek-V2 introduced Decoupled RoPE, in which the rotation is applied to a separate small portion of the key and query vectors so the rest can be compressed by Multi-head Latent Attention. Llama 4 introduces iRoPE, an interleaved scheme that alternates standard RoPE attention layers with NoPE layers in roughly a 3:1 ratio to support very long contexts.
The base value 10,000 in theta_i = 10000^(-2i/d) is a hyperparameter, not a fundamental constant. Su et al. (2021) inherited it from the sinusoidal encoding for direct continuity with the Transformer's original encoding scheme. Larger bases produce slower rotations across all dimensions, which means the same numerical position m corresponds to a smaller phase angle. This effectively spreads positional information over a longer range and is the foundation of NTK-aware context extension.
The d/2 frequencies span a wide spectrum. The first few pairs rotate quickly and complete many full revolutions even within a short context window; they encode fine-grained, high-frequency positional information. The last few pairs rotate very slowly and barely complete a full revolution within the entire training context; they encode coarse, low-frequency, long-range positional information.
This two-end behavior matters for context extension. High-frequency dimensions saturate quickly and add little useful signal at long range; low-frequency dimensions carry the long-range signal but are sparsely sampled. Several context-extension techniques exploit this asymmetry by treating different frequencies differently.
One of RoPE's most active areas of development is extending the context length of pre-trained models beyond their original training length. Vanilla RoPE generalizes poorly when a model trained on, for example, 2,048 tokens is asked to attend over 8,192 tokens. The attention scores at the new positions involve rotation angles the model never saw during training, and perplexity collapses. Research starting in mid-2023 produced a sequence of techniques to extend the effective context window of pretrained RoPE models with little or no fine-tuning.
Chen et al. (2023) proposed position interpolation, a simple technique: instead of extrapolating RoPE to positions beyond the training range, interpolate by dividing all position indices by a scaling factor s. If a model was trained with max length 4,096 and you want to extend to 32,768, set s = 8 and divide all positions by 8. This maps the extended range [0, 32,768] back into [0, 4,096], where the model has seen positions during training. A small amount of fine-tuning (around 1,000 steps) is needed to adapt the model to the compressed position space. The authors argued that interpolation has an upper-bound error roughly 600x smaller than direct extrapolation, which is the main reason it works so well with such little training.
NTK-aware (Neural Tangent Kernel-aware) interpolation was developed through community experimentation on the r/LocalLLaMA Reddit community in mid-2023, originating in a post by a user named bloc97. The insight is that position interpolation treats all frequency dimensions equally, but this is suboptimal. High-frequency dimensions (which capture fine-grained position differences) lose resolution when compressed, while low-frequency dimensions (which capture coarse position information) barely need compression at all.
NTK-aware scaling addresses this by modifying the base frequency parameter. Instead of scaling position indices, it scales the base value (10,000) in the frequency formula: the new base is b' = b * s^(d / (d - 2)), where s is the desired length-extension factor. This stretches the low-frequency components more than the high-frequency ones, spreading the interpolation pressure across dimensions. The approach can be applied without any fine-tuning at all, achieving 2x to 4x extension zero-shot, though fine-tuning improves quality.
NTK-aware scaling was quickly adopted by popular inference frameworks including llama.cpp and text-generation-webui, and was incorporated into several Hugging Face model implementations. A related variant called Dynamic NTK, also developed in the EleutherAI community, applies NTK-aware scaling adaptively at inference time based on the current sequence length, so the model uses no scaling at short lengths (preserving its trained behavior) and smoothly increases scaling as the sequence grows.
Peng et al. (2023) introduced YaRN, the first formal academic paper that rigorously analyzed and improved upon the community NTK-aware scaling methods. The authors are Bowen Peng and Jeffrey Quesnelle (Nous Research), Honglu Fan (EleutherAI and University of Geneva), and Enrico Shippole. YaRN's key innovation is recognizing that different frequency ranges require fundamentally different scaling strategies. It addresses this with two mechanisms:
NTK-by-parts (ramp function for dimension-wise interpolation). A ramp function decides whether each dimension is fully interpolated, not interpolated, or partially interpolated, depending on its wavelength relative to the original context length. For LLaMA-family models the authors used parameters alpha = 1, beta = 32. Low-frequency dimensions receive more interpolation, while high-frequency dimensions receive less.
Temperature scaling for attention logits. Extending context length causes a distributional shift in attention scores because longer sequences spread probability mass over more tokens. YaRN applies a temperature factor t to attention logits to compensate for this shift, with sqrt(1/t) = 0.1 * ln(s) + 1, where s is the extension factor.
YaRN achieves the same extended-context perplexity as PI with roughly 10x fewer training tokens and 2.5x fewer training steps, while requiring fine-tuning on less than 0.1% of the original pre-training data. It enables models to extrapolate beyond even the fine-tuning context length. LLaMA models extended with YaRN have been shown to handle context lengths of 64K and 128K tokens effectively, and YaRN became the standard recipe for context extension in many open-source LLM releases through 2024.
Ding et al. (2024) at Microsoft introduced LongRoPE, which uses an evolutionary search to find a non-uniform, per-frequency rescaling that further reduces the gap between original and extended context performance. LongRoPE extends pretrained LLMs to 2,048k tokens with up to 1,000 fine-tuning steps at 256k length, while preserving short-context quality. The method underpins the 128k-context variants of Phi-3.
| Method | Date and source | Core idea | Fine-tuning needed | Typical extension |
|---|---|---|---|---|
| Position Interpolation (PI) | Chen et al., June 2023 | Linearly downscale positions by factor s = L'/L so all positions fall in trained range | About 1,000 steps | 2x to 16x |
| NTK-aware scaling | bloc97 on Reddit r/LocalLLaMA, June 2023 | Increase base theta instead of uniformly scaling positions; preserves high-frequency components | None (zero-shot) or minimal | 2x to 4x with no fine-tuning |
| Dynamic NTK | Reddit / EleutherAI work, mid-2023 | Adjust base on the fly during inference based on current sequence length | None | Smooth degradation past trained length |
| YaRN | Peng et al., August 2023 | Combine NTK-by-parts with attention temperature scaling | About 10x fewer tokens than PI | 32x to 64x with brief fine-tuning |
| LongRoPE | Ding et al. (Microsoft), February 2024 | Search for non-uniform per-frequency scaling factors with evolutionary search | Up to 1,000 steps at 256k length | Up to 2,048k tokens (used in Phi-3-128k) |
| Extended theta | Standard pre-training | Increase base from 10,000 to 500,000 or more | Yes (full pre-training) | Long contexts trained from scratch |
Some models have adopted the simpler approach of just increasing the base theta value during pre-training. LLaMA 3 used a base theta of 500,000 (compared to the standard 10,000), which inherently allows the model to handle longer sequences because the rotation frequencies are lower. This approach requires training with the extended theta from the start (or extensive continued pre-training) but avoids the need for post-hoc scaling.
In practice, RoPE is implemented efficiently without explicitly constructing the full rotation matrices. The rotation of each 2D pair can be computed with simple element-wise multiplications and additions:
q_rotated[2i] = q[2i] * cos(m * theta_i) - q[2i+1] * sin(m * theta_i)
q_rotated[2i+1] = q[2i] * sin(m * theta_i) + q[2i+1] * cos(m * theta_i)
The cosine and sine values for each position and dimension pair can be precomputed and cached as a lookup table, so the runtime cost of RoPE is just two element-wise multiplications and one addition per dimension pair, per token, per layer. This is negligible compared to the cost of the attention computation itself.
A common compact form, used in modern codebases, expresses the rotation elementwise. Letting cos(m * theta) and sin(m * theta) be vectors of length d/2 broadcast across the head, the rotated query is:
q_rotated = q * cos(m * theta) + rotate_half(q) * sin(m * theta)
where rotate_half swaps and negates the two halves of the vector. This is the form used in the apply_rotary_pos_emb function in the Hugging Face transformers library, in src/transformers/models/llama/modeling_llama.py. The rotation table is built once per forward pass by LlamaRotaryEmbedding, which exposes hooks for PI, NTK-aware, dynamic NTK, YaRN, and LongRoPE scaling through configuration flags.
An equivalent formulation, often used in implementations, restructures the computation using complex number multiplication. If each pair (x_{2i}, x_{2i+1}) is treated as a complex number x_{2i} + i * x_{2i+1}, then the rotation is simply multiplication by the complex number cos(m * theta_i) + i * sin(m * theta_i) = e^(i * m * theta_i). This view makes the mathematical structure clearer and can be implemented efficiently using complex number support in frameworks like PyTorch.
A practical detail worth noting: there are two equivalent encodings for the pairs. The original GPT-J style interleaves consecutive even and odd dimensions, while the GPT-NeoX and LLaMA style splits the head dimension into two halves and uses a rotate_half helper. Both are mathematically equivalent up to a permutation of dimensions, but checkpoints converted between formats need a one-time reordering of the corresponding weights.
RoPE is applied to queries and keys only, not to values, and is applied independently at each attention layer rather than just at the input. This means the position information is refreshed at every layer, which helps maintain the position signal throughout the network depth. This contrasts with additive methods (sinusoidal, learned embeddings) that apply the position signal only at the input and rely on the network to propagate it through subsequent layers.
For researchers building from scratch, the cleanest reference implementations live in the EleutherAI GPT-NeoX codebase and in Su Jianlin's original RoFormer release at ZhuiyiTechnology/roformer on GitHub.
RoPE can be viewed as a way to inject the relative-position term R(theta, n - m) into the attention dot product without adding it as a separate scalar. T5's relative position bias adds a learned scalar b(n - m) directly to the pre-softmax logit; RoPE multiplies the query and key vectors by complementary rotation matrices so the same relative-position structure emerges from the dot product itself. The two approaches share a goal but differ in mechanism: T5's bias is additive and learned per relative-distance bucket, while RoPE's rotation is multiplicative and parameter-free.
This shared spirit explains some of RoPE's observed properties. The decay of the dot product with relative distance, for example, is qualitatively similar to ALiBi's hand-designed linear penalty, even though RoPE arrives at it through phase mismatch in high-frequency dimensions rather than an explicit subtraction.
Several extensions to the basic RoPE formulation have been explored:
2D and 3D RoPE extends the rotation approach to two-dimensional and three-dimensional position grids, which is useful for vision transformers and other models that operate on spatial data. Instead of a single position index, each token has (x, y) or (x, y, z) coordinates, and the rotation angles are defined over these multi-dimensional positions.
Dynamic NTK adjusts the scaling factor at inference time based on the actual sequence length, rather than using a fixed scaling factor. If the input sequence is shorter than the training length, no scaling is applied; scaling kicks in only when the sequence exceeds the training length.
Long RoPE (Ding et al., 2024) further refines dimension-wise scaling by searching for optimal scaling factors for each dimension independently, achieving effective context lengths of up to 2 million tokens with minimal fine-tuning.
Decoupled RoPE (DeepSeek-V2) applies the rotation to a separate small portion of query and key vectors, while the larger compressed portion is handled by Multi-head Latent Attention. This resolves a conflict between RoPE and low-rank KV compression.
iRoPE (Llama 4) interleaves standard RoPE attention layers with NoPE (no positional encoding) layers in roughly a 3:1 ratio, drawing on findings that pure-NoPE layers can generalize better at extreme context lengths.
While RoPE has proven highly effective, it is not without limitations.
The most cited issue is poor length generalization. Without modification, a RoPE model trained on 4k tokens degrades quickly past 5k or 6k tokens. This is the central motivation for PI, NTK, YaRN, and LongRoPE. Even with these techniques, attention over very long contexts often shows weaker recall than a model trained natively at the longer length.
A second issue is the asymmetry between high- and low-frequency dimensions. The high-frequency dimensions saturate after a small number of token positions and contribute little extra information; the low-frequency dimensions carry the long-range signal but are sparsely sampled, so the effective resolution of long-range positional information is lower than the head dimension would suggest. Recent work has even argued that some high-frequency dimensions become wasteful in long-context retrieval.
A third limitation is the oscillating nature of the distance decay. Unlike ALiBi, which provides a monotonic decrease in attention bias with distance, RoPE's distance sensitivity oscillates due to the periodic nature of rotation. Very distant tokens can occasionally receive higher attention scores than moderately distant tokens, though in practice the model learns to handle this.
A fourth issue is that RoPE is not always optimal. ALiBi sometimes wins on raw extrapolation in language modeling perplexity, and Kazemnejad et al. (2023) showed that NoPE can outperform RoPE on certain reasoning tasks that require generalizing to longer sequences than seen during training. Llama 4's iRoPE design takes this finding seriously by deliberately interleaving NoPE layers with RoPE layers.
A final practical limitation is that RoPE is only applied to queries and keys, not values. Some research has investigated whether extending a similar rotation to values, or using completely different per-frequency treatments, can yield gains, but no such variant has displaced standard RoPE in mainstream LLM training. The interaction of RoPE with alternative attention mechanisms (such as state-space models) is also less well understood than its behavior in standard multi-head attention.
When extending a pretrained RoPE model to a longer context window, the choice between PI, NTK-aware, YaRN, and LongRoPE depends on the target length and the available compute.
In all cases, evaluating on a needle-in-a-haystack or passkey retrieval benchmark at the target length is essential, since perplexity alone can hide failures of long-range attention.