# Rotary Position Embedding

> Source: https://aiwiki.ai/wiki/rotary_position_embedding
> Updated: 2026-07-11
> Categories: Deep Learning, Large Language Models, Model Architecture, Transformer Models
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**Rotary Position Embedding** (RoPE) is a [positional encoding](/wiki/positional_encoding) method for [transformer](/wiki/transformer) models that encodes a token's absolute position by rotating its query and key vectors in two-dimensional subspaces, so that the attention dot product depends only on the relative distance between tokens. Introduced by Jianlin Su and colleagues at Zhuiyi Technology in the 2021 paper "RoFormer: Enhanced Transformer with Rotary Position Embedding" (submitted to arXiv on 20 April 2021 and later published in Neurocomputing, Volume 568, 2024), RoPE has become the dominant positional encoding scheme in modern [large language models](/wiki/large_language_model).[1] In the authors' words, "the proposed RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation."[1] This gives the model relative position information without requiring explicit relative position embeddings or any additional learnable parameters.[1]

The RoFormer abstract names three properties that the rotation scheme provides: "the flexibility of sequence length, decaying inter-token dependency with increasing relative distances, and the capability of equipping the linear self-attention with relative position encoding."[1] The underlying idea first appeared in a series of Chinese-language blog posts on Su's site Kexue.fm in late 2020 and early 2021 before being formalized in the RoFormer paper.[2] EleutherAI later picked up the technique, implemented it in [GPT-J](/wiki/gpt) and GPT-NeoX, and wrote an influential English-language explainer that described RoPE as "a new type of position encoding that unifies absolute and relative approaches" and that "either matches or surpasses all other methods currently available for injecting positional information into transformers."[19]

RoPE is used in virtually all major open-source LLMs released since 2023, including [LLaMA](/wiki/llama), [LLaMA 2](/wiki/llama_2), [LLaMA 3](/wiki/llama_3), [Llama 4](/wiki/llama_4), [Mistral](/wiki/mistral), Mixtral, [Qwen](/wiki/qwen), [Gemma](/wiki/gemma), [DeepSeek](/wiki/deepseek), Yi, Phi, and [Falcon](/wiki/falcon) 2.[9][11] Google's [PaLM](/wiki/palm) also adopted RoPE.[15] The method's combination of simplicity, parameter-free design, and compatibility with efficient inference (particularly KV caching) has made it the de facto standard.[1]

## Background and motivation

Before RoPE, transformer models typically used either sinusoidal positional encodings (Vaswani et al., 2017) or learned positional embeddings ([BERT](/wiki/bert), [GPT-2](/wiki/gpt-2)).[3] Both are absolute positional methods: they assign a fixed vector to each position in the sequence, which is added to the token embedding. While these approaches work, they have notable limitations.

Absolute positional encodings do not directly encode the distance between tokens. If a model needs to learn that "a verb typically follows its subject by one or two positions," it must extract this relative information indirectly from absolute position vectors. Learned embeddings also impose a hard maximum sequence length, since there is no embedding for positions beyond the training maximum.

Relative positional methods like those in Shaw et al. (2018), Transformer-XL (Dai et al., 2019), and [T5](/wiki/t5) (Raffel et al., 2020) address the relative position problem by adding bias terms that depend on the distance between query and key positions.[4][12][13] ALiBi (Press et al., 2022) takes a similar route with linear attention biases that decay with distance.[5] However, these methods typically require additional parameters or modify the attention computation in ways that can complicate efficient implementation.

Su et al. sought a method that would encode absolute position (so each token knows where it is), naturally produce relative position information in the attention scores (so the model knows how far apart tokens are), require no additional parameters, and work efficiently with standard transformer implementations.[1] RoPE belongs to a different family from earlier methods: it does not add anything to the residual stream. Instead, it transforms the query and key vectors right before the attention dot product, so the dot product itself becomes position-aware.[1]

## How does RoPE work?

The central idea of RoPE is to encode position by rotating the query and key vectors.[1] A rotation in two dimensions is a well-understood geometric operation: you take a 2D vector and spin it by some angle. The rotation preserves the vector's magnitude while changing its direction.

RoPE exploits a specific property of rotations: when you compute the dot product between two rotated vectors, the result depends only on the angle difference between the two rotations, not on the individual rotation angles.[1] If vector A is rotated by angle alpha and vector B is rotated by angle beta, their dot product depends on (alpha - beta), not on alpha and beta individually. Since [self-attention](/wiki/self_attention) is fundamentally a dot product between query and key vectors, this property means that rotating queries and keys by position-dependent angles automatically injects relative position information into the attention scores.[1]

### Step-by-step mechanism

1. **Pair up dimensions.** Given a query or key vector of dimension $$d$$, RoPE groups the dimensions into $$d/2$$ consecutive pairs: $$(x_1, x_2), (x_3, x_4), (x_5, x_6)$$, and so on. Each pair defines a 2D subspace.

2. **Assign rotation frequencies.** Each pair $$i$$ (where i ranges from 0 to $$d/2 - 1$$) is assigned a base frequency: $$\theta_i = 10000^{-2i/d}$$. This follows the same frequency schedule as the original sinusoidal positional encoding.[3] Lower-indexed pairs rotate at higher frequencies (capturing fine-grained position differences), while higher-indexed pairs rotate at lower frequencies (capturing coarser position information).

3. **Compute rotation angles.** For a token at position $$m$$, the rotation angle for pair $$i$$ is simply $$m \theta_i$$. Tokens at later positions get rotated by larger angles.

4. **Apply 2D rotation to each pair.** For each dimension pair (x_{2i}, x_{2i+1}), the rotation is applied as:
   - $$x'_{2i} = x_{2i} \cos(m \theta_i) - x_{2i+1} \sin(m \theta_i)$$
   - $$x'_{2i+1} = x_{2i} \sin(m \theta_i) + x_{2i+1} \cos(m \theta_i)$$

   This is the standard 2D rotation matrix applied to each pair independently.

5. **Compute attention as usual.** After rotation, the query and key vectors are used in the standard dot-product attention computation. No other changes to the attention mechanism are needed.

### Mathematical formulation

More formally, let $$x_m$$ be the embedding of the token at position $$m$$, and let $$W_q$$ and $$W_k$$ be the query and key projection matrices. The query and key vectors are:

- $$q_m = R(\theta, m) W_q x_m$$
- $$k_n = R(\theta, n) W_k x_n$$

where $$R(\theta, m)$$ is a block-diagonal rotation matrix. Each $$2 \times 2$$ block along the diagonal is:

$$
\begin{bmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{bmatrix}
$$

The attention score between positions m and n is:

- $$\text{score}(m, n) = q_m^\top k_n = (W_q x_m)^\top R(\theta, m)^\top R(\theta, n) (W_k x_n)$$

Because rotation matrices are orthogonal, $$R(\theta, m)^\top R(\theta, n) = R(\theta, n - m)$$. The score therefore depends on the relative position $$(n - m)$$ rather than on m and n individually:

- $$\text{score}(m, n) = (W_q x_m)^\top R(\theta, n - m) (W_k x_n)$$

This is the core mathematical property that makes RoPE work: absolute rotations produce relative position information in the dot product.[1]

## What are the properties and advantages of RoPE?

RoPE has several properties that contributed to its widespread adoption:

### Relative position through absolute encoding

RoPE is sometimes described as a "bridge" between absolute and relative positional encoding.[19] Each token is encoded with its absolute position (via the rotation angle), but the attention mechanism naturally computes relative positions (via the angle difference in the dot product). This gives the model both types of information without requiring separate mechanisms.

### No additional parameters

RoPE adds zero learnable parameters to the model.[1] The rotation angles are deterministic functions of position and dimension index. This means RoPE has exactly the same parameter count as a model with no positional encoding at all, and the same count as the original sinusoidal encoding. Fused into the attention kernel, RoPE costs roughly 1 to 3 percent extra runtime.

### Decaying distance sensitivity

Su et al. showed that the inner product between RoPE-encoded vectors exhibits a natural decay as the relative distance increases.[1] Tokens that are close together tend to have higher attention scores (all else being equal) than tokens far apart. This aligns with the empirical observation that local context is generally more relevant than distant context in natural language. The authors described this property in the abstract as "decaying inter-token dependency with increasing relative distances," and framed it as a soft inductive bias toward locality that is consistent with the empirical behavior of language.[1]

### Compatibility with KV caching and linear attention

During autoregressive generation, transformers cache the key and value vectors from previous tokens to avoid recomputing them. Because RoPE applies the rotation to query and key vectors before they enter the attention computation, the rotation can be applied once when a token is first processed, and the rotated key can be cached directly. This is straightforward to implement and does not interfere with standard KV caching strategies, which is an important practical consideration for inference efficiency.

A further advantage is compatibility with linear attention. Because RoPE multiplies queries and keys by orthogonal rotation matrices and does not introduce a softmax-dependent bias, it can be combined with kernelized or linear attention variants (such as Performer-style attention) where additive relative biases are difficult to apply.[1] The RoFormer authors specifically highlighted "the capability of equipping the linear self-attention with relative position encoding" as a distinguishing property.[1]

### Natural extension to longer sequences

Because RoPE's rotation angles are continuous functions of position, they are defined for any positive integer position. A model trained with a maximum sequence length of 4,096 will still produce well-defined rotations for position 8,192. Whether the model actually performs well at those extended positions depends on other factors, but the positional encoding itself does not have a hard boundary. This property has made RoPE the foundation for several context-length extension techniques.

## How does RoPE differ from other positional encoding methods?

| Property | Sinusoidal | Learned embeddings | ALiBi | RoPE | NoPE |
|---|---|---|---|---|---|
| Type | Absolute | Absolute | Relative (bias) | Relative (via rotation) | None |
| Additional parameters | None | $$O(L \cdot d)$$ | None (fixed slopes per head) | None | None |
| Where applied | Input layer (additive) | Input layer (additive) | Attention logits (additive bias) | Query/key at each layer (multiplicative) | Not applied |
| Relative position signal | Indirect | No | Direct (linear penalty) | Direct (rotation difference) | Implicit via causal mask |
| Length extrapolation | Limited | None (hard max) | Strong | Moderate (strong with PI/NTK/YaRN scaling) | Surprisingly strong on some reasoning tasks |
| Adopted by | Original Transformer | [BERT](/wiki/bert), [GPT-2](/wiki/gpt-2) | BLOOM, MPT | [LLaMA](/wiki/llama) family, [PaLM](/wiki/palm), [Mistral](/wiki/mistral), Mixtral, [Qwen](/wiki/qwen), [DeepSeek](/wiki/deepseek), GPT-J, GPT-NeoX, Phi-3 | Some research models, partial use in Llama 4 (iRoPE) |
| Distance decay | No built-in decay | No built-in decay | Strong (linear) | Moderate (oscillating) | None |

Kazemnejad et al. (2023), in *The Impact of Positional Encoding on Length Generalization in Transformers*, found that decoder-only transformers trained without any positional encoding (NoPE) often generalize to longer sequences better than RoPE, ALiBi, or learned absolute encodings on certain reasoning tasks.[16] The result is one reason researchers continue to question whether RoPE is universally optimal even though it dominates in production systems.

## Which LLMs use RoPE?

The widespread adoption of RoPE can be traced to Meta's release of [LLaMA](/wiki/llama) in February 2023.[9] LLaMA's architecture choices, including RoPE, became a template that nearly every subsequent open-source LLM followed. The timeline of adoption includes:

| Model | Organization | Release | RoPE variant |
|---|---|---|---|
| GPT-J | EleutherAI | June 2021 | Partial RoPE on 25% of head dimensions |
| GPT-NeoX | EleutherAI | February 2022 | RoPE on full head dimension |
| PaLM | Google | April 2022 | Standard RoPE |
| LLaMA | Meta | February 2023 | Standard RoPE |
| LLaMA 2 | Meta | July 2023 | Standard RoPE |
| Code Llama | Meta | August 2023 | RoPE with NTK scaling |
| Mistral 7B | Mistral AI | September 2023 | Standard RoPE with sliding-window attention |
| Qwen | Alibaba | September 2023 | Standard RoPE with NTK-aware scaling |
| Yi | 01.AI | November 2023 | Standard RoPE |
| Phi-2 | Microsoft | December 2023 | Standard RoPE |
| Mixtral 8x7B / 8x22B | Mistral AI | December 2023 / April 2024 | Standard RoPE |
| Gemma | Google | February 2024 | Standard RoPE |
| Mistral Large | Mistral AI | February 2024 | Standard RoPE |
| LLaMA 3 | Meta | April 2024 | RoPE with extended theta (500,000) |
| Phi-3 | Microsoft | April 2024 | RoPE with Su-scaled or YaRN-style scaling for 128k variant |
| DeepSeek-V2 | DeepSeek | May 2024 | Decoupled RoPE with Multi-head Latent Attention |
| Qwen 2.5 | Alibaba | September 2024 | RoPE with YaRN scaling |
| Llama 4 | Meta | 2025 | iRoPE (interleaved RoPE and NoPE layers) |
| Qwen 3 | Alibaba | 2025 | RoPE |
| DeepSeek-V3 | DeepSeek | 2024 to 2025 | Decoupled RoPE with MLA |

The infrastructure ecosystem followed this adoption wave. [FlashAttention](/wiki/flash_attention), [vLLM](/wiki/vllm), and [Hugging Face](/wiki/hugging_face) Transformers all provide optimized RoPE implementations.[21] Hugging Face's `rope_scaling` configuration parameter allows users to specify different scaling strategies (linear, dynamic, YaRN) directly in model configuration files.[21]

A few notable variants deserve mention. GPT-J applies RoPE to only 25 percent of head dimensions (64 out of 256), leaving the remaining dimensions untouched, while GPT-NeoX applies RoPE to the full head dimension.[19] DeepSeek-V2 introduced *Decoupled RoPE*, in which the rotation is applied to a separate small portion of the key and query vectors so the rest can be compressed by [Multi-head Latent Attention](/wiki/multi-head_latent_attention).[17] Llama 4 introduces *iRoPE*, an interleaved scheme that alternates standard RoPE attention layers with NoPE layers in roughly a 3:1 ratio to support very long contexts.

## The base parameter and frequency spectrum

The base value 10,000 in $$\theta_i = 10000^{-2i/d}$$ is a hyperparameter, not a fundamental constant. Su et al. (2021) inherited it from the sinusoidal encoding for direct continuity with the Transformer's original encoding scheme.[1][3] Larger bases produce slower rotations across all dimensions, which means the same numerical position m corresponds to a smaller phase angle. This effectively spreads positional information over a longer range and is the foundation of NTK-aware context extension.

The d/2 frequencies span a wide spectrum. The first few pairs rotate quickly and complete many full revolutions even within a short context window; they encode fine-grained, high-frequency positional information. The last few pairs rotate very slowly and barely complete a full revolution within the entire training context; they encode coarse, low-frequency, long-range positional information.

This two-end behavior matters for context extension. High-frequency dimensions saturate quickly and add little useful signal at long range; low-frequency dimensions carry the long-range signal but are sparsely sampled. Several context-extension techniques exploit this asymmetry by treating different frequencies differently.

## How is RoPE used to extend context length?

One of RoPE's most active areas of development is extending the context length of pre-trained models beyond their original training length. Vanilla RoPE generalizes poorly when a model trained on, for example, 2,048 tokens is asked to attend over 8,192 tokens. The attention scores at the new positions involve rotation angles the model never saw during training, and perplexity collapses. Research starting in mid-2023 produced a sequence of techniques to extend the effective context window of pretrained RoPE models with little or no fine-tuning.

### Position interpolation (PI)

Chen et al. (2023) proposed position interpolation, a simple technique: instead of extrapolating RoPE to positions beyond the training range, interpolate by dividing all position indices by a scaling factor $$s$$.[6] If a model was trained with max length 4,096 and you want to extend to 32,768, set $$s = 8$$ and divide all positions by 8. This maps the extended range [0, 32,768] back into [0, 4,096], where the model has seen positions during training. The authors reported that this "extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps)."[6] They argued that "the upper bound of interpolation is at least ~600x smaller than that of extrapolation," which is the main reason it works so well with such little training.[6]

### NTK-aware scaling

NTK-aware (Neural Tangent Kernel-aware) interpolation was developed through community experimentation on the r/LocalLLaMA Reddit community in mid-2023, originating in a post by a user named bloc97.[7] The insight is that position interpolation treats all frequency dimensions equally, but this is suboptimal. High-frequency dimensions (which capture fine-grained position differences) lose resolution when compressed, while low-frequency dimensions (which capture coarse position information) barely need compression at all.

NTK-aware scaling addresses this by modifying the base frequency parameter. Instead of scaling position indices, it scales the base value (10,000) in the frequency formula: the new base is $$b' = b \cdot s^{d / (d - 2)}$$, where $$s$$ is the desired length-extension factor.[7] This stretches the low-frequency components more than the high-frequency ones, spreading the interpolation pressure across dimensions. The approach can be applied without any fine-tuning at all, achieving 2x to 4x extension zero-shot, though fine-tuning improves quality.[7]

NTK-aware scaling was quickly adopted by popular inference frameworks including [llama.cpp](/wiki/llama_cpp) and text-generation-webui, and was incorporated into several Hugging Face model implementations.[21] A related variant called *Dynamic NTK*, also developed in the EleutherAI community, applies NTK-aware scaling adaptively at inference time based on the current sequence length, so the model uses no scaling at short lengths (preserving its trained behavior) and smoothly increases scaling as the sequence grows.[20]

### YaRN (Yet another RoPE extensioN)

Peng et al. (2023) introduced YaRN, the first formal academic paper that rigorously analyzed and improved upon the community NTK-aware scaling methods.[8] The authors are Bowen Peng and Jeffrey Quesnelle (Nous Research), Honglu Fan (EleutherAI and University of Geneva), and Enrico Shippole.[8] YaRN's key innovation is recognizing that different frequency ranges require fundamentally different scaling strategies. It addresses this with two mechanisms:

1. **NTK-by-parts (ramp function for dimension-wise interpolation).** A ramp function decides whether each dimension is fully interpolated, not interpolated, or partially interpolated, depending on its wavelength relative to the original context length. For LLaMA-family models the authors used parameters $$\alpha = 1$$, $$\beta = 32$$. Low-frequency dimensions receive more interpolation, while high-frequency dimensions receive less.[8]

2. **[Temperature](/wiki/temperature_sampling) scaling for attention logits.** Extending context length causes a distributional shift in attention scores because longer sequences spread probability mass over more tokens. YaRN applies a temperature factor $$t$$ to attention logits to compensate for this shift, with $$\sqrt{1/t} = 0.1 \cdot \ln(s) + 1$$, where $$s$$ is the extension factor.[8]

The YaRN authors reported that the method reaches the same extended-context quality while "requiring 10x less tokens and 2.5x less training steps than previous methods," and that it works with fine-tuning on less than 0.1% of the original pre-training data.[8] It enables models to extrapolate beyond even the fine-tuning context length. LLaMA models extended with YaRN have been shown to handle context lengths of 64K and 128K tokens effectively, and YaRN became the standard recipe for context extension in many open-source LLM releases through 2024.[8]

### LongRoPE

Ding et al. (2024) at Microsoft introduced *LongRoPE*, which uses an evolutionary search to find a non-uniform, per-frequency rescaling that further reduces the gap between original and extended context performance.[14] LongRoPE extends pretrained LLMs to 2,048k tokens with up to 1,000 fine-tuning steps at 256k length, while preserving short-context quality.[14] The method underpins the 128k-context variants of Phi-3.[18]

### Comparison of RoPE scaling methods

| Method | Date and source | Core idea | Fine-tuning needed | Typical extension |
|---|---|---|---|---|
| Position Interpolation (PI) | Chen et al., June 2023 | Linearly downscale positions by factor $$s = L'/L$$ so all positions fall in trained range | About 1,000 steps | 2x to 16x |
| NTK-aware scaling | bloc97 on Reddit r/LocalLLaMA, June 2023 | Increase base theta instead of uniformly scaling positions; preserves high-frequency components | None (zero-shot) or minimal | 2x to 4x with no fine-tuning |
| Dynamic NTK | Reddit / EleutherAI work, mid-2023 | Adjust base on the fly during inference based on current sequence length | None | Smooth degradation past trained length |
| YaRN | Peng et al., August 2023 | Combine NTK-by-parts with attention temperature scaling | About 10x fewer tokens than PI | 32x to 64x with brief fine-tuning |
| LongRoPE | Ding et al. (Microsoft), February 2024 | Search for non-uniform per-frequency scaling factors with evolutionary search | Up to 1,000 steps at 256k length | Up to 2,048k tokens (used in Phi-3-128k) |
| Extended theta | Standard pre-training | Increase base from 10,000 to 500,000 or more | Yes (full pre-training) | Long contexts trained from scratch |

Some models have adopted the simpler approach of just increasing the base theta value during pre-training. [LLaMA 3](/wiki/llama_3) used a base theta of 500,000 (compared to the standard 10,000), which inherently allows the model to handle longer sequences because the rotation frequencies are lower. This approach requires training with the extended theta from the start (or extensive continued pre-training) but avoids the need for post-hoc scaling.

## Implementation details

In practice, RoPE is implemented efficiently without explicitly constructing the full rotation matrices. The rotation of each 2D pair can be computed with simple element-wise multiplications and additions:

```
q_rotated[2i]   = q[2i] * cos(m * theta_i) - q[2i+1] * sin(m * theta_i)
q_rotated[2i+1] = q[2i] * sin(m * theta_i) + q[2i+1] * cos(m * theta_i)
```

The cosine and sine values for each position and dimension pair can be precomputed and cached as a lookup table, so the runtime cost of RoPE is just two element-wise multiplications and one addition per dimension pair, per token, per layer. This is negligible compared to the cost of the attention computation itself.

A common compact form, used in modern codebases, expresses the rotation elementwise. Letting cos(m * theta) and sin(m * theta) be vectors of length d/2 broadcast across the head, the rotated query is:

```
q_rotated = q * cos(m * theta) + rotate_half(q) * sin(m * theta)
```

where `rotate_half` swaps and negates the two halves of the vector. This is the form used in the `apply_rotary_pos_emb` function in the Hugging Face `transformers` library, in `src/transformers/models/llama/modeling_llama.py`.[21] The rotation table is built once per forward pass by `LlamaRotaryEmbedding`, which exposes hooks for PI, NTK-aware, dynamic NTK, YaRN, and LongRoPE scaling through configuration flags.[21]

An equivalent formulation, often used in implementations, restructures the computation using complex number multiplication. If each pair $$(x_{2i}, x_{2i+1})$$ is treated as a complex number $$x_{2i} + i x_{2i+1}$$, then the rotation is simply multiplication by the complex number $$\cos(m \theta_i) + i \sin(m \theta_i) = e^{i m \theta_i}$$.[1] This view makes the mathematical structure clearer and can be implemented efficiently using complex number support in frameworks like [PyTorch](/wiki/pytorch).

A practical detail worth noting: there are two equivalent encodings for the pairs. The original GPT-J style interleaves consecutive even and odd dimensions, while the GPT-NeoX and LLaMA style splits the head dimension into two halves and uses a `rotate_half` helper.[21] Both are mathematically equivalent up to a permutation of dimensions, but checkpoints converted between formats need a one-time reordering of the corresponding weights.

RoPE is applied to queries and keys only, not to values, and is applied independently at each attention layer rather than just at the input.[1] This means the position information is refreshed at every layer, which helps maintain the position signal throughout the network depth. This contrasts with additive methods (sinusoidal, learned embeddings) that apply the position signal only at the input and rely on the network to propagate it through subsequent layers.

For researchers building from scratch, the cleanest reference implementations live in the EleutherAI GPT-NeoX codebase and in Su Jianlin's original RoFormer release at ZhuiyiTechnology/roformer on GitHub.[22]

## Connection to relative position bias

RoPE can be viewed as a way to inject the relative-position term $$R(\theta, n - m)$$ into the attention dot product without adding it as a separate scalar.[1] T5's relative position bias adds a learned scalar $$b(n - m)$$ directly to the pre-softmax logit; RoPE multiplies the query and key vectors by complementary rotation matrices so the same relative-position structure emerges from the dot product itself.[13] The two approaches share a goal but differ in mechanism: T5's bias is additive and learned per relative-distance bucket, while RoPE's rotation is multiplicative and parameter-free.

This shared spirit explains some of RoPE's observed properties. The decay of the dot product with relative distance, for example, is qualitatively similar to ALiBi's hand-designed linear penalty, even though RoPE arrives at it through phase mismatch in high-frequency dimensions rather than an explicit subtraction.[5]

## Extensions and variants

Several extensions to the basic RoPE formulation have been explored:

**2D and 3D RoPE** extends the rotation approach to two-dimensional and three-dimensional position grids, which is useful for [vision transformers](/wiki/vision_transformer) and other models that operate on spatial data. Instead of a single position index, each token has $$(x, y)$$ or $$(x, y, z)$$ coordinates, and the rotation angles are defined over these multi-dimensional positions.

**Dynamic NTK** adjusts the scaling factor at inference time based on the actual sequence length, rather than using a fixed scaling factor.[20] If the input sequence is shorter than the training length, no scaling is applied; scaling kicks in only when the sequence exceeds the training length.

**Long RoPE** (Ding et al., 2024) further refines dimension-wise scaling by searching for optimal scaling factors for each dimension independently, achieving effective context lengths of up to 2 million tokens with minimal fine-tuning.[14]

**Decoupled RoPE** (DeepSeek-V2) applies the rotation to a separate small portion of query and key vectors, while the larger compressed portion is handled by Multi-head Latent Attention.[17] This resolves a conflict between RoPE and low-rank KV compression.

**iRoPE** (Llama 4) interleaves standard RoPE attention layers with NoPE (no positional encoding) layers in roughly a 3:1 ratio, drawing on findings that pure-NoPE layers can generalize better at extreme context lengths.[16]

## What are the limitations of RoPE?

While RoPE has proven highly effective, it is not without limitations.

The most cited issue is poor length generalization. Without modification, a RoPE model trained on 4k tokens degrades quickly past 5k or 6k tokens. This is the central motivation for PI, NTK, YaRN, and LongRoPE. Even with these techniques, attention over very long contexts often shows weaker recall than a model trained natively at the longer length.

A second issue is the asymmetry between high- and low-frequency dimensions. The high-frequency dimensions saturate after a small number of token positions and contribute little extra information; the low-frequency dimensions carry the long-range signal but are sparsely sampled, so the effective resolution of long-range positional information is lower than the head dimension would suggest. Recent work has even argued that some high-frequency dimensions become wasteful in long-context retrieval.

A third limitation is the oscillating nature of the distance decay. Unlike ALiBi, which provides a monotonic decrease in attention bias with distance, RoPE's distance sensitivity oscillates due to the periodic nature of rotation.[5] Very distant tokens can occasionally receive higher attention scores than moderately distant tokens, though in practice the model learns to handle this.

A fourth issue is that RoPE is not always optimal. ALiBi sometimes wins on raw extrapolation in language modeling perplexity, and Kazemnejad et al. (2023) showed that NoPE can outperform RoPE on certain reasoning tasks that require generalizing to longer sequences than seen during training.[16] Llama 4's iRoPE design takes this finding seriously by deliberately interleaving NoPE layers with RoPE layers.

A final practical limitation is that RoPE is only applied to queries and keys, not values. Some research has investigated whether extending a similar rotation to values, or using completely different per-frequency treatments, can yield gains, but no such variant has displaced standard RoPE in mainstream LLM training. The interaction of RoPE with alternative attention mechanisms (such as state-space models) is also less well understood than its behavior in standard [multi-head attention](/wiki/multi-head_self-attention).

## Practical tips for context extension

When extending a pretrained RoPE model to a longer context window, the choice between PI, NTK-aware, YaRN, and LongRoPE depends on the target length and the available compute.

- For modest extensions (2x to 4x) with no fine-tuning, NTK-aware scaling is the simplest option and works well as a zero-shot deployment trick.
- For modest extensions with a small fine-tuning budget, PI is the original recipe and remains a reasonable baseline.
- For larger extensions (8x to 32x) with a brief fine-tuning run, YaRN gives the best balance of quality and cost. Most open-source long-context releases through 2024 used YaRN for this regime.
- For extreme extensions (above 256k tokens) and when search compute is available, LongRoPE produces the highest quality at the cost of running an evolutionary search over per-frequency scaling factors.

In all cases, evaluating on a needle-in-a-haystack or passkey retrieval benchmark at the target length is essential, since perplexity alone can hide failures of long-range attention.

## See also

- [MEGABYTE](/wiki/megabyte)
- [Positional encoding](/wiki/positional_encoding)
- [Transformer](/wiki/transformer)
- [Self-attention](/wiki/self_attention)
- [Attention](/wiki/attention)
- [Embeddings](/wiki/embeddings)
- [LLaMA](/wiki/llama)
- [LLaMA 2](/wiki/llama_2)
- [LLaMA 3](/wiki/llama_3)
- [Llama 4](/wiki/llama_4)
- [PaLM](/wiki/palm)
- [Multi-head Latent Attention](/wiki/multi-head_latent_attention)
- [Flash Attention](/wiki/flash_attention)

## References

1. Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." Submitted to arXiv 20 April 2021; published in *Neurocomputing, Volume 568, 2024* (article 127063). https://arxiv.org/abs/2104.09864

2. Su, J. (2021). Original Kexue.fm blog posts introducing rotary position embedding. https://kexue.fm

3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "[Attention Is All You Need](/wiki/attention_is_all_you_need)." *[NeurIPS](/wiki/neurips) 2017*. https://arxiv.org/abs/1706.03762

4. Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). "Self-Attention with Relative Position Representations." *NAACL 2018*. https://arxiv.org/abs/1803.02155

5. Press, O., Smith, N. A., & Lewis, M. (2022). "Train Short, Test Long: [Attention](/wiki/attention) with Linear Biases Enables Input Length Extrapolation." *ICLR 2022*. https://arxiv.org/abs/2108.12409

6. Chen, S., Wong, S., Chen, L., & Tian, Y. (2023). "Extending Context Window of Large Language Models via Positional Interpolation." *arXiv preprint*. https://arxiv.org/abs/2306.15595

7. bloc97 (2023). "NTK-Aware Scaled RoPE allows LLaMA Models to have Extended (8k+) Context Size without any Fine-Tuning and Minimal Perplexity Degradation." *Reddit r/LocalLLaMA*, June 2023.

8. Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." *ICLR 2024*. https://arxiv.org/abs/2309.00071

9. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). "LLaMA: Open and Efficient Foundation Language Models." *arXiv preprint*. https://arxiv.org/abs/2302.13971

10. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." *arXiv preprint*. https://arxiv.org/abs/2307.09288

11. Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lenber, G., Lample, G., Saulnier, L., et al. (2023). "Mistral 7B." *arXiv preprint*. https://arxiv.org/abs/2310.06825

12. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." *ACL 2019*. https://arxiv.org/abs/1901.02860

13. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *JMLR 21*. https://arxiv.org/abs/1910.10683

14. Ding, Y., Zhang, L. L., Zhang, C., Xu, Y., Shang, N., Xu, J., Yang, F., & Yang, M. (2024). "LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens." *ICML 2024*. https://arxiv.org/abs/2402.13753

15. Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*. https://arxiv.org/abs/2204.02311

16. Kazemnejad, A., Padhi, I., Natesan, K., Das, P., & Reddy, S. (2023). "The Impact of Positional Encoding on Length Generalization in Transformers." *NeurIPS 2023*. https://arxiv.org/abs/2305.19466

17. DeepSeek-AI (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." *arXiv preprint*. https://arxiv.org/abs/2405.04434

18. Microsoft (2024). "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone." *arXiv preprint*. https://arxiv.org/abs/2404.14219

19. [EleutherAI](/wiki/eleutherai). "Rotary [Embeddings](/wiki/embeddings): A Relative Revolution." *EleutherAI Blog*. https://blog.eleuther.ai/rotary-embeddings/

20. EleutherAI (2023). "Extending the RoPE." *EleutherAI Blog*. https://blog.eleuther.ai/yarn/

21. Hugging Face Transformers. `apply_rotary_pos_emb` and `LlamaRotaryEmbedding` in `src/transformers/models/llama/modeling_llama.py`. https://github.com/huggingface/transformers

22. ZhuiyiTechnology. "RoFormer reference implementation." https://github.com/ZhuiyiTechnology/roformer