Rotary Position Embedding

Rotary Position Embedding (RoPE) is a positional encoding method for transformer models that encodes absolute position by rotating query and key vectors in two-dimensional subspaces. Introduced by Jianlin Su and colleagues at Zhuiyi Technology in 2021 in the paper "RoFormer: Enhanced Transformer with Rotary Position Embedding," RoPE has become the dominant positional encoding scheme in modern large language models. Its core insight is that multiplying query and key vectors by rotation matrices causes their dot product to depend only on relative position, not absolute position. This gives the model relative position information without requiring explicit relative position embeddings or any additional learnable parameters.

The underlying idea first appeared in a series of Chinese-language blog posts on Su's site Kexue.fm in late 2020 and early 2021 before being formalized in the RoFormer paper. EleutherAI later picked up the technique, implemented it in GPT-J and GPT-NeoX, and wrote an influential English-language explainer that helped popularize RoPE in the broader open-source community.

RoPE is used in virtually all major open-source LLMs released since 2023, including LLaMA, LLaMA 2, LLaMA 3, Llama 4, Mistral, Mixtral, Qwen, Gemma, DeepSeek, Yi, Phi, and Falcon 2. Google's PaLM also adopted RoPE. The method's combination of simplicity, parameter-free design, and compatibility with efficient inference (particularly KV caching) has made it the de facto standard.

Background and motivation

Before RoPE, transformer models typically used either sinusoidal positional encodings (Vaswani et al., 2017) or learned positional embeddings (BERT, GPT-2). Both are absolute positional methods: they assign a fixed vector to each position in the sequence, which is added to the token embedding. While these approaches work, they have notable limitations.

Absolute positional encodings do not directly encode the distance between tokens. If a model needs to learn that "a verb typically follows its subject by one or two positions," it must extract this relative information indirectly from absolute position vectors. Learned embeddings also impose a hard maximum sequence length, since there is no embedding for positions beyond the training maximum.

Relative positional methods like those in Shaw et al. (2018), Transformer-XL (Dai et al., 2019), and T5 (Raffel et al., 2020) address the relative position problem by adding bias terms that depend on the distance between query and key positions. ALiBi (Press et al., 2022) takes a similar route with linear attention biases that decay with distance. However, these methods typically require additional parameters or modify the attention computation in ways that can complicate efficient implementation.

Su et al. sought a method that would encode absolute position (so each token knows where it is), naturally produce relative position information in the attention scores (so the model knows how far apart tokens are), require no additional parameters, and work efficiently with standard transformer implementations. RoPE belongs to a different family from earlier methods: it does not add anything to the residual stream. Instead, it transforms the query and key vectors right before the attention dot product, so the dot product itself becomes position-aware.

How RoPE works

The central idea of RoPE is to encode position by rotating the query and key vectors. A rotation in two dimensions is a well-understood geometric operation: you take a 2D vector and spin it by some angle. The rotation preserves the vector's magnitude while changing its direction.

RoPE exploits a specific property of rotations: when you compute the dot product between two rotated vectors, the result depends only on the angle difference between the two rotations, not on the individual rotation angles. If vector A is rotated by angle alpha and vector B is rotated by angle beta, their dot product depends on (alpha - beta), not on alpha and beta individually. Since self-attention is fundamentally a dot product between query and key vectors, this property means that rotating queries and keys by position-dependent angles automatically injects relative position information into the attention scores.

Step-by-step mechanism

Pair up dimensions. Given a query or key vector of dimension d, RoPE groups the dimensions into d/2 consecutive pairs: (x_1, x_2), (x_3, x_4), (x_5, x_6), and so on. Each pair defines a 2D subspace.
Assign rotation frequencies. Each pair i (where i ranges from 0 to d/2 - 1) is assigned a base frequency: theta_i = 10000^(-2i/d). This follows the same frequency schedule as the original sinusoidal positional encoding. Lower-indexed pairs rotate at higher frequencies (capturing fine-grained position differences), while higher-indexed pairs rotate at lower frequencies (capturing coarser position information).
Compute rotation angles. For a token at position m, the rotation angle for pair i is simply m * theta_i. Tokens at later positions get rotated by larger angles.
Apply 2D rotation to each pair. For each dimension pair (x_{2i}, x_{2i+1}), the rotation is applied as:
- x'{2i} = x{2i} * cos(m * theta_i) - x_{2i+1} * sin(m * theta_i)
- x'{2i+1} = x{2i} * sin(m * theta_i) + x_{2i+1} * cos(m * theta_i)
This is the standard 2D rotation matrix applied to each pair independently.
Compute attention as usual. After rotation, the query and key vectors are used in the standard dot-product attention computation. No other changes to the attention mechanism are needed.

Mathematical formulation

More formally, let x_m be the embedding of the token at position m, and let W_q and W_k be the query and key projection matrices. The query and key vectors are:

q_m = R(theta, m) * W_q * x_m
k_n = R(theta, n) * W_k * x_n

where R(theta, m) is a block-diagonal rotation matrix. Each 2x2 block along the diagonal is:

[ cos(m * theta_i)   -sin(m * theta_i) ]
[ sin(m * theta_i)    cos(m * theta_i) ]

The attention score between positions m and n is:

score(m, n) = q_m^T * k_n = (W_q * x_m)^T * R(theta, m)^T * R(theta, n) * (W_k * x_n)

Because rotation matrices are orthogonal, R(theta, m)^T * R(theta, n) = R(theta, n - m). The score therefore depends on the relative position (n - m) rather than on m and n individually:

score(m, n) = (W_q * x_m)^T * R(theta, n - m) * (W_k * x_n)

This is the core mathematical property that makes RoPE work: absolute rotations produce relative position information in the dot product.

Properties and advantages

RoPE has several properties that contributed to its widespread adoption:

Relative position through absolute encoding

RoPE is sometimes described as a "bridge" between absolute and relative positional encoding. Each token is encoded with its absolute position (via the rotation angle), but the attention mechanism naturally computes relative positions (via the angle difference in the dot product). This gives the model both types of information without requiring separate mechanisms.

No additional parameters

RoPE adds zero learnable parameters to the model. The rotation angles are deterministic functions of position and dimension index. This means RoPE has exactly the same parameter count as a model with no positional encoding at all, and the same count as the original sinusoidal encoding. Fused into the attention kernel, RoPE costs roughly 1 to 3 percent extra runtime.

Decaying distance sensitivity

Su et al. showed that the inner product between RoPE-encoded vectors exhibits a natural decay as the relative distance increases. Tokens that are close together tend to have higher attention scores (all else being equal) than tokens far apart. This aligns with the empirical observation that local context is generally more relevant than distant context in natural language. The authors framed this as a soft inductive bias toward locality that is consistent with the empirical behavior of language.

Compatibility with KV caching and linear attention

During autoregressive generation, transformers cache the key and value vectors from previous tokens to avoid recomputing them. Because RoPE applies the rotation to query and key vectors before they enter the attention computation, the rotation can be applied once when a token is first processed, and the rotated key can be cached directly. This is straightforward to implement and does not interfere with standard KV caching strategies, which is an important practical consideration for inference efficiency.

A further advantage is compatibility with linear attention. Because RoPE multiplies queries and keys by orthogonal rotation matrices and does not introduce a softmax-dependent bias, it can be combined with kernelized or linear attention variants (such as Performer-style attention) where additive relative biases are difficult to apply.

Natural extension to longer sequences

Because RoPE's rotation angles are continuous functions of position, they are defined for any positive integer position. A model trained with a maximum sequence length of 4,096 will still produce well-defined rotations for position 8,192. Whether the model actually performs well at those extended positions depends on other factors, but the positional encoding itself does not have a hard boundary. This property has made RoPE the foundation for several context-length extension techniques.

Comparison with other positional encoding methods

Property	Sinusoidal	Learned embeddings	ALiBi	RoPE	NoPE
Type	Absolute	Absolute	Relative (bias)	Relative (via rotation)	None
Additional parameters	None	O(L * d)	None (fixed slopes per head)	None	None
Where applied	Input layer (additive)	Input layer (additive)	Attention logits (additive bias)	Query/key at each layer (multiplicative)	Not applied
Relative position signal	Indirect	No	Direct (linear penalty)	Direct (rotation difference)	Implicit via causal mask
Length extrapolation	Limited	None (hard max)	Strong	Moderate (strong with PI/NTK/YaRN scaling)	Surprisingly strong on some reasoning tasks
Adopted by	Original Transformer	BERT, GPT-2	BLOOM, MPT	LLaMA family, PaLM, Mistral, Mixtral, Qwen, DeepSeek, GPT-J, GPT-NeoX, Phi-3	Some research models, partial use in Llama 4 (iRoPE)
Distance decay	No built-in decay	No built-in decay	Strong (linear)	Moderate (oscillating)	None

Kazemnejad et al. (2023), in The Impact of Positional Encoding on Length Generalization in Transformers, found that decoder-only transformers trained without any positional encoding (NoPE) often generalize to longer sequences better than RoPE, ALiBi, or learned absolute encodings on certain reasoning tasks. The result is one reason researchers continue to question whether RoPE is universally optimal even though it dominates in production systems.

Adoption in modern LLMs

The widespread adoption of RoPE can be traced to Meta's release of LLaMA in February 2023. LLaMA's architecture choices, including RoPE, became a template that nearly every subsequent open-source LLM followed. The timeline of adoption includes:

Model	Organization	Release	RoPE variant
GPT-J	EleutherAI	June 2021	Partial RoPE on 25% of head dimensions
GPT-NeoX	EleutherAI	February 2022	RoPE on full head dimension
PaLM	Google	April 2022	Standard RoPE
LLaMA	Meta	February 2023	Standard RoPE
LLaMA 2	Meta	July 2023	Standard RoPE
Code Llama	Meta	August 2023	RoPE with NTK scaling
Mistral 7B	Mistral AI	September 2023	Standard RoPE with sliding-window attention
Qwen	Alibaba	September 2023	Standard RoPE with NTK-aware scaling
Yi	01.AI	November 2023	Standard RoPE
Phi-2	Microsoft	December 2023	Standard RoPE
Mixtral 8x7B / 8x22B	Mistral AI	December 2023 / April 2024	Standard RoPE
Gemma	Google	February 2024	Standard RoPE
Mistral Large	Mistral AI	February 2024	Standard RoPE
LLaMA 3	Meta	April 2024	RoPE with extended theta (500,000)
Phi-3	Microsoft	April 2024	RoPE with Su-scaled or YaRN-style scaling for 128k variant
DeepSeek-V2	DeepSeek	May 2024	Decoupled RoPE with Multi-head Latent Attention
Qwen 2.5	Alibaba	September 2024	RoPE with YaRN scaling
Llama 4	Meta	2025	iRoPE (interleaved RoPE and NoPE layers)
Qwen 3	Alibaba	2025	RoPE
DeepSeek-V3	DeepSeek	2024 to 2025	Decoupled RoPE with MLA

The infrastructure ecosystem followed this adoption wave. FlashAttention, vLLM, and Hugging Face Transformers all provide optimized RoPE implementations. Hugging Face's rope_scaling configuration parameter allows users to specify different scaling strategies (linear, dynamic, YaRN) directly in model configuration files.

A few notable variants deserve mention. GPT-J applies RoPE to only 25 percent of head dimensions (64 out of 256), leaving the remaining dimensions untouched, while GPT-NeoX applies RoPE to the full head dimension. DeepSeek-V2 introduced Decoupled RoPE, in which the rotation is applied to a separate small portion of the key and query vectors so the rest can be compressed by Multi-head Latent Attention. Llama 4 introduces iRoPE, an interleaved scheme that alternates standard RoPE attention layers with NoPE layers in roughly a 3:1 ratio to support very long contexts.

The base parameter and frequency spectrum

The base value 10,000 in theta_i = 10000^(-2i/d) is a hyperparameter, not a fundamental constant. Su et al. (2021) inherited it from the sinusoidal encoding for direct continuity with the Transformer's original encoding scheme. Larger bases produce slower rotations across all dimensions, which means the same numerical position m corresponds to a smaller phase angle. This effectively spreads positional information over a longer range and is the foundation of NTK-aware context extension.

The d/2 frequencies span a wide spectrum. The first few pairs rotate quickly and complete many full revolutions even within a short context window; they encode fine-grained, high-frequency positional information. The last few pairs rotate very slowly and barely complete a full revolution within the entire training context; they encode coarse, low-frequency, long-range positional information.

This two-end behavior matters for context extension. High-frequency dimensions saturate quickly and add little useful signal at long range; low-frequency dimensions carry the long-range signal but are sparsely sampled. Several context-extension techniques exploit this asymmetry by treating different frequencies differently.

Context length extension with RoPE

One of RoPE's most active areas of development is extending the context length of pre-trained models beyond their original training length. Vanilla RoPE generalizes poorly when a model trained on, for example, 2,048 tokens is asked to attend over 8,192 tokens. The attention scores at the new positions involve rotation angles the model never saw during training, and perplexity collapses. Research starting in mid-2023 produced a sequence of techniques to extend the effective context window of pretrained RoPE models with little or no fine-tuning.

Position interpolation (PI)

Chen et al. (2023) proposed position interpolation, a simple technique: instead of extrapolating RoPE to positions beyond the training range, interpolate by dividing all position indices by a scaling factor s. If a model was trained with max length 4,096 and you want to extend to 32,768, set s = 8 and divide all positions by 8. This maps the extended range [0, 32,768] back into [0, 4,096], where the model has seen positions during training. A small amount of fine-tuning (around 1,000 steps) is needed to adapt the model to the compressed position space. The authors argued that interpolation has an upper-bound error roughly 600x smaller than direct extrapolation, which is the main reason it works so well with such little training.

NTK-aware scaling

NTK-aware (Neural Tangent Kernel-aware) interpolation was developed through community experimentation on the r/LocalLLaMA Reddit community in mid-2023, originating in a post by a user named bloc97. The insight is that position interpolation treats all frequency dimensions equally, but this is suboptimal. High-frequency dimensions (which capture fine-grained position differences) lose resolution when compressed, while low-frequency dimensions (which capture coarse position information) barely need compression at all.

NTK-aware scaling addresses this by modifying the base frequency parameter. Instead of scaling position indices, it scales the base value (10,000) in the frequency formula: the new base is b' = b * s^(d / (d - 2)), where s is the desired length-extension factor. This stretches the low-frequency components more than the high-frequency ones, spreading the interpolation pressure across dimensions. The approach can be applied without any fine-tuning at all, achieving 2x to 4x extension zero-shot, though fine-tuning improves quality.

NTK-aware scaling was quickly adopted by popular inference frameworks including llama.cpp and text-generation-webui, and was incorporated into several Hugging Face model implementations. A related variant called Dynamic NTK, also developed in the EleutherAI community, applies NTK-aware scaling adaptively at inference time based on the current sequence length, so the model uses no scaling at short lengths (preserving its trained behavior) and smoothly increases scaling as the sequence grows.

YaRN (Yet another RoPE extensioN)

Peng et al. (2023) introduced YaRN, the first formal academic paper that rigorously analyzed and improved upon the community NTK-aware scaling methods. The authors are Bowen Peng and Jeffrey Quesnelle (Nous Research), Honglu Fan (EleutherAI and University of Geneva), and Enrico Shippole. YaRN's key innovation is recognizing that different frequency ranges require fundamentally different scaling strategies. It addresses this with two mechanisms:

NTK-by-parts (ramp function for dimension-wise interpolation). A ramp function decides whether each dimension is fully interpolated, not interpolated, or partially interpolated, depending on its wavelength relative to the original context length. For LLaMA-family models the authors used parameters alpha = 1, beta = 32. Low-frequency dimensions receive more interpolation, while high-frequency dimensions receive less.
Temperature scaling for attention logits. Extending context length causes a distributional shift in attention scores because longer sequences spread probability mass over more tokens. YaRN applies a temperature factor t to attention logits to compensate for this shift, with sqrt(1/t) = 0.1 * ln(s) + 1, where s is the extension factor.

YaRN achieves the same extended-context perplexity as PI with roughly 10x fewer training tokens and 2.5x fewer training steps, while requiring fine-tuning on less than 0.1% of the original pre-training data. It enables models to extrapolate beyond even the fine-tuning context length. LLaMA models extended with YaRN have been shown to handle context lengths of 64K and 128K tokens effectively, and YaRN became the standard recipe for context extension in many open-source LLM releases through 2024.

LongRoPE

Ding et al. (2024) at Microsoft introduced LongRoPE, which uses an evolutionary search to find a non-uniform, per-frequency rescaling that further reduces the gap between original and extended context performance. LongRoPE extends pretrained LLMs to 2,048k tokens with up to 1,000 fine-tuning steps at 256k length, while preserving short-context quality. The method underpins the 128k-context variants of Phi-3.

Comparison of RoPE scaling methods

Method	Date and source	Core idea	Fine-tuning needed	Typical extension
Position Interpolation (PI)	Chen et al., June 2023	Linearly downscale positions by factor s = L'/L so all positions fall in trained range	About 1,000 steps	2x to 16x
NTK-aware scaling	bloc97 on Reddit r/LocalLLaMA, June 2023	Increase base theta instead of uniformly scaling positions; preserves high-frequency components	None (zero-shot) or minimal	2x to 4x with no fine-tuning
Dynamic NTK	Reddit / EleutherAI work, mid-2023	Adjust base on the fly during inference based on current sequence length	None	Smooth degradation past trained length
YaRN	Peng et al., August 2023	Combine NTK-by-parts with attention temperature scaling	About 10x fewer tokens than PI	32x to 64x with brief fine-tuning
LongRoPE	Ding et al. (Microsoft), February 2024	Search for non-uniform per-frequency scaling factors with evolutionary search	Up to 1,000 steps at 256k length	Up to 2,048k tokens (used in Phi-3-128k)
Extended theta	Standard pre-training	Increase base from 10,000 to 500,000 or more	Yes (full pre-training)	Long contexts trained from scratch

Some models have adopted the simpler approach of just increasing the base theta value during pre-training. LLaMA 3 used a base theta of 500,000 (compared to the standard 10,000), which inherently allows the model to handle longer sequences because the rotation frequencies are lower. This approach requires training with the extended theta from the start (or extensive continued pre-training) but avoids the need for post-hoc scaling.

Implementation details

In practice, RoPE is implemented efficiently without explicitly constructing the full rotation matrices. The rotation of each 2D pair can be computed with simple element-wise multiplications and additions:

q_rotated[2i]   = q[2i] * cos(m * theta_i) - q[2i+1] * sin(m * theta_i)
q_rotated[2i+1] = q[2i] * sin(m * theta_i) + q[2i+1] * cos(m * theta_i)

The cosine and sine values for each position and dimension pair can be precomputed and cached as a lookup table, so the runtime cost of RoPE is just two element-wise multiplications and one addition per dimension pair, per token, per layer. This is negligible compared to the cost of the attention computation itself.

A common compact form, used in modern codebases, expresses the rotation elementwise. Letting cos(m * theta) and sin(m * theta) be vectors of length d/2 broadcast across the head, the rotated query is:

q_rotated = q * cos(m * theta) + rotate_half(q) * sin(m * theta)

where rotate_half swaps and negates the two halves of the vector. This is the form used in the apply_rotary_pos_emb function in the Hugging Face transformers library, in src/transformers/models/llama/modeling_llama.py. The rotation table is built once per forward pass by LlamaRotaryEmbedding, which exposes hooks for PI, NTK-aware, dynamic NTK, YaRN, and LongRoPE scaling through configuration flags.

An equivalent formulation, often used in implementations, restructures the computation using complex number multiplication. If each pair (x_{2i}, x_{2i+1}) is treated as a complex number x_{2i} + i * x_{2i+1}, then the rotation is simply multiplication by the complex number cos(m * theta_i) + i * sin(m * theta_i) = e^(i * m * theta_i). This view makes the mathematical structure clearer and can be implemented efficiently using complex number support in frameworks like PyTorch.

A practical detail worth noting: there are two equivalent encodings for the pairs. The original GPT-J style interleaves consecutive even and odd dimensions, while the GPT-NeoX and LLaMA style splits the head dimension into two halves and uses a rotate_half helper. Both are mathematically equivalent up to a permutation of dimensions, but checkpoints converted between formats need a one-time reordering of the corresponding weights.

RoPE is applied to queries and keys only, not to values, and is applied independently at each attention layer rather than just at the input. This means the position information is refreshed at every layer, which helps maintain the position signal throughout the network depth. This contrasts with additive methods (sinusoidal, learned embeddings) that apply the position signal only at the input and rely on the network to propagate it through subsequent layers.

For researchers building from scratch, the cleanest reference implementations live in the EleutherAI GPT-NeoX codebase and in Su Jianlin's original RoFormer release at ZhuiyiTechnology/roformer on GitHub.

Connection to relative position bias

RoPE can be viewed as a way to inject the relative-position term R(theta, n - m) into the attention dot product without adding it as a separate scalar. T5's relative position bias adds a learned scalar b(n - m) directly to the pre-softmax logit; RoPE multiplies the query and key vectors by complementary rotation matrices so the same relative-position structure emerges from the dot product itself. The two approaches share a goal but differ in mechanism: T5's bias is additive and learned per relative-distance bucket, while RoPE's rotation is multiplicative and parameter-free.

This shared spirit explains some of RoPE's observed properties. The decay of the dot product with relative distance, for example, is qualitatively similar to ALiBi's hand-designed linear penalty, even though RoPE arrives at it through phase mismatch in high-frequency dimensions rather than an explicit subtraction.

Extensions and variants

Several extensions to the basic RoPE formulation have been explored:

2D and 3D RoPE extends the rotation approach to two-dimensional and three-dimensional position grids, which is useful for vision transformers and other models that operate on spatial data. Instead of a single position index, each token has (x, y) or (x, y, z) coordinates, and the rotation angles are defined over these multi-dimensional positions.

Dynamic NTK adjusts the scaling factor at inference time based on the actual sequence length, rather than using a fixed scaling factor. If the input sequence is shorter than the training length, no scaling is applied; scaling kicks in only when the sequence exceeds the training length.

Long RoPE (Ding et al., 2024) further refines dimension-wise scaling by searching for optimal scaling factors for each dimension independently, achieving effective context lengths of up to 2 million tokens with minimal fine-tuning.

Decoupled RoPE (DeepSeek-V2) applies the rotation to a separate small portion of query and key vectors, while the larger compressed portion is handled by Multi-head Latent Attention. This resolves a conflict between RoPE and low-rank KV compression.

iRoPE (Llama 4) interleaves standard RoPE attention layers with NoPE (no positional encoding) layers in roughly a 3:1 ratio, drawing on findings that pure-NoPE layers can generalize better at extreme context lengths.

Limitations

While RoPE has proven highly effective, it is not without limitations.

The most cited issue is poor length generalization. Without modification, a RoPE model trained on 4k tokens degrades quickly past 5k or 6k tokens. This is the central motivation for PI, NTK, YaRN, and LongRoPE. Even with these techniques, attention over very long contexts often shows weaker recall than a model trained natively at the longer length.

A second issue is the asymmetry between high- and low-frequency dimensions. The high-frequency dimensions saturate after a small number of token positions and contribute little extra information; the low-frequency dimensions carry the long-range signal but are sparsely sampled, so the effective resolution of long-range positional information is lower than the head dimension would suggest. Recent work has even argued that some high-frequency dimensions become wasteful in long-context retrieval.

A third limitation is the oscillating nature of the distance decay. Unlike ALiBi, which provides a monotonic decrease in attention bias with distance, RoPE's distance sensitivity oscillates due to the periodic nature of rotation. Very distant tokens can occasionally receive higher attention scores than moderately distant tokens, though in practice the model learns to handle this.

A fourth issue is that RoPE is not always optimal. ALiBi sometimes wins on raw extrapolation in language modeling perplexity, and Kazemnejad et al. (2023) showed that NoPE can outperform RoPE on certain reasoning tasks that require generalizing to longer sequences than seen during training. Llama 4's iRoPE design takes this finding seriously by deliberately interleaving NoPE layers with RoPE layers.

A final practical limitation is that RoPE is only applied to queries and keys, not values. Some research has investigated whether extending a similar rotation to values, or using completely different per-frequency treatments, can yield gains, but no such variant has displaced standard RoPE in mainstream LLM training. The interaction of RoPE with alternative attention mechanisms (such as state-space models) is also less well understood than its behavior in standard multi-head attention.

Practical tips for context extension

When extending a pretrained RoPE model to a longer context window, the choice between PI, NTK-aware, YaRN, and LongRoPE depends on the target length and the available compute.

For modest extensions (2x to 4x) with no fine-tuning, NTK-aware scaling is the simplest option and works well as a zero-shot deployment trick.
For modest extensions with a small fine-tuning budget, PI is the original recipe and remains a reasonable baseline.
For larger extensions (8x to 32x) with a brief fine-tuning run, YaRN gives the best balance of quality and cost. Most open-source long-context releases through 2024 used YaRN for this regime.
For extreme extensions (above 256k tokens) and when search compute is available, LongRoPE produces the highest quality at the cost of running an evolutionary search over per-frequency scaling factors.

In all cases, evaluating on a needle-in-a-haystack or passkey retrieval benchmark at the target length is essential, since perplexity alone can hide failures of long-range attention.

References

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). "RoFormer: Enhanced Transformer with Rotary Position Embedding." *Neurocomputing, Volume 568, 2024*. https://arxiv.org/abs/2104.09864
Su, J. (2021). Original Kexue.fm blog posts introducing rotary position embedding. https://kexue.fm
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). "Attention Is All You Need." *NeurIPS 2017*. https://arxiv.org/abs/1706.03762
Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). "Self-Attention with Relative Position Representations." *NAACL 2018*. https://arxiv.org/abs/1803.02155
Press, O., Smith, N. A., & Lewis, M. (2022). "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation." *ICLR 2022*. https://arxiv.org/abs/2108.12409
Chen, S., Wong, S., Chen, L., & Tian, Y. (2023). "Extending Context Window of Large Language Models via Positional Interpolation." *arXiv preprint*. https://arxiv.org/abs/2306.15595
bloc97 (2023). "NTK-Aware Scaled RoPE allows LLaMA Models to have Extended (8k+) Context Size without any Fine-Tuning and Minimal Perplexity Degradation." *Reddit r/LocalLLaMA*, June 2023.
Peng, B., Quesnelle, J., Fan, H., & Shippole, E. (2023). "YaRN: Efficient Context Window Extension of Large Language Models." *ICLR 2024*. https://arxiv.org/abs/2309.00071
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., & Lample, G. (2023). "LLaMA: Open and Efficient Foundation Language Models." *arXiv preprint*. https://arxiv.org/abs/2302.13971
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." *arXiv preprint*. https://arxiv.org/abs/2307.09288
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lenber, G., Lample, G., Saulnier, L., et al. (2023). "Mistral 7B." *arXiv preprint*. https://arxiv.org/abs/2310.06825
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q. V., & Salakhutdinov, R. (2019). "Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context." *ACL 2019*. https://arxiv.org/abs/1901.02860
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *JMLR 21*. https://arxiv.org/abs/1910.10683
Ding, Y., Zhang, L. L., Zhang, C., Xu, Y., Shang, N., Xu, J., Yang, F., & Yang, M. (2024). "LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens." *ICML 2024*. https://arxiv.org/abs/2402.13753
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). "PaLM: Scaling Language Modeling with Pathways." *arXiv preprint*. https://arxiv.org/abs/2204.02311
Kazemnejad, A., Padhi, I., Natesan, K., Das, P., & Reddy, S. (2023). "The Impact of Positional Encoding on Length Generalization in Transformers." *NeurIPS 2023*. https://arxiv.org/abs/2305.19466
DeepSeek-AI (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." *arXiv preprint*. https://arxiv.org/abs/2405.04434
Microsoft (2024). "Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone." *arXiv preprint*. https://arxiv.org/abs/2404.14219
EleutherAI. "Rotary Embeddings: A Relative Revolution." *EleutherAI Blog*. https://blog.eleuther.ai/rotary-embeddings/
EleutherAI (2023). "Extending the RoPE." *EleutherAI Blog*. https://blog.eleuther.ai/yarn/
Hugging Face Transformers. `apply_rotary_pos_emb` and `LlamaRotaryEmbedding` in `src/transformers/models/llama/modeling_llama.py`. https://github.com/huggingface/transformers
ZhuiyiTechnology. "RoFormer reference implementation." https://github.com/ZhuiyiTechnology/roformer