Position Interpolation (PI)

Deep Learning Neural Networks

10 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

6 citations

Revision

v1 · 2,022 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Position Interpolation (PI) is a method for extending the context window of a pretrained large language model that uses rotary position embedding (RoPE). Rather than asking the model to operate at position indices larger than any it encountered during training, PI linearly down-scales the position indices of a longer input so that they fall back inside the original trained range, after which the model is briefly fine-tuned. The technique was introduced in June 2023 by Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian at Meta in the paper "Extending Context Window of Large Language Models via Positional Interpolation." ^[1] Using PI, the authors extended LLaMA models, which were originally trained with a 2,048-token window, to context lengths of 8,192, 16,384, and 32,768 tokens with no more than 1,000 steps of fine-tuning, while preserving quality on the original window. PI is widely regarded as the foundational method of the RoPE-scaling family and a direct precursor to NTK-aware interpolation and YaRN. ^[1]^[4]

Overview

Transformer language models cannot natively process sequences longer than the maximum length seen during pretraining, because their position-encoding mechanism has only been exposed to a bounded range of positions. The naive fix, simply feeding in a longer sequence and letting RoPE produce position indices beyond the trained range (extrapolation), fails dramatically: the model's output quality collapses, with language-modeling perplexity often exploding into the thousands. ^[1]

Position Interpolation reframes the problem. Instead of pushing positions outward into unseen territory, it pulls a longer sequence inward, compressing its position indices so that the largest index in a sequence of the new target length equals the largest index seen in training. The attention computation therefore always operates on in-distribution position values. A short round of fine-tuning then lets the model adapt to the finer spacing between adjacent positions. Because the rescaled positions never leave the trained range, PI is far more stable than extrapolation and needs only a fraction of the compute that training a long-context model from scratch would demand. ^[1]

Background: RoPE and the extrapolation failure

Rotary position embedding, introduced by Su et al. in the RoFormer architecture, encodes position by rotating the query and key vectors in a set of two-dimensional subspaces, where the rotation angle is proportional to the token's absolute position. ^[2] For a model with head dimension d, RoPE applies to position m a set of rotation frequencies theta_j = base^(-2j/d) for j = 0 to d/2 - 1, with the base constant conventionally set to 10,000. The dot product between a rotated query at position m and a rotated key at position n depends only on their relative offset m - n, which is what makes RoPE attractive: relative position is encoded implicitly through rotation. ^[1]^[2]

The low-index dimensions rotate quickly (high frequency, short wavelength) and the high-index dimensions rotate slowly (low frequency, long wavelength). When a sequence is extended beyond the trained length L, the relative offsets m - n grow larger than any value seen in training. The high-frequency components, which cycle rapidly, then land at rotation angles the model has never observed in combination, and the resulting attention logits can become very large and erratic. Chen et al. showed that the attention score, viewed as a function of relative distance, behaves well only inside the trained interval and that its extrapolated values can grow large enough to overwhelm the softmax. This is the mechanism behind the catastrophic perplexity blow-up seen when RoPE models are run past their training length. ^[1]

How Position Interpolation works

PI modifies the position index before it enters the RoPE rotation. Let L be the original context window and L' the desired longer window, and define the scale factor s = L' / L. RoPE applied to a feature vector x at position m is written f(x, m). Position Interpolation replaces this with

f'(x, m) = f(x, m * L / L') = f(x, m / s).

In words, every position index is multiplied by L / L' (equivalently, divided by the extension factor s), so a sequence of length L' has its indices squeezed from the range [0, L') down into [0, L). For example, extending a 2,048-token model to 32,768 tokens uses s = 16, and position 32,767 in the long sequence is presented to RoPE as if it were position 2,047.99, right at the edge of the trained range. The interpolation is linear and uniform: every rotary frequency is scaled by the same factor. ^[1]

Because the rescaling places neighboring tokens closer together in rotation-angle space than the model originally saw, a brief fine-tune is needed so the network can resolve the finer spacing. The paper fine-tuned on the Pile corpus using the next-token language-modeling objective for as few as 1,000 gradient steps (compared with roughly 10,000 steps for direct fine-tuning on extrapolated positions), with AdamW, a learning rate of 2e-5 for the 7B and 13B models and 1e-5 for the 33B and 65B models, and global batch sizes of 32 to 128. ^[1] This procedure is sometimes called "linear" RoPE scaling, and it was adopted in mainstream libraries such as Hugging Face Transformers under the rope-scaling type "linear."

Theoretical justification

The central theoretical result of the paper is a bound on how far the interpolated attention score can deviate from a smooth, well-behaved function. Chen et al. proved that the upper bound on the attention score under interpolation is at least about 600 times smaller than the corresponding bound under extrapolation, for the LLaMA configuration (head dimension 128, base 10,000). ^[1] Intuitively, interpolation only ever evaluates the attention function at points strictly between values it already fits well, so the deviation is controlled, whereas extrapolation evaluates it outside that region where no such guarantee holds. This much tighter bound is the formal explanation for why interpolated positions keep attention scores in-distribution and why PI needs so little fine-tuning to recover full performance. ^[1]

Results

Chen et al. applied PI to the full LLaMA 1 model family (7B, 13B, 33B, and 65B parameters), extending the original 2,048-token window to 8,192, 16,384, and 32,768 tokens. The extended models showed continued improvement in language-modeling perplexity as more context was made available, evaluated on long-document corpora including PG-19 and arXiv math proofs, indicating that they genuinely used the additional tokens rather than ignoring them. ^[1] On tasks that fit inside the original 2,048-token window, the PI-extended models stayed close to the base models, with degradation on standard benchmarks reported as small (on the order of a couple of percent), demonstrating that the extension does not badly harm short-context behavior. ^[1]

The most striking contrast appeared on the passkey retrieval task, a synthetic test in which a random pass code is hidden at a varying depth inside a long document and the model must recover it. A PI-extended model recovered the passkey across the entire target context after only about 200 fine-tuning steps. A model fine-tuned by direct extrapolation, by contrast, could barely extend its effective range beyond the original 2,048 tokens even after more than 10,000 steps. ^[1] The combination of strong results with minimal fine-tuning is what made PI immediately practical and widely copied.

Relationship to NTK-aware scaling, YaRN, and ABF

PI's one apparent weakness, that it scales every rotary frequency by the same factor, motivated a rapid sequence of refinements. Uniform interpolation crowds the fast, high-frequency dimensions especially hard, which can blur the model's ability to distinguish nearby tokens and erase fine-grained local position information. ^[3]^[4]

NTK-aware interpolation, proposed in mid-2023 by the pseudonymous developer "bloc97" shortly after the PI paper, addresses this by not scaling the frequencies uniformly. Instead of rescaling positions, it increases the RoPE base constant, which, because the frequencies fall off exponentially with dimension index, has the effect of barely touching the high-frequency dimensions while strongly interpolating the low-frequency ones. This spreads the interpolation pressure unevenly so that local resolution is largely preserved, and it can extend context to a moderate degree even without any fine-tuning. A "Dynamic NTK" variant adjusts the scale on the fly as the sequence grows. ^[3] A further refinement, NTK-by-parts, interpolates only the frequency bands where it helps and leaves the highest frequencies untouched. ^[4]

YaRN (Yet another RoPE extensioN), published in August 2023 by Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole of Nous Research and collaborators, builds directly on these ideas. ^[4] It combines NTK-by-parts interpolation with an attention "temperature" adjustment that rescales the attention logits before the softmax, which compensates for the way longer contexts spread attention more thinly. YaRN reported reaching the same quality as PI using roughly ten times fewer tokens and about 2.5 times fewer training steps, and the authors released Llama 2 checkpoints fine-tuned to 64k and 128k tokens. It was accepted at ICLR 2024.

A parallel line of work from Meta, the "adjusted base frequency" (ABF) method of Xiong et al. in "Effective Long-Context Scaling of Foundation Models" (the Llama 2 Long work, September 2023), increases the RoPE base from 10,000 to 500,000 and continues pretraining on long sequences, extending Llama 2 to a 32,768-token window. ^[5] The same base-increase strategy appeared in Code Llama, which raised the base to 1,000,000 for 16,384-token code contexts, and in later Llama 3 models, which use a base of 500,000. ^[6] These methods and PI sit on a spectrum: PI rescales positions, ABF rescales the frequencies through the base, and NTK-aware and YaRN apply non-uniform combinations of the two.

Method	Core mechanism	Non-uniform across frequencies	Fine-tuning
Position Interpolation (PI)	Linearly down-scale position indices by L/L'	No (uniform)	Short (about 1,000 steps)
NTK-aware	Increase RoPE base to scale low frequencies more, high frequencies less	Yes	Optional / none for modest extension
YaRN	NTK-by-parts interpolation plus attention temperature scaling	Yes	Short, very token-efficient
ABF (adjusted base frequency)	Raise RoPE base (10,000 to 500,000) and continue pretraining	Yes	Longer continued pretraining

Later training-free or search-based methods extended the lineage further. self-extend (Jin et al., 2024) groups distant positions to reuse trained relative distances without any fine-tuning, and LongRoPE (Microsoft, 2024) uses evolutionary search to find a non-uniform, per-dimension interpolation schedule, reaching context windows beyond two million tokens.

Limitations

Position Interpolation has several recognized limitations. It still requires gradient updates: although the fine-tuning budget is small, PI is not a fully training-free method, unlike the zero-shot regime that NTK-aware scaling can sometimes achieve. Its uniform scaling is the main technical shortcoming, because compressing all frequencies by the same factor degrades the high-frequency dimensions that encode local, short-range position, which is precisely what NTK-aware and YaRN were designed to correct. ^[3]^[4] PI also introduces a small but measurable quality cost on tasks that fit within the original window, since the model's effective positional resolution is reduced after rescaling. ^[1] Finally, the method is specific to rotary embeddings: it does not directly apply to models with absolute learned positions or additive schemes such as ALiBi, and the achievable extension factor is bounded in practice because very aggressive compression eventually packs adjacent positions too tightly to distinguish even after fine-tuning. Despite these constraints, PI remains a standard baseline and a conceptual foundation for the RoPE context-extension family.

References

Chen, Shouyuan; Wong, Sherman; Chen, Liangjian; Tian, Yuandong. "Extending Context Window of Large Language Models via Positional Interpolation." arXiv:2306.15595, June 27, 2023. https://arxiv.org/abs/2306.15595 ↩
Su, Jianlin; Lu, Yu; Pan, Shengfeng; Murtadha, Ahmed; Wen, Bo; Liu, Yunfeng. "RoFormer: Enhanced Transformer with Rotary Position Embedding." arXiv:2104.09864, April 2021. https://arxiv.org/abs/2104.09864 ↩
bloc97. "NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation." Reddit r/LocalLLaMA, mid-2023. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/ ↩
Peng, Bowen; Quesnelle, Jeffrey; Fan, Honglu; Shippole, Enrico. "YaRN: Efficient Context Window Extension of Large Language Models." arXiv:2309.00071, August 31, 2023 (ICLR 2024). https://arxiv.org/abs/2309.00071 ↩
Xiong, Wenhan; et al. "Effective Long-Context Scaling of Foundation Models." arXiv:2309.16039, September 2023. https://arxiv.org/abs/2309.16039 ↩
Roziere, Baptiste; et al. "Code Llama: Open Foundation Models for Code." arXiv:2308.12950, August 2023. https://arxiv.org/abs/2308.12950 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Self-Extend