YaRN (Yet another RoPE extensioN) is a compute-efficient method for extending the context window of large language models that use Rotary Position Embeddings (RoPE). Introduced by Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole in a paper submitted to arXiv on August 31, 2023 (arXiv:2309.00071), YaRN combines selective frequency interpolation with an attention temperature scaling mechanism to enable models to process sequences far longer than their original pre-training context window while requiring only a small amount of additional fine-tuning. The paper was accepted at the International Conference on Learning Representations (ICLR 2024). The approach was subsequently adopted as the primary context extension method in several leading open-weight models, including Llama 3.1, Qwen 3, Mistral Large, and DeepSeek V3.
The name is a self-aware reference to the proliferation of RoPE scaling techniques in 2023; at the time of publication, the field had already seen Position Interpolation, NTK-aware scaling, and several community variants.
Transformer models have no inherent notion of token order. Position information must be injected explicitly, either as learned embeddings added to token representations, or through mechanisms that modify the attention computation itself. Early large language models such as GPT-2 used learned absolute positional embeddings, which do not generalize beyond the maximum training length. Sinusoidal positional encodings (Vaswani et al., 2017) generalize slightly better but still degrade sharply beyond training length. ALiBi (Attention with Linear Biases, Press et al., 2022), used in models such as BLOOM, adds a learned linear bias to attention logits and extrapolates more gracefully, but it is not compatible with the RoPE architecture.
Rotary Position Embeddings (RoPE), introduced by Su et al. in the RoFormer paper (arXiv:2104.09864), became the dominant positional encoding scheme in open-weight foundation models, including the entire LLaMA family, Mistral, Qwen, DeepSeek, and others. RoPE applies a two-dimensional rotation to consecutive pairs of dimensions in the query and key vectors before the attention dot product is computed. Each pair of dimensions d is rotated by angle m * theta_d, where m is the token position and theta_d = base^(-2d/D). Here D is the embedding dimension and base is typically 10,000. The inner product between a rotated query at position m and a rotated key at position n depends only on the relative displacement (m - n), naturally encoding relative position.
Because the rotation angle depends on both the position m and the frequency theta_d, RoPE creates a multi-scale positional representation. Lower-indexed dimension pairs have large theta_d values and therefore rotate rapidly as positions increase (high frequency, encoding local short-range structure). Higher-indexed dimension pairs have small theta_d values and rotate slowly (low frequency, encoding global long-range structure). This hierarchical frequency structure is central to why selective scaling across frequencies -- as in YaRN -- works better than uniform scaling.
Despite their advantages, RoPE-based models generalize poorly beyond their pre-training context length L_train. When a model encounters token positions m greater than L_train, the rotary angles m * theta_d for those positions lie entirely outside the range seen during training. The attention mechanism relies on inner products between rotated queries and keys; at unseen angles, the inner products become unpredictable and the model's ability to relate distant tokens breaks down. In practice this manifests as a sharp perplexity spike and loss of coherent output when sequence length exceeds the training maximum by more than a modest amount.
This limitation is significant for production use cases. A model trained with a 4K or 8K context window cannot reliably summarize long documents, process entire codebases, maintain coherence across book-length inputs, or handle many-turn extended conversations without expensive re-pre-training from scratch on longer sequences. As instruction-tuned models became widely deployed in 2023, the gap between typical training context lengths (4K to 8K tokens) and user demand (10K to 128K or more) became a pressing practical problem.
The first widely adopted solution was Position Interpolation (PI), introduced by Chen et al. (arXiv:2306.15595) at Meta in June 2023. The core idea is to scale down position indices at inference time: rather than feeding position m directly into RoPE, use m * L_train / L_target. The longest position in the extended context thus maps to the same rotary angle as the longest position seen during training, ensuring all angles remain within the trained distribution. With roughly 1,000 fine-tuning steps, PI allowed LLaMA models to extend to 32,768 tokens.
However, PI applies uniform compression to all position indices and all frequency dimensions. At a scaling factor of s (target context = s times original), adjacent tokens are 1/s apart in the compressed position space. High-frequency RoPE dimensions, which rely on the difference between nearby positions to encode fine-grained local relationships, are particularly harmed: at 8x extension, the distance between adjacent tokens in position space becomes 1/8 of what the model was trained on, effectively making it much harder to distinguish nearby tokens in those dimensions. This degradation grows with the extension factor. Perplexity at standard context lengths measurably increases in PI-extended models, reflecting the loss of local positional precision.
NTK-aware scaling (also called NTK-RoPE), which circulated as a community contribution in mid-2023, takes a different approach: instead of compressing position indices, it increases the RoPE base frequency. By setting base_new = base * alpha^(D/(D-2)) for a chosen scale factor alpha, the frequencies of all dimensions are slowed down uniformly, so that positions up to L_target * base_ratio still correspond to angles within a reasonable range. The connection to Neural Tangent Kernel theory is that this rescaling effectively rescales the frequency bandwidth of the positional encoding.
NTK-aware scaling works as a zero-shot approach, requiring no fine-tuning, because it avoids placing token positions at entirely unseen angles. When combined with fine-tuning it achieves lower perplexity than PI in many settings. The limitation is that scaling the base uniformly still treats all frequency dimensions identically: high-frequency dimensions are slowed when their original frequencies were already well-suited for the training context length, and low-frequency dimensions may not be slowed enough at very large extension ratios.
Both PI and NTK-aware scaling apply a single scalar to all frequency dimensions despite the fact that different dimensions serve fundamentally different roles. The YaRN authors identified this as the central failure mode: high-frequency dimensions should not be interpolated (they already generalize locally), while low-frequency dimensions must be interpolated to bring extended positions into the trained angular range. An approach that could apply extrapolation to high frequencies and interpolation to low frequencies, with a smooth transition between the two regimes, would outperform either uniform strategy. YaRN provides exactly this, combined with a principled correction for the change in attention score distribution caused by any form of RoPE modification.
The four authors represent a collaboration between Nous Research and EleutherAI. Bowen Peng, the lead author, is affiliated with both Nous Research and Universite de Montreal. Jeffrey Quesnelle is co-founder and CEO of Nous Research, an open-source AI research organization focused on agent reliability, context-length extension, and novel training techniques. Honglu Fan contributed from EleutherAI, the open research collective known for the GPT-NeoX and Pythia model series. Enrico Shippole is also affiliated with Nous Research. The collaboration brought together practical open-source model development expertise (Nous Research's fine-tuning work on the LLaMA family) with fundamental research capabilities (EleutherAI's infrastructure and evaluation tooling).
The initial version of the paper was submitted on August 31, 2023. A revised version was submitted in late 2023 and accepted at ICLR 2024. The paper was also updated as late as February 2024. The GitHub repository (github.com/jquesnelle/yarn), maintained by Quesnelle, was publicly released alongside the paper and received approximately 1,700 stars and 130 forks, reflecting strong community interest. The repository includes training scripts, DeepSpeed configuration files, evaluation scripts built on lm-evaluation-harness, and links to trained model weights published on Hugging Face under appropriate licenses.
The paper's primary claims are:
YaRN's first technical contribution is NTK-by-parts interpolation, a piecewise blending strategy that applies different scaling to different frequency dimensions based on each dimension's characteristic wavelength relative to the target context length.
The wavelength of dimension pair d is the sequence length at which that dimension completes one full 2pi rotation cycle: lambda_d = 2pi / theta_d. Dimensions with small index (high theta_d) have short wavelengths and complete many full rotations within any non-trivial sequence. Dimensions with large index (low theta_d) have long wavelengths, potentially much longer than the training context.
The ramp function gamma(r) maps a wavelength ratio r to a blending coefficient between 0 and 1:
Here r is typically defined as the ratio of the dimension's wavelength to the extended context length. The blended modified frequency for dimension d is then:
h(theta_d) = (1 - gamma(r)) * (theta_d / s) + gamma(r) * theta_d
where s is the extension factor (target length divided by original length). When gamma = 0, the original frequency theta_d is preserved unchanged (pure extrapolation). When gamma = 1, the frequency is divided by s (full interpolation, as in PI). In the transition zone, a weighted blend is applied.
For LLaMA-family models, the authors found through empirical experimentation that alpha = 1 and beta = 32 worked well. This means dimensions whose wavelength is shorter than the context length receive full interpolation, dimensions whose wavelength is much shorter than 1/32 of the context length receive no interpolation, and a smooth ramp connects the two regimes. The parameters can be adapted to other model families; for Mistral 7B, different values were used.
Modifying RoPE frequencies changes the distribution of the dot products between rotated queries and keys. When frequencies are compressed (interpolated), the inner products between query-key pairs at similar positions become smaller in magnitude, leading to a flatter attention distribution that treats all tokens more equally. When frequencies are extrapolated, the opposite effect can occur. Either way, the attention softmax distribution diverges from what the model was trained on, leading to suboptimal behavior.
YaRN's second contribution addresses this directly with an attention temperature correction. The standard scaled dot-product attention computes softmax(q^T k / sqrt(D)). YaRN introduces a temperature parameter t that further scales the denominator, effectively modifying the sharpness of the attention distribution. Dividing by a larger effective scale (when t > 1) sharpens attention; multiplying by sqrt(1/t) when t < 1 softens it.
The key finding of the paper is an empirical formula for the optimal temperature as a function of the extension factor s:
sqrt(1/t) = 0.1 * ln(s) + 1
This logarithmic relationship was found to hold across Llama 2 models of all sizes (7B, 13B, 70B) without requiring model-specific tuning of the temperature. For example, at s = 16 (a 4K model extended to 64K), sqrt(1/t) = 0.1 * ln(16) + 1 approximately 1.277, giving t approximately 0.613. The attention scores are scaled by 1.277, softening the distribution to compensate for the compression effect of the modified RoPE frequencies.
An important implementation detail: the temperature scaling can be absorbed into the query and key normalization step, adding zero overhead at inference time. Rather than modifying the softmax computation, implementations scale the query and key vectors by sqrt(1/t) before computing the dot product, which is mathematically equivalent and compatible with optimized attention kernels including Flash Attention.
The mscale (attention scaling multiplier) in model config files corresponds to sqrt(1/t) and is set using this same formula, with the value depending on the scale factor of the deployed model.
A third contribution of the paper is the dynamic extension variant, which requires no fine-tuning. In Dynamic YaRN, the scale factor s is not fixed during deployment but computed on the fly for each forward pass as s = max(1, l / L_train), where l is the length of the current input and L_train is the training context length. If the current input fits within the original training length, s = 1 and no scaling is applied, preserving original model behavior. As the sequence grows longer, s increases and the ramp-blended interpolation is applied progressively.
Dynamic scaling achieves more than 2x context extension with zero fine-tuning. It is less accurate at extreme lengths than the fine-tuned static variant, but it provides a practical drop-in upgrade for any deployed RoPE model without requiring weight changes. Dynamic YaRN was subsequently integrated into llama.cpp, vLLM, Hugging Face Transformers, and other inference frameworks as a configurable rope_scaling option.
The following table summarizes how YaRN compares to the primary competing approaches for extending context in RoPE-based language models:
| Method | Fine-tuning | Frequency treatment | Attention correction | Typical extension | Notes |
|---|---|---|---|---|---|
| Position Interpolation | Required | Uniform (position scaling) | None | Up to 8x well, degrades beyond | Degrades local precision |
| NTK-Aware Scaling | Optional | Uniform (base rescaling) | None | 2-4x zero-shot | Better than PI zero-shot |
| NTK-by-Parts | Optional | Non-uniform (piecewise) | None | 4-8x | Partial solution |
| YaRN (static) | Light (400-600 steps) | Non-uniform (ramp) | Yes (temperature) | 8-32x reliably | Best perplexity fine-tuned |
| Dynamic YaRN | None | Non-uniform (adaptive) | Yes | 2x+ reliably | Best zero-shot |
| LongRoPE | Optional | Per-dimension search | None | Up to 64x+ zero-shot | Automated, no heuristics |
A central empirical result from the paper is that YaRN significantly outperforms Position Interpolation on extended perplexity while also avoiding PI's degradation on short contexts. On Llama 2 7B at 32K context with 400 fine-tuning steps, YaRN achieves a perplexity of 2.77 versus PI's 3.57. NTK-aware scaling, despite being reasonable zero-shot, actually performs poorly after fine-tuning (perplexity 8.49), suggesting that its uniform scaling creates a poor initialization for fine-tuning. NTK-by-parts alone (without temperature scaling) achieves 2.81, just slightly above YaRN's 2.77, demonstrating that the temperature correction provides a meaningful additional improvement.
PI fine-tuned models show measurably higher perplexity on standard-length inputs compared to the original model. YaRN's frequency partitioning avoids this regression by preserving high-frequency dimensions unchanged, which are the most important for short-range token discrimination.
ALiBi takes a fundamentally different architectural approach: it encodes position through a linear bias added to attention logits rather than modifying query and key representations. ALiBi extrapolates gracefully to contexts somewhat longer than training and requires no position-specific fine-tuning. However, it is architecturally incompatible with RoPE: models already trained with RoPE cannot adopt ALiBi without structural changes and re-training. YaRN specifically targets the RoPE ecosystem, where the majority of open-weight foundation models live.
The paper evaluates perplexity using a sliding window approach with stride S = 256 on the PG-19 book corpus and GovReport datasets. At 32K context length with 400 fine-tuning steps, YaRN-extended Llama 2 7B achieves perplexity of 2.77, substantially below both PI (3.57) and NTK-aware (8.49). At 128K context with scale factor s = 32, Llama 2 7B achieves perplexity of 2.37, compared to 2.71 for Code Llama (which uses an NTK-based extension).
On the GovReport benchmark across 50 long government documents at a 32K sliding window, YaRN at s = 16 achieves perplexity of 3.59 and at s = 32 achieves 3.64, versus Code Llama's 4.44.
PassKey retrieval, a synthetic task, tests whether a model can locate a short numeric passkey planted at a random position within a long document of irrelevant filler text. It is a simple diagnostic for whether the model can attend to arbitrary positions in a long context rather than defaulting to a recency bias. YaRN Llama 2 7B and 13B extended to 128K context both achieved 99.4% accuracy on PassKey retrieval, indicating effective attention across the full context window without positional blind spots.
To verify that context extension does not degrade standard language understanding capabilities, the paper evaluates YaRN-extended Llama 2 7B on ARC-Challenge, HellaSwag, MMLU, and TruthfulQA. The results show minimal regression:
| Benchmark | Baseline | YaRN s=16 | YaRN s=32 |
|---|---|---|---|
| ARC-Challenge | 53.1 | 52.3 | 52.1 |
| HellaSwag | 77.8 | 78.8 | 78.4 |
| MMLU | 43.8 | 42.5 | 41.7 |
| TruthfulQA | 39.0 | 38.2 | 37.3 |
The small regressions on MMLU and TruthfulQA, and the slight improvement on HellaSwag, indicate that YaRN's temperature-corrected interpolation preserves the model's general knowledge and reasoning capabilities through context extension.
YaRN's fine-tuning requirements are modest by the standards of context extension research. Extending Llama 2 7B to 64K tokens requires 128 to 384 GPU-hours on A100 hardware, using 400 training steps at s = 16 and approximately 200 additional steps to reach s = 32 for 128K. The paper reports that comparable results via Position Interpolation require 640 to 6,400 GPU-hours. The training dataset consists of pg19 books and Long-Data-Collections, both available on Hugging Face, tokenized into chunks of 65,536 tokens. Fine-tuning with YaRN on less than 0.1% of the original pre-training data volume achieves the reported perplexity numbers.
The YaRN repository published fine-tuned model variants for:
All models were published on Hugging Face under appropriate licenses and the repository linked to their checkpoints.
Llama 3.1, released by Meta in July 2024, extended the context window from the 8K of Llama 3 to 128K tokens. Meta's approach used a multi-stage long-context pre-training process with six progressive stages from 8K to 128K across 800 billion training tokens. The RoPE base frequency was dramatically increased to 500,000 (versus 10,000 in the original LLaMA) to provide a better initialized positional encoding for extended contexts. YaRN-compatible NTK-by-parts scaling and temperature correction informed the adaptation strategy at each stage. Llama 3.1 became the first widely adopted open-weight model series with a production-grade 128K context window, and its technical report explicitly referenced YaRN as central to the context extension methodology.
DeepSeek V3, the large mixture-of-experts model released by DeepSeek in December 2024 (arXiv:2412.19437), applies YaRN explicitly in two post-pre-training context extension phases. After initial pre-training, the model undergoes two sequential 1,000-step fine-tuning phases: in the first, the sequence length target is 32K tokens with a batch size of 1,920; in the second, it is extended to 128K tokens with a batch size of 480. Both phases use the YaRN temperature formula sqrt(1/t) = 0.1 * ln(s) + 1, applied to the decoupled shared key within DeepSeek's Multi-Head Latent Attention (MLA) architecture. The same YaRN configuration was inherited from DeepSeek V2, indicating that the technique's stability enabled direct reuse across architectural generations of the model series.
Alibaba's Qwen family uses YaRN as an optional context extension mechanism. The native context length of most Qwen 3 models is 32,768 tokens. Applying YaRN rope scaling with factor 4.0 and original_max_position_embeddings = 32,768 extends the effective context to 131,072 tokens. The standard config entry is rope_scaling: {factor: 4.0, original_max_position_embeddings: 32768, type: "yarn"}. Qwen 2.5 followed the same pattern. The design choice to make YaRN optional -- off by default, activated via configuration -- reflects a practical trade-off: standard-length inputs benefit from the full pre-trained perplexity without context compression artifacts, while users needing longer contexts can activate the extension.
The YaRN paper included extended Mistral 7B v0.1 variants at 64K and 128K context, demonstrating that the technique transferred across model families from LLaMA to Mistral. The Mistral architecture uses Grouped Query Attention and (in v0.1) a sliding window attention pattern, which required some adaptation. Mistral 7B v0.2, released by Mistral AI in December 2023, adopted a 32K context window with a substantially increased rope_theta of 1,000,000 and removed sliding window attention entirely. This architecture was designed with long-context compatibility in mind, informed in part by the YaRN findings on Mistral 7B.
Beyond the major commercial model families, YaRN found wide adoption in the independent fine-tuning community. Because the technique requires very little compute, individual researchers and smaller organizations could apply it to any RoPE-based model without the GPU clusters required for full pre-training. The YaRN repository's training scripts (built on DeepSpeed and Hugging Face accelerate), together with the publicly available tokenized training datasets, made long-context fine-tuning accessible at the scale of a single multi-GPU workstation. Community-extended model variants appeared for Falcon, CodeLlama, and other base models within weeks of the paper's release. Inference framework integration -- llama.cpp, vLLM, Hugging Face Transformers -- followed shortly after, further democratizing deployment.
LongRoPE, introduced by Ding et al. at Microsoft Research (arXiv:2402.13753, accepted at ICML 2024), extended the YaRN insight while addressing its primary limitation: the reliance on human-designed heuristics (alpha and beta) for frequency partitioning. LongRoPE replaces the ramp function with an evolutionary search algorithm that discovers per-dimension rescaling factors lambda_i automatically, optimized to minimize perplexity on the target context length. It also introduces a second type of non-uniformity: initial token positions receive less aggressive interpolation than later positions, preserving the strong attention to sequence beginnings that is important for instruction following and context framing.
LongRoPE was integrated into Microsoft's Phi-3 model family and demonstrated context extension to 2 million tokens. LongRoPE's advantages over YaRN are most pronounced in zero-shot settings (no fine-tuning) and at extreme extension factors of 64x or more. At moderate extension factors with fine-tuning, the practical performance gap narrows. The progression from PI to NTK to NTK-by-parts to YaRN to LongRoPE illustrates the field's movement from uniform scaling to hand-crafted non-uniform rules to automated per-dimension optimization.
LongRoPE2 (arXiv:2502.20082) further refined the automated search process and reported consistent improvements over both YaRN and original LongRoPE across all evaluated context lengths. The paper also clarified when YaRN's human-designed heuristics perform sub-optimally, noting that at very large extension ratios YaRN can underperform simpler methods like PI on certain benchmarks if the heuristic parameters are not well-tuned for the specific model and extension factor.
The dynamic scaling concept from the YaRN paper became a standard configuration option in production inference frameworks. llama.cpp, vLLM, and Hugging Face Transformers all support rope_scaling.type = "yarn" or "dynamic" options that implement dynamic YaRN at inference time. Configuration files for widely used models, including DeepSeek V3 and Phi-3, specify YaRN parameters that the inference backend reads to apply the appropriate frequency blending and temperature correction at runtime without modifying model weights.
Heuristic parameter selection. The ramp function parameters alpha and beta must be chosen empirically for each model family. Although alpha = 1 and beta = 32 work well for LLaMA-family models, they are not theoretically derived and may be sub-optimal for models with significantly different architectures, base frequencies, or pre-training distributions. LongRoPE's automated search was directly motivated by this limitation.
Perplexity degradation on standard benchmarks. Even with minimal regression, extending context introduces a small but measurable perplexity increase on short-context benchmarks at higher extension factors. At s = 32, MMLU drops from 43.8 to 41.7 and TruthfulQA from 39.0 to 37.3. Users who do not need long context should use model variants without YaRN applied, as the temperature and frequency modifications have a net effect on standard-length inference even if small.
Extrapolation ceiling. YaRN enables fine-tuning at one context length and inference at a somewhat longer length, but this extrapolation is not unlimited. Very large extension ratios (32x or more) require fine-tuning at or near the target length. The "train short, test long" capability is most reliable for extensions up to roughly 1.5-2x the fine-tuning length. Beyond that, significant perplexity degradation occurs.
RoPE specificity. YaRN applies only to RoPE-based models. Models using learned absolute positional embeddings, sinusoidal encodings, ALiBi, or other positional schemes require entirely different approaches. While the RoPE ecosystem covers most major open-weight models as of 2024, this limits YaRN's applicability to that subset.
Inference memory cost unchanged. YaRN dramatically reduces the training cost of context extension but has no effect on inference memory cost. Processing a 128K token context still requires the full key-value cache for all 128K positions. In standard attention, KV cache memory scales linearly with sequence length (quadratic attention computation is addressed by Flash Attention, but memory remains linear per layer). Practical deployment of YaRN-extended models at very long contexts still requires careful KV cache management, quantized caches, or architectural techniques such as sliding window attention or grouped key-value sharing.
Static versus dynamic trade-off. The static (fine-tuned) and dynamic variants serve different deployment needs. Static YaRN gives the best quality but requires a separate trained model variant for each target context length. Dynamic YaRN avoids fine-tuning but shows higher perplexity at very long contexts. Some inference frameworks, including vLLM in 2024, initially supported only static YaRN, requiring operators to pre-specify the target context length at model load time rather than adapting per-request.
Temperature formula generalization. The empirical formula sqrt(1/t) = 0.1 * ln(s) + 1 was validated on Llama 2 model sizes from 7B to 70B. For model families with substantially different pre-training data, tokenization, or architecture, the optimal temperature may differ. DeepSeek V2 and V3 used the same formula, suggesting it generalizes reasonably, but the theoretical basis for this generalization is not fully established.