LongRoPE
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,518 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,518 words
Add missing citations, update stale details, or suggest a clearer explanation.
LongRoPE is a context-window extension technique for large language models (LLMs) that use rotary position embeddings (RoPE). Introduced by researchers at Microsoft Research in February 2024, it identifies non-uniform rescaling factors for each RoPE dimension and for distinguished token position ranges using an evolutionary search algorithm, then briefly fine-tunes the model so that its effective context window can be expanded far beyond the length seen during pre-training.[^1][^2] The original paper, "LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens," reported extending Llama-2 7B from 4,096 tokens to 2,048,000 tokens (a 512x increase), with up to roughly 1,000 fine-tuning steps performed at sequence lengths capped at 256k.[^1][^3] LongRoPE was accepted at the 41st International Conference on Machine Learning (ICML 2024) and subsequently used to produce the 128k-context members of the Phi-3 family, including Phi-3-mini-128k-instruct.[^2][^4][^5] A follow-up paper, LongRoPE2 (February 2025, ICML 2025), refined the search using a "needle-driven" perplexity signal and introduced a mixed-context-window training scheme, reaching 128k on Llama-3 8B while retaining over 98.5% of the original short-context performance with only 10 billion training tokens.[^6][^7][^8]
| Property | Value |
|---|---|
| Type | RoPE rescaling and context-extension method |
| Original creators | Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, Mao Yang |
| Organization | Microsoft Research |
| First paper | arXiv:2402.13753, February 21, 2024 |
| Venue | ICML 2024 (PMLR vol. 235, pp. 11091 to 11104) |
| Code | github.com/microsoft/LongRoPE (MIT License) |
| Reference models tested | Llama 2 7B, Mistral 7B |
| Maximum reported context | 2,048,000 tokens (Llama 2 7B) |
| Follow-up | LongRoPE2 (arXiv:2502.20082, February 27, 2025; ICML 2025) |
| Notable downstream use | Phi-3-mini/small/medium/vision 128k-instruct variants |
RoPE, the rotary position embedding introduced by Su and colleagues for the RoFormer architecture, is the dominant positional-encoding scheme in modern decoder-only transformers, including the LLaMA, Mistral, and Phi model families.[^1][^9] RoPE encodes the absolute position of a token by rotating each pair of query and key dimensions by an angle proportional to the token's position index and an inverse-frequency base. Self-attention then naturally expresses relative-position information through the inner product of rotated queries and keys, which gives RoPE the convenient property that an attention score between two tokens depends only on their relative offset and not on their absolute positions, even though the embedding itself is computed from absolute positions.[^1][^9]
A query vector q at position m and a key vector k at position n are split into d/2 two-dimensional pairs; each pair (i = 0, 1, ..., d/2 - 1) is rotated by an angle m * theta_i and n * theta_i respectively, with inverse frequencies theta_i = base^(-2i/d) and base typically set to 10,000. The inner product of the rotated vectors then depends only on (m - n) and on the theta_i, which is the relative-position property that lets RoPE generalize to many positional offsets.[^1][^9] In practice, however, transformers trained at a maximum length L (for example, L = 4,096 for Llama 2) generalize poorly beyond L. Two failure modes drive this drop in quality. First, attention scores become noisy and out-of-distribution because positions beyond L have never been observed during training; second, the model has not learned to discriminate among positions in the extrapolated range. Naive extrapolation causes catastrophic spikes in attention values and rapid degradation of perplexity once the input length exceeds L.[^1][^3][^9]
A line of work attempted to address this by interpolating or rescaling RoPE so that an extended sequence of length L' > L maps back into the model's familiar position range. The earliest method, Position Interpolation (PI) by Chen and colleagues, divides every RoPE angle by the scaling factor s = L'/L, treating all dimensions identically.[^1][^9] PI requires a few thousand fine-tuning steps at L' to converge but reliably extends Llama-2 from 4k to 32k or 100k.[^1] NTK-aware interpolation, proposed in community blog posts on the LocalLLaMA forum, instead rescales the base frequency so that low-frequency (high-index) dimensions are interpolated while high-frequency (low-index) dimensions remain close to extrapolation; the reasoning is that high-frequency dimensions carry local short-range information that the model already knows how to use and should not be compressed.[^1][^9] YaRN, by Peng and colleagues, refined the idea further: it partitions the RoPE dimensions into three frequency groups and applies pure extrapolation to the highest-frequency group, NTK interpolation to the middle group, and linear (PI-style) interpolation to the lowest-frequency group, with a temperature correction in the attention softmax to compensate for the reduced spread of dot products at longer contexts.[^1][^9]
LongLoRA (Chen and colleagues, 2023) takes an orthogonal route by combining shifted-sparse attention with LoRA adapters during fine-tuning, reducing the cost of long-context fine-tuning by an order of magnitude, but it still depends on a base RoPE-rescaling step (typically PI or NTK). PoSE (Zhu and colleagues, 2024) uses positional skipping to simulate longer contexts during fine-tuning without actually feeding ultra-long sequences: short segments are concatenated with manipulated position indices so that the model sees the full range of positions in a long context without having to process all the tokens.[^1][^9]
The LongRoPE authors observed two limitations shared by PI, NTK, YaRN, and PoSE. The first is that the rescaling functions they apply are uniform across token positions and only partially differentiated across RoPE dimensions; in particular, the higher-frequency dimensions that carry fine-grained local structure are under-utilized because uniform interpolation compresses them along with the lower-frequency ones. The second is that the choice of rescaling factors is set by closed-form heuristics rather than searched empirically against the actual model, which leaves performance on the table when targeting extreme extension ratios.[^1][^3][^9] The authors framed this observation as a "complex non-uniform information entropy" in the transformer's use of positional embeddings, arguing that the optimal rescaling depends both on the RoPE dimension index and on the specific position range, and that no closed-form heuristic could capture this dependence as well as an empirical search guided by the model's own perplexity.[^1][^3]
A third practical constraint, also acknowledged in the introduction of the LongRoPE paper, is the scarcity and cost of training data at extreme lengths. Open long-text corpora contain few documents longer than a few hundred thousand tokens, and the GPU-memory and compute requirements of fine-tuning on million-token sequences are prohibitive. Any method that aspires to extend a model from 4k to 2,048k must therefore avoid training on sequences anywhere near 2,048k, which rules out direct PI-style fine-tuning at the target length and motivates LongRoPE's combination of search plus short fine-tuning at intermediate lengths.[^1][^3]
LongRoPE is built on three intertwined ideas: a non-uniform rescaling parametrization, an evolutionary search to find good rescaling factors per dimension and per position range, and a progressive two-stage extension procedure followed by short-context recovery.[^1][^3]
RoPE assigns each pair of dimensions an inverse frequency theta_i = base^(-2i/d), where d is the per-head dimension and i indexes dimension pairs. PI rescales the rotation angle for position m and dimension i from m * theta_i to (m/s) * theta_i for a single scalar s. LongRoPE replaces the scalar s with a vector of per-dimension rescaling factors lambda_i, so the rotation angle becomes (m / lambda_i) * theta_i.[^1][^3]
On top of this, LongRoPE introduces a small set of initial token positions, of size n-hat, that are left un-interpolated. The motivation is empirical: initial tokens in a sequence receive disproportionately high attention weights (a phenomenon related to "attention sinks"), so distorting their position embeddings is more harmful than distorting the embeddings of later tokens. The number n-hat is itself one of the parameters that the search optimizes, and it tends to grow with the target extension ratio.[^1][^3]
The full search space therefore consists of d/2 continuous rescaling factors lambda_i, constrained to be monotonically non-decreasing in i (so that high-index, low-frequency dimensions are not interpolated less than lower-index, higher-frequency dimensions), plus the discrete choice of n-hat. The authors report search ranges of lambda_i in [1.0, 1.25 * s] with step 0.01 and n-hat drawn from a discrete set such as {0, 2, 4, 8, 16, ..., 256}.[^3]
Because the search space is exponential in the number of dimensions, exhaustive enumeration is infeasible. LongRoPE uses an evolutionary algorithm with population size P = 64 (halved to 32 for extensions beyond 512k), mutation pool of size 16, crossover pool of size 16, mutation probability 0.3, and up to 40 iterations. The initial population is seeded with the rescaling vectors that PI, NTK, and YaRN would prescribe, plus random perturbations of those seeds. Each candidate is scored by its perplexity on a held-out validation set sampled from PG-19 at the target context length; the lowest-perplexity candidates survive to the next generation.[^3][^10]
A monotonicity constraint on lambda_i (lambda_i is less than or equal to lambda_{i+1}) prunes the search space and prevents nonsensical configurations in which a higher-frequency dimension would be interpolated more aggressively than a lower-frequency one. The constraint reflects the empirical finding, also visible in the YaRN derivation, that low-frequency dimensions tolerate more interpolation while high-frequency dimensions need less.[^1][^3]
Direct evolutionary search for extreme extensions (such as 512x to 2,048k) is unstable because no fine-tuning has prepared the model for any long-range positions. LongRoPE addresses this with a progressive two-stage procedure.[^1][^3]
In the first stage, the search is run on the pre-trained 4k-context base model with a target of 256k. The resulting rescaling vector is applied to the model and used during a short fine-tune (the paper reports 400 steps at sequence length 128k followed by 600 steps at sequence length 256k, totaling at most around 1,000 optimizer steps). The training data is drawn from RedPajama for Llama-2 and from Together AI's long-data collection for Mistral.[^1][^3]
In the second stage, the 256k-fine-tuned model is treated as a new base, and a second evolutionary search is run directly on it with a target of 2,048k. Because the model has already adapted to 256k positions, an additional 8x rescaling is within reach without any further fine-tuning. The final model thus supports a 2,048k context window even though gradient updates were only ever taken on sequences of at most 256k.[^1][^3]
Aggressive interpolation harms short-context behaviour because rotation angles for early positions are distorted relative to the pre-training distribution. To recover original quality on inputs of a few thousand tokens, the authors run a third evolutionary search at 4k and 8k targets, this time with a tighter upper bound on lambda_i so that the interpolation is gentle. At inference time, the model dynamically selects which rescaling vector to use based on the input length, applying the long-context configuration only when the prompt exceeds the original training window.[^1][^3][^10]
The LongRoPE paper evaluates on two base models: Llama 2 7B and Mistral 7B. The principal long-context test is perplexity on Books3 at increasing evaluation lengths, the passkey retrieval task at lengths up to 2,048k, and standard short-context benchmarks (ARC-Challenge, HellaSwag, MMLU, TruthfulQA) on the same models after extension.[^1][^3]
On Books3 at a 256k evaluation length, LongRoPE-2048k (with stage-one fine-tuning at 256k) reports a perplexity of 1.87, compared to 99.64 for YaRN tuned for 128k and 246.45 for PI tuned for 100k.[^3] The gap is even larger at longer evaluation lengths: prior methods diverge to triple-digit perplexity once the evaluation length exceeds their fine-tune length, whereas LongRoPE maintains low single-digit perplexity throughout the 256k to 2,048k range. The paper also presents perplexity sweeps on Proof-Pile and PG-19 at lengths from 4k up to 2,048k, with the qualitative picture matching the Books3 numbers.[^1][^3][^4]
The passkey retrieval task, originally introduced for evaluating long-context Transformers, embeds a short random secret inside an otherwise irrelevant long document and asks the model to retrieve it. On this task, LongRoPE-Llama-2-2048k maintains accuracy above 90% from 4k to 2,048k tokens, while the Mistral-based variant maintains 100% accuracy up to roughly 1,800k and falls to about 60% at 2,048k.[^3] These numbers are the basis for the claim that the technique delivers a "usable" 2M-token context window on Llama-2 7B, rather than only matching baseline perplexity at long lengths.[^1][^3]
On standard 4k-window benchmarks, the extended Llama-2 LongRoPE model scores in the range of 51.0 to 52.9 on ARC-Challenge, 75.3 to 76.5 on HellaSwag, 39.6 to 43.4 on MMLU, and 37.3 to 38.8 on TruthfulQA; the Mistral LongRoPE model lands in the range of 59.0 to 59.2 on ARC-Challenge, 80.9 to 81.2 on HellaSwag, 61.1 to 61.3 on MMLU, and 42.2 to 43.1 on TruthfulQA, in some cases slightly above the original short-context baseline.[^3] The bottom of those ranges corresponds to applying the long-context rescaling at short input length without the dedicated short-context recovery search; the top corresponds to the dynamically swapped configuration described in the previous section.[^1][^3]
These results were highlighted in independent third-party summaries. The Graphcore "Papers of the Month" review from February 2024 noted that LongRoPE "perplexity tends to be better compared to other extension methods" up to 256k tokens, while also flagging that the short-context degradation can be substantial in some configurations (cited as a drop from 46.6% to 39.6% on MMLU with Llama).[^10] The same review credited the method's evolutionary search as a "smarter scaling discovery" relative to YaRN's hand-set frequency groups, and identified the protection of initial-token positions and the dynamic per-length switching as the two practical refinements that distinguish LongRoPE most clearly from earlier work.[^10] Hugging Face's paper page for LongRoPE has aggregated discussions and reproduced figures from the paper, and the OpenReview-style materials accompanying the ICML 2024 publication document a similar comparison set.[^9]
The following table summarizes the principal differences among PI, NTK, YaRN, PoSE, LongLoRA, and LongRoPE on RoPE-based LLMs.
| Method | Rescaling granularity | Position-range awareness | Search procedure | Fine-tuning requirement |
|---|---|---|---|---|
| Position Interpolation (PI) | Uniform scalar s = L'/L across all dimensions[^1][^9] | None[^1] | Closed form[^1] | Fine-tune at L' for a few thousand steps[^1] |
| NTK-aware (NTK) | Per-dimension via rescaled base frequency[^1] | None[^1] | Closed form[^1] | Often zero-shot or short fine-tune[^1] |
| YaRN | Three frequency groups with extrapolation, NTK, linear[^1] | None (attention temperature adjustment)[^1] | Closed form per group[^1] | Fine-tune at L' for a few thousand steps[^1] |
| PoSE | Same as base RoPE rescaling | Indirect (positional skipping)[^1] | Closed form | Fine-tune at short lengths with skipping[^1] |
| LongLoRA | Combined with shifted-sparse attention and LoRA adapters | None[^1] | Closed form for RoPE part | Fine-tune at L' with LoRA[^1] |
| LongRoPE | Per-dimension lambda_i, monotonic[^1][^3] | n-hat un-interpolated initial tokens[^1][^3] | Evolutionary search vs. perplexity[^3] | <=1k steps; multi-stage[^1] |
The Graphcore review describes LongRoPE's contribution succinctly as three refinements over YaRN: smarter scaling discovery via evolutionary search, protected initial tokens, and dynamic per-length rescaling at inference.[^10] None of these ideas requires a new architecture; the model's weights and self-attention code paths remain identical to the base transformer, and only the RoPE module is modified.[^1][^4]
The most prominent production use of LongRoPE has been in Microsoft's Phi-3 family. The Phi-3 technical report describes Phi-3-mini-128K as a long-context variant of the 3.8B-parameter Phi-3-mini built by applying LongRoPE on top of the standard short-context Phi-3-mini, achieving a 128k effective context window "while maintaining performance on par with the 4K version" of the same base model.[^5][^11] The companion 128k-instruct variants for the small (Phi-3-small-128k-instruct), medium (Phi-3-medium-128k-instruct), and vision (Phi-3-vision-128k-instruct) sizes followed the same recipe, making the Phi-3 family one of the few sub-10B-parameter open-weight model families to ship with 128k as a default option at release.[^4][^5] Phi-3-mini-128k-instruct was distributed under an MIT license through Hugging Face and was trained on 4.9 trillion tokens of mixed natural-language and code data on 512 H100-80G GPUs over roughly ten days.[^12]
Practically, the 128k-instruct variants are loaded with the same tokenizer and generation interface as their 4k-instruct counterparts; the only model-config difference is the RoPE configuration block, which specifies the per-dimension rescaling vector and the inference-time switching rule. This is what makes LongRoPE attractive for production deployment: existing serving stacks built on top of the Hugging Face Transformers library, vLLM, or other inference engines can serve a 128k Phi-3 with only a config change rather than a code rewrite, because the rescaling enters as a transformation of the position-dependent rotation angles inside the unchanged RoPE kernel.[^4][^12]
The original LongRoPE codebase, hosted at github.com/microsoft/LongRoPE under MIT license, ships reference implementations for the evolutionary search and for applying the resulting rescaling vectors to Llama 2 7B, Mistral 7B, and the Phi-3 family. The repository's evaluation scripts use PG-19 as the search validation set and Proof-Pile as one of the perplexity evaluation sets. The released code documents the same default hyperparameters reported in the paper (population size 64, mutation pool 16, crossover pool 16, 40 generations, mutation probability 0.3) and the two-stage 128k-then-256k fine-tuning schedule.[^4]
Beyond Microsoft's own model family, the LongRoPE recipe has been replicated and adapted by community contributors for other RoPE-based open-weight models. Because the only model-side ingredient is a vector of rescaling factors and a small fine-tune, third parties can apply LongRoPE to any pre-trained RoPE-based transformer without retraining from scratch, which has made it a common reference in subsequent long-context-extension write-ups.[^10]
A follow-up paper, "LongRoPE2: Near-Lossless LLM Context Window Scaling" (arXiv:2502.20082), was submitted on February 27, 2025 and accepted as a poster at ICML 2025.[^6][^7][^8] The authors (Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang) are largely overlapping with the original LongRoPE team, again from Microsoft.[^6][^8]
LongRoPE2 starts from a diagnostic claim: prior RoPE-extension methods, including LongRoPE, still leave the higher-frequency (low-index) RoPE dimensions effectively under-trained at extended lengths, which produces residual out-of-distribution behaviour even after fine-tuning. The paper attributes this to two factors: (i) most rescaling heuristics interpolate the high-frequency dimensions too aggressively, suppressing their fine-grained information, and (ii) ordinary token-averaged perplexity is an insensitive signal for the rare tokens that actually require long-range retrieval.[^6][^7]
LongRoPE2 introduces three changes on top of the original method:[^6][^7]
Needle-driven perplexity for the search fitness. Rather than averaging perplexity across all tokens of a long passage, the evolutionary search scores candidate rescaling vectors on a needle-in-a-haystack-style dataset where specific answer tokens depend on long-range context. This pushes the search toward configurations that improve true long-range retrieval rather than only the easier short-range continuations that dominate token-averaged perplexity.[^6][^7]
NTK-style scaling for low dimensions and search-optimized scaling for high dimensions. The paper retains NTK rescaling on the lower-index (high-frequency) dimensions because the closed-form NTK formula already handles them well, and uses evolutionary search only for the higher-index (low-frequency) dimensions that the diagnostic identified as problematic.[^7]
Mixed-context-window training. Fine-tuning interleaves short sequences (using the original RoPE) with long sequences (using the rescaled RoPE), so that the model preserves short-context behaviour while learning to use the rescaled embeddings for long inputs. At inference, the model dynamically switches RoPE configurations based on input length, as in LongRoPE.[^6][^7]
On Llama-3 8B extended to 128k, the paper reports retaining over 98.5% of the original short-context performance using only 10 billion fine-tuning tokens, which it presents as roughly 80x fewer tokens than Meta's published recipe for Llama-3.1's 128k context.[^6][^7] On the RULER benchmark at 128k, third-party summaries cite scores of 82.03 for LongRoPE2 on Llama-3 8B, against 73.40 for the original LongRoPE and 49.39 for YaRN under matched conditions; on Phi-3-mini-3.8B at 128k, LongRoPE2 reports 58.81 against 49.37 for NTK.[^13] Independent coverage in Microsoft Research-focused outlets described LongRoPE2 as a "near-lossless" extension method that recovers most short-context quality lost by prior methods.[^13][^14]
The most direct application of LongRoPE is producing long-context versions of small RoPE-based language models for tasks such as document question answering, multi-document summarization, long-form code understanding, and retrieval-augmented generation over very long contexts. The 128k context window unlocked by LongRoPE in the Phi-3 family is sufficient for processing book-length documents, large code repositories, and long multi-turn conversation histories within a single forward pass, eliminating the need for sliding-window chunking in many practical workflows.[^5][^11][^12]
A second class of applications is research on long-context evaluation itself. Because LongRoPE-extended models exist at lengths well beyond what most benchmarks were designed for, they have served as test subjects for the development of benchmarks like RULER, Needle in a Haystack, LongBench, and InfiniteBench, which probe whether a claimed long-context model actually uses its window or merely tolerates long inputs without crashing. LongRoPE2 explicitly evaluates on this newer generation of benchmarks rather than relying solely on Books3 perplexity.[^7][^13]
A third application is in-context learning at scale. The LongRoPE paper notes that one motivation for extending the context window is to support in-context learning with many examples, where the prompt itself can contain thousands of demonstrations. With a 128k or larger window, downstream users can fit entire few-shot training sets, conversation logs, or document corpora into the prompt without resorting to retrieval pipelines.[^1][^3]
LongRoPE made two practical contributions to the long-context-extension literature. First, it demonstrated empirically that a single small model (Llama 2 7B at the time the largest open weights commonly fine-tuned in academic settings) could be extended to a 2,048k effective context window with only short and inexpensive fine-tuning, at a moment when state-of-the-art commercial systems were near 128k. This shifted the question of "how long can RoPE go" from one of training-time scaling to one of post-training search and short fine-tuning.[^1][^3]
Second, by treating per-dimension and per-position rescaling factors as variables to be searched against the target model's own perplexity, LongRoPE made it normal to optimize positional-encoding parameters as a separate hyperparameter problem rather than fixing them by closed form. LongRoPE2's perplexity-guided refinement extends the same idea to a more discriminative fitness signal, and several follow-on works have adopted variants of the same search-then-fine-tune methodology.[^6][^7][^10]
In the production setting, the most concrete impact has been on the Phi-3 long-context variants, which are widely deployed via Hugging Face, Azure AI Studio, and on-device inference stacks. By preserving short-context quality (Phi-3-mini-128k matches Phi-3-mini-4k on standard benchmarks per the Phi-3 technical report) the technique made 128k context a default option rather than a trade-off for that model family. LongRoPE2 takes the same logic further on Llama-3 8B, recovering near-original short-context behaviour while reaching 128k effective length with under 1% of the fine-tuning tokens used by Meta's published recipe for Llama-3.1.[^5][^6][^7][^11][^12]
Several limitations of LongRoPE and LongRoPE2 are acknowledged in the original papers or in third-party reviews.
Short-context degradation is the most discussed. The Graphcore review highlighted a documented case where Llama-2 LongRoPE drops from 46.6% to 39.6% on MMLU compared to the unmodified base.[^10] The authors mitigate this by running an additional short-context evolutionary search and by dynamically swapping rescaling configurations at inference, but the workaround adds engineering complexity and does not always fully close the gap.[^1][^10]
Diminishing returns at the extreme of 2,048k are also visible. The Mistral-based variant of LongRoPE falls from 100% passkey retrieval at 1,800k to about 60% at 2,048k, indicating that the 512x extension ratio is near the limit of what the technique can support cleanly on a 7B model.[^3]
LongRoPE2's own framing implicitly criticizes the original LongRoPE for under-training the high-frequency RoPE dimensions and for relying on token-averaged perplexity in its search. The follow-up paper presents needle-driven perplexity and mixed-context training as direct fixes for those weaknesses, suggesting that the original method's gains beyond a few hundred thousand tokens were partly an artifact of the fitness function rather than a reflection of true long-range understanding.[^6][^7]
Finally, the technique is specific to RoPE-based LLMs and does not transfer directly to models using ALiBi, absolute sinusoidal encodings, or learned positional embeddings; alternative families such as Mamba and RWKV do not need it because they do not use explicit positional encodings, though they have their own long-context trade-offs.[^9]
LongRoPE belongs to a cluster of methods that extend RoPE-based transformers without changing their architecture. Closely related techniques include Position Interpolation, NTK-aware interpolation, YaRN, LongLoRA, and PoSE, all summarized in the comparison table above.[^1][^9] Orthogonal approaches change the attention mechanism rather than the positional encoding: sliding window attention in early Mistral, Ring Attention for distributed long-context training, Flash Attention for kernel-level efficiency, Infini-Attention for compressive memory, and PagedAttention for serving long-context KV caches at scale. Architectural alternatives such as Mamba, Mamba 2, and RWKV avoid quadratic-attention costs entirely.[^9]
The recent RULER benchmark and Needle in a Haystack test suite, both used in LongRoPE2's evaluation, have become the standard tools for measuring whether a claimed long-context model actually retrieves information from far back in its window rather than just keeping perplexity low.[^7][^13]