Self-Extend
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,894 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,894 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-Extend (written SelfExtend in the original paper) is a training-free technique that lets a pretrained large language model process inputs much longer than the context window it was trained on, with no fine-tuning and only a small change to the attention computation at inference time. It was introduced in the paper "LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning" by Hongye Jin and colleagues at Texas A&M University and Rice University, first posted to arXiv on January 2, 2024 and presented as a spotlight paper at the International Conference on Machine Learning (ICML) in July 2024 [1].
The method rests on a simple claim: a model that fails on long inputs is usually not missing the ability to model long-range dependencies, it is confused by position values it never saw during training. Self-Extend remaps those unfamiliar positions back into the range the model already understands. It does this with a bi-level attention scheme that fuses a grouped attention, which applies integer (floor) division to position indices for tokens that are far apart, with an ordinary neighbor attention, which keeps exact positions for nearby tokens. Because the change touches only how relative positions are assigned, it adds no parameters and needs no gradient updates, so in principle any rotary position embedding (RoPE) model can be extended with a few lines of code [1].
Modern decoder-only Transformer language models encode token order with relative position information, most commonly through RoPE [4]. RoPE rotates the query and key vectors at each position by an angle proportional to that position, so the attention score between a query and a key depends only on the distance between them, not on their absolute indices. A model trained on sequences up to length L therefore learns to interpret relative distances in the range 0 to L - 1.
The failure at inference time is a distribution-shift problem. When an input is longer than L, pairs of tokens can be separated by distances larger than any seen during pretraining. RoPE produces rotation angles for these distances that the model never learned to read, so the attention logits become unreliable. In practice perplexity climbs steeply once the input passes the training length, and tasks that require reading information from far back in the sequence collapse [1]. The authors frame this as a positional out-of-distribution (O.O.D.) problem: the parameters can still represent long-range relationships, but the unseen large relative positions push the model outside the regime it was optimized for.
This points to a remedy that needs no new training: since every relative position inside 0 to L - 1 is already understood, the fix is to fold the large, unseen distances back into that trusted interval. Self-Extend does this with a discrete mapping that reuses the original positions, rather than rescaling RoPE frequencies.
The core operation is a floor division applied to position indices. For a chosen group size G, each token at position p is reassigned the grouped position floor(p / G). Adjacent groups of G tokens collapse onto a single position value, so two tokens that are far apart in the raw sequence end up with grouped indices whose difference falls back inside the trained range. Concretely, a raw relative distance d becomes roughly floor(d / G), shrinking the maximum distance the model must interpret by a factor of G. With G = 8, a Llama-2 model trained on 4,096 tokens sees grouped distances that stay within its learned window even when the real input is tens of thousands of tokens long [1].
The cost of grouping is precision: all G tokens in a group share one position, so the model can no longer tell them apart by order. For distant context this is usually acceptable, since coarse resolution at long range still lets the model locate relevant passages, but it would badly damage the local fluency that depends on knowing exactly which token comes next.
To preserve local precision, Self-Extend keeps a second, ungrouped attention path. Within a neighbor window of size w_n, tokens use their normal, exact relative positions, exactly as the unmodified model would. Outside that window, tokens use the grouped positions. The two are merged into a single attention map per layer and head: for a query and key separated by fewer than w_n tokens the precise relative position is used, and for pairs farther apart the floor-divided grouped position is used. A constant shift is added to the grouped branch so the position values line up continuously at the boundary where the neighbor window ends and grouping begins [1].
The result is fine-grained positions for the recent context that matters most for next-token prediction and compressed positions for the long tail, all while staying inside the trained range of distances. No weights change; only the integers fed into RoPE are altered, so the modification is purely a re-indexing at attention time.
The two hyperparameters set how far the window can be pushed. The maximum supported length is approximately (L - w_n) times G plus w_n, where L is the original context length. A larger G extends the reach but loses positional resolution; a larger w_n preserves more exact local context but leaves less of the trained range available for the grouped portion, so there is a direct trade-off. For Llama-2 the authors report group sizes from 2 to 64 and neighbor windows from 512 to 1,536 tokens as reasonable operating points [1].
Across language modeling and retrieval tasks, Self-Extend recovers most of the long-context ability that the base model lacks out of the box. On the PG19 long-document corpus, the perplexity of an unmodified model diverges once the input exceeds its training length, while Self-Extend keeps perplexity low and stable well past that point [1]. On the passkey retrieval test, which hides a short secret deep inside a long filler document and asks the model to recover it, Self-Extend achieves near-perfect retrieval at depths and lengths far beyond the native window, where the base model scores near zero [1].
On LongBench, a multi-task long-context benchmark, Self-Extend applied at inference matched or surpassed several baselines that required dedicated fine-tuning, while leaving short-context performance essentially unchanged, since inputs that fit inside the neighbor window are processed normally. The paper validates the method on Llama-2-7B and 13B, Mistral-7B-instruct-v0.1 (which itself uses sliding window attention), Phi-2, and SOLAR-10.7B, extending their windows several-fold, for instance taking Llama-2's 4,096-token window into the 16,000 to 25,000 token range [1].
Self-Extend sits in a crowded field of context-extension techniques, most of which modify how RoPE encodes position. It is not the only training-free option, since position interpolation and NTK-aware (dynamic NTK) scaling can also be applied without tuning. What distinguishes Self-Extend is its approach, a discrete grouping of positions rather than a continuous rescaling of frequencies, and its result that a training-free method can reach quality comparable to approaches that fine-tune.
Position interpolation (PI), proposed by Chen and colleagues at Meta in 2023, linearly downscales all position indices so the maximum stays inside the trained range; it works well but generally needs a short fine-tuning run to recover quality [2]. NTK-aware scaling and YaRN instead adjust the per-dimension RoPE frequencies, with YaRN adding an attention temperature term; both are efficient and YaRN in particular is strong, though best results typically still involve fine-tuning [3]. LongLoRA takes a different route, using a cheap shifted sparse attention to make the fine-tuning itself affordable rather than avoiding it [6].
A separate family, including StreamingLLM and LM-Infinite, is also training-free but solves a different problem: they keep a rolling window plus a few attention sink tokens so the model can stream over unbounded text with bounded memory, but they discard the middle of the sequence and so cannot integrate information across the whole input [5]. Self-Extend, by contrast, keeps all tokens in view and aims to genuinely extend the usable context, which is why it can solve passkey retrieval across the full length where streaming methods cannot.
| Method | Year | Training-free | Core mechanism |
|---|---|---|---|
| Position Interpolation (PI) | 2023 | Usually fine-tuned | Linearly downscale position indices [2] |
| NTK-aware / dynamic NTK | 2023 | Yes (degrades) | Rescale RoPE base frequency [1][3] |
| YaRN | 2023 | Usually fine-tuned | NTK-by-parts plus attention temperature [3] |
| LongLoRA | 2023 | No | Cheap fine-tuning via shifted sparse attention [6] |
| StreamingLLM / LM-Infinite | 2023 | Yes | Sliding window plus attention sinks (streaming, not full context) [5] |
| Self-Extend | 2024 | Yes | Grouped (floor-division) plus neighbor attention [1] |
Self-Extend is bounded. The reachable length is fixed by the group size and neighbor window, and pushing G higher eventually coarsens positions enough that quality falls, so the method cannot extend a model indefinitely [1]. Grouping is inherently lossy: tokens that share a grouped position lose their exact ordering at long range, which can hurt tasks that need fine positional discrimination across distant spans.
The technique is tied to relative position encodings, specifically RoPE, and does not transfer directly to models built on absolute position embeddings or on ALiBi. It also adds inference cost: a naive implementation computes attention over the full set of query-key pairs and merges the two branches, so realizing it efficiently requires custom or blocked attention kernels. The released code supports a FlashAttention path, but the method is not a drop-in for the standard fused kernel. Finally, the group size and neighbor window must be chosen per model and target length; the authors give an empirical rule but note it is not universally optimal, so some tuning of these inference-time hyperparameters is expected [1]. Self-Extend elicits latent ability rather than teaching new ability, so it cannot exceed what the original weights, viewed through coarser positions, can support.
The reference implementation is released as the open-source LongLM repository, with wrappers for Llama, Mistral, and Phi models [8].