LongLoRA
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,093 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,093 words
Add missing citations, update stale details, or suggest a clearer explanation.
LongLoRA is a parameter-efficient fine-tuning technique that extends the context window of pre-trained large language models with substantially lower computation than full fine-tuning. It was introduced by Yukang Chen and collaborators at the Chinese University of Hong Kong (CUHK) and the MIT Han Lab in a paper first posted to arXiv on 21 September 2023 and later accepted as an oral presentation at the International Conference on Learning Representations (ICLR) in 2024.[^1][^2] LongLoRA combines an improved form of LoRA (in which embedding and normalization layers are unfrozen alongside the usual low-rank adapters) with a novel training-time approximation called Shifted Sparse Attention, or S2-Attn, that approximates the dense self-attention pattern with shifted local windows.[^1] Using these two ingredients together, the authors extended Llama 2 7B from a 4,096-token context to 100,000 tokens, and a 70B variant to 32,768 tokens, on a single eight-GPU NVIDIA A100 node.[^1] Code, model weights and a companion long-context instruction dataset called LongAlpaca were released openly on GitHub and Hugging Face shortly after the paper appeared, and the design influenced a wave of subsequent context-extension recipes.[^3][^4]
| Attribute | Value |
|---|---|
| First arXiv version | 21 September 2023 (v1)[^2] |
| Latest arXiv version | 8 March 2024 (v3)[^2] |
| Venue | ICLR 2024, Oral[^5] |
| Primary authors | Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia[^1] |
| Affiliations | CUHK; Massachusetts Institute of Technology Han Lab[^1] |
| Base models studied | Llama 2 7B, 13B, 70B[^1] |
| Maximum context extended | 100K (7B), 64K (13B), 32K (70B) on 8 by A100[^1][^3] |
| Code license | Apache 2.0[^3] |
| Weight and data license | CC-BY-NC 4.0 (research use)[^3][^4] |
| Companion dataset | LongAlpaca-12k (9K long QA + 3K Alpaca short QA)[^3][^4] |
| Code repository | github.com/dvlab-research/LongLoRA[^3] |
LongLoRA's contribution is not a new transformer architecture: it modifies how an existing pre-trained Transformer is fine-tuned so that the same architecture can be deployed at a longer context length than it was originally trained for. The trained models therefore retain their original self-attention at inference and are compatible with deployment tools that expect a standard Llama 2 layout, including FlashAttention-2 kernels.[^1][^3]
By mid-2023, a tension had emerged between user-facing demand for very long-context LLMs and the cost of producing them. Pre-training a Llama-style model with a context of 32,768 tokens was reported to require thousands of NVIDIA A100 GPU hours, and naive full fine-tuning at the same length was almost as expensive because attention cost grows quadratically with sequence length and activation memory grows linearly.[^1] Several lighter-weight alternatives had appeared, most notably Position Interpolation (PI) from Meta, which linearly rescales rotary positional indices (RoPE) so that a 4K-trained model can attend over a longer range with little additional training, and the NTK-aware scaling variants that became popular in the open-source community.[^6][^7] These methods adjust positional encodings but still require some fine-tuning pass over long sequences, and that fine-tuning was typically done with full-parameter training.
LoRA, introduced in 2021 by Edward Hu and colleagues, freezes the original weights of a pre-trained model and inserts trainable low-rank update matrices into the attention projections, training only a tiny fraction of the parameters. The natural question, and the one LongLoRA set out to answer, was whether LoRA-style adaptation could be the workhorse for long-context fine-tuning, replacing full fine-tuning entirely. The authors report that a plain LoRA configuration fails: when LoRA is applied only to the query, key, value and output projections in attention, perplexity at long context lengths is significantly worse than full fine-tuning, even at higher LoRA ranks.[^1] LongLoRA was designed to close that gap while remaining far cheaper than full fine-tuning, and to do so at a sequence length the underlying model has not seen during pre-training.
A second motivation came from cost engineering. The team reported that fine-tuning Llama 2 7B from 4,096 to 65,536 tokens with full attention on the same 8 by A100 node took roughly 32 hours; with their method the same target context could be reached in roughly half that time and with about half the peak GPU memory, opening the door to long-context fine-tuning on a single workstation rather than a multi-node cluster.[^1] The work therefore positioned itself as both an algorithmic and a systems contribution.
LongLoRA's recipe has two parts that are usually used together but can in principle be combined separately with other methods. The first part is a training-time attention approximation; the second part is a slightly relaxed version of LoRA that exposes a few extra parameter groups for learning.
In a standard self-attention layer over a sequence of length L, each token attends to every other token, so the per-layer compute and activation memory scale as O(L^2). When L grows from 4K to 100K, this is the dominant cost. S2-Attn replaces this dense attention pattern during training with a partitioned, shifted local pattern that approximates the full attention while being much cheaper.[^1]
S2-Attn works in three conceptual steps inside each training-time self-attention call:
Because the second half of the heads sees a shifted partition, information that lives near a group boundary in the unshifted view ends up in the interior of a group in the shifted view, and vice versa. Across the whole layer, every pair of nearby tokens is therefore in the same group in at least one of the two head subsets, so gradient information can still propagate across boundaries. The intuition is closely related to the shifted-window scheme used in the Swin Transformer for vision, which the LongLoRA paper cites as inspiration.[^1] The result, on the authors' benchmarks, is that fine-tuning with S2-Attn produces a model whose perplexity nearly matches a full-attention fine-tune, while reducing per-step attention compute and memory roughly in proportion to the group size.[^1]
The paper highlights an important property: S2-Attn is used only during training. At inference time, the model is loaded with its standard dense self-attention layers and behaves like any other Llama 2 derivative, which keeps it compatible with FlashAttention-2 inference kernels and the wider Llama deployment ecosystem.[^1][^3] The authors describe the training kernel itself as implementable in "two lines of code" on top of a normal attention call, since the shift is a tensor roll plus a reshape into groups.[^1][^8]
In an ablation table, the authors compared S2-Attn against alternative sparse training patterns: pure block-sparse local attention without shifting, dilated attention and stride patterns. After fine-tuning to a 32K target and evaluating with standard full attention, S2-Attn obtained 8.12 perplexity on their evaluation slice, compared with 8.39 for block-sparse and 9.70 for a dilated variant, supporting their argument that the shift is what closes most of the gap to full attention.[^1]
The second ingredient is a slight, deliberate extension of standard LoRA. In the original LoRA setup, only low-rank update matrices A and B are trained inside the linear projections of attention layers; the embedding lookup table and the layer-norm scale and bias are kept frozen at their pre-trained values. LongLoRA reports that this configuration is insufficient for long-context adaptation. With only attention projections adapted, perplexity at 32K context remains far worse than full fine-tuning, and adding LoRA to the feed-forward MLP weights does not fix the problem.[^1]
The remedy is empirical but small. The authors unfreeze the embedding lookup table and the RMSNorm parameters that follow each attention and MLP block, training them densely while leaving everything else governed by LoRA's low-rank updates. Together, the embedding and normalization layers represent roughly 2% of Llama 2 7B's parameters, so the resulting setup is still firmly in the parameter-efficient regime, but the additional flexibility lets the model learn the new statistics induced by very long sequences. The authors refer to this variant as "improved LoRA" or sometimes LoRA+; it has nothing to do with the unrelated optimizer-side LoRA+ proposal from other groups.[^1] In the paper's ablations, this small change accounts for the bulk of the gap between vanilla LoRA and full fine-tuning at 32K context.[^1]
Putting the two ingredients together, a LongLoRA fine-tune of Llama 2 looks like this. The original model is loaded with frozen pre-trained weights. LoRA adapters of a chosen rank are injected into the query, key, value and output projections. The embedding lookup table and the RMSNorm parameters are unfrozen. The model is then fine-tuned on a long-context corpus using S2-Attn in place of full attention. Optimization is done with AdamW under a DeepSpeed ZeRO-3 setup; the authors used FlashAttention-2 to accelerate the in-group attention.[^1][^3] At the end of training, the LoRA matrices can be merged back into the base weights, producing a single dense checkpoint that uses standard attention at inference.
The recipe is also composable with positional-encoding adjustments. In practice, the LongLoRA fine-tunes apply Position Interpolation to the RoPE base before training, so the model has a well-conditioned positional encoding at the new context length when the LoRA pass starts; this stacks naturally with S2-Attn.[^1][^6]
The LongLoRA paper evaluates the recipe on language modeling perplexity, long-document retrieval and downstream instruction-following.
The main perplexity results in Table 3 of the paper evaluate Llama 2 fine-tuned with LongLoRA at several target context lengths on the Proof-pile mathematical text corpus, a benchmark used in earlier long-context work.[^1]
| Base model | Fine-tune target | Proof-pile perplexity |
|---|---|---|
| Llama 2 7B | 8,192 | 2.66[^1] |
| Llama 2 7B | 16,384 | 2.51[^1] |
| Llama 2 7B | 32,768 | 2.50[^1] |
| Llama 2 13B | 8,192 | 2.53[^1] |
| Llama 2 13B | 16,384 | 2.40[^1] |
| Llama 2 13B | 32,768 | 2.33[^1] |
For both model sizes, the perplexity number falls monotonically as the training context length is increased, which the authors take as evidence that the long-context fine-tuning is doing real work rather than just memorizing a fixed window. They also report consistent results on the PG-19 long-document language-modeling benchmark, where LongLoRA's 7B model at 32K context achieves lower perplexity than the corresponding 4K Llama 2 baseline.[^1]
The paper also reports a comparison against full fine-tuning at the same target lengths and finds the perplexity gap closes to within a few hundredths of a perplexity point, while compute and memory drop by roughly half.[^1]
The cost comparison most often cited from the paper is Table 12, which reports the wall-clock time and peak GPU memory for fine-tuning Llama 2 7B to 8,192 tokens on a single 8 by A100 node.[^1]
| Method | Time (hours) | Peak memory (GB per GPU) |
|---|---|---|
| Full fine-tune | 7.4 | 46.3 |
| Vanilla LoRA | 6.0 | 25.7 |
| LongLoRA (S2-Attn + improved LoRA) | 5.2 | 25.6 |
At the more extreme 65,536-token target, full fine-tuning is reported as roughly 32 hours; LongLoRA reaches the same target in about half the time and with about half the peak memory, which is what makes the single-node 100K result for the 7B model feasible.[^1] The 70B fine-tune to 32K context also fits on the same 8 by A100 node when LongLoRA is combined with DeepSpeed sharding, whereas the corresponding full-attention fine-tune would not fit in 8 by 80 GB of GPU memory.[^1][^3]
To test whether the extended context is actually usable, the authors run a passkey retrieval test in which a five-digit secret is inserted at a controlled position inside a long stream of distractor text and the model is prompted to recover it. They report that a 7B LongLoRA model fine-tuned at 32,768 tokens retrieves the passkey with near-perfect accuracy at positions up to its trained context, with accuracy degrading sharply only beyond about 33,000 to 34,000 tokens.[^1] By further increasing the maximum position embeddings with Position Interpolation and without any additional training, the same model can retain reasonable retrieval accuracy out to roughly 48K tokens.[^1] These results helped establish passkey retrieval as a standard sanity check for subsequent long-context releases.
To make the extended-context base usable for downstream chat tasks, the authors built a companion instruction-tuning dataset called LongAlpaca-12k.[^1][^4] It contains 12,000 examples, of which 9,000 are newly collected long-context question-answer pairs (covering tasks such as long-paper summarization, book chapter QA and structured analysis of long documents) and 3,000 are short instruction samples taken from the original Stanford Alpaca dataset. The mix is intentional: the authors report that omitting the short examples causes the model to degrade on conventional short-instruction prompts after fine-tuning, while including a small fraction preserves general instruction tuning behavior.[^1][^4] The dataset is hosted on Hugging Face under a non-commercial CC-BY-NC 4.0 license, with the instructions in Alpaca-style JSON.[^4]
A second small dataset called LongQA, with several thousand long-context QA pairs and used internally for evaluation, is also released alongside the main weights.[^3]
The dvlab-research/LongLoRA repository hosts the official code and pointers to all released model weights.[^3] Two families of weights were published:
The repository's README documents that LongLoRA is built on FlashAttention-2 for efficient kernels and DeepSpeed for sharding, and that the code is released under Apache 2.0 while the weights and data inherit the more restrictive CC-BY-NC 4.0 used by Llama 2 derivatives. A QLoRA integration was added in the first month after release, allowing the fine-tune to run in 4-bit quantization for further memory savings.[^3]
The release timeline as documented in the repository and on the MIT HAN Lab project page is approximately:
| Date | Event |
|---|---|
| 21 September 2023 | Paper posted to arXiv (v1)[^2] |
| Late September 2023 | Initial code and base model weights released on GitHub and Hugging Face[^3] |
| 8 October 2023 | LongAlpaca-12k dataset and LongAlpaca instruction-tuned 7B/13B/70B models released[^3][^4] |
| October 2023 | QLoRA integration added for 4-bit fine-tuning[^3] |
| 5 December 2023 | arXiv v2 with additional experiments[^2] |
| 16 January 2024 | Accepted as oral presentation at ICLR 2024[^5] |
| 8 March 2024 | arXiv v3, the final conference version[^2] |
By the time LongLoRA appeared, the open-source community had several distinct approaches to context extension for Llama-family models. LongLoRA does not replace these methods so much as combine and cheapen them.
LongLoRA's distinguishing place in this landscape is that it isolates the fine-tuning cost as the bottleneck and provides a recipe in which both the parameter update (improved LoRA) and the per-step attention cost (S2-Attn) are simultaneously made cheap, while keeping the deployed model architecturally identical to a standard Llama 2.
The LongLoRA repository was widely adopted within months of its release. Two notable follow-on projects make the influence concrete.
The most direct successor is LongQLoRA, posted to arXiv in November 2023 by Jianxin Yang. LongQLoRA explicitly composes QLoRA-style 4-bit quantized fine-tuning with LongLoRA's shifted sparse attention and Position Interpolation, allowing context extension on a single 32 GB V100 instead of an 8 by A100 node. The paper reports competitive PG-19 and Proof-pile perplexity numbers and credits LongLoRA's shifted attention as one of its three core building blocks.[^12]
The MIT HAN Lab's later LongVA and LongVILA work on long-context vision-language models reuses the LongLoRA fine-tuning recipe to extend the language backbone of multimodal systems, illustrating that the approach is not limited to text models.[^13] In the broader community, the LongAlpaca dataset has been adopted as a long-context instruction-tuning benchmark and supervision source for subsequent open releases on Hugging Face, and the LongBench suite is now a common downstream evaluation pairing.[^3]
The Hugging Face PEFT library and ecosystem tools added trainable-embedding and trainable-normalization options around the same period, in part to support recipes such as LongLoRA where additional dense parameter groups outside the LoRA adapter are necessary for stable long-context training.[^14]
LongLoRA is honest about several limits of its approach.
LongLoRA was one of the first papers to demonstrate that a credible 100K-token context window for a 7B model could be obtained by fine-tuning on a single workstation-class server, rather than by re-pretraining a model. The work made two specific technical points that have outlasted the specific 2023 Llama 2 setting: that an attention approximation can be confined to training without changing the deployed model, and that for long-context adaptation, treating embedding and normalization layers as parameter-efficient extras is more important than carefully tuning LoRA ranks. These findings have been reused by later parameter-efficient long-context methods and adopted by parts of the multimodal community for extending vision-language models.[^11][^13]
LongLoRA also helped normalize the open release of long-context model weights at all three Llama 2 sizes (7B, 13B, 70B), at a time when most open releases capped out around 4K to 8K tokens. The accompanying LongAlpaca-12k dataset, although small, became a frequently cited example of how to blend long and short instruction examples to avoid regression on standard tasks, an approach later echoed in larger instruction-tuning mixes.[^4]