# LongLoRA

> Source: https://aiwiki.ai/wiki/longlora
> Updated: 2026-06-07
> Categories: Large Language Models, Training & Optimization
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

# LongLoRA

**LongLoRA** is a parameter-efficient fine-tuning technique that extends the context window of pre-trained large language models with substantially lower computation than full fine-tuning. It was introduced by Yukang Chen and collaborators at the Chinese University of Hong Kong (CUHK) and the MIT Han Lab in a paper first posted to arXiv on 21 September 2023 and later accepted as an oral presentation at the International Conference on Learning Representations (ICLR) in 2024.[^1][^2] LongLoRA combines an improved form of [LoRA](/wiki/lora) (in which embedding and normalization layers are unfrozen alongside the usual low-rank adapters) with a novel training-time approximation called Shifted Sparse Attention, or S2-Attn, that approximates the dense self-attention pattern with shifted local windows.[^1] Using these two ingredients together, the authors extended [Llama 2](/wiki/llama_2) 7B from a 4,096-token context to 100,000 tokens, and a 70B variant to 32,768 tokens, on a single eight-GPU [NVIDIA A100](/wiki/nvidia_a100) node.[^1] Code, model weights and a companion long-context instruction dataset called LongAlpaca were released openly on GitHub and Hugging Face shortly after the paper appeared, and the design influenced a wave of subsequent context-extension recipes.[^3][^4]

## Overview

| Attribute | Value |
|-----------|-------|
| First arXiv version | 21 September 2023 (v1)[^2] |
| Latest arXiv version | 8 March 2024 (v3)[^2] |
| Venue | ICLR 2024, Oral[^5] |
| Primary authors | Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia[^1] |
| Affiliations | CUHK; [Massachusetts Institute of Technology](/wiki/mit) Han Lab[^1] |
| Base models studied | Llama 2 7B, 13B, 70B[^1] |
| Maximum context extended | 100K (7B), 64K (13B), 32K (70B) on 8 by A100[^1][^3] |
| Code license | Apache 2.0[^3] |
| Weight and data license | CC-BY-NC 4.0 (research use)[^3][^4] |
| Companion dataset | LongAlpaca-12k (9K long QA + 3K Alpaca short QA)[^3][^4] |
| Code repository | github.com/dvlab-research/LongLoRA[^3] |

LongLoRA's contribution is not a new transformer architecture: it modifies how an existing pre-trained [Transformer](/wiki/transformer) is fine-tuned so that the same architecture can be deployed at a longer context length than it was originally trained for. The trained models therefore retain their original [self-attention](/wiki/self_attention) at inference and are compatible with deployment tools that expect a standard Llama 2 layout, including [FlashAttention](/wiki/flashattention)-2 kernels.[^1][^3]

## Background and motivation

By mid-2023, a tension had emerged between user-facing demand for very long-context [LLM](/wiki/llm)s and the cost of producing them. Pre-training a Llama-style model with a context of 32,768 tokens was reported to require thousands of [NVIDIA A100](/wiki/nvidia_a100) GPU hours, and naive full fine-tuning at the same length was almost as expensive because attention cost grows quadratically with sequence length and activation memory grows linearly.[^1] Several lighter-weight alternatives had appeared, most notably Position Interpolation (PI) from Meta, which linearly rescales rotary positional indices ([RoPE](/wiki/rope)) so that a 4K-trained model can attend over a longer range with little additional training, and the NTK-aware scaling variants that became popular in the open-source community.[^6][^7] These methods adjust positional encodings but still require some fine-tuning pass over long sequences, and that fine-tuning was typically done with full-parameter training.

[LoRA](/wiki/lora), introduced in 2021 by Edward Hu and colleagues, freezes the original weights of a pre-trained model and inserts trainable low-rank update matrices into the attention projections, training only a tiny fraction of the parameters. The natural question, and the one LongLoRA set out to answer, was whether LoRA-style adaptation could be the workhorse for long-context fine-tuning, replacing full fine-tuning entirely. The authors report that a plain LoRA configuration fails: when LoRA is applied only to the query, key, value and output projections in attention, perplexity at long context lengths is significantly worse than full fine-tuning, even at higher LoRA ranks.[^1] LongLoRA was designed to close that gap while remaining far cheaper than full fine-tuning, and to do so at a sequence length the underlying model has not seen during pre-training.

A second motivation came from cost engineering. The team reported that fine-tuning Llama 2 7B from 4,096 to 65,536 tokens with full attention on the same 8 by A100 node took roughly 32 hours; with their method the same target context could be reached in roughly half that time and with about half the peak GPU memory, opening the door to long-context fine-tuning on a single workstation rather than a multi-node cluster.[^1] The work therefore positioned itself as both an algorithmic and a systems contribution.

## The two ingredients of LongLoRA

LongLoRA's recipe has two parts that are usually used together but can in principle be combined separately with other methods. The first part is a training-time attention approximation; the second part is a slightly relaxed version of LoRA that exposes a few extra parameter groups for learning.

### Shifted Sparse Attention (S2-Attn)

In a standard self-attention layer over a sequence of length L, each token attends to every other token, so the per-layer compute and activation memory scale as O(L^2). When L grows from 4K to 100K, this is the dominant cost. S2-Attn replaces this dense attention pattern during training with a partitioned, shifted local pattern that approximates the full attention while being much cheaper.[^1]

S2-Attn works in three conceptual steps inside each training-time self-attention call:

1. **Group partition.** The L tokens are split along the sequence axis into G equal-size groups. The authors report that a group size of roughly one quarter of the target context length, for example 8,192 tokens when fine-tuning to 32,768 tokens, gives the best trade-off in their experiments.[^1]
2. **Split heads in two.** The attention heads inside the layer are split into two halves. The first half computes group-local attention exactly as if each group were an independent sequence. The second half does the same, but on a copy of the tokens that has been circularly shifted along the sequence axis by half a group size before being grouped.[^1][^8]
3. **Merge.** The two halves are concatenated back along the head dimension, restoring the original head count.

Because the second half of the heads sees a shifted partition, information that lives near a group boundary in the unshifted view ends up in the interior of a group in the shifted view, and vice versa. Across the whole layer, every pair of nearby tokens is therefore in the same group in at least one of the two head subsets, so gradient information can still propagate across boundaries. The intuition is closely related to the shifted-window scheme used in the [Swin Transformer](/wiki/swin_transformer) for vision, which the LongLoRA paper cites as inspiration.[^1] The result, on the authors' benchmarks, is that fine-tuning with S2-Attn produces a model whose perplexity nearly matches a full-attention fine-tune, while reducing per-step attention compute and memory roughly in proportion to the group size.[^1]

The paper highlights an important property: S2-Attn is used **only during training**. At inference time, the model is loaded with its standard dense self-attention layers and behaves like any other Llama 2 derivative, which keeps it compatible with [FlashAttention](/wiki/flashattention)-2 inference kernels and the wider Llama deployment ecosystem.[^1][^3] The authors describe the training kernel itself as implementable in "two lines of code" on top of a normal attention call, since the shift is a tensor roll plus a reshape into groups.[^1][^8]

In an ablation table, the authors compared S2-Attn against alternative sparse training patterns: pure block-sparse local attention without shifting, dilated attention and stride patterns. After fine-tuning to a 32K target and evaluating with standard full attention, S2-Attn obtained 8.12 perplexity on their evaluation slice, compared with 8.39 for block-sparse and 9.70 for a dilated variant, supporting their argument that the shift is what closes most of the gap to full attention.[^1]

### LoRA with trainable embeddings and norms

The second ingredient is a slight, deliberate extension of standard LoRA. In the original LoRA setup, only low-rank update matrices A and B are trained inside the linear projections of attention layers; the embedding lookup table and the layer-norm scale and bias are kept frozen at their pre-trained values. LongLoRA reports that this configuration is insufficient for long-context adaptation. With only attention projections adapted, perplexity at 32K context remains far worse than full fine-tuning, and adding LoRA to the feed-forward MLP weights does not fix the problem.[^1]

The remedy is empirical but small. The authors unfreeze the embedding lookup table and the [RMSNorm](/wiki/rmsnorm) parameters that follow each attention and MLP block, training them densely while leaving everything else governed by LoRA's low-rank updates. Together, the embedding and normalization layers represent roughly 2% of Llama 2 7B's parameters, so the resulting setup is still firmly in the parameter-efficient regime, but the additional flexibility lets the model learn the new statistics induced by very long sequences. The authors refer to this variant as "improved LoRA" or sometimes LoRA+; it has nothing to do with the unrelated optimizer-side LoRA+ proposal from other groups.[^1] In the paper's ablations, this small change accounts for the bulk of the gap between vanilla LoRA and full fine-tuning at 32K context.[^1]

### Combining the two

Putting the two ingredients together, a LongLoRA fine-tune of Llama 2 looks like this. The original model is loaded with frozen pre-trained weights. LoRA adapters of a chosen rank are injected into the query, key, value and output projections. The embedding lookup table and the RMSNorm parameters are unfrozen. The model is then fine-tuned on a long-context corpus using S2-Attn in place of full attention. Optimization is done with [AdamW](/wiki/adamw) under a [DeepSpeed](/wiki/deepspeed) ZeRO-3 setup; the authors used FlashAttention-2 to accelerate the in-group attention.[^1][^3] At the end of training, the LoRA matrices can be merged back into the base weights, producing a single dense checkpoint that uses standard attention at inference.

The recipe is also composable with positional-encoding adjustments. In practice, the LongLoRA fine-tunes apply Position Interpolation to the RoPE base before training, so the model has a well-conditioned positional encoding at the new context length when the LoRA pass starts; this stacks naturally with S2-Attn.[^1][^6]

## Experimental results

The LongLoRA paper evaluates the recipe on language modeling perplexity, long-document retrieval and downstream instruction-following.

### Language modeling perplexity

The main perplexity results in Table 3 of the paper evaluate Llama 2 fine-tuned with LongLoRA at several target context lengths on the Proof-pile mathematical text corpus, a benchmark used in earlier long-context work.[^1]

| Base model | Fine-tune target | Proof-pile perplexity |
|------------|------------------|-----------------------|
| Llama 2 7B | 8,192 | 2.66[^1] |
| Llama 2 7B | 16,384 | 2.51[^1] |
| Llama 2 7B | 32,768 | 2.50[^1] |
| Llama 2 13B | 8,192 | 2.53[^1] |
| Llama 2 13B | 16,384 | 2.40[^1] |
| Llama 2 13B | 32,768 | 2.33[^1] |

For both model sizes, the perplexity number falls monotonically as the training context length is increased, which the authors take as evidence that the long-context fine-tuning is doing real work rather than just memorizing a fixed window. They also report consistent results on the PG-19 long-document language-modeling benchmark, where LongLoRA's 7B model at 32K context achieves lower perplexity than the corresponding 4K Llama 2 baseline.[^1]

The paper also reports a comparison against full fine-tuning at the same target lengths and finds the perplexity gap closes to within a few hundredths of a [perplexity](/wiki/perplexity) point, while compute and memory drop by roughly half.[^1]

### Training cost

The cost comparison most often cited from the paper is Table 12, which reports the wall-clock time and peak GPU memory for fine-tuning Llama 2 7B to 8,192 tokens on a single 8 by A100 node.[^1]

| Method | Time (hours) | Peak memory (GB per GPU) |
|--------|--------------|---------------------------|
| Full fine-tune | 7.4 | 46.3 |
| Vanilla LoRA | 6.0 | 25.7 |
| LongLoRA (S2-Attn + improved LoRA) | 5.2 | 25.6 |

At the more extreme 65,536-token target, full fine-tuning is reported as roughly 32 hours; LongLoRA reaches the same target in about half the time and with about half the peak memory, which is what makes the single-node 100K result for the 7B model feasible.[^1] The 70B fine-tune to 32K context also fits on the same 8 by A100 node when LongLoRA is combined with DeepSpeed sharding, whereas the corresponding full-attention fine-tune would not fit in 8 by 80 GB of GPU memory.[^1][^3]

### Passkey retrieval

To test whether the extended context is actually usable, the authors run a passkey retrieval test in which a five-digit secret is inserted at a controlled position inside a long stream of distractor text and the model is prompted to recover it. They report that a 7B LongLoRA model fine-tuned at 32,768 tokens retrieves the passkey with near-perfect accuracy at positions up to its trained context, with accuracy degrading sharply only beyond about 33,000 to 34,000 tokens.[^1] By further increasing the maximum position embeddings with Position Interpolation and without any additional training, the same model can retain reasonable retrieval accuracy out to roughly 48K tokens.[^1] These results helped establish passkey retrieval as a standard sanity check for subsequent long-context releases.

### Instruction tuning with LongAlpaca

To make the extended-context base usable for downstream chat tasks, the authors built a companion instruction-tuning dataset called LongAlpaca-12k.[^1][^4] It contains 12,000 examples, of which 9,000 are newly collected long-context question-answer pairs (covering tasks such as long-paper summarization, book chapter QA and structured analysis of long documents) and 3,000 are short instruction samples taken from the original Stanford Alpaca dataset. The mix is intentional: the authors report that omitting the short examples causes the model to degrade on conventional short-instruction prompts after fine-tuning, while including a small fraction preserves general [instruction tuning](/wiki/instruction_tuning) behavior.[^1][^4] The dataset is hosted on [Hugging Face](/wiki/hugging_face) under a non-commercial CC-BY-NC 4.0 license, with the instructions in Alpaca-style JSON.[^4]

A second small dataset called LongQA, with several thousand long-context QA pairs and used internally for evaluation, is also released alongside the main weights.[^3]

## Released artifacts

The dvlab-research/LongLoRA repository hosts the official code and pointers to all released model weights.[^3] Two families of weights were published:

1. **LongLoRA base models**, which are the result of running the LongLoRA recipe on a Llama 2 base. The headline checkpoints are Llama-2-7B-LongLoRA-100k, Llama-2-13B-LongLoRA-64k and Llama-2-70B-LongLoRA-32k, each hosted at huggingface.co/Yukang.[^3][^9][^10]
2. **LongAlpaca instruction-tuned models**, which take a LongLoRA base and apply supervised fine-tuning on LongAlpaca-12k. Released checkpoints include LongAlpaca-7B, LongAlpaca-13B and LongAlpaca-70B, all at 32K context. The team noted at release that the LongAlpaca-70B-chat checkpoint was, to their knowledge, the first openly released 70B long-context chat model.[^3]

The repository's README documents that LongLoRA is built on FlashAttention-2 for efficient kernels and DeepSpeed for sharding, and that the code is released under Apache 2.0 while the weights and data inherit the more restrictive CC-BY-NC 4.0 used by Llama 2 derivatives. A QLoRA integration was added in the first month after release, allowing the fine-tune to run in 4-bit quantization for further memory savings.[^3]

The release timeline as documented in the repository and on the MIT HAN Lab project page is approximately:

| Date | Event |
|------|-------|
| 21 September 2023 | Paper posted to arXiv (v1)[^2] |
| Late September 2023 | Initial code and base model weights released on GitHub and Hugging Face[^3] |
| 8 October 2023 | LongAlpaca-12k dataset and LongAlpaca instruction-tuned 7B/13B/70B models released[^3][^4] |
| October 2023 | [QLoRA](/wiki/qlora) integration added for 4-bit fine-tuning[^3] |
| 5 December 2023 | arXiv v2 with additional experiments[^2] |
| 16 January 2024 | Accepted as oral presentation at ICLR 2024[^5] |
| 8 March 2024 | arXiv v3, the final conference version[^2] |

## How LongLoRA fits into the long-context landscape

By the time LongLoRA appeared, the open-source community had several distinct approaches to context extension for Llama-family models. LongLoRA does not replace these methods so much as combine and cheapen them.

* **Position Interpolation (PI).** Meta's PI paper, released in mid-2023, showed that linearly interpolating the angular indices of [RoPE](/wiki/rope) lets a 4K-trained model attend over much longer sequences after a brief fine-tune. PI changes positional encoding but not how the model is fine-tuned; LongLoRA almost always stacks on top of PI to give the model a well-conditioned positional space at the new context length.[^6]
* **NTK-aware scaling and [YaRN](/wiki/yarn).** These methods, the latter from Bowen Peng and collaborators, refine PI with frequency-aware rescaling and ramped attention temperature. They reduce or sometimes eliminate the need for additional training; they are orthogonal to how parameters are updated, and a YaRN-rescaled model can also be fine-tuned with the LongLoRA recipe if further adaptation is desired.[^7]
* **Sliding-window attention.** Approaches such as the original [sliding window attention](/wiki/sliding_window_attention) used in Longformer, and later [Mistral 7B](/wiki/mistral_7b), change the attention pattern permanently, both at train and inference time. They typically require retraining from scratch or substantial fine-tuning, and they trade a strict attention budget for reduced cost. LongLoRA differs by using a similar local pattern only at training time.[^1]
* **[Ring Attention](/wiki/ring_attention).** Ring attention parallelizes a full-attention computation across many devices by streaming key-value blocks around a ring of GPUs, scaling long-context training horizontally rather than approximating attention. Where ring attention attacks the same problem with more hardware, LongLoRA attacks it with cheaper math on the same hardware. The two are complementary and have been combined in subsequent work.[^11]

LongLoRA's distinguishing place in this landscape is that it isolates the fine-tuning cost as the bottleneck and provides a recipe in which both the parameter update (improved LoRA) and the per-step attention cost (S2-Attn) are simultaneously made cheap, while keeping the deployed model architecturally identical to a standard Llama 2.

## Adoption and downstream work

The LongLoRA repository was widely adopted within months of its release. Two notable follow-on projects make the influence concrete.

The most direct successor is **LongQLoRA**, posted to arXiv in November 2023 by Jianxin Yang. LongQLoRA explicitly composes [QLoRA](/wiki/qlora)-style 4-bit quantized fine-tuning with LongLoRA's shifted sparse attention and Position Interpolation, allowing context extension on a single 32 GB V100 instead of an 8 by A100 node. The paper reports competitive PG-19 and Proof-pile perplexity numbers and credits LongLoRA's shifted attention as one of its three core building blocks.[^12]

The MIT HAN Lab's later **LongVA** and **LongVILA** work on long-context vision-language models reuses the LongLoRA fine-tuning recipe to extend the language backbone of multimodal systems, illustrating that the approach is not limited to text models.[^13] In the broader community, the LongAlpaca dataset has been adopted as a long-context instruction-tuning benchmark and supervision source for subsequent open releases on Hugging Face, and the [LongBench](/wiki/longbench) suite is now a common downstream evaluation pairing.[^3]

The Hugging Face PEFT library and ecosystem tools added trainable-embedding and trainable-normalization options around the same period, in part to support recipes such as LongLoRA where additional dense parameter groups outside the LoRA adapter are necessary for stable long-context training.[^14]

## Limitations

LongLoRA is honest about several limits of its approach.

* **The trained context is the usable context.** Like other fine-tuning-based extensions, a LongLoRA model is reliable up to roughly the context length it was trained at; passkey retrieval accuracy drops sharply beyond that length unless additional positional tricks (such as more aggressive Position Interpolation) are layered on top.[^1] The 100K headline number is achievable on a single node but does not by itself guarantee perfect retrieval at 100K. Independent later work, including the LongBench-v2 benchmark, has shown that high-context perplexity does not automatically translate into reliable retrieval or reasoning at those lengths.[^15]
* **S2-Attn is a training-time approximation, not a free lunch.** Because attention at inference time is the original dense pattern, the per-token inference cost still scales with the full context length. LongLoRA reduces the cost of *producing* a long-context model, not the cost of *running* it.[^1]
* **Improved LoRA changes the parameter-efficient story.** Unfreezing embeddings and norms means a LongLoRA fine-tune updates significantly more parameters than a strict LoRA adapter, although still less than 2% of the model. Storing and serving multiple LongLoRA variants therefore requires either merging back into the base weights or shipping the embedding table alongside the adapter, which is heavier than a pure LoRA adapter.[^1]
* **License of weights and data.** Like other Llama-derived artifacts, the LongLoRA and LongAlpaca weights are released under CC-BY-NC 4.0 and inherit the Llama 2 community license, so they are restricted to research and non-commercial uses.[^3][^4]
* **Comparisons to fully attended fine-tuning.** Although the authors close most of the gap, they acknowledge a small residual perplexity gap between LongLoRA and full fine-tuning at very long contexts. A 2024 controlled study of context-extension methods reports that the gap varies across evaluations and that LongLoRA, like other methods, can show degradation on some downstream tasks relative to full fine-tuning.[^15]

## Significance

LongLoRA was one of the first papers to demonstrate that a credible 100K-token context window for a 7B model could be obtained by fine-tuning on a single workstation-class server, rather than by re-pretraining a model. The work made two specific technical points that have outlasted the specific 2023 Llama 2 setting: that an attention approximation can be confined to training without changing the deployed model, and that for long-context adaptation, treating embedding and normalization layers as parameter-efficient extras is more important than carefully tuning LoRA ranks. These findings have been reused by later parameter-efficient long-context methods and adopted by parts of the multimodal community for extending vision-language models.[^11][^13]

LongLoRA also helped normalize the open release of long-context model weights at all three Llama 2 sizes (7B, 13B, 70B), at a time when most open releases capped out around 4K to 8K tokens. The accompanying LongAlpaca-12k dataset, although small, became a frequently cited example of how to blend long and short instruction examples to avoid regression on standard tasks, an approach later echoed in larger instruction-tuning mixes.[^4]

## Related work

* [LoRA](/wiki/lora) is the parameter-efficient method that LongLoRA extends.
* [QLoRA](/wiki/qlora) adds 4-bit quantization to the LoRA backbone and has been composed with LongLoRA in follow-ons such as LongQLoRA.
* [DoRA](/wiki/dora) is a later weight-decomposed LoRA variant that addresses a related parameter-efficiency question.
* [Rotary position embedding (RoPE)](/wiki/rope) is the positional encoding that Llama 2 and LongLoRA use; Position Interpolation manipulates its angular indices.
* [YaRN](/wiki/yarn) is a competing context-extension method targeting the positional encoding rather than the fine-tuning recipe.
* [FlashAttention](/wiki/flashattention) and its successors provide the efficient attention kernels that LongLoRA relies on.
* [Sliding window attention](/wiki/sliding_window_attention) and [sparse attention](/wiki/sparse_attention) more generally are the design family from which S2-Attn borrows its local-window structure.
* [Ring Attention](/wiki/ring_attention) is a complementary, parallelism-oriented approach to long-context training.

## See also

* [LoRA (Low-Rank Adaptation)](/wiki/lora)
* [QLoRA](/wiki/qlora)
* [DoRA (Weight-Decomposed Low-Rank Adaptation)](/wiki/dora)
* [PEFT](/wiki/peft)
* [HuggingFace PEFT](/wiki/huggingface_peft)
* [Llama 2](/wiki/llama_2)
* [LLaMA](/wiki/llama)
* [Rotary position embedding (RoPE)](/wiki/rope)
* [YaRN](/wiki/yarn)
* [FlashAttention](/wiki/flashattention)
* [Ring Attention](/wiki/ring_attention)
* [Sparse attention](/wiki/sparse_attention)
* [Sliding window attention](/wiki/sliding_window_attention)
* [Swin Transformer](/wiki/swin_transformer)
* [Context window](/wiki/context_window)
* [LongBench](/wiki/longbench)
* [RedPajama](/wiki/red_pajama)
* [Instruction Tuning](/wiki/instruction_tuning)
* [RMSNorm](/wiki/rmsnorm)
* [Layer normalization](/wiki/layer_normalization)
* [DeepSpeed](/wiki/deepspeed)
* [AdamW](/wiki/adamw)
* [NVIDIA A100](/wiki/nvidia_a100)
* [Hugging Face](/wiki/hugging_face)
* [International Conference on Learning Representations](/wiki/iclr)
* [Vicuna (language model)](/wiki/vicuna)
* [Mistral 7B](/wiki/mistral_7b)

## References

[^1]: Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, Jiaya Jia, "LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models", arXiv (v3), 2024-03-08. https://arxiv.org/abs/2309.12307. Accessed 2026-05-20.
[^2]: arXiv listing, "[2309.12307] LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models, submission history", arXiv, 2024-03-08. https://arxiv.org/abs/2309.12307. Accessed 2026-05-20.
[^3]: dvlab-research, "LongLoRA: Code and documents of LongLoRA and LongAlpaca (ICLR 2024 Oral)", GitHub README, 2024. https://github.com/dvlab-research/LongLoRA. Accessed 2026-05-20.
[^4]: Yukang Chen et al., "LongAlpaca-12k dataset card", Hugging Face Datasets, 2023-10-08. https://huggingface.co/datasets/Yukang/LongAlpaca-12k. Accessed 2026-05-20.
[^5]: ICLR 2024 program, "LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (Oral)", iclr.cc, 2024. https://iclr.cc/virtual/2024/oral/19790. Accessed 2026-05-20.
[^6]: Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian, "Extending Context Window of Large Language Models via Positional Interpolation", arXiv, 2023-06-27. https://arxiv.org/abs/2306.15595. Accessed 2026-05-20.
[^7]: Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole, "YaRN: Efficient Context Window Extension of Large Language Models", arXiv, 2023-08-31. https://arxiv.org/abs/2309.00071. Accessed 2026-05-20.
[^8]: MIT HAN Lab, "LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models (project page)", hanlab.mit.edu, 2024. https://hanlab.mit.edu/projects/longlora. Accessed 2026-05-20.
[^9]: Yukang Chen, "Llama-2-7b-longlora-100k-ft model card", Hugging Face, 2023. https://huggingface.co/Yukang/Llama-2-7b-longlora-100k-ft. Accessed 2026-05-20.
[^10]: Yukang Chen, "Llama-2-70b-longlora-32k model card", Hugging Face, 2023. https://huggingface.co/Yukang/Llama-2-70b-longlora-32k. Accessed 2026-05-20.
[^11]: Hao Liu, Matei Zaharia, Pieter Abbeel, "Ring Attention with Blockwise Transformers for Near-Infinite Context", arXiv, 2023-10-03. https://arxiv.org/abs/2310.01889. Accessed 2026-05-20.
[^12]: Jianxin Yang, "LongQLoRA: Efficient and Effective Method to Extend Context Length of Large Language Models", arXiv, 2023-11-08. https://arxiv.org/abs/2311.04879. Accessed 2026-05-20.
[^13]: Yukang Chen, "Yukang Chen, Research Scientist (NVIDIA / MIT HAN Lab), Long AI research summary", yukangchen.com, 2025. https://yukangchen.com/. Accessed 2026-05-20.
[^14]: Hugging Face, "PEFT: State-of-the-art parameter-efficient fine-tuning methods", GitHub README, 2024. https://github.com/huggingface/peft. Accessed 2026-05-20.
[^15]: Yi Lu et al., "A Controlled Study on Long Context Extension and Generalization in LLMs", arXiv, 2024-09-18. https://arxiv.org/abs/2409.12181. Accessed 2026-05-20.

