Longformer
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,683 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,683 words
Add missing citations, update stale details, or suggest a clearer explanation.
Longformer is a transformer architecture for processing long documents, introduced by Iz Beltagy, Matthew E. Peters, and Arman Cohan of the Allen Institute for AI (AI2) in the April 2020 paper "Longformer: The Long-Document Transformer" (arXiv:2004.05150).[1] Its defining feature is a sparse attention mechanism that scales linearly with sequence length, O(n), in place of the quadratic O(n²) cost of standard self-attention. This lets a single model read sequences of several thousand tokens at once: the most widely used checkpoint, allenai/longformer-base-4096 on Hugging Face, accepts up to 4,096 tokens, eight times the 512-token limit of BERT-style models.[2][3]
Rather than design a new pretraining recipe from scratch, the authors built Longformer as a "drop-in replacement" for the dense attention inside an existing pretrained model. The released encoder is initialized from a RoBERTa checkpoint, has its position embeddings extended from 512 to 4,096, and is then given a short round of continued masked-language-model pretraining.[1][4] On long-document tasks such as multi-hop and open-domain question answering, document classification, and coreference resolution, Longformer consistently outperforms a RoBERTa baseline that has to break the input into 512-token chunks, and at the time of release it set new state-of-the-art results on the WikiHop and TriviaQA question-answering benchmarks and on the text8 and enwik8 character-level language-modeling benchmarks.[1] The paper also introduced a sequence-to-sequence variant, the Longformer-Encoder-Decoder (LED), built on top of BART for generative tasks like long-document summarization.[1][5]
Longformer arrived during a wave of "efficient transformer" research in 2020 alongside closely related models such as ETC, Reformer, and Google's BigBird, all attacking the same quadratic-attention bottleneck. Of these it became one of the most widely adopted in practice: the base checkpoint records well over a million downloads per month on Hugging Face, has hundreds of community fine-tunes, and remains a common baseline and production choice for long-text encoding even after the rise of long-context window decoder-only large language models.[2][3]
A standard Transformer computes attention between every pair of positions in its input. For a sequence of length n, the attention score matrix QKᵀ has n² entries, so both the compute and the memory needed grow with the square of the sequence length. This is why the original BERT and RoBERTa models cap their input at 512 tokens: doubling the length roughly quadruples the cost, and long inputs quickly exhaust GPU memory. Practitioners working with long documents historically dealt with this in one of three ways, all of them lossy: truncate the document to the first 512 tokens, split it into independent chunks and process each separately, or build a two-stage pipeline that first retrieves a small relevant passage and then runs a short-context model on it.[1] Each approach either discards information or introduces complex machinery to stitch chunk-level results back together.
Longformer's premise is that for most long-document tasks, full all-to-all attention is unnecessary. Local context, the words immediately surrounding a given token, carries most of the signal, and a small number of special positions need to see the whole sequence. Longformer therefore replaces dense attention with a fixed sparse pattern that combines a windowed local attention with a task-specific global attention. Because each token attends to only a constant-size neighborhood plus a handful of global tokens, the total number of attention computations grows linearly with sequence length rather than quadratically. The result is a model that can ingest an entire document in a single forward pass, build contextual representations across the whole input through stacked attention layers, and avoid both the information loss of truncation and the architectural complexity of chunking.[1]
The key engineering claim is that this sparse attention is a faithful substitute for the dense version inside a pretrained model. Because Longformer-base uses a local window of 512 (matching RoBERTa's full context width) and inherits RoBERTa's weights, the lower layers behave almost identically to the original model while the stacked windows give upper layers an effectively global receptive field. Continued pretraining then adapts the weights to the longer context with only a modest number of gradient updates.[4]
Longformer's attention pattern is built from three components, illustrated in the paper as sparse variants of the full n² attention matrix.[1]
The core pattern is a fixed-size sliding window. Given a window width w, each token attends to the ½w tokens on either side of it, for a total of w neighbors. The cost of this pattern is O(n × w): linear in sequence length because w is a fixed constant independent of n. A single windowed layer can only see a local neighborhood, but stacking layers expands the reach. With ℓ stacked layers of window size w, the receptive field at the top layer is ℓ × w, analogous to how stacked convolutions in a CNN build up a large receptive field from small local filters. With enough layers, the model can in principle propagate information across the entire sequence even though no single layer attends globally.[1]
The authors found it helpful to vary the window size across layers rather than fixing it. In their language-modeling experiments, lower layers use small windows to capture fine local structure, and window sizes increase toward the upper layers so that higher levels of the network can build representations over larger spans. An ablation showed that increasing window size from bottom (32) to top (512) layers produced the best perplexity, beating both the reverse arrangement and a fixed average window.[1]
To widen the receptive field further without adding computation, the window can be "dilated," leaving gaps of size d between attended positions, directly analogous to dilated convolutions. With dilation d across ℓ layers of window w, the receptive field grows to ℓ × d × w, which can reach tens of thousands of tokens even for small dilation values. Because attention is multi-headed, different heads can use different dilation settings: some heads attend without dilation to focus tightly on local context, while others use dilation to reach distant tokens. The authors used dilation only in the autoregressive language-modeling setting (on a small number of heads in the upper layers) and reported it gave a small improvement; they did not use it in the pretrained encoder, since dilation is not compatible with the pretrained RoBERTa weights.[1]
Local windows alone cannot learn task-specific full-sequence representations, so Longformer adds global attention on a few pre-selected positions chosen according to the task. Global attention is symmetric: a token with global attention attends to every other token in the sequence, and every token in the sequence attends back to it. Because the number of such global tokens is small and independent of n, adding them keeps the overall complexity at O(n). Global attention is the mechanism by which Longformer injects task-specific inductive bias into an otherwise generic local pattern. The choice of which tokens are global depends on the downstream task:[1][4]
[CLS] token (the <s> token in RoBERTa), so the pooled representation can aggregate the whole document.In the Hugging Face implementation this is controlled by a global_attention_mask tensor passed alongside the input, where a value of 1 marks a position as global and 0 marks it as local. Setting global attention correctly for the task is the user's responsibility and is important for good performance.[3]
Standard attention computes queries, keys, and values from input via three linear projections Q, K, V. Longformer instead uses two separate sets of projection matrices: Q_s, K_s, V_s for the sliding-window attention and Q_g, K_g, V_g for the global attention. The extra projections give the model the flexibility to model the two kinds of attention differently, which the authors show is critical for best downstream performance. The global projections are initialized to copies of the sliding-window projections at the start of fine-tuning.[1]
Implementing the sparse pattern efficiently is non-trivial: it requires a form of "banded" matrix multiplication that computes only the diagonals of QKᵀ that fall inside the window, which is not natively supported by libraries such as PyTorch or TensorFlow. The paper describes three implementations: loop, a memory-efficient but very slow PyTorch version used only for testing; chunks, a vectorized version that supports the non-dilated case and is used for pretraining and fine-tuning; and cuda, a custom CUDA kernel written with TVM and used for the language-modeling experiments. Figure 1 of the paper shows that Longformer's memory use rises linearly with sequence length while full self-attention rises quadratically and runs out of memory on long inputs, with the vectorized chunked implementation being the fastest in practice.[1]
Longformer is an encoder-only Transformer. The released checkpoints come in two sizes that mirror RoBERTa-base and RoBERTa-large, differing only in the attention layers and the maximum sequence length.[3]
| Specification | longformer-base-4096 | longformer-large-4096 |
|---|---|---|
| Initialized from | RoBERTa-base | RoBERTa-large |
| Transformer layers | 12 | 24 |
| Hidden size | 768 | 1024 |
| Attention heads | 12 | 16 |
| Local attention window | 512 | 512 |
| Max sequence length | 4,096 tokens | 4,096 tokens |
| Vocabulary | byte-level BPE (RoBERTa) | byte-level BPE (RoBERTa) |
| Pretraining objective | masked language modeling | masked language modeling |
| License | Apache 2.0 | Apache 2.0 |
Because it descends from RoBERTa, Longformer does not use token_type_ids (segment embeddings); when two segments need to be combined, such as a question and a context passage, they are concatenated and separated by the </s> separator token rather than BERT's [SEP]/segment-id scheme.[3] RoBERTa uses learned absolute position embeddings with a maximum position of 512. To support 4,096 tokens, Longformer adds new position embeddings and, crucially, initializes them not randomly but by copying RoBERTa's 512 position embeddings repeatedly to fill the longer matrix. The authors note that BERT-style models show a strong learned bias toward attending to local positions (such as the immediately preceding or following token), and copying preserves this local structure everywhere except at the boundaries between copied blocks. This simple trick let pretraining converge with very few gradient updates.[4]
The Hugging Face Transformers library exposes Longformer through a LongformerModel base class and task-specific heads including LongformerForMaskedLM, LongformerForSequenceClassification, LongformerForQuestionAnswering, LongformerForTokenClassification, and LongformerForMultipleChoice, mirroring the standard BERT/RoBERTa head set. The attention_window configuration parameter sets the local window size (default 512) and may be specified per layer.[3]
The original Longformer is encoder-only and therefore suited to understanding tasks (classification, QA, tagging) but not to generation. To extend the linear-attention idea to sequence-to-sequence problems such as summarization, the authors introduced the Longformer-Encoder-Decoder (LED) in version 2 of the paper (December 2020).[1][5]
LED follows the encoder-decoder structure of the original Transformer. Its encoder replaces full self-attention with Longformer's local-plus-global pattern, so it scales linearly with the length of the input document; its decoder uses ordinary full self-attention over the (much shorter) output sequence and full cross-attention to the encoded input. Because pretraining a seq2seq model from scratch is expensive, LED is initialized from BART and follows BART's exact architecture in number of layers and hidden sizes. The only change needed to handle long inputs is extending BART's position embeddings from 1,024 to 16,384 (16K) tokens, again by repeatedly copying the original 1K embeddings.[1][5] Two sizes were released: LED-base and LED-large, with 6 and 12 layers respectively in both the encoder and decoder stacks, distributed on Hugging Face as allenai/led-base-16384 and allenai/led-large-16384.[5][6]
LED was evaluated on the arXiv long-document summarization dataset, where the 90th-percentile document length is about 14,500 tokens, well beyond any 512- or 1,024-token model. The encoder uses a window of 1,024 and global attention on the first <s> token; the decoder attends fully and is trained with teacher forcing, using beam search at inference. Despite having no summarization-specific pretraining and being only initialized from BART, LED-large at 16,384 tokens achieved state-of-the-art ROUGE on arXiv, slightly edging out the concurrently published BigBird summarization model (which is pretrained specifically for summarization on top of Pegasus). The paper also demonstrated that increasing the input length from 1K to 4K to 16K monotonically improved ROUGE scores, underscoring the value of reading the whole document.[1]
| arXiv summarization model | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Pegasus (2020) | 44.21 | 16.95 | 38.83 |
| LED-large (seqlen 4,096) | 44.40 | 17.94 | 39.76 |
| BigBird (seqlen 4,096) | 46.63 | 19.02 | 41.77 |
| LED-large (seqlen 16,384) | 46.63 | 19.62 | 41.83 |
Because the recipe (take a pretrained encoder, swap in windowed attention, extend positions, continue MLM pretraining) is general, the community has produced many domain-specific Longformers. Examples include clinical and biomedical variants such as Clinical-Longformer and a Longformer-based version of AI2's own SciBERT for scientific text, legal-domain Longformers, and numerous task-specific fine-tunes; the base checkpoint alone lists well over a hundred fine-tuned descendants on Hugging Face.[2][3] AI2 also published a conversion notebook so that users can turn an arbitrary RoBERTa- or BART-style checkpoint into a long-context model using the same position-copying procedure.[4]
The authors first validated the attention pattern on autoregressive character-level language modeling, the standard test bed for long-sequence transformers at the time, using the text8 and enwik8 benchmarks measured in bits per character (BPC, lower is better). They used a staged training schedule that started with a short sequence length (2,048) and small window and progressively doubled both over five phases up to a final sequence length of 23,040, which made training fast while reserving the expensive long-sequence work for the end.[1]
A "small" Longformer of 41M parameters set new state-of-the-art results, reaching 1.10 BPC on text8 and 1.00 BPC on enwik8, beating comparable models such as Transformer-XL, BP-Transformer, and Sukhbaatar et al.'s Adaptive Span at similar parameter counts. A "large" 102M-parameter Longformer reached 0.99 BPC on enwik8, outperforming the comparable 88M and 277M Transformer-XL configurations and matching the Sparse Transformer, while using far fewer parameters than the very largest models.[1]
After continued MLM pretraining, Longformer was fine-tuned on six long-document tasks and compared head-to-head against a strong RoBERTa-base baseline that breaks long inputs into the longest possible 512-token segments and concatenates their activations. Longformer-base outperformed RoBERTa-base on every task, with the largest gains on the tasks that most depend on long-range context.[1]
| Task (metric) | RoBERTa-base | Longformer-base |
|---|---|---|
| WikiHop (accuracy) | 72.4 | 75.0 |
| TriviaQA (F1) | 74.3 | 75.2 |
| HotpotQA (joint F1) | 63.5 | 64.4 |
| OntoNotes coreference (avg F1) | 78.4 | 78.6 |
| IMDB (accuracy) | 95.3 | 95.7 |
| Hyperpartisan news (F1) | 87.4 | 94.8 |
The pattern is informative. On WikiHop, a multi-hop QA dataset that requires combining facts scattered across documents, and on Hyperpartisan, whose documents are long, the gains are large (roughly 2.6 and 7.4 points). On TriviaQA, IMDB, and OntoNotes the improvements are smaller, because in those datasets the local context near an answer is often sufficient or most documents are short. The authors emphasize that these gains come from the attention mechanism itself, not from extra pretraining: when Longformer is configured exactly like RoBERTa-base (512 sequence length, full n² attention) it performs slightly worse than RoBERTa-base, confirming the long-context attention is doing the work.[1]
The larger Longformer-large pushed the question-answering results further, achieving leaderboard state-of-the-art at submission time (May 2020) with 81.9 accuracy on WikiHop and 77.3 F1 on TriviaQA, improving on the previous best by large margins of roughly 3.6 and 4 points. On HotpotQA it placed near the top of the published leaderboard, trailing graph-neural-network-based systems that encode an explicit entity graph.[1]
Longformer is designed for any natural-language task where the input is longer than the 512-token window of conventional encoders and where truncation or chunking would lose important information. Typical applications include:[1][3]
[CLS]/<s> token.Beyond the original NLP benchmarks, Longformer and LED have been used in clinical text processing, legal document analysis, scientific-literature mining, and as a long-context encoder backbone in larger systems. Practically, the model is straightforward to adopt because it is a drop-in replacement for RoBERTa or BART in the Hugging Face ecosystem, requiring only that the user supply a global_attention_mask appropriate to the task.[3]
Longformer's design involves real trade-offs. The sparse attention pattern is fixed and hand-designed rather than learned, so the window size and the placement of global tokens are hyperparameters the practitioner must choose; setting global attention incorrectly for a task degrades performance, and the optimal choice is not always obvious. Because the windowed pattern is local, information between two distant tokens must propagate through several stacked layers rather than in a single attention step, which can in principle limit the modeling of certain long-range dependencies compared with full attention.[1][3]
The efficient attention also requires custom kernels: the banded matrix multiplication is not natively supported by mainstream deep-learning frameworks, so the fastest path historically relied on a custom CUDA/TVM kernel, and the portable PyTorch implementation is slower than the specialized version. The maximum length of 4,096 tokens, while large in 2020, is modest by the standards of later models; extending further requires re-extending position embeddings and additional pretraining.[1] The original Longformer also predates rotary and relative position-embedding schemes that became standard later, relying instead on copied absolute embeddings.
Finally, Longformer is an architecture from 2020. The subsequent shift toward very large decoder-only language models with native context windows of tens of thousands to millions of tokens, enabled by techniques such as FlashAttention, rotary embeddings, and other efficient-attention methods, means that for generative use cases practitioners often reach for a modern long-context LLM instead. Longformer nonetheless remains widely used as an efficient, comparatively small encoder for long-text classification, retrieval, and extraction, where a full LLM would be overkill.[2]
Longformer was developed at the Allen Institute for AI in Seattle by Iz Beltagy, Matthew E. Peters, and Arman Cohan, with Beltagy and Peters listed as equal first authors. The first version of the paper was posted to arXiv on April 10, 2020 (arXiv:2004.05150), and the code and pretrained weights were released open-source at github.com/allenai/longformer under the Apache 2.0 license.[1] Matthew Peters was a co-creator of ELMo, and the team's broader interest in long-document modeling at AI2 also produced SciBERT and the SciDocs/Specter line of work, situating Longformer within a sustained AI2 effort on scientific and long-form text.
The model appeared amid an intense burst of "efficient transformer" research in 2020. Reformer (using locality-sensitive hashing), the Sparse Transformer (fixed sparse patterns), ETC (a closely related local-plus-global scheme that uses relative position embeddings and a separate global memory), and Google's BigBird (which combines windowed, global, and random attention and proved theoretically that such sparse transformers are universal approximators) all targeted the quadratic-attention bottleneck within months of one another. Longformer's ETC-like local-plus-global formulation and BigBird's pattern are direct cousins; the papers cite and benchmark against each other.[1]
In December 2020 the authors posted version 2 of the paper, adding the Longformer-Encoder-Decoder (LED) and its arXiv-summarization results, and released the led-base-16384 and led-large-16384 checkpoints.[5] Longformer and LED were integrated into the Hugging Face Transformers library, where they became standard, well-documented model types. Over the following years the base checkpoint grew into one of the most-downloaded long-document encoders on the Hugging Face Hub, accumulating well over a million downloads per month and spawning a large family of domain-specific and task-specific descendants. Although newer long-context architectures have since surpassed its raw context length, Longformer's central idea, that combining a cheap local window with a few global tokens recovers most of the value of full attention at a fraction of the cost, has remained influential in the design of efficient sequence models.[2][3]