Longformer

Large Language Models Natural Language Processing Transformer Models

20 min read

Updated Jul 13, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 13, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v3 · 4,021 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Longformer is a transformer architecture for processing long documents, introduced by Iz Beltagy, Matthew E. Peters, and Arman Cohan of the Allen Institute for AI (AI2) in the April 2020 paper "Longformer: The Long-Document Transformer" (arXiv:2004.05150).^[1] Its defining feature is a sparse attention mechanism that scales linearly with sequence length, $O(n)$ , in place of the quadratic $O(n^2)$ cost of standard self-attention. In the authors' words, "Longformer's attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention."^[1] This lets a single model read sequences of several thousand tokens at once: the most widely used checkpoint, allenai/longformer-base-4096 on Hugging Face, accepts up to 4,096 tokens, eight times the 512-token limit of BERT-style models.^[2]^[3]

Rather than design a new pretraining recipe from scratch, the authors built Longformer as a drop-in replacement for the dense attention inside an existing pretrained model. The released encoder is initialized from a RoBERTa checkpoint, has its position embeddings extended from 512 to 4,096, and is then given a short round of continued masked-language-model pretraining.^[1]^[4] On long-document tasks such as multi-hop and open-domain question answering, document classification, and coreference resolution, Longformer consistently outperforms a RoBERTa baseline that has to break the input into 512-token chunks. The paper reports that "our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA," and at the time of release it also set new state-of-the-art results on the text8 and enwik8 character-level language-modeling benchmarks.^[1] The paper further introduced a sequence-to-sequence variant, the Longformer-Encoder-Decoder (LED), built on top of BART for generative tasks like long-document summarization.^[1]^[5]

Longformer arrived during a wave of "efficient transformer" research in 2020 alongside closely related models such as ETC, Reformer, and Google's BigBird, all attacking the same quadratic-attention bottleneck. Of these it became one of the most widely adopted in practice: the base checkpoint records well over a million downloads per month on Hugging Face, has hundreds of community fine-tunes, and remains a common baseline and production choice for long-text encoding even after the rise of long-context window decoder-only large language models.^[2]^[3]

What problem does Longformer solve?

A standard Transformer computes attention between every pair of positions in its input. For a sequence of length $n$ , the attention score matrix $QK^\top$ has $n^2$ entries, so both the compute and the memory needed grow with the square of the sequence length. As the paper puts it, "Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length."^[1] This is why the original BERT and RoBERTa models cap their input at 512 tokens: doubling the length roughly quadruples the cost, and long inputs quickly exhaust GPU memory. Practitioners working with long documents historically dealt with this in one of three ways, all of them lossy: truncate the document to the first 512 tokens, split it into independent chunks and process each separately, or build a two-stage pipeline that first retrieves a small relevant passage and then runs a short-context model on it.^[1] Each approach either discards information or introduces complex machinery to stitch chunk-level results back together.

Longformer's premise is that for most long-document tasks, full all-to-all attention is unnecessary. Local context, the words immediately surrounding a given token, carries most of the signal, and a small number of special positions need to see the whole sequence. Longformer therefore replaces dense attention with a fixed sparse pattern that combines a windowed local attention with a task-specific global attention. Because each token attends to only a constant-size neighborhood plus a handful of global tokens, the total number of attention computations grows linearly with sequence length rather than quadratically. The authors describe the goal directly: "we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer."^[1] The result is a model that can ingest an entire document in a single forward pass, build contextual representations across the whole input through stacked attention layers, and avoid both the information loss of truncation and the architectural complexity of chunking.^[1]

The key engineering claim is that this sparse attention is a faithful substitute for the dense version inside a pretrained model. Because Longformer-base uses a local window of 512 (matching RoBERTa's full context width) and inherits RoBERTa's weights, the lower layers behave almost identically to the original model while the stacked windows give upper layers an effectively global receptive field. Continued pretraining then adapts the weights to the longer context with only a modest number of gradient updates.^[4]

How does Longformer attention work?

Longformer's attention pattern is built from three components, illustrated in the paper as sparse variants of the full $n^2$ attention matrix.^[1]

Sliding window (local) attention

The core pattern is a fixed-size sliding window. Given a window width $w$ , each token attends to the half- $w$ tokens on either side of it, for a total of $w$ neighbors. The cost of this pattern is $O(n \times w)$ : linear in sequence length because $w$ is a fixed constant independent of $n$ . A single windowed layer can only see a local neighborhood, but stacking layers expands the reach. With $L$ stacked layers of window size $w$ , the receptive field at the top layer is $L \times w$ , analogous to how stacked convolutions in a CNN build up a large receptive field from small local filters. With enough layers, the model can in principle propagate information across the entire sequence even though no single layer attends globally.^[1]

The authors found it helpful to vary the window size across layers rather than fixing it. In their language-modeling experiments, lower layers use small windows to capture fine local structure, and window sizes increase toward the upper layers so that higher levels of the network can build representations over larger spans. An ablation showed that increasing window size from bottom (32) to top (512) layers produced the best perplexity, beating both the reverse arrangement and a fixed average window.^[1]

Dilated sliding window

To widen the receptive field further without adding computation, the window can be "dilated," leaving gaps of size d between attended positions, directly analogous to dilated convolutions. With dilation $d$ across $L$ layers of window $w$ , the receptive field grows to $L \times d \times w$ , which can reach tens of thousands of tokens even for small dilation values. Because attention is multi-headed, different heads can use different dilation settings: some heads attend without dilation to focus tightly on local context, while others use dilation to reach distant tokens. The authors used dilation only in the autoregressive language-modeling setting (on a small number of heads in the upper layers) and reported it gave a small improvement; they did not use it in the pretrained encoder, since dilation is not compatible with the pretrained RoBERTa weights.^[1]

Global attention

Local windows alone cannot learn task-specific full-sequence representations, so Longformer adds global attention on a few pre-selected positions chosen according to the task. Global attention is symmetric: a token with global attention attends to every other token in the sequence, and every token in the sequence attends back to it. Because the number of such global tokens is small and independent of n, adding them keeps the overall complexity at $O(n)$ . Global attention is the mechanism by which Longformer injects task-specific inductive bias into an otherwise generic local pattern. The choice of which tokens are global depends on the downstream task:^[1]^[4]

For classification, global attention is placed on the special [CLS] token (the <s> token in RoBERTa), so the pooled representation can aggregate the whole document.
For question answering, global attention is placed on all of the question tokens, letting the model compare the question against every part of the document.
For masked language modeling, the model relies on local context to predict masked words, so little or no global attention is needed.

In the Hugging Face implementation this is controlled by a global_attention_mask tensor passed alongside the input, where a value of 1 marks a position as global and 0 marks it as local. Setting global attention correctly for the task is the user's responsibility and is important for good performance.^[3]

Separate projections and implementation

Standard attention computes queries, keys, and values from input via three linear projections $Q, K, V$ . Longformer instead uses two separate sets of projection matrices: $Q_s, K_s, V_s$ for the sliding-window attention and $Q_g, K_g, V_g$ for the global attention. The extra projections give the model the flexibility to model the two kinds of attention differently, which the authors show is critical for best downstream performance. The global projections are initialized to copies of the sliding-window projections at the start of fine-tuning.^[1]

Implementing the sparse pattern efficiently is non-trivial: it requires a form of "banded" matrix multiplication that computes only the diagonals of $QK^\top$ that fall inside the window, which is not natively supported by libraries such as PyTorch or TensorFlow. The paper describes three implementations: loop, a memory-efficient but very slow PyTorch version used only for testing; chunks, a vectorized version that supports the non-dilated case and is used for pretraining and fine-tuning; and cuda, a custom CUDA kernel written with TVM and used for the language-modeling experiments. Figure 1 of the paper shows that Longformer's memory use rises linearly with sequence length while full self-attention rises quadratically and runs out of memory on long inputs, with the vectorized chunked implementation being the fastest in practice.^[1]

What is the Longformer architecture?

Longformer is an encoder-only Transformer. The released checkpoints come in two sizes that mirror RoBERTa-base and RoBERTa-large, differing only in the attention layers and the maximum sequence length.^[3]

Specification	longformer-base-4096	longformer-large-4096
Initialized from	RoBERTa-base	RoBERTa-large
Transformer layers	12	24
Hidden size	768	1024
Attention heads	12	16
Local attention window	512	512
Max sequence length	4,096 tokens	4,096 tokens
Vocabulary	byte-level BPE (RoBERTa)	byte-level BPE (RoBERTa)
Pretraining objective	masked language modeling	masked language modeling
License	Apache 2.0	Apache 2.0

Because it descends from RoBERTa, Longformer does not use token_type_ids (segment embeddings); when two segments need to be combined, such as a question and a context passage, they are concatenated and separated by the </s> separator token rather than BERT's [SEP]/segment-id scheme.^[3] RoBERTa uses learned absolute position embeddings with a maximum position of 512. To support 4,096 tokens, Longformer adds new position embeddings and, crucially, initializes them not randomly but by copying RoBERTa's 512 position embeddings repeatedly to fill the longer matrix. The authors note that BERT-style models show a strong learned bias toward attending to local positions (such as the immediately preceding or following token), and copying preserves this local structure everywhere except at the boundaries between copied blocks. This simple trick let pretraining converge with very few gradient updates.^[4]

The Hugging Face Transformers library exposes Longformer through a LongformerModel base class and task-specific heads including LongformerForMaskedLM, LongformerForSequenceClassification, LongformerForQuestionAnswering, LongformerForTokenClassification, and LongformerForMultipleChoice, mirroring the standard BERT/RoBERTa head set. The attention_window configuration parameter sets the local window size (default 512) and may be specified per layer.^[3]

What is LED (Longformer-Encoder-Decoder)?

Longformer-Encoder-Decoder (LED)

The original Longformer is encoder-only and therefore suited to understanding tasks (classification, QA, tagging) but not to generation. To extend the linear-attention idea to sequence-to-sequence problems such as summarization, the authors introduced the Longformer-Encoder-Decoder (LED) in version 2 of the paper (December 2020), described as "a Longformer variant for supporting long document generative sequence-to-sequence tasks."^[1]^[5]

LED follows the encoder-decoder structure of the original Transformer. Its encoder replaces full self-attention with Longformer's local-plus-global pattern, so it scales linearly with the length of the input document; its decoder uses ordinary full self-attention over the (much shorter) output sequence and full cross-attention to the encoded input. Because pretraining a seq2seq model from scratch is expensive, LED is initialized from BART and follows BART's exact architecture in number of layers and hidden sizes. As the released model card states, "led-large-16384 was initialized from bart-large since both models share the exact same architecture," and "bart-large's position embedding matrix was simply copied 16 times" to handle 16,384 (16K) tokens.^[1]^[5] Two sizes were released: LED-base and LED-large, with 6 and 12 layers respectively in both the encoder and decoder stacks, distributed on Hugging Face as allenai/led-base-16384 and allenai/led-large-16384.^[5]^[6]

LED was evaluated on the arXiv long-document summarization dataset, where the 90th-percentile document length is about 14,500 tokens, well beyond any 512- or 1,024-token model. The encoder uses a window of 1,024 and global attention on the first <s> token; the decoder attends fully and is trained with teacher forcing, using beam search at inference. Despite having no summarization-specific pretraining and being only initialized from BART, LED-large at 16,384 tokens achieved state-of-the-art ROUGE on arXiv, slightly edging out the concurrently published BigBird summarization model (which is pretrained specifically for summarization on top of Pegasus). The paper also demonstrated that increasing the input length from 1K to 4K to 16K monotonically improved ROUGE scores, underscoring the value of reading the whole document.^[1]

arXiv summarization model	ROUGE-1	ROUGE-2	ROUGE-L
Pegasus (2020)	44.21	16.95	38.83
LED-large (seqlen 4,096)	44.40	17.94	39.76
BigBird (seqlen 4,096)	46.63	19.02	41.77
LED-large (seqlen 16,384)	46.63	19.62	41.83

Domain-specific and community variants

Because the recipe (take a pretrained encoder, swap in windowed attention, extend positions, continue MLM pretraining) is general, the community has produced many domain-specific Longformers. Examples include clinical and biomedical variants such as Clinical-Longformer and a Longformer-based version of AI2's own SciBERT for scientific text, legal-domain Longformers, and numerous task-specific fine-tunes; the base checkpoint alone lists well over a hundred fine-tuned descendants on Hugging Face.^[2]^[3] AI2 also published a conversion notebook so that users can turn an arbitrary RoBERTa- or BART-style checkpoint into a long-context model using the same position-copying procedure.^[4]

How well does Longformer perform?

Character-level language modeling

The authors first validated the attention pattern on autoregressive character-level language modeling, the standard test bed for long-sequence transformers at the time, using the text8 and enwik8 benchmarks measured in bits per character (BPC, lower is better). They used a staged training schedule that started with a short sequence length (2,048) and small window and progressively doubled both over five phases up to a final sequence length of 23,040, which made training fast while reserving the expensive long-sequence work for the end.^[1]

A "small" Longformer of 41M parameters set new state-of-the-art results, reaching 1.10 BPC on text8 and 1.00 BPC on enwik8, beating comparable models such as Transformer-XL, BP-Transformer, and Sukhbaatar et al.'s Adaptive Span at similar parameter counts. A "large" 102M-parameter Longformer reached 0.99 BPC on enwik8, outperforming the comparable 88M and 277M Transformer-XL configurations and matching the Sparse Transformer, while using far fewer parameters than the very largest models.^[1]

Downstream NLP tasks

After continued MLM pretraining, Longformer was fine-tuned on six long-document tasks and compared head-to-head against a strong RoBERTa-base baseline that breaks long inputs into the longest possible 512-token segments and concatenates their activations. Longformer-base outperformed RoBERTa-base on every task, with the largest gains on the tasks that most depend on long-range context.^[1]

Task (metric)	RoBERTa-base	Longformer-base
WikiHop (accuracy)	72.4	75.0
TriviaQA (F1)	74.3	75.2
HotpotQA (joint F1)	63.5	64.4
OntoNotes coreference (avg F1)	78.4	78.6
IMDB (accuracy)	95.3	95.7
Hyperpartisan news (F1)	87.4	94.8

The pattern is informative. On WikiHop, a multi-hop QA dataset that requires combining facts scattered across documents, and on Hyperpartisan, whose documents are long, the gains are large (roughly 2.6 and 7.4 points). On TriviaQA, IMDB, and OntoNotes the improvements are smaller, because in those datasets the local context near an answer is often sufficient or most documents are short. The authors emphasize that these gains come from the attention mechanism itself, not from extra pretraining: when Longformer is configured exactly like RoBERTa-base (512 sequence length, full $n^2$ attention) it performs slightly worse than RoBERTa-base, confirming the long-context attention is doing the work.^[1]

The larger Longformer-large pushed the question-answering results further, achieving leaderboard state-of-the-art at submission time (May 2020) with 81.9 accuracy on WikiHop and 77.3 F1 on TriviaQA, improving on the previous best by large margins of roughly 3.6 and 4 points. On HotpotQA it placed near the top of the published leaderboard, trailing graph-neural-network-based systems that encode an explicit entity graph.^[1]

What is Longformer used for?

Longformer is designed for any natural-language task where the input is longer than the 512-token window of conventional encoders and where truncation or chunking would lose important information. Typical applications include:^[1]^[3]

Long-document question answering, especially multi-hop and open-domain settings where the evidence is spread across a long context (WikiHop, TriviaQA, HotpotQA).
Document classification, including sentiment analysis of long reviews and detection tasks over full articles, with global attention on the [CLS]/<s> token.
Coreference resolution over full documents, where mentions and their antecedents may be far apart.
Token-level tasks such as named-entity recognition over long texts.
Long-document summarization and other generation via the LED variant, used for scientific papers, news articles, meeting transcripts, and similar long-form inputs.
Embedding and retrieval of long passages, where the whole document must be encoded into a single representation without splitting.

Beyond the original NLP benchmarks, Longformer and LED have been used in clinical text processing, legal document analysis, scientific-literature mining, and as a long-context encoder backbone in larger systems. Practically, the model is straightforward to adopt because it is a drop-in replacement for RoBERTa or BART in the Hugging Face ecosystem, requiring only that the user supply a global_attention_mask appropriate to the task.^[3]

What are Longformer's limitations?

Longformer's design involves real trade-offs. The sparse attention pattern is fixed and hand-designed rather than learned, so the window size and the placement of global tokens are hyperparameters the practitioner must choose; setting global attention incorrectly for a task degrades performance, and the optimal choice is not always obvious. Because the windowed pattern is local, information between two distant tokens must propagate through several stacked layers rather than in a single attention step, which can in principle limit the modeling of certain long-range dependencies compared with full attention.^[1]^[3]

The efficient attention also requires custom kernels: the banded matrix multiplication is not natively supported by mainstream deep-learning frameworks, so the fastest path historically relied on a custom CUDA/TVM kernel, and the portable PyTorch implementation is slower than the specialized version. The maximum length of 4,096 tokens, while large in 2020, is modest by the standards of later models; extending further requires re-extending position embeddings and additional pretraining.^[1] The original Longformer also predates rotary and relative position-embedding schemes that became standard later, relying instead on copied absolute embeddings.

Finally, Longformer is an architecture from 2020. The subsequent shift toward very large decoder-only language models with native context windows of tens of thousands to millions of tokens, enabled by techniques such as FlashAttention, rotary embeddings, and other efficient-attention methods, means that for generative use cases practitioners often reach for a modern long-context LLM instead. Longformer nonetheless remains widely used as an efficient, comparatively small encoder for long-text classification, retrieval, and extraction, where a full LLM would be overkill.^[2]

When was Longformer released?

Longformer was developed at the Allen Institute for AI in Seattle by Iz Beltagy, Matthew E. Peters, and Arman Cohan, with Beltagy and Peters listed as equal first authors. The first version of the paper was posted to arXiv on April 10, 2020 (arXiv:2004.05150), and the code and pretrained weights were released open-source at github.com/allenai/longformer under the Apache 2.0 license.^[1] Matthew Peters was a co-creator of ELMo, and the team's broader interest in long-document modeling at AI2 also produced SciBERT and the SciDocs/Specter line of work, situating Longformer within a sustained AI2 effort on scientific and long-form text.

The model appeared amid an intense burst of "efficient transformer" research in 2020. Reformer (using locality-sensitive hashing), the Sparse Transformer (fixed sparse patterns), ETC (a closely related local-plus-global scheme that uses relative position embeddings and a separate global memory), and Google's BigBird (which combines windowed, global, and random attention and proved theoretically that such sparse transformers are universal approximators) all targeted the quadratic-attention bottleneck within months of one another. Longformer's ETC-like local-plus-global formulation and BigBird's pattern are direct cousins; the papers cite and benchmark against each other.^[1]

In December 2020 the authors posted version 2 of the paper, adding the Longformer-Encoder-Decoder (LED) and its arXiv-summarization results, and released the led-base-16384 and led-large-16384 checkpoints.^[5] Longformer and LED were integrated into the Hugging Face Transformers library, where they became standard, well-documented model types. Over the following years the base checkpoint grew into one of the most-downloaded long-document encoders on the Hugging Face Hub, accumulating well over a million downloads per month and spawning a large family of domain-specific and task-specific descendants. Although newer long-context architectures have since surpassed its raw context length, Longformer's central idea, that combining a cheap local window with a few global tokens recovers most of the value of full attention at a fraction of the cost, has remained influential in the design of efficient sequence models.^[2]^[3]

ELI5: Longformer in plain terms

Imagine reading a very long book but only being allowed to remember a few hundred words at a time, the way ordinary BERT works. To understand the whole book you would have to chop it into chunks and read each one separately, forgetting what came before. Longformer fixes this by changing how the model "pays attention." Instead of having every word look at every other word (which gets impossibly expensive for long texts, growing with the square of the length), each word mostly looks only at its near neighbors through a sliding window, like reading with your finger and seeing a few words on each side. A few special words, such as the question you are trying to answer, are allowed to look at the entire document and be looked at by everything else. Stacking many of these windowed layers lets information travel across the whole book, so the model can read about 4,000 words at once (eight times more than BERT) without running out of memory. The same trick, applied to a writing model called BART, gives LED, which can read up to roughly 16,000 words and write a summary.

References

Beltagy, I.; Peters, M. E.; Cohan, A. (April 10, 2020; revised December 2, 2020). "Longformer: The Long-Document Transformer." arXiv:2004.05150. https://arxiv.org/abs/2004.05150 ↩
Allen Institute for AI. "allenai/longformer-base-4096." Hugging Face model card (Apache 2.0; downloads and fine-tune counts). https://huggingface.co/allenai/longformer-base-4096 ↩
Hugging Face. "Longformer." Transformers documentation (self-attention mechanism, attention_window, global_attention_mask, model classes, base/large checkpoints). https://huggingface.co/docs/transformers/model_doc/longformer ↩
Beltagy, I.; Peters, M. E.; Cohan, A. "Longformer." GitHub repository (RoBERTa initialization, position-embedding copying, conversion notebook, Apache 2.0 license). https://github.com/allenai/longformer ↩
Beltagy, I.; Peters, M. E.; Cohan, A. "Longformer paper, Section 7: Longformer-Encoder-Decoder (LED)" (initialized from BART, 16K position embeddings, arXiv summarization ROUGE results). arXiv:2004.05150v2. https://arxiv.org/pdf/2004.05150 ↩
Allen Institute for AI. "allenai/led-base-16384" and "allenai/led-large-16384." Hugging Face model cards (LED initialized from bart-base/bart-large, 16,384-token support, long-document summarization and QA). https://huggingface.co/allenai/led-large-16384 ↩
Zaheer, M.; et al. (2020). "Big Bird: Transformers for Longer Sequences." arXiv:2007.14062 (contemporaneous sparse-attention model benchmarked against Longformer). https://arxiv.org/abs/2007.14062
Ainslie, J.; et al. (2020). "ETC: Encoding Long and Structured Inputs in Transformers." Proceedings of EMNLP 2020 (closely related local-plus-global attention scheme). https://aclanthology.org/2020.emnlp-main.19/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Multi-Head Self-Attention RoBERTa Sparse attention Summarization Models Text summarization

What problem does Longformer solve?

How does Longformer attention work?

Sliding window (local) attention

Dilated sliding window

Global attention

Separate projections and implementation

What is the Longformer architecture?

What is LED (Longformer-Encoder-Decoder)?

Longformer-Encoder-Decoder (LED)

Domain-specific and community variants

How well does Longformer perform?

Character-level language modeling

Downstream NLP tasks

What is Longformer used for?

What are Longformer's limitations?

When was Longformer released?

ELI5: Longformer in plain terms

See also

References

Improve this article

Related Articles

PaLM

Positional encoding

XLNet

RoBERTa

ELECTRA

ALBERT

What links here

Related Articles

PaLM

Positional encoding

XLNet

RoBERTa

ELECTRA

ALBERT

What links here