Long-context language models
Long-context language models are large language models engineered to accept and reason over inputs that are far larger than the few-thousand-token windows used by early transformer systems. The frontier has moved from 512 tokens in BERT (2018) to 2,048 tokens in GPT-3 (2020), to 100K tokens in Claude 2 (2023), to one million or more tokens in Gemini 1.5 Pro (2024) and the Claude Opus 4.7 family (2026).[1][2][3][4] Extending the context window requires solutions across several layers of the stack: positional encoding schemes that extrapolate beyond training length, sub-quadratic or IO-aware attention kernels, KV cache memory management, evaluation benchmarks designed for very long inputs, and inference systems that can serve such requests at acceptable cost and latency. Long-context capability is a frontier of model design because effective context length determines what tasks a single model call can perform without external retrieval, indexing, or chunking pipelines.
Background and motivation
The original Transformer architecture proposed in 2017 employed dot-product self-attention with full O(N^2) cost in the number of tokens N. Early systems trained on relatively short sequences. BERT was pretrained with a maximum sequence length of 512 wordpieces, which is roughly two or three paragraphs of natural text. The 512 limit was a practical compromise: BERT used learned absolute positional embeddings that did not generalize beyond the training length, and the quadratic attention cost made longer pretraining expensive.[5] GPT-3, released in May 2020, used a context window of 2,048 tokens, double that of GPT-2 (1,024 tokens) but still small relative to most documents that users wanted to process.[6] Many subsequent open models stayed in the 2K to 8K range through 2022.
A single forward pass of standard attention requires constructing an N by N similarity matrix between queries and keys, so doubling N quadruples the FLOPs and memory traffic of the attention layer. During autoregressive generation, each new token must attend to all previous tokens, which would force the model to recompute attention over the entire prefix at every decoding step. The KV cache avoids that recomputation by storing the keys and values produced for past tokens and reusing them, reducing per-token decoding cost from quadratic to linear, but it shifts the bottleneck to memory.[7] KV cache memory grows roughly linearly with sequence length, number of layers, head dimension, and batch size. For LLaMA2-7B with batch size 8 and 32K context, the KV cache occupies on the order of 128 GB, which exceeds the memory of a single high-end accelerator and motivates aggressive quantization, paging, and eviction strategies.[7]
Beyond raw compute and memory, simply training on longer sequences is statistically inefficient. The number of independent long documents in typical pretraining corpora is small relative to the number of token positions a long-context model has to learn, and most documents do not contain genuine long-range dependencies. As a result, long-context capability is often added through a combination of architectural choices, dedicated long-context fine-tuning, and inference-time extrapolation tricks.
Three distinct cost surfaces interact in a long-context system. Prefill cost is the work required to ingest the prompt, which is dominated by attention layers and grows roughly linearly per token at the layer level (sub-quadratically in N thanks to FlashAttention-style kernels) and quadratically per layer in N when measured as raw FLOPs. Decode cost is the per-output-token work, which is dominated by reading the entire KV cache from memory and is therefore memory-bandwidth-bound rather than compute-bound; longer contexts make decoding slower without making it more parallelizable. Storage cost is the KV cache footprint, which competes with model weights for accelerator memory and forces design decisions such as grouped-query attention (Grouped-Query Attention) and multi-head latent attention (Multi-head Latent Attention) that shrink the cache by collapsing heads or projecting to a lower-rank representation.
Historical progression
The growth of context windows since 2018 has been roughly exponential, with several distinct phases driven by both training tricks and engineering improvements.
| Year | Model | Context (tokens) | Notes |
|---|
| 2018 | BERT | 512 | Learned absolute positional embeddings, encoder-only.[5] |
| 2019 | GPT-2 | 1,024 | Decoder-only, absolute positions.[6] |
| 2020 | GPT-3 | 2,048 | Doubled GPT-2 window.[6] |
| 2023 | GPT-4 (initial) | 8,192 / 32,768 | Two variants at launch. |
| 2023 | Claude 2 | 100,000 | Released July 11, 2023, by Anthropic.[1] |
| 2023 | GPT-4 Turbo | 128,000 | Announced at OpenAI DevDay, Nov 6, 2023.[8] |
| 2023 | Claude 2.1 | 200,000 | Released Nov 21, 2023.[9] |
| 2024 | Gemini 1.5 Pro | 1,000,000 (10M demonstrated) | February 2024; technical report claims 99% recall to 10M.[2][10] |
| 2024 | Gemini 1.5 Pro (GA) | 2,000,000 | 2M token window opened to developers in 2024.[11] |
| 2025 | Gemini 2.0 Pro | 2,000,000 | Released February 2025 as part of the Gemini 2.0 family.[12] |
| 2025 | Claude Sonnet 4 | 1,000,000 | 1M beta announced August 2025.[13] |
| 2025 | Gemini 3 Pro | 1,000,000 | Announced Nov 18, 2025; up to 64K output tokens.[14] |
| 2026 | Claude Opus 4.7 | 1,000,000 input / 128,000 output | Released April 16, 2026.[4] |
Several milestones in this trajectory deserve attention. Anthropic's expansion of Claude from a 9K window to 100K tokens in May 2023 was the first time a frontier commercial model crossed the symbolic six-figure threshold; the announcement showed Claude detecting a one-line edit inside The Great Gatsby in 22 seconds.[1] The Gemini 1.5 technical report demonstrated near-perfect recall on a multi-modal Needle-in-a-Haystack at up to 10 million tokens of text, video, and audio, although the 10M setting was a research demonstration rather than a publicly available product.[2] Claude Sonnet 4's 1M extension in 2025 was significant because Anthropic offered the long window without a per-token surcharge for prompt caching reuse, narrowing the operational gap between Anthropic and Google for long-document workloads.[13][15]
It is worth noting that "context window" itself is a marketing-laden number. Two models advertising the same context can have very different real-world capabilities depending on training data composition, positional encoding, and inference-time KV cache handling. The progression in the table above tracks nominal capability; the effective fraction usable on hard tasks lags it substantially, as shown by RULER and BABILong scores discussed below. The trajectory also masks differences in modality: Gemini 1.5 demonstrated 10M token contexts that combined text, audio, and video, while Claude Opus 4.7 supports text plus images at 1M tokens, with distinct cost models for each modality.
Technical details
A transformer needs some way of distinguishing tokens by position. Several schemes have been used, with different consequences for length generalization.
- Sinusoidal and learned absolute embeddings: the original Attention Is All You Need paper used fixed sinusoidal positions, while BERT and GPT-2 learned an embedding per position up to the training length. Learned absolute positions do not extrapolate; the model has never seen position N+1 during training.
- Rotary position embedding (RoPE): rotates queries and keys by a position-dependent angle so that their dot product depends on relative offset. RoPE generalizes somewhat beyond the training length but accuracy still degrades sharply.
- ALiBi (Press, Smith, and Lewis, 2021): replaces explicit position embeddings with a per-head linear penalty added to attention scores, scaled by a fixed head-specific slope m = (4 * (log2 H + 3) - 1)^(-h). Training a 1.3B model on length 1024 inputs allowed extrapolation to length 2048 with perplexity matching a sinusoidal model trained at 2048, and ALiBi keeps reasonable performance at 5 to 10x the training length.[16]
- NTK-aware scaling and Position Interpolation: shortly after RoPE, several practitioners proposed rescaling the RoPE frequency base or linearly interpolating positions to push trained models to longer windows with limited fine-tuning.
- YaRN (Peng, Quesnelle, Fan, and Cocktail, 2023): partitions RoPE frequencies into regions and combines NTK-by-parts interpolation with a temperature-scaled softmax. YaRN extends LLaMA2 to 64K with only 400 fine-tuning steps and supports 128K+ contexts.[17]
- LongRoPE (Microsoft, 2024): identifies non-uniformities in the rotary frequency spectrum, uses an evolutionary search to find good interpolation parameters, and progressively extends to 2,048K tokens with up to 1K fine-tuning steps; the technique was incorporated into Phi-3.[18]
- Position Interpolation (PI): linearly rescales the position indices used by RoPE so that the model effectively sees positions in the range it was trained on, even when the actual sequence is much longer. PI provided the first widely used recipe for extending LLaMA from 2K to 32K with modest fine-tuning.
The trend in positional encoding research has been toward designs that decouple the training length from the deployment length. ALiBi, NTK-aware RoPE, YaRN, and LongRoPE all reflect the same insight: explicit absolute position information becomes brittle outside the training range, while relative or implicit position signals can be made to extrapolate further if the underlying spectrum is handled carefully.
Attention efficiency
Even with a stable positional scheme, the per-layer attention cost has to be controlled. Long-context systems have converged on a handful of techniques.
- FlashAttention (Dao, Fu, Ermon, Rudra, and Re, 2022): an IO-aware exact attention kernel that tiles queries, keys, and values in on-chip SRAM rather than materializing the N by N attention matrix in HBM. It made training at longer sequences feasible for a broad swath of models and is now standard in LLaMA, Falcon, MPT, and many others.[19] FlashAttention-3 specializes the algorithm for Hopper GPUs and FP8.
- Ring Attention (Liu, Zaharia, and Abbeel, October 2023): distributes long sequences across many devices using a ring topology and overlaps the communication of KV blocks with blockwise attention computation. Ring Attention raises the effective sequence length by roughly the device count and enabled the multi-million-token training runs used by Gemini 1.5 and other large systems.[20]
- Sliding window attention: each token attends only to a fixed-size window of recent tokens, giving linear cost in N. The pattern was popularized by Longformer and is used in Mistral 7B. Combining sliding windows with periodic dense layers preserves global communication.
- Sparse attention: patterns such as strided, block-sparse, or content-dependent attention reduce work to a sub-quadratic budget. Production systems blend sparse patterns with dense layers and routing heuristics.
- Linear Attention and kernel-based variants reformulate attention as a matrix product with a feature map so that the cost scales linearly in N at the price of expressivity.
- Blockwise transformers (Liu and Abbeel, 2023): compute attention block by block over a tiled sequence, never materializing the full attention matrix in HBM, and feed naturally into Ring Attention's distributed schedule.
The convention in many modern systems is to use FlashAttention for the bulk of the layers, switch some layers to sliding-window or sparse patterns, and accept the resulting trade between expressivity and throughput. GQA further reduces both compute and KV memory by sharing keys and values across groups of heads. The combination of GQA, Multi-head Latent Attention in models like DeepSeek V2 and V3, and FlashAttention is what makes 1M-token contexts feasible to serve at all on current accelerators.
Alternative architectures
Several non-attention architectures aim to match transformer quality while scaling to very long sequences.
- Mamba and Mamba 2: selective state space models introduced by Albert Gu and Tri Dao in December 2023. Mamba makes SSM parameters input-dependent so the model can selectively forget or propagate state. Throughput is roughly 5x that of a similar-size Transformer at long sequence lengths and scales linearly.[21]
- Hyena: a sub-quadratic drop-in replacement for attention by Poli, Massaroli, and collaborators (ICML 2023) using implicitly parametrized long convolutions interleaved with data-controlled gating. Hyena is reported to be 100x faster than optimized attention at 64K context.[22]
- RWKV and RWKV-7 (Goose): linear-attention recurrent networks designed for long contexts and inference-time efficiency.
Most commercial frontier systems remain transformer-based but borrow ideas, for example combining attention layers with linear-attention layers in a hybrid stack.
KV cache management
Once a long-context model is trained, serving it requires careful management of the KV cache. Key techniques include:
- PagedAttention in vLLM (Kwon et al., September 2023): partitions the per-sequence KV cache into fixed-size blocks and uses an OS-style page table to map logical token positions to physical memory blocks. PagedAttention achieves near-zero memory waste and supports flexible sharing across requests, improving throughput 2 to 4x over prior systems at the same latency.[23]
- Eviction policies: long generations cannot keep every token in the cache. Heuristics range from simple sliding windows to attention-score-based eviction.
- Quantization of KV: 8-bit, 4-bit, and even lower precision representations for keys and values cut memory by 2 to 8x with minimal accuracy loss in many configurations.
- Attention sinks (Xiao, Tian, Chen, Han, and Lewis, September 2023): the observation that LLMs route large attention mass to the first few tokens, regardless of their semantic content, because the softmax must distribute weight somewhere. Preserving the KV of those initial tokens, in combination with a sliding window over later tokens, lets the StreamingLLM framework run Llama 2, MPT, Falcon, and Pythia at up to 4 million tokens without fine-tuning, with 22x speedup over a sliding-window recomputation baseline.[24]
- Continuous batching (Continuous Batching) and shared cache reuse via Prompt Caching cut both latency and cost when many requests share a long prefix, for example a code base or a corpus of contracts.[15][25]
- Disaggregated prefill and decode: production serving stacks increasingly run the prefill phase (large prompt ingestion, compute-bound) on a different pool of accelerators from the decode phase (per-token generation, memory-bandwidth-bound). This separation matters more as context length grows because the two phases have radically different resource profiles.
KV cache management is now a research subfield in its own right. The recent literature on budgeted KV allocation, layer-condensed caches, and sparsity-aware KV caching reflects the observation that not every token in a million-token prompt deserves the same retention; carefully designed eviction or compression can reduce effective KV memory by an order of magnitude with minimal task degradation.
Training-time techniques
Building a model that genuinely uses a long window typically follows a multi-stage recipe. Pretraining occurs at a moderate length, for example 8K, on the bulk of the data. The model is then continued on a smaller mixture of long documents at progressively larger windows, with the positional encoding rescaled at each stage; this is the "progressive extension" pattern used in LongRoPE and many open releases.[18] Context distillation transfers a teacher's behavior on long contexts into a student model without requiring matching memory at training time, reducing peak memory by up to 60% in some configurations.[26] Specialized long-document datasets such as LongCrawl64, a corpus of 6.66 million pre-tokenized documents of 65,536 tokens each, are designed to make long-context training feasible by providing pre-shuffled, length-uniform batches.[27]
Evaluation
A model that accepts 1M tokens is not necessarily a model that uses 1M tokens. Evaluation has matured from simple retrieval probes to multi-task suites.
- Needle in a Haystack (NIAH) (Greg Kamradt, 2023): inserts a single fact (the needle) at varying depths inside Paul Graham essays (the haystack) and asks the model to retrieve it. NIAH became the de facto smoke test for new context-length claims because it is cheap to run and easy to visualize as a heat map of accuracy versus depth and length.
- RULER (Hsieh et al., NVIDIA, April 2024): extends NIAH with 13 task types in four categories (retrieval, multi-hop tracing, aggregation, question answering) at flexible sequence lengths. Although every tested model passed vanilla NIAH at 32K, only about half maintained satisfactory RULER scores at that length.[28]
- LongBench (Bai et al., 2023, published at ACL 2024): a bilingual (English and Chinese) suite of 21 datasets across six categories including single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code, averaging 6,711 English words or 13,386 Chinese characters per instance.[29]
- InfiniteBench (OpenBMB, February 2024): 12 tasks spanning novels, code, math, and dialogue at 100K+ token inputs in English and Chinese, designed to require genuine long-range dependency understanding rather than retrieval alone.[30]
- BABILong (Booydar et al., June 2024, NeurIPS 2024): hides facts from the 20 bAbI reasoning tasks inside large blocks of irrelevant text and supports lengths up to 10 million tokens. Popular LLMs were found to use effectively only 10 to 20 percent of available context, with accuracy collapsing as reasoning complexity grew.[31]
- ZeroSCROLLS (Shaham, Ivgi, and collaborators, EMNLP 2023): six tasks adapted from SCROLLS plus four new tasks including aggregation, all in a zero-shot setting with no training set.[32]
These benchmarks have converged on the message that nominal context length significantly overstates effective context length. Few public models score above 90 on RULER at 32K, and most degrade further by 128K.[28]
The community has also developed specialized evaluation patterns that are not full benchmarks but are nevertheless useful. Multi-needle Needle-in-a-Haystack variants insert multiple disjoint facts and require the model to retrieve and combine them. Multimodal NIAH inserts an image, audio clip, or video segment as the needle. Passkey retrieval evaluates the ability of a model to find a five-digit code that has been hidden in a noise document of varying length; LongRoPE reports more than 90% passkey accuracy up to 2,048K tokens.[18] These probes complement aggregate benchmark scores by exposing specific failure modes.
Failure modes
Several failure modes have been documented well enough to have named effects.
- Lost in the middle (Liu, Lin, Hewitt, Paranjape, Bevilacqua, Petroni, and Liang, July 2023): performance on multi-document QA and key-value retrieval is highest when the relevant span is at the beginning or end of the input and degrades sharply when it sits in the middle. The effect appears even in models marketed as long-context.[33]
- Attention dilution: as N grows, each individual key-query interaction must compete with more candidates, and the softmax tends to flatten. Many low-relevance tokens can collectively divert mass away from the genuinely useful token.
- Recency bias: in decoder-only causal models, recent tokens receive larger gradients during training and tend to dominate predictions, which can suppress information embedded much earlier in the prompt.
- Distractor sensitivity: long-context models are often brittle to plausible-looking but incorrect text inserted between the question and the answer; BABILong scores drop sharply as the ratio of distractor to signal grows.[31]
- Position extrapolation failure: models trained with absolute or insufficiently rescaled positions can produce degenerate output beyond the training length, manifesting as repetitive or off-topic text.
Practical considerations
Long context shifts cost from external retrieval infrastructure into the model call itself. Several practical patterns have emerged.
- Cost: pricing per token is typically flat across context length on the major Anthropic and Google APIs, so a 1M token request can be 1,000x more expensive than a 1K token request. For Claude Opus 4.7, input is priced at $5 per million tokens and output at $25 per million, with a 1M context window included at standard pricing.[4][15] Gemini 1.5 Pro's 2M window is similarly included at base rates.[11]
- Latency: prefill (processing the prompt) is compute-bound and scales roughly linearly in N for current systems, with attention kernels giving sub-quadratic real-world cost. A 1M token request can take many seconds to tens of seconds before the first generated token. Speculative decoding (Speculative Decoding), PagedAttention, and KV reuse mitigate this.[7][23]
- Prompt Caching: storing the KV state of a frequently reused prefix lets subsequent calls skip its prefill. Anthropic offers cache writes at 1.25x base input rate (5-minute TTL) or 2x (1-hour TTL), and cache reads at 0.1x of the base input rate, giving up to 90% input-cost reduction when applicable.[15]
- Retrieval-Augmented Generation vs long context: at sufficient model scale, long context often outperforms RAG on the same task when ample budget is available, but RAG remains far cheaper per query. Studies in 2024 and 2025 report that long-context 1M requests can run 30 to 60x slower than a comparable RAG pipeline at hundreds of times the per-query cost, while RAG retains advantages on dynamic corpora.[34] Hybrid systems route between the two approaches based on query characteristics.
Applications
Long-context language models support several application classes that are hard to handle by other means. The applications below are not exhaustive but represent the families that practitioners report as most impacted by the move from 32K to 1M-plus contexts.
- Single-call multi-document question answering: a model with a 1M window can ingest hundreds of pages, multiple reports, or a small code base in one call, and answer cross-document questions without an external retriever.
- Code understanding: large repositories, sometimes the entire codebase or a representative slice, can be loaded directly. Gemini 3 Pro markets the ability to ingest about 1,500 pages of text or 50,000 lines of code per call.[14]
- Agentic tasks: persistent agents accumulate state, tool outputs, and intermediate plans in their context. A 1M window lets a Claude or Gemini agent stay coherent across many turns of computer use, file edits, or tool calls without aggressive summarization.
- Long video and audio: Gemini 1.5 ingests video and audio interleaved with text, supporting tasks like multi-hour video question answering and long-form transcript reasoning.[2]
- In-context learning of rare distributions: the Gemini 1.5 report demonstrated learning to translate English to Kalamang, a language with fewer than 200 speakers, by ingesting an entire grammar manual at test time.[2]
- Compliance and analysis on long documents: regulatory filings, contracts, depositions, scientific corpora, and clinical records often exceed shorter windows. Loading the full document removes the chunking step that often loses context boundaries in shorter-window pipelines.
- Few-shot in-context learning at scale: with a million-token window, dozens or hundreds of high-quality demonstrations can be packed into the prompt, sometimes substituting for fine-tuning on small tasks. This is sometimes called "many-shot in-context learning" and was used to evaluate Gemini 1.5 on translation, transliteration, and rare-language tasks.[2]
- Process and trace replay: agents that have accumulated thousands of tool calls, model outputs, and observations can be replayed in a single long-context call for debugging, post-hoc analysis, or distillation into shorter prompts.
Limitations
The honest summary is that nominal and effective context length diverge.
- Effective context less than nominal: BABILong showed popular LLMs effectively use 10 to 20 percent of the context they accept.[31] RULER results at 32K are well below NIAH numbers for the same models.[28]
- Cost growth: even when attention is sub-quadratic at the kernel level, the KV cache and prefill compute still grow linearly per layer per token. A 1M context can dominate the cost of an application.
- No free lunch on training: long-context fine-tuning requires real long documents with real long-range dependencies, which are scarce. Models can pass NIAH at lengths well beyond what they robustly reason at.
- Position bias persists: even with RoPE, YaRN, or ALiBi, the lost-in-the-middle effect has not been fully eliminated; placement matters for accuracy.[33]
- Hardware fragility: serving 1M-token requests requires substantial accelerator memory and network bandwidth, particularly when Ring Attention is used to shard the sequence across devices.[20]
- Tokenizer changes invalidate prior measurements: Claude Opus 4.7 shipped with a new tokenizer that can produce up to 35% more tokens for the same input text relative to its predecessor; like-for-like cost comparisons therefore require careful accounting.[4]
- Mode-collapse on degenerate prompts: at very long contexts, repeated or near-duplicate spans can drive the model into repetition loops or topical drift that is harder to detect than in shorter prompts.
Engineering and economics
The economics of long-context serving are distinctive enough to deserve their own discussion. A 1M-token Claude Opus 4.7 call at base rates costs $5 for input and up to $25 if the model generates 128K output tokens at peak rate, but realistic cost depends heavily on caching and on how many of those input tokens are actually unique to the request.[4] On Anthropic's API, the same prefix loaded once and reused over twelve hours might cost the long-context premium only on the first call; subsequent calls pay only the 0.1x cache-read rate for the shared prefix and the base rate for new tokens, which can swing total cost by an order of magnitude in code-assistant or document-analytics workloads.[15]
Latency budgets also change. Prefill of one million tokens, even with FlashAttention and tensor parallelism, takes seconds to tens of seconds before the first output token, which is incompatible with interactive chat unless a streaming UI is designed around it. For agentic applications that already operate on multi-second turn budgets, this is acceptable; for end-user chat it is not, which is one reason that products often expose long context as an opt-in mode rather than the default.
Network and storage architectures matter too. Ring Attention requires a high-bandwidth ring topology (NVLink, InfiniBand, or equivalent) because per-iteration latency depends on the slowest link in the ring; serving long context on inferior interconnects pays a substantial throughput penalty.[20] PagedAttention, by contrast, can be implemented on a single accelerator and is more attractive for inference-only deployments.[23]
Comparison with alternative approaches
The main alternative to long context is retrieval. A long-context call sends the full corpus to the model, while a RAG system uses an index and a retriever to send only a small relevant subset.
- When long context wins: small to medium corpora, static content where caching dominates, multi-hop reasoning across the entire corpus, and tasks that cannot be cleanly chunked without losing structure.
- When RAG wins: very large or dynamic corpora, dialogue-style retrieval over a knowledge base, latency-sensitive query paths, and budgets that cannot absorb 1M-token prefills.
- Hybrid systems: techniques such as GraphRAG and Agentic RAG interleave retrieval with longer in-context reasoning, while long-context models reduce the need for fine-grained retrieval inside any given hop.[34]
The choice is also influenced by scaling laws; for a fixed compute budget, more parameters and tokens in pretraining can substitute for longer windows, and vice versa.
Long-context modeling intersects with several adjacent research threads.
- Attention efficiency: Flash Attention, Ring Attention, sparse attention, sliding window attention, and linear attention variants.
- Positional encoding: RoPE, ALiBi, YaRN, rotary scaling families.
- Memory and serving: KV cache, PagedAttention, Continuous Batching, Prompt Caching, vLLM, Speculative Decoding.
- Non-transformer architectures: Mamba, Mamba 2, Hyena, RWKV, state space models.
- Benchmarks: Needle in a Haystack, RULER, LongBench.
- Retrieval and orchestration: Retrieval-Augmented Generation, Agentic RAG, GraphRAG.
See also
References
- Anthropic, "Introducing 100K Context Windows", Anthropic, 2023-05-11. https://www.anthropic.com/news/100k-context-windows. Accessed 2026-05-20.
- Gemini Team, Google, "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context", arXiv:2403.05530, 2024-03-08. https://arxiv.org/abs/2403.05530. Accessed 2026-05-20.
- Google, "Introducing Gemini 1.5, Google's next-generation AI model", Google Blog, 2024-02-15. https://blog.google/innovation-and-ai/products/google-gemini-next-generation-model-february-2024/. Accessed 2026-05-20.
- Anthropic / AWS, "Introducing Anthropic's Claude Opus 4.7 model in Amazon Bedrock", AWS News Blog, 2026-04-16. https://aws.amazon.com/blogs/aws/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock/. Accessed 2026-05-20.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", arXiv:1810.04805, 2018-10-11. https://arxiv.org/abs/1810.04805. Accessed 2026-05-20.
- Tom B. Brown et al., "Language Models are Few-Shot Learners", arXiv:2005.14165, 2020-05-28. https://arxiv.org/abs/2005.14165. Accessed 2026-05-20.
- Pierre Lienhart, "LLM Inference Series: 4. KV caching, a deeper look", Medium, 2024-01-08. https://medium.com/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8. Accessed 2026-05-20.
- OpenAI, "New models and developer products announced at DevDay", OpenAI, 2023-11-06. https://openai.com/index/new-models-and-developer-products-announced-at-devday/. Accessed 2026-05-20.
- Anthropic, "Introducing Claude 2.1", Anthropic, 2023-11-21. https://www.anthropic.com/news/claude-2-1. Accessed 2026-05-20.
- Petko Georgiev et al. (Gemini Team), "Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context (Technical Report v1.5)", Google DeepMind, 2024-03-08. https://storage.googleapis.com/deepmind-media/gemini/gemini_v1_5_report.pdf. Accessed 2026-05-20.
- Google Developers, "Gemini 1.5 Pro 2M context window, code execution capabilities, and Gemma 2 are available today", Google Developers Blog, 2024-06-27. https://developers.googleblog.com/en/new-features-for-the-gemini-api-and-google-ai-studio/. Accessed 2026-05-20.
- Google, "Gemini 2.0 model updates: 2.0 Flash, Flash-Lite, Pro Experimental", Google Blog, 2025-02-05. https://blog.google/innovation-and-ai/models-and-research/google-deepmind/gemini-model-updates-february-2025/. Accessed 2026-05-20.
- The New Stack, "Anthropic's Claude Sonnet 4 Model Gets a 1M Token Context Window", The New Stack, 2025-08-12. https://thenewstack.io/anthropics-claude-sonnet-4-model-gets-a-1m-token-context-window/. Accessed 2026-05-20.
- Google Cloud, "Gemini 3 Pro", Google Cloud Documentation, 2025-11-18. https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-pro. Accessed 2026-05-20.
- Anthropic, "Prompt caching", Claude API Docs, 2026-01-15. https://platform.claude.com/docs/en/build-with-claude/prompt-caching. Accessed 2026-05-20.
- Ofir Press, Noah A. Smith, Mike Lewis, "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation", arXiv:2108.12409, 2021-08-27. https://arxiv.org/abs/2108.12409. Accessed 2026-05-20.
- Bowen Peng, Jeffrey Quesnelle, Honglu Fan, Enrico Shippole, "YaRN: Efficient Context Window Extension of Large Language Models", arXiv:2309.00071, 2023-08-31. https://arxiv.org/abs/2309.00071. Accessed 2026-05-20.
- Yiran Ding et al., "LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens", arXiv:2402.13753, 2024-02-21. https://arxiv.org/abs/2402.13753. Accessed 2026-05-20.
- Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Re, "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", arXiv:2205.14135, 2022-05-27. https://arxiv.org/abs/2205.14135. Accessed 2026-05-20.
- Hao Liu, Matei Zaharia, Pieter Abbeel, "Ring Attention with Blockwise Transformers for Near-Infinite Context", arXiv:2310.01889, 2023-10-03. https://arxiv.org/abs/2310.01889. Accessed 2026-05-20.
- Albert Gu, Tri Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces", arXiv:2312.00752, 2023-12-01. https://arxiv.org/abs/2312.00752. Accessed 2026-05-20.
- Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen A. Baccus, Yoshua Bengio, Stefano Ermon, Christopher Re, "Hyena Hierarchy: Towards Larger Convolutional Language Models", arXiv:2302.10866, 2023-02-21. https://arxiv.org/abs/2302.10866. Accessed 2026-05-20.
- Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica, "Efficient Memory Management for Large Language Model Serving with PagedAttention", arXiv:2309.06180, 2023-09-12. https://arxiv.org/abs/2309.06180. Accessed 2026-05-20.
- Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis, "Efficient Streaming Language Models with Attention Sinks", arXiv:2309.17453, 2023-09-29. https://arxiv.org/abs/2309.17453. Accessed 2026-05-20.
- vLLM Team, "vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention", vLLM Blog, 2023-06-20. https://blog.vllm.ai/2023/06/20/vllm.html. Accessed 2026-05-20.
- Rajesh Kanna Selvaraj et al., "Efficient LLM Context Distillation", arXiv:2409.01930, 2024-09-03. https://arxiv.org/abs/2409.01930. Accessed 2026-05-20.
- Manifest AI, "LongCrawl64: A Long-Context Natural-Language Dataset", Manifest AI, 2024-07-15. https://manifestai.com/articles/longcrawl64/. Accessed 2026-05-20.
- Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, Boris Ginsburg, "RULER: What's the Real Context Size of Your Long-Context Language Models?", arXiv:2404.06654, 2024-04-09. https://arxiv.org/abs/2404.06654. Accessed 2026-05-20.
- Yushi Bai et al., "LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding", arXiv:2308.14508, 2023-08-28. https://arxiv.org/abs/2308.14508. Accessed 2026-05-20.
- Xinrong Zhang et al. (OpenBMB), "InfinityBench: Extending Long Context Evaluation Beyond 100K Tokens", arXiv:2402.13718, 2024-02-21. https://arxiv.org/abs/2402.13718. Accessed 2026-05-20.
- Yuri Kuratov et al., "BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack", arXiv:2406.10149, 2024-06-14. https://arxiv.org/abs/2406.10149. Accessed 2026-05-20.
- Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy, "ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding", arXiv:2305.14196, 2023-05-23. https://arxiv.org/abs/2305.14196. Accessed 2026-05-20.
- Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, Percy Liang, "Lost in the Middle: How Language Models Use Long Contexts", arXiv:2307.03172, 2023-07-06. https://arxiv.org/abs/2307.03172. Accessed 2026-05-20.
- Xinze Li, Yixin Cao, Yubo Ma, Aixin Sun, "Long Context vs. RAG for LLMs: An Evaluation and Revisits", arXiv:2501.01880, 2025-01-03. https://arxiv.org/abs/2501.01880. Accessed 2026-05-20.