PagedAttention is a memory management algorithm for serving large language models that borrows the concept of virtual memory paging from operating systems to eliminate fragmentation in the key-value (KV) cache. Introduced by Woosuk Kwon and colleagues at UC Berkeley, Stanford University, and UC San Diego in 2023, PagedAttention became the foundational technique behind vLLM, one of the most widely deployed open-source LLM serving systems. The algorithm delivers 2 to 4 times higher throughput compared to prior serving approaches by allowing KV cache blocks to reside in non-contiguous GPU memory locations, enabling far more requests to share available memory at any given time.
The work was published as "Efficient Memory Management for Large Language Model Serving with PagedAttention" at the 29th ACM Symposium on Operating Systems Principles (SOSP 2023) and received the Best Paper Award at that conference. The paper's arXiv identifier is 2309.06180.
To understand why PagedAttention matters, it helps to understand how autoregressive language model inference works. When a transformer processes a sequence of tokens, the self-attention mechanism computes key and value vectors for every token at every layer. During autoregressive generation, the model produces one new token per step, but it needs access to the keys and values from all previous tokens to compute attention over the full context. Recomputing these vectors from scratch at every step would be prohibitively slow, so serving systems store them in a structure called the KV cache.
The KV cache grows as generation proceeds. For a sequence of length $T$ with $L$ transformer layers, $H$ attention heads, and head dimension $d$, the cache holds $2 \times L \times T \times H \times d$ floating-point values (one tensor each for keys and values). On a model like LLaMA-13B running at fp16 precision, a single sequence can require roughly 1.7 GB of GPU memory for its KV cache at the maximum supported context length. For a model serving hundreds of concurrent requests, this memory consumption dominates the GPU budget, far exceeding what model parameters alone require.
Prior to PagedAttention, serving systems handled KV cache memory in a static, pre-allocating fashion. When a new request arrived, the system reserved a contiguous block of GPU memory large enough to hold the KV cache for that request's maximum possible output length. This created three distinct forms of waste:
Reserved but unused slots. Because the output length cannot be known in advance, systems had to over-reserve memory for every request. Tokens that would never be generated still occupied GPU memory.
Internal fragmentation. Within an allocated block, much of the memory corresponded to token positions that had not yet been filled. The allocated tensor had to span from token 0 to the maximum length even though generation might have only reached token 100.
External fragmentation. As requests completed and freed their allocations, the freed regions were scattered across the address space. A new request requiring a large contiguous allocation might find that there is technically enough free memory in aggregate, but none of the free regions are large enough individually. This is classic external fragmentation.
The original PagedAttention paper measured the consequence: in existing systems, only 20.4% to 38.2% of allocated KV cache memory actually stored useful token states. Between 60% and 80% of that memory was wasted. This waste directly limited the number of requests that could run concurrently, which in turn limited throughput.
The solution PagedAttention applies is borrowed directly from how operating systems manage RAM. In a modern OS, physical RAM is divided into fixed-size pages (typically 4 KB). When a process needs memory, it does not receive a large contiguous physical allocation. Instead, it receives virtual address space that the OS maps onto physical pages wherever they happen to be free. A page table maintained per-process records which virtual pages map to which physical pages. Pages for the same process can be scattered throughout physical RAM, but the process sees a clean contiguous virtual address space.
This design solves both internal and external fragmentation. Internal fragmentation is bounded to at most one page's worth of wasted space (the partially filled last page). External fragmentation is eliminated because any free physical page can satisfy any request; there is no need for the free pages to be adjacent.
PagedAttention applies this exact idea to GPU memory for KV cache. The KV cache for a sequence is divided into fixed-size logical blocks. Each block stores the keys and values for a fixed number of consecutive tokens (the block size, defaulting to 16 tokens in the reference implementation). The system maintains a block table per sequence, mapping each logical block index to a physical block anywhere in the GPU's memory pool. The physical blocks for a single sequence do not need to be contiguous.
During inference, attention is computed across this paged layout using a custom GPU kernel that follows each sequence's block table to gather the relevant key and value data. The paper and implementation include specialized CUDA kernels for this block-based attention computation. The default kernel layout assigns one thread block per attention head per sequence, with 32-thread warps iterating over logical blocks via the block table.
The block manager in vLLM maintains a global pool of physical KV cache blocks. When a new request begins, the block manager allocates physical blocks on demand as tokens are generated. Only one physical block at a time is open for writing for any given sequence; previous blocks are treated as read-only once they are full.
This on-demand allocation means that a sequence using 80 tokens occupies only the memory for those 80 tokens, not a pre-reserved slab sized for the maximum context length. A request that completes early frees its physical blocks immediately, making them available for new requests without any compaction or defragmentation step. Internal fragmentation is bounded to at most block_size - 1 tokens per sequence (the partially filled last block), compared to the entire maximum-length pre-allocation in earlier designs.
External fragmentation is eliminated because any free physical block is interchangeable with any other. The block table handles the address translation, so no contiguity constraint exists.
The tradeoff in choosing block size involves a balance between two effects. Smaller blocks reduce internal fragmentation but increase the size of block tables and add more indirection overhead in attention kernels. Larger blocks amortize the table overhead but waste more memory on partially filled trailing blocks. The paper evaluates block sizes from 2 to 128 tokens on the ShareGPT and Alpaca datasets, finding that block sizes in the range of 16 to 128 perform well, with 16 tokens as the default.
The memory layout for the key cache in vLLM's CUDA kernels is [num_blocks, num_kv_heads, head_size/x, block_size, x], where x is a hardware-dependent vectorization factor. The value cache layout is [num_blocks, num_kv_heads, head_size, block_size]. These specific layouts are chosen to allow coalesced memory access within the attention kernel: neighboring threads read adjacent memory addresses, maximizing memory bandwidth utilization.
Beyond eliminating fragmentation, the block table abstraction enables another optimization: sharing physical blocks across multiple sequences. If two sequences share an identical prefix of tokens, their block tables can point to the same physical blocks for that prefix. The shared physical blocks hold the KV cache data computed once for the common prefix, and both sequences read from them without duplication.
This sharing is managed through reference counting. Each physical block tracks how many sequences' block tables point to it. As long as multiple sequences reference a block, it is read-only. When a sequence needs to write a new token into a shared block (or perform any modification), the system applies copy-on-write semantics: it allocates a new physical block, copies the contents of the shared block into the new block, and updates the requesting sequence's block table to point to the new copy. The original shared block remains unmodified, so other sequences still referencing it are unaffected.
This mechanism is directly useful in several inference scenarios:
Parallel sampling. A single prompt generating $n$ output sequences in parallel (for best-of-$n$ selection) would traditionally require $n$ separate copies of the prompt's KV cache. With copy-on-write, all $n$ sequences share the same physical blocks for the prompt. Independent physical blocks are allocated only as the sequences diverge in their generated tokens. The paper measures memory savings of 6.1% to 9.8% from this sharing on parallel sampling workloads.
Beam search. Beam search maintains $k$ candidate sequences at each step, keeping the top-$k$ by score. These candidates share not just the prompt but also portions of the decoded sequence history where beams have not yet diverged. With copy-on-write, beams can share physical blocks as long as they point to the same token history, and new blocks are allocated only when beams diverge. Memory savings from beam search sharing range from 37.6% to 55.2% in the paper's evaluation.
Shared system prompts. Many production deployments prepend the same system prompt or few-shot examples to every user request. With block sharing, the KV cache for that common prefix is computed once and shared across all concurrent requests using it. This is essentially automatic prefix caching at the block level, and it became the foundation for vLLM's later Automatic Prefix Caching (APC) feature, which extends block sharing to use a hash of block contents to identify reusable blocks from a global cache.
PagedAttention's memory efficiency enables a scheduler that works at the iteration level rather than the request level. At every decoding step, vLLM's scheduler decides which sequences to include in the next forward pass. Because physical memory blocks are allocated and freed on demand, the scheduler can add new sequences mid-generation whenever free blocks become available and remove completed sequences immediately when they finish.
The scheduler maintains three queues:
When the GPU memory pool runs low, the scheduler must preempt some running sequences to make room for new ones. vLLM uses first-come-first-served (FCFS) ordering; the most recently admitted requests are preempted first, preserving progress on requests that have already generated significant output. Preemption operates on entire sequence groups at once rather than individual sequences, because sequences within a group (for example, multiple parallel samples from one request) may share physical blocks.
Two recovery strategies are available after preemption:
Swapping. The preempted sequence's physical KV cache blocks are serialized to CPU DRAM over PCIe and released on the GPU. When the sequence is re-admitted, the blocks are copied back from CPU to GPU. This is efficient for larger block sizes where the PCIe transfer cost is amortized over more data, but it adds latency proportional to the amount of data swapped and the PCIe bandwidth available.
Recomputation. The preempted sequence's KV cache blocks are simply discarded. When the sequence is re-admitted, the prefill phase is re-run from the beginning to regenerate the KV cache. For block sizes of 64 tokens or fewer, recomputation is often faster than the round-trip swap cost. The paper reports that recomputation overhead never exceeds 20% of the swapping overhead across the tested configurations.
The paper evaluates vLLM against two primary baselines: Orca and FasterTransformer. Orca is a prior iteration-level scheduling system without PagedAttention's memory management. FasterTransformer is NVIDIA's high-performance transformer inference library.
Evaluations use OPT and LLaMA models ranging from 13B to 175B parameters, tested on ShareGPT and Alpaca prompt datasets. The ShareGPT dataset has much longer sequences (8.4 times longer input and 5.8 times longer output than Alpaca), making KV cache memory pressure more acute.
| System | vs. FasterTransformer | vs. Orca (Oracle) | vs. Orca (Max) |
|---|---|---|---|
| vLLM (PagedAttention) | up to 22x | 1.7x to 2.7x | 2.3x to 4.3x |
"Oracle" refers to Orca with perfect foreknowledge of output lengths to eliminate over-reservation. The fact that vLLM outperforms even the oracle variant by 1.7x to 2.7x shows that the gains come from more than just eliminating over-reservation; the paged allocation and sharing provide additional benefits.
| System | KV cache utilization | Memory waste |
|---|---|---|
| FasterTransformer / Orca | 20.4% to 38.2% | 60% to 80% |
| vLLM (PagedAttention) | over 96% | under 4% |
| Scenario | Memory savings |
|---|---|
| Parallel sampling | 6.1% to 9.8% |
| Beam search (width 6) | 37.6% to 55.2% |
| Shared system prefix | up to 30% in some configurations |
The vLLM blog post reports even larger gains relative to unoptimized inference code. Compared to HuggingFace Transformers running on the same hardware:
These larger multipliers reflect the baseline inefficiency of HuggingFace Transformers for serving use cases, rather than the isolated gain from PagedAttention specifically.
The table below summarizes the conceptual differences between naive KV cache management (as used in early serving systems) and PagedAttention:
| Property | Naive KV cache | PagedAttention |
|---|---|---|
| Memory allocation | Pre-allocated contiguous slab per request | On-demand fixed-size blocks, any location |
| Internal fragmentation | Up to full max-length reservation | At most block_size - 1 tokens |
| External fragmentation | Yes, grows as requests complete | None (any free block fills any request) |
| Memory utilization | 20% to 38% | Over 96% |
| Prefix sharing | Not supported | Supported via block table aliasing |
| Copy-on-write | Not applicable | Supported via reference counting |
| Maximum concurrent batch | Limited by fragmentation | Near-optimal packing |
| Kernel complexity | Simple contiguous tensor access | Block table indirection, custom CUDA kernels |
| Per-kernel overhead | Baseline | Approximately 20% to 26% higher |
The per-kernel overhead is real: looking up block table entries and accessing non-contiguous memory is slower than a simple contiguous tensor read. However, the overhead is far outweighed by the ability to run 2 to 4 times more requests concurrently. Throughput, which is what matters for a serving system under load, improves substantially even though individual kernel latency is slightly higher.
vLLM is the system that introduced PagedAttention and remains its primary home. Released by UC Berkeley researchers in June 2023, vLLM quickly became the most widely used open-source LLM serving framework. PagedAttention is the core of its memory management layer. Subsequent vLLM development has extended PagedAttention with Automatic Prefix Caching (APC), which uses content hashes to identify and reuse blocks across requests without explicit prefix matching; chunked prefill, which breaks long prefill sequences into smaller chunks to reduce time-to-first-token variance; and speculative decoding, which uses a smaller draft model to propose multiple tokens for verification by the main model.
By 2024, vLLM supported over 100 model architectures and integrated multiple attention kernel backends including Flash Attention, FlashInfer, and Triton-based custom kernels. The codebase spans approximately 8,500 lines of Python and 2,000 lines of C++ and CUDA.
NVIDIA's TensorRT-LLM inference framework adopted a blocked KV cache design that is similar in spirit to PagedAttention. Rather than pre-allocating contiguous per-sequence buffers, TensorRT-LLM manages a grid of KV chunks, freeing and reusing them as requests complete. The internal memory layout and kernel implementation differ from vLLM's design, and TensorRT-LLM incorporates additional NVIDIA-specific optimizations such as FP8 quantization, INT8 smooth quantization, and XQA kernels for grouped-query attention. TensorRT-LLM is typically used when squeezing maximum performance from NVIDIA hardware, particularly for production deployments on A100 and H100 GPUs.
SGLang is a serving framework developed at UC Berkeley that takes the prefix caching idea further with a data structure called RadixAttention. Where PagedAttention achieves prefix sharing by aliasing block tables when the serving system knows requests share a prefix, RadixAttention organizes the entire KV cache as a radix tree (trie) indexed by token sequences. Any request whose token prefix matches a prefix already in the tree can immediately reuse that portion of the KV cache, with no need for the operator to configure explicit prefix groups. SGLang reports 29% higher throughput than vLLM on workloads with significant prefix sharing, because the radix tree enables reuse that vLLM's block-hash APC can also find, but with lower lookup overhead in some configurations.
HuggingFace's Text Generation Inference server adopted paged memory management as well, incorporating a blocked KV cache design in later versions after vLLM's publication demonstrated the approach's merits. TGI remains popular among Huggingface model users for its tight integration with the Huggingface Hub.
Shanghai AI Laboratory's LMDeploy framework includes its own implementation of paged KV cache management, called TurboMind, targeting deployment of models like InternLM. It follows the same conceptual model of non-contiguous physical blocks with logical-to-physical mapping.
PagedAttention is not without costs and constraints.
Attention kernel overhead. The custom attention kernels that gather KV data from non-contiguous blocks are slower per operation than kernels operating on contiguous tensors. The paper measures approximately 20% to 26% kernel-level overhead. At batch size 1 (low-concurrency scenarios), this overhead is not offset by memory gains, and a system like Flash Attention using contiguous memory may have lower latency. PagedAttention's advantage is in high-concurrency, high-throughput scenarios.
Block table management complexity. The system must maintain per-sequence block tables and a global block allocator. Reference counting for copy-on-write adds additional bookkeeping. This complexity is manageable in practice but adds implementation overhead compared to simple contiguous allocation.
PCIe bandwidth for swapping. When the scheduler preempts sequences by swapping their KV blocks to CPU memory, the PCIe link between the GPU and host RAM becomes a bottleneck. PCIe 4.0 x16 provides approximately 32 GB/s of bidirectional bandwidth, but a large swap event can still add hundreds of milliseconds of latency for a request waiting to be readmitted. Recomputation avoids this but burns GPU compute instead.
Internal fragmentation from block granularity. The block size is a fixed parameter for the entire serving instance. If most requests produce short sequences and the block size is large, internal fragmentation re-emerges. Choosing an appropriate block size for a deployment's workload characteristics matters.
Distributed inference complexity. PagedAttention was originally designed for single-GPU or tensor-parallel setups. Pipeline parallelism across nodes introduces additional complexity because different stages must coordinate block table state. vLLM's distributed implementation uses Megatron-LM-style tensor parallelism, where each GPU holds a slice of each physical block, but pipeline-parallel block management is more involved.
Interaction with speculative decoding. Speculative decoding generates multiple candidate tokens using a draft model and verifies them with the main model. Managing block allocation for speculative tokens that may be rejected requires care; blocks allocated for rejected tokens must be rolled back or discarded efficiently.
vLLM's APC feature, added after the initial PagedAttention paper, extends block sharing to work automatically without explicit prefix groups. When a physical block is filled and becomes read-only, vLLM hashes its contents along with its position-dependent prefix hash. New requests with matching prefixes find their logical blocks remapped to existing physical blocks from the hash table, skipping recomputation. This turns PagedAttention's copy-on-write mechanism into a general-purpose KV cache that persists across requests.
SGLang's RadixAttention organizes the KV cache as a radix tree where each edge represents a sequence of tokens and each node represents a point where multiple prefixes share a common ancestor. This structure makes prefix matching for new requests an efficient tree traversal, avoiding the per-block hashing of vLLM's APC. RadixAttention is particularly effective for workloads with deep, branching prefix structures such as multi-turn conversations, few-shot examples, or retrieval-augmented generation with shared retrieved contexts.
FlashInfer is an attention kernel library that provides PagedAttention-compatible kernels with additional optimizations, including support for grouped-query attention (GQA), sliding window attention, and attention with custom softmax modifications. FlashInfer 0.2 integrated Flash Attention 3 templates and a JIT compiler that generates specialized kernels for attention variants, reducing the need to maintain a separate CUDA kernel for every attention configuration. vLLM adopted FlashInfer as an optional kernel backend.
Chunked prefill divides long input sequences into smaller chunks processed in separate iterations. Combined with PagedAttention, this reduces peak memory pressure during the prefill phase, which would otherwise require allocating the full KV cache for a long prompt in one step. Chunked prefill also reduces time-to-first-token variance by preventing a single long prefill from blocking all decoding iterations. The two techniques are orthogonal: PagedAttention manages spatial fragmentation in memory while chunked prefill manages temporal concentration of compute.
PagedEviction (arXiv 2509.04377) extends PagedAttention with a structured pruning strategy that selectively evicts KV cache blocks for less important tokens when memory pressure is high, rather than evicting entire sequences. This allows longer context windows to fit in fixed GPU memory at the cost of some approximation in attention over evicted tokens.
The vLLM V1 refactor (introduced in 2024-2025) redesigned the runtime architecture to separate the scheduler from the workers more cleanly, enable CUDAGraph capture for PagedAttention-based decoding, and improve compatibility with chunked prefill and speculative decoding. The V1 architecture maintains PagedAttention as its core memory abstraction while improving scheduling throughput and reducing Python overhead in the control plane.