RadixAttention is a KV cache management technique introduced in SGLang that uses a radix tree data structure to automatically share and reuse cached key-value tensors across inference requests. Developed by Lianmin Zheng, Ying Sheng, and colleagues at UC Berkeley and Stanford University, RadixAttention was described in the paper "Efficiently Programming Large Language Models using SGLang" (arXiv:2312.07104) and publicly released in January 2024. The technique enables prefix sharing across requests that begin with the same token sequence, eliminating redundant computation for shared system prompts, few-shot examples, document contexts, and multi-turn conversation histories.
Before RadixAttention, inference systems discarded the KV cache of each request after generation completed, even when subsequent requests would recompute identical attention tensors. RadixAttention retains these tensors in a persistent, tree-structured cache and matches new requests against it at scheduling time. In benchmark evaluations, the technique delivered 2x to 5.6x higher throughput compared to baseline systems across tasks including few-shot learning, document summarization, and ReAct agent execution.
In transformer-based language models, the attention mechanism requires computing key and value matrices for every token in the context at each forward pass. For a sequence of length $n$, this computation scales quadratically in the naive case. The KV cache optimization resolves this by storing the computed keys and values from prior tokens so that each new decoding step only requires computing attention over the new token rather than the full sequence. This reduces per-step cost from O(n^2) to O(n), at the expense of memory proportional to context length.
In conventional serving systems circa 2023, including early versions of vLLM, HuggingFace TGI, and LMQL, the KV cache was ephemeral per request. When a request finished, the allocated GPU memory was freed. This created a systematic inefficiency for workloads where requests shared a common prefix:
The SGLang authors identified this as one of the primary sources of wasted compute in LLM serving. Their insight was that the KV cache, rather than being request-scoped, should be treated as a shared cache keyed by token sequence, with the same design philosophy as CPU caches or virtual memory systems.
A radix tree (also called a patricia trie or compressed prefix tree) is a space-efficient tree where each node represents a shared prefix among its children. Unlike a standard trie where each edge corresponds to a single character or token, a radix tree compresses chains of nodes with only one child into a single edge labeled with a sequence. This compression reduces both memory consumption and the number of pointer traversals required to find a prefix match.
In SGLang's implementation, the radix tree maps token sequences (as keys) to KV cache tensors (as values). The root of the tree represents the empty prefix. Each edge is labeled with a contiguous subsequence of tokens. Each internal node represents a prefix shared by at least two distinct cached sequences, and each leaf node represents the endpoint of a fully cached sequence.
The KV cache tensors associated with each node are stored on-device (GPU) in a non-contiguous paged layout, where each page holds the keys and values for a fixed number of tokens (typically one token per page at the finest granularity, or aligned to page boundaries of 8, 16, or 32 tokens). The tree structure itself is maintained on CPU with negligible memory overhead.
Three operations on the radix tree are central to RadixAttention:
Prefix lookup. When a new request arrives, the scheduler traverses the radix tree from the root, following edges whose token labels match the leading tokens of the incoming prompt. The traversal continues as long as matches exist. The result is the longest cached prefix found, along with a pointer to the corresponding KV tensors. Lookups run in O(k) time where k is the length of the matching prefix.
Insertion. After a request completes generation, its full prompt (including the generated output) is inserted into the tree. If the new sequence shares a prefix with an existing node but diverges at some point, the existing node is split: a new internal node is created to represent the shared prefix, and two children are created for the diverging suffixes. This dynamic splitting is what allows the tree to represent arbitrary overlapping prefixes without pre-specifying them.
Eviction. When GPU memory is insufficient to admit a new request, the system must free cached tensors. The eviction policy operates on leaf nodes only: a non-leaf node is referenced by at least one child and cannot be evicted without first evicting that child. Each node carries a reference counter tracking the number of active requests using it. Nodes with a non-zero reference count are ineligible for eviction regardless of recency. Among eligible leaf nodes, the system evicts the least recently used (LRU) node, then checks whether its parent has become a leaf and is also evictable, and continues recursively up the tree.
This recursive LRU policy has an important property: it always frees complete subtrees rather than leaving orphaned internal nodes. A parent node without children is meaningless to retain, so the cascading eviction reclaims memory from entire unused branches in one pass.
The operational flow for a request under RadixAttention is as follows:
This workflow integrates naturally with FlashAttention and FlashInfer kernels. SGLang passes cache indices (pointers to cached KV states) directly to the attention backend, which combines new and retrieved values in a single fused kernel call. The backend handles the fact that the cached and newly computed tensors may reside in non-contiguous memory pages.
Importantly, RadixAttention requires no changes to the model architecture or attention algorithm. It operates entirely at the serving layer by managing which KV tensors are computed versus retrieved, and it is compatible with continuous batching and tensor parallelism.
A naive scheduler processes requests in FIFO order, which can reduce cache utilization by interleaving requests that share different prefixes. RadixAttention includes a cache-aware scheduling policy that sorts incoming requests by the length of their matching prefix: requests with a longer cache hit are admitted to a batch first. This greedy heuristic increases the probability that cache-warm requests run while the relevant KV tensors are still resident in GPU memory, reducing the rate at which recently inserted nodes are evicted before they can be reused.
The scheduler also avoids admitting a new request that would force eviction of a node currently referenced by an in-progress request, preventing memory hazards.
The LRU eviction policy in RadixAttention follows standard cache management principles with adaptations for the tree structure. Each node records its last access timestamp, updated whenever a request performs a cache hit on that node or any of its descendants. When memory pressure triggers eviction, the system selects the leaf node with the oldest access timestamp, frees its KV tensors, removes it from the tree, and checks whether its parent has become a new leaf eligible for eviction.
Several invariants govern the eviction process:
Later versions of SGLang extended the eviction framework to support alternative policies including LFU (least frequently used, which retains popular prefixes regardless of recency), FIFO (first-in-first-out), and priority-based eviction for multi-tenant deployments where different tenants or request types carry explicit priority levels.
RadixAttention was designed with several recurring workload patterns in mind, each of which produces a different reuse topology in the radix tree.
Most production deployments prepend a system prompt before user input. System prompts can range from a few dozen tokens to several hundred tokens describing the model's persona, capabilities, and constraints. When every user request begins with the same system prompt, RadixAttention caches the system prompt KV tensors on the first request and reuses them for all subsequent ones. The cache hit on the system prompt prefix is effectively 100% once the first request warms the cache, and the reuse persists across the lifetime of the server process.
Few-shot prompting prepends k labeled examples before the actual query. In benchmark evaluation workloads such as MMLU, HellaSwag, and GSM-8K, every query shares the same k examples. This creates a star topology in the radix tree: a single internal node representing the shared examples, with one leaf per distinct query. In the SGLang paper's evaluation, this pattern yielded 4.4x higher throughput over vLLM for few-shot workloads and cache hit rates of 85 to 95%.
In RAG pipelines, documents retrieved from a vector store are concatenated into the prompt before a user question. When multiple questions are asked about the same document (or overlapping document sets), the document tokens can be cached and reused. This pattern is common in enterprise document Q&A systems where a user session involves multiple questions about the same contract, report, or manual. Cache hit rates for RAG workloads typically fall in the 80 to 95% range depending on how frequently document chunks repeat across requests.
In chat applications, each successive turn appends a new user message and assistant response to the accumulated history. The radix tree naturally represents this as a chain: turn 1 prefix, then turn 2 prefix (which includes turn 1), then turn 3 (which includes turns 1 and 2), and so on. When a new turn arrives, the scheduler finds the longest matching prefix (all prior turns), computes attention only over the new user message, and extends the cached chain. This eliminates the redundant reprocessing of prior turns that occurred in systems without prefix caching. Chat workloads typically see cache hit rates of 60 to 80%, with higher rates for longer histories and fewer new tokens per turn.
In agentic frameworks such as ReAct (Reasoning and Acting), the model interleaves reasoning traces, tool calls, and observations in a single growing prompt. Each agent step appends several hundred tokens before the model generates the next action. Because each step extends the prefix from the previous step, the radix tree grows linearly with the agent's execution trace, and each new step reuses the entire prior trace from cache. In the SGLang paper's ReAct benchmark, this produced 5.6x higher throughput than vLLM and reduced latency to roughly 13% of vLLM's for single-instance execution.
Tree-of-thought (ToT) prompting and self-consistency decoding both require branching from a common prefix. In ToT, the model generates multiple candidate reasoning paths from the same problem statement. In self-consistency, the same question is sampled multiple times to aggregate answers. Both patterns produce a tree-shaped reuse topology: a shared root (the problem statement), with multiple branches for different samples or paths. RadixAttention handles this natively because the radix tree can represent arbitrary branching, unlike flat FIFO caches.
The SGLang paper evaluated RadixAttention against vLLM (v0.1.7), Guidance, and HuggingFace TGI across six benchmark tasks using the Vicuna-13B and Llama2-70B models.
| Benchmark | SGLang throughput (req/s) | vs vLLM speedup |
|---|---|---|
| MMLU (5-shot) | 4,250 | 3.0x |
| Few-shot learning | 2,850 | 4.4x |
| HellaSwag | varies | 2.0x |
| GSM-8K | varies | 4.5x |
| ReAct agent | varies | 5.6x |
| Long document JSON | varies | 2.9x |
Caching latency overhead in the absence of cache hits was measured at 0.5% of total inference time for 128 concurrent requests, indicating that the radix tree traversal and insertion impose negligible cost when cache misses occur.
Cache hit rates reported in the paper and subsequent community benchmarks vary significantly by workload:
| Workload type | Typical cache hit rate |
|---|---|
| System prompt only | ~100% (after warmup) |
| Few-shot (shared examples) | 85-95% |
| RAG (document reuse) | 80-95% |
| Multi-turn chat | 60-80% |
| Agent execution loops | 70-90% |
| Diverse single-turn queries | 20-40% |
First-token latency (TTFT) improves proportionally to the cache hit length: a request where 80% of the prompt is cached reduces prefill computation by 80%, and TTFT is reduced by a corresponding factor.
PagedAttention, introduced in the original vLLM paper (arXiv:2309.06180) by Woosuk Kwon, Zhuohan Li, Lianmin Zheng, and colleagues at UC Berkeley, addresses a different problem than RadixAttention. PagedAttention solves memory fragmentation: transformer KV caches traditionally required contiguous GPU memory allocations, but GPU memory is not always available in contiguous blocks large enough for long sequences. PagedAttention maps logical KV cache blocks to non-contiguous physical memory pages, analogous to virtual memory paging in operating systems. This reduces memory waste to under 4% of total KV cache allocation and enables more concurrent requests to fit in GPU memory.
RadixAttention, by contrast, solves the reuse problem: how to avoid recomputing KV tensors that two or more requests would produce identically. The two techniques address orthogonal dimensions of serving efficiency.
| Property | PagedAttention | RadixAttention |
|---|---|---|
| Primary goal | Eliminate memory fragmentation | Eliminate redundant KV computation |
| Mechanism | Non-contiguous physical paging | Radix tree prefix sharing |
| Memory layout | Fixed-size logical-to-physical block table | Variable-length tree nodes, paged storage |
| Granularity | Block-level (fixed block size) | Token-level (or page-aligned) |
| Reuse scope | Within a single request (copy-on-write for beam search) | Across requests (arbitrary prefix sharing) |
| Eviction policy | LRU at block level | LRU at tree node level, cascading |
| Overhead when no sharing | Minimal | Less than 1% of forward pass time |
| Introduced | June 2023 (vLLM v0.1) | January 2024 (SGLang) |
The two techniques are complementary and both SGLang and later versions of vLLM use them together. SGLang adopted paged KV cache storage (with page size configurable from 1 to 32 tokens) while layering the radix tree structure on top to manage which pages are shared across requests.
vLLM's own prefix caching differs architecturally from RadixAttention. vLLM uses a hash-based approach: each fixed-size KV cache block is assigned a hash derived from the block's token IDs and the hash of the preceding block, forming a hash chain. When a new request arrives, vLLM hashes its prefix blocks and looks them up in a hash table; a match constitutes a cache hit. This approach is compatible with vLLM's existing block table infrastructure and scales well, but operates at block granularity (typically 16 or 32 tokens per block) rather than token granularity, meaning a partial block match produces no hit. SGLang's radix tree can match at the individual token level (or at any page boundary), giving it finer-grained hit detection for prefixes that do not align to block boundaries.
| Feature | SGLang RadixAttention | vLLM prefix caching |
|---|---|---|
| Data structure | Radix tree | Hash table over paged blocks |
| Matching granularity | Token-level (configurable) | Block-level (fixed block size) |
| Prefix split handling | Dynamic node splitting | Partial block = no hit |
| Eviction | LRU tree-node eviction, cascading | LRU block eviction |
| Cache-aware scheduling | Built-in (longest prefix first) | Available in v1 |
| Security isolation | Priority eviction (multi-tenant) | Cache salt per request |
| Default since | SGLang v0.1 (always on) | vLLM v0.8.0 (V1 architecture) |
RadixAttention was built into SGLang from its initial public release and is enabled by default. The codebase is hosted at github.com/sgl-project/sglang under the LMSYS organization. As of 2025, SGLang has been deployed in production at xAI (serving Grok 3), on Microsoft Azure for DeepSeek R1 on AMD GPUs, and on Google Cloud, AWS, and Oracle Cloud infrastructure. The project reports daily token generation in the trillions across deployments running on more than 400,000 GPUs.
SGLang v0.4 (December 2024) introduced a cache-aware load balancer for distributed deployments, which routes incoming requests to the replica most likely to produce a cache hit based on prefix matching against per-replica cache state. This extended RadixAttention's benefits to multi-node serving by minimizing cross-replica cache cold starts. The load balancer was reported to achieve up to 1.9x higher throughput and 3.8x higher cache hit rates in distributed configurations.
SGLang joined the PyTorch ecosystem in 2024, and the SGLang team has become active in the broader LLM inference research community, with follow-on work including HiCache (hierarchical multi-tier caching across GPU HBM, CPU DRAM, and NVMe SSD), disaggregated prefill-decode, and speculative decoding integration.
vLLM's initial release in June 2023 did not include automatic prefix caching. The vLLM team added experimental automatic prefix caching (APC) in v0.4.0 in 2024, using the hash-based block approach. In the vLLM V1 architecture (introduced with v0.8.0 in early 2025 and made the default), prefix caching was redesigned as a core feature rather than an optional flag. The vLLM documentation notes that V1 prefix caching incurs less than 1% throughput overhead even at a 0% cache hit rate, and multiplies throughput significantly at high hit rates. The V1 design also supports cache salting per request for security isolation in multi-tenant deployments.
The SGLang paper is cited in the vLLM documentation as prior work on prefix caching for LLM serving.
Prompt caching features offered by OpenAI (from October 2024), Anthropic (from August 2024), and Google are conceptually similar to RadixAttention in that they cache the KV tensors for repeated prompt prefixes and charge reduced rates for cache hits. These implementations are proprietary and their exact mechanisms differ, but the underlying motivation, shared prefix reuse, is the same.
SGLang's RadixAttention implementation is contained primarily in the RadixCache class and the TreeNode class within the codebase. Key design choices include:
Paged KV storage. KV tensors are stored in pages on GPU. The default page size is 1 token, allowing token-level prefix matching, but larger page sizes (8, 16, 32 tokens) reduce tree overhead at the cost of coarser hit detection. A configurable --page-size flag controls this tradeoff.
CPU tree, GPU tensors. The tree structure (node metadata, child pointers, token labels) lives on CPU memory. Only the KV tensors themselves occupy GPU memory. This separation means the overhead of tree operations does not consume GPU compute cycles during the forward pass.
Reference counting. Each TreeNode carries an integer reference count incremented when a request starts using that node's cached tensors and decremented when the request completes. The eviction policy checks reference counts before freeing any node, ensuring active requests are never preempted.
Chunked prefill compatibility. When a request's uncached portion is long, SGLang can split the prefill across multiple micro-batches using chunked prefill (default chunk size 2048 tokens). The cached portion is processed in a single attention call to the backend; only the chunked uncached portion is spread across multiple steps.
FlashInfer integration. SGLang uses FlashInfer as its attention backend for most configurations. FlashInfer accepts sparse attention indices that include both cached and newly computed KV positions, handling the non-contiguous memory layout transparently. This allows RadixAttention to reuse arbitrarily arranged cached tensors without copying them into contiguous buffers before the attention call.
Distributed serving. In tensor-parallel configurations, each GPU holds a shard of the KV tensors for each tree node. The radix tree structure is replicated on each node's CPU, but the actual KV tensors are sharded. Cache hits are detected using the shared tree structure, and each GPU loads its shard of the cached tensors independently.
Reasoning model cache pollution. When serving reasoning models such as DeepSeek-R1 or QwQ, the model produces a chain-of-thought trace as part of its output. SGLang inserts both the prompt and the generated tokens (including the thinking trace) into the radix tree on request completion. However, the SGLang chat API drops thinking tokens from the assistant history when constructing multi-turn prompts, following the models' recommended API behavior. This creates orphaned subtrees: the radix tree contains branches ending with thinking tokens that no subsequent request will ever match. These dead branches occupy GPU memory until LRU eviction claims them, and at typical thinking token counts of around 5,000 tokens per turn, a modest number of concurrent sessions can strand tens of gigabytes of GPU memory in unreachable cache entries. This issue was reported and tracked as a known problem in the SGLang GitHub repository (issue #22373 as of 2025).
Workload dependence. RadixAttention provides no benefit when requests are entirely unique (no shared prefixes). For workloads with high entropy in the prompt, such as single-turn creative generation with diverse user inputs, cache hit rates approach zero and the radix tree overhead is a pure cost, albeit a small one (under 1%).
Tree contention under high concurrency. The radix tree is a shared mutable data structure. Under very high request arrival rates, tree operations (lookup, insertion, eviction) must be serialized to avoid race conditions. SGLang's scheduler is single-threaded on CPU, which bounds the tree update rate. At extreme scale, this can become a bottleneck relative to GPU throughput, particularly for workloads with short cached prefixes that require frequent insertions.
Memory overhead from long unique outputs. If requests produce long generated sequences that are subsequently cached, the tree can accumulate large leaf nodes representing rarely reused generation outputs. This is a common pattern when generation diversity is high (temperature > 0, no identical completions), and the resulting tree contains many leaf nodes competing for eviction with more valuable internal nodes representing shared prefixes.
Cross-request isolation in multi-tenant settings. Without explicit configuration, requests from different tenants can read each other's cached KV tensors if they share a prefix. This is not a correctness problem (the KV tensors encode only the attention over the prompt, not private user data per se), but it may violate privacy or compliance requirements in shared infrastructure. vLLM addresses this with cache salting; SGLang offers priority-based eviction that can be used to segregate tenant caches, but does not have native cryptographic cache isolation as of 2025.
Multimodal extension. While RadixAttention can cache image tokens by treating them like token IDs, image preprocessing (encoding patches into embeddings) still occurs outside the tree lookup. Partial image matches, where two requests share some but not all image patches, are not handled as efficiently as text prefix sharing.