Disaggregated serving is an LLM inference architecture that physically separates the prefill phase and the decode phase of text generation onto different sets of GPU hardware. Also called prefill-decode disaggregation or P/D disaggregation, the technique recognizes that the two phases have fundamentally different computational profiles and therefore perform best on different hardware configurations. Keeping them on the same GPUs forces a compromise that satisfies neither phase well. By routing each request through dedicated prefill workers first and dedicated decode workers second, operators can tune each pool independently, eliminating the interference that degrades latency and throughput in conventional colocated systems.
The approach gained traction as a research topic in early 2024 with papers from Peking University (DistServe), Microsoft Research (Splitwise), and Moonshot AI (Mooncake). By 2025 it had moved from research into production: vLLM, SGLang, NVIDIA TensorRT-LLM, and NVIDIA Dynamo all ship disaggregated serving as a supported mode. DeepSeek reports running disaggregated serving in production across thousands of nodes.
Every autoregressive LLM inference request passes through two sequential stages.
The prefill stage processes the entire input prompt in a single forward pass. Because all input tokens are known in advance, the computation is structured as large matrix multiplications over the full sequence length. This is compute-bound work: the GPU tensor cores are the bottleneck, not memory bandwidth. A prompt of 4,096 tokens on an H100 achieves arithmetic intensity in the range of 200 to 400 floating-point operations per byte of memory accessed, well above the GPU's memory-bandwidth roofline.
The decode stage generates one output token at a time. Each step requires reading the full KV cache for all previous tokens plus the model weights from high-bandwidth memory (HBM), performing a forward pass, and sampling the next token. Because only one token is generated per step and the full KV cache must be reread on every step, decode is memory-bandwidth-bound. Arithmetic intensity drops to roughly 60 to 80 operations per byte, and GPU utilization falls to 20 to 40 percent. The tensor cores finish quickly and then idle while waiting for the next memory read.
These different bottlenecks have practical consequences. Compute-bound prefill benefits from high tensor parallelism (TP) across fewer dimensions and large batch sizes that amortize the overhead. Memory-bandwidth-bound decode benefits from different parallelism configurations that maximize HBM throughput, often with higher TP degree or more replicas processing smaller batches. No single configuration optimizes both.
When prefill and decode share GPU resources, they interfere with each other in two ways.
First, continuous batching (the standard scheduling technique used in vLLM and similar frameworks) interleaves prefill jobs with ongoing decode iterations. Each time a new request arrives, its prefill computation displaces one or more decode steps. This causes inter-token latency (ITL) spikes in all active decode sequences. On a heavily loaded server, decode requests get interrupted every time a prefill job enters the system. Users experience this as irregular token streaming: the model pauses, generates a burst, pauses again.
Second, resource allocation is coupled. The GPU memory capacity, parallelism strategy, and batch size must serve both workloads simultaneously. The optimal configuration for prefill (large batch, compute-optimized) conflicts with the optimal configuration for decode (smaller batch, memory-bandwidth-optimized). Operators must pick a single compromise setting.
The result is that colocated systems cannot simultaneously maintain tight service-level objectives (SLOs) on both time-to-first-token (TTFT) and inter-token latency (ITL or TPOT, time per output token). Tightening TTFT requires dedicating more compute to prefill, which increases ITL jitter. Reducing ITL requires protecting decode batches from interruption, which allows TTFT to rise.
Disaggregated serving resolves this tension by routing each request through two separate worker pools.
A prefill worker (sometimes called a context worker) receives the prompt, runs the full prefill forward pass, and produces the initial KV cache. It then transfers the KV cache tensors over a high-speed interconnect to a decode worker and moves on to its next prefill request.
A decode worker receives the KV cache from the prefill worker and runs the autoregressive generation loop, producing one token per step until a stop condition is met. It never runs prefill computation, so its decode iterations are never interrupted.
A router or load balancer sits in front of both pools and coordinates request dispatch. When a request arrives, the router sends it to an available prefill worker. When prefill completes, the router arranges KV cache transfer to an appropriate decode worker.
Because the two pools are independent, each can be sized and configured differently. Prefill workers can use lower TP degree (fewer GPUs per model instance) and larger batch sizes to maximize compute throughput. Decode workers can use higher TP degree or more replicas, with batch sizes tuned for memory bandwidth. The ratio of prefill workers to decode workers can be adjusted based on the workload: prompts that are long relative to output benefit from more prefill capacity; short prompts with long outputs shift the balance toward decode.
Traditional inference benchmarks report raw throughput in tokens per second or requests per second without regard for whether latency SLOs are met. The DistServe paper introduced goodput as a more relevant measure: the maximum request rate the system can sustain while meeting specified TTFT and TPOT objectives. A system that processes 1,000 requests per second but delivers 30 percent of responses late has goodput significantly below 1,000. Disaggregated serving improves goodput by making SLO attainment reliable on both dimensions rather than trading one against the other.
DistServe, published by Yinmin Zhong and colleagues from Peking University, UC San Diego, and StepFun, appeared at the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2024) as one of the earliest formal treatments of disaggregated LLM serving.
The paper's central observation is that colocated serving creates interference in two directions: prefill computation delays ongoing decode steps, and the need to serve both phases forces suboptimal resource allocation for each. DistServe addresses this by assigning prefill and decode to separate GPUs and solving an optimization problem to find the best resource allocation and parallelism strategy for each phase independently, given the cluster's inter-node bandwidth budget.
The system introduces a placement algorithm that considers available bandwidth when deciding how to route KV cache transfers between prefill and decode nodes. If the cluster has sufficient interconnect bandwidth, the KV cache transfer latency is small compared to the decode step latency, and the transfer can be hidden within normal operation. If bandwidth is constrained, the algorithm adjusts placement to reduce transfer volume.
Evaluated on three representative workloads, DistServe reported significant improvements in goodput over colocated baselines:
| Workload | Goodput improvement over vLLM |
|---|---|
| Chatbot | 2.0x to 3.41x |
| Code completion | 3.2x |
| Summarization | 4.48x |
At the system level, DistServe achieved up to 7.4 times higher request throughput and SLOs up to 12.6 times tighter than state-of-the-art colocated systems while keeping over 90 percent of requests within latency constraints.
The DistServe code was released on GitHub under the LLMServe organization and directly influenced the disaggregated serving implementations in subsequent frameworks.
Splitwise, authored by Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Inigo Goiri, Saeed Maleki, and Ricardo Bianchini from Microsoft Research, appeared at the International Symposium on Computer Architecture (ISCA) 2024.
Splitwise takes a hardware-oriented view of the same problem. The paper characterizes prefill and decode as distinct workloads with different hardware utilization signatures and argues that they are naturally suited to different GPU generations or tiers.
Prefill, being compute-bound, benefits from the high FLOPS of the latest-generation accelerators. Decode, being memory-bandwidth-bound, benefits most from GPUs with large HBM capacity and high memory bandwidth, which are often cheaper than the newest compute-optimized chips. Splitwise proposes running prefill on a pool of high-FLOPS GPUs and decode on a separate pool of cost-effective high-memory-bandwidth GPUs. This heterogeneous hardware strategy can reduce total cost of ownership while maintaining or improving performance.
A key engineering contribution in Splitwise is its approach to KV cache transfer. Rather than waiting until the entire prefill is complete before sending the KV cache, Splitwise overlaps transfer with prefill computation. As each transformer layer completes its prefill pass, the KV vectors for that layer begin transferring asynchronously to the decode GPU. By the time the final layer's prefill is done, the earlier layers' KV tensors have already arrived. This layer-wise pipelined transfer can hide a large fraction of the transfer latency, reducing the effective overhead of disaggregation.
Splitwise reported approximately 1.4 times throughput improvement over colocated baselines with about 20 percent reduction in cost per token, achieved by matching workload characteristics to hardware capabilities.
Mooncake is the serving platform behind Kimi, Moonshot AI's flagship LLM service, and was described in a paper by Ruoyu Qin and colleagues from Tsinghua University and Moonshot AI, released in June 2024 (arXiv:2407.00079).
Where DistServe and Splitwise treat disaggregation primarily as a GPU scheduling problem, Mooncake takes a KVCache-centric view of the entire serving stack. Its central argument is that KV cache placement and movement should drive all other scheduling decisions, including how requests are routed and which GPUs handle which phases.
Mooncake's architecture separates prefill and decode clusters and builds a distributed KV cache store on top of the underutilized CPU DRAM, local SSDs, and NVMe storage already present in typical GPU cluster nodes. Instead of deleting KV caches after a request completes, Mooncake retains them in this tiered storage hierarchy for potential reuse by future requests with overlapping prefixes (prefix caching). When a new request arrives that shares a prefix with a cached sequence, the cached KV tensors can be loaded from storage rather than recomputed, reducing TTFT.
The Transfer Engine, open-sourced in November 2024, is the core communication layer. It uses RDMA where available and supports topology-aware path selection to route transfers over the fastest available link (NVLink, InfiniBand, RoCE, or Ethernet). Multi-card bandwidth aggregation allows a single transfer to use multiple NICs simultaneously.
In production at Moonshot AI, Mooncake processes over 100 billion tokens daily across thousands of nodes. Internal benchmarks using real Kimi traffic traces showed 59 to 498 percent improvement in effective request capacity compared to baseline methods. The vLLM project added official support for the Mooncake Transfer Engine in December 2024, making it available as a connector option for vLLM disaggregated prefill deployments.
Mooncake's broader architecture also extends KV cache sharing across requests within a data center and has been described as a step toward "Prefill-as-a-Service", where prefill computation becomes a pooled resource independent of any specific model replica.
Transferring KV cache tensors between prefill and decode nodes is the central engineering challenge of disaggregated serving. The KV cache for a single request can be substantial: a 70B parameter model processing a 4,096-token prompt with a batch of 8 requests produces roughly 40 to 80 GB of KV cache data, depending on precision and model configuration. Transferring this at low latency requires specialized network infrastructure.
Remote direct memory access (RDMA) allows one machine's network interface card to read or write the memory of another machine without involving the CPU or operating system of either side. This zero-copy, low-latency transfer is standard in high-performance computing clusters. Modern GPU clusters typically use InfiniBand (IB) or RDMA over Converged Ethernet (RoCE) at 200 Gbps or 400 Gbps per link.
For disaggregated LLM serving, RDMA means KV cache tensors can be sent directly from GPU HBM on the prefill node to GPU HBM on the decode node in a single operation. For a 70B model at 200 Gbps InfiniBand HDR, a 10 GB KV cache batch transfers in approximately 400 milliseconds, which is within range for serving scenarios with long inputs. For shorter contexts and faster interconnects (400 Gbps HDR200 or NVLink), the transfer time drops substantially.
Production disaggregated deployments typically require InfiniBand or RoCE at 200 Gbps or faster between prefill and decode nodes. TCP-based Ethernet at 100 Gbps or below is too slow for most latency-sensitive workloads.
NVIDIA released the NIXL (NVIDIA Inference Transfer Library) as an open-source, vendor-agnostic communication library specifically designed for disaggregated inference workloads. NIXL provides a unified API for point-to-point data movement across heterogeneous memory types (GPU HBM, CPU DRAM, NVMe SSD) and heterogeneous network fabrics (NVLink, InfiniBand, RoCE, Ethernet).
NIXL supports non-blocking asynchronous transfers, dynamic metadata exchange (so nodes can discover each other's memory layout at runtime), and multi-path routing that can aggregate bandwidth across multiple NICs. It integrates with NVIDIA Dynamo, TensorRT-LLM, and vLLM. When RDMA is available, NIXL achieves sub-millisecond transfer initiation latency. When RDMA is unavailable, it falls back to TCP-based transfers with higher latency.
The library was designed to hide the complexity of managing heterogeneous interconnects from inference framework developers, providing a single call interface regardless of whether the underlying transfer uses NVLink, InfiniBand, or a storage system.
Moonshot AI's Transfer Engine (part of Mooncake, open-sourced November 2024) takes a similar approach but was developed independently based on production experience with Kimi's serving infrastructure. It supports RDMA over InfiniBand and RoCE, topology-aware path selection (choosing the lowest-latency path between any two nodes in a multi-rack cluster), and multi-NIC bandwidth aggregation. The vLLM MooncakeConnector uses the Transfer Engine as the KV transfer backend, reporting up to 25 percent lower mean TTFT compared to TCP-based transport.
Both Splitwise and several production implementations pipeline KV cache transfer with prefill computation at the transformer layer granularity. After a prefill worker completes a transformer layer's forward pass and writes the layer's KV vectors to HBM, it immediately initiates an asynchronous transfer of those vectors to the decode worker while the next layer's computation proceeds. By the time all layers complete, most of the KV cache has already arrived on the decode node. This overlapping of compute and communication reduces the end-to-end latency overhead of disaggregation.
vLLM added experimental support for disaggregated prefilling in late 2024. The implementation runs two separate vLLM instances, one dedicated to prefill and one to decode, connected by a configurable KV transfer connector. All disaggregated prefilling code lives under vllm/distributed/kv_transfer. Three connectors are available: NixlConnector (supporting fully asynchronous send and receive), P2pNcclConnector, and MooncakeConnector.
The vLLM disaggregated mode allows different parallelism strategies for each instance. For example, a prefill instance might run with tensor parallelism 1 (single-GPU model copies) to maximize throughput per token on short prompts, while decode instances might use higher TP to speed up token generation for longer sequences. Without disaggregated prefilling, vLLM inserts prefill jobs during active decode batches, producing inter-token latency spikes; disaggregated mode eliminates these by ensuring decode instances never run prefill.
In 2025, Meta, LinkedIn, Mistral, and HuggingFace reported running vLLM with disaggregated serving in production. Meta's implementation uses larger KV cache block sizes (128 or 256 tokens per block, versus vLLM's default of 16) to improve transfer efficiency.
SGLang implements P/D disaggregation with three components: a proxy server that handles external traffic, a prefill server, and a decode server. The proxy routes each request to an available prefill server, which computes the KV cache and transfers it to an assigned decode server. The decode server polls for arriving KV tensors while the prefill transfer is in progress, overlapping transfer with setup.
SGLang supports RDMA-based transfer via Mooncake and NIXL backends, enabling zero-copy GPU-to-GPU transfer. The framework is deployed in production at LMSYS and processes over 400,000 GPUs' worth of inference workloads. In May 2025, LMSYS published results for DeepSeek-R1 disaggregated on 96 H100 GPUs, achieving 52,300 input tokens per second and 22,300 output tokens per second per node.
AMD has validated SGLang's P/D disaggregation on Instinct MI300X GPUs using Mooncake for KV transfer, demonstrating that the pattern is not specific to NVIDIA hardware.
NVIDIA announced Dynamo at GTC 2025 as an open-source, datacenter-scale distributed inference framework with disaggregated serving as a core design feature. Dynamo introduces four coordinated components:
The Dynamo Planner monitors TTFT and inter-token latency targets and dynamically allocates GPUs between prefill and decode pools at runtime, adjusting the ratio based on queue lengths and transfer costs. Unlike static pool configurations, the planner can shift GPUs between pools in response to changing workload characteristics.
The Smart Router tracks the distributed KV cache across all nodes using a radix tree structure (similar to RadixAttention) and routes incoming requests to minimize KV cache recomputation by maximizing prefix overlap with existing cached sequences.
The Distributed KV Cache Manager offloads less-frequently accessed KV tensors from GPU HBM to CPU DRAM, local SSDs, or networked object storage, implementing hierarchical caching across the cluster.
NIXL serves as the transfer layer for all inter-node data movement.
When serving DeepSeek-R1 671B on NVIDIA GB300 NVL72 hardware with disaggregated serving, NVIDIA reported up to 30 times more requests served than baseline configurations. For Llama 70B on Hopper GPUs, Dynamo with disaggregation more than doubled throughput.
Dynamo is compatible with vLLM, SGLang, and TensorRT-LLM backends and is available on GitHub under the ai-dynamo organization.
NVIDIA's TensorRT-LLM added disaggregated serving support, calling the two phases context (prefill) and generation (decode) to match its naming conventions. The framework provides three deployment options: OpenAI-compatible REST servers with a separate orchestrator, Triton Inference Server integration using a Python BLS backend as orchestrator, and Dynamo integration.
KV cache transfer in TensorRT-LLM is modular, with the exchange layer decoupled from both the KV cache manager and the underlying communication library. Supported transports include MPI, UCX, and NIXL. The framework supports running context and generation executors on the same physical node or different nodes. Layer-wise asynchronous KV cache transfer, where transfer begins as each transformer layer finishes rather than waiting for the full prefill, is under active development.
DeepSeek was an early adopter of P/D disaggregation as a core inference stack, deploying it in production for DeepSeek-V3 and subsequent models. The architecture uses a 3:9 ratio of prefill nodes to decode nodes (each with 8 H100 GPUs), reflecting the relative computational demands of the two phases for DeepSeek's workload mix.
Prefill and decode phases use different parallelism configurations. Prefill runs with smaller expert-parallel (EP) and data-parallel (DP) degrees to handle large batch processing efficiently. Decode uses a much wider expert parallelism (approximately EP=256) and high data parallelism (DP=8 to 16), which maximizes GroupGEMM utilization in DeepSeek's mixture-of-experts architecture. This divergence in parallelism strategy is only possible because the two phases operate independently.
For KV cache transfer, DeepSeek built 3FS, a distributed file system that aggregates the throughput of thousands of SSDs and the bandwidth of hundreds of storage nodes, allowing KV cache data to move through storage in a locality-agnostic way.
Disaggregated serving shifts the hardware planning problem from "buy GPUs with a balance of compute and memory bandwidth" to "buy compute-optimized GPUs for prefill and memory-bandwidth-optimized GPUs for decode."
Prefill pools benefit from high FLOPS per GPU, which means compute-dense chips like NVIDIA H100 SXM or H200 SXM. Batch size flexibility matters: larger batches amortize the cost of prefill over more requests simultaneously. Model parallelism configuration for prefill typically uses lower TP degree so that each GPU runs a full model copy processing a large batch, rather than splitting a smaller batch across many GPUs.
Decode pools benefit from GPUs with large HBM capacity and high HBM bandwidth. Models larger than single-GPU memory require tensor parallelism, but the primary bottleneck is memory bandwidth rather than compute. Chips with a favorable bandwidth-to-FLOPS ratio, including some designs optimized for inference rather than training, are well-suited for decode pools. Splitwise's observation that older-generation high-bandwidth GPUs can serve decode efficiently at lower cost than the latest compute-optimized chips has attracted interest from cloud providers evaluating TCO.
Network infrastructure between prefill and decode nodes is a hard requirement. Without InfiniBand or high-speed RoCE (200 Gbps or faster), the KV cache transfer time dominates the latency budget and negates the benefits of disaggregation. For a 70B model processing 4,096-token prompts in batches of 8, KV cache sizes run to tens of gigabytes; transferring this over 100 Gbps Ethernet takes hundreds of milliseconds, which is longer than most TTFT SLOs. Production deployments use 200 Gbps InfiniBand HDR at a minimum, with 400 Gbps InfiniBand NDR preferred for large clusters.
Within-node NVLink can make single-node pseudo-disaggregation viable for medium-sized models: some implementations run prefill and decode on separate GPUs connected via NVLink within the same DGX node, achieving the scheduling benefits of disaggregation without requiring inter-node RDMA. This works for smaller models or when the cluster has very-high-bandwidth NVLink fabrics like the NVL72 configuration.
Published results from the major disaggregated serving systems:
| System | Source | Metric | Result vs. baseline |
|---|---|---|---|
| DistServe | OSDI 2024 | Max goodput (chatbot) | 2.0x to 3.41x |
| DistServe | OSDI 2024 | Max goodput (summarization) | 4.48x |
| DistServe | OSDI 2024 | Request throughput at SLO | up to 7.4x |
| Splitwise | ISCA 2024 | Throughput | 1.4x with 20% lower cost |
| Mooncake | Moonshot AI 2024 | Effective request capacity | 59% to 498% |
| Mooncake + vLLM | vLLM 2024 | Mean TTFT | 25% lower vs. TCP |
| NVIDIA Dynamo | GTC 2025 | Requests served (DeepSeek-R1) | up to 30x |
| NVIDIA Dynamo | GTC 2025 | Throughput (Llama 70B, Hopper) | 2x+ |
| SGLang (LMSYS) | May 2025 | DeepSeek-R1 on 96 H100s | 52,300 input tok/s per node |
Note that improvements are highly workload-dependent. Long-prompt, high-concurrency workloads with strict TTFT and ITL SLOs see the largest gains. Short prompts with high prefix cache hit rates and loose latency requirements may see little benefit or even performance regression.
Disaggregated serving delivers clear benefits when:
The workload has long or highly variable prompt lengths (2,000 tokens or more). Long prompts produce large KV caches that are expensive to transfer but whose compute cost is also high, so the prefill phase dominates latency and benefits most from dedicated compute.
The application has strict, differentiated SLOs for TTFT and ITL. Disaggregation allows each metric to be optimized independently, which is impossible in colocated systems.
The cluster already has InfiniBand or high-speed RoCE interconnects. High-performance computing clusters and major cloud providers' GPU instances typically meet this requirement.
The deployment is at sufficient scale (roughly 16 to 32 GPUs or more). The routing and transfer overhead has fixed costs that only amortize at larger deployment sizes.
Several scenarios favor colocated or chunked-prefill approaches instead:
Small deployments on a single node or a few nodes cannot take advantage of separate hardware pools. The transfer overhead and coordination complexity reduce performance rather than improving it.
Short prompts with high prefix cache hit rates. If a large fraction of requests can be served from cached KV tensors (see PagedAttention), the prefill phase is already short or absent, and disaggregation adds overhead without meaningful benefit.
Clusters without InfiniBand or high-speed RoCE. TCP-based KV transfer at 10 to 100 Gbps is too slow for most production workloads with long contexts.
Loose latency requirements where chunked prefill provides sufficient ITL control. Chunked prefill, which splits long prompts into smaller chunks and interleaves them with decode steps, can reduce ITL spikes in colocated systems and is simpler to operate than full disaggregation.
Under-tuned deployments can actually perform worse. If the prefill-to-decode worker ratio is misconfigured for the workload, one pool becomes a bottleneck while the other sits idle, and throughput can drop 20 to 30 percent versus a well-tuned colocated system.
Disaggregated serving introduces new operational challenges compared to colocated deployment.
Two independent worker pools must be provisioned, monitored, and scaled separately. Failures in the prefill pool or decode pool have different cascading effects, requiring different alerting and recovery procedures.
The router must handle KV cache transfer coordination, including retries, timeouts, and load balancing across both pools simultaneously. If a prefill worker completes but the assigned decode worker is unavailable, the KV cache must be held temporarily or the request must be restarted.
Debugging latency issues requires reasoning about which pool is the bottleneck and whether transfer time is contributing. Standard single-node profiling tools provide incomplete visibility.
Dynamic pool sizing, as implemented in NVIDIA Dynamo's Planner, is one approach to managing these trade-offs: rather than statically partitioning GPUs into fixed pools, the planner adjusts allocation at runtime based on observed queue depths and latency metrics. This reduces the configuration burden but adds another layer of system behavior to understand and debug.
Chunked prefill is the main alternative to disaggregation for controlling ITL in colocated systems. It splits a long prompt into smaller chunks and processes one chunk per decode iteration, keeping each iteration's compute cost roughly constant and preventing single long prefill operations from blocking all decode steps.
Research published in 2025 (TaiChi, DuetServe) found that the two approaches are complementary rather than competing: PD aggregation (colocated with chunked prefill) performs better when TTFT constraints are tight and TPOT is relaxed, while disaggregation excels when TPOT must be strictly controlled and TTFT can be somewhat relaxed. Systems that need both metrics simultaneously satisfied tend to favor disaggregation. Some proposals combine the two, using chunked prefill within prefill workers to improve batching efficiency while still separating prefill and decode pools.