Chunked prefill
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,008 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 2,008 words
Add missing citations, update stale details, or suggest a clearer explanation.
Chunked prefill is a scheduling technique for large language model serving that splits the processing of a long input prompt (the prefill) into smaller, fixed-size token chunks, and combines each chunk with the single-token decode steps of other in-flight requests inside one batched forward pass. Combining a compute-heavy prefill chunk with many lightweight decodes in the same batch is often called "piggybacking" decodes onto the prefill. The goal is to keep the GPU busy during the otherwise underutilized decode phase while preventing long prefills from stalling the token stream of requests that are already generating, which smooths inter-token latency [1][2].
The method was introduced in 2023 in the paper "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" by Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee, a group centered at Microsoft Research India (Agrawal contributed as an intern and is a PhD student at Georgia Tech) [1]. A follow-up serving system, Sarathi-Serve, was published at the USENIX Symposium on Operating Systems Design and Implementation (OSDI) in 2024 and generalized the idea into "stall-free" scheduling [2]. Essentially the same technique was developed independently and concurrently by Microsoft's DeepSpeed team under the name "Dynamic SplitFuse" [5]. Chunked prefill is now a default or standard option in major serving stacks, including vLLM and NVIDIA TensorRT-LLM [3][4].
LLM inference with an autoregressive transformer proceeds in two distinct phases with very different hardware profiles. In the prefill phase, the model processes every token of the input prompt in a single parallel forward pass, building the KV cache for the whole prompt. Because all prompt tokens are computed at once, prefill performs large matrix-matrix multiplications and is compute-bound: even a single request with a moderately long prompt can saturate the floating-point units of a modern GPU [1].
In the decode phase, the model generates output one token at a time. Each decode step is a forward pass over a single new token per request, so the matrix multiplications collapse into much smaller matrix-vector operations, while the model weights and the growing KV cache must still be read from high-bandwidth memory on every step. Decode is therefore memory-bandwidth-bound and has low arithmetic intensity: the GPU spends most of its time moving data rather than computing, leaving a large fraction of its compute idle [1][2].
This mismatch creates a scheduling problem. Continuous batching, introduced by the Orca serving system, schedules at the granularity of individual iterations rather than whole requests, so completed requests can leave a batch and new ones can join on every step, which already improves throughput over static batching [8]. But continuous batching still has to decide what to do when a new request arrives and needs its full prefill. Two naive policies both fail in different ways. A decode-prioritizing schedule delays incoming prefills until current requests finish, inflating time-to-first-token (TTFT) for new arrivals. A prefill-prioritizing schedule, used by early versions of vLLM, pauses all ongoing decodes to run a new prompt's prefill to completion; because a long prefill can take far longer than a decode step, every active request suffers a "generation stall" and a large spike in time-between-tokens (TBT). Mixing a full prefill and decodes in one batch ("hybrid batching") has the same effect, since the long prefill dominates the iteration's latency [2]. Naive batching thus forces a tradeoff between throughput and latency.
Chunked prefill resolves this tradeoff with two ideas from the SARATHI paper: chunked prefills and decode-maximal batching [1].
First, instead of processing a long prompt in one shot, the scheduler divides the prefill into equal-sized chunks of a fixed number of tokens and processes one chunk per iteration over several successive forward passes. Each chunk attends to its own tokens plus the KV cache accumulated by all earlier chunks, so the final result is identical to a single full prefill; only the work is spread across iterations.
Second, decode-maximal batching constructs each batch from one prefill chunk plus as many ready decode tokens as possible in the remaining slots. The key insight is that a decode-only batch is heavily memory-bound and leaves compute idle, so there is "arithmetic intensity slack" to absorb extra prefill work almost for free: adding a prefill chunk to a decode batch fills the idle compute, and the piggybacked decodes cost up to an order of magnitude less than running them in a separate decode-only iteration [1]. Because the prefill chunk is bounded in size, no single iteration is dominated by a long prompt, so the decodes riding along do not experience a large stall.
A concrete way to see the chunking effect: prefill throughput rises with the number of tokens processed per pass only until the operation becomes compute-bound, which on an NVIDIA A100 happens at roughly 500 to 1,000 tokens. Beyond that point, adding more prefill tokens to one pass yields little extra efficiency but does increase that pass's latency. Choosing a chunk size near this saturation point keeps each iteration short while retaining most of prefill's compute efficiency [1][2].
Sarathi-Serve, presented at OSDI 2024 by Amey Agrawal and Alexey Tumanov (Georgia Tech) together with Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee (Microsoft Research India), turned chunked prefill into a complete scheduler built on two mechanisms: stall-free batching and a token budget [2].
Stall-free batching admits new requests without ever pausing running decodes. On each iteration the scheduler first packs all in-flight decodes into the batch, then adds any partially completed prefill, and only then admits new requests, fitting their prefill chunks into whatever capacity remains. Because decodes are never evicted to make room for a prefill, ongoing requests never suffer a generation stall [2].
The capacity that bounds each iteration is the token budget, written as the Greek letter tau, the maximum number of tokens (prefill plus decode) processed per forward pass. The budget is derived from a desired TBT service-level objective (SLO): a smaller tau yields shorter, more uniform iterations and tighter tail latency, while a larger tau allows bigger prefill chunks and higher throughput. Sarathi-Serve frames serving as meeting a TBT SLO, distinguishing strict targets (on the order of five times a single decode iteration's runtime, for interactive chat) from relaxed targets (around twenty-five times, for batch workloads), and sizes tau accordingly [2]. Uniform iteration sizes also shrink "pipeline bubbles," the idle gaps that appear under pipeline parallelism when consecutive micro-batches have unequal runtimes, which is why the largest end-to-end gains appear in multi-GPU pipeline-parallel deployments [1][2].
The two papers report the following representative results.
| System | Model and hardware | Reported improvement |
|---|---|---|
| SARATHI [1] | LLaMA-13B, A6000 GPU | Up to 10x decode throughput; 1.33x end-to-end throughput |
| SARATHI [1] | LLaMA-33B, A100 GPU | Up to 4.25x decode throughput; 1.25x end-to-end throughput |
| SARATHI [1] | GPT-3 (175B), pipeline parallel | 6.29x reduction in pipeline bubbles; 1.91x end-to-end throughput |
| Sarathi-Serve [2] | Mistral-7B, one A100 | Up to 2.6x higher serving capacity vs vLLM under SLO |
| Sarathi-Serve [2] | Yi-34B, two A100s | Up to 3.7x higher serving capacity vs vLLM under SLO |
| Sarathi-Serve [2] | Falcon-180B, pipeline parallel | Up to 5.6x higher serving capacity vs vLLM under SLO |
Chunked prefill is now standard in production serving frameworks. vLLM added it in version 0.4.0 (2024) as an opt-in flag, enable_chunked_prefill, with a default per-iteration budget (max_num_batched_tokens) of 512 tokens; its scheduler prioritizes decodes and chunks any prefill that does not fit in the remaining budget. In the rewritten vLLM V1 engine, chunked prefill is enabled by default whenever applicable and cannot be turned off, reflecting how central the technique has become [3]. NVIDIA TensorRT-LLM exposes the same idea as "Chunked Context" (also called chunked prefill), and uses dynamic chunk sizing both to raise GPU utilization by batching more decode tokens alongside prefill and to decouple activation-memory use from prompt length, which removes hard limits on input length and simplifies engine builds [4]. The concurrent DeepSpeed-FastGen system ships the same mechanism as Dynamic SplitFuse, decomposing long prompts into uniform chunks and composing them with other requests' generation steps to run at a consistent forward-pass size [5].
The technique is not free. The main cost is the KV cache for earlier chunks must be re-read from memory when later chunks of the same prompt compute attention: if a prompt is split into N chunks, the first chunk's keys and values are read roughly N-1 additional times, so very small chunks add measurable attention overhead and slow prefill (raising TTFT) relative to a single full prefill. In practice the SARATHI authors found this attention overhead modest because attention is a small fraction of total prefill compute, but they also identified a sharper pitfall: tile-quantization effects on the GPU mean chunk sizes should align to hardware tile boundaries. Crossing a tile boundary, for example using 257 tokens instead of 256, can inflate a prefill's runtime by about 32 percent [1]. The chunk size (or token budget) is therefore a tuning knob that trades TTFT against TBT and throughput, and serving systems expose it as such.
Chunked prefill is one of two dominant strategies for handling prefill-decode interference, and it is useful to contrast it with the other. Prefill-decode disaggregation, embodied by systems such as Splitwise, DistServe, and TetriInfer, runs the two phases on physically separate GPU pools (or "replicas") so they never share a batch [6][7]. This eliminates interference outright and lets each pool be provisioned and parallelized for its own bottleneck, but it requires transferring each request's KV cache over the interconnect from the prefill GPUs to the decode GPUs when prefill finishes, and it can underutilize the memory of the prefill pool [2][6][7].
Chunked prefill takes the opposite, co-located approach: it keeps both phases on the same GPUs and resolves interference in software through stall-free scheduling, avoiding any KV-cache migration. The two ideas address the same underlying problem from different directions, and they are not mutually exclusive. Disaggregated systems still apply chunked prefill within their prefill pool, and a growing body of work tries to unify the two, switching between aggregation and disaggregation based on load to keep both TTFT and TBT within target. Chunked prefill is thus best understood as a foundational building block of modern LLM serving rather than a single product feature: it is the mechanism that makes mixed prefill-and-decode batches latency-friendly, on top of which continuous batching, PagedAttention memory management, and disaggregation are layered [2][3][6].