Chunked prefill

AI Infrastructure Machine Learning

12 min read

Updated Jun 28, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 28, 2026

Fact-checked

In review queue

Sources

8 citations

Revision

v2 · 2,375 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Chunked prefill is a scheduling technique for large language model serving that splits the processing of a long input prompt (the prefill) into smaller, fixed-size token chunks and combines each chunk with the single-token decode steps of other in-flight requests inside one batched forward pass. By piggybacking lightweight decodes onto a compute-heavy prefill chunk, it keeps the GPU busy during the otherwise underutilized decode phase while preventing long prompts from stalling the token stream of requests that are already generating. Introduced in the 2023 SARATHI paper and generalized by Sarathi-Serve at OSDI 2024, the method is now standard in major serving stacks including vLLM and NVIDIA TensorRT-LLM ^[1]^[2]^[3]^[4].

What is chunked prefill?

Chunked prefill is a scheduling technique for large language model serving that splits the processing of a long input prompt (the prefill) into smaller, fixed-size token chunks, and combines each chunk with the single-token decode steps of other in-flight requests inside one batched forward pass. Combining a compute-heavy prefill chunk with many lightweight decodes in the same batch is often called "piggybacking" decodes onto the prefill. The goal is to keep the GPU busy during the otherwise underutilized decode phase while preventing long prefills from stalling the token stream of requests that are already generating, which smooths inter-token latency ^[1]^[2].

The method was introduced in 2023 in the paper "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills" by Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee, a group centered at Microsoft Research India (Agrawal contributed as an intern and is a PhD student at Georgia Tech) ^[1]. A follow-up serving system, Sarathi-Serve, was published at the USENIX Symposium on Operating Systems Design and Implementation (OSDI) in 2024 and generalized the idea into "stall-free" scheduling ^[2]. Essentially the same technique was developed independently and concurrently by Microsoft's DeepSpeed team under the name "Dynamic SplitFuse" ^[5]. Chunked prefill is now a default or standard option in major serving stacks, including vLLM and NVIDIA TensorRT-LLM ^[3]^[4].

What is the difference between prefill and decode?

LLM inference with an autoregressive transformer proceeds in two distinct phases with very different hardware profiles. The SARATHI authors frame the motivation directly: "While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request." ^[1]

In the prefill phase, the model processes every token of the input prompt in a single parallel forward pass, building the KV cache for the whole prompt. Because all prompt tokens are computed at once, prefill performs large matrix-matrix multiplications and is compute-bound: even a single request with a moderately long prompt can saturate the floating-point units of a modern GPU ^[1].

In the decode phase, the model generates output one token at a time. Each decode step is a forward pass over a single new token per request, so the matrix multiplications collapse into much smaller matrix-vector operations, while the model weights and the growing KV cache must still be read from high-bandwidth memory on every step. Decode is therefore memory-bandwidth-bound and has low arithmetic intensity: the GPU spends most of its time moving data rather than computing, leaving a large fraction of its compute idle ^[1]^[2].

Why is naive scheduling a problem?

The prefill-versus-decode mismatch creates a scheduling problem. Continuous batching, introduced by the Orca serving system, schedules at the granularity of individual iterations rather than whole requests, so completed requests can leave a batch and new ones can join on every step, which already improves throughput over static batching ^[8]. But continuous batching still has to decide what to do when a new request arrives and needs its full prefill. Two naive policies both fail in different ways. A decode-prioritizing schedule delays incoming prefills until current requests finish, inflating time-to-first-token (TTFT) for new arrivals. A prefill-prioritizing schedule, used by early versions of vLLM, pauses all ongoing decodes to run a new prompt's prefill to completion; because a long prefill can take far longer than a decode step, every active request suffers a "generation stall" and a large spike in time-between-tokens (TBT). Mixing a full prefill and decodes in one batch ("hybrid batching") has the same effect, since the long prefill dominates the iteration's latency ^[2]. Naive batching thus forces a tradeoff between throughput and latency.

How does chunked prefill work?

Chunked prefill resolves this tradeoff with two ideas from the SARATHI paper: chunked prefills and decode-maximal batching ^[1].

First, instead of processing a long prompt in one shot, the scheduler divides the prefill into equal-sized chunks of a fixed number of tokens and processes one chunk per iteration over several successive forward passes. Each chunk attends to its own tokens plus the KV cache accumulated by all earlier chunks, so the final result is identical to a single full prefill; only the work is spread across iterations.

Second, decode-maximal batching constructs each batch from one prefill chunk plus as many ready decode tokens as possible in the remaining slots. The key insight is that a decode-only batch is heavily memory-bound and leaves compute idle, so there is "arithmetic intensity slack" to absorb extra prefill work almost for free: adding a prefill chunk to a decode batch fills the idle compute, and the piggybacked decodes cost up to an order of magnitude less than running them in a separate decode-only iteration ^[1]. Because the prefill chunk is bounded in size, no single iteration is dominated by a long prompt, so the decodes riding along do not experience a large stall.

How is the chunk size chosen?

A concrete way to see the chunking effect: prefill throughput rises with the number of tokens processed per pass only until the operation becomes compute-bound, which on an NVIDIA A100 happens at roughly 500 to 1,000 tokens. Beyond that point, adding more prefill tokens to one pass yields little extra efficiency but does increase that pass's latency. Choosing a chunk size near this saturation point keeps each iteration short while retaining most of prefill's compute efficiency ^[1]^[2].

Chunk size must also respect GPU tile boundaries. GPUs compute matrix multiplications by partitioning the matrices into fixed-size tiles, so a token count that is not a multiple of the tile dimension forces some thread blocks to do wasted work, a phenomenon SARATHI calls tile-quantization. The effect is sharp: "using a chunk size of 257 can increase prefill time by 32% compared to that with chunk size 256 due to tile-quantization effects." ^[1] Practical chunk sizes are therefore kept on tile-aligned values such as 256, 512, or 1,024.

What is Sarathi-Serve?

Sarathi-Serve, presented at OSDI 2024 by Amey Agrawal and Alexey Tumanov (Georgia Tech) together with Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee (Microsoft Research India), turned chunked prefill into a complete scheduler built on two mechanisms: stall-free batching and a token budget ^[2].

Stall-free batching admits new requests without ever pausing running decodes. On each iteration the scheduler first packs all in-flight decodes into the batch, then adds any partially completed prefill, and only then admits new requests, fitting their prefill chunks into whatever capacity remains. Because decodes are never evicted to make room for a prefill, ongoing requests never suffer a generation stall ^[2].

The capacity that bounds each iteration is the token budget, written as the Greek letter tau, the maximum number of tokens (prefill plus decode) processed per forward pass. The budget is derived from a desired TBT service-level objective (SLO): a smaller tau yields shorter, more uniform iterations and tighter tail latency, while a larger tau allows bigger prefill chunks and higher throughput. Sarathi-Serve frames serving as meeting a TBT SLO, distinguishing strict targets (on the order of five times a single decode iteration's runtime, for interactive chat) from relaxed targets (around twenty-five times, for batch workloads), and sizes tau accordingly ^[2]. Uniform iteration sizes also shrink "pipeline bubbles," the idle gaps that appear under pipeline parallelism when consecutive micro-batches have unequal runtimes, which is why the largest end-to-end gains appear in multi-GPU pipeline-parallel deployments ^[1]^[2].

How much does chunked prefill improve performance?

The two papers report the following representative results.

System	Model and hardware	Reported improvement
SARATHI ^[1]	LLaMA-13B, A6000 GPU	Up to 10x decode throughput; 1.33x end-to-end throughput
SARATHI ^[1]	LLaMA-33B, A100 GPU	Up to 4.25x decode throughput; 1.25x end-to-end throughput
SARATHI ^[1]	GPT-3 (175B), pipeline parallel	6.29x reduction in pipeline bubbles; 1.91x end-to-end throughput
Sarathi-Serve ^[2]	Mistral-7B, one A100	Up to 2.6x higher serving capacity vs vLLM under SLO
Sarathi-Serve ^[2]	Yi-34B, two A100s	Up to 3.7x higher serving capacity vs vLLM under SLO
Sarathi-Serve ^[2]	Falcon-180B, pipeline parallel	Up to 5.6x higher serving capacity vs vLLM under SLO

The SARATHI abstract states that "for the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x." ^[1] The concurrent DeepSpeed-FastGen system reports up to 2.3x higher effective throughput, 2x lower average latency, and up to 3.7x lower tail latency than vLLM using the same chunking idea under the Dynamic SplitFuse name ^[5].

Which serving frameworks use chunked prefill?

Chunked prefill is now standard in production serving frameworks. vLLM added it in version 0.4.0 (2024) as an opt-in flag, enable_chunked_prefill, with a default per-iteration budget (max_num_batched_tokens) of 512 tokens; its scheduler prioritizes decodes and chunks any prefill that does not fit in the remaining budget. In the rewritten vLLM V1 engine, chunked prefill is enabled by default whenever applicable and cannot be turned off, reflecting how central the technique has become; the default token budget was also raised in later releases (for example, 2,048 tokens in vLLM 0.8.x) ^[3]. NVIDIA TensorRT-LLM exposes the same idea as "Chunked Context" (also called chunked prefill), and uses dynamic chunk sizing both to raise GPU utilization by batching more decode tokens alongside prefill and to decouple activation-memory use from prompt length, which removes hard limits on input length and simplifies engine builds ^[4]. The concurrent DeepSpeed-FastGen system ships the same mechanism as Dynamic SplitFuse, decomposing long prompts into uniform chunks and composing them with other requests' generation steps to run at a consistent forward-pass size ^[5].

What are the tradeoffs and costs of chunked prefill?

The technique is not free. The main cost is that the KV cache for earlier chunks must be re-read from memory when later chunks of the same prompt compute attention: if a prompt is split into N chunks, the first chunk's keys and values are read roughly N-1 additional times, so very small chunks add measurable attention overhead and slow prefill (raising TTFT) relative to a single full prefill. In practice the SARATHI authors found this attention overhead modest because attention is a small fraction of total prefill compute, but they also identified a sharper pitfall: tile-quantization effects on the GPU mean chunk sizes should align to hardware tile boundaries. Crossing a tile boundary, for example using 257 tokens instead of 256, can inflate a prefill's runtime by about 32 percent ^[1]. The chunk size (or token budget) is therefore a tuning knob that trades TTFT against TBT and throughput, and serving systems expose it as such.

How does chunked prefill compare to prefill-decode disaggregation?

Chunked prefill is one of two dominant strategies for handling prefill-decode interference, and it is useful to contrast it with the other. Prefill-decode disaggregation, embodied by systems such as Splitwise, DistServe, and TetriInfer, runs the two phases on physically separate GPU pools (or "replicas") so they never share a batch ^[6]^[7]. This eliminates interference outright and lets each pool be provisioned and parallelized for its own bottleneck, but it requires transferring each request's KV cache over the interconnect from the prefill GPUs to the decode GPUs when prefill finishes, and it can underutilize the memory of the prefill pool ^[2]^[6]^[7].

Chunked prefill takes the opposite, co-located approach: it keeps both phases on the same GPUs and resolves interference in software through stall-free scheduling, avoiding any KV-cache migration. The two ideas address the same underlying problem from different directions, and they are not mutually exclusive. Disaggregated systems still apply chunked prefill within their prefill pool, and a growing body of work tries to unify the two, switching between aggregation and disaggregation based on load to keep both TTFT and TBT within target. Chunked prefill is thus best understood as a foundational building block of modern LLM serving rather than a single product feature: it is the mechanism that makes mixed prefill-and-decode batches latency-friendly, on top of which continuous batching, PagedAttention memory management, and disaggregation are layered ^[2]^[3]^[6].

References

Agrawal, A., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., and Ramjee, R. "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills." arXiv:2308.16369, August 31, 2023. https://arxiv.org/abs/2308.16369 ↩
Agrawal, A., Kedia, N., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., Tumanov, A., and Ramjee, R. "Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve." Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2024). arXiv:2403.02310. https://arxiv.org/abs/2403.02310 ↩
vLLM. "Optimization and Tuning" (chunked prefill configuration and V1 default behavior). vLLM Documentation. https://docs.vllm.ai/en/latest/configuration/optimization/ ↩
NVIDIA. "Streamlining AI Inference Performance and Deployment with NVIDIA TensorRT-LLM Chunked Prefill." NVIDIA Technical Blog, November 2024. https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/ ↩
Holmes, C., Tanaka, M., Wyatt, M., et al. "DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference." arXiv:2401.08671, January 2024. https://arxiv.org/abs/2401.08671 ↩
Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H. "DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving." OSDI 2024. arXiv:2401.09670. https://arxiv.org/abs/2401.09670 ↩
Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, I., Maleki, S., and Bianchini, R. "Splitwise: Efficient Generative LLM Inference Using Phase Splitting." Proceedings of the 51st International Symposium on Computer Architecture (ISCA 2024). arXiv:2311.18677. https://arxiv.org/abs/2311.18677 ↩
Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.-G. "Orca: A Distributed Serving System for Transformer-Based Generative Models." Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2022). https://www.usenix.org/conference/osdi22/presentation/yu ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

1 revision by 1 contributors · full history

Suggest edit

What links here

LMDeploy PagedAttention RadixAttention

What is chunked prefill?

What is the difference between prefill and decode?

Why is naive scheduling a problem?

How does chunked prefill work?

How is the chunk size chosen?

What is Sarathi-Serve?

How much does chunked prefill improve performance?

Which serving frameworks use chunked prefill?

What are the tradeoffs and costs of chunked prefill?

How does chunked prefill compare to prefill-decode disaggregation?

References

Improve this article

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node

What links here

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node

What links here