Continuous Batching
Last reviewed
May 7, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 ยท 5,126 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 7, 2026
Sources
10 citations
Review status
Source-backed
Revision
v2 ยท 5,126 words
Add missing citations, update stale details, or suggest a clearer explanation.
Continuous batching is a scheduling technique for large language model (LLM) inference servers that inserts new requests into a running batch at the granularity of individual model iterations rather than waiting for all requests in a batch to complete before accepting new work. Also called iteration-level scheduling, dynamic batching, or in-flight batching, the technique was formally described and evaluated in the paper "Orca: A Distributed Serving System for Transformer-Based Generative Models" by Yu et al., presented at USENIX OSDI 2022. Systems that implement continuous batching, including vLLM, HuggingFace Text Generation Inference (TGI), NVIDIA TensorRT-LLM, and SGLang, typically report throughput improvements of 10x to 23x over naive static batching under realistic request arrival patterns.
Transformer-based autoregressive language models generate output one token at a time. In order to produce a response of $N$ tokens, a model must execute $N$ sequential forward passes (ignoring speculative decoding), each of which takes the previously generated tokens as context. This sequential dependency means that the time to complete a request is directly proportional to the number of output tokens it generates.
The fundamental challenge for inference serving is that different requests produce different numbers of output tokens. A request asking for a short factual answer might finish in 20 tokens. A request asking the model to write a 500-word essay might require 600 or more tokens. An instruction-following request might involve a chain-of-thought that is unpredictable in length. The output length of any individual request cannot be known in advance; it depends on what the model generates, which in turn depends on the specific input and the model's learned behavior.
GPUs perform best when many operations are fused into a single large matrix multiplication. Serving a single request at a time is GPU-inefficient: the batch dimension of all matrix multiplications is 1, leaving most of the arithmetic units idle. Early inference systems addressed this by grouping multiple requests into a batch and processing them together in a single forward pass. When requests in the same batch each produce one token per step, a single GPU kernel handles all of them simultaneously, substantially improving hardware utilization.
The simplest batching strategy, called static batching or request-level batching, waits until it has collected a full batch of requests and then runs that batch through the model until all requests in the batch complete. The batch is treated as an atomic unit: a new request cannot join the batch while it is running, and the batch does not release completed requests until every member has finished generating.
This approach has a severe efficiency problem rooted in the variable-length nature of language model outputs. In any realistic workload, some requests finish much earlier than others. If a batch of 16 requests has one request that needs 500 tokens and all others need 20 tokens, the 15 short requests finish in 20 iterations but sit idle occupying batch slots for the remaining 480 iterations while the long request completes. GPU memory allocated for the short requests' KV caches is held hostage. New requests waiting in the queue cannot be admitted even though 15 out of 16 batch slots are sitting idle.
This problem was analyzed in detail by Yu et al. in the Orca paper. They measured the "batch slot waste" in static batching systems and found it to be the primary bottleneck preventing high GPU utilization. Real workload traces show that output length distributions are highly skewed: many requests are short, but a long tail of requests produces very long outputs. Static batching, which must wait for the longest-running request in the batch before admitting new work, is especially harmed by this skew.
The Orca paper quantified this problem using traces from real LLM deployments. They measured the average number of "wasted" forward passes per request under static batching: passes during which a completed request still occupies a batch slot and prevents new work from entering. For a batch where one request is 10 times longer than the median, the median request sits idle for roughly 9 times its own generation length. Translating this to throughput: a static batching system serving requests from a realistic length distribution might achieve only 5% to 15% of the maximum throughput achievable if GPU resources were continuously utilized.
The paper "Orca: A Distributed Serving System for Transformer-Based Generative Models" by Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun was published at USENIX OSDI (Operating Systems Design and Implementation) 2022. OSDI is one of the top venues in systems research, with acceptance rates typically below 20%.
Orca's central contribution was the observation that the natural scheduling granularity for autoregressive decoding is the iteration, not the request. At every iteration, the model executes one forward pass and produces one new token for each sequence in the batch. From the hardware's perspective, what matters is that the batch dimension is large. Whether two sequences in that batch started at different times is irrelevant to the GPU; it sees the same matrix multiplications either way.
The authors proposed iteration-level scheduling: after every forward pass, the scheduler checks whether any sequences in the batch have completed (i.e., generated an end-of-sequence token or reached the maximum length). Completed sequences are immediately removed from the batch, freeing their slots. New requests waiting in the queue are immediately inserted to fill those slots, up to the maximum batch size the GPU can accommodate. The batch composition changes at every iteration, but the batch size remains at or near maximum throughout the run.
The Orca paper also introduced the concept of selective batching, which recognizes that not all operations in a transformer forward pass support variable-length inputs equally. Attention operations are the most complex because the attention computation for each sequence depends on that sequence's own KV cache, which grows to different lengths for different sequences. The paper proposed handling this by treating the self-attention layers differently from the feed-forward layers: feed-forward layers can be naively batched across all sequences simultaneously (since they operate per-token), while attention layers require more careful handling of variable-length KV sequences.
Evaluation in the Orca paper showed 36.9x higher throughput and 29.7x lower tail latency compared to a static batching baseline (TRITON inference server) on the GPT-3-like workload. These numbers apply to workloads with high request arrival rates and variable output lengths; workloads with uniform short outputs see more modest gains.
The core loop of a continuous batching inference server operates as follows:
The key difference from static batching is steps 5 through 7. In static batching, these steps only happen after all sequences in the batch have completed. In continuous batching, they happen at every single iteration.
Because the batch is modified between iterations, each iteration may process a slightly different set of sequences. The sequences from the previous iteration that have not yet completed carry forward their KV caches (already computed and cached), so only the new token needs to be processed for continuing sequences. New sequences joining the batch at step 7 must have their entire input prompt processed (the prefill phase) before they can join the main decoding loop.
Autoregressive generation has two distinct phases with different computational profiles:
Prefill (prompt processing): The model ingests the entire input prompt in a single forward pass, computing KV cache entries for every token in the prompt simultaneously. This phase is compute-bound: it involves large matrix multiplications where the sequence dimension equals the prompt length. A long prompt of 2,048 tokens produces 2,048 KV entries per layer in a single iteration. Prefill is fast per token but requires a lot of compute in aggregate.
Decode (token generation): After prefill, the model generates one new token per iteration. Each decoding iteration involves a forward pass over just the single new token, but the attention computation must attend over the full KV cache accumulated so far. The decoding phase is memory-bandwidth-bound rather than compute-bound: the bottleneck is loading KV cache data from GPU HBM.
Continuous batching must manage both phases simultaneously. New requests entering the batch need to run prefill; existing requests already in the decode phase need to run decode. In simple implementations, prefill and decode are handled in the same forward pass: the prefill request's tokens and the decode requests' single new tokens are concatenated and processed together. This works but can cause conflicts: a large prefill can dominate computation in an iteration and increase the time-to-first-token for other decode requests that happen to be in the same batch.
Chunked prefill is an extension to continuous batching that addresses the prefill-decode interference problem. Instead of processing an entire prompt in one iteration, chunked prefill breaks the prompt into fixed-size chunks (e.g., 512 or 1,024 tokens per chunk) and processes one chunk per iteration. This keeps the compute load of any single iteration bounded, preventing long prefills from stalling the decode phase.
With chunked prefill, new requests enter the batch in a partially computed state, contributing only a chunk's worth of prefill compute per iteration until their prompt is fully processed, at which point they transition to the decode phase. The scheduler interleaves prefill chunks and decode tokens within each iteration, maintaining more consistent latency for already-running requests while still admitting new requests efficiently.
vLLM versions from 0.4 onward implement chunked prefill as a configurable option. The chunk size is a tunable parameter that trades off prefill throughput (larger chunks finish faster) against time-to-first-token stability (smaller chunks are less disruptive to concurrent decode requests).
Continuous batching imposes specific requirements on memory management. Because the number of tokens each request will generate is unknown, the serving system cannot pre-allocate a fixed block of GPU memory per request. Allocating the maximum possible context length for every request would be prohibitively wasteful.
In early implementations of continuous batching, including the Orca paper's prototype, memory management was handled with fixed-length padded blocks or simple watermark-based admission control: new requests were admitted only when the currently running batch was estimated to have enough KV cache space to accommodate them. This approach was conservative and left GPU memory underutilized.
The introduction of PagedAttention by Kwon et al. (SOSP 2023) solved this problem by applying virtual memory concepts to KV cache allocation. PagedAttention divides GPU memory into fixed-size physical blocks and maintains a per-sequence logical-to-physical block mapping. KV cache blocks are allocated on demand as tokens are generated, and released immediately when a sequence completes. This allows the maximum possible number of sequences to fit in GPU memory at any given time, complementing continuous batching's goal of keeping the batch full.
The combination of continuous batching (for iteration-level scheduling) and PagedAttention (for efficient memory management) is what powers the high throughput of modern serving systems like vLLM. The two techniques are orthogonal and complementary: continuous batching determines when requests enter and leave the batch, while PagedAttention determines how memory is allocated and shared among the requests currently in the batch.
The terms static batching, dynamic batching, and continuous batching are sometimes used interchangeably or inconsistently in the literature. The following table clarifies the distinctions as used in most systems research:
| Property | Static batching | Dynamic batching | Continuous batching |
|---|---|---|---|
| Scheduling granularity | Request (entire batch runs until all complete) | Request with variable wait window | Iteration (every forward pass) |
| New request admission | Only between batches | Only between batches, but batches assembled dynamically | At every iteration |
| Handling of variable output lengths | Batch waits for longest request | Batch waits for longest request | Completed requests immediately released |
| GPU slot waste | High (proportional to length skew) | Reduced vs static (larger batches) | Minimal |
| Implementation complexity | Low | Low to moderate | Moderate to high |
| Typical throughput improvement | Baseline | 1.5x to 3x over single-request | 10x to 23x over static batching |
| Latency for short requests | Can be delayed by long co-batch members | Can be delayed by long co-batch members | Nearly optimal: released as soon as done |
Dynamic batching, as implemented in early serving frameworks like NVIDIA Triton Inference Server and TensorFlow Serving, refers to assembling a batch dynamically as requests arrive within a configurable time window or until a target batch size is reached. The key distinction from continuous batching is that dynamic batching still treats the assembled batch as an atomic unit once execution begins: no new requests are admitted until the entire batch finishes. Dynamic batching improves throughput over single-request serving by achieving larger matrix multiplications, but it still suffers from the idle-slot problem for long-tailed output distributions.
Continuous batching's defining characteristic is that the batch composition can change between every pair of consecutive model iterations. This is what enables close to 100% batch slot utilization regardless of output length variance.
The throughput improvements from continuous batching relative to static batching depend heavily on the workload characteristics, particularly the variance in output lengths. The following table summarizes reported gains from key papers and systems:
| Source | Baseline | Workload | Improvement |
|---|---|---|---|
| Orca paper (Yu et al., OSDI 2022) | TRITON static batching | GPT-3 equivalent, variable-length outputs | 36.9x throughput, 29.7x tail latency |
| vLLM blog post (Kwon et al., 2023) | HuggingFace Transformers | LLaMA-13B on ShareGPT traces | 14x to 24x throughput |
| vLLM blog post (Kwon et al., 2023) | vLLM vs Orca (w/ PagedAttention) | LLaMA-13B on ShareGPT traces | 2.3x to 4.3x above Orca alone |
| TGI benchmarks (HuggingFace) | Static batching baseline | Llama-2-70B, mixed workload | 10x to 18x throughput |
| NVIDIA TensorRT-LLM documentation | Static batch baseline | GPT-J-6B, A100 80GB | Approximately 10x to 15x throughput |
It is important to note that these figures are not purely from continuous batching alone. Modern systems combine continuous batching with PagedAttention, optimized attention kernels (Flash Attention, FlashInfer), quantization, and hardware-specific tuning. Isolating the contribution of continuous batching alone is difficult because it is almost always deployed together with these other optimizations.
The Orca paper's 36.9x figure is one of the largest reported and reflects a comparison against an unoptimized static batching baseline without any of the memory management improvements that came later. The vLLM figures showing 2.3x to 4.3x above Orca represent the additional gain from PagedAttention's memory efficiency on top of iteration-level scheduling.
In workloads with uniform output lengths (all requests produce the same number of tokens), continuous batching provides little throughput benefit over static batching because there is no idle-slot problem to solve. The gain is specific to the variable-length case, which describes virtually all production LLM deployments.
vLLM, developed at UC Berkeley and released in June 2023, is the most widely used open-source LLM serving framework as of 2025. It implements continuous batching as its default scheduling strategy, combined with PagedAttention for memory management. vLLM's scheduler maintains a running batch that is updated at every iteration: completed sequences are removed, preempted sequences are swapped to CPU memory or recomputed, and new sequences are admitted from the waiting queue.
vLLM's implementation of continuous batching is more sophisticated than the original Orca description in several respects. The scheduler handles sequence groups (multiple parallel completions from one request), supports chunked prefill for controlling time-to-first-token, and integrates with speculative decoding pipelines where accepted tokens advance sequences by multiple positions per iteration. vLLM also supports prefix caching via Automatic Prefix Caching (APC), which allows the prefill of repeated prompt prefixes to be skipped by reusing previously computed KV cache blocks.
The vLLM V1 architecture (released in 2024-2025) refactored the scheduler to reduce Python overhead in the control path and enable CUDAGraph capture for faster iteration dispatch, while preserving the continuous batching model.
HuggingFace's Text Generation Inference server implemented continuous batching in version 0.9 (2023), following the Orca paper and the vLLM release. TGI calls its implementation "continuous batching" explicitly in documentation and uses the same iteration-level scheduling model: requests are added to the running batch as slots become available, and completed requests are released immediately without waiting for batch-mates.
TGI targets tight integration with HuggingFace Hub model weights and the Transformers library ecosystem. It supports a range of model architectures and provides a REST API compatible with the OpenAI API format. TGI's continuous batching implementation uses a waterfall model for managing KV cache memory, similar in spirit to PagedAttention but with some architectural differences in block management.
In benchmark comparisons published by HuggingFace, TGI with continuous batching achieves 10x to 18x higher throughput than naive sequential inference for typical chat workloads. TGI is widely deployed at companies using HuggingFace-hosted model weights and is the backend for HuggingFace's Inference Endpoints product.
NVIDIA's TensorRT-LLM library provides highly optimized inference on NVIDIA hardware (particularly A100, H100, and H200 GPUs) and implements iteration-level scheduling under the name "in-flight batching" in its documentation. TensorRT-LLM's executor API (introduced in TensorRT-LLM v0.8, 2024) provides fine-grained control over in-flight batching behavior, including configurable maximum batch sizes, KV cache size limits, and chunked context (equivalent to chunked prefill).
TensorRT-LLM combines in-flight batching with NVIDIA-specific optimizations unavailable in open-source frameworks: FP8 quantization on H100 Tensor Cores, INT8 KV cache quantization, multi-head attention kernels optimized for specific GPU generations via XQA (cross-attention), and inference via TensorRT's engine compilation pipeline that fuses operations at the graph level. The combination of in-flight batching and these hardware optimizations produces state-of-the-art throughput on NVIDIA hardware.
NVIDIA also provides a higher-level serving framework, Triton Inference Server, which can host TensorRT-LLM backends and add request queuing, load balancing, and multi-model management on top of TensorRT-LLM's in-flight batching.
SGLang, developed at UC Berkeley and the University of Washington, implements continuous batching as part of its runtime and extends it with RadixAttention, which uses a radix tree (trie) data structure to manage the KV cache. RadixAttention enables automatic prefix reuse across all requests without requiring explicit configuration: any two requests sharing a common prefix will automatically share the corresponding KV cache blocks. This allows continuous batching to coexist with aggressive prefix sharing in workloads involving repeated system prompts, few-shot examples, or multi-turn conversation history.
SGLang's scheduler also implements chunk prefill and automatic batching of requests across multiple concurrent programs (structured generation pipelines). In benchmarks on workloads with significant prefix sharing, SGLang reports 29% to 50% higher throughput than vLLM due to the more aggressive prefix reuse enabled by RadixAttention.
NVIDIA's Triton Inference Server added a "dynamic batching" feature long before the Orca paper, but this earlier dynamic batching was request-level, not iteration-level. Triton assembled batches from queued requests and dispatched them atomically. After the Orca paper and the vLLM/TGI releases demonstrated the superiority of iteration-level scheduling, NVIDIA added in-flight batching support to Triton when using TensorRT-LLM backends.
Ollama, a lightweight local LLM serving tool, began adding support for parallel request handling in version 0.1.33 (2024). Ollama's parallel inference uses a simpler batching model than production systems and does not implement full iteration-level scheduling for all backend configurations, but it demonstrates the adoption of continuous batching ideas even in consumer-facing tools.
Shanghai AI Laboratory's LMDeploy framework implements continuous batching in its TurboMind engine, which targets deployment of InternLM and other models. TurboMind uses iteration-level scheduling combined with a paged KV cache. LMDeploy reports throughput competitive with vLLM on supported model architectures.
Continuous batching and PagedAttention address different aspects of the same throughput problem and are highly complementary.
Continuous batching solves the scheduling problem: how to keep the batch as full as possible given that requests complete at different times. Without continuous batching, even a system with perfect memory management would suffer from idle batch slots as short requests complete and wait for long ones.
PagedAttention solves the memory management problem: how to allocate GPU memory for KV caches without fragmentation or over-reservation. Without PagedAttention, even a system with iteration-level scheduling would be constrained by how many requests can fit in GPU memory, because each request would need to reserve memory for its worst-case output length.
Together, they create a two-layer optimization:
The vLLM system demonstrated this synergy empirically. Compared to Orca (iteration-level scheduling without PagedAttention), vLLM (iteration-level scheduling with PagedAttention) achieves 2.3x to 4.3x higher throughput. The additional gain comes from PagedAttention's ability to pack more concurrent sequences into the same amount of GPU memory, giving the continuous batching scheduler more sequences to work with.
As mentioned above, chunked prefill is a natural extension of continuous batching that addresses the interference between prefill (processing long input prompts) and decode (generating tokens one by one). Without chunked prefill, a request with a very long prompt would monopolize the GPU for an entire iteration, increasing the latency of all concurrently decoding requests by the duration of that prefill.
Chunked prefill was proposed as part of the Sarathi-Serve paper (Agrawal et al., 2023, arXiv 2308.16369) and subsequently implemented in vLLM, SGLang, and TensorRT-LLM. The core idea is to divide the prefill phase into chunks of at most $C$ tokens (the chunk size) and process one chunk per iteration alongside the normal decode tokens. This caps the per-iteration overhead from any single prefill at $C$ tokens, maintaining more predictable iteration latency.
The Sarathi-Serve paper reports that chunked prefill reduces time-to-first-token variance by 2x to 5x and reduces P99 decode latency by up to 40% compared to continuous batching without chunked prefill, while preserving the same or better overall throughput.
Setting the chunk size $C$ involves a tradeoff:
In practice, chunk sizes of 512 to 2,048 tokens are commonly used. vLLM exposes this as the max_num_batched_tokens parameter.
A more aggressive architectural response to the prefill-decode interference problem is prefill-decode disaggregation, sometimes called disaggregated serving. In this approach, dedicated prefill machines (or GPU pools) handle only the prompt processing phase, and completed KV caches are transferred over a high-bandwidth interconnect (InfiniBand, NVLink) to dedicated decode machines that handle token generation. Continuous batching applies within each pool independently.
Disaggregated serving eliminates the fundamental tension between prefill and decode by running them on entirely separate hardware, but at the cost of added system complexity and the latency and bandwidth cost of KV cache transfer. Projects including DistServe (Zhong et al., 2024), Mooncake (Qin et al., 2024, arXiv 2407.00079), and Splitwise (Patel et al., 2024) have demonstrated disaggregated serving at scale.
Speculative Decoding is a technique that uses a smaller, faster draft model to propose multiple candidate tokens per step, which are then verified in parallel by the larger target model. When the verification accepts $k$ tokens in a single iteration, a sequence advances by $k$ token positions instead of one. This directly interacts with continuous batching in two ways.
First, sequences that accept multiple speculative tokens in one iteration effectively "sprint" ahead of sequences using standard decoding. This means different sequences in the same continuous batch can be at very different positions in their generation, making KV cache management more complex.
Second, speculative tokens that are rejected must be rolled back: the sequence's position and KV cache must be reset to the last accepted token. In a PagedAttention-based system, this means releasing blocks allocated for rejected positions. The implementation must handle this efficiently to avoid overhead that negates speculative decoding's gains.
vLLM's speculative decoding implementation (added in 2024) maintains compatibility with continuous batching by handling acceptance and rejection at the block manager level, treating rejected speculative tokens as if the sequence had simply not generated them. The scheduler continues to operate at the iteration level regardless of whether speculation is active for individual sequences.
Without chunked prefill, a single large prefill can spike iteration time substantially. If a request with a 4,096-token prompt enters the running batch, the iteration that processes its prefill takes roughly 16 times as long as a standard decode iteration (since prefill scales roughly quadratically with sequence length for standard attention). All other requests in the batch experience a corresponding latency spike. This is the primary motivation for chunked prefill.
When GPU memory is nearly full, the scheduler may be forced to pause admission of new requests even though some batch slots are technically available. This creates a form of head-of-line blocking: the front of the request queue cannot be admitted even if the batch has room, because there is not enough free memory for its KV cache. The scheduler must balance admission control against memory pressure, sometimes preempting running sequences to make room. Preemption via KV cache swapping to CPU memory adds latency for the preempted request.
Requests with extremely long context (100K tokens or more) are challenging for continuous batching systems. A single very long sequence consumes enormous amounts of KV cache memory, limiting how many other sequences can run concurrently. The throughput benefit of continuous batching is reduced when most of the batch budget is consumed by a few very long sequences.
Context-length extension techniques, including ring attention, sequence parallelism, and KV cache compression methods, address this at the model architecture level rather than the scheduling level.
Large models deployed with tensor parallelism (splitting the model's weight matrices across multiple GPUs) require all GPUs to participate in every forward pass. This means the continuous batching scheduler must coordinate across all tensor-parallel ranks simultaneously: every rank must agree on which sequences are in the batch at each iteration. This is straightforward in synchronous setups but adds coordination overhead.
Pipeline parallelism (distributing model layers across GPUs in a pipeline) is more difficult to combine with continuous batching than tensor parallelism. In a pipeline, different layers of the model are on different GPUs, and data flows from one stage to the next. Changing the batch composition between iterations requires careful synchronization to ensure all pipeline stages are updated consistently. Naive continuous batching with pipeline parallelism can lead to "bubble" inefficiencies at pipeline stage boundaries when batch compositions change. Research on pipeline-parallel continuous batching (including work on micro-batching and online scheduling within pipeline stages) is an active area.
Many deployment scenarios impose per-request or per-user token budgets (maximum input plus output tokens). Enforcing these budgets correctly in a continuous batching system requires the scheduler to track per-sequence token counts and terminate sequences that exceed their budget, which adds bookkeeping complexity. Production serving systems handle this via per-sequence termination conditions in the scheduler.
Continuous batching is one of several scheduling innovations for LLM inference. The following table places it in context:
| Technique | What it optimizes | How it interacts with continuous batching |
|---|---|---|
| Continuous batching | Batch slot utilization over time | Foundation; all other techniques build on or extend it |
| PagedAttention | GPU memory fragmentation | Orthogonal; maximizes number of sequences in memory |
| Chunked prefill | Prefill-decode latency interference | Extension to continuous batching scheduler |
| Speculative decoding | Per-sequence tokens-per-second | Requires custom handling in continuous batch scheduler |
| Prefix caching (APC, RadixAttention) | Redundant prefill computation | Implemented within the continuous batching memory layer |
| Disaggregated serving | Prefill-decode resource contention | Alternative architecture that avoids prefill-decode sharing |
| Multi-query / grouped-query attention | Attention KV cache size | Reduces memory pressure, indirectly benefits batching |