Lookahead Decoding
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,243 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,243 words
Add missing citations, update stale details, or suggest a clearer explanation.
Lookahead Decoding is a parallel decoding algorithm for accelerating inference in large language models, introduced in November 2023 by Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang from the Hao AI Lab at the University of California, San Diego, in collaboration with LMSYS.[^1][^2] The method breaks the strict left-to-right token-by-token dependency of autoregressive decoding by reframing generation as solving a non-linear system of equations through fixed-point Jacobi iteration, then caching and verifying the n-grams that emerge along iteration trajectories.[^1][^2] Unlike speculative decoding, lookahead decoding requires no separate draft model, no auxiliary data store, and no additional training, while still producing output that is identical to greedy autoregressive decoding under the same sampling settings.[^1][^2] The authors reported speedups of roughly 1.5x to 2.3x on a single GPU across MT-Bench, HumanEval, and GSM8K, and up to 4x on code completion when distributed across eight GPUs using a multi-device extension called Lookahead Parallelism.[^1][^2][^3]
Autoregressive decoding generates one token per forward pass through a large model, and each step's input depends on the previous step's output. This sequential dependency is the dominant latency bottleneck for large language model serving, because at small batch sizes the per-step computation is memory-bandwidth bound on modern accelerators rather than compute bound.[^1][^4] A single forward pass on an NVIDIA A100 uses only a tiny fraction of the device's available floating-point throughput, leaving room to trade extra arithmetic work for fewer total steps.[^2][^4] On commodity inference workloads, decoders idle for most of each step waiting on weights and the KV cache to be streamed from high-bandwidth memory into compute units, which is why nominally extreme arithmetic overheads (tens or even hundreds of times the FLOPs of vanilla decoding) translate into only modest wall-clock penalties at low batch sizes.[^2][^4]
Several lines of research had tried to exploit this surplus arithmetic before lookahead decoding's release. Speculative decoding, described by Leviathan, Kalman, and Matias at Google and independently by Chen and colleagues at DeepMind in 2023, runs a small draft model to propose candidate continuations that the target model verifies in parallel.[^2][^4] Medusa, from researchers at Together AI and others, attaches multiple lightweight prediction heads to the target model to forecast several future tokens at once.[^5] Both approaches eliminate some sequential steps, but they introduce engineering burdens: speculative decoding requires a well-aligned draft model that is hard to obtain for new architectures, and Medusa requires training new heads on top of the base model.[^2] EAGLE, which appeared in early 2024, refined the speculative-decoding family by performing draft prediction at the feature level rather than the token level, again at the cost of training a new lightweight component.[^13]
A second strand of work, parallel decoding via Jacobi iteration, was introduced by Santilli and colleagues in 2023 for machine translation.[^6] That technique reformulates greedy decoding as the solution to a fixed-point system: starting from a random sequence of guesses, the model is repeatedly applied in parallel, and tokens at converged positions are accepted. In principle this can finish in fewer steps than the sequence length. In practice, however, vanilla Jacobi decoding shows only marginal benefit on large language models, often averaging only about 1.05x speedup, because Transformer attention rarely produces the correct token when its left context is itself incorrect.[^4] Lookahead decoding's contribution is to make Jacobi iteration practical for LLMs by harvesting the partial n-grams produced along the way and verifying them in the same step.[^1][^2]
The technique was first announced publicly in the LMSYS blog post "Break the Sequential Dependency of LLM Inference Using Lookahead Decoding" on November 21, 2023.[^2] The post arrived during a period in which LMSYS (the umbrella under which the Vicuna language model and the MT-Bench evaluation set were released) was publishing a steady cadence of inference-systems contributions, and the blog format made the work immediately accessible to practitioners. A full paper, arXiv:2402.02057, appeared on February 3, 2024, and was accepted as a poster at the 41st International Conference on Machine Learning (ICML 2024), where it was published as pages 14060 to 14079 of Proceedings of Machine Learning Research, volume 235.[^1][^7][^8]
The named authors of the paper and the LMSYS blog post are Yichao Fu and Hao Zhang of the Hao AI Lab at the University of California, San Diego, together with Peter Bailis (Stanford University and Sisu Data) and Ion Stoica of UC Berkeley.[^1][^2] Yichao Fu is a Ph.D. student in the Hao AI Lab whose published work focuses on LLM inference algorithms and systems, including subsequent papers such as Lookahead Reasoning.[^12] Hao Zhang, who founded the Hao AI Lab and supervised the project, is an assistant professor of computer science and engineering at UC San Diego with a research focus on LLM inference and serving systems.[^12]
Lookahead decoding starts from the observation that greedy autoregressive decoding is mathematically a fixed-point problem.[^1][^2][^6] Let f denote one forward pass of the model: given a prefix and a sequence of future-position tokens, f returns the model's top-1 prediction at every position. A sequence y is the greedy continuation of a prompt x if and only if applying f to (x, y) returns y again at every position. Jacobi iteration solves this fixed point by starting from random tokens at the future positions and repeatedly applying f in parallel until the output stops changing.[^1][^2][^6]
Each iteration costs one forward pass but can move many positions at once. The catch is the "incorrect context" problem: if position k in the current guess is wrong, the model's output at position k+1 is also usually wrong, so positions only converge from left to right at roughly one token per iteration in the worst case.[^4] As a result, vanilla Jacobi decoding on a Transformer LLM does not meaningfully reduce step count, and is often slightly slower than greedy decoding once per-step overheads are counted.[^4]
The key insight of lookahead decoding is that the wasted Jacobi work is not wasted: even when a position has not converged, the predicted token there typically agrees with the eventual output of vanilla decoding several iterations later, because the model is computing a smoothly improving guess. The iteration trajectory therefore contains a stream of partial n-grams that, if remembered, can be replayed as candidate continuations once the prefix catches up to them.[^1][^2]
To extract more value per step, lookahead decoding maintains a two-dimensional window of past Jacobi iteration states. The window has two parameters:
A single forward pass therefore evaluates W positions, and after N iterations the lookahead branch has produced a sliding trajectory of W times (N minus 1) candidate tokens. Sliding diagonally across this 2D buffer yields candidate n-grams: short strings of length N that the model has been gradually refining at each position.[^1][^2] Conceptually the window is a rolling tape: at each step a new row is appended on top, the oldest row at the bottom is consumed into the n-gram pool, and the cells along each anti-diagonal of the buffer correspond to one candidate n-gram aligned to a particular future position.[^1][^2]
Completed n-grams are deposited into an in-memory pool, keyed by their first token. The pool acts as a small per-request cache of candidate continuations that the model has previously been led to believe in.[^1][^2] When the current decoding tip emits a token t, the verification branch consults the pool for n-grams whose first token equals t and proposes them as candidate next-N tokens. The pool is therefore both the product of one branch (lookahead) and the input to the other (verification), enabling the two branches to amplify each other across steps.[^1][^2]
A lookahead decoding step has two parallel branches, executed inside a single forward pass thanks to a specially constructed attention mask.[^1][^2]
The lookahead branch performs one Jacobi step over the W times (N minus 1) future positions, extending the trajectory by one row. Newly completed n-grams are added to the n-gram pool keyed by their first token.[^1][^2]
The verification branch looks up n-grams in the pool whose first token matches the model's most recent committed token, then asks the model to score those candidates in the same forward pass. The number of candidates verified per step is capped at a third parameter, the guess set size G, typically chosen equal to W.[^2] The longest prefix of any candidate that matches the model's own top-1 prediction at every verified position is then accepted as committed output, so multiple tokens can be added in one step.[^1][^2] When no n-gram matches, the algorithm gracefully falls back to ordinary single-token decoding, paying only the overhead of the lookahead branch's extra positions.[^2]
Both branches share the same KV cache and run under one attention mask whose structure separates them: tokens in the verification branch cannot see tokens in the lookahead branch, and within each branch the mask is causal as in standard decoding.[^2] This design lets the algorithm hide most of the lookahead overhead behind the same forward pass that performs verification, taking advantage of the device's spare arithmetic.[^2]
Because the verification branch always checks each candidate token against the model's greedy top-1 prediction at the corresponding position, the committed output is identical to what plain greedy decoding would produce from the same prompt.[^1][^2] The authors describe the algorithm as "exact" in this sense, and note compatibility with sampling methods without changing the output distribution.[^1] In contrast to draft-model speculative decoding, which preserves the target distribution only under a specific acceptance-and-rejection rule, lookahead decoding's preservation is direct: it never commits a token unless the target model itself would have produced that token next.[^1][^2]
The verification mask uses a block-diagonal structure that is straightforward to combine with FlashAttention kernels.[^3] The released implementation packages a patched FlashAttention 2.3.3 (the "flash_attn_lade" wheel) that yields roughly 20 percent additional speedup over the pure-Python version.[^3] Compatibility with FlashAttention also implies straightforward composition with memory-efficient KV-cache layouts such as PagedAttention in vLLM, since the underlying attention math is unchanged.[^10]
The three tuning knobs trade arithmetic per step for steps:
| Parameter | Symbol | Role |
|---|---|---|
| Window size | W | Future positions explored each step by the lookahead branch.[^2] |
| N-gram size | N | Depth of Jacobi history kept; length of candidate n-grams.[^2] |
| Guess set size | G | Maximum number of n-gram candidates verified per step (often G equals W).[^2] |
In the LMSYS blog announcement, the authors gave recommended A100 configurations for greedy decoding of Llama 2-Chat models:[^2]
| Model | W | N | Reported per-step overhead |
|---|---|---|---|
| 7B | 15 | 5 | about 120x FLOPs per step |
| 13B | 10 | 5 | about 80x FLOPs per step |
| 33B | 7 | 5 | about 56x FLOPs per step |
Despite these large multipliers on raw arithmetic, end-to-end latency drops because each forward pass at small batch sizes already underuses the GPU's floating-point pipeline, and the saved steps outweigh the per-step cost.[^2][^4] The authors caution that picking W and N too large can flip the trade-off and slow things down, especially on models or hardware where each step is closer to compute-bound.[^2]
The authors also observed a scaling regime: when N is sufficiently large (the blog cites N around 11), exponentially increasing the future-token guesses can linearly reduce the number of decoding steps. This is the basis for the multi-GPU "Lookahead Parallelism" extension.[^2][^4]
The reference implementation is released as the LADE library at the GitHub repository hao-ai-lab/LookaheadDecoding, under the Apache 2.0 license.[^3] It can be installed via pip install lade or built from source; the entry point is a small monkey-patch over Hugging Face Transformers that swaps the generate decoding loop for the lookahead version with a few additional environment variables (LEVEL for N, WINDOW_SIZE for W, and GUESS_SET_SIZE for G).[^3] The repository's core algorithm lives in decoding.py, with model-specific glue in models/llama.py.[^3] As released, the library supports LLaMA-family models, including Llama 2 and Code Llama (broader support depends on community ports of the same monkey-patch pattern).[^3]
A CUDA path that fuses the lookahead and verification branches with a customized FlashAttention kernel is available as a prebuilt wheel.[^3] On a single A100, the maintainers report speedups of roughly 1.5x to 2.3x across the supported models and tasks; the precise figure depends on workload and parameters, and the library's README advises users to tune W, N, and G based on their hardware's compute-to-memory ratio.[^3]
For workloads with extra GPU budget, the paper's "Lookahead Parallelism" (LP) extension distributes the lookahead branch across multiple devices, so that the window can be widened to a much larger W without prohibitive single-device cost.[^1] When N is large enough (the blog cites a regime starting around N equal to 11), increasing the number of parallel future-token guesses exponentially translates into a roughly linear reduction in the number of decoding steps.[^1][^2] LP exploits this by running each segment of the lookahead window on its own GPU and merging the resulting candidates before verification, yielding the reported 4x speedup on 8-GPU code completion.[^1]
The original blog and paper evaluated lookahead decoding on Llama 2-Chat, Code Llama, and CodeLLaMA-Instruct across multiple model sizes, on A100 GPUs.[^1][^2]
| Workload | Model | Reported speedup |
|---|---|---|
| MT-Bench (multi-turn chat) | Llama 2-Chat 7B/13B/33B | about 1.5x[^2] |
| HumanEval (code completion) | Code Llama 7B/13B | about 2.3x[^2] |
| GSM8K (math word problems) | Code Llama-Instruct | about 1.8x[^2] |
| Code completion, 8-GPU Lookahead Parallelism | Code Llama | up to 4x[^1] |
The arXiv paper aggregates these into headline numbers of up to 1.8x on MT-Bench and up to 4x with multi-GPU strong scaling on code completion.[^1]
Memory overhead is small: the blog notes that even at decoding length 128, lookahead's auxiliary buffers raise GPU memory consumption by only about 0.6 percent compared with vanilla decoding.[^2]
The lookahead decoding library is built to drop into Hugging Face Transformers: applying a one-line monkey-patch over a Hugging Face model replaces the standard generate call with the lookahead variant, with no architecture changes required.[^3] A community feature request to upstream lookahead decoding directly into Transformers' assisted-generation API was opened in November 2023 as issue #27649.[^9] Hugging Face separately added related parallel-decoding tooling such as 4D attention masks, which simplify integrations like lookahead decoding that need block-structured masks.[^9]
A request to add lookahead decoding to vLLM was filed as issue #1742 on November 21, 2023, the same day as the LMSYS blog post, citing the reported 1.5x to 2x speedup without a draft model as motivation for integration.[^10] A second proposal, issue #1754, requested formal support for the technique a few days later. vLLM's broader speculative-decoding subsystem subsequently absorbed n-gram-based and tree-verification approaches alongside its draft-model and Medusa-style options, with lookahead-style n-gram speculation surfacing in the engine's mainline configurations.[^10]
SGLang, the inference engine that emerged from the LMSYS ecosystem and shares contributors with vLLM and the Hao AI Lab, added an n-gram-style speculative decoding mode and tracks lookahead-style approaches; its documentation enumerates EAGLE 2 and 3, multi-token prediction (MTP), classic draft-model speculation, and an n-gram variant among its supported speculative paths.[^11] A January 2025 GitHub issue (sgl-project/sglang #2772) proposed an explicit lookahead speculative-decoding mode for SGLang based on the N-gram / trie-retrieval pattern from lookahead decoding, motivated by RAG-style workloads where prompt and history share long substrings with the generation.[^11]
TensorRT-LLM, NVIDIA's production inference toolkit, exposes a family of speculative-sampling modes including draft-model speculation, Medusa, and EAGLE; lookahead decoding is widely referenced in this literature as the canonical training-free, draft-free baseline against which those methods are compared.[^11][^13]
Several follow-up lines of research build on or compare against lookahead decoding.
Consistency Large Language Models (CLLMs), introduced by Kou, Hu, He, Deng, and Zhang in March 2024, fine-tune the target model so that Jacobi iteration converges to the fixed point in many fewer steps than the untrained baseline.[^4] CLLMs preserve the same algorithmic shape as lookahead decoding (parallel Jacobi updates plus verification) but pay for their speedup in offline training rather than online FLOPs. The CLLM paper of March 2024 lists lookahead decoding as one of its principal baselines and reports that, with appropriate fine-tuning, the technique can deliver speedups competitive with draft-model speculative decoding while remaining draft-free at inference time.[^4]
Lookahead Parallelism (LP) is a distributed CUDA implementation, included in the ICML paper, that parallelizes both the lookahead and verification branches across multiple GPUs.[^1] It is the source of the headline 4x speedup on 8-GPU code completion. The parallelism is "strong scaling" in the parallel-computing sense: a fixed problem (one decoding request) is divided across more devices to finish faster, rather than throughput-style "weak scaling" of more requests at once.[^1]
Lookahead Reasoning, introduced by members of the same lab in 2025, extends the same trade-step-for-arithmetic principle to reasoning-trace generation: instead of parallelizing tokens, it parallelizes semantic reasoning steps and verifies them against the main model.[^12] The technique applies to chain-of-thought style models where each reasoning step needs to be semantically valid but not literally identical to the autoregressive output, and is explicitly framed as a complement to existing token-level speculative-decoding approaches.[^12]
N-gram-based speculative decoding in production engines, including the n-gram path inside SGLang and the lookup-decoding mode discussed in vLLM, borrows the n-gram-pool-plus-verification structure of lookahead decoding while sometimes simplifying or removing the Jacobi branch.[^10][^11] These engine-level n-gram methods are especially effective on workloads with high prompt-to-output substring overlap, such as retrieval-augmented question answering and code editing.[^11]
| Method | Auxiliary model required | Extra training | Output preserved | Source of parallel candidates |
|---|---|---|---|---|
| Standard autoregressive | No | No | n/a | n/a |
| Draft-model speculative decoding | Yes (draft model) | Sometimes (alignment) | Yes (exact under standard rules) | Draft model rollouts[^4] |
| Medusa | No, but extra heads on the base model | Yes (head training) | Yes (with tree verification) | Multiple prediction heads[^5] |
| EAGLE | No separate model, but a feature-level draft head | Yes (feature head training) | Yes | Feature-level draft head[^13] |
| Lookahead decoding | No | No | Yes (matches greedy decoding) | Jacobi iteration trajectory + n-gram cache[^1][^2] |
The defining trait of lookahead decoding within this landscape is that it is both training-free and draft-model-free, while still preserving the exact greedy output.[^1][^2] It pays for that with extra arithmetic per step, which works in its favor on memory-bandwidth-bound workloads but can become a liability when batch sizes are large enough to saturate the GPU compute pipeline.[^2]
Lookahead decoding sharpened a broader practical message about LLM serving: at typical inference batch sizes, decoding is bottlenecked by memory bandwidth rather than arithmetic, so substantial speedups are available by spending unused FLOPs to remove sequential steps.[^1][^4] It demonstrated that this could be done with no auxiliary model, no fine-tuning, and no change to the output distribution, lowering the engineering barrier for deployment in inference optimization stacks.[^1][^2]
Several systems-level conclusions followed. First, the technique made the case that the FLOP budget at small batch sizes is essentially free, an observation that has since been re-used in trained acceleration methods such as Medusa and CLLM.[^4][^5] Second, by relying on n-gram caching it showed that meaningful structure exists in Jacobi iteration trajectories, refuting the implicit assumption in earlier work that intermediate Jacobi states were merely transient noise.[^4] Third, by composing cleanly with FlashAttention and KV-cache layouts like PagedAttention, it demonstrated that exotic decoding loops could be retrofitted to optimized inference stacks without rewriting the underlying kernels.[^2][^3]
The technique has since become a standard reference baseline in the parallel-decoding literature and is enumerated alongside speculative decoding, Medusa, and EAGLE in surveys and engine documentation.[^4][^11] Its emphasis on trading per-step arithmetic for fewer total steps also influenced reasoning-stage parallelism work such as Lookahead Reasoning, which extends the same idea from tokens to reasoning steps.[^12]
Lookahead decoding is most beneficial in latency-sensitive, single-stream LLM serving scenarios, where one user or one request occupies most of the available GPU resources. Reported gains track these properties closely.[^1][^2]
By extension, the same setup applies to retrieval-augmented generation pipelines, where prompts and outputs frequently share substrings, and to local-inference workflows where a single user runs one stream at a time on a single device. In contrast, multi-tenant batched serving with large simultaneous batch sizes already saturates the GPU's compute pipeline, and the FLOP overhead of lookahead decoding becomes a real cost rather than free arithmetic; this is the regime in which production engines tend to fall back to draft-model speculation or trained heads.[^2][^11]
The blog and paper are forthright about the trade-offs.[^1][^2] The main cost is computational overhead: at the recommended A100 settings, the 7B configuration uses roughly 120x the per-step FLOPs of standard decoding, the 13B configuration uses about 80x, and the 33B configuration about 56x.[^2] Because LLM decoding at low batch sizes is memory-bandwidth bound, these multipliers translate into only modest per-step wall-clock cost; if W and N are set too high, however, the per-step cost can rise faster than the step count falls, undoing the benefit.[^2]
A second limitation is workload sensitivity. Speedup figures vary materially between MT-Bench, HumanEval, and GSM8K, and the optimal W, N, G triple depends on the model, the task, and the GPU.[^2] On hardware or batch regimes where the model is closer to compute-bound, the FLOP overhead becomes a real cost rather than free.
Third, vanilla greedy decoding is what is preserved by construction; other sampling distributions are supported but require the same verification logic to be applied.[^1] The reference implementation targets LLaMA-family decoders, and broader model-architecture support depends on community ports.[^3]
Finally, although the method needs no draft model or training, it still requires nontrivial tuning of W, N, and G per model and per task to realize the headline speedups.[^2]