Prompt lookup decoding
Last reviewed
Jun 8, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,025 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 2,025 words
Add missing citations, update stale details, or suggest a clearer explanation.
Prompt lookup decoding (PLD), also called n-gram speculative decoding, is an inference acceleration method for large language models that speeds up text generation without changing the model or its outputs. It is a variant of speculative decoding in which the usual small "draft" model is replaced by a simple string search over the prompt. At each step the method takes the last few tokens the model has produced, looks for an earlier occurrence of that same n-gram in the input context, and, if it finds one, proposes the tokens that followed that occurrence as a block of candidate continuation tokens. The large target model then verifies the whole block in a single forward pass, accepting as many leading tokens as match what it would have generated anyway. [1]
The technique was proposed by Apoorv Saxena (full name Apoorv Umang Saxena) in November 2023 and released as a small open-source repository rather than a formal paper. [1][10] It spread quickly through the practitioner community and was integrated into the major inference stacks within a few months: it became the prompt_lookup_num_tokens option in Hugging Face Transformers in January 2024, the ngram speculative-decoding method in vLLM, and the "NGram" speculative-decoding mode in NVIDIA TensorRT-LLM. [7][8][9] Its appeal is that it requires no separate draft model, no training, and no external datastore, yet delivers reported speedups of roughly 2x to 4x on input-grounded tasks such as summarization, document question answering, retrieval-augmented generation, and code editing, with output identical to standard decoding. [1]
Autoregressive generation is slow because each new token requires a full forward pass of the model, and those passes are dominated by the memory cost of streaming the model weights and the KV cache rather than by arithmetic. Generating N tokens therefore means N sequential, memory-bound passes.
Speculative decoding, proposed independently by Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google Research (arXiv, November 2022; ICML 2023) and by Charlie Chen and colleagues at Google DeepMind (February 2023), attacks this bottleneck by separating drafting from verification. [2][3] A cheap "drafter" proposes a block of several candidate tokens. The expensive target model then processes all of those candidate positions in one forward pass, because a transformer can score many positions in parallel almost as cheaply as it scores one. A verification rule decides how many of the drafted tokens to keep. Under greedy decoding the model keeps each drafted token that matches its own argmax and stops at the first mismatch; under sampling, a modified rejection-sampling step preserves the target model's exact output distribution. Either way the result is guaranteed to be identical, in distribution, to what the target model would have produced on its own, so the speedup is lossless. The classic formulation uses a smaller language model of the same family as the drafter, which yields 2x to 3x latency reductions but requires hosting and running that second model. [2][3]
The key insight that PLD exploits is that the drafter does not have to be a neural network. Anything that can cheaply guess the next several tokens, and is right often enough, will produce a net speedup, because every accepted token is a token the target model did not have to generate sequentially.
PLD replaces the draft model with a lookup in the text the model has already seen. The procedure at each generation step is:
max_ngram_size to 3), and falls back to shorter n-grams if no match is found.num_pred_tokens to 10) and propose them as the candidate continuation. If no match is found at any n-gram length, propose nothing and fall back to one ordinary decoding step.Because the drafted tokens are an exact copy from context and the target model still verifies every one of them, the generated text is bit-for-bit identical to standard greedy decoding, and distribution-identical under sampling. The drafting cost is a string search rather than a neural forward pass, so it is effectively free compared with running a draft model. The method works with any decoder-only model out of the box, with both greedy and sampling decoding, and needs no fine-tuning. [1]
Saxena benchmarked the original implementation on Mistral-7B-Instruct-v0.1 (see Mistral) on a single A100 40GB GPU under greedy decoding, reporting roughly a 2.4x average speedup on summarization and context-grounded question answering. [1] On open-ended chat (MT-Bench) the gains were smaller and uneven: the first turn, where the only context is a short user question, offers little to copy, whereas later turns can copy from the assistant's own earlier replies, and coding conversations in particular showed large gains because the model reproduces long spans of previously written code. [1]
The method rests on a simple empirical observation: in many real applications the output reuses large verbatim chunks of the input. [1][4] A summarizer lifts named entities, numbers, and whole phrases from the source; a retrieval-augmented generation system quotes the passages it was given; a code assistant editing a file reproduces most of the surrounding code and changes only a few lines; document question answering copies spans from the supporting text. In all of these cases there is high n-gram overlap between input and output, so when the model is about to emit a span that already appears in context, the lookup finds it and supplies the whole span as a draft that the target model accepts in one pass.
The economics follow directly from speculative decoding: one forward pass that verifies a block of ten tokens costs only marginally more than a pass that produces one, because both are limited by weight and cache movement rather than by arithmetic, so every accepted copied block yields those tokens at a fraction of their normal cost. The more the output echoes the input, the higher the acceptance rate and the larger the speedup, which is why PLD pairs naturally with long-context workloads where a large prompt is heavily reused. [1]
PLD belongs to a family of methods that keep the verification half of speculative decoding but obtain draft tokens by retrieval or copying instead of from a model. Its closest ancestor is LLMA, introduced as "Inference with Reference" by Nan Yang and colleagues at Microsoft in April 2023, which copies token spans from a reference text (such as a retrieved document) and verifies them in parallel for over 2x speedup. [4] PLD can be seen as a stripped-down, drop-in form of the same idea that needs no separate reference: it searches the prompt itself. REST (Retrieval-Based Speculative Decoding) by Zhenyu He and colleagues (NAACL 2024) instead retrieves draft continuations from an external datastore built over a corpus, trading a memory footprint for the ability to draft text that is not present in the current prompt. [5] Lookahead decoding by Yichao Fu and colleagues (ICML 2024) is also draft-model-free but generates candidate n-grams on the fly using parallel Jacobi iteration rather than copying them from context. [6] The table below summarizes the contrast.
| Method | Source of draft tokens | Needs draft model | Needs external store | Best case |
|---|---|---|---|---|
| Classic speculative decoding [2][3] | Smaller LLM of the same family | Yes | No | General generation |
| LLMA / Inference with Reference [4] | Copy spans from a given reference text | No | No (reference supplied) | Output overlaps a reference |
| REST [5] | Retrieve continuations from a corpus datastore | No | Yes | Common phrasings in the corpus |
| Lookahead decoding [6] | n-grams generated by parallel Jacobi decoding | No | No | General generation |
| Prompt lookup decoding [1] | Copy spans from the prompt and prior output | No | No | Output overlaps the input |
All of these methods share speculative decoding's lossless guarantee, since the target model always verifies the draft. They differ only in how the draft is produced. PLD is the simplest member of the group: a plain n-gram match against text already in the context window. This simplicity is why it was adopted so widely. In Transformers it is the prompt_lookup_num_tokens argument to generate, where it shares the assisted-generation code path with model-based speculation. [7] In vLLM it is the ngram method, configured with num_speculative_tokens plus minimum and maximum lookup lengths (prompt_lookup_min and prompt_lookup_max). [8] In TensorRT-LLM the "NGram" mode is documented explicitly as an implementation of prompt lookup decoding, maintaining a map from token prefixes to candidate draft sequences drawn from the prompt and prior outputs. [9]
The method's strength is also its boundary. Because the only drafts it can offer are verbatim copies from context, it provides little or no speedup when the output does not reuse the input. [1] Open-ended writing, brainstorming, translation, and chain-of-thought reasoning that introduces new tokens all have low n-gram overlap, so lookups usually fail and generation falls back to one token per pass; the first turn of a short-prompt conversation is a common example with nothing to copy. When lookups fail there is a small constant overhead for the search and for verifying rejected tokens, so on copy-free workloads the net effect ranges from neutral to slightly negative, though this is typically minor.
Other constraints follow from the copying mechanism. The drafter can only propose an exact token sequence that already occurred, so it captures verbatim reuse but not the paraphrases a learned draft model could sometimes guess, and because matching is at the token level, a phrase that re-tokenizes differently in its new position may not match. The basic algorithm also follows the first match it finds and proposes a single linear continuation rather than a branching tree, so one wrong token ends acceptance for that step, though production implementations add tree-style verification and tuned defaults. The two hyperparameters trade off as well: a larger maximum n-gram makes matches more precise but rarer, and a longer proposed block helps on long copied spans but wastes compute when acceptance is short. None of this affects correctness, since verification preserves the target model's exact output; it affects only how much speedup is realized. [1][7][8]