Prompt lookup decoding

AI Infrastructure Machine Learning

10 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

10 citations

Revision

v1 · 2,025 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

Prompt lookup decoding (PLD), also called n-gram speculative decoding, is an inference acceleration method for large language models that speeds up text generation without changing the model or its outputs. It is a variant of speculative decoding in which the usual small "draft" model is replaced by a simple string search over the prompt. At each step the method takes the last few tokens the model has produced, looks for an earlier occurrence of that same n-gram in the input context, and, if it finds one, proposes the tokens that followed that occurrence as a block of candidate continuation tokens. The large target model then verifies the whole block in a single forward pass, accepting as many leading tokens as match what it would have generated anyway. ^[1]

The technique was proposed by Apoorv Saxena (full name Apoorv Umang Saxena) in November 2023 and released as a small open-source repository rather than a formal paper. ^[1]^[10] It spread quickly through the practitioner community and was integrated into the major inference stacks within a few months: it became the prompt_lookup_num_tokens option in Hugging Face Transformers in January 2024, the ngram speculative-decoding method in vLLM, and the "NGram" speculative-decoding mode in NVIDIA TensorRT-LLM. ^[7]^[8]^[9] Its appeal is that it requires no separate draft model, no training, and no external datastore, yet delivers reported speedups of roughly 2x to 4x on input-grounded tasks such as summarization, document question answering, retrieval-augmented generation, and code editing, with output identical to standard decoding. ^[1]

Background: speculative decoding

Autoregressive generation is slow because each new token requires a full forward pass of the model, and those passes are dominated by the memory cost of streaming the model weights and the KV cache rather than by arithmetic. Generating N tokens therefore means N sequential, memory-bound passes.

Speculative decoding, proposed independently by Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google Research (arXiv, November 2022; ICML 2023) and by Charlie Chen and colleagues at Google DeepMind (February 2023), attacks this bottleneck by separating drafting from verification. ^[2]^[3] A cheap "drafter" proposes a block of several candidate tokens. The expensive target model then processes all of those candidate positions in one forward pass, because a transformer can score many positions in parallel almost as cheaply as it scores one. A verification rule decides how many of the drafted tokens to keep. Under greedy decoding the model keeps each drafted token that matches its own argmax and stops at the first mismatch; under sampling, a modified rejection-sampling step preserves the target model's exact output distribution. Either way the result is guaranteed to be identical, in distribution, to what the target model would have produced on its own, so the speedup is lossless. The classic formulation uses a smaller language model of the same family as the drafter, which yields 2x to 3x latency reductions but requires hosting and running that second model. ^[2]^[3]

The key insight that PLD exploits is that the drafter does not have to be a neural network. Anything that can cheaply guess the next several tokens, and is right often enough, will produce a net speedup, because every accepted token is a token the target model did not have to generate sequentially.

How prompt lookup decoding works

PLD replaces the draft model with a lookup in the text the model has already seen. The procedure at each generation step is:

Take the last k tokens of the current sequence (the prompt plus everything generated so far) as a search n-gram. The method tries the longest n-gram first, up to a configurable maximum (the original implementation defaults max_ngram_size to 3), and falls back to shorter n-grams if no match is found.
Scan the whole context for an earlier occurrence of that n-gram, using exact token matching over a sliding window.
If a match is found, copy the next several tokens that followed it (the original implementation defaults num_pred_tokens to 10) and propose them as the candidate continuation. If no match is found at any n-gram length, propose nothing and fall back to one ordinary decoding step.
Run the target model once over the current sequence plus the proposed block, then accept the longest prefix of the block that agrees with the model's own predictions. As in any speculative scheme, at least one new token is always produced per pass, so the method never does worse than ordinary decoding by more than the small cost of the failed lookup. ^[1]

Because the drafted tokens are an exact copy from context and the target model still verifies every one of them, the generated text is bit-for-bit identical to standard greedy decoding, and distribution-identical under sampling. The drafting cost is a string search rather than a neural forward pass, so it is effectively free compared with running a draft model. The method works with any decoder-only model out of the box, with both greedy and sampling decoding, and needs no fine-tuning. ^[1]

Saxena benchmarked the original implementation on Mistral-7B-Instruct-v0.1 (see Mistral) on a single A100 40GB GPU under greedy decoding, reporting roughly a 2.4x average speedup on summarization and context-grounded question answering. ^[1] On open-ended chat (MT-Bench) the gains were smaller and uneven: the first turn, where the only context is a short user question, offers little to copy, whereas later turns can copy from the assistant's own earlier replies, and coding conversations in particular showed large gains because the model reproduces long spans of previously written code. ^[1]

Why it works for grounded tasks

The method rests on a simple empirical observation: in many real applications the output reuses large verbatim chunks of the input. ^[1]^[4] A summarizer lifts named entities, numbers, and whole phrases from the source; a retrieval-augmented generation system quotes the passages it was given; a code assistant editing a file reproduces most of the surrounding code and changes only a few lines; document question answering copies spans from the supporting text. In all of these cases there is high n-gram overlap between input and output, so when the model is about to emit a span that already appears in context, the lookup finds it and supplies the whole span as a draft that the target model accepts in one pass.

The economics follow directly from speculative decoding: one forward pass that verifies a block of ten tokens costs only marginally more than a pass that produces one, because both are limited by weight and cache movement rather than by arithmetic, so every accepted copied block yields those tokens at a fraction of their normal cost. The more the output echoes the input, the higher the acceptance rate and the larger the speedup, which is why PLD pairs naturally with long-context workloads where a large prompt is heavily reused. ^[1]

Relationship to other draft-free methods

PLD belongs to a family of methods that keep the verification half of speculative decoding but obtain draft tokens by retrieval or copying instead of from a model. Its closest ancestor is LLMA, introduced as "Inference with Reference" by Nan Yang and colleagues at Microsoft in April 2023, which copies token spans from a reference text (such as a retrieved document) and verifies them in parallel for over 2x speedup. ^[4] PLD can be seen as a stripped-down, drop-in form of the same idea that needs no separate reference: it searches the prompt itself. REST (Retrieval-Based Speculative Decoding) by Zhenyu He and colleagues (NAACL 2024) instead retrieves draft continuations from an external datastore built over a corpus, trading a memory footprint for the ability to draft text that is not present in the current prompt. ^[5] Lookahead decoding by Yichao Fu and colleagues (ICML 2024) is also draft-model-free but generates candidate n-grams on the fly using parallel Jacobi iteration rather than copying them from context. ^[6] The table below summarizes the contrast.

Method	Source of draft tokens	Needs draft model	Needs external store	Best case
Classic speculative decoding ^[2]^[3]	Smaller LLM of the same family	Yes	No	General generation
LLMA / Inference with Reference ^[4]	Copy spans from a given reference text	No	No (reference supplied)	Output overlaps a reference
REST ^[5]	Retrieve continuations from a corpus datastore	No	Yes	Common phrasings in the corpus
Lookahead decoding ^[6]	n-grams generated by parallel Jacobi decoding	No	No	General generation
Prompt lookup decoding ^[1]	Copy spans from the prompt and prior output	No	No	Output overlaps the input

All of these methods share speculative decoding's lossless guarantee, since the target model always verifies the draft. They differ only in how the draft is produced. PLD is the simplest member of the group: a plain n-gram match against text already in the context window. This simplicity is why it was adopted so widely. In Transformers it is the prompt_lookup_num_tokens argument to generate, where it shares the assisted-generation code path with model-based speculation. ^[7] In vLLM it is the ngram method, configured with num_speculative_tokens plus minimum and maximum lookup lengths (prompt_lookup_min and prompt_lookup_max). ^[8] In TensorRT-LLM the "NGram" mode is documented explicitly as an implementation of prompt lookup decoding, maintaining a map from token prefixes to candidate draft sequences drawn from the prompt and prior outputs. ^[9]

Limitations

The method's strength is also its boundary. Because the only drafts it can offer are verbatim copies from context, it provides little or no speedup when the output does not reuse the input. ^[1] Open-ended writing, brainstorming, translation, and chain-of-thought reasoning that introduces new tokens all have low n-gram overlap, so lookups usually fail and generation falls back to one token per pass; the first turn of a short-prompt conversation is a common example with nothing to copy. When lookups fail there is a small constant overhead for the search and for verifying rejected tokens, so on copy-free workloads the net effect ranges from neutral to slightly negative, though this is typically minor.

Other constraints follow from the copying mechanism. The drafter can only propose an exact token sequence that already occurred, so it captures verbatim reuse but not the paraphrases a learned draft model could sometimes guess, and because matching is at the token level, a phrase that re-tokenizes differently in its new position may not match. The basic algorithm also follows the first match it finds and proposes a single linear continuation rather than a branching tree, so one wrong token ends acceptance for that step, though production implementations add tree-style verification and tuned defaults. The two hyperparameters trade off as well: a larger maximum n-gram makes matches more precise but rarer, and a longer proposed block helps on long copied spans but wastes compute when acceptance is short. None of this affects correctness, since verification preserves the target model's exact output; it affects only how much speedup is realized. ^[1]^[7]^[8]

References

Apoorv Saxena. "Prompt Lookup Decoding." GitHub, November 2023. https://github.com/apoorvumang/prompt-lookup-decoding ↩
Yaniv Leviathan, Matan Kalman, Yossi Matias. "Fast Inference from Transformers via Speculative Decoding." ICML 2023. arXiv:2211.17192, November 2022. https://arxiv.org/abs/2211.17192 ↩
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper. "Accelerating Large Language Model Decoding with Speculative Sampling." arXiv:2302.01318, February 2023. https://arxiv.org/abs/2302.01318 ↩
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei. "Inference with Reference: Lossless Acceleration of Large Language Models." arXiv:2304.04487, April 2023. https://arxiv.org/abs/2304.04487 ↩
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, Di He. "REST: Retrieval-Based Speculative Decoding." NAACL 2024. arXiv:2311.08252, November 2023. https://arxiv.org/abs/2311.08252 ↩
Yichao Fu, Peter Bailis, Ion Stoica, Hao Zhang. "Break the Sequential Dependency of LLM Inference Using Lookahead Decoding." ICML 2024. arXiv:2402.02057, February 2024. https://arxiv.org/abs/2402.02057 ↩
Hugging Face. "Generation" (text generation, prompt_lookup_num_tokens) and "Assisted Generation." Transformers documentation, January 2024. https://huggingface.co/docs/transformers/main_classes/text_generation ↩
vLLM. "Speculative Decoding" (ngram method, prompt_lookup_min / prompt_lookup_max). vLLM Documentation, 2024. https://docs.vllm.ai/en/stable/features/speculative_decoding/ ↩
NVIDIA. "Speculative Decoding" (NGram method as an implementation of Prompt Lookup Decoding). TensorRT-LLM documentation, 2024. https://github.com/NVIDIA/TensorRT-LLM/blob/main/docs/source/features/speculative-decoding.md ↩
Simon Willison. "Prompt lookup decoding." Simon Willison's Weblog, January 23, 2024. https://simonwillison.net/2024/Jan/23/prompt-lookup-decoding/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Self-speculative decoding (LayerSkip)

Overview

Background: speculative decoding

How prompt lookup decoding works

Why it works for grounded tasks

Relationship to other draft-free methods

Limitations

References

Improve this article

Related Articles

Cloud TPU

Data Parallelism

Machine learning terms/Google Cloud

Model Parallelism

TPU Pod

TPU Node