See also: transformer, large language model, inference, KV cache, autoregressive model, knowledge distillation
Speculative decoding is an inference acceleration technique for autoregressive transformer models that generates multiple tokens per forward pass of the target model while provably preserving the output distribution. The technique was independently proposed by Leviathan, Kalman, and Matias at Google Research (November 2022) and by Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper at DeepMind (February 2023). Both papers demonstrated 2x to 3x latency reductions on large language models without retraining, changing model architectures, or altering the quality of generated text.
The core idea draws from speculative execution in computer architecture: a small, fast draft model generates a sequence of candidate tokens ("speculations"), and the larger target model verifies all of them in a single forward pass. Correct speculations are accepted, while incorrect ones are rejected and resampled using a modified rejection sampling scheme. Because the target model processes the speculated tokens in parallel rather than sequentially, the wall-clock time per accepted token decreases significantly. A mathematical proof guarantees that the output distribution is identical to that of the target model alone, making speculative decoding a lossless optimization.
Since its introduction, speculative decoding has become a standard component of production LLM serving systems. It is supported by major inference frameworks including vLLM, NVIDIA TensorRT-LLM, and Hugging Face Transformers. Multiple variants have been developed, including Medusa, EAGLE, self-speculative decoding, and lookahead decoding, each exploring different approaches to draft token generation.
Imagine you are writing a story, but you can only write one word at a time, and after each word you have to ask your teacher if it is the right word. This is very slow because you have to wait for the teacher after every single word.
Speculative decoding is like having a really fast friend who guesses the next several words for you all at once. Your friend writes down five words quickly, then you show all five to the teacher at the same time. The teacher checks them and says "the first three are correct, but the fourth one is wrong." You keep the three correct words (saving a lot of time) and fix only the wrong one. Your friend is not as smart as the teacher, but they are much faster, and many of their guesses turn out to be right. In the end, the story is exactly the same as if the teacher had written every word, but it gets finished much more quickly.
Autoregressive language models generate text one token at a time. At each step, the model takes all previously generated tokens as input, performs a full forward pass to compute a probability distribution over the vocabulary, and samples the next token. This process is inherently sequential: generating K tokens requires K serial forward passes through the model.
For modern LLMs with billions of parameters, each forward pass involves loading the entire set of model weights from GPU high-bandwidth memory (HBM) into the compute units. On an NVIDIA A100 GPU with 80 GB of HBM and 2 TB/s memory bandwidth, loading a 14 GB model (7B parameters in FP16) takes approximately 7 ms. The actual matrix multiplications for a single token take only about 0.5 ms. This means the GPU spends roughly 93% of each decoding step waiting for data to arrive from memory and only 7% performing useful computation.
The ratio of arithmetic operations to memory accesses is captured by the concept of arithmetic intensity, measured in FLOPs per byte. Modern GPUs like the A100 have a compute-to-bandwidth ratio of roughly 150:1 (312 TFLOPS vs. 2 TB/s). To keep the GPU's compute units fully utilized, a workload must perform at least 150 floating-point operations for every byte read from memory.
During autoregressive decoding with batch size 1, each forward pass processes a single token. The model weights must be loaded in full, but only a small number of operations are performed per weight element. The arithmetic intensity is far below the 150 FLOP/byte threshold, placing the workload squarely in the memory-bound regime. The GPU's massive compute capacity sits largely idle.
| Phase | Arithmetic intensity | Bottleneck | GPU compute utilization |
|---|---|---|---|
| Prefill (prompt processing) | High (many tokens processed simultaneously) | Compute | 60-80% |
| Autoregressive decode (batch size 1) | Very low (one token, full weight load) | Memory bandwidth | 5-10% |
| Autoregressive decode (large batch) | Moderate to high | Transitioning to compute | 30-70% |
Speculative decoding exploits this imbalance. By verifying K draft tokens in a single forward pass, the target model performs (K+1) times more useful computation for essentially the same memory access cost. This increases the arithmetic intensity and moves the workload closer to the compute-bound regime, making better use of available hardware.
Increasing the batch size is another way to improve GPU utilization during decoding, but it addresses throughput (total tokens per second across all requests) rather than latency (time per individual request). For interactive applications like chatbots and code assistants, users care about per-request latency. Speculative decoding reduces the latency of a single sequence's generation without requiring additional concurrent requests.
Speculative decoding involves two models:
The algorithm proceeds in rounds. Each round consists of a drafting phase and a verification phase.
Draft generation: The draft model M_q autoregressively generates K candidate tokens (x_1, x_2, ..., x_K), sampling each from q(x_i | x_{<i}). This requires K sequential forward passes of the small model, which are fast due to the model's small size.
Parallel verification: The target model M_p processes the full sequence (original context plus all K draft tokens) in a single forward pass, computing the target distribution p(x_i | x_{<i}) at each of the K draft positions plus one additional position.
Token-by-token acceptance: For each draft token x_i in order (i = 1, 2, ..., K):
Bonus token: If all K draft tokens are accepted, sample one additional token from the target model's distribution at position K+1. This ensures that every round produces at least one new token verified by the target model.
Repeat: Return to step 1, continuing from the last accepted position.
The acceptance probability for a single token depends on how well the draft distribution q(x) matches the target distribution p(x). Define the acceptance rate alpha as:
alpha = 1 - (1/2) * sum_x |p(x) - q(x)| = sum_x min(p(x), q(x))
This is related to the total variation distance between p and q. When the two distributions are identical (alpha = 1), every draft token is accepted. When they differ substantially, the acceptance rate drops.
With K draft tokens and acceptance rate alpha, the expected number of tokens generated per round is:
E[tokens per round] = (1 - alpha^(K+1)) / (1 - alpha)
For alpha = 0.8 and K = 5, this gives approximately 4.0 tokens per round. The speedup over standard decoding depends on the relative costs of the draft and target model forward passes.
Let c be the ratio of the draft model's forward pass time to the target model's forward pass time (typically c is between 0.01 and 0.1 for a much smaller draft model). The expected speedup factor S is approximately:
S = (1 - alpha^(K+1)) / ((1 - alpha) * (c * K + 1))
For alpha = 0.8, K = 5, and c = 0.05, this yields S approximately equal to 3.3x. In practice, reported speedups range from 2x to 3.5x depending on the model pair, task, and hardware.
The defining property of speculative decoding is that it produces samples from exactly the same distribution as standard autoregressive sampling from the target model. This is not an approximation; it is a mathematical identity.
Consider a single position where the draft model proposes token x sampled from q(x). The speculative decoding procedure produces a token from the following combined distribution:
P(output = x) = P(accept x) * q(x) + P(reject) * p'(x)
where P(accept x) = min(1, p(x)/q(x)) and p'(x) = max(0, p(x) - q(x)) / Z with Z = sum_x max(0, p(x) - q(x)).
The probability of rejection is:
P(reject) = sum_x q(x) * max(0, 1 - p(x)/q(x)) = sum_x max(0, q(x) - p(x)) = Z
The total probability of outputting token x is:
P(output = x) = q(x) * min(1, p(x)/q(x)) + Z * max(0, p(x) - q(x)) / Z
Case 1: p(x) >= q(x). Then min(1, p(x)/q(x)) = 1, so the accept term is q(x). The resample term is p(x) - q(x). Total: q(x) + p(x) - q(x) = p(x).
Case 2: p(x) < q(x). Then min(1, p(x)/q(x)) = p(x)/q(x), so the accept term is q(x) * p(x)/q(x) = p(x). The resample term is max(0, p(x) - q(x)) = 0. Total: p(x) + 0 = p(x).
In both cases, P(output = x) = p(x), confirming that the output distribution matches the target model exactly.
The acceptance criterion acts as a filter. When the draft model assigns too much probability to a token relative to the target model (q(x) > p(x)), the token is accepted only with probability p(x)/q(x), trimming the excess. The probability mass removed by rejection is redistributed via the resample distribution p'(x), which captures precisely the tokens that the target model favors more than the draft model. These two corrections cancel out perfectly, restoring the target distribution.
The choice of draft model significantly affects the speed and acceptance rate of speculative decoding. Several strategies have been developed.
| Strategy | Draft source | Training required | Extra memory | Typical speedup |
|---|---|---|---|---|
| Independent small model | Separate smaller model (e.g., 68M for a 7B target) | No (use existing model) | Yes (load second model) | 2-3x |
| Fine-tuned draft model | Small model distilled from target | Yes | Yes | 2.5-3.5x |
| Self-speculative (layer skipping) | Target model with layers skipped | No | No (shared weights) | 1.5-2x |
| Medusa heads | Extra prediction heads on target model | Yes (heads only) | Minimal | 2-3.6x |
| EAGLE | Lightweight feature-level predictor | Yes (predictor only) | Minimal | 2.7-6.5x |
| N-gram / prompt lookup | N-gram matching from prompt context | No | No | 1.5-2.5x (input-heavy tasks) |
The simplest approach uses a pre-existing smaller model from the same model family as the draft model. For example, LLaMA 68M can serve as a draft model for LLaMA 7B, or T5-Small (60M) for T5-XXL (11B). The key requirement is that both models share the same tokenizer and vocabulary.
Leviathan et al. (2023) used this approach to achieve 2x to 3x speedups on T5-XXL. Chen et al. (2023) demonstrated 2x to 2.5x speedups on Chinchilla 70B in a distributed setting. The main drawback is that the draft model must be loaded into GPU memory alongside the target model, consuming additional resources.
Training a draft model specifically for the target model via knowledge distillation can improve the acceptance rate. The draft model learns to approximate the target model's distribution more closely than a generic small model would. Google reported using distilled draft models in production for AI Overviews in Google Search, achieving faster response times while maintaining quality.
For tasks where the output heavily overlaps with the input (such as text editing, code completion with context, or summarization), draft tokens can be extracted directly from the input prompt using n-gram matching. The system looks for n-grams in the prompt that match the most recently generated tokens and uses the continuation as the draft. This approach requires no additional model, no training, and no extra memory, but its effectiveness depends on the input-output overlap.
Medusa, developed by Cai et al. (2024), eliminates the need for a separate draft model by adding multiple prediction heads to the target model itself. Each head predicts a token at a different future position. For example, head 1 predicts the token at position t+1, head 2 predicts the token at position t+2, and so on.
During generation, each head produces a set of top-k candidate tokens, and these candidates are combined into a tree of possible continuations. All candidates in the tree are verified simultaneously using a tree attention mechanism. The target model processes the tree in a single forward pass, and the longest correct path through the tree is accepted.
Medusa comes in two variants:
A key advantage of Medusa is that it does not require a separate draft model, simplifying deployment and avoiding the memory overhead of loading two models. The added heads are small (typically a single linear layer or a small MLP per head) and add negligible compute and memory overhead.
Medusa was published at ICML 2024.
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), proposed by Li et al. (2024), takes a different approach by performing autoregression at the feature level rather than the token level. The key insight is that predicting the next token's hidden representation (specifically, the second-to-top-layer features) is easier and more reliable than predicting the next token directly.
EAGLE adds a lightweight autoregressive head that takes the target model's hidden states as input and predicts the feature vector for the next position. By also incorporating the token embedding shifted forward by one time step, EAGLE resolves the inherent uncertainty in feature-level prediction.
The EAGLE family has progressed through three versions:
Self-speculative decoding uses the target model itself as both the drafter and the verifier, eliminating the need for any additional model or trainable parameters. The drafting phase uses an approximate version of the target model, created by skipping a subset of intermediate attention and feed-forward layers. The verification phase runs the full model.
Zhang et al. (2024) proposed this approach in "Draft and Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding," published at ACL 2024. By selectively skipping layers during drafting, the model generates tokens faster (fewer layers means fewer computations) at slightly lower quality. The verification step then ensures that the final output matches the full model's distribution exactly.
A related approach is LayerSkip (Elhoushi et al., 2024), which trains the model with early exit objectives so that intermediate layers produce useful predictions. During inference, the early layers generate draft tokens, and the remaining layers verify them. LayerSkip achieves up to 1.99x speedup on LLaMA-2 variants and enables sharing the KV cache between the draft and verification phases, avoiding redundant computation.
The main advantage of self-speculative methods is that they are plug-and-play: no extra model needs to be trained, stored, or loaded. The main limitation is that the speedup is typically lower than with a dedicated draft model, because skipping layers provides a less significant speed advantage than using a much smaller model.
Lookahead decoding, proposed by Fu et al. (2024) and published at ICML 2024, takes a fundamentally different approach that does not use a draft model at all. Instead, it reformulates autoregressive decoding as solving a system of nonlinear equations and applies the Jacobi iteration method to generate multiple tokens in parallel.
The algorithm maintains two concurrent operations:
Lookahead decoding is exact (lossless) and does not require any auxiliary model, training data, or data store. It achieves 1.5x to 2.3x speedup, with higher gains on longer generation tasks. Its main advantage is simplicity of deployment; its main limitation is that the speedup is generally lower than methods that use a dedicated draft model or learned prediction heads.
SpecInfer, proposed by Miao et al. (2024) and published at ASPLOS 2024, extends speculative decoding to use multiple small draft models that collectively generate a token tree of candidate sequences. Rather than a single draft sequence, SpecInfer organizes speculations into a tree structure where each node represents a candidate token and each path from root to leaf represents a possible continuation.
The target model verifies all paths in the tree simultaneously using a tree-based parallel decoding mechanism. This increases the probability that at least one path matches the target model's output, improving the effective acceptance rate. SpecInfer achieved 1.5x to 2.8x speedup for distributed LLM inference and 2.6x to 3.5x for offloading-based inference.
Several speculative decoding variants (Medusa, EAGLE, SpecInfer) use tree-structured speculation rather than a single linear draft sequence. Instead of generating one sequence of K tokens, the draft model generates a tree where each node branches into multiple candidate continuations.
Tree-based verification processes all branches simultaneously using a modified attention mask (tree attention). The target model evaluates every path in the tree in a single forward pass, and the longest path that passes verification is accepted. This approach trades additional computation for a higher probability of accepting long sequences of tokens.
The width and depth of the draft tree can be configured to balance between exploration (more branches, higher acceptance probability) and efficiency (fewer branches, lower computational cost per verification step).
Choosing an appropriate draft model involves balancing speed and accuracy:
Empirical studies suggest that the draft model should be approximately 10x to 100x smaller than the target model for optimal results.
The number of draft tokens K per round is a tunable hyperparameter. Larger K means more potential tokens per round but also more wasted computation if early tokens are rejected (since all tokens after a rejection are discarded). The optimal K depends on the acceptance rate alpha:
| Acceptance rate (alpha) | Optimal K | Expected tokens per round |
|---|---|---|
| 0.5 | 3-4 | 1.9-2.0 |
| 0.7 | 4-6 | 2.8-3.3 |
| 0.8 | 5-8 | 3.6-4.5 |
| 0.9 | 8-12 | 5.5-7.5 |
| 0.95 | 12-20 | 8.0-12.0 |
Some implementations (such as EAGLE-2) dynamically adjust K based on the draft model's confidence, using more draft tokens when predictions are confident and fewer when they are uncertain.
Speculative decoding is implemented in all major LLM serving frameworks:
| Framework | Supported methods | Notes |
|---|---|---|
| vLLM | Draft model, EAGLE, EAGLE-3, Medusa, n-gram, MLPSpeculator | Integrated into the PagedAttention-based serving engine |
| TensorRT-LLM | Draft model, EAGLE, Medusa, n-gram (prompt lookup) | NVIDIA-optimized CUDA kernels; up to 3.6x throughput boost reported |
| Hugging Face Transformers | Draft model (assisted generation) | Available via model.generate(assistant_model=...) |
| SGLang | Draft model, EAGLE | Supports RadixAttention for efficient prefix sharing |
| llama.cpp | Draft model, prompt lookup | CPU and GPU inference with GGUF models |
Speculative decoding provides the largest benefits under specific conditions:
Extending speculative decoding to batched inference introduces the "ragged tensor" problem. Different sequences in the same batch may accept different numbers of draft tokens, resulting in variable-length outputs that break the regular tensor shapes required for efficient GPU computation. Sequences must be padded or processed with custom attention masks, adding implementation complexity and potentially reducing throughput.
When using an independent draft model, both the draft and target models must reside in GPU memory simultaneously. For memory-constrained deployments, this can be prohibitive. Methods like self-speculative decoding and Medusa address this by reusing the target model's weights.
The speedup of speculative decoding depends heavily on how well the draft model's distribution matches the target model's distribution. If the two models are trained on different data or have significantly different capabilities, the acceptance rate may be too low to provide meaningful acceleration. Fine-tuning or distilling the draft model specifically for the target model can improve alignment but adds training cost.
Paradoxically, making the draft model more capable (and therefore more accurate) often makes it slower, reducing the speed ratio between draft and target models. There is an inherent tension between draft quality (high acceptance rate) and draft speed (low cost per draft token). Finding the optimal balance point requires empirical experimentation for each target model and task.
Implementing speculative decoding efficiently requires careful management of the KV cache (draft tokens may need to be evicted upon rejection), dynamic sequence lengths, and synchronization between draft and target model execution. Production-quality implementations in frameworks like vLLM and TensorRT-LLM represent significant engineering effort.
| Method | Year | Venue | Draft source | Training needed | Extra memory | Speedup range | Lossless |
|---|---|---|---|---|---|---|---|
| Speculative Decoding (Leviathan et al.) | 2023 | ICML | Independent small model | No | Yes | 2-3x | Yes |
| Speculative Sampling (Chen et al.) | 2023 | arXiv | Independent small model | No | Yes | 2-2.5x | Yes |
| SpecInfer (Miao et al.) | 2024 | ASPLOS | Multiple small models (tree) | No | Yes | 1.5-3.5x | Yes |
| Medusa (Cai et al.) | 2024 | ICML | Extra prediction heads | Yes (heads) | Minimal | 2.2-3.6x | Medusa-1: Yes; Medusa-2: Approximate |
| EAGLE-1 (Li et al.) | 2024 | ICML | Feature-level predictor | Yes (predictor) | Minimal | 2.7-3.5x | Yes |
| EAGLE-2 (Li et al.) | 2024 | EMNLP | Feature-level predictor + dynamic tree | Yes (predictor) | Minimal | 3.05-4.26x | Yes |
| EAGLE-3 (Li et al.) | 2025 | NeurIPS | Multi-level feature fusion + dynamic tree | Yes (predictor) | Minimal | 3.0-6.5x | Yes |
| Self-Speculative (Zhang et al.) | 2024 | ACL | Layer-skipped target model | No | None | 1.5-2x | Yes |
| LayerSkip (Elhoushi et al.) | 2024 | ACL | Early exit from target model | Yes (early exit training) | None | 1.5-2x | Yes |
| Lookahead Decoding (Fu et al.) | 2024 | ICML | Jacobi iteration (no model) | No | None | 1.5-2.3x | Yes |
| N-gram / Prompt Lookup | 2023 | Open source | Input text n-gram matching | No | None | 1.5-2.5x | Yes |
The concept of speculative decoding was introduced in two independent papers that developed the same core idea concurrently.
Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google Research posted "Fast Inference from Transformers via Speculative Decoding" on arXiv on November 30, 2022. The paper was subsequently accepted as an oral presentation at ICML 2023. They demonstrated 2x to 3x speedups on T5-XXL (11B parameters) using a T5-Small (60M parameters) draft model, with no change to the output quality.
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper at DeepMind published "Accelerating Large Language Model Decoding with Speculative Sampling" on arXiv on February 2, 2023. They demonstrated 2x to 2.5x speedups on Chinchilla (70B parameters) in a distributed computing environment.
Both papers independently arrived at the same modified rejection sampling scheme for preserving the target distribution. In a May 2023 revision, Leviathan et al. acknowledged Chen et al.'s work as a concurrent and independent development.
Google reported in 2024 that speculative decoding had been deployed in production for AI Overviews in Google Search, using distilled draft models to generate faster responses. The technique has since been adopted broadly across the industry, with inference frameworks, cloud providers, and hardware vendors all incorporating support for it.