Speculative Decoding

Introduction

Speculative decoding is an inference acceleration technique for autoregressive transformer models that generates multiple tokens per forward pass of the target model while provably preserving the output distribution. The technique was independently proposed by Leviathan, Kalman, and Matias at Google Research (November 2022) and by Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper at DeepMind (February 2023). Both papers demonstrated 2x to 3x latency reductions on large language models without retraining, changing model architectures, or altering the quality of generated text.

The core idea draws from speculative execution in computer architecture: a small, fast draft model generates a sequence of candidate tokens ("speculations"), and the larger target model verifies all of them in a single forward pass. Correct speculations are accepted, while incorrect ones are rejected and resampled using a modified rejection sampling scheme. Because the target model processes the speculated tokens in parallel rather than sequentially, the wall-clock time per accepted token decreases significantly. A mathematical proof guarantees that the output distribution is identical to that of the target model alone, making speculative decoding a lossless optimization.

Since its introduction, speculative decoding has become a standard component of production LLM serving systems. It is supported by major inference frameworks including vLLM, NVIDIA TensorRT-LLM, SGLang, and Hugging Face Transformers. Multiple variants have been developed, including Medusa, EAGLE-1/2/3, ReDrafter, Speculative Streaming, self-speculative decoding, and lookahead decoding, each exploring different approaches to draft token generation. By 2025, the multi-token prediction (MTP) heads released with DeepSeek-V3 made speculative decoding integral to frontier open-weight model deployment.

ELI5 (Explain like I'm 5)

Imagine you are writing a story, but you can only write one word at a time, and after each word you have to ask your teacher if it is the right word. This is very slow because you have to wait for the teacher after every single word.

Speculative decoding is like having a really fast friend who guesses the next several words for you all at once. Your friend writes down five words quickly, then you show all five to the teacher at the same time. The teacher checks them and says "the first three are correct, but the fourth one is wrong." You keep the three correct words (saving a lot of time) and fix only the wrong one. Your friend is not as smart as the teacher, but they are much faster, and many of their guesses turn out to be right. In the end, the story is exactly the same as if the teacher had written every word, but it gets finished much more quickly.

Background and motivation

The autoregressive decoding bottleneck

Autoregressive language models generate text one token at a time. At each step, the model takes all previously generated tokens as input, performs a full forward pass to compute a probability distribution over the vocabulary, and samples the next token. This process is inherently sequential: generating K tokens requires K serial forward passes through the model.

For modern LLMs with billions of parameters, each forward pass involves loading the entire set of model weights from GPU high-bandwidth memory (HBM) into the compute units. On an NVIDIA A100 GPU with 80 GB of HBM and 2 TB/s memory bandwidth, loading a 14 GB model (7B parameters in FP16) takes approximately 7 ms. The actual matrix multiplications for a single token take only about 0.5 ms. This means the GPU spends roughly 93% of each decoding step waiting for data to arrive from memory and only 7% performing useful computation.

Memory-bound vs. compute-bound operations

The ratio of arithmetic operations to memory accesses is captured by the concept of arithmetic intensity, measured in FLOPs per byte. Modern GPUs like the A100 have a compute-to-bandwidth ratio of roughly 150:1 (312 TFLOPS vs. 2 TB/s). To keep the GPU's compute units fully utilized, a workload must perform at least 150 floating-point operations for every byte read from memory.

During autoregressive decoding with batch size 1, each forward pass processes a single token. The model weights must be loaded in full, but only a small number of operations are performed per weight element. The arithmetic intensity is far below the 150 FLOP/byte threshold, placing the workload squarely in the memory-bound regime. The GPU's massive compute capacity sits largely idle.

Phase	Arithmetic intensity	Bottleneck	GPU compute utilization
Prefill (prompt processing)	High (many tokens processed simultaneously)	Compute	60-80%
Autoregressive decode (batch size 1)	Very low (one token, full weight load)	Memory bandwidth	5-10%
Autoregressive decode (large batch)	Moderate to high	Transitioning to compute	30-70%

Speculative decoding exploits this imbalance. By verifying K draft tokens in a single forward pass, the target model performs (K+1) times more useful computation for essentially the same memory access cost. This increases the arithmetic intensity and moves the workload closer to the compute-bound regime, making better use of available hardware.

Why not just increase the batch size?

Increasing the batch size is another way to improve GPU utilization during decoding, but it addresses throughput (total tokens per second across all requests) rather than latency (time per individual request). For interactive applications like chatbots and code assistants, users care about per-request latency. Speculative decoding reduces the latency of a single sequence's generation without requiring additional concurrent requests.

Algorithm

Overview

Speculative decoding involves two models:

Draft model (M_q): A smaller, faster model that quickly generates candidate tokens. Its probability distribution over the vocabulary at each position is denoted q(x).
Target model (M_p): The larger, more capable model whose output distribution p(x) must be exactly preserved. This is the model the system would use without speculative decoding.

The algorithm proceeds in rounds. Each round consists of a drafting phase and a verification phase.

Step-by-step procedure

Draft generation: The draft model M_q autoregressively generates K candidate tokens (x_1, x_2, ..., x_K), sampling each from q(x_i | x_{<i}). This requires K sequential forward passes of the small model, which are fast due to the model's small size.
Parallel verification: The target model M_p processes the full sequence (original context plus all K draft tokens) in a single forward pass, computing the target distribution p(x_i | x_{<i}) at each of the K draft positions plus one additional position.
Token-by-token acceptance: For each draft token x_i in order (i = 1, 2, ..., K):
- Draw a uniform random number r from [0, 1).
- If r < min(1, p(x_i) / q(x_i)), accept the token and continue to the next.
- Otherwise, reject the token. Resample a replacement token from the adjusted distribution: p'(x) = normalize(max(0, p(x) - q(x))). All subsequent draft tokens (x_{i+1}, ..., x_K) are discarded.
Bonus token: If all K draft tokens are accepted, sample one additional token from the target model's distribution at position K+1. This ensures that every round produces at least one new token verified by the target model.
Repeat: Return to step 1, continuing from the last accepted position.

Acceptance rate and expected speedup

The acceptance probability for a single token depends on how well the draft distribution q(x) matches the target distribution p(x). Define the acceptance rate alpha as:

alpha = 1 - (1/2) * sum_x |p(x) - q(x)| = sum_x min(p(x), q(x))

This is related to the total variation distance between p and q. When the two distributions are identical (alpha = 1), every draft token is accepted. When they differ substantially, the acceptance rate drops.

With K draft tokens and acceptance rate alpha, the expected number of tokens generated per round is:

E[tokens per round] = (1 - alpha^(K+1)) / (1 - alpha)

For alpha = 0.8 and K = 5, this gives approximately 4.0 tokens per round. The speedup over standard decoding depends on the relative costs of the draft and target model forward passes.

Simplified speedup formula

Let c be the ratio of the draft model's forward pass time to the target model's forward pass time (typically c is between 0.01 and 0.1 for a much smaller draft model). The expected speedup factor S is approximately:

S = (1 - alpha^(K+1)) / ((1 - alpha) * (c * K + 1))

For alpha = 0.8, K = 5, and c = 0.05, this yields S approximately equal to 3.3x. In practice, reported speedups range from 2x to 3.5x for classical draft-model setups and 3x to 6x for feature-level methods like EAGLE-3.

Mathematical guarantee: lossless sampling

The defining property of speculative decoding is that it produces samples from exactly the same distribution as standard autoregressive sampling from the target model. This is not an approximation; it is a mathematical identity.

Proof sketch

Consider a single position where the draft model proposes token x sampled from q(x). The speculative decoding procedure produces a token from the following combined distribution:

P(output = x) = P(accept x) * q(x) + P(reject) * p'(x)

where P(accept x) = min(1, p(x)/q(x)) and p'(x) = max(0, p(x) - q(x)) / Z with Z = sum_x max(0, p(x) - q(x)).

The probability of rejection is:

P(reject) = sum_x q(x) * max(0, 1 - p(x)/q(x)) = sum_x max(0, q(x) - p(x)) = Z

The total probability of outputting token x is:

P(output = x) = q(x) * min(1, p(x)/q(x)) + Z * max(0, p(x) - q(x)) / Z

Case 1: p(x) >= q(x). Then min(1, p(x)/q(x)) = 1, so the accept term is q(x). The resample term is p(x) - q(x). Total: q(x) + p(x) - q(x) = p(x).

Case 2: p(x) < q(x). Then min(1, p(x)/q(x)) = p(x)/q(x), so the accept term is q(x) * p(x)/q(x) = p(x). The resample term is max(0, p(x) - q(x)) = 0. Total: p(x) + 0 = p(x).

In both cases, P(output = x) = p(x), confirming that the output distribution matches the target model exactly.

Intuition

The acceptance criterion acts as a filter. When the draft model assigns too much probability to a token relative to the target model (q(x) > p(x)), the token is accepted only with probability p(x)/q(x), trimming the excess. The probability mass removed by rejection is redistributed via the resample distribution p'(x), which captures precisely the tokens that the target model favors more than the draft model. These two corrections cancel out perfectly, restoring the target distribution.

Draft model strategies

The choice of draft model significantly affects the speed and acceptance rate of speculative decoding. Several strategies have been developed.

Strategy	Draft source	Training required	Extra memory	Typical speedup
Independent small model	Separate smaller model (e.g., 68M for a 7B target)	No (use existing model)	Yes (load second model)	2-3x
Fine-tuned draft model	Small model distilled from target	Yes	Yes	2.5-3.5x
Self-speculative (layer skipping)	Target model with layers skipped	No	No (shared weights)	1.5-2x
Medusa heads	Extra prediction heads on target model	Yes (heads only)	Minimal	2-3.6x
EAGLE	Lightweight feature-level predictor	Yes (predictor only)	Minimal	2.7-6.5x
MTP heads (DeepSeek-V3 style)	Trained-in multi-token prediction modules	Yes (during pretraining)	Small (additional transformer block per depth)	1.6-1.8x for V3 (alpha ~ 0.85)
N-gram / prompt lookup	N-gram matching from prompt context	No	No	1.5-2.5x (input-heavy tasks)

Independent small model

The simplest approach uses a pre-existing smaller model from the same model family as the draft model. For example, LLaMA 68M can serve as a draft model for LLaMA 7B, or T5-Small (60M) for T5-XXL (11B). The key requirement is that both models share the same tokenizer and vocabulary.

Leviathan et al. (2023) used this approach to achieve 2x to 3x speedups on T5-XXL. Chen et al. (2023) demonstrated 2x to 2.5x speedups on Chinchilla 70B in a distributed setting. The main drawback is that the draft model must be loaded into GPU memory alongside the target model, consuming additional resources.

Fine-tuned and distilled draft models

Training a draft model specifically for the target model via knowledge distillation can improve the acceptance rate. The draft model learns to approximate the target model's distribution more closely than a generic small model would. Google reported using distilled draft models in production for AI Overviews in Google Search, achieving faster response times while maintaining quality.

N-gram and prompt lookup decoding

For tasks where the output heavily overlaps with the input (such as text editing, code completion with context, or summarization), draft tokens can be extracted directly from the input prompt using n-gram matching. The system looks for n-grams in the prompt that match the most recently generated tokens and uses the continuation as the draft. This approach requires no additional model, no training, and no extra memory, but its effectiveness depends on the input-output overlap.

Variants and extensions

Medusa

Medusa, developed by Cai et al. (2024), eliminates the need for a separate draft model by adding multiple prediction heads to the target model itself. Each head predicts a token at a different future position. For example, head 1 predicts the token at position t+1, head 2 predicts the token at position t+2, and so on.

During generation, each head produces a set of top-k candidate tokens, and these candidates are combined into a tree of possible continuations. All candidates in the tree are verified simultaneously using a tree attention mechanism. The target model processes the tree in a single forward pass, and the longest correct path through the tree is accepted.

Medusa comes in two variants:

Medusa-1: Only the additional heads are trained; the base model remains frozen. This is parameter-efficient and accessible for users with limited compute. Achieves over 2.2x speedup.
Medusa-2: Both the heads and the base model are jointly fine-tuned using a self-distillation procedure. Achieves 2.3x to 3.6x speedup.

A key advantage of Medusa is that it does not require a separate draft model, simplifying deployment and avoiding the memory overhead of loading two models. The added heads are small (typically a single linear layer or a small MLP per head) and add negligible compute and memory overhead.

Medusa was published at ICML 2024.

EAGLE family

EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), proposed by Li et al. (2024), takes a different approach by performing autoregression at the feature level rather than the token level. The key insight is that predicting the next token's hidden representation (specifically, the second-to-top-layer features) is easier and more reliable than predicting the next token directly.

EAGLE adds a lightweight autoregressive head that takes the target model's hidden states as input and predicts the feature vector for the next position. By also incorporating the token embedding shifted forward by one time step, EAGLE resolves the inherent uncertainty in feature-level prediction.

The EAGLE family has progressed through three versions:

EAGLE-1 (ICML 2024): Introduced feature-level autoregression with a static draft tree. Achieved 2.7x to 3.5x speedup on LLaMA2-Chat 70B while maintaining lossless output quality.
EAGLE-2 (EMNLP 2024): Added context-aware dynamic draft trees. Rather than using a fixed tree structure, EAGLE-2 adjusts the tree shape based on the draft model's confidence scores, allocating more branches where predictions are likely correct and fewer where they are uncertain. Achieved 3.05x to 4.26x speedup, a 20-40% improvement over EAGLE-1.
EAGLE-3 (NeurIPS 2025, released March 2025): Replaced the second-to-top-layer features with a fusion of low-, mid-, and high-level semantic features from the target model. Combined with the dynamic draft trees from EAGLE-2, this achieved 3.0x to 6.5x speedup, with a further 20-40% improvement over EAGLE-2. Public benchmarks reported SGLang plus EAGLE-3 on a single H100 reaching 373 tokens per second versus 158 tokens per second for vanilla decoding on the same hardware.

EAGLE-3 has become the de facto choice for production speculative decoding on open-weight models. By late 2025 vLLM, SGLang, and TensorRT-LLM all shipped first-class support, including CUDA graphs and per-position acceptance-rate telemetry, and the SGLang project released SpecForge, a training pipeline that fits an EAGLE-3 head for an arbitrary target model in a few hours on a single GPU node. A parallel variant, P-EAGLE (AWS, 2025), decouples the drafter from the verifier so that drafting runs concurrently with target-model verification on a separate stream, reducing the end-to-end critical path.

Multi-token prediction (DeepSeek-V3 MTP)

The DeepSeek-V3 technical report (December 2024) introduced multi-token prediction (MTP) as a pretraining objective and showed that the resulting modules double as a high-quality speculative-decoding drafter. MTP attaches D sequential prediction modules to the main transformer; each module contains a transformer block, a projection matrix, and shared embedding and output heads, and module d predicts the token at offset d+1 conditioned on the main model's hidden state and the embedding of the previous token.

During pretraining, MTP densifies the supervision signal so every hidden state must support both next-token and farther-ahead predictions, which encourages the model to plan ahead and tends to improve downstream accuracy. At inference the MTP modules can either be discarded (recovering a vanilla model) or repurposed as drafters. DeepSeek-V3 reports an acceptance rate above 80% for the first MTP head and an end-to-end generation speedup of roughly 1.8x; LMSYS measurements on SGLang showed up to 60% higher output throughput with no loss in quality. MTP-based speculation has since been adopted in NVIDIA's Megatron-Bridge stack and vLLM's Ascend backend, making it one of the few speculative-decoding methods baked into the base model during pretraining rather than added afterward.

ReDrafter, Speculative Streaming, and Mirror-SD (Apple)

Apple has contributed three complementary single-vendor methods. ReDrafter (2024) replaces Medusa's parallel heads with a small recurrent neural network that consumes the target model's hidden state and emits a beam of candidate continuations, which are then organized into a dynamic tree for parallel verification. Apple and NVIDIA jointly integrated ReDrafter into TensorRT-LLM in 2025, reporting up to 3.5 accepted tokens per generation step. Speculative Streaming (EMNLP 2025) collapses drafting into the target model by replacing the standard next-token training objective with a future n-gram prediction objective and a multi-stream attention pattern, achieving 1.8x to 3.1x speedup without any separate drafter or auxiliary heads. Mirror-SD (2025) co-schedules drafter and target on heterogeneous accelerators (typically GPU plus NPU on Apple Silicon) so that the draft proposes forward continuations while the target simultaneously proposes correction paths for the drafter, converting speculation into two complementary execution streams rather than a strictly sequential draft-then-verify pattern.

Sequoia

Sequoia (Chen et al., 2024) is a tree-based system that optimizes the shape of the speculation tree as a function of hardware bandwidth and the drafter's acceptance distribution. It reports up to 9.5x speedups over incremental decoding in offloading scenarios where target-model weights spill to CPU memory or disk, because tree-structured speculation amortizes the very high per-step cost of paging weights back into the GPU. Sequoia popularized the use of explicit acceptance-rate modeling to compute the optimal tree topology rather than hand-tuning it.

Self-speculative decoding

Self-speculative decoding uses the target model itself as both the drafter and the verifier, eliminating the need for any additional model or trainable parameters. The drafting phase uses an approximate version of the target model, created by skipping a subset of intermediate attention and feed-forward layers. The verification phase runs the full model.

Zhang et al. (2024) proposed this approach in "Draft and Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding," published at ACL 2024. By selectively skipping layers during drafting, the model generates tokens faster (fewer layers means fewer computations) at slightly lower quality. The verification step then ensures that the final output matches the full model's distribution exactly.

A related approach is LayerSkip (Elhoushi et al., 2024), which trains the model with early exit objectives so that intermediate layers produce useful predictions. During inference, the early layers generate draft tokens, and the remaining layers verify them. LayerSkip achieves up to 1.99x speedup on LLaMA-2 variants and enables sharing the KV cache between the draft and verification phases, avoiding redundant computation.

The main advantage of self-speculative methods is that they are plug-and-play: no extra model needs to be trained, stored, or loaded. The main limitation is that the speedup is typically lower than with a dedicated draft model, because skipping layers provides a less significant speed advantage than using a much smaller model.

Lookahead decoding

Lookahead decoding, proposed by Fu et al. (2024) and published at ICML 2024, takes a fundamentally different approach that does not use a draft model at all. Instead, it reformulates autoregressive decoding as solving a system of nonlinear equations and applies the Jacobi iteration method to generate multiple tokens in parallel.

The algorithm maintains two concurrent operations:

Lookahead branch: A fixed-size 2D window defined by a window size W (how far ahead to predict) and an n-gram size N (how many Jacobi iteration steps to look back). At each step, the model processes all positions in the window simultaneously, and the trajectory of predictions across iterations naturally produces n-gram candidates.
Verification branch: Checks whether any n-gram candidates from the lookahead branch match future tokens that the model would actually generate, and accepts the longest matching n-gram.

Lookahead decoding is exact (lossless) and does not require any auxiliary model, training data, or data store. It achieves 1.5x to 2.3x speedup, with higher gains on longer generation tasks. Its main advantage is simplicity of deployment; its main limitation is that the speedup is generally lower than methods that use a dedicated draft model or learned prediction heads.

SpecInfer

SpecInfer, proposed by Miao et al. (2024) and published at ASPLOS 2024, extends speculative decoding to use multiple small draft models that collectively generate a token tree of candidate sequences. Rather than a single draft sequence, SpecInfer organizes speculations into a tree structure where each node represents a candidate token and each path from root to leaf represents a possible continuation.

The target model verifies all paths in the tree simultaneously using a tree-based parallel decoding mechanism. This increases the probability that at least one path matches the target model's output, improving the effective acceptance rate. SpecInfer achieved 1.5x to 2.8x speedup for distributed LLM inference and 2.6x to 3.5x for offloading-based inference.

Tree-based verification

Several speculative decoding variants (Medusa, EAGLE, SpecInfer, Sequoia, ReDrafter) use tree-structured speculation rather than a single linear draft sequence. Instead of generating one sequence of K tokens, the draft model generates a tree where each node branches into multiple candidate continuations.

Tree-based verification processes all branches simultaneously using a modified attention mask (tree attention). The target model evaluates every path in the tree in a single forward pass, and the longest path that passes verification is accepted. This approach trades additional computation for a higher probability of accepting long sequences of tokens.

The width and depth of the draft tree can be configured to balance between exploration (more branches, higher acceptance probability) and efficiency (fewer branches, lower computational cost per verification step).

Practical considerations

Draft model selection

Choosing an appropriate draft model involves balancing speed and accuracy:

Too small: Generates low-quality drafts with high rejection rates, negating the speed benefit.
Too large: Reduces the speed advantage because the draft model itself becomes slow.
Wrong vocabulary: The draft and target models must share the same tokenizer. Mismatched tokenizers make speculative decoding inapplicable.

Empirical studies suggest that the draft model should be approximately 10x to 100x smaller than the target model for optimal results.

Optimal number of draft tokens (K)

The number of draft tokens K per round is a tunable hyperparameter. Larger K means more potential tokens per round but also more wasted computation if early tokens are rejected (since all tokens after a rejection are discarded). The optimal K depends on the acceptance rate alpha:

Acceptance rate (alpha)	Optimal K	Expected tokens per round
0.5	3-4	1.9-2.0
0.7	4-6	2.8-3.3
0.8	5-8	3.6-4.5
0.9	8-12	5.5-7.5
0.95	12-20	8.0-12.0

Some implementations (such as EAGLE-2 and EAGLE-3) dynamically adjust K based on the draft model's confidence, using more draft tokens when predictions are confident and fewer when they are uncertain.

Framework support

Speculative decoding is implemented in all major LLM serving frameworks. The following table reflects the state of support as of mid-2025.

Framework	Supported methods	Notes
vLLM	Draft model, EAGLE-1/2/3, P-EAGLE, Medusa, MTP, n-gram, MLPSpeculator	Integrated into the PagedAttention-based serving engine; per-position acceptance metrics exposed
TensorRT-LLM	Draft model, EAGLE, Medusa, ReDrafter, n-gram (prompt lookup)	NVIDIA-optimized CUDA kernels; up to 3.6x throughput boost reported, ReDrafter via Apple-NVIDIA integration
Hugging Face Transformers	Draft model (assisted generation), prompt lookup	Available via `model.generate(assistant_model=...)`
SGLang	Draft model, EAGLE-1/2/3, MTP	Supports RadixAttention for efficient prefix sharing; SpecForge training pipeline for EAGLE-3
llama.cpp	Draft model, prompt lookup	CPU and GPU inference with GGUF models
Megatron-Bridge	MTP training and inference	NVIDIA training-side support for DeepSeek-style MTP

When speculative decoding helps

Speculative decoding provides the largest benefits under specific conditions:

Low batch sizes: At batch size 1, autoregressive decoding is deeply memory-bound, and the extra computation from verifying draft tokens is essentially free. At large batch sizes, the workload becomes compute-bound, and the draft model's computation competes with the target model for GPU resources.
Latency-sensitive applications: Chatbots, code completion tools, and interactive assistants benefit from lower per-token latency.
Predictable text patterns: Tasks with higher token predictability (code generation, structured output, translation of formulaic text) yield higher acceptance rates and greater speedups.
Long output sequences: The overhead of speculative decoding is amortized over more tokens, making it more beneficial for longer generations.
Offloading and disaggregated serving: When target-model weights are paged from CPU or NVMe, tree-based methods like Sequoia and SpecInfer can amortize each weight load over many speculated tokens, producing speedups well above the single-GPU range.

When speculative decoding does not help

Large batch serving: When serving many concurrent requests with large batches, the GPU is already compute-saturated. Adding draft model computation provides no free lunch. Benchmarks on SGLang plus EAGLE-3 with an H100 GPU show speedup dropping from 1.81x at batch size 2 to 1.38x at batch size 64, and prior work observed near-parity beyond batch size 128.
Very short outputs: For single-token classification or short answers, the overhead of running the draft model may exceed the savings.
Highly creative or unpredictable text: Tasks where the target model's distribution is very different from the draft model's distribution (low acceptance rate) yield minimal speedup.

Limitations

Batch inference complexity

Extending speculative decoding to batched inference introduces the "ragged tensor" problem. Different sequences in the same batch may accept different numbers of draft tokens, resulting in variable-length outputs that break the regular tensor shapes required for efficient GPU computation. Sequences must be padded or processed with custom attention masks, adding implementation complexity and potentially reducing throughput.

A 2025 analysis ("Batch Speculative Decoding Done Right," Zhang et al.) documented that many widely deployed batch implementations silently desynchronize position IDs, attention masks, and KV-cache state when sequences accept different numbers of tokens, producing repetitive or nonsensical outputs without raising an error. The authors proposed EQSPEC, which guarantees output equivalence at up to 40% alignment overhead, and EXSPEC, which dynamically groups same-length sequences across batches to recover most of that overhead. They reported up to 3x batch-8 throughput improvement on Spec-Bench across Vicuna, Qwen3, and GLM-4 model pairs while restoring near-perfect decoding equivalence with the non-batched baseline.

Memory overhead

When using an independent draft model, both the draft and target models must reside in GPU memory simultaneously. For memory-constrained deployments, this can be prohibitive. Methods like self-speculative decoding, Medusa, EAGLE, and Speculative Streaming address this by reusing the target model's weights or attaching only small heads.

Draft-target alignment

The speedup of speculative decoding depends heavily on how well the draft model's distribution matches the target model's distribution. If the two models are trained on different data or have significantly different capabilities, the acceptance rate may be too low to provide meaningful acceleration. Fine-tuning or distilling the draft model specifically for the target model can improve alignment but adds training cost.

Diminishing returns with stronger draft models

Paradoxically, making the draft model more capable (and therefore more accurate) often makes it slower, reducing the speed ratio between draft and target models. There is an inherent tension between draft quality (high acceptance rate) and draft speed (low cost per draft token). Finding the optimal balance point requires empirical experimentation for each target model and task.

Hardware and software complexity

Implementing speculative decoding efficiently requires careful management of the KV cache (draft tokens may need to be evicted upon rejection), dynamic sequence lengths, and synchronization between draft and target model execution. Production-quality implementations in frameworks like vLLM and TensorRT-LLM represent significant engineering effort.

Benchmarking and evaluation

Standardized evaluation emerged with Spec-Bench (Xia et al., 2024), which measures wall-clock acceleration across six tasks (multi-turn chat, translation, summarization, question answering, math reasoning, and retrieval-augmented generation) and reports both mean accepted length per round and end-to-end speedup. Spec-Bench has two known limitations: a small, low-diversity prompt corpus and the conflation of pure-algorithm gains with kernel-level optimizations. SPEED-Bench (NVIDIA, 2026) addresses both by expanding the prompt pool by roughly an order of magnitude, stratifying by entropy (low-entropy domains like code and math vs. high-entropy domains like open-ended chat), and pinning the kernel stack so that algorithmic changes can be isolated from runtime differences. Early measurements confirmed that EAGLE-3-class methods consistently lead on low-entropy domains and that the gap shrinks but does not close on creative writing. A separate effort, SpecDecode-Bench ("Speculative Decoding: Performance or Illusion?"), audited several published implementations and found that some reported speedups had been inflated by comparing against unrealistically slow baselines or by ignoring tokenization overhead.

Comparison of speculative decoding methods

Method	Year	Venue	Draft source	Training needed	Extra memory	Speedup range	Lossless
Speculative Decoding (Leviathan et al.)	2023	ICML	Independent small model	No	Yes	2-3x	Yes
Speculative Sampling (Chen et al.)	2023	arXiv	Independent small model	No	Yes	2-2.5x	Yes
SpecInfer (Miao et al.)	2024	ASPLOS	Multiple small models (tree)	No	Yes	1.5-3.5x	Yes
Medusa (Cai et al.)	2024	ICML	Extra prediction heads	Yes (heads)	Minimal	2.2-3.6x	Medusa-1: Yes; Medusa-2: Approximate
EAGLE-1 (Li et al.)	2024	ICML	Feature-level predictor	Yes (predictor)	Minimal	2.7-3.5x	Yes
EAGLE-2 (Li et al.)	2024	EMNLP	Feature-level predictor + dynamic tree	Yes (predictor)	Minimal	3.05-4.26x	Yes
EAGLE-3 (Li et al.)	2025	NeurIPS	Multi-level feature fusion + dynamic tree	Yes (predictor)	Minimal	3.0-6.5x	Yes
P-EAGLE (AWS)	2025	AWS blog	EAGLE drafter, parallel execution	Yes (predictor)	Minimal	3-5x	Yes
Sequoia (Chen et al.)	2024	NeurIPS	Optimized speculation tree	No	Yes (drafter)	Up to 9.5x (offload)	Yes
ReDrafter (Apple)	2024	arXiv	RNN drafter with beam search	Yes (RNN)	Modest	2.5-3.5x	Yes
Speculative Streaming (Apple)	2025	EMNLP	Future n-gram heads on target	Yes (target fine-tune)	None	1.8-3.1x	Approximate
Mirror-SD (Apple)	2025	arXiv	Drafter on GPU, target on NPU (or vice versa)	Yes	Modest	2-3x w/ heterogeneous overlap	Yes
MTP (DeepSeek-V3)	2024	arXiv	Multi-token prediction heads from pretraining	Yes (during pretraining)	Small	1.6-1.8x; up to 60% throughput gain	Yes
Self-Speculative (Zhang et al.)	2024	ACL	Layer-skipped target model	No	None	1.5-2x	Yes
LayerSkip (Elhoushi et al.)	2024	ACL	Early exit from target model	Yes (early exit training)	None	1.5-2x	Yes
Lookahead Decoding (Fu et al.)	2024	ICML	Jacobi iteration (no model)	No	None	1.5-2.3x	Yes
N-gram / Prompt Lookup	2023	Open source	Input text n-gram matching	No	None	1.5-2.5x	Yes

History and development

The concept of speculative decoding was introduced in two independent papers that developed the same core idea concurrently.

Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google Research posted "Fast Inference from Transformers via Speculative Decoding" on arXiv on November 30, 2022. The paper was subsequently accepted as an oral presentation at ICML 2023. They demonstrated 2x to 3x speedups on T5-XXL (11B parameters) using a T5-Small (60M parameters) draft model, with no change to the output quality.

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper at DeepMind published "Accelerating Large Language Model Decoding with Speculative Sampling" on arXiv on February 2, 2023. They demonstrated 2x to 2.5x speedups on Chinchilla (70B parameters) in a distributed computing environment.

Both papers independently arrived at the same modified rejection sampling scheme for preserving the target distribution. In a May 2023 revision, Leviathan et al. acknowledged Chen et al.'s work as a concurrent and independent development.

Google reported in 2024 that speculative decoding had been deployed in production for AI Overviews in Google Search, using distilled draft models to generate faster responses. The technique has since been adopted broadly across the industry, with inference frameworks, cloud providers, and hardware vendors all incorporating support for it.

Three developments in 2024-2025 reshaped the field. First, the EAGLE series demonstrated that feature-level autoregression with dynamic trees could break the 4x speedup barrier that had constrained classical draft-model approaches, and EAGLE-3 became the default speculative-decoding method in most production frameworks by the end of 2025. Second, DeepSeek-V3 mainstreamed the practice of training multi-token prediction heads directly into the base model, turning speculative decoding from a deployment-time add-on into a pretraining decision. Third, the maturation of batch-aware algorithms such as EXSPEC made speculative decoding viable for the high-concurrency serving regimes that had previously rejected it.

References

Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." *Proceedings of the 40th International Conference on Machine Learning (ICML)*. arXiv:2211.17192.
Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." *arXiv preprint arXiv:2302.01318*.
Cai, T., Li, Y., Geng, Z., Peng, H., Lee, J. D., Chen, D., and Dao, T. (2024). "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads." *Proceedings of the 41st International Conference on Machine Learning (ICML)*. arXiv:2401.10774.
Li, Y., Wei, F., Zhang, C., and Zhang, H. (2024). "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty." *Proceedings of the 41st International Conference on Machine Learning (ICML)*. arXiv:2401.15077.
Li, Y., Wei, F., Zhang, C., and Zhang, H. (2024). "EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees." *Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)*. arXiv:2406.16858.
Li, Y., Wei, F., Zhang, C., and Zhang, H. (2025). "EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test." *Advances in Neural Information Processing Systems (NeurIPS)*. arXiv:2503.01840.
Zhang, J., Wang, J., Li, H., Shou, L., Chen, K., Chen, G., and Mehrotra, S. (2024). "Draft and Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)*.
Elhoushi, M., Shrivastava, A., Liskovich, D., Hosmer, B., Wasti, B., Lai, L., Mahmoud, A., Acun, B., Agarwal, S., Roman, A., Aly, A., Chen, B., and Wu, C.-J. (2024). "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding." *Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)*. arXiv:2404.16710.
Fu, Y., Bailis, P., Stoica, I., and Zhang, H. (2024). "Break the Sequential Dependency of LLM Inference Using Lookahead Decoding." *Proceedings of the 41st International Conference on Machine Learning (ICML)*. arXiv:2402.02057.
Miao, X., Oliaro, G., Zhang, Z., Cheng, X., Wang, Z., Zhang, Z., Wong, R. Y. Y., Zhu, A., Yang, L., Shi, X., Shi, C., Chen, Z., Arfeen, D., Abhyankar, R., and Jia, Z. (2024). "SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification." *Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)*. arXiv:2305.09781.
Xia, H., Ge, T., Wang, P., Chen, S., Wei, F., and Sui, Z. (2024). "Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding." *Findings of the Association for Computational Linguistics: ACL 2024*. arXiv:2401.07851.
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." *Proceedings of the 29th Symposium on Operating Systems Principles (SOSP)*.
DeepSeek-AI. (2024). "DeepSeek-V3 Technical Report." *arXiv preprint arXiv:2412.19437*.
Bhendawade, N., Belousova, I., Fu, Q., Mason, H., Rastegari, M., and Najibi, M. (2024). "Speculative Streaming: Fast LLM Inference Without Auxiliary Models." *Apple Machine Learning Research*. Published at EMNLP 2025.
Cheng, Y., et al. (2024). "Recurrent Drafter for Fast Speculative Decoding in Large Language Models (ReDrafter)." *Apple Machine Learning Research*.
Apple Machine Learning Research. (2025). "Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference." arXiv:2510.13161.
Chen, Z., May, A., Svirschevski, R., Huang, Y., Ryabinin, M., Jia, Z., and Chen, B. (2024). "Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding." arXiv:2402.12374.
Zhang, R. H., Dey, S., et al. (2025). "Batch Speculative Decoding Done Right." arXiv:2510.22876.
NVIDIA Research. (2026). "SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding."
LMSYS. (2025). "Accelerating SGLang with Multiple Token Prediction." *LMSYS Blog*.
Google Research. (2024). "Looking Back at Speculative Decoding." *Google Research Blog*. https://research.google/blog/looking-back-at-speculative-decoding/
NVIDIA. (2025). "An Introduction to Speculative Decoding for Reducing Latency in AI Inference." *NVIDIA Technical Blog*. https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/

Introduction

ELI5 (Explain like I'm 5)

Background and motivation

The autoregressive decoding bottleneck

Memory-bound vs. compute-bound operations

Why not just increase the batch size?

Algorithm

Overview

Step-by-step procedure

Acceptance rate and expected speedup

Simplified speedup formula

Mathematical guarantee: lossless sampling

Proof sketch

Intuition

Draft model strategies

Independent small model

Fine-tuned and distilled draft models

N-gram and prompt lookup decoding

Variants and extensions

Medusa

EAGLE family

Multi-token prediction (DeepSeek-V3 MTP)

ReDrafter, Speculative Streaming, and Mirror-SD (Apple)

Sequoia

Self-speculative decoding

Lookahead decoding

SpecInfer

Tree-based verification

Practical considerations

Draft model selection

Optimal number of draft tokens (K)

Framework support

When speculative decoding helps

When speculative decoding does not help

Limitations

Batch inference complexity

Memory overhead

Draft-target alignment

Diminishing returns with stronger draft models

Hardware and software complexity

Benchmarking and evaluation

Comparison of speculative decoding methods

History and development

References

Improve this article

Related Articles

Sparse autoencoder

ARC-AGI 2

DeepSeek 3.0

Context window

Post-training

Multi-token prediction

Introduction

ELI5 (Explain like I'm 5)

Background and motivation

The autoregressive decoding bottleneck

Memory-bound vs. compute-bound operations

Why not just increase the batch size?

Algorithm

Overview

Step-by-step procedure

Acceptance rate and expected speedup

Simplified speedup formula

Mathematical guarantee: lossless sampling

Proof sketch

Intuition

Draft model strategies

Independent small model

Fine-tuned and distilled draft models

N-gram and prompt lookup decoding

Variants and extensions

Medusa

EAGLE family

Multi-token prediction (DeepSeek-V3 MTP)

ReDrafter, Speculative Streaming, and Mirror-SD (Apple)

Sequoia

Self-speculative decoding

Lookahead decoding

SpecInfer

Tree-based verification