Speculative Decoding
Last reviewed
May 17, 2026
Sources
22 citations
Review status
Source-backed
Revision
v4 ยท 6,424 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
22 citations
Review status
Source-backed
Revision
v4 ยท 6,424 words
Add missing citations, update stale details, or suggest a clearer explanation.
See also: transformer, large language model, inference, KV cache, autoregressive model, knowledge distillation
Speculative decoding is an inference acceleration technique for autoregressive transformer models that generates multiple tokens per forward pass of the target model while provably preserving the output distribution. The technique was independently proposed by Leviathan, Kalman, and Matias at Google Research (November 2022) and by Chen, Borgeaud, Irving, Lespiau, Sifre, and Jumper at DeepMind (February 2023). Both papers demonstrated 2x to 3x latency reductions on large language models without retraining, changing model architectures, or altering the quality of generated text.
The core idea draws from speculative execution in computer architecture: a small, fast draft model generates a sequence of candidate tokens ("speculations"), and the larger target model verifies all of them in a single forward pass. Correct speculations are accepted, while incorrect ones are rejected and resampled using a modified rejection sampling scheme. Because the target model processes the speculated tokens in parallel rather than sequentially, the wall-clock time per accepted token decreases significantly. A mathematical proof guarantees that the output distribution is identical to that of the target model alone, making speculative decoding a lossless optimization.
Since its introduction, speculative decoding has become a standard component of production LLM serving systems. It is supported by major inference frameworks including vLLM, NVIDIA TensorRT-LLM, SGLang, and Hugging Face Transformers. Multiple variants have been developed, including Medusa, EAGLE-1/2/3, ReDrafter, Speculative Streaming, self-speculative decoding, and lookahead decoding, each exploring different approaches to draft token generation. By 2025, the multi-token prediction (MTP) heads released with DeepSeek-V3 made speculative decoding integral to frontier open-weight model deployment.
Imagine you are writing a story, but you can only write one word at a time, and after each word you have to ask your teacher if it is the right word. This is very slow because you have to wait for the teacher after every single word.
Speculative decoding is like having a really fast friend who guesses the next several words for you all at once. Your friend writes down five words quickly, then you show all five to the teacher at the same time. The teacher checks them and says "the first three are correct, but the fourth one is wrong." You keep the three correct words (saving a lot of time) and fix only the wrong one. Your friend is not as smart as the teacher, but they are much faster, and many of their guesses turn out to be right. In the end, the story is exactly the same as if the teacher had written every word, but it gets finished much more quickly.
Autoregressive language models generate text one token at a time. At each step, the model takes all previously generated tokens as input, performs a full forward pass to compute a probability distribution over the vocabulary, and samples the next token. This process is inherently sequential: generating K tokens requires K serial forward passes through the model.
For modern LLMs with billions of parameters, each forward pass involves loading the entire set of model weights from GPU high-bandwidth memory (HBM) into the compute units. On an NVIDIA A100 GPU with 80 GB of HBM and 2 TB/s memory bandwidth, loading a 14 GB model (7B parameters in FP16) takes approximately 7 ms. The actual matrix multiplications for a single token take only about 0.5 ms. This means the GPU spends roughly 93% of each decoding step waiting for data to arrive from memory and only 7% performing useful computation.
The ratio of arithmetic operations to memory accesses is captured by the concept of arithmetic intensity, measured in FLOPs per byte. Modern GPUs like the A100 have a compute-to-bandwidth ratio of roughly 150:1 (312 TFLOPS vs. 2 TB/s). To keep the GPU's compute units fully utilized, a workload must perform at least 150 floating-point operations for every byte read from memory.
During autoregressive decoding with batch size 1, each forward pass processes a single token. The model weights must be loaded in full, but only a small number of operations are performed per weight element. The arithmetic intensity is far below the 150 FLOP/byte threshold, placing the workload squarely in the memory-bound regime. The GPU's massive compute capacity sits largely idle.
| Phase | Arithmetic intensity | Bottleneck | GPU compute utilization |
|---|---|---|---|
| Prefill (prompt processing) | High (many tokens processed simultaneously) | Compute | 60-80% |
| Autoregressive decode (batch size 1) | Very low (one token, full weight load) | Memory bandwidth | 5-10% |
| Autoregressive decode (large batch) | Moderate to high | Transitioning to compute | 30-70% |
Speculative decoding exploits this imbalance. By verifying K draft tokens in a single forward pass, the target model performs (K+1) times more useful computation for essentially the same memory access cost. This increases the arithmetic intensity and moves the workload closer to the compute-bound regime, making better use of available hardware.
Increasing the batch size is another way to improve GPU utilization during decoding, but it addresses throughput (total tokens per second across all requests) rather than latency (time per individual request). For interactive applications like chatbots and code assistants, users care about per-request latency. Speculative decoding reduces the latency of a single sequence's generation without requiring additional concurrent requests.
Speculative decoding involves two models:
The algorithm proceeds in rounds. Each round consists of a drafting phase and a verification phase.
Draft generation: The draft model M_q autoregressively generates K candidate tokens (x_1, x_2, ..., x_K), sampling each from q(x_i | x_{<i}). This requires K sequential forward passes of the small model, which are fast due to the model's small size.
Parallel verification: The target model M_p processes the full sequence (original context plus all K draft tokens) in a single forward pass, computing the target distribution p(x_i | x_{<i}) at each of the K draft positions plus one additional position.
Token-by-token acceptance: For each draft token x_i in order (i = 1, 2, ..., K):
Bonus token: If all K draft tokens are accepted, sample one additional token from the target model's distribution at position K+1. This ensures that every round produces at least one new token verified by the target model.
Repeat: Return to step 1, continuing from the last accepted position.
The acceptance probability for a single token depends on how well the draft distribution q(x) matches the target distribution p(x). Define the acceptance rate alpha as:
alpha = 1 - (1/2) * sum_x |p(x) - q(x)| = sum_x min(p(x), q(x))
This is related to the total variation distance between p and q. When the two distributions are identical (alpha = 1), every draft token is accepted. When they differ substantially, the acceptance rate drops.
With K draft tokens and acceptance rate alpha, the expected number of tokens generated per round is:
E[tokens per round] = (1 - alpha^(K+1)) / (1 - alpha)
For alpha = 0.8 and K = 5, this gives approximately 4.0 tokens per round. The speedup over standard decoding depends on the relative costs of the draft and target model forward passes.
Let c be the ratio of the draft model's forward pass time to the target model's forward pass time (typically c is between 0.01 and 0.1 for a much smaller draft model). The expected speedup factor S is approximately:
S = (1 - alpha^(K+1)) / ((1 - alpha) * (c * K + 1))
For alpha = 0.8, K = 5, and c = 0.05, this yields S approximately equal to 3.3x. In practice, reported speedups range from 2x to 3.5x for classical draft-model setups and 3x to 6x for feature-level methods like EAGLE-3.
The defining property of speculative decoding is that it produces samples from exactly the same distribution as standard autoregressive sampling from the target model. This is not an approximation; it is a mathematical identity.
Consider a single position where the draft model proposes token x sampled from q(x). The speculative decoding procedure produces a token from the following combined distribution:
P(output = x) = P(accept x) * q(x) + P(reject) * p'(x)
where P(accept x) = min(1, p(x)/q(x)) and p'(x) = max(0, p(x) - q(x)) / Z with Z = sum_x max(0, p(x) - q(x)).
The probability of rejection is:
P(reject) = sum_x q(x) * max(0, 1 - p(x)/q(x)) = sum_x max(0, q(x) - p(x)) = Z
The total probability of outputting token x is:
P(output = x) = q(x) * min(1, p(x)/q(x)) + Z * max(0, p(x) - q(x)) / Z
Case 1: p(x) >= q(x). Then min(1, p(x)/q(x)) = 1, so the accept term is q(x). The resample term is p(x) - q(x). Total: q(x) + p(x) - q(x) = p(x).
Case 2: p(x) < q(x). Then min(1, p(x)/q(x)) = p(x)/q(x), so the accept term is q(x) * p(x)/q(x) = p(x). The resample term is max(0, p(x) - q(x)) = 0. Total: p(x) + 0 = p(x).
In both cases, P(output = x) = p(x), confirming that the output distribution matches the target model exactly.
The acceptance criterion acts as a filter. When the draft model assigns too much probability to a token relative to the target model (q(x) > p(x)), the token is accepted only with probability p(x)/q(x), trimming the excess. The probability mass removed by rejection is redistributed via the resample distribution p'(x), which captures precisely the tokens that the target model favors more than the draft model. These two corrections cancel out perfectly, restoring the target distribution.
The choice of draft model significantly affects the speed and acceptance rate of speculative decoding. Several strategies have been developed.
| Strategy | Draft source | Training required | Extra memory | Typical speedup |
|---|---|---|---|---|
| Independent small model | Separate smaller model (e.g., 68M for a 7B target) | No (use existing model) | Yes (load second model) | 2-3x |
| Fine-tuned draft model | Small model distilled from target | Yes | Yes | 2.5-3.5x |
| Self-speculative (layer skipping) | Target model with layers skipped | No | No (shared weights) | 1.5-2x |
| Medusa heads | Extra prediction heads on target model | Yes (heads only) | Minimal | 2-3.6x |
| EAGLE | Lightweight feature-level predictor | Yes (predictor only) | Minimal | 2.7-6.5x |
| MTP heads (DeepSeek-V3 style) | Trained-in multi-token prediction modules | Yes (during pretraining) | Small (additional transformer block per depth) | 1.6-1.8x for V3 (alpha ~ 0.85) |
| N-gram / prompt lookup | N-gram matching from prompt context | No | No | 1.5-2.5x (input-heavy tasks) |
The simplest approach uses a pre-existing smaller model from the same model family as the draft model. For example, LLaMA 68M can serve as a draft model for LLaMA 7B, or T5-Small (60M) for T5-XXL (11B). The key requirement is that both models share the same tokenizer and vocabulary.
Leviathan et al. (2023) used this approach to achieve 2x to 3x speedups on T5-XXL. Chen et al. (2023) demonstrated 2x to 2.5x speedups on Chinchilla 70B in a distributed setting. The main drawback is that the draft model must be loaded into GPU memory alongside the target model, consuming additional resources.
Training a draft model specifically for the target model via knowledge distillation can improve the acceptance rate. The draft model learns to approximate the target model's distribution more closely than a generic small model would. Google reported using distilled draft models in production for AI Overviews in Google Search, achieving faster response times while maintaining quality.
For tasks where the output heavily overlaps with the input (such as text editing, code completion with context, or summarization), draft tokens can be extracted directly from the input prompt using n-gram matching. The system looks for n-grams in the prompt that match the most recently generated tokens and uses the continuation as the draft. This approach requires no additional model, no training, and no extra memory, but its effectiveness depends on the input-output overlap.
Medusa, developed by Cai et al. (2024), eliminates the need for a separate draft model by adding multiple prediction heads to the target model itself. Each head predicts a token at a different future position. For example, head 1 predicts the token at position t+1, head 2 predicts the token at position t+2, and so on.
During generation, each head produces a set of top-k candidate tokens, and these candidates are combined into a tree of possible continuations. All candidates in the tree are verified simultaneously using a tree attention mechanism. The target model processes the tree in a single forward pass, and the longest correct path through the tree is accepted.
Medusa comes in two variants:
A key advantage of Medusa is that it does not require a separate draft model, simplifying deployment and avoiding the memory overhead of loading two models. The added heads are small (typically a single linear layer or a small MLP per head) and add negligible compute and memory overhead.
Medusa was published at ICML 2024.
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), proposed by Li et al. (2024), takes a different approach by performing autoregression at the feature level rather than the token level. The key insight is that predicting the next token's hidden representation (specifically, the second-to-top-layer features) is easier and more reliable than predicting the next token directly.
EAGLE adds a lightweight autoregressive head that takes the target model's hidden states as input and predicts the feature vector for the next position. By also incorporating the token embedding shifted forward by one time step, EAGLE resolves the inherent uncertainty in feature-level prediction.
The EAGLE family has progressed through three versions:
EAGLE-3 has become the de facto choice for production speculative decoding on open-weight models. By late 2025 vLLM, SGLang, and TensorRT-LLM all shipped first-class support, including CUDA graphs and per-position acceptance-rate telemetry, and the SGLang project released SpecForge, a training pipeline that fits an EAGLE-3 head for an arbitrary target model in a few hours on a single GPU node. A parallel variant, P-EAGLE (AWS, 2025), decouples the drafter from the verifier so that drafting runs concurrently with target-model verification on a separate stream, reducing the end-to-end critical path.
The DeepSeek-V3 technical report (December 2024) introduced multi-token prediction (MTP) as a pretraining objective and showed that the resulting modules double as a high-quality speculative-decoding drafter. MTP attaches D sequential prediction modules to the main transformer; each module contains a transformer block, a projection matrix, and shared embedding and output heads, and module d predicts the token at offset d+1 conditioned on the main model's hidden state and the embedding of the previous token.
During pretraining, MTP densifies the supervision signal so every hidden state must support both next-token and farther-ahead predictions, which encourages the model to plan ahead and tends to improve downstream accuracy. At inference the MTP modules can either be discarded (recovering a vanilla model) or repurposed as drafters. DeepSeek-V3 reports an acceptance rate above 80% for the first MTP head and an end-to-end generation speedup of roughly 1.8x; LMSYS measurements on SGLang showed up to 60% higher output throughput with no loss in quality. MTP-based speculation has since been adopted in NVIDIA's Megatron-Bridge stack and vLLM's Ascend backend, making it one of the few speculative-decoding methods baked into the base model during pretraining rather than added afterward.
Apple has contributed three complementary single-vendor methods. ReDrafter (2024) replaces Medusa's parallel heads with a small recurrent neural network that consumes the target model's hidden state and emits a beam of candidate continuations, which are then organized into a dynamic tree for parallel verification. Apple and NVIDIA jointly integrated ReDrafter into TensorRT-LLM in 2025, reporting up to 3.5 accepted tokens per generation step. Speculative Streaming (EMNLP 2025) collapses drafting into the target model by replacing the standard next-token training objective with a future n-gram prediction objective and a multi-stream attention pattern, achieving 1.8x to 3.1x speedup without any separate drafter or auxiliary heads. Mirror-SD (2025) co-schedules drafter and target on heterogeneous accelerators (typically GPU plus NPU on Apple Silicon) so that the draft proposes forward continuations while the target simultaneously proposes correction paths for the drafter, converting speculation into two complementary execution streams rather than a strictly sequential draft-then-verify pattern.
Sequoia (Chen et al., 2024) is a tree-based system that optimizes the shape of the speculation tree as a function of hardware bandwidth and the drafter's acceptance distribution. It reports up to 9.5x speedups over incremental decoding in offloading scenarios where target-model weights spill to CPU memory or disk, because tree-structured speculation amortizes the very high per-step cost of paging weights back into the GPU. Sequoia popularized the use of explicit acceptance-rate modeling to compute the optimal tree topology rather than hand-tuning it.
Self-speculative decoding uses the target model itself as both the drafter and the verifier, eliminating the need for any additional model or trainable parameters. The drafting phase uses an approximate version of the target model, created by skipping a subset of intermediate attention and feed-forward layers. The verification phase runs the full model.
Zhang et al. (2024) proposed this approach in "Draft and Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding," published at ACL 2024. By selectively skipping layers during drafting, the model generates tokens faster (fewer layers means fewer computations) at slightly lower quality. The verification step then ensures that the final output matches the full model's distribution exactly.
A related approach is LayerSkip (Elhoushi et al., 2024), which trains the model with early exit objectives so that intermediate layers produce useful predictions. During inference, the early layers generate draft tokens, and the remaining layers verify them. LayerSkip achieves up to 1.99x speedup on LLaMA-2 variants and enables sharing the KV cache between the draft and verification phases, avoiding redundant computation.
The main advantage of self-speculative methods is that they are plug-and-play: no extra model needs to be trained, stored, or loaded. The main limitation is that the speedup is typically lower than with a dedicated draft model, because skipping layers provides a less significant speed advantage than using a much smaller model.
Lookahead decoding, proposed by Fu et al. (2024) and published at ICML 2024, takes a fundamentally different approach that does not use a draft model at all. Instead, it reformulates autoregressive decoding as solving a system of nonlinear equations and applies the Jacobi iteration method to generate multiple tokens in parallel.
The algorithm maintains two concurrent operations:
Lookahead decoding is exact (lossless) and does not require any auxiliary model, training data, or data store. It achieves 1.5x to 2.3x speedup, with higher gains on longer generation tasks. Its main advantage is simplicity of deployment; its main limitation is that the speedup is generally lower than methods that use a dedicated draft model or learned prediction heads.
SpecInfer, proposed by Miao et al. (2024) and published at ASPLOS 2024, extends speculative decoding to use multiple small draft models that collectively generate a token tree of candidate sequences. Rather than a single draft sequence, SpecInfer organizes speculations into a tree structure where each node represents a candidate token and each path from root to leaf represents a possible continuation.
The target model verifies all paths in the tree simultaneously using a tree-based parallel decoding mechanism. This increases the probability that at least one path matches the target model's output, improving the effective acceptance rate. SpecInfer achieved 1.5x to 2.8x speedup for distributed LLM inference and 2.6x to 3.5x for offloading-based inference.
Several speculative decoding variants (Medusa, EAGLE, SpecInfer, Sequoia, ReDrafter) use tree-structured speculation rather than a single linear draft sequence. Instead of generating one sequence of K tokens, the draft model generates a tree where each node branches into multiple candidate continuations.
Tree-based verification processes all branches simultaneously using a modified attention mask (tree attention). The target model evaluates every path in the tree in a single forward pass, and the longest path that passes verification is accepted. This approach trades additional computation for a higher probability of accepting long sequences of tokens.
The width and depth of the draft tree can be configured to balance between exploration (more branches, higher acceptance probability) and efficiency (fewer branches, lower computational cost per verification step).
Choosing an appropriate draft model involves balancing speed and accuracy:
Empirical studies suggest that the draft model should be approximately 10x to 100x smaller than the target model for optimal results.
The number of draft tokens K per round is a tunable hyperparameter. Larger K means more potential tokens per round but also more wasted computation if early tokens are rejected (since all tokens after a rejection are discarded). The optimal K depends on the acceptance rate alpha:
| Acceptance rate (alpha) | Optimal K | Expected tokens per round |
|---|---|---|
| 0.5 | 3-4 | 1.9-2.0 |
| 0.7 | 4-6 | 2.8-3.3 |
| 0.8 | 5-8 | 3.6-4.5 |
| 0.9 | 8-12 | 5.5-7.5 |
| 0.95 | 12-20 | 8.0-12.0 |
Some implementations (such as EAGLE-2 and EAGLE-3) dynamically adjust K based on the draft model's confidence, using more draft tokens when predictions are confident and fewer when they are uncertain.
Speculative decoding is implemented in all major LLM serving frameworks. The following table reflects the state of support as of mid-2025.
| Framework | Supported methods | Notes |
|---|---|---|
| vLLM | Draft model, EAGLE-1/2/3, P-EAGLE, Medusa, MTP, n-gram, MLPSpeculator | Integrated into the PagedAttention-based serving engine; per-position acceptance metrics exposed |
| TensorRT-LLM | Draft model, EAGLE, Medusa, ReDrafter, n-gram (prompt lookup) | NVIDIA-optimized CUDA kernels; up to 3.6x throughput boost reported, ReDrafter via Apple-NVIDIA integration |
| Hugging Face Transformers | Draft model (assisted generation), prompt lookup | Available via model.generate(assistant_model=...) |
| SGLang | Draft model, EAGLE-1/2/3, MTP | Supports RadixAttention for efficient prefix sharing; SpecForge training pipeline for EAGLE-3 |
| llama.cpp | Draft model, prompt lookup | CPU and GPU inference with GGUF models |
| Megatron-Bridge | MTP training and inference | NVIDIA training-side support for DeepSeek-style MTP |
Speculative decoding provides the largest benefits under specific conditions:
Extending speculative decoding to batched inference introduces the "ragged tensor" problem. Different sequences in the same batch may accept different numbers of draft tokens, resulting in variable-length outputs that break the regular tensor shapes required for efficient GPU computation. Sequences must be padded or processed with custom attention masks, adding implementation complexity and potentially reducing throughput.
A 2025 analysis ("Batch Speculative Decoding Done Right," Zhang et al.) documented that many widely deployed batch implementations silently desynchronize position IDs, attention masks, and KV-cache state when sequences accept different numbers of tokens, producing repetitive or nonsensical outputs without raising an error. The authors proposed EQSPEC, which guarantees output equivalence at up to 40% alignment overhead, and EXSPEC, which dynamically groups same-length sequences across batches to recover most of that overhead. They reported up to 3x batch-8 throughput improvement on Spec-Bench across Vicuna, Qwen3, and GLM-4 model pairs while restoring near-perfect decoding equivalence with the non-batched baseline.
When using an independent draft model, both the draft and target models must reside in GPU memory simultaneously. For memory-constrained deployments, this can be prohibitive. Methods like self-speculative decoding, Medusa, EAGLE, and Speculative Streaming address this by reusing the target model's weights or attaching only small heads.
The speedup of speculative decoding depends heavily on how well the draft model's distribution matches the target model's distribution. If the two models are trained on different data or have significantly different capabilities, the acceptance rate may be too low to provide meaningful acceleration. Fine-tuning or distilling the draft model specifically for the target model can improve alignment but adds training cost.
Paradoxically, making the draft model more capable (and therefore more accurate) often makes it slower, reducing the speed ratio between draft and target models. There is an inherent tension between draft quality (high acceptance rate) and draft speed (low cost per draft token). Finding the optimal balance point requires empirical experimentation for each target model and task.
Implementing speculative decoding efficiently requires careful management of the KV cache (draft tokens may need to be evicted upon rejection), dynamic sequence lengths, and synchronization between draft and target model execution. Production-quality implementations in frameworks like vLLM and TensorRT-LLM represent significant engineering effort.
Standardized evaluation emerged with Spec-Bench (Xia et al., 2024), which measures wall-clock acceleration across six tasks (multi-turn chat, translation, summarization, question answering, math reasoning, and retrieval-augmented generation) and reports both mean accepted length per round and end-to-end speedup. Spec-Bench has two known limitations: a small, low-diversity prompt corpus and the conflation of pure-algorithm gains with kernel-level optimizations. SPEED-Bench (NVIDIA, 2026) addresses both by expanding the prompt pool by roughly an order of magnitude, stratifying by entropy (low-entropy domains like code and math vs. high-entropy domains like open-ended chat), and pinning the kernel stack so that algorithmic changes can be isolated from runtime differences. Early measurements confirmed that EAGLE-3-class methods consistently lead on low-entropy domains and that the gap shrinks but does not close on creative writing. A separate effort, SpecDecode-Bench ("Speculative Decoding: Performance or Illusion?"), audited several published implementations and found that some reported speedups had been inflated by comparing against unrealistically slow baselines or by ignoring tokenization overhead.
| Method | Year | Venue | Draft source | Training needed | Extra memory | Speedup range | Lossless |
|---|---|---|---|---|---|---|---|
| Speculative Decoding (Leviathan et al.) | 2023 | ICML | Independent small model | No | Yes | 2-3x | Yes |
| Speculative Sampling (Chen et al.) | 2023 | arXiv | Independent small model | No | Yes | 2-2.5x | Yes |
| SpecInfer (Miao et al.) | 2024 | ASPLOS | Multiple small models (tree) | No | Yes | 1.5-3.5x | Yes |
| Medusa (Cai et al.) | 2024 | ICML | Extra prediction heads | Yes (heads) | Minimal | 2.2-3.6x | Medusa-1: Yes; Medusa-2: Approximate |
| EAGLE-1 (Li et al.) | 2024 | ICML | Feature-level predictor | Yes (predictor) | Minimal | 2.7-3.5x | Yes |
| EAGLE-2 (Li et al.) | 2024 | EMNLP | Feature-level predictor + dynamic tree | Yes (predictor) | Minimal | 3.05-4.26x | Yes |
| EAGLE-3 (Li et al.) | 2025 | NeurIPS | Multi-level feature fusion + dynamic tree | Yes (predictor) | Minimal | 3.0-6.5x | Yes |
| P-EAGLE (AWS) | 2025 | AWS blog | EAGLE drafter, parallel execution | Yes (predictor) | Minimal | 3-5x | Yes |
| Sequoia (Chen et al.) | 2024 | NeurIPS | Optimized speculation tree | No | Yes (drafter) | Up to 9.5x (offload) | Yes |
| ReDrafter (Apple) | 2024 | arXiv | RNN drafter with beam search | Yes (RNN) | Modest | 2.5-3.5x | Yes |
| Speculative Streaming (Apple) | 2025 | EMNLP | Future n-gram heads on target | Yes (target fine-tune) | None | 1.8-3.1x | Approximate |
| Mirror-SD (Apple) | 2025 | arXiv | Drafter on GPU, target on NPU (or vice versa) | Yes | Modest | 2-3x w/ heterogeneous overlap | Yes |
| MTP (DeepSeek-V3) | 2024 | arXiv | Multi-token prediction heads from pretraining | Yes (during pretraining) | Small | 1.6-1.8x; up to 60% throughput gain | Yes |
| Self-Speculative (Zhang et al.) | 2024 | ACL | Layer-skipped target model | No | None | 1.5-2x | Yes |
| LayerSkip (Elhoushi et al.) | 2024 | ACL | Early exit from target model | Yes (early exit training) | None | 1.5-2x | Yes |
| Lookahead Decoding (Fu et al.) | 2024 | ICML | Jacobi iteration (no model) | No | None | 1.5-2.3x | Yes |
| N-gram / Prompt Lookup | 2023 | Open source | Input text n-gram matching | No | None | 1.5-2.5x | Yes |
The concept of speculative decoding was introduced in two independent papers that developed the same core idea concurrently.
Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google Research posted "Fast Inference from Transformers via Speculative Decoding" on arXiv on November 30, 2022. The paper was subsequently accepted as an oral presentation at ICML 2023. They demonstrated 2x to 3x speedups on T5-XXL (11B parameters) using a T5-Small (60M parameters) draft model, with no change to the output quality.
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper at DeepMind published "Accelerating Large Language Model Decoding with Speculative Sampling" on arXiv on February 2, 2023. They demonstrated 2x to 2.5x speedups on Chinchilla (70B parameters) in a distributed computing environment.
Both papers independently arrived at the same modified rejection sampling scheme for preserving the target distribution. In a May 2023 revision, Leviathan et al. acknowledged Chen et al.'s work as a concurrent and independent development.
Google reported in 2024 that speculative decoding had been deployed in production for AI Overviews in Google Search, using distilled draft models to generate faster responses. The technique has since been adopted broadly across the industry, with inference frameworks, cloud providers, and hardware vendors all incorporating support for it.
Three developments in 2024-2025 reshaped the field. First, the EAGLE series demonstrated that feature-level autoregression with dynamic trees could break the 4x speedup barrier that had constrained classical draft-model approaches, and EAGLE-3 became the default speculative-decoding method in most production frameworks by the end of 2025. Second, DeepSeek-V3 mainstreamed the practice of training multi-token prediction heads directly into the base model, turning speculative decoding from a deployment-time add-on into a pretraining decision. Third, the maturation of batch-aware algorithms such as EXSPEC made speculative decoding viable for the high-concurrency serving regimes that had previously rejected it.