Speculative decoding is an inference optimization technique for large language models (LLMs) that accelerates text generation by using a smaller, faster draft model to propose candidate tokens, which are then verified in parallel by the larger target model. The technique typically achieves 2-3x speedups while producing output that is mathematically identical to what the target model would generate on its own. Since its introduction in 2022, speculative decoding has become one of the most widely adopted inference acceleration methods, deployed in production systems including Google Search's AI Overviews [1].
Autoregressive text generation in transformer-based language models is inherently sequential: each token depends on all previously generated tokens. For a model with billions of parameters, generating a single token requires loading the full model weights from GPU memory, performing a forward pass, and sampling from the output distribution. Because modern GPUs have far more compute capacity than memory bandwidth, this process is heavily memory-bound during the decoding phase. The GPU spends most of its time waiting for weights to be loaded from memory rather than performing arithmetic operations.
This memory-bound bottleneck means that generating 100 tokens takes roughly 100 times as long as generating a single token, regardless of how powerful the hardware is. The arithmetic intensity (ratio of computation to memory access) during autoregressive decoding is extremely low, leaving most of the GPU's computational resources idle. Speculative decoding addresses this inefficiency by converting some of the wasted compute capacity into useful work.
The concept of speculative decoding was introduced independently by two research groups in late 2022 and early 2023.
Yaniv Leviathan, Matan Kalman, and Yossi Matias at Google published "Fast Inference from Transformers via Speculative Decoding" in November 2022 [2]. The paper drew an analogy to speculative execution in CPU design, where processors execute instructions ahead of time on the prediction that a branch will go a certain way, discarding the work if the prediction is wrong. Applied to language model inference, the authors proposed using a smaller "approximation model" to generate multiple candidate tokens, then verifying them against the target model in a single parallel forward pass. They demonstrated 2-3x speedups on T5-XXL models for translation and summarization tasks without any change to model outputs. The paper was published at ICML 2023.
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper at DeepMind published "Accelerating Large Language Model Decoding with Speculative Sampling" in February 2023 [3]. Working independently from the Google team, they arrived at the same core idea. They benchmarked their approach using Chinchilla, a 70 billion parameter model, and demonstrated 2-2.5x decoding speedups in a distributed inference setup. Their formulation emphasized the modified rejection sampling scheme that guarantees distributional equivalence.
Both papers established the foundational principle: speculative decoding achieves exact lossless acceleration by exploiting the gap between the draft model's speed and the target model's verification capacity.
Speculative decoding operates in a draft-then-verify loop. Each iteration proceeds through several well-defined steps.
A small, fast draft model (also called the approximation model) generates K tokens autoregressively. Because the draft model has far fewer parameters than the target model, each token generation step is much faster. For example, if the target model is a 70B parameter model and the draft model is a 7B parameter model, the draft model can produce tokens roughly 10x faster per step. The draft model generates tokens one at a time, producing a sequence of K candidate tokens along with their associated probability distributions q(x) at each position.
Typical values of K range from 3 to 8 tokens, depending on the alignment between the draft and target models.
The target model processes the entire sequence of K draft tokens in a single forward pass. Because transformer models can evaluate all positions in parallel during a forward pass (unlike the sequential nature of generation), this verification step takes roughly the same time as generating just one token. The target model produces its own probability distributions p(x) at each of the K positions.
This is the key insight: verifying K tokens costs about the same as generating one token, because the memory bandwidth bottleneck is dominated by loading model weights, which happens once per forward pass regardless of sequence length.
Each draft token is evaluated from left to right using a modified rejection sampling scheme. For a draft token x at position i:
Once a token is rejected, all subsequent draft tokens in that batch are discarded. A new token is sampled from the corrected distribution at the rejection position.
If all K draft tokens are accepted, the target model's forward pass also provides logits for position K+1, allowing an additional token to be sampled at no extra cost. This means each iteration produces between 1 and K+1 tokens.
The process then repeats: the draft model generates a new batch of K candidate tokens starting from the last accepted position.
The most important property of speculative decoding is that it produces output from exactly the same probability distribution as standard autoregressive decoding from the target model. This is not an approximation; it is mathematically exact.
The proof relies on the structure of the acceptance-rejection scheme. For any token x at a given position, the probability of it being the final output can be decomposed as:
P(output = x) = q(x) * alpha(x) + beta * p'(x)
where q(x) is the draft model's probability, alpha(x) is the acceptance probability, beta is the total rejection rate, and p'(x) is the resampling distribution after rejection. When these terms are expanded and simplified, the result equals p(x), the target model's probability, for every possible token x [2][3].
This guarantee holds for both greedy decoding and stochastic sampling. It means speculative decoding introduces no quality degradation whatsoever. The speedup is entirely free in terms of output quality.
The speedup from speculative decoding depends on several factors.
| Factor | Effect on Speedup |
|---|---|
| Draft model accuracy | Higher acceptance rate leads to more tokens per iteration |
| Draft model latency | Faster draft models reduce overhead per iteration |
| Number of draft tokens (K) | Higher K means more potential tokens but diminishing returns |
| Target model size | Larger models benefit more due to greater memory-boundedness |
| Batch size | Smaller batches (especially batch size 1) benefit most |
| Sequence length | Longer KV caches keep decoding memory-bound even at larger batches |
In practice, speedups of 2-3x are typical for single-request inference scenarios [1][2]. The acceptance rate of the draft model is the primary determinant: if the draft model matches the target model on 70-80% of tokens, the expected number of accepted tokens per iteration is high enough to overcome the overhead of running both models.
The theoretical speedup can be expressed in terms of the average acceptance length (the expected number of consecutive accepted tokens). If the draft model runs at c times the speed of the target model per token, and the average acceptance length is L, then the wallclock speedup is approximately L / (1 + L/c) compared to standard decoding.
Since the original proposals, numerous variants have been developed to address different scenarios and improve upon the basic framework.
Medusa, developed by researchers at Together AI, takes a different approach to draft generation [4]. Instead of using a separate draft model, Medusa adds multiple prediction heads to the target model itself. Each head predicts a token at a different future position: head 1 predicts the next token, head 2 predicts two tokens ahead, and so on. These predictions are combined into candidate sequences using a tree-structured attention mechanism.
Medusa-1 freezes the base model and only trains the additional heads. Medusa-2 extends this with a recipe for jointly fine-tuning the base model and heads using self-distillation. Reported speedups range from 2.2-3.6x depending on the model and task. One limitation of Medusa is that, because it modifies the model architecture, its theoretical guarantee of distributional equivalence is weaker than methods that leave the target model untouched.
EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a family of speculative decoding methods developed by the SafeAI Lab [5]. EAGLE uses an auxiliary model that operates on the feature level (hidden states) of the target model rather than on token-level outputs. This approach rethinks the draft process as feature extrapolation rather than language modeling.
The family has evolved through three iterations:
| Version | Venue | Key Innovation |
|---|---|---|
| EAGLE-1 | ICML 2024 | Feature-level extrapolation for draft generation |
| EAGLE-2 | EMNLP 2024 | Dynamic draft trees that adapt to input difficulty |
| EAGLE-3 | NeurIPS 2025 | Training-time rollouts to better mimic decoding behavior |
EAGLE-3 achieves 2-6x speedups depending on model size and batch configuration. Because EAGLE does not modify the target model at all, it preserves the output distribution exactly for both greedy and non-greedy sampling.
Self-speculative decoding eliminates the need for a separate draft model entirely [6]. The core idea is to use the target model itself as the draft model by skipping certain intermediate layers during the draft phase. During the drafting stage, a subset of attention and feed-forward layers are bypassed, producing faster but less accurate predictions. During verification, the full model with all layers is used.
This approach has two notable advantages: it requires no additional model training, and it adds no extra memory overhead since only one set of weights is loaded. Benchmarks with LLaMA-2 variants have demonstrated speedups up to 1.99x. Related techniques include LayerSkip (Meta, 2024), which trains models specifically to support early exit and self-speculative decoding [7].
SpecInfer, published at ASPLOS 2024, introduces tree-based speculative inference and verification [8]. Rather than generating a single linear sequence of draft tokens, SpecInfer uses multiple small speculative models to collectively construct a token tree. Each branch of the tree represents a different possible continuation. The target model then verifies all branches in parallel using a tree-based parallel decoding mechanism.
This tree structure increases the probability that at least one path through the tree matches the target model's preferred continuation. SpecInfer demonstrated 1.5-2.8x speedups for distributed LLM inference and 2.6-3.5x for offloading-based inference.
Staged speculative decoding extends the concept to a cascade of models. Instead of a single draft model and a single target model, it uses a hierarchy: a very small model drafts tokens, a medium model verifies those, and the large target model performs final verification. This multi-stage approach can be particularly effective when there is a large gap between the smallest available model and the target model.
Several additional approaches have expanded the design space of speculative decoding:
| Variant | Key Idea |
|---|---|
| SpecTr | Uses optimal transport theory for multi-token speculative decoding |
| Lookahead Decoding | Uses n-gram generation from Jacobi iteration without a draft model |
| SuffixDecoding | Retrieval-based (model-free) drafting using suffix automata |
| Online Speculative Decoding (OSD) | Continuously adapts draft models to evolving query distributions |
| REST | Retrieval-based speculative decoding using a datastore of n-grams |
| Mirror Speculative Decoding | Apple research breaking the serial barrier in LLM inference |
Speculative decoding is not universally beneficial. Its effectiveness depends on the inference regime.
Memory-bound inference. The technique is most effective when the target model's decoding is memory-bound, meaning the GPU's compute units are underutilized while waiting for memory transfers. This is the typical situation for single-request or small-batch inference with large models. At batch size 1, the arithmetic intensity of autoregressive decoding is very low, and the GPU has ample spare compute capacity to run both the draft and target models.
Long text generation. Tasks that require generating many tokens (long-form writing, code generation, detailed explanations) benefit proportionally more, since the per-token cost reduction compounds over many generation steps.
High draft model alignment. When the draft model closely approximates the target model's distribution for the given domain, acceptance rates are high and speedups approach their theoretical maximum. Domain-specific fine-tuning of the draft model can significantly boost performance for particular use cases.
Long context scenarios. Recent research has shown that speculative decoding remains effective even at larger batch sizes when context lengths are long, because the large KV cache keeps the attention kernels in the memory-bound regime [9].
Despite its benefits, speculative decoding has several practical limitations.
Draft model overhead. Running a draft model consumes additional GPU memory and compute. On a single GPU, the draft model's memory footprint can reduce the space available for the target model's KV cache, potentially limiting the maximum context length or batch size.
Diminishing returns at large batch sizes. At large batch sizes without long contexts, autoregressive decoding becomes more compute-bound, and the GPU is already well-utilized. In this regime, the overhead of running the draft model may outweigh the speedup from parallel verification. However, recent research from Together AI (2025) has shown that speedups of up to 2x are still achievable at batch size 256 in long-context settings [9].
Draft model quality matters. If the draft model's predictions diverge significantly from the target model, most tokens will be rejected, and each iteration will produce only one or two tokens. In this case, speculative decoding can actually be slower than standard decoding due to the overhead of running both models. The choice and alignment of the draft model is critical.
Sensitivity to hyperparameters. The number of draft tokens K, the choice of draft model, and system-level configurations all affect performance. There is no universal optimal configuration; tuning is required for each target model, hardware setup, and workload.
Implementation complexity. Integrating speculative decoding into serving systems requires careful engineering. KV cache management becomes more complex because draft tokens that are rejected must have their cache entries invalidated. Batching multiple requests with different acceptance lengths adds scheduling complexity.
By 2025, all major LLM inference frameworks have incorporated production-ready speculative decoding support.
vLLM, the popular open-source serving framework, supports multiple speculative decoding methods including draft model-based speculation and EAGLE [10]. Starting from version 0.16.0, vLLM integrated P-EAGLE (Parallel EAGLE), developed in collaboration with AWS, which improves upon standard EAGLE with parallelized draft generation. Enabling speculative decoding in vLLM is straightforward via command-line flags:
--speculative-method eagle --num-speculative-tokens 8
vLLM's implementation achieves up to 2.5x speedup and handles the complex bookkeeping of KV cache management, rejection sampling, and batched verification automatically.
NVIDIA's TensorRT-LLM framework supports speculative decoding with both internal and external draft models [11]. NVIDIA has demonstrated 3.6x throughput improvements on H200 GPUs. The framework integrates speculative decoding with other optimization techniques such as quantization and in-flight batching. NVIDIA's Model Optimizer library provides tools for training and optimizing draft models specifically for use with TensorRT-LLM.
SGLang, a high-performance serving framework developed at UC Berkeley, supports speculative decoding as part of its runtime [12]. SGLang has shown competitive performance, particularly at moderate concurrency levels. The framework also supports SpecForge, a tool for efficiently training draft models tailored to specific target models and workloads.
| Framework | Speculative Methods Supported | Notable Features |
|---|---|---|
| vLLM | Draft model, EAGLE, P-EAGLE, Medusa | Automatic KV cache management, broad hardware support |
| TensorRT-LLM | Draft model (internal/external), EAGLE | Tight NVIDIA GPU optimization, in-flight batching integration |
| SGLang | Draft model, EAGLE, SpecForge | High concurrency performance, draft model training tools |
Speculative decoding has moved beyond research into large-scale production use. Google disclosed in December 2024 that speculative decoding powers AI Overviews in Google Search, reducing latency for billions of queries while maintaining the same quality of responses [1]. This deployment validated the technique's robustness at extreme scale.
Cloud AI platforms including AWS, Google Cloud, and Azure offer speculative decoding as a configurable option in their managed LLM inference endpoints. The technique has become a standard component of the modern LLM serving stack alongside continuous batching, KV cache optimization, and model quantization.
As of early 2026, speculative decoding continues to evolve along several fronts.
Adaptive and online methods. Online Speculative Decoding (OSD), introduced in a 2025 UC Berkeley dissertation, continuously adapts draft models to the evolving distribution of queries during serving via online knowledge distillation. This addresses the problem of static draft models becoming misaligned with shifting workloads over time.
System-level co-optimization. TurboSpec and similar systems treat speculative decoding as a control problem, using offline profiling and online feedback to dynamically adjust parameters such as K and the draft tree structure at runtime. This closed-loop approach robustly optimizes performance across diverse and changing workloads.
Rethinking draft model design. Recent large-scale benchmarks have revealed that a draft model's language modeling accuracy does not correlate strongly with its speculative decoding throughput. Draft model latency is a far stronger determinant of end-to-end performance, suggesting that very small, fast models may outperform larger, more accurate draft models in practice.
Hardware co-design. As speculative decoding becomes a standard inference technique, hardware designers are beginning to consider its requirements. The technique's need for efficient parallel verification and fast draft generation may influence future GPU and accelerator architectures.
Speculative Speculative Decoding. Published at ICLR 2026, this work explores applying speculative execution principles recursively, speculatively generating the drafts themselves to further reduce latency [13].