EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) is a family of speculative decoding algorithms designed to accelerate inference in large language models (LLMs) without altering their output distributions. Developed by Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang -- researchers at Peking University, the University of Waterloo, the Vector Institute, and Microsoft Research -- EAGLE rethinks where speculation happens: rather than operating at the token level like classic speculative decoding or at the output-head level like Medusa, it speculates at the second-to-top-layer feature level. This shift resolves a fundamental source of uncertainty that limits prediction accuracy at shallower levels.
The original EAGLE paper (arXiv:2401.15077) was published in January 2024 and accepted at ICML 2024. It was followed by EAGLE-2 (arXiv:2406.16858, EMNLP 2024), which introduced dynamic draft trees calibrated by confidence scores, and EAGLE-3 (arXiv:2503.01840, NeurIPS 2025), which abandoned feature prediction in favor of direct token prediction with multi-layer semantic fusion. Across the three iterations, the peak measured speedup over vanilla autoregressive decoding grew from roughly 3x to 5.6x on standard 13B-parameter models, with EAGLE-3 reaching 6.5x on some configurations. EAGLE is integrated into vLLM, SGLang, NVIDIA TensorRT-LLM, MLC-LLM, AMD ROCm, and AWS NeuronX.
Speculative decoding is a lossless inference acceleration technique first described independently by Leviathan et al. and Chen et al. in 2022-2023. The core idea is to use a cheap draft model to propose several tokens at once, then verify the entire bundle with the target model in a single forward pass. If the target model accepts a proposed token, it is kept at no additional cost; rejected tokens trigger resampling from the target distribution, preserving losslessness. The practical speedup depends on two quantities: how fast the draft model is relative to the target, and how often the target model accepts the draft tokens.
Vanilla speculative decoding requires a separate, smaller model that shares vocabulary and architecture with the target. Finding or training such a model is a practical burden: the draft model must be compact enough to be substantially faster than the target, yet accurate enough to produce acceptable proposals. For a 70B-parameter target model, a suitable draft model might be 7B parameters, which still requires memory and routing overhead. Furthermore, the relationship between draft and target distributions is difficult to optimize because the two models are trained independently.
Medusa, introduced in 2024 by Cai et al., takes a different approach. It attaches multiple additional prediction heads directly to the base model's final layer. Each head predicts a future token position independently. Because the heads run in parallel on top of features that the base model has already computed, no separate model is needed, and the draft overhead is small. However, Medusa's heads predict tokens independently without conditioning on each other, which limits their accuracy. Typical draft acceptance rates for Medusa are around 0.6, compared to roughly 0.8 for EAGLE. On MT-Bench with a 13B model, Medusa achieves approximately 2.1x speedup; EAGLE-1 achieves 3.0x on the same benchmark.
Both vanilla speculative decoding and Medusa operate at the token level: each head or draft model outputs a probability distribution over the vocabulary and proposes the most likely next token. EAGLE's founding insight is that prediction at the feature level is an easier problem.
In a standard transformer LLM, the sequence of computation proceeds from token embeddings through N transformer layers to a final hidden state (the top-layer feature), which is then projected through a linear LM head to produce logit scores over the vocabulary. Each top-layer feature encodes not just the most probable next token but a rich probability distribution over many possible continuations. When a draft model tries to predict the next token directly, it discards this distributional richness in favor of a single discrete choice.
More importantly, the top-layer feature at position i is jointly determined by both the token at position i and all previous features in the sequence. If a sampling step picks token "always" instead of "am" at position i, the feature at position i diverges along a completely different trajectory. A token-level draft model must implicitly predict which branch was taken -- but it was not told. This ambiguity degrades accuracy.
EAGLE proposes speculating at the second-to-top-layer (penultimate transformer layer) feature level rather than the token level. The key observation is that the second-to-top-layer feature at position i is almost entirely determined by the actual sampled token at position i (because the LM head from penultimate to logits is a fixed linear map). If the draft model is given both the preceding feature sequence and the token sequence advanced by one time step -- that is, the actual token that was sampled at position i is provided as input when predicting the feature at position i+1 -- the uncertainty evaporates. The model knows which branch was taken.
Formally, the EAGLE draft model predicts the feature f_{i+1} given the context (f_{1:i}, t_{2:i+1}), where t_{2:i+1} is the token sequence shifted one position forward. Because the token at position i+1 will eventually be drawn from the distribution induced by f_{i+1} (through the LM head), conditioning on the one-step-ahead token resolves the sampling ambiguity that plagued earlier feature-level predictors.
The EAGLE draft model is lightweight. It reuses the embedding layer and LM head from the frozen target model (no additional parameters) and adds a single auto-regression head consisting of a fully connected (FC) layer followed by one transformer decoder layer. Depending on the target model size, the auto-regression head introduces between 0.24 billion and 0.99 billion additional parameters.
The training pipeline computes features from the frozen target model on a text corpus (68,000 ShareGPT dialogue rounds in the original work), then trains the auto-regression head to minimize a combined loss:
The combined loss is L = L_reg + 0.1 * L_cls. Small uniform noise drawn from U(-0.1, 0.1) is injected into features during training to mitigate error accumulation during multi-step drafting. Training on a 70B model completes in 1-2 days on a cluster of 8x A100-40G GPUs. For smaller models, consumer hardware (8x RTX 3090) suffices.
Linear draft sequences are inefficient because a rejection early in the chain wastes all subsequent drafts. EAGLE drafts a tree-structured candidate set instead. Beginning from the current context, the draft model generates a branching tree of possible token continuations. At the root the top-k (default k=4) highest-probability tokens are expanded; at subsequent levels k=3 or k=2 tokens are expanded per branch. A tree of depth 5 can therefore represent more than 10 candidate tokens, all of which are verified by the target model in a single forward pass using a modified attention mask that enforces the tree structure.
During verification, the target model processes the entire tree in one call by carefully masking attention so that each candidate node attends only to its own ancestors. This is sometimes called "tree attention" or "multi-candidate attention." The target accepts or rejects each node using the standard speculative decoding acceptance criterion: a token t at position j is accepted with probability min(1, p_j(t) / p_hat_j(t)), where p_j is the target distribution and p_hat_j is the draft distribution.
The average acceptance length tau -- the expected number of tokens accepted per drafting-verification cycle -- is a key efficiency metric. On LLaMA2-Chat 13B at temperature 0, EAGLE-1 achieves tau = 3.90, meaning each cycle produces nearly four tokens at the cost of roughly 1.3 target forward passes (one for the original token plus a fractional increment for the draft overhead). Standard speculative decoding with a separate 7B draft achieves tau around 2.0 on the same model.
EAGLE-1 was evaluated on four LLM families (Vicuna, LLaMA2-Chat, Mixtral 8x7B, and others) across six tasks: dialogue (MT-Bench), code generation (HumanEval), mathematical reasoning (GSM8K), instruction following (Alpaca), summarization (CNN/DailyMail), and question answering (Natural Questions). All experiments used greedy decoding unless noted.
On MT-Bench with LLaMA2-Chat 13B:
Code generation produced the highest speedups (up to 3.76x on HumanEval with LLaMA2-Chat 13B), consistent with the higher predictability of code syntax. Mathematical reasoning tasks (GSM8K) achieved 3.03x-3.20x; instruction-following tasks (Alpaca) achieved 2.78x-3.03x. Summarization and QA tasks, which involve less predictable continuation structure, fell in the 2.5x-2.8x range.
For the 70B model (LLaMA2-Chat 70B), EAGLE-1 achieved 2.7x-3.5x speedup across tasks, slightly lower than the 13B results because the relative overhead of the draft step is larger when the target model is bigger. On Mixtral 8x7B (a mixture-of-experts architecture), the speedup was approximately 1.5x -- lower because expert routing during verification reduces the parallelism available within a single forward pass.
At temperature=1 (non-greedy sampling), speedups dropped to 2.66x-2.89x due to reduced token predictability. At batch sizes above 1, memory bandwidth becomes less of a bottleneck and the marginal benefit of speculative decoding decreases; nevertheless, EAGLE-1 delivered approximately 2.0x throughput improvement even at maximum batch capacity.
The paper confirmed that EAGLE is lossless: the output distribution of EAGLE matches that of the base model exactly, verified empirically by measuring KL divergence between output distributions.
EAGLE-2 (arXiv:2406.16858), published in June 2024 and accepted at EMNLP 2024, identified and fixed a key inefficiency in EAGLE-1: the draft tree structure was static, using the same branching pattern for every input regardless of how predictable the continuation actually was.
In EAGLE-1, the tree shape (branching factors and depth) is set as a hyperparameter before inference and never changes. This is suboptimal because token predictability is highly context-dependent. Highly structured text (code, templated responses, mathematical expressions) has predictable continuations and benefits from deeper trees with high branching factors. Open-ended dialogue or creative text has far less predictable continuations, where a shallow tree wastes less compute on likely-to-be-rejected nodes.
EAGLE-2's core observation is that the draft model's own confidence scores are well-calibrated proxies for actual token acceptance rates. The paper empirically shows that draft tokens with a confidence score below 0.05 have an actual acceptance rate of approximately 0.04, while tokens with confidence above 0.95 have an acceptance rate of approximately 0.98. This near-perfect monotonic relationship means the draft model's softmax outputs can substitute for expensive oracle acceptance rate estimation.
From this calibration, EAGLE-2 constructs a value V_i for each candidate node in the draft tree as the product of confidence scores along the path from the root to that node. This approximates the probability that every ancestor of node i in the tree will be accepted, making V_i a good estimate of the expected utility of including node i in the tree.
At each drafting step, EAGLE-2 replaces the static tree with a dynamically grown tree built by a two-phase process:
Expansion: Starting from the current root, the draft model expands the top-k nodes (by V_i) at each depth level. The tree grows greedily toward the branches most likely to be accepted.
Reranking: All candidate nodes across all depths are ranked by their V_i values. The top-m nodes (where m is the budget for verification tokens) are selected. Tree connectivity is preserved by prioritizing shallower nodes when scores tie.
The selected nodes are flattened into a 1D sequence with an associated attention mask encoding the tree structure. This sequence is passed to the target model for single-pass verification, exactly as in EAGLE-1.
Critically, EAGLE-2 requires no additional training beyond EAGLE-1. The dynamic tree construction is a runtime inference algorithm that uses the already-trained draft model's own outputs. This makes upgrading from EAGLE-1 to EAGLE-2 a zero-cost change for existing deployments.
EAGLE-2 was evaluated on the same six tasks as EAGLE-1 using Vicuna and LLaMA2-Chat in 7B and 13B sizes. The improvement over EAGLE-1 was consistent across all configurations.
Speedup comparisons on MT-Bench (temperature=0):
| Model | EAGLE-1 | EAGLE-2 | Gain |
|---|---|---|---|
| Vicuna 13B | 3.07x | 4.26x | +39% |
| LLaMA2-Chat 13B | 3.03x | 4.21x | +39% |
| Vicuna 7B | 2.90x | 3.62x | +25% |
| LLaMA2-Chat 7B | 2.78x | 3.43x | +23% |
Average acceptance length comparisons (tau):
| Model | EAGLE-1 | EAGLE-2 |
|---|---|---|
| Vicuna 13B | 3.98 | 4.83 |
| LLaMA2-Chat 13B | 3.90 | 4.75 |
| Vicuna 7B | 3.94 | 4.98 |
| LLaMA2-Chat 7B | 3.62 | 4.70 |
Task-specific results for Vicuna 13B showed the highest gains in code generation (HumanEval: 4.96x) and instruction following (Alpaca: 4.25x), while summarization (CNN/DM: 3.40x) and QA (Natural Questions: 3.13x) lagged due to lower predictability.
Ablation studies confirmed that both the confidence-value scoring and the reranking phase contribute meaningfully. Removing both drops the speedup from 3.62x to 2.81x on Vicuna 7B; restoring value-based scoring alone raises it to 3.21x; restoring value plus reranking achieves the full 3.62x.
EAGLE-3 (arXiv:2503.01840), published in March 2025 and accepted at NeurIPS 2025, addresses a scaling ceiling that EAGLE-1 and EAGLE-2 share: adding more training data produces diminishing returns. The paper shows that the root cause is the feature prediction objective itself, which acts as an architectural constraint that limits the expressiveness of the draft model.
In EAGLE-1 and EAGLE-2, the draft model is trained to predict the next top-layer feature vector via a regression loss. This is a proxy task: what the system actually needs is for the draft model to produce token distributions that the target model will accept. The regression loss on feature vectors can be satisfied by predictions that are directionally correct but imprecise, and the imprecision grows as more difficult or diverse training data is added -- the draft model cannot represent all modes simultaneously under the regression objective.
EAGLE-3 removes the feature regression loss entirely. The draft model is trained only with a token-level classification (cross-entropy) loss. This frees the model to focus entirely on what matters for acceptance: producing the right token distributions.
Removing the feature prediction target creates a new problem: the draft model during training always conditions on exact ground-truth features from the frozen target model, but at inference time it must condition on its own previously generated features (since exact target features are not available ahead of time). This train-inference mismatch limits generalization.
EAGLE-3 resolves this with a technique called training-time test. During training, the system simulates the actual inference process: the draft model generates a sequence of draft tokens, those tokens are passed through the frozen target model to obtain intermediate features, and those intermediate features are then used as inputs for subsequent draft steps -- mirroring exactly what happens at test time. Because the training loop now exposes the draft model to the same distribution of inputs it will encounter during inference, the mismatch is eliminated. The approach is analogous to scheduled sampling in sequence-to-sequence models, but applied to feature sequences rather than token sequences.
EAGLE-1 and EAGLE-2 use only top-layer (penultimate) features as context. EAGLE-3 additionally incorporates features from earlier layers, providing the draft model with multi-level semantic information:
Concretely, EAGLE-3 concatenates k-dimensional feature vectors from three selected layers into a 3k-dimensional vector, then passes it through a fully connected layer to reduce it back to k dimensions. This fused representation provides richer contextual signal than any single layer alone.
The combination of training-time test and multi-layer fusion also enables data scaling: EAGLE-3 shows proportional improvements as training data increases, unlike EAGLE-1 and EAGLE-2 which plateau. This makes EAGLE-3 amenable to continued improvement simply by training on more data.
EAGLE-3 was evaluated on chat models (Vicuna, LLaMA-Instruct) and reasoning models (DeepSeek-R1 distillations) across multiple scales. The dynamic draft tree from EAGLE-2 is retained.
Speedup comparisons (temperature=0, MT-Bench or equivalent):
| Model | EAGLE-2 | EAGLE-3 | Gain |
|---|---|---|---|
| Vicuna 13B | 4.26x | 5.58x | +31% |
| LLaMA-3.1-8B Instruct | 3.16x | 4.40x | +39% |
| LLaMA-3.3-70B Instruct | 2.83x | 4.11x | +45% |
| DeepSeek-R1-Distill-LLaMA-8B | 2.92x | 4.05x | +39% |
Peak measured speedup reached 5.6x on Vicuna 13B (some configurations reached 6.5x). In the SGLang serving framework at batch size 64, EAGLE-3 delivered a 1.38x throughput improvement over EAGLE-2, a significant gain at scale.
Compared to all prior methods on MT-Bench with Vicuna 13B:
The following table summarizes measured speedups across the EAGLE family and key competitors on MT-Bench at temperature=0 with greedy decoding. All results are lossless (output distributions preserved).
| Method | Vicuna 7B | LLaMA2-Chat 13B | LLaMA 3.1 8B | LLaMA 3.3 70B |
|---|---|---|---|---|
| Vanilla autoregressive | 1.00x | 1.00x | 1.00x | 1.00x |
| Lookahead decoding | ~1.4x | ~1.5x | -- | -- |
| Medusa | ~1.8x | ~2.1x | -- | -- |
| Standard spec. decoding | ~1.9x | ~1.9x | -- | -- |
| EAGLE-1 (ICML 2024) | 2.90x | 3.03x | -- | -- |
| EAGLE-2 (EMNLP 2024) | 3.62x | 4.21x | 3.16x | 2.83x |
| EAGLE-3 (NeurIPS 2025) | -- | -- | 4.40x | 4.11x |
| Model | EAGLE-1 | EAGLE-2 |
|---|---|---|
| Vicuna 13B | 3.98 | 4.83 |
| LLaMA2-Chat 13B | 3.90 | 4.75 |
| Vicuna 7B | 3.94 | 4.98 |
| LLaMA2-Chat 7B | 3.62 | 4.70 |
| Task | Speedup |
|---|---|
| HumanEval (code) | 4.96x |
| MT-Bench (dialogue) | 4.26x |
| Alpaca (instruction) | 4.25x |
| GSM8K (math) | 4.22x |
| CNN/DM (summarization) | 3.40x |
| Natural Questions (QA) | 3.13x |
Medusa appends independent prediction heads to the last transformer layer of the base model. Each head predicts a future token position without conditioning on the other heads' predictions, which limits the joint distribution those heads can represent. The heads are trained jointly with the base model (or fine-tuned from a pretrained model) to minimize a cross-entropy loss at each future position. Because there is no dependency between heads, the tree they generate is essentially a Cartesian product of per-position predictions, and the acceptance probability of a sequence of k tokens is approximately p_1 * p_2 * ... * p_k where each p_i is the marginal accuracy at that position. For k=4 tokens with p_i=0.6, this product is about 0.13 -- meaning most 4-token sequences are rejected.
EAGLE's auto-regression head conditions each step on the previous predicted feature (and, in EAGLE-1/2, on the token sequence shifted forward), enabling it to track which branch of the distribution tree was actually taken. This yields acceptance rates around 0.77 at the first position and 0.69 even after one prediction error has occurred (the "1-alpha" rate), compared to Medusa's roughly 0.6 first-position acceptance rate.
Vanilla speculative decoding with a separate smaller model produces competitive acceptance rates when the draft model is carefully matched to the target. However, it requires storing and routing to two separate models, which imposes memory and scheduling overhead. EAGLE's auto-regression head is much smaller than even a 7B draft model (0.24B-0.99B parameters) and shares the embedding and LM head weights with the target, reducing memory overhead substantially.
The following qualitative comparison summarizes the architectural tradeoffs:
| Property | Vanilla spec. decoding | Medusa | EAGLE |
|---|---|---|---|
| Separate draft model needed | Yes (7B-13B) | No | No (small head) |
| Speculation level | Token | Token (parallel heads) | Feature (penultimate layer) |
| Head/draft size | 7B-13B parameters | 1-2 extra heads | 0.24B-0.99B parameters |
| Conditions on draft history | Yes | No | Yes |
| Dynamic draft tree | No | No (EAGLE-2+: Yes) | EAGLE-2+ |
| Lossless | Yes | Yes | Yes |
| Typical speedup (13B, greedy) | ~2x | ~2.1x | 3x-5.6x |
| Training data needed | Draft model training | Head fine-tuning | ~70K conversations |
vLLM is the most widely deployed open-source LLM serving framework and one of the first to integrate EAGLE. EAGLE support in vLLM is activated through the --speculative-method eagle flag combined with --speculative-model pointing to the EAGLE draft model checkpoint. The vLLM integration supports EAGLE-2 dynamic draft trees and is compatible with vLLM's continuous batching and PagedAttention memory management. In October 2024, the vLLM team published benchmarks showing speculative decoding in vLLM delivering up to 2.8x speedup in production-like serving scenarios.
In 2025, AWS and the vLLM team published P-EAGLE (arXiv:2602.01469), an extension of EAGLE to parallel drafting that was merged into vLLM starting from v0.16.0. P-EAGLE allows multiple draft sequences to be generated in parallel across hardware, further compressing latency at the cost of slightly more compute.
SGLang is a structured generation language and serving runtime developed at UC Berkeley. It natively supports EAGLE and EAGLE-3, with the EAGLE-3 integration benchmarked in the original EAGLE-3 paper. The 1.38x throughput improvement at batch size 64 reported in that paper was measured within SGLang. SGLang's RadixAttention memory management interacts well with EAGLE's tree attention, as both require tracking variable-length prefix structures efficiently.
The SpecForge framework (arXiv:2603.18567), a training harness for speculative decoding, recommends SGLang as the primary deployment backend for EAGLE-3 models trained through its pipeline.
NVIDIA's TensorRT-LLM inference framework supports external draft model speculative decoding, which is the mode used to deploy EAGLE. The configuration flag --speculative_decoding_mode draft_tokens_external with --max_draft_len 8 enables EAGLE-style draft integration. NVIDIA's technical documentation cites 3.6x throughput improvements on H200 GPUs for speculative decoding workloads. The NVIDIA Model Optimizer library also supports EAGLE as part of its broader suite of optimization techniques including quantization and pruning.
MLC-LLM, a cross-platform compiler and runtime for LLMs developed at Carnegie Mellon University, integrates EAGLE for CPU, GPU, and mobile deployment targets. AMD's ROCm inference stack and Intel's optimization libraries have also added EAGLE support, reflecting its adoption as a de facto standard for single-sequence latency optimization. The EAGLE repository lists official pretrained draft model checkpoints for Vicuna-13B v1.3, LLaMA-3.1-8B and 3.3-70B Instruct, DeepSeek-R1-Distill-LLaMA-8B, and the Qwen3 series from 1.7B through 235B parameters, as well as numerous community-contributed checkpoints.
The speculative decoding literature has grown rapidly since 2023. EAGLE belongs to the "modified draft model" subclass -- methods that learn a specialized draft model from the target model rather than using an off-the-shelf smaller model. Other notable approaches include:
EAGLE's distinctive contributions relative to this landscape are the feature-level speculation (which increases acceptance rates) and, in EAGLE-2+, the dynamic draft tree (which adapts the draft budget to context predictability).
EAGLE's speedup is most pronounced in single-sequence, low-batch-size scenarios. As batch size increases, the memory bandwidth bottleneck that speculative decoding exploits -- the target model running below peak arithmetic throughput on small batches -- diminishes. At batch sizes above 32, the speedup advantage over vanilla autoregressive decoding shrinks significantly, though EAGLE-3 retains meaningful gains (1.38x throughput at batch size 64 in SGLang).
Mixture-of-experts models such as Mixtral 8x7B exhibit lower speedups (approximately 1.5x for EAGLE-1) because the expert routing step during target model verification creates irregular compute patterns that reduce the efficiency of single-pass tree verification.
Training an EAGLE draft model requires a significant but tractable effort: a GPU cluster (8x A100 or 8x RTX 3090), 1-2 days of compute, and a representative text corpus. For novel or proprietary models without published checkpoints, users must train their own draft heads. The SafeAILab repository provides training scripts and a list of community-contributed checkpoints, but coverage is not universal.
EAGLE is inherently tied to the architecture of a specific base model. A draft model trained on LLaMA-3.1-8B cannot be used with Qwen3-8B, even if both have similar parameter counts, because the internal feature spaces differ. This means each new base model requires a new draft model.
Finally, EAGLE's losslessness guarantee holds only in expectation over the acceptance-rejection sampling procedure. Numerically, floating-point precision differences between the draft and target model's LM head (when they share weights) can introduce negligible distribution drift on some hardware configurations, though this is not unique to EAGLE.