EAGLE-2
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,258 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 21, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,258 words
Add missing citations, update stale details, or suggest a clearer explanation.
EAGLE-2 ("Faster Inference of Language Models with Dynamic Draft Trees") is the second generation of the EAGLE family of speculative decoding methods for accelerating large language model inference, introduced by Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang in an arXiv preprint posted on 24 June 2024 and published at EMNLP 2024.[^1][^2] The system replaces the static draft tree used in the original EAGLE with a context aware dynamic draft tree that expands and reranks candidate token paths according to the draft model's per step confidence, exploiting the empirical observation that EAGLE's draft head is well calibrated against the target model's acceptance rate.[^1] On six standard generation tasks across three model families, EAGLE-2 reports speedups of 3.05x to 4.26x over vanilla autoregressive decoding, roughly 20% to 40% faster than its predecessor while remaining lossless with respect to the target model's output distribution.[^1][^3] EAGLE-2 has since been packaged in the SafeAILab/EAGLE reference repository alongside EAGLE-1 and EAGLE-3, and integrated into mainstream serving stacks including vLLM, SGLang, and NVIDIA TensorRT-LLM.[^4][^5][^6]
| Field | Value |
|---|---|
| Full title | EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees |
| Authors | Yuhui Li, Fangyun Wei, Chao Zhang, Hongyang Zhang |
| arXiv identifier | 2406.16858 |
| Initial arXiv submission | 24 June 2024 (v1) |
| Venue | EMNLP 2024 (Main Conference, pages 7421 to 7432) |
| Reference implementation | SafeAILab/EAGLE on GitHub |
| License | Apache 2.0 |
| Reported peak speedup | 4.26x on MT-bench, Vicuna 13B, temperature 0 |
| Output guarantee | Lossless (same distribution as target model) |
Bibliographic information for the EMNLP version is mirrored on the ACL Anthology, which assigns the paper the DOI 10.18653/v1/2024.emnlp-main.422.[^2]
The EAGLE series is the work of a four person team based across Peking University (Yuhui Li and Chao Zhang), Microsoft Research (Fangyun Wei), and the University of Waterloo with the Vector Institute for AI (Hongyang Zhang, a tenure track assistant professor and the team's principal investigator).[^5][^13] The project page was first published on 8 December 2023, ahead of the EAGLE-1 arXiv preprint on 26 January 2024.[^5][^7] EAGLE-1 appeared at ICML 2024.[^4][^8]
EAGLE-2 followed five months later. Yuhui Li and collaborators posted version 1 of arXiv:2406.16858 on 24 June 2024 and a small revision (v2) on 30 June 2024; the paper was accepted at EMNLP 2024 and appears in the proceedings at pages 7421 to 7432 under DOI 10.18653/v1/2024.emnlp-main.422.[^1][^2] Hongyang Zhang gave an invited talk on the two systems titled "The EAGLE Series: Lossless Inference Acceleration for LLMs" at the MLSys@UCSD seminar on 6 March 2025, summarizing EAGLE-1 as a feature level drafter delivering "2.1x to 3.8x" acceleration and EAGLE-2 as adding dynamic draft trees for "2.5x to 5.0x" total acceleration, a "20% to 40% speed boost" over EAGLE-1.[^13]
The third generation, EAGLE-3, was posted on arXiv on 3 March 2025 as 2503.01840 and was accepted at NeurIPS 2025.[^4][^9] As of mid 2026 the SafeAILab/EAGLE repository's main branch implements EAGLE-2 and EAGLE-3 by default, with an explicit v1 branch retained for users who need to reproduce the original EAGLE-1 paper exactly.[^4]
Modern autoregressive LLMs generate one token per forward pass, leaving most of the model's parameters memory bound under typical batch sizes. Speculative decoding, popularized by Leviathan et al. and Chen et al. in 2023, mitigates this bottleneck by having a small draft model propose several tokens which a larger target model verifies in parallel using a single forward pass and a rejection sampling step that preserves the target distribution exactly.[^7]
EAGLE-1, posted on arXiv as 2401.15077 on 26 January 2024 and presented at ICML 2024, refined this template with two structural choices.[^7][^8] First, instead of drafting at the token level it performs autoregression on the second to top layer hidden state of the target model, using a single transformer decoder layer plus a small fully connected projection (the "auto regression head") that reuses the target model's frozen embedding and language modeling head.[^7] Second, it conditions the next feature prediction on a token sequence shifted forward by one position, which the authors argue resolves the uncertainty introduced by stochastic sampling at the previous step.[^7] EAGLE-1 reports MT-bench speedups in the range 2.7x to 3.5x for LLaMA2-Chat 70B and roughly 3.0x for 13B class models, faster than Medusa and Lookahead Decoding under the same evaluation harness.[^7][^4] The project page lists authors from Peking University, the University of Waterloo, the Vector Institute, and Microsoft Research.[^5]
A second contribution of EAGLE-1 was the use of a fixed draft tree rather than a linear chain. By running the auto regression head a small number of steps and keeping a handful of top candidates at each depth, the drafter produces a tree of tokens that the target verifies with tree attention in one parallel call. The tree shape in EAGLE-1, however, is identical for every input and every position, encoding an implicit assumption that the per node acceptance rate depends only on the node's depth and position within the tree, not on the surrounding context.[^1]
The EAGLE-2 paper opens by challenging that static tree assumption. The authors plot per token acceptance rate as a function of the draft model's softmax confidence and report a near monotonic relationship: tokens whose draft confidence falls below 0.05 are accepted only about 4% of the time, whereas tokens with confidence above 0.95 are accepted approximately 98% of the time.[^1] Because the draft head is itself a small neural network calibrated against the target LLM, its emitted probabilities track the verification acceptance rate closely with only small calibration errors, and so confidence can substitute for an expensive simulation of the verification step.[^1]
The same plot also shows that two tokens at the same depth in the tree can have widely different acceptance rates depending on which prefix preceded them. A static tree that always allocates k slots to depth d therefore wastes verification bandwidth on unlikely branches in some contexts and starves promising branches in others. EAGLE-2 exploits these two observations to make the tree adaptive: it allocates more nodes where confidence is high (and the marginal benefit of looking deeper is large) and prunes nodes where confidence collapses.[^1]
EAGLE-2 inherits the draft model architecture from EAGLE-1 unchanged: one transformer decoder layer plus a linear projection, reusing the target model's embedding and classification heads.[^7] The only difference from EAGLE-1 lies in how draft tokens are generated and selected for verification each step. The dynamic tree construction proceeds in two phases that the paper calls expansion and reranking.[^1]
Beginning from the root (the most recently verified token), the drafter runs forward to produce a probability distribution over next tokens. From the current frontier, EAGLE-2 selects the top k nodes ranked by the global acceptance probability of the path that leads to them. Concretely, for a node t_i reached by a path from the root, the value V_i is defined as the product of the draft confidences c_j along that path. This corresponds to an approximation of the probability that all ancestors and the node itself will be accepted by the target model under the rejection sampling rule.[^1]
The top k nodes are expanded by another draft step, generating new children. This continues until a depth bound (six in the released configurations) is reached. Crucially, k is global across the frontier, not per layer, so a single context with one extremely promising branch can devote all expansions to that branch rather than spreading them across depths.[^1]
After expansion finishes, the candidate tree may contain more tokens than the verification budget allows. The reranking step collects all generated tokens, sorts them by their value V_i, and keeps the top m tokens that the authors target for verification. The paper notes that shallower nodes with higher cumulative value are favored, which keeps the resulting subset connected to the root (a node cannot be selected unless its ancestors are present) and lets the drafter spend verification capacity on the most likely full paths.[^1]
To make this work inside the target's single forward pass, EAGLE-2 builds an attention mask that lets each selected token attend only to its ancestors in the tree, the same tree attention idiom used by SpecInfer, Sequoia, and EAGLE-1.[^1] Because the resulting set still encodes a valid tree, the rejection sampling proof from Leviathan et al. carries over and the algorithm is exactly distribution preserving, never altering the output sampling distribution of the underlying target model.[^1][^7]
The EAGLE-2 reference implementation ships with default hyperparameters that vary with target model size. For 7B and 8B class models, the configuration uses 60 total draft tokens, expansion depth 6, and top 10 frontier nodes per expansion. For 13B models the budget shrinks to 50 tokens, and for 70B models to 48 tokens, with depth 6 and top 10 retained across all sizes.[^3] The numbers reflect a tradeoff between verification batch size (which raises target forward pass cost) and average tokens accepted per step.
A practically important property of EAGLE-2 is that it does not introduce a new training procedure. The dynamic tree algorithm uses the same draft model checkpoint that EAGLE-1 produces; only the inference time tree construction code differs. Consequently, every drafter already published under the SafeAILab project (for example the Hugging Face checkpoints under the yuhuili/EAGLE-* namespace) can be swapped from EAGLE-1 to EAGLE-2 by changing a flag at runtime, with no additional GPU hours spent.[^1][^4] EAGLE-2 also requires no extra trainable parameters and does not learn a separate model to predict the tree structure: the confidence proxy is read directly off the existing draft head.[^1]
EAGLE-2 inherits its draft model training procedure from EAGLE-1. The objective is to teach the auto regression head (one transformer decoder layer plus a linear projection) to predict the next second to top hidden state of the target LLM given the previous hidden states and a token sequence shifted forward by one step.[^7] The loss combines a smooth L1 regression term on the predicted feature against the target hidden state, and a cross entropy term on the resulting logits after the target model's frozen language modeling head is applied to that predicted feature.[^7]
Training data for the original EAGLE-1 and EAGLE-2 drafters comes from the ShareGPT dialogue corpus (approximately 68 thousand conversations).[^7] An ablation in the EAGLE-1 paper compares drafters trained on the off the shelf corpus to drafters trained on responses generated by the target LLM itself, and reports only a 3.6% acceptance length difference, evidence that EAGLE drafters are not very sensitive to the source of dialogue data.[^7] The recommended hardware is an 8x RTX 3090 node, which produces a usable drafter in one to two days; offline pipelines that pre extract hidden states allow training on a single GPU but with substantial disk usage.[^4][^11]
Because the drafter is tied to a specific target model's hidden representation, a separate drafter is trained per target. The Hugging Face yuhuili/EAGLE-* collection therefore lists distinct checkpoints for each supported base model rather than a single universal drafter.[^12] Newer EAGLE-3 drafters under the yuhuili/EAGLE3-* namespace require a different training objective (direct token prediction with multi layer feature fusion) and are not interchangeable with EAGLE-1 or EAGLE-2 checkpoints.[^9][^12]
EAGLE-2 is evaluated on six tasks: MT-bench (multi turn conversation), HumanEval (code generation), GSM8K (mathematical reasoning), Alpaca (instruction following), CNN/Daily Mail (summarization), and Natural Questions (open domain QA). It is benchmarked across three model series, namely Vicuna v1.3 (7B and 13B), LLaMA2-Chat (7B, 13B, 70B), and LLaMA-3 Instruct (8B, 70B), at both temperature 0 and temperature 1 sampling settings.[^1][^3]
The headline numbers, all measured against vanilla autoregressive decoding on the same hardware, are reproduced below for selected configurations at temperature 0.
| Model | Task | EAGLE-2 speedup | EAGLE-1 speedup |
|---|---|---|---|
| Vicuna 13B | MT-bench | 4.26x | 3.07x |
| Vicuna 13B | HumanEval | 4.96x | 3.58x |
| LLaMA2-Chat 13B | MT-bench | 4.21x | 3.03x |
| LLaMA2-Chat 13B | HumanEval | 5.00x | 3.76x |
| Vicuna 7B | MT-bench | 3.62x | 2.90x |
| LLaMA2-Chat 70B | MT-bench | 3.51x | 3.01x |
| LLaMA-3 Instruct 8B | MT-bench | 3.46x | 2.72x |
| LLaMA-3 Instruct 70B | MT-bench | 3.29x | 2.83x |
Source: EAGLE-2 arXiv tables 1 and 2.[^1] Across all reported model and task combinations, EAGLE-2 lies 20% to 40% above EAGLE-1; on coding tasks where confidence is highest (and templates more predictable) the relative gain is larger.[^1][^3]
A hardware independent measure of drafting quality is tau, the average number of tokens accepted per draft and verify cycle. On MT-bench at temperature 0, EAGLE-2 raises tau from approximately 3.9 to roughly 4.8 on Vicuna 13B and from 3.6 to 4.7 on LLaMA2-Chat 7B. HumanEval pushes tau above 5 tokens per cycle for 13B class models with EAGLE-2.[^1] The paper notes that this is approximately twice the acceptance length of standard speculative sampling with an independent draft model and substantially above Medusa under matched conditions.[^1]
The EAGLE-2 paper ablates each phase. Removing reranking, or replacing dynamic expansion with the EAGLE-1 static tree of comparable size, both reduce tau by roughly 0.5 to 0.8 tokens per cycle and erase most of the wall clock gain over EAGLE-1, confirming that adaptivity, not raw tree size, is responsible for the improvement.[^1] The authors also verify that the draft model trained for EAGLE-1 can be reused for EAGLE-2 without retraining: only the inference time tree construction logic changes.[^1]
The SafeAILab/EAGLE repository hosts the official implementations of three generations of the method, each with its own venue, arXiv identifier, and architectural footprint. The line is summarized in the table below.[^4][^5]
| Version | arXiv | First submitted | Venue | Core idea | Reported peak speedup |
|---|---|---|---|---|---|
| EAGLE (EAGLE-1) | 2401.15077 | 26 Jan 2024 | ICML 2024 | Feature level drafting at the second to top hidden layer; tokens shifted one step; static draft tree | ~3.5x on LLaMA2-Chat 70B[^7] |
| EAGLE-2 | 2406.16858 | 24 Jun 2024 | EMNLP 2024 | Same draft head; dynamic draft tree built from expansion plus reranking using draft confidence | up to 4.26x on Vicuna 13B[^1] |
| EAGLE-3 | 2503.01840 | 3 Mar 2025 | NeurIPS 2025 | Drops feature prediction in favor of direct token prediction; multi layer feature fusion via "training time test" | up to 6.5x; 1.4x over EAGLE-2[^9] |
EAGLE-1 was introduced as "EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty."[^7] Its main contributions are the second to top layer feature drafter (rather than token level), the time shifted token input that resolves which token was actually sampled at the previous step, and a small static draft tree consumed in one tree attention pass. It established the SafeAILab/EAGLE codebase and the calibrated draft head that EAGLE-2 then leverages without modification.[^4][^7]
EAGLE-3, posted on arXiv 3 March 2025 as 2503.01840, is the most recent member of the family. Its abstract argues that scaling up training data yields limited improvements for EAGLE-1 and EAGLE-2 because the feature prediction objective imposes a structural ceiling: the draft head is forced to reproduce a specific hidden state, which becomes harder to fit as data grows. EAGLE-3 therefore abandons feature prediction in favor of direct token prediction and replaces reliance on a single top layer feature with a fusion of low, mid, and high level features from the target model, trained with what the authors call a training time test procedure that simulates multi step drafting during training.[^9]
EAGLE-3 reports speedups up to 6.5x over vanilla decoding (roughly 1.4x above EAGLE-2) on five evaluation tasks across both chat and reasoning models, and a 1.38x throughput improvement at batch size 64 inside SGLang.[^9] The SafeAILab repository lists EAGLE-3 drafters for models including Vicuna-13B, LLaMA-3.1-8B-Instruct, LLaMA-3.3-70B-Instruct, DeepSeek-R1-Distill-LLaMA-8B, and several sizes of Qwen3 from 1.7B up to 235B parameters, alongside support for Llama 4 Scout and Maverick.[^4]
EAGLE-2 has migrated quickly into production inference stacks. The SafeAILab/EAGLE repository is the canonical reference; it is licensed Apache 2.0 and contains training scripts (data generation, auto regression head fitting, and a modeling_basemodelname.py adapter pattern), with recommended training on 8x RTX 3090 GPUs for one to two days for a single drafter.[^4][^5]
The drafters used by EAGLE-2 are distributed on the Hugging Face Hub under the username yuhuili, the GitHub handle of the lead author. The catalog covers EAGLE-1 and EAGLE-2 checkpoints for Vicuna 7B, 13B, and 33B (v1.3), LLaMA2-Chat 7B/13B/70B, LLaMA-3 Instruct 8B and 70B, Mixtral 8x7B Instruct, and Qwen2 7B and 72B Instruct, and EAGLE-3 checkpoints (suffix EAGLE3-) for the same target families plus newer ones such as DeepSeek-R1-Distill-LLaMA-8B and the Qwen3 series. Each repository's model card cross references the three EAGLE arXiv papers and notes that the drafter can be used with Hugging Face Transformers, vLLM, or SGLang via a one line target plus draft launch command.[^12] A typical drafter for a 70B target weighs about 2 billion parameters in fp16, several times smaller than the target and small enough to be co located on the same GPU.[^12]
vLLM added support for EAGLE based speculative decoding in its V1 engine. The vLLM documentation exposes a speculative_config dictionary in which users set method to eagle for EAGLE-1 or EAGLE-2 style drafters or eagle3 for the newer variant, pass model to a Hugging Face draft checkpoint such as yuhuili/EAGLE-LLaMA3-Instruct-8B, and set draft_tensor_parallel_size: 1 (EAGLE draft models do not currently shard across GPUs even when the target model does).[^10] Red Hat reports that EAGLE-1 and EAGLE-3 became available in vLLM as of version 0.8.5, with CUDA graph support for speculative decoding added in 0.9.1; the same article measures up to 1.8x latency reduction on Llama 3.1 8B and 1.6x on Llama 3.3 70B at low request rates, and up to 2.5x on RAG and math reasoning workloads.[^6]
SGLang also supports EAGLE based speculative decoding, and the EAGLE-3 paper specifically reports throughput numbers measured inside SGLang at batch size 64.[^9] In July 2025 the LMSYS team published SpecForge, an open source training framework optimized for producing EAGLE-3 drafters that plug directly into SGLang's runtime. SpecForge integrates natively with SGLang, leverages PyTorch FSDP and tensor parallelism, and supports MoE targets including Llama 4 Scout and Maverick using a 320K dialogue dataset assembled from ShareGPT and UltraChat; it can train EAGLE drafters in either an online mode (target model run alongside the drafter) or an offline mode that precomputes hidden states and can run on as little as a single GPU.[^11] SpecForge reports a 2.18x MT-Bench speedup for Llama 4 Maverick and 2.0x for Scout when paired with the trained drafters.[^11]
The SafeAILab README enumerates 14 plus mainstream frameworks with EAGLE integrations, including NVIDIA TensorRT-LLM, AWS NeuronX, Intel Extension for Transformers, AMD ROCm, and MLC-LLM. EAGLE is described as "combinable" with parallel techniques such as DeepSpeed, Mamba, FlashAttention, and various forms of quantization.[^4]
Because speculative decoding preserves the target model's output distribution exactly, EAGLE-2 is a drop in accelerator for any deployment of a supported chat or instruct model: the user receives the same samples (in distribution) at a fraction of the wall clock latency. The most pronounced gains appear on workloads where draft confidence is high. The EAGLE-2 paper measures the largest per task speedups on HumanEval, where code completions have many predictable templates, and lower (though still substantial) speedups on conversational and open question answering tasks.[^1] Common deployment scenarios include interactive chat services, batch inference for evaluation pipelines, code copilots, and structured generation tasks such as JSON or SQL emission where context is highly predictable.
A second application area is reasoning workloads. The EAGLE-3 paper reports speedups on both chat models and "reasoning models" (including a distilled DeepSeek-R1 variant) and Red Hat's vLLM benchmark notes that math reasoning gains in particular reach 2.1x.[^9][^6] Although these later numbers come from EAGLE-3, EAGLE-2 supplies the dynamic tree primitive that EAGLE-3 builds on, and the relative dynamic versus static tree gap is similar across the two versions.
EAGLE-2 inherits the practical constraints of feature level speculative decoding. The draft head is bound to the specific target model used during training: a drafter trained for LLaMA-3 Instruct 8B cannot be reused for LLaMA-3 Instruct 70B or for a different model family. The SafeAILab repository therefore distributes a separate drafter per target, and end users who fine tune a target model typically need to retrain the drafter on the new model's hidden states to keep the calibration tight.[^4]
Throughput gains compress at larger batch sizes. Speculative decoding spends extra compute per query to save latency, and once the target model's forward pass saturates the device's compute, that extra compute starts to cost more than it saves. Red Hat's measurements on Llama 3.3 70B note degraded gains at higher request rates, and the EAGLE-3 paper reports a throughput improvement of only 1.38x at batch size 64 inside SGLang, well below the per request latency speedups.[^6][^9] vLLM's documentation also notes that observed EAGLE speedups are presently lower than the reference numbers in the original paper and that the discrepancy is under investigation.[^10]
EAGLE based drafters also require non trivial training data and infrastructure. The reference recipe uses the ShareGPT dialogue corpus and recommends an 8x RTX 3090 cluster for one to two days per drafter.[^7][^4] Finally, EAGLE-2 still uses a static depth bound (six by default) and a fixed total token budget per step; the adaptivity is in how the budget is spent, not in whether the algorithm halts early, which leaves a residual class of inputs (very low confidence drafting contexts) on which EAGLE-2 does no better than EAGLE-1.[^1]
EAGLE-2 sits alongside several other tree based or adaptive speculative decoding methods that emerged in the same period.
| System | Mechanism | Relationship to EAGLE-2 |
|---|---|---|
| Speculative sampling (Leviathan, Chen, 2023) | Small independent draft model verified by rejection sampling | Foundational; EAGLE-2 inherits the rejection sampling guarantee |
| Medusa (Cai et al., 2023) | Multiple MLP heads predict tokens at offsets in parallel | Same tree attention idiom, but Medusa relaxes acceptance criteria in non greedy settings (lossy) |
| SpecInfer (Miao et al., 2024) | Several small models drafting in parallel, merged into a token tree | Introduced tree attention for speculative inference |
| Sequoia (Chen, May et al., 2024) | Dynamic programming optimization over static tree shapes for a fixed budget | Still position only; EAGLE-2 paper cites Sequoia as making the same implicit assumption it disputes |
| Lookahead Decoding (Fu et al., 2023) | Jacobi style n gram lookahead without draft model | EAGLE-1 reports being roughly 2x faster than Lookahead on MT-bench |
| Dynamic Depth Decoding (2024) | Adapts draft tree depth per step | Independent line that, like EAGLE-2, makes the drafting structure context dependent |
The EAGLE-2 contribution is specifically the use of the draft model's own confidence (rather than an offline optimization or a hand picked schedule) as a cheap, calibrated proxy for acceptance probability, and the two phase expansion plus reranking algorithm that converts that proxy into a connected verification tree.[^1]