Self-speculative decoding (LayerSkip)
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,930 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,930 words
Add missing citations, update stale details, or suggest a clearer explanation.
Self-speculative decoding is a family of speculative decoding methods that accelerate large language model (LLM) inference by using the target model itself, run in a cheaper reduced-depth mode, as its own draft model. Instead of pairing the large model with a separate small drafter, the method produces candidate tokens from a subset of the model's own layers, by skipping intermediate layers or by exiting early, and then verifies those candidates with the full model in a single forward pass. Because the same weights and the same KV cache serve both the drafting and the verification stages, there is no second model to host, no second cache to maintain, and the generated text is identical to what ordinary decoding would have produced. [1][2]
The term was introduced by Jun Zhang and colleagues in "Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding" (arXiv, September 2023; published at ACL 2024), which drafts by selectively skipping layers of an unmodified model at inference time and needs no extra training. [1] A second, training-based line is Meta AI's LayerSkip (Elhoushi et al., arXiv, April 2024; ACL 2024), which trains the model with layer dropout and an early-exit loss so that the first few layers can draft while the remaining layers verify, sharing computation between the two stages. [2] The starting premise that self-speculative decoding is principally "Meta's LayerSkip" is therefore incomplete: LayerSkip is one instance, and the technique and its name originate with Zhang et al. a year earlier.
Autoregressive generation is slow because each new token requires a full forward pass of the model, and those passes are dominated by the memory cost of streaming the weights and the KV cache rather than by arithmetic. Producing N tokens therefore means N sequential, memory-bound passes. [3]
Speculative decoding, introduced by Yaniv Leviathan, Matan Kalman, and Yossi Matias (arXiv, November 2022; ICML 2023) and by Charlie Chen and colleagues (arXiv, February 2023), breaks this bottleneck by separating drafting from verification. [3][4] A cheap drafter proposes a block of several candidate tokens; the expensive target model then scores all of those positions in one forward pass, since a transformer can evaluate many positions in parallel almost as cheaply as one. A verification rule decides how many drafted tokens to keep: under greedy decoding the target keeps each token that matches its own argmax and stops at the first mismatch, and under sampling a modified rejection sampling step preserves the target's exact output distribution. Either way the result is, in distribution, identical to standard decoding, so the speedup is lossless. [3][4]
The classic formulation uses a smaller LLM from the same family as the drafter. This is the practical pain point that motivates self-speculation. A suitable draft model must exist, must be well aligned with the target so that its proposals are accepted often, and must be hosted alongside the target, where it occupies additional memory and runs its own separate KV cache. For many models no small sibling exists, and training or distilling one is costly. Self-speculative decoding removes this dependency by drawing the draft from the target model itself, operated in a less expensive configuration. [1][2]
The original method, by Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, and Gang Chen of Zhejiang University and Sharad Mehrotra of the University of California, Irvine, drafts by skipping a fixed subset of the model's intermediate attention and feed-forward sublayers, then verifies with the complete, unmodified model. [1] Two design choices make it work.
First, which layers to skip is chosen offline by Bayesian optimization. The skipped set is encoded as a binary mask over the layers, and the optimizer searches for the mask that minimizes the average inference time per verified token, balancing a faster draft (skip more) against a higher rejection rate (skip too much). The paper runs roughly 1,000 search iterations, taking about 2.5 hours for a 13B model, and the resulting mask is reused for all subsequent generation. [1]
Second, an adaptive draft-exiting mechanism decides how long to keep drafting before handing off to verification. The drafter continues emitting tokens only while the predicted probability of each draft token stays above a confidence threshold; once a token's probability falls below the threshold, drafting stops and the full model verifies the block so far. The threshold itself is updated on the fly to keep the empirical acceptance rate near a target value, so the draft length adapts to how predictable the current text is. [1]
The scheme is entirely plug and play: it adds no parameters, requires no additional training, and uses no extra memory beyond the original model, since the drafter is just the same network with some layers bypassed. The verification step guarantees that the output is token-for-token identical to standard generation, making the acceleration lossless. [1] Reference code was released as the open-source repository dilab-zju/self-speculative-decoding. [8]
LayerSkip, by Mostafa Elhoushi, Akshat Shrivastava, Carole-Jean Wu, Beidi Chen, and ten other authors at Meta, turns self-speculation into a property the model is trained for, rather than a configuration discovered at inference time. [2] It has three parts.
The training recipe applies layer dropout with a schedule that uses low dropout rates for earlier layers and higher rates for later ones, combined with an early-exit loss. The early-exit loss connects the hidden state at every layer to one shared language-model head, so that the same output projection can decode a prediction from any depth. A curriculum gradually introduces the early-exit terms during training. Together these encourage the early layers to produce representations that are already good enough to read out a plausible next token, while leaving the full-depth accuracy of the model intact. [2]
At inference, self-speculative decoding then drafts with the first E transformer layers followed by the shared head, producing several tokens autoregressively from this shallow sub-model, and verifies by running the remaining layers over those draft tokens to complete the full model in a single pass, keeping the matching prefix. [2]
The distinctive efficiency of LayerSkip comes from sharing computation between the two stages. Because the draft's E layers are exactly the first E layers of the verifier, the two stages use a single shared KV cache rather than the two separate caches a draft-model scheme would need. An "exit query cache" stores the query vector at the exit layer so that verification can resume directly at layer E and continue to the final layer without recomputing the early layers it already ran for drafting. This shared compute and shared cache give LayerSkip a smaller memory footprint than conventional speculative decoding while preserving the lossless verification guarantee. [2] Meta released six LayerSkip checkpoints based on Llama models on Hugging Face, and the training recipe was integrated into torchtune in December 2024 and into Hugging Face TRL in March 2025. [2][7]
Both methods report lossless speedups of roughly 1.5x to 2.2x, with the largest gains on input-grounded tasks such as summarization and structured generation, where many tokens are easy to draft.
| Method | Model | Task | Reported speedup |
|---|---|---|---|
| Draft and Verify [1] | LLaMA-2 13B | CNN/DM summarization | ~1.57x |
| Draft and Verify [1] | Code Llama 13B | HumanEval code generation | ~1.46x |
| Draft and Verify [1] | LLaMA-2 (best case) | text generation | up to 1.99x |
| LayerSkip [2] | Llama 2 7B (continual pretraining) | CNN/DM summarization | 1.86x |
| LayerSkip [2] | Llama 2 13B (continual pretraining) | CNN/DM summarization | 1.81x |
| LayerSkip [2] | Llama (from scratch) | one-shot summarization | up to 2.16x |
| LayerSkip [2] | Llama (code finetuning) | HumanEval code generation | 1.82x |
| LayerSkip [2] | Llama (task finetuning) | TOPv2 semantic parsing | 2.0x |
The numbers are not directly comparable across the two papers because they use different model sizes, hardware, and benchmark setups, but they agree on the qualitative picture: self-speculation recovers a substantial fraction of the speedup of draft-model speculative decoding while removing the separate draft model entirely. The gains track the acceptance rate of the shallow drafter, so tasks whose tokens are highly predictable from the input, such as extractive summarization or semantic parsing, see the highest acceleration. [1][2]
Self-speculative decoding sits within the broader speculative-decoding taxonomy as a self-drafting method, meaning the draft comes from the target model rather than an independent network.
Relative to early exit, the connection is direct. A plain early-exit decoder stops at a shallow layer and accepts that shallow prediction, trading output quality for speed, so it is lossy. Self-speculative decoding uses the same shallow forward pass only to draft, then verifies with the full model, which restores exactness. LayerSkip can thus be read as converting a lossy early-exit accelerator into a lossless one by adding verification. [2]
Medusa and EAGLE are the two best-known self-drafting alternatives, and they differ from self-speculative decoding in where the draft comes from. Medusa (Cai et al., 2024) attaches several extra decoding heads on top of the model's final hidden state, each predicting a token a few positions ahead, and verifies the resulting candidates with tree attention. [5] EAGLE (Li et al., 2024) adds a lightweight autoregressive module that predicts the next layer's features rather than tokens, also expanding candidates into a draft tree verified by tree attention. [6] Both add new trained parameters and draft from the model's top-layer features. Self-speculative decoding instead reuses the model's existing intermediate layers as the drafter, adding no new modules: Draft and Verify adds nothing at all and trains nothing, while LayerSkip changes only the training recipe and shares the one language-model head. A further distinction is that Medusa's default typical-acceptance criterion is not strictly distribution-preserving, whereas Draft and Verify and LayerSkip self-speculation are exact for greedy decoding by construction.
The method is also complementary to draft-free schemes that obtain candidates without any model pass, such as prompt lookup decoding, which copies n-grams from the context, and lookahead decoding, which generates candidates by parallel Jacobi iteration. Those approaches draft by retrieval or by fixed-point iteration; self-speculative decoding drafts by computing with fewer layers. All of them share the same verification half of speculative decoding and the same lossless guarantee, differing only in how the draft tokens are produced. [1][2][3]