Multi-token prediction
Last reviewed
May 17, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 5,180 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 5,180 words
Add missing citations, update stale details, or suggest a clearer explanation.
Multi-token prediction (often abbreviated MTP) is a language modeling training objective in which the model is trained to predict several future tokens at each context position rather than only the next token. The standard autoregressive objective, next-token prediction, asks the model to maximize the likelihood of the immediately following token given the prefix. MTP generalizes this by attaching n parallel prediction heads on top of a shared transformer trunk, one for each of the next n tokens, and adding the auxiliary cross-entropy losses to the main objective during training [1]. The technique was popularized as a deliberate training objective by a Meta AI paper in April 2024 ("Better and Faster Large Language Models via Multi-token Prediction" by Fabian Gloeckle and colleagues), and was adopted at scale in DeepSeek-V3 in December 2024, where the same MTP heads doubled as drafters for speculative decoding at inference time [1][2].
MTP is attractive for two distinct reasons. First, it provides denser supervision per training step, because each hidden state has to encode information useful for predicting several future positions rather than just the next one. This appears to improve sample efficiency and downstream performance on generative tasks, with the gains growing larger at bigger model scales [1]. Second, the auxiliary heads can be reused at inference time as a self-contained drafter for speculative decoding, delivering roughly 1.8 to 3 times faster generation without quality loss [1][2][3]. By 2026 the technique has spread from research papers into production-grade open models including DeepSeek-V3, DeepSeek-R1, Qwen3-Next, and Gemma 4, and into mainstream inference engines such as vLLM, SGLang, and llama.cpp [3][4].
For most of the modern history of large language models, the workhorse training objective has been maximum likelihood under an autoregressive factorization. The model receives a sequence of tokens, transforms it through stacked self-attention and feedforward blocks, and produces a probability distribution over the vocabulary at each position. The loss compares that distribution to the actual next token. Stacking this across billions of positions gives the familiar next-token cross-entropy objective.
This recipe has known weaknesses. Each token contributes only one bit of supervision per step, which makes pretraining data-hungry. The signal is also myopic: the gradient only ever encourages the model to be right about the immediately following token, so any longer-range planning, such as choosing a function name early in a code completion so that calls many tokens later remain coherent, has to emerge implicitly. Researchers have long argued that this is one reason why language models can be locally fluent but globally incoherent over many tokens.
In parallel, the inference side of language modeling has its own bottleneck. Autoregressive decoding is inherently sequential: each new token requires a full forward pass through the network, and there is no parallelism over the generated tokens themselves. On modern accelerators this is wasteful, because the matrix multiplications in a single forward pass barely saturate the GPU, and most of the time is spent waiting on memory bandwidth. Various forms of parallel decoding have been proposed to break this bottleneck, with speculative decoding being the dominant family.
MTP sits at the intersection of these two threads. By predicting several tokens during training, it provides extra supervision that the model can use to learn richer representations. By exposing the auxiliary heads at inference, it provides a built-in drafter for speculative decoding. The same architectural addition that helps training also helps inference, which is unusual in the literature on language model efficiency.
The idea of predicting multiple tokens at once is not new. Stern, Shazeer, and Uszkoreit introduced blockwise parallel decoding at NeurIPS 2018 as an inference speedup for sequence-to-sequence models [5]. Their approach added k auxiliary prediction heads to a Transformer, trained them by knowledge distillation from the original next-token head, and used them at inference to propose a block of k candidate tokens that the base model could then verify in parallel. The accepted prefix advanced the output, and any rejected tail was recomputed normally. They reported wall-clock speedups of around 3 times on machine translation. This is essentially the same recipe used by modern MTP drafters, although the 2018 paper did not frame the auxiliary heads as a tool for improving the base model itself.
For several years after, the parallel decoding line of work stayed focused on inference. Speculative decoding in its now-canonical form, with a separate draft model verifying against a larger target, was introduced by Leviathan et al. and Chen et al. in 2022 and 2023, and quickly became standard in production serving stacks. The training objective itself, however, remained almost universally next-token prediction.
The modern MTP literature changed this by treating multi-token prediction as a training objective rather than an inference trick. Gloeckle et al.'s Meta paper in April 2024 was the watershed moment: it demonstrated that adding several auxiliary heads at training time, with no extra training-time cost, both improved downstream performance and yielded a usable drafter for free at inference [1]. DeepSeek-V3 then operationalized the idea in a flagship open model later in 2024 [2], and a string of follow-up papers in 2025 and 2026 refined the technique with self-distillation targets, register tokens, leap prediction, and shared drafter heads.
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, and Gabriel Synnaeve published "Better and Faster Large Language Models via Multi-token Prediction" at ICML 2024 as a Meta FAIR contribution [1]. The paper trained transformer LMs of up to 13 billion parameters on hundreds of billions of tokens of code and natural text, with each model duplicated under two training objectives: standard next-token prediction and 4-token prediction.
The Meta architecture keeps a single shared transformer trunk and adds n independent output heads on top of it. Each head is itself a transformer layer followed by an unembedding, and head i is responsible for predicting the token at position t plus i given the prefix up to position t. The heads share the input embedding matrix and the unembedding matrix of the main model, so the auxiliary cost is only the small per-head transformer layer plus n softmaxes. With n = 4, the additional parameter count is in the low single-digit percent of the trunk.
Given a shared trunk that produces a hidden state z_t at position t, the multi-token prediction loss factorizes the joint probability of the next n tokens through n independent head distributions:
L_n = - sum over t of sum over i = 1 to n of log P_theta(x_{t+i} | z_t)
In this factorization the n head distributions are conditionally independent given z_t. Because the latent z_t has to support all n predictions simultaneously, the trunk is pushed toward representations that encode information about several future tokens, not just the next one. The total training loss is the sum of the per-head cross-entropy losses, often averaged and scaled by a constant.
A naive implementation would materialize n logit tensors of shape (batch, sequence, vocabulary), which at modern vocabulary sizes is the dominant memory cost of training. Gloeckle et al. propose a sequential forward and backward over the heads: for each head, the trunk hidden state is fed in, the logits and loss are computed, the backward pass is run, the gradients are accumulated at the trunk, and the logits and their gradient buffers are released before moving on to the next head [1]. This reduces the peak memory requirement from O(nV + d) to O(V + d), where V is the vocabulary size and d is the hidden dimension, and brings the per-step memory of MTP training back to roughly the same as next-token training despite the n heads.
Gloeckle et al. report several findings. On code generation benchmarks, 13 billion parameter models trained with 4-token prediction solved about 12 percent more HumanEval problems and 17 percent more MBPP problems than next-token baselines trained on the same data [1]. On natural language summarization (CNN / DailyMail, XSum, and other ROUGE-evaluated benchmarks), the MTP models showed consistent ROUGE gains after the same finetuning recipes. On byte-level models, where the natural prediction horizon is longer because each byte carries less information than a BPE token, 8-byte prediction was the optimal setting and gave around 6.4 times faster inference. The gains grew with model size, suggesting that MTP is more useful for larger models than for small ones, which is the opposite of many regularization-style techniques.
At inference, the same 4-head models reached roughly 3 times higher decoding throughput when the auxiliary heads were used for self-speculative decoding, with no degradation in output quality because the main head was still used as the verifier. The paper also argued that MTP helps the model develop induction heads earlier in training and improves algorithmic reasoning on toy multi-step arithmetic tasks, suggesting that part of the gain is qualitative rather than only sample-efficiency-driven.
DeepSeek-V3 was the first widely deployed flagship model to ship with multi-token prediction baked in at scale. The DeepSeek-V3 technical report describes an MTP design that is similar in spirit to Gloeckle et al. but differs in several important details [2].
DeepSeek-V3 uses D sequential MTP modules to predict the next D tokens, where each MTP module is a full transformer block of the same flavor as the main trunk (with Multi-head Latent Attention and a Mixture-of-Experts feedforward) [2]. The base trunk and the D MTP modules share the input embedding and the output head, so the modules add at most a handful of percent to the parameter count. In the released checkpoints, D = 1 was used, meaning DeepSeek-V3 predicts two tokens at each position during training: the next one from the main head, and the second-next one from a single MTP module sitting one block deeper than the trunk.
The critical architectural difference from Gloeckle et al. is that DeepSeek-V3 keeps the predictions causally chained. The k-th MTP module takes as input the embedding of the (k+1)-th future token together with the hidden state from the previous module's output, concatenates them, normalizes via RMSNorm, and projects back to the trunk hidden dimension via a learned matrix M_k. The output of this MTP transformer block is then fed to the shared unembedding to score the (k+2)-th token. Predicting tokens 2, 3, and beyond therefore uses a longer causal chain rather than n independent parallel heads, which the authors argue preserves richer dependencies between predictions [2].
The total MTP loss is the average of the cross-entropy losses across the D modules, scaled by a weighting factor lambda:
L_MTP = (lambda / D) * sum over k = 1 to D of L^k_MTP
The total training loss combines the main next-token loss with L_MTP. In the DeepSeek-V3 recipe, lambda starts at 0.3 for the first 10 trillion training tokens and is reduced to 0.1 for the final stages of pretraining, to taper the MTP contribution as the main head becomes accurate [2]. The total compute overhead of MTP at training is small because D = 1 only adds one transformer block per forward pass, the embedding and unembedding are shared, and the memory-efficient sequential head implementation from Gloeckle et al. is applied.
DeepSeek-V3 ablations report that adding the 1-depth MTP module improves performance across reasoning, code, and math benchmarks while leaving training throughput roughly unchanged [2]. The acceptance rate of the second-token prediction by the MTP module ranges between 85 and 90 percent across diverse generation tasks, which is what enables the inference speedup described below.
DeepSeek-V3 supports two inference modes. In the discard-MTP mode, the MTP module is dropped and the model behaves as a standard next-token autoregressive transformer. In the MTP speculative mode, the MTP module drafts one extra candidate token per step, and the main head verifies it; if the verified token agrees with the draft, the model effectively advances two tokens per forward pass. At an 85 to 90 percent acceptance rate, this delivers approximately a 1.8 times speedup in tokens per second [2]. AMD ROCm benchmarks of SGLang with MTP on Instinct GPUs reported 1.25 to 2.11 times speedup on Random workloads and 1.36 to 1.80 times speedup on ShareGPT, broadly confirming the technical report numbers [3].
The two designs aim at the same goal but make different trade-offs. The table below summarizes the contrasts.
| Aspect | Meta MTP (Gloeckle et al., 2024) | DeepSeek-V3 MTP (2024) |
|---|---|---|
| Training depth | n = 4 heads in parallel | D = 1 module in series (predicts 2nd token) |
| Head architecture | n independent transformer layers, parallel | Sequential transformer blocks, causally chained |
| Embedding sharing | Shared with main model | Shared with main model |
| Unembedding sharing | Shared with main model | Shared with main model |
| Loss aggregation | Sum over heads | Lambda-weighted average over depth |
| Lambda weighting | Implicit, uniform | Explicit lambda, decayed during training |
| Inference role | Self-speculative decoding | Speculative draft, main head verifies |
| Reported speedup | About 3x on byte models, lower on tokens | About 1.8x in production |
| Discardable at inference | Yes | Yes |
| Scale of evaluation | Up to 13B | 671B activated 37B (MoE) |
MTP heads are a natural drafter for speculative decoding, the most widely used inference-time acceleration for autoregressive models. In standard speculative decoding, a small draft model produces several candidate tokens, and the larger target model verifies them in a single parallel forward pass. Any prefix of the draft that matches what the target would have sampled is accepted; the first disagreement triggers a fresh sample from the target's distribution and the rest of the draft is discarded.
The ceiling on speedup is set by the average accepted run length. With perfect drafting, k drafted tokens are accepted per call and the user sees up to a (k+1)-fold throughput improvement, modulo verification overhead. In practice, draft acceptance is well below perfect, and the speedup is closer to 1.5 to 3 times for modern systems.
MTP makes drafting cheap and naturally aligned. Because the drafter heads are trained jointly with the target model on the same data, they predict from the same distribution that the target was trained to match, which gives high acceptance rates. Because they share the embedding and unembedding with the main model, the drafter is essentially free in storage. And because they sit on top of the same hidden state as the main head, the verification pass is just the existing trunk forward, with the small MTP block as the only additional compute. The result is a self-contained drafter that does not require a separate small model and avoids the calibration mismatch that often hurts external drafters.
Reported speedups across recent MTP-equipped systems include:
| System | Setting | Reported speedup | Source |
|---|---|---|---|
| Meta 4-token model | Self-speculative decoding, batch=1 | About 3x | Gloeckle et al., 2024 [1] |
| Meta 8-byte model | Self-speculative decoding | About 6.4x | Gloeckle et al., 2024 [1] |
| DeepSeek-V3 | MTP speculative, 85 to 90 percent acceptance | 1.8x TPS | DeepSeek-AI, 2024 [2] |
| DeepSeek-V3 on SGLang (ROCm) | Various workloads | 1.25 to 2.11x | AMD ROCm Blogs, 2025 [3] |
| Qwen3-Next on vLLM | Single-stream code generation | 2 to 3x | vLLM team, 2025 [4] |
| Gemma 4 MTP drafter | Various tasks | Up to 3x | Google AI, 2026 [6] |
| FastMTP | Trained MTP drafter with shared weights | 2.03x average | Yu et al., 2025 [7] |
MTP-style drafters are now the default in several inference frameworks. The vLLM project added native MTP support for DeepSeek-V3 and Qwen3-Next in 2025. SGLang implemented MTP drafting with both vendor GPU support (AMD ROCm and NVIDIA CUDA). The llama.cpp project merged beta MTP support in 2026, allowing local users to enjoy roughly 1.5 to 2 times decode speedups on compatible models without running a separate draft model [4].
The MTP heads need a training target. In the simplest setting, the target is the actual next-token sequence in the training data, treated as a hard label. Several extensions instead use the main model's predictive distribution as a soft target, in a form of self-distillation.
Work by Kirchenbauer and colleagues (2026) shows that converting a pretrained next-token model into an MTP model using online self-distillation recovers most of the benefits of training MTP from scratch, at a small fraction of the cost [8]. The student MTP head is trained to clone the next-token distribution of the frozen base model at the corresponding future position, rather than fitting the raw training data. Because the final use of MTP is to draft future tokens that the base model would itself generate, distilling from the base model is more aligned with the inference objective than fitting the pretraining corpus.
A related family of methods, sometimes labelled MTP-D for "distilled MTP," uses the top-N logits of the main head as a teacher signal for the MTP heads, with the original next-token cross-entropy left untouched. This minimizes interference with the main head and has been reported to raise MTP head acceptance by about 7.5 percent on standard benchmarks while leaving the base model's accuracy essentially unchanged.
Gerontopoulos et al. (May 2025) proposed adding dedicated register tokens to the context whose only job is to carry information for the auxiliary heads [9]. The intuition is that the trunk's hidden states are already overloaded with information needed for the next-token head, and asking them to also encode several future positions degrades the main task. Register tokens act as private scratch space for MTP, letting the heads pull from them without disturbing the rest of the residual stream. The technique improves MTP gains on a range of tasks without changing the main objective.
L-MTP (Liu et al., NeurIPS 2025) drops the assumption that MTP must predict adjacent future tokens [10]. Instead, the MTP heads predict tokens at strategically chosen non-adjacent positions, skipping over intermediate tokens to extend the effective prediction horizon at the same compute cost. The decoder is then modified to consume these non-sequential predictions for accelerated generation. Reported gains include both improved language model quality and faster decoding compared to vanilla MTP.
FastMTP (Yu et al., September 2025) replaces the per-depth MTP modules with a single MTP head that is reused across multiple draft positions through weight sharing [7]. A short finetuning stage teaches the shared head to draft k tokens autoregressively while remaining compatible with EAGLE-style speculative decoding. FastMTP also adds a language-aware dynamic vocabulary compression step that drops rare logits from the drafter to lower its compute cost. On seven diverse benchmarks the system delivers an average 2.03 times speedup over next-token decoding, outperforming the vanilla DeepSeek-V3-style MTP drafter by 82 percent.
A more recent line of work (Beyond Multi-Token Prediction, 2025) trains the model to predict a learned summary of the next k tokens rather than the tokens themselves. The summary is computed by a small auxiliary network and serves as a compressed teacher signal, on the theory that exactly predicting many future tokens is unreasonably hard and that compressing them first gives a better target.
Self-distillation-based MTP-D includes a looped extension that allows the same MTP head to be reused at varying depths through iterative prediction, reducing the additional model parameters needed and enabling effective extension of the prediction horizon. Some implementations have reported further significant inference speedup gains, with one-head MTP receiving plus 220 percent improvements when combined with looped extensions [11].
As of 2026, MTP has moved out of research papers into production systems.
| Model / system | Year | MTP variant | Reported impact |
|---|---|---|---|
| Blockwise parallel decoding (Stern et al.) | 2018 | k auxiliary heads, distilled | About 3x speedup on translation |
| Gloeckle et al. (Meta) | 2024 | 4 parallel heads, 8 for bytes | 12 to 17 percent gains on code, 3x to 6.4x speedup |
| DeepSeek-V3 | 2024 | D = 1 sequential module | 1.8x decoding speedup |
| DeepSeek-R1 | 2025 | Inherits MTP from V3 base | Same drafter usable for reasoning trace generation |
| Qwen3-Next | 2025 | MTP head in checkpoint | 2 to 3x speedup in vLLM, above 80 percent acceptance |
| vLLM | 2025 | Native MTP drafter support | Production inference path |
| SGLang on AMD ROCm | 2025 | DeepSeek-V3 MTP | 1.25 to 2.11x throughput |
| llama.cpp | 2026 | Beta MTP support | 1.5 to 2x decode speed for compatible models |
| Gemma 4 | 2026 | MTP drafter (Google) | Up to 3x faster inference |
| Set Block Decoding | 2025 | MTP-style block drafting | Various |
Llama 3, Mistral, and the earlier Gemma releases ship without MTP heads, so users running these models on MTP-aware engines see no speedup until a future checkpoint is trained with the new objective. This asymmetry has become a quiet point in favor of newer open models like Qwen3-Next and DeepSeek-V3 in 2026 inference-throughput benchmarks.
The Meta paper offered several hypotheses for why predicting more tokens improves the base model itself, not just inference [1]:
Not all of these effects are equally well established. Subsequent work has questioned whether the algorithmic-reasoning improvements survive at larger scale, and the optimal value of n depends on tokenization, with larger n preferred for byte-level models and smaller n for BPE tokenizers.
MTP is not a universal improvement. The Meta paper itself shows several settings where it neither helps nor hurts:
Most production deployments use n = 1 or n = 2 to keep the auxiliary cost minimal while still gaining the inference benefits. The DeepSeek-V3 choice of D = 1 reflects a pragmatic balance between training cost and inference utility.
Implementing MTP in a training stack involves a few non-obvious choices.
The unembedding matrix (also called the output projection or LM head) is typically the largest single tensor in a transformer model, often tens of gigabytes for a flagship LLM. Sharing it across the main head and the MTP heads is essential to keep memory and parameter count manageable. Both the Meta paper and DeepSeek-V3 share it.
The key memory optimization is to compute the loss and gradient for each head sequentially rather than materializing all n logit tensors at once. This brings the peak memory back to roughly the same as next-token training [1].
The weighting factor lambda controls how much the MTP objective influences the trunk. Too small and the gains vanish; too large and the main head suffers. DeepSeek-V3 used a curriculum, starting at 0.3 and decaying to 0.1 [2]. Other reports suggest 0.1 to 0.3 is a reasonable range for D = 1 to D = 4.
The Meta paper uses parallel heads conditionally independent given z_t. DeepSeek-V3 chains the predictions causally, with each MTP module conditioning on the previous module's output and the next token's embedding. The causal version is more compute-intensive at depth greater than one but is reported to preserve richer dependencies between predicted tokens. For D = 1, the two approaches are functionally similar.
For MTP heads to be useful at inference, the serving stack must support speculative decoding with the model's own MTP modules as the drafter. Reference implementations exist in vLLM, SGLang, TensorRT-LLM, and llama.cpp as of 2026. The main difficulty is bookkeeping: KV-cache management, prefix-acceptance logic, and graceful fallback when the MTP draft fails repeatedly all need to be plumbed through. Engineering improvements over 2025 have made this largely transparent for end users.
MTP overlaps with several other ideas in language model training and inference. The table summarizes how it compares.
| Technique | Trains the base model? | Speeds up inference? | Needs separate draft model? | Notes |
|---|---|---|---|---|
| Next-token prediction | Yes | No | n/a | Standard objective |
| Blockwise parallel decoding (2018) | No (distilled aux heads) | Yes | No | First MTP-style drafter |
| Speculative decoding (Leviathan, Chen, 2022 to 2023) | No | Yes | Yes (small draft) | Generic inference acceleration |
| Medusa | No | Yes | No (extra heads on frozen base) | MTP-like heads added post-training |
| EAGLE | No | Yes | Yes (lightweight feature-level drafter) | State-of-the-art draft model |
| Multi-token prediction (Gloeckle, 2024) | Yes | Yes | No | Joint train + draft |
| DeepSeek-V3 MTP | Yes | Yes | No | Production deployment |
| Set Block Decoding | Yes | Yes | No | Block-level parallel decoding |
| Future summaries | Yes | No (training only) | n/a | Predicts compressed summary, not raw tokens |
| Self-distillation MTP | Yes (from frozen base) | Yes | No | Converts next-token model to MTP |
Medusa and EAGLE are particularly close in spirit to MTP at inference. Medusa adds extra heads to a frozen base model and trains them on top, while MTP heads are trained jointly with the base. EAGLE uses a tiny separate drafter that reads the base model's hidden states, which often achieves higher acceptance than Medusa but adds external state. MTP is roughly competitive with Medusa on speedup while also helping the base model's quality, which neither Medusa nor EAGLE do.
As of 2026, several questions about MTP remain open: