Multi-token prediction

Large Language Models Machine Learning Training & Optimization

26 min read

Updated May 17, 2026

Suggest edit History Talk

RawGraph

Last edited

May 17, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v1 · 5,180 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Multi-token prediction (often abbreviated MTP) is a language modeling training objective in which the model is trained to predict several future tokens at each context position rather than only the next token. The standard autoregressive objective, next-token prediction, asks the model to maximize the likelihood of the immediately following token given the prefix. MTP generalizes this by attaching n parallel prediction heads on top of a shared transformer trunk, one for each of the next n tokens, and adding the auxiliary cross-entropy losses to the main objective during training ^[1]. The technique was popularized as a deliberate training objective by a Meta AI paper in April 2024 ("Better and Faster Large Language Models via Multi-token Prediction" by Fabian Gloeckle and colleagues), and was adopted at scale in DeepSeek-V3 in December 2024, where the same MTP heads doubled as drafters for speculative decoding at inference time ^[1]^[2].

MTP is attractive for two distinct reasons. First, it provides denser supervision per training step, because each hidden state has to encode information useful for predicting several future positions rather than just the next one. This appears to improve sample efficiency and downstream performance on generative tasks, with the gains growing larger at bigger model scales ^[1]. Second, the auxiliary heads can be reused at inference time as a self-contained drafter for speculative decoding, delivering roughly 1.8 to 3 times faster generation without quality loss ^[1]^[2]^[3]. By 2026 the technique has spread from research papers into production-grade open models including DeepSeek-V3, DeepSeek-R1, Qwen3-Next, and Gemma 4, and into mainstream inference engines such as vLLM, SGLang, and llama.cpp ^[3]^[4].

Background and motivation

For most of the modern history of large language models, the workhorse training objective has been maximum likelihood under an autoregressive factorization. The model receives a sequence of tokens, transforms it through stacked self-attention and feedforward blocks, and produces a probability distribution over the vocabulary at each position. The loss compares that distribution to the actual next token. Stacking this across billions of positions gives the familiar next-token cross-entropy objective.

This recipe has known weaknesses. Each token contributes only one bit of supervision per step, which makes pretraining data-hungry. The signal is also myopic: the gradient only ever encourages the model to be right about the immediately following token, so any longer-range planning, such as choosing a function name early in a code completion so that calls many tokens later remain coherent, has to emerge implicitly. Researchers have long argued that this is one reason why language models can be locally fluent but globally incoherent over many tokens.

In parallel, the inference side of language modeling has its own bottleneck. Autoregressive decoding is inherently sequential: each new token requires a full forward pass through the network, and there is no parallelism over the generated tokens themselves. On modern accelerators this is wasteful, because the matrix multiplications in a single forward pass barely saturate the GPU, and most of the time is spent waiting on memory bandwidth. Various forms of parallel decoding have been proposed to break this bottleneck, with speculative decoding being the dominant family.

MTP sits at the intersection of these two threads. By predicting several tokens during training, it provides extra supervision that the model can use to learn richer representations. By exposing the auxiliary heads at inference, it provides a built-in drafter for speculative decoding. The same architectural addition that helps training also helps inference, which is unusual in the literature on language model efficiency.

Origins

The idea of predicting multiple tokens at once is not new. Stern, Shazeer, and Uszkoreit introduced blockwise parallel decoding at NeurIPS 2018 as an inference speedup for sequence-to-sequence models ^[5]. Their approach added k auxiliary prediction heads to a Transformer, trained them by knowledge distillation from the original next-token head, and used them at inference to propose a block of k candidate tokens that the base model could then verify in parallel. The accepted prefix advanced the output, and any rejected tail was recomputed normally. They reported wall-clock speedups of around 3 times on machine translation. This is essentially the same recipe used by modern MTP drafters, although the 2018 paper did not frame the auxiliary heads as a tool for improving the base model itself.

For several years after, the parallel decoding line of work stayed focused on inference. Speculative decoding in its now-canonical form, with a separate draft model verifying against a larger target, was introduced by Leviathan et al. and Chen et al. in 2022 and 2023, and quickly became standard in production serving stacks. The training objective itself, however, remained almost universally next-token prediction.

The modern MTP literature changed this by treating multi-token prediction as a training objective rather than an inference trick. Gloeckle et al.'s Meta paper in April 2024 was the watershed moment: it demonstrated that adding several auxiliary heads at training time, with no extra training-time cost, both improved downstream performance and yielded a usable drafter for free at inference ^[1]. DeepSeek-V3 then operationalized the idea in a flagship open model later in 2024 ^[2], and a string of follow-up papers in 2025 and 2026 refined the technique with self-distillation targets, register tokens, leap prediction, and shared drafter heads.

The Meta multi-token prediction paper

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Roziere, David Lopez-Paz, and Gabriel Synnaeve published "Better and Faster Large Language Models via Multi-token Prediction" at ICML 2024 as a Meta FAIR contribution ^[1]. The paper trained transformer LMs of up to 13 billion parameters on hundreds of billions of tokens of code and natural text, with each model duplicated under two training objectives: standard next-token prediction and 4-token prediction.

Architecture

The Meta architecture keeps a single shared transformer trunk and adds n independent output heads on top of it. Each head is itself a transformer layer followed by an unembedding, and head i is responsible for predicting the token at position t plus i given the prefix up to position t. The heads share the input embedding matrix and the unembedding matrix of the main model, so the auxiliary cost is only the small per-head transformer layer plus n softmaxes. With n = 4, the additional parameter count is in the low single-digit percent of the trunk.

Loss formulation

Given a shared trunk that produces a hidden state z_t at position t, the multi-token prediction loss factorizes the joint probability of the next n tokens through n independent head distributions:

L_n = - sum over t of sum over i = 1 to n of log P_theta(x_{t+i} | z_t)

In this factorization the n head distributions are conditionally independent given z_t. Because the latent z_t has to support all n predictions simultaneously, the trunk is pushed toward representations that encode information about several future tokens, not just the next one. The total training loss is the sum of the per-head cross-entropy losses, often averaged and scaled by a constant.

Memory-efficient implementation

A naive implementation would materialize n logit tensors of shape (batch, sequence, vocabulary), which at modern vocabulary sizes is the dominant memory cost of training. Gloeckle et al. propose a sequential forward and backward over the heads: for each head, the trunk hidden state is fed in, the logits and loss are computed, the backward pass is run, the gradients are accumulated at the trunk, and the logits and their gradient buffers are released before moving on to the next head ^[1]. This reduces the peak memory requirement from O(nV + d) to O(V + d), where V is the vocabulary size and d is the hidden dimension, and brings the per-step memory of MTP training back to roughly the same as next-token training despite the n heads.

Key results

Gloeckle et al. report several findings. On code generation benchmarks, 13 billion parameter models trained with 4-token prediction solved about 12 percent more HumanEval problems and 17 percent more MBPP problems than next-token baselines trained on the same data ^[1]. On natural language summarization (CNN / DailyMail, XSum, and other ROUGE-evaluated benchmarks), the MTP models showed consistent ROUGE gains after the same finetuning recipes. On byte-level models, where the natural prediction horizon is longer because each byte carries less information than a BPE token, 8-byte prediction was the optimal setting and gave around 6.4 times faster inference. The gains grew with model size, suggesting that MTP is more useful for larger models than for small ones, which is the opposite of many regularization-style techniques.

At inference, the same 4-head models reached roughly 3 times higher decoding throughput when the auxiliary heads were used for self-speculative decoding, with no degradation in output quality because the main head was still used as the verifier. The paper also argued that MTP helps the model develop induction heads earlier in training and improves algorithmic reasoning on toy multi-step arithmetic tasks, suggesting that part of the gain is qualitative rather than only sample-efficiency-driven.

DeepSeek-V3 multi-token prediction

DeepSeek-V3 was the first widely deployed flagship model to ship with multi-token prediction baked in at scale. The DeepSeek-V3 technical report describes an MTP design that is similar in spirit to Gloeckle et al. but differs in several important details ^[2].

Architecture

DeepSeek-V3 uses D sequential MTP modules to predict the next D tokens, where each MTP module is a full transformer block of the same flavor as the main trunk (with Multi-head Latent Attention and a Mixture-of-Experts feedforward) ^[2]. The base trunk and the D MTP modules share the input embedding and the output head, so the modules add at most a handful of percent to the parameter count. In the released checkpoints, D = 1 was used, meaning DeepSeek-V3 predicts two tokens at each position during training: the next one from the main head, and the second-next one from a single MTP module sitting one block deeper than the trunk.

The critical architectural difference from Gloeckle et al. is that DeepSeek-V3 keeps the predictions causally chained. The k-th MTP module takes as input the embedding of the (k+1)-th future token together with the hidden state from the previous module's output, concatenates them, normalizes via RMSNorm, and projects back to the trunk hidden dimension via a learned matrix M_k. The output of this MTP transformer block is then fed to the shared unembedding to score the (k+2)-th token. Predicting tokens 2, 3, and beyond therefore uses a longer causal chain rather than n independent parallel heads, which the authors argue preserves richer dependencies between predictions ^[2].

Loss formulation

The total MTP loss is the average of the cross-entropy losses across the D modules, scaled by a weighting factor lambda:

L_MTP = (lambda / D) * sum over k = 1 to D of L^k_MTP

The total training loss combines the main next-token loss with L_MTP. In the DeepSeek-V3 recipe, lambda starts at 0.3 for the first 10 trillion training tokens and is reduced to 0.1 for the final stages of pretraining, to taper the MTP contribution as the main head becomes accurate ^[2]. The total compute overhead of MTP at training is small because D = 1 only adds one transformer block per forward pass, the embedding and unembedding are shared, and the memory-efficient sequential head implementation from Gloeckle et al. is applied.

Training results

DeepSeek-V3 ablations report that adding the 1-depth MTP module improves performance across reasoning, code, and math benchmarks while leaving training throughput roughly unchanged ^[2]. The acceptance rate of the second-token prediction by the MTP module ranges between 85 and 90 percent across diverse generation tasks, which is what enables the inference speedup described below.

Inference use

DeepSeek-V3 supports two inference modes. In the discard-MTP mode, the MTP module is dropped and the model behaves as a standard next-token autoregressive transformer. In the MTP speculative mode, the MTP module drafts one extra candidate token per step, and the main head verifies it; if the verified token agrees with the draft, the model effectively advances two tokens per forward pass. At an 85 to 90 percent acceptance rate, this delivers approximately a 1.8 times speedup in tokens per second ^[2]. AMD ROCm benchmarks of SGLang with MTP on Instinct GPUs reported 1.25 to 2.11 times speedup on Random workloads and 1.36 to 1.80 times speedup on ShareGPT, broadly confirming the technical report numbers ^[3].

How DeepSeek-V3 differs from the Meta paper

The two designs aim at the same goal but make different trade-offs. The table below summarizes the contrasts.

Aspect	Meta MTP (Gloeckle et al., 2024)	DeepSeek-V3 MTP (2024)
Training depth	n = 4 heads in parallel	D = 1 module in series (predicts 2nd token)
Head architecture	n independent transformer layers, parallel	Sequential transformer blocks, causally chained
Embedding sharing	Shared with main model	Shared with main model
Unembedding sharing	Shared with main model	Shared with main model
Loss aggregation	Sum over heads	Lambda-weighted average over depth
Lambda weighting	Implicit, uniform	Explicit lambda, decayed during training
Inference role	Self-speculative decoding	Speculative draft, main head verifies
Reported speedup	About 3x on byte models, lower on tokens	About 1.8x in production
Discardable at inference	Yes	Yes
Scale of evaluation	Up to 13B	671B activated 37B (MoE)

Inference acceleration and speculative decoding

MTP heads are a natural drafter for speculative decoding, the most widely used inference-time acceleration for autoregressive models. In standard speculative decoding, a small draft model produces several candidate tokens, and the larger target model verifies them in a single parallel forward pass. Any prefix of the draft that matches what the target would have sampled is accepted; the first disagreement triggers a fresh sample from the target's distribution and the rest of the draft is discarded.

The ceiling on speedup is set by the average accepted run length. With perfect drafting, k drafted tokens are accepted per call and the user sees up to a (k+1)-fold throughput improvement, modulo verification overhead. In practice, draft acceptance is well below perfect, and the speedup is closer to 1.5 to 3 times for modern systems.

MTP makes drafting cheap and naturally aligned. Because the drafter heads are trained jointly with the target model on the same data, they predict from the same distribution that the target was trained to match, which gives high acceptance rates. Because they share the embedding and unembedding with the main model, the drafter is essentially free in storage. And because they sit on top of the same hidden state as the main head, the verification pass is just the existing trunk forward, with the small MTP block as the only additional compute. The result is a self-contained drafter that does not require a separate small model and avoids the calibration mismatch that often hurts external drafters.

Reported speedups across recent MTP-equipped systems include:

System	Setting	Reported speedup	Source
Meta 4-token model	Self-speculative decoding, batch=1	About 3x	Gloeckle et al., 2024 ^[1]
Meta 8-byte model	Self-speculative decoding	About 6.4x	Gloeckle et al., 2024 ^[1]
DeepSeek-V3	MTP speculative, 85 to 90 percent acceptance	1.8x TPS	DeepSeek-AI, 2024 ^[2]
DeepSeek-V3 on SGLang (ROCm)	Various workloads	1.25 to 2.11x	AMD ROCm Blogs, 2025 ^[3]
Qwen3-Next on vLLM	Single-stream code generation	2 to 3x	vLLM team, 2025 ^[4]
Gemma 4 MTP drafter	Various tasks	Up to 3x	Google AI, 2026 ^[6]
FastMTP	Trained MTP drafter with shared weights	2.03x average	Yu et al., 2025 ^[7]

MTP-style drafters are now the default in several inference frameworks. The vLLM project added native MTP support for DeepSeek-V3 and Qwen3-Next in 2025. SGLang implemented MTP drafting with both vendor GPU support (AMD ROCm and NVIDIA CUDA). The llama.cpp project merged beta MTP support in 2026, allowing local users to enjoy roughly 1.5 to 2 times decode speedups on compatible models without running a separate draft model ^[4].

Self-distillation and other extensions

The MTP heads need a training target. In the simplest setting, the target is the actual next-token sequence in the training data, treated as a hard label. Several extensions instead use the main model's predictive distribution as a soft target, in a form of self-distillation.

MTP via self-distillation

Work by Kirchenbauer and colleagues (2026) shows that converting a pretrained next-token model into an MTP model using online self-distillation recovers most of the benefits of training MTP from scratch, at a small fraction of the cost ^[8]. The student MTP head is trained to clone the next-token distribution of the frozen base model at the corresponding future position, rather than fitting the raw training data. Because the final use of MTP is to draft future tokens that the base model would itself generate, distilling from the base model is more aligned with the inference objective than fitting the pretraining corpus.

A related family of methods, sometimes labelled MTP-D for "distilled MTP," uses the top-N logits of the main head as a teacher signal for the MTP heads, with the original next-token cross-entropy left untouched. This minimizes interference with the main head and has been reported to raise MTP head acceptance by about 7.5 percent on standard benchmarks while leaving the base model's accuracy essentially unchanged.

MTP needs registers

Gerontopoulos et al. (May 2025) proposed adding dedicated register tokens to the context whose only job is to carry information for the auxiliary heads ^[9]. The intuition is that the trunk's hidden states are already overloaded with information needed for the next-token head, and asking them to also encode several future positions degrades the main task. Register tokens act as private scratch space for MTP, letting the heads pull from them without disturbing the rest of the residual stream. The technique improves MTP gains on a range of tasks without changing the main objective.

Leap multi-token prediction

L-MTP (Liu et al., NeurIPS 2025) drops the assumption that MTP must predict adjacent future tokens ^[10]. Instead, the MTP heads predict tokens at strategically chosen non-adjacent positions, skipping over intermediate tokens to extend the effective prediction horizon at the same compute cost. The decoder is then modified to consume these non-sequential predictions for accelerated generation. Reported gains include both improved language model quality and faster decoding compared to vanilla MTP.

FastMTP

FastMTP (Yu et al., September 2025) replaces the per-depth MTP modules with a single MTP head that is reused across multiple draft positions through weight sharing ^[7]. A short finetuning stage teaches the shared head to draft k tokens autoregressively while remaining compatible with EAGLE-style speculative decoding. FastMTP also adds a language-aware dynamic vocabulary compression step that drops rare logits from the drafter to lower its compute cost. On seven diverse benchmarks the system delivers an average 2.03 times speedup over next-token decoding, outperforming the vanilla DeepSeek-V3-style MTP drafter by 82 percent.

Future-summary pretraining

A more recent line of work (Beyond Multi-Token Prediction, 2025) trains the model to predict a learned summary of the next k tokens rather than the tokens themselves. The summary is computed by a small auxiliary network and serves as a compressed teacher signal, on the theory that exactly predicting many future tokens is unreasonably hard and that compressing them first gives a better target.

MTP-D and looped extensions

Self-distillation-based MTP-D includes a looped extension that allows the same MTP head to be reused at varying depths through iterative prediction, reducing the additional model parameters needed and enabling effective extension of the prediction horizon. Some implementations have reported further significant inference speedup gains, with one-head MTP receiving plus 220 percent improvements when combined with looped extensions ^[11].

Adoption and deployment

As of 2026, MTP has moved out of research papers into production systems.

Model / system	Year	MTP variant	Reported impact
Blockwise parallel decoding (Stern et al.)	2018	k auxiliary heads, distilled	About 3x speedup on translation
Gloeckle et al. (Meta)	2024	4 parallel heads, 8 for bytes	12 to 17 percent gains on code, 3x to 6.4x speedup
DeepSeek-V3	2024	D = 1 sequential module	1.8x decoding speedup
DeepSeek-R1	2025	Inherits MTP from V3 base	Same drafter usable for reasoning trace generation
Qwen3-Next	2025	MTP head in checkpoint	2 to 3x speedup in vLLM, above 80 percent acceptance
vLLM	2025	Native MTP drafter support	Production inference path
SGLang on AMD ROCm	2025	DeepSeek-V3 MTP	1.25 to 2.11x throughput
llama.cpp	2026	Beta MTP support	1.5 to 2x decode speed for compatible models
Gemma 4	2026	MTP drafter (Google)	Up to 3x faster inference
Set Block Decoding	2025	MTP-style block drafting	Various

Llama 3, Mistral, and the earlier Gemma releases ship without MTP heads, so users running these models on MTP-aware engines see no speedup until a future checkpoint is trained with the new objective. This asymmetry has become a quiet point in favor of newer open models like Qwen3-Next and DeepSeek-V3 in 2026 inference-throughput benchmarks.

Why MTP helps training

The Meta paper offered several hypotheses for why predicting more tokens improves the base model itself, not just inference ^[1]:

Denser supervision per token. Each hidden state z_t is used n times in the loss rather than once. The signal-per-FLOP ratio at training time is higher, which is consistent with the observed improvement in sample efficiency on code benchmarks.
Forces lookahead. Because the trunk has to encode information about future tokens, it is pushed to develop representations with longer effective horizons. The paper showed that MTP-trained models develop induction heads earlier in training and reach better algorithmic-reasoning scores on a synthetic multi-step arithmetic task.
Implicit choice-of-credit assignment. Predicting tokens 2 through n at once is a form of teacher forcing the model on a longer horizon, which provides gradient signal that the standard one-step objective never sees. This is loosely analogous to multi-step BPTT in recurrent networks.
Better behavior on generative tasks. The gains are largest on benchmarks that require longer output generation (HumanEval, MBPP, summarization). On short discriminative benchmarks, gains are smaller. This is consistent with MTP teaching the model to generate longer coherent sequences rather than to score isolated tokens better.

Not all of these effects are equally well established. Subsequent work has questioned whether the algorithmic-reasoning improvements survive at larger scale, and the optimal value of n depends on tokenization, with larger n preferred for byte-level models and smaller n for BPE tokenizers.

When MTP does not help

MTP is not a universal improvement. The Meta paper itself shows several settings where it neither helps nor hurts:

Small models (less than 1 billion parameters) sometimes see neutral or slightly negative effects on standard benchmarks. The gains scale with model size, which is unusual.
Models trained for one epoch on very small corpora can lose from MTP because the auxiliary loss adds variance to the gradient without enough data to absorb it.
Discriminative benchmarks (classification, single-token completion) show smaller gains than generative benchmarks.
The optimal n depends on the tokenizer. With BPE tokenizers around 32K to 200K vocabulary, n = 4 is a sweet spot. With byte-level tokenizers, n = 8 is better.

Most production deployments use n = 1 or n = 2 to keep the auxiliary cost minimal while still gaining the inference benefits. The DeepSeek-V3 choice of D = 1 reflects a pragmatic balance between training cost and inference utility.

Implementation considerations

Implementing MTP in a training stack involves a few non-obvious choices.

The unembedding matrix (also called the output projection or LM head) is typically the largest single tensor in a transformer model, often tens of gigabytes for a flagship LLM. Sharing it across the main head and the MTP heads is essential to keep memory and parameter count manageable. Both the Meta paper and DeepSeek-V3 share it.

Sequential forward and backward over heads

The key memory optimization is to compute the loss and gradient for each head sequentially rather than materializing all n logit tensors at once. This brings the peak memory back to roughly the same as next-token training ^[1].

Lambda weighting

The weighting factor lambda controls how much the MTP objective influences the trunk. Too small and the gains vanish; too large and the main head suffers. DeepSeek-V3 used a curriculum, starting at 0.3 and decaying to 0.1 ^[2]. Other reports suggest 0.1 to 0.3 is a reasonable range for D = 1 to D = 4.

Causal chain versus parallel heads

The Meta paper uses parallel heads conditionally independent given z_t. DeepSeek-V3 chains the predictions causally, with each MTP module conditioning on the previous module's output and the next token's embedding. The causal version is more compute-intensive at depth greater than one but is reported to preserve richer dependencies between predicted tokens. For D = 1, the two approaches are functionally similar.

Drafter compatibility at inference

For MTP heads to be useful at inference, the serving stack must support speculative decoding with the model's own MTP modules as the drafter. Reference implementations exist in vLLM, SGLang, TensorRT-LLM, and llama.cpp as of 2026. The main difficulty is bookkeeping: KV-cache management, prefix-acceptance logic, and graceful fallback when the MTP draft fails repeatedly all need to be plumbed through. Engineering improvements over 2025 have made this largely transparent for end users.

Relationship to other techniques

MTP overlaps with several other ideas in language model training and inference. The table summarizes how it compares.

Technique	Trains the base model?	Speeds up inference?	Needs separate draft model?	Notes
Next-token prediction	Yes	No	n/a	Standard objective
Blockwise parallel decoding (2018)	No (distilled aux heads)	Yes	No	First MTP-style drafter
Speculative decoding (Leviathan, Chen, 2022 to 2023)	No	Yes	Yes (small draft)	Generic inference acceleration
Medusa	No	Yes	No (extra heads on frozen base)	MTP-like heads added post-training
EAGLE	No	Yes	Yes (lightweight feature-level drafter)	State-of-the-art draft model
Multi-token prediction (Gloeckle, 2024)	Yes	Yes	No	Joint train + draft
DeepSeek-V3 MTP	Yes	Yes	No	Production deployment
Set Block Decoding	Yes	Yes	No	Block-level parallel decoding
Future summaries	Yes	No (training only)	n/a	Predicts compressed summary, not raw tokens
Self-distillation MTP	Yes (from frozen base)	Yes	No	Converts next-token model to MTP

Medusa and EAGLE are particularly close in spirit to MTP at inference. Medusa adds extra heads to a frozen base model and trains them on top, while MTP heads are trained jointly with the base. EAGLE uses a tiny separate drafter that reads the base model's hidden states, which often achieves higher acceptance than Medusa but adds external state. MTP is roughly competitive with Medusa on speedup while also helping the base model's quality, which neither Medusa nor EAGLE do.

Open questions and current research

As of 2026, several questions about MTP remain open:

Optimal n. The Meta paper found n = 4 was best for BPE and n = 8 for byte models, but newer work suggests that with self-distillation or register tokens the curve shifts. There is no consensus default for production training.
Training cost trade-off. Large auxiliary head counts pay off in inference speed but make pretraining slightly more expensive. The frontier appears to be small n at training (D = 1 to 4) with shared-weight inference-time expansion via methods like FastMTP.
Quality versus speed trade-off. MTP heads improve the base model at the same step count, but if compute is held fixed, the increase in per-step cost has to be paid back via faster convergence. DeepSeek-V3 reports that this trade-off is favorable at scale, but the picture is less clear for smaller models.
Interaction with reasoning training. Models trained for long chain-of-thought reasoning often produce many tokens per query, so MTP-driven speedups are particularly valuable. Whether MTP also improves reasoning quality is less clear, although DeepSeek-R1 inherits MTP from V3 and benefits from the same drafter at inference.
Generalization to non-text modalities. MTP for diffusion models, discrete diffusion language models, and multimodal inputs is still early-stage. Some 2025 papers show that parallel-decoding style techniques transfer naturally to diffusion LLMs, suggesting that future work may unify these objectives.
Long-context behavior. Whether MTP helps or hurts long-context understanding is an active area. The auxiliary heads can in principle force the trunk to track longer-range dependencies, but they also burn capacity that might otherwise serve long-context attention.
Compatibility with reinforcement learning. Post-training with RLHF or GRPO-style methods on MTP-trained base models has been done in practice (Qwen3, DeepSeek-R1), but the details of how MTP interacts with policy-gradient updates are not yet well documented.

References

Gloeckle, F., Youbi Idrissi, B., Roziere, B., Lopez-Paz, D., and Synnaeve, G. (2024). "Better and Faster Large Language Models via Multi-token Prediction." Proceedings of ICML 2024, PMLR 235:15706-15734. https://arxiv.org/abs/2404.19737 ↩
DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." https://arxiv.org/abs/2412.19437 ↩
AMD ROCm Blogs (2025). "Efficient LLM Serving with MTP: DeepSeek V3 and SGLang on AMD Instinct GPUs." https://rocm.blogs.amd.com/software-tools-optimization/mtp/README.html ↩
llama.cpp Project (2026). "Multi-Token Prediction (MTP) Beta Support." GitHub Discussions and PR. https://github.com/ggml-org/llama.cpp/discussions/11455 ↩
Stern, M., Shazeer, N., and Uszkoreit, J. (2018). "Blockwise Parallel Decoding for Deep Autoregressive Models." Proceedings of NeurIPS 2018. https://proceedings.neurips.cc/paper/2018/hash/c4127b9194fe8562c64dc0f5bf2c93bc-Abstract.html ↩
Google AI (2026). "Accelerating Gemma 4: Faster Inference with Multi-Token Prediction Drafters." https://blog.google/innovation-and-ai/technology/developers-tools/multi-token-prediction-gemma-4/ ↩
Yu, Y. et al. (2025). "FastMTP: Accelerating LLM Inference with Enhanced Multi-Token Prediction." arXiv:2509.18362. https://arxiv.org/abs/2509.18362 ↩
Kirchenbauer, J. et al. (2026). "Multi-Token Prediction via Self-Distillation." arXiv:2602.06019. https://arxiv.org/abs/2602.06019 ↩
Gerontopoulos, A. et al. (2025). "Multi-Token Prediction Needs Registers." arXiv:2505.10518. https://arxiv.org/abs/2505.10518 ↩
Liu, X. et al. (2025). "L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models." Proceedings of NeurIPS 2025. https://arxiv.org/abs/2505.17505 ↩
"Evolving LLMs from Next-Token Prediction to Multi-Token Prediction via Self-Distillation." MDPI Electronics, Vol. 15, Issue 7 (2026). https://www.mdpi.com/2079-9292/15/7/1533 ↩
Leviathan, Y., Kalman, M., and Matias, Y. (2023). "Fast Inference from Transformers via Speculative Decoding." Proceedings of ICML 2023. https://arxiv.org/abs/2211.17192
Chen, C., Borgeaud, S., Irving, G., et al. (2023). "Accelerating Large Language Model Decoding with Speculative Sampling." https://arxiv.org/abs/2302.01318

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Medusa Qwen3-Next

Background and motivation

Origins

The Meta multi-token prediction paper

Architecture

Loss formulation

Memory-efficient implementation

Key results

DeepSeek-V3 multi-token prediction

Architecture

Loss formulation

Training results

Inference use

How DeepSeek-V3 differs from the Meta paper

Inference acceleration and speculative decoding

Self-distillation and other extensions

MTP via self-distillation

MTP needs registers

Leap multi-token prediction

FastMTP

Future-summary pretraining

MTP-D and looped extensions

Adoption and deployment

Why MTP helps training

When MTP does not help

Implementation considerations

Sharing the unembedding

Sequential forward and backward over heads

Lambda weighting

Causal chain versus parallel heads

Drafter compatibility at inference

Relationship to other techniques

Open questions and current research

See also

References

Improve this article

Related Articles

ORPO

Tensor Parallelism

Pipeline Parallelism

QLoRA

InstructGPT

DeepSeek-R1-Distill

What links here

Related Articles

ORPO

Tensor Parallelism

Pipeline Parallelism

QLoRA

InstructGPT

DeepSeek-R1-Distill

What links here