Mixture of Depths

Deep Learning Machine Learning Model Architecture Transformer Models

32 min read

Updated Jul 16, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 16, 2026

Fact-checked

In review queue

Sources

17 citations

Revision

v6 · 6,480 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Mixture of Depths (MoD) is a technique for dynamically allocating computation to individual tokens within transformer-based language models. Introduced by David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro at Google DeepMind in April 2024, MoD challenges the conventional design of transformers, which apply the same amount of computation uniformly to every token in a sequence.^[1] Instead, MoD uses a learned router at each layer to decide which tokens should receive full processing through self-attention and feed-forward network (FFN) blocks and which tokens should skip the layer entirely via a residual connection.^[1] The result is a model that can match or exceed the performance of a standard transformer while using significantly fewer floating-point operations (FLOPs) per forward pass.^[1]

The paper was released as arXiv:2404.02258 on 2 April 2024 and presented at the International Conference on Machine Learning (ICML) 2024.^[1] Although the technique attracted considerable attention and inspired follow-up work across language, vision, and multimodal architectures, MoD has so far seen limited direct adoption in publicly disclosed frontier models, where Mixture of Experts has emerged as the dominant form of conditional computation. MoD nevertheless remains an active research area, particularly for inference efficiency, long-context applications, and as a building block in unified frameworks that combine depth, width, and recurrence sparsity.

Motivation

Standard transformer models apply an identical sequence of operations to every token at every layer. Whether a token is a common function word like "the" or a rare technical term carrying heavy semantic weight, the model spends exactly the same computational resources on both. This uniform allocation is wasteful because not all tokens contribute equally to the final prediction. Some tokens may be easy to predict after only a few layers of processing, while others require the full depth of the network.

This observation has been confirmed by interpretability research. Studies of layer-wise representations in language models have found that information about syntactic structure tends to be resolved relatively early in the stack, while higher-level semantic features emerge in deeper layers. For predictable continuations such as common bigrams, the model's output distribution often stabilizes well before the final layer. Forcing every token to traverse every layer therefore allocates compute uniformly across inputs with highly non-uniform difficulty.

Prior work on adaptive computation had explored the idea that different inputs deserve different amounts of processing. Alex Graves introduced Adaptive Computation Time (ACT) in 2016, allowing recurrent neural networks to learn how many computational steps to take for each input.^[2] The Universal Transformer (Dehghani et al., 2019) extended this idea to transformer architectures by applying a per-position halting mechanism.^[3] However, these approaches used dynamic computation graphs with variable tensor sizes, making them difficult to implement efficiently on modern hardware such as GPUs and TPUs.^[1]

The mismatch between adaptive computation and static-graph hardware is a fundamental obstacle. Modern accelerators rely on bulk synchronous parallelism and predictable memory access patterns, both of which suffer when individual examples in a batch take different numbers of steps. Padding to the longest active sequence wastes memory, while masking-based approaches still incur the full per-step FLOP cost.

Mixture of Depths addresses this problem differently. Rather than varying the number of steps with dynamic graph structures, it enforces a fixed compute budget by capping the number of tokens that participate in each layer's computation.^[1] This keeps tensor sizes static and predictable, which is critical for efficient hardware utilization, while still allowing the model to decide on a per-token basis where to spend its compute. The identity of the selected tokens varies from input to input and from layer to layer, but the volume of arithmetic is fixed.^[1]

Core mechanism

Router architecture

At the heart of MoD is a lightweight routing mechanism. For each transformer layer designated as a "routing layer," a linear projection produces a scalar weight for every token in the sequence. Formally, the router weight for the i-th token at layer l is computed as:

r_i^l = w_θ^T x_i^l

where w_θ is a learnable weight vector and x_i^l is the token's representation at layer l. This scalar weight expresses the router's assessment of how much that token would benefit from full computation at that layer.^[1]

The router introduces only a single linear projection per routing layer, adding a negligible number of parameters compared to the surrounding transformer block.^[1] The router consumes the same hidden state that flows into attention, so no additional memory traffic is required beyond reading the existing activations.

Top-k token selection

Once router weights are computed for all tokens in the sequence, a top-k operation selects the k tokens with the highest weights. Only these selected tokens pass through the full self-attention and MLP computations.^[1] The remaining tokens bypass the layer entirely through a residual connection, meaning their representations are carried forward unchanged and at zero additional computational cost.^[1]

The number k is determined by a capacity ratio C, defined as k = C * S, where S is the sequence length.^[1] For example, with a sequence length of 2,048 and a capacity ratio of 12.5%, only 256 tokens would be processed at that layer, while 1,792 tokens would skip it.

After the selected tokens are processed by attention and the FFN, their output representations are scattered back into the original sequence positions, replacing the residual-only outputs for those positions. Crucially, the router weight is multiplied into the gradient path by scaling the contribution of the selected tokens by their router scores. This places the router parameters along the gradient flow of the language modeling loss, allowing them to be learned end-to-end without an explicit auxiliary objective.^[1]

Static computation graph

A key advantage of this design is that the computation graph remains static. Because k is set before training begins, tensor sizes at every point in the network are known in advance.^[1] This stands in contrast to earlier adaptive computation methods that required dynamic computation graphs and variable-sized tensors, which impose significant overhead on modern parallel hardware. The identities of the k selected tokens are fluid and context-dependent, but the total amount of computation is entirely predictable.^[1]

In practice, a routing layer is implemented with a gather operation that collects the selected tokens into a dense tensor of shape (batch, k, hidden), runs attention and the FFN on that smaller tensor, and then scatters the outputs back into the original positions. Because k is fixed, every routing layer processes the same tensor shapes, allowing compilers such as XLA or torch.compile to apply the same fusion and scheduling optimizations used for dense transformers.

Theoretical FLOP analysis

The compute savings of MoD can be quantified directly from the capacity ratio. For a standard transformer layer with hidden dimension d, sequence length S, and FFN expansion factor 4, the self-attention contributes roughly 4Sd^2 + 2S^2 d FLOPs and the FFN contributes 8Sd^2 FLOPs per layer. For modest sequence lengths where the quadratic attention term is not dominant, the FFN accounts for the majority of per-layer cost.

In a MoD layer with capacity ratio C, only CS tokens are processed. The self-attention cost becomes approximately 4*(CS)d^2* + 2*(CS)^2 d* and the FFN cost becomes 8*(CS)d^2*. Because the quadratic attention term scales as the square of the active token count, the savings on attention compound: with C = 0.125, the quadratic term is reduced to roughly 1.5% of its dense equivalent, while the linear terms scale by C directly.

When routing layers are interleaved every other block (as in the recommended configuration), only half of all blocks apply routing. The aggregate FLOP reduction across the full forward pass is then approximately (1 + C) / 2 of the dense baseline. For C = 0.125, this yields about 56% of the dense compute per token, or roughly 1.8x fewer FLOPs per forward pass.

The paper reports that the optimal MoD configuration delivers up to a 50% reduction in compute at equal loss compared to a standard transformer, or up to a 1.5% improvement in final log probability at equal training compute.^[1] The realized inference speedup is often more dramatic than the FLOP ratio because static routing avoids the bandwidth and synchronization overheads that plague dynamic adaptive methods.

Routing strategies

The paper investigates two routing strategies borrowed from the Mixture of Experts (MoE) literature: token-choice routing and expert-choice routing.^[1]

Token-choice routing

In token-choice routing, each token independently decides whether to participate in the layer's computation or to skip it. The router produces a probability distribution for each token, and the token selects its preferred path. This approach suffers from load-balancing problems: many tokens may choose the same path, leading to uneven utilization. Auxiliary balancing losses are typically required to mitigate this issue.^[4]

Token-choice routing has been the historical default in MoE work because it generalizes naturally to autoregressive decoding. The cost of this independence is that without coordination the load across paths is highly skewed, with some experts receiving most tokens while others starve.^[4]

Expert-choice routing

In expert-choice routing, the computational path itself selects which tokens to process. The layer picks the top-k tokens with the highest router weights, guaranteeing a perfectly balanced distribution of tokens across computational paths.^[8] This eliminates the need for auxiliary balancing losses and ensures that the most critical tokens (those with the highest router weights) are always selected.^[8]

Expert-choice routing was introduced for MoE in Zhou et al. (2022).^[8] Its key advantage is that load balancing becomes a structural property of the selection operation.^[8] The corresponding drawback is that top-k requires looking at all tokens at once, which is incompatible with strict left-to-right autoregressive generation. The MoD paper addresses this through auxiliary predictors, discussed below.^[1]

The paper finds that expert-choice routing is the better strategy for MoD. It offers natural load balancing, no need for additional loss terms during training, and the ability to let relative router weights determine which tokens benefit most from computation.^[1]

Relationship to adaptive computation approaches

Mixture of Depths sits within a broader family of techniques that aim to vary the amount of computation per input. The following table summarizes the key approaches and how they differ.

Approach	Mechanism	Granularity	Computation graph	Year
Adaptive Computation Time (ACT)	Learned halting probability per step^[2]	Per-token	Dynamic	2016
Universal Transformer	Recurrent transformer with per-position halting^[3]	Per-token	Dynamic	2019
Early Exit (DeeBERT, PABEE)	Tokens exit at intermediate layers based on confidence^[5]^[6]	Per-sample or per-token	Dynamic	2020
Token Dropping	Drop unimportant tokens during training^[7]	Per-token	Semi-static	2022
CoLT5 conditional routing	Token-level routing to heavy versus light branches^[9]	Per-token, per-layer	Static	2023
Mixture of Depths	Router selects top-k tokens per layer; others skip via residual^[1]	Per-token, per-layer	Static	2024
LayerSkip	Layer dropout plus early exit and self-speculative decoding^[17]	Per-sequence	Static training, dynamic decode	2024
Mixture of Recursions	Routes tokens to varying numbers of recursive applications of a shared block^[12]	Per-token	Static	2025

Early exit methods

Early exit approaches, such as DeeBERT (Xin et al., 2020)^[5] and PABEE (Zhou et al., 2020),^[6] allow tokens or entire inputs to stop processing at intermediate layers when a confidence threshold is met. A key limitation is that once a token exits, it cannot be updated by self-attention interactions with tokens that continue processing through deeper layers. In MoD, a token can skip a middle layer but still participate in later layers, maintaining the ability to interact with fully processed tokens at subsequent depths.^[1]

Meta's LayerSkip (2024) revives the early-exit idea by combining layer dropout during pretraining with self-speculative decoding at inference time.^[17] Unlike MoD, LayerSkip leaves the per-layer compute uniform during training and instead trains the model to be robust to truncation, then exploits intermediate-layer outputs as drafts during generation.^[17] This makes LayerSkip a complementary rather than competing technique, since the two could in principle be combined: MoD-trained checkpoints could be augmented with intermediate exit heads to enable speculative drafting.

Token dropping

Token dropping methods, such as the approach by Hou et al. (2022) for BERT pretraining, drop unimportant tokens at intermediate layers to save computation.^[7] Dropped tokens are typically recovered at the final layer so the model still outputs full-length sequences.^[7] MoD differs in that it learns which tokens to process or skip at each individual layer through an explicit routing mechanism, and tokens that skip a layer still pass through via the residual connection rather than being removed from the computation entirely.^[1]

Adaptive computation time

ACT (Graves, 2016) allows models to learn how many computational steps to take per input by introducing a halting probability and a ponder cost.^[2] While ACT operates at the level of repeated recurrent steps, MoD operates at the level of transformer layers, selectively engaging or bypassing standard transformer blocks. MoD also avoids the dynamic computation graph that ACT requires.^[1]

CoLT5 and conditional routing variants

CoLT5 (Ainslie et al., 2023) introduced a conditional routing approach for long-context T5 models in which each token is routed either to a light branch with a small attention head and FFN, or to a heavy branch with a wider attention head and a larger FFN.^[9] CoLT5 maintains static tensor shapes by capping the number of tokens that take the heavy branch.^[9] MoD generalizes this idea: rather than choosing between two computational paths, the model chooses whether to perform any computation at all at each routing layer.^[1]

Comparison with Mixture of Experts

Mixture of Depths is frequently compared with Mixture of Experts (MoE) because both use routing mechanisms, but they address different dimensions of computational efficiency.

Feature	Standard transformer	Mixture of Experts (MoE)	Mixture of Depths (MoD)
Compute per token	Uniform across all layers	Uniform total FLOPs; routed to different experts	Variable; some tokens skip layers entirely
What the router decides	N/A	Which expert(s) process each token	Whether a token is processed or skipped
Type of sparsity	None (dense)	Width sparsity (multiple experts, only some activated)	Depth sparsity (layers selectively skipped)
Total FLOPs per forward pass	Fixed	Fixed (same as dense for active parameters)	Reduced (tokens skip computation)
Parameter count vs. active parameters	Equal	Parameters > active parameters per token	Parameters approximately equal to active parameters
Load balancing	N/A	Requires auxiliary losses or expert-choice routing	Naturally balanced via top-k selection
Hardware efficiency	High (static graph)	Moderate (routing overhead, communication)	High (static graph, reduced FLOPs)
Memory footprint	Baseline	Higher (all experts must be stored)	Comparable to dense baseline
Inference speed benefit	Baseline	Moderate (fewer active parameters per token)	Significant (fewer FLOPs per forward pass)
Primary scaling axis	Depth and width together	Width through expert count	Depth utilization

In a standard MoE transformer, each token is routed to one or more expert sub-networks (typically specialized FFN modules), but every token still goes through some expert.^[4] The total FLOPs per token remain roughly constant; the benefit comes from having more total parameters (capacity) without proportionally increasing compute.^[4] In MoD, there is a single set of standard transformer computations, and the routing decision is binary: a token either goes through the full block or skips it entirely.^[1] This means MoD directly reduces the total FLOPs per forward pass.^[1]

The two techniques therefore optimize different cost frontiers. MoE makes total FLOPs cheap relative to total parameters, which is attractive when memory is abundant and the goal is to scale parameter count for capability. MoD makes total FLOPs cheap relative to active parameters, which is attractive when compute is the binding constraint and memory is already fully utilized. In practice, frontier model trainers face both constraints simultaneously, which is why hybrid schemes such as MoDE (described below) have generated continuing interest.

Mixture of Depths and Experts (MoDE)

The paper also explores combining MoD with MoE into a unified framework called Mixture of Depths and Experts (MoDE). Two integration strategies are investigated.^[1]

Staged MoDE

In the staged approach, the MoD routing decision is made first. Tokens that are selected for processing are then routed to different experts using standard MoE routing. Tokens that are not selected skip the entire block.^[1] This is a two-stage process: first decide whether to compute, then decide which expert to use. A practical advantage of the staged variant is that tokens can also skip the self-attention computation, since MoD applies to the full block rather than only the FFN.^[1] This is the more aggressive of the two schemes in terms of FLOP reduction.

Integrated MoDE

In the integrated approach, the MoD and MoE routing decisions are made simultaneously through a single routing operation. The set of available "experts" is expanded to include a "no-op" expert that performs no computation (equivalent to a residual skip).^[1] A single router assigns each token to either a real expert or the no-op expert, unifying both decisions.^[1]

The integrated variant simplifies the architecture: there is only one router per block and one routing decision to learn. It is less aggressive than the staged variant because the no-op pathway only replaces the FFN portion of the block (attention is still applied uniformly), but it is also more straightforward to implement on top of an existing MoE codebase.^[1]

Experiments showed that the integrated approach outperformed simply reducing the capacity of a standard MoE model.^[1] The performance improvements from MoD and MoE compound rather than overlap, suggesting that depth sparsity and width sparsity address complementary aspects of computational waste.^[1] The paper interprets this complementarity as evidence that the two forms of conditional computation extract different kinds of information from the input: MoE captures functional specialization while MoD captures difficulty-based allocation.

Training

Language modeling objective

MoD models are trained end-to-end using the standard language modeling objective (next-token prediction). The router weights are placed directly along the gradient path, meaning they are learned through standard backpropagation and gradient descent as part of the language modeling loss.^[1] No separate routing loss is needed during training when using expert-choice routing, because the top-k selection is a deterministic operation that is differentiable with respect to the selected tokens' representations.^[1]

In implementation, the router weight is multiplied into the output of the attention and FFN sub-blocks before the residual addition.^[1] The router naturally learns to assign high weights to tokens for which the residual update is helpful.^[1] Tokens that are not selected do not contribute to the router's gradient signal at that layer, but the router still learns globally consistent rankings because the same parameters are shared across all positions.

Configuration

The experiments used models ranging from 60 million to 3 billion parameters.^[1] All models were trained with a batch size of 128, a sequence length of 2,048, and cosine learning rate schedules.^[1] Routing layers were interleaved with standard (non-routing) transformer layers. The best-performing configuration applied routing at every other layer with a capacity ratio of 12.5%, meaning that at routing layers, only 12.5% of the sequence's tokens were processed by self-attention and MLPs, while 87.5% bypassed the block.^[1]

Ablations confirmed the importance of interleaving.^[1] Applying routing to every layer hurt quality, because dense layers between routing layers let the model redistribute information across all positions before the next pruning decision.^[1] Applying routing to too few layers foregoes most of the available compute savings. The every-other-layer pattern emerged as a robust default.

Autoregressive sampling considerations

The top-k routing mechanism used during training is non-causal: it requires examining all tokens in the sequence to determine which ones fall in the top-k. This poses a challenge for autoregressive generation, where tokens are produced one at a time and future tokens are not yet available.^[1]

To address this, the paper proposes two solutions:

Auxiliary binary classifier: A binary cross-entropy loss is added to train the router to predict, for each token independently, whether it would be in the top-k. This introduces a small performance degradation of roughly 0.2 to 0.3%.^[1]
Auxiliary MLP predictor: A small auxiliary MLP (with a stop-gradient on its inputs) is trained alongside the main model to predict whether each token will be routed to computation or will skip. This predictor achieves over 97% accuracy early in training and reaches approximately 99% accuracy by the end, with negligible impact on step speed.^[1]

The second option is preferred in practice because the stop-gradient prevents the auxiliary predictor from distorting the gradient flow that shapes the primary router. During inference, the predictor is consulted in place of the top-k operation, and any errors it makes are absorbed by the residual stream rather than propagating catastrophically.

Experimental results

isoFLOP analysis

The paper conducts isoFLOP comparisons, which hold the total training FLOPs constant and vary model size and configuration to find the best-performing model. Experiments were run at total training budgets of 6 x 10^18, 2 x 10^19, and 1 x 10^20 FLOPs, with model sizes ranging from 60 million to 3 billion parameters.^[1]

Key findings include:

Matching performance with fewer FLOPs: MoD models matched the performance of the isoFLOP-optimal vanilla transformer baseline while using a fraction of the FLOPs per forward pass.^[1]
Up to 1.5% improvement in log probability: When trained for the same total FLOPs as a dense baseline, the isoFLOP-optimal MoD model achieved approximately 1.5% better final log probability than the isoFLOP-optimal vanilla model.^[1]
Larger optimal model size: The isoFLOP-optimal MoD transformer tends to have more parameters than the corresponding vanilla baseline. Because each forward pass uses fewer FLOPs (due to token skipping), the training budget can accommodate a larger model that processes fewer tokens per layer.^[1]
Depth over width: The analysis found that when adding FLOPs to an MoD model, it is generally better to add depth (more layers) rather than width (wider layers).^[1]

These results have implications for scaling laws. The Chinchilla-style relationships between parameters, tokens, and compute were derived under the assumption of uniform per-token compute. MoD models violate this assumption, and the paper's isoFLOP curves suggest that the compute-optimal parameter count is meaningfully higher for MoD models than for matched dense baselines.^[1]

Inference speed

A 220-million-parameter MoD model was found to outperform the isoFLOP-optimal baseline (also 220 million parameters) while being over 60% faster per step during post-training autoregressive sampling.^[1] More broadly, MoD models were found to be up to 50% faster to step during sampling compared to equivalent vanilla transformers.^[1]

This speed advantage comes from two sources. First, tokens that skip layers require zero FLOPs at those layers, directly reducing per-step computation. Second, because isoFLOP-optimal MoD models can be larger (more parameters) for the same training cost, they can match baseline performance with a model that has more capacity but lower per-step FLOPs.

A practical caveat is that inference latency depends on more than FLOPs. Modern decoding is often memory-bandwidth bound rather than compute bound, especially with small batch sizes. MoD does not reduce the number of parameters that must be loaded, so its realized speedup is largest in compute-bound long-sequence prefill and smaller in bandwidth-bound short-context decode.

Token routing patterns

The paper observes that the router learns meaningful patterns about which tokens need more computation. Tokens that are routed through computation more frequently tend to correlate with output predictions that have higher entropy, meaning the model finds those tokens harder to predict.^[1] Conversely, tokens that frequently skip layers tend to be associated with more confident predictions.^[1] This suggests the router learns to allocate compute where it matters most.

Qualitatively, function words and tokens that complete predictable bigrams are routed less often, while content words, rare technical terms, and tokens following sharp distributional shifts are routed more often. Early routing layers tend to focus on surface-level distinctions, while later routing layers select tokens that play a structurally important role in the unfolding output. The router thus exhibits emergent specialization despite being a single linear projection.

Relationship to speculative decoding

Speculative decoding is a separate technique for accelerating autoregressive inference. In speculative decoding, a smaller "draft" model generates candidate tokens quickly, and a larger "target" model verifies them in parallel, accepting or rejecting each candidate.^[16] This exploits the observation that many tokens in a sequence are predictable and do not require the full capacity of the large model.^[16]

MoD and speculative decoding share a philosophical motivation: both recognize that not all tokens require the same amount of computation. However, they operate at different levels. Speculative decoding works at the system level, coordinating two separate models during inference. MoD works at the architectural level, building adaptive computation directly into a single model's forward pass.

The two approaches are potentially complementary. MoD's observation that tokens engaging with more layers correlate with higher-entropy predictions could inform speculative decoding strategies, helping draft models identify which tokens are likely to be accepted by the target model without verification. Early exit methods, which are related to MoD, have already been explored in combination with speculative decoding for faster drafting. Meta's LayerSkip explicitly combines layer dropout pretraining with self-speculative decoding by drafting with intermediate-layer outputs and verifying with the full network,^[17] and SpecEE (presented at ISCA 2025) extends this idea with speculative early exiting for general LLMs.

Implementation considerations

Open-source implementations

Several community implementations of MoD were released within weeks of the original paper. The most widely cited is astramind-ai/Mixture-of-depths, which provides a PyTorch reimplementation of the routing layer that can be wrapped around existing transformer blocks. Other implementations include kyegomez/Mixture-of-Depths and sramshetty/mixture-of-depths, each making different choices about how to integrate routing into pre-existing model classes such as Hugging Face's Llama or Mistral implementations. None of these exactly reproduce the DeepMind setup because key training details were not fully released.

A typical implementation centers on a routing layer that computes router logits, selects top-k indices, gathers the corresponding hidden states into a (batch, k, hidden) tensor, applies attention and FFN, multiplies the outputs by the gathered router weights, and scatters them back. During inference the auxiliary predictor's output is used to decide whether each newly generated token should engage the routing layer.

Hardware and kernel considerations

On modern accelerators, the gather and scatter operations involved in MoD routing are not free. They consume memory bandwidth and require careful indexing. Fused kernels that integrate routing, attention, and FFN computation in a single pass can mitigate this cost but require significant engineering effort and are not as widely available as the corresponding kernels for dense or MoE models.

A second consideration is interaction with the key-value cache used in autoregressive decoding. In MoD, tokens that skip a routing layer do not contribute K/V at that layer. This creates a sparse cache structure that some kernel libraries do not naturally support and that complicates compatibility with paged attention and similar memory optimizations. Mixture of Recursions explicitly addresses this concern with recursion-wise KV caching strategies.^[12]

Adapting pretrained models with MoDification

A central practical question for MoD adoption is whether existing pretrained models can be retrofitted into MoD form without prohibitive retraining. The original paper trained MoD models from scratch, which is expensive at frontier scale.^[1] MoDification (Zhang et al., October 2024) addresses this gap.^[10] It replaces the top-k operator with a threshold-p operator that retains tokens whose router score exceeds a fixed threshold, allowing the number of retained tokens to vary across layers and inputs.^[10] The paper reports that the technique can convert pretrained checkpoints ranging from 3 to 70 billion parameters into MoD form with modest fine-tuning, achieving up to 1.2x latency speedup and 1.8x memory reduction in long-context applications.^[10]

Extensions and follow-up work

Several works have built upon the Mixture of Depths framework since its introduction. The field has evolved from a single architecture into a family of techniques applicable across modalities and training regimes.

MoDification

MoDification (Zhang et al., 2024) addresses the challenge of converting existing pretrained large language models to the MoD framework without expensive retraining. As noted above, it replaces the top-k selection mechanism with a threshold-based operator, achieving up to 1.2x latency speedup and 1.8x memory reduction, particularly for long-context applications, across model scales from 3 billion to 70 billion parameters.^[10]

A-MoD (attention-based MoD)

A-MoD (Gadhikar et al., December 2024) replaces the learned linear router with attention-based routing that leverages the existing attention map from the preceding layer to make routing decisions in the current layer.^[11] Because attention scores already encode which tokens are receiving information from which other tokens, A-MoD argues that they constitute a natural prior on token importance.^[11]

A-MoD reports up to 2% higher accuracy than standard linear routing on ImageNet vision transformer benchmarks, and up to 2x faster convergence during transfer learning.^[11] Importantly, A-MoD introduces no additional trainable router parameters, making it easier to adapt existing pretrained transformers.^[11] The technique works particularly well when the underlying model already has strong attention patterns to repurpose.

Mixture of Recursions

Mixture of Recursions (MoR) (Bae et al., 2025) reframes MoD as a recursive routing problem. Rather than routing tokens to skip or process a layer, MoR routes tokens to varying numbers of applications of a shared recursive block.^[12] The shared-weight design gives parameter efficiency similar to that of Universal Transformer, while the per-token routing gives compute efficiency similar to MoD.^[12]

MoR was published at NeurIPS 2025 and reports new Pareto fronts on the (parameters, FLOPs, perplexity) frontier across model sizes from 135M to 1.7B parameters.^[12] It introduces a recursion-wise KV caching strategy that selectively stores K/V pairs to resolve the missing-cache problem inherent to depth-routed architectures while optimizing memory usage.^[12] MoR achieves up to 2x greater inference throughput than standard transformers at comparable accuracy.^[12]

γ-MoD for multimodal LLMs

γ-MoD (Luo et al., 2024, ICLR 2025) extends MoD to multimodal large language models.^[13] The authors observe that directly converting dense layers in a multimodal LLM to MoD form leads to substantial performance degradation, because not all layers exhibit equal redundancy.^[13] To address this, γ-MoD introduces a metric called Attention Rank (ARank) that measures the rank of attention maps and guides which layers should be converted to MoD layers.^[13]

The method also introduces two design innovations: a shared vision-language router that uses the same parameters for both modalities, and masked routing learning that ensures critical vision tokens are retained during fine-tuning.^[13] Results on LLaVA-HR show that over 90% of dense layers can be converted to MoD layers with only a 1.5% performance drop, while reducing training time by 31% and inference time by 53%.^[13]

UniMoD for unified multimodal transformers

UniMoD (2025) applies MoD to unified multimodal transformers that handle both generation and understanding tasks within a shared parameter space.^[14] The key observation is that token redundancy varies significantly by task: generation tasks have different redundancy patterns from understanding tasks.^[14] UniMoD therefore uses a separate router for each task to determine which tokens should be pruned.^[14]

Evaluated on Show-o and Emu3 (representative unified transformers), UniMoD reduces training FLOPs by approximately 15% on Show-o and 40% on Emu3 while maintaining or improving benchmark performance.^[14]

CNN Mixture-of-Depths

CNN MoD (Cakaj et al., ACCV 2024) adapts MoD to convolutional neural networks.^[15] Rather than selecting tokens, CNN MoD selects channels in feature maps for focused processing within convolutional blocks, while skipping less relevant channels.^[15] Like the language model version, CNN MoD preserves static tensor shapes and integrates cleanly with standard CUDA kernels.^[15]

The authors report that MobileNetV2-MoD-L matches the top-1 accuracy of standard MobileNetV2 while delivering an 11% CPU and 10% GPU speedup, and that ResNet75-MoD matches ResNet50 with a 25% CPU and 15% GPU speedup.^[15] The extension demonstrates that the depth-routing principle generalizes beyond transformer architectures.

MD-DiT for diffusion models

MD-DiT extends the MoD concept to diffusion transformers, applying step-aware depth allocation to different timesteps in the diffusion process. Because diffusion sampling exhibits very different computational needs at different timesteps (early steps are coarse and noisy, later steps require finer detail), routing computation according to timestep yields substantial sampling speedups.

Real-world adoption

Despite the strong empirical results and proliferation of follow-up work, MoD has not seen the kind of headline adoption in publicly disclosed frontier models that Mixture of Experts has achieved. As of 2026, the leading open-weight model families (DeepSeek-V3 and its descendants, Kimi K2, Llama-4, Mistral Large 3, Qwen3) overwhelmingly use MoE rather than MoD as their primary sparsity mechanism. Several factors explain this.

First, MoE optimizes a different axis than MoD: parameter scaling for capacity, rather than FLOP reduction for the same capacity. Frontier labs in 2024 and 2025 were primarily competing on capability per training dollar, where MoE's ability to deliver enormous parameter counts at constant training compute aligned more directly with their objectives. MoD's benefits accrue mainly to inference cost, which only became a primary concern as inference workloads exploded with reasoning and agentic systems.

Second, MoD requires training from scratch or careful adaptation procedures. MoE has well-understood scaling recipes and a deep stack of open-source tooling, making it cheaper to adopt for organizations with existing MoE expertise.

Third, the inference-time gains of MoD are largest in long-context prefill workloads, which until recently were a minority of production traffic. With the rise of long-context retrieval-augmented generation and agentic loops, this calculus may shift in MoD's favor. The proliferation of MoR, γ-MoD, and UniMoD across NeurIPS, ICLR, and ICML in 2025 suggests the underlying ideas remain academically vibrant even where production deployment has lagged. The most likely path to widespread adoption may be hybrid frameworks combining depth, width, and recurrence sparsity.

Limitations

Despite its promising results, Mixture of Depths has several limitations.

Non-causal routing during training: The top-k operation requires seeing all tokens in the sequence, which conflicts with the causal nature of autoregressive models. The auxiliary predictor addresses this at inference time but adds architectural complexity.^[1]
Fixed capacity ratio: The capacity ratio C is a hyperparameter set before training. The optimal ratio may vary across tasks, domains, or even different parts of training. The paper found 12.5% to be optimal in their setting, but this may not generalize.^[1] MoDification addresses this with a threshold-based variant, but at the cost of giving up the fully static graph guarantee.^[10]
Training data dependence: The router's behavior is shaped by the training data distribution. Tokens or patterns not well represented during training may receive suboptimal routing decisions at inference time. This is a particular concern for out-of-distribution inputs, since the router has no explicit mechanism to detect that it is operating outside its training regime.
Interaction effects: When a token skips a layer, it misses the opportunity to attend to and be attended by tokens that did undergo computation at that layer. While the paper shows this does not hurt performance in practice, the long-term effects on representation quality for very deep models or very long sequences remain an open question.^[1]
KV cache complications: Tokens that skip layers do not contribute K/V at those layers, creating a sparse cache structure that complicates compatibility with paged attention and other memory optimizations. MoR addresses this with recursion-aware caching, but plain MoD inherits the issue.^[12]
Limited frontier-model adoption: Despite strong research interest, MoD has not been publicly adopted by leading frontier models, raising questions about whether its benefits translate to the very large scale at which those models operate, or whether scaling to that regime simply has not been attempted yet.

Summary

Mixture of Depths introduces a simple yet effective method for making transformer computation conditional at the token level. By using a learned router and a top-k selection mechanism, MoD allows models to spend their computational budget where it matters most, skipping unnecessary computation for tokens that can be handled by a residual connection alone. The approach maintains static computation graphs for hardware efficiency, achieves significant FLOP reductions per forward pass, and can be combined with Mixture of Experts for further gains.^[1]

The technique has spawned a substantial body of follow-up work spanning language, vision, multimodal, diffusion, and convolutional models, and is now best understood as one member of a broader family that includes Mixture of Recursions, A-MoD, MoDification, γ-MoD, and UniMoD. While its direct production adoption has lagged behind that of Mixture of Experts, MoD remains an active and influential research direction. It represents an important step toward more efficient transformer architectures that allocate computation adaptively rather than uniformly, and the ideas it introduced continue to shape how the field thinks about conditional compute.

References

Raposo, D., Ritter, S., Richards, B., Lillicrap, T., Humphreys, P. C., & Santoro, A. (2024). Mixture-of-Depths: Dynamically allocating compute in transformer-based language models. *arXiv preprint arXiv:2404.02258*. ICML 2024. ↩
Graves, A. (2016). Adaptive Computation Time for Recurrent Neural Networks. *arXiv preprint arXiv:1603.08983*. ↩
Dehghani, M., Gouws, S., Vinyals, O., Uszkoreit, J., & Kaiser, L. (2019). Universal Transformers. *Proceedings of ICLR 2019*. ↩
Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., & Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. *Proceedings of ICLR 2017*. ↩
Xin, J., Tang, R., Lee, J., Yu, Y., & Lin, J. (2020). DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference. *Proceedings of ACL 2020*. ↩
Zhou, W., Xu, C., Ge, T., McAuley, J., Xu, K., & Wei, F. (2020). BERT Loses Patience: Fast and Robust Inference with Early Exit. *Proceedings of NeurIPS 2020*. ↩
Hou, L., Huang, R., Shang, L., Jiang, X., Chen, X., & Liu, Q. (2022). Token Dropping for Efficient BERT Pretraining. *Proceedings of ACL 2022*. ↩
Zhou, Y., Chen, T., Xu, B., He, S., Chen, K., Xu, M., et al. (2022). Mixture-of-Experts with Expert Choice Routing. *Proceedings of NeurIPS 2022*. ↩
Ainslie, J., Lei, T., de Jong, M., Ontañón, S., Brahma, S., Zemlyanskiy, Y., et al. (2023). CoLT5: Faster Long-Range Transformers with Conditional Computation. *arXiv preprint arXiv:2303.09752*. ↩
Zhang, C., Zhong, M., Wang, Q., Lu, X., Ye, Z., Lu, C., et al. (2024). MoDification: Mixture of Depths Made Easy. *arXiv preprint arXiv:2410.14268*. ↩
Gadhikar, A., et al. (2024). Attention Is All You Need For Mixture-of-Depths Routing. *arXiv preprint arXiv:2412.20875*. ↩
Bae, S., et al. (2025). Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation. *Proceedings of NeurIPS 2025*. arXiv:2507.10524. ↩
Luo, Y., et al. (2024). γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models. *Proceedings of ICLR 2025*. arXiv:2410.13859. ↩
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths (2025). arXiv:2502.06474. ↩
Cakaj, R., Mehnert, J., & Yang, B. (2024). CNN Mixture-of-Depths. *Proceedings of ACCV 2024*. arXiv:2409.17016. ↩
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. *Proceedings of ICML 2023*. ↩
Elhoushi, M., et al. (2024). LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding. *Proceedings of ACL 2024*. arXiv:2404.16710. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

5 revisions by 1 contributors · full history

Suggest edit

What links here

Dynamic inference Mixture-of-Recursions (MoR)Switch Transformer

Motivation

Core mechanism

Router architecture

Top-k token selection

Static computation graph

Theoretical FLOP analysis

Routing strategies

Token-choice routing

Expert-choice routing

Relationship to adaptive computation approaches

Early exit methods

Token dropping

Adaptive computation time

CoLT5 and conditional routing variants

Comparison with Mixture of Experts

Mixture of Depths and Experts (MoDE)

Staged MoDE

Integrated MoDE

Training

Language modeling objective

Configuration

Autoregressive sampling considerations

Experimental results

isoFLOP analysis

Inference speed

Token routing patterns

Relationship to speculative decoding

Implementation considerations

Open-source implementations

Hardware and kernel considerations

Adapting pretrained models with MoDification

Extensions and follow-up work

MoDification

A-MoD (attention-based MoD)

Mixture of Recursions

γ-MoD for multimodal LLMs

UniMoD for unified multimodal transformers

CNN Mixture-of-Depths

MD-DiT for diffusion models

Real-world adoption

Limitations

Summary

References

Improve this article

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Self-attention

Sparse attention

Rotary Position Embedding

Grouped-Query Attention

What links here

Related Articles

Multi-head Latent Attention

Multi-Head Self-Attention

Self-attention

Sparse attention

Rotary Position Embedding

Grouped-Query Attention

What links here