Mixture of Depths (MoD) is a technique for dynamically allocating computation to individual tokens within transformer-based language models. Introduced by David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro at Google DeepMind in April 2024, MoD challenges the conventional design of transformers, which apply the same amount of computation uniformly to every token in a sequence. Instead, MoD uses a learned router at each layer to decide which tokens should receive full processing through self-attention and feed-forward network (FFN) blocks and which tokens should skip the layer entirely via a residual connection. The result is a model that can match or exceed the performance of a standard transformer while using significantly fewer floating-point operations (FLOPs) per forward pass.
Standard transformer models apply an identical sequence of operations to every token at every layer. Whether a token is a common function word like "the" or a rare technical term carrying heavy semantic weight, the model spends exactly the same computational resources on both. This uniform allocation is wasteful because not all tokens contribute equally to the final prediction. Some tokens may be easy to predict after only a few layers of processing, while others require the full depth of the network.
Prior work on adaptive computation had explored the idea that different inputs deserve different amounts of processing. Alex Graves introduced Adaptive Computation Time (ACT) in 2016, allowing recurrent neural networks to learn how many computational steps to take for each input. The Universal Transformer (Dehghani et al., 2019) extended this idea to transformer architectures by applying a per-position halting mechanism. However, these approaches used dynamic computation graphs with variable tensor sizes, making them difficult to implement efficiently on modern hardware such as GPUs and TPUs.
Mixture of Depths addresses this problem differently. Rather than varying the number of steps with dynamic graph structures, it enforces a fixed compute budget by capping the number of tokens that participate in each layer's computation. This keeps tensor sizes static and predictable, which is critical for efficient hardware utilization, while still allowing the model to decide on a per-token basis where to spend its compute.
At the heart of MoD is a lightweight routing mechanism. For each transformer layer designated as a "routing layer," a linear projection produces a scalar weight for every token in the sequence. Formally, the router weight for the i-th token at layer l is computed as:
r_i^l = w_θ^T x_i^l
where w_θ is a learnable weight vector and x_i^l is the token's representation at layer l. This scalar weight expresses the router's assessment of how much that token would benefit from full computation at that layer.
Once router weights are computed for all tokens in the sequence, a top-k operation selects the k tokens with the highest weights. Only these selected tokens pass through the full self-attention and MLP computations. The remaining tokens bypass the layer entirely through a residual connection, meaning their representations are carried forward unchanged and at zero additional computational cost.
The number k is determined by a capacity ratio C, defined as k = C * S, where S is the sequence length. For example, with a sequence length of 2,048 and a capacity ratio of 12.5%, only 256 tokens would be processed at that layer, while 1,792 tokens would skip it.
A key advantage of this design is that the computation graph remains static. Because k is set before training begins, tensor sizes at every point in the network are known in advance. This stands in contrast to earlier adaptive computation methods that required dynamic computation graphs and variable-sized tensors, which impose significant overhead on modern parallel hardware. The identities of the k selected tokens are fluid and context-dependent, but the total amount of computation is entirely predictable.
The paper investigates two routing strategies borrowed from the Mixture of Experts (MoE) literature: token-choice routing and expert-choice routing.
In token-choice routing, each token independently decides whether to participate in the layer's computation or to skip it. The router produces a probability distribution for each token, and the token selects its preferred path. This approach suffers from load-balancing problems: many tokens may choose the same path, leading to uneven utilization. Auxiliary balancing losses are typically required to mitigate this issue.
In expert-choice routing, the computational path itself selects which tokens to process. The layer picks the top-k tokens with the highest router weights, guaranteeing a perfectly balanced distribution of tokens across computational paths. This eliminates the need for auxiliary balancing losses and ensures that the most critical tokens (those with the highest router weights) are always selected.
The paper finds that expert-choice routing is the better strategy for MoD. It offers natural load balancing, no need for additional loss terms during training, and the ability to let relative router weights determine which tokens benefit most from computation.
Mixture of Depths sits within a broader family of techniques that aim to vary the amount of computation per input. The following table summarizes the key approaches and how they differ.
| Approach | Mechanism | Granularity | Computation Graph | Year |
|---|---|---|---|---|
| Adaptive Computation Time (ACT) | Learned halting probability per step | Per-token | Dynamic | 2016 |
| Universal Transformer | Recurrent transformer with per-position halting | Per-token | Dynamic | 2019 |
| Early Exit (DeeBERT, PABEE) | Tokens exit at intermediate layers based on confidence | Per-sample or per-token | Dynamic | 2020 |
| Token Dropping | Drop unimportant tokens during training | Per-token | Semi-static | 2022 |
| Mixture of Depths | Router selects top-k tokens per layer; others skip via residual | Per-token, per-layer | Static | 2024 |
Early exit approaches, such as DeeBERT (Xin et al., 2020) and PABEE (Zhou et al., 2020), allow tokens or entire inputs to stop processing at intermediate layers when a confidence threshold is met. A key limitation is that once a token exits, it cannot be updated by self-attention interactions with tokens that continue processing through deeper layers. In MoD, a token can skip a middle layer but still participate in later layers, maintaining the ability to interact with fully processed tokens at subsequent depths.
Token dropping methods, such as the approach by Hou et al. (2022) for BERT pretraining, drop unimportant tokens at intermediate layers to save computation. Dropped tokens are typically recovered at the final layer so the model still outputs full-length sequences. MoD differs in that it learns which tokens to process or skip at each individual layer through an explicit routing mechanism, and tokens that skip a layer still pass through via the residual connection rather than being removed from the computation entirely.
ACT (Graves, 2016) allows models to learn how many computational steps to take per input by introducing a halting probability and a ponder cost. While ACT operates at the level of repeated recurrent steps, MoD operates at the level of transformer layers, selectively engaging or bypassing standard transformer blocks. MoD also avoids the dynamic computation graph that ACT requires.
Mixture of Depths is frequently compared with Mixture of Experts (MoE) because both use routing mechanisms, but they address different dimensions of computational efficiency.
| Feature | Standard Transformer | Mixture of Experts (MoE) | Mixture of Depths (MoD) |
|---|---|---|---|
| Compute per token | Uniform across all layers | Uniform total FLOPs; routed to different experts | Variable; some tokens skip layers entirely |
| What the router decides | N/A | Which expert(s) process each token | Whether a token is processed or skipped |
| Type of sparsity | None (dense) | Width sparsity (multiple experts, only some activated) | Depth sparsity (layers selectively skipped) |
| Total FLOPs per forward pass | Fixed | Fixed (same as dense for active parameters) | Reduced (tokens skip computation) |
| Parameter count vs. active parameters | Equal | Parameters > active parameters per token | Parameters approximately equal to active parameters |
| Load balancing | N/A | Requires auxiliary losses or expert-choice routing | Naturally balanced via top-k selection |
| Hardware efficiency | High (static graph) | Moderate (routing overhead, communication) | High (static graph, reduced FLOPs) |
| Inference speed benefit | Baseline | Moderate (fewer active parameters per token) | Significant (fewer FLOPs per forward pass) |
In a standard MoE transformer, each token is routed to one or more expert sub-networks (typically specialized FFN modules), but every token still goes through some expert. The total FLOPs per token remain roughly constant; the benefit comes from having more total parameters (capacity) without proportionally increasing compute. In MoD, there is a single set of standard transformer computations, and the routing decision is binary: a token either goes through the full block or skips it entirely. This means MoD directly reduces the total FLOPs per forward pass.
The paper also explores combining MoD with MoE into a unified framework called Mixture of Depths and Experts (MoDE). Two integration strategies are investigated:
In the staged approach, the MoD routing decision is made first. Tokens that are selected for processing are then routed to different experts using standard MoE routing. Tokens that are not selected skip the entire block. This is a two-stage process: first decide whether to compute, then decide which expert to use.
In the integrated approach, the MoD and MoE routing decisions are made simultaneously through a single routing operation. The set of available "experts" is expanded to include a "no-op" expert that performs no computation (equivalent to a residual skip). A single router assigns each token to either a real expert or the no-op expert, unifying both decisions.
Experiments showed that the integrated approach outperformed simply reducing the capacity of a standard MoE model. The performance improvements from MoD and MoE compound rather than overlap, suggesting that depth sparsity and width sparsity address complementary aspects of computational waste.
MoD models are trained end-to-end using the standard language modeling objective (next-token prediction). The router weights are placed directly along the gradient path, meaning they are learned through standard backpropagation and gradient descent as part of the language modeling loss. No separate routing loss is needed during training when using expert-choice routing, because the top-k selection is a deterministic operation that is differentiable with respect to the selected tokens' representations.
The experiments used models ranging from 60 million to 3 billion parameters. All models were trained with a batch size of 128, a sequence length of 2,048, and cosine learning rate schedules. Routing layers were interleaved with standard (non-routing) transformer layers. The best-performing configuration applied routing at every other layer with a capacity ratio of 12.5%, meaning that at routing layers, only 12.5% of the sequence's tokens were processed by self-attention and MLPs, while 87.5% bypassed the block.
The top-k routing mechanism used during training is non-causal: it requires examining all tokens in the sequence to determine which ones fall in the top-k. This poses a challenge for autoregressive generation, where tokens are produced one at a time and future tokens are not yet available.
To address this, the paper proposes two solutions:
Auxiliary binary classifier: A binary cross-entropy loss is added to train the router to predict, for each token independently, whether it would be in the top-k. This introduces a small performance degradation of roughly 0.2 to 0.3%.
Auxiliary MLP predictor: A small auxiliary MLP (with a stop-gradient on its inputs) is trained alongside the main model to predict whether each token will be routed to computation or will skip. This predictor achieves over 97% accuracy early in training and reaches approximately 99% accuracy by the end, with negligible impact on step speed.
The paper conducts isoFLOP comparisons, which hold the total training FLOPs constant and vary model size and configuration to find the best-performing model. Experiments were run at total training budgets of 6 x 10^18, 2 x 10^19, and 1 x 10^20 FLOPs, with model sizes ranging from 60 million to 3 billion parameters.
Key findings include:
A 220-million-parameter MoD model was found to outperform the isoFLOP-optimal baseline (also 220 million parameters) while being over 60% faster per step during post-training autoregressive sampling. More broadly, MoD models were found to be up to 50% faster to step during sampling compared to equivalent vanilla transformers.
This speed advantage comes from two sources. First, tokens that skip layers require zero FLOPs at those layers, directly reducing per-step computation. Second, because isoFLOP-optimal MoD models can be larger (more parameters) for the same training cost, they can match baseline performance with a model that has more capacity but lower per-step FLOPs.
The paper observes that the router learns meaningful patterns about which tokens need more computation. Tokens that are routed through computation more frequently tend to correlate with output predictions that have higher entropy, meaning the model finds those tokens harder to predict. Conversely, tokens that frequently skip layers tend to be associated with more confident predictions. This suggests the router learns to allocate compute where it matters most.
Speculative decoding is a separate technique for accelerating autoregressive inference. In speculative decoding, a smaller "draft" model generates candidate tokens quickly, and a larger "target" model verifies them in parallel, accepting or rejecting each candidate. This exploits the observation that many tokens in a sequence are predictable and do not require the full capacity of the large model.
MoD and speculative decoding share a philosophical motivation: both recognize that not all tokens require the same amount of computation. However, they operate at different levels. Speculative decoding works at the system level, coordinating two separate models during inference. MoD works at the architectural level, building adaptive computation directly into a single model's forward pass.
The two approaches are potentially complementary. MoD's observation that tokens engaging with more layers correlate with higher-entropy predictions could inform speculative decoding strategies, helping draft models identify which tokens are likely to be accepted by the target model without verification. Early exit methods, which are related to MoD, have already been explored in combination with speculative decoding for faster drafting.
Several works have built upon the Mixture of Depths framework since its introduction:
Despite its promising results, Mixture of Depths has several limitations:
Mixture of Depths introduces a simple yet effective method for making transformer computation conditional at the token level. By using a learned router and a top-k selection mechanism, MoD allows models to spend their computational budget where it matters most, skipping unnecessary computation for tokens that can be handled by a residual connection alone. The approach maintains static computation graphs for hardware efficiency, achieves significant FLOP reductions per forward pass, and can be combined with Mixture of Experts for further gains. It represents an important step toward more efficient transformer architectures that allocate computation adaptively rather than uniformly.