Mixture-of-Recursions (MoR)
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,823 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
8 citations
Review status
Source-backed
Revision
v1 · 1,823 words
Add missing citations, update stale details, or suggest a clearer explanation.
Mixture-of-Recursions (MoR) is a Transformer architecture, introduced in 2025, that unifies two previously separate strategies for building efficient language models: parameter sharing through recursion, and adaptive per-token computation. Rather than stacking many distinct layers, an MoR model reuses a single shared block of layers several times in sequence, and a small learned router decides, for each token, how many times that token should pass through the shared block. Tokens the model judges to be simple can exit after one or two passes, while tokens that need more processing recurse deeper. A model with a small set of unique parameters can therefore emulate a much deeper network, and it can concentrate its compute on the tokens that benefit most. [1]
The method was presented in "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation" by Sangmin Bae, Yujin Kim, Reza Bayat and colleagues, posted to arXiv on July 14, 2025 and accepted at the NeurIPS 2025 conference. [1][2] The authors are affiliated with KAIST AI, Mila and the Universite de Montreal, Google DeepMind, and Google Cloud. Across models ranging from 135M to 1.7B parameters trained from scratch, MoR matched or improved on standard Transformers at equal training compute while using roughly half as many unique parameters, and it delivered up to about 2x higher inference throughput. [1]
MoR sits at the intersection of two research lines: recursive (weight-tied) Transformers and adaptive computation.
A standard Transformer gives every layer its own weights. Recursive, or looped, Transformers instead tie weights so that one block is applied repeatedly. The Universal Transformer (Dehghani et al., 2018) showed that repeatedly applying a single shared block can match the representational power of a deep stack of distinct layers, and it paired this with a halting mechanism to vary the number of steps per position. [4] Weight tying cuts the parameter count and memory footprint, but naive tying typically costs some accuracy relative to an untied model of equal depth.
A direct predecessor of MoR, "Relaxed Recursive Transformers" (Bae et al., October 2024), narrowed that gap by adding small layer-wise LoRA adapters, which let each iteration of the shared block behave slightly differently while keeping the model compact. That paper also introduced continuous depth-wise batching, an inference scheme that MoR later reuses. [3] Several MoR authors, including the lead author, carried this line of work forward.
The second idea is that not every token needs the same amount of work. Adaptive computation time (Graves, 2016) added a learned halting unit to recurrent networks so they could take a variable number of steps per input. [5] Mixture of depths (Raposo et al., 2024) applied per-token routing inside a Transformer: a router selects which tokens a given layer processes and which bypass it, capping how many tokens take the expensive path. [6] Early-exit methods such as Confident Adaptive Language Modeling (Schuster, Fisch et al., 2022) let the model stop computing for a token once an intermediate prediction is confident. [7]
Applied to autoregressive decoding, these adaptive methods share a practical difficulty. If a token exits early and never computes its key and value vectors at the deeper layers, later tokens that attend back to it find those entries missing from the key-value cache, which forces either recomputation or approximation. One of MoR's contributions is to combine parameter sharing and adaptive depth in a single model trained end to end, and to address this missing-KV problem with caching schemes designed for recursion. [1]
An MoR model has three coupled components: a shared recursion block, a router that assigns recursion depth, and a recursion-aware key-value cache.
The unique layers of the model are collected into one block that is applied up to Nr times, where Nr is the maximum recursion depth (a small integer such as 2, 3 or 4). The authors compare several ways to map a model's layers onto this shared block. In the Cycle strategy the shared layers are applied in repeating cyclic order across recursions; in the Sequence strategy each layer is repeated consecutively before moving on. Two further variants, Middle-Cycle and Middle-Sequence, keep the first and last layers as unique full-capacity layers and share only the middle layers. The authors report that Middle-Cycle, which preserves distinct input and output layers while tying the middle, is the most effective and gives the lowest validation loss across model sizes. [1]
At the core of MoR is a lightweight router, typically a small linear projection over the token's hidden state, that determines how many times each token traverses the shared block. Because the block is reused, the recursion index plays the role that layer depth plays in a normal network, so the router is effectively choosing a per-token depth between 1 and Nr. The router is trained jointly with the rest of the model from scratch, so the model learns how to allocate compute during pretraining rather than having an exit rule bolted on afterward. [1]
MoR offers two key-value caching strategies that trade memory against compute. In recursion-wise caching, only the tokens that are still active at a given recursion step compute and store key-value pairs at that depth, and attention at that step is restricted to those locally cached tokens. This shrinks the cache and reduces attention computation, and because a token writes its KV entries only while it is active, the entries other tokens attend to stay consistent, which avoids the missing-KV problem. In recursive KV sharing, all tokens pass through the first recursion, key-value pairs are computed and cached only at that first step, and every deeper recursion reuses them. Sharing guarantees that every position has a cached entry and lowers prefill cost and memory, at the price of less depth-specific keys and values. [1]
MoR borrows two routing schemes from the mixture-of-experts literature, here applied along the recursion axis rather than across parallel expert networks.
| Variant | How it routes | Strengths | Weaknesses |
|---|---|---|---|
| Expert-choice | Each recursion depth acts as an "expert" that selects the top-k highest-scoring tokens to continue; lower-scoring tokens exit | Fixed, predictable compute budget and near-perfect load balance | Top-k selection peeks across positions, causing a causality or information-leak issue at training time that needs an auxiliary router or loss |
| Token-choice | Each token receives a single up-front assignment to a full recursion depth (1 to Nr) by argmax over the router scores | No information leakage; one decision per token | Load can become imbalanced across depths, usually requiring a balancing loss |
Expert-choice routing fixes the amount of compute per step and is straightforward to batch, but because choosing the top-k tokens at a step depends on the other tokens present at that step, it can leak future information during training and needs care to remain causal. Token-choice routing avoids that leakage by committing each token to its full recursion path at the outset, but it can overload some depths and underuse others, so it benefits from load-balancing regularization. [1]
The authors pretrain MoR, vanilla, and recursive baselines from scratch at scales from 135M to 1.7B parameters and compare them under matched training compute. MoR establishes a new compute-versus-accuracy Pareto frontier: at equal training FLOPs and with fewer unique parameters, it reaches lower validation perplexity and higher few-shot accuracy than a standard Transformer. In one comparison at a matched compute budget, an MoR model with two recursions and expert-choice routing used about 167M unique parameters and reached a lower validation negative log-likelihood (about 2.75) and 43.1% average few-shot accuracy, against roughly 315M parameters and 42.3% accuracy for the vanilla baseline. That is comparable or better quality with close to half the unique parameters. [1]
Because the shared block is reused, MoR is also cheaper to train and run. For a two-recursion configuration trained on the same number of tokens, the authors report roughly 25% fewer training FLOPs, about 19% less wall-clock training time, and about 25% lower peak memory than the vanilla model. At inference, the recursion structure enables continuous depth-wise batching, which keeps the hardware busy by grouping together tokens that sit at different recursion depths instead of waiting for the slowest token in a batch. This yields up to a 2.06x throughput speedup at the 360M scale for a four-recursion model, with the gain growing as more tokens exit early. [1][8]
MoR can be read as transplanting mixture of experts routing from the width dimension to the depth dimension. A mixture-of-experts layer routes each token to one of several parallel expert sub-networks, adding parameters to raise capacity at a fixed per-token FLOP count. MoR instead routes each token across repeated applications of one shared block, so it adds depth and computation while holding the parameter count fixed; its expert-choice and token-choice routers are the same mechanisms used in MoE work, reinterpreted with recursion steps as the experts. [1]
Relative to the Universal Transformer and other recursive or looped models, MoR keeps the weight-tied shared block but replaces a uniform or halting-based step count with a learned per-token router, and it adds the recursion-aware KV caching needed for efficient decoding. Relative to mixture of depths, which routes tokens to skip or process layers within a fixed, untied stack, MoR routes over recursions of a shared block, which lets a parameter-light model reach an effective depth beyond its physical layer count. Relative to early-exit and adaptive-depth methods, MoR learns its routing during pretraining rather than fitting an exit rule after the fact, and its caching design removes the missing-KV obstacle that complicates early exit in autoregressive models. The MoR author list includes researchers behind both the Relaxed Recursive Transformer parameter-sharing work and the CALM early-exit work, and the paper positions MoR as a synthesis of those threads. [1][3][7]