Latent reasoning via recurrent depth (Huginn)
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,900 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
5 citations
Review status
Source-backed
Revision
v1 · 1,900 words
Add missing citations, update stale details, or suggest a clearer explanation.
Latent reasoning via recurrent depth is an approach to scaling a language model's test-time computation by iterating a recurrent transformer block in latent (hidden) space, rather than by generating additional chain-of-thought tokens. It was introduced in the February 2025 paper "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach" by Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein, a collaboration of the ELLIS Institute and the Max Planck Institute for Intelligent Systems in Tubingen, the University of Maryland, College Park, and Lawrence Livermore National Laboratory [1][2]. The accompanying proof-of-concept model, a 3.5-billion-parameter depth-recurrent transformer trained on roughly 800 billion tokens, is nicknamed Huginn-0125, after one of the ravens that, in Norse mythology, represents thought for the god Odin [1][3].
The central claim is that a model can "think" more by performing more internal recurrence on a fixed input, increasing its effective computational depth at inference time without externalizing intermediate steps as words. The authors argue this is a distinct axis of test-time compute scaling from the token-based reasoning used by models such as OpenAI's o1 and DeepSeek-R1 [1].
Most reasoning models scale test-time compute by producing longer outputs: the model emits a chain of intermediate tokens (a "scratchpad" or chain of thought), and accuracy on hard problems tends to rise with the length of that visible reasoning trace. Latent reasoning via recurrent depth instead increases the number of times a shared block of transformer layers is applied to the hidden state for each token. Because each additional iteration is a full pass through the recurrent block, spending more compute corresponds to unrolling the network to a greater effective depth, all of which happens internally before any token is emitted [1].
Geiping et al. report three properties they present as advantages over the token-based approach [1]:
The amount of recurrence is adjustable at test time, so a single trained model can trade accuracy for compute on a per-query basis, and the authors show it can even adapt the amount of computation per token.
The dominant paradigm for improving reasoning, popularized by chain-of-thought prompting and reinforced by reasoning-tuned models, scales the amount of computation by scaling the number of generated tokens. Producing more intermediate tokens lets the model commit partial results to its context and condition later steps on them, but it ties the "thinking budget" to output length, requires the reasoning to be expressible in language, and grows the context and KV cache that must be stored and attended to [1].
Latent test-time scaling decouples the depth of computation from the number of emitted tokens. Conceptually, it returns to the idea of recurrent and weight-tied networks, in which the same parameters are applied repeatedly. The recurrent-depth work positions itself alongside earlier efforts on adaptive or unbounded computation, including the Universal Transformer of Dehghani et al. (2019), which applied a shared transformer layer recurrently with the explicit aim of building Turing-complete, universal computation, and Graves's Adaptive Computation Time, which let a network learn how many steps to ponder before halting [1]. The novelty of Huginn is to take this recurrent-depth idea to a modern decoder-only language model trained at scale and to use the recurrence specifically as a knob for test-time compute.
The model is organized into three functional components, applied in sequence [1]:
This (2, 4, 2) configuration yields about 3.5B parameters total, allocated roughly as 1.5B in the prelude and head, 1.5B in the recurrent core, and 0.5B in embeddings [3]. Because only the 4-layer core is repeated, running it 32 times unrolls the network to an effective depth of roughly 132 layers, far deeper than the physical layer count [3].
A key training choice makes the recurrence count flexible at inference time. During training, the number of core iterations is sampled randomly per step from a log-normal Poisson distribution with a mean of about 32 iterations and a heavy tail, so the model usually sees a moderate number of iterations but is occasionally trained at much higher counts [1][3]. To keep memory bounded, gradients are backpropagated only through the final 8 iterations (truncated backpropagation through the recurrence), independent of how many forward iterations were taken [1]. Training on a random, variable number of steps is what allows the model to generalize to more iterations at test time than it typically saw during training [1].
| Component | Huginn configuration | Role |
|---|---|---|
| Prelude (P) | 2 transformer layers | Embed input into latent space |
| Core (R) | 4 transformer layers, iterated | Recurrent latent computation (the depth knob) |
| Coda (C) | 2 transformer layers | Decode latent state to token logits |
| Total | ~3.5B parameters | (2, 4, 2) configuration |
Huginn-0125 was trained on roughly 795 to 800 billion tokens, weighted toward code and mathematical data, using a 65,536-token vocabulary [1][3]. The authors report that the pretraining run used 4,096 AMD MI250X GPUs on the Frontier supercomputer at Oak Ridge National Laboratory [1][3]. The model and its training and inference code were released publicly on Hugging Face and GitHub [2][4].
The headline empirical finding is that accuracy on reasoning benchmarks rises as the number of test-time recurrences increases, without any additional training. On GSM8K with chain-of-thought prompting, the paper reports performance climbing with more iterations: at 32 recurrences Huginn scores in the high 30s to low 40s in percent (for example, about 38 percent flexible-match accuracy, rising with a tuned system prompt), and with weight averaging and 64 recurrences it reaches roughly 47 percent flexible match [1]. On other tasks the model saturates at lower iteration counts, behaving as if easier problems require less "thinking" [1]. The authors summarize the scaling behavior by noting that, on reasoning-heavy tasks, the recurrent model's inference-time compute can rise to a load equivalent to a roughly 50-billion-parameter non-recurrent model, while still being a 3.5B-parameter network [1][2].
A controlled comparison underscores that the gains come from the recurrence itself: at an early checkpoint (180 billion tokens), the recurrent model substantially outperformed a fixed-depth, non-recurrent baseline on GSM8K chain-of-thought, a gap the authors describe as on the order of a fivefold improvement [1].
The paper also documents several emergent latent behaviors observed by inspecting the trajectories of the hidden state across iterations [1]:
Because the recurrence produces a sequence of increasingly refined latent states, the authors note that the model supports zero-shot, training-free inference features such as early exit on easy tokens via a KL-divergence threshold, a smaller or shared KV cache by reusing latent states, and self-speculative decoding using low-recurrence drafts [1].
Latent reasoning via recurrent depth sits at the intersection of several lines of work [1]:
The work is significant primarily because it demonstrates, at a non-trivial scale, a test-time compute axis that is independent of output length. Instead of "reasoning by writing more," a depth-recurrent model can "reason by iterating more," spending additional FLOPs internally and only then producing an answer. It concentrates compute in repeated passes over a small, weight-shared block, which can improve the ratio of useful computation to memory traffic [1].
The released Huginn-0125 model also serves as an open testbed for studying how reasoning is represented in continuous latent space, including phenomena like latent orbits and per-token adaptive depth that have no direct analogue in token-based chains of thought. While the absolute benchmark scores of a 3.5B-parameter model trained on roughly 800B tokens are modest compared with frontier reasoning systems, the result is presented as evidence that latent, recurrent test-time scaling is a viable and underexplored direction, complementary to the dominant chain-of-thought paradigm [1][2].