Latent reasoning via recurrent depth (Huginn)

Deep Learning Machine Learning

10 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

5 citations

Revision

v1 · 1,900 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Latent reasoning via recurrent depth is an approach to scaling a language model's test-time computation by iterating a recurrent transformer block in latent (hidden) space, rather than by generating additional chain-of-thought tokens. It was introduced in the February 2025 paper "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach" by Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein, a collaboration of the ELLIS Institute and the Max Planck Institute for Intelligent Systems in Tubingen, the University of Maryland, College Park, and Lawrence Livermore National Laboratory ^[1]^[2]. The accompanying proof-of-concept model, a 3.5-billion-parameter depth-recurrent transformer trained on roughly 800 billion tokens, is nicknamed Huginn-0125, after one of the ravens that, in Norse mythology, represents thought for the god Odin ^[1]^[3].

The central claim is that a model can "think" more by performing more internal recurrence on a fixed input, increasing its effective computational depth at inference time without externalizing intermediate steps as words. The authors argue this is a distinct axis of test-time compute scaling from the token-based reasoning used by models such as OpenAI's o1 and DeepSeek-R1 ^[1].

Overview

Most reasoning models scale test-time compute by producing longer outputs: the model emits a chain of intermediate tokens (a "scratchpad" or chain of thought), and accuracy on hard problems tends to rise with the length of that visible reasoning trace. Latent reasoning via recurrent depth instead increases the number of times a shared block of transformer layers is applied to the hidden state for each token. Because each additional iteration is a full pass through the recurrent block, spending more compute corresponds to unrolling the network to a greater effective depth, all of which happens internally before any token is emitted ^[1].

Geiping et al. report three properties they present as advantages over the token-based approach ^[1]:

It requires no specialized reasoning or chain-of-thought training data; the model is trained with ordinary next-token prediction.
It can operate within small context windows, because the "reasoning" is not written out as tokens and therefore does not consume context length or grow the key-value (KV) cache.
It can, in principle, capture reasoning that is not easily verbalized, since the computation is carried out in a continuous latent space rather than in discrete words.

The amount of recurrence is adjustable at test time, so a single trained model can trade accuracy for compute on a per-query basis, and the authors show it can even adapt the amount of computation per token.

Background: token versus latent test-time scaling

The dominant paradigm for improving reasoning, popularized by chain-of-thought prompting and reinforced by reasoning-tuned models, scales the amount of computation by scaling the number of generated tokens. Producing more intermediate tokens lets the model commit partial results to its context and condition later steps on them, but it ties the "thinking budget" to output length, requires the reasoning to be expressible in language, and grows the context and KV cache that must be stored and attended to ^[1].

Latent test-time scaling decouples the depth of computation from the number of emitted tokens. Conceptually, it returns to the idea of recurrent and weight-tied networks, in which the same parameters are applied repeatedly. The recurrent-depth work positions itself alongside earlier efforts on adaptive or unbounded computation, including the Universal Transformer of Dehghani et al. (2019), which applied a shared transformer layer recurrently with the explicit aim of building Turing-complete, universal computation, and Graves's Adaptive Computation Time, which let a network learn how many steps to ponder before halting ^[1]. The novelty of Huginn is to take this recurrent-depth idea to a modern decoder-only language model trained at scale and to use the recurrence specifically as a knob for test-time compute.

The recurrent-depth architecture

The model is organized into three functional components, applied in sequence ^[1]:

Prelude (P): a small stack of transformer layers that embeds the input tokens into the latent space. In Huginn this is 2 layers.
Core recurrent block (R): a shared block of layers that is applied repeatedly. Each iteration takes the current latent state together with the prelude's embedded input and produces an updated latent state. In Huginn the core is 4 layers, and it is this block whose iteration count is varied. The core is initialized at the start of each forward pass with a random state sampled from a Gaussian distribution, and then refined iteratively ^[1].
Coda (C): a small stack of layers that decodes the final latent state back to token probabilities. In Huginn this is 2 layers.

This (2, 4, 2) configuration yields about 3.5B parameters total, allocated roughly as 1.5B in the prelude and head, 1.5B in the recurrent core, and 0.5B in embeddings ^[3]. Because only the 4-layer core is repeated, running it 32 times unrolls the network to an effective depth of roughly 132 layers, far deeper than the physical layer count ^[3].

A key training choice makes the recurrence count flexible at inference time. During training, the number of core iterations is sampled randomly per step from a log-normal Poisson distribution with a mean of about 32 iterations and a heavy tail, so the model usually sees a moderate number of iterations but is occasionally trained at much higher counts ^[1]^[3]. To keep memory bounded, gradients are backpropagated only through the final 8 iterations (truncated backpropagation through the recurrence), independent of how many forward iterations were taken ^[1]. Training on a random, variable number of steps is what allows the model to generalize to more iterations at test time than it typically saw during training ^[1].

Component	Huginn configuration	Role
Prelude (P)	2 transformer layers	Embed input into latent space
Core (R)	4 transformer layers, iterated	Recurrent latent computation (the depth knob)
Coda (C)	2 transformer layers	Decode latent state to token logits
Total	~3.5B parameters	(2, 4, 2) configuration

Huginn results

Huginn-0125 was trained on roughly 795 to 800 billion tokens, weighted toward code and mathematical data, using a 65,536-token vocabulary ^[1]^[3]. The authors report that the pretraining run used 4,096 AMD MI250X GPUs on the Frontier supercomputer at Oak Ridge National Laboratory ^[1]^[3]. The model and its training and inference code were released publicly on Hugging Face and GitHub ^[2]^[4].

The headline empirical finding is that accuracy on reasoning benchmarks rises as the number of test-time recurrences increases, without any additional training. On GSM8K with chain-of-thought prompting, the paper reports performance climbing with more iterations: at 32 recurrences Huginn scores in the high 30s to low 40s in percent (for example, about 38 percent flexible-match accuracy, rising with a tuned system prompt), and with weight averaging and 64 recurrences it reaches roughly 47 percent flexible match ^[1]. On other tasks the model saturates at lower iteration counts, behaving as if easier problems require less "thinking" ^[1]. The authors summarize the scaling behavior by noting that, on reasoning-heavy tasks, the recurrent model's inference-time compute can rise to a load equivalent to a roughly 50-billion-parameter non-recurrent model, while still being a 3.5B-parameter network ^[1]^[2].

A controlled comparison underscores that the gains come from the recurrence itself: at an early checkpoint (180 billion tokens), the recurrent model substantially outperformed a fixed-depth, non-recurrent baseline on GSM8K chain-of-thought, a gap the authors describe as on the order of a fivefold improvement ^[1].

The paper also documents several emergent latent behaviors observed by inspecting the trajectories of the hidden state across iterations ^[1]:

Orbiting: for tokens that require computation, such as numbers and certain action verbs, the latent state can trace circular ("orbiting") trajectories rather than collapsing to a single point.
Convergence and per-token compute: many tokens reach an approximate fixed point quickly, while harder tokens take more iterations to converge, so different tokens effectively consume different amounts of compute.
Sliders: some directions in latent space exhibit a steady linear drift across iterations, which the authors suggest may track progress or iteration count.
Path independence: the authors report that, in the sense of Anil et al. (2022), the model tends to follow similar trajectories and reach similar outcomes even when re-initialized from different random starting states, indicating the learned dynamics are stable rather than chaotic ^[1].

Because the recurrence produces a sequence of increasingly refined latent states, the authors note that the model supports zero-shot, training-free inference features such as early exit on easy tokens via a KL-divergence threshold, a smaller or shared KV cache by reusing latent states, and self-speculative decoding using low-recurrence drafts ^[1].

Relationship to other methods

Latent reasoning via recurrent depth sits at the intersection of several lines of work ^[1]:

Chain-of-thought and token-based reasoning: the recurrent-depth approach is presented as an orthogonal, latent-space alternative to verbalized reasoning. The authors note the two are not mutually exclusive and could be combined, with latent recurrence handling computation that is hard to put into words.
Coconut (continuous chain of thought): the work is closely related to Coconut by Hao et al. (2024), which feeds a model's last hidden state back as a continuous "thought" instead of decoding a token. The authors distinguish Huginn by pretraining the recurrence from scratch on a large corpus, rather than fine-tuning an existing model into a continuous-thought mode ^[1].
Universal Transformers and adaptive computation: as noted above, the architecture revives the recurrent-depth and weight-tying ideas of the Universal Transformer and adaptive computation time, applied as a test-time scaling mechanism at the multi-billion-parameter scale ^[1].
Implicit and weight-tied reasoning: the method belongs to a broader family of "latent" or "implicit" reasoning approaches that perform extra computation inside the network rather than in the output stream, and it has been cited as a foundational example in subsequent surveys of latent chain-of-thought reasoning ^[5].

Significance

The work is significant primarily because it demonstrates, at a non-trivial scale, a test-time compute axis that is independent of output length. Instead of "reasoning by writing more," a depth-recurrent model can "reason by iterating more," spending additional FLOPs internally and only then producing an answer. It concentrates compute in repeated passes over a small, weight-shared block, which can improve the ratio of useful computation to memory traffic ^[1].

The released Huginn-0125 model also serves as an open testbed for studying how reasoning is represented in continuous latent space, including phenomena like latent orbits and per-token adaptive depth that have no direct analogue in token-based chains of thought. While the absolute benchmark scores of a 3.5B-parameter model trained on roughly 800B tokens are modest compared with frontier reasoning systems, the result is presented as evidence that latent, recurrent test-time scaling is a viable and underexplored direction, complementary to the dominant chain-of-thought paradigm ^[1]^[2].

References

Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., Kailkhura, B., Bhatele, A., & Goldstein, T. (2025). "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach." arXiv:2502.05171. https://arxiv.org/abs/2502.05171 ↩
Hugging Face Papers. "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach." https://huggingface.co/papers/2502.05171 ↩
Huginn-0125 model card, Hugging Face. https://huggingface.co/tomg-group-umd/huginn-0125 ↩
seal-rg, "recurrent-pretraining: Pretraining and inference code for a large-scale depth-recurrent language model," GitHub. https://github.com/seal-rg/recurrent-pretraining ↩
"Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning" (2025). arXiv:2505.16782. https://arxiv.org/abs/2505.16782 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Coconut (Chain of Continuous Thought)

Overview

Background: token versus latent test-time scaling

The recurrent-depth architecture

Huginn results

Relationship to other methods

Significance

References

Improve this article

Related Articles

Diffusion model

Generalization

Mixture of Experts (MoE)

Modality

Sparsity

Activation Function