Coconut (Chain of Continuous Thought)
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,899 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
3 citations
Review status
Source-backed
Revision
v1 · 1,899 words
Add missing citations, update stale details, or suggest a clearer explanation.
Coconut (Chain of Continuous Thought) is a method for training a large language model to perform multi-step reasoning inside a continuous latent space rather than by writing out intermediate steps as discrete word tokens. It was introduced in the December 2024 paper "Training Large Language Models to Reason in a Continuous Latent Space" by Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian, a collaboration between Meta's Fundamental AI Research (FAIR) lab and the University of California, San Diego [1].
The central idea is to take the model's last-layer hidden state, which Coconut calls a "continuous thought," and feed it back into the model as the next input embedding directly, instead of first decoding it into a token and then re-embedding that token. This creates a reasoning loop that operates entirely on continuous vectors. A continuous thought is not constrained to express a single word, so it can in principle hold a superposition of several candidate next steps at once. The authors report that this lets the model behave somewhat like a breadth-first search (BFS), keeping multiple reasoning paths alive instead of committing prematurely to one, which helps on logical-reasoning problems that require planning and search [1].
Coconut is positioned as an alternative to standard chain-of-thought (CoT) prompting and as an instance of the broader category of latent reasoning. The paper was later accepted to the Conference on Language Modeling (COLM) 2025 [1].
In chain-of-thought reasoning, a language model solves a hard problem by generating a sequence of intermediate natural-language steps before producing its final answer. Each step is sampled as discrete tokens, and those tokens are appended to the context and fed back in, so the model conditions on its own externalized reasoning. CoT substantially improves performance on arithmetic, commonsense, and symbolic tasks, and it underlies much of the recent work on test-time reasoning [1].
The Coconut authors argue that forcing reasoning to pass through language tokens is a poor fit for two reasons. First, most tokens in a written rationale exist mainly to keep the text grammatical and fluent and carry little reasoning content, so generating them wastes compute. Second, a few tokens at each step encode genuinely difficult choices, and committing to one specific word at that moment discards the model's uncertainty: once a token is sampled, the information passed forward is limited to the meaning of that single word [1].
This is the "token bottleneck." A discrete token is a one-hot choice among the vocabulary, whereas the hidden state that produced it is a high-dimensional vector that can encode a distribution over many continuations, and decoding to a token throws that richer representation away. There is also an architectural mismatch: a transformer uses a fixed number of layers per generated token, so problems that need more sequential computation must spread it across more tokens. Coconut instead keeps the reasoning state in the continuous vector domain [1].
Coconut alternates between two modes. In normal "language mode" the model behaves like a standard autoregressive LLM: it maps the last hidden state to a probability distribution over the vocabulary, samples a token, and embeds that token as the next input. In "latent mode" it skips the decode-and-re-embed step entirely. The last-layer hidden state from the current position is used directly as the input embedding for the next position. Because no token is produced, this thought is never written down and never decoded into words [1].
Two special tokens mark the boundary of the latent segment: a begin-of-thought token (<bot>) and an end-of-thought token (<eot>). At inference the model emits <bot>, then runs for a chosen number of continuous-thought steps in latent mode, then emits <eot> and switches back to language mode to produce the final answer. Each continuous thought is a single forward pass whose output vector becomes the next input, so a chain of continuous thoughts is built up sequentially in the latent space [1].
A model does not learn to reason in latent space if it is simply asked to. Coconut is trained with a multi-stage curriculum, inspired by the implicit-chain-of-thought (iCoT) "stepwise internalization" approach of Deng et al. (2024), which gradually removes written steps and forces the model to internalize them [1].
Training starts from data that contains a full written chain of thought. At stage 0 the model is trained on the complete language reasoning chain. At each subsequent stage k, the first k reasoning steps in the written rationale are removed and replaced by continuous thoughts, while the remaining later steps stay as text. A hyperparameter c sets how many continuous thoughts stand in for one removed language step; for the math experiments the authors used c = 2. As training advances through the stages, more and more of the written reasoning is replaced by continuous thoughts, until the model performs most of its reasoning in latent space [1].
The objective remains the standard language-modeling (next-token) loss, but it is applied only to the visible tokens: the loss on the question and on the latent thoughts is masked out, since the continuous thoughts have no target token to predict. Gradients still flow back through the continuous thoughts because each one is fed forward as an input embedding, so the chain of latent steps is differentiable end to end [1].
A consequence is that the latent steps cannot be computed in parallel during training: each continuous thought depends on the previous forward pass, so the curriculum requires multiple sequential forward passes per example. The authors note this hurts training efficiency and flag it as an open problem [1].
The experiments fine-tuned a pretrained GPT-2-scale model and evaluated it against a chain-of-thought baseline (the same model trained to produce written rationales) and a no-reasoning baseline. Three datasets were used: GSM8K for grade-school math word problems, and ProntoQA and ProsQA for logical reasoning. ProsQA is a harder graph-structured benchmark constructed by the authors to demand more planning and search than ProntoQA [1].
The headline result is that Coconut beats chain-of-thought on the planning-heavy logical tasks while generating far fewer reasoning tokens. The table below summarizes the reported accuracy and average number of reasoning steps generated (continuous thoughts for Coconut, text tokens for the CoT baseline) [1].
| Task | CoT accuracy | Coconut accuracy | CoT reasoning tokens | Coconut continuous thoughts |
|---|---|---|---|---|
| ProntoQA (logic) | 98.8% | 99.8% | 92.5 | 9 |
| ProsQA (harder logic) | 77.5% | 97.0% | 49.4 | 14.2 |
| GSM8K (math) | 42.9% | 34.1% | 25.0 | 8.2 |
On ProsQA, the gap is large: Coconut reaches 97.0 percent versus 77.5 percent for CoT while using a fraction of the reasoning length. On GSM8K, however, Coconut underperformed the chain-of-thought baseline (34.1 versus 42.9 percent), although it used fewer steps. The authors interpret this split as evidence that the latent approach helps most where reasoning requires substantial search and planning, and less on tasks where a mostly linear arithmetic derivation already suits CoT [1].
The key qualitative finding concerns what the continuous thoughts represent. Because a continuous thought is a vector rather than a committed token, it can encode several alternative next steps at the same time. By probing the latent states, the authors argue that Coconut implicitly explores a tree of possibilities and prunes wrong branches over successive thoughts, behaving like a breadth-first search rather than the depth-first, single-path commitment of CoT. This lets the model defer hard choices instead of locking in an early mistake [1]. An ablation that keeps the curriculum but removes the actual continuous thoughts ("w/o thought") performs worse than full Coconut, indicating that the latent vectors, not merely the staged training, drive the improvement [1].
Coconut sits within a family of "latent" or "implicit" reasoning techniques that try to move some of the reasoning off the visible token stream.
Implicit chain of thought (iCoT), by Deng et al. (2024), trains a model to internalize its reasoning by progressively deleting the written steps during training. Coconut borrows this stepwise-internalization curriculum but replaces the deleted steps with explicit continuous-thought vectors rather than relying on the network to absorb them silently [1].
Pause tokens and filler tokens. Goyal et al. (2023) insert learnable "pause" tokens to give the model extra forward passes before answering, and Pfau et al. (2024) study "filler" tokens such as repeated dots that provide hidden computation without conveying content. These are still discrete tokens drawn from a fixed (often single) added symbol, whereas a Coconut continuous thought is a full-width hidden-state vector that varies with the input. The Coconut authors observe that pause-style tokens suit highly parallelizable problems, while feeding back the continuous state helps more when each reasoning step depends heavily on the previous one [1].
Quiet-STaR (Zelikman et al., 2024) trains a model to generate internal rationales at every token to improve next-token prediction. It still reasons in token space, but it shares the idea of rationales as a latent aid to prediction [1].
Recurrent-depth latent reasoning. A closely related but architecturally distinct approach is the "Huginn" model of Geiping et al. (February 2025), "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach." Instead of feeding a hidden state back as a new input position, Huginn iterates a shared recurrent transformer block in place, unrolling to arbitrary depth at test time so that extra computation does not require extra tokens. The 3.5-billion-parameter model, trained on 800 billion tokens, can match the reasoning performance of a much larger model by adding recurrence iterations, and like Coconut it needs no written chain-of-thought data and can capture reasoning not easily put into words [2]. Both methods reason in continuous space; Coconut extends the sequence with latent positions, whereas recurrent depth deepens the computation per position. This is sometimes referred to as recurrent depth reasoning.
Coconut has also prompted follow-up theoretical work analyzing whether its continuous thoughts genuinely hold a "superposition" of reasoning paths [1].
Interpretability is the most prominent limitation: because continuous thoughts are never decoded into words, a Coconut chain is far harder to read, audit, or debug than a written chain-of-thought, which weakens the transparency that made CoT attractive for high-stakes use. Training is more complex and less efficient, since the multi-stage curriculum and the sequential dependence between continuous thoughts prevent the parallel computation that ordinary language-model training enjoys, and the authors call out training efficiency as future work. The method also requires source data with explicit reasoning chains to seed the curriculum. Finally, the benefit is task-dependent: the advantage is clearest on search-heavy logical reasoning, and on GSM8K-style math Coconut lagged the chain-of-thought baseline, so latent reasoning is not yet a uniform replacement for token-level reasoning [1].