Coconut (Chain of Continuous Thought)
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,494 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 18, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 4,494 words
Add missing citations, update stale details, or suggest a clearer explanation.
Coconut (Chain of Continuous Thought) is a reasoning paradigm for large language models introduced by researchers at FAIR at Meta and UC San Diego in December 2024. Rather than producing intermediate reasoning steps as discrete language tokens, as in Chain-of-Thought prompting, a Coconut model performs intermediate "thoughts" entirely in continuous latent space by feeding the final hidden state of its last forward pass back as the next input embedding. The paper "Training Large Language Models to Reason in a Continuous Latent Space" (arXiv:2412.06769), by Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian, was submitted on 9 December 2024 and accepted to COLM 2025.[^1][^2]
Coconut is positioned as a counterpoint to the prevailing assumption that explicit, token-by-token deliberation is necessary for advanced LLM reasoning. By keeping reasoning representations in the model's high-dimensional embedding space, Coconut allows a single continuous thought vector to encode a probability distribution over multiple candidate next steps. The authors argue that this enables an implicit breadth-first search through reasoning states, which is particularly helpful for problems that require backtracking and planning. Empirically, Coconut substantially outperforms standard CoT on two logical reasoning benchmarks (ProsQA and ProntoQA) while trailing CoT on the open-domain math benchmark GSM8K.[^1][^3]
The release of Coconut helped catalyze a broader research program on "latent reasoning" that grew rapidly through 2025 and into 2026, including theoretical work on why continuous thoughts can be more expressive than discrete ones and practical follow-ups such as CODI, SIM-CoT, and parallel continuous-CoT methods.
Standard Chain-of-Thought reasoning, popularized by Wei et al. (2022), asks a language model to produce a textual sequence of intermediate steps before its final answer. Each step is a sequence of discrete word tokens drawn from the model's vocabulary, and each token is selected by sampling or greedy decoding from the model's probability distribution over the vocabulary. This procedure has driven much of the recent progress in reasoning-capable LLMs and is a key ingredient in test-time compute scaling.[^1]
The Coconut authors identify several structural limitations of token-level CoT that motivated their work.[^1][^4]
First, most word tokens are not load-bearing for reasoning. In natural-language reasoning traces, the majority of tokens exist primarily to maintain textual coherence (articles, conjunctions, punctuation, grammatical glue) rather than to carry algorithmic information. Yet the model expends a full forward pass and a full slot of its test-time compute budget on each one.
Second, discrete tokens form an information bottleneck. The hidden state of a transformer at any given position is a high-dimensional continuous vector, often with thousands of dimensions. Projecting that vector through the language-model head onto a categorical distribution over the vocabulary, and then sampling a single token, collapses a rich, multi-modal representation down to a single discrete choice. The Coconut paper argues that this projection is wasteful when the model is reasoning rather than communicating with a human.[^1]
Third, committing to one token forces a single search path. In a problem requiring search or planning, the model must commit to one reasoning step at a time, even when several plausible continuations exist. Standard CoT therefore performs something like a greedy depth-first walk through the space of reasoning trajectories. Methods such as Tree of Thoughts address this by sampling multiple branches externally, but they multiply the inference cost.[^5]
Fourth, neurological intuition. The authors note that neuroscientific evidence suggests human reasoning does not always proceed in words, offered as a soft motivation rather than a strict justification.[^1]
These considerations led the authors to ask whether the entire chain-of-thought process can be moved into the model's continuous representation space, sidestepping the token-level bottleneck while preserving the structural benefits of multi-step deliberation.
The core of Coconut is a simple but consequential modification to how the model is fed its own output during reasoning. A Coconut model operates in two alternating modes.[^1][^4]
Language mode is identical to a standard autoregressive transformer language model. At each step, the model reads a sequence of token embeddings, runs a forward pass, projects the final-layer hidden state through the LM head, and produces a probability distribution over the vocabulary from which the next token is sampled or argmax-decoded.
Latent (or "thought") mode is the new behavior. In this mode the model still performs a forward pass, but the final hidden state is not projected to the vocabulary. Instead, the hidden state vector is taken directly as the input embedding at the next position. No discrete token is produced; the LM head is bypassed entirely for those positions. Crucially, the continuous thought is not constrained to lie near any particular token embedding; it can occupy regions of embedding space that do not correspond to any word.[^1]
Switching between the two modes is signaled by two special tokens added to the vocabulary: <bot> (beginning-of-thought) and <eot> (end-of-thought). A typical Coconut trace has the form
[input question tokens] <bot> [k continuous thoughts] <eot> [answer tokens]
Between <bot> and <eot>, the model is in latent mode. For positions i < t < j (where i and j are the positions of <bot> and <eot>), the input at position t is the final hidden state from position t-1, not a token embedding. After <eot> the model returns to language mode and emits a natural-language answer.[^1]
Formally, if M denotes the model, x denotes the input sequence of tokens with embeddings e(x), and h denotes the final-layer hidden state, then within the latent span the input embedding sequence becomes [e(x_1), ..., e(x_i), h_i, h_{i+1}, ..., h_{t-1}]. The training objective is the standard next-token cross-entropy loss applied only at language-mode positions, since the latent positions have no corresponding ground-truth token to predict.[^1]
A central point is that a continuous thought vector lives in the same high-dimensional space the transformer uses to internally mix probability mass over many tokens. Because softmax over the vocabulary is not applied within the latent span, the representation can encode a superposition over multiple candidate next reasoning steps in a way that a single discrete token cannot. Case studies in the paper suggest that the latent thought after <bot> in a graph-reachability problem implicitly represents multiple frontier nodes simultaneously, enabling implicit parallel search.[^1][^6]
The hyperparameter c controls how many continuous thoughts replace one discrete reasoning step. The authors use c = 2 for GSM8K and c = 1 for the logical reasoning benchmarks. Increasing c gives the model more latent forward passes per reasoning step at the cost of more compute.[^1]
A naive approach of training a fresh model to produce continuous thoughts end-to-end would lack supervision: there is no ground-truth latent thought to imitate. The Coconut authors instead propose a multi-stage curriculum, inspired by the implicit Chain-of-Thought (iCoT) work of Deng et al. (2024), that progressively converts discrete reasoning steps into continuous ones.[^1][^7]
The curriculum is organized into stages indexed by k = 0, 1, 2, ..., K, where K is the maximum number of continuous-thought stages.
Stage 0 trains the model with full standard CoT supervision. Every reasoning step is presented as language tokens, and the model is fine-tuned with the usual next-token loss on the entire input-CoT-answer sequence.
Stage k > 0 replaces the first k reasoning steps in each training example with k * c continuous thoughts. Concretely, the language tokens for the first k steps of the CoT are removed and substituted with a <bot> marker, k * c latent positions (which carry no ground-truth tokens), and an <eot> marker. The remaining reasoning steps and the final answer are left as language tokens. Loss is computed only on the language-mode positions.
Final stage replaces all explicit reasoning steps with continuous thoughts so that, at inference, the model sees the question, switches to latent mode for the entire reasoning span, and then emits only the final answer in language mode.
The optimizer state is reset between stages, and the data is also mixed across stages with a uniform_prob parameter so that the model does not catastrophically forget earlier behaviors. The published configuration uses a learning rate of 1e-4, a batch size of 128, and bf16 mixed precision.[^4]
Two observations from the paper are worth emphasizing. First, the curriculum is not optional: variants trained without curriculum supervision (i.e., directly attempting to learn latent thoughts from scratch) underperform substantially, comparable to no-CoT baselines. The continuous-thought capacity must be bootstrapped from explicit language supervision. Second, the curriculum itself contributes significantly to performance, independent of the continuous thoughts. An ablation that keeps the staged removal of language steps but does not insert any latent thoughts (the "w/o thought" variant) still recovers part of the gain, suggesting that the structured curriculum drives some of Coconut's improvement.[^1][^3]
The Coconut paper evaluates on three datasets covering math reasoning and logical reasoning.[^1]
GSM8K is a benchmark of grade-school math word problems, each requiring two to eight arithmetic reasoning steps. The Coconut authors use 385,620 training examples (augmented), 500 validation, and 1,319 test problems. GSM8K is treated as an open-domain reasoning task because both the surface form and the required arithmetic steps vary substantially across examples.
ProntoQA is a synthetic logical reasoning dataset that asks the model to determine whether a target proposition follows from a small set of premises. The Coconut paper uses a 5-hop version with 9,000 training, 200 validation, and 800 test examples. ProntoQA tasks have a relatively narrow structure compared with GSM8K.
ProsQA is a new dataset introduced in the Coconut paper. It generates random directed acyclic graphs (DAGs) over concept nodes and asks the model to answer reachability-style queries that require planning a path through the DAG. ProsQA was designed specifically to stress planning and backtracking; the paper uses 17,886 training, 300 validation, and 500 test examples.[^1][^8]
For all three datasets the base model is a pre-trained GPT-2. The paper does not specify which size variant, but the official implementation defaults to the openai-community/gpt2 124M-parameter checkpoint. Fine-tuning is performed independently per benchmark.[^1][^4]
The main baselines reported are: CoT (standard supervised CoT fine-tuning), No-CoT (direct answer fine-tuning), iCoT (implicit CoT via Stepwise Internalization), and Pause Token (Goyal et al., 2024, which inserts learned <pause> tokens). The paper also reports several Coconut ablations: w/o curriculum (Coconut without the multi-stage schedule), w/o thought (curriculum schedule but no inserted latent thoughts), and Pause as thought (continuous thoughts replaced by repeated <pause> tokens with equivalent compute).[^1]
The headline results compare Coconut and standard CoT on each benchmark.[^1][^3]
On ProsQA, the planning-intensive DAG reasoning task, Coconut reaches 97.0% accuracy versus 77.5% for CoT, a gap of nearly 20 percentage points. Coconut also produces a shorter reasoning trace, averaging 14.2 generated tokens at inference compared with 49.4 for CoT.
On ProntoQA, Coconut achieves 99.8% accuracy with about 9.0 tokens, edging out CoT's 98.8% at 92.5 tokens. Both methods nearly saturate the benchmark, but Coconut does so with roughly an order of magnitude fewer thinking tokens.
On GSM8K, Coconut underperforms CoT: 34.1% accuracy for Coconut versus 42.9% for CoT, an 8.8-point gap. However, Coconut uses an average of only 8.2 tokens per problem compared with CoT's 25.0, and the iCoT baseline (which produces almost no tokens) reaches only 30.0%.
Three patterns emerge across the results.[^1][^3]
First, Coconut excels at planning-intensive reasoning. The paper's case studies on ProsQA suggest that the continuous thought encodes a probability distribution over multiple candidate next nodes simultaneously, enabling a kind of implicit breadth-first search through the reasoning graph. CoT, by contrast, must commit to a single next step at each token, and on harder ProsQA instances it tends to commit early to a wrong path and cannot backtrack.
Second, Coconut is more token-efficient than CoT in all cases, sometimes dramatically. This is a structural property: a single forward pass with no decoding constraint can carry more information than a single sampled token.
Third, Coconut underperforms on open-domain math reasoning. The authors attribute this partly to the diversity of reasoning patterns in GSM8K and partly to the difficulty of preserving precise arithmetic intermediate quantities in a continuous representation. The paper does not provide a fully mechanistic explanation for the GSM8K gap.[^3]
An additional finding is a chaining effect: holding everything else fixed, accuracy increases as the number of continuous thoughts per reasoning step (c) is raised from 0 to 2. This suggests that latent reasoning preserves some of the scaling-with-thinking-tokens property that has made test-time compute valuable for CoT models.[^1]
Coconut is intentionally minimal at the architecture level. It introduces no new neural network components, no auxiliary heads, and no architectural modifications to the underlying transformer. The model is a standard pre-trained GPT-2 with two added special tokens (<bot> and <eot>) and a modified input pipeline that, during latent positions, substitutes the previous step's final hidden state for the usual token embedding.[^1][^4]
A few technical points are worth highlighting.
The substitution happens at the input embedding layer. The hidden state h_t (the output of the last transformer block at position t) is fed in as the input vector at position t+1, in place of an embedding lookup from the vocabulary. The dimensionality therefore matches the model's hidden size automatically, with no projection.[^1]
Because the model is autoregressive, each continuous thought requires its own forward pass; the latent thoughts cannot be batched in parallel. This makes training somewhat less compute-efficient than standard CoT training, where the entire sequence can be processed in parallel with a causal mask. Coconut effectively requires K + 1 sequential passes for K latent thoughts per example, a property the paper flags as a limitation for scale-up.[^1]
The next-token loss is masked out at latent positions because no ground-truth language token exists there. The cross-entropy loss is computed only at language-mode positions: the question (typically not included in the loss, depending on configuration), any remaining language reasoning steps, and the final answer. Gradients still flow back through the latent hidden states because they participate in producing later language-mode predictions.[^1]
The released implementation supports loading any Hugging Face causal LM as a base; the published experiments use GPT-2, but the framework is in principle agnostic to model family.[^4]
The paper and subsequent analyses identify several limitations.[^1][^3][^9]
Dependence on language supervision. Coconut's curriculum requires high-quality language CoT data to bootstrap. Variants trained without curriculum supervision perform at no-CoT levels, indicating that the continuous-thought representations are learned through distillation from language steps rather than from scratch. This limits Coconut's potential to discover novel reasoning strategies absent from the training distribution.
Open-domain underperformance. Coconut substantially trails CoT on GSM8K. The paper does not establish that Coconut scales to broader reasoning domains.
Sequential training inefficiency. Each continuous thought requires its own forward pass, and the curriculum requires multiple training stages, making Coconut more expensive to train than vanilla CoT fine-tuning.
Length scaling. Follow-up analyses observe that Coconut's accuracy on synthetic logical reasoning degrades sharply as the required step count grows, dropping from near-100% at two steps to roughly 38% at five steps in some configurations. The capacity of the continuous-thought representation to maintain long reasoning chains appears bounded.[^3]
Interpretability loss. Perhaps the most discussed downside is that Coconut replaces a human-readable reasoning trace with an inscrutable high-dimensional vector. This is a serious concern for the alignment and oversight research community, which had come to view explicit CoT as a partial window into model deliberation. Several commentators have argued that the wide adoption of latent reasoning would undermine recent progress on monitoring chain-of-thought for misuse or mechanistic interpretability-style auditing.[^9][^10]
Shortcut dependence. A 2026 analysis (arXiv:2512.21711) reports that Coconut models exhibit a strong tendency to exploit surface-level shortcuts (answer patterns and contextual cues correlated with the label) rather than reasoning through the problem, raising doubts about whether the latent thoughts represent genuine reasoning or cached associations on the curriculum's narrow distribution.[^11]
Mode-switching heuristics. The paper relies on fixed heuristics (a hard-coded number of latent thoughts per step) rather than letting the model adaptively choose when to switch modes. Adaptive mode-switching is left as future work.
The Coconut paper sparked a fast-growing body of follow-up work that came to be known under the umbrella term latent reasoning. The 2025 survey "A Survey on Latent Reasoning" by Zhu et al. (arXiv:2507.06203) organizes this literature into several methodological families, of which Coconut sits in the "hidden state propagation" branch.[^12]
Notable directly-related work includes:
CODI (Shen et al., 2025; arXiv:2502.21074): Replaces Coconut's explicit curriculum with a self-distillation framework in which a single model jointly trains an explicit-CoT "teacher" task and an implicit-CoT "student" task, aligning hidden activations between them. CODI reports matching explicit CoT on GSM8K at the GPT-2 scale while compressing reasoning by 3.1x.[^13]
Reasoning by Superposition (Zhu, Hao, Hu, Jiao, Russell, and Tian, 2025; arXiv:2505.12514): Provides theoretical grounding for Coconut by proving that a two-layer transformer with D continuous-thought steps can solve directed-graph reachability problems where D equals the graph diameter, while discrete CoT requires O(n^2) steps for n vertices. The paper formalizes the "superposition" hypothesis that continuous thoughts encode multiple search frontiers in parallel.[^6]
Continuous Chain of Thought Enables Parallel Exploration and Reasoning (Gozeten et al., 2025; arXiv:2505.23648): Quantifies the level of achievable parallelism in continuous CoT and provides algorithms that exploit it for logical-reasoning search.[^14]
SIM-CoT (Supervised Implicit Chain-of-Thought, 2025; arXiv:2509.20317): Adds supervision signals to the implicit reasoning steps themselves, addressing the unsupervised-latent problem that Coconut handled via curriculum.[^15]
Parallel Continuous Chain-of-Thought with Jacobi Iteration (EMNLP 2025): Speeds up continuous-CoT inference by replacing sequential latent forward passes with parallel Jacobi-style fixed-point iteration, partially mitigating Coconut's sequential-inference cost.[^16]
LightThinker (2025): Trains models to compress reasoning traces into shorter "gist" tokens, occupying a middle ground between language tokens and Coconut's fully-continuous vectors.[^12]
The survey by Zhu et al. (2507.06203) also discusses an "infinite-depth" branch that connects latent reasoning to masked diffusion language models, where iterative denoising can be interpreted as repeated latent reasoning over a fixed-length canvas. A parallel survey by Sui et al. (2025) provides a complementary taxonomy along token-wise versus structural dimensions.[^12]
A causal-and-adversarial analysis paper from December 2025 (arXiv:2512.21711) probes whether the latent tokens in Coconut actually "think" or instead serve as differentiable computational scaffolding that does not encode the claimed search semantics. The paper presents evidence that Coconut's latent tokens are more brittle and shortcut-dependent than the original paper suggests, and proposes adversarial perturbation tests that future latent-reasoning systems should pass.[^11]
Coconut and standard CoT differ along several axes that are useful to make explicit.
Representation of intermediate steps. CoT represents each intermediate reasoning step as a sequence of natural-language tokens drawn from the model's vocabulary. Coconut represents each step (or sub-step) as a single high-dimensional continuous vector that does not correspond to any token. CoT's representations are interpretable and bounded by the vocabulary's discrete combinatorics; Coconut's are inscrutable but far higher-bandwidth per slot.
Search structure. CoT performs greedy depth-first search through the space of reasoning trajectories, committing to one token per step. Coconut performs implicit parallel exploration: a single continuous thought can encode a superposition over candidate next steps, and later thoughts can sharpen or discard those candidates. This is the structural reason for Coconut's advantage on ProsQA and ProntoQA.[^1][^6]
Training requirements. CoT requires only standard supervised fine-tuning on (input, reasoning, answer) triples. Coconut requires a multi-stage curriculum that gradually replaces language steps with latent ones, optimizer-state resets between stages, and additional hyperparameters (c, K, mixing probabilities). Without that curriculum, Coconut fails.
Inference cost. Coconut produces far fewer total tokens at inference (8 to 14 versus 25 to 92 in the reported experiments), but each latent forward pass is roughly as expensive as a token forward pass. Total wall-clock and FLOPs costs depend on the ratio of latent to language steps but tend to favor Coconut on planning-heavy problems.
Compatibility with reward learning. CoT integrates straightforwardly with reward modeling techniques such as process reward models and reinforcement learning methods like GRPO, which score individual reasoning steps written in natural language. Coconut's latent thoughts have no surface form, which complicates standard step-level reward modeling and is an active area of research.
Interpretability. CoT yields a (claimed) human-auditable reasoning trace; Coconut does not. The trade-off has direct implications for AI safety: an organization relying on chain-of-thought monitoring as part of its oversight stack would lose that signal if Coconut-style latent reasoning displaced it.[^9][^10]
Coconut is one of several proposed mechanisms for performing reasoning "internally" rather than via explicit token generation. Distinguishing these is important because the term implicit chain-of-thought has been used loosely.[^7][^15]
iCoT (Implicit Chain-of-Thought via Stepwise Internalization), Deng et al. (2024). iCoT also uses a curriculum, but its goal is to remove the language reasoning steps from the model's output entirely so that the model produces the answer directly while having internalized the reasoning during fine-tuning. iCoT does not introduce any new latent-mode mechanism: every forward pass still emits a discrete token; only some of those tokens are dropped during training to encourage internalization. Coconut, by contrast, explicitly carves out latent positions where the LM head is bypassed and the hidden state is fed back as input. Coconut therefore creates new "thinking slots" between input and answer, while iCoT compresses reasoning into the answer-token positions themselves.[^7]
Pause Tokens (Goyal et al., 2024). The pause-token method inserts learned, but discrete, <pause> tokens that give the model extra forward passes before producing the next answer token. Pause tokens occupy a single fixed vector per slot rather than a context-dependent hidden state. Coconut's "Pause as thought" ablation, which replaces continuous thoughts with pause tokens at matched compute, underperforms full Coconut, suggesting that the context-dependence of the hidden state matters.[^1]
Quiet-STaR / STaR family. These methods train models to generate explicit rationales token-by-token but score them with answer-based rewards. They remain firmly in the discrete-token regime and address a different question (how to generate rationales without labeled rationale data) than Coconut.
Latent diffusion language models. Masked or absorbing diffusion language models iteratively refine a sequence by denoising over multiple steps. Each denoising step can be viewed as a kind of latent reasoning pass over the same canvas, related conceptually to Coconut's iterative latent thoughts though architecturally distinct.[^12]
Activation-based recurrence / universal transformers. Architectures that loop a transformer block over the same hidden state for a variable number of iterations implement a form of latent computation that does not produce intermediate tokens. Coconut differs by performing latent reasoning at the sequence level rather than the layer level: each continuous thought is a new sequence position rather than an extra pass through the existing layers.[^12]
If Coconut-style latent reasoning matures into a practical alternative or supplement to CoT, it would have several non-trivial implications for the design of future language models.
Decoupling reasoning compute from token budgets. Modern reasoning models are heavily constrained by their effective token budget: longer answers cost proportionally more. Latent reasoning would let providers separate "thinking compute" (latent forward passes) from "communication compute" (decoded tokens), enabling more compact user-visible outputs without sacrificing internal deliberation.
Architectural neutrality. Coconut's mechanism (feeding the last hidden state back as input) is agnostic to the underlying sequence model. It works on a standard transformer but in principle could be applied to alternative architectures such as Mamba or mixture-of-experts models, where reasoning-as-recurrence may interact with the model's own selective state mechanism in interesting ways.
Pressure on interpretability and oversight. The same property that makes Coconut attractive (richer per-slot information) makes it harder to monitor. As of 2026, much AI safety practice relies on the assumption that production reasoning models think in legible English. Widespread latent reasoning would weaken that assumption and shift more of the burden to interpretability tools that operate on hidden representations directly, including mechanistic interpretability techniques that aim to extract structure from latent vectors.[^9][^10]
Theory-experiment loop. The pairing of Coconut with formal results on continuous-thought expressiveness (Zhu et al. 2505.12514, Gozeten et al. 2505.23648) is an unusually clean example of empirical and theoretical work informing one another rapidly. It is one of the first cases where a theoretical separation between two reasoning paradigms (discrete vs. continuous CoT) has been backed by both an architectural proposal and matching empirical gains on a benchmark designed to exhibit the separation.[^6][^14]
Future direction: pretraining with latent thoughts. All of Coconut's experiments fine-tune from a pre-trained checkpoint. The authors flag pre-training with latent-thought objectives as a promising direction, since end-to-end pre-training might unlock latent reasoning strategies that curriculum-driven fine-tuning cannot reach. Several 2025-2026 follow-ups explore this, though results at scale have not yet superseded standard CoT-based reasoning models on broad benchmarks.[^1][^12]
Coconut should therefore be understood not as a finished alternative to chain-of-thought but as a proof of concept that token-level reasoning is not the only viable mode for reasoning-capable language models. Its lasting contribution is the demonstration that a minimal architectural change (route the last hidden state back as the next input embedding) is sufficient to unlock a qualitatively different kind of search behavior, at least on planning-intensive synthetic benchmarks, and to seed a rapidly developing latent-reasoning literature.