Quiet-STaR
Last reviewed
Jun 8, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,910 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
10 citations
Review status
Source-backed
Revision
v1 · 1,910 words
Add missing citations, update stale details, or suggest a clearer explanation.
Quiet-STaR is a self-supervised training method that teaches a large language model to generate short, token-level internal "thoughts," or rationales, that help it predict the text that follows, rather than producing reasoning only when answering an explicit question. It generalizes STaR, the Self-Taught Reasoner, from curated question-answering datasets to arbitrary web text: after every token in a passage the model writes a brief rationale, uses it to sharpen its next-token prediction, and is rewarded by a reinforcement learning signal whenever that rationale makes the real continuation more likely. [1]
The technique was introduced in March 2024 by Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, and Noah Goodman. Zelikman, Shao, Haber, and Goodman were at Stanford University; Harik and Jayasiri were at the startup Notbad AI Inc. [1] The paper, "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking," was released as an arXiv preprint in March 2024 and presented at the Conference on Language Modeling (COLM) in 2024. [1][3] The name reflects running STaR "quietly" in the background of ordinary language modeling, training the model to think before it speaks. Applied to a Mistral 7B base model with no task-specific fine-tuning, Quiet-STaR raised zero-shot accuracy on the GSM8K grade-school math benchmark and the CommonsenseQA reasoning benchmark purely as a side effect of continued pretraining on web text. [1]
STaR, introduced by Zelikman and colleagues in 2022, showed that a model can bootstrap its own reasoning on a question-answering dataset: it samples chain-of-thought rationales, keeps the ones that reach the known correct answer, fine-tunes on them, and repeats the loop on progressively harder problems. [2] The method works, but it inherits two structural limits. It needs a curated dataset of questions paired with checkable answers, and the reasoning it learns only ever covers the narrow slice of tasks those datasets contain. High-quality reasoning corpora are expensive to build and will always be incomplete. [1][2]
Quiet-STaR removes the dependence on labeled questions. Its authors observe that reasoning is implicit in almost all text: the unstated steps between the lines of a proof, the motive behind a line of dialogue, the arithmetic a financial report takes for granted. If a model could infer those hidden rationales, then every document becomes a reasoning lesson rather than just a string to memorize. [1] The paper frames this as a direct extension of the view that "language models are unsupervised multitask learners." Instead of specializing on mathematical question answering, the model learns to reason in whatever way helps it predict the next stretch of ordinary internet text. [1]
A close predecessor is "pause token" training (Goyal et al., 2023), in which a model is given extra blank tokens so it can perform silent computation before committing to an answer. Quiet-STaR can be read as a richer version in which the inserted tokens carry actual generated content and are explicitly optimized to be useful. [1][8]
Quiet-STaR recasts the two-step STaR loop, generate then fine-tune on correct rationales, as a continuous three-part procedure applied during pretraining. The authors label the three parts think, talk, and learn. [1]
Generating a separate rationale after each of the thousands of tokens in a document would be hopelessly slow if done one at a time. Quiet-STaR instead produces all of the thoughts in parallel, advancing every thought by one token per batched forward pass. The trick is a custom diagonal attention mask: each token inside a thought may attend to the real text that precedes it and to the earlier tokens of its own thought, but not to the tokens of the other "counterfactual" thoughts being generated at neighboring positions. Forward passes are cached so that a single inference call extends every thought simultaneously. [1]
Two learnable meta-tokens, written as a start-of-thought token and an end-of-thought token, bracket each rationale and tell the model when to begin reasoning and when to resume ordinary prediction. Learning these embeddings from scratch is difficult because tokens are discrete, so the start-of-thought embedding is initialized from the embedding of the em dash token, a punctuation mark that in natural text often precedes a pause or an elaboration. To help these few parameters converge quickly, their gradients are scaled up by a large factor (a weight of 1e2 in the released configuration). [1]
Once a thought ends, the model holds two predictions for the upcoming token: one conditioned on the thought and one from the base model that ignores it. A small mixing head interpolates between them. It is a three-layer MLP with ReLU activations that takes the concatenated hidden states of the with-thought and without-thought paths and outputs a single scalar weight, which blends the two sets of next-token logits. This "talk" step lets the model rely on a thought only when it helps and otherwise fall back to its ordinary prediction. That keeps training stable in the early phase, before the thoughts have learned to be useful, and limits the distribution shift introduced by inserting thoughts into the stream. [1]
Because there is no labeled answer to filter on, Quiet-STaR cannot simply keep "correct" rationales the way STaR does. Instead it scores each thought by how much it improves prediction of the genuine continuation and optimizes the thoughts with REINFORCE, a policy-gradient algorithm. At each position the model samples several thoughts; a thought's reward is the log-probability it assigns to the true future tokens minus the average over the sampled thoughts at that position, so only thoughts that beat their peers are reinforced. The authors found it helped to discard negative rewards entirely. The loss is deliberately non-myopic: rather than scoring a thought on the single next token, a teacher-forcing trick evaluates it against several upcoming tokens (four tokens ahead in the main runs), which rewards rationales that set up a longer span of text and, qualitatively, produced more coherent reasoning. The gradient updates both the base model weights and the start-of-thought and end-of-thought embeddings. [1]
The main configuration is summarized below.
| Component | Choice in the paper |
|---|---|
| Base model | Mistral 7B (base, no instruction tuning) |
| Training corpora | OpenWebMath (primary), C4 |
| Thought length | 12 tokens (OpenWebMath runs), 16 tokens (C4 runs); gains grew with longer thoughts |
| Lookahead in the loss | 4 true tokens ahead |
| Thoughts sampled per position | 2 to 4 |
| Start and end-of-thought tokens | learnable meta-tokens; start token initialized from the em dash token; gradient weight 1e2 |
| Mixing head | three-layer MLP with ReLU; input is the concatenation of with-thought and without-thought hidden states; output is one scalar weight |
| Objective | REINFORCE policy gradient, non-myopic, negative rewards excluded |
| Hardware | one node of eight 80GB H100 GPUs |
Every headline gain is zero-shot. The model is never fine-tuned on GSM8K or CommonsenseQA; the improvement comes only from continued pretraining on web text and is then measured on the held-out benchmarks. Trained on OpenWebMath, a math-heavy web crawl chosen because the authors expected it to contain a high density of tokens that reward reasoning, the Mistral 7B base model's direct-answer accuracy rose on both benchmarks. Training on the more general C4 corpus produced smaller but still positive gains. [1][5][6][7]
| Benchmark (Mistral 7B, zero-shot) | Base model | Quiet-STaR (OpenWebMath) | Quiet-STaR (C4) |
|---|---|---|---|
| GSM8K | 5.9% | 10.9% | 8.1% |
| CommonsenseQA | 36.3% | 47.2% | 42.6% |
The OpenWebMath gains are 5.0 and 10.9 percentage points respectively, from continued pretraining alone. [1] The authors report two further findings. First, accuracy grows steadily with the length of the thoughts the model is allowed to generate, suggesting that more deliberate internal reasoning translates into better direct answers. Second, the thoughts disproportionately help on tokens that are otherwise hard to predict, which is exactly where reasoning should matter, and they lower perplexity on those difficult tokens in natural text. [1] By comparison, pause-token training under a similar setup gave only marginal gains on CommonsenseQA and actually hurt GSM8K, which the authors read as evidence that multi-token, content-bearing rationales reason more effectively than blank filler tokens. [1][8]
Quiet-STaR was, to its authors' knowledge, the first method to train a language model to reason from general unstructured text rather than from curated reasoning tasks. [1] It appeared in March 2024, several months before "think before answering" became central to commercial systems. When OpenAI released o1 in September 2024 and the wave of reasoning models that followed made test-time compute a primary scaling axis, Quiet-STaR and STaR were frequently cited as conceptual precursors of the idea that a model should spend tokens thinking before it commits to an answer. [10] Press coverage at the time described the result as giving language models an "inner monologue." [9] Its lead author, Eric Zelikman, and senior author, Noah Goodman, also wrote the original STaR paper. [1][2]
The paper is candid about its limitations. The dominant one is compute. Generating a multi-token rationale after every input token multiplies the cost of both inference and training, and the authors present compute-adjusted plots precisely because the raw overhead is large. [1] Quiet-STaR was demonstrated only on a single 7B model and was not trained from scratch; the authors expect, but did not show, that the same technique would help larger and stronger models more, as gains from reasoning often scale with model capability. [1] The implementation also thinks after every token indiscriminately, with no learned mechanism to decide when a thought is worth generating or how long it should run. The authors suggest moving the mixing head before the prediction so its weight could gate computation dynamically, but note that judging a thought's usefulness in advance is harder than judging it after the fact. [1] Finally, as with STaR, there is no guarantee that a verbalized thought faithfully reflects the model's internal computation, and nothing prevents the model from reinforcing biased or unsound reasoning if it happens to improve prediction. [1] Several of these threads, namely dynamic thinking budgets, scaling to larger models, and rewarding reasoning that improves prediction, recur in the reasoning-model research that followed.