STaR (Self-Taught Reasoner)

Machine Learning Reinforcement Learning

9 min read

Updated Jul 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jul 23, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v3 · 1,783 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

STaR (Self-Taught Reasoner) is a self-training method that teaches a large language model to reason by having it generate its own chain-of-thought rationales, keeping only the rationales that lead to correct answers, and fine-tuning on them, then repeating the loop so the model bootstraps progressively stronger reasoning from a small number of worked examples plus a larger set of problems for which only the final answers are known. It was introduced in the March 2022 paper "STaR: Bootstrapping Reasoning with Reasoning" by Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman, primarily at Stanford University, and published at NeurIPS 2022. ^[1]^[2]

In the paper's words, STaR "relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat." ^[1] The central idea is that high-quality reasoning traces are expensive to collect by hand, but a model that can already produce some correct rationales can be used to manufacture many more. STaR turns a handful of demonstration rationales into a large, self-generated training set, filtered for correctness using the known answers. The technique is now widely regarded as a conceptual ancestor of the reinforcement-learning-on-reasoning paradigm behind later reasoning models such as OpenAI o1 and DeepSeek-R1. ^[9]

What problem was STaR designed to solve?

By early 2022, chain-of-thought prompting had shown that asking a model to produce intermediate reasoning steps before its final answer substantially improves performance on arithmetic, commonsense, and symbolic tasks. ^[3] Two practical problems remained, however. First, few-shot chain-of-thought relies on the frozen model's existing ability and does not let it learn from experience. Second, the alternative of fine-tuning on human-written rationales requires large hand-annotated datasets, which are costly and exist for very few tasks. Directly fine-tuning a model to output only the final answer, with no rationale, tends to underperform on multi-step problems. As the paper puts it, inducing rationale generation "currently requires either constructing massive rationale datasets or sacrificing accuracy by using only few-shot inference." ^[1]

STaR was designed to capture the benefits of rationale-based fine-tuning without a large human-annotated rationale corpus. It assumes access to a dataset of problems labeled only with correct final answers, plus a very small seed set of worked examples that include rationales. The correct final answers act as a cheap, automatic verifier: any self-generated rationale can be accepted or rejected by checking whether it arrives at the known answer.

How does STaR work?

STaR is an iterative, or bootstrapping, loop. Each iteration performs the following steps. ^[1]

Rationale generation. Using a small set of few-shot examples that contain rationales, the current model is prompted to generate a step-by-step rationale and a final answer for every problem in the dataset.
Filtering. The generated answer is compared against the known correct answer. Rationales that produce the correct answer are kept; the rest are discarded.
Rationalization (described below) recovers training signal from the problems the model got wrong.
Fine-tuning. The model is fine-tuned on the collected set of correct rationales.
Repeat. The improved model regenerates rationales in the next iteration, and the loop continues until performance stops improving.

A subtle but important design choice is that each round of fine-tuning starts from the original pretrained model rather than continuing from the previous iteration's fine-tuned checkpoint. Restarting from the base model each time helps avoid overfitting to the growing self-generated dataset. ^[1]

In the original experiments the base model was GPT-J, a publicly available 6 billion parameter model, chosen because its checkpoint and fine-tuning code were openly available and it was large enough to produce non-trivial rationales worth bootstrapping from. ^[1]

Conceptually, STaR can be read as a simple reinforcement learning procedure: sampling rationales is like sampling actions from a policy, the 0/1 check on final-answer correctness is the reward, and fine-tuning on the successful samples approximates a policy-gradient update that raises the probability of reasoning paths that work. This places STaR in the same family as rejection sampling fine-tuning and other "learn from your own correct samples" methods. ^[1]^[6]

What is rationalization in STaR?

A pure generate-and-filter loop has a structural weakness: the model can only ever learn from problems it already solves. Hard problems that it never gets right contribute no training data, so the self-generated dataset becomes biased toward easy examples and improvement stalls.

Rationalization addresses this. For each problem the model failed, STaR re-prompts it with the correct answer supplied as a hint and asks it to produce a rationale that reaches that answer. Because the goal is given, the model can reason backward to a plausible justification. If this hinted rationale reaches the correct answer, it is added to the training set, but with the hint removed from the prompt, as though the model had produced the reasoning unaided. ^[1]

Rationalization expands coverage to harder problems and accelerates learning. Its effect is most dramatic on tasks the model otherwise cannot crack: in the arithmetic experiments, the paper reports that 2-digit addition accuracy was "less than 1%" without rationalization, but "after one fine-tuning iteration on the model's generated scratchpads, 2-digit addition improves to 32%." ^[1]

What results did STaR achieve?

STaR was evaluated on symbolic n-digit arithmetic, the GSM8K grade-school math word problem benchmark, and the CommonsenseQA multiple-choice commonsense benchmark, all using the 6B GPT-J base model. ^[1]^[4]^[5] On n-digit addition, STaR reached 89.5% overall accuracy after 16 iterations. On the two reasoning benchmarks it consistently beat both few-shot chain-of-thought and direct fine-tuning. ^[1]

Method (GPT-J 6B base)	CommonsenseQA	GSM8K
Few-shot chain-of-thought	36.6%	3.1%
Direct fine-tuning (answer only)	60.0%	5.8%
STaR without rationalization	68.8%	10.1%
STaR with rationalization	72.5%	10.7%
Fine-tuned GPT-3 (about 30x larger), for reference	73.0%	n/a

On CommonsenseQA, STaR with rationalization reached 72.5%, an improvement of about 35.9 percentage points over the few-shot chain-of-thought baseline, and it essentially matched a fine-tuned GPT-3 (about 175 billion parameters) that scored 73.0% despite being roughly 30 times larger. ^[1] The abstract summarizes this as performing "comparably to fine-tuning a 30x larger state-of-the-art language model on CommensenseQA." ^[1] On GSM8K the absolute numbers stayed low, reflecting the modest size of the 6B base model, but STaR still nearly doubled direct fine-tuning, going from 5.8% to 10.7%. ^[1]

What came after STaR, and how does it relate to o1-style reasoning?

STaR has been influential out of proportion to its modest headline accuracies, because it demonstrated a general recipe: use a model to generate candidate reasoning, filter by an automatic correctness signal, and train on what survives. The paper's own framing is that "STaR lets a model improve itself by learning from its own generated reasoning." ^[1] Several lines of work build directly on it.

ReST-EM. "Beyond Human Data" (Singh et al., 2023) recasts the STaR loop as an expectation-maximization procedure for reinforcement learning: an E-step samples multiple solutions per problem and filters them by a binary reward, and an M-step fine-tunes the base model on the survivors. Using PaLM-2 on the MATH and APPS benchmarks, it showed the approach scales with model size and can surpass training on human data alone. ^[6]
V-STaR. "Training Verifiers for Self-Taught Reasoners" (Hosseini et al., COLM 2024) keeps both the correct and the incorrect generations and uses them to train a separate verifier with DPO. The verifier then ranks candidate solutions at inference time, yielding 4 to 17 point gains over STaR-style self-improvement on math and code benchmarks with Llama-2 models. ^[7]
Quiet-STaR. "Language Models Can Teach Themselves to Think Before Speaking" (Zelikman et al., 2024) generalizes STaR from question answering to arbitrary text, training the model to generate short token-level "thoughts" that help it predict upcoming tokens, rather than only producing rationales for labeled questions. ^[8]

More broadly, STaR is frequently cited as a conceptual precursor to the reinforcement-learning-on-chain-of-thought paradigm that powers modern reasoning models. Systems such as OpenAI o1 (2024) and DeepSeek-R1 (2025) scale up the same underlying intuition, training on self-generated reasoning that is rewarded for reaching correct answers, while adding large-scale reinforcement learning, search, and substantial test-time compute. ^[9] Related ideas appear in rejection-sampling fine-tuning, self-consistency decoding, and self-play methods such as SPIN.

What are the limitations of STaR?

STaR has several well-understood limitations. ^[1]

It requires an automatic correctness signal. Filtering depends on knowing the correct final answer, or having a reliable verifier, so STaR applies cleanly only to tasks with checkable answers such as math, code, or multiple-choice questions, and less directly to open-ended generation.
Right answer, wrong reasoning. Filtering on the final answer accepts any rationale that happens to reach it. On multiple-choice tasks like CommonsenseQA, where random guessing succeeds one time in five, some accepted rationales are unfaithful or contain invalid steps that nonetheless land on the correct choice, so the reasoning the model learns may be partly spurious.
Rationalization can teach post-hoc justification. Because rationalization supplies the answer in advance, the resulting traces can be after-the-fact justifications rather than genuine derivations, which risks reinforcing the appearance of reasoning over its substance.
Cold start and capability floor. Bootstrapping only works if the base model already solves some problems; below a certain capability the loop has nothing to build on, and gains are bounded by the base model and the difficulty distribution of the data.
Self-reinforcement and cost. Training repeatedly on a model's own outputs can amplify its existing biases, and each iteration's generation plus fine-tuning is computationally expensive.

References

Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman. "STaR: Bootstrapping Reasoning with Reasoning." arXiv:2203.14465, March 2022. https://arxiv.org/abs/2203.14465 ↩
Advances in Neural Information Processing Systems 35 (NeurIPS 2022), conference proceedings. https://proceedings.neurips.cc/paper_files/paper/2022/file/639a9a172c044fbb64175b5fad42e9a5-Paper-Conference.pdf ↩
Jason Wei et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv:2201.11903, 2022. https://arxiv.org/abs/2201.11903 ↩
Karl Cobbe et al. "Training Verifiers to Solve Math Word Problems" (GSM8K). arXiv:2110.14168, 2021. https://arxiv.org/abs/2110.14168 ↩
Alon Talmor et al. "CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge." arXiv:1811.00937, 2019. https://arxiv.org/abs/1811.00937 ↩
Avi Singh et al. "Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models" (ReST-EM). Transactions on Machine Learning Research, 2023. arXiv:2312.06585. https://arxiv.org/abs/2312.06585 ↩
Arian Hosseini, Xingdi Yuan, Nikolay Malkin, Aaron Courville, Alessandro Sordoni, Rishabh Agarwal. "V-STaR: Training Verifiers for Self-Taught Reasoners." COLM 2024. arXiv:2402.06457. https://arxiv.org/abs/2402.06457 ↩
Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman. "Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking." arXiv:2403.09629, 2024. https://arxiv.org/abs/2403.09629 ↩
OpenAI. "Learning to Reason with LLMs" (o1). September 2024. https://openai.com/index/learning-to-reason-with-llms/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Absolute Zero Reasoner Instruction backtranslation (Humpback)Quiet-STaR ReST / ReST-EM (Reinforced Self-Training)Self-training

What problem was STaR designed to solve?

How does STaR work?

What is rationalization in STaR?

What results did STaR achieve?

What came after STaR, and how does it relate to o1-style reasoning?

What are the limitations of STaR?

References

Improve this article

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here

Related Articles

State (Reinforcement Learning)

State-Action Value Function

Action (Reinforcement Learning)

Bellman Equation

Critic

Deep Q-Network (DQN)

What links here