STaR (Self-Taught Reasoner)
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,627 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,627 words
Add missing citations, update stale details, or suggest a clearer explanation.
STaR (Self-Taught Reasoner) is a self-training method that improves the reasoning ability of a large language model by having it generate its own chain-of-thought rationales, keeping the rationales that lead to correct answers, and fine-tuning on them. The procedure repeats, so the model bootstraps progressively stronger reasoning from a small number of worked examples plus a larger dataset of problems for which only the final answers are known. It was introduced in the 2022 paper "STaR: Bootstrapping Reasoning with Reasoning" by Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman, primarily at Stanford University, and published at NeurIPS 2022. [1][2]
The central idea is that high-quality reasoning traces are expensive to collect by hand, but a model that can already produce some correct rationales can be used to manufacture many more. STaR turns a handful of demonstration rationales into a large, self-generated training set, filtered for correctness using the known answers. The technique is now widely regarded as a conceptual ancestor of the reinforcement-learning-on-reasoning paradigm behind later "reasoning models" such as OpenAI o1 and DeepSeek-R1. [9]
By early 2022, chain-of-thought prompting had shown that asking a model to produce intermediate reasoning steps before its final answer substantially improves performance on arithmetic, commonsense, and symbolic tasks. [3] Two practical problems remained, however. First, few-shot chain-of-thought relies on the frozen model's existing ability and does not let it learn from experience. Second, the alternative of fine-tuning on human-written rationales requires large hand-annotated datasets, which are costly and exist for very few tasks. Directly fine-tuning a model to output only the final answer, with no rationale, tends to underperform on multi-step problems.
STaR was designed to capture the benefits of rationale-based fine-tuning without a large human-annotated rationale corpus. It assumes access to a dataset of problems labeled only with correct final answers, plus a very small seed set of worked examples that include rationales. The correct final answers act as a cheap, automatic verifier: any self-generated rationale can be accepted or rejected by checking whether it arrives at the known answer.
STaR is an iterative, or bootstrapping, loop. Each iteration performs the following steps. [1]
A subtle but important design choice is that each round of fine-tuning starts from the original pretrained model rather than continuing from the previous iteration's fine-tuned checkpoint. Restarting from the base model each time helps avoid overfitting to the growing self-generated dataset. [1]
In the original experiments the base model was GPT-J, a publicly available 6 billion parameter model, chosen because its checkpoint and fine-tuning code were openly available and it was large enough to produce non-trivial rationales worth bootstrapping from. [1]
Conceptually, STaR can be read as a simple reinforcement learning procedure: sampling rationales is like sampling actions from a policy, the 0/1 check on final-answer correctness is the reward, and fine-tuning on the successful samples approximates a policy-gradient update that raises the probability of reasoning paths that work. This places STaR in the same family as rejection sampling fine-tuning and other "learn from your own correct samples" methods. [1][6]
A pure generate-and-filter loop has a structural weakness: the model can only ever learn from problems it already solves. Hard problems that it never gets right contribute no training data, so the self-generated dataset becomes biased toward easy examples and improvement stalls.
Rationalization addresses this. For each problem the model failed, STaR re-prompts it with the correct answer supplied as a hint and asks it to produce a rationale that reaches that answer. Because the goal is given, the model can reason backward to a plausible justification. If this hinted rationale reaches the correct answer, it is added to the training set, but with the hint removed from the prompt, as though the model had produced the reasoning unaided. [1]
Rationalization expands coverage to harder problems and accelerates learning. Its effect is most dramatic on tasks the model otherwise cannot crack: in the arithmetic experiments, adding rationalization lifted two-digit addition accuracy from under 1% to 32%. [1]
STaR was evaluated on symbolic n-digit arithmetic, the GSM8K grade-school math word problem benchmark, and the CommonsenseQA multiple-choice commonsense benchmark, all using the 6B GPT-J base model. [1][4][5] On n-digit addition, STaR reached 89.5% overall accuracy after 16 iterations. On the two reasoning benchmarks it consistently beat both few-shot chain-of-thought and direct fine-tuning.
| Method (GPT-J 6B base) | CommonsenseQA | GSM8K |
|---|---|---|
| Few-shot chain-of-thought | 36.6% | 3.1% |
| Direct fine-tuning (answer only) | 60.0% | 5.8% |
| STaR without rationalization | 68.8% | 10.1% |
| STaR with rationalization | 72.5% | 10.7% |
| Fine-tuned GPT-3 (about 30x larger), for reference | 73.0% | n/a |
On CommonsenseQA, STaR with rationalization reached 72.5%, an improvement of about 35.9 percentage points over the few-shot chain-of-thought baseline, and it essentially matched a fine-tuned GPT-3 (about 175 billion parameters) that scored 73.0% despite being roughly 30 times larger. [1] On GSM8K the absolute numbers stayed low, reflecting the modest size of the 6B base model, but STaR still nearly doubled direct fine-tuning, going from 5.8% to 10.7%. [1]
STaR has been influential out of proportion to its modest headline accuracies, because it demonstrated a general recipe: use a model to generate candidate reasoning, filter by an automatic correctness signal, and train on what survives. Several lines of work build directly on it.
More broadly, STaR is frequently cited as a conceptual precursor to the reinforcement-learning-on-chain-of-thought paradigm that powers modern reasoning models. Systems such as OpenAI o1 (2024) and DeepSeek-R1 (2025) scale up the same underlying intuition, training on self-generated reasoning that is rewarded for reaching correct answers, while adding large-scale reinforcement learning, search, and substantial test-time compute. [9] Related ideas appear in rejection-sampling fine-tuning, self-consistency decoding, and self-play methods such as SPIN.
STaR has several well-understood limitations. [1]