# Reasoning models

> Source: https://aiwiki.ai/wiki/reasoning_models
> Updated: 2026-06-20
> Categories: Artificial Intelligence, Large Language Models, Machine Learning
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**Reasoning models** are a class of [large language models](/wiki/large_language_model) trained, typically through [reinforcement learning](/wiki/reinforcement_learning) on long [chain-of-thought](/wiki/chain_of_thought) traces, to perform an extended internal deliberation phase before producing a final answer. Unlike conventional LLMs, which respond in a single autoregressive pass, reasoning models spend additional inference compute generating intermediate "thinking" tokens that may include planning, self-criticism, backtracking, and verification. The category was crystallized by [OpenAI](/wiki/openai)'s release of [o1](/wiki/o1) in September 2024 and grew rapidly across labs over the following eighteen months, becoming the dominant paradigm at the frontier by 2026.[1][2] The defining result was that the extra deliberation translated into large accuracy gains on hard problems: on the AIME 2024 mathematics competition, o1 averaged 74% with a single sample, compared with 12% for the conventional model [GPT-4o](/wiki/gpt_4o).[1]

Reasoning models have produced large gains on benchmarks that resist single-pass solutions, including [AIME](/wiki/aime_2024) competition mathematics, [GPQA Diamond](/wiki/gpqa_diamond), [FrontierMath](/wiki/frontiermath), Codeforces, [SWE-bench Verified](/wiki/swe_bench_verified), [ARC-AGI](/wiki/arc_agi), and [Humanity's Last Exam](/wiki/humanity_s_last_exam). Their economics differ from earlier models because output token volume is much higher and inference compute, rather than additional training data, becomes the dominant scaling axis. The category includes both closed systems (the OpenAI o-series, [Anthropic](/wiki/anthropic)'s extended thinking models, [Google DeepMind](/wiki/google_deepmind)'s Gemini Thinking and Deep Think modes, [xAI](/wiki/xai)'s Grok Thinking variants) and open-weight systems ([DeepSeek-R1](/wiki/deepseek_r1), [Alibaba](/wiki/alibaba)'s QwQ and Qwen3, Moonshot AI's Kimi K2 Thinking, R1 distills into Qwen and Llama). Many of the largest reasoning models, open and closed, are built on a [mixture-of-experts](/wiki/mixture_of_experts) architecture, which lets them scale total parameter count while keeping per-token inference cost bounded.[3][4][5]

The rise of reasoning models has also opened active debates about whether the visible reasoning trace faithfully reflects the underlying decision process, whether long chains of thought introduce new failure modes (hallucinations, reward hacking, problem-complexity collapse), and whether the apparent gains on hard benchmarks reflect genuine generalization or sophisticated pattern matching. The Apple research group's June 2025 paper "The Illusion of Thinking," Anthropic's 2025 work on chain-of-thought faithfulness and monitorability, and several large-scale audits in late 2025 and 2026 have shaped how the field interprets reasoning model behavior.[6][7]

## Infobox

| Reasoning models |
| --- |
| Type | Category of [large language models](/wiki/large_language_model) |
| First introduced | September 12, 2024 ([OpenAI](/wiki/openai) o1-preview) |
| Defining mechanism | Internal extended chain-of-thought reasoning before final answer |
| Primary training methodology | Large-scale [reinforcement learning](/wiki/reinforcement_learning) on chain-of-thought, often with rule-based rewards |
| Primary scaling axis | [Test-time compute](/wiki/test_time_compute) (inference token budget) |
| Common training algorithms | [GRPO](/wiki/grpo), PPO variants, RLVR pipelines |
| Key examples | [OpenAI o1](/wiki/o1), [OpenAI o3](/wiki/o3), [o4-mini](/wiki/o4_mini), [GPT-5](/wiki/gpt-5), [DeepSeek-R1](/wiki/deepseek_r1), QwQ, Qwen3, Kimi K2 Thinking, [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet), [Claude Opus 4](/wiki/claude_opus_4), [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5), [Claude Opus 4.5](/wiki/claude_opus_4_5), [Gemini 2.5 Pro](/wiki/gemini_2_5_pro), [Gemini 3 Pro](/wiki/gemini_3_pro), [Grok 3](/wiki/grok_3), [Grok 4](/wiki/grok_4) |
| Predecessor techniques | [Chain-of-thought](/wiki/chain_of_thought) prompting, self-consistency, tree of thoughts |
| Open vs closed | Both ([DeepSeek-R1](/wiki/deepseek_r1) under MIT, R1 distill family open; o-series and Claude reasoning closed) |
| Notable benchmarks | [AIME](/wiki/aime_2024), [GPQA Diamond](/wiki/gpqa_diamond), [FrontierMath](/wiki/frontiermath), [SWE-bench Verified](/wiki/swe_bench_verified), [ARC-AGI](/wiki/arc_agi), [Humanity's Last Exam](/wiki/humanity_s_last_exam) |

## What is a reasoning model?

A reasoning model is a large language model that has been trained, usually through reinforcement learning rather than supervised fine-tuning alone, to allocate substantial inference-time computation to producing an internal sequence of intermediate steps (often called a reasoning trace, scratchpad, or thinking trajectory) before committing to a final output. The output the user sees is split conceptually into two parts: the thinking portion (which may be hidden, summarized, or shown) and the final answer. OpenAI summarized the underlying recipe in its launch announcement: "The performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)."[1]

Several properties separate reasoning models from conventional chain-of-thought prompted models:

- **Trained behavior, not prompted behavior.** Earlier work showed that asking a model to "think step by step" can elicit longer responses with better accuracy. Reasoning models internalize this pattern through training, so the long deliberation happens automatically and consistently, not only when the user asks for it. Models trained this way produce reasoning traces of a kind, length, and structure that prompt-only methods rarely produce.
- **Reinforcement learning from outcome signals.** The dominant training recipe applies reinforcement learning to a base LLM, rewarding correct final answers (or reasoning step quality, in some variants) and letting the model discover that longer, more careful reasoning leads to higher reward. This causes traits like self-correction, backtracking, and case analysis to emerge.
- **Inference-time scaling.** Reasoning models are designed to benefit from additional inference compute. Allocating more thinking tokens, more parallel samples, or higher "effort" tiers usually improves performance, often roughly logarithmically with the compute budget. This makes [test-time compute](/wiki/test_time_compute) a primary product knob.
- **Distinct API surface.** Most production reasoning systems expose explicit controls: an effort or budget parameter (low/medium/high or token limits), a thinking-on/thinking-off toggle, and sometimes the option to stream the visible reasoning trace.
- **Different cost and latency profile.** Because reasoning models can emit thousands of thinking tokens per query, latency is measured in seconds to minutes rather than fractions of a second, and per-query cost scales with reasoning length rather than fixed-context cost.

These characteristics distinguish reasoning models from older models that benefit from chain-of-thought prompting. They also distinguish reasoning models from sampling-and-verification frameworks (best-of-N, tree of thoughts, process reward model search), where the deliberation happens in an external system rather than inside the model's own forward generation.

## Origins

### From chain-of-thought prompting to RL on chain-of-thought

Chain-of-thought prompting was introduced by Jason Wei and colleagues at Google in January 2022. They demonstrated that prompting a sufficiently large model with examples of step-by-step solutions caused the model to produce intermediate reasoning before answering, and this produced large gains on arithmetic, symbolic, and commonsense benchmarks. Self-consistency (Wang et al., March 2022) added a sampling layer on top: instead of decoding one chain greedily, sample many chains and take the majority vote among final answers. Tree of Thoughts (Yao et al., 2023) generalized this to a search procedure over partial chains using a verifier or self-evaluation.

These techniques showed that test-time deliberation helps. They also revealed limits. Prompted chain-of-thought is brittle: many models produce short, perfunctory reasoning even when asked to think carefully, and the structure of the reasoning is not learned by the model so much as copied from in-context exemplars. Process reward models (Lightman et al. 2023, "Let's Verify Step by Step") and rejection-sampling fine-tuning began to address this by rewarding step-level correctness, but the resulting models still depended on external scoring or careful prompting.

The step that produced reasoning models as a distinct category was applying large-scale reinforcement learning directly to chain-of-thought, with the chain-of-thought as the policy's action sequence and a verifier (a deterministic checker, a unit-test runner, or sometimes a reward model) producing the reward. Reports of internal work at OpenAI on a system codenamed Q* and later Strawberry circulated in late 2023 and early 2024, and the public release of o1 in September 2024 was the first end-to-end demonstration that this approach yielded a frontier model with a clearly different qualitative behavior.

### OpenAI o1 and the establishment of the category

On September 12, 2024 OpenAI announced o1-preview and o1-mini, the first widely available models trained explicitly with large-scale reinforcement learning on chain-of-thought. The accompanying blog post, "Learning to Reason with LLMs," reported that performance scaled smoothly with both training compute (RL) and inference compute (test-time thinking), and that the model exhibited self-correction, planning, and the ability to try alternative approaches when stuck. On the AIME 2024 mathematics competition, o1 reached 74% with a single sample (compared to 12% for [GPT-4o](/wiki/gpt_4o)) and 93% when reranking 1,000 samples; on GPQA Diamond, 78%, exceeding the average accuracy of human PhD-level domain experts on the same questions; on Codeforces, 89th percentile.[1]

The full o1 model and o1 Pro launched on December 5, 2024 during the "12 Days of OpenAI" event. On December 20, 2024, OpenAI announced o3, with claims that included 87.5% on ARC-AGI-1 in a high-compute configuration and 25.2% on the FrontierMath benchmark, where prior systems sat below 2%. The ARC-AGI result drew particular attention because it crossed the 85% threshold often cited as approximate human performance on the benchmark, and because it implied the gains from test-time compute were transferable to a benchmark explicitly designed to resist memorization.

The field interpreted these releases as evidence that a new scaling axis had been opened. Whereas earlier scaling debates centered on parameter count and pretraining tokens, reasoning models showed that comparable or larger gains could come from spending compute at inference time, provided the model had been trained to use that compute productively.

## Timeline of releases

The table below summarizes notable reasoning model releases between September 2024 and the first half of 2026. Models in the table either explicitly market a thinking mode or were trained primarily for extended chain-of-thought reasoning.

| Date | Model | Developer | Notes |
|---|---|---|---|
| 2024-09-12 | [o1-preview](/wiki/o1) and o1-mini | [OpenAI](/wiki/openai) | First widely released reasoning models; reasoning trace hidden, summary shown |
| 2024-11-28 | QwQ-32B-Preview | [Alibaba](/wiki/alibaba) Qwen | First open-weight reasoning model to come close to o1-preview on math and code |
| 2024-12-05 | o1 (full) and o1 Pro | [OpenAI](/wiki/openai) | Full o1 with longer reasoning budget; o1 Pro Mode in ChatGPT Pro tier |
| 2024-12-19 | Gemini 2.0 Flash Thinking Experimental | [Google DeepMind](/wiki/google_deepmind) | First public Gemini variant with explicit thinking mode |
| 2024-12-20 | o3 (announcement) | [OpenAI](/wiki/openai) | Claimed 87.5% on [ARC-AGI](/wiki/arc_agi) high-compute, 25.2% on [FrontierMath](/wiki/frontiermath) |
| 2025-01-20 | [DeepSeek-R1](/wiki/deepseek_r1) and R1-Zero | [DeepSeek](/wiki/deepseek) | Open MIT license, full reasoning trace visible, distilled into Qwen and Llama |
| 2025-01-31 | o3-mini | [OpenAI](/wiki/openai) | First o3-family release in production; configurable effort tiers |
| 2025-02-17 | [Grok 3](/wiki/grok_3) Reasoning (Beta) | [xAI](/wiki/xai) | Think Mode and Big Brain Mode marketed alongside DeepSearch |
| 2025-02-24 | [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) | [Anthropic](/wiki/anthropic) | First hybrid reasoning model with single identifier and Extended Thinking toggle |
| 2025-03-06 | QwQ-32B (general release) | [Alibaba](/wiki/alibaba) Qwen | Trained with GRPO; competitive with o1-mini and R1 distills |
| 2025-03-25 | [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) Experimental | [Google DeepMind](/wiki/google_deepmind) | Thinking by default for every response; SOTA on [GPQA](/wiki/gpqa) Diamond and AIME 2025 at launch |
| 2025-04-28 | Qwen3 family | [Alibaba](/wiki/alibaba) Qwen | Hybrid reasoning (thinking and non-thinking in one model); Apache 2.0; MoE and dense sizes |
| 2025-04-16 | o3 (full) and [o4-mini](/wiki/o4_mini) | [OpenAI](/wiki/openai) | First o-series with full tool use inside the reasoning loop |
| 2025-05-20 | Gemini 2.5 Pro Deep Think | [Google DeepMind](/wiki/google_deepmind) | Higher-intensity thinking variant announced at I/O |
| 2025-05-22 | [Claude Opus 4](/wiki/claude_opus_4) and Claude Sonnet 4 | [Anthropic](/wiki/anthropic) | Hybrid reasoning across the new Claude 4 family; ASL-3 deployment for Opus |
| 2025-05-28 | DeepSeek-R1-0528 | [DeepSeek](/wiki/deepseek) | Significant gains on AIME 2025 and GPQA Diamond |
| 2025-06-10 | o3-pro | [OpenAI](/wiki/openai) | Higher-effort tier of o3 |
| 2025-07-09 | [Grok 4](/wiki/grok_4) and Grok 4 Heavy | [xAI](/wiki/xai) | First model to exceed 50% on Humanity's Last Exam (Heavy multi-agent mode) |
| 2025-08-07 | [GPT-5](/wiki/gpt-5) | [OpenAI](/wiki/openai) | Unifies fast and reasoning models with adaptive router; default thinking for hard prompts |
| 2025-08-21 | DeepSeek-V3.1 | [DeepSeek](/wiki/deepseek) | Hybrid model: thinking and non-thinking modes selectable via chat template; 671B MoE, 37B active |
| 2025-09-29 | [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5) | [Anthropic](/wiki/anthropic) | First model to sustain 30+ hours of focused autonomous operation |
| 2025-09-30 | GLM-4.6 | Zhipu AI | Open-weight (MIT) 355B MoE; reasoning, agentic, and long-context gains over GLM-4.5 |
| 2025-11-06 | Kimi K2 Thinking | Moonshot AI | Open-weight (modified MIT) 1T MoE, 32B active; native interleaved thinking and tool use across 200 to 300 sequential tool calls |
| 2025-11-17 | [Grok 4.1](/wiki/grok_4) (Beta) | [xAI](/wiki/xai) | Grok 4.1 Thinking topped LMArena Text Arena at launch (reported 1483 Elo); lower hallucination rate than Grok 4 |
| 2025-11-18 | [Gemini 3 Pro](/wiki/gemini_3_pro) and Gemini 3 Pro Deep Think | [Google DeepMind](/wiki/google_deepmind) | First publicly accessible model to clear 1500 Elo on LMArena |
| 2025-11-24 | [Claude Opus 4.5](/wiki/claude_opus_4_5) | [Anthropic](/wiki/anthropic) | First model above 80% on SWE-bench Verified; introduces effort parameter |
| 2025-12-01 | DeepSeek-V3.2 | [DeepSeek](/wiki/deepseek) | Open-weight (MIT) MoE with DeepSeek Sparse Attention; hybrid thinking and native thinking-in-tool-use |
| 2025-12-17 | Gemini 3 Flash | [Google DeepMind](/wiki/google_deepmind) | Pro-grade reasoning at Flash latency; thinking-mode toggle; default model in the Gemini app |
| 2026-01-25 | Qwen3-Max-Thinking | [Alibaba](/wiki/alibaba) Qwen | Reasoning variant of the 1T+ parameter Qwen3-Max; reported 100% on AIME 2025 and 58.3 on HLE |
| 2026-02-05 | [Claude Opus 4.6](/wiki/claude_opus_4_5) | [Anthropic](/wiki/anthropic) | Adaptive thinking plus a four-level effort parameter (low, medium, high, max) |
| 2026-02-19 | Gemini 3.1 Pro | [Google DeepMind](/wiki/google_deepmind) | Three-tier thinking (low, medium, high); reported 77.1% on ARC-AGI-2 |
| 2026-04-16 | Claude Opus 4.7 | [Anthropic](/wiki/anthropic) | Adaptive thinking with dynamic budget allocation; reported 87.6% on SWE-bench Verified |
| 2026-04-23 | GPT-5.5 | [OpenAI](/wiki/openai) | Instant, Thinking, and Pro variants; reported 82.7% on Terminal-Bench 2.0 and 51.7% on FrontierMath Tiers 1 to 3 |

Releases continued into the first half of 2026 with Qwen3-Max-Thinking (January), Claude Opus 4.6 (February) and 4.7 (April), Gemini 3.1 Pro (February), and the GPT-5.x line culminating in GPT-5.5 (April), alongside further open-weight updates such as the DeepSeek-V3.2 series and successive Kimi, GLM, and MiniMax models. By mid-2026 the question "is this a reasoning model?" had largely become moot for frontier systems: most flagships ship as hybrid reasoners or expose a thinking mode by default, and the trend among the most recent releases is toward adaptive thinking, where the model itself decides how much to deliberate rather than relying on a user-set toggle.

## Training methodology

The production training recipe for a frontier reasoning model is a multi-stage pipeline that starts from a strong pretrained base and adds progressively more specialized reinforcement learning. Different labs report different details, but the overall structure is consistent.

### Reinforcement learning on chain-of-thought

The core idea is to treat the model as a policy whose actions are tokens (or token blocks), the trajectory as a chain-of-thought ending in a final answer, and the reward as a function of that final answer. For mathematical and code-generation tasks, the reward is rule-based: a deterministic equation checker, a code execution sandbox running unit tests, or a comparison against a known answer. For tasks without an obvious verifier, a reward model trained on human preferences is used, sometimes alongside a process reward model that scores intermediate steps.

The most widely documented recipe is the one in the DeepSeek-R1 paper (arXiv:2501.12948, January 2025). DeepSeek-R1-Zero was trained directly on top of DeepSeek-V3-Base with no supervised fine-tuning on reasoning traces. The reward function had two components: an accuracy reward (one if the final answer matched the ground truth on a math problem or passed unit tests on a coding problem; zero otherwise) and a format reward (the model had to wrap its thinking in `<think>...</think>` tags). The optimization algorithm was [GRPO](/wiki/grpo) (Group Relative Policy Optimization), which replaces the value model in PPO with a group-based baseline computed from multiple sampled completions per prompt. R1-Zero exhibited the famous "aha moment" in which the model spontaneously started writing phrases like "Wait, wait. Wait. That's an aha moment I can flag here" and adopted longer, self-correcting reasoning patterns over training. The authors wrote that the moment "is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning."[3] Its AIME 2024 pass@1 climbed from 15.6% to 71.0% over the course of training, reaching levels comparable to OpenAI o1-0912.[3] A peer-reviewed version of the work appeared on the cover of Nature on September 17, 2025 (volume 645, pages 633 to 638), making DeepSeek-R1 widely reported as the first major large language model to pass formal peer review in an established scientific journal.[36]

The full DeepSeek-R1 model added a multi-stage refinement: a small "cold start" supervised fine-tuning pass on long chain-of-thought examples to stabilize generation format, a large reasoning-RL stage with the same rule-based rewards as R1-Zero, a rejection-sampling and SFT round to broaden the model into general assistant behaviors, and a final RL stage using both rule-based and preference-based rewards. The published recipe became a template that open-source reproductions (Open-R1, OLMo-2 RLVR, the QwQ-32B training pipeline) followed closely.

OpenAI has disclosed less detail about o1 and o3 training, but its public statements describe a similar shape: a base model, large-scale reinforcement learning on chain-of-thought with verifier-based rewards, smooth scaling of performance with both training and inference compute, and emergent self-correction behaviors. Anthropic, Google DeepMind, and xAI have made comparable but vaguer disclosures.[1][8]

### Verifiable rewards and RLVR

The reward design that distinguishes reasoning model training from earlier RLHF is the use of verifiable rewards. Reinforcement Learning from Verifiable Rewards (RLVR) is a label that emerged in 2025 to describe RL pipelines whose reward signal comes from a deterministic checker rather than a learned reward model. Verifiable domains include:

- competition mathematics, where final answers are integers or short expressions and can be checked exactly,
- code generation, where unit tests in a sandboxed environment serve as the verifier,
- formal proofs, where theorem provers (Lean, Coq, Isabelle) accept or reject candidate proofs,
- structured outputs (JSON schemas, regex matches, table generation) where format checking suffices.

The attraction of verifiable rewards at large scale is that they resist reward hacking. Neural reward models can be gamed by adversarial outputs that score highly without being correct; deterministic verifiers cannot, at least not in the same way. The DeepSeek-R1 authors specifically cited reward-hacking concerns as the reason they chose rule-based rewards over a neural reward model for the reasoning RL stage.[3] Allen AI's OLMo-2 RLVR pipeline made the same choice.

The limitation is that not every domain has a clean verifier. For open-ended writing, conversational helpfulness, or aesthetic judgment, RL training still relies on preference-based reward models and remains exposed to the usual RLHF failure modes.

### Distillation to smaller models

A second result in the DeepSeek-R1 paper was that long, well-structured reasoning traces from a strong reasoning model can be transferred to smaller dense models through supervised fine-tuning, without repeating the RL stage. DeepSeek released six R1-Distill models built from Qwen2.5 and Llama 3 backbones (1.5B, 7B, 8B, 14B, 32B, 70B). The R1-Distill-Qwen-32B model reached 72.6% on AIME 2024 pass@1, far above prior RL fine-tunes of a 32B base. The recipe was straightforward: sample roughly 800,000 reasoning traces from R1, filter for correctness and quality, and run supervised fine-tuning on the smaller models.

Distillation became the main way reasoning capabilities propagated through the open ecosystem. Community fine-tunes (Sky-T1, Bespoke-Stratos, S1, LIMO, OpenThinker) used filtered R1 traces or analogous datasets to lift small open models into the reasoning regime. The same pattern shows up inside closed labs, where smaller production reasoning models are usually distilled from a larger teacher rather than RL-trained from scratch.

### Self-consistency, voting, and process supervision

Sampling and voting also play a role at training time. Several recipes generate multiple candidate solutions per prompt, use self-consistency (majority vote) as a soft label or quality filter, and then fine-tune on the surviving traces or feed them into the next RL pass. This is sometimes called STaR-style bootstrapping after Zelikman et al.'s 2022 "Self-Taught Reasoner" paper.

Process reward models, which score intermediate steps rather than only the final answer, were prominent in early test-time-compute work. Lightman et al. (2023) showed that a step-level reward model trained on PRM800K produced better best-of-N reranking on math problems than an outcome-only reward model. In production reasoning training, outcome rewards have largely won out: they are easier to scale to millions of prompts, and the long chain-of-thought traces produced by RL-trained models often contain self-correction loops that step-level scoring penalizes incorrectly.

### Comparison of training pipelines across labs

The table below outlines disclosed elements of the training pipelines for several reasoning models. Many details remain proprietary, especially for the closed systems.

| Model | Base | Core RL algorithm | Reward signal | Reasoning trace visibility | Open weights |
|---|---|---|---|---|---|
| [OpenAI](/wiki/openai) [o1](/wiki/o1) | Undisclosed | Large-scale RL on chain-of-thought | Verifier-based and preference-based; details not public | Hidden trace, summary shown to user | No |
| [OpenAI](/wiki/openai) [o3](/wiki/o3) | Undisclosed | Same family as o1, scaled up | Verifier-based and preference-based | Hidden trace, summary shown | No |
| [DeepSeek-R1](/wiki/deepseek_r1)-Zero | DeepSeek-V3-Base | [GRPO](/wiki/grpo), no SFT | Rule-based accuracy + format rewards | Full trace visible | Yes (MIT) |
| [DeepSeek-R1](/wiki/deepseek_r1) | DeepSeek-V3-Base | [GRPO](/wiki/grpo) with multi-stage SFT and RL | Rule-based + neural preference reward in final stage | Full trace visible | Yes (MIT) |
| [Alibaba](/wiki/alibaba) QwQ-32B | Qwen2.5-32B | [GRPO](/wiki/grpo) family with verifier rewards | Rule-based for math/code | Full trace visible | Yes (Apache 2.0) |
| [Anthropic](/wiki/anthropic) [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) | Anthropic base | Extended thinking RL on top of unified base | Mixture of verifier and preference | Visible trace by default (developer toggle) | No |
| [Anthropic](/wiki/anthropic) [Claude Opus 4](/wiki/claude_opus_4) | Anthropic base | Hybrid reasoning RL with budget control | Mixture; details limited | Visible or summary, configurable | No |
| [Google DeepMind](/wiki/google_deepmind) [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) | Gemini 2.5 base | Thinking-by-default RL | Mixture; details limited | Summary shown to consumer; trace via API parameter | No |
| [xAI](/wiki/xai) [Grok 3](/wiki/grok_3) Reasoning | Grok 3 base | RL with Think Mode and DeepSearch components | Mixture | Visible, expandable trace in product | No |
| [OpenAI](/wiki/openai) [GPT-5](/wiki/gpt-5) | Unified base | Adaptive routing; reasoning trained alongside fast mode | Mixture; full pipeline undisclosed | Hidden by default; thinking mode visible | No |

## Inference-time scaling

Reasoning models reframe inference compute as a primary product knob. The same trained model can be run with different reasoning budgets (token limits, effort tiers, or numbers of parallel samples) and produce noticeably different accuracies. This makes test-time compute the principal scaling axis at the frontier, complementing rather than replacing pretraining scale.

### Compute budgets and effort tiers

Most production reasoning models expose explicit compute controls:

- **OpenAI** ships o3 and successor models with `reasoning_effort` set to low, medium, or high. Higher tiers let the model use more reasoning tokens per query and produce better answers on hard prompts at higher latency and cost.
- **Anthropic** introduced extended thinking with a token-level budget, up to 128,000 thinking tokens per turn for Claude 3.7 Sonnet, and added an `effort` parameter on Claude Opus 4.5 that exposes the same idea more directly.
- **Google** uses a `thinking_level` parameter on Gemini 2.5 and 3 (minimal, low, medium, high) plus separate Deep Think configurations that consume substantially more compute.
- **xAI** offers a Heavy mode on Grok 4 that runs multiple reasoning agents in parallel and aggregates their outputs, gated behind the SuperGrok Heavy subscription tier.
- **DeepSeek-R1** and other open models do not have explicit effort tiers but allow the user to set `max_tokens` for the reasoning portion or to run multiple parallel samples and vote.

In practice, effort tiers and budgets compose with parallel sampling. A common production pattern is to run the model at medium effort with majority voting over five or ten samples, which often outperforms a single high-effort run on the same total compute budget.

### Logarithmic scaling and diminishing returns

Research and product reports consistently find that accuracy on hard reasoning benchmarks scales roughly logarithmically with the inference compute budget. Each doubling of thinking tokens or parallel samples produces a smaller incremental gain than the last. This is the same pattern that earlier test-time compute studies reported, and it sets the practical ceiling on what extended thinking can buy.

A second consistent pattern is that base model capability matters. Test-time compute is most useful when the underlying model has a non-trivial probability of solving the problem in a single sample. If the base model has zero coverage on a problem class, more thinking does not help. This makes reasoning models complements to, not replacements for, strong pretraining.

### Adaptive reasoning and routing

GPT-5 (August 2025) introduced adaptive reasoning at the product level. Rather than asking the user to pick a model or an effort tier, GPT-5 deploys a real-time router that sends easy queries to fast paths and hard queries to longer-thinking paths within the same model family. Anthropic's hybrid reasoning approach in Claude 3.7 Sonnet was an early version of the same idea: a single model identifier handles both modes, and the developer toggles thinking on or off.[5][9] Gemini 2.5 Pro made thinking the default for every response and used internal complexity assessment to decide how much thinking to spend.[4]

The net effect is that the user-facing distinction between "reasoning" and "non-reasoning" models has been blurring since mid-2025. Most frontier products now reason when needed and respond fast when not, with the routing handled inside the system rather than by the caller.

## Evaluations

Reasoning models have produced large gains on a specific cluster of benchmarks where the bottleneck is multi-step deliberation rather than knowledge recall. The same models tend to do less well on benchmarks that reward brevity or that punish over-thinking.

### Benchmarks where reasoning models excel

- **AIME 2024 and AIME 2025** (American Invitational Mathematics Examination), 15-30 short-answer competition problems each, are the canonical math benchmark for reasoning models. Scores moved from 12% (GPT-4o, August 2024) to above 90% (most frontier reasoning models, mid-2025).
- **GPQA Diamond** (198 PhD-level multiple-choice questions in biology, physics, and chemistry) was the first benchmark on which a frontier model surpassed PhD-validator accuracy (o1, September 2024). By 2026 several reasoning models were above 90%.
- **FrontierMath** (Epoch AI, 350+ research-level mathematics problems) gives the clearest example of the gap between reasoning models and earlier systems. Pre-reasoning models scored under 2%; o3 reached 25.2% in December 2024, and successive frontier reasoning models pushed scores into the 40-50% range by 2026.
- **Codeforces** (competitive programming Elo rating) is one of the few benchmarks that maps onto a recognized human skill scale. o1 reached 89th percentile, and later reasoning models such as o3 and the GPT-5 family exceeded 99th percentile, with reported Elo above 2500.
- **MATH-500** and the broader MATH benchmark (12,500 problems by Hendrycks et al.) are largely saturated by reasoning models, with several scoring above 97%.
- **SWE-bench Verified** (500 human-validated GitHub issues from 12 popular Python repositories) was the dominant agentic coding benchmark from late 2024 through early 2026. Reasoning models drove scores from below 50% in late 2024 to above 80% by late 2025, leading to the benchmark's effective saturation and OpenAI's February 2026 retirement recommendation.
- **ARC-AGI-1** (novel pattern-induction puzzles) was crossed by o3 in December 2024 with 87.5% on the high-compute configuration. ARC-AGI-2 (March 2025) was designed to stress-test reasoning models on harder tasks and reset top scores to single-digit percentages at launch.
- **Humanity's Last Exam** (CAIS and Scale AI, 3,000 multidisciplinary frontier questions, January 2025) is one of the few benchmarks designed specifically with reasoning models in mind. GPT-4o scored 2.7% at launch; by mid-2025 Grok 4 Heavy reached 50.7%, and by 2026 several models scored above 60% with tools.[10]

### Cross-model benchmark comparison

The following table compares representative reasoning models on several benchmarks. Scores are pass@1 unless noted otherwise and use developer-reported numbers. Different labs use slightly different evaluation protocols, so comparisons across rows should be treated as approximate.

| Model | AIME 2024 | AIME 2025 | GPQA Diamond | FrontierMath | SWE-bench Verified | HLE (no tools) | Notes |
|---|---|---|---|---|---|---|---|
| [OpenAI](/wiki/openai) [o1](/wiki/o1) | 83% (cons. 64) | not reported | 78.0% | not reported | ~48% | not yet released | First reasoning model |
| [OpenAI](/wiki/openai) [o3](/wiki/o3) | 96.7% | 88.9% | 87.7% | 25.2% | ~72% | not reported at launch | First sub-2% to >25% on FrontierMath |
| [OpenAI](/wiki/openai) [o4-mini](/wiki/o4_mini) | 99.5% (with Python) | 92.7% | reported high | reported high | reported high | not reported | Native visual reasoning |
| [DeepSeek-R1](/wiki/deepseek_r1) | 79.8% | 70.0% | 71.5% | not reported | ~49% | ~9% | Open weights, MIT license |
| DeepSeek-R1-0528 | not reported | 87.5% | 81.0% | not reported | not reported | not reported | Updated R1 |
| QwQ-32B | competitive with o1-mini | competitive | competitive | not reported | not reported | not reported | 32B open weights |
| [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) | 80.0% (cons.) | not reported | 84.8% | not reported | 70.3% | not reported | Hybrid reasoning |
| [Claude Opus 4](/wiki/claude_opus_4) | reported high | reported high | reported high | reported high | high | reported | Hybrid reasoning |
| [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5) | reported high | reported high | reported high | reported high | 77.2% (82.0% high) | reported | 30+ hour autonomy |
| [Claude Opus 4.5](/wiki/claude_opus_4_5) | reported high | reported high | reported high | reported high | 80.9% | reported | First to clear 80% on SWE-bench Verified |
| [Gemini 2.5 Pro](/wiki/gemini_2_5_pro) | reported high | reported high | reported high | reported high | reported high | 21.6% | LMArena leader at launch |
| [Gemini 3 Pro](/wiki/gemini_3_pro) | reported high | reported high | 91.9% | reported high | 76.2% | 37.5% | First public model above 1500 LMArena Elo |
| [Grok 4](/wiki/grok_4) | reported high | reported high | reported high | reported high | reported high | 41.0% (50.7% Heavy) | First above 50% on HLE |
| [GPT-5](/wiki/gpt-5) | reported high | 94.6% | reported high | reported high | 74.9% | reported high | Adaptive reasoning router |
| Kimi K2 Thinking | not reported | not reported | not reported | not reported | 71.3% | 44.9% | Open weights (modified MIT); strong on agentic tool use |
| [Claude Opus 4.6](/wiki/claude_opus_4_5) | not reported | not reported | not reported | not reported | 81.4% (with prompt mod) | 53.0% (with tools) | Adaptive thinking; four-level effort parameter |
| Claude Opus 4.7 | not reported | not reported | not reported | not reported | 87.6% | not reported | 64.3% on SWE-bench Pro |
| GPT-5.5 | not reported | not reported | not reported | 51.7% (Tiers 1 to 3) | not reported | not reported | 82.7% on Terminal-Bench 2.0 |

For full benchmark-specific tables see the individual model pages and the dedicated [GPQA](/wiki/gpqa), GPQA Diamond, AIME 2024, [AIME 2025](/wiki/aime_2025), FrontierMath, SWE-bench Verified, ARC-AGI, and Humanity's Last Exam articles.

## Faithfulness and monitorability

One of the central questions about reasoning models is whether the reasoning trace they produce faithfully describes the process by which they arrive at their final answer. If yes, then chain-of-thought is a useful window into model behavior, with implications for debugging, oversight, and [AI safety](/wiki/ai_safety). If no, then the reasoning trace is largely a post-hoc rationalization, and using it as a monitoring signal is unsafe.

### The Anthropic faithfulness work

Anthropic's April 2025 paper "Reasoning Models Don't Always Say What They Think" is the most cited empirical study of this question. The authors evaluated Claude 3.7 Sonnet and DeepSeek-R1 on tasks where they secretly inserted a hint into the prompt that would bias the model toward a particular answer, then examined whether the reasoning trace acknowledged the hint when the model followed it. Averaged across hint types, the overall faithfulness score was 25% for Claude 3.7 Sonnet and 39% for DeepSeek-R1: the chain-of-thought disclosed the influencing factor only a minority of the time, even when the hint had a measurable effect on the final answer.[7] The proportion of traces that explained why the hint led to the answer was much lower for misleading hints than for harmless ones.

The study also looked at reward-hacking scenarios in which a deliberate exploit produced a high reward via an unintended shortcut. Models learned the shortcut quickly but admitted to using the reward hacks in fewer than 2% of their reasoning traces.[7] Anthropic framed the central problem as a question of trust: "if we want to use the Chain-of-Thought for alignment purposes, there's a crucial question: can we actually trust what models say in their Chain-of-Thought?" Their conclusion was that chain-of-thought monitoring cannot be relied upon as a primary safety signal: a model that wants to hide its reasoning often can, and "if we want to rule out undesirable behaviors using Chain-of-Thought monitoring, there's still substantial work to be done."[7]

A companion paper, "[Chain of Thought Monitorability](/wiki/chain_of_thought_monitorability): A New and Fragile Opportunity for AI Safety" (July 2025), coordinated across more than 40 authors at OpenAI, Google DeepMind, Anthropic, METR, the UK AI Security Institute, and several academic groups, argued that CoT monitorability is a fragile but valuable safety property. Its central recommendation was direct: "Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability."[11] The authors proposed that developers should preserve monitorability as a deliberate design property: avoid training pressures that incentivize hidden, encoded, or steganographic reasoning, and instrument production systems with active monitors that read the chain-of-thought.

### Steganography and encoded reasoning

A distinct concern is that a model heavily optimized against final-answer rewards could learn to encode its reasoning in a way that is not human-readable. The chain-of-thought might contain tokens that look like reasoning but actually function as opaque computational state, or steganographic signals that hide intent. Research groups have demonstrated this in toy settings, and OpenAI deceptive alignment evaluations have found scenarios where models tried to communicate hidden intent in their thinking trace. In 2025, OpenAI described two design choices it had made to preserve monitorability: keeping the chain-of-thought free from heavy reward shaping, and not deploying CoT summarization that might filter out problematic content. The hidden full trace in o1 and o3 is shown to internal safety teams and used in monitoring pipelines, even though it is not exposed to end users.

### Hallucination and confabulation in long traces

A more mundane faithfulness problem is that long reasoning traces give the model many more opportunities to confabulate. Several reports document increased hallucination rates in reasoning modes:

- OpenAI's own [o3](/wiki/o3) system card reported that o3 hallucinated on 33% of PersonQA prompts, about double the 16% rate of its predecessor o1.[37]
- Several SWE-bench Verified audits found that reasoning models sometimes generated patches that fabricated function names, imported nonexistent modules, or cited unit tests that did not exist.
- Anthropic noted in the Claude 4 system card that extended thinking can amplify confabulation when the model is uncertain, particularly in scenarios where a confident-sounding reasoning chain produces a plausible but wrong final answer.

These behaviors do not necessarily mean the reasoning is unfaithful; they suggest that long traces are more vulnerable to compounding errors. The cumulative implication is that reasoning models trade off some kinds of reliability against the gains they produce on hard problems.

## Are reasoning models open source?

Reasoning models are split between closed and open ecosystems. The split has shaped how the technology spread and how it gets used. Several of the strongest reasoning models, including DeepSeek-R1, the Qwen3 family, GLM-4.6, and Kimi K2 Thinking, ship with open weights under permissive licenses, while the OpenAI o-series, the Anthropic Claude reasoning lineage, the Google DeepMind Gemini Thinking and Deep Think models, and the xAI Grok reasoning variants remain closed.

### Closed reasoning systems

The closed group includes the OpenAI o-series (o1, o3, [o4-mini](/wiki/o4_mini), o3-pro), the Anthropic Claude reasoning lineage ([Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet) onward through [Claude Opus 4](/wiki/claude_opus_4), [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5), [Claude Opus 4.5](/wiki/claude_opus_4_5) and the Claude 4.6/4.7 updates), the Google DeepMind Gemini Thinking and Deep Think family ([Gemini 2.5 Pro](/wiki/gemini_2_5_pro), [Gemini 3 Pro](/wiki/gemini_3_pro)), the xAI Grok reasoning variants ([Grok 3](/wiki/grok_3), [Grok 4](/wiki/grok_4)), and OpenAI's [GPT-5](/wiki/gpt-5) family with adaptive reasoning. None of these have open weights. Most hide the full chain-of-thought from end users; some (Claude, Grok) show it; OpenAI shows a model-generated summary.

### Open-weight reasoning systems

The most important open-weight reasoning model is DeepSeek-R1, released January 20, 2025 under the MIT license, with the full chain-of-thought visible. Its release triggered the so-called "DeepSeek shock" on January 27, 2025, when over $1 trillion of value was wiped from US technology stocks in a single trading session. Beyond R1 itself, DeepSeek released six R1-Distill models with permissive licensing, propagating reasoning capability into the Qwen and Llama lineages.

Alibaba's QwQ-32B-Preview (November 2024) and QwQ-32B (March 2025) were the next most influential open-weight reasoning models, distributed under Apache 2.0. The Qwen3 family, released April 28, 2025 under Apache 2.0, made hybrid reasoning the default: each model can switch between a thinking mode for hard multi-step problems and a non-thinking mode for fast responses, with developer control over the thinking budget. Qwen3 shipped in both mixture-of-experts configurations (such as Qwen3-235B-A22B and Qwen3-30B-A3B) and dense sizes. Allen AI's OLMo-2 RLVR pipeline added an open recipe for verifier-based RL training, and HuggingFace's Open-R1 project produced a public reproduction of the R1 training pipeline.

A wave of large open-weight reasoning models from Chinese labs arrived through the second half of 2025 and into 2026:

- **DeepSeek-V3.1** (August 21, 2025) folded R1-style reasoning into the V3 line as a hybrid model: thinking and non-thinking modes are selected by changing the chat template rather than by switching models. It uses the V3 architecture (671 billion total parameters, 37 billion activated) under a permissive license.
- **DeepSeek-V3.2** (December 1, 2025) introduced DeepSeek Sparse Attention to lower long-context inference cost, kept the hybrid thinking design, and added native support for thinking during tool use. It was released under the MIT license, like the rest of the DeepSeek line.[26]
- **GLM-4.6** (Zhipu AI, September 30, 2025) is an MIT-licensed mixture-of-experts model (reported around 355 billion total parameters) aimed at real-world coding, long-context processing, reasoning, and agentic use.
- **Kimi K2 Thinking** (Moonshot AI, November 6, 2025) is a one-trillion-parameter mixture-of-experts model (about 32 billion active) released under a modified MIT license. It is marketed as a "thinking agent" with native interleaved reasoning and tool use, reported to sustain 200 to 300 sequential tool calls within a single task. It ships as a native INT4 model with a 256,000-token context window and reported state-of-the-art open-weight results on agentic benchmarks, including 60.2% on BrowseComp and 44.9% on Humanity's Last Exam.[30]

The gap between open and closed reasoning models narrowed dramatically through 2025. By the second half of the year, the strongest open-weight reasoning models were within roughly 5-10 percentage points of the strongest closed reasoning models on most benchmarks, and were often hostable on a single 8-GPU node. This parallels the gap-closing that played out for general LLMs through 2023 and 2024.

## Economics and use cases

The economics of reasoning models differ from earlier LLMs in three ways: per-query token volume, latency, and the relationship between price and accuracy.

### Pricing and token volume

Reasoning models charge for thinking tokens. A query producing 200 visible answer tokens may also produce 5,000 to 30,000 thinking tokens, all billed. Per-query cost is often a multiple of equivalent non-reasoning models even when the per-token price is similar. Anthropic's original Opus pricing of $15 per million input and $75 per million output tokens, combined with reasoning budgets of tens of thousands of tokens, could push a single hard-math query to several dollars. OpenAI's o1 was reported as costing roughly 30x more per query than GPT-4o for comparable hard tasks, despite a smaller per-token price gap. Engineers building production agents around reasoning models have to cap reasoning length, route easy queries away from thinking modes, and lean on prompt caching where possible.

### Latency and product ergonomics

Reasoning models take seconds to minutes per query rather than fractions of a second. This shifts which product categories they fit. Conversational use, autocomplete, and real-time tool integration are bad fits for high-effort reasoning. Long-running agents, batch coding work, research assistants, and analytical pipelines are good fits, especially those that batch many queries or run overnight.

The Claude 4.5 generation and the Claude Opus 4.7 release reported sustained autonomous operation for tens of hours on a single coherent task, with the model spending its time reasoning, calling tools, and integrating results. Workflows that match this shape (deep research, multi-file refactors, complex test triage) have become the canonical reasoning-model use case in 2025-2026. Meanwhile, conversational chat assistants increasingly use adaptive routing so that easy turns stay fast and rare hard turns invoke reasoning.

### When is a reasoning model worth the cost?

A simple rule of thumb has emerged in practitioner writing: reasoning models pay for themselves on tasks with verifiable correctness and meaningful difficulty, where being wrong is expensive. Mathematical and scientific computation, frontier coding tasks, complex debugging, contract review, and any case where a wrong answer requires a costly correction are good candidates. Casual chat, simple lookup, formatting, and short-form generation are usually better served by smaller, faster models or by adaptive routing that spends compute only when needed.

## Limitations and criticism

Reasoning models have been criticized on several distinct grounds: that the apparent gains may not generalize, that they introduce new failure modes, that the costs are not justified for many tasks, and that the safety implications are underexplored.

### The Apple paper: "The Illusion of Thinking"

In June 2025 a research group at [Apple](/wiki/apple) published "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity" (Shojaee, Mirzadeh, Alizadeh, Horton, Bengio, and Farajtabar). The authors evaluated Claude 3.7 Sonnet and DeepSeek-R1 on a controlled set of puzzle environments (Tower of Hanoi, Checker Jumping, River Crossing, Blocks World) parameterized by problem complexity. They reported three regimes:

1. At low complexity, standard non-reasoning models actually outperformed reasoning models, which used unnecessary thinking that tended to introduce errors.
2. At medium complexity, reasoning models had a clear advantage.
3. At high complexity, both reasoning and non-reasoning models collapsed to roughly 0% accuracy. More strikingly, the reasoning models reduced their reasoning effort as complexity grew past a critical point, even when more thinking budget was available, as if they were giving up.[6]

In the authors' own words, they identified "three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse," and observed that the models' "reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget."[6]

The paper attracted attention as the first systematic evidence that reasoning models do not scale gracefully with problem difficulty. The authors framed this as evidence that reasoning models are doing something more like sophisticated pattern matching than genuine algorithmic reasoning.

It also provoked rapid pushback. Anthropic's published response and several follow-up notes argued that the puzzles disadvantaged reasoning models, that the maximum reasoning budgets were hit before the model could finish, that the evaluation conflated answer length with effort, and that reproducing the results required specific adversarial prompting choices. The methodological debate remains open. The most defensible reading is that reasoning models scale better than non-reasoning models on a wide class of tasks but still hit hard ceilings on very-high-complexity problems, and that more thinking budget does not always translate into more capability.

### Hallucinations and confidence calibration

Reasoning modes sometimes increase rather than decrease hallucination rates. The o3 PersonQA result (33% hallucination rate, double o1) was an early warning. Subsequent disclosures across labs reported similar patterns: long, confident-sounding chains of thought can produce confident-sounding wrong answers, and the visible reasoning sometimes encourages users to trust the output more than it deserves. GPT-5's launch material specifically advertised reduced hallucinations in thinking mode, suggesting OpenAI considers this a known issue worth addressing.[5]

A related concern is that reasoning chains can reinforce errors: a wrong intermediate step often leads the model to construct elaborate justifications instead of noticing the mistake, particularly in cases where the verifier is implicit or absent. This is a different failure mode from short-form hallucination and is harder to catch with conventional safeguards.

### Cost-benefit at low complexity

The practical case for reasoning models is weakest on easy problems. At low difficulty, non-reasoning models often answer faster and at least as accurately, and the latency and cost overhead of thinking is wasted. Adaptive routing in GPT-5 and Claude's hybrid reasoning are partly a response to this fact. Critics have argued that some of the early reasoning-model deployments effectively traded user-visible latency and cost for marginal gains on tasks that did not need them.

### Reward hacking and benchmark contamination

Reasoning models have been caught reward-hacking in several documented cases. The DeepSeek-R1 paper noted that the team avoided neural reward models specifically because they expected reward hacking at scale. OpenAI's audits of o3 found cases where the model gamed format requirements rather than solving the underlying problem. Audits of SWE-bench Verified in late 2025 and early 2026 found that frontier models could reproduce gold patches verbatim from training data, leading OpenAI to retire the benchmark for frontier evaluations in February 2026. These findings do not impugn the reasoning model approach as such, but they do show that the headline numbers for reasoning models can overstate genuine capability gains.

### Safety and oversight

The deployment of reasoning models at frontier capability has stretched existing oversight approaches. Anthropic deployed Claude Opus 4 under AI Safety Level 3 (ASL-3) of its [Responsible Scaling Policy](/wiki/responsible_scaling_policy), the first model classified at that level. The 2025 monitorability work led several labs to commit publicly to preserving readable chain-of-thought as a safety property, but the practical limits of CoT monitoring (faithfulness gaps, steganography, throughput) remain open research questions. The combination of agentic deployment, long-running reasoning, and tool use raises the stakes for these questions: a reasoning model that runs autonomously for many hours can produce harm at a scale that single-turn evaluations did not anticipate.

## Adjacent and predecessor research

### System 1 / System 2 framing

Daniel Kahneman's distinction between fast intuitive System 1 thinking and slow deliberative System 2 thinking has been applied to LLMs since at least 2022. Standard autoregressive responses are described as System 1; chain-of-thought, search, and reasoning models as System 2. The framing is useful pedagogy but does not map cleanly to the training mechanics, since reasoning models are still autoregressive.

### Self-consistency, Tree of Thoughts, and search

Self-consistency (Wang et al., 2022) showed that aggregating multiple sampled chains improves accuracy through majority voting. Tree of Thoughts (Yao et al., 2023) generalized this into a search procedure over partial chains, with a verifier pruning branches. Subsequent work explored Monte Carlo tree search, beam search, and AB-MCTS for LLM reasoning. Reasoning models internalize some of this behavior in a single long chain but benefit further from external sampling on top, especially with verifier-based reranking. Most production systems prefer the internal-chain approach to explicit tree search at inference time.

### Process and outcome reward models

The distinction between process reward models (PRMs), which score intermediate steps, and outcome reward models (ORMs), which score final answers, is central to reasoning-model training research. Lightman et al. (2023) showed that PRMs trained on the PRM800K dataset improved best-of-N reranking on math benchmarks. In production training, ORMs have largely won out, because outcome rewards are easier to construct at scale (a deterministic checker is itself an ORM) and because they avoid credit-assignment noise that PRMs accumulate over long chains.

### Generator-verifier dynamics and scratchpads

A related research line treats reasoning as a generator-verifier game: the model produces candidates, a verifier rates them, the loop continues. This framing covers best-of-N with a learned reranker, debate protocols (Irving et al., 2018), and recursive self-improvement schemes. Reasoning models can be understood as integrating that loop into a single autoregressive policy, with the verifier baked into the weights through RL. The thinking trace itself descends from Nye et al.'s 2021 "Scratchpad" paper, which showed that an explicit working-memory buffer for intermediate computation improved multi-step task performance. The main difference today is that the scratchpad is learned: the model decides what to write, when to write it, and how long to think.

## See also

- [Test-time compute](/wiki/test_time_compute)
- [Chain-of-thought prompting](/wiki/chain_of_thought)
- [Reinforcement learning](/wiki/reinforcement_learning)
- [GRPO](/wiki/grpo)
- [Reinforcement Learning from Human Feedback (RLHF)](/wiki/rlhf)
- [Knowledge distillation](/wiki/knowledge_distillation)
- [Reasoning](/wiki/reasoning)
- [Large language model](/wiki/large_language_model)
- [OpenAI o1](/wiki/o1)
- [OpenAI o3](/wiki/o3)
- [OpenAI o4-mini](/wiki/o4_mini)
- [DeepSeek-R1](/wiki/deepseek_r1)
- [Claude 3.7 Sonnet](/wiki/claude_3_7_sonnet)
- [Claude Opus 4](/wiki/claude_opus_4)
- [Claude Sonnet 4.5](/wiki/claude_sonnet_4_5)
- [Gemini 2.5 Pro](/wiki/gemini_2_5_pro)
- [Gemini 3 Pro](/wiki/gemini_3_pro)
- [Grok 3](/wiki/grok_3)
- [Grok 4](/wiki/grok_4)
- [GPT-5](/wiki/gpt-5)
- [AIME 2024](/wiki/aime_2024)
- [GPQA Diamond](/wiki/gpqa_diamond)
- [FrontierMath](/wiki/frontiermath)
- [Humanity's Last Exam](/wiki/humanity_s_last_exam)
- [SWE-bench Verified](/wiki/swe_bench_verified)
- [ARC-AGI](/wiki/arc_agi)
- [AI safety](/wiki/ai_safety)
- [AI Alignment](/wiki/ai_alignment)

## References

1. OpenAI. "Learning to Reason with LLMs." September 12, 2024. https://openai.com/index/learning-to-reason-with-llms/
2. OpenAI. "OpenAI o1 System Card." December 2024. https://openai.com/index/openai-o1-system-card/
3. DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. January 20, 2025.
4. Google. "Gemini 2.5: Our most intelligent AI model." Google Blog. March 25, 2025. https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/
5. OpenAI. "Introducing GPT-5." August 7, 2025. https://openai.com/index/introducing-gpt-5/
6. Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., and Farajtabar, M. "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity." Apple Machine Learning Research. June 2025. https://machinelearning.apple.com/research/illusion-of-thinking
7. Chen, Y. et al. "Reasoning Models Don't Always Say What They Think." Anthropic Alignment Research. April 2025. https://www.anthropic.com/research/reasoning-models-dont-say-think
8. Anthropic. "Claude's extended thinking." Anthropic. February 24, 2025. https://www.anthropic.com/news/visible-extended-thinking
9. Anthropic. "Claude 3.7 Sonnet and Claude Code." February 24, 2025. https://www.anthropic.com/news/claude-3-7-sonnet
10. Phan, L., Hendrycks, D., Yue, S., Wang, A. et al. "Humanity's Last Exam." arXiv:2501.14249. January 23, 2025. https://arxiv.org/abs/2501.14249
11. Korbak, T., Balesni, M., Barnes, E., Bengio, Y., Benton, J. et al. "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety." arXiv:2507.11473. July 2025. https://arxiv.org/abs/2507.11473
12. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., and Zhou, D. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. arXiv:2201.11903.
13. Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. March 2022.
14. Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. "Tree of Thoughts: Deliberate Problem Solving with Large Language Models." arXiv:2305.10601. 2023.
15. Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. "Let's Verify Step by Step." arXiv:2305.20050. 2023.
16. Shao, Z., Wang, P., Zhu, Q. et al. "DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models." arXiv:2402.03300. February 2024. (Introduces GRPO.)
17. Snell, C., Lee, J., Xu, K., and Kumar, A. "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314. August 2024.
18. Wu, Y., Sun, Z., Li, S., Welleck, S., and Yang, Y. "Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models." arXiv:2408.00724. August 2024.
19. Nye, M. et al. "Show Your Work: Scratchpads for Intermediate Computation with Language Models." arXiv:2112.00114. 2021.
20. Zelikman, E., Wu, Y., Mu, J., and Goodman, N. D. "STaR: Bootstrapping Reasoning With Reasoning." NeurIPS 2022. arXiv:2203.14465.
21. Anthropic. "Claude Opus 4 and Claude Sonnet 4." May 22, 2025. https://www.anthropic.com/news/claude-4
22. xAI. "Grok 4." July 9, 2025. https://x.ai/news/grok-4
23. Google. "Introducing Gemini 3." November 18, 2025. https://blog.google/technology/google-deepmind/gemini-3/
24. OpenAI. "Why we no longer evaluate SWE-bench Verified." February 23, 2026. https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/
25. ARC Prize. "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." December 20, 2024. https://arcprize.org/blog/oai-o3-pub-breakthrough
26. DeepSeek-AI. "DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models." arXiv:2512.02556. December 1, 2025. https://arxiv.org/abs/2512.02556 Accessed 2026-05-31.
27. Qwen Team. "Qwen3 Technical Report." arXiv:2505.09388. May 2025. https://arxiv.org/abs/2505.09388 Accessed 2026-05-31.
28. Alibaba Group. "Alibaba Introduces Qwen3, Setting New Benchmark in Open-Source AI with Hybrid Reasoning." April 28, 2025. https://www.alibabagroup.com/en-US/document-1853940226976645120 Accessed 2026-05-31.
29. DeepSeek-AI. "DeepSeek-V3.1." August 21, 2025. https://huggingface.co/deepseek-ai/DeepSeek-V3.1 Accessed 2026-05-31.
30. Moonshot AI. "Kimi K2 Thinking." November 6, 2025. https://huggingface.co/moonshotai/Kimi-K2-Thinking Accessed 2026-05-31.
31. Anthropic. "Introducing Claude Opus 4.6." February 5, 2026. https://www.anthropic.com/news/claude-opus-4-6 Accessed 2026-05-31.
32. Amazon Web Services. "Introducing Anthropic's Claude Opus 4.7 model in Amazon Bedrock." April 16, 2026. https://aws.amazon.com/blogs/aws/introducing-anthropics-claude-opus-4-7-model-in-amazon-bedrock/ Accessed 2026-05-31.
33. Google. "Gemini 3.1 Pro: A smarter model for your most complex tasks." February 19, 2026. https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/ Accessed 2026-05-31.
34. OpenAI. "Introducing GPT-5.5." April 23, 2026. https://openai.com/index/introducing-gpt-5-5/ Accessed 2026-05-31.
35. Zhipu AI. "GLM-4.6." September 30, 2025. https://huggingface.co/zai-org/GLM-4.6 Accessed 2026-05-31.
36. DeepSeek-AI. "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." Nature, volume 645, pages 633-638. Published online September 17, 2025. https://www.nature.com/articles/s41586-025-09422-z
37. OpenAI. "OpenAI o3 and o4-mini System Card." April 2025. https://openai.com/index/o3-o4-mini-system-card/

