Inference-time scaling is the practice of improving an AI model's output quality by allocating more computational resources during inference rather than during training. Instead of a single forward pass, a model using inference-time scaling generates extended reasoning chains, explores multiple candidate solutions, applies search algorithms, or runs verification steps before committing to an answer. The more compute a model is allowed to spend at inference time, the better it tends to perform on difficult tasks.
The concept gained wide attention after OpenAI released o1 in September 2024, the first widely-deployed commercial model explicitly trained to exploit inference-time compute. Researchers and practitioners now commonly use "inference-time scaling" interchangeably with test-time compute, though there is a subtle framing difference: test-time compute emphasizes the algorithmic mechanisms (how additional tokens or steps are generated), while inference-time scaling emphasizes the compute-performance relationship (how performance improves as more FLOPs are devoted to inference). Both refer to the same underlying phenomenon.
For most of the deep learning era, progress in language models came primarily from scaling three training-time quantities: model size, dataset size, and total training compute. The 2020 Kaplan et al. scaling laws paper from OpenAI showed that language model loss follows smooth power-law relationships with these quantities across more than seven orders of magnitude. The 2022 Chinchilla paper from Google DeepMind refined this understanding, showing that model size and training tokens should scale proportionally and that many existing models were undertrained relative to their parameter counts.
Both of these frameworks assumed that inference was cheap and fixed: a model processes a prompt and returns one output in a single pass. The only way to get a smarter answer was to train a larger or better model.
By the early 2020s, the costs and practical limits of that approach were becoming clear. Training a frontier model required tens of thousands of GPUs running for months. High-quality training data was increasingly constrained; some researchers observed that publicly available internet text was being exhausted faster than it could be synthesized or collected. Larger models also produced better outputs on average but were not reliably capable of deep multi-step reasoning on the hardest problems, regardless of scale. Benchmark improvements from adding more parameters were real but diminishing on tasks requiring chains of logical deduction.
This created a practical opening for a different strategy: rather than always training larger models, could a fixed model reason its way to better answers if it was given more time to think?
The insight that generation length correlates with answer quality predates o1. Jason Wei et al. at Google published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" in 2022, showing that prompting models to write out their reasoning step by step before giving a final answer dramatically improved accuracy on arithmetic, commonsense, and symbolic tasks. The mechanism was straightforward: breaking a multi-step problem into intermediate steps reduced the cognitive load on each individual step, allowing the model to handle problems it could not solve in one shot.
That same year, Xuezhi Wang et al. introduced self-consistency, sampling many independent reasoning chains and selecting the answer with the most votes across them. On a range of math and reasoning benchmarks, majority voting over diverse reasoning paths outperformed any single greedy chain. This was an early demonstration that throwing more tokens at a problem, even naively, helped.
The 2023 Tree of Thoughts paper by Yao et al. generalized this further, proposing a structured search over reasoning steps rather than independent samples. Instead of generating complete solutions in parallel, Tree of Thoughts builds a tree where each node is an intermediate reasoning state, and a separate evaluation function prunes low-quality branches before expanding further. This allowed models to backtrack and explore alternatives in a way that linear chain-of-thought prompting cannot.
These methods worked as prompting techniques applied on top of base models. None required the model itself to be trained to reason this way. That would come later.
Chain-of-thought prompting is the simplest inference-time scaling method. The model generates explicit intermediate reasoning steps before producing a final answer. Each step consumes tokens, so a chain-of-thought response costs more compute than a direct answer. In exchange, performance on multi-step problems improves substantially. This approach requires no search or external verifier; the model generates a single reasoning trajectory.
The key limitation is that a wrong turn early in the chain compounds: if the model makes an error in step 3 of a 10-step derivation, all subsequent steps build on that mistake. Chain-of-thought alone does not give the model any mechanism to detect or correct its own errors.
Best-of-N (BoN) sampling generates N independent responses to the same prompt and selects the best one using a scoring function. The scoring function can be as simple as majority voting (pick the answer that appears most often), or it can be a learned reward model that scores each candidate. Because samples are independent, they can be generated in parallel, which reduces wall-clock latency compared to sequential search. BoN scales predictably with N: performance improves as a log-linear function of the number of samples, though gains plateau once common failure modes appear across all samples.
Self-consistency is a specific form of best-of-N that applies majority voting over chain-of-thought reasoning paths. The model generates multiple complete reasoning trajectories and then marginalizes over them by voting on the final answers. It is effective because different reasoning paths may reach the correct answer via different routes, and errors tend to be idiosyncratic while correct answers converge.
A process reward model (PRM) assigns a score not just to the final answer but to every individual reasoning step. Trained on datasets where human annotators (or automated pipelines) label each step as correct or incorrect, a PRM can be used as a verifier during search: the model generates candidate next steps, the PRM scores them, and the search algorithm selects the highest-scoring path to expand. PRMs enable more fine-grained inference-time compute allocation than outcome-based verifiers, which only signal success or failure at the end. The August 2024 paper by Snell et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters," showed that searching with a PRM under a compute budget matched or exceeded models 14 times larger on certain problem classes.
Monte Carlo Tree Search (MCTS) applies a structured tree search to language model reasoning. The algorithm builds a tree of reasoning states, uses random rollouts to estimate the value of each state, and progressively focuses sampling on promising branches. MCTS is well-suited to problems with long reasoning chains and a large branching factor, but it is expensive: getting useful signal from rollouts requires many samples, and MCTS typically demands far more tokens than simpler methods like BoN at the same answer quality. Research has found that for many problem types, the performance gains from MCTS over simpler search strategies do not justify the additional compute cost in practice.
A recurring finding in the inference scaling literature is that not all prompts need the same amount of compute. Simple questions can be answered correctly in a short chain; hard questions benefit from deep search. Adaptive approaches try to route prompts to the appropriate compute level, allocating more tokens only when necessary. This reduces average costs considerably compared to always running maximum-depth search. OpenAI's reasoning models expose this through "thinking token budgets" that users can configure.
On September 12, 2024, OpenAI publicly released o1-preview and o1-mini, the first models in what it called the "o" series of reasoning models. OpenAI described them as trained to spend more time thinking through problems before responding, using a private chain of thought that the model refines iteratively through reinforcement learning.
The core innovation was not just prompting a model to produce chain-of-thought outputs. OpenAI trained o1 with reinforcement learning to generate and refine its own chains of thought, rewarding correct final answers and penalizing incorrect ones. Through this training process, the model learned to try different approaches, check its work, and abandon unpromising paths without having human-written reasoning traces to imitate. The internal reasoning process is not shown to users; they see only the final answer and a brief summary.
The performance gains compared to GPT-4o were substantial on tasks requiring multi-step reasoning. On a qualifying exam for the International Mathematics Olympiad, GPT-4o solved 13% of problems while o1 solved 83%. On the American Invitational Mathematics Examination (AIME) 2024, o1 scored 74.4% compared to single-digit figures for non-reasoning models. In competitive programming, o1 reached the 89th percentile on Codeforces. These improvements came specifically from giving the model more compute at inference time; performance scaled further when o1 was allowed more reasoning tokens.
The o1 release established a concrete commercial demonstration that inference-time scaling was viable at production scale. It also showed that the performance of a single trained model is not a fixed ceiling: the same weights, given more compute at inference time, produce qualitatively different and better outputs.
OpenAI later released o3, announced in December 2024 and released in April 2025, which extended the paradigm further. o3 achieved 91.6% on AIME 2024 and 88.9% on AIME 2025 and, in an experimental configuration, solved five of six problems at the 2025 International Mathematical Olympiad, earning enough points for a gold medal. o3 exposed compute scaling to users more directly, offering different "reasoning effort" settings with different cost and latency profiles.
DeepSeek-R1, released by the Chinese AI company DeepSeek in January 2025, demonstrated that the inference-time scaling approach pioneered by o1 could be replicated with open-source methods. The 671-billion-parameter model matched or exceeded o1 performance on several benchmarks while being released under the MIT license, making it freely available for research and commercial deployment.
DeepSeek-R1's training process used Group Relative Policy Optimization (GRPO), a reinforcement learning algorithm introduced in the DeepSeekMath paper. Unlike Proximal Policy Optimization (PPO), which requires a separate value network, GRPO estimates the baseline from a group of sampled outputs for the same input, normalizing rewards within the group. This reduces memory and compute requirements compared to PPO while preserving the core RL training dynamic.
A notable intermediate result was DeepSeek-R1-Zero, trained with pure RL from a base model without any supervised fine-tuning on human-written reasoning traces. R1-Zero spontaneously developed extended chain-of-thought behavior and improved its AIME 2024 accuracy from 15.6% (pass@1) to 71.0% through extended reasoning, reaching 86.7% with majority voting. This demonstrated that RL training with verifiable rewards is sufficient to elicit reasoning behavior without requiring hand-labeled reasoning demonstrations.
DeepSeek-R1 improved on R1-Zero by adding cold-start supervised fine-tuning before RL training and additional alignment stages. The final model matched o1 on most benchmarks and offered API access at roughly 70% lower cost per token. DeepSeek also released six smaller distilled models (1.5B, 7B, 8B, 14B, 32B, and 70B parameters) derived from R1's reasoning traces, demonstrating that reasoning capability could be transferred to smaller models through imitation learning.
The R1 release accelerated open-source research into reasoning models significantly. Within weeks of release, other organizations began training their own reasoning models using similar RL-with-verifiable-rewards pipelines, and DeepSeek's distilled models appeared in many downstream fine-tuning projects.
The terms "inference-time scaling" and test-time compute are used largely interchangeably in the literature, but the framing differs in emphasis. Test-time compute is the older term (predating the o1 release) and tends to be used in academic papers that discuss the compute budget and algorithmic mechanisms: how many tokens or samples are generated, what search strategy is used, and how efficiency varies across problem difficulty. Inference-time scaling is the more common framing in industry and popular writing, and it emphasizes the analogy to training-time scaling laws: just as performance scales with training compute, it also scales with inference compute.
The distinction sometimes matters. Test-time compute research focuses partly on methods that work with base models through clever prompting or external search, while inference-time scaling often refers specifically to models trained to reason more deeply when given more compute budget. In practice, most discussions treat the two as synonymous, and the technical content is the same.
The clearest evidence for inference-time scaling comes from mathematical and programming benchmarks, where answers can be verified automatically and the difficulty spectrum is well-characterized.
AIME (American Invitational Mathematics Examination): AIME problems require multi-step proofs and algebraic manipulation. GPT-4o scored around 9-12% on AIME 2024. DeepSeek-R1 improved this to 79.8%. o1 scored 74.4% and o3 reached 91.6%. These gains came with no change in training data for mathematics; all the improvement came from extended reasoning at inference time.
MATH-500: A benchmark of 500 competition math problems. Standard LLMs plateau around 70-75% at scale. Reasoning models cross 90-97%.
Codeforces: Competitive programming rankings. GPT-4o sits around the 11th percentile. o1 reached the 89th percentile. o3 continued improving further.
ARC-AGI: A general reasoning benchmark designed to be resistant to memorization. Standard LLMs score near zero. o3 with high-compute settings scored 87.5%, raising significant discussion about what the benchmark was actually measuring.
The Wu et al. paper "Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models" (ICLR 2025) studied scaling behavior formally across multiple models and tasks. They found that inference-compute scaling follows smooth, predictable curves: performance increases monotonically with compute budget, with the rate of improvement depending on task difficulty and model capability. The optimal allocation strategy (how many samples vs. how deep to search) depends on the specific task and budget.
Extended reasoning takes time. A non-reasoning model answers a typical query in under a second. An o1-class model thinking deeply may generate thousands of internal reasoning tokens before producing output, adding several seconds to tens of seconds of latency. For many applications, this is acceptable or even desirable (users tolerate longer waits for hard problems), but latency-sensitive deployments such as real-time conversation or live code suggestions are poorly served by maximum-depth reasoning.
Adaptive compute budgets are one response: the model uses shallow reasoning for simple queries and deep reasoning for hard ones. Research by groups at Stanford and elsewhere has shown that latency-aware test-time scaling can reduce average token generation while maintaining accuracy close to full-compute baselines.
Reasoning models generate many more tokens than standard models. The internal chain of thought for a hard math problem might run to 10,000-100,000 tokens, compared to a few hundred for a direct answer. Since API pricing for most commercial models is proportional to tokens, this multiplies cost substantially. OpenAI's 2024 inference spending reached approximately $2.3 billion, roughly 15 times what it spent on training GPT-4.5. Analysts project inference costs will claim 75% of total AI compute spending by 2030.
For users, the implication is that reasoning models are not a free upgrade. A query that costs $0.01 with GPT-4o might cost $0.10-$1.00 with o1-class reasoning at high compute settings. DeepSeek-R1's open weights and efficient inference implementation reduced these costs significantly, and competition across providers has driven down API prices.
Inference-time scaling works well for tasks with verifiable, structured answers: mathematics, formal logic, programming. It works less well for knowledge-intensive retrieval tasks where the bottleneck is factual recall rather than reasoning. A 2025 study published at OpenReview found that test-time scaling in reasoning models is not yet effective for knowledge-intensive tasks and can actually increase hallucination rates in some settings, because extended reasoning encourages the model to generate confident-sounding but unsupported claims rather than acknowledging uncertainty.
This asymmetry has practical implications. Deploying a reasoning model on a customer support chatbot does not guarantee accuracy improvements and may increase costs without benefit. Reasoning models are best matched to tasks with clear solution criteria.
Extended reasoning introduces failure modes specific to long chain-of-thought generation. Overthinking occurs when a model spends excessive tokens on a simple problem, generating unnecessary complexity without improving the answer. Task drift occurs when a long internal reasoning chain leads the model away from the original prompt constraints, arriving at an answer to a subtly different question than the one asked. These problems are active areas of research, with approaches including reasoning length penalties in RL training and self-evaluation modules that check final answers against original instructions.
The shift toward inference-time scaling is reshaping AI infrastructure. Training workloads are bursty: a large cluster runs for weeks or months, then the job ends. Inference workloads are continuous, running at whatever throughput the user base demands. Reasoning models amplify this because each query consumes far more tokens than a standard model would.
By 2026, inference spend surpassed training spend in data center revenue for the first time, according to Deloitte. Analysts project inference will claim 75% of total AI compute by 2030, with inference electricity demand growing from roughly 21 GW in 2024 to 93 GW by 2030 (a 35% compound annual growth rate), while training grows at 22% CAGR over the same period.
This demand profile has influenced hardware design. NVIDIA's Blackwell GPU architecture was designed with inference efficiency as a primary objective alongside training throughput. Purpose-built inference accelerators, including Groq's Language Processing Unit (LPU) technology (acquired by NVIDIA for approximately $20 billion), target the specific bottlenecks of token-sequential generation. Google's TPU v5e and similar chips optimize for sustained inference throughput rather than peak training FLOP counts. High-bandwidth memory (HBM) capacity and memory bandwidth, which limit how fast tokens can be generated, became major constraints as reasoning model deployments scaled.
For smaller deployments, the distilled reasoning models (7B-70B parameter variants from DeepSeek and others) make inference-time scaling accessible on a single GPU or workstation, broadening who can deploy reasoning-capable systems.
Several fundamental questions about inference-time scaling remain unresolved.
Ceiling effects. It is not clear whether inference-time scaling will continue to produce useful gains as models already saturate the available benchmarks. AIME and competitive programming problems served as clean test beds partly because frontier models had room to grow. As o3-class models approach perfect scores on existing benchmarks, new, harder evaluation suites will be needed to measure further progress.
Generalization beyond verifiable tasks. The training signal for reasoning models typically comes from verifiable rewards: the answer is either right or wrong. This works cleanly for math and code. For tasks like writing, analysis, or judgment under ambiguity, there is no reliable automatic verifier, which makes RL training with verifiable rewards inapplicable without additional assumptions. Extending inference-time scaling to open-ended tasks is an active research problem.
Reasoning vs. retrieval. The current evidence suggests that inference-time scaling improves reasoning capability but does not improve factual retrieval. A model that cannot recall a fact in a single pass does not reliably recall it through extended reasoning. This means that reasoning models and retrieval-augmented generation systems address complementary failure modes, and the two approaches are often combined.
Compute efficiency. MCTS and similar deep search strategies produce strong results but at high compute cost. Finding inference strategies that are both accurate and efficient across diverse problem types remains an open engineering challenge.
Training the reasoner. The RL training procedures that produce good reasoning models are still being refined. Questions about reward shaping, handling sparse rewards, avoiding reward hacking, and whether models trained on math and code generalize to other domains are all active areas of research.