Test-time compute (also called inference-time compute scaling or test-time scaling) refers to the practice of allocating additional computational resources during the inference phase of a large language model (LLM), rather than relying solely on computation invested during training. Instead of producing an answer in a single forward pass, a model using test-time compute generates multiple reasoning steps, explores alternative solution paths, or samples and verifies several candidate responses before returning a final output. This paradigm has emerged as one of the most significant developments in AI research since 2024, powering reasoning-focused models such as OpenAI o1, o3, DeepSeek-R1, and others that achieve substantial performance gains on difficult mathematical, scientific, and coding benchmarks.
For most of the modern deep learning era, improvements in language model performance have come from scaling up three factors at training time: model size (number of parameters), dataset size (number of training tokens), and training compute (total floating-point operations). Two landmark studies formalized this relationship.
In January 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models," demonstrating that language model loss follows predictable power-law relationships with model size, dataset size, and training compute. Their findings showed that these trends span more than seven orders of magnitude and that architectural details such as network width or depth have minimal effects within a wide range. The paper also suggested that larger models are significantly more sample-efficient, meaning that compute-optimal training involves building very large models trained on relatively modest amounts of data and stopping well before convergence.
In March 2022, Jordan Hoffmann and colleagues at Google DeepMind published "Training Compute-Optimal Large Language Models," which challenged the prevailing approach of scaling model size while keeping training data roughly constant. By training over 400 models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they found that model size and training tokens should be scaled equally: for every doubling of model size, the number of training tokens should also double. Their compute-optimal model, Chinchilla (70 billion parameters trained on 4x more data), outperformed the much larger Gopher (280 billion parameters), GPT-3 (175 billion parameters), and Megatron-Turing NLG (530 billion parameters) across a wide range of evaluation tasks.
Both of these studies focused exclusively on training-time compute. The core assumption was that once a model finishes training, inference is cheap and fixed: you send in a prompt and receive a single forward-pass output. Test-time compute scaling challenges this assumption by asking a different question: what if a model could spend more computation thinking about a problem at inference time, and how much performance could that additional thinking buy?
The paper that most directly crystallized the test-time compute paradigm was "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" by Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar, published in August 2024. The authors studied the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt?
Snell et al. analyzed two primary mechanisms for scaling test-time computation:
Their key findings were striking. A "compute-optimal" scaling strategy that adaptively allocates test-time compute per prompt improved the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. In a FLOPs-matched evaluation, on problems where a smaller base model achieved somewhat non-trivial success rates, test-time compute could be used to outperform a model 14x larger. Critically, the effectiveness of different approaches varied depending on the difficulty of the prompt, suggesting that adaptive allocation is essential.
A concurrent study by Yangzhen Wu and colleagues, "Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models" (August 2024), examined cost-performance trade-offs across inference strategies including greedy search, majority voting, best-of-N, weighted voting, and tree search algorithms. Their central finding was that scaling inference compute with advanced strategies can be more computationally efficient than scaling model parameters. Specifically, a Llemma-7B model paired with a tree search algorithm consistently outperformed the Llemma-34B model across all tested inference strategies on the MATH benchmark, despite being roughly one-fifth the size.
Test-time compute encompasses a family of techniques that increase the amount of computation a model performs between receiving a prompt and producing a final answer. These techniques can be broadly divided into two categories: internal (the model generates a longer chain of reasoning tokens) and external (the system generates multiple candidate outputs and selects among them using verification).
Chain-of-thought (CoT) prompting, introduced by Jason Wei and colleagues at Google in January 2022, was an early demonstration that generating intermediate reasoning steps before a final answer improves performance on complex tasks. In CoT prompting, a model is shown a few examples of step-by-step reasoning and then asked to produce its own intermediate steps. This approach boosted performance on arithmetic, commonsense reasoning, and symbolic reasoning benchmarks.
Chain-of-thought reasoning represents the simplest form of test-time compute scaling: by generating more tokens (the reasoning steps), the model spends more computation per problem. Modern reasoning models like OpenAI o1 and DeepSeek-R1 internalize this pattern. Rather than requiring the user to prompt for step-by-step reasoning, these models are trained via reinforcement learning to automatically produce extended chains of thought, sometimes generating hundreds or thousands of reasoning tokens before arriving at a final answer.
Self-consistency, proposed by Xuezhi Wang and colleagues in March 2022, extends chain-of-thought prompting with a sampling-based approach. Instead of generating a single chain of thought (greedy decoding), the method samples multiple diverse reasoning paths from the model and then selects the most common final answer through majority voting. The intuition is that a complex reasoning problem typically admits multiple valid solution paths leading to the same correct answer, and aggregating across these paths filters out errors that appear in individual samples.
Self-consistency boosted chain-of-thought performance by significant margins on arithmetic and commonsense reasoning benchmarks, including +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA, and +6.4% on StrategyQA. This technique established a core principle of test-time compute: generating more samples and aggregating results can systematically improve accuracy.
Best-of-N sampling (also called rejection sampling) is one of the most straightforward test-time compute strategies. The model generates N candidate responses to a given prompt, each response is scored by a reward model, and the highest-scoring response is returned. The reward model may be an outcome reward model (ORM) that evaluates only the final answer, or a process reward model (PRM) that evaluates each intermediate reasoning step.
Best-of-N sampling is simple to implement, requires no changes to how the model is trained, and scales linearly with N. Its main limitation is that it offers diminishing returns as N grows, since random sampling may repeatedly explore similar solution paths.
Process reward models (PRMs) provide feedback at each step of a multi-step reasoning trace, rather than only scoring the final outcome. A PRM is typically a language model fine-tuned to evaluate whether each reasoning step is correct and productive. By providing step-level feedback, PRMs enable more effective search over reasoning paths, since errors can be identified and pruned early rather than discovered only at the end.
PRMs are particularly useful when combined with tree search methods. At each step of reasoning, the model generates several candidate next steps, the PRM scores each candidate, and only the most promising branches are expanded further. This approach is far more efficient than generating complete responses and scoring them after the fact.
Recent research has nuanced the comparison between PRMs and outcome reward models (ORMs). While PRMs offer better credit assignment in mathematical reasoning, some studies have found that generative outcome reward models can be more robust across diverse domains, since step-wise PRM scoring can accumulate labeling noise over long reasoning trajectories.
Tree search methods adapt classical search algorithms to the problem of generating text. In a tree search framework for LLM reasoning:
Monte Carlo tree search (MCTS) is a particularly effective variant that balances exploration (trying new reasoning paths) with exploitation (extending promising ones). MCTS has been shown to be the most effective strategy when ample computational resources are available, while best-of-N sampling offers a more practical alternative under resource constraints due to its simplicity and speed.
A 2025 extension called Adaptive Branching Monte Carlo Tree Search (AB-MCTS) dynamically decides whether to "go wider" (expanding new candidate responses) or "go deeper" (revisiting and extending existing ones) based on external feedback signals.
Beam search maintains a fixed number of the most promising partial solutions (the "beam width") at each step, expanding only the top candidates at each reasoning step. It is less computationally expensive than full tree search but more directed than pure random sampling. In the context of test-time compute, beam search guided by a process reward model offers a middle ground between simple best-of-N sampling and exhaustive tree search.
OpenAI released o1 (initially codenamed "Strawberry") in September 2024 as the first widely available model explicitly designed around test-time compute. When a user submits a prompt to o1, the model internally generates a chain of thought before producing its final response. This thinking process is a genuine computational step where the model explores different approaches, checks its own work, and refines its reasoning.
OpenAI trained o1 using a large-scale reinforcement learning algorithm that teaches the model how to use its chain of thought productively. Through RL, o1 learns to recognize and correct its mistakes, break down complex steps into simpler ones, and try different approaches when one is not working. The model's performance improves both with more reinforcement learning (training-time compute) and with more time spent thinking (test-time compute).
On key benchmarks, o1 achieved:
OpenAI announced o3 in December 2024 as a successor to o1 with further improvements in reasoning capability and efficiency. o3 and its smaller variant o3-mini offer configurable reasoning effort, with three tiers (low, medium, high) that control how many thinking cycles the model uses.
On benchmarks, o3 achieved:
Despite outperforming o1 on most tasks, o3-mini is reported to be 63% cheaper to run than o1-mini for comparable usage, demonstrating that improvements in reasoning efficiency can offset the cost of additional inference computation.
DeepSeek-R1, released in January 2025 by the Chinese AI lab DeepSeek, is an open-weight reasoning model that demonstrated test-time compute could be achieved through a relatively straightforward training approach. Rather than relying on complex search procedures at inference time, DeepSeek trained its model primarily through reinforcement learning to produce extended reasoning traces.
During RL training, DeepSeek-R1-Zero (a precursor trained without supervised fine-tuning) naturally acquired the ability to solve increasingly complex tasks by generating hundreds to thousands of reasoning tokens. Sophisticated behaviors such as self-reflection (revisiting and reevaluating previous steps) and exploration of alternative approaches emerged spontaneously without explicit programming.
DeepSeek-R1 achieved:
An updated version, DeepSeek-R1-0528, further improved scores to 87.5% on AIME 2025 and 81.0% on GPQA Diamond.
The significance of DeepSeek-R1 lies in its open-weight release, which made reasoning model capabilities accessible to the broader research community and enabled distillation of reasoning abilities into smaller models.
Anthropic released Claude 3.7 Sonnet in February 2025 with an "extended thinking" mode that represents a hybrid approach to test-time compute. Users and developers can toggle extended thinking on or off and set a "thinking budget" controlling how many tokens Claude spends reasoning about a problem, up to 128,000 tokens.
Anthropic designed this capability with the philosophy that reasoning should be an integrated capability of frontier models rather than requiring a separate model. The model's accuracy on math questions improves logarithmically with the number of thinking tokens it is allowed to sample.
Claude 3.7 Sonnet achieved 80.0% on AIME 2024 in parallel extended thinking mode with a 64,000-token thinking budget, and 70.3% on SWE-bench Verified. Unlike OpenAI o1, Claude 3.7 Sonnet fully displays its reasoning tokens to users.
Alibaba Cloud's Qwen team released QwQ-32B-Preview in November 2024 as an open-source reasoning model leveraging test-time compute. Despite having only 32.5 billion parameters, QwQ demonstrated performance competitive with o1-preview and o1-mini on certain benchmarks. The model reasons through tasks by planning ahead and performing a series of self-checking actions, with the trade-off being longer response times. A later release, QwQ-32B (March 2025), further improved performance to compete with DeepSeek-R1 and o1-mini.
Qwen3, released in 2025, introduced a "Thinking Mode" where the model reasons step by step before delivering a final answer, with performance scaling smoothly with the computational reasoning budget allocated.
Google introduced thinking capabilities in its Gemini model line, beginning with Gemini 2.0 Flash Thinking Experimental in December 2024. These models use an internal thinking process that improves reasoning and multi-step planning. Developers can control the level of internal reasoning through a thinking_level parameter (minimal, low, medium, or high), balancing response quality against latency and cost.
| Model | Developer | Release | Parameters | Approach | AIME 2024 | GPQA Diamond | Open Weights |
|---|---|---|---|---|---|---|---|
| OpenAI o1 | OpenAI | Sep 2024 | Undisclosed | RL-trained CoT, internal reasoning | 83% (64-sample consensus) | 78.0% | No |
| OpenAI o3 | OpenAI | Dec 2024 | Undisclosed | RL-trained CoT, configurable effort | 96.7% | 87.7% | No |
| DeepSeek-R1 | DeepSeek | Jan 2025 | 671B (MoE) | RL-trained extended reasoning | 79.8% | 71.5% | Yes |
| DeepSeek-R1-0528 | DeepSeek | May 2025 | 671B (MoE) | RL-trained extended reasoning | 87.5% (AIME 2025) | 81.0% | Yes |
| Claude 3.7 Sonnet | Anthropic | Feb 2025 | Undisclosed | Extended thinking with budget control | 80.0% | N/A | No |
| QwQ-32B | Alibaba Qwen | Mar 2025 | 32.5B | RL-trained CoT reasoning | Competitive with o1-mini | Competitive with o1-mini | Yes |
| Gemini 2.0 Flash Thinking | Google DeepMind | Dec 2024 | Undisclosed | Internal thinking process, configurable | N/A | N/A | No |
Internal test-time scaling refers to the model generating a longer sequence of reasoning tokens before producing its final answer. This is the approach used by o1, DeepSeek-R1, and Claude's extended thinking. The model is trained (typically via reinforcement learning) to produce useful intermediate reasoning that helps it arrive at better answers.
The advantage of internal scaling is simplicity at inference time: no external verifier or search algorithm is needed, since the reasoning ability is baked into the model's weights. The disadvantage is that the model must learn when to think longer and when to stop, and there is no external check on the quality of intermediate steps.
External test-time scaling involves generating multiple candidate outputs in parallel and selecting among them. This category includes best-of-N sampling, majority voting, and tree search with verification. External scaling requires either an external reward model or a self-evaluation mechanism but can be applied to any base model without retraining.
The advantage of external scaling is that it provides an independent verification signal, which can catch errors that the model itself would not detect. The disadvantage is the overhead of running a separate verifier and the cost of generating many candidate responses.
In practice, the most effective systems combine internal and external scaling. For example, a reasoning model that generates an extended chain of thought (internal) can also be sampled multiple times with majority voting or reranking (external). OpenAI's reported AIME results for o1 illustrate this: single-sample performance was 74%, but consensus among 64 samples raised it to 83%, and reranking 1,000 samples pushed it to 93%.
A central question in test-time compute research is how to allocate a fixed inference budget optimally. Not all problems benefit equally from additional thinking, and spending excessive computation on easy problems wastes resources.
Snell et al. (2024) found that the effectiveness of different test-time strategies depends critically on problem difficulty. On easy problems, simple sampling and advanced tree search produce comparable results. On harder problems, tree search demonstrates a significant advantage. This suggests that the optimal strategy is difficulty-adaptive: allocate more inference compute to harder problems while stopping early on easier ones.
Practical systems implement this through complexity assessment mechanisms that gauge whether a question is easy or hard and trigger deep reasoning only when needed. This approach optimizes both latency and cost by avoiding excessive computation on simple queries while still applying full reasoning power to difficult ones.
Research has identified several principles for compute-optimal inference:
Test-time compute presents a three-way trade-off between response latency, computational cost, and output quality.
Reasoning models take significantly longer to respond than standard models. OpenAI o1-preview's latency can extend beyond 10 seconds and reach 30 seconds on complex prompts, compared to GPT-4o's typical 2 to 4 second response time. For applications like interactive chat assistants, coding copilots, and real-time customer support, this additional latency may be unacceptable. For tasks like document analysis, research assistance, and complex problem-solving, users are often willing to wait for higher-quality answers.
The cost impact of test-time compute goes beyond per-token pricing differences. While o1's per-token price is roughly 6x higher than GPT-4o ($15 per million input tokens and $60 per million output tokens for o1, versus $2.50 and $10 for GPT-4o), the actual cost per query can be 30x or more because reasoning models generate substantially more tokens per response. The extended chain-of-thought tokens, even when hidden from the user, still consume compute resources and count toward billing.
For best-of-N sampling and tree search approaches applied externally, the compute cost scales roughly linearly with the number of samples or search iterations.
The quality improvements from test-time compute are most pronounced on tasks that require multi-step reasoning, mathematical problem-solving, code generation, and scientific analysis. On straightforward factual questions or creative writing tasks, the benefit of additional reasoning may be minimal, and the added latency and cost may not be justified.
| Factor | Standard Inference | Test-Time Compute Scaling |
|---|---|---|
| Latency | 1 to 5 seconds typical | 10 to 60+ seconds for complex queries |
| Cost per query | Lower (single forward pass) | Higher (extended reasoning, multiple samples) |
| Math/reasoning accuracy | Moderate | Substantially higher |
| Simple task performance | Good | Similar (minimal benefit from extra compute) |
| User experience | Fast, responsive | Slower, but higher quality on hard tasks |
| Compute scaling | Fixed per query | Adjustable per query difficulty |
Test-time compute has produced dramatic improvements on benchmarks that were previously considered extremely difficult for AI systems.
AIME is a challenging mathematics competition designed for the brightest high school students in the United States. Before reasoning models, GPT-4o solved only 12% (1.8 out of 15) of AIME 2024 problems. With test-time compute, o1 raised this to 74% with a single sample and 93% with reranking. By late 2024, o3 achieved 96.7% on AIME 2024. On AIME 2025, top reasoning models routinely score above 90%, vastly surpassing historical human averages.
GPQA Diamond tests PhD-level knowledge in physics, biology, and chemistry. Human domain experts with PhDs achieve approximately 65 to 70% accuracy on these questions, while non-expert validators with unrestricted web access reach only 34%. OpenAI o1 was the first model to exceed expert-level performance at 78.0%. o3 pushed this further to 87.7%. Between 2023 and 2024, AI performance on GPQA improved by 48.9 percentage points.
ARC-AGI is designed to test novel task adaptation, a capability considered central to general intelligence. The benchmark resisted AI progress for years: GPT-3 scored 0% in 2020, and GPT-4o managed only 5% by 2024. OpenAI o3, using high-compute test-time scaling, scored 87.5% in December 2024, surpassing the 85% threshold often cited as approximate human-level performance. This result sparked widespread discussion about whether the gains reflect genuine reasoning advances or sophisticated pattern matching at scale.
The EpochAI Frontier Math benchmark contains exceptionally difficult mathematical problems where most AI systems score below 2%. OpenAI o3 achieved 25.2%, demonstrating that test-time compute enables meaningful progress even on problems well beyond current AI capabilities.
Test-time compute and training-time compute represent two complementary approaches to improving model performance, each with distinct characteristics.
| Dimension | Training-Time Scaling | Test-Time Scaling |
|---|---|---|
| When compute is spent | Before deployment (offline) | During each query (online) |
| What scales | Model parameters, training data, training FLOPs | Reasoning tokens, number of samples, search depth |
| Cost structure | Large upfront investment, low marginal cost per query | Lower upfront cost, higher marginal cost per query |
| Flexibility | Fixed after training | Adjustable per query and per difficulty level |
| Diminishing returns | Power-law scaling with compute | Roughly logarithmic scaling with compute |
| Knowledge breadth | Improves general capabilities across all tasks | Primarily improves reasoning on individual problems |
| Latency impact | None (compute spent before deployment) | Increases response time per query |
| Accessibility | Requires massive GPU clusters for training | Can improve existing models without retraining |
Training-time scaling produces models with broader knowledge and general capabilities, while test-time scaling produces deeper reasoning on specific problems. The two approaches are not mutually exclusive. The strongest systems, such as o3, combine large-scale pretraining with test-time compute to achieve the best results.
Snell et al. (2024) found that in some regimes, spending additional compute at inference time is more effective per FLOP than spending it on training a larger model. This has implications for how organizations allocate their compute budgets: rather than always building bigger models, it may sometimes be more cost-effective to deploy smaller models with sophisticated inference-time strategies.
Test-time compute represents a shift in how the AI research community thinks about scaling. The traditional approach focused almost entirely on making models bigger and training them on more data. The emerging paradigm recognizes that inference-time computation is a separate, independently scalable axis that can produce large performance gains, especially on reasoning-heavy tasks.
This shift has practical implications. Organizations that cannot afford to train frontier-scale models can still achieve competitive performance by applying test-time compute techniques to smaller, openly available models. Wu et al. (2024) demonstrated that a 7B-parameter model with advanced inference strategies can match or exceed a 34B-parameter model on mathematical problem-solving.
The release of open-weight reasoning models like DeepSeek-R1 and QwQ has made test-time compute capabilities accessible beyond the largest AI laboratories. DeepSeek-R1's open release enabled researchers to study reasoning model behavior, distill reasoning capabilities into even smaller models, and develop new inference-time techniques. This contrasts with the closed nature of OpenAI's o-series models, where the internal chain-of-thought reasoning is hidden from users.
Test-time compute enables a new form of resource management where computational expenditure scales with problem difficulty. Simple factual queries can be answered quickly and cheaply with minimal reasoning, while complex mathematical proofs or scientific analyses receive substantially more computation. This difficulty-adaptive approach is more economically efficient than applying the same level of computation to every query.
The integration of verification mechanisms (reward models, self-consistency checks, tree search with pruning) into the inference pipeline improves the reliability of model outputs. By generating multiple solutions and selecting among them based on verification, test-time compute systems can catch and correct errors that would persist in a single-pass generation. This has particular value in high-stakes applications such as medical reasoning, legal analysis, and scientific research.
Several open research directions remain active: