Test-time compute
Last reviewed
May 17, 2026
Sources
24 citations
Review status
Source-backed
Revision
v6 ยท 6,344 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 17, 2026
Sources
24 citations
Review status
Source-backed
Revision
v6 ยท 6,344 words
Add missing citations, update stale details, or suggest a clearer explanation.
Test-time compute (also called inference-time compute scaling or test-time scaling) refers to the practice of allocating additional computational resources during the inference phase of a large language model (LLM), rather than relying solely on computation invested during training. Instead of producing an answer in a single forward pass, a model using test-time compute generates multiple reasoning steps, explores alternative solution paths, or samples and verifies several candidate responses before returning a final output. This paradigm has emerged as one of the most significant developments in AI research since 2024, powering reasoning-focused models such as OpenAI o1, o3, DeepSeek-R1, Google DeepMind Gemini Deep Think, and Anthropic Claude Mythos that achieve substantial performance gains on difficult mathematical, scientific, and coding benchmarks.
For most of the modern deep learning era, improvements in language model performance have come from scaling up three factors at training time: model size (number of parameters), dataset size (number of training tokens), and training compute (total floating-point operations). Two landmark studies formalized this relationship.
In January 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models," demonstrating that language model loss follows predictable power-law relationships with model size, dataset size, and training compute. Their findings showed that these trends span more than seven orders of magnitude and that architectural details such as network width or depth have minimal effects within a wide range. The paper also suggested that larger models are significantly more sample-efficient, meaning that compute-optimal training involves building very large models trained on relatively modest amounts of data and stopping well before convergence.
In March 2022, Jordan Hoffmann and colleagues at Google DeepMind published "Training Compute-Optimal Large Language Models," which challenged the prevailing approach of scaling model size while keeping training data roughly constant. By training over 400 models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they found that model size and training tokens should be scaled equally: for every doubling of model size, the number of training tokens should also double. Their compute-optimal model, Chinchilla (70 billion parameters trained on 4x more data), outperformed the much larger Gopher (280 billion parameters), GPT-3 (175 billion parameters), and Megatron-Turing NLG (530 billion parameters) across a wide range of evaluation tasks.
Both of these studies focused exclusively on training-time compute. The core assumption was that once a model finishes training, inference is cheap and fixed: you send in a prompt and receive a single forward-pass output. Test-time compute scaling challenges this assumption by asking a different question: what if a model could spend more computation thinking about a problem at inference time, and how much performance could that additional thinking buy?
The paper that most directly crystallized the test-time compute paradigm was "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" by Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar, published in August 2024. The authors studied the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt?
Snell et al. analyzed two primary mechanisms for scaling test-time computation:
Their key findings were striking. A "compute-optimal" scaling strategy that adaptively allocates test-time compute per prompt improved the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. In a FLOPs-matched evaluation, on problems where a smaller base model achieved somewhat non-trivial success rates, test-time compute could be used to outperform a model 14x larger. Critically, the effectiveness of different approaches varied depending on the difficulty of the prompt, suggesting that adaptive allocation is essential.
A concurrent study by Yangzhen Wu and colleagues, "Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models" (August 2024), examined cost-performance trade-offs across inference strategies including greedy search, majority voting, best-of-N, weighted voting, and tree search algorithms. Their central finding was that scaling inference compute with advanced strategies can be more computationally efficient than scaling model parameters. Specifically, a Llemma-7B model paired with a tree search algorithm consistently outperformed the Llemma-34B model across all tested inference strategies on the MATH benchmark, despite being roughly one-fifth the size.
A third influential study from the same period was "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" by Bradley Brown and colleagues at Stanford. The authors studied coverage (the fraction of problems solved by at least one of N samples) as N varied over four orders of magnitude. Across tasks and base models, coverage scaled smoothly with log-N, fitting an exponentiated power law. In verifiable domains such as competitive coding and formal proofs, this translates directly into accuracy, showing that even a weak base model becomes competitive when paired with many candidates and a reliable verifier.
Test-time compute encompasses a family of techniques that increase the amount of computation a model performs between receiving a prompt and producing a final answer. These techniques can be broadly divided into two categories: internal (the model generates a longer chain of reasoning tokens) and external (the system generates multiple candidate outputs and selects among them using verification).
Chain-of-thought (CoT) prompting, introduced by Jason Wei and colleagues at Google in January 2022, was an early demonstration that generating intermediate reasoning steps before a final answer improves performance on complex tasks. In CoT prompting, a model is shown a few examples of step-by-step reasoning and then asked to produce its own intermediate steps. This approach boosted performance on arithmetic, commonsense reasoning, and symbolic reasoning benchmarks.
Chain-of-thought reasoning represents the simplest form of test-time compute scaling: by generating more tokens (the reasoning steps), the model spends more computation per problem. Modern reasoning models like OpenAI o1 and DeepSeek-R1 internalize this pattern. Rather than requiring the user to prompt for step-by-step reasoning, these models are trained via reinforcement learning to automatically produce extended chains of thought, sometimes generating hundreds or thousands of reasoning tokens before arriving at a final answer.
Self-consistency, proposed by Xuezhi Wang and colleagues in March 2022, extends chain-of-thought prompting with a sampling-based approach. Instead of generating a single chain of thought (greedy decoding), the method samples multiple diverse reasoning paths from the model and then selects the most common final answer through majority voting. The intuition is that a complex reasoning problem typically admits multiple valid solution paths leading to the same correct answer, and aggregating across these paths filters out errors that appear in individual samples.
Self-consistency boosted chain-of-thought performance by significant margins on arithmetic and commonsense reasoning benchmarks, including +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA, and +6.4% on StrategyQA. This technique established a core principle of test-time compute: generating more samples and aggregating results can systematically improve accuracy.
Best-of-N sampling (also called rejection sampling) is one of the most straightforward test-time compute strategies. The model generates N candidate responses to a given prompt, each response is scored by a reward model, and the highest-scoring response is returned. The reward model may be an outcome reward model (ORM) that evaluates only the final answer, or a process reward model (PRM) that evaluates each intermediate reasoning step.
Best-of-N sampling is simple to implement, requires no changes to how the model is trained, and scales linearly with N. Its main limitation is that it offers diminishing returns as N grows, since random sampling may repeatedly explore similar solution paths.
Process reward models (PRMs) provide feedback at each step of a multi-step reasoning trace, rather than only scoring the final outcome. A PRM is typically a language model fine-tuned to evaluate whether each reasoning step is correct and productive. By providing step-level feedback, PRMs enable more effective search over reasoning paths, since errors can be identified and pruned early rather than discovered only at the end.
PRMs are particularly useful when combined with tree search methods. At each step of reasoning, the model generates several candidate next steps, the PRM scores each candidate, and only the most promising branches are expanded further. This approach is far more efficient than generating complete responses and scoring them after the fact.
Recent research has nuanced the comparison between PRMs and outcome reward models (ORMs). While PRMs offer better credit assignment in mathematical reasoning, some studies have found that generative outcome reward models can be more robust across diverse domains, since step-wise PRM scoring can accumulate labeling noise over long reasoning trajectories.
Tree search methods adapt classical search algorithms to the problem of generating text. In a tree search framework for LLM reasoning:
Monte Carlo tree search (MCTS) is a particularly effective variant that balances exploration (trying new reasoning paths) with exploitation (extending promising ones). MCTS has been shown to be the most effective strategy when ample computational resources are available, while best-of-N sampling offers a more practical alternative under resource constraints due to its simplicity and speed.
A 2025 extension called Adaptive Branching Monte Carlo Tree Search (AB-MCTS) dynamically decides whether to "go wider" (expanding new candidate responses) or "go deeper" (revisiting and extending existing ones) based on external feedback signals.
Beam search maintains a fixed number of the most promising partial solutions (the "beam width") at each step, expanding only the top candidates at each reasoning step. It is less computationally expensive than full tree search but more directed than pure random sampling. In the context of test-time compute, beam search guided by a process reward model offers a middle ground between simple best-of-N sampling and exhaustive tree search.
Sequential refinement involves the model generating an initial answer and repeatedly editing or critiquing it across multiple passes. Each new candidate is conditioned on the previous attempt and feedback about its flaws. Self-Refine (2023) and SETS (Self-Enhanced Test-Time Scaling, 2025) combine sequential revision with parallel sampling using self-verification; SETS reported accuracy improvements of up to 10.9% over pure parallel scaling on planning, math, and coding benchmarks.
Researchers increasingly view test-time compute as a generator-verifier problem. Stanford's Weaver framework (2025) combines weak verifiers into a unified score via weighted ensembles, shrinking the generation-verification gap by an average of 14.5% on GPQA Diamond. DeepSeekMath-V2 (2025) extends the idea to theorem proving by training a verifier on self-generated proof critiques, producing a model that can both write and check its own proofs.
OpenAI released o1 (initially codenamed "Strawberry") in September 2024 as the first widely available model explicitly designed around test-time compute. When a user submits a prompt to o1, the model internally generates a chain of thought before producing its final response. This thinking process is a genuine computational step where the model explores different approaches, checks its own work, and refines its reasoning.
OpenAI trained o1 using a large-scale reinforcement learning algorithm that teaches the model how to use its chain of thought productively. Through RL, o1 learns to recognize and correct its mistakes, break down complex steps into simpler ones, and try different approaches when one is not working. The model's performance improves both with more reinforcement learning (training-time compute) and with more time spent thinking (test-time compute).
On key benchmarks, o1 achieved:
OpenAI announced o3 in December 2024 as a successor to o1 with further improvements in reasoning capability and efficiency. o3 and its smaller variant o3-mini offer configurable reasoning effort, with three tiers (low, medium, high) that control how many thinking cycles the model uses.
On benchmarks, o3 achieved:
Despite outperforming o1 on most tasks, o3-mini is reported to be 63% cheaper to run than o1-mini for comparable usage, demonstrating that improvements in reasoning efficiency can offset the cost of additional inference computation.
DeepSeek-R1, released in January 2025 by the Chinese AI lab DeepSeek, is an open-weight reasoning model that demonstrated test-time compute could be achieved through a relatively straightforward training approach. Rather than relying on complex search procedures at inference time, DeepSeek trained its model primarily through reinforcement learning to produce extended reasoning traces.
During RL training, DeepSeek-R1-Zero (a precursor trained without supervised fine-tuning) naturally acquired the ability to solve increasingly complex tasks by generating hundreds to thousands of reasoning tokens. Sophisticated behaviors such as self-reflection (revisiting and reevaluating previous steps) and exploration of alternative approaches emerged spontaneously without explicit programming.
DeepSeek-R1 achieved:
An updated version, DeepSeek-R1-0528, further improved scores to 87.5% on AIME 2025 and 81.0% on GPQA Diamond.
The significance of DeepSeek-R1 lies in its open-weight release, which made reasoning model capabilities accessible to the broader research community and enabled distillation of reasoning abilities into smaller models.
Anthropic released Claude 3.7 Sonnet in February 2025 with an "extended thinking" mode that represents a hybrid approach to test-time compute. Users and developers can toggle extended thinking on or off and set a "thinking budget" controlling how many tokens Claude spends reasoning about a problem, up to 128,000 tokens.
Anthropic designed this capability with the philosophy that reasoning should be an integrated capability of frontier models rather than requiring a separate model. The model's accuracy on math questions improves logarithmically with the number of thinking tokens it is allowed to sample.
Claude 3.7 Sonnet achieved 80.0% on AIME 2024 in parallel extended thinking mode with a 64,000-token thinking budget, and 70.3% on SWE-bench Verified. Unlike OpenAI o1, Claude 3.7 Sonnet fully displays its reasoning tokens to users.
Anthropic's Claude Mythos Preview, announced in April 2026, extends the extended-thinking paradigm into long-running agentic workflows. Mythos is positioned as a frontier model for cybersecurity, autonomous coding, and multi-step agents. It supports a 1 million token context, up to 128,000 output tokens, and an "adaptive" thinking mode that allocates inference compute dynamically based on task difficulty. In Anthropic's internal evaluations, Mythos reproduced security vulnerabilities and produced working exploits on the first attempt in over 83% of cases. Citing offensive capabilities, Anthropic restricted release through Project Glasswing. Mythos illustrates how test-time compute is moving beyond single-shot questions toward multi-hour agent runs that may use millions of reasoning tokens per task.
Alibaba Cloud's Qwen team released QwQ-32B-Preview in November 2024 as an open-source reasoning model leveraging test-time compute. Despite having only 32.5 billion parameters, QwQ demonstrated performance competitive with o1-preview and o1-mini on certain benchmarks. The model reasons through tasks by planning ahead and performing a series of self-checking actions, with the trade-off being longer response times. A later release, QwQ-32B (March 2025), further improved performance to compete with DeepSeek-R1 and o1-mini.
Qwen3, released in 2025, introduced a "Thinking Mode" where the model reasons step by step before delivering a final answer, with performance scaling smoothly with the computational reasoning budget allocated.
Google introduced thinking capabilities in its Gemini line beginning with Gemini 2.0 Flash Thinking Experimental in December 2024. Developers control the level of internal reasoning via a thinking_level parameter (minimal, low, medium, or high).
In 2025, Google DeepMind extended this with Gemini 2.5 Deep Think, a parallel-reasoning variant that explores many candidate paths simultaneously. In July 2025, an advanced internal version reached the IMO gold-medal threshold by solving five of six 2025 International Mathematical Olympiad problems with full marks (35 of 42 points). The public Gemini 2.5 Deep Think reaches bronze-medal performance on the same problems. On IMO-ProofBench Advanced, Deep Think's accuracy climbs toward 90% as the inference-compute budget increases. Deep Think combines internal extended thinking with explicit parallel branch exploration, making it a clear production example of a hybrid system.
| Model | Developer | Release | Parameters | Approach | AIME 2024 | GPQA Diamond | Open weights |
|---|---|---|---|---|---|---|---|
| OpenAI o1 | OpenAI | Sep 2024 | Undisclosed | RL-trained CoT, internal reasoning | 83% (64-sample consensus) | 78.0% | No |
| OpenAI o3 | OpenAI | Dec 2024 | Undisclosed | RL-trained CoT, configurable effort | 96.7% | 87.7% | No |
| DeepSeek-R1 | DeepSeek | Jan 2025 | 671B (MoE) | RL-trained extended reasoning | 79.8% | 71.5% | Yes |
| DeepSeek-R1-0528 | DeepSeek | May 2025 | 671B (MoE) | RL-trained extended reasoning | 87.5% (AIME 2025) | 81.0% | Yes |
| Claude 3.7 Sonnet | Anthropic | Feb 2025 | Undisclosed | Extended thinking with budget control | 80.0% | N/A | No |
| Claude Mythos Preview | Anthropic | Apr 2026 | Undisclosed | Adaptive thinking, agentic | N/A (security-focused) | N/A | No (Project Glasswing) |
| QwQ-32B | Alibaba Qwen | Mar 2025 | 32.5B | RL-trained CoT reasoning | Competitive with o1-mini | Competitive with o1-mini | Yes |
| Gemini 2.0 Flash Thinking | Google DeepMind | Dec 2024 | Undisclosed | Internal thinking process, configurable | N/A | N/A | No |
| Gemini 2.5 Deep Think | Google DeepMind | Aug 2025 | Undisclosed | Parallel branch exploration, hybrid | 2025 IMO gold (internal); bronze (public) | Strong on IMO-ProofBench | No |
Internal test-time scaling refers to the model generating a longer sequence of reasoning tokens before producing its final answer. This is the approach used by o1, DeepSeek-R1, and Claude's extended thinking. The model is trained (typically via reinforcement learning) to produce useful intermediate reasoning that helps it arrive at better answers.
The advantage of internal scaling is simplicity at inference time: no external verifier or search algorithm is needed, since the reasoning ability is baked into the model's weights. The disadvantage is that the model must learn when to think longer and when to stop, and there is no external check on the quality of intermediate steps.
External test-time scaling involves generating multiple candidate outputs in parallel and selecting among them. This category includes best-of-N sampling, majority voting, and tree search with verification. External scaling requires either an external reward model or a self-evaluation mechanism but can be applied to any base model without retraining.
The advantage of external scaling is that it provides an independent verification signal, which can catch errors that the model itself would not detect. The disadvantage is the overhead of running a separate verifier and the cost of generating many candidate responses.
In practice, the most effective systems combine internal and external scaling. For example, a reasoning model that generates an extended chain of thought (internal) can also be sampled multiple times with majority voting or reranking (external). OpenAI's reported AIME results for o1 illustrate this: single-sample performance was 74%, but consensus among 64 samples raised it to 83%, and reranking 1,000 samples pushed it to 93%. Gemini Deep Think pushes the hybrid pattern further by running many parallel reasoning branches inside a single "thinking" call, with the model itself responsible for aggregating across branches.
The table below summarizes the main families under a fixed compute budget.
| Strategy | Verifier required | Adaptive to difficulty | Best suited to |
|---|---|---|---|
| Greedy chain-of-thought | No | No | Easy and medium reasoning |
| Self-consistency (majority vote) | No (uses answer matching) | Weak | Math, multi-choice |
| Best-of-N with ORM/PRM reranker | Yes | Weak | Coding, math, factual QA |
| Beam search with PRM | Yes (PRM) | Moderate | Multi-step math, planning |
| Monte Carlo tree search | Yes (often PRM) | Strong | Olympiad math, theorem proving |
| Sequential refinement (Self-Refine, SETS) | Optional (self-critique) | Strong | Coding, writing, agentic tasks |
| Internal RL-trained CoT (o1, R1) | No (internal) | Strong (learned) | General reasoning, deployment |
| Parallel-branch thinking (Deep Think) | No (internal) | Strong (learned) | Olympiad math, hard proofs |
A central question in test-time compute research is how to allocate a fixed inference budget optimally. Not all problems benefit equally from additional thinking, and spending excessive computation on easy problems wastes resources.
Snell et al. (2024) found that the effectiveness of different test-time strategies depends critically on problem difficulty. On easy problems, simple sampling and advanced tree search produce comparable results. On harder problems, tree search demonstrates a significant advantage. This suggests that the optimal strategy is difficulty-adaptive: allocate more inference compute to harder problems while stopping early on easier ones.
Practical systems implement this through complexity assessment mechanisms that gauge whether a question is easy or hard and trigger deep reasoning only when needed. This approach optimizes both latency and cost by avoiding excessive computation on simple queries while still applying full reasoning power to difficult ones.
Research has identified several principles for compute-optimal inference:
Despite the dramatic gains above, test-time compute is not unbounded. A 2025 study, "Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models," found a consistent pattern across benchmarks and strategies: rapid initial improvement followed by a plateau. The authors formalized this as the Test-Time Scaling Performance Model (TTSPM), defining a saturation point N* where incremental benefit drops below a chosen threshold. The plateau appears on parallel and sequential strategies and on AIME, MATH-500, and GPQA. A 2025 analysis of machine translation found that, outside math and coding where verification is easy, scaling provides only small initial gains before plateauing. "More thinking" is most effective where verification is cheap and the base model already has meaningful coverage.
A complementary line studies test-time compute through the generation-verification gap: the difference between the best answer the generator can produce in N tries and the answer the verifier selects. The "Trust but Verify" survey (2025) catalogs the design space of verifiers, distinguishing process-level versus outcome-level checks and weak versus strong policies that decide when to defer to costly checks like human review.
Test-time compute presents a three-way trade-off between response latency, computational cost, and output quality.
Reasoning models take significantly longer to respond than standard models. OpenAI o1-preview's latency can extend beyond 10 seconds and reach 30 seconds on complex prompts, compared to GPT-4o's typical 2 to 4 second response time. For applications like interactive chat assistants, coding copilots, and real-time customer support, this additional latency may be unacceptable. For tasks like document analysis, research assistance, and complex problem-solving, users are often willing to wait for higher-quality answers.
The cost impact of test-time compute goes beyond per-token pricing differences. While o1's per-token price is roughly 6x higher than GPT-4o ($15 per million input tokens and $60 per million output tokens for o1, versus $2.50 and $10 for GPT-4o), the actual cost per query can be 30x or more because reasoning models generate substantially more tokens per response. The extended chain-of-thought tokens, even when hidden from the user, still consume compute resources and count toward billing.
For best-of-N sampling and tree search approaches applied externally, the compute cost scales roughly linearly with the number of samples or search iterations.
The quality improvements from test-time compute are most pronounced on tasks that require multi-step reasoning, mathematical problem-solving, code generation, and scientific analysis. On straightforward factual questions or creative writing tasks, the benefit of additional reasoning may be minimal, and the added latency and cost may not be justified.
| Factor | Standard inference | Test-time compute scaling |
|---|---|---|
| Latency | 1 to 5 seconds typical | 10 to 60+ seconds for complex queries |
| Cost per query | Lower (single forward pass) | Higher (extended reasoning, multiple samples) |
| Math/reasoning accuracy | Moderate | Substantially higher |
| Simple task performance | Good | Similar (minimal benefit from extra compute) |
| User experience | Fast, responsive | Slower, but higher quality on hard tasks |
| Compute scaling | Fixed per query | Adjustable per query difficulty |
Test-time compute has produced dramatic improvements on benchmarks that were previously considered extremely difficult for AI systems.
AIME is a challenging mathematics competition designed for the brightest high school students in the United States. Before reasoning models, GPT-4o solved only 12% (1.8 out of 15) of AIME 2024 problems. With test-time compute, o1 raised this to 74% with a single sample and 93% with reranking. By late 2024, o3 achieved 96.7% on AIME 2024. On AIME 2025, top reasoning models routinely score above 90%, vastly surpassing historical human averages.
The International Mathematical Olympiad has long been a demanding test of reasoning because each problem requires a multi-page proof. Test-time compute changed this in 2025: Gemini Deep Think reached the IMO gold-medal threshold on the 2025 problems, and DeepSeekMath-V2 (November 2025) demonstrated self-verifiable proof generation in which a trained verifier rewards a generator for finding and fixing flaws in its own proofs. These results show test-time scaling delivering progress on tasks where verification cannot rely on simple answer matching.
GPQA Diamond tests PhD-level knowledge in physics, biology, and chemistry. Human domain experts with PhDs achieve approximately 65 to 70% accuracy on these questions, while non-expert validators with unrestricted web access reach only 34%. OpenAI o1 was the first model to exceed expert-level performance at 78.0%. o3 pushed this further to 87.7%. Between 2023 and 2024, AI performance on GPQA improved by 48.9 percentage points.
ARC-AGI is designed to test novel task adaptation, a capability considered central to general intelligence. The benchmark resisted AI progress for years: GPT-3 scored 0% in 2020, and GPT-4o managed only 5% by 2024. OpenAI o3, using high-compute test-time scaling, scored 87.5% in December 2024, surpassing the 85% threshold often cited as approximate human-level performance. This result sparked widespread discussion about whether the gains reflect genuine reasoning advances or sophisticated pattern matching at scale.
The EpochAI Frontier Math benchmark contains exceptionally difficult mathematical problems where most AI systems score below 2%. OpenAI o3 achieved 25.2%, demonstrating that test-time compute enables meaningful progress even on problems well beyond current AI capabilities.
Not every task benefits equally. Test-time scaling applied to machine translation, summarization, and creative writing shows only modest gains before plateauing. The shared characteristic is a weak verifier signal: with no automatic check for translation quality or narrative engagement, the system cannot reliably pick the best of many candidates.
A rapidly growing application of test-time compute is in agentic systems, where a model is given a multi-step task such as fixing a bug across a codebase or auditing a system for security flaws. Agentic settings stretch the paradigm: the agent plans and revises across hundreds of actions; the verifier is often the environment itself (code that compiles, tests that pass); and the inference budget runs into millions of reasoning tokens per task.
Several 2025 lines of work specialize test-time compute for agents. Confidence-aware test-time scaling (CATTS) allocates extra compute when the agent's step-level uncertainty rises. Agentic verifier approaches treat the verifier as its own agent that executes candidate code and proposes discriminative test inputs; on competitive coding benchmarks this yielded 10 to 15 percentage point gains in Best@K over execution-based baselines. When each sample is itself a long agentic run, simple best-of-N with a strong verifier remains a competitive baseline. Anthropic's Mythos is a production-scale example, combining an adaptive thinking budget with a 1 million token context and long-horizon agentic loops.
Test-time compute and training-time compute represent two complementary approaches to improving model performance, each with distinct characteristics.
| Dimension | Training-time scaling | Test-time scaling |
|---|---|---|
| When compute is spent | Before deployment (offline) | During each query (online) |
| What scales | Model parameters, training data, training FLOPs | Reasoning tokens, number of samples, search depth |
| Cost structure | Large upfront investment, low marginal cost per query | Lower upfront cost, higher marginal cost per query |
| Flexibility | Fixed after training | Adjustable per query and per difficulty level |
| Diminishing returns | Power-law scaling with compute | Roughly logarithmic scaling with compute |
| Knowledge breadth | Improves general capabilities across all tasks | Primarily improves reasoning on individual problems |
| Latency impact | None (compute spent before deployment) | Increases response time per query |
| Accessibility | Requires massive GPU clusters for training | Can improve existing models without retraining |
Training-time scaling produces models with broader knowledge and general capabilities, while test-time scaling produces deeper reasoning on specific problems. The two approaches are not mutually exclusive. The strongest systems, such as o3 and Gemini Deep Think, combine large-scale pretraining with test-time compute to achieve the best results.
Snell et al. (2024) found that in some regimes, spending additional compute at inference time is more effective per FLOP than spending it on training a larger model. This has implications for how organizations allocate their compute budgets: rather than always building bigger models, it may sometimes be more cost-effective to deploy smaller models with sophisticated inference-time strategies. The economic break-even depends on expected query volume; a system serving billions of queries benefits more from a bigger pretrained model, while a research workflow that runs only a few thousand expensive queries can profitably push more compute into thinking time.
Test-time compute represents a shift in how the AI research community thinks about scaling. The traditional approach focused almost entirely on making models bigger and training them on more data. The emerging paradigm recognizes that inference-time computation is a separate, independently scalable axis that can produce large performance gains, especially on reasoning-heavy tasks.
This shift has practical implications. Organizations that cannot afford to train frontier-scale models can still achieve competitive performance by applying test-time compute techniques to smaller, openly available models. Wu et al. (2024) demonstrated that a 7B-parameter model with advanced inference strategies can match or exceed a 34B-parameter model on mathematical problem-solving.
The release of open-weight reasoning models like DeepSeek-R1 and QwQ has made test-time compute capabilities accessible beyond the largest AI laboratories. DeepSeek-R1's open release enabled researchers to study reasoning model behavior, distill reasoning capabilities into even smaller models, and develop new inference-time techniques. This contrasts with the closed nature of OpenAI's o-series models, where the internal chain-of-thought reasoning is hidden from users.
Test-time compute enables a new form of resource management where computational expenditure scales with problem difficulty. Simple factual queries can be answered quickly and cheaply with minimal reasoning, while complex mathematical proofs or scientific analyses receive substantially more computation. This difficulty-adaptive approach is more economically efficient than applying the same level of computation to every query.
The integration of verification mechanisms (reward models, self-consistency checks, tree search with pruning) into the inference pipeline improves the reliability of model outputs. By generating multiple solutions and selecting among them based on verification, test-time compute systems can catch and correct errors that would persist in a single-pass generation. This has particular value in high-stakes applications such as medical reasoning, legal analysis, and scientific research.
Test-time compute has also reshaped the AI safety discussion. Longer reasoning traces make behavior more transparent in principle, since each step is written out as auditable tokens, which motivates Claude's visible extended thinking. Hidden internal chains of thought (as in the o-series and Mythos) raise oversight concerns, and long-running agents can be misused, which is why Anthropic restricted Mythos through Project Glasswing.
Several open research directions remain active: