Test-time compute

Artificial Intelligence Large Language Models Machine Learning Reasoning Models

34 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

25 citations

Revision

v8 · 6,774 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Test-time compute (also called inference-time compute scaling or test-time scaling) is the practice of allocating additional computation while a large language model answers a query, rather than relying solely on computation spent during training. Instead of producing an answer in a single forward pass, a model using test-time compute generates extended reasoning steps, explores alternative solution paths, or samples and verifies multiple candidate responses before returning a final output. The paradigm was crystallized by an August 2024 study from Charlie Snell and colleagues, who found that "scaling test-time computation can be more effective than scaling model parameters," and that on a FLOPs-matched basis a smaller model with optimal test-time compute can "outperform a 14x larger model."^[5] It is the engine behind reasoning-focused models such as OpenAI o1 and o3, DeepSeek-R1, Google DeepMind Gemini Deep Think, and Anthropic Claude 3.7 Sonnet, which achieve large gains on difficult mathematical, scientific, and coding benchmarks. Industry commentators describe it as a third scaling axis alongside model size and dataset size.

What is test-time compute in simple terms?

In one sentence: test-time compute lets a model "think longer" on hard problems by spending more computation at the moment of answering. A standard LLM does a fixed amount of work per query, however hard the question. A test-time-compute model instead spends a variable, adjustable amount of computation, generating hundreds or thousands of intermediate reasoning tokens, sampling many candidate solutions, or searching over reasoning paths, and tends to spend more of it on harder questions. Concretely, OpenAI o1 lifted accuracy on the 2024 American Invitational Mathematics Examination (AIME) from GPT-4o's 12% to 74% with a single attempt, and to 93% when reranking 1,000 samples, with no change to the underlying pretrained knowledge, only to how much computation was spent at inference.^[7] This is why the technique is sometimes summarized as trading inference dollars and latency for accuracy on reasoning-heavy tasks.

Background: Traditional scaling laws

For most of the modern deep learning era, improvements in language model performance have come from scaling up three factors at training time: model size (number of parameters), dataset size (number of training tokens), and training compute (total floating-point operations). Two landmark studies formalized this relationship.

In January 2020, Jared Kaplan and colleagues at OpenAI published "Scaling Laws for Neural Language Models,"^[1] demonstrating that language model loss follows predictable power-law relationships with model size, dataset size, and training compute.^[1] Their findings showed that these trends span more than seven orders of magnitude and that architectural details such as network width or depth have minimal effects within a wide range. The paper also suggested that larger models are significantly more sample-efficient, meaning that compute-optimal training involves building very large models trained on relatively modest amounts of data and stopping well before convergence.^[1]

In March 2022, Jordan Hoffmann and colleagues at Google DeepMind published "Training Compute-Optimal Large Language Models,"^[2] which challenged the prevailing approach of scaling model size while keeping training data roughly constant. By training over 400 models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they found that model size and training tokens should be scaled equally: for every doubling of model size, the number of training tokens should also double.^[2] Their compute-optimal model, Chinchilla (70 billion parameters trained on 4x more data), outperformed the much larger Gopher (280 billion parameters), GPT-3 (175 billion parameters), and Megatron-Turing NLG (530 billion parameters) across a wide range of evaluation tasks.^[2]

Both of these studies focused exclusively on training-time compute. The core assumption was that once a model finishes training, inference is cheap and fixed: you send in a prompt and receive a single forward-pass output. Test-time compute scaling challenges this assumption by asking a different question: what if a model could spend more computation thinking about a problem at inference time, and how much performance could that additional thinking buy?

The shift toward test-time compute

Snell et al. (2024)

The paper that most directly crystallized the test-time compute paradigm was "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters" by Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar, published in August 2024.^[5] The authors studied the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? They framed the goal directly, asking "if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt?"^[5]

Snell et al. analyzed two primary mechanisms for scaling test-time computation:

Searching against process-based verifier reward models (PRMs): The model generates multiple candidate solutions, and a verifier scores each reasoning step to select the best one.
Adaptive distribution updating: The model's distribution over responses is updated at test time, given the specific prompt, allowing it to tailor its output more carefully.

Their key findings were striking. A "compute-optimal" scaling strategy that adaptively allocates test-time compute per prompt improved the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline.^[5] In a FLOPs-matched evaluation, on problems where a smaller base model achieved somewhat non-trivial success rates, test-time compute could be used to outperform a model 14x larger.^[5] Critically, the effectiveness of different approaches varied depending on the difficulty of the prompt, suggesting that adaptive allocation is essential.^[5]

Wu et al. (2024)

A concurrent study by Yangzhen Wu and colleagues, "Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models" (August 2024),^[6] examined cost-performance trade-offs across inference strategies including greedy search, majority voting, best-of-N, weighted voting, and tree search algorithms. Their central finding was that scaling inference compute with advanced strategies can be more computationally efficient than scaling model parameters.^[6] Specifically, a Llemma-7B model paired with a tree search algorithm consistently outperformed the Llemma-34B model across all tested inference strategies on the MATH benchmark, despite being roughly one-fifth the size.^[6]

Large Language Monkeys (Brown et al., 2024)

A third influential study from the same period was "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling" by Bradley Brown and colleagues at Stanford.^[12] The authors studied coverage (the fraction of problems solved by at least one of N samples) as N varied over four orders of magnitude. Across tasks and base models, coverage scaled smoothly with log-N, fitting an exponentiated power law.^[12] In verifiable domains such as competitive coding and formal proofs, this translates directly into accuracy, showing that even a weak base model becomes competitive when paired with many candidates and a reliable verifier.^[12]

How does test-time compute work?

Test-time compute encompasses a family of techniques that increase the amount of computation a model performs between receiving a prompt and producing a final answer. These techniques can be broadly divided into two categories: internal (the model generates a longer chain of reasoning tokens) and external (the system generates multiple candidate outputs and selects among them using verification).

Chain-of-thought reasoning

Chain-of-thought (CoT) prompting, introduced by Jason Wei and colleagues at Google in January 2022,^[3] was an early demonstration that generating intermediate reasoning steps before a final answer improves performance on complex tasks. In CoT prompting, a model is shown a few examples of step-by-step reasoning and then asked to produce its own intermediate steps. This approach boosted performance on arithmetic, commonsense reasoning, and symbolic reasoning benchmarks.^[3]

Chain-of-thought reasoning represents the simplest form of test-time compute scaling: by generating more tokens (the reasoning steps), the model spends more computation per problem. Modern reasoning models like OpenAI o1 and DeepSeek-R1 internalize this pattern. Rather than requiring the user to prompt for step-by-step reasoning, these models are trained via reinforcement learning to automatically produce extended chains of thought, sometimes generating hundreds or thousands of reasoning tokens before arriving at a final answer.

Self-consistency and majority voting

Self-consistency, proposed by Xuezhi Wang and colleagues in March 2022,^[4] extends chain-of-thought prompting with a sampling-based approach. Instead of generating a single chain of thought (greedy decoding), the method samples multiple diverse reasoning paths from the model and then selects the most common final answer through majority voting.^[4] The intuition is that a complex reasoning problem typically admits multiple valid solution paths leading to the same correct answer, and aggregating across these paths filters out errors that appear in individual samples.

Self-consistency boosted chain-of-thought performance by significant margins on arithmetic and commonsense reasoning benchmarks, including +17.9% on GSM8K, +11.0% on SVAMP, +12.2% on AQuA, and +6.4% on StrategyQA.^[4] This technique established a core principle of test-time compute: generating more samples and aggregating results can systematically improve accuracy.

Best-of-N sampling

Best-of-N sampling (also called rejection sampling) is one of the most straightforward test-time compute strategies. The model generates N candidate responses to a given prompt, each response is scored by a reward model, and the highest-scoring response is returned. The reward model may be an outcome reward model (ORM) that evaluates only the final answer, or a process reward model (PRM) that evaluates each intermediate reasoning step.

Best-of-N sampling is simple to implement, requires no changes to how the model is trained, and scales linearly with N. Its main limitation is that it offers diminishing returns as N grows, since random sampling may repeatedly explore similar solution paths.

Process reward models

Process reward models (PRMs) provide feedback at each step of a multi-step reasoning trace, rather than only scoring the final outcome. A PRM is typically a language model fine-tuned to evaluate whether each reasoning step is correct and productive. By providing step-level feedback, PRMs enable more effective search over reasoning paths, since errors can be identified and pruned early rather than discovered only at the end.

PRMs are particularly useful when combined with tree search methods. At each step of reasoning, the model generates several candidate next steps, the PRM scores each candidate, and only the most promising branches are expanded further. This approach is far more efficient than generating complete responses and scoring them after the fact.

Recent research has nuanced the comparison between PRMs and outcome reward models (ORMs). While PRMs offer better credit assignment in mathematical reasoning, some studies have found that generative outcome reward models can be more robust across diverse domains, since step-wise PRM scoring can accumulate labeling noise over long reasoning trajectories.

Tree search and Monte Carlo tree search

Tree search methods adapt classical search algorithms to the problem of generating text. In a tree search framework for LLM reasoning:

Each node represents a partial reasoning trace.
At each node, the model generates several candidate continuations (branching).
A verifier (often a PRM) scores each continuation.
The search algorithm decides which branches to explore further and which to prune.

Monte Carlo tree search (MCTS) is a particularly effective variant that balances exploration (trying new reasoning paths) with exploitation (extending promising ones). MCTS has been shown to be the most effective strategy when ample computational resources are available, while best-of-N sampling offers a more practical alternative under resource constraints due to its simplicity and speed.

A 2025 extension called Adaptive Branching Monte Carlo Tree Search (AB-MCTS) dynamically decides whether to "go wider" (expanding new candidate responses) or "go deeper" (revisiting and extending existing ones) based on external feedback signals.

Beam search

Beam search maintains a fixed number of the most promising partial solutions (the "beam width") at each step, expanding only the top candidates at each reasoning step. It is less computationally expensive than full tree search but more directed than pure random sampling. In the context of test-time compute, beam search guided by a process reward model offers a middle ground between simple best-of-N sampling and exhaustive tree search.

Sequential refinement involves the model generating an initial answer and repeatedly editing or critiquing it across multiple passes. Each new candidate is conditioned on the previous attempt and feedback about its flaws. Self-Refine (2023)^[13] and SETS (Self-Enhanced Test-Time Scaling, 2025)^[14] combine sequential revision with parallel sampling using self-verification; SETS reported accuracy improvements of up to 10.9% over pure parallel scaling on planning, math, and coding benchmarks.^[14]

Verification-centric scaling

Researchers increasingly view test-time compute as a generator-verifier problem. Stanford's Weaver framework (2025) combines weak verifiers into a unified score via weighted ensembles, shrinking the generation-verification gap by an average of 14.5% on GPQA Diamond.^[17] DeepSeekMath-V2 (2025) extends the idea to theorem proving by training a verifier on self-generated proof critiques, producing a model that can both write and check its own proofs.^[18]

Reasoning models using test-time compute

OpenAI o1

OpenAI released o1 (initially codenamed "Strawberry") in September 2024 as the first widely available model explicitly designed around test-time compute.^[7] When a user submits a prompt to o1, the model internally generates a chain of thought before producing its final response. This thinking process is a genuine computational step where the model explores different approaches, checks its own work, and refines its reasoning.

OpenAI trained o1 using a large-scale reinforcement learning algorithm that teaches the model how to use its chain of thought productively.^[7] Through RL, o1 learns to recognize and correct its mistakes, break down complex steps into simpler ones, and try different approaches when one is not working. The model's performance improves both with more reinforcement learning (training-time compute) and with more time spent thinking (test-time compute).^[7] OpenAI framed this as a distinct scaling axis, noting that o1 "performance consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)," and that "the constraints on scaling this approach differ substantially from those of LLM pretraining."^[7]

On key benchmarks, o1 achieved:

AIME 2024: 74% (11.1/15) with a single sample, 83% (12.5/15) with consensus among 64 samples, and 93% (13.9/15) when reranking 1,000 samples with a learned scoring function. By comparison, GPT-4o solved only 12% (1.8/15) of problems.^[7]
GPQA Diamond: 78.0% (consensus), surpassing the 69.7% accuracy achieved by human domain experts with PhDs.^[7]
Codeforces: 89th percentile on competitive programming questions.^[7]
International Math Olympiad qualifying exam: 83%, compared to GPT-4o's 13%.^[7]

OpenAI o3

OpenAI announced o3 on December 20, 2024 as a successor to o1 with further improvements in reasoning capability and efficiency.^[10] o3 and its smaller variant o3-mini offer configurable reasoning effort, with three tiers (low, medium, high) that control how many thinking cycles the model uses.

On benchmarks, o3 achieved:

AIME 2024: 96.7% accuracy.
GPQA Diamond: 87.7%.
ARC-AGI: 75.7% on the semi-private evaluation set in its high-efficiency configuration (a $2,680 total compute cost that stays within the $10,000 ARC Prize budget limit), and 87.5% in a low-efficiency configuration that used roughly 172x more compute (about $456,000).^[10] ARC Prize called this "a surprising and important step-function increase in AI capabilities, showing novel task adaptation ability never seen before in the GPT-family models," noting that "ARC-AGI-1 took 4 years to go from 0% with GPT-3 in 2020 to 5% in 2024 with GPT-4o."^[10]
EpochAI Frontier Math: 25.2%, a massive improvement over the sub-2% average for other AI systems at the time.

Despite outperforming o1 on most tasks, o3-mini is reported to be 63% cheaper to run than o1-mini for comparable usage, demonstrating that improvements in reasoning efficiency can offset the cost of additional inference computation.

DeepSeek-R1

DeepSeek-R1, released in January 2025 by the Chinese AI lab DeepSeek, is an open-weight reasoning model that demonstrated test-time compute could be achieved through a relatively straightforward training approach.^[8] Rather than relying on complex search procedures at inference time, DeepSeek trained its model primarily through reinforcement learning to produce extended reasoning traces.^[8] The work was later published in the journal Nature in September 2025.^[25]

During RL training, DeepSeek-R1-Zero (a precursor trained without supervised fine-tuning) naturally acquired the ability to solve increasingly complex tasks by generating hundreds to thousands of reasoning tokens.^[8] Sophisticated behaviors such as self-reflection (revisiting and reevaluating previous steps) and exploration of alternative approaches emerged spontaneously without explicit programming.^[8] The authors documented an "aha moment" during training in which "DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach," adding that "this moment is not only an 'aha moment' for the model but also for the researchers observing its behavior."^[8]

DeepSeek-R1 achieved:

AIME 2024: 79.8% pass@1.^[8]
MATH-500: 97.3%, on par with OpenAI o1.^[8]
GPQA Diamond: 71.5%.^[8]

An updated version, DeepSeek-R1-0528, further improved scores to 87.5% on AIME 2025 and 81.0% on GPQA Diamond.

The significance of DeepSeek-R1 lies in its open-weight release, which made reasoning model capabilities accessible to the broader research community and enabled distillation of reasoning abilities into smaller models.^[8]

Claude 3.7 Sonnet (extended thinking)

Anthropic released Claude 3.7 Sonnet in February 2025 with an "extended thinking" mode that represents a hybrid approach to test-time compute.^[9] Users and developers can toggle extended thinking on or off and set a "thinking budget" controlling how many tokens Claude spends reasoning about a problem, up to 128,000 tokens.^[9]

Anthropic designed this capability with the philosophy that reasoning should be an integrated capability of frontier models rather than requiring a separate model.^[9] The company reported that "Claude 3.7 Sonnet's accuracy on, for example, math questions improves logarithmically with the number of 'thinking tokens' that it's allowed to sample," illustrating the diminishing-returns curve characteristic of test-time scaling.^[9] In standard (non-thinking) mode the model scored 23.3% on AIME 2024, with accuracy rising substantially once extended thinking was enabled.^[9]

Claude 3.7 Sonnet achieved 80.0% on AIME 2024 in parallel extended thinking mode with a 64,000-token thinking budget, and 70.3% on SWE-bench Verified.^[9] Unlike OpenAI o1, Claude 3.7 Sonnet fully displays its reasoning tokens to users.^[9]

Claude Mythos

Anthropic's Claude Mythos Preview, announced in April 2026, extends the extended-thinking paradigm into long-running agentic workflows.^[22] Mythos is positioned as a frontier model for cybersecurity, autonomous coding, and multi-step agents. It supports a 1 million token context, up to 128,000 output tokens, and an "adaptive" thinking mode that allocates inference compute dynamically based on task difficulty.^[22] In Anthropic's internal evaluations, Mythos reproduced security vulnerabilities and produced working exploits on the first attempt in over 83% of cases.^[22] Citing offensive capabilities, Anthropic restricted release through Project Glasswing.^[22] Mythos illustrates how test-time compute is moving beyond single-shot questions toward multi-hour agent runs that may use millions of reasoning tokens per task.

QwQ (Qwen with Questions)

Alibaba Cloud's Qwen team released QwQ-32B-Preview in November 2024 as an open-source reasoning model leveraging test-time compute. Despite having only 32.5 billion parameters, QwQ demonstrated performance competitive with o1-preview and o1-mini on certain benchmarks. The model reasons through tasks by planning ahead and performing a series of self-checking actions, with the trade-off being longer response times. A later release, QwQ-32B (March 2025), further improved performance to compete with DeepSeek-R1 and o1-mini.

Qwen3, released in 2025, introduced a "Thinking Mode" where the model reasons step by step before delivering a final answer, with performance scaling smoothly with the computational reasoning budget allocated.^[11]

Google Gemini (thinking models and Deep Think)

Google introduced thinking capabilities in its Gemini line beginning with Gemini 2.0 Flash Thinking Experimental in December 2024. Developers control the level of internal reasoning via a thinking_level parameter (minimal, low, medium, or high).

In 2025, Google DeepMind extended this with Gemini 2.5 Deep Think, a parallel-reasoning variant that explores many candidate paths simultaneously.^[20] In July 2025, an advanced internal version reached the IMO gold-medal threshold by solving five of six 2025 International Mathematical Olympiad problems with full marks (35 of 42 points).^[19] The public Gemini 2.5 Deep Think reaches bronze-medal performance on the same problems.^[20] On IMO-ProofBench Advanced, Deep Think's accuracy climbs toward 90% as the inference-compute budget increases.^[21] Deep Think combines internal extended thinking with explicit parallel branch exploration, making it a clear production example of a hybrid system.

Comparison of models using test-time compute

Model	Developer	Release	Parameters	Approach	AIME 2024	GPQA Diamond	Open weights
OpenAI o1	OpenAI	Sep 2024	Undisclosed	RL-trained CoT, internal reasoning	83% (64-sample consensus)^[7]	78.0%^[7]	No
OpenAI o3	OpenAI	Dec 2024	Undisclosed	RL-trained CoT, configurable effort	96.7%	87.7%	No
DeepSeek-R1	DeepSeek	Jan 2025	671B (MoE)	RL-trained extended reasoning	79.8%^[8]	71.5%^[8]	Yes
DeepSeek-R1-0528	DeepSeek	May 2025	671B (MoE)	RL-trained extended reasoning	87.5% (AIME 2025)	81.0%	Yes
Claude 3.7 Sonnet	Anthropic	Feb 2025	Undisclosed	Extended thinking with budget control	80.0%^[9]	N/A	No
Claude Mythos Preview	Anthropic	Apr 2026	Undisclosed	Adaptive thinking, agentic	N/A (security-focused)	N/A	No (Project Glasswing)^[22]
QwQ-32B	Alibaba Qwen	Mar 2025	32.5B	RL-trained CoT reasoning	Competitive with o1-mini	Competitive with o1-mini	Yes
Gemini 2.0 Flash Thinking	Google DeepMind	Dec 2024	Undisclosed	Internal thinking process, configurable	N/A	N/A	No
Gemini 2.5 Deep Think	Google DeepMind	Aug 2025	Undisclosed	Parallel branch exploration, hybrid	2025 IMO gold (internal); bronze (public)^[19]	Strong on IMO-ProofBench	No

Techniques in detail

Internal test-time scaling (sequential)

Internal test-time scaling refers to the model generating a longer sequence of reasoning tokens before producing its final answer. This is the approach used by o1, DeepSeek-R1, and Claude's extended thinking. The model is trained (typically via reinforcement learning) to produce useful intermediate reasoning that helps it arrive at better answers.

The advantage of internal scaling is simplicity at inference time: no external verifier or search algorithm is needed, since the reasoning ability is baked into the model's weights. The disadvantage is that the model must learn when to think longer and when to stop, and there is no external check on the quality of intermediate steps.

External test-time scaling (parallel)

External test-time scaling involves generating multiple candidate outputs in parallel and selecting among them. This category includes best-of-N sampling, majority voting, and tree search with verification. External scaling requires either an external reward model or a self-evaluation mechanism but can be applied to any base model without retraining.

The advantage of external scaling is that it provides an independent verification signal, which can catch errors that the model itself would not detect. The disadvantage is the overhead of running a separate verifier and the cost of generating many candidate responses.

Hybrid approaches

In practice, the most effective systems combine internal and external scaling. For example, a reasoning model that generates an extended chain of thought (internal) can also be sampled multiple times with majority voting or reranking (external). OpenAI's reported AIME results for o1 illustrate this: single-sample performance was 74%, but consensus among 64 samples raised it to 83%, and reranking 1,000 samples pushed it to 93%.^[7] Gemini Deep Think pushes the hybrid pattern further by running many parallel reasoning branches inside a single "thinking" call, with the model itself responsible for aggregating across branches.

Comparing the major strategies

The table below summarizes the main families under a fixed compute budget.

Strategy	Verifier required	Adaptive to difficulty	Best suited to
Greedy chain-of-thought	No	No	Easy and medium reasoning
Self-consistency (majority vote)	No (uses answer matching)	Weak	Math, multi-choice
Best-of-N with ORM/PRM reranker	Yes	Weak	Coding, math, factual QA
Beam search with PRM	Yes (PRM)	Moderate	Multi-step math, planning
Monte Carlo tree search	Yes (often PRM)	Strong	Olympiad math, theorem proving
Sequential refinement (Self-Refine, SETS)	Optional (self-critique)	Strong	Coding, writing, agentic tasks
Internal RL-trained CoT (o1, R1)	No (internal)	Strong (learned)	General reasoning, deployment
Parallel-branch thinking (Deep Think)	No (internal)	Strong (learned)	Olympiad math, hard proofs

Compute-optimal inference

A central question in test-time compute research is how to allocate a fixed inference budget optimally. Not all problems benefit equally from additional thinking, and spending excessive computation on easy problems wastes resources.

Difficulty-adaptive allocation

Snell et al. (2024) found that the effectiveness of different test-time strategies depends critically on problem difficulty.^[5] On easy problems, simple sampling and advanced tree search produce comparable results. On harder problems, tree search demonstrates a significant advantage.^[5] This suggests that the optimal strategy is difficulty-adaptive: allocate more inference compute to harder problems while stopping early on easier ones.^[5]

Practical systems implement this through complexity assessment mechanisms that gauge whether a question is easy or hard and trigger deep reasoning only when needed. This approach optimizes both latency and cost by avoiding excessive computation on simple queries while still applying full reasoning power to difficult ones.

When should a model think longer?

Research has identified several principles for compute-optimal inference:

Marginal returns diminish: Each additional sample or reasoning step contributes less than the previous one, following a roughly logarithmic scaling curve.
Base model capability matters: Test-time compute is most effective when the base model already has a non-trivial probability of solving the problem. If the base model has nearly zero chance of producing a correct solution, generating more samples is unlikely to help.
Problem structure affects strategy choice: Problems with verifiable intermediate steps benefit more from process-level verification (PRMs and tree search), while problems with easily checkable final answers benefit from outcome-level verification (majority voting, ORMs).
Smaller models with more inference compute can beat larger models: Wu et al. (2024) showed that a 7B-parameter model with tree search outperformed a 34B-parameter model across all tested strategies, given the same total compute budget.^[6]

Saturation and the scaling plateau

Despite the dramatic gains above, test-time compute is not unbounded. A 2025 study, "Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models," found a consistent pattern across benchmarks and strategies: rapid initial improvement followed by a plateau.^[15] The authors formalized this as the Test-Time Scaling Performance Model (TTSPM), defining a saturation point N* where incremental benefit drops below a chosen threshold.^[15] The plateau appears on parallel and sequential strategies and on AIME, MATH-500, and GPQA.^[15] A 2025 analysis of machine translation found that, outside math and coding where verification is easy, scaling provides only small initial gains before plateauing. "More thinking" is most effective where verification is cheap and the base model already has meaningful coverage.

Generator-verifier dynamics

A complementary line studies test-time compute through the generation-verification gap: the difference between the best answer the generator can produce in N tries and the answer the verifier selects. The "Trust but Verify" survey (2025) catalogs the design space of verifiers, distinguishing process-level versus outcome-level checks and weak versus strong policies that decide when to defer to costly checks like human review.^[16]

What are the trade-offs of test-time compute (latency, cost, and quality)?

Test-time compute presents a three-way trade-off between response latency, computational cost, and output quality.

Latency

Reasoning models take significantly longer to respond than standard models. OpenAI o1-preview's latency can extend beyond 10 seconds and reach 30 seconds on complex prompts, compared to GPT-4o's typical 2 to 4 second response time. For applications like interactive chat assistants, coding copilots, and real-time customer support, this additional latency may be unacceptable. For tasks like document analysis, research assistance, and complex problem-solving, users are often willing to wait for higher-quality answers.

Cost

The cost impact of test-time compute goes beyond per-token pricing differences. While o1's per-token price is roughly 6x higher than GPT-4o ($15 per million input tokens and $60 per million output tokens for o1, versus $2.50 and $10 for GPT-4o), the actual cost per query can be 30x or more because reasoning models generate substantially more tokens per response. The extended chain-of-thought tokens, even when hidden from the user, still consume compute resources and count toward billing.

For best-of-N sampling and tree search approaches applied externally, the compute cost scales roughly linearly with the number of samples or search iterations.

Quality

The quality improvements from test-time compute are most pronounced on tasks that require multi-step reasoning, mathematical problem-solving, code generation, and scientific analysis. On straightforward factual questions or creative writing tasks, the benefit of additional reasoning may be minimal, and the added latency and cost may not be justified.

Factor	Standard inference	Test-time compute scaling
Latency	1 to 5 seconds typical	10 to 60+ seconds for complex queries
Cost per query	Lower (single forward pass)	Higher (extended reasoning, multiple samples)
Math/reasoning accuracy	Moderate	Substantially higher
Simple task performance	Good	Similar (minimal benefit from extra compute)
User experience	Fast, responsive	Slower, but higher quality on hard tasks
Compute scaling	Fixed per query	Adjustable per query difficulty

Impact on AI benchmarks

Test-time compute has produced dramatic improvements on benchmarks that were previously considered extremely difficult for AI systems.

AIME (American Invitational Mathematics Examination)

AIME is a challenging mathematics competition designed for the brightest high school students in the United States. Before reasoning models, GPT-4o solved only 12% (1.8 out of 15) of AIME 2024 problems. With test-time compute, o1 raised this to 74% with a single sample and 93% with reranking.^[7] By late 2024, o3 achieved 96.7% on AIME 2024. On AIME 2025, top reasoning models routinely score above 90%, vastly surpassing historical human averages.

IMO and proof-style mathematics

The International Mathematical Olympiad has long been a demanding test of reasoning because each problem requires a multi-page proof. Test-time compute changed this in 2025: Gemini Deep Think reached the IMO gold-medal threshold on the 2025 problems,^[19] and DeepSeekMath-V2 (November 2025) demonstrated self-verifiable proof generation in which a trained verifier rewards a generator for finding and fixing flaws in its own proofs.^[18] These results show test-time scaling delivering progress on tasks where verification cannot rely on simple answer matching.

GPQA Diamond (graduate-level Google-Proof Q&A)

GPQA Diamond tests PhD-level knowledge in physics, biology, and chemistry. Human domain experts with PhDs achieve approximately 65 to 70% accuracy on these questions, while non-expert validators with unrestricted web access reach only 34%. OpenAI o1 was the first model to exceed expert-level performance at 78.0%.^[7] o3 pushed this further to 87.7%. Between 2023 and 2024, AI performance on GPQA improved by 48.9 percentage points.

ARC-AGI (Abstraction and Reasoning Corpus)

ARC-AGI is designed to test novel task adaptation, a capability considered central to general intelligence. The benchmark resisted AI progress for years: GPT-3 scored 0% in 2020, and GPT-4o managed only 5% by 2024. OpenAI o3, using high-compute test-time scaling, scored 87.5% in December 2024, surpassing the 85% threshold often cited as approximate human-level performance.^[10] This result sparked widespread discussion about whether the gains reflect genuine reasoning advances or sophisticated pattern matching at scale.

Frontier Math

The EpochAI Frontier Math benchmark contains exceptionally difficult mathematical problems where most AI systems score below 2%. OpenAI o3 achieved 25.2%, demonstrating that test-time compute enables meaningful progress even on problems well beyond current AI capabilities.

Where test-time compute helps less

Not every task benefits equally. Test-time scaling applied to machine translation, summarization, and creative writing shows only modest gains before plateauing. The shared characteristic is a weak verifier signal: with no automatic check for translation quality or narrative engagement, the system cannot reliably pick the best of many candidates.

Agentic test-time compute

A rapidly growing application of test-time compute is in agentic systems, where a model is given a multi-step task such as fixing a bug across a codebase or auditing a system for security flaws. Agentic settings stretch the paradigm: the agent plans and revises across hundreds of actions; the verifier is often the environment itself (code that compiles, tests that pass); and the inference budget runs into millions of reasoning tokens per task.

Several 2025 lines of work specialize test-time compute for agents. Confidence-aware test-time scaling (CATTS) allocates extra compute when the agent's step-level uncertainty rises. Agentic verifier approaches treat the verifier as its own agent that executes candidate code and proposes discriminative test inputs; on competitive coding benchmarks this yielded 10 to 15 percentage point gains in Best@K over execution-based baselines.^[24] When each sample is itself a long agentic run, simple best-of-N with a strong verifier remains a competitive baseline.^[23] Anthropic's Mythos is a production-scale example, combining an adaptive thinking budget with a 1 million token context and long-horizon agentic loops.^[22]

How does test-time compute differ from training-time compute?

Test-time compute and training-time compute represent two complementary approaches to improving model performance, each with distinct characteristics.

Dimension	Training-time scaling	Test-time scaling
When compute is spent	Before deployment (offline)	During each query (online)
What scales	Model parameters, training data, training FLOPs	Reasoning tokens, number of samples, search depth
Cost structure	Large upfront investment, low marginal cost per query	Lower upfront cost, higher marginal cost per query
Flexibility	Fixed after training	Adjustable per query and per difficulty level
Diminishing returns	Power-law scaling with compute	Roughly logarithmic scaling with compute
Knowledge breadth	Improves general capabilities across all tasks	Primarily improves reasoning on individual problems
Latency impact	None (compute spent before deployment)	Increases response time per query
Accessibility	Requires massive GPU clusters for training	Can improve existing models without retraining

Training-time scaling produces models with broader knowledge and general capabilities, while test-time scaling produces deeper reasoning on specific problems. The two approaches are not mutually exclusive. The strongest systems, such as o3 and Gemini Deep Think, combine large-scale pretraining with test-time compute to achieve the best results.

Snell et al. (2024) found that in some regimes, spending additional compute at inference time is more effective per FLOP than spending it on training a larger model.^[5] This has implications for how organizations allocate their compute budgets: rather than always building bigger models, it may sometimes be more cost-effective to deploy smaller models with sophisticated inference-time strategies. The economic break-even depends on expected query volume; a system serving billions of queries benefits more from a bigger pretrained model, while a research workflow that runs only a few thousand expensive queries can profitably push more compute into thinking time.

Implications for AI development and deployment

Shifting the scaling paradigm

Test-time compute represents a shift in how the AI research community thinks about scaling. The traditional approach focused almost entirely on making models bigger and training them on more data. The emerging paradigm recognizes that inference-time computation is a separate, independently scalable axis that can produce large performance gains, especially on reasoning-heavy tasks.

This shift has practical implications. Organizations that cannot afford to train frontier-scale models can still achieve competitive performance by applying test-time compute techniques to smaller, openly available models. Wu et al. (2024) demonstrated that a 7B-parameter model with advanced inference strategies can match or exceed a 34B-parameter model on mathematical problem-solving.^[6]

Democratization through open-source models

The release of open-weight reasoning models like DeepSeek-R1 and QwQ has made test-time compute capabilities accessible beyond the largest AI laboratories. DeepSeek-R1's open release enabled researchers to study reasoning model behavior, distill reasoning capabilities into even smaller models, and develop new inference-time techniques.^[8] This contrasts with the closed nature of OpenAI's o-series models, where the internal chain-of-thought reasoning is hidden from users.

Adaptive compute allocation

Test-time compute enables a new form of resource management where computational expenditure scales with problem difficulty. Simple factual queries can be answered quickly and cheaply with minimal reasoning, while complex mathematical proofs or scientific analyses receive substantially more computation. This difficulty-adaptive approach is more economically efficient than applying the same level of computation to every query.

Verification and reliability

The integration of verification mechanisms (reward models, self-consistency checks, tree search with pruning) into the inference pipeline improves the reliability of model outputs. By generating multiple solutions and selecting among them based on verification, test-time compute systems can catch and correct errors that would persist in a single-pass generation. This has particular value in high-stakes applications such as medical reasoning, legal analysis, and scientific research.

Safety, oversight, and dual-use concerns

Test-time compute has also reshaped the AI safety discussion. Longer reasoning traces make behavior more transparent in principle, since each step is written out as auditable tokens, which motivates Claude's visible extended thinking.^[9] Hidden internal chains of thought (as in the o-series and Mythos) raise oversight concerns, and long-running agents can be misused, which is why Anthropic restricted Mythos through Project Glasswing.^[22]

Future directions

Several open research directions remain active:

Better verifiers: Improving process reward models to provide more accurate step-level feedback across diverse domains, not just mathematics.
Efficient search: Developing search algorithms that explore the solution space more efficiently, reducing the computational overhead of test-time scaling.
Latency reduction: Finding ways to deliver the benefits of extended reasoning without the associated latency penalties, potentially through speculative decoding or parallel reasoning paths.
Generalization beyond reasoning: Extending test-time compute benefits to tasks beyond mathematics and coding, such as open-ended creative tasks, where verification is inherently more difficult.
Test-time training: A distinct but related approach where model parameters are temporarily updated at inference time for a specific task. This technique achieved top scores on ARC-AGI in both the 2024 and 2025 ARC Prize competitions.
Self-verifiable proofs: Systems like DeepSeekMath-V2 that produce proofs they can themselves verify, enabling scaling on open problems with no known answer.^[18]
Agentic budget control: Controllers that decide in real time how much compute to spend on each step of a long agent run.

References

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). "Scaling Laws for Neural Language Models." arXiv:2001.08361. ↩
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., et al. (2022). "Training Compute-Optimal Large Language Models." Proceedings of NeurIPS 2022. arXiv:2203.15556. ↩
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E., Le, Q., & Zhou, D. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Proceedings of NeurIPS 2022. arXiv:2201.11903. ↩
Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., & Zhou, D. (2022). "Self-Consistency Improves Chain of Thought Reasoning in Language Models." arXiv:2203.11171. ↩
Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." arXiv:2408.03314. ↩
Wu, Y., Sun, Z., Li, S., Welleck, S., & Yang, Y. (2024). "Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models." arXiv:2408.00724. ↩
OpenAI. (2024). "Learning to Reason with LLMs." https://openai.com/index/learning-to-reason-with-llms/ ↩
DeepSeek-AI. (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948. ↩
Anthropic. (2025). "Claude's Extended Thinking." https://www.anthropic.com/news/visible-extended-thinking ↩
ARC Prize. (2024). "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub." https://arcprize.org/blog/oai-o3-pub-breakthrough ↩
Alibaba Qwen Team. (2025). "Qwen3 Technical Report." arXiv:2505.09388. ↩
Brown, B., Juravsky, J., Ehrlich, R., Clark, R., Le, Q. V., Re, C., & Mirhoseini, A. (2024). "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling." arXiv:2407.21787. ↩
Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., et al. (2023). "Self-Refine: Iterative Refinement with Self-Feedback." arXiv:2303.17651. ↩
"SETS: Leveraging Self-Verification and Self-Correction for Improved Test-Time Scaling." (2025). arXiv:2501.19306. ↩
"Scaling over Scaling: Exploring Test-Time Scaling Plateau in Large Reasoning Models." (2025). arXiv:2505.20522. ↩
"Trust but Verify: A Survey on Verification Design for Test-time Scaling." (2025). arXiv:2508.16665. ↩
Stanford Scaling Intelligence Lab. (2025). "Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers." ↩
DeepSeek-AI. (2025). "DeepSeekMath-V2: Towards Self-Verifiable Mathematical Reasoning." arXiv:2511.22570. ↩
Google DeepMind. (2025). "Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad." https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/ ↩
Google DeepMind. (2025). "Gemini 2.5 Deep Think Model Card." ↩
Epoch AI. (2025). "Evaluating Gemini 2.5 Deep Think's math capabilities." ↩
Anthropic. (2026). "Claude Mythos Preview." https://red.anthropic.com/2026/mythos-preview/ ↩
"Scaling Test-time Compute for LLM Agents." (2025). arXiv:2506.12928. ↩
"Agentic Test-Time Scaling for WebAgents." (2026). arXiv:2602.12276. ↩
DeepSeek-AI. (2025). "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." Nature, vol. 645. https://www.nature.com/articles/s41586-025-09422-z ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit