Reasoning (artificial intelligence)

Artificial Intelligence Large Language Models Machine Learning

27 min read

Updated Jun 20, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 20, 2026

Fact-checked

In review queue

Sources

31 citations

Revision

v7 · 5,397 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Reasoning in artificial intelligence is the ability of an AI system to draw inferences, solve problems, and make decisions through structured, multi-step thought rather than a single immediate response. It spans logical deduction, pattern recognition, hypothesis formation, and navigating uncertainty, and it has been a central goal of AI since the field began in the 1950s. Since late 2024, a distinct class of "reasoning models" has dominated frontier AI: systems such as OpenAI o1 and o3, DeepSeek R1, and Google DeepMind's Gemini Deep Think that spend extra compute at inference time, generating an internal chain-of-thought before answering and posting large gains on math, science, and coding benchmarks.

The scale of recent progress is concrete. In July 2025 an advanced version of Gemini Deep Think solved 5 of 6 problems at the International Mathematical Olympiad for 35 of 42 points, the first AI to officially achieve gold-medal standard under standard human conditions ^[13]. DeepSeek R1, released January 20, 2025 under an MIT license, reached o1-class performance at roughly 96% lower API cost and in September 2025 became the first large language model published, after peer review, on the cover of Nature ^[7] ^[28]. Whether these systems genuinely reason or perform sophisticated pattern matching remains an open and consequential question, sharpened by 2025 findings that frontier reasoning models suffer a "complete accuracy collapse" beyond a certain problem complexity ^[29].

Types of reasoning

Reasoning in AI can be categorized into several distinct types, each reflecting a different aspect of human cognitive ability. These categories are not mutually exclusive; real-world problem-solving often requires combining multiple forms of reasoning simultaneously.

Deductive reasoning

Deductive reasoning moves from general premises to specific conclusions. If the premises are true and the logic is valid, the conclusion is guaranteed to be true. For example: "All mammals are warm-blooded. A dog is a mammal. Therefore, a dog is warm-blooded." Classical symbolic AI systems and logic programming languages like Prolog were built around deductive reasoning.

Inductive reasoning

Inductive reasoning works in the opposite direction, drawing general conclusions from specific observations. A system that observes thousands of spam emails and learns to identify common patterns is performing induction. Most machine learning algorithms rely heavily on inductive reasoning, generalizing from training data to make predictions on unseen inputs.

Abductive reasoning

Abductive reasoning involves inferring the most likely explanation for a set of observations. When a doctor considers symptoms and arrives at a diagnosis, that process is abductive. This form of reasoning is inherently uncertain; the conclusion is a best guess rather than a guaranteed truth. It plays a significant role in diagnostic systems and natural language understanding.

Analogical reasoning

Analogical reasoning solves new problems by finding similarities to previously solved problems. If a system knows how to navigate one city's road network, it might apply similar strategies to a different city. This type of reasoning is central to transfer learning and few-shot learning in modern AI.

Causal reasoning

Causal reasoning goes beyond correlation to understand cause-and-effect relationships. Rather than merely noting that umbrella sales and rain tend to co-occur, a causally reasoning system would understand that rain causes people to buy umbrellas, not the reverse. Judea Pearl's work on causal inference has been influential in formalizing this type of reasoning for AI systems ^[1].

Commonsense reasoning

Commonsense reasoning involves applying the vast body of everyday knowledge that humans take for granted. Knowing that a glass will break if dropped, that people feel hungry before meals, or that objects do not float upward without a force acting on them are examples. This has historically been one of the hardest challenges for AI, since commonsense knowledge is enormous in scope and difficult to formalize.

Mathematical reasoning

Mathematical reasoning involves manipulating numerical and symbolic expressions, constructing proofs, and solving equations. It requires precision, multi-step logic, and the ability to apply abstract rules correctly. Mathematical reasoning has become a primary benchmark for evaluating modern reasoning models.

Spatial reasoning

Spatial reasoning deals with understanding and manipulating the positions, shapes, and relationships of objects in space. It is critical for robotics, computer vision, and navigation tasks. Spatial reasoning also plays a role in understanding language that describes physical arrangements.

History of reasoning in AI

The pursuit of machine reasoning is as old as the field of artificial intelligence itself. The approaches taken have shifted dramatically over the decades, from hand-coded logic to statistical methods to the neural network systems dominant today.

Logic-based AI (1950s to 1980s)

The earliest AI programs were built on the premise that intelligence could be captured through formal logic and symbol manipulation. In 1955, Allen Newell and Herbert A. Simon, with the assistance of J.C. Shaw, created the Logic Theorist, widely considered the first AI program. The Logic Theorist proved 38 of the first 52 theorems in Bertrand Russell and Alfred North Whitehead's Principia Mathematica, and in some cases discovered proofs more elegant than the originals ^[2].

The 1956 Dartmouth Workshop, organized by John McCarthy and Marvin Minsky, formally established AI as an academic discipline. In the years that followed, researchers developed systems like the General Problem Solver (1957), which attempted to encode general-purpose reasoning strategies. McCarthy's development of Lisp in 1958 and later work on situation calculus provided programming tools and formal frameworks for reasoning about actions and change.

This era was dominated by what is now called "Good Old-Fashioned AI" (GOFAI) or symbolic AI. The core assumption was that intelligence consists of manipulating symbolic representations according to logical rules. Systems could perform impressive feats of deduction within narrow domains, but they struggled with the messiness of real-world knowledge and the combinatorial explosion of possible inferences.

Expert systems (1970s to 1990s)

Expert systems represented the first major commercial application of AI reasoning. These systems encoded the knowledge of human domain experts as collections of if-then rules. MYCIN (1976), developed at Stanford, could diagnose bacterial infections and recommend antibiotics with accuracy comparable to human specialists. DENDRAL (1969) could determine molecular structures from mass spectrometry data.

Expert systems demonstrated that narrow, domain-specific reasoning could be practically useful. However, they were brittle: they could not handle situations outside their programmed rules, they required enormous effort to build and maintain, and they lacked the ability to learn from new data. By the early 1990s, interest in expert systems had waned considerably.

Statistical and probabilistic approaches (1990s to 2010s)

The limitations of purely symbolic reasoning led to a shift toward statistical methods. Bayesian networks, hidden Markov models, and other probabilistic frameworks allowed AI systems to reason under uncertainty, a capability that rule-based systems lacked.

This era also saw the rise of machine learning as the dominant paradigm. Rather than hand-coding reasoning rules, systems learned patterns from data. Support vector machines, decision trees, and eventually deep learning models demonstrated that statistical pattern recognition could solve problems that symbolic methods could not, from speech recognition to image classification.

Neural network reasoning (2010s to present)

The deep learning revolution, beginning around 2012 with the success of AlexNet on ImageNet, initially focused on perception tasks such as image recognition and speech processing. But researchers quickly began exploring whether neural networks could also perform reasoning.

Early milestones included DeepMind's AlphaGo (2016), which defeated the world champion at Go through a combination of deep learning and tree search, demonstrating that neural approaches could handle complex strategic reasoning. The introduction of the Transformer architecture in 2017 ^[3] and subsequent development of large language models opened a new chapter in AI reasoning, as these models began to show unexpected capabilities in logical, mathematical, and commonsense reasoning tasks.

Reasoning in large language models

The emergence of large language models such as GPT-3, GPT-4, and Claude revealed that models trained primarily on next-token prediction could exhibit reasoning-like behavior. This was surprising to many researchers, since the training objective (predicting the next word) does not explicitly require reasoning.

What is chain-of-thought prompting?

A landmark paper by Jason Wei and colleagues at Google, published in January 2022, introduced chain-of-thought (CoT) prompting ^[4]. The key insight was simple but powerful: by including intermediate reasoning steps in the prompt examples given to a large language model, the model could be induced to generate its own step-by-step reasoning before arriving at an answer.

Experiments showed that chain-of-thought prompting dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks. A 540-billion-parameter PaLM model prompted with just eight chain-of-thought examples achieved state-of-the-art accuracy on the GSM8K benchmark of grade-school math word problems ^[4]. The technique worked best on larger models; smaller models did not benefit as much, suggesting that reasoning capabilities emerge at scale.

Chain-of-thought prompting spawned numerous variants, including zero-shot CoT (simply adding "Let's think step by step" to the prompt), self-consistency (sampling multiple reasoning paths and taking a majority vote), and tree-of-thought prompting (exploring branching reasoning paths).

From prompting to training: reasoning models

While chain-of-thought prompting showed that LLMs could produce reasoning traces, the reasoning was often unreliable and inconsistent. The next major step was to train models specifically to reason, rather than relying solely on prompting tricks. This led to the development of dedicated reasoning models, sometimes called "large reasoning models" (LRMs), which use reinforcement learning and other techniques to internalize chain-of-thought reasoning as a core capability.

Key reasoning models

Several major reasoning models have been released since late 2024, representing a new paradigm in AI development. The following table summarizes the most significant systems.

Model	Developer	Release Date	Key Characteristics
OpenAI o1	OpenAI	September 12, 2024 (preview); December 5, 2024 (full)	First major reasoning model; uses internal chain-of-thought trained via reinforcement learning; scored 83.3% on the 2024 AIME and 78% on GPQA Diamond ^[5]
OpenAI o3-mini	OpenAI	January 31, 2025	Smaller, faster reasoning model; 80% on AIME 2024 at significantly lower cost than o1 ^[6]
DeepSeek R1	DeepSeek	January 20, 2025	Open-source (MIT License); trained with pure reinforcement learning; performance comparable to o1 at roughly 96% lower cost; open-sourced with distilled variants; published on the cover of Nature in September 2025 ^[7] ^[28]
Claude 3.7 Sonnet (extended thinking)	Anthropic	February 25, 2025	Hybrid model with toggleable extended thinking mode; developers can set a "thinking budget" up to 128K tokens; accuracy improves logarithmically with thinking tokens ^[8]
Gemini 2.0 Flash Thinking	Google DeepMind	December 19, 2024	Experimental reasoning variant of Gemini 2.0 Flash; introduced thinking traces for improved multi-step reasoning ^[9]
OpenAI o3	OpenAI	April 16, 2025	Multimodal reasoning model with tool use; 91.6% on AIME 2024, 83.3% on GPQA Diamond; can browse the web and execute code ^[10]
OpenAI o4-mini	OpenAI	April 16, 2025	High-throughput reasoning model; delivers over 90% of o3 performance at half the compute cost ^[10]
QwQ-32B	Alibaba Cloud	November 2024 (preview); March 2025 (full)	Open-source reasoning model built on Qwen2.5; performance comparable to DeepSeek R1 on AIME and LiveCodeBench benchmarks ^[11]
OpenAI o3-pro	OpenAI	June 10, 2025	Described as OpenAI's most capable reasoning model at the time of release; designed for maximum accuracy on the hardest tasks ^[12]
Gemini Deep Think	Google DeepMind	August 1, 2025 (GA)	Parallel hypothesis exploration; gold-medal performance on IMO, IPhO, and IChO written sections; 84.6% on ARC-AGI-2 ^[13]
Qwen3	Alibaba Cloud	April 29, 2025	Hybrid thinking/non-thinking modes; trained on 36 trillion tokens; significant improvements over QwQ on math, code, and logical reasoning ^[14]
GPT-5	OpenAI	August 7, 2025	Unified system that uses a real-time router to switch between a fast model and a deeper "gpt-5-thinking" reasoning model; scored 94.6% on AIME 2025 without tools ^[30]

How reasoning models work

Reasoning models represent a fundamental shift in how AI systems are built and deployed. Rather than simply predicting the next token as quickly as possible, these models are designed to "think" before responding, allocating additional computational resources at inference time to work through difficult problems.

What is test-time compute?

Traditional LLMs follow a paradigm of scaling train-time compute: performance improves by training larger models on more data. Reasoning models introduce a complementary approach called test-time compute scaling (also known as inference-time compute scaling). The core idea is that a model can improve its answers by spending more time "thinking" at inference time, generating internal reasoning tokens before producing a final response ^[15].

This creates a new dimension for improving AI performance. Rather than building an ever-larger model, developers can take a smaller model and give it more time to reason through a problem. Research has shown that scaling inference compute with appropriate reasoning strategies can be more computationally efficient than scaling model parameters alone ^[16]. In some cases, a smaller model with extended reasoning outperforms a larger model that answers immediately.

The implications for deployment are significant. Analysts project that inference will claim 75% of total AI compute by 2030, partly driven by the growing adoption of reasoning models that consume 10 to 100 times more tokens than standard models per query ^[15].

Internal chain of thought

Reasoning models generate an internal chain of thought before producing their final answer. Unlike chain-of-thought prompting, where the user explicitly asks the model to show its reasoning, the internal chain of thought is a trained behavior. The model automatically breaks down complex problems into steps, considers multiple approaches, checks its own work, and refines its answer.

In OpenAI's o1, for example, the internal chain of thought is hidden from the user and only a summary is shown. OpenAI stated that "we have decided not to show the raw chains of thought to users," citing safety and the risk that unfiltered intermediate steps could contain policy-violating content ^[5]. The model might spend hundreds or thousands of tokens reasoning through a math problem internally before producing a concise final answer. DeepSeek R1, being open-source, made its reasoning traces more visible, revealing patterns such as self-reflection ("Wait, let me reconsider..."), verification ("Let me check this step..."), and dynamic strategy adaptation ^[7].

Anthropic's approach with Claude's extended thinking mode is somewhat different: developers can set a "thinking budget" that controls how many tokens Claude is allowed to use for internal reasoning, providing fine-grained control over the trade-off between reasoning depth and response speed ^[8].

Reinforcement learning on reasoning tasks

The training process for reasoning models relies heavily on reinforcement learning (RL). The general approach involves training a base language model to produce reasoning traces, then using RL to reward traces that lead to correct answers and penalize those that do not.

OpenAI described o1 as using "a large-scale reinforcement learning algorithm" and stated that "through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes" ^[5]. DeepSeek's R1 took this further by demonstrating that pure RL, without any supervised fine-tuning on human-written reasoning examples, could produce strong reasoning capabilities. DeepSeek-R1-Zero, trained via large-scale RL from scratch, spontaneously developed advanced reasoning patterns including self-verification and backtracking, an emergent behavior the authors called an "aha moment" in which the model learns to reevaluate its approach mid-solution ^[7].

The RL training process typically uses a reward signal based on the correctness of final answers (for math and coding tasks, where answers can be verified automatically) or on preference judgments from human evaluators or AI judges. DeepSeek trained R1 with Group Relative Policy Optimization (GRPO), an RL method that lets the model score and rank its own sampled outputs without a separate critic model ^[7].

Reinforcement learning from human feedback (RLHF) and beyond

While RLHF was initially developed to align language models with human preferences for helpfulness and safety, similar techniques have been adapted for reasoning. Some reasoning models use a combination of outcome-based reward (did the model get the right answer?) and process-based reward (did the model follow sound reasoning steps?). Process reward models, which evaluate each step of a reasoning chain rather than just the final answer, have shown promise in improving reasoning reliability.

Reasoning benchmarks

Evaluating reasoning capabilities requires specialized benchmarks that test multi-step problem-solving, domain expertise, and the ability to handle novel challenges. The following table summarizes the major benchmarks used to evaluate reasoning models.

Benchmark	Domain	Description	Difficulty Level	Notable Scores
GSM8K	Mathematics	8,000 grade-school math word problems requiring multi-step arithmetic reasoning; introduced by OpenAI researchers in 2021 ^[17]	Moderate	Largely saturated by 2024; top models exceed 95%
MATH	Mathematics	12,500 competition-level high school math problems covering algebra, geometry, number theory, and calculus ^[18]	Hard	o1 scored 94.8% on the full set; most frontier models now exceed 90%
AIME	Mathematics	Problems from the American Invitational Mathematics Examination, a prestigious high school math competition; tests creative multi-step problem-solving ^[19]	Very Hard	o1: 83.3% (2024); o3: 91.6% (2024), 88.9% (2025); GPT-5: 94.6% (2025) ^[30]
GPQA Diamond	Science	198 graduate-level multiple-choice questions in physics, chemistry, and biology written by domain experts; designed to be "Google-proof" ^[20]	Very Hard	o1: 78%; o3: 83.3%; top models in late 2025 exceed 90%
ARC-AGI	Abstract reasoning	Tests the ability to identify patterns in novel visual grids; designed to measure fluid intelligence and generalization ^[21]	Very Hard	Gemini 3 Pro: 31.1% (45.1% with Deep Think); Gemini Deep Think: 84.6% on ARC-AGI-2
Humanity's Last Exam	Cross-domain	Extremely difficult questions across many academic disciplines, designed to be the hardest AI benchmark; created by a coalition of experts ^[22]	Extreme	Top models in late 2025 score 35-50%; considered far from saturated

Benchmark saturation

A recurring challenge in AI reasoning evaluation is benchmark saturation. GSM8K, once considered a meaningful test of mathematical reasoning, is now solved at near-perfect accuracy by most frontier models. The MATH benchmark followed a similar trajectory, going from below 10% accuracy in 2021 to above 90% by 2024. This rapid saturation has driven the creation of progressively harder benchmarks like AIME, GPQA Diamond, and Humanity's Last Exam.

The pattern suggests that static benchmarks may not be reliable long-term measures of reasoning ability. Models can improve on specific benchmarks through targeted training, making it difficult to distinguish genuine reasoning improvement from benchmark-specific optimization.

Inference-time scaling laws

Traditional scaling laws, such as those described by Kaplan et al. (2020) and Hoffmann et al. (2022, the "Chinchilla" paper), focused on the relationship between model size, training data, and training compute ^[23]. Inference-time scaling laws extend this framework to the compute spent during generation.

A key 2024 paper by Snell et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters," demonstrated that for reasoning tasks, allocating additional compute at inference time can yield better performance-per-FLOP than simply training a larger model ^[16]. This finding has profound implications for the economics of AI deployment.

OpenAI reported that o1's performance improves consistently with both more training compute and more time spent thinking at test time ^[5]. Anthropic observed that Claude's accuracy on math questions improves logarithmically with the number of thinking tokens it is permitted to sample ^[8]. These results suggest a smooth, predictable relationship between inference-time compute and reasoning accuracy, analogous to the scaling laws observed for training.

The practical consequence is a new trade-off in model design. A smaller, cheaper model that reasons for longer can sometimes match or exceed a much larger model that answers immediately. This has led to a proliferation of model sizes within reasoning families (for instance, o3 versus o4-mini, or DeepSeek R1 versus its smaller distilled variants), allowing users to choose the appropriate balance of cost, speed, and accuracy for their use case.

Limitations of reasoning models

Despite their impressive benchmark scores, reasoning models have significant limitations that constrain their reliability and applicability.

Unfaithful reasoning

One of the most concerning findings is that the reasoning traces produced by these models can be "unfaithful," meaning the stated reasoning does not accurately reflect the process that led to the model's answer. A model might arrive at a correct answer through pattern matching or memorization while generating a plausible-sounding but fabricated chain of reasoning to justify it. Conversely, a model might produce logically sound reasoning steps but arrive at an incorrect conclusion due to an error in one step that the subsequent steps fail to catch.

Research published at ICLR 2025 demonstrated that even when language models produce correct-looking reasoning chains, a significant portion of the inference patterns may represent misleading or irrelevant logic ^[24]. This undermines one of the key proposed benefits of reasoning models: that visible reasoning traces would make AI systems more interpretable and trustworthy.

Confabulation and hallucination

Reasoning models remain susceptible to confabulation (also called hallucination), where the model generates plausible but factually incorrect information. Extended reasoning chains can sometimes make this worse rather than better: a model may "reason" its way into an incorrect conclusion with great confidence, constructing an elaborate justification for a wrong answer. The longer the reasoning chain, the more opportunities there are for errors to compound.

Does reasoning collapse on hard problems?

While reasoning models perform impressively on established benchmarks, they can be surprisingly brittle when faced with genuinely novel problems or problems that require reasoning patterns not well-represented in their training data. Studies have shown that LLMs, including reasoning models, struggle with tasks requiring flexible reasoning in unfamiliar contexts. For example, state-of-the-art reasoning models performed poorly compared to physicians on clinical reasoning tasks that required commonsense medical reasoning and adaptation to unusual patient presentations ^[25].

In June 2025, Apple researchers (Shojaee et al.) published "The Illusion of Thinking," which evaluated frontier large reasoning models on controllable puzzles of increasing difficulty rather than standard benchmarks. The authors found that "frontier LRMs face a complete accuracy collapse beyond certain complexities," and reported a counter-intuitive scaling limit: "their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget" ^[29]. The paper was widely debated; a rebuttal argued that some failures reflected the experiments' own token limits and a number of mathematically impossible puzzle instances rather than a fundamental reasoning ceiling ^[31]. Either way, small changes to problem formatting or the introduction of irrelevant information can cause dramatic drops in performance, suggesting that at least some of what appears to be reasoning is actually pattern matching on surface features of problems.

Computational cost

Reasoning models are significantly more expensive to run than standard LLMs. Because they generate many more tokens per query (often 10 to 100 times more), they require proportionally more compute, memory, and time. This makes them impractical for many real-time applications and raises questions about the environmental and economic sustainability of deploying reasoning models at scale. The price gap can also be large between providers: at launch, OpenAI o1 cost about 15 US dollars per million input tokens and 60 US dollars per million output tokens, while DeepSeek R1 was priced at roughly 0.55 and 2.19 US dollars respectively, a difference of about 96% on equivalent workloads ^[7] ^[28].

Do LLMs truly reason?

The question of whether large language models genuinely reason or merely perform sophisticated pattern matching is one of the most actively debated topics in AI research. The answer has significant implications for AI safety, trustworthiness, and the trajectory of future development.

The case for genuine reasoning

Proponents of the view that LLMs can reason point to several lines of evidence. First, modern reasoning models can solve novel mathematical problems at competition level, including problems from the International Mathematical Olympiad that were published after the model's training data cutoff. In July 2025, Gemini Deep Think and an OpenAI experimental model both reached gold-medal scores at the IMO, solving 5 of 6 problems end-to-end in natural language within the official time limit, a feat that 2024 systems achieved only with human-assisted formalization ^[13]. Solving such problems requires applying learned techniques to unfamiliar situations, which seems to go beyond simple memorization or pattern matching.

Second, LLMs demonstrate the ability to combine knowledge from different domains in ways that suggest some form of internal representation. They can draw analogies, transfer solution strategies between domains, and generate creative approaches to problems they have not seen before.

Third, the scaling behavior of reasoning models, where performance improves smoothly with more thinking time, parallels human cognition. Humans also reason better when given more time to think, and this similarity suggests that the underlying process may share some functional characteristics with human reasoning.

The case for pattern matching

Skeptics argue that what appears to be reasoning is actually very sophisticated statistical pattern matching over the training data. Several observations support this view.

LLMs struggle with tasks that require persistent state tracking or that present familiar concepts in unfamiliar formats. A model might correctly explain a concept in one context but fail to apply that same concept when the surface presentation changes, which would not happen if the model had a genuine understanding of the underlying principle ^[26].

Performance on tasks requiring rigorous proof generation, as opposed to producing numerical answers, remains significantly weaker. This suggests a "reasoning illusion" where success on benchmarks involving numerical answers might stem partly from pattern matching on problem types rather than genuine mathematical insight ^[27]. The Apple "Illusion of Thinking" results, in which reasoning effort actually declines as problems pass a complexity threshold, are frequently cited in support of this position ^[29].

Additionally, LLMs can be easily fooled by problems that contain misleading surface features. If a problem looks like a standard type but has a twist that changes the solution approach, models frequently apply the standard approach regardless, suggesting they are matching patterns rather than truly understanding the problem structure.

A middle ground

Many researchers have adopted a more nuanced position. LLMs may implement something that functions as reasoning within certain domains and contexts, even if the underlying mechanism (statistical pattern matching over distributed representations) is fundamentally different from human reasoning. The question may ultimately be less about whether LLMs "truly" reason in a philosophical sense and more about understanding the specific conditions under which their reasoning-like behavior is reliable and the conditions under which it breaks down.

This pragmatic framing has practical value: rather than debating definitions, it focuses research on characterizing the boundaries of LLM reasoning capabilities, which directly informs decisions about where these systems can be safely deployed.

Current state (2025 to 2026)

The reasoning model landscape has evolved rapidly since OpenAI's release of o1 in September 2024. Several trends define the current state of the field.

Commoditization and open source

DeepSeek R1's January 2025 release demonstrated that competitive reasoning capabilities could be achieved at a fraction of the cost of proprietary models and released under an open-source license. This triggered a wave of open-source reasoning model development. Alibaba's QwQ and Qwen3, along with various distilled versions of R1, have made reasoning capabilities accessible to a much broader set of developers and researchers ^[7] ^[14]. In September 2025, DeepSeek R1 became the first major LLM to pass formal scientific peer review, appearing on the cover of Nature under the headline "Self-help: reinforcement learning teaches AI model to improve itself" ^[28].

Hybrid and unified models

Rather than having separate "reasoning" and "non-reasoning" models, the trend has moved toward hybrid systems that can toggle reasoning on and off. Anthropic's extended thinking mode for Claude, Qwen3's thinking/non-thinking modes, and the varied reasoning effort settings in OpenAI's models all reflect this approach ^[8]. OpenAI's GPT-5, released August 7, 2025, pushed the idea furthest by shipping as a single system that uses a real-time router to decide automatically whether a prompt needs the fast model or the deeper "gpt-5-thinking" reasoning model ^[30]. Users and developers can thus choose, or let the system choose, when to invest the additional compute required for deep reasoning and when a quick response suffices.

Multimodal reasoning

Reasoning capabilities have expanded beyond text. OpenAI's o3 can reason about images, charts, and graphics ^[10]. Google's Gemini Deep Think applies its parallel hypothesis exploration to scientific diagrams and visual problem-solving ^[13]. This represents a significant expansion of what reasoning models can do, moving them closer to the kind of multi-modal reasoning that humans perform naturally.

Agentic reasoning

Reasoning models are increasingly being integrated into agentic workflows where they can use tools, browse the web, execute code, and interact with external systems as part of their reasoning process. OpenAI's o3, for example, can agentically combine web search, Python code execution, and image analysis within a single reasoning chain ^[10]. This integration of reasoning with action represents a convergence of two major threads in AI research.

Remaining challenges

Despite rapid progress, significant challenges remain. Humanity's Last Exam, where even the best models score only 35 to 50%, serves as a reminder that current reasoning systems fall far short of human expert-level reasoning across diverse domains ^[22]. Clinical reasoning, legal reasoning, and other domains requiring deep contextual understanding and commonsense knowledge remain difficult. The faithfulness of reasoning traces is an unsolved problem that has implications for AI safety and interpretability.

The field continues to evolve quickly. New models and techniques are released regularly, benchmarks are created and saturated in increasingly short cycles, and the theoretical understanding of what drives reasoning capabilities in neural networks remains incomplete. What is clear is that reasoning has become the central frontier of AI capability development, with major labs and open-source communities alike investing heavily in pushing the boundaries of what these systems can do.

References

Pearl, Judea. *Causality: Models, Reasoning, and Inference*. Cambridge University Press, 2000. ↩
"History of artificial intelligence." Wikipedia. https://en.wikipedia.org/wiki/History_of_artificial_intelligence ↩
Vaswani, A., et al. "Attention Is All You Need." *Advances in Neural Information Processing Systems*, 2017. https://arxiv.org/abs/1706.03762 ↩
Wei, J., et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." *NeurIPS 2022*. https://arxiv.org/abs/2201.11903 ↩
"Learning to reason with LLMs." OpenAI, September 2024. https://openai.com/index/learning-to-reason-with-llms/ ↩
"OpenAI o3-mini." OpenAI, January 2025. https://openai.com/index/openai-o3-mini/ ↩
"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." DeepSeek, January 2025. https://arxiv.org/abs/2501.12948 ↩
"Claude's extended thinking." Anthropic, February 2025. https://www.anthropic.com/news/visible-extended-thinking ↩
"Google introduces Gemini 2.0." Google, December 2024. https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ ↩
"Introducing OpenAI o3 and o4-mini." OpenAI, April 2025. https://openai.com/index/introducing-o3-and-o4-mini/ ↩
"Alibaba Cloud Unveils Open-Source AI Reasoning Model QwQ." Alibaba Cloud, 2025. https://www.alibabacloud.com/blog/alibaba-cloud-unveils-open-source-ai-reasoning-model-qwq-and-new-image-editing-tool_601813 ↩
"OpenAI o3." Wikipedia. https://en.wikipedia.org/wiki/OpenAI_o3 ↩
"Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad." Google DeepMind, July 2025. https://deepmind.google/blog/advanced-version-of-gemini-with-deep-think-officially-achieves-gold-medal-standard-at-the-international-mathematical-olympiad/ ↩
"Alibaba Introduces Qwen3, Setting New Benchmark in Open-Source AI with Hybrid Reasoning." Alibaba Cloud, April 2025. https://www.alibabacloud.com/blog/alibaba-introduces-qwen3-setting-new-benchmark-in-open-source-ai-with-hybrid-reasoning_602192 ↩
"Test-Time Compute in Generative AI." Emerge Haus Blog. https://www.emerge.haus/blog/test-time-compute-generative-ai ↩
Snell, C., et al. "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters." 2024. https://arxiv.org/abs/2408.03314 ↩
"GSM8K Benchmark." DeepEval. https://deepeval.com/docs/benchmarks-gsm8k ↩
Hendrycks, D., et al. "Measuring Mathematical Problem Solving With the MATH Dataset." *NeurIPS 2021*. https://arxiv.org/abs/2103.03874 ↩
"AIME 2025 Benchmark: An Analysis of AI Math Reasoning." IntuitionLabs. https://intuitionlabs.ai/articles/aime-2025-ai-benchmark-explained ↩
"GPQA: A Graduate-Level Google-Proof Q&A Benchmark." Rein, D., et al. https://arxiv.org/abs/2311.12022 ↩
"ARC-AGI." Chollet, F. https://arcprize.org/ ↩
"Reasoning Benchmarks: GPQA, AIME, and Humanity's Last Exam." Awesome Agents. https://awesomeagents.ai/leaderboards/reasoning-benchmarks-leaderboard/ ↩
Kaplan, J., et al. "Scaling Laws for Neural Language Models." 2020. https://arxiv.org/abs/2001.08361 ↩
"Published as a conference paper at ICLR 2025." https://proceedings.iclr.cc/paper_files/paper/2025/file/05774fb74e863308c4b460c9f49f6918-Paper-Conference.pdf ↩
"Limitations of large language models in clinical problem-solving arising from inflexible reasoning." *Scientific Reports*, 2025. https://www.nature.com/articles/s41598-025-22940-0 ↩
"The Great LLM Debate: World Models or Sophisticated Pattern Matching?" Signals, 2025. https://signals.aktagon.com/articles/2025/09/the-great-llm-debate-world-models-or-sophisticated-pattern-matching/ ↩
"Benchmarking LLMs on Advanced Mathematical Reasoning." UC Berkeley, 2025. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-121.pdf ↩
DeepSeek-AI. "DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning." *Nature*, vol. 645, September 17, 2025. https://www.nature.com/articles/s41586-025-09422-z ↩
Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., Farajtabar, M. "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity." Apple Machine Learning Research, June 2025. https://machinelearning.apple.com/research/illusion-of-thinking ↩
"Introducing GPT-5." OpenAI, August 2025. https://openai.com/index/introducing-gpt-5/ ↩
"Comment on The Illusion of Thinking." 2025. https://arxiv.org/abs/2506.09250 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit