Reasoning in artificial intelligence refers to the ability of AI systems to draw inferences, solve problems, and make decisions through structured thought processes. It encompasses a broad set of cognitive capabilities, from following logical rules and recognizing patterns to forming hypotheses and navigating uncertain information. Reasoning has been a central goal of AI research since the field's inception in the 1950s, and it remains one of the most actively studied and debated topics in modern AI, particularly in the context of large language models (LLMs).
As of 2025 and into 2026, a new class of so-called "reasoning models" has emerged. These systems, including OpenAI o1, o3, DeepSeek R1, and others, use techniques like internal chain-of-thought processing and reinforcement learning to spend additional compute at inference time, yielding substantial improvements on mathematical, scientific, and coding benchmarks. Whether these models genuinely reason or perform sophisticated pattern matching remains an open and consequential question.
Reasoning in AI can be categorized into several distinct types, each reflecting a different aspect of human cognitive ability. These categories are not mutually exclusive; real-world problem-solving often requires combining multiple forms of reasoning simultaneously.
Deductive reasoning moves from general premises to specific conclusions. If the premises are true and the logic is valid, the conclusion is guaranteed to be true. For example: "All mammals are warm-blooded. A dog is a mammal. Therefore, a dog is warm-blooded." Classical symbolic AI systems and logic programming languages like Prolog were built around deductive reasoning.
Inductive reasoning works in the opposite direction, drawing general conclusions from specific observations. A system that observes thousands of spam emails and learns to identify common patterns is performing induction. Most machine learning algorithms rely heavily on inductive reasoning, generalizing from training data to make predictions on unseen inputs.
Abductive reasoning involves inferring the most likely explanation for a set of observations. When a doctor considers symptoms and arrives at a diagnosis, that process is abductive. This form of reasoning is inherently uncertain; the conclusion is a best guess rather than a guaranteed truth. It plays a significant role in diagnostic systems and natural language understanding.
Analogical reasoning solves new problems by finding similarities to previously solved problems. If a system knows how to navigate one city's road network, it might apply similar strategies to a different city. This type of reasoning is central to transfer learning and few-shot learning in modern AI.
Causal reasoning goes beyond correlation to understand cause-and-effect relationships. Rather than merely noting that umbrella sales and rain tend to co-occur, a causally reasoning system would understand that rain causes people to buy umbrellas, not the reverse. Judea Pearl's work on causal inference has been influential in formalizing this type of reasoning for AI systems [1].
Commonsense reasoning involves applying the vast body of everyday knowledge that humans take for granted. Knowing that a glass will break if dropped, that people feel hungry before meals, or that objects do not float upward without a force acting on them are examples. This has historically been one of the hardest challenges for AI, since commonsense knowledge is enormous in scope and difficult to formalize.
Mathematical reasoning involves manipulating numerical and symbolic expressions, constructing proofs, and solving equations. It requires precision, multi-step logic, and the ability to apply abstract rules correctly. Mathematical reasoning has become a primary benchmark for evaluating modern reasoning models.
Spatial reasoning deals with understanding and manipulating the positions, shapes, and relationships of objects in space. It is critical for robotics, computer vision, and navigation tasks. Spatial reasoning also plays a role in understanding language that describes physical arrangements.
The pursuit of machine reasoning is as old as the field of artificial intelligence itself. The approaches taken have shifted dramatically over the decades, from hand-coded logic to statistical methods to the neural network systems dominant today.
The earliest AI programs were built on the premise that intelligence could be captured through formal logic and symbol manipulation. In 1955, Allen Newell and Herbert A. Simon, with the assistance of J.C. Shaw, created the Logic Theorist, widely considered the first AI program. The Logic Theorist proved 38 of the first 52 theorems in Bertrand Russell and Alfred North Whitehead's Principia Mathematica, and in some cases discovered proofs more elegant than the originals [2].
The 1956 Dartmouth Workshop, organized by John McCarthy and Marvin Minsky, formally established AI as an academic discipline. In the years that followed, researchers developed systems like the General Problem Solver (1957), which attempted to encode general-purpose reasoning strategies. McCarthy's development of Lisp in 1958 and later work on situation calculus provided programming tools and formal frameworks for reasoning about actions and change.
This era was dominated by what is now called "Good Old-Fashioned AI" (GOFAI) or symbolic AI. The core assumption was that intelligence consists of manipulating symbolic representations according to logical rules. Systems could perform impressive feats of deduction within narrow domains, but they struggled with the messiness of real-world knowledge and the combinatorial explosion of possible inferences.
Expert systems represented the first major commercial application of AI reasoning. These systems encoded the knowledge of human domain experts as collections of if-then rules. MYCIN (1976), developed at Stanford, could diagnose bacterial infections and recommend antibiotics with accuracy comparable to human specialists. DENDRAL (1969) could determine molecular structures from mass spectrometry data.
Expert systems demonstrated that narrow, domain-specific reasoning could be practically useful. However, they were brittle: they could not handle situations outside their programmed rules, they required enormous effort to build and maintain, and they lacked the ability to learn from new data. By the early 1990s, interest in expert systems had waned considerably.
The limitations of purely symbolic reasoning led to a shift toward statistical methods. Bayesian networks, hidden Markov models, and other probabilistic frameworks allowed AI systems to reason under uncertainty, a capability that rule-based systems lacked.
This era also saw the rise of machine learning as the dominant paradigm. Rather than hand-coding reasoning rules, systems learned patterns from data. Support vector machines, decision trees, and eventually deep learning models demonstrated that statistical pattern recognition could solve problems that symbolic methods could not, from speech recognition to image classification.
The deep learning revolution, beginning around 2012 with the success of AlexNet on ImageNet, initially focused on perception tasks such as image recognition and speech processing. But researchers quickly began exploring whether neural networks could also perform reasoning.
Early milestones included DeepMind's AlphaGo (2016), which defeated the world champion at Go through a combination of deep learning and tree search, demonstrating that neural approaches could handle complex strategic reasoning. The introduction of the Transformer architecture in 2017 [3] and subsequent development of large language models opened a new chapter in AI reasoning, as these models began to show unexpected capabilities in logical, mathematical, and commonsense reasoning tasks.
The emergence of large language models such as GPT-3, GPT-4, and Claude revealed that models trained primarily on next-token prediction could exhibit reasoning-like behavior. This was surprising to many researchers, since the training objective (predicting the next word) does not explicitly require reasoning.
A landmark paper by Jason Wei and colleagues at Google, published in January 2022, introduced chain-of-thought (CoT) prompting [4]. The key insight was simple but powerful: by including intermediate reasoning steps in the prompt examples given to a large language model, the model could be induced to generate its own step-by-step reasoning before arriving at an answer.
Experiments showed that chain-of-thought prompting dramatically improved performance on arithmetic, commonsense, and symbolic reasoning tasks. A 540-billion-parameter PaLM model prompted with just eight chain-of-thought examples achieved state-of-the-art accuracy on the GSM8K benchmark of grade-school math word problems [4]. The technique worked best on larger models; smaller models did not benefit as much, suggesting that reasoning capabilities emerge at scale.
Chain-of-thought prompting spawned numerous variants, including zero-shot CoT (simply adding "Let's think step by step" to the prompt), self-consistency (sampling multiple reasoning paths and taking a majority vote), and tree-of-thought prompting (exploring branching reasoning paths).
While chain-of-thought prompting showed that LLMs could produce reasoning traces, the reasoning was often unreliable and inconsistent. The next major step was to train models specifically to reason, rather than relying solely on prompting tricks. This led to the development of dedicated reasoning models, sometimes called "large reasoning models" (LRMs), which use reinforcement learning and other techniques to internalize chain-of-thought reasoning as a core capability.
Several major reasoning models have been released since late 2024, representing a new paradigm in AI development. The following table summarizes the most significant systems.
| Model | Developer | Release Date | Key Characteristics |
|---|---|---|---|
| OpenAI o1 | OpenAI | September 12, 2024 (preview); December 5, 2024 (full) | First major reasoning model; uses internal chain-of-thought trained via reinforcement learning; scored 83.3% on the 2024 AIME and 78% on GPQA Diamond [5] |
| OpenAI o3-mini | OpenAI | January 31, 2025 | Smaller, faster reasoning model; 80% on AIME 2024 at significantly lower cost than o1 [6] |
| DeepSeek R1 | DeepSeek | January 20, 2025 | Open-source (MIT License); trained with pure reinforcement learning; performance comparable to o1 at roughly 96% lower cost; open-sourced with distilled variants [7] |
| Claude 3.7 Sonnet (extended thinking) | Anthropic | February 25, 2025 | Hybrid model with toggleable extended thinking mode; developers can set a "thinking budget" up to 128K tokens; accuracy improves logarithmically with thinking tokens [8] |
| Gemini 2.0 Flash Thinking | Google DeepMind | December 19, 2024 | Experimental reasoning variant of Gemini 2.0 Flash; introduced thinking traces for improved multi-step reasoning [9] |
| OpenAI o3 | OpenAI | April 16, 2025 | Multimodal reasoning model with tool use; 91.6% on AIME 2024, 83.3% on GPQA Diamond; can browse the web and execute code [10] |
| OpenAI o4-mini | OpenAI | April 16, 2025 | High-throughput reasoning model; delivers over 90% of o3 performance at half the compute cost [10] |
| QwQ-32B | Alibaba Cloud | November 2024 (preview); March 2025 (full) | Open-source reasoning model built on Qwen2.5; performance comparable to DeepSeek R1 on AIME and LiveCodeBench benchmarks [11] |
| OpenAI o3-pro | OpenAI | June 10, 2025 | Described as OpenAI's most capable reasoning model at the time of release; designed for maximum accuracy on the hardest tasks [12] |
| Gemini Deep Think | Google DeepMind | August 1, 2025 (GA) | Parallel hypothesis exploration; gold-medal performance on IMO, IPhO, and IChO written sections; 84.6% on ARC-AGI-2 [13] |
| Qwen3 | Alibaba Cloud | April 29, 2025 | Hybrid thinking/non-thinking modes; trained on 36 trillion tokens; significant improvements over QwQ on math, code, and logical reasoning [14] |
Reasoning models represent a fundamental shift in how AI systems are built and deployed. Rather than simply predicting the next token as quickly as possible, these models are designed to "think" before responding, allocating additional computational resources at inference time to work through difficult problems.
Traditional LLMs follow a paradigm of scaling train-time compute: performance improves by training larger models on more data. Reasoning models introduce a complementary approach called test-time compute scaling (also known as inference-time compute scaling). The core idea is that a model can improve its answers by spending more time "thinking" at inference time, generating internal reasoning tokens before producing a final response [15].
This creates a new dimension for improving AI performance. Rather than building an ever-larger model, developers can take a smaller model and give it more time to reason through a problem. Research has shown that scaling inference compute with appropriate reasoning strategies can be more computationally efficient than scaling model parameters alone [16]. In some cases, a smaller model with extended reasoning outperforms a larger model that answers immediately.
The implications for deployment are significant. Analysts project that inference will claim 75% of total AI compute by 2030, partly driven by the growing adoption of reasoning models that consume 10 to 100 times more tokens than standard models per query [15].
Reasoning models generate an internal chain of thought before producing their final answer. Unlike chain-of-thought prompting, where the user explicitly asks the model to show its reasoning, the internal chain of thought is a trained behavior. The model automatically breaks down complex problems into steps, considers multiple approaches, checks its own work, and refines its answer.
In OpenAI's o1, for example, the internal chain of thought is hidden from the user and only a summary is shown. The model might spend hundreds or thousands of tokens reasoning through a math problem internally before producing a concise final answer. DeepSeek R1, being open-source, made its reasoning traces more visible, revealing patterns such as self-reflection ("Wait, let me reconsider..."), verification ("Let me check this step..."), and dynamic strategy adaptation [7].
Anthropic's approach with Claude's extended thinking mode is somewhat different: developers can set a "thinking budget" that controls how many tokens Claude is allowed to use for internal reasoning, providing fine-grained control over the trade-off between reasoning depth and response speed [8].
The training process for reasoning models relies heavily on reinforcement learning (RL). The general approach involves training a base language model to produce reasoning traces, then using RL to reward traces that lead to correct answers and penalize those that do not.
OpenAI described o1 as using "a large-scale reinforcement learning algorithm" that teaches the model to "refine its chain of thought" through a learning process that leverages training-time compute [5]. DeepSeek's R1 took this further by demonstrating that pure RL, without any supervised fine-tuning on human-written reasoning examples, could produce strong reasoning capabilities. DeepSeek-R1-Zero, trained via large-scale RL from scratch, spontaneously developed advanced reasoning patterns including self-verification and backtracking [7].
The RL training process typically uses a reward signal based on the correctness of final answers (for math and coding tasks, where answers can be verified automatically) or on preference judgments from human evaluators or AI judges.
While RLHF was initially developed to align language models with human preferences for helpfulness and safety, similar techniques have been adapted for reasoning. Some reasoning models use a combination of outcome-based reward (did the model get the right answer?) and process-based reward (did the model follow sound reasoning steps?). Process reward models, which evaluate each step of a reasoning chain rather than just the final answer, have shown promise in improving reasoning reliability.
Evaluating reasoning capabilities requires specialized benchmarks that test multi-step problem-solving, domain expertise, and the ability to handle novel challenges. The following table summarizes the major benchmarks used to evaluate reasoning models.
| Benchmark | Domain | Description | Difficulty Level | Notable Scores |
|---|---|---|---|---|
| GSM8K | Mathematics | 8,000 grade-school math word problems requiring multi-step arithmetic reasoning; introduced by OpenAI researchers in 2021 [17] | Moderate | Largely saturated by 2024; top models exceed 95% |
| MATH | Mathematics | 12,500 competition-level high school math problems covering algebra, geometry, number theory, and calculus [18] | Hard | o1 scored 94.8% on the full set; most frontier models now exceed 90% |
| AIME | Mathematics | Problems from the American Invitational Mathematics Examination, a prestigious high school math competition; tests creative multi-step problem-solving [19] | Very Hard | o1: 83.3% (2024); o3: 91.6% (2024), 88.9% (2025) |
| GPQA Diamond | Science | 198 graduate-level multiple-choice questions in physics, chemistry, and biology written by domain experts; designed to be "Google-proof" [20] | Very Hard | o1: 78%; o3: 83.3%; top models in late 2025 exceed 90% |
| ARC-AGI | Abstract reasoning | Tests the ability to identify patterns in novel visual grids; designed to measure fluid intelligence and generalization [21] | Very Hard | Gemini 3 Pro: 31.1% (45.1% with Deep Think); Gemini Deep Think: 84.6% on ARC-AGI-2 |
| Humanity's Last Exam | Cross-domain | Extremely difficult questions across many academic disciplines, designed to be the hardest AI benchmark; created by a coalition of experts [22] | Extreme | Top models in late 2025 score 35-50%; considered far from saturated |
A recurring challenge in AI reasoning evaluation is benchmark saturation. GSM8K, once considered a meaningful test of mathematical reasoning, is now solved at near-perfect accuracy by most frontier models. The MATH benchmark followed a similar trajectory, going from below 10% accuracy in 2021 to above 90% by 2024. This rapid saturation has driven the creation of progressively harder benchmarks like AIME, GPQA Diamond, and Humanity's Last Exam.
The pattern suggests that static benchmarks may not be reliable long-term measures of reasoning ability. Models can improve on specific benchmarks through targeted training, making it difficult to distinguish genuine reasoning improvement from benchmark-specific optimization.
Traditional scaling laws, such as those described by Kaplan et al. (2020) and Hoffmann et al. (2022, the "Chinchilla" paper), focused on the relationship between model size, training data, and training compute [23]. Inference-time scaling laws extend this framework to the compute spent during generation.
A key 2024 paper by Snell et al., "Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters," demonstrated that for reasoning tasks, allocating additional compute at inference time can yield better performance-per-FLOP than simply training a larger model [16]. This finding has profound implications for the economics of AI deployment.
OpenAI reported that o1's performance improves consistently with both more training compute and more time spent thinking at test time [5]. Anthropic observed that Claude's accuracy on math questions improves logarithmically with the number of thinking tokens it is permitted to sample [8]. These results suggest a smooth, predictable relationship between inference-time compute and reasoning accuracy, analogous to the scaling laws observed for training.
The practical consequence is a new trade-off in model design. A smaller, cheaper model that reasons for longer can sometimes match or exceed a much larger model that answers immediately. This has led to a proliferation of model sizes within reasoning families (for instance, o3 versus o4-mini, or DeepSeek R1 versus its smaller distilled variants), allowing users to choose the appropriate balance of cost, speed, and accuracy for their use case.
Despite their impressive benchmark scores, reasoning models have significant limitations that constrain their reliability and applicability.
One of the most concerning findings is that the reasoning traces produced by these models can be "unfaithful," meaning the stated reasoning does not accurately reflect the process that led to the model's answer. A model might arrive at a correct answer through pattern matching or memorization while generating a plausible-sounding but fabricated chain of reasoning to justify it. Conversely, a model might produce logically sound reasoning steps but arrive at an incorrect conclusion due to an error in one step that the subsequent steps fail to catch.
Research published at ICLR 2025 demonstrated that even when language models produce correct-looking reasoning chains, a significant portion of the inference patterns may represent misleading or irrelevant logic [24]. This undermines one of the key proposed benefits of reasoning models: that visible reasoning traces would make AI systems more interpretable and trustworthy.
Reasoning models remain susceptible to confabulation (also called hallucination), where the model generates plausible but factually incorrect information. Extended reasoning chains can sometimes make this worse rather than better: a model may "reason" its way into an incorrect conclusion with great confidence, constructing an elaborate justification for a wrong answer. The longer the reasoning chain, the more opportunities there are for errors to compound.
While reasoning models perform impressively on established benchmarks, they can be surprisingly brittle when faced with genuinely novel problems or problems that require reasoning patterns not well-represented in their training data. Studies have shown that LLMs, including reasoning models, struggle with tasks requiring flexible reasoning in unfamiliar contexts. For example, state-of-the-art reasoning models performed poorly compared to physicians on clinical reasoning tasks that required commonsense medical reasoning and adaptation to unusual patient presentations [25].
Small changes to problem formatting or the introduction of irrelevant information can cause dramatic drops in performance, suggesting that at least some of what appears to be reasoning is actually pattern matching on surface features of problems.
Reasoning models are significantly more expensive to run than standard LLMs. Because they generate many more tokens per query (often 10 to 100 times more), they require proportionally more compute, memory, and time. This makes them impractical for many real-time applications and raises questions about the environmental and economic sustainability of deploying reasoning models at scale.
The question of whether large language models genuinely reason or merely perform sophisticated pattern matching is one of the most actively debated topics in AI research. The answer has significant implications for AI safety, trustworthiness, and the trajectory of future development.
Proponents of the view that LLMs can reason point to several lines of evidence. First, modern reasoning models can solve novel mathematical problems at competition level, including problems from the International Mathematical Olympiad that were published after the model's training data cutoff. Solving such problems requires applying learned techniques to unfamiliar situations, which seems to go beyond simple memorization or pattern matching.
Second, LLMs demonstrate the ability to combine knowledge from different domains in ways that suggest some form of internal representation. They can draw analogies, transfer solution strategies between domains, and generate creative approaches to problems they have not seen before.
Third, the scaling behavior of reasoning models, where performance improves smoothly with more thinking time, parallels human cognition. Humans also reason better when given more time to think, and this similarity suggests that the underlying process may share some functional characteristics with human reasoning.
Skeptics argue that what appears to be reasoning is actually very sophisticated statistical pattern matching over the training data. Several observations support this view.
LLMs struggle with tasks that require persistent state tracking or that present familiar concepts in unfamiliar formats. A model might correctly explain a concept in one context but fail to apply that same concept when the surface presentation changes, which would not happen if the model had a genuine understanding of the underlying principle [26].
Performance on tasks requiring rigorous proof generation, as opposed to producing numerical answers, remains significantly weaker. This suggests a "reasoning illusion" where success on benchmarks involving numerical answers might stem partly from pattern matching on problem types rather than genuine mathematical insight [27].
Additionally, LLMs can be easily fooled by problems that contain misleading surface features. If a problem looks like a standard type but has a twist that changes the solution approach, models frequently apply the standard approach regardless, suggesting they are matching patterns rather than truly understanding the problem structure.
Many researchers have adopted a more nuanced position. LLMs may implement something that functions as reasoning within certain domains and contexts, even if the underlying mechanism (statistical pattern matching over distributed representations) is fundamentally different from human reasoning. The question may ultimately be less about whether LLMs "truly" reason in a philosophical sense and more about understanding the specific conditions under which their reasoning-like behavior is reliable and the conditions under which it breaks down.
This pragmatic framing has practical value: rather than debating definitions, it focuses research on characterizing the boundaries of LLM reasoning capabilities, which directly informs decisions about where these systems can be safely deployed.
The reasoning model landscape has evolved rapidly since OpenAI's release of o1 in September 2024. Several trends define the current state of the field.
DeepSeek R1's January 2025 release demonstrated that competitive reasoning capabilities could be achieved at a fraction of the cost of proprietary models and released under an open-source license. This triggered a wave of open-source reasoning model development. Alibaba's QwQ and Qwen3, along with various distilled versions of R1, have made reasoning capabilities accessible to a much broader set of developers and researchers [7] [14].
Rather than having separate "reasoning" and "non-reasoning" models, the trend has moved toward hybrid systems that can toggle reasoning on and off. Anthropic's extended thinking mode for Claude, Qwen3's thinking/non-thinking modes, and the varied reasoning effort settings in OpenAI's models all reflect this approach. Users and developers can choose when to invest the additional compute required for deep reasoning and when a quick response suffices [8].
Reasoning capabilities have expanded beyond text. OpenAI's o3 can reason about images, charts, and graphics [10]. Google's Gemini Deep Think applies its parallel hypothesis exploration to scientific diagrams and visual problem-solving [13]. This represents a significant expansion of what reasoning models can do, moving them closer to the kind of multi-modal reasoning that humans perform naturally.
Reasoning models are increasingly being integrated into agentic workflows where they can use tools, browse the web, execute code, and interact with external systems as part of their reasoning process. OpenAI's o3, for example, can agentically combine web search, Python code execution, and image analysis within a single reasoning chain [10]. This integration of reasoning with action represents a convergence of two major threads in AI research.
Despite rapid progress, significant challenges remain. Humanity's Last Exam, where even the best models score only 35 to 50%, serves as a reminder that current reasoning systems fall far short of human expert-level reasoning across diverse domains [22]. Clinical reasoning, legal reasoning, and other domains requiring deep contextual understanding and commonsense knowledge remain difficult. The faithfulness of reasoning traces is an unsolved problem that has implications for AI safety and interpretability.
The field continues to evolve quickly. New models and techniques are released regularly, benchmarks are created and saturated in increasingly short cycles, and the theoretical understanding of what drives reasoning capabilities in neural networks remains incomplete. What is clear is that reasoning has become the central frontier of AI capability development, with major labs and open-source communities alike investing heavily in pushing the boundaries of what these systems can do.