Q* OpenAI
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v3 ยท 3,097 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 11, 2026
Sources
18 citations
Review status
Source-backed
Revision
v3 ยท 3,097 words
Add missing citations, update stale details, or suggest a clearer explanation.
Q* (pronounced Q-Star) was a closely guarded internal project at OpenAI that became public knowledge in November 2023, when Reuters reported that several staff researchers had sent a warning letter to the company's board days before the brief firing of CEO Sam Altman. The leak claimed the project demonstrated a new kind of mathematical reasoning capability that some inside the company believed could be a step toward artificial general intelligence (AGI). Q* was never released as a product under that name. By mid-2024 it had been renamed Project Strawberry, and on September 12, 2024 OpenAI publicly launched its first reasoning model line as o1-preview, which most observers (including OpenAI insiders quoted by The Information and Bloomberg) treat as the production descendant of the Q* research thread.[1][2][3]
The story of Q* matters less because of any single technical artifact and more because it marks the moment when reinforcement-learning-based reasoning, inference-time search, and process supervision became the dominant frontier in large language model research. The o1, o3, and o4-mini families that followed all share the same broad design philosophy that the Q* leak hinted at: spend more compute at inference time on a hidden chain of thought, train the model with reinforcement learning on verifiable problems, and use a step-level reward signal rather than a single end-of-answer score.[4][5]
| Item | Detail |
|---|---|
| Codename | Q* (pronounced Q-Star), later succeeded by Strawberry |
| Organization | OpenAI |
| First public reporting | Reuters, November 22, 2023 |
| Trigger event | Letter from staff researchers to the board, days before Sam Altman's November 17, 2023 firing |
| Reported capability | Solving certain grade-school level math problems with novel reasoning |
| Speculated technique | Hybrid of Q-learning, A* search, process reward models, and tree-style inference search |
| Public successor | o1-preview (Sep 12, 2024); full o1 (Dec 5, 2024); o3 (Apr 16, 2025) |
| Status | Officially never confirmed by OpenAI as a standalone product; widely treated as the research lineage behind the o-series |
On November 22, 2023, five days after Sam Altman was abruptly fired by the OpenAI board, Reuters published an exclusive citing two unnamed people familiar with the matter. They reported that several staff researchers had sent the board a letter warning of a powerful AI discovery that they believed could threaten humanity, and that the letter, together with concerns about the pace of commercialization, was one of the catalysts for the board's decision to remove Altman. The letter referenced an internal model named Q*.[1]
The Reuters story said that, given vast computing resources, Q* had been able to solve certain math problems. The performance level was modest in absolute terms (the problems were described as roughly grade-school in difficulty) but the source said researchers were excited because the system was reasoning through problems rather than recalling answers, and because performance was scaling with compute.[1][2] Fortune followed up with additional sourcing, reporting that CTO Mira Murati had acknowledged the existence of the project and the letter in an internal note to staff, while declining to confirm details about its capabilities.[6]
Reuters made clear it had not seen a copy of the letter. OpenAI did not publicly comment on Q* at the time. In a subsequent interview with The Verge, Sam Altman addressed the topic with deliberate vagueness, calling the leak "unfortunate" and saying he had no particular comment, while reiterating that OpenAI's research progress had been consistently rapid.[7]
| Reported by Reuters / Fortune | Speculation by outside researchers |
|---|---|
| A letter from staff researchers existed | The letter explicitly named the project as a path to AGI |
| Q* solved certain math problems | The math involved Olympiad-level proofs |
| Mira Murati acknowledged the project internally | OpenAI confirmed safety risks publicly |
| Capability scaled with compute | The model could rewrite its own code or self-improve |
The asterisk in Q* deliberately evokes A*, the classic heuristic search algorithm introduced by Peter Hart, Nils Nilsson and Bertram Raphael in 1968 at SRI. A* searches a graph of possible states by combining the actual cost so far with a heuristic estimate of the remaining cost, and it is one of the foundational algorithms in any artificial intelligence curriculum.[8]
The Q comes from Q-learning, a value-based reinforcement learning method introduced by Christopher Watkins in his 1989 Cambridge PhD thesis. Q-learning estimates a function Q(state, action) that captures the expected long-term reward of taking a particular action in a particular state. It is the same family of techniques that produced DeepMind's Deep Q-Network (DQN), which learned to play Atari games at superhuman level in 2013, and that contributed to the value head used in AlphaGo.[9]
Q* in classical reinforcement learning refers specifically to the optimal action-value function: the Q-function under the optimal policy. Combining that name with A*-style search produces a fairly obvious hint at the architecture researchers suspected: a learned value function that scores partial reasoning steps, paired with a search procedure that explores a tree of possible next steps, all wrapped around a large language model generator.
Three threads of academic work made the speculation about Q* feel grounded rather than fanciful.
In May 2023, OpenAI researchers Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever and Karl Cobbe published "Let's Verify Step by Step" on arXiv (paper id 2305.20050). The paper compared two ways to train reward models on the MATH benchmark: outcome supervision, which only rewards the final answer, and process supervision, which rewards each individual reasoning step. Process supervision substantially outperformed outcome supervision, with the best process reward model (PRM) solving 78% of problems in a representative subset of MATH. The team also released PRM800K, a dataset of 800,000 step-level human correctness labels.[10]
The paper is the closest publicly available analogue of what Q* was rumored to do. It establishes that step-level reward signals, applied across long chains of reasoning, are a tractable way to train language models to do mathematics.
In March 2022, Eric Zelikman, Yuhuai Wu, Jesse Mu and Noah Goodman of Stanford published "STaR: Bootstrapping Reasoning with Reasoning" (arXiv 2203.14465). STaR uses a small number of seed examples with worked-out reasoning to generate rationales for a much larger set of problems, fine-tunes the model on the rationales that produced correct answers, and iterates. A key innovation, called rationalization, lets the model generate post-hoc rationales for problems it initially got wrong, given the correct answer. STaR achieved performance close to a 30 times larger model on CommonsenseQA and showed that a self-improving loop on reasoning was practically achievable.[11]
Also in May 2023, Princeton and Google DeepMind researchers published "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., arXiv 2305.10601). Tree of Thoughts (ToT) lets a model expand multiple reasoning branches, evaluate intermediate states, and backtrack, in the spirit of classic AI search. On the Game of 24 task, ToT raised GPT-4 success from around 4% with chain-of-thought prompting to 74%.[12]
Taken together, these three lines of work describe almost exactly the recipe that outside observers guessed Q* was following: generate candidate reasoning steps, score each step with a learned verifier, search over branches, and use the resulting traces to fine-tune the generator with reinforcement learning.
Meta's chief AI scientist Yann LeCun was one of the few senior researchers to publicly comment on Q* during the November 2023 storm. Writing on X (formerly Twitter), he urged people to ignore the wave of speculation and noted that essentially every top lab was working on combining language models with planning. He singled out OpenAI's hiring of Noam Brown as the most concrete evidence of where the work was heading.[13]
Noam Brown joined OpenAI in mid-2023 after years at Carnegie Mellon and Meta's FAIR lab, where he co-built Libratus and Pluribus (which beat top professional poker players) and Cicero (which played Diplomacy at human level by combining a language model with strategic search). On joining OpenAI, Brown said publicly that he wanted to make these planning and self-play methods truly general, with the goal of producing language models far more capable than GPT-4 at reasoning. The combination of his hire and the Lightman process supervision paper made it reasonable to assume OpenAI was working on something resembling AlphaGo-style search bolted onto a language model.[13][14]
Q* never shipped as a product. The next time the same internal research thread surfaced was in mid-2024, under a different codename.
In July 2024, Reuters and Bloomberg reported that OpenAI was internally testing a model under the codename Strawberry. According to the Reuters report, an internal document described Strawberry as a project aimed at letting OpenAI's models plan ahead, navigate the internet autonomously, and perform what the company called "deep research." The Information added that Strawberry was a successor to Q* and was being trained with reinforcement learning to follow long chains of reasoning.[3][15]
On September 12, 2024, OpenAI publicly released o1-preview and o1-mini to ChatGPT Plus and Team users. The accompanying blog post described a model trained with reinforcement learning to perform a long internal chain of thought before answering, with performance scaling logarithmically with the amount of inference-time compute. The full o1 model followed on December 5, 2024, alongside the launch of ChatGPT Pro at $200 per month.[4]
| Model | Public release | Notable detail |
|---|---|---|
| o1-preview | September 12, 2024 | First public OpenAI reasoning model; 83% on AIME 2024 vs 13% for GPT-4o |
| o1-mini | September 12, 2024 | About 80% cheaper than o1-preview; tuned for STEM and code |
| o1 (full) | December 5, 2024 | Released with ChatGPT Pro; image input |
| o1-pro | March 2025 (API) | $150 / $600 per million input/output tokens |
| o3-mini | January 31, 2025 | Three reasoning effort levels |
| o3 | April 16, 2025 | Roughly 3x o1 accuracy on ARC-AGI; 71.7% on SWE-bench Verified |
| o4-mini | April 16, 2025 | Multimodal reasoning, tool use, image generation |
| o3-pro | June 10, 2025 | Extended deliberation tier for o3 |
The shared design pattern across this family, sometimes called inference-time scaling or test-time compute, is the public-facing answer to the question of what Q* really was: a research direction, not a single model.[5]
When the leak hit, the AI research community produced a flurry of blog posts trying to reconstruct what Q* might be doing under the hood. None of these were confirmed, but they share a remarkably consistent picture and they line up well with what OpenAI later described publicly for o1.
At the heart of the speculation is the idea that a language model, instead of producing one linear answer, generates a tree of possible reasoning paths. Each branch is expanded into one or more next steps, and partial branches can be pruned or extended depending on how promising they look. A separately trained process reward model gives a score to each step rather than to the entire answer, exactly as Lightman et al. demonstrated on MATH in 2023. A PRM can be used in three closely related ways: as a verifier for best-of-N sampling, as a guide for tree search, and as a reward signal for reinforcement learning fine-tuning of the generator.
A tree search guided by a PRM produces a large number of reasoning traces labeled with step-level scores. Those traces can be filtered down to high-quality solutions and used to fine-tune the generator with offline reinforcement learning, similar in spirit to STaR but with a much richer reward signal. This loop, where the model produces its own training data and a verifier curates it, is sometimes called self-play for reasoning, and it is the closest thing in modern machine learning to what AlphaGo did for the game of Go.
Many researchers, including Tim Lee at Understanding AI and Nathan Lambert in his Interconnects newsletter, framed Q* as an attempt to bring an AlphaGo-style architecture to a language model: a generator policy that proposes moves (reasoning steps), a value network that scores positions (the PRM), and a search procedure (something like Monte Carlo tree search or beam search) that combines the two.[16]
OpenAI has never publicly described a product or paper named "Q*." The company has, however, said quite a lot about the o-series that is consistent with the rumored Q* architecture, and almost nothing that contradicts it.[4][5]
| Claim | Confirmed by OpenAI |
|---|---|
| A model called Q* exists | Indirectly, via Mira Murati's internal note acknowledging media reports |
| Q* solved grade-school math problems | No public confirmation |
| The o-series uses RL over a hidden chain of thought | Yes; stated in the o1 release post and system card |
| Inference-time compute scaling is a deliberate goal | Yes; o1 documentation describes log-linear accuracy scaling with thinking time |
| Process reward models or tree search are used | Not confirmed by name; OpenAI keeps the specific algorithm proprietary |
| Q* is the same project as Strawberry / o1 | Not confirmed officially, but reported by Reuters, Bloomberg and The Information using internal sources |
The Q* leak landed in the middle of one of the most chaotic weeks in OpenAI's history. Sam Altman was removed by the board on November 17, 2023, with the board citing a loss of confidence in his leadership. Within hours, Greg Brockman resigned in protest, hundreds of OpenAI employees signed a letter threatening to leave for Microsoft unless Altman was reinstated, and on November 21, 2023, Altman returned with a new initial board. The Reuters story on Q* appeared the following day.[1][6]
The framing of the staff researchers' letter, that the new capability could threaten humanity, became one of the public anchors for the safety side of OpenAI's internal debate. Whether Q*'s actual capabilities deserved that framing is contested. Gary Marcus, in a Substack post titled "About that OpenAI 'breakthrough,'" argued that the entire episode had been overhyped and that solving grade-school math problems is a long way from any plausible AGI threshold. MIT Technology Review's Will Douglas Heaven made a similar point, quoting Wenda Li of Edinburgh and Katie Collins of Cambridge to argue that current systems still lack the architecture needed to reason about mathematics in any robust way.[17][18]
In hindsight, Q* is best read as the public's first glimpse of a research bet that paid off. The lineage runs cleanly from process supervision in May 2023, through the November 2023 leak, through the Strawberry codename in mid-2024, into o1 in September 2024 and o3 in April 2025. Every model in that chain has the same basic shape: a language model that has been trained with reinforcement learning to think for a long time before answering, scored by a verifier on each step, and rewarded for getting the final answer right.[4][5][10]
The practical impact has been concrete. The o-series substantially raised the bar on math benchmarks (o1-preview hit 83% on AIME 2024 vs 13% for GPT-4o), on competitive programming (o3 reached an Elo of 2727 on Codeforces vs 1891 for o1), and on agentic software tasks (o3 scored 71.7% on SWE-bench Verified vs 48.9% for o1). The same techniques have spread industry-wide, with Anthropic, Google DeepMind, DeepSeek, Alibaba and others all releasing reasoning-tuned variants of their flagship models within a year of o1.[5]
Whether or not Q* ever existed as a discrete model, the name has become a kind of shorthand for the moment when the AI field stopped trying to scale only the size of transformer models and started seriously scaling how long they think.