Q* OpenAI
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v4 ยท 3,671 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Sources
18 citations
Review status
Source-backed
Revision
v4 ยท 3,671 words
Add missing citations, update stale details, or suggest a clearer explanation.
Q* (pronounced Q-Star) is the reported, never-officially-detailed OpenAI research project that surfaced in news reports during the November 2023 leadership crisis, when Reuters reported that several staff researchers had warned the board about a powerful AI advance. According to that reporting, Q* had, given vast computing resources, solved certain grade-school-level math problems, a result some inside the company viewed as a possible step toward artificial general intelligence (AGI). OpenAI has never published a product or paper named Q*, so the specific math-solving and architecture claims should be treated as reported or rumored rather than confirmed. By mid-2024 reporters tied the same internal research thread to the codename Strawberry, and on September 12, 2024 OpenAI publicly launched its first reasoning model line as o1-preview, which Reuters, Bloomberg and The Information (citing internal sources) described as the production descendant of that research.[1][2][3]
The story of Q* matters less because of any single technical artifact and more because it marks the moment when reinforcement-learning-based reasoning, inference-time search, and process supervision became the dominant frontier in large language model research. The o1, o3, and o4-mini families that followed all share the same broad design philosophy that the Q* leak hinted at: spend more compute at inference time on a hidden chain of thought, train the model with reinforcement learning on verifiable problems, and use a step-level reward signal rather than a single end-of-answer score. OpenAI has confirmed this design pattern for the o-series, but has not confirmed it for Q* by name.[4][5]
| Item | Detail |
|---|---|
| Codename | Q* (pronounced Q-Star), reportedly succeeded by Strawberry |
| Organization | OpenAI |
| First public reporting | Reuters, November 22, 2023 |
| Trigger event | Reported letter from staff researchers to the board, days before Sam Altman's November 17, 2023 firing |
| Reported capability | Solving certain grade-school level math problems with novel reasoning (reported, unverified) |
| Speculated technique | Hybrid of Q-learning, A* search, process reward models, and tree-style inference search (speculation, not confirmed) |
| Reported successor | o1-preview (Sep 12, 2024); full o1 (Dec 5, 2024); o3 (Apr 16, 2025) |
| Status | Never confirmed by OpenAI as a standalone product; widely reported as the research lineage behind the o-series |
Q* is best understood not as a confirmed model but as a reported internal research effort whose name became public during OpenAI's November 2023 turmoil. The Reuters report described it as a project that some staff believed could be a breakthrough in the search for superintelligence, while cautioning that the news agency could not independently verify the capabilities the researchers claimed.[1][2] OpenAI itself has never released documentation, benchmarks, or a paper under the name Q*, which is why the math-solving result and any description of its architecture remain in the category of reported or speculated rather than established fact.
What is documented is the lineage framing that grew up around the name. Reporters at Reuters, Bloomberg and The Information later connected the same internal reasoning research to the Strawberry codename and then to the publicly released o1 reasoning model, and English Wikipedia summarizes the consensus reading as: o1 "was formerly known within OpenAI as Q*, and later as Strawberry."[3] OpenAI has neither confirmed nor denied that specific chain of codenames.
On November 22, 2023, five days after Sam Altman was abruptly removed by the OpenAI board, Reuters published an exclusive citing two unnamed people familiar with the matter. They reported that several staff researchers had sent the board a letter warning of a powerful AI discovery that they believed could threaten humanity, and that the letter, alongside concerns about the pace of commercialization, was one of the catalysts for the board's decision to remove Altman. The letter referenced an internal model named Q*.[1] Fortune, summarizing the same Reuters reporting, wrote that "several staff researchers wrote a letter to the organization's board warning of a discovery that could potentially threaten the human race."[6]
The Reuters story said that, given vast computing resources, Q* had been able to solve certain math problems. The performance level was modest in absolute terms, with Reuters describing the system as "performing math on the level of grade-school students," but the source said researchers were excited because the system appeared to be reasoning through problems rather than recalling answers, and because performance was scaling with compute.[1][2] Crucially, Reuters stated that it could not independently verify the capabilities of Q* claimed by the researchers, a caveat that is often dropped when the story is retold.[1]
Reuters also reported that, after it contacted OpenAI, chief technology officer Mira Murati acknowledged in an internal memo to employees the existence of the Q* project as well as the letter sent to the board, while not confirming the accuracy of the media reports about its capabilities.[6] In a subsequent interview with The Verge, Altman addressed the topic with deliberate vagueness, calling the leak "unfortunate" and saying he had no particular comment, while reiterating that OpenAI's research progress had been consistently rapid.[7]
| Reported by Reuters / Fortune (sourced but unverified) | Speculation by outside commentators (not reported as fact) |
|---|---|
| A letter from staff researchers existed | The letter explicitly named the project as a path to AGI |
| Q* solved certain grade-school math problems | The math involved Olympiad-level proofs |
| Mira Murati acknowledged the project internally | OpenAI confirmed safety risks publicly |
| Capability reportedly scaled with compute | The model could rewrite its own code or self-improve |
The name itself is widely read as a hint, though OpenAI never explained it. The asterisk in Q* evokes A*, the classic heuristic search algorithm introduced by Peter Hart, Nils Nilsson and Bertram Raphael in 1968 at SRI. A* searches a graph of possible states by combining the actual cost so far with a heuristic estimate of the remaining cost, and it is one of the foundational algorithms in any artificial intelligence curriculum.[8]
The Q comes from Q-learning, a value-based reinforcement learning method introduced by Christopher Watkins in his 1989 Cambridge PhD thesis. Q-learning estimates a function Q(state, action) that captures the expected long-term reward of taking a particular action in a particular state. It is the same family of techniques that produced DeepMind's Deep Q-Network (DQN), which learned to play Atari games at superhuman level in 2013, and that contributed to the value head used in AlphaGo.[9]
Q* in classical reinforcement learning refers specifically to the optimal action-value function: the Q-function under the optimal policy. Combining that name with A*-style search produces a fairly obvious hint at the architecture researchers suspected: a learned value function that scores partial reasoning steps, paired with a search procedure that explores a tree of possible next steps, all wrapped around a large language model generator. This reading is inference, not an OpenAI statement.
Three threads of academic work made the speculation about Q* feel grounded rather than fanciful.
In May 2023, OpenAI researchers Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever and Karl Cobbe published "Let's Verify Step by Step" on arXiv (paper id 2305.20050). The paper compared two ways to train reward models on the MATH benchmark: outcome supervision, which only rewards the final answer, and process supervision, which rewards each individual reasoning step. Process supervision substantially outperformed outcome supervision, with the best process reward model (PRM) solving 78% of problems in a representative subset of MATH. The team also released PRM800K, a dataset of 800,000 step-level human correctness labels.[10]
The paper is the closest publicly available analogue of what Q* was rumored to do. It establishes that step-level reward signals, applied across long chains of reasoning, are a tractable way to train language models to do mathematics.
In March 2022, Eric Zelikman, Yuhuai Wu, Jesse Mu and Noah Goodman of Stanford published "STaR: Bootstrapping Reasoning with Reasoning" (arXiv 2203.14465). STaR uses a small number of seed examples with worked-out reasoning to generate rationales for a much larger set of problems, fine-tunes the model on the rationales that produced correct answers, and iterates. A key innovation, called rationalization, lets the model generate post-hoc rationales for problems it initially got wrong, given the correct answer. STaR achieved performance close to a 30 times larger model on CommonsenseQA and showed that a self-improving loop on reasoning was practically achievable.[11]
Also in May 2023, Princeton and Google DeepMind researchers published "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., arXiv 2305.10601). Tree of Thoughts (ToT) lets a model expand multiple reasoning branches, evaluate intermediate states, and backtrack, in the spirit of classic AI search. On the Game of 24 task, ToT raised GPT-4 success from around 4% with chain-of-thought prompting to 74%.[12]
Taken together, these three lines of work describe almost exactly the recipe that outside observers guessed Q* was following: generate candidate reasoning steps, score each step with a learned verifier, search over branches, and use the resulting traces to fine-tune the generator with reinforcement learning. None of this was confirmed to be Q*, but it made the speculation technically plausible.
Meta's chief AI scientist Yann LeCun was one of the few senior researchers to publicly comment on Q* during the November 2023 storm. Writing on X (formerly Twitter), he urged people to ignore the wave of speculation and noted that essentially every top lab was working on combining language models with planning. He singled out OpenAI's hiring of Noam Brown as the most concrete evidence of where the work was heading.[13]
Noam Brown joined OpenAI in mid-2023 after years at Carnegie Mellon and Meta's FAIR lab, where he co-built Libratus and Pluribus (which beat top professional poker players) and Cicero (which played Diplomacy at human level by combining a language model with strategic search). On joining OpenAI on July 6, 2023, Brown wrote that his work on reasoning and self-play in poker and Diplomacy had motivated him to "make these methods truly general," adding that if the research succeeded "we may one day see LLMs that are 1,000x better than GPT-4." The combination of his hire and the Lightman process supervision paper made it reasonable to assume OpenAI was working on something resembling AlphaGo-style search bolted onto a language model, though Brown was describing his research goals, not Q* specifically.[13][14]
Q* never shipped as a product under that name. The next time the same internal research thread surfaced in reporting was in mid-2024, under a different codename.
In July 2024, Reuters and Bloomberg reported that OpenAI was internally testing a model under the codename Strawberry. According to the Reuters report, an internal document described Strawberry as a project aimed at letting OpenAI's models plan ahead, navigate the internet autonomously, and perform what the company called "deep research." The Information added that Strawberry was a successor to Q* and was being trained with reinforcement learning to follow long chains of reasoning.[3][15]
On September 12, 2024, OpenAI publicly released o1-preview and o1-mini to ChatGPT Plus and Team users. The accompanying blog post described a model trained with reinforcement learning to perform a long internal chain of thought before answering, with performance scaling logarithmically with the amount of inference-time compute. The full o1 model followed on December 5, 2024, alongside the launch of ChatGPT Pro at $200 per month.[4] OpenAI did not use the name Q* in any of this documentation; the Q*-to-Strawberry-to-o1 chain is a media reconstruction built from internal sources, not an official OpenAI statement.
| Model | Public release | Notable detail |
|---|---|---|
| o1-preview | September 12, 2024 | First public OpenAI reasoning model; 83% on AIME 2024 vs 13% for GPT-4o |
| o1-mini | September 12, 2024 | About 80% cheaper than o1-preview; tuned for STEM and code |
| o1 (full) | December 5, 2024 | Released with ChatGPT Pro; image input |
| o1-pro | March 2025 (API) | $150 / $600 per million input/output tokens |
| o3-mini | January 31, 2025 | Three reasoning effort levels |
| o3 | April 16, 2025 | Roughly 3x o1 accuracy on ARC-AGI; 71.7% on SWE-bench Verified |
| o4-mini | April 16, 2025 | Multimodal reasoning, tool use, image generation |
| o3-pro | June 10, 2025 | Extended deliberation tier for o3 |
The shared design pattern across this family, sometimes called inference-time scaling or test-time compute, is the public-facing answer to the question of what Q* really pointed at: a research direction, not a single confirmed model.[5]
When the leak hit, the AI research community produced a flurry of blog posts trying to reconstruct what Q* might be doing under the hood. None of these were confirmed by OpenAI, but they share a remarkably consistent picture and they line up well with what OpenAI later described publicly for o1.
At the heart of the speculation is the idea that a language model, instead of producing one linear answer, generates a tree of possible reasoning paths. Each branch is expanded into one or more next steps, and partial branches can be pruned or extended depending on how promising they look. A separately trained process reward model gives a score to each step rather than to the entire answer, exactly as Lightman et al. demonstrated on MATH in 2023. A PRM can be used in three closely related ways: as a verifier for best-of-N sampling, as a guide for tree search, and as a reward signal for reinforcement learning fine-tuning of the generator.
A tree search guided by a PRM produces a large number of reasoning traces labeled with step-level scores. Those traces can be filtered down to high-quality solutions and used to fine-tune the generator with offline reinforcement learning, similar in spirit to STaR but with a much richer reward signal. This loop, where the model produces its own training data and a verifier curates it, is sometimes called self-play for reasoning, and it is the closest thing in modern machine learning to what AlphaGo did for the game of Go.
Many commentators, including Tim Lee at Understanding AI and Nathan Lambert in his Interconnects newsletter, framed Q* as an attempt to bring an AlphaGo-style architecture to a language model: a generator policy that proposes moves (reasoning steps), a value network that scores positions (the PRM), and a search procedure (something like Monte Carlo tree search or beam search) that combines the two. To be explicit: this is informed speculation by outside observers, not a description OpenAI has endorsed.[16]
OpenAI has never publicly described a product or paper named "Q*." The company has, however, said quite a lot about the o-series that is consistent with the rumored Q* architecture, and almost nothing that contradicts it.[4][5]
| Claim | Confirmed by OpenAI |
|---|---|
| A model called Q* exists | Indirectly, via Mira Murati's internal note acknowledging media reports |
| Q* solved grade-school math problems | No public confirmation; Reuters could not independently verify it |
| The o-series uses RL over a hidden chain of thought | Yes; stated in the o1 release post and system card |
| Inference-time compute scaling is a deliberate goal | Yes; o1 documentation describes log-linear accuracy scaling with thinking time |
| Process reward models or tree search are used | Not confirmed by name; OpenAI keeps the specific algorithm proprietary |
| Q* is the same project as Strawberry / o1 | Not confirmed officially; reported by Reuters, Bloomberg and The Information using internal sources |
The Q* leak landed in the middle of one of the most chaotic weeks in OpenAI's history. Sam Altman was removed by the board on November 17, 2023, with the board citing a loss of confidence in his leadership. Within hours, Greg Brockman resigned in protest, hundreds of OpenAI employees signed a letter threatening to leave for Microsoft unless Altman was reinstated, and on November 21, 2023, Altman returned with a new initial board. The Reuters story on Q* appeared the following day.[1][6]
The framing of the staff researchers' letter, that the new capability could threaten humanity, became one of the public anchors for the safety side of OpenAI's internal debate. Whether Q*'s actual capabilities deserved that framing is contested. Gary Marcus, in a Substack post titled "About that OpenAI 'breakthrough,'" argued that the entire episode had been overhyped and that solving grade-school math problems is a long way from any plausible AGI threshold. MIT Technology Review's Will Douglas Heaven made a similar point, quoting Wenda Li of Edinburgh and Katie Collins of Cambridge to argue that current systems still lack the architecture needed to reason about mathematics in any robust way.[17][18]
In hindsight, Q* is best read as the public's first glimpse of a research bet that paid off. The reported lineage runs from process supervision in May 2023, through the November 2023 leak, through the Strawberry codename in mid-2024, into o1 in September 2024 and o3 in April 2025. Every model in that public chain has the same basic shape: a language model that has been trained with reinforcement learning to think for a long time before answering, scored by a verifier on each step, and rewarded for getting the final answer right.[4][5][10]
The practical impact has been concrete. The o-series substantially raised the bar on math benchmarks (o1-preview hit 83% on AIME 2024 vs 13% for GPT-4o), on competitive programming (o3 reached an Elo of 2727 on Codeforces vs 1891 for o1), and on agentic software tasks (o3 scored 71.7% on SWE-bench Verified vs 48.9% for o1). The same techniques have spread industry-wide, with Anthropic, Google DeepMind, DeepSeek, Alibaba and others all releasing reasoning-tuned variants of their flagship models within a year of o1.[5]
Whether or not Q* ever existed as a discrete model, the name has become a kind of shorthand for the moment when the AI field stopped trying to scale only the size of transformer models and started seriously scaling how long they think.
Imagine a very smart student who, instead of blurting out the first answer that comes to mind, quietly works through a problem step by step on scratch paper, tries a few different approaches, checks each step, and only then writes down the final answer. Q* was the rumored OpenAI project about teaching a chatbot to do exactly that, using a separate "grader" that scores each step of the thinking and a search that tries many lines of reasoning. People got excited (and a little scared) in November 2023 because reporters said it could solve simple math problems on its own and might be a step toward much smarter AI. OpenAI never explained Q* directly, but the same idea later became the real, public reasoning models o1 and o3.