Strawberry (OpenAI codename)
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,276 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 20, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 ยท 3,276 words
Add missing citations, update stale details, or suggest a clearer explanation.
Strawberry was the internal codename used at [[openai|OpenAI]] for the research program that produced the o-series of reasoning models, most notably [[o1|OpenAI o1]]. The project was the direct successor to an earlier internal effort known as Q* (pronounced "Q-star"), and the existence of the rebranded program became public on July 12, 2024, when Reuters reported on internal OpenAI documents describing a system aimed at large-scale reasoning, autonomous web browsing, and "deep research."[^1] Strawberry shipped under the production name "o1-preview" on September 12, 2024.[^2] As a milestone, the project is widely cited as the moment when OpenAI made test-time compute, rather than pure pretraining scale, the centerpiece of its frontier roadmap.[^3]
| Field | Value |
|---|---|
| Codename | Strawberry (formerly Q*) |
| Owner | [[openai |
| First public report | 2024-07-12 (Reuters) |
| Product launch | o1-preview, o1-mini on 2024-09-12 |
| Full release | o1 (GA) and ChatGPT Pro on 2024-12-05 |
| Successor codenames | o3 (announced 2024-12-20), o4-mini (2025-04-16) |
| Paradigm | Reinforcement learning on chain-of-thought traces; test-time compute |
| Sources | Reuters, Bloomberg, OpenAI launch posts |
The first signal that OpenAI was building a model with novel reasoning capabilities arrived on November 22, 2023, days after the board of [[openai|OpenAI]] had abruptly removed [[sam_altman|Sam Altman]] as CEO. Reuters reported that, prior to Altman's dismissal, several staff researchers had sent a letter to the board warning of a "powerful artificial intelligence discovery" associated with an internal project called Q*.[^4] According to people familiar with the matter, the model had solved certain mathematical problems at roughly the level of grade-school students, and the speed of its progress led some inside the company to interpret it as a step toward general intelligence.[^4] OpenAI's then-CTO, [[mira_murati|Mira Murati]], confirmed the project's existence to staff in an internal memo and indicated that a letter about it had been sent to the board.[^4]
The Q* story circulated alongside, but did not by itself explain, Altman's removal. Reuters described the letter as "one factor among a longer list of grievances" cited by the board, and later reporting attributed the firing to broader concerns about candor and governance rather than the model alone.[^4][^5] Altman returned as CEO on November 22, 2023, after employees signed an open letter demanding his reinstatement, and the board was reconstituted shortly afterward.[^5]
Q* itself did not vanish. It was renamed. By mid-2024, OpenAI staff and internal documentation referred to the same line of work as "Strawberry."[^1] Reporters and analysts later noted that the name was, in part, an inside joke about a familiar failure mode of large language models: when asked how many letter "r"s appear in the word "strawberry," earlier models such as GPT-4 typically miscounted because of how byte-pair tokenization split the word into pieces.[^6][^7] The fact that the new reasoning system could in principle solve this kind of letter-counting task by spelling the word out in its private scratchpad became a small piece of community folklore around the launch.[^6]
A separate wiki entry on this site covers the Q* phase specifically; see [[q_openai|Q* OpenAI]].
On July 12, 2024, Reuters reporter Anna Tong, with Katie Paul reporting from New York, published an exclusive describing a project "code-named Strawberry, according to a person familiar with the matter and internal documentation reviewed by Reuters."[^1] The article reported that Strawberry was the new internal name for what had been called Q*, and that the company had developed a "specialized way of post-training" its generative AI models, adapting base models after pretraining to perform multi-step reasoning, autonomous web browsing, and what the document referred to as "deep research."[^1]
Reuters described an internal OpenAI document, dated to May 2024, that laid out plans for using Strawberry-trained models to plan ahead, navigate the internet through "CUAs" (computer-using agents), and "perform 'deep research' on users' behalf."[^1] The wire service was careful about what it could and could not verify. It noted that it had reviewed a copy of the document but could not independently confirm the dates referenced inside it or the capabilities ascribed to the project.[^1]
The story landed two days after a separate Bloomberg report by Rachel Metz, published July 11, 2024, in which OpenAI executives told staff at an internal all-hands meeting that the company would track its progress toward [[agi|AGI]] using a five-tier capability framework.[^8] Bloomberg also reported that, at that same meeting, OpenAI had demonstrated a research project that, executives said, gave a [[gpt_4o|GPT-4]]-class model skills similar to human reasoning; Bloomberg could not confirm whether the demo was part of Strawberry, but the timing led most observers to assume the two were linked.[^8]
The framework Bloomberg described slotted current chatbots into "Level 1" and placed Strawberry-style systems into "Level 2."[^8] As reported, the five levels were:
| Level | Label | Description |
|---|---|---|
| 1 | Chatbots | Conversational AI capable of natural-language interaction, exemplified by then-current [[chatgpt |
| 2 | Reasoners | Systems able to solve problems at the level of a person with a doctorate-level education, without access to external tools[^8] |
| 3 | Agents | Systems that can take autonomous actions on a user's behalf over extended periods (days)[^8] |
| 4 | Innovators | Systems that can contribute to original inventions and discoveries[^8] |
| 5 | Organizations | Systems capable of doing the work of an entire organization[^8] |
According to Bloomberg's sources, OpenAI executives told staff at the July 11, 2024 meeting that the company believed it was operating at Level 1 but was "on the cusp" of Level 2, which the company labeled "Reasoners."[^8] The framework was not published externally as a formal document; the Bloomberg article, and Axios coverage three days later, remain the canonical references for it.[^8][^9]
OpenAI launched the production version of the Strawberry project on September 12, 2024, under the name o1, releasing two models simultaneously: o1-preview, aimed at general reasoning, and o1-mini, a smaller variant optimized for code and math tasks.[^2][^10] Both were initially available to ChatGPT Plus and Team subscribers and via the OpenAI API for selected developers.[^2] OpenAI's launch blog post described the model as having "learn[ed] to refine its thinking process, try different strategies, and recognize its mistakes" through large-scale reinforcement learning.[^11]
The full "o1" (without the "-preview" suffix) shipped as part of OpenAI's "12 Days of OpenAI" event on December 5, 2024, alongside a new $200/month tier called ChatGPT Pro that gave subscribers access to a higher-compute variant labeled o1 pro mode.[^12][^13] OpenAI published an o1 system card on the same date.[^14] Microsoft integrated o1 into its Copilot product in January 2025, and the o1-pro API was released to developers in March 2025 at $150 per million input tokens and $600 per million output tokens, the most expensive model OpenAI had offered at the time.[^10]
On December 20, 2024, the final day of the same shipping event, OpenAI announced o3 and o3-mini, the next reasoning generation, with a system card and benchmark results but no immediate public availability.[^15] o3-mini was released to all ChatGPT users (including the free tier) on January 31, 2025; the full [[o3|OpenAI o3]] and a successor lightweight model, [[o4_mini|o4-mini]], shipped together on April 16, 2025, adding native tool use, web browsing, image input, and image generation to the reasoning pipeline.[^16][^17] A higher-compute [[o3-pro|o3-pro]] variant followed on the ChatGPT Pro tier.
OpenAI has not released a paper or weights for o1, and the company's published material is deliberately limited. What is publicly disclosed comes from the September 12, 2024 launch posts, the December 5, 2024 system card, statements by OpenAI researchers, and reporting on the project.
OpenAI's launch blog states that o1 was "trained with reinforcement learning" to produce a private [[chain_of_thought|chain-of-thought]] before generating a final answer.[^11] OpenAI research scientist Noam Brown, who joined the company in mid-2023 and led parts of the reasoning effort, said in interviews that the team had discovered "strong test-time compute scaling laws," meaning that performance on reasoning benchmarks continued to improve as the model was allowed to think for longer at inference.[^18] OpenAI's own materials phrase the same observation more cautiously, noting "a correlation between accuracy and the logarithm of the amount of compute spent thinking before answering."[^10]
The training approach itself is closer to [[reinforcement_learning|reinforcement learning]] over self-generated reasoning trajectories than to standard [[rlhf|RLHF]] preference modeling. OpenAI has not described the algorithm in detail, but independent commentary has connected it to the 2022 Stanford "Self-Taught Reasoner" (STaR) line of work, in which a model is rewarded for producing chains of thought that reach verified correct answers.[^7] Reuters reported in July 2024 that internal OpenAI documents described a similar approach for Strawberry: fine-tuning base models on a curated "deep research" dataset and rewarding successful multi-step reasoning.[^1]
The most visible architectural choice is that o1 dedicates explicit, billable tokens to internal deliberation before producing a user-visible answer. OpenAI's API documentation describes these as "reasoning tokens," and bills for them at the same rate as output tokens even though they are not returned to the caller.[^10] The model thus has two budgets: the visible response and a hidden "thinking" budget that scales with the difficulty of the problem.
OpenAI hid the raw chain-of-thought from end users for a combination of reasons that the company described in its launch post: it wanted the model to be able to "express its thoughts in unaltered form" without policy-trained smoothing, and it wanted to preserve the option of monitoring those thoughts for signs of deception or misalignment.[^11] The company also stated that it would consider penalizing or revoking access for users who attempted to extract the hidden trace via prompt injection.[^10]
This design directly implements [[test_time_compute|test-time compute]] (also called [[inference_time_scaling|inference-time scaling]]) as a first-class product surface, rather than as a research curiosity. The o3 generation extended the idea further: o3 exposes a controllable "reasoning effort" setting (low, medium, high) that adjusts how many reasoning tokens the model uses on a given query.[^15]
The Strawberry/o1 approach has clear intellectual precursors in academic work that predates the project by two to three years. [[chain_of_thought|Chain-of-thought]] prompting was introduced by Google researchers in early 2022 as a prompting strategy that elicited step-by-step reasoning from large pretrained models. [[tree_of_thoughts|Tree of Thoughts]], introduced in 2023, generalized this to branching search over reasoning steps. [[reflexion|Reflexion]], also published in 2023, added a self-critique loop in which an agent revises its own outputs after observing failure signals. o1's contribution is not the chain-of-thought structure itself but the use of large-scale reinforcement learning to make the model produce useful chains of thought reliably, without prompting tricks.[^11][^18]
On the AIME 2024 mathematics competition (see [[aime|AIME]]), OpenAI reported that o1 solved 83% of problems at pass@1, compared to roughly 13% for [[gpt_4o|GPT-4o]].[^11] On [[gpqa_diamond|GPQA Diamond]], a benchmark of graduate-level science questions designed to be Google-proof, o1 scored at or near the level of human experts in the relevant fields.[^11] These figures were widely cited at launch and were a substantial part of the case OpenAI made that o1 was a qualitative step beyond the GPT-4 generation.
The o3 generation pushed the same numbers further. On December 20, 2024, OpenAI reported that o3 reached 96.7% on AIME 2024, 87.7% on GPQA Diamond, 71.7% on [[swe_bench_verified|SWE-bench Verified]], and 25.2% on EpochAI's Frontier Math benchmark, where no other model at the time had cleared 2%.[^15] On [[arc_agi|ARC-AGI]], o3 scored between 75.7% and 87.5% in its tuned configurations, comparable to the average human score on the same private set, though the high-compute run cost was reported in the hundreds of dollars per task.[^15]
"Strawberry" is one entry in a longer pattern of internal OpenAI codenames that have surfaced in press reports without ever being acknowledged in marketing copy.
| Codename | Reported product | First public reference |
|---|---|---|
| Q* | Reasoning research that became Strawberry | Reuters, 2023-11-22[^4] |
| Strawberry | Reasoning project that became o1 | Reuters, 2024-07-12[^1] |
| Orion | Internal name for [[gpt-4.5 | GPT-4.5]], reported as the last "pretraining-scaling" frontier model |
[[gpt-4.5|GPT-4.5]], released on February 27, 2025, was reported by TechCrunch and Fortune to have been known internally as "Orion" and to be the last model OpenAI built primarily using the dense-pretraining-scale paradigm that produced GPT-3 and GPT-4.[^19][^20] In the same coverage, two former employees told Fortune that Orion had originally been positioned as a potential GPT-5 but had not delivered the across-the-board capability jump that the company expected from a "5", and so was relabeled GPT-4.5 and superseded by reasoning-centric work.[^20] [[gpt-5|GPT-5]] subsequently launched in August 2025, retiring GPT-4.5 from the consumer tiers.[^20]
The shift from "Orion" to "Strawberry" thus reads, in retrospect, as the inflection point at which the company's headline ambition moved from a single huge pretraining run to a smaller base model trained heavily on reasoning, a pattern that continued into [[gpt_5_codex|GPT-5 Codex]] and the o3/o4 generations.
The Strawberry project is most often described, by both OpenAI staff and outside commentators, as the point at which the production AI ecosystem moved from pretraining-scaling to test-time-compute-scaling as the dominant performance lever.
The 2022 Chinchilla paper had reframed the [[scaling_laws|scaling laws]] of the GPT-3 era by showing that, for a fixed compute budget, optimal pretraining required more tokens per parameter than the GPT-3-style runs had used (see [[chinchilla_scaling|Chinchilla scaling laws]]). That argument concerned training-time compute. Strawberry and its successors introduced a parallel axis: even with a fixed trained model, accuracy on hard reasoning tasks could be improved further by paying more compute at inference.[^3][^18] OpenAI's own launch post pointed to "a correlation between accuracy and the logarithm of the amount of compute spent thinking before answering," and external commentators frequently described this as a new scaling law.[^11][^3]
The connection to [[bitter_lesson|The Bitter Lesson]], Richard Sutton's 2019 essay, is one that AI researchers including Noam Brown and Jim Fan have drawn explicitly. Sutton argued that, over the long arc of AI research, the methods that succeed are those that scale with compute, and that he saw two such methods: learning and search.[^21] Strawberry's design adds search (sampling many reasoning paths at inference, then selecting or refining among them) on top of large-scale learning, in a way that earlier post-GPT systems did not.[^18][^21]
For the broader industry, the visible effect was that competing labs accelerated their own reasoning programs. By mid-2025, every major frontier lab had shipped or announced a "thinking" mode that explicitly allocated extra inference compute, and the design pattern of charging users for hidden reasoning tokens became standard. The Strawberry rebrand also coincided with renewed willingness, inside and outside OpenAI, to talk publicly about the company's [[agi|AGI]] roadmap, with the Bloomberg "five levels" framework providing a vocabulary that quickly entered industry discussion.[^8][^9]
The public record about Strawberry rests almost entirely on two pieces of journalism (Reuters in November 2023 and July 2024, Bloomberg in July 2024) and the production materials that followed.[^4][^1][^8] OpenAI has not published the internal documents, has not released a research paper describing the o1 training pipeline, and has explicitly hidden o1's chain of thought from users and developers.[^10] The connection between the November 2023 Q* leak and the July 2024 Strawberry leak is asserted by Reuters and by subsequent secondary coverage, but the precise relationship between the two research efforts has not been documented by OpenAI itself.[^1][^4]
Several details that have circulated in popular reporting are not directly supported by primary sources. OpenAI has not, for example, published a confirmed reason for the choice of the name "Strawberry"; the connection to the "how many r's are in 'strawberry'" tokenization joke is reported and widely repeated, but it is presented in secondary sources as folklore rather than as an officially confirmed naming rationale.[^6][^7]
It is also worth noting that the "five levels of AGI" framework was reported by Bloomberg and Axios as an internal taxonomy shared at a company meeting, not as an externally published OpenAI document.[^8][^9] Other versions of the same idea, with different labels and a different number of stages, have appeared in OpenAI communications since, so the table above should be read as a snapshot of how the framework was reported in July 2024 rather than as a permanent classification.