OpenThoughts
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,441 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Jun 8, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 1,441 words
Add missing citations, update stale details, or suggest a clearer explanation.
OpenThoughts is an open-source initiative and a series of datasets of verified reasoning traces created to train open reasoning models. Launched on January 28, 2025 by Bespoke Labs and the DataComp community, the project set out to build the best fully open datasets and data recipes for reasoning, in response to the fact that the strongest reasoning models of the period relied on proprietary training data that was not publicly documented [1][2]. The effort produced a progression of increasingly large datasets, beginning with OpenThoughts-114k and culminating in the flagship OpenThoughts3-1.2M, alongside a family of OpenThinker models trained on that data [3].
The datasets are built primarily through knowledge distillation: questions in mathematics, code, and science are collected, filtered, and answered by a strong "teacher" reasoning model, whose chain-of-thought traces are then verified and used to fine-tune smaller "student" models. The work was a flagship of the 2025 open-reasoning-data wave that followed DeepSeek-R1, which demonstrated that a few hundred thousand reasoning demonstrations could substantially improve a model's reasoning ability [1][4]. The accompanying technical report, "OpenThoughts: Data Recipes for Reasoning Models" (arXiv 2506.04178, submitted June 4, 2025), documents the methodology and was authored by a collaboration of roughly 50 researchers across several universities and labs [4]. All datasets, model weights, and code are released openly through the project at openthoughts.ai [3][4].
In late 2024 and early 2025, reasoning models that produce long internal chains of thought before answering, exemplified by OpenAI's o1 and then DeepSeek-R1, rapidly advanced performance on competition mathematics, programming, and scientific reasoning. DeepSeek-R1, released in January 2025, showed that distilling its reasoning traces into smaller dense models (the DeepSeek-R1-Distill series) produced strong open-weight reasoners, and it indicated that a relatively modest number of high-quality reasoning demonstrations could meaningfully lift a base model's capabilities [1][4].
A gap remained, however: the training data and the precise recipes used to produce frontier reasoning models were largely closed, leaving the open community without a documented, reproducible pipeline. Several concurrent efforts moved to fill this gap. UC Berkeley's NovaSky group released Sky-T1, a reasoning model trained for a few hundred dollars; Bespoke Labs released the Bespoke-Stratos models and the Bespoke-Stratos-17k dataset, distilled from DeepSeek-R1 using a pipeline adapted from Sky-T1; and Stanford researchers released s1 and the s1K dataset, emphasizing that a small, carefully chosen sample could teach reasoning [2][5]. OpenThoughts was conceived as a community-scale extension of this line of work, aiming to systematically curate the best open reasoning datasets rather than a single recipe [2].
OpenThoughts evolved through several dataset generations, each larger and more carefully curated than the last [3][4]:
| Dataset | Approx. size | Teacher model | Notes |
|---|---|---|---|
| Bespoke-Stratos-17k | 17,000 | DeepSeek-R1 | Predecessor from Bespoke Labs, adapted from the Sky-T1 pipeline |
| OpenThoughts-114k | 114,000 | DeepSeek-R1 | First OpenThoughts release; math, code, science, puzzles |
| OpenThoughts2-1M | ~1,000,000 | DeepSeek-R1 | Combines OpenThoughts-114k with OpenR1-Math plus new math and code data |
| OpenThoughts3-1.2M | 1,200,000 | QwQ-32B | Flagship; built from a large systematic ablation study |
The first release, OpenThoughts-114k, contains roughly 114,000 high-quality examples spanning mathematics, science, code, and puzzles, with reasoning traces generated by DeepSeek-R1 and verified for correctness. Questions were drawn from a curated mix of existing datasets, including sources such as AI-MO/NuminaMath-CoT for math and codeparrot/apps and BAAI/TACO for code, and the data was produced using Bespoke Labs' Curator framework. It is released under the Apache 2.0 license [3][6].
OpenThoughts2-1M scaled the corpus to roughly one million examples by combining OpenThoughts-114k with the OpenR1-Math data and additional newly generated math and code reasoning traces, broadening domain coverage and improving the resulting models [3][7].
OpenThoughts3-1.2M, the flagship release, comprises 1.2 million examples: approximately 850,000 in math, 250,000 in code, and 100,000 in science, with reasoning annotations generated by Alibaba's QwQ-32B model. Unlike the earlier datasets, OpenThoughts3 was not assembled ad hoc but designed as the output of a large controlled ablation study over the entire data pipeline [4][8].
The central methodological contribution of OpenThoughts3 was a systematic study of what makes reasoning-distillation data effective. The team ran more than 1,000 controlled experiments across math, code, and science, isolating each stage of the pipeline to find the best individual choices, then composing the winning choices into a final recipe [4][8]. The pipeline stages examined included:
Several findings were notable and, in some cases, counterintuitive [4][8]:
Composing these choices produced the OpenThoughts3 recipe, which was then scaled to 1.2 million examples [4][8].
The datasets were used to train the OpenThinker family of models, fine-tuned from Qwen2.5-Instruct base models [3][4]:
On benchmarks, OpenThinker3-7B reportedly scored 53% on AIME 2025, 51% on LiveCodeBench (covering the June 2024 to January 2025 window), and 54% on GPQA Diamond. The report frames these as improvements of 15.3, 17.2, and 20.5 percentage points respectively over DeepSeek-R1-Distill-Qwen-7B, the comparable distilled baseline at the same scale [4][8].
OpenThoughts is significant as one of the most thorough public efforts to make reasoning-model training data and its construction transparent and reproducible. Where earlier open reasoning work such as Sky-T1, Bespoke-Stratos, and s1 demonstrated that open distillation was feasible, OpenThoughts contributed an at-scale, ablation-driven account of which data choices actually matter, releasing not only the final datasets and models but the recipe behind them [2][4]. Its finding that a weaker-scoring teacher (QwQ-32B) could yield a stronger student challenged the intuition that distillation quality tracks teacher benchmark performance [4][8].
The project is fully open: the datasets (Bespoke-Stratos-17k, OpenThoughts-114k, OpenThoughts2-1M, and OpenThoughts3-1.2M), the OpenThinker model weights, and the data-generation and evaluation code are publicly available, with OpenThoughts-114k under the Apache 2.0 license [3][4][6]. The initiative was led by Bespoke Labs and the DataComp community, with contributing researchers from institutions including Stanford, the University of California, Berkeley, the University of Texas at Austin, the University of Washington, UCLA, the University of North Carolina, the Toyota Research Institute, and LAION [2][4]. By documenting a complete, reproducible recipe and releasing every artifact, OpenThoughts became a widely cited reference point in the open-source AI reasoning ecosystem of 2025 [3][4].