OpenThoughts

Data & Datasets Machine Learning

7 min read

Updated Jun 8, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 8, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 1,441 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Overview

OpenThoughts is an open-source initiative and a series of datasets of verified reasoning traces created to train open reasoning models. Launched on January 28, 2025 by Bespoke Labs and the DataComp community, the project set out to build the best fully open datasets and data recipes for reasoning, in response to the fact that the strongest reasoning models of the period relied on proprietary training data that was not publicly documented ^[1]^[2]. The effort produced a progression of increasingly large datasets, beginning with OpenThoughts-114k and culminating in the flagship OpenThoughts3-1.2M, alongside a family of OpenThinker models trained on that data ^[3].

The datasets are built primarily through knowledge distillation: questions in mathematics, code, and science are collected, filtered, and answered by a strong "teacher" reasoning model, whose chain-of-thought traces are then verified and used to fine-tune smaller "student" models. The work was a flagship of the 2025 open-reasoning-data wave that followed DeepSeek-R1, which demonstrated that a few hundred thousand reasoning demonstrations could substantially improve a model's reasoning ability ^[1]^[4]. The accompanying technical report, "OpenThoughts: Data Recipes for Reasoning Models" (arXiv 2506.04178, submitted June 4, 2025), documents the methodology and was authored by a collaboration of roughly 50 researchers across several universities and labs ^[4]. All datasets, model weights, and code are released openly through the project at openthoughts.ai ^[3]^[4].

Context: open reasoning after DeepSeek-R1

In late 2024 and early 2025, reasoning models that produce long internal chains of thought before answering, exemplified by OpenAI's o1 and then DeepSeek-R1, rapidly advanced performance on competition mathematics, programming, and scientific reasoning. DeepSeek-R1, released in January 2025, showed that distilling its reasoning traces into smaller dense models (the DeepSeek-R1-Distill series) produced strong open-weight reasoners, and it indicated that a relatively modest number of high-quality reasoning demonstrations could meaningfully lift a base model's capabilities ^[1]^[4].

A gap remained, however: the training data and the precise recipes used to produce frontier reasoning models were largely closed, leaving the open community without a documented, reproducible pipeline. Several concurrent efforts moved to fill this gap. UC Berkeley's NovaSky group released Sky-T1, a reasoning model trained for a few hundred dollars; Bespoke Labs released the Bespoke-Stratos models and the Bespoke-Stratos-17k dataset, distilled from DeepSeek-R1 using a pipeline adapted from Sky-T1; and Stanford researchers released s1 and the s1K dataset, emphasizing that a small, carefully chosen sample could teach reasoning ^[2]^[5]. OpenThoughts was conceived as a community-scale extension of this line of work, aiming to systematically curate the best open reasoning datasets rather than a single recipe ^[2].

The OpenThoughts datasets (114k to 1.2M)

OpenThoughts evolved through several dataset generations, each larger and more carefully curated than the last ^[3]^[4]:

Dataset	Approx. size	Teacher model	Notes
Bespoke-Stratos-17k	17,000	DeepSeek-R1	Predecessor from Bespoke Labs, adapted from the Sky-T1 pipeline
OpenThoughts-114k	114,000	DeepSeek-R1	First OpenThoughts release; math, code, science, puzzles
OpenThoughts2-1M	~1,000,000	DeepSeek-R1	Combines OpenThoughts-114k with OpenR1-Math plus new math and code data
OpenThoughts3-1.2M	1,200,000	QwQ-32B	Flagship; built from a large systematic ablation study

The first release, OpenThoughts-114k, contains roughly 114,000 high-quality examples spanning mathematics, science, code, and puzzles, with reasoning traces generated by DeepSeek-R1 and verified for correctness. Questions were drawn from a curated mix of existing datasets, including sources such as AI-MO/NuminaMath-CoT for math and codeparrot/apps and BAAI/TACO for code, and the data was produced using Bespoke Labs' Curator framework. It is released under the Apache 2.0 license ^[3]^[6].

OpenThoughts2-1M scaled the corpus to roughly one million examples by combining OpenThoughts-114k with the OpenR1-Math data and additional newly generated math and code reasoning traces, broadening domain coverage and improving the resulting models ^[3]^[7].

OpenThoughts3-1.2M, the flagship release, comprises 1.2 million examples: approximately 850,000 in math, 250,000 in code, and 100,000 in science, with reasoning annotations generated by Alibaba's QwQ-32B model. Unlike the earlier datasets, OpenThoughts3 was not assembled ad hoc but designed as the output of a large controlled ablation study over the entire data pipeline ^[4]^[8].

Methodology: the data recipe ablations

The central methodological contribution of OpenThoughts3 was a systematic study of what makes reasoning-distillation data effective. The team ran more than 1,000 controlled experiments across math, code, and science, isolating each stage of the pipeline to find the best individual choices, then composing the winning choices into a final recipe ^[4]^[8]. The pipeline stages examined included:

Question sourcing: where to draw raw questions from.
Question mixing: how to combine sources and domains.
Question filtering: selecting questions by difficulty and other criteria.
Deduplication: removing near-duplicate math and science questions.
Answer generation: sampling multiple reasoning traces per question from a teacher model.
Answer (trace) filtering: selecting which generated traces to keep.
Teacher model selection: choosing which model produces the traces.

Several findings were notable and, in some cases, counterintuitive ^[4]^[8]:

QwQ-32B outperformed DeepSeek-R1 as a teacher for distillation, even though QwQ-32B scores lower on some benchmarks itself, indicating that a teacher's own benchmark score is not the sole determinant of distillation quality.
Concentrating on a small number of high-quality question sources beat maximizing diversity across many sources.
Sampling multiple answers per question from the teacher (rather than one answer each) was an effective way to expand the dataset, with the study reporting that this kind of multi-sampling could expand the data by a large factor (at least 16x) while preserving quality.
LLM-based filters for question difficulty and response length outperformed embedding-based filtering methods.

Composing these choices produced the OpenThoughts3 recipe, which was then scaled to 1.2 million examples ^[4]^[8].

Results: the OpenThinker models

The datasets were used to train the OpenThinker family of models, fine-tuned from Qwen2.5-Instruct base models ^[3]^[4]:

OpenThinker-7B and OpenThinker-32B were trained on OpenThoughts-114k. OpenThinker-7B is a fine-tune of Qwen2.5-7B-Instruct, and the 32B model, released February 12, 2025, was presented as a leading open-data reasoning model at its scale ^[3]^[9].
OpenThinker2-32B, trained on OpenThoughts2-1M, was described as the first model trained on publicly available reasoning data to match DeepSeek-R1-Distill-Qwen-32B on standard reasoning benchmarks such as AIME and LiveCodeBench ^[4].
OpenThinker3-7B, trained on OpenThoughts3-1.2M, is the flagship result. The report presents it as the best open-data reasoning model at the 7B scale, regardless of whether models are optimized with supervised fine-tuning, reinforcement learning, or both ^[4]^[8].

On benchmarks, OpenThinker3-7B reportedly scored 53% on AIME 2025, 51% on LiveCodeBench (covering the June 2024 to January 2025 window), and 54% on GPQA Diamond. The report frames these as improvements of 15.3, 17.2, and 20.5 percentage points respectively over DeepSeek-R1-Distill-Qwen-7B, the comparable distilled baseline at the same scale ^[4]^[8].

Significance and open release

OpenThoughts is significant as one of the most thorough public efforts to make reasoning-model training data and its construction transparent and reproducible. Where earlier open reasoning work such as Sky-T1, Bespoke-Stratos, and s1 demonstrated that open distillation was feasible, OpenThoughts contributed an at-scale, ablation-driven account of which data choices actually matter, releasing not only the final datasets and models but the recipe behind them ^[2]^[4]. Its finding that a weaker-scoring teacher (QwQ-32B) could yield a stronger student challenged the intuition that distillation quality tracks teacher benchmark performance ^[4]^[8].

The project is fully open: the datasets (Bespoke-Stratos-17k, OpenThoughts-114k, OpenThoughts2-1M, and OpenThoughts3-1.2M), the OpenThinker model weights, and the data-generation and evaluation code are publicly available, with OpenThoughts-114k under the Apache 2.0 license ^[3]^[4]^[6]. The initiative was led by Bespoke Labs and the DataComp community, with contributing researchers from institutions including Stanford, the University of California, Berkeley, the University of Texas at Austin, the University of Washington, UCLA, the University of North Carolina, the Toyota Research Institute, and LAION ^[2]^[4]. By documenting a complete, reproducible recipe and releasing every artifact, OpenThoughts became a widely cited reference point in the open-source AI reasoning ecosystem of 2025 ^[3]^[4].

References

DeepSeek-AI. "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning." arXiv:2501.12948, January 2025. ↩
OpenThoughts / Bespoke Labs. "Launching the OpenThoughts Project." open-thoughts.ai/blog/launch, January 28, 2025. https://www.open-thoughts.ai/blog/launch ↩
open-thoughts. "open-thoughts: Fully open data curation for reasoning models." GitHub repository. https://github.com/open-thoughts/open-thoughts ↩
Guha, E., Marten, R., Keh, S., Raoof, N., Smyrnis, G., et al. "OpenThoughts: Data Recipes for Reasoning Models." arXiv:2506.04178, June 2025. https://arxiv.org/abs/2506.04178 ↩
Muennighoff, N., et al. "s1: Simple test-time scaling." arXiv:2501.19393, January 2025. ↩
open-thoughts. "OpenThoughts-114k." Hugging Face Datasets. https://huggingface.co/datasets/open-thoughts/OpenThoughts-114k ↩
open-thoughts. "OpenThoughts2-1M." Hugging Face Datasets. https://huggingface.co/datasets/open-thoughts/OpenThoughts2-1M ↩
OpenThoughts. "OpenThoughts3: A New SOTA Reasoning Data Recipe." openthoughts.ai/blog/ot3, June 2025. https://www.openthoughts.ai/blog/ot3 ↩
MarkTechPost. "Open Thoughts: An Open Source Initiative Advancing AI Reasoning with High-Quality Datasets and Models Like OpenThoughts-114k and OpenThinker-7B." January 30, 2025. https://www.marktechpost.com/2025/01/30/open-thoughts-an-open-source-initiative-advancing-ai-reasoning/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

Common Crawl Feature Engineering

Overview

Context: open reasoning after DeepSeek-R1

The OpenThoughts datasets (114k to 1.2M)

Methodology: the data recipe ablations

Results: the OpenThinker models

Significance and open release

References

Improve this article

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here

Related Articles

Dimension Reduction

Discrete Feature

Proxy labels

Bucketing

Categorical Data

Class-Imbalanced Dataset

What links here