Cosmopedia
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,184 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 31, 2026
Sources
9 citations
Review status
Source-backed
Revision
v1 · 2,184 words
Add missing citations, update stale details, or suggest a clearer explanation.
Cosmopedia is an open synthetic pretraining dataset released by Hugging Face in February 2024, made up of textbooks, blog posts, stories, and WikiHow-style articles written entirely by a large language model. The first version contains roughly 25 billion tokens spread across more than 30 million files, which made it the largest open synthetic data corpus of its kind at the time of release [1][2]. All of the text was generated by Mixtral-8x7B-Instruct-v0.1, an open model, and the prompts were seeded from web and educational sources so that the output would cover a wide span of topics. The project was led by Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra [2], and it was conceived as an open reproduction of the recipe behind Microsoft's Phi models.
Cosmopedia later spawned a second iteration, Cosmopedia v2, which became the synthetic backbone of the SmolLM family of small language models. The dataset and the code used to build it were both released under permissive terms, partly as a response to the fact that the Phi reports described their approach but never shared the data [1].
The motivation traces back to a 2023 paper from Microsoft Research titled "Textbooks Are All You Need," which introduced phi-1, a 1.3 billion parameter code model trained on about 6 billion tokens of filtered web text plus roughly 1 billion tokens of synthetic textbooks and exercises produced by GPT-3.5 [3]. Despite its small size, phi-1 reached 50.6 percent pass@1 on HumanEval, a result that punched well above models many times larger. A follow-up, phi-1.5, extended the idea to general reasoning and showed similar gains [4]. The takeaway that drew so much attention was that data quality, not just scale, drives what a model learns, and that carefully written "textbook quality" content could substitute for a much larger pile of raw web pages.
The catch was reproducibility. Microsoft published strong numbers and a general description of the method, but it released neither the synthetic datasets nor the exact prompts behind them. Phi-2 became one of the most downloaded and most liked models on the Hugging Face Hub, yet nobody outside Microsoft could rebuild the data that made it work [1]. Cosmopedia set out to close that gap: take the textbooks hypothesis, run it with a fully open generator model, and publish everything so the community could study, criticize, and improve it.
The hardest part of Cosmopedia was not the compute. It was the prompts. As Ben Allal put it in the launch writeup, most of the effort went into prompt engineering rather than into orchestrating GPUs, because keeping the output diverse gets much harder as the volume grows [1]. A model asked the same kind of question over and over will produce near-duplicates, and a pretraining set full of near-duplicates teaches very little. So the pipeline is really a strategy for manufacturing varied prompts at scale.
The team split the prompt sources into two broad buckets. The first was curated educational material, which is high quality but limited in quantity: course outlines from Stanford, units from OpenStax, lessons from Khan Academy, and article titles scraped from WikiHow. To stretch a small set of topics into many prompts, each topic was crossed with four target audiences (young children, high school students, college students, and researchers) and three generation styles (textbook, blog post, and WikiHow article). That combination yields up to twelve different prompts from a single seed topic, and asking for the same subject at a child's level versus a researcher's level produces genuinely different text [1].
The second and larger bucket was web data, which supplied more than 80 percent of all prompts. Here the team clustered millions of web samples drawn from a RefinedWeb-style corpus into 145 clusters, then used Mixtral to read ten random samples from each cluster and name the shared topic. After filtering out low quality categories such as explicit material and celebrity gossip, 112 topics remained. A web page was used as a "seed sample" to ground the generation, and the prompt was conditioned on the cluster topic about half the time, which kept outputs anchored to real-world knowledge while still varying the framing [1]. Mathematical content was added through the AutoMathText dataset, and a stories split was seeded from instruction-tuning data, namely the "questions about the world" subset of UltraChat and parts of OpenHermes2.5, to inject the everyday common-sense knowledge that formal textbooks tend to skip.
Generation ran through Hugging Face's llm-swarm library, which managed many parallel Mixtral-8x7B-Instruct-v0.1 instances served with Text Generation Inference on H100 GPUs from the company's science cluster. The full run took more than 10,000 GPU hours [1]. Afterward the text was decontaminated against common evaluation benchmarks: the team flagged any sample whose 10-gram overlap with a benchmark example was suspicious, verified the match with Python's difflib SequenceMatcher, and dropped the sample when more than half of a benchmark item appeared inside it. That pass removed contaminated rows tied to ARC, BoolQ, HellaSwag, PIQA, and several others [1].
Cosmopedia v0.1 is organized into eight splits, each named for the seed source behind its prompts. The largest two come from web samples and account for roughly three quarters of the data combined.
| Split | Rows | Seed source |
|---|---|---|
| web_samples_v1 | 12.4M | Internal RefinedWeb-style web dataset |
| web_samples_v2 | 10.3M | Internal web dataset, refined prompts |
| stories | 4.99M | UltraChat and OpenHermes2.5 |
| auto_math_text | 1.95M | AutoMathText |
| stanford | 1.02M | Stanford course outlines |
| wikihow | 179k | WikiHow titles |
| openstax | 126k | OpenStax course outlines |
| khanacademy | 24.1k | Khan Academy course outlines |
The full release holds 31,064,744 rows, totaling about 25 billion tokens, and ships under the Apache 2.0 license [2]. Each row records the prompt, the generated text, the token length, the seed dataset, the format (textbook, blog post, story, and so on), and the intended audience. The team also published a 100,000-row sample called cosmopedia-100k for quick experiments, and trained a 1.8 billion parameter model, Cosmo-1B, on the data to test it. Cosmo-1B beat TinyLlama 1.1B on ARC-easy, ARC-challenge, OpenBookQA, and MMLU, and was competitive with Qwen-1.5-1B on some of those, though it still trailed Phi-1.5, a gap the authors attributed to the strength of the generator model, topic coverage, and prompt design rather than to anything fundamental about the method [1].
The second version, built a few months later for the SmolLM project, reworked the weakest parts of the pipeline. Instead of clustering web pages to discover topics, v2 started from a predefined list of about 34,000 topics drawn from the BISAC book classification, a standard publishing taxonomy that is broad and education-oriented. The team began with 5,000 topics across 51 categories and asked Mixtral to expand them into subtopics [5]. The audience mix was rebalanced toward the levels that mattered most for a general model: 40 percent of the content aimed at middle school students, 30 percent at college students, and the remaining 30 percent a blend of other audiences and styles, including stories and Stanford-based textbooks carried over from v1 [5]. The team also generated 1 billion tokens of code textbooks seeded from Python samples in AutoMathText, so the corpus would carry some programming signal.
The result was about 39 million documents totaling 28 billion tokens, still generated by Mixtral-8x7B-Instruct-v0.1 [5]. Cosmopedia v2 then became one of three ingredients in the SmolLM-Corpus, alongside Python-Edu (4 billion tokens of educational Python from The Stack) and FineWeb-Edu (220 billion tokens of deduplicated educational web pages). The SmolLM models trained on this mixture: the 135M and 360M versions on 600 billion tokens and the 1.7B version on 1 trillion tokens [5].
| Aspect | Cosmopedia v0.1 | Cosmopedia v2 |
|---|---|---|
| Release | February 2024 | July 2024 |
| Documents | ~31 million | ~39 million |
| Tokens | ~25 billion | ~28 billion |
| Generator | Mixtral-8x7B-Instruct-v0.1 | Mixtral-8x7B-Instruct-v0.1 |
| Topic selection | Web clustering (145 to 112 topics) | BISAC taxonomy (~34,000 topics) |
| Audience strategy | 4 audiences x 3 styles | 40% middle school, 30% college, 30% mixed |
| Code content | None (math via AutoMathText) | ~1B tokens of Python textbooks |
| Primary use | Cosmo-1B | SmolLM family |
The appeal of a corpus like Cosmopedia is that it is dense with explanatory, well-structured prose. A scraped web crawl is mostly noise: boilerplate, ads, navigation menus, and shallow content, with the genuinely instructive material thinly spread. Synthetic textbooks invert that ratio. Every document is written to teach something, which is why a small model trained on them can match a much larger model trained on raw text. The flip side is coverage. Web data, for all its mess, reflects the actual distribution of human knowledge and language, including the rare facts, odd phrasings, and edge cases that a generator tends to smooth over. This is why both Cosmopedia and the SmolLM corpus pair synthetic data with filtered web data rather than relying on synthetic text alone. Cosmopedia v2 contributes 28 billion synthetic tokens against 220 billion tokens of FineWeb-Edu, so the synthetic portion sharpens the mixture rather than dominating it.
Using model-generated text to train the next model is contested. The central worry is captured by research on model collapse: in a 2024 Nature paper, Shumailov and colleagues showed that when a generative model is trained on data produced by earlier models, generation after generation, the tails of the original distribution fade and the model degrades in ways that are hard to reverse [6]. The rare events vanish first, then the variety thins out, until the model produces a narrow, repetitive slice of what it started with. As AI-generated text fills more of the public web, some researchers fear this kind of feedback loop could quietly poison future training sets.
Cosmopedia sits inside that debate but does not fit the doomsday version of it. Model collapse, as studied, comes from the indiscriminate and recursive reuse of model output, where each generation feeds on the last with no fresh grounding. Cosmopedia is different in two ways. The synthetic text is seeded from real web pages and curated human sources, so it is anchored to genuine knowledge rather than spun from the model's own prior outputs, and it is generated once by Mixtral and then mixed with large amounts of human-written data, not fed back in a loop. The practical results from Phi, Cosmo-1B, and SmolLM suggest that carefully constructed synthetic data can help when the seeding and curation are done well. Whether the broader trend of training on uncurated AI text proves benign is a separate and still open question.
The quality of any generated corpus is capped by the model that writes it. Mixtral-8x7B-Instruct-v0.1 is a capable open model, but it is not as strong as the frontier systems some closed datasets use, and its errors, biases, and stylistic habits propagate into every document, which the Cosmopedia authors named as one reason Cosmo-1B trailed Phi-1.5 [1]. Coverage is bounded by the seed topics, so subjects that are absent or thin in the seeds are absent or thin in the output. The text can also be repetitive in tone, since a single instruction-tuned model produces all of it, and v2's move to a large fixed taxonomy was partly an attempt to widen that range. Decontamination reduces benchmark leakage but cannot guarantee none, and the dataset reflects whatever the generator knew as of its training cutoff, so it does not capture anything more recent. None of this undercuts the project's main contribution, which was to make the textbooks-quality approach open and reproducible, but it does mean Cosmopedia works best as one component of a mixture rather than as a standalone source of truth.