Cosmopedia

Data & Datasets Large Language Models Open Source AI

11 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 2,184 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Cosmopedia is an open synthetic pretraining dataset released by Hugging Face in February 2024, made up of textbooks, blog posts, stories, and WikiHow-style articles written entirely by a large language model. The first version contains roughly 25 billion tokens spread across more than 30 million files, which made it the largest open synthetic data corpus of its kind at the time of release ^[1]^[2]. All of the text was generated by Mixtral-8x7B-Instruct-v0.1, an open model, and the prompts were seeded from web and educational sources so that the output would cover a wide span of topics. The project was led by Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra ^[2], and it was conceived as an open reproduction of the recipe behind Microsoft's Phi models.

Cosmopedia later spawned a second iteration, Cosmopedia v2, which became the synthetic backbone of the SmolLM family of small language models. The dataset and the code used to build it were both released under permissive terms, partly as a response to the fact that the Phi reports described their approach but never shared the data ^[1].

Why it was built

The motivation traces back to a 2023 paper from Microsoft Research titled "Textbooks Are All You Need," which introduced phi-1, a 1.3 billion parameter code model trained on about 6 billion tokens of filtered web text plus roughly 1 billion tokens of synthetic textbooks and exercises produced by GPT-3.5 ^[3]. Despite its small size, phi-1 reached 50.6 percent pass@1 on HumanEval, a result that punched well above models many times larger. A follow-up, phi-1.5, extended the idea to general reasoning and showed similar gains ^[4]. The takeaway that drew so much attention was that data quality, not just scale, drives what a model learns, and that carefully written "textbook quality" content could substitute for a much larger pile of raw web pages.

The catch was reproducibility. Microsoft published strong numbers and a general description of the method, but it released neither the synthetic datasets nor the exact prompts behind them. Phi-2 became one of the most downloaded and most liked models on the Hugging Face Hub, yet nobody outside Microsoft could rebuild the data that made it work ^[1]. Cosmopedia set out to close that gap: take the textbooks hypothesis, run it with a fully open generator model, and publish everything so the community could study, criticize, and improve it.

The generation pipeline

The hardest part of Cosmopedia was not the compute. It was the prompts. As Ben Allal put it in the launch writeup, most of the effort went into prompt engineering rather than into orchestrating GPUs, because keeping the output diverse gets much harder as the volume grows ^[1]. A model asked the same kind of question over and over will produce near-duplicates, and a pretraining set full of near-duplicates teaches very little. So the pipeline is really a strategy for manufacturing varied prompts at scale.

The team split the prompt sources into two broad buckets. The first was curated educational material, which is high quality but limited in quantity: course outlines from Stanford, units from OpenStax, lessons from Khan Academy, and article titles scraped from WikiHow. To stretch a small set of topics into many prompts, each topic was crossed with four target audiences (young children, high school students, college students, and researchers) and three generation styles (textbook, blog post, and WikiHow article). That combination yields up to twelve different prompts from a single seed topic, and asking for the same subject at a child's level versus a researcher's level produces genuinely different text ^[1].

The second and larger bucket was web data, which supplied more than 80 percent of all prompts. Here the team clustered millions of web samples drawn from a RefinedWeb-style corpus into 145 clusters, then used Mixtral to read ten random samples from each cluster and name the shared topic. After filtering out low quality categories such as explicit material and celebrity gossip, 112 topics remained. A web page was used as a "seed sample" to ground the generation, and the prompt was conditioned on the cluster topic about half the time, which kept outputs anchored to real-world knowledge while still varying the framing ^[1]. Mathematical content was added through the AutoMathText dataset, and a stories split was seeded from instruction-tuning data, namely the "questions about the world" subset of UltraChat and parts of OpenHermes2.5, to inject the everyday common-sense knowledge that formal textbooks tend to skip.

Generation ran through Hugging Face's llm-swarm library, which managed many parallel Mixtral-8x7B-Instruct-v0.1 instances served with Text Generation Inference on H100 GPUs from the company's science cluster. The full run took more than 10,000 GPU hours ^[1]. Afterward the text was decontaminated against common evaluation benchmarks: the team flagged any sample whose 10-gram overlap with a benchmark example was suspicious, verified the match with Python's difflib SequenceMatcher, and dropped the sample when more than half of a benchmark item appeared inside it. That pass removed contaminated rows tied to ARC, BoolQ, HellaSwag, PIQA, and several others ^[1].

What is inside v0.1

Cosmopedia v0.1 is organized into eight splits, each named for the seed source behind its prompts. The largest two come from web samples and account for roughly three quarters of the data combined.

Split	Rows	Seed source
web_samples_v1	12.4M	Internal RefinedWeb-style web dataset
web_samples_v2	10.3M	Internal web dataset, refined prompts
stories	4.99M	UltraChat and OpenHermes2.5
auto_math_text	1.95M	AutoMathText
stanford	1.02M	Stanford course outlines
wikihow	179k	WikiHow titles
openstax	126k	OpenStax course outlines
khanacademy	24.1k	Khan Academy course outlines

The full release holds 31,064,744 rows, totaling about 25 billion tokens, and ships under the Apache 2.0 license ^[2]. Each row records the prompt, the generated text, the token length, the seed dataset, the format (textbook, blog post, story, and so on), and the intended audience. The team also published a 100,000-row sample called cosmopedia-100k for quick experiments, and trained a 1.8 billion parameter model, Cosmo-1B, on the data to test it. Cosmo-1B beat TinyLlama 1.1B on ARC-easy, ARC-challenge, OpenBookQA, and MMLU, and was competitive with Qwen-1.5-1B on some of those, though it still trailed Phi-1.5, a gap the authors attributed to the strength of the generator model, topic coverage, and prompt design rather than to anything fundamental about the method ^[1].

Cosmopedia v2 and SmolLM

The second version, built a few months later for the SmolLM project, reworked the weakest parts of the pipeline. Instead of clustering web pages to discover topics, v2 started from a predefined list of about 34,000 topics drawn from the BISAC book classification, a standard publishing taxonomy that is broad and education-oriented. The team began with 5,000 topics across 51 categories and asked Mixtral to expand them into subtopics ^[5]. The audience mix was rebalanced toward the levels that mattered most for a general model: 40 percent of the content aimed at middle school students, 30 percent at college students, and the remaining 30 percent a blend of other audiences and styles, including stories and Stanford-based textbooks carried over from v1 ^[5]. The team also generated 1 billion tokens of code textbooks seeded from Python samples in AutoMathText, so the corpus would carry some programming signal.

The result was about 39 million documents totaling 28 billion tokens, still generated by Mixtral-8x7B-Instruct-v0.1 ^[5]. Cosmopedia v2 then became one of three ingredients in the SmolLM-Corpus, alongside Python-Edu (4 billion tokens of educational Python from The Stack) and FineWeb-Edu (220 billion tokens of deduplicated educational web pages). The SmolLM models trained on this mixture: the 135M and 360M versions on 600 billion tokens and the 1.7B version on 1 trillion tokens ^[5].

Aspect	Cosmopedia v0.1	Cosmopedia v2
Release	February 2024	July 2024
Documents	~31 million	~39 million
Tokens	~25 billion	~28 billion
Generator	Mixtral-8x7B-Instruct-v0.1	Mixtral-8x7B-Instruct-v0.1
Topic selection	Web clustering (145 to 112 topics)	BISAC taxonomy (~34,000 topics)
Audience strategy	4 audiences x 3 styles	40% middle school, 30% college, 30% mixed
Code content	None (math via AutoMathText)	~1B tokens of Python textbooks
Primary use	Cosmo-1B	SmolLM family

How synthetic text compares with web data

The appeal of a corpus like Cosmopedia is that it is dense with explanatory, well-structured prose. A scraped web crawl is mostly noise: boilerplate, ads, navigation menus, and shallow content, with the genuinely instructive material thinly spread. Synthetic textbooks invert that ratio. Every document is written to teach something, which is why a small model trained on them can match a much larger model trained on raw text. The flip side is coverage. Web data, for all its mess, reflects the actual distribution of human knowledge and language, including the rare facts, odd phrasings, and edge cases that a generator tends to smooth over. This is why both Cosmopedia and the SmolLM corpus pair synthetic data with filtered web data rather than relying on synthetic text alone. Cosmopedia v2 contributes 28 billion synthetic tokens against 220 billion tokens of FineWeb-Edu, so the synthetic portion sharpens the mixture rather than dominating it.

The debate over synthetic data

Using model-generated text to train the next model is contested. The central worry is captured by research on model collapse: in a 2024 Nature paper, Shumailov and colleagues showed that when a generative model is trained on data produced by earlier models, generation after generation, the tails of the original distribution fade and the model degrades in ways that are hard to reverse ^[6]. The rare events vanish first, then the variety thins out, until the model produces a narrow, repetitive slice of what it started with. As AI-generated text fills more of the public web, some researchers fear this kind of feedback loop could quietly poison future training sets.

Cosmopedia sits inside that debate but does not fit the doomsday version of it. Model collapse, as studied, comes from the indiscriminate and recursive reuse of model output, where each generation feeds on the last with no fresh grounding. Cosmopedia is different in two ways. The synthetic text is seeded from real web pages and curated human sources, so it is anchored to genuine knowledge rather than spun from the model's own prior outputs, and it is generated once by Mixtral and then mixed with large amounts of human-written data, not fed back in a loop. The practical results from Phi, Cosmo-1B, and SmolLM suggest that carefully constructed synthetic data can help when the seeding and curation are done well. Whether the broader trend of training on uncurated AI text proves benign is a separate and still open question.

Limitations

The quality of any generated corpus is capped by the model that writes it. Mixtral-8x7B-Instruct-v0.1 is a capable open model, but it is not as strong as the frontier systems some closed datasets use, and its errors, biases, and stylistic habits propagate into every document, which the Cosmopedia authors named as one reason Cosmo-1B trailed Phi-1.5 ^[1]. Coverage is bounded by the seed topics, so subjects that are absent or thin in the seeds are absent or thin in the output. The text can also be repetitive in tone, since a single instruction-tuned model produces all of it, and v2's move to a large fixed taxonomy was partly an attempt to widen that range. Decontamination reduces benchmark leakage but cannot guarantee none, and the dataset reflects whatever the generator knew as of its training cutoff, so it does not capture anything more recent. None of this undercuts the project's main contribution, which was to make the textbooks-quality approach open and reproducible, but it does mean Cosmopedia works best as one component of a mixture rather than as a standalone source of truth.

References

Ben Allal, L., Lozhkov, A., & van Strien, D. (2024). Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models. Hugging Face Blog. https://huggingface.co/blog/cosmopedia ↩
Ben Allal, L., Lozhkov, A., Penedo, G., Wolf, T., & von Werra, L. (2024). Cosmopedia (dataset card). Hugging Face. https://huggingface.co/datasets/HuggingFaceTB/cosmopedia ↩
Gunasekar, S., et al. (2023). Textbooks Are All You Need. arXiv:2306.11644. https://arxiv.org/abs/2306.11644 ↩
Li, Y., et al. (2023). Textbooks Are All You Need II: phi-1.5 technical report. arXiv:2309.05463. https://arxiv.org/abs/2309.05463 ↩
Ben Allal, L., et al. (2024). SmolLM: blazingly fast and remarkably powerful. Hugging Face Blog. https://huggingface.co/blog/smollm ↩
Shumailov, I., Shumaylov, Z., Zhao, Y., Papernot, N., Anderson, R., & Gal, Y. (2024). AI models collapse when trained on recursively generated data. Nature 631, 755-759. https://www.nature.com/articles/s41586-024-07566-y ↩
Hugging Face. cosmopedia (GitHub repository). https://github.com/huggingface/cosmopedia
Hugging Face Smol Models Research. SmolLM-Corpus (dataset card). https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus
Mehta, S. (2024). Hugging Face Introduces Cosmopedia, the Largest Open Synthetic Dataset. Analytics India Magazine. https://analyticsindiamag.com/huggingface-introduces-cosmopedia-the-largest-open-synthetic-dataset/

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

SmolLM

Why it was built

The generation pipeline

What is inside v0.1

Cosmopedia v2 and SmolLM

How synthetic text compares with web data

The debate over synthetic data

Limitations

See also

References

Improve this article

Related Articles

Dolma

RefinedWeb

SlimPajama

OpenOrca

TxT360

The Pile (dataset)