Common Pile
Last reviewed
May 16, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 3,498 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 16, 2026
Sources
13 citations
Review status
Source-backed
Revision
v1 ยท 3,498 words
Add missing citations, update stale details, or suggest a clearer explanation.
Common Pile v0.1 is an 8 terabyte corpus of openly licensed and public domain text published on June 5, 2025, by EleutherAI and a consortium of more than two dozen academic and industry collaborators. Compiled over roughly two years, the dataset draws from 30 sources spanning research papers, source code, government records, public domain books, encyclopedias, educational resources, audio transcripts, and licensed web text. It is the largest fully open licensed pretraining corpus released to date and was designed to show that competitive large language models can be trained without recourse to copyrighted material of uncertain provenance.
Alongside the dataset, EleutherAI released two companion models, Comma v0.1-1T and Comma v0.1-2T. Both are 7 billion parameter decoder-only transformers, trained on 1 trillion and 2 trillion tokens respectively. According to the accompanying paper, the Comma models match or exceed the performance of LLaMA 1 7B and LLaMA 2 7B on standard benchmarks, supporting the project's central thesis that the gap between models trained on openly licensed text and on unfiltered web scrapes is driven largely by filtering quality rather than by the inclusion of copyrighted text. Common Pile v0.1 is hosted on Hugging Face and distributed under the Open Definition 2.1 framework.
Common Pile v0.1 is the spiritual successor to The Pile, the 825 GiB diverse text corpus that EleutherAI released in December 2020. The original Pile was the first large pretraining corpus to be both fully described in an accompanying paper and made available for download, and it underpinned a generation of open models including GPT-Neo, GPT-J, GPT-NeoX-20B, Pythia, RWKV, and Cerebras-GPT. However, several of The Pile's twenty-two components were collected without explicit permission from rights holders. The most prominent, Books3, contained roughly 197,000 full text books scraped from the private torrent tracker Bibliotik. After the Danish anti-piracy group Rights Alliance issued a takedown request in August 2023, Books3 was pulled from canonical hosting, and Hugging Face and EleutherAI removed it from their distribution channels.
During the same period, a cascade of copyright lawsuits filed against OpenAI, Anthropic, Meta, Microsoft, and other AI companies argued that training on unlicensed copyrighted text constituted infringement. The New York Times, Sarah Silverman, Richard Kadrey, and the Authors Guild were among the plaintiffs. In an interview accompanying the release of Common Pile, EleutherAI executive director Stella Biderman observed that these lawsuits had "drastically decreased the transparency companies engage in," pushing closed labs to disclose less about their training data. Biderman also pushed back on a widely held industry assumption, arguing that "the common idea that unlicensed text drives performance is unjustified."
Common Pile was conceived as a direct response to both pressures: a license clean replacement for corpora that academic and independent groups had been quietly assembling from web scrapes of mixed provenance, and empirical evidence that open licensing does not condemn a model to second tier performance. The project marks an explicit step back from the permissive interpretations of fair use that characterized the 2020 to 2022 era of LLM pretraining, in favor of a license safety thesis that limits training data to works whose creators have granted permission to copy, modify, and redistribute.
Common Pile was assembled by roughly two dozen researchers spread across fourteen institutions, with the University of Toronto and Vector Institute taking the lead. Acknowledged contributors and their affiliations follow.
| Institution | Role |
|---|---|
| EleutherAI | Coordinating organization, infrastructure, evaluation, modeling |
| University of Toronto | Co-lead authorship and dataset curation |
| Vector Institute | Compute and research coordination (Toronto co-lead) |
| Hugging Face | Hosting, tokenization, distribution |
| Poolside | Industry partner; code and infrastructure |
| US Library of Congress | Public domain book digitization |
| Internet Archive | Public domain book hosting and digitization |
| Allen Institute for AI | Dataset curation, evaluation methodology |
| Massachusetts Institute of Technology | Research collaboration |
| Carnegie Mellon University | Research collaboration |
| Cornell University | License analysis and curation |
| University of Maryland, College Park | Modeling, training infrastructure |
| Lawrence Livermore National Laboratory | Compute and engineering |
| Teraflop AI | Pipeline engineering and curation |
| Lila Sciences | Curation and benchmarking |
The paper's co-leads are Nikhil Kandpal and Brian Lester (Toronto and Vector), with Colin Raffel as senior author across Toronto, Vector, and Hugging Face. Stella Biderman, Sebastian Majstorovic, Baber Abbasi, Aviya Skowron, John Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, and Lintang Sutawika contributed from EleutherAI, while Luca Soldaini and Tyler Murray represented the Allen Institute for AI. Additional contributors include A. Feder Cooper and Aaron Gokaslan (Cornell), John Kirchenbauer and Tom Goldstein (Maryland), Brian Bartoldson and Bhavya Kailkhura (Lawrence Livermore), Shayne Longpre (MIT), Alon Albalak (Lila Sciences), Enrico Shippole (Teraflop AI), and Guilherme Penedo, Loubna Ben Allal, and Elie Bakouch (Hugging Face).
Common Pile is built around a strict interpretation of what it means for content to be "openly licensed." The authors anchor their definition in version 2.1 of the Open Definition maintained by the Open Knowledge Foundation, which holds that a work is open if anyone is free to use, study, modify, and redistribute it for any purpose, including commercial use, subject only to requirements such as attribution and share-alike.
Under that framing, the permissible license families included are:
The authors excluded Creative Commons Non-Commercial and No-Derivatives variants because they do not satisfy the Open Definition's requirement that any purpose be permitted. They also excluded datasets such as OpenAlex, YouTube Commons, and the Hacker News Kaggle dataset on the grounds that their license metadata was unreliable. The paper coins the term "license laundering" to describe situations where a curator labels a collection with an open license that its constituent documents do not actually carry, and the pipelines were designed to avoid that pitfall.
The paper is candid about residual risk. The authors acknowledge that automatic license detection has limited recall, that licenses can change after collection, and that public domain documents sometimes quote copyrighted material. They release the dataset under the explicit caveat that "despite our best efforts at due diligence, data with restrictive licensing terms may have still ended up in our dataset."
The Common Pile v0.1 aggregates 30 distinct sources grouped into ten thematic categories. After license filtering, deduplication, and quality filtering, the dataset shrinks from a raw 8 TB to roughly 7.6 TB. Source code is the single largest contributor, accounting for more than half of the unfiltered total.
| Category | Sources | Notable size or scope |
|---|---|---|
| Source code | Stack V2 (open licensed subset), Python Enhancement Proposals | About 130 billion tokens across 69.6 million documents in code; Stack V2 alone supplies more than half of the raw 8 TB |
| Scientific and scholarly text | peS2o, PubMed Central, ArXiv papers, ArXiv abstracts | peS2o contributes about 273.9 billion tokens across 6.1 million documents; ArXiv contributes more than 2.4 million papers |
| Government and legal text | Caselaw Access Project, Court Listener, US Government Publishing Office, US Patent and Trademark Office (1782 to present), UK Hansard, Regulations.gov | Caselaw Access Project alone contains roughly 40 million pages of US court decisions; Court Listener adds about 900,000 cases |
| Public domain books | Pre-1929 books from the Internet Archive and HathiTrust, Library of Congress digitized titles, Project Gutenberg, Biodiversity Heritage Library | About 300,000 public domain books in total |
| Open educational resources | Directory of Open Access Books, PressBooks, OERCommons, LibreTexts | DOAB contributes more than 94,000 peer reviewed open access books; LibreTexts adds roughly 3,000 open textbooks; PressBooks adds about 8,000 books |
| Online discussion | StackExchange (CC BY-SA), Ubuntu IRC public logs since 2004 | Q and A and chat content licensed under share-alike or public domain |
| Wikis and encyclopedias | Wikimedia projects (English), Wikiteam dumps | Encyclopedic content, manuals, and structured reference text |
| Transcribed audio | Creative Commons licensed YouTube | More than 1.1 million CC BY videos comprising over 470,000 hours of audio, transcribed using OpenAI's Whisper |
| Web text | Creative Commons Common Crawl (CCCC) | 52 Common Crawl snapshots filtered for documents declaring a Creative Commons license |
| Curated tasks and other | Data Provenance Initiative collections, Foodista recipes, CC licensed news, Public Domain Review | Domain specialty content not captured by the broader buckets |
Stack V2 is filtered down to repositories that ScanCode and BigCode tooling identified as carrying Blue Oak Council permissive licenses. The arXiv subset includes full text papers screened for CC BY, CC BY-SA, CC0, or arXiv's permissive licenses, plus abstracts released under CC0. The CCCC subset retains only HTML pages from Common Crawl that emit a machine readable Creative Commons tag and is the only portion of the dataset derived from a generic web scrape.
The transcribed audio component is unusual among major pretraining corpora. The team identified more than a million YouTube videos tagged with a Creative Commons Attribution license, downloaded the audio, and ran it through Whisper to produce transcripts. The resulting text adds spoken style content from lectures, tutorials, conference talks, podcasts, and interviews, partially offsetting the scarcity of conversational data elsewhere in the corpus.
The Common Pile pipeline applies uniform preprocessing on top of each source's ingestion logic. Documents are deduplicated within each source using locality sensitive hashing on MinHash signatures, and cross source deduplication is applied where appropriate. Language identification is performed with fastText, and only English documents are retained for the main release.
Filtering is performed per source. The CCCC subset receives the heaviest treatment using quality heuristics inspired by the FineWeb and Dolma pipelines, while higher trust sources such as PubMed Central, ArXiv, and the Library of Congress are subjected to lighter filtering aimed at fixing PDF extraction artifacts. Each source is used to train a small probe model, and the probe's downstream benchmark performance sets the mixing weight for that source in the final training mixture. Smaller, higher quality sources may repeat up to six times across a 1 trillion token training run, while the largest sources (Stack V2 and CCCC) contribute closer to a single pass. A cool down phase mixture is defined for the final tens of billions of tokens, skewing toward scholarly, encyclopedic, and educational sources, mirroring the approach popularized by OLMo.
To demonstrate that Common Pile is sufficient on its own to train competitive models, EleutherAI released two checkpoints under the Comma name. Both are decoder-only transformers in the LLaMA architectural family.
| Specification | Comma v0.1-1T | Comma v0.1-2T |
|---|---|---|
| Parameters | 7 billion | 7 billion |
| Training tokens | 1 trillion | 2 trillion |
| Context length | 4,096 | 4,096 |
| Vocabulary | 64,000 BPE (custom) | 64,000 BPE (custom) |
| Batch size | 512 sequences | 2,048 sequences |
| Optimizer | AdamW, weight decay 0.2 | AdamW, weight decay 0.2 |
| Peak learning rate | 1e-3 | 2e-3 |
| Minimum learning rate | 1e-9 | 2e-9 |
| Schedule | Cosine with cool down | Cosine with cool down |
| Stage 1 steps | 460,000 (plus 2,000 warmup) | 230,000 |
| Stage 2 (cool down) steps | 18,000 | 9,000 |
| Final checkpoint | Average of 10 cool down checkpoints | Average of cool down checkpoints |
| License | Open weights | Open weights |
Comma v0.1-1T is positioned as the budget matched comparison point against LLaMA 1 7B, MPT-7B, RPJ-INCITE-7B, StableLM-7B, and OpenLLaMA-7B. The paper reports that it outperforms these baselines on more than half of the benchmarks tested. The suite covers ARC-Easy, ARC-Challenge, MMLU, BoolQ, HellaSwag, OpenBookQA, CommonsenseQA, PIQA, SIQA, HumanEval, and MBPP. Comma is particularly strong on knowledge intensive tasks (MMLU, ARC-Challenge) and on the code benchmarks HumanEval and MBPP, an advantage attributed to the heavy weighting of Stack V2.
Comma v0.1-2T is benchmarked against OLMo-7B-Twin-2T, LLaMA 2 7B, DeepSeek LLM 7B, and Qwen3 8B (the latter included as a higher compute upper bound, trained on roughly 36 trillion tokens). Comma v0.1-2T is described as "competitive with OLMo, LLaMA 2, and DeepSeek LLM," trailing on commonsense reasoning benchmarks such as HellaSwag and PIQA but leading or matching on MMLU, ARC, and code. The authors note the 2T run repeats some smaller sources up to 16 times and characterize it as "likely not a best case" execution, leaving room for higher quality 2T runs in future revisions.
The headline conclusion: "to the authors' knowledge," Common Pile v0.1 constitutes "the largest collection of openly licensed text to date," and the models trained on it "attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as LLaMA 1 and 2 7B." The authors interpret this as direct empirical refutation of the view that copyrighted text is a load bearing ingredient in frontier capability.
The paper reports a controlled experiment in which 1.7 billion parameter models were trained on 28 billion tokens drawn from a range of competing pretraining corpora, holding architecture and tokenizer constant. The following table summarizes how Common Pile compares to its peers, including other large open recipes published since 2023.
| Dataset | Curator | Approximate size | Licensing posture | Reported relative performance vs. Common Pile |
|---|---|---|---|---|
| The Pile | EleutherAI (2020) | 825 GiB | Mixed; later partial takedowns (Books3) | Pile model outperformed on HellaSwag and PIQA; Common Pile leads on most other benchmarks |
| Open License Corpus (OLC) | Min et al., 2023 | 0.85 TB across 12 sources | Open licensed | Common Pile clearly outperforms across all benchmarks |
| Common Corpus | Pleias, 2024 | About 7.4 TB | Open licensed | Common Pile clearly outperforms across all benchmarks |
| KL3M | 2024 | About 3 TB, mostly government | Open licensed | Common Pile clearly outperforms across all benchmarks |
| FineWeb | Hugging Face, 2024 | About 15 trillion tokens | Web scrape, unlicensed | FineWeb leads on most benchmarks, attributed to its much larger initial pool enabling aggressive filtering |
| Dolma | Allen Institute for AI, 2024 | About 3 trillion tokens | Mixed web scrape and curated, unlicensed | Used as a methodological reference for filtering; not directly compared head to head |
| RedPajama | Together AI, 2023 | About 1.2 trillion tokens (v1) and 30 trillion (v2) | Web scrape, unlicensed | Comma compared against RPJ-INCITE-7B; Common Pile leads on the majority of benchmarks at matched compute |
| OSCAR | Common Crawl derivative, 2019 onward | Multilingual web scrape | Web scrape, unlicensed | OSCAR model leads on HellaSwag and PIQA; Common Pile leads on most other benchmarks |
| Nemotron-CC | NVIDIA, 2024 | About 6.3 trillion tokens | Web scrape, unlicensed | Cited as a contemporary high quality recipe; not used as a direct head to head training source in the controlled study |
| DCLM | DataComp consortium, 2024 | Benchmark spanning 240 trillion token pool | Web scrape, unlicensed | Referenced as a state of the art benchmark for filtering recipes |
The paper's interpretation is that Common Pile cleanly dominates other openly licensed corpora and approaches the performance of unlicensed web scrapes. The remaining gap to FineWeb is attributed to FineWeb having a much larger raw input pool (more than ten times the Common Pile inputs), which allows its pipeline to discard a higher fraction of low quality text. The authors argue this gap should narrow as Common Pile grows in subsequent versions and as more publishers, archives, and educational repositories adopt machine readable open licensing.
Common Pile was greeted as a landmark release in the open source AI community. TechCrunch covered the launch on June 6, 2025, focusing on the contrast with The Pile's earlier copyright issues. Coverage in The Decoder and Gigazine emphasized the two year curation effort and the involvement of the US Library of Congress. Commentators at Open Future placed Common Pile within an emerging policy conversation about AI commons, alongside the Data Provenance Initiative, Pleias's Common Corpus, and the Allen Institute's Dolma, discussing it in tandem with the contemporaneous Institutional Books dataset from Harvard Law.
The project has drawn pointed critique. Some observers argued that even Common Pile cannot fully resolve the ambiguity of training on works such as US court opinions, which often quote copyrighted briefs, or transcribed audio, which can capture copyrighted music. The paper engages with these objections in its caveats and treats them as known limitations to be addressed in future revisions.
Common Pile v0.1 is hosted on Hugging Face under the common-pile organization, with each of the 30 component sources released as a separate dataset. The Comma v0.1-1T and Comma v0.1-2T model weights are released openly on Hugging Face, alongside training and evaluation code on GitHub. The paper is available as arXiv preprint 2506.05209.
EleutherAI has stated that Common Pile is intended to be the first in a continuing series of open licensed dataset releases. The v0.1 designation signals planned future revisions that will expand coverage, refine license verification, add multilingual sources, and address quality issues identified by the community. Biderman has indicated the organization will release "open datasets more frequently going forward," framing this cadence as a strategy to push back against the post 2023 retreat from data transparency at frontier labs.
The name Common Pile is a deliberate echo of EleutherAI's first major data release, but the two corpora reflect very different moments in the history of open AI. The Pile emerged in late 2020 when open replication of GPT-3 was the central goal of EleutherAI, and training data licensing was widely treated as fair use until proven otherwise. Common Pile was produced in a post Books3, post lawsuit environment in which the legal and reputational costs of unlicensed training data had become impossible to ignore. By restricting itself to works whose creators have explicitly granted broad redistribution rights, Common Pile accepts a smaller raw input pool in exchange for legal defensibility.
This approach is not without tradeoffs. The dataset is heavily weighted toward source code, government records, scholarly publications, and pre 1929 books, all of which carry distributional biases. Spoken conversational text remains underrepresented despite the YouTube transcript subset, and contemporary fiction is almost entirely absent. Where The Pile demonstrated that diverse curated text could outperform raw web crawl at small scale, Common Pile is intended to demonstrate that license cleanliness need not foreclose competitive scale either. Whether subsequent versions can close the remaining gap to FineWeb and similar web derived corpora is one of the central open questions for the open source AI ecosystem.
EleutherAI, The Pile, Common Corpus, FineWeb, Dolma, Nemotron-CC, DCLM, RedPajama, Allen Institute for AI, Library of Congress, Large language models