Common Pile

Data & Datasets Natural Language Processing Open Source AI

19 min read

Updated Jun 27, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 27, 2026

Fact-checked

In review queue

Sources

13 citations

Revision

v3 · 3,867 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Common Pile v0.1 is an 8 terabyte corpus of openly licensed and public domain text, released on June 5, 2025, by EleutherAI and a consortium of more than two dozen academic and industry collaborators. It is the largest collection of openly licensed text assembled to date, drawing on 30 sources that span research papers, source code, government records, public domain books, encyclopedias, educational resources, audio transcripts, and licensed web text. ^[1] The project was designed as a copyright conscious successor to The Pile, demonstrating that a competitive large language model can be trained on training data without recourse to copyrighted material of uncertain provenance.

Alongside the dataset, EleutherAI released two companion models, Comma v0.1-1T and Comma v0.1-2T. Both are 7 billion parameter decoder-only transformers, trained on 1 trillion and 2 trillion tokens respectively. According to the accompanying paper, the Comma models match or exceed the performance of LLaMA 1 7B and LLaMA 2 7B on standard benchmarks: the abstract states that "both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B." ^[1] This supports the project's central thesis that the gap between models trained on openly licensed text and on unfiltered web scrapes is driven largely by filtering quality rather than by the inclusion of copyrighted text. Common Pile v0.1 is hosted on Hugging Face and distributed under the Open Definition 2.1 framework. ^[1]^[2]

What is the Common Pile?

The Common Pile v0.1 is an 8 terabyte pretraining corpus built entirely from text whose licenses permit reuse, modification, and redistribution. Compiled over roughly two years, the dataset aggregates 30 distinct sources into ten thematic categories, ranging from peer reviewed scientific papers and permissively licensed source code to pre-1929 public domain books and transcribed Creative Commons audio. ^[1]^[2] EleutherAI frames the release as evidence that openly licensed data is sufficient, not merely adequate, for training capable models: the abstract of the paper describes Common Pile v0.1 as, "to our knowledge, the largest corpus of openly licensed and public domain text suitable for LLM pretraining." ^[1]

Background and motivation

Common Pile v0.1 is the spiritual successor to The Pile, the 825 GiB diverse text corpus that EleutherAI released in December 2020. The original Pile was the first large pretraining corpus to be both fully described in an accompanying paper and made available for download, and it underpinned a generation of open models including GPT-Neo, GPT-J, GPT-NeoX-20B, Pythia, RWKV, and Cerebras-GPT. ^[9] However, several of The Pile's twenty-two components were collected without explicit permission from rights holders. The most prominent, Books3, contained roughly 197,000 full text books scraped from the private torrent tracker Bibliotik. After the Danish anti-piracy group Rights Alliance issued a takedown request in August 2023, Books3 was pulled from canonical hosting, and Hugging Face and EleutherAI removed it from their distribution channels. ^[10]

During the same period, a cascade of copyright lawsuits filed against OpenAI, Anthropic, Meta, Microsoft, and other AI companies argued that training on unlicensed copyrighted text constituted infringement. The New York Times, Sarah Silverman, Richard Kadrey, and the Authors Guild were among the plaintiffs. In an interview accompanying the release of Common Pile, EleutherAI executive director Stella Biderman observed that these lawsuits "have not meaningfully changed data sourcing practices in [model] training, but they have drastically decreased the transparency companies engage in," pushing closed labs to disclose less about their training data. ^[4] Biderman also pushed back on a widely held industry assumption, arguing, "in general, we think that the common idea that unlicensed text drives performance is unjustified." ^[4]

Common Pile was conceived as a direct response to both pressures: a license clean replacement for corpora that academic and independent groups had been quietly assembling from web scrapes of mixed provenance, and empirical evidence that open licensing does not condemn a model to second tier performance. The project marks an explicit step back from the permissive interpretations of fair use that characterized the 2020 to 2022 era of LLM pretraining, in favor of a license safety thesis that limits training data to works whose creators have granted permission to copy, modify, and redistribute. ^[1]

Who built the Common Pile?

Common Pile was assembled by roughly two dozen researchers spread across fourteen institutions, with the University of Toronto and Vector Institute taking the lead. Acknowledged contributors and their affiliations follow.

Institution	Role
EleutherAI	Coordinating organization, infrastructure, evaluation, modeling
University of Toronto	Co-lead authorship and dataset curation
Vector Institute	Compute and research coordination (Toronto co-lead)
Hugging Face	Hosting, tokenization, distribution
Poolside	Industry partner; code and infrastructure
US Library of Congress	Public domain book digitization
Internet Archive	Public domain book hosting and digitization
Allen Institute for AI	Dataset curation, evaluation methodology
Massachusetts Institute of Technology	Research collaboration
Carnegie Mellon University	Research collaboration
Cornell University	License analysis and curation
University of Maryland, College Park	Modeling, training infrastructure
Lawrence Livermore National Laboratory	Compute and engineering
Teraflop AI	Pipeline engineering and curation
Lila Sciences	Curation and benchmarking

The paper's co-leads are Nikhil Kandpal and Brian Lester (Toronto and Vector), with Colin Raffel as senior author across Toronto, Vector, and Hugging Face. Stella Biderman, Sebastian Majstorovic, Baber Abbasi, Aviya Skowron, John Pressman, Honglu Fan, Dashiell Stander, Guangyu Song, and Lintang Sutawika contributed from EleutherAI, while Luca Soldaini and Tyler Murray represented the Allen Institute for AI. Additional contributors include A. Feder Cooper and Aaron Gokaslan (Cornell), John Kirchenbauer and Tom Goldstein (Maryland), Brian Bartoldson and Bhavya Kailkhura (Lawrence Livermore), Shayne Longpre (MIT), Alon Albalak (Lila Sciences), Enrico Shippole (Teraflop AI), and Guilherme Penedo, Loubna Ben Allal, and Elie Bakouch (Hugging Face). ^[1]

What counts as "openly licensed"?

Common Pile is built around a strict interpretation of what it means for content to be "openly licensed." The authors anchor their definition in version 2.1 of the Open Definition maintained by the Open Knowledge Foundation, which holds that a work is open if anyone is free to use, study, modify, and redistribute it for any purpose, including commercial use, subject only to requirements such as attribution and share-alike. ^[11]

Under that framing, the permissible license families included are:

Creative Commons CC0, CC BY, and CC BY-SA for prose, scholarly works, encyclopedias, and audio transcripts.
Public domain works, whether by copyright expiration (US books published before 1929), statutory exclusion (US federal works under 17 U.S.C. section 105), or deliberate dedication (Creative Commons Public Domain Mark).
Permissive and copyleft open source licenses certified by the Blue Oak Council, including MIT, BSD, Apache 2.0, MPL, and GPL. License detection on source code was performed using ScanCode Toolkit and BigCode tooling on the Software Heritage archive. ^[12]
Government open licenses such as the UK Open Parliament License for Hansard, and equivalent permissive terms for US Patent and Trademark Office filings.

The authors excluded Creative Commons Non-Commercial and No-Derivatives variants because they do not satisfy the Open Definition's requirement that any purpose be permitted. They also excluded datasets such as OpenAlex, YouTube Commons, and the Hacker News Kaggle dataset on the grounds that their license metadata was unreliable. The paper coins the term "license laundering" to describe situations where a curator labels a collection with an open license that its constituent documents do not actually carry, and the pipelines were designed to avoid that pitfall. ^[1]

The paper is candid about residual risk. The authors acknowledge that automatic license detection has limited recall, that licenses can change after collection, and that public domain documents sometimes quote copyrighted material. They release the dataset under the explicit caveat that "despite our best efforts at due diligence, data with restrictive licensing terms may have still ended up in our dataset." ^[1]

What sources are in the Common Pile?

The Common Pile v0.1 aggregates 30 distinct sources grouped into ten thematic categories. After license filtering, deduplication, and quality filtering, the dataset shrinks from a raw 8 TB to roughly 7.6 TB. Source code is the single largest contributor, accounting for more than half of the unfiltered total. ^[1]^[2]

Category	Sources	Notable size or scope
Source code	Stack V2 (open licensed subset), Python Enhancement Proposals	About 130 billion tokens across 69.6 million documents in code; Stack V2 alone supplies more than half of the raw 8 TB
Scientific and scholarly text	peS2o, PubMed Central, ArXiv papers, ArXiv abstracts	peS2o contributes about 273.9 billion tokens across 6.1 million documents; ArXiv contributes more than 2.4 million papers
Government and legal text	Caselaw Access Project, Court Listener, US Government Publishing Office, US Patent and Trademark Office (1782 to present), UK Hansard, Regulations.gov	Caselaw Access Project alone contains roughly 40 million pages of US court decisions; Court Listener adds about 900,000 cases
Public domain books	Pre-1929 books from the Internet Archive and HathiTrust, Library of Congress digitized titles, Project Gutenberg, Biodiversity Heritage Library	About 300,000 public domain books in total
Open educational resources	Directory of Open Access Books, PressBooks, OERCommons, LibreTexts	DOAB contributes more than 94,000 peer reviewed open access books; LibreTexts adds roughly 3,000 open textbooks; PressBooks adds about 8,000 books
Online discussion	StackExchange (CC BY-SA), Ubuntu IRC public logs since 2004	Q and A and chat content licensed under share-alike or public domain
Wikis and encyclopedias	Wikimedia projects (English), Wikiteam dumps	Encyclopedic content, manuals, and structured reference text
Transcribed audio	Creative Commons licensed YouTube	More than 1.1 million CC BY videos comprising over 470,000 hours of audio, transcribed using OpenAI's Whisper
Web text	Creative Commons Common Crawl (CCCC)	52 Common Crawl snapshots filtered for documents declaring a Creative Commons license
Curated tasks and other	Data Provenance Initiative collections, Foodista recipes, CC licensed news, Public Domain Review	Domain specialty content not captured by the broader buckets

Stack V2 is filtered down to repositories that ScanCode and BigCode tooling identified as carrying Blue Oak Council permissive licenses. The arXiv subset includes full text papers screened for CC BY, CC BY-SA, CC0, or arXiv's permissive licenses, plus abstracts released under CC0. The CCCC subset retains only HTML pages from Common Crawl that emit a machine readable Creative Commons tag and is the only portion of the dataset derived from a generic web scrape. ^[1]

The transcribed audio component is unusual among major pretraining corpora. The team identified more than a million YouTube videos tagged with a Creative Commons Attribution license, downloaded the audio, and ran it through Whisper to produce transcripts. The resulting text adds spoken style content from lectures, tutorials, conference talks, podcasts, and interviews, partially offsetting the scarcity of conversational data elsewhere in the corpus. ^[1]^[4]

How is the data filtered and quality controlled?

The Common Pile pipeline applies uniform preprocessing on top of each source's ingestion logic. Documents are deduplicated within each source using locality sensitive hashing on MinHash signatures, and cross source deduplication is applied where appropriate. Language identification is performed with fastText, and only English documents are retained for the main release. ^[1]

Filtering is performed per source. The CCCC subset receives the heaviest treatment using quality heuristics inspired by the FineWeb and Dolma pipelines, while higher trust sources such as PubMed Central, ArXiv, and the Library of Congress are subjected to lighter filtering aimed at fixing PDF extraction artifacts. Each source is used to train a small probe model, and the probe's downstream benchmark performance sets the mixing weight for that source in the final training mixture. Smaller, higher quality sources may repeat up to six times across a 1 trillion token training run, while the largest sources (Stack V2 and CCCC) contribute closer to a single pass. A cool down phase mixture is defined for the final tens of billions of tokens, skewing toward scholarly, encyclopedic, and educational sources, mirroring the approach popularized by OLMo. ^[1]

What are the Comma models?

To demonstrate that Common Pile is sufficient on its own to train competitive models, EleutherAI released two checkpoints under the Comma name. Both are decoder-only transformers in the LLaMA architectural family. ^[1]^[2]

Specification	Comma v0.1-1T	Comma v0.1-2T
Parameters	7 billion	7 billion
Training tokens	1 trillion	2 trillion
Context length	4,096	4,096
Vocabulary	64,000 BPE (custom)	64,000 BPE (custom)
Batch size	512 sequences	2,048 sequences
Optimizer	AdamW, weight decay 0.2	AdamW, weight decay 0.2
Peak learning rate	1e-3	2e-3
Minimum learning rate	1e-9	2e-9
Schedule	Cosine with cool down	Cosine with cool down
Stage 1 steps	460,000 (plus 2,000 warmup)	230,000
Stage 2 (cool down) steps	18,000	9,000
Final checkpoint	Average of 10 cool down checkpoints	Average of cool down checkpoints
License	Open weights	Open weights

Comma v0.1-1T is positioned as the budget matched comparison point against LLaMA 1 7B, MPT-7B, RPJ-INCITE-7B, StableLM-7B, and OpenLLaMA-7B. The paper reports that it outperforms these baselines on more than half of the benchmarks tested. The suite covers ARC-Easy, ARC-Challenge, MMLU, BoolQ, HellaSwag, OpenBookQA, CommonsenseQA, PIQA, SIQA, HumanEval, and MBPP. Comma is particularly strong on knowledge intensive tasks (MMLU, ARC-Challenge) and on the code benchmarks HumanEval and MBPP, an advantage attributed to the heavy weighting of Stack V2. ^[1]

Comma v0.1-2T is benchmarked against OLMo-7B-Twin-2T, LLaMA 2 7B, DeepSeek LLM 7B, and Qwen3 8B (the latter included as a higher compute upper bound, trained on roughly 36 trillion tokens). Comma v0.1-2T is described as "competitive with OLMo, LLaMA 2, and DeepSeek LLM," trailing on commonsense reasoning benchmarks such as HellaSwag and PIQA but leading or matching on MMLU, ARC, and code. The authors note the 2T run repeats some smaller sources up to 16 times and characterize it as "likely not a best case" execution, leaving room for higher quality 2T runs in future revisions. ^[1]

The headline conclusion: "to the authors' knowledge," Common Pile v0.1 constitutes "the largest collection of openly licensed text to date," and the models trained on it "attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as LLaMA 1 and 2 7B." ^[1] The authors interpret this as direct empirical refutation of the view that copyrighted text is a load bearing ingredient in frontier capability.

How does the Common Pile compare to other corpora?

The paper reports a controlled experiment in which 1.7 billion parameter models were trained on 28 billion tokens drawn from a range of competing pretraining corpora, holding architecture and tokenizer constant. The following table summarizes how Common Pile compares to its peers, including other large open recipes published since 2023. ^[1]

Dataset	Curator	Approximate size	Licensing posture	Reported relative performance vs. Common Pile
The Pile	EleutherAI (2020)	825 GiB	Mixed; later partial takedowns (Books3)	Pile model outperformed on HellaSwag and PIQA; Common Pile leads on most other benchmarks
Open License Corpus (OLC)	Min et al., 2023	0.85 TB across 12 sources	Open licensed	Common Pile clearly outperforms across all benchmarks
Common Corpus	Pleias, 2024	About 7.4 TB	Open licensed	Common Pile clearly outperforms across all benchmarks
KL3M	2024	About 3 TB, mostly government	Open licensed	Common Pile clearly outperforms across all benchmarks
FineWeb	Hugging Face, 2024	About 15 trillion tokens	Web scrape, unlicensed	FineWeb leads on most benchmarks, attributed to its much larger initial pool enabling aggressive filtering
Dolma	Allen Institute for AI, 2024	About 3 trillion tokens	Mixed web scrape and curated, unlicensed	Used as a methodological reference for filtering; not directly compared head to head
RedPajama	Together AI, 2023	About 1.2 trillion tokens (v1) and 30 trillion (v2)	Web scrape, unlicensed	Comma compared against RPJ-INCITE-7B; Common Pile leads on the majority of benchmarks at matched compute
OSCAR	Common Crawl derivative, 2019 onward	Multilingual web scrape	Web scrape, unlicensed	OSCAR model leads on HellaSwag and PIQA; Common Pile leads on most other benchmarks
Nemotron-CC	NVIDIA, 2024	About 6.3 trillion tokens	Web scrape, unlicensed	Cited as a contemporary high quality recipe; not used as a direct head to head training source in the controlled study
DCLM	DataComp consortium, 2024	Benchmark spanning 240 trillion token pool	Web scrape, unlicensed	Referenced as a state of the art benchmark for filtering recipes

The paper's interpretation is that Common Pile cleanly dominates other openly licensed corpora and approaches the performance of unlicensed web scrapes. The remaining gap to FineWeb is attributed to FineWeb having a much larger raw input pool (more than ten times the Common Pile inputs), which allows its pipeline to discard a higher fraction of low quality text. The authors argue this gap should narrow as Common Pile grows in subsequent versions and as more publishers, archives, and educational repositories adopt machine readable open licensing. ^[1]

Reception and discussion

Common Pile was greeted as a landmark release in the open source AI community. TechCrunch covered the launch on June 6, 2025, focusing on the contrast with The Pile's earlier copyright issues. ^[4] Coverage in The Decoder and Gigazine emphasized the two year curation effort and the involvement of the US Library of Congress. ^[5]^[7] Commentators at Open Future placed Common Pile within an emerging policy conversation about AI commons, alongside the Data Provenance Initiative, Pleias's Common Corpus, and the Allen Institute's Dolma, discussing it in tandem with the contemporaneous Institutional Books dataset from Harvard Law. ^[13]

The project has drawn pointed critique. Some observers argued that even Common Pile cannot fully resolve the ambiguity of training on works such as US court opinions, which often quote copyrighted briefs, or transcribed audio, which can capture copyrighted music. The paper engages with these objections in its caveats and treats them as known limitations to be addressed in future revisions. ^[1]

Where can you download the Common Pile?

Common Pile v0.1 is hosted on Hugging Face under the common-pile organization, with each of the 30 component sources released as a separate dataset. ^[8] The Comma v0.1-1T and Comma v0.1-2T model weights are released openly on Hugging Face, alongside training and evaluation code on GitHub. The paper is available as arXiv preprint 2506.05209. ^[1]

EleutherAI has stated that Common Pile is intended to be the first in a continuing series of open licensed dataset releases. The v0.1 designation signals planned future revisions that will expand coverage, refine license verification, add multilingual sources, and address quality issues identified by the community. ^[2] Biderman has indicated the organization will release "open datasets more frequently going forward," framing this cadence as a strategy to push back against the post 2023 retreat from data transparency at frontier labs. ^[4]

How is it different from The Pile?

The name Common Pile is a deliberate echo of EleutherAI's first major data release, but the two corpora reflect very different moments in the history of open AI. The Pile emerged in late 2020 when open replication of GPT-3 was the central goal of EleutherAI, and training data licensing was widely treated as fair use until proven otherwise. Common Pile was produced in a post Books3, post lawsuit environment in which the legal and reputational costs of unlicensed training data had become impossible to ignore. By restricting itself to works whose creators have explicitly granted broad redistribution rights, Common Pile accepts a smaller raw input pool in exchange for legal defensibility. ^[1]^[4]

This approach is not without tradeoffs. The dataset is heavily weighted toward source code, government records, scholarly publications, and pre-1929 books, all of which carry distributional biases. Spoken conversational text remains underrepresented despite the YouTube transcript subset, and contemporary fiction is almost entirely absent. Where The Pile demonstrated that diverse curated text could outperform raw web crawl at small scale, Common Pile is intended to demonstrate that license cleanliness need not foreclose competitive scale either. Whether subsequent versions can close the remaining gap to FineWeb and similar web derived corpora is one of the central open questions for the open source AI ecosystem. ^[1]

ELI5: Explain it like I'm 5

Imagine you want to teach a robot to read by giving it a giant pile of books, websites, and notes. The problem is that a lot of those books belong to other people who never said you could copy them, and they can get upset. The Common Pile is a huge pile of reading material (about 8 terabytes, which is like millions of books) where every single piece comes with a note that says "you are allowed to use and share me." EleutherAI and their friends spent about two years gathering only this allowed material. Then they used it to teach two robots, called Comma, and those robots turned out to be just as smart as other robots that learned from the "you did not ask permission" pile. That shows you can build a smart AI politely, without taking things you are not supposed to.

References

Kandpal, Nikhil; Lester, Brian; Raffel, Colin; et al. "The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text." arXiv preprint arXiv:2506.05209, June 5, 2025. https://arxiv.org/abs/2506.05209 ↩
EleutherAI Blog. "The Common Pile v0.1." June 5, 2025. https://blog.eleuther.ai/common-pile/ ↩
EleutherAI. "Common Pile v0.1." Project page, June 2025. https://www.eleuther.ai/news/common-pile-v01
Wiggers, Kyle. "EleutherAI releases massive AI training dataset of licensed and open domain text." TechCrunch, June 6, 2025. https://techcrunch.com/2025/06/06/eleutherai-releases-massive-ai-training-dataset-of-licensed-and-open-domain-text/ ↩
The Decoder. "Researchers build massive AI training dataset using only openly licensed sources." June 2025. https://the-decoder.com/researchers-build-massive-ai-training-dataset-using-only-openly-licensed-sources/ ↩
Biderman, Stella. "The Common Pile v0.1." Hugging Face blog, June 2025. https://huggingface.co/blog/stellaathena/common-pile
Gigazine. "AI research institute EleutherAI releases 'Common Pile v0.1', a massive dataset of about 8TB consisting only of public domain and open license content." June 9, 2025. https://gigazine.net/gsc_news/en/20250609-eleutherai-common-pile-v-0-1/ ↩
common-pile organization page on Hugging Face. https://huggingface.co/common-pile ↩
Gao, Leo; Biderman, Stella; Black, Sid; et al. "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv preprint arXiv:2101.00027, December 2020. https://arxiv.org/abs/2101.00027 ↩
TorrentFreak. "Anti-Piracy Group Takes Prominent AI Training Dataset 'Books3' Offline." August 16, 2023. https://torrentfreak.com/anti-piracy-group-takes-prominent-ai-training-dataset-books3-offline-230816/ ↩
Open Knowledge Foundation. "Open Definition 2.1." https://opendefinition.org/od/2.1/en/ ↩
Blue Oak Council. "License List." https://blueoakcouncil.org/list ↩
Open Future. "Common Pile and Institutional Books datasets chart two pathways for AI Commons." June 2025. https://openfuture.eu/blog/common-pile-and-institutional-books-datasets-chart-two-pathways-for-ai-commons/ ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Common Corpus DCLM (DataComp for Language Models)Nemotron-CC Pleias The Stack (BigCode dataset)

What is the Common Pile?

Background and motivation

Who built the Common Pile?

What counts as "openly licensed"?

What sources are in the Common Pile?

How is the data filtered and quality controlled?

What are the Comma models?

How does the Common Pile compare to other corpora?

Reception and discussion

Where can you download the Common Pile?

How is it different from The Pile?

ELI5: Explain it like I'm 5

See also

References

Improve this article

Related Articles

The Pile (dataset)

FineWeb

RedPajama

Common Corpus

DCLM (DataComp for Language Models)

Reporting Bias

What links here

Related Articles

The Pile (dataset)

FineWeb

RedPajama

Common Corpus

DCLM (DataComp for Language Models)

Reporting Bias

What links here