The Pile is an 825.18 GiB (approximately 886 GB) English text corpus designed for training large language models. Created by EleutherAI, a grassroots collective of AI researchers, The Pile aggregates 22 diverse, high-quality subsets drawn from academic, professional, internet, literary, and miscellaneous sources. The dataset was publicly released on December 31, 2020, and the accompanying paper, "The Pile: An 800GB Dataset of Diverse Text for Language Modeling," was authored by Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy.
The Pile became one of the most widely used open-source training datasets in the history of natural language processing. It served as the foundation for numerous influential language models, including GPT-Neo, GPT-J, GPT-NeoX-20B, the Pythia suite, RWKV, and Cerebras-GPT. Its release marked a turning point in democratizing access to high-quality training data for the broader research community.
Before The Pile, most publicly available text datasets for language model training consisted primarily of web-scraped data from Common Crawl. While Common Crawl provided massive scale, its content was heavily skewed toward general web pages and lacked representation from specialized domains such as scientific literature, legal documents, source code, and published books. OpenAI's GPT-3 was trained on a curated mix of data sources, but the specifics of its dataset were never publicly released, leaving the open-source community without a comparable resource.
EleutherAI formed in July 2020 as a Discord-based collective of volunteer researchers with the goal of replicating GPT-3 in an open-source setting. The group recognized that data quality and diversity were just as important as model architecture and compute. Rather than relying solely on Common Crawl, the team set out to build a training corpus that combined web data with curated, domain-specific text from a wide range of fields.
The key insight behind The Pile was that training on a more diverse mixture of text would produce models with broader capabilities. The authors hypothesized that exposure to scientific papers, legal opinions, source code, mathematical problems, and other specialized content during pre-training would give language models a richer understanding of language and reasoning. Their experiments confirmed this: a GPT-2-sized model trained on The Pile outperformed models of the same size trained on raw Common Crawl or the CC-100 dataset across multiple benchmarks.
The Pile is constructed from 22 component datasets spanning five broad categories: academic writing, internet content, prose and literature, dialogue, and miscellaneous sources. Each component was selected to contribute a distinct type of text that would be underrepresented in a pure web crawl. The following table provides a complete breakdown of all 22 components.
| Component | Raw Size (GiB) | Weight (%) | Epochs | Effective Size (GiB) | Documents | Description |
|---|---|---|---|---|---|---|
| Pile-CC | 227.12 | 18.11 | 1.0 | 227.12 | 54,953,117 | Filtered web pages from Common Crawl, extracted using jusText |
| PubMed Central | 90.27 | 14.40 | 2.0 | 180.55 | 3,098,931 | Full-text biomedical and life science research articles |
| Books3 | 100.96 | 12.07 | 1.5 | 151.44 | 196,640 | Collection of published books sourced from Bibliotik |
| OpenWebText2 | 62.77 | 10.01 | 2.0 | 125.54 | 17,103,059 | Web pages linked from high-scoring Reddit posts |
| ArXiv | 56.21 | 8.96 | 2.0 | 112.42 | 1,264,405 | Scientific preprints converted from LaTeX to Markdown |
| GitHub | 95.16 | 7.59 | 1.0 | 95.16 | 19,021,454 | Source code files from public repositories |
| FreeLaw | 51.15 | 6.12 | 1.5 | 76.73 | 3,562,015 | Legal opinions from the CourtListener database |
| Stack Exchange | 32.20 | 5.13 | 2.0 | 64.39 | 15,622,475 | Questions and top-voted answers from the Stack Exchange network |
| USPTO Backgrounds | 22.90 | 3.65 | 2.0 | 45.81 | 5,883,037 | Background sections of United States patent applications |
| PubMed Abstracts | 19.26 | 3.07 | 2.0 | 38.53 | 15,518,009 | Abstracts from biomedical research papers indexed in PubMed |
| Gutenberg (PG-19) | 10.88 | 2.17 | 2.5 | 27.19 | 28,602 | Public-domain books from Project Gutenberg |
| OpenSubtitles | 12.98 | 1.55 | 1.5 | 19.47 | 446,612 | Movie and television subtitles in English |
| Wikipedia (en) | 6.38 | 1.53 | 3.0 | 19.13 | 6,033,151 | English Wikipedia articles |
| DM Mathematics | 7.75 | 1.24 | 2.0 | 15.49 | 1,014,997 | Algorithmically generated math problems from DeepMind |
| Ubuntu IRC | 5.52 | 0.88 | 2.0 | 11.03 | 10,605 | Chat logs from the Ubuntu support IRC channel |
| BookCorpus2 | 6.30 | 0.75 | 1.5 | 9.45 | 17,868 | Collection of freely available books |
| EuroParl | 4.59 | 0.73 | 2.0 | 9.17 | 69,814 | Proceedings of the European Parliament |
| HackerNews | 3.90 | 0.62 | 2.0 | 7.80 | 831,198 | Comments and discussions from Hacker News |
| YouTube Subtitles | 3.73 | 0.60 | 2.0 | 7.47 | 173,651 | Automatically generated and human-written video captions |
| PhilPapers | 2.38 | 0.38 | 2.0 | 4.76 | 33,990 | Academic papers in philosophy |
| NIH ExPORTER | 1.89 | 0.30 | 2.0 | 3.79 | 939,668 | Abstracts from NIH-funded research grants |
| Enron Emails | 0.88 | 0.14 | 2.0 | 1.76 | 517,401 | Emails from the Enron Corporation, released during federal investigation |
| Total | 825.18 | 100 | 1,254.20 | ~211 million |
The "Weight" column indicates what fraction of the final training mixture each component represents. The "Epochs" column shows how many times each component is seen during a single pass through The Pile. Smaller but high-quality datasets such as Wikipedia and PubMed Central are upsampled (repeated more than once), while larger datasets like Pile-CC and GitHub are seen only once. This upsampling strategy ensures that the model receives adequate exposure to specialized content without being overwhelmed by web text.
The "Effective Size" column represents the total amount of data the model actually processes for each component, accounting for upsampling. The effective total of 1,254.20 GiB means that a single epoch through The Pile involves reading substantially more data than the raw 825.18 GiB on disk.
The construction of The Pile involved extensive processing, filtering, and deduplication steps tailored to each component dataset.
The largest single component, Pile-CC, was derived from Common Crawl data. Rather than using the pre-extracted WET files that Common Crawl provides, the team processed the raw WARC (Web ARChive) files using the jusText extraction tool. This approach produced higher-quality text by more effectively separating article content from boilerplate, navigation menus, and other non-content elements.
After text extraction, the pipeline applied language detection using pycld2 to filter for English-language content. A fastText classifier, trained to distinguish high-quality text (using OpenWebText2 as positive examples) from low-quality Common Crawl content, scored each document. Documents below a Pareto-distribution threshold were discarded. Document-level deduplication was also performed on Pile-CC.
OpenWebText2 extended the original OpenWebText dataset by extracting URLs from Reddit submissions with a score of 3 or higher, then scraping the linked pages using the Newspaper library. Deduplication was performed at the document level using MinHashLSH through the DataSketch library, stored in memory.
The ArXiv component was built by downloading LaTeX source files directly from the ArXiv S3 bulk access interface. These LaTeX sources were then converted to Markdown using pandoc, preserving the mathematical and structural content of research papers while removing LaTeX-specific formatting.
The GitHub component targeted public repositories, filtering for files under 100 KB to exclude binary files and extremely large generated files. This component contributed source code in a wide range of programming languages.
For each question on the Stack Exchange network, the top three upvoted answers (with a minimum of 3 upvotes) were extracted and formatted alongside the original question. This filtering ensured that only substantive, community-vetted answers were included.
To prevent data leakage between the training set and evaluation benchmarks, the authors applied 13-gram overlap filtering, consistent with the methodology used in the GPT-3 paper. This process identified and removed training documents that contained 13-gram sequences overlapping with standard evaluation datasets.
The Pile uses a carefully designed sampling strategy rather than simply concatenating all components in proportion to their raw sizes. The authors observed that naively mixing data by size would cause the model to spend the vast majority of its training on web-scraped text while seeing very little of the smaller, specialized datasets.
To address this, smaller high-quality datasets are upsampled. For example, English Wikipedia (6.38 GiB raw) is repeated 3 times per epoch, while PubMed Central (90.27 GiB raw) is repeated twice. Conversely, very large datasets like Pile-CC and GitHub are seen only once per epoch. The weights were chosen through a combination of heuristics and preliminary experiments, aiming to balance breadth of coverage against the risk of overfitting on small components.
The Pile paper presented benchmark results comparing a 1.3-billion-parameter GPT-2-style model trained on The Pile against equivalent models trained on CC-100 (a filtered, monolingual Common Crawl dataset) and raw Common Crawl. All models were trained on 40 GB subsamples to ensure a fair, size-controlled comparison.
| Metric | The Pile | CC-100 | Raw CC |
|---|---|---|---|
| Pile BPB (validation) | 0.9281 | 1.3143 | 1.1180 |
| Pile BPB (test) | 0.9433 | 1.3293 | 1.1275 |
| WikiText perplexity | 5.59 | 8.27 | 11.75 |
| LAMBADA perplexity | 12.78 | 11.78 | 19.84 |
| LAMBADA accuracy | 50.1% | 49.7% | 43.8% |
BPB stands for "bits per byte," a metric that measures how well the model predicts text (lower is better). The Pile-trained model achieved the best BPB scores on The Pile's own validation and test sets and the best WikiText perplexity, while also achieving competitive or superior LAMBADA accuracy. The authors noted that the Pile-trained model showed "significant improvements over both Raw CC and CC-100 on all components," demonstrating that data diversity translates directly into better generalization.
The Pile became the default training dataset for EleutherAI's family of open-source language models and was adopted by numerous other organizations. The following table lists notable models trained primarily or entirely on The Pile.
| Model | Organization | Parameters | Year | Notes |
|---|---|---|---|---|
| GPT-Neo | EleutherAI | 125M, 1.3B, 2.7B | 2021 | First open-source GPT-3-style models |
| GPT-J-6B | EleutherAI | 6B | 2021 | Trained using Mesh Transformer JAX; largest open model at the time |
| GPT-NeoX-20B | EleutherAI | 20B | 2022 | Largest dense open-source model at time of release |
| Pythia | EleutherAI | 70M to 12B (8 sizes) | 2023 | Suite of models for interpretability research; trained on both The Pile and a deduplicated version |
| RWKV-4 | BlinkDL | 169M to 14B | 2022-2023 | RNN architecture with transformer-level performance |
| Cerebras-GPT | Cerebras Systems | 111M to 13B (7 sizes) | 2023 | Compute-optimal models following Chinchilla scaling laws |
| Stable LM | Stability AI | Various | 2023 | Built upon an experimental dataset derived from The Pile |
EleutherAI's first models trained on The Pile were the GPT-Neo series, released in March 2021 in sizes of 125M, 1.3B, and 2.7B parameters. These were followed in June 2021 by GPT-J-6B, a 6-billion-parameter model that became the largest publicly available GPT-3-style model at the time. GPT-J was trained using Ben Wang's Mesh Transformer JAX library and employed Rotary Position Embedding (RoPE). Both model families demonstrated that open-source models trained on a diverse, publicly available dataset could approach the performance of proprietary systems.
Released in February 2022, GPT-NeoX-20B was a 20-billion-parameter autoregressive language model trained on The Pile. At the time, it was the largest dense, publicly available language model in the world. The model was trained using the GPT-NeoX library, which extended the Megatron and DeepSpeed frameworks for efficient distributed training.
The Pythia suite, described in a paper by Biderman et al. (2023), consists of 16 models: 8 sizes (70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B parameters) each trained on two versions of The Pile (standard and globally deduplicated). All models at each size were trained on the exact same data in the exact same order, with 154 intermediate checkpoints saved per model. This controlled experimental setup was specifically designed to support research on interpretability and training dynamics. Each model was trained for approximately 300 billion tokens.
RWKV is a novel architecture that combines the training efficiency of transformers with the inference efficiency of recurrent neural networks. The RWKV-4 Pile series included models at 169M, 430M, 3B, 7B, and 14B parameters, all trained on The Pile for approximately 332 billion tokens each. RWKV demonstrated that non-transformer architectures could achieve competitive language modeling performance when trained on high-quality, diverse data.
Cerebras Systems released a family of seven GPT models (111M to 13B parameters) in March 2023, all trained on The Pile using Cerebras wafer-scale hardware. These models followed DeepMind's Chinchilla scaling laws, training each model on approximately 20 tokens per parameter. Cerebras-GPT was released under the Apache 2.0 license.
The most significant controversy surrounding The Pile centers on its Books3 component, which contained approximately 196,640 books in plaintext format totaling about 101 GiB. The books were originally collected from Bibliotik, a private BitTorrent tracker known for hosting pirated e-books. Shawn Presser, who compiled and shared the Books3 dataset, argued that the books were necessary for creating diverse and unbiased language models.
In August 2023, the Danish anti-piracy organization Rights Alliance issued a DMCA takedown notice targeting the Books3 dataset, which had been hosted on a website called The Eye. Rights Alliance identified approximately 150 titles belonging to its member publishers within the dataset. The Eye complied with the takedown request, and the primary download link for Books3 was taken offline.
Rights Alliance also contacted Hugging Face and EleutherAI, both of which had hosted links to Books3. Maria Fredenslund, director of Rights Alliance, stated that it was "absolutely crucial" to prevent AI from being trained on illegally obtained content. Bloomberg also confirmed to Rights Alliance that it would not use Books3 to train future versions of its BloombergGPT model.
In July 2023, authors Richard Kadrey, Sarah Silverman, and Christopher Golden filed copyright infringement lawsuits against both OpenAI and Meta, alleging that these companies used Books3 (as part of The Pile) to train their language models without permission. The plaintiffs argued that the use of copyrighted material for AI training constituted infringement.
Meta acknowledged in court proceedings that it had used The Pile, including Books3, to train its LLaMA models. In June 2025, a federal court in California issued its decision in Kadrey v. Meta, ruling on summary judgment that Meta's copying of books for AI training qualified as fair use. The ruling represented a significant legal precedent for the AI industry, although appeals and related cases continue.
Despite the takedown of the original download link, copies of the Books3 dataset persisted on the Internet Archive's Wayback Machine and through alternative distribution channels, making complete removal effectively impossible.
The Pile itself was released under the MIT License. However, the individual component datasets carry their own licensing terms, and some components contain copyrighted material. The creators acknowledged that Books3 contained copyrighted books and that substantial portions of ArXiv papers and PhilPapers articles were also under copyright. The Pile's authors asserted that use of the dataset fell under fair use for research purposes, though they recommended consulting intellectual property attorneys for jurisdiction-specific guidance.
A formal "Datasheet for the Pile" was published by Biderman et al. in January 2022 (arXiv:2201.07311), providing detailed documentation of each component's provenance, collection methodology, and potential ethical concerns. The datasheet noted several important considerations:
All data in The Pile was collected before September 1, 2020.
The Pile was released at a time when few large-scale, open training datasets existed. In the years that followed, several alternative and successor datasets emerged, each taking a different approach to data curation.
| Dataset | Size (Tokens) | Year | Source | Approach |
|---|---|---|---|---|
| The Pile | ~1.35 trillion | 2020 | 22 diverse sources | Curated multi-domain mixture |
| C4 | ~156 billion | 2019 | Single Common Crawl snapshot | Filtered web text |
| RefinedWeb | ~5 trillion | 2023 | Common Crawl | Heavily filtered and deduplicated web data |
| FineWeb | ~15 trillion | 2024 | 96 Common Crawl snapshots | Large-scale filtered web text |
| The Common Pile | ~2.1 trillion | 2025 | 30 openly licensed sources | Public domain and openly licensed text only |
C4, created by Google for training the T5 model, was derived from a single April 2019 Common Crawl snapshot. At approximately 156 billion tokens, it was substantially smaller than The Pile and consisted entirely of filtered web text. C4 applied heuristic filters to extract natural language and performed deduplication, but it lacked the domain diversity that The Pile provided through its curated component structure.
RefinedWeb, created by the Technology Innovation Institute (TII) for the Falcon language model, demonstrated that sufficiently well-filtered web data could match or exceed the performance of curated datasets like The Pile. The RefinedWeb paper (Penedo et al., 2023) applied aggressive filtering and multi-level deduplication to Common Crawl, producing approximately 5 trillion tokens of high-quality English text. Models trained on RefinedWeb performed comparably to those trained on The Pile across standard benchmarks, challenging the assumption that domain-diverse curation was strictly necessary.
Notably, the RefinedWeb analysis found that The Pile exhibited 40 to 60 percent duplicate content, suggesting that more aggressive deduplication could have improved its quality.
FineWeb, released by Hugging Face in 2024, processed 96 Common Crawl snapshots to produce approximately 15 trillion tokens of cleaned English text. Models trained on FineWeb outperformed those trained on The Pile, C4, and several other datasets on aggregate benchmark tasks. FineWeb represented a shift toward massive-scale, web-only datasets with sophisticated filtering pipelines.
In response to the copyright controversies surrounding The Pile, EleutherAI spent two years developing a successor dataset called The Common Pile. Released in June 2025, The Common Pile v0.1 is an 8 TB corpus comprising text from 30 distinct sources, all of which are either in the public domain or released under open licenses that explicitly permit use for AI training.
The Common Pile was developed in collaboration with partners including the University of Toronto, the Vector Institute, Hugging Face, and the Allen Institute for Artificial Intelligence. The dataset follows the Open Knowledge Foundation's Open Definition 2.1 and draws on sources including over 300,000 public-domain books digitized by the Library of Congress and the Internet Archive.
To validate the dataset's quality, EleutherAI trained two 7-billion-parameter language models called Comma v0.1-1T and Comma v0.1-2T on 1 trillion and 2 trillion tokens respectively. Both models achieved competitive performance compared to models trained on unlicensed text with similar computational budgets, such as LLaMA 1 and LLaMA 2 at the 7B scale. This result demonstrated that it is possible to build high-performing language models using only openly licensed data.
The Pile's significance extends well beyond its use as a training dataset. Its release in late 2020 helped establish several important precedents for the open-source AI ecosystem.
First, The Pile demonstrated that a volunteer-driven research collective could produce a training dataset competitive with those used by well-funded corporate labs. Before The Pile, the assumption was that only organizations with access to proprietary data pipelines could train state-of-the-art language models. EleutherAI's work showed that careful curation of publicly available data could close much of that gap.
Second, The Pile popularized the concept of domain-diverse training mixtures. Rather than treating all text as interchangeable, the dataset's 22-component structure highlighted the importance of including specialized content from science, law, code, and other fields. This approach influenced subsequent dataset designs, even those that ultimately relied on web-only sources.
Third, The Pile catalyzed a broader conversation about data transparency in AI. By publishing detailed documentation about the dataset's composition, processing pipeline, and known limitations, EleutherAI set a standard for openness that contrasted with the opacity of proprietary training datasets. The "Datasheet for the Pile" publication further advanced this transparency.
Finally, the copyright disputes around Books3 forced the AI research community to confront difficult questions about the legal and ethical boundaries of training data collection. The resulting lawsuits, takedowns, and policy discussions contributed to a growing consensus that future datasets would need to take copyright and licensing more seriously, as reflected in The Common Pile's focus on openly licensed content.
The Pile is distributed as a collection of compressed JSONL (JSON Lines) files. Each line contains a JSON object with two fields: "text" (the document content) and "meta" (metadata including the source dataset name). The dataset is split into training, validation, and test sets.
Using the GPT-2 tokenizer, The Pile contains approximately 1.35 trillion tokens across its effective size of 1,254.20 GiB. The raw 825.18 GiB on disk yields fewer tokens, but the upsampling of smaller components increases the total token count seen during training.
The Pile was originally hosted on The Eye and later made available through Hugging Face at the EleutherAI/pile repository. Following the Books3 controversy and associated legal concerns, access to the complete dataset became restricted. A deduplicated version was also available at EleutherAI/the_pile_deduplicated on Hugging Face.
EleutherAI also produced several specialized derivatives and related datasets: