The Pile (dataset)

Data & Datasets Machine Learning Natural Language Processing Open Source AI

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

11 citations

Revision

v7 · 4,305 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

The Pile is an 825.18 GiB (approximately 886 GB) English text corpus designed for training large language models, assembled from 22 diverse, high-quality subsets spanning academic, professional, internet, literary, and miscellaneous sources.^[1] Created by EleutherAI, a grassroots collective of AI researchers, and publicly released on December 31, 2020, The Pile became one of the most widely used open-source training datasets in the history of natural language processing. The accompanying paper, "The Pile: An 800GB Dataset of Diverse Text for Language Modeling," was authored by Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy.^[1] The paper states its motivating thesis plainly: "increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models."^[1]

The Pile served as the foundation for numerous influential language models, including GPT-Neo, GPT-J, GPT-NeoX-20B, the Pythia suite, RWKV, and Cerebras-GPT. Its release marked a turning point in democratizing access to high-quality training data for the broader research community: a GPT-2-sized model trained on The Pile outperformed equivalent models trained on raw Common Crawl or CC-100 across multiple benchmarks.^[1]

Why was The Pile created?

Before The Pile, most publicly available text datasets for language model training consisted primarily of web-scraped data from Common Crawl. While Common Crawl provided massive scale, its content was heavily skewed toward general web pages and lacked representation from specialized domains such as scientific literature, legal documents, source code, and published books. OpenAI's GPT-3 was trained on a curated mix of data sources, but the specifics of its dataset were never publicly released, leaving the open-source community without a comparable resource.

EleutherAI formed in July 2020 as a Discord-based collective of volunteer researchers with the goal of replicating GPT-3 in an open-source setting. The group recognized that data quality and diversity were just as important as model architecture and compute. Rather than relying solely on Common Crawl, the team set out to build a training corpus that combined web data with curated, domain-specific text from a wide range of fields.

The key insight behind The Pile was that training on a more diverse mixture of text would produce models with broader capabilities.^[1] The authors hypothesized that exposure to scientific papers, legal opinions, source code, mathematical problems, and other specialized content during pre-training would give language models a richer understanding of language and reasoning. Their experiments confirmed this: a GPT-2-sized model trained on The Pile outperformed models of the same size trained on raw Common Crawl or the CC-100 dataset across multiple benchmarks.^[1]

What is The Pile made of?

The Pile is constructed from 22 component datasets spanning five broad categories: academic writing, internet content, prose and literature, dialogue, and miscellaneous sources.^[1] Each component was selected to contribute a distinct type of text that would be underrepresented in a pure web crawl. The following table provides a complete breakdown of all 22 components.

Component Datasets

Component	Raw Size (GiB)	Weight (%)	Epochs	Effective Size (GiB)	Documents	Description
Pile-CC	227.12	18.11	1.0	227.12	54,953,117	Filtered web pages from Common Crawl, extracted using jusText
PubMed Central	90.27	14.40	2.0	180.55	3,098,931	Full-text biomedical and life science research articles
Books3	100.96	12.07	1.5	151.44	196,640	Collection of published books sourced from Bibliotik
OpenWebText2	62.77	10.01	2.0	125.54	17,103,059	Web pages linked from high-scoring Reddit posts
ArXiv	56.21	8.96	2.0	112.42	1,264,405	Scientific preprints converted from LaTeX to Markdown
GitHub	95.16	7.59	1.0	95.16	19,021,454	Source code files from public repositories
FreeLaw	51.15	6.12	1.5	76.73	3,562,015	Legal opinions from the CourtListener database
Stack Exchange	32.20	5.13	2.0	64.39	15,622,475	Questions and top-voted answers from the Stack Exchange network
USPTO Backgrounds	22.90	3.65	2.0	45.81	5,883,037	Background sections of United States patent applications
PubMed Abstracts	19.26	3.07	2.0	38.53	15,518,009	Abstracts from biomedical research papers indexed in PubMed
Gutenberg (PG-19)	10.88	2.17	2.5	27.19	28,602	Public-domain books from Project Gutenberg
OpenSubtitles	12.98	1.55	1.5	19.47	446,612	Movie and television subtitles in English
Wikipedia (en)	6.38	1.53	3.0	19.13	6,033,151	English Wikipedia articles
DM Mathematics	7.75	1.24	2.0	15.49	1,014,997	Algorithmically generated math problems from DeepMind
Ubuntu IRC	5.52	0.88	2.0	11.03	10,605	Chat logs from the Ubuntu support IRC channel
BookCorpus2	6.30	0.75	1.5	9.45	17,868	Collection of freely available books
EuroParl	4.59	0.73	2.0	9.17	69,814	Proceedings of the European Parliament
HackerNews	3.90	0.62	2.0	7.80	831,198	Comments and discussions from Hacker News
YouTube Subtitles	3.73	0.60	2.0	7.47	173,651	Automatically generated and human-written video captions
PhilPapers	2.38	0.38	2.0	4.76	33,990	Academic papers in philosophy
NIH ExPORTER	1.89	0.30	2.0	3.79	939,668	Abstracts from NIH-funded research grants
Enron Emails	0.88	0.14	2.0	1.76	517,401	Emails from the Enron Corporation, released during federal investigation
Total	825.18	100		1,254.20	~211 million

The "Weight" column indicates what fraction of the final training mixture each component represents. The "Epochs" column shows how many times each component is seen during a single pass through The Pile. Smaller but high-quality datasets such as Wikipedia and PubMed Central are upsampled (repeated more than once), while larger datasets like Pile-CC and GitHub are seen only once.^[1] This upsampling strategy ensures that the model receives adequate exposure to specialized content without being overwhelmed by web text.

The "Effective Size" column represents the total amount of data the model actually processes for each component, accounting for upsampling. The effective total of 1,254.20 GiB means that a single epoch through The Pile involves reading substantially more data than the raw 825.18 GiB on disk.^[1]

How was The Pile cleaned and filtered?

The construction of The Pile involved extensive processing, filtering, and deduplication steps tailored to each component dataset.^[1]

Pile-CC

The largest single component, Pile-CC, was derived from Common Crawl data. Rather than using the pre-extracted WET files that Common Crawl provides, the team processed the raw WARC (Web ARChive) files using the jusText extraction tool.^[1] This approach produced higher-quality text by more effectively separating article content from boilerplate, navigation menus, and other non-content elements.

After text extraction, the pipeline applied language detection using pycld2 to filter for English-language content. A fastText classifier, trained to distinguish high-quality text (using OpenWebText2 as positive examples) from low-quality Common Crawl content, scored each document.^[1] Documents below a Pareto-distribution threshold were discarded. Document-level deduplication was also performed on Pile-CC.

OpenWebText2

OpenWebText2 extended the original OpenWebText dataset by extracting URLs from Reddit submissions with a score of 3 or higher, then scraping the linked pages using the Newspaper library.^[1] Deduplication was performed at the document level using MinHashLSH through the DataSketch library, stored in memory.

ArXiv

The ArXiv component was built by downloading LaTeX source files directly from the ArXiv S3 bulk access interface. These LaTeX sources were then converted to Markdown using pandoc, preserving the mathematical and structural content of research papers while removing LaTeX-specific formatting.^[1]

GitHub

The GitHub component targeted public repositories, filtering for files under 100 KB to exclude binary files and extremely large generated files.^[1] This component contributed source code in a wide range of programming languages.

Stack Exchange

For each question on the Stack Exchange network, the top three upvoted answers (with a minimum of 3 upvotes) were extracted and formatted alongside the original question.^[1] This filtering ensured that only substantive, community-vetted answers were included.

Evaluation Decontamination

To prevent data leakage between the training set and evaluation benchmarks, the authors applied 13-gram overlap filtering, consistent with the methodology used in the GPT-3 paper.^[1] This process identified and removed training documents that contained 13-gram sequences overlapping with standard evaluation datasets.

Sampling and Weighting Strategy

The Pile uses a carefully designed sampling strategy rather than simply concatenating all components in proportion to their raw sizes.^[1] The authors observed that naively mixing data by size would cause the model to spend the vast majority of its training on web-scraped text while seeing very little of the smaller, specialized datasets.

To address this, smaller high-quality datasets are upsampled. For example, English Wikipedia (6.38 GiB raw) is repeated 3 times per epoch, while PubMed Central (90.27 GiB raw) is repeated twice. Conversely, very large datasets like Pile-CC and GitHub are seen only once per epoch.^[1] The weights were chosen through a combination of heuristics and preliminary experiments, aiming to balance breadth of coverage against the risk of overfitting on small components.

How well do models trained on The Pile perform?

The Pile paper presented benchmark results comparing a 1.3-billion-parameter GPT-2-style model trained on The Pile against equivalent models trained on CC-100 (a filtered, monolingual Common Crawl dataset) and raw Common Crawl. All models were trained on 40 GB subsamples to ensure a fair, size-controlled comparison.^[1]

Metric	The Pile	CC-100	Raw CC
Pile BPB (validation)	0.9281	1.3143	1.1180
Pile BPB (test)	0.9433	1.3293	1.1275
WikiText perplexity	5.59	8.27	11.75
LAMBADA perplexity	12.78	11.78	19.84
LAMBADA accuracy	50.1%	49.7%	43.8%

BPB stands for "bits per byte," a metric that measures how well the model predicts text (lower is better). The Pile-trained model achieved the best BPB scores on The Pile's own validation and test sets and the best WikiText perplexity, while also achieving competitive or superior LAMBADA accuracy. The authors noted that the Pile-trained model showed "significant improvements over both Raw CC and CC-100 on all components," demonstrating that data diversity translates directly into better generalization.^[1]

What models were trained on The Pile?

The Pile became the default training dataset for EleutherAI's family of open-source language models and was adopted by numerous other organizations. The following table lists notable models trained primarily or entirely on The Pile.

Model	Organization	Parameters	Year	Notes
GPT-Neo	EleutherAI	125M, 1.3B, 2.7B	2021	First open-source GPT-3-style models
GPT-J-6B	EleutherAI	6B	2021	Trained using Mesh Transformer JAX; largest open model at the time
GPT-NeoX-20B	EleutherAI	20B	2022	Largest dense open-source model at time of release
Pythia	EleutherAI	70M to 12B (8 sizes)	2023	Suite of models for interpretability research; trained on both The Pile and a deduplicated version
RWKV-4	BlinkDL	169M to 14B	2022-2023	RNN architecture with transformer-level performance
Cerebras-GPT	Cerebras Systems	111M to 13B (7 sizes)	2023	Compute-optimal models following Chinchilla scaling laws
Stable LM	Stability AI	Various	2023	Built upon an experimental dataset derived from The Pile

GPT-Neo and GPT-J

EleutherAI's first models trained on The Pile were the GPT-Neo series, released in March 2021 in sizes of 125M, 1.3B, and 2.7B parameters. These were followed in June 2021 by GPT-J-6B, a 6-billion-parameter model that became the largest publicly available GPT-3-style model at the time. GPT-J was trained using Ben Wang's Mesh Transformer JAX library and employed Rotary Position Embedding (RoPE). Both model families demonstrated that open-source models trained on a diverse, publicly available dataset could approach the performance of proprietary systems.

GPT-NeoX-20B

Released in February 2022, GPT-NeoX-20B was a 20-billion-parameter autoregressive language model trained on The Pile.^[3] At the time, it was the largest dense, publicly available language model in the world. The model was trained using the GPT-NeoX library, which extended the Megatron and DeepSpeed frameworks for efficient distributed training.^[3]

Pythia

The Pythia suite, described in a paper by Biderman et al. (2023), consists of 16 models: 8 sizes (70M, 160M, 410M, 1B, 1.4B, 2.8B, 6.9B, and 12B parameters) each trained on two versions of The Pile (standard and globally deduplicated).^[4] All models at each size were trained on the exact same data in the exact same order, with 154 intermediate checkpoints saved per model. This controlled experimental setup was specifically designed to support research on interpretability and training dynamics. Each model was trained for approximately 300 billion tokens.^[4]

RWKV

RWKV is a novel architecture that combines the training efficiency of transformers with the inference efficiency of recurrent neural networks.^[7] The RWKV-4 Pile series included models at 169M, 430M, 3B, 7B, and 14B parameters, all trained on The Pile for approximately 332 billion tokens each.^[7] RWKV demonstrated that non-transformer architectures could achieve competitive language modeling performance when trained on high-quality, diverse data.

Cerebras-GPT

Cerebras Systems released a family of seven GPT models (111M to 13B parameters) in March 2023, all trained on The Pile using Cerebras wafer-scale hardware.^[6] These models followed DeepMind's Chinchilla scaling laws, training each model on approximately 20 tokens per parameter. Cerebras-GPT was released under the Apache 2.0 license.^[6]

Was The Pile legal to use? The Books3 controversy

The most significant controversy surrounding The Pile centers on its Books3 component, which contained approximately 196,640 books in plaintext format totaling about 101 GiB. The books were originally collected from Bibliotik, a private BitTorrent tracker known for hosting pirated e-books. Shawn Presser, who compiled and shared the Books3 dataset, argued that the books were necessary for creating diverse and unbiased language models.

DMCA Takedown

In August 2023, the Danish anti-piracy organization Rights Alliance issued a DMCA takedown notice targeting the Books3 dataset, which had been hosted on a website called The Eye.^[8] Rights Alliance identified approximately 150 titles belonging to its member publishers within the dataset. The Eye complied with the takedown request, and the primary download link for Books3 was taken offline.^[8]

Rights Alliance also contacted Hugging Face and EleutherAI, both of which had hosted links to Books3. Maria Fredenslund, director of Rights Alliance, stated that it was "absolutely crucial" to prevent AI from being trained on illegally obtained content.^[8] Bloomberg also confirmed to Rights Alliance that it would not use Books3 to train future versions of its BloombergGPT model.^[8]

Lawsuits

In July 2023, authors Richard Kadrey, Sarah Silverman, and Christopher Golden filed copyright infringement lawsuits against both OpenAI and Meta, alleging that these companies used Books3 (as part of The Pile) to train their language models without permission. The plaintiffs argued that the use of copyrighted material for AI training constituted infringement.

Meta acknowledged in court proceedings that it had used The Pile, including Books3, to train its LLaMA models. On June 25, 2025, Judge Vince Chhabria of the U.S. District Court for the Northern District of California issued his decision in Kadrey v. Meta, granting summary judgment in Meta's favor and ruling that Meta's copying of books for AI training qualified as fair use. The court found Meta's use "highly transformative," but Judge Chhabria emphasized that the plaintiffs had failed to develop a "market dilution" theory and warned that the ruling did not stand "for the proposition that Meta's use of copyrighted materials to train its language models is lawful," only that these specific plaintiffs "made the wrong arguments." The ruling represented a significant legal precedent for the AI industry, although appeals and related cases continue.

Despite the takedown of the original download link, copies of the Books3 dataset persisted on the Internet Archive's Wayback Machine and through alternative distribution channels, making complete removal effectively impossible.

Licensing and Ethical Considerations

The Pile itself was released under the MIT License. However, the individual component datasets carry their own licensing terms, and some components contain copyrighted material. The creators acknowledged that Books3 contained copyrighted books and that substantial portions of ArXiv papers and PhilPapers articles were also under copyright.^[2] The Pile's authors asserted that use of the dataset fell under fair use for research purposes, though they recommended consulting intellectual property attorneys for jurisdiction-specific guidance.

A formal "Datasheet for the Pile" was published by Biderman et al. in January 2022 (arXiv:2201.07311), providing detailed documentation of each component's provenance, collection methodology, and potential ethical concerns.^[2] The datasheet noted several important considerations:

No IRB review was conducted because EleutherAI, as a volunteer collective, did not have an associated Institutional Review Board.^[2]
Sensitive content: Most components (with the exception of GitHub and DM Mathematics) probably contain offensive or harmful material.^[2]
Personal information: Multiple datasets contain information that could identify individuals by race, gender, ability, religion, sexual orientation, or national origin.^[2]
No consent mechanism: Data was collected without notification to or consent from the individuals whose text appears in the dataset.^[2]
No revocation process: There was no mechanism for individuals to request removal of their data from the dataset.^[2]

All data in The Pile was collected before September 1, 2020.^[2]

How does The Pile compare with other training datasets?

The Pile was released at a time when few large-scale, open training datasets existed. In the years that followed, several alternative and successor datasets emerged, each taking a different approach to data curation.

Dataset	Size (Tokens)	Year	Source	Approach
The Pile	~1.35 trillion	2020	22 diverse sources	Curated multi-domain mixture
C4	~156 billion	2019	Single Common Crawl snapshot	Filtered web text
RefinedWeb	~5 trillion	2023	Common Crawl	Heavily filtered and deduplicated web data
FineWeb	~15 trillion	2024	96 Common Crawl snapshots	Large-scale filtered web text
The Common Pile	~2.1 trillion	2025	30 openly licensed sources	Public domain and openly licensed text only

C4 (Colossal Clean Crawled Corpus)

C4, created by Google for training the T5 model, was derived from a single April 2019 Common Crawl snapshot.^[11] At approximately 156 billion tokens, it was substantially smaller than The Pile and consisted entirely of filtered web text. C4 applied heuristic filters to extract natural language and performed deduplication, but it lacked the domain diversity that The Pile provided through its curated component structure.^[11]

RefinedWeb

RefinedWeb, created by the Technology Innovation Institute (TII) for the Falcon language model, demonstrated that sufficiently well-filtered web data could match or exceed the performance of curated datasets like The Pile.^[5] The RefinedWeb paper (Penedo et al., 2023) applied aggressive filtering and multi-level deduplication to Common Crawl, producing approximately 5 trillion tokens of high-quality English text.^[5] Models trained on RefinedWeb performed comparably to those trained on The Pile across standard benchmarks, challenging the assumption that domain-diverse curation was strictly necessary.

Notably, the RefinedWeb analysis found that The Pile exhibited 40 to 60 percent duplicate content, suggesting that more aggressive deduplication could have improved its quality.^[5]

FineWeb

FineWeb, released by Hugging Face in 2024, processed 96 Common Crawl snapshots to produce approximately 15 trillion tokens of cleaned English text.^[10] Models trained on FineWeb outperformed those trained on The Pile, C4, and several other datasets on aggregate benchmark tasks.^[10] FineWeb represented a shift toward massive-scale, web-only datasets with sophisticated filtering pipelines.

The Common Pile: A Successor Dataset

In response to the copyright controversies surrounding The Pile, EleutherAI spent two years developing a successor dataset called The Common Pile. Released in June 2025, The Common Pile v0.1 is an 8 TB corpus comprising text from 30 distinct sources, all of which are either in the public domain or released under open licenses that explicitly permit use for AI training.^[9]

The Common Pile was developed in collaboration with partners including the University of Toronto, the Vector Institute, Hugging Face, and the Allen Institute for Artificial Intelligence.^[9] The dataset follows the Open Knowledge Foundation's Open Definition 2.1 and draws on sources including over 300,000 public-domain books digitized by the Library of Congress and the Internet Archive.^[9]

To validate the dataset's quality, EleutherAI trained two 7-billion-parameter language models called Comma v0.1-1T and Comma v0.1-2T on 1 trillion and 2 trillion tokens respectively.^[9] Both models achieved competitive performance compared to models trained on unlicensed text with similar computational budgets, such as LLaMA 1 and LLaMA 2 at the 7B scale.^[9] This result demonstrated that it is possible to build high-performing language models using only openly licensed data.

Impact on Open-Source AI

The Pile's significance extends well beyond its use as a training dataset. Its release in late 2020 helped establish several important precedents for the open-source AI ecosystem.

First, The Pile demonstrated that a volunteer-driven research collective could produce a training dataset competitive with those used by well-funded corporate labs. Before The Pile, the assumption was that only organizations with access to proprietary data pipelines could train state-of-the-art language models. EleutherAI's work showed that careful curation of publicly available data could close much of that gap.

Second, The Pile popularized the concept of domain-diverse training mixtures. Rather than treating all text as interchangeable, the dataset's 22-component structure highlighted the importance of including specialized content from science, law, code, and other fields. This approach influenced subsequent dataset designs, even those that ultimately relied on web-only sources.

Third, The Pile catalyzed a broader conversation about data transparency in AI. By publishing detailed documentation about the dataset's composition, processing pipeline, and known limitations, EleutherAI set a standard for openness that contrasted with the opacity of proprietary training datasets. The "Datasheet for the Pile" publication further advanced this transparency.^[2]

Finally, the copyright disputes around Books3 forced the AI research community to confront difficult questions about the legal and ethical boundaries of training data collection. The resulting lawsuits, takedowns, and policy discussions contributed to a growing consensus that future datasets would need to take copyright and licensing more seriously, as reflected in The Common Pile's focus on openly licensed content.^[9]

Technical Details

Data Format

The Pile is distributed as a collection of compressed JSONL (JSON Lines) files. Each line contains a JSON object with two fields: "text" (the document content) and "meta" (metadata including the source dataset name).^[1] The dataset is split into training, validation, and test sets.

Token Count

Using the GPT-2 tokenizer, The Pile contains approximately 1.35 trillion tokens across its effective size of 1,254.20 GiB.^[1] The raw 825.18 GiB on disk yields fewer tokens, but the upsampling of smaller components increases the total token count seen during training.

Is The Pile still available?

The Pile was originally hosted on The Eye and later made available through Hugging Face at the EleutherAI/pile repository. Following the Books3 controversy and associated legal concerns, access to the complete dataset became restricted. A deduplicated version was also available at EleutherAI/the_pile_deduplicated on Hugging Face.

EleutherAI also produced several specialized derivatives and related datasets:

The Pile (deduplicated): A version of The Pile with global document-level deduplication applied, reducing the dataset to approximately 207 billion tokens per epoch. Used for training half of the Pythia model suite.^[4]
Proof-Pile-2: A mathematics-focused dataset combining AlgebraicStack, open-access ArXiv papers, and OpenWebMath for training mathematical reasoning models.

References

Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., & Leahy, C. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv preprint arXiv:2101.00027. https://arxiv.org/abs/2101.00027 ↩
Biderman, S., Bicheno, K., & Gao, L. (2022). "Datasheet for the Pile." arXiv preprint arXiv:2201.07311. https://arxiv.org/abs/2201.07311 ↩
Black, S., Biderman, S., Hallahan, E., Anthony, Q., Gao, L., Golding, L., He, H., Leahy, C., McDonell, K., Phang, J., et al. (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model." arXiv preprint arXiv:2204.06745. https://arxiv.org/abs/2204.06745 ↩
Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). "Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling." arXiv preprint arXiv:2304.01373. https://arxiv.org/abs/2304.01373 ↩
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Launay, J., & Noune, B. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." arXiv preprint arXiv:2306.01116. https://arxiv.org/abs/2306.01116 ↩
Dey, N., Gosal, G., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., & Hestness, J. (2023). "Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster." arXiv preprint arXiv:2304.03208. https://arxiv.org/abs/2304.03208 ↩
Peng, B., Alcaide, E., Anthony, Q., et al. (2023). "RWKV: Reinventing RNNs for the Transformer Era." arXiv preprint arXiv:2305.13048. https://arxiv.org/abs/2305.13048 ↩
Rights Alliance. (2023). "Rights Alliance removes the illegal Books3 dataset used to train artificial intelligence." https://rettighedsalliancen.com/rights-alliance-removes-the-illegal-books3-dataset-used-to-train-artificial-intelligence/ ↩
EleutherAI. (2025). "The Common Pile v0.1." EleutherAI Blog. https://blog.eleuther.ai/common-pile/ ↩
Penedo, G., Kydlicek, H., Lozhkov, A., et al. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." NeurIPS 2024 Datasets and Benchmarks Track. ↩
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P.J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." Journal of Machine Learning Research, 21(140), 1-67. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

The Pile (dataset)

Why was The Pile created?

What is The Pile made of?

Component Datasets

How was The Pile cleaned and filtered?

Pile-CC

OpenWebText2

ArXiv

GitHub

Stack Exchange

Evaluation Decontamination

Sampling and Weighting Strategy

How well do models trained on The Pile perform?

What models were trained on The Pile?

GPT-Neo and GPT-J

GPT-NeoX-20B

Pythia

RWKV

Cerebras-GPT

Was The Pile legal to use? The Books3 controversy

DMCA Takedown

Lawsuits

Licensing and Ethical Considerations

How does The Pile compare with other training datasets?

C4 (Colossal Clean Crawled Corpus)

RefinedWeb

FineWeb

The Common Pile: A Successor Dataset

Impact on Open-Source AI

Technical Details

Data Format

Token Count

Is The Pile still available?

References

Improve this article

What links here (24 of 35)

What links here (24 of 35)

Why was The Pile created?

What is The Pile made of?

Component Datasets

How was The Pile cleaned and filtered?

Pile-CC

OpenWebText2

ArXiv

GitHub

Stack Exchange

Evaluation Decontamination

Sampling and Weighting Strategy

How well do models trained on The Pile perform?

What models were trained on The Pile?

GPT-Neo and GPT-J

GPT-NeoX-20B

Pythia

RWKV

Cerebras-GPT

Was The Pile legal to use? The Books3 controversy

DMCA Takedown

Lawsuits

Licensing and Ethical Considerations

How does The Pile compare with other training datasets?

C4 (Colossal Clean Crawled Corpus)

RefinedWeb

FineWeb

The Common Pile: A Successor Dataset

Impact on Open-Source AI

Technical Details

Data Format

Token Count

Is The Pile still available?

Related Datasets

References

Improve this article

Related Articles

FineWeb

RedPajama

Common Corpus

DCLM (DataComp for Language Models)

Common Pile

Reporting Bias

What links here (24 of 35)

Related Articles

FineWeb

RedPajama

Common Corpus

DCLM (DataComp for Language Models)

Common Pile

Reporting Bias

What links here (24 of 35)