C4 (Colossal Clean Crawled Corpus)
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,052 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
Apr 30, 2026
Sources
No citations yet
Review status
Needs citations
Revision
v1 · 3,052 words
Add missing citations, update stale details, or suggest a clearer explanation.
C4, short for Colossal Clean Crawled Corpus, is a large-scale, cleaned, English-language web text dataset derived from a single monthly snapshot of Common Crawl. It was introduced by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li and Peter J. Liu (Google) in the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (arXiv:1910.10683, JMLR 2020), where it served as the pretraining corpus for T5. The April 2019 Common Crawl dump (about 1.4 trillion tokens of raw HTML-stripped text) was filtered down to roughly 750 GB of "reasonably clean and natural English text," containing on the order of 365 million documents and roughly 156 billion tokens after deduplication and quality filters (Raffel et al. 2020; Dodge et al. 2021).
C4 became one of the most widely cited examples of a "filtered Common Crawl" corpus and is the direct ancestor of nearly every later open web pretraining dataset, including mC4, RedPajama, RefinedWeb, FineWeb and Dolma. It also became a touchstone for dataset documentation work after Dodge et al. (2021) audited it at EMNLP and showed that several of its heuristic filters, especially the so-called bad-words blocklist, removed substantially more text from minority and dialect-rich web pages than from White American English pages (Dodge et al. 2021).
The T5 project at Google Research was an ablation study of pretraining choices for transfer learning in language models. To compare pretraining objectives, model sizes, training durations and data mixtures fairly, the authors needed a single large, English, web-scale corpus that they could control end-to-end. Existing corpora at the time were either small and curated (Wikipedia, BookCorpus), proprietary (WebText, used for GPT-2), or unfiltered Common Crawl (which contained large quantities of menus, error messages, machine-generated boilerplate and non-English text). C4 was built to fill that gap (Raffel et al. 2020, section 2.2).
C4 was first an internal Google artifact, generated through TensorFlow Datasets (tfds) build scripts that ran the cleaning pipeline directly against Common Crawl on Cloud Dataflow. Because the build was expensive (on the order of thousands of CPU hours) and the result was very large, the team initially distributed only the build recipe rather than the data itself. In 2021 the Allen Institute for AI (AI2) released a redistributable copy on the Hugging Face Hub at allenai/c4, which is now the most commonly used distribution (AllenAI 2021). The TensorFlow Datasets catalog page for c4 documents the official variants and remains the canonical reference for sizes and splits (TensorFlow Datasets, c4 catalog).
The T5 paper, section 2.2 ("The Colossal Clean Crawled Corpus"), describes the cleaning pipeline applied to the April 2019 Common Crawl dump. Common Crawl provides web text in two formats: raw HTML (WARC) and HTML-stripped plain text (WET). C4 starts from the WET extracts and applies a sequence of heuristic filters that target obvious junk while keeping the pipeline cheap enough to run at the scale of trillions of tokens.
| Step | Filter | Purpose |
|---|---|---|
| 1 | Keep only lines ending in a terminal punctuation mark (., ?, !, or a closing quote) | Drops navigation menus, error messages, list scaffolding |
| 2 | Discard pages with fewer than 5 sentences; discard sentences with fewer than 3 words | Removes very short pages and stub lines |
| 3 | Remove any page containing the placeholder text "lorem ipsum" | Drops template and design-stub pages |
| 4 | Remove any page containing a curly brace { or } | Drops most code, JSON and templating leftovers |
| 5 | Remove any page containing any word from the "List of Dirty, Naughty, Obscene or Otherwise Bad Words" list | Intended to remove pornography, hate speech and obscene content |
| 6 | Language identification with langdetect: keep pages classified as English with probability at least 0.99 | Restrict to English |
| 7 | Three-sentence span deduplication: discard any three-sentence span that appears more than once across the corpus | Drops boilerplate and duplicate articles |
Steps 1 to 6 are the per-page filters; step 7 is a global pass that uses a hash table over rolling three-sentence spans (Raffel et al. 2020). Importantly the deduplication is at the level of three-sentence spans, not whole documents, so two near-identical articles that share two-sentence prefixes can both survive. Dodge et al. (2021) showed that this leaves substantial document-level near-duplicate content in the final corpus.
After all filters, the canonical English variant (c4/en) contains roughly 365 million documents and about 156 billion tokens, distributed as 806.87 GiB on disk in TensorFlow Datasets format and approximately 305 GB compressed JSONL on the AllenAI Hugging Face mirror. The TensorFlow Datasets catalog reports 364,613,570 train and 364,724 validation documents for c4/en version 3.1.0; the Hugging Face redistribution reports closely matching but not identical numbers (about 364.87M train and 364.6K validation) because the two builds use different Beam runs (TensorFlow Datasets, c4 catalog; AllenAI 2021).
The c4 family on TensorFlow Datasets and on allenai/c4 ships several variants that exist precisely so researchers can study the effect of the cleaning pipeline.
| Variant | Description | Approx size | Documents |
|---|---|---|---|
c4/en | English with the full T5 cleaning pipeline | 806.87 GiB (TFDS) / 305 GB (HF) | 364.6M train |
c4/en.noclean | English with no heuristic cleaning, only language ID and a light dedup | 6.21 TiB | 1.06B train |
c4/en.noblocklist | English without the bad-words blocklist (other filters intact) | 380 GB | 393.4M train |
c4/realnewslike | Documents whose URLs are in the RealNews domain list of Zellers et al. (2020) | 36.91 GiB | 13.8M train |
c4/webtextlike | Documents whose URLs overlap OpenWebText URLs | 17.93 GiB | 4.49M train |
c4/multilingual (mC4) | 101 languages from 71 to 86 Common Crawl dumps, used to train mT5 | ~27 to 38 TiB | 6.6B pages |
The realnewslike variant overlaps with, but is not the same as, the RealNews dataset of Zellers et al. (2020) "Defending Against Neural Fake News." RealNews is itself a Common Crawl-derived 120 GB news corpus restricted to roughly 5,000 news domains indexed by Google News; the realnewslike C4 variant simply intersects that domain list with the C4 build (Zellers et al. 2020).
The mC4 variant was released alongside mT5 (Xue et al. 2021, NAACL) and is the multilingual analogue of C4. It generalizes the C4 pipeline to 101 languages, swaps langdetect for CLD3 for language identification, and pulls from many more monthly Common Crawl dumps to get enough data for low-resource languages. The mC4 release reports about 6.6 billion pages and 6.3 trillion tokens across all languages.
C4 is most famous as the pretraining corpus for the original T5 family of models (small, base, large, 3B and 11B, plus the LM-adapted variants). The T5 paper used the c4/en variant exclusively for the main results and reported small but real improvements over both unfiltered Common Crawl and over the WebText-like variant, which the authors took as evidence that domain filtering does help even when the source pool is enormous (Raffel et al. 2020).
Beyond T5 itself, C4 (or a derivative) has appeared in the pretraining mix of:
By 2024, frontier-scale pretraining had moved to multi-trillion-token mixes such as FineWeb (about 15 trillion tokens) and RedPajama-V2 (about 30 trillion tokens), and C4's roughly 156 billion tokens had become small in absolute terms. C4 is still routinely used as a baseline in dataset papers and as a small, cheap pretraining corpus for academic ablations and for small language models.
The most important secondary work on C4 is Dodge, Sap, Marasović, Agnew, Ilharco, Groeneveld, Mitchell and Gardner, "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus" (EMNLP 2021, arXiv:2104.08758). This paper performed a systematic audit of the c4/en corpus and is widely cited in dataset documentation work.
Key findings:
patents.google.com, followed by Wikipedia, the New York Times, Los Angeles Times, the Guardian, PLOS, Forbes, the Huffington Post and Patreon. A non-trivial fraction of the corpus is from US government and military domains and from machine-generated patent text.The Dodge et al. paper has become a standard reference for what cleaning pipelines actually do to a web corpus, and it is one of the main reasons that more recent corpora (RefinedWeb, FineWeb, Dolma) explicitly avoid the C4 blocklist or replace it with classifier-based approaches.
The original distribution path is tensorflow_datasets.load('c4', ...). TensorFlow Datasets does not host the cleaned data directly; instead the user runs a Beam pipeline (typically on Cloud Dataflow) that downloads the relevant Common Crawl WET files and applies the cleaning steps locally. This is why the catalog page describes c4 as requiring "manual preparation." The latest TFDS version is 3.1.0 (TensorFlow Datasets, c4 catalog).
The Allen Institute for AI runs the most-used redistribution at https://huggingface.co/datasets/allenai/c4. Files are gzipped JSONL with three fields per record (url, text, timestamp); the English train split is sharded into 1,024 files and the validation split into 8 files. The redistribution is licensed under ODC-BY 1.0 (Open Data Commons Attribution License 1.0), with the explicit note that users are also bound by the Common Crawl terms of use for the underlying content (AllenAI 2021).
A single record looks like:
{
"url": "https://example.com/article",
"text": "Article content here...",
"timestamp": "2019-04-25T12:57:54Z"
}
C4 spawned a small industry of "better-than-C4" web corpora. Most of them either keep C4's broad shape (filter Common Crawl with heuristics, deduplicate, ship JSONL) and add classifier-based quality scoring, or scale up to more dumps and more languages.
| Corpus | Year | Source | Approx size | Tokens | Deduplication | Quality filters | License | Key paper |
|---|---|---|---|---|---|---|---|---|
| C4 (en) | 2019 / 2021 | April 2019 CC | ~750 GB / ~305 GB compressed | ~156 B | 3-sentence spans | Heuristic + bad-words list | ODC-BY 1.0 (AllenAI) | Raffel et al. 2020 |
| mC4 | 2021 | 71 to 86 CC dumps | ~27 to 38 TiB | ~6.3 T | 3-sentence spans | C4-style + CLD3 | ODC-BY 1.0 | Xue et al. 2021 |
| RealNews | 2019 | CC, Dec 2016 to Mar 2019 | 120 GB | ~30 B | URL + content hash | News domain whitelist | Apache-2.0 | Zellers et al. 2020 |
| The Pile | 2020 | 22 sub-corpora (CC + Books + GitHub + arXiv + ...) | 825 GiB | ~300 B | per-source | mostly source-side | Mixed | Gao et al. 2020 |
| RedPajama-V1 | 2023 | LLaMA recipe (CC + C4 + GitHub + arXiv + Books + Wikipedia + StackExchange) | 1.2 TB | 1.2 T | per-source | per-source | Apache-2.0 | Computer et al. 2023 |
| SlimPajama | 2023 | Cleaned RedPajama-V1 | ~895 GB | 627 B | global MinHash | Cerebras filters | Apache-2.0 | Soboleva et al. 2023 |
| RefinedWeb | 2023 | CC, all dumps to 2023 | 2.8 TB | ~600 B (5T avail.) | URL + MinHash | Heuristic, no blocklist | ODC-BY 1.0 | Penedo et al. 2023 |
| Dolma | 2024 | CC + Books + arXiv + GitHub + Reddit + ... | ~7 TB | 3 T | per-source + global | classifier-based | ODC-BY 1.0 | Soldaini et al. 2024 |
| FineWeb | 2024 | 96 CC dumps, 2013 to Apr 2024 | ~44 TB | 15 T (18.5 T extended) | global MinHash | Heuristic + classifier | ODC-BY 1.0 | Penedo et al. 2024 |
| FineWeb-Edu | 2024 | FineWeb filtered with educational classifier | ~3.5 TB | 1.3 T | inherited | Llama-3 educational classifier | ODC-BY 1.0 | Penedo et al. 2024 |
| RedPajama-V2 | 2024 | 84 CC dumps | ~270 TB raw | ~30 T | global MinHash + Bloom | quality annotations | Apache-2.0 | Together AI 2024 |
A few patterns are visible in this table. First, every successor corpus uses some form of global, cross-document deduplication (typically MinHash with locality-sensitive hashing) instead of C4's three-sentence span dedup, because document-level duplicates were the single biggest issue Dodge et al. flagged. Second, the bad-words blocklist disappeared after Dodge et al. (2021); RefinedWeb, FineWeb and Dolma all explicitly replace it with classifier-based or URL-based filtering. Third, the absolute scale of frontier pretraining data has grown by roughly two orders of magnitude since C4 was released, from about 156 billion tokens to 15 to 30 trillion tokens.
C4 set three durable conventions for open pretraining data: (1) start from Common Crawl, (2) apply a documented cleaning pipeline, and (3) ship the full text rather than just the URL list. Before C4 the dominant open web corpus was OpenWebText, which was a URL-list reproduction of OpenAI's WebText for GPT-2; after C4 the dominant pattern is a directly downloadable JSONL corpus with a well-documented build script.
The bad-words filter in particular has become a cautionary example in the dataset documentation literature, cited in datasheets for almost every later open corpus (RedPajama, RefinedWeb, FineWeb, Dolma) as the canonical example of a well-intentioned filter with severe demographic side effects. The phrase "C4 blocklist" is now a shorthand in the field for that mistake (Dodge et al. 2021).
For researchers training small or academic-scale models, C4 remains attractive because it is well-documented, license-clean (ODC-BY 1.0), modest in size relative to modern frontier corpora, and directly comparable to a long line of published baselines. As of 2026 it is still one of the most downloaded text datasets on the Hugging Face Hub.