# C4 (Colossal Clean Crawled Corpus)

> Source: https://aiwiki.ai/wiki/c4_dataset
> Updated: 2026-06-22
> Categories: Data & Datasets, Natural Language Processing
> From AI Wiki (https://aiwiki.ai), a free encyclopedia of artificial intelligence. Quote with attribution.

**C4 (Colossal Clean Crawled Corpus)** is a roughly 750 GB, cleaned, English-language web text [dataset](/wiki/dataset) of about 365 million documents and 156 billion tokens that Google created from the April 2019 [Common Crawl](/wiki/common_crawl) snapshot to pretrain the [T5](/wiki/t5) model. It was introduced by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li and Peter J. Liu (Google) in the 2020 paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (arXiv:1910.10683, JMLR 2020).[1] The raw April 2019 Common Crawl dump (about 1.4 trillion tokens of HTML-stripped text) was filtered down to roughly 750 GB of "reasonably clean and natural English text" using a documented set of heuristic rules, then redistributed by the Allen Institute for AI on the Hugging Face Hub.[1][14]

C4 became one of the most widely cited examples of a "filtered Common Crawl" corpus and is the direct ancestor of nearly every later open web pretraining dataset, including mC4, RedPajama, RefinedWeb, FineWeb and Dolma. It also became a touchstone for dataset documentation work after Dodge et al. (2021) audited it at EMNLP and showed that several of its heuristic filters, especially the so-called bad-words blocklist, removed substantially more text from minority and dialect-rich web pages than from White American English pages (Dodge et al. 2021).[2] In 2023 a Washington Post investigation, built on an AllenAI replica of C4, mapped the 15 million websites inside the corpus and found problematic sources alongside the patents and encyclopedia text.[15]

## Who made C4 and why?

The T5 project at Google Research was an ablation study of pretraining choices for transfer learning in [language models](/wiki/language_model). To compare pretraining objectives, model sizes, training durations and data mixtures fairly, the authors needed a single large, English, web-scale corpus that they could control end-to-end. Existing corpora at the time were either small and curated (Wikipedia, BookCorpus), proprietary (WebText, used for [GPT-2](/wiki/gpt-2)), or unfiltered Common Crawl (which contained large quantities of menus, error messages, machine-generated boilerplate and non-English text). C4 was built to fill that gap (Raffel et al. 2020, section 2.2).[1]

C4 was first an internal Google artifact, generated through TensorFlow Datasets (`tfds`) build scripts that ran the cleaning pipeline directly against Common Crawl on Cloud Dataflow. Because the build was expensive (on the order of thousands of CPU hours) and the result was very large, the team initially distributed only the build recipe rather than the data itself. In 2021 the Allen Institute for AI (AI2) released a redistributable copy on the Hugging Face Hub at `allenai/c4`, which is now the most commonly used distribution (AllenAI 2021).[14] The TensorFlow Datasets catalog page for `c4` documents the official variants and remains the canonical reference for sizes and splits (TensorFlow Datasets, c4 catalog).[13]

## How is C4 constructed?

The T5 paper, section 2.2 ("The Colossal Clean Crawled Corpus"), describes the cleaning pipeline applied to the April 2019 Common Crawl dump.[1] Common Crawl provides web text in two formats: raw HTML (WARC) and HTML-stripped plain text (WET). C4 starts from the WET extracts and applies a sequence of heuristic filters that target obvious junk while keeping the pipeline cheap enough to run at the scale of trillions of tokens.

### Heuristic filters

| Step | Filter | Purpose |
|------|--------|---------|
| 1 | Keep only lines ending in a terminal punctuation mark (`.`, `?`, `!`, or a closing quote) | Drops navigation menus, error messages, list scaffolding |
| 2 | Discard pages with fewer than 5 sentences; discard sentences with fewer than 3 words | Removes very short pages and stub lines |
| 3 | Remove any page containing the placeholder text "lorem ipsum" | Drops template and design-stub pages |
| 4 | Remove any page containing a curly brace `{` or `}` | Drops most code, JSON and templating leftovers |
| 5 | Remove any page containing any word from the "List of Dirty, Naughty, Obscene or Otherwise Bad Words" list | Intended to remove pornography, hate speech and obscene content |
| 6 | Language identification with `langdetect`: keep pages classified as English with probability at least 0.99 | Restrict to English |
| 7 | Three-sentence span deduplication: discard any three-sentence span that appears more than once across the corpus | Drops boilerplate and duplicate articles |

Steps 1 to 6 are the per-page filters; step 7 is a global pass that uses a hash table over rolling three-sentence spans (Raffel et al. 2020).[1] Importantly the deduplication is at the level of three-sentence spans, not whole documents, so two near-identical articles that share two-sentence prefixes can both survive. Dodge et al. (2021) showed that this leaves substantial document-level near-duplicate content in the final corpus.[2]

### How large is C4?

After all filters, the canonical English variant (`c4/en`) contains roughly 365 million documents and about 156 billion tokens, distributed as 806.87 GiB on disk in TensorFlow Datasets format and approximately 305 GB compressed JSONL on the AllenAI Hugging Face mirror. The TensorFlow Datasets catalog reports 364,613,570 train and 364,724 validation documents for `c4/en` version 3.1.0; the Hugging Face redistribution reports closely matching but not identical numbers (about 364.87M train and 364.6K validation) because the two builds use different Beam runs (TensorFlow Datasets, c4 catalog; AllenAI 2021).[13][14]

## What variants of C4 exist?

The `c4` family on TensorFlow Datasets and on `allenai/c4` ships several variants that exist precisely so researchers can study the effect of the cleaning pipeline.

| Variant | Description | Approx size | Documents |
|---------|-------------|-------------|-----------|
| `c4/en` | English with the full T5 cleaning pipeline | 806.87 GiB (TFDS) / 305 GB (HF) | 364.6M train |
| `c4/en.noclean` | English with no heuristic cleaning, only language ID and a light dedup | 6.21 TiB | 1.06B train |
| `c4/en.noblocklist` | English without the bad-words blocklist (other filters intact) | 380 GB | 393.4M train |
| `c4/realnewslike` | Documents whose URLs are in the RealNews domain list of Zellers et al. (2020) | 36.91 GiB | 13.8M train |
| `c4/webtextlike` | Documents whose URLs overlap OpenWebText URLs | 17.93 GiB | 4.49M train |
| `c4/multilingual` (mC4) | 101 languages from 71 to 86 Common Crawl dumps, used to train mT5 | ~38 TiB | 6.6B pages |

The `realnewslike` variant overlaps with, but is not the same as, the **RealNews** dataset of Zellers et al. (2020) "Defending Against Neural Fake News." RealNews is itself a Common Crawl-derived 120 GB news corpus restricted to roughly 5,000 news domains indexed by Google News; the `realnewslike` C4 variant simply intersects that domain list with the C4 build (Zellers et al. 2020).[4]

The **mC4** variant was released alongside mT5 (Xue et al. 2021, NAACL) and is the multilingual analogue of C4.[3] It generalizes the C4 pipeline to 101 languages, swaps `langdetect` for CLD3 for language identification, and pulls from many more monthly Common Crawl dumps to get enough data for low-resource languages. The mC4 release reports about 6.6 billion pages and 6.3 trillion tokens across all languages.[3]

## What is C4 used to train?

C4 is most famous as the pretraining corpus for the original T5 family of models (small, base, large, 3B and 11B, plus the LM-adapted variants). The T5 paper used the `c4/en` variant exclusively for the main results and reported small but real improvements over both unfiltered Common Crawl and over the WebText-like variant, which the authors took as evidence that domain filtering does help even when the source pool is enormous (Raffel et al. 2020).[1]

Beyond T5 itself, C4 (or a derivative) has appeared in the pretraining mix of:

- **mT5** (Xue et al. 2021), trained on mC4 across 101 languages.[3]
- **Switch Transformer** (Fedus, Zoph and Shazeer, JMLR 2022), a 1.6 trillion parameter sparse Mixture-of-Experts model whose authors specifically chose C4 to keep results comparable to T5.[5]
- **Open weights models** including [GPT-J](/wiki/gpt-j)-style replications, MPT (MosaicML), [Falcon](/wiki/falcon) (used C4 indirectly via RefinedWeb's design), Cerebras-GPT and Pythia variants that used C4 as one component of a larger mixture.
- **Llama 1 and Llama 2** (Meta), which used C4 as a 15% slice of their pretraining mix alongside Common Crawl, GitHub, Wikipedia and books, according to Touvron et al. (2023).[12]

By 2024, frontier-scale pretraining had moved to multi-trillion-token mixes such as FineWeb (about 15 trillion tokens) and RedPajama-V2 (about 30 trillion tokens), and C4's roughly 156 billion tokens had become small in absolute terms.[10] C4 is still routinely used as a baseline in dataset papers and as a small, cheap pretraining corpus for academic ablations and for [small language models](/wiki/small_language_model).

## What did Dodge et al. (2021) find inside C4?

The most important secondary work on C4 is Dodge, Sap, Marasovic, Agnew, Ilharco, Groeneveld, Mitchell and Gardner, "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus" (EMNLP 2021, arXiv:2104.08758). This paper performed a systematic audit of the `c4/en` corpus and is widely cited in dataset documentation work.[2]

Key findings:

- **Where the data comes from.** The single largest source domain in C4 is `patents.google.com`, followed by Wikipedia, the New York Times, Los Angeles Times, the Guardian, PLOS, Forbes, the Huffington Post and Patreon. A non-trivial fraction of the corpus is from US government and military domains and from machine-generated patent text.[2]
- **The bad-words blocklist disproportionately removes minority content.** Dodge et al. measured the effect of the blocklist on text written in different English dialects. African-American English documents were removed at 42% and Hispanic-aligned English at 32%, compared with 6.2% for White American English. Pages from LGBTQ+ communities were also disproportionately filtered, often because non-pornographic terms like "twink" or "gay" appear on the blocklist.[2]
- **Geographic and demographic skew.** The United States is heavily over-represented; many countries with large English-speaking populations (India, Pakistan, Nigeria, the Philippines) are under-represented relative to their share of internet users.[2]
- **PII and benchmark contamination.** The audit found personally identifiable information at scale and substantial test-set contamination from popular evaluation benchmarks present in the training data.[2]
- **Document-level duplicates.** Because deduplication operates on three-sentence spans, entire near-duplicate documents (translations of the same press release, mirror sites, syndicated news) are common in the final corpus.[2]
- **Sentiment skew.** Dodge et al. measured "identity term" sentiment in C4 and found, for example, that the term "Jewish" appeared in positive-sentiment contexts 67.1% of the time, while "Arab" appeared in positive-sentiment contexts only 37.0% of the time.[2]

The Dodge et al. paper has become a standard reference for what cleaning pipelines actually do to a web corpus, and it is one of the main reasons that more recent corpora (RefinedWeb, FineWeb, Dolma) explicitly avoid the C4 blocklist or replace it with classifier-based approaches.[2]

## What did the Washington Post find in C4?

On April 19, 2023 the Washington Post published an investigation, "Inside the secret list of websites that make AI like ChatGPT sound smart," that worked with researchers at the Allen Institute for AI to categorize the websites inside an AllenAI replica of C4.[15] The Post described C4 as "a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs" and built an interactive tool letting readers search whether a given site appeared in the corpus.[15]

The investigation's main quantitative findings:

- The team ranked the top 10 million websites in C4 by token volume. The largest sources were `patents.google.com`, Wikipedia and the digital library Scribd, with the New York Times and the academic publisher PLOS ranking fourth and fifth.[15]
- Copyrighted material was widespread: the copyright symbol appeared more than 200 million times in the corpus.[15]
- Despite the "clean" in the name and the bad-words blocklist, problematic sources survived. The Post reported toxic and hateful text scraped from sites such as the white-supremacist forum Stormfront, the doxxing board Kiwi Farms and the message board 4chan, plus more than 72,000 instances of the word "swastika," one of the supposedly banned terms.[15]
- The corpus also contained sites hosting personal data, including voter-registration databases, raising privacy concerns about web-scraped pretraining data.[15]

The Post investigation translated the academic findings of Dodge et al. into a public-facing account of what a single, widely used training set actually contains, and it became one of the most-cited pieces of journalism on AI training data.

## How does C4 distribute and license its data?

### TensorFlow Datasets

The original distribution path is `tensorflow_datasets.load('c4', ...)`. TensorFlow Datasets does not host the cleaned data directly; instead the user runs a Beam pipeline (typically on Cloud Dataflow) that downloads the relevant Common Crawl WET files and applies the cleaning steps locally. This is why the catalog page describes `c4` as requiring "manual preparation." The latest TFDS version is 3.1.0 (TensorFlow Datasets, c4 catalog).[13]

### AllenAI redistribution on Hugging Face

The Allen Institute for AI runs the most-used redistribution at `https://huggingface.co/datasets/allenai/c4`. Files are gzipped JSONL with three fields per record (`url`, `text`, `timestamp`); the English `train` split is sharded into 1,024 files and the `validation` split into 8 files. The redistribution is licensed under **ODC-BY 1.0** (Open Data Commons Attribution License 1.0), with the explicit note that users are also bound by the Common Crawl terms of use for the underlying content (AllenAI 2021).[14]

### File format

A single record looks like:

```json
{
  "url": "https://example.com/article",
  "text": "Article content here...",
  "timestamp": "2019-04-25T12:57:54Z"
}
```

## What corpora succeeded C4?

C4 spawned a small industry of "better-than-C4" web corpora. Most of them either keep C4's broad shape (filter Common Crawl with heuristics, deduplicate, ship JSONL) and add classifier-based quality scoring, or scale up to more dumps and more languages.

| Corpus | Year | Source | Approx size | Tokens | Deduplication | Quality filters | License | Key paper |
|--------|------|--------|-------------|--------|---------------|-----------------|---------|-----------|
| C4 (en) | 2019 / 2021 | April 2019 CC | ~750 GB / ~305 GB compressed | ~156 B | 3-sentence spans | Heuristic + bad-words list | ODC-BY 1.0 (AllenAI) | Raffel et al. 2020 |
| mC4 | 2021 | 71 to 86 CC dumps | ~38 TiB | ~6.3 T | 3-sentence spans | C4-style + CLD3 | ODC-BY 1.0 | Xue et al. 2021 |
| RealNews | 2019 | CC, Dec 2016 to Mar 2019 | 120 GB | ~30 B | URL + content hash | News domain whitelist | Apache-2.0 | Zellers et al. 2020 |
| [The Pile](/wiki/the_pile) | 2020 | 22 sub-corpora (CC + Books + GitHub + arXiv + ...) | 825 GiB | ~300 B | per-source | mostly source-side | Mixed | Gao et al. 2020 |
| RedPajama-V1 | 2023 | LLaMA recipe (CC + C4 + GitHub + arXiv + Books + Wikipedia + StackExchange) | 1.2 TB | 1.2 T | per-source | per-source | Apache-2.0 | Computer et al. 2023 |
| SlimPajama | 2023 | Cleaned RedPajama-V1 | ~895 GB | 627 B | global MinHash | Cerebras filters | Apache-2.0 | Soboleva et al. 2023 |
| RefinedWeb | 2023 | CC, all dumps to 2023 | 2.8 TB | ~600 B (5T avail.) | URL + MinHash | Heuristic, no blocklist | ODC-BY 1.0 | Penedo et al. 2023 |
| Dolma | 2024 | CC + Books + arXiv + GitHub + Reddit + ... | ~7 TB | 3 T | per-source + global | classifier-based | ODC-BY 1.0 | Soldaini et al. 2024 |
| FineWeb | 2024 | 96 CC dumps, 2013 to Apr 2024 | ~44 TB | 15 T (18.5 T extended) | global MinHash | Heuristic + classifier | ODC-BY 1.0 | Penedo et al. 2024 |
| FineWeb-Edu | 2024 | FineWeb filtered with educational classifier | ~3.5 TB | 1.3 T | inherited | Llama-3 educational classifier | ODC-BY 1.0 | Penedo et al. 2024 |
| RedPajama-V2 | 2024 | 84 CC dumps | ~270 TB raw | ~30 T | global MinHash + Bloom | quality annotations | Apache-2.0 | Together AI 2024 |

A few patterns are visible in this table. First, every successor corpus uses some form of global, cross-document deduplication (typically MinHash with locality-sensitive hashing) instead of C4's three-sentence span dedup, because document-level duplicates were the single biggest issue Dodge et al. flagged.[2] Second, the bad-words blocklist disappeared after Dodge et al. (2021); RefinedWeb, FineWeb and Dolma all explicitly replace it with classifier-based or URL-based filtering.[2] Third, the absolute scale of frontier pretraining data has grown by roughly two orders of magnitude since C4 was released, from about 156 billion tokens to 15 to 30 trillion tokens.

## Why does C4 matter?

C4 set three durable conventions for open pretraining data: (1) start from Common Crawl, (2) apply a documented cleaning pipeline, and (3) ship the full text rather than just the URL list. Before C4 the dominant open web corpus was OpenWebText, which was a URL-list reproduction of OpenAI's WebText for [GPT-2](/wiki/gpt-2); after C4 the dominant pattern is a directly downloadable JSONL corpus with a well-documented build script.

The **bad-words filter** in particular has become a cautionary example in the dataset documentation literature, cited in datasheets for almost every later open corpus (RedPajama, RefinedWeb, FineWeb, Dolma) as the canonical example of a well-intentioned filter with severe demographic side effects. The phrase "C4 blocklist" is now a shorthand in the field for that mistake (Dodge et al. 2021).[2]

For researchers training small or academic-scale models, C4 remains attractive because it is well-documented, license-clean (ODC-BY 1.0), modest in size relative to modern frontier corpora, and directly comparable to a long line of published baselines. As of 2026 it is still one of the most downloaded text datasets on the Hugging Face Hub.

## References

1. Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research* 21 (140): 1 to 67. arXiv:1910.10683.
2. Dodge, Jesse; Sap, Maarten; Marasovic, Ana; Agnew, William; Ilharco, Gabriel; Groeneveld, Dirk; Mitchell, Margaret; Gardner, Matt (2021). "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus." In *Proceedings of EMNLP 2021*, pp. 1286 to 1305. arXiv:2104.08758.
3. Xue, Linting; Constant, Noah; Roberts, Adam; Kale, Mihir; Al-Rfou, Rami; Siddhant, Aditya; Barua, Aditya; Raffel, Colin (2021). "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer." In *Proceedings of NAACL 2021*. arXiv:2010.11934.
4. Zellers, Rowan; Holtzman, Ari; Rashkin, Hannah; Bisk, Yonatan; Farhadi, Ali; Roesner, Franziska; Choi, Yejin (2019). "Defending Against Neural Fake News." In *Proceedings of NeurIPS 2019*. arXiv:1905.12616.
5. Fedus, William; Zoph, Barret; Shazeer, Noam (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." *Journal of Machine Learning Research* 23 (120): 1 to 39.
6. Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027.
7. Computer, Together (2023). "RedPajama: An Open Source Recipe to Reproduce LLaMA Training Dataset." Together AI blog and dataset card on Hugging Face.
8. Soboleva, Daria; Al-Khateeb, Faisal; Myers, Robert; Steeves, Jacob R.; Hestness, Joel; Dey, Nolan (2023). "SlimPajama: A 627B token, cleaned and deduplicated version of RedPajama." Cerebras blog post.
9. Penedo, Guilherme; Malartic, Quentin; Hesslow, Daniel; Cojocaru, Ruxandra; Cappelli, Alessandro; Alobeidli, Hamza; Pannier, Baptiste; Almazrouei, Ebtesam; Launay, Julien (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." arXiv:2306.01116.
10. Penedo, Guilherme; Kydlicek, Hynek; Lozhkov, Anton; Mitchell, Margaret; Raffel, Colin; Werra, Leandro von; Wolf, Thomas (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557.
11. Soldaini, Luca; Kinney, Rodney; Bhagia, Akshita; Schwenk, Dustin; Atkinson, David; Authur, Russell; Bogin, Ben; Chandu, Khyathi; Dumas, Jennifer; Elazar, Yanai; et al. (2024). "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." In *Proceedings of ACL 2024*. arXiv:2402.00159.
12. Touvron, Hugo; Martin, Louis; Stone, Kevin; et al. (2023). "Llama 2: Open Foundation and Fine-Tuned Chat Models." arXiv:2307.09288.
13. TensorFlow Datasets, "c4" catalog page. https://www.tensorflow.org/datasets/catalog/c4
14. Allen Institute for AI (2021). `allenai/c4` dataset card on the Hugging Face Hub. https://huggingface.co/datasets/allenai/c4
15. Schaul, Kevin; Chen, Szu Yu; Tiku, Nitasha (2023). "Inside the secret list of websites that make AI like ChatGPT sound smart." *The Washington Post*, April 19, 2023. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
16. Common Crawl Foundation, monthly crawl archives. https://commoncrawl.org

