Common Crawl is a nonprofit 501(c)(3) organization that maintains a free, open repository of web crawl data. Founded in 2007 by Gil Elbaz, the organization conducts regular large-scale crawls of the public web and makes the resulting datasets available at no cost. The archive contains over 300 billion captured web pages spanning more than a decade of crawling, stored in petabytes of compressed data on Amazon Web Services. Common Crawl has become one of the most widely used data sources in natural language processing and machine learning research, with its data forming the backbone of training corpora for the majority of modern large language models.
Gil Elbaz founded the Common Crawl Foundation in 2007 after leaving his position as Engineering Director at Google, where he had worked from 2003 to 2007. Before Google, Elbaz had co-founded Applied Semantics, the company behind AdSense, which Google acquired in 2003. His vision for Common Crawl grew from the idea of "neutral data companies," open nonprofit infrastructure projects that would crawl the web and provide the resulting data free of charge to researchers and businesses alike. Elbaz was concerned about the concentration of web data in the hands of a few large search engine companies, and he believed that making a comparable crawl freely available would stimulate competition and new research.
The organization began its first crawls in 2008. The inaugural dataset, designated CC-MAIN-2008-2009, captured approximately 1.8 billion web pages and was stored in the ARC archive format, a predecessor to the WARC standard adopted later. A second crawl, CC-MAIN-2009-2010, expanded coverage to roughly 2.9 billion pages. By 2011, the foundation began releasing full-scale crawls to the public on a regular schedule.
For most of its existence, Common Crawl operated on a minimal budget with Elbaz personally financing the operation. As of 2023, the organization had only three employees. Lisa Green served as director from 2011 to 2015, bringing experience from her prior role as Chief of Staff at Creative Commons. Rich Skrenta, the founder of the Open Directory Project and the search engine Blekko, later joined as executive director. Elbaz has continued to finance and chair the organization since its founding.
Common Crawl's importance grew substantially after 2018 as the transformer architecture and large-scale pretraining became standard practice in NLP. A 2024 study by the Mozilla Foundation found that at least 64% of text-generating LLMs published between 2019 and October 2023 (30 out of 47 analyzed models) used at least one filtered version of Common Crawl data for pretraining. The organization has been cited in more than 10,000 academic studies.
Common Crawl operates its own web crawler, identified by the user-agent string CCBot/2.0. The crawler respects the robots.txt protocol; website owners can block CCBot by adding a Disallow directive for the CCBot user-agent in their robots.txt file. The crawler also honors the Crawl-delay parameter, allowing site owners to throttle request rates.
The crawling process uses a URL database (CrawlDB) that contained roughly 25 billion URLs as of August 2023. To decide which pages to fetch, the system ranks URLs using harmonic centrality scoring, a method that prioritizes frequently linked domains. About half of each crawl consists of previously fetched URLs being refreshed, while the other half represents newly discovered pages.
Each monthly crawl typically captures between 2.4 and 3.0 billion web pages. Since 2017, crawls have consistently processed 3 to 5 billion URLs per cycle. The resulting data for a single crawl ranges from roughly 350 to 470 TiB of uncompressed content. For example, the January 2025 crawl encompassed approximately 3.0 billion pages totaling 460 TiB, while the August 2025 crawl added 2.42 billion pages across 419 TiB.
The cumulative archive exceeded 9.5 petabytes by mid-2023, with hundreds of terabytes added each month in subsequent releases.
Common Crawl snapshots follow the naming pattern CC-MAIN-YYYY-WW, where YYYY is the year and WW is the ISO week number when the crawl was released. For example, CC-MAIN-2024-10 refers to the snapshot released in the tenth week of 2024.
Common Crawl distributes its data in four complementary formats, each designed for different use cases. The WARC standard was adopted in summer 2013 with crawl CC-MAIN-2013-20, replacing the earlier ARC format.
| Format | Full name | Contents | Primary use case |
|---|---|---|---|
| WARC | Web ARChive | Complete HTTP responses, including headers, HTML content, and crawl metadata | Full-fidelity analysis of raw web pages |
| WET | WARC Encapsulated Text | Extracted plain text without HTML markup | Text mining, language model training |
| WAT | Web Archive Transformation | Metadata from WARC files in JSON format, including HTTP headers, links, and HTML metadata | Link analysis, web graph construction, metadata research |
| CDX | Capture/Crawl inDeX | Index records mapping URLs to their positions within WARC files | Targeted retrieval of specific pages from the archive |
The WARC (Web ARChive) format is an ISO standard (ISO 28500) for archiving web content. Common Crawl's WARC files contain three record types: response records with the actual HTTP responses and page content, request records documenting how pages were fetched, and metadata records capturing information about the crawl process itself. Each record includes headers such as WARC-Date, WARC-Target-URI, and Content-Length. The raw HTTP response is stored in its entirety, preserving both headers and the full HTML body. Because WARC files preserve the original HTML, researchers can re-extract text using their own parsers, which often produces higher-quality results than the pre-extracted WET files.
WET files contain only the extracted plain text from crawled pages, stripped of all HTML markup. The extraction is performed automatically by Common Crawl's processing pipeline. Each WET record includes WARC-style metadata headers (URL, content length, date) followed by the converted plain text. While convenient for NLP tasks since researchers can skip the HTML parsing step, the extraction quality has been criticized. Several research groups have found that applying their own text extraction tools (such as jusText or trafilatura) to the raw WARC files produces cleaner output. The early CCNet pipeline primarily operated on WET files.
WAT files hold computed metadata extracted from each WARC record, stored as compact JSON with whitespace removed. This metadata includes HTTP response headers, HTML metadata (title, scripts, meta tags), and all links found on the page. WAT files are the basis for constructing web graph datasets. For instance, the November 2025 through January 2026 web graph data comprised 279.4 million host-level nodes with 13.4 billion edges and 122.3 million domain-level nodes with 6.1 billion edges.
CDX files serve as an index that maps URLs to their positions within WARC files. This allows users to look up specific URLs across crawl archives without downloading entire snapshots, enabling targeted retrieval of individual pages.
All Common Crawl data is hosted through Amazon Web Services' (AWS) Open Data Sponsorship Program, which covers the storage costs. The data resides in the S3 bucket s3://commoncrawl/ in the US-East-1 (Northern Virginia) region. Users can access the data through three URL schemes:
s3://commoncrawl/[path]https://ds5q9oxwqwsfj.cloudfront.net/[path]https://data.commoncrawl.org/[path]For cloud-based processing, Common Crawl recommends running compute workloads in the us-east-1 region to avoid inter-region data transfer fees and to minimize latency. The data can also be downloaded directly over HTTPS for local processing. Common Crawl is listed in the AWS Registry of Open Data and is available through the AWS Marketplace.
Each crawl snapshot consists of tens of thousands of individual WARC, WET, and WAT files, typically compressed with gzip. Individual files are roughly 1 GB compressed. Due to the size of the data, most large-scale processing is done using distributed computing frameworks such as Apache Spark or custom MapReduce jobs running on cloud infrastructure.
Raw Common Crawl data requires substantial filtering and cleaning before it is useful for training language models. Several processing pipelines have been developed by different research groups, each taking a slightly different approach to extracting high-quality text from the archive.
CCNet, developed by Facebook AI Research (now Meta AI) and described in a 2020 paper at LREC, is one of the most widely adopted pipelines for processing Common Crawl. It operates on WET files and performs three main steps:
Deduplication: WET files are grouped into 5 GB fragments. Duplicate paragraphs, which account for roughly 70% of the raw text, are removed by normalizing each paragraph through lowercasing, replacing numbers with placeholders, and stripping Unicode punctuation and diacritical marks. A hash code is computed for each normalized paragraph using the first 64 bits of SHA-1.
Language identification: Each document is classified by language, allowing extraction of monolingual subsets.
Quality filtering: A KenLM 5-gram language model trained on Wikipedia assigns perplexity scores to each document. Documents with lower perplexity (closer to Wikipedia-quality prose) are retained, while boilerplate content such as navigation menus, cookie warnings, and contact information is filtered out.
CCNet serves as the foundation for several derived datasets. RedPajama-V2 processed 84 Common Crawl snapshots through the CCNet pipeline, and the data used to train XLM-RoBERTa was prepared using CCNet's monolingual extraction approach.
The MacroData Refinement pipeline, used to create RefinedWeb, takes a different approach by operating on raw WARC files rather than pre-extracted WET text. Working with HTML preserves structural information lost in the WET conversion and allows for better text extraction through trafilatura. The pipeline includes URL filtering, document-level deduplication using MinHash, and line-level heuristics to remove boilerplate.
Hugging Face's FineWeb pipeline processes 96 Common Crawl snapshots (spanning 2013 through 2024) and applies aggressive deduplication, heuristic filtering, and PII (Personally Identifiable Information) removal. It uses the datatrove library for parallel processing. The pipeline also includes a variant called FineWeb-Edu that further filters for educational content using a classifier trained on annotations from Llama 3 70B Instruct.
While specific pipelines differ in implementation, the general approach to processing Common Crawl follows a common pattern:
| Stage | Purpose | Common tools | Typical data reduction |
|---|---|---|---|
| Text extraction | Convert HTML or WET to clean plain text | trafilatura, jusText, resiliparse | Varies |
| Language identification | Label documents by language; discard non-target languages | fastText, CLD3, langdetect | 50-80% for English-only datasets |
| Heuristic filtering | Remove short, low-quality, or boilerplate documents | Custom rules (line length, punctuation ratio, repetition) | 20-40% |
| Deduplication | Remove exact and near-duplicate documents or paragraphs | MinHash + LSH, SHA-1 hashing, URL dedup | 45-75% |
| Quality scoring | Rank documents by similarity to reference corpus | KenLM perplexity, trained classifiers | Varies by threshold |
| Safety filtering | Remove toxic, pornographic, or PII-containing content | Blocklists, toxicity classifiers, regex PII masking | 5-15% |
Common Crawl has given rise to numerous curated datasets that serve as pretraining corpora for language models, multimodal models, and other AI systems. The table below lists the most widely used derived datasets.
| Dataset | Year | Creator | Size | Source snapshots | Key details | Models trained |
|---|---|---|---|---|---|---|
| C4 (Colossal Clean Crawled Corpus) | 2019 | Google Research | ~750 GB, ~156B tokens | April 2019 snapshot | Filtered from 1.4T raw tokens; 99% English via langdetect; deduplication and heuristic quality filters | T5, LaMDA |
| mC4 (Multilingual C4) | 2020 | Google Research | Varies by language | 86 Common Crawl dumps | Multilingual extension of C4 covering 101+ languages using CLD3 for language identification | mT5 |
| The Pile (Pile-CC component) | 2020 | EleutherAI | 825 GiB total (22 subsets) | Multiple snapshots | Pile-CC uses jusText on WARC files for improved extraction quality; diverse corpus with academic and professional sources | GPT-Neo, GPT-J, GPT-NeoX |
| OSCAR | 2019 | INRIA (ALMAnaCH) | Varies by version | Per-snapshot extraction | Multilingual corpus covering 168 languages, extracted from WET files with fastText language classification | Various multilingual models |
| CCNet | 2019 | Meta AI Research | Varies by language | Per-snapshot processing | Pipeline using paragraph deduplication, language ID, and KenLM perplexity scoring against Wikipedia | XLM-RoBERTa |
| LAION-5B | 2022 | LAION e.V. | 5.85B image-text pairs | Multiple snapshots | Image-text pairs from WAT files (HTML IMG tags with alt-text); filtered by CLIP ViT-B/32 (threshold 0.28 English, 0.26 other) | Stable Diffusion |
| LAION-400M | 2021 | LAION e.V. | 400M image-text pairs | Multiple snapshots | English-only predecessor to LAION-5B | Various image generation models |
| RefinedWeb | 2023 | Technology Innovation Institute | ~5T tokens (600B public) | Multiple snapshots | Built from WARC files using MacroData Refinement; heavy MinHash deduplication | Falcon |
| RedPajama v1 | 2023 | Together AI | 1.2T tokens | Multiple snapshots | Open reproduction of LLaMA training mix; Common Crawl is the largest component | MPT-7B, OpenLLaMA |
| RedPajama v2 | 2023 | Together AI | 30T tokens (deduplicated) from 100B+ documents | 84 Common Crawl snapshots | Processed through CCNet pipeline; 40+ quality annotations; five languages (English, French, Spanish, German, Italian) | Various open models |
| SlimPajama | 2023 | Cerebras | 627B tokens | Derived from RedPajama v1 | Deduplicated version of RedPajama v1; removed 49.6% of bytes through global MinHash | Cerebras-GPT |
| FineWeb | 2024 | Hugging Face | ~15T tokens, 44 TB | 96 snapshots (2013-2024) | Aggressive filtering/dedup with PII removal; outperforms C4, Dolma, and RedPajama on benchmarks; ODC-By 1.0 license | SmolLM |
| FineWeb-Edu | 2024 | Hugging Face | 1.3T tokens (score >= 3); 5.4T tokens (score >= 2) | Subset of FineWeb | Educational content filtered by classifier trained on Llama 3 70B annotations; 92% of FineWeb removed at strict threshold; 6,000 H100 GPU hours for classification | SmolLM, various |
| Dolma | 2023 | AI2 (Allen Institute for AI) | 3T tokens (v1); 2.3T tokens (v1.7) | Multiple snapshots | Mixed corpus: Common Crawl, academic papers, code, books, Wikipedia; ODC-By license | OLMo |
Common Crawl data, in various filtered forms, constitutes the single largest component of training data for most open and commercial large language models.
GPT-3, released by OpenAI in 2020, drew approximately 60% of its weighted training tokens from a filtered version of Common Crawl. By raw volume, Common Crawl represented 82% of the dataset (410 billion tokens). The remaining training tokens came from WebText2 (22%), two book corpora (16%), and Wikipedia (3%). Common Crawl was intentionally downsampled during training so that its contribution to the actual training mix was 60% rather than 82%.
LLaMA, released by Meta in 2023, used two separate filtered versions of Common Crawl: one processed through the CCNet pipeline and another through C4. The authors found that using diverse Common Crawl processing pipelines improved model performance. C4 contributed about 15% of LLaMA's training tokens.
The Falcon family of models, developed by the Technology Innovation Institute, was trained almost exclusively on RefinedWeb, a dataset derived entirely from Common Crawl. This demonstrated that a single, well-filtered web source could match or exceed the performance of models trained on more diverse multi-source corpora.
T5 (Text-to-Text Transfer Transformer), released by Google Research in 2019, was trained on C4, which was derived from a single April 2019 Common Crawl snapshot. The C4 dataset was created specifically for the T5 paper and started with 1.4 trillion raw tokens from Common Crawl, filtered down to approximately 156 billion tokens (750 GB).
Stable Diffusion, developed by Stability AI and released in 2022, was trained on LAION-5B, whose 5.85 billion image-text pairs were all extracted from Common Crawl's WAT and WARC files. The LAION team parsed HTML IMG tags with alt-text attributes and then filtered the resulting pairs using CLIP embeddings, removing approximately 90% of the initial 50+ billion candidate pairs.
Common Crawl's role as a data intermediary between the open web and AI training pipelines has placed it at the center of several ongoing controversies.
Because Common Crawl archives pages from across the public web without entering into licensing agreements with content creators, questions about copyright infringement have intensified as AI companies increasingly rely on the data. As of early 2026, over thirty copyright lawsuits related to AI training data were pending in courts worldwide, with Common Crawl's archives frequently cited as a data source.
In November 2024, The Atlantic published an investigation revealing that despite publishers submitting takedown requests (including The New York Times in July 2023), Common Crawl's archives still contained the content those publishers had requested be removed. CCBot does not execute JavaScript verification, meaning it retrieves the full HTML before any paywall or subscription checks occur. Critics argue this effectively bypasses paywalls, granting AI companies access to premium journalism without compensation.
In December 2023, The New York Times sued OpenAI and Microsoft, alleging that copyrighted Times articles were included in Common Crawl data used to train GPT models. In Denmark, the Rights Alliance (RettighedsAlliancen) pressured Common Crawl to remove content from Danish media houses; Common Crawl's attorney stated in December 2024 that approximately 50% of the requested content had been removed, more than six months after the initial request.
Whether training AI models on copyrighted material constitutes fair use under U.S. law remains an open question, with no definitive court ruling as of early 2026. The copyright question is further complicated by Common Crawl's position as an intermediary: the organization itself does not train AI models but simply provides the raw data.
Website owners can block CCBot through their robots.txt file. However, once content has been captured in a crawl snapshot, it persists in the archive even if the site later adds a block. Common Crawl has stated that it processes removal requests, but the effectiveness of this process has been questioned.
Blocking rates for CCBot have risen sharply since 2023. Analysis of over 1,100 news websites found that approximately 48% blocked Common Crawl's crawler. Among the 50 largest news publishers, CCBot was the least permitted AI crawler, with only nine of those sites allowing it. A secondary wave of restrictions appeared after August 2024, correlating with the enforcement of the EU Artificial Intelligence Act.
Common Crawl's archives contain personal information that was publicly accessible at the time of crawling: names, email addresses, phone numbers, and other identifying details embedded in web pages. When this data is used to train language models, there is a risk that the models memorize and reproduce personal information. Privacy regulations such as the GDPR impose requirements on how personal data is collected and processed, and it is not clear how bulk web archiving and subsequent use for AI training interacts with these frameworks.
Several of the processing pipelines built on Common Crawl (notably FineWeb) include PII removal steps using regular expressions and classifiers, but the raw Common Crawl data itself does not undergo such filtering.
Common Crawl explicitly states that it does not attempt to cover the entire web. The archive skews toward English-language content due to its U.S.-based infrastructure and the dominance of English on the web. This language bias propagates into downstream models trained on Common Crawl derivatives. Content from digitally underrepresented communities, smaller websites, and non-English languages appears less frequently in the archive.
Automated filtering pipelines also struggle with hate speech, pornographic content, and other problematic material. While deduplication and quality classifiers remove much of this content, no filtering pipeline catches everything, and some forms of bias are amplified through the filtering process itself (for example, perplexity-based filters trained on Wikipedia may disproportionately remove non-standard English dialects).
In addition to page content, Common Crawl publishes web graph datasets derived from the link structure in its WAT files. These graphs are available at both the host level and the domain level.
The May through July 2025 web graph release contained 481.6 million host-level nodes with 3.4 billion edges and 209.5 million domain-level nodes with 2.6 billion edges. The November 2025 through January 2026 release grew to 279.4 million host-level nodes with 13.4 billion edges and 122.3 million domain-level nodes with 6.1 billion edges. The smaller node count but larger edge count in the latter release reflects denser link coverage during that period.
These web graph datasets are used for research in link analysis, search engine development, and network science. They provide an independent alternative to web graph data controlled by commercial search engines.
One of Common Crawl's most significant effects has been lowering the barrier to entry for AI research and development. Before Common Crawl, building a language model required either running your own web crawler (which demands significant infrastructure and engineering resources) or paying for access to proprietary data. Google, Microsoft, and other search engine companies had a structural advantage because they already operated web crawlers at scale. Common Crawl removed this advantage by providing a comparable dataset for free.
This has had concrete results. EleutherAI, a grassroots collective of volunteer researchers, used Common Crawl data (via The Pile) to train GPT-NeoX-20B, one of the first open-source 20-billion-parameter models. Together AI built RedPajama to create an open replication of LLaMA's training data. Hugging Face created FineWeb to give the research community access to the largest clean web text dataset available. None of these efforts would have been possible without Common Crawl's freely available archives.
The Mozilla Foundation's 2024 report characterized Common Crawl as enabling "more LLM research and development to take place beyond well-resourced leading AI companies." At the same time, because the training data is publicly available, researchers can audit it for bias, toxicity, and other quality issues in ways that are impossible with proprietary training datasets.
The scale of Common Crawl's operation is notable for a nonprofit with a small staff. Key figures include:
| Metric | Value |
|---|---|
| Total archive size (as of mid-2023) | 9.5+ petabytes |
| URLs in CrawlDB (as of August 2023) | ~25 billion |
| Pages per monthly crawl | 2.4 to 3.0 billion |
| Uncompressed data per monthly crawl | ~350 to 470 TiB |
| Archive format | WARC (since summer 2013; ARC before) |
| Hosting | AWS S3, US-East-1 region |
| Access cost | Free (AWS Open Data Sponsorship) |
| Academic citations | 10,000+ |
Common Crawl is sometimes compared with the Internet Archive's Wayback Machine, but the two projects have different goals. The Wayback Machine is a historical archive designed to preserve web pages over time, with a focus on allowing users to view past versions of specific URLs. Common Crawl is designed as a research dataset: each monthly crawl is a broad snapshot of the web at a point in time, optimized for bulk download and computational analysis rather than individual page lookup.
Common Crawl's scope is also narrower in some respects. It focuses on publicly accessible HTML pages and does not systematically archive images, videos, or other media files (though these may appear in the raw HTML). The Wayback Machine, by contrast, archives a broader range of content types.