Common Crawl

Data & Datasets Machine Learning Natural Language Processing

23 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

26 citations

Revision

v8 · 4,611 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

Common Crawl is a nonprofit 501(c)(3) organization that maintains a free, open repository of web crawl data, and it is the single largest publicly available source of text used to train large language models.^[1]^[3] Founded in 2007 by Gil Elbaz, the organization conducts regular large-scale crawls of the public web and makes the resulting datasets available at no cost.^[1]^[2] In its own words, "the Common Crawl corpus contains petabytes of data, regularly collected since 2008," and the archive holds over 300 billion captured web pages spanning roughly 15 years of crawling, stored on Amazon Web Services.^[26]^[1] Common Crawl has become one of the most widely used data sources in natural language processing and machine learning research, with its data forming the backbone of training corpora for the majority of modern large language models.^[3]

Who founded Common Crawl?

Gil Elbaz founded the Common Crawl Foundation in 2007 in San Francisco, after leaving his position as Engineering Director at Google, where he had worked from 2003 to 2007.^[2]^[26] Before Google, Elbaz had co-founded Applied Semantics, the company behind AdSense, which Google acquired in 2003.^[2] His vision for Common Crawl grew from the idea of "neutral data companies," open nonprofit infrastructure projects that would crawl the web and provide the resulting data free of charge to researchers and businesses alike.^[2] Elbaz was concerned about the concentration of web data in the hands of a few large search engine companies, and he believed that making a comparable crawl freely available would stimulate competition and new research.^[2]

The organization began its first crawls in 2008. The inaugural dataset, designated CC-MAIN-2008-2009, captured approximately 1.8 billion web pages and was stored in the ARC archive format, a predecessor to the WARC standard adopted later.^[16] A second crawl, CC-MAIN-2009-2010, expanded coverage to roughly 2.9 billion pages.^[16] By 2011, the foundation began releasing full-scale crawls to the public on a regular schedule.^[1]

For most of its existence, Common Crawl operated on a minimal budget with Elbaz personally financing the operation. As of 2023, the organization had only three employees.^[18] Lisa Green served as director from 2011 to 2015, bringing experience from her prior role as Chief of Staff at Creative Commons. Rich Skrenta, the founder of the Open Directory Project and the search engine Blekko, later joined as executive director.^[1] Elbaz has continued to finance and chair the organization since its founding.^[2]

Common Crawl's importance grew substantially after 2018 as the transformer architecture and large-scale pretraining became standard practice in NLP. A 2024 study by the Mozilla Foundation, authored by Stefan Baack and Mozilla Insights and published on February 6, 2024, found that at least 64% of text-generating LLMs published between 2019 and October 2023 (30 out of 47 analyzed models) used at least one filtered version of Common Crawl data for pretraining.^[3] The same report described Common Crawl as "a massive (9.5-plus petabytes), freely available archive of web crawl data dating back to 2008."^[3] The organization has been cited in more than 10,000 academic studies.^[3]

Crawling infrastructure

Common Crawl operates its own web crawler, identified by the user-agent string CCBot/2.0.^[15] The crawler respects the robots.txt protocol; website owners can block CCBot by adding a Disallow directive for the CCBot user-agent in their robots.txt file.^[15] The crawler also honors the Crawl-delay parameter, allowing site owners to throttle request rates.^[15]

The crawling process uses a URL database (CrawlDB) that contained roughly 25 billion URLs as of August 2023.^[16] To decide which pages to fetch, the system ranks URLs using harmonic centrality scoring, a method that prioritizes frequently linked domains.^[16] About half of each crawl consists of previously fetched URLs being refreshed, while the other half represents newly discovered pages.^[16]

Each monthly crawl typically captures between 2.2 and 3.0 billion web pages.^[16] Since 2017, crawls have consistently processed 3 to 5 billion URLs per cycle.^[16] The resulting data for a single crawl ranges from roughly 350 to 470 TiB of uncompressed content.^[16] For example, the January 2025 crawl encompassed approximately 3.0 billion pages totaling 460 TiB, while the August 2025 crawl added 2.42 billion pages across 419 TiB.^[16] More recent snapshots have trended slightly smaller: the January 2026 crawl contained about 2.3 billion pages (398 TiB) and the May 2026 crawl about 2.16 billion pages (365.56 TiB).^[16]

The cumulative archive exceeded 9.5 petabytes by mid-2023, with hundreds of terabytes added each month in subsequent releases.^[16]^[3]

Crawl archive naming

Common Crawl snapshots follow the naming pattern CC-MAIN-YYYY-WW, where YYYY is the year and WW is the ISO week number when the crawl was released.^[16] For example, CC-MAIN-2024-10 refers to the snapshot released in the tenth week of 2024.

What data formats does Common Crawl provide?

Common Crawl distributes its data in four complementary formats, each designed for different use cases.^[14] The WARC standard was adopted in summer 2013 with crawl CC-MAIN-2013-20, replacing the earlier ARC format.^[14]

Format	Full name	Contents	Primary use case
WARC	Web ARChive	Complete HTTP responses, including headers, HTML content, and crawl metadata	Full-fidelity analysis of raw web pages
WET	WARC Encapsulated Text	Extracted plain text without HTML markup	Text mining, language model training
WAT	Web Archive Transformation	Metadata from WARC files in JSON format, including HTTP headers, links, and HTML metadata	Link analysis, web graph construction, metadata research
CDX	Capture/Crawl inDeX	Index records mapping URLs to their positions within WARC files	Targeted retrieval of specific pages from the archive

WARC format

The WARC (Web ARChive) format is an ISO standard (ISO 28500) for archiving web content.^[14] Common Crawl's WARC files contain three record types: response records with the actual HTTP responses and page content, request records documenting how pages were fetched, and metadata records capturing information about the crawl process itself.^[14] Each record includes headers such as WARC-Date, WARC-Target-URI, and Content-Length.^[14] The raw HTTP response is stored in its entirety, preserving both headers and the full HTML body.^[14] Because WARC files preserve the original HTML, researchers can re-extract text using their own parsers, which often produces higher-quality results than the pre-extracted WET files.^[6]

WET format

WET files contain only the extracted plain text from crawled pages, stripped of all HTML markup.^[14] The extraction is performed automatically by Common Crawl's processing pipeline. Each WET record includes WARC-style metadata headers (URL, content length, date) followed by the converted plain text.^[14] While convenient for NLP tasks since researchers can skip the HTML parsing step, the extraction quality has been criticized. Several research groups have found that applying their own text extraction tools (such as jusText or trafilatura) to the raw WARC files produces cleaner output.^[6] The early CCNet pipeline primarily operated on WET files.^[4]

WAT format

WAT files hold computed metadata extracted from each WARC record, stored as compact JSON with whitespace removed.^[14] This metadata includes HTTP response headers, HTML metadata (title, scripts, meta tags), and all links found on the page.^[14] WAT files are the basis for constructing web graph datasets.^[16] For instance, the November 2025 through January 2026 web graph data comprised 279.4 million host-level nodes with 13.4 billion edges and 122.3 million domain-level nodes with 6.1 billion edges.^[16]

CDX index

CDX files serve as an index that maps URLs to their positions within WARC files.^[14] This allows users to look up specific URLs across crawl archives without downloading entire snapshots, enabling targeted retrieval of individual pages.^[14]

Is Common Crawl free, and how do you access it?

All Common Crawl data is hosted through Amazon Web Services' (AWS) Open Data Sponsorship Program, which covers the storage costs, so access to the full corpus is free of charge.^[19] The data resides in the S3 bucket s3://commoncrawl/ in the US-East-1 (Northern Virginia) region.^[19] Users can access the data through three URL schemes:

S3: s3://commoncrawl/[path]
CloudFront CDN: https://ds5q9oxwqwsfj.cloudfront.net/[path]
HTTP: https://data.commoncrawl.org/[path]

For cloud-based processing, Common Crawl recommends running compute workloads in the us-east-1 region to avoid inter-region data transfer fees and to minimize latency.^[19] The data can also be downloaded directly over HTTPS for local processing.^[19] Common Crawl is listed in the AWS Registry of Open Data and is available through the AWS Marketplace.^[19]

Each crawl snapshot consists of tens of thousands of individual WARC, WET, and WAT files, typically compressed with gzip.^[1] Individual files are roughly 1 GB compressed.^[1] Due to the size of the data, most large-scale processing is done using distributed computing frameworks such as Apache Spark or custom MapReduce jobs running on cloud infrastructure.

Processing pipelines

Raw Common Crawl data requires substantial filtering and cleaning before it is useful for training language models. Several processing pipelines have been developed by different research groups, each taking a slightly different approach to extracting high-quality text from the archive.

CCNet

CCNet, developed by Facebook AI Research (now Meta AI) and described in a 2020 paper at LREC, is one of the most widely adopted pipelines for processing Common Crawl.^[4] It operates on WET files and performs three main steps:

Deduplication: WET files are grouped into 5 GB fragments. Duplicate paragraphs, which account for roughly 70% of the raw text, are removed by normalizing each paragraph through lowercasing, replacing numbers with placeholders, and stripping Unicode punctuation and diacritical marks.^[4] A hash code is computed for each normalized paragraph using the first 64 bits of SHA-1.^[4]
Language identification: Each document is classified by language, allowing extraction of monolingual subsets.^[4]
Quality filtering: A KenLM 5-gram language model trained on Wikipedia assigns perplexity scores to each document.^[4] Documents with lower perplexity (closer to Wikipedia-quality prose) are retained, while boilerplate content such as navigation menus, cookie warnings, and contact information is filtered out.^[4]

CCNet serves as the foundation for several derived datasets. RedPajama-V2 processed 84 Common Crawl snapshots through the CCNet pipeline, and the data used to train XLM-RoBERTa was prepared using CCNet's monolingual extraction approach.^[8]^[4]

The MacroData Refinement pipeline, used to create RefinedWeb, takes a different approach by operating on raw WARC files rather than pre-extracted WET text.^[6] Working with HTML preserves structural information lost in the WET conversion and allows for better text extraction through trafilatura.^[6] The pipeline includes URL filtering, document-level deduplication using MinHash, and line-level heuristics to remove boilerplate.^[6]

FineWeb pipeline

Hugging Face's FineWeb pipeline processes 96 Common Crawl snapshots (spanning 2013 through 2024) and applies aggressive deduplication, heuristic filtering, and PII (Personally Identifiable Information) removal.^[9] It uses the datatrove library for parallel processing.^[9] The pipeline also includes a variant called FineWeb-Edu that further filters for educational content using a classifier trained on annotations from Llama 3 70B Instruct.^[9]

General processing stages

While specific pipelines differ in implementation, the general approach to processing Common Crawl follows a common pattern:

Stage	Purpose	Common tools	Typical data reduction
Text extraction	Convert HTML or WET to clean plain text	trafilatura, jusText, resiliparse	Varies
Language identification	Label documents by language; discard non-target languages	fastText, CLD3, langdetect	50-80% for English-only datasets
Heuristic filtering	Remove short, low-quality, or boilerplate documents	Custom rules (line length, punctuation ratio, repetition)	20-40%
Deduplication	Remove exact and near-duplicate documents or paragraphs	MinHash + LSH, SHA-1 hashing, URL dedup	45-75%
Quality scoring	Rank documents by similarity to reference corpus	KenLM perplexity, trained classifiers	Varies by threshold
Safety filtering	Remove toxic, pornographic, or PII-containing content	Blocklists, toxicity classifiers, regex PII masking	5-15%

What is Common Crawl used for in AI training?

Common Crawl has given rise to numerous curated datasets that serve as pretraining corpora for language models, multimodal models, and other AI systems. The table below lists the most widely used derived datasets.

Dataset	Year	Creator	Size	Source snapshots	Key details	Models trained
C4 (Colossal Clean Crawled Corpus)	2019	Google Research	~750 GB, ~156B tokens	April 2019 snapshot	Filtered from 1.4T raw tokens; 99% English via langdetect; deduplication and heuristic quality filters	T5, LaMDA
mC4 (Multilingual C4)	2020	Google Research	Varies by language	86 Common Crawl dumps	Multilingual extension of C4 covering 101+ languages using CLD3 for language identification	mT5
The Pile (Pile-CC component)	2020	EleutherAI	825 GiB total (22 subsets)	Multiple snapshots	Pile-CC uses jusText on WARC files for improved extraction quality; diverse corpus with academic and professional sources	GPT-Neo, GPT-J, GPT-NeoX
OSCAR	2019	INRIA (ALMAnaCH)	Varies by version	Per-snapshot extraction	Multilingual corpus covering 168 languages, extracted from WET files with fastText language classification	Various multilingual models
CCNet	2019	Meta AI Research	Varies by language	Per-snapshot processing	Pipeline using paragraph deduplication, language ID, and KenLM perplexity scoring against Wikipedia	XLM-RoBERTa
LAION-5B	2022	LAION e.V.	5.85B image-text pairs	Multiple snapshots	Image-text pairs from WAT files (HTML IMG tags with alt-text); filtered by CLIP ViT-B/32 (threshold 0.28 English, 0.26 other)	Stable Diffusion
LAION-400M	2021	LAION e.V.	400M image-text pairs	Multiple snapshots	English-only predecessor to LAION-5B	Various image generation models
RefinedWeb	2023	Technology Innovation Institute	~5T tokens (600B public)	Multiple snapshots	Built from WARC files using MacroData Refinement; heavy MinHash deduplication	Falcon
RedPajama v1	2023	Together AI	1.2T tokens	Multiple snapshots	Open reproduction of LLaMA training mix; Common Crawl is the largest component	MPT-7B, OpenLLaMA
RedPajama v2	2023	Together AI	30T tokens (deduplicated) from 100B+ documents	84 Common Crawl snapshots	Processed through CCNet pipeline; 40+ quality annotations; five languages (English, French, Spanish, German, Italian)	Various open models
SlimPajama	2023	Cerebras	627B tokens	Derived from RedPajama v1	Deduplicated version of RedPajama v1; removed 49.6% of bytes through global MinHash	Cerebras-GPT
FineWeb	2024	Hugging Face	~15T tokens, 44 TB	96 snapshots (2013-2024)	Aggressive filtering/dedup with PII removal; outperforms C4, Dolma, and RedPajama on benchmarks; ODC-By 1.0 license	SmolLM
FineWeb-Edu	2024	Hugging Face	1.3T tokens (score >= 3); 5.4T tokens (score >= 2)	Subset of FineWeb	Educational content filtered by classifier trained on Llama 3 70B annotations; 92% of FineWeb removed at strict threshold; 6,000 H100 GPU hours for classification	SmolLM, various
Dolma	2023	AI2 (Allen Institute for AI)	3T tokens (v1); 2.3T tokens (v1.7)	Multiple snapshots	Mixed corpus: Common Crawl, academic papers, code, books, Wikipedia; ODC-By license	OLMo

Role in major language models

Common Crawl data, in various filtered forms, constitutes the single largest component of training data for most open and commercial large language models.

GPT-3

GPT-3, released by OpenAI in 2020, drew approximately 60% of its weighted training tokens from a filtered version of Common Crawl.^[12] By raw volume, Common Crawl represented 82% of the dataset (410 billion tokens).^[12] The remaining training tokens came from WebText2 (22%), two book corpora (16%), and Wikipedia (3%).^[12] Common Crawl was intentionally downsampled during training so that its contribution to the actual training mix was 60% rather than 82%.^[12]

LLaMA

LLaMA, released by Meta in 2023, used two separate filtered versions of Common Crawl: one processed through the CCNet pipeline and another through C4.^[13] The authors found that using diverse Common Crawl processing pipelines improved model performance.^[13] C4 contributed about 15% of LLaMA's training tokens.^[13]

Falcon

The Falcon family of models, developed by the Technology Innovation Institute, was trained almost exclusively on RefinedWeb, a dataset derived entirely from Common Crawl.^[6] This demonstrated that a single, well-filtered web source could match or exceed the performance of models trained on more diverse multi-source corpora.^[6]

T5

T5 (Text-to-Text Transfer Transformer), released by Google Research in 2019, was trained on C4, which was derived from a single April 2019 Common Crawl snapshot.^[5] The C4 dataset was created specifically for the T5 paper and started with 1.4 trillion raw tokens from Common Crawl, filtered down to approximately 156 billion tokens (750 GB).^[5]

Stable Diffusion

Stable Diffusion, developed by Stability AI and released in 2022, was trained on LAION-5B, whose 5.85 billion image-text pairs were all extracted from Common Crawl's WAT and WARC files.^[10] The LAION team parsed HTML IMG tags with alt-text attributes and then filtered the resulting pairs using CLIP embeddings, removing approximately 90% of the initial 50+ billion candidate pairs.^[10]

Privacy and copyright debates

Common Crawl's role as a data intermediary between the open web and AI training pipelines has placed it at the center of several ongoing controversies.

Copyright and content licensing

Because Common Crawl archives pages from across the public web without entering into licensing agreements with content creators, questions about copyright infringement have intensified as AI companies increasingly rely on the data.^[23] As of early 2026, over thirty copyright lawsuits related to AI training data were pending in courts worldwide, with Common Crawl's archives frequently cited as a data source.^[23]

In November 2024, The Atlantic published an investigation revealing that despite publishers submitting takedown requests (including The New York Times in July 2023), Common Crawl's archives still contained the content those publishers had requested be removed.^[20] CCBot does not execute JavaScript verification, meaning it retrieves the full HTML before any paywall or subscription checks occur.^[20] Critics argue this effectively bypasses paywalls, granting AI companies access to premium journalism without compensation.^[20]

In December 2023, The New York Times sued OpenAI and Microsoft, alleging that copyrighted Times articles were included in Common Crawl data used to train GPT models.^[20] In Denmark, the Rights Alliance (RettighedsAlliancen) pressured Common Crawl to remove content from Danish media houses; Common Crawl's attorney stated in December 2024 that approximately 50% of the requested content had been removed, more than six months after the initial request.^[25]

Whether training AI models on copyrighted material constitutes fair use under U.S. law remains an open question, with no definitive court ruling as of early 2026.^[23] The copyright question is further complicated by Common Crawl's position as an intermediary: the organization itself does not train AI models but simply provides the raw data.

Robots.txt and opt-out mechanisms

Website owners can block CCBot through their robots.txt file.^[15] However, once content has been captured in a crawl snapshot, it persists in the archive even if the site later adds a block.^[20] Common Crawl has stated that it processes removal requests, but the effectiveness of this process has been questioned.^[20]

Blocking rates for CCBot have risen sharply since 2023.^[22] Analysis of over 1,100 news websites found that approximately 48% blocked Common Crawl's crawler.^[22] Among the 50 largest news publishers, CCBot was the least permitted AI crawler, with only nine of those sites allowing it.^[22] A secondary wave of restrictions appeared after August 2024, correlating with the enforcement of the EU Artificial Intelligence Act.^[22]

Privacy concerns

Common Crawl's archives contain personal information that was publicly accessible at the time of crawling: names, email addresses, phone numbers, and other identifying details embedded in web pages.^[23] When this data is used to train language models, there is a risk that the models memorize and reproduce personal information. Privacy regulations such as the GDPR impose requirements on how personal data is collected and processed, and it is not clear how bulk web archiving and subsequent use for AI training interacts with these frameworks.

Several of the processing pipelines built on Common Crawl (notably FineWeb) include PII removal steps using regular expressions and classifiers, but the raw Common Crawl data itself does not undergo such filtering.^[9]

Content quality and bias

Common Crawl explicitly states that it does not attempt to cover the entire web.^[1] The archive skews toward English-language content due to its U.S.-based infrastructure and the dominance of English on the web.^[23] This language bias propagates into downstream models trained on Common Crawl derivatives. Content from digitally underrepresented communities, smaller websites, and non-English languages appears less frequently in the archive.^[23]

Automated filtering pipelines also struggle with hate speech, pornographic content, and other problematic material.^[17] While deduplication and quality classifiers remove much of this content, no filtering pipeline catches everything, and some forms of bias are amplified through the filtering process itself (for example, perplexity-based filters trained on Wikipedia may disproportionately remove non-standard English dialects).^[17]

Web graph data

In addition to page content, Common Crawl publishes web graph datasets derived from the link structure in its WAT files.^[16] These graphs are available at both the host level and the domain level.^[16]

The May through July 2025 web graph release contained 481.6 million host-level nodes with 3.4 billion edges and 209.5 million domain-level nodes with 2.6 billion edges.^[16] The November 2025 through January 2026 release grew to 279.4 million host-level nodes with 13.4 billion edges and 122.3 million domain-level nodes with 6.1 billion edges.^[16] The smaller node count but larger edge count in the latter release reflects denser link coverage during that period.

These web graph datasets are used for research in link analysis, search engine development, and network science. They provide an independent alternative to web graph data controlled by commercial search engines.

How did Common Crawl democratize AI research?

One of Common Crawl's most significant effects has been lowering the barrier to entry for AI research and development.^[3] Before Common Crawl, building a language model required either running your own web crawler (which demands significant infrastructure and engineering resources) or paying for access to proprietary data. Google, Microsoft, and other search engine companies had a structural advantage because they already operated web crawlers at scale. Common Crawl removed this advantage by providing a comparable dataset for free.^[3] As the Mozilla report put it, the organization's goal was to "make both the kinds and the amounts of data that usually only big tech companies like Google have access to available to researchers and smaller businesses."^[3]

This has had concrete results. EleutherAI, a grassroots collective of volunteer researchers, used Common Crawl data (via The Pile) to train GPT-NeoX-20B, one of the first open-source 20-billion-parameter models.^[7] Together AI built RedPajama to create an open replication of LLaMA's training data.^[8] Hugging Face created FineWeb to give the research community access to the largest clean web text dataset available.^[9] None of these efforts would have been possible without Common Crawl's freely available archives.

The Mozilla Foundation's 2024 report characterized Common Crawl as enabling "more LLM research and development to take place beyond well-resourced leading AI companies."^[3] At the same time, because the training data is publicly available, researchers can audit it for bias, toxicity, and other quality issues in ways that are impossible with proprietary training datasets.

Technical scale

The scale of Common Crawl's operation is notable for a nonprofit with a small staff. Key figures include:

Metric	Value
Founded	2007 (San Francisco) by Gil Elbaz
Total captured pages (cumulative)	300+ billion, spanning ~15 years
Total archive size (as of mid-2023)	9.5+ petabytes
URLs in CrawlDB (as of August 2023)	~25 billion
Pages per monthly crawl	2.2 to 3.0 billion
Uncompressed data per monthly crawl	~350 to 470 TiB
Archive format	WARC (since summer 2013; ARC before)
Hosting	AWS S3, US-East-1 region
Access cost	Free (AWS Open Data Sponsorship)
LLMs using Common Crawl (2019-Oct 2023)	At least 64% (30 of 47 reviewed)
Academic citations	10,000+

How does Common Crawl differ from the Wayback Machine?

Common Crawl is sometimes compared with the Internet Archive's Wayback Machine, but the two projects have different goals. The Wayback Machine is a historical archive designed to preserve web pages over time, with a focus on allowing users to view past versions of specific URLs. Common Crawl is designed as a research dataset: each monthly crawl is a broad snapshot of the web at a point in time, optimized for bulk download and computational analysis rather than individual page lookup.^[1]

Common Crawl's scope is also narrower in some respects. It focuses on publicly accessible HTML pages and does not systematically archive images, videos, or other media files (though these may appear in the raw HTML).^[1] The Wayback Machine, by contrast, archives a broader range of content types.

References

Common Crawl Foundation. "Overview." commoncrawl.org. Accessed March 2026. ↩
Common Crawl Foundation. "Our Team: Gil Elbaz, Chairman." commoncrawl.org/team/gil-elbaz-chairman. Accessed March 2026. ↩
Baack, S. and Mozilla Insights. "Training Data for the Price of a Sandwich: Common Crawl's Impact on Generative AI." Mozilla Foundation, February 6, 2024. ↩
Wenzek, G., Lachaux, M.-A., Conneau, A., et al. "CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data." Proceedings of LREC 2020. ↩
Raffel, C., Shazeer, N., Roberts, A., et al. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." JMLR, 2020. (C4 dataset) ↩
Penedo, G., Malartic, Q., Hesslow, D., et al. "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." NeurIPS 2023 Datasets and Benchmarks Track. ↩
Gao, L., Biderman, S., Black, S., et al. "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027, 2020. ↩
Together AI. "RedPajama-Data-v2: An Open Dataset with 30 Trillion Tokens for Training Large Language Models." together.ai, 2023. ↩
Penedo, G., Kydlicek, H., Lozhkov, A., et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." NeurIPS 2024 Datasets and Benchmarks Track. ↩
Schuhmann, C., Beaumont, R., Vencu, R., et al. "LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models." NeurIPS 2022. ↩
Soldaini, L., Kinney, R., Bhagia, A., et al. "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." ACL 2024.
Brown, T., Mann, B., Ryder, N., et al. "Language Models are Few-Shot Learners." NeurIPS 2020. (GPT-3 training data composition) ↩
Touvron, H., Lavril, T., Izacard, G., et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023. ↩
Common Crawl Foundation. "Navigating the WARC File Format." commoncrawl.org/blog. Accessed March 2026. ↩
Common Crawl Foundation. "CCBot." commoncrawl.org/ccbot. Accessed March 2026. ↩
Common Crawl. "Statistics of Common Crawl Monthly Archives." commoncrawl.github.io/cc-crawl-statistics. Accessed June 2026. ↩
Dodge, J., Sap, M., Marasovic, A., et al. "Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus." EMNLP 2021. ↩
ProPublica Nonprofit Explorer. "CommonCrawl Foundation, EIN 26-1635908." projects.propublica.org/nonprofits. Accessed March 2026. ↩
Registry of Open Data on AWS. "Common Crawl." registry.opendata.aws/commoncrawl. Accessed March 2026. ↩
The Atlantic. Investigation into Common Crawl takedown request handling. November 2024. ↩
Ortiz Suarez, P. J., Sagot, B., Romary, L. "Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures." LREC 2019. (OSCAR dataset)
Press Gazette. "Eight in ten of world's biggest news websites now block AI training bots." 2024. ↩
Longpre, S., et al. "A Critical Analysis of the Largest Source for Generative AI Training Data: Common Crawl." ACM FAccT 2024. ↩
Cerebras. "SlimPajama: A 627B Token Cleaned and Deduplicated Version of RedPajama." 2023.
RettighedsAlliancen (Denmark Rights Alliance). Statement on Common Crawl content removal. December 2024. ↩
Common Crawl Foundation. "Overview." commoncrawl.org/overview. Accessed June 2026. ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

7 revisions by 1 contributors · full history

Suggest edit

Common Crawl

Who founded Common Crawl?

Crawling infrastructure

Crawl archive naming

What data formats does Common Crawl provide?

WARC format

WET format

WAT format

CDX index

Is Common Crawl free, and how do you access it?

Processing pipelines

CCNet

MacroData Refinement (MDR)

FineWeb pipeline

General processing stages

What is Common Crawl used for in AI training?

Role in major language models

GPT-3

LLaMA

Falcon

T5

Stable Diffusion

Privacy and copyright debates

Copyright and content licensing

Robots.txt and opt-out mechanisms

Privacy concerns

Content quality and bias

Web graph data

How did Common Crawl democratize AI research?

Technical scale

How does Common Crawl differ from the Wayback Machine?

See also

References

Improve this article

What links here (24 of 60)

What links here (24 of 60)

Who founded Common Crawl?

Crawling infrastructure

Crawl archive naming

What data formats does Common Crawl provide?

WARC format

WET format

WAT format

CDX index

Is Common Crawl free, and how do you access it?

Processing pipelines

CCNet

MacroData Refinement (MDR)

FineWeb pipeline

General processing stages

What is Common Crawl used for in AI training?

Role in major language models

GPT-3

LLaMA

Falcon

T5

Stable Diffusion

Privacy and copyright debates

Copyright and content licensing

Robots.txt and opt-out mechanisms

Privacy concerns

Content quality and bias

Web graph data

How did Common Crawl democratize AI research?

Technical scale

How does Common Crawl differ from the Wayback Machine?

See also

References

Improve this article

Related Articles

Reporting Bias

The Pile (dataset)

FineWeb

RedPajama

SuperGLUE

C4 (Colossal Clean Crawled Corpus)

What links here (24 of 60)

Related Articles

Reporting Bias

The Pile (dataset)

FineWeb

RedPajama

SuperGLUE

C4 (Colossal Clean Crawled Corpus)

What links here (24 of 60)