RefinedWeb

Data & Datasets Large Language Models Open Source AI

18 min read

Updated Jun 23, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 23, 2026

Fact-checked

In review queue

Sources

12 citations

Revision

v3 · 3,651 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

RefinedWeb is a large-scale English pretraining dataset for large language models, built from filtered and deduplicated Common Crawl web data alone and released in June 2023 by the Technology Innovation Institute (TII) in Abu Dhabi to train the Falcon models. Its central finding, stated in the abstract, is that "properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile" ^[1]. The full corpus contains about five trillion tokens, of which a 600-billion-token extract was released publicly on Hugging Face under the ODC-By 1.0 license ^[1]^[4].

RefinedWeb was introduced in the paper The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only by Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei and Julien Launay, submitted to arXiv on 1 June 2023 ^[1]. It is built entirely from Common Crawl snapshots, with no curated sources such as books, Wikipedia, academic papers or code repositories. The paper made a strong empirical claim that ran against the conventional wisdom of the time: web data alone, if filtered and deduplicated aggressively enough, can produce models that match or beat models trained on traditionally curated mixes such as The Pile. Models trained on RefinedWeb performed on par with the original GPT-3 series within the authors' evaluation harness, despite using no books, no Wikipedia and no scholarly articles ^[1]. The dataset is the main pretraining corpus behind the Falcon family of models, including Falcon-7B, Falcon-40B and the 180-billion-parameter Falcon-180B, which was for a brief period the largest openly available LLM ^[2]^[3].

Together with two smaller reference models (Falcon-RW-1B and Falcon-RW-7B), the release served as a new open baseline for web pretraining data and has since influenced datasets such as FineWeb, Dolma and RedPajama v2.

What problem was RefinedWeb built to solve?

By early 2023, frontier language models were widely believed to depend on two ingredients: massive web crawl data plus a careful blend of curated, single-domain corpora such as books, Wikipedia, social media conversations, scientific papers and code ^[1]. The web crawl part provided scale; the curated part was thought to provide quality. As the paper put it, "this curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities" ^[1]. Most public datasets at the time, including The Pile (~340GT, where GT means billions of tokens), C4 (~360GT) and OSCAR (~370GT), reflected one half of that recipe but rarely both at the scale needed for models past 100B parameters ^[1].

Two problems were converging. The Chinchilla scaling laws of Hoffmann et al. (2022) suggested that compute-optimal training of a GPT-3 sized model required on the order of 3.5 trillion tokens, roughly twice the size of the largest public pretraining sets at that time and ten times the size of the largest open English corpora ^[1]. Several papers also warned that the supply of high-quality, properly licensed text might run out within a few years at current scaling rates; the paper framed the open question as "whether we will run out of unique high-quality data soon" ^[1]. The Falcon team's response was to challenge the assumption that you needed curated data at all. If web crawls could be cleaned and deduplicated thoroughly enough, scale could substitute for curation, and the dataset could in principle be extended indefinitely as new Common Crawl dumps appeared.

The paper does not pretend that filtering is a neutral act. Excessive filtering, especially with ML-based quality classifiers, can introduce its own biases and disproportionately remove minority dialects or non-mainstream sources ^[1]. The authors stuck to "neutral filtering": rule-based heuristics and URL blocklists, with no ML quality classifier beyond fastText for language identification. This is a different philosophy from C4, which uses a stop-word and bad-word list, or from later datasets such as FineWeb-Edu, which trains an explicit classifier to score educational quality.

Who built RefinedWeb? TII and the Falcon program

The Technology Innovation Institute is the applied research arm of Abu Dhabi's Advanced Technology Research Council (ATRC), founded in May 2020 as part of a broader UAE strategy to build domestic capacity in advanced technologies. TII's AI and Digital Science Research Center is responsible for the Falcon program. The dataset was developed in collaboration with LightOn, a French AI infrastructure company, listed as the affiliation for Penedo, Hesslow, Cappelli, Pannier and Launay in the paper ^[1]. Falcon-40B was released in May 2023, with Falcon-7B shortly after. The RefinedWeb paper appeared on arXiv on 1 June 2023, alongside the public 600-billion-token data extract. Falcon-180B followed in September 2023, trained primarily on a larger internal version of RefinedWeb. The paper was later peer-reviewed and accepted at NeurIPS 2023 Datasets and Benchmarks.

The processing pipeline is called MacroData Refinement, abbreviated MDR. It runs in three broad phases: document preparation, filtering, and deduplication. The pipeline starts from raw WARC files (the original HTTP responses, including HTML markup) rather than the cleaner WET files that Common Crawl also publishes, because the team found WET files contained too many leftover navigation menus, ads and footers ^[1].

The authors state three explicit design principles ^[1]^[4]:

Scale first. The target size is 3 to 6 trillion tokens of English, enough to train models in the 40B to 200B parameter range to optimality under Chinchilla scaling.
Strict deduplication. Both fuzzy (MinHash) and exact (suffix array) deduplication are applied, with thresholds tuned to remove far more than other public pipelines.
Neutral filtering. No ML quality classifier is used. Filtering relies on URL blocklists, simple heuristics and language identification.

All stages combined remove about 90% of the documents in the original Common Crawl input. Roughly half of that loss comes from language identification (most of CommonCrawl is not English), about a quarter from quality filtering, and about 12% from deduplication ^[1].

Pipeline stages

Stage	Method	Tools / references
URL filtering	4.6M-domain blocklist plus a URL scoring system based on weighted word lists; common high-quality sources (Wikipedia, arXiv, etc.) are excluded so they can be added back as a curated mix	Custom blocklist, Appendix G.1 of ^[1]
Text extraction	Read raw WARC files with `warcio`; extract main content with trafilatura	Barbaresi (2021)
Language identification	fastText character-n-gram classifier from CCNet, threshold of 0.65 on the top language; only English kept for the main release	Wenzek et al. (2020)
Repetition removal	Drop documents with excessive repeated lines, paragraphs or n-grams	MassiveWeb heuristics, Rae et al. (2021)
Document-wise filtering	Quality heuristics on length, mean word length, symbol-to-word ratio, fraction of bullet points, ellipsis lines, etc.	MassiveWeb, Gopher heuristics from Rae et al. (2021)
Line-wise corrections	Strip menu items, like counters, share buttons, navigation lines; if more than 5% of a document is removed by the filter, drop the document	Custom, Appendix G.2 of ^[1]
Fuzzy deduplication	MinHash with 9,000 hashes per document over 5-grams, divided into 20 buckets of 450 hashes each	Broder (1997), tuned setup
Exact deduplication	Suffix-array exact substring removal of any matching span longer than 50 consecutive tokens	Lee et al. (2022) implementation
URL deduplication	Track URLs already kept in earlier dump partitions and drop revisits in later partitions	Custom

The RW-Raw checkpoint refers to the data after URL filtering, text extraction and language ID, which keeps about 48% of the original CommonCrawl documents. After document and line filtering, the RW-Filtered set retains about 23% of the original. After deduplication, the final RefinedWeb is roughly 11% of the original document count ^[1].

Why do the MinHash settings matter?

Most prior public datasets used relatively coarse MinHash signatures: The Pile used about 10 hashes per document. RefinedWeb uses 9,000, computed over 5-grams and grouped into 20 bands of 450 hashes (two documents are flagged as candidate duplicates if any one band matches exactly) ^[1]. Scaling up the signature recovers far more near-duplicates than the looser settings used elsewhere. After MinHash catches whole-document templates and SEO boilerplate, the exact substring step removes any 50-token span shared between two documents using suffix arrays. The two methods are complementary: MinHash kills near-duplicate pages, exact substring kills shared snippets like licenses, disclaimers or copy-pasted paragraphs that survive inside otherwise distinct pages.

Running deduplication across all of CommonCrawl in one pass is infeasible. The team split CommonCrawl into 100 partitions and deduplicated each independently, which still catches large global clusters because they tend to appear in multiple partitions. URL-level bookkeeping then prevents the same page from being kept twice when CommonCrawl revisits it in later dumps.

How big is RefinedWeb, and what is in each record?

The full internal RefinedWeb corpus contains approximately five trillion English tokens ^[1]. The publicly released extract contains about 968 million documents, totaling around 2.8TB of plain text uncompressed (about 500GB compressed for download), and tokenizes to roughly 500 to 650 billion tokens depending on the tokenizer. The dataset card describes it as 600GT for short ^[4].

Each document carries six fields: content (cleaned plain text), url, timestamp of the crawl, dump (the CommonCrawl dump), segment (within that dump), and image_urls (a list of [image_url, alt_text] pairs found in the page). The image_urls field is unusual for a text-only pretraining set; the dataset card states that RefinedWeb is also "multimodal-friendly: it contains links and alt texts for images in processed samples," since alt text and image URL pairs let downstream researchers reconstruct image-text training data without rerunning the crawl ^[4]. No canonical train/validation split is provided. The authors recommend using validation loss only as an upstream sanity check, since perplexity does not always track downstream zero-shot accuracy ^[1].

Which models were trained on RefinedWeb?

RefinedWeb was used as the main pretraining corpus, in different mixtures, for the Falcon family.

Model	Parameters	Training tokens	Mixture	Released
Falcon-RW-1B	1.3B	350B	RefinedWeb only	June 2023 ^[4]
Falcon-RW-7B	7.5B	350B	RefinedWeb only	June 2023 ^[4]
Falcon-7B	7B	1,500B	RefinedWeb plus curated corpora	May 2023 ^[2]
Falcon-40B	40B	1,000B	RefinedWeb plus curated corpora	May 2023 ^[3]
Falcon-180B	180B	3,500B	~85% RefinedWeb, ~12% curated, ~3% code	September 2023 ^[5]

Falcon-RW-1B and Falcon-RW-7B are the dataset's reference models, described on the dataset card as "two models trained on 350 billion tokens of RefinedWeb alone to demonstrate its quality compared to curated corpora" ^[4]. They are trained on 350 billion tokens drawn purely from RefinedWeb, with no curated mix-in, and are intended as a controlled comparison against GPT-3, Pythia and Cerebras-GPT at similar compute ^[1]. They are not the strongest models in the Falcon family; the production Falcon-7B and Falcon-40B both add a curated portion on top of RefinedWeb. Falcon-180B uses a smaller curated fraction (about 12%) and a slice of code (about 3%), and was trained on up to 4,096 GPUs through Amazon SageMaker for roughly 7 million GPU hours ^[5]. In all three production releases, RefinedWeb is the bulk of the training data.

How well do RefinedWeb models perform?

The central empirical result is shown in Figure 1 of the original paper: at matched compute, models trained on RefinedWeb match the GPT-3 series in zero-shot accuracy and substantially outperform open models trained on The Pile, including the authors' own internal Pile-trained 1B baseline ^[1]. Both findings ran counter to the conventional view at the time. As the abstract summarizes, "despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl" ^[1].

The authors evaluate on four task aggregates totaling 18 zero-shot tasks. Tasks include HellaSwag, LAMBADA, Winogrande, PIQA, ARC, OpenBookQA, BoolQ, COPA, CB, RTE, ReCoRD, ANLI, LogiQA, HeadQA, MathQA, PROST, PubMedQA and SciQ. Evaluation runs through the EleutherAI evaluation harness for all internal models so that comparisons are apples-to-apples within the harness; original GPT-3 paper results are flagged separately ^[1].

In small-scale ablations (1B parameters, 27B tokens; 3B parameters, 60B tokens), RefinedWeb beats both popular web datasets (C4, OSCAR-21.09, OSCAR-22.01) and The Pile. The comparison shows that filtering and deduplication independently contribute: RW-Raw underperforms RW-Filtered, which in turn underperforms the fully deduplicated final dataset ^[1]. When the MDR pipeline's filtering and deduplication are applied to other datasets, deduplication delivers a steady boost across the board, while filtering provides smaller and source-dependent gains. The Pile and OSCAR-22.01 turn out to have very high duplicate rates (45% and 60% respectively), which the paper reads as evidence that aggressive deduplication is generally underused.

The larger-scale comparison uses 1B and 7B parameter models trained on 350GT, stacked against GPT-3 babbage and curie (via the OpenAI API), GPT-Neo, GPT-J, GPT-NeoX-20B, OPT, the Pythia suite, Cerebras-GPT, Aleph Alpha Luminous and PaLM-8B. RefinedWeb-trained models outperform every open Pile-based model and match the GPT-3 series within the harness, despite RefinedWeb excluding the curated sources (Wikipedia, books, technical papers) that GPT-3 itself was trained on ^[1]. The authors do not compare against contemporaneous LLaMA models, since even LLaMA-7B was trained on about 2.5x more compute than the largest RefinedWeb model in the paper, which would have made the comparison unfair.

The paper does not claim that web data is intrinsically better than curated data, only that adequately processed web data can match curated data and that scale plus stringent deduplication can substitute for hand curation. Later work (FineWeb-Edu, Dolma's quality-classifier ablations) has shown that ML-based quality filters can improve over rule-based filtering when applied carefully, so "web is enough" is best read as an existence proof rather than a final verdict on data quality.

How does RefinedWeb compare to other open pretraining corpora?

The table below shows differences in origin, scale and deduplication between RefinedWeb and several widely used open corpora.

Dataset	Released	Approx. size	Sources	Deduplication
C4	2019 (T5 paper)	~360GT (English)	Single CommonCrawl snapshot, NSFW word blocklist	Exact, spans of 3 sentences
The Pile	2020 (EleutherAI)	~340GT	22 curated sources, ~18% web	MinHash with ~10 hashes (~26% removed)
RedPajama v1	2023 (Together AI)	~1.2T tokens	Open reproduction of LLaMA mix (CommonCrawl, C4, GitHub, books, arXiv, Wikipedia, StackExchange)	Inherited from sources
RefinedWeb (public extract)	June 2023 (TII)	~600GT (full ~5T)	CommonCrawl only	MinHash with 9,000 hashes plus exact substring (~50% combined removal at dedup stage)
Dolma v1	2023 (AI2)	~3T tokens	CommonCrawl, The Stack, peS2o, Reddit, Wikipedia, Project Gutenberg	Bloom filter exact, MinHash fuzzy
RedPajama v2	October 2023	~30T tokens	84 CommonCrawl snapshots in 5 languages, with quality signals	Hash-based
FineWeb	April 2024 (Hugging Face)	~15T tokens	96 CommonCrawl snapshots	MinHash per snapshot, with iterative URL filters

C4 is a single-snapshot, line-level filtering dataset with only weak deduplication; RefinedWeb is multi-snapshot and applies far more aggressive deduplication. The Pile is curated and small by 2024 standards, and the RefinedWeb paper showed that its specific curated mix did not transfer well to the authors' evaluation harness even after dedup. RedPajama v1 reproduces LLaMA's mix and is heterogeneous; RedPajama v2 is closer in spirit to RefinedWeb but at much larger scale and multilingual. Dolma comes from AI2 and was designed with reproducibility and open processing tools in mind. FineWeb is the most direct successor: it was led by Guilherme Penedo (the lead author of the RefinedWeb paper) after he moved to Hugging Face, and it scales the same general philosophy (heuristics plus aggressive deduplication on raw CommonCrawl) to 96 snapshots and 15 trillion tokens.

Is RefinedWeb open source, and how do you access it?

The public 600-billion-token extract is released under the Open Data Commons Attribution license (ODC-By 1.0), which permits commercial use with attribution. Users are also bound by the CommonCrawl terms of use. The dataset is hosted on Hugging Face at tiiuae/falcon-refinedweb ^[4]. Removal requests go to falconllm@tii.ae, and the underlying crawl respects robots.txt opt-out signals.

The full ~5T-token internal dataset is not public, and TII has not announced plans to release it. The only way to reproduce the full corpus from outside TII is to rerun MDR (or a similar pipeline) on raw CommonCrawl dumps, which is what FineWeb effectively did.

Limitations

The paper itself flags several limitations ^[1]. The toxicity profile, measured with the Perspective API, is roughly comparable to The Pile. The Perspective definition of toxicity is narrow ("content that is rude or disrespectful") and does not capture social bias or harm in any deep sense, so the comparison is suggestive rather than definitive. RefinedWeb is built on publicly available web pages and may contain personal information such as emails, phone numbers and IP addresses; deduplication probably reduces but does not eliminate PII ^[4].

The public dataset is English-only, even though the MDR pipeline can be applied to other languages. The Falcon-180B announcement mentions an internal RefinedWeb-Europe version, but it has not been published, which makes RefinedWeb less useful as a multilingual baseline than RedPajama v2 or FineWeb-2.

The "neutral filtering" choice is itself a tradeoff. Without an ML quality classifier, RefinedWeb keeps a long tail of pages that more aggressive pipelines such as FineWeb-Edu would discard. Whether that tail helps or hurts a given downstream task is empirical; later work suggests classifier-based filtering can produce stronger reasoning models for the same token budget. RefinedWeb's strength is that it does not bake a particular notion of "good text" into the data.

The authors also report that combining MDR with C4's stricter span-level deduplication and stop-word filtering produced subsets that slightly outperformed RefinedWeb itself, but at rejection rates so high they were not viable for trillion-token training ^[1]. More aggressive cleaning helps zero-shot scores up to a point, then cuts the token budget too far for large models. Deduplication research has not converged either: Biderman et al.'s Pythia work found relatively limited gains from deduplicating The Pile, while RefinedWeb finds substantial gains from deduplicating web data. The authors read this as evidence that deduplication is more important for web-heavy datasets than for curated mixes.

Influence and successor datasets

RefinedWeb is widely cited in the open dataset literature that followed it. It popularized running both fuzzy and exact deduplication with MinHash settings far more aggressive than The Pile and earlier datasets used. Dolma, RedPajama v2 and FineWeb all build on this lesson, and FineWeb's ablations show clear improvements from adopting RefinedWeb-style MinHash thresholds. Before RefinedWeb the default open recipe was a curated mix in the spirit of The Pile; after, several major open datasets, FineWeb most prominently, were defined as cleaned CommonCrawl with no curated mix-in. The Falcon-RW-1B and Falcon-RW-7B reference models are the empirical case for that recipe.

FineWeb (April 2024), led by Penedo at Hugging Face, scales the same general approach to 15T tokens and includes the FineWeb-Edu subset, which uses an ML quality classifier trained on Llama 3 70B annotations to filter for educational content ^[11]. Dolma adds explicit code, scientific paper and Reddit slices. RedPajama v2 multilingualizes the recipe and adds quality signals as document metadata, leaving filtering decisions to downstream users. Each can be read as a direct response to one of RefinedWeb's design choices.

The RefinedWeb pipeline draws explicitly on several prior data-cleaning efforts. CCNet (Wenzek et al., 2020) provides the fastText language identification classifier. MassiveWeb and Gopher (Rae et al., 2021) contributed document-level repetition removal and the rule-based quality filters (mean word length, fraction of bullets, fraction of stop words, ellipsis lines) that RefinedWeb adopts wholesale. Lee et al. (2022) provided the suffix-array exact substring deduplication implementation. Broder (1997) introduced the original MinHash sketch. The HTML-to-text extractor trafilatura (Barbaresi, 2021) was chosen on the basis of Lopukhin's 2019 benchmarking, which found it the best non-commercial library for blog and news content. The paper cites Hoffmann et al. (2022) on Chinchilla as motivation for needing trillions of tokens, and Dodge et al. (2021) and Welbl et al. (2021) on filter-induced bias as motivation for the "neutral filtering" stance.

References

Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. (2023). *The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only*. arXiv:2306.01116. Accepted at NeurIPS 2023 Datasets and Benchmarks Track. https://arxiv.org/abs/2306.01116 ↩
TII. *Falcon-7B model card*. Hugging Face. https://huggingface.co/tiiuae/falcon-7b ↩
TII. *Falcon-40B model card*. Hugging Face. https://huggingface.co/tiiuae/falcon-40b ↩
TII. *Falcon RefinedWeb dataset card*. Hugging Face. https://huggingface.co/datasets/tiiuae/falcon-refinedweb ↩
Almazrouei, E. et al. *Spread Your Wings: Falcon 180B is here.* Hugging Face Blog, 6 September 2023. https://huggingface.co/blog/falcon-180b ↩
Rae, J. W. et al. (2021). *Scaling Language Models: Methods, Analysis & Insights from Training Gopher*. arXiv:2112.11446.
Wenzek, G. et al. (2020). *CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data*. LREC 2020. arXiv:1911.00359.
Lee, K. et al. (2022). *Deduplicating Training Data Makes Language Models Better*. ACL 2022. arXiv:2107.06499.
Hoffmann, J. et al. (2022). *Training Compute-Optimal Large Language Models*. arXiv:2203.15556.
Barbaresi, A. (2021). *Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction*. ACL System Demonstrations.
Penedo, G. et al. (2024). *The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale*. arXiv:2406.17557. https://arxiv.org/abs/2406.17557 ↩
Soldaini, L. et al. (2024). *Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research*. arXiv:2402.00159.

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

2 revisions by 1 contributors · full history

Suggest edit

What links here

Common Corpus Common Crawl DCLM (DataComp for Language Models)DatologyAI Dolma FineWeb FineWeb-Edu Hashing Sketching SlimPajama The Stack v2

RefinedWeb

What problem was RefinedWeb built to solve?

Who built RefinedWeb? TII and the Falcon program

What is the MacroData Refinement (MDR) pipeline?

Pipeline stages

Why do the MinHash settings matter?

How big is RefinedWeb, and what is in each record?

Which models were trained on RefinedWeb?

How well do RefinedWeb models perform?

How does RefinedWeb compare to other open pretraining corpora?

Is RefinedWeb open source, and how do you access it?

Limitations

Influence and successor datasets

See also

References

Improve this article

What links here

What links here

What problem was RefinedWeb built to solve?

Who built RefinedWeb? TII and the Falcon program

What is the MacroData Refinement (MDR) pipeline?

Pipeline stages

Why do the MinHash settings matter?

How big is RefinedWeb, and what is in each record?

Which models were trained on RefinedWeb?

How well do RefinedWeb models perform?

How does RefinedWeb compare to other open pretraining corpora?

Is RefinedWeb open source, and how do you access it?

Limitations

Influence and successor datasets

Related work

See also

References

Improve this article

Related Articles

Dolma

SlimPajama

OpenOrca

Cosmopedia

TxT360

The Pile (dataset)

What links here

Related Articles

Dolma

SlimPajama

OpenOrca

Cosmopedia

TxT360

The Pile (dataset)

What links here