RefinedWeb
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,461 words
Improve this article
Add missing citations, update stale details, or suggest a clearer explanation.
Last reviewed
May 2, 2026
Sources
12 citations
Review status
Source-backed
Revision
v1 ยท 3,461 words
Add missing citations, update stale details, or suggest a clearer explanation.
RefinedWeb is a large-scale English pretraining dataset for large language models, built and released by the Technology Innovation Institute (TII) in Abu Dhabi. It was introduced in the June 2023 paper The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only by Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei and Julien Launay [1]. RefinedWeb is built entirely from filtered and deduplicated Common Crawl snapshots, with no curated sources such as books, Wikipedia, academic papers or code repositories.
The paper made a strong empirical claim that ran against the conventional wisdom of the time: web data alone, if filtered and deduplicated aggressively enough, can produce models that match or beat models trained on traditionally curated mixes such as The Pile. Models trained on RefinedWeb performed on par with the original GPT-3 series within the authors' evaluation harness, despite using no books, no Wikipedia and no scholarly articles [1]. The dataset is the main pretraining corpus behind the Falcon family of models, including Falcon-7B, Falcon-40B and the 180-billion-parameter Falcon-180B, which was for a brief period the largest openly available LLM [2][3].
The full RefinedWeb corpus is roughly five trillion tokens of English text, but the public version is a 600-billion-token extract published on Hugging Face under the ODC-By 1.0 license [4]. Together with two smaller reference models (Falcon-RW-1B and Falcon-RW-7B), the release served as a new open baseline for web pretraining data and has since influenced datasets such as FineWeb, Dolma and RedPajama v2.
By early 2023, frontier language models were widely believed to depend on two ingredients: massive web crawl data plus a careful blend of curated, single-domain corpora such as books, Wikipedia, social media conversations, scientific papers and code [1]. The web crawl part provided scale; the curated part was thought to provide quality. Most public datasets at the time, including The Pile (~340GT, where GT means billions of tokens), C4 (~360GT) and OSCAR (~370GT), reflected one half of that recipe but rarely both at the scale needed for models past 100B parameters [1].
Two problems were converging. The Chinchilla scaling laws of Hoffmann et al. (2022) suggested that compute-optimal training of a GPT-3 sized model required on the order of 3.5 trillion tokens, roughly twice the size of the largest public pretraining sets at that time and ten times the size of the largest open English corpora [1]. Several papers also warned that the supply of high-quality, properly licensed text might run out within a few years at current scaling rates. The Falcon team's response was to challenge the assumption that you needed curated data at all. If web crawls could be cleaned and deduplicated thoroughly enough, scale could substitute for curation, and the dataset could in principle be extended indefinitely as new Common Crawl dumps appeared.
The paper does not pretend that filtering is a neutral act. Excessive filtering, especially with ML-based quality classifiers, can introduce its own biases and disproportionately remove minority dialects or non-mainstream sources [1]. The authors stuck to "neutral filtering": rule-based heuristics and URL blocklists, with no ML quality classifier beyond fastText for language identification. This is a different philosophy from C4, which uses a stop-word and bad-word list, or from later datasets such as FineWeb-Edu, which trains an explicit classifier to score educational quality.
The Technology Innovation Institute is the applied research arm of Abu Dhabi's Advanced Technology Research Council (ATRC), founded in May 2020 as part of a broader UAE strategy to build domestic capacity in advanced technologies. TII's AI and Digital Science Research Center is responsible for the Falcon program. The dataset was developed in collaboration with LightOn, a French AI infrastructure company, listed as the affiliation for Penedo, Hesslow, Cappelli, Pannier and Launay in the paper [1]. Falcon-40B was released in May 2023, with Falcon-7B shortly after. The RefinedWeb paper appeared on arXiv on 1 June 2023, alongside the public 600-billion-token data extract. Falcon-180B followed in September 2023, trained primarily on a larger internal version of RefinedWeb. The paper was later peer-reviewed and accepted at NeurIPS 2023 Datasets and Benchmarks.
The processing pipeline is called Macrodata Refinement, abbreviated MDR. It runs in three broad phases: document preparation, filtering, and deduplication. The pipeline starts from raw WARC files (the original HTTP responses, including HTML markup) rather than the cleaner WET files that Common Crawl also publishes, because the team found WET files contained too many leftover navigation menus, ads and footers [1].
The authors state three explicit design principles [1][4]:
All stages combined remove about 90% of the documents in the original Common Crawl input. Roughly half of that loss comes from language identification (most of CommonCrawl is not English), about a quarter from quality filtering, and about 12% from deduplication [1].
| Stage | Method | Tools / references |
|---|---|---|
| URL filtering | 4.6M-domain blocklist plus a URL scoring system based on weighted word lists; common high-quality sources (Wikipedia, arXiv, etc.) are excluded so they can be added back as a curated mix | Custom blocklist, Appendix G.1 of [1] |
| Text extraction | Read raw WARC files with warcio; extract main content with trafilatura | Barbaresi (2021) |
| Language identification | fastText character-n-gram classifier from CCNet, threshold of 0.65 on the top language; only English kept for the main release | Wenzek et al. (2020) |
| Repetition removal | Drop documents with excessive repeated lines, paragraphs or n-grams | MassiveWeb heuristics, Rae et al. (2021) |
| Document-wise filtering | Quality heuristics on length, mean word length, symbol-to-word ratio, fraction of bullet points, ellipsis lines, etc. | MassiveWeb, Gopher heuristics from Rae et al. (2021) |
| Line-wise corrections | Strip menu items, like counters, share buttons, navigation lines; if more than 5% of a document is removed by the filter, drop the document | Custom, Appendix G.2 of [1] |
| Fuzzy deduplication | MinHash with 9,000 hashes per document over 5-grams, divided into 20 buckets of 450 hashes each | Broder (1997), tuned setup |
| Exact deduplication | Suffix-array exact substring removal of any matching span longer than 50 consecutive tokens | Lee et al. (2022) implementation |
| URL deduplication | Track URLs already kept in earlier dump partitions and drop revisits in later partitions | Custom |
The RW-Raw checkpoint refers to the data after URL filtering, text extraction and language ID, which keeps about 48% of the original CommonCrawl documents. After document and line filtering, the RW-Filtered set retains about 23% of the original. After deduplication, the final RefinedWeb is roughly 11% of the original document count [1].
Most prior public datasets used relatively coarse MinHash signatures: The Pile used about 10 hashes per document. RefinedWeb uses 9,000, computed over 5-grams and grouped into 20 bands of 450 hashes (two documents are flagged as candidate duplicates if any one band matches exactly) [1]. Scaling up the signature recovers far more near-duplicates than the looser settings used elsewhere. After MinHash catches whole-document templates and SEO boilerplate, the exact substring step removes any 50-token span shared between two documents using suffix arrays. The two methods are complementary: MinHash kills near-duplicate pages, exact substring kills shared snippets like licenses, disclaimers or copy-pasted paragraphs that survive inside otherwise distinct pages.
Running deduplication across all of CommonCrawl in one pass is infeasible. The team split CommonCrawl into 100 partitions and deduplicated each independently, which still catches large global clusters because they tend to appear in multiple partitions. URL-level bookkeeping then prevents the same page from being kept twice when CommonCrawl revisits it in later dumps.
The full internal RefinedWeb corpus contains approximately five trillion English tokens [1]. The publicly released extract contains about 968 million documents, totaling around 2.8TB of plain text uncompressed (about 500GB compressed for download), and tokenizes to roughly 500 to 650 billion tokens depending on the tokenizer. The dataset card describes it as 600GT for short [4].
Each document carries six fields: content (cleaned plain text), url, timestamp of the crawl, dump (the CommonCrawl dump), segment (within that dump), and image_urls (a list of [image_url, alt_text] pairs found in the page). The image_urls field is unusual for a text-only pretraining set; the authors describe it as making RefinedWeb "multimodal-friendly," since alt text and image URL pairs let downstream researchers reconstruct image-text training data without rerunning the crawl [4]. No canonical train/validation split is provided. The authors recommend using validation loss only as an upstream sanity check, since perplexity does not always track downstream zero-shot accuracy [1].
RefinedWeb was used as the main pretraining corpus, in different mixtures, for the Falcon family.
| Model | Parameters | Training tokens | Mixture | Released |
|---|---|---|---|---|
| Falcon-RW-1B | 1.3B | 350B | RefinedWeb only | June 2023 [4] |
| Falcon-RW-7B | 7.5B | 350B | RefinedWeb only | June 2023 [4] |
| Falcon-7B | 7B | 1,500B | RefinedWeb plus curated corpora | May 2023 [2] |
| Falcon-40B | 40B | 1,000B | RefinedWeb plus curated corpora | May 2023 [3] |
| Falcon-180B | 180B | 3,500B | ~85% RefinedWeb, ~12% curated, ~3% code | September 2023 [5] |
Falcon-RW-1B and Falcon-RW-7B are the dataset's reference models. They are trained on 350 billion tokens drawn purely from RefinedWeb, with no curated mix-in, and are intended as a controlled comparison against GPT-3, Pythia and Cerebras-GPT at similar compute [1]. They are not the strongest models in the Falcon family; the production Falcon-7B and Falcon-40B both add a curated portion on top of RefinedWeb. Falcon-180B uses a smaller curated fraction (about 12%) and a slice of code (about 3%), and was trained on up to 4,096 GPUs through Amazon SageMaker for roughly 7 million GPU hours [5]. In all three production releases, RefinedWeb is the bulk of the training data.
The central empirical result is shown in Figure 1 of the original paper: at matched compute, models trained on RefinedWeb match the GPT-3 series in zero-shot accuracy and substantially outperform open models trained on The Pile, including the authors' own internal Pile-trained 1B baseline [1]. Both findings ran counter to the conventional view at the time.
The authors evaluate on four task aggregates totaling 18 zero-shot tasks. Tasks include HellaSwag, LAMBADA, Winogrande, PIQA, ARC, OpenBookQA, BoolQ, COPA, CB, RTE, ReCoRD, ANLI, LogiQA, HeadQA, MathQA, PROST, PubMedQA and SciQ. Evaluation runs through the EleutherAI evaluation harness for all internal models so that comparisons are apples-to-apples within the harness; original GPT-3 paper results are flagged separately [1].
In small-scale ablations (1B parameters, 27B tokens; 3B parameters, 60B tokens), RefinedWeb beats both popular web datasets (C4, OSCAR-21.09, OSCAR-22.01) and The Pile. The comparison shows that filtering and deduplication independently contribute: RW-Raw underperforms RW-Filtered, which in turn underperforms the fully deduplicated final dataset [1]. When the MDR pipeline's filtering and deduplication are applied to other datasets, deduplication delivers a steady boost across the board, while filtering provides smaller and source-dependent gains. The Pile and OSCAR-22.01 turn out to have very high duplicate rates (45% and 60% respectively), which the paper reads as evidence that aggressive deduplication is generally underused.
The larger-scale comparison uses 1B and 7B parameter models trained on 350GT, stacked against GPT-3 babbage and curie (via the OpenAI API), GPT-Neo, GPT-J, GPT-NeoX-20B, OPT, the Pythia suite, Cerebras-GPT, Aleph Alpha Luminous and PaLM-8B. RefinedWeb-trained models outperform every open Pile-based model and match the GPT-3 series within the harness, despite RefinedWeb excluding the curated sources (Wikipedia, books, technical papers) that GPT-3 itself was trained on [1]. The authors do not compare against contemporaneous LLaMA models, since even LLaMA-7B was trained on about 2.5x more compute than the largest RefinedWeb model in the paper, which would have made the comparison unfair.
The paper does not claim that web data is intrinsically better than curated data, only that adequately processed web data can match curated data and that scale plus stringent deduplication can substitute for hand curation. Later work (FineWeb-Edu, Dolma's quality-classifier ablations) has shown that ML-based quality filters can improve over rule-based filtering when applied carefully, so "web is enough" is best read as an existence proof rather than a final verdict on data quality.
The table below shows differences in origin, scale and deduplication between RefinedWeb and several widely used open corpora.
| Dataset | Released | Approx. size | Sources | Deduplication |
|---|---|---|---|---|
| C4 | 2019 (T5 paper) | ~360GT (English) | Single CommonCrawl snapshot, NSFW word blocklist | Exact, spans of 3 sentences |
| The Pile | 2020 (EleutherAI) | ~340GT | 22 curated sources, ~18% web | MinHash with ~10 hashes (~26% removed) |
| RedPajama v1 | 2023 (Together AI) | ~1.2T tokens | Open reproduction of LLaMA mix (CommonCrawl, C4, GitHub, books, arXiv, Wikipedia, StackExchange) | Inherited from sources |
| RefinedWeb (public extract) | June 2023 (TII) | ~600GT (full ~5T) | CommonCrawl only | MinHash with 9,000 hashes plus exact substring (~50% combined removal at dedup stage) |
| Dolma v1 | 2023 (AI2) | ~3T tokens | CommonCrawl, The Stack, peS2o, Reddit, Wikipedia, Project Gutenberg | Bloom filter exact, MinHash fuzzy |
| RedPajama v2 | October 2023 | ~30T tokens | 84 CommonCrawl snapshots in 5 languages, with quality signals | Hash-based |
| FineWeb | April 2024 (Hugging Face) | ~15T tokens | 96 CommonCrawl snapshots | MinHash per snapshot, with iterative URL filters |
C4 is a single-snapshot, line-level filtering dataset with only weak deduplication; RefinedWeb is multi-snapshot and applies far more aggressive deduplication. The Pile is curated and small by 2024 standards, and the RefinedWeb paper showed that its specific curated mix did not transfer well to the authors' evaluation harness even after dedup. RedPajama v1 reproduces LLaMA's mix and is heterogeneous; RedPajama v2 is closer in spirit to RefinedWeb but at much larger scale and multilingual. Dolma comes from AI2 and was designed with reproducibility and open processing tools in mind. FineWeb is the most direct successor: it was led by Guilherme Penedo (the lead author of the RefinedWeb paper) after he moved to Hugging Face, and it scales the same general philosophy (heuristics plus aggressive deduplication on raw CommonCrawl) to 96 snapshots and 15 trillion tokens.
The public 600-billion-token extract is released under the Open Data Commons Attribution license (ODC-By 1.0), which permits commercial use with attribution. Users are also bound by the CommonCrawl terms of use. The dataset is hosted on Hugging Face at tiiuae/falcon-refinedweb [4]. Removal requests go to falconllm@tii.ae, and the underlying crawl respects robots.txt opt-out signals.
The full ~5T-token internal dataset is not public, and TII has not announced plans to release it. The only way to reproduce the full corpus from outside TII is to rerun MDR (or a similar pipeline) on raw CommonCrawl dumps, which is what FineWeb effectively did.
The paper itself flags several limitations [1]. The toxicity profile, measured with the Perspective API, is roughly comparable to The Pile. The Perspective definition of toxicity is narrow ("content that is rude or disrespectful") and does not capture social bias or harm in any deep sense, so the comparison is suggestive rather than definitive. RefinedWeb is built on publicly available web pages and may contain personal information such as emails, phone numbers and IP addresses; deduplication probably reduces but does not eliminate PII [4].
The public dataset is English-only, even though the MDR pipeline can be applied to other languages. The Falcon-180B announcement mentions an internal RefinedWeb-Europe version, but it has not been published, which makes RefinedWeb less useful as a multilingual baseline than RedPajama v2 or FineWeb-2.
The "neutral filtering" choice is itself a tradeoff. Without an ML quality classifier, RefinedWeb keeps a long tail of pages that more aggressive pipelines such as FineWeb-Edu would discard. Whether that tail helps or hurts a given downstream task is empirical; later work suggests classifier-based filtering can produce stronger reasoning models for the same token budget. RefinedWeb's strength is that it does not bake a particular notion of "good text" into the data.
The authors also report that combining MDR with C4's stricter span-level deduplication and stop-word filtering produced subsets that slightly outperformed RefinedWeb itself, but at rejection rates so high they were not viable for trillion-token training [1]. More aggressive cleaning helps zero-shot scores up to a point, then cuts the token budget too far for large models. Deduplication research has not converged either: Biderman et al.'s Pythia work found relatively limited gains from deduplicating The Pile, while RefinedWeb finds substantial gains from deduplicating web data. The authors read this as evidence that deduplication is more important for web-heavy datasets than for curated mixes.
RefinedWeb is widely cited in the open dataset literature that followed it. It popularized running both fuzzy and exact deduplication with MinHash settings far more aggressive than The Pile and earlier datasets used. Dolma, RedPajama v2 and FineWeb all build on this lesson, and FineWeb's ablations show clear improvements from adopting RefinedWeb-style MinHash thresholds. Before RefinedWeb the default open recipe was a curated mix in the spirit of The Pile; after, several major open datasets, FineWeb most prominently, were defined as cleaned CommonCrawl with no curated mix-in. The Falcon-RW-1B and Falcon-RW-7B reference models are the empirical case for that recipe.
FineWeb (April 2024), led by Penedo at Hugging Face, scales the same general approach to 15T tokens and includes the FineWeb-Edu subset, which uses an ML quality classifier trained on Llama 3 70B annotations to filter for educational content. Dolma adds explicit code, scientific paper and Reddit slices. RedPajama v2 multilingualizes the recipe and adds quality signals as document metadata, leaving filtering decisions to downstream users. Each can be read as a direct response to one of RefinedWeb's design choices.
The RefinedWeb pipeline draws explicitly on several prior data-cleaning efforts. CCNet (Wenzek et al., 2020) provides the fastText language identification classifier. MassiveWeb and Gopher (Rae et al., 2021) contributed document-level repetition removal and the rule-based quality filters (mean word length, fraction of bullets, fraction of stop words, ellipsis lines) that RefinedWeb adopts wholesale. Lee et al. (2022) provided the suffix-array exact substring deduplication implementation. Broder (1997) introduced the original MinHash sketch. The HTML-to-text extractor trafilatura (Barbaresi, 2021) was chosen on the basis of Lopukhin's 2019 benchmarking, which found it the best non-commercial library for blog and news content. The paper cites Hoffmann et al. (2022) on Chinchilla as motivation for needing trillions of tokens, and Dodge et al. (2021) and Welbl et al. (2021) on filter-induced bias as motivation for the "neutral filtering" stance.