# RefinedWeb

> Source: https://aiwiki.ai/wiki/refinedweb
> Updated: 2026-06-23
> Categories: Data & Datasets, Large Language Models, Open Source AI
> License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
> From AI Wiki (https://aiwiki.ai), the free encyclopedia of artificial intelligence. Reuse freely with attribution to "AI Wiki (aiwiki.ai)".

**RefinedWeb** is a large-scale English [pretraining](/wiki/pretraining) [dataset](/wiki/dataset) for [large language models](/wiki/large_language_model), built from filtered and deduplicated [Common Crawl](/wiki/common_crawl) web data alone and released in June 2023 by the [Technology Innovation Institute](/wiki/tii) (TII) in Abu Dhabi to train the [Falcon](/wiki/falcon) models. Its central finding, stated in the abstract, is that "properly filtered and deduplicated web data alone can lead to powerful models; even significantly outperforming models from the state-of-the-art trained on The Pile" [1]. The full corpus contains about five trillion tokens, of which a 600-billion-token extract was released publicly on [Hugging Face](/wiki/hugging_face) under the ODC-By 1.0 license [1][4].

RefinedWeb was introduced in the paper *The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only* by Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei and Julien Launay, submitted to arXiv on 1 June 2023 [1]. It is built entirely from Common Crawl snapshots, with no curated sources such as books, Wikipedia, academic papers or code repositories. The paper made a strong empirical claim that ran against the conventional wisdom of the time: web data alone, if filtered and deduplicated aggressively enough, can produce models that match or beat models trained on traditionally curated mixes such as [The Pile](/wiki/the_pile). Models trained on RefinedWeb performed on par with the original [GPT-3](/wiki/gpt-3) series within the authors' evaluation harness, despite using no books, no Wikipedia and no scholarly articles [1]. The dataset is the main pretraining corpus behind the [Falcon](/wiki/falcon) family of models, including Falcon-7B, Falcon-40B and the 180-billion-parameter Falcon-180B, which was for a brief period the largest openly available [LLM](/wiki/llm) [2][3].

Together with two smaller reference models (Falcon-RW-1B and Falcon-RW-7B), the release served as a new open baseline for web pretraining data and has since influenced datasets such as [FineWeb](/wiki/fineweb), Dolma and [RedPajama](/wiki/red_pajama) v2.

## What problem was RefinedWeb built to solve?

By early 2023, frontier language models were widely believed to depend on two ingredients: massive web crawl data plus a careful blend of curated, single-domain corpora such as books, Wikipedia, social media conversations, scientific papers and code [1]. The web crawl part provided scale; the curated part was thought to provide quality. As the paper put it, "this curation process is believed to be necessary to produce performant models with broad zero-shot generalization abilities" [1]. Most public datasets at the time, including The Pile (~340GT, where GT means billions of tokens), [C4](/wiki/c4_dataset) (~360GT) and OSCAR (~370GT), reflected one half of that recipe but rarely both at the scale needed for models past 100B parameters [1].

Two problems were converging. The [Chinchilla](/wiki/chinchilla_scaling) scaling laws of Hoffmann et al. (2022) suggested that compute-optimal training of a GPT-3 sized model required on the order of 3.5 trillion tokens, roughly twice the size of the largest public pretraining sets at that time and ten times the size of the largest open English corpora [1]. Several papers also warned that the supply of high-quality, properly licensed text might run out within a few years at current scaling rates; the paper framed the open question as "whether we will run out of unique high-quality data soon" [1]. The Falcon team's response was to challenge the assumption that you needed curated data at all. If web crawls could be cleaned and deduplicated thoroughly enough, scale could substitute for curation, and the dataset could in principle be extended indefinitely as new Common Crawl dumps appeared.

The paper does not pretend that filtering is a neutral act. Excessive filtering, especially with ML-based quality classifiers, can introduce its own biases and disproportionately remove minority dialects or non-mainstream sources [1]. The authors stuck to "neutral filtering": rule-based heuristics and URL blocklists, with no ML quality classifier beyond [fastText](/wiki/fasttext) for language identification. This is a different philosophy from C4, which uses a stop-word and bad-word list, or from later datasets such as FineWeb-Edu, which trains an explicit classifier to score educational quality.

## Who built RefinedWeb? TII and the Falcon program

The Technology Innovation Institute is the applied research arm of Abu Dhabi's Advanced Technology Research Council (ATRC), founded in May 2020 as part of a broader UAE strategy to build domestic capacity in advanced technologies. TII's AI and Digital Science Research Center is responsible for the Falcon program. The dataset was developed in collaboration with LightOn, a French AI infrastructure company, listed as the affiliation for Penedo, Hesslow, Cappelli, Pannier and Launay in the paper [1]. Falcon-40B was released in May 2023, with Falcon-7B shortly after. The RefinedWeb paper appeared on arXiv on 1 June 2023, alongside the public 600-billion-token data extract. Falcon-180B followed in September 2023, trained primarily on a larger internal version of RefinedWeb. The paper was later peer-reviewed and accepted at NeurIPS 2023 Datasets and Benchmarks.

## What is the MacroData Refinement (MDR) pipeline?

The processing pipeline is called **MacroData Refinement**, abbreviated MDR. It runs in three broad phases: document preparation, filtering, and deduplication. The pipeline starts from raw WARC files (the original HTTP responses, including HTML markup) rather than the cleaner WET files that Common Crawl also publishes, because the team found WET files contained too many leftover navigation menus, ads and footers [1].

The authors state three explicit design principles [1][4]:

1. **Scale first.** The target size is 3 to 6 trillion tokens of English, enough to train models in the 40B to 200B parameter range to optimality under Chinchilla scaling.
2. **Strict deduplication.** Both fuzzy (MinHash) and exact (suffix array) deduplication are applied, with thresholds tuned to remove far more than other public pipelines.
3. **Neutral filtering.** No ML quality classifier is used. Filtering relies on URL blocklists, simple heuristics and language identification.

All stages combined remove about 90% of the documents in the original Common Crawl input. Roughly half of that loss comes from language identification (most of CommonCrawl is not English), about a quarter from quality filtering, and about 12% from deduplication [1].

### Pipeline stages

| Stage | Method | Tools / references |
|-------|--------|---------------------|
| URL filtering | 4.6M-domain blocklist plus a URL scoring system based on weighted word lists; common high-quality sources (Wikipedia, arXiv, etc.) are excluded so they can be added back as a curated mix | Custom blocklist, Appendix G.1 of [1] |
| Text extraction | Read raw WARC files with `warcio`; extract main content with [trafilatura](/wiki/trafilatura) | Barbaresi (2021) |
| Language identification | fastText character-n-gram classifier from CCNet, threshold of 0.65 on the top language; only English kept for the main release | Wenzek et al. (2020) |
| Repetition removal | Drop documents with excessive repeated lines, paragraphs or n-grams | MassiveWeb heuristics, Rae et al. (2021) |
| Document-wise filtering | Quality heuristics on length, mean word length, symbol-to-word ratio, fraction of bullet points, ellipsis lines, etc. | MassiveWeb, [Gopher](/wiki/gopher) heuristics from Rae et al. (2021) |
| Line-wise corrections | Strip menu items, like counters, share buttons, navigation lines; if more than 5% of a document is removed by the filter, drop the document | Custom, Appendix G.2 of [1] |
| Fuzzy deduplication | MinHash with 9,000 hashes per document over 5-grams, divided into 20 buckets of 450 hashes each | Broder (1997), tuned setup |
| Exact deduplication | Suffix-array exact substring removal of any matching span longer than 50 consecutive tokens | Lee et al. (2022) implementation |
| URL deduplication | Track URLs already kept in earlier dump partitions and drop revisits in later partitions | Custom |

The `RW-Raw` checkpoint refers to the data after URL filtering, text extraction and language ID, which keeps about 48% of the original CommonCrawl documents. After document and line filtering, the `RW-Filtered` set retains about 23% of the original. After deduplication, the final RefinedWeb is roughly 11% of the original document count [1].

### Why do the MinHash settings matter?

Most prior public datasets used relatively coarse MinHash signatures: The Pile used about 10 hashes per document. RefinedWeb uses 9,000, computed over 5-grams and grouped into 20 bands of 450 hashes (two documents are flagged as candidate duplicates if any one band matches exactly) [1]. Scaling up the signature recovers far more near-duplicates than the looser settings used elsewhere. After MinHash catches whole-document templates and SEO boilerplate, the exact substring step removes any 50-token span shared between two documents using suffix arrays. The two methods are complementary: MinHash kills near-duplicate pages, exact substring kills shared snippets like licenses, disclaimers or copy-pasted paragraphs that survive inside otherwise distinct pages.

Running deduplication across all of CommonCrawl in one pass is infeasible. The team split CommonCrawl into 100 partitions and deduplicated each independently, which still catches large global clusters because they tend to appear in multiple partitions. URL-level bookkeeping then prevents the same page from being kept twice when CommonCrawl revisits it in later dumps.

## How big is RefinedWeb, and what is in each record?

The full internal RefinedWeb corpus contains approximately five trillion English tokens [1]. The publicly released extract contains about 968 million documents, totaling around 2.8TB of plain text uncompressed (about 500GB compressed for download), and tokenizes to roughly 500 to 650 billion tokens depending on the [tokenizer](/wiki/tokenization). The dataset card describes it as 600GT for short [4].

Each document carries six fields: `content` (cleaned plain text), `url`, `timestamp` of the crawl, `dump` (the CommonCrawl dump), `segment` (within that dump), and `image_urls` (a list of `[image_url, alt_text]` pairs found in the page). The `image_urls` field is unusual for a text-only pretraining set; the dataset card states that RefinedWeb is also "multimodal-friendly: it contains links and alt texts for images in processed samples," since alt text and image URL pairs let downstream researchers reconstruct image-text training data without rerunning the crawl [4]. No canonical train/validation split is provided. The authors recommend using validation loss only as an upstream sanity check, since perplexity does not always track downstream zero-shot accuracy [1].

## Which models were trained on RefinedWeb?

RefinedWeb was used as the main pretraining corpus, in different mixtures, for the Falcon family.

| Model | Parameters | Training tokens | Mixture | Released |
|-------|------------|-----------------|---------|----------|
| Falcon-RW-1B | 1.3B | 350B | RefinedWeb only | June 2023 [4] |
| Falcon-RW-7B | 7.5B | 350B | RefinedWeb only | June 2023 [4] |
| [Falcon-7B](/wiki/falcon) | 7B | 1,500B | RefinedWeb plus curated corpora | May 2023 [2] |
| [Falcon-40B](/wiki/falcon) | 40B | 1,000B | RefinedWeb plus curated corpora | May 2023 [3] |
| [Falcon-180B](/wiki/falcon) | 180B | 3,500B | ~85% RefinedWeb, ~12% curated, ~3% code | September 2023 [5] |

Falcon-RW-1B and Falcon-RW-7B are the dataset's reference models, described on the dataset card as "two models trained on 350 billion tokens of RefinedWeb alone to demonstrate its quality compared to curated corpora" [4]. They are trained on 350 billion tokens drawn purely from RefinedWeb, with no curated mix-in, and are intended as a controlled comparison against GPT-3, Pythia and Cerebras-GPT at similar compute [1]. They are not the strongest models in the Falcon family; the production Falcon-7B and Falcon-40B both add a curated portion on top of RefinedWeb. Falcon-180B uses a smaller curated fraction (about 12%) and a slice of code (about 3%), and was trained on up to 4,096 GPUs through Amazon SageMaker for roughly 7 million GPU hours [5]. In all three production releases, RefinedWeb is the bulk of the training data.

## How well do RefinedWeb models perform?

The central empirical result is shown in Figure 1 of the original paper: at matched compute, models trained on RefinedWeb match the GPT-3 series in zero-shot accuracy and substantially outperform open models trained on The Pile, including the authors' own internal Pile-trained 1B baseline [1]. Both findings ran counter to the conventional view at the time. As the abstract summarizes, "despite extensive filtering, the high-quality data we extract from the web is still plentiful, and we are able to obtain five trillion tokens from CommonCrawl" [1].

The authors evaluate on four task aggregates totaling 18 zero-shot tasks. Tasks include HellaSwag, LAMBADA, Winogrande, PIQA, ARC, OpenBookQA, BoolQ, COPA, CB, RTE, ReCoRD, ANLI, LogiQA, HeadQA, MathQA, PROST, PubMedQA and SciQ. Evaluation runs through the EleutherAI evaluation harness for all internal models so that comparisons are apples-to-apples within the harness; original GPT-3 paper results are flagged separately [1].

In small-scale ablations (1B parameters, 27B tokens; 3B parameters, 60B tokens), RefinedWeb beats both popular web datasets (C4, OSCAR-21.09, OSCAR-22.01) and The Pile. The comparison shows that filtering and deduplication independently contribute: `RW-Raw` underperforms `RW-Filtered`, which in turn underperforms the fully deduplicated final dataset [1]. When the MDR pipeline's filtering and deduplication are applied to other datasets, deduplication delivers a steady boost across the board, while filtering provides smaller and source-dependent gains. The Pile and OSCAR-22.01 turn out to have very high duplicate rates (45% and 60% respectively), which the paper reads as evidence that aggressive deduplication is generally underused.

The larger-scale comparison uses 1B and 7B parameter models trained on 350GT, stacked against GPT-3 babbage and curie (via the OpenAI API), GPT-Neo, GPT-J, GPT-NeoX-20B, OPT, the Pythia suite, Cerebras-GPT, Aleph Alpha Luminous and PaLM-8B. RefinedWeb-trained models outperform every open Pile-based model and match the GPT-3 series within the harness, despite RefinedWeb excluding the curated sources (Wikipedia, books, technical papers) that GPT-3 itself was trained on [1]. The authors do not compare against contemporaneous [LLaMA](/wiki/llama) models, since even LLaMA-7B was trained on about 2.5x more compute than the largest RefinedWeb model in the paper, which would have made the comparison unfair.

The paper does not claim that web data is intrinsically better than curated data, only that adequately processed web data can match curated data and that scale plus stringent deduplication can substitute for hand curation. Later work (FineWeb-Edu, Dolma's quality-classifier ablations) has shown that ML-based quality filters can improve over rule-based filtering when applied carefully, so "web is enough" is best read as an existence proof rather than a final verdict on data quality.

## How does RefinedWeb compare to other open pretraining corpora?

The table below shows differences in origin, scale and deduplication between RefinedWeb and several widely used open corpora.

| Dataset | Released | Approx. size | Sources | Deduplication |
|---------|----------|--------------|---------|----------------|
| [C4](/wiki/c4_dataset) | 2019 (T5 paper) | ~360GT (English) | Single CommonCrawl snapshot, NSFW word blocklist | Exact, spans of 3 sentences |
| [The Pile](/wiki/the_pile) | 2020 ([EleutherAI](/wiki/eleutherai)) | ~340GT | 22 curated sources, ~18% web | MinHash with ~10 hashes (~26% removed) |
| [RedPajama](/wiki/red_pajama) v1 | 2023 ([Together AI](/wiki/together_ai)) | ~1.2T tokens | Open reproduction of LLaMA mix (CommonCrawl, C4, GitHub, books, arXiv, Wikipedia, StackExchange) | Inherited from sources |
| RefinedWeb (public extract) | June 2023 (TII) | ~600GT (full ~5T) | CommonCrawl only | MinHash with 9,000 hashes plus exact substring (~50% combined removal at dedup stage) |
| Dolma v1 | 2023 (AI2) | ~3T tokens | CommonCrawl, The Stack, peS2o, Reddit, Wikipedia, Project Gutenberg | Bloom filter exact, MinHash fuzzy |
| RedPajama v2 | October 2023 | ~30T tokens | 84 CommonCrawl snapshots in 5 languages, with quality signals | Hash-based |
| [FineWeb](/wiki/fineweb) | April 2024 ([Hugging Face](/wiki/hugging_face)) | ~15T tokens | 96 CommonCrawl snapshots | MinHash per snapshot, with iterative URL filters |

C4 is a single-snapshot, line-level filtering dataset with only weak deduplication; RefinedWeb is multi-snapshot and applies far more aggressive deduplication. The Pile is curated and small by 2024 standards, and the RefinedWeb paper showed that its specific curated mix did not transfer well to the authors' evaluation harness even after dedup. RedPajama v1 reproduces LLaMA's mix and is heterogeneous; RedPajama v2 is closer in spirit to RefinedWeb but at much larger scale and multilingual. Dolma comes from AI2 and was designed with reproducibility and open processing tools in mind. FineWeb is the most direct successor: it was led by Guilherme Penedo (the lead author of the RefinedWeb paper) after he moved to Hugging Face, and it scales the same general philosophy (heuristics plus aggressive deduplication on raw CommonCrawl) to 96 snapshots and 15 trillion tokens.

## Is RefinedWeb open source, and how do you access it?

The public 600-billion-token extract is released under the [Open Data Commons Attribution license](https://opendatacommons.org/licenses/by/1-0/) (ODC-By 1.0), which permits commercial use with attribution. Users are also bound by the CommonCrawl terms of use. The dataset is hosted on Hugging Face at `tiiuae/falcon-refinedweb` [4]. Removal requests go to `falconllm@tii.ae`, and the underlying crawl respects `robots.txt` opt-out signals.

The full ~5T-token internal dataset is not public, and TII has not announced plans to release it. The only way to reproduce the full corpus from outside TII is to rerun MDR (or a similar pipeline) on raw CommonCrawl dumps, which is what FineWeb effectively did.

## Limitations

The paper itself flags several limitations [1]. The toxicity profile, measured with the Perspective API, is roughly comparable to The Pile. The Perspective definition of toxicity is narrow ("content that is rude or disrespectful") and does not capture social bias or harm in any deep sense, so the comparison is suggestive rather than definitive. RefinedWeb is built on publicly available web pages and may contain personal information such as emails, phone numbers and IP addresses; deduplication probably reduces but does not eliminate PII [4].

The public dataset is English-only, even though the MDR pipeline can be applied to other languages. The Falcon-180B announcement mentions an internal RefinedWeb-Europe version, but it has not been published, which makes RefinedWeb less useful as a multilingual baseline than RedPajama v2 or [FineWeb-2](/wiki/fineweb_2).

The "neutral filtering" choice is itself a tradeoff. Without an ML quality classifier, RefinedWeb keeps a long tail of pages that more aggressive pipelines such as FineWeb-Edu would discard. Whether that tail helps or hurts a given downstream task is empirical; later work suggests classifier-based filtering can produce stronger reasoning models for the same token budget. RefinedWeb's strength is that it does not bake a particular notion of "good text" into the data.

The authors also report that combining MDR with C4's stricter span-level deduplication and stop-word filtering produced subsets that slightly outperformed RefinedWeb itself, but at rejection rates so high they were not viable for trillion-token training [1]. More aggressive cleaning helps zero-shot scores up to a point, then cuts the token budget too far for large models. Deduplication research has not converged either: Biderman et al.'s Pythia work found relatively limited gains from deduplicating The Pile, while RefinedWeb finds substantial gains from deduplicating web data. The authors read this as evidence that deduplication is more important for web-heavy datasets than for curated mixes.

## Influence and successor datasets

RefinedWeb is widely cited in the open dataset literature that followed it. It popularized running both fuzzy and exact deduplication with MinHash settings far more aggressive than The Pile and earlier datasets used. Dolma, RedPajama v2 and FineWeb all build on this lesson, and FineWeb's ablations show clear improvements from adopting RefinedWeb-style MinHash thresholds. Before RefinedWeb the default open recipe was a curated mix in the spirit of The Pile; after, several major open datasets, FineWeb most prominently, were defined as cleaned CommonCrawl with no curated mix-in. The Falcon-RW-1B and Falcon-RW-7B reference models are the empirical case for that recipe.

FineWeb (April 2024), led by Penedo at Hugging Face, scales the same general approach to 15T tokens and includes the FineWeb-Edu subset, which uses an ML quality classifier trained on Llama 3 70B annotations to filter for educational content [11]. Dolma adds explicit code, scientific paper and Reddit slices. RedPajama v2 multilingualizes the recipe and adds quality signals as document metadata, leaving filtering decisions to downstream users. Each can be read as a direct response to one of RefinedWeb's design choices.

## Related work

The RefinedWeb pipeline draws explicitly on several prior data-cleaning efforts. CCNet (Wenzek et al., 2020) provides the fastText language identification classifier. MassiveWeb and [Gopher](/wiki/gopher) (Rae et al., 2021) contributed document-level repetition removal and the rule-based quality filters (mean word length, fraction of bullets, fraction of stop words, ellipsis lines) that RefinedWeb adopts wholesale. Lee et al. (2022) provided the suffix-array exact substring deduplication implementation. Broder (1997) introduced the original MinHash sketch. The HTML-to-text extractor [trafilatura](/wiki/trafilatura) (Barbaresi, 2021) was chosen on the basis of Lopukhin's 2019 benchmarking, which found it the best non-commercial library for blog and news content. The paper cites Hoffmann et al. (2022) on Chinchilla as motivation for needing trillions of tokens, and Dodge et al. (2021) and Welbl et al. (2021) on filter-induced bias as motivation for the "neutral filtering" stance.

## See also

- [Falcon (language model)](/wiki/falcon)
- [FineWeb](/wiki/fineweb)
- [The Pile](/wiki/the_pile)
- [C4 (Colossal Clean Crawled Corpus)](/wiki/c4_dataset)
- [RedPajama](/wiki/red_pajama)
- [Common Crawl](/wiki/common_crawl)
- [Gopher (language model)](/wiki/gopher)
- [Trafilatura](/wiki/trafilatura)

## References

[1] Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. (2023). *The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only*. arXiv:2306.01116. Accepted at NeurIPS 2023 Datasets and Benchmarks Track. https://arxiv.org/abs/2306.01116

[2] TII. *Falcon-7B model card*. Hugging Face. https://huggingface.co/tiiuae/falcon-7b

[3] TII. *Falcon-40B model card*. Hugging Face. https://huggingface.co/tiiuae/falcon-40b

[4] TII. *Falcon RefinedWeb dataset card*. Hugging Face. https://huggingface.co/datasets/tiiuae/falcon-refinedweb

[5] Almazrouei, E. et al. *Spread Your Wings: Falcon 180B is here.* Hugging Face Blog, 6 September 2023. https://huggingface.co/blog/falcon-180b

[6] Rae, J. W. et al. (2021). *Scaling Language Models: Methods, Analysis & Insights from Training Gopher*. arXiv:2112.11446.

[7] Wenzek, G. et al. (2020). *CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data*. LREC 2020. arXiv:1911.00359.

[8] Lee, K. et al. (2022). *Deduplicating Training Data Makes Language Models Better*. ACL 2022. arXiv:2107.06499.

[9] Hoffmann, J. et al. (2022). *Training Compute-Optimal Large Language Models*. arXiv:2203.15556.

[10] Barbaresi, A. (2021). *Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction*. ACL System Demonstrations.

[11] Penedo, G. et al. (2024). *The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale*. arXiv:2406.17557. https://arxiv.org/abs/2406.17557

[12] Soldaini, L. et al. (2024). *Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research*. arXiv:2402.00159.