FineWeb is a large-scale, open pretraining dataset for large language models (LLMs) created by Hugging Face. Released in April 2024, it contains approximately 15 trillion tokens extracted and cleaned from 96 Common Crawl snapshots spanning from the summer of 2013 to April 2024. At roughly 44 terabytes of disk space, FineWeb is the largest publicly available, cleaned English web corpus built specifically for LLM pretraining. The dataset was introduced by Guilherme Penedo, Hynek Kydlicek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf in a paper titled "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale," which was accepted to the NeurIPS 2024 Datasets and Benchmarks Track. FineWeb is released under the Open Data Commons Attribution License (ODC-By) v1.0.
Alongside the base FineWeb dataset, the authors released FineWeb-Edu, a 1.3 trillion token subset filtered for educational content using a classifier trained on annotations generated by Llama 3-70B-Instruct. Models pretrained on FineWeb-Edu outperform those trained on all other publicly available web datasets across knowledge-intensive and reasoning benchmarks, including MMLU, ARC, and OpenBookQA.
Both datasets are freely available on the Hugging Face Hub.
The quality of pretraining data has a direct effect on the downstream performance of large language models. While Common Crawl provides petabytes of raw web data, the unprocessed crawl contains enormous amounts of boilerplate text, duplicate content, low-quality pages, and non-English material. Converting this raw data into a corpus suitable for LLM pretraining requires extensive filtering, cleaning, and deduplication.
Before FineWeb, several open datasets attempted to address this problem. C4 (Colossal Clean Crawled Corpus), introduced by Raffel et al. (2020) for training the T5 model, applied heuristic filters to a single Common Crawl snapshot and produced roughly 175 billion tokens. RefinedWeb, released by the Technology Innovation Institute (TII) in 2023, used more sophisticated filtering and deduplication to produce a 600 billion token public English web corpus (the full internal version was estimated at 5 trillion tokens). The Pile, assembled by EleutherAI, combined web data with curated sources like books and academic papers to reach approximately 340 billion tokens. RedPajama v2, created by Together AI, provided over 20 trillion raw tokens (before quality filtering). Dolma, released by the Allen Institute for AI (AI2) for the OLMo project, provided roughly 3 trillion tokens in its version 1.6 Common Crawl portion.
Despite this progress, existing datasets had clear limitations. Some were too small for modern training runs, others used outdated or opaque filtering methods, and few documented their curation decisions in enough detail for the research community to reproduce or improve upon them. FineWeb was designed to address all of these shortcomings by building the largest open web corpus with a fully documented, empirically validated processing pipeline.
FineWeb started as an effort to openly replicate the RefinedWeb dataset, which was used to train the Falcon family of models. During development, the Hugging Face team found that by carefully adding extra filtering and deduplication steps, they could push model performance well above what RefinedWeb achieved. The project grew into a full-scale empirical study of data curation practices, with over 70 ablation models trained to test different processing choices. The motivation was twofold: to lower the barrier for open-source model training by providing a high-quality dataset that would otherwise cost hundreds of thousands of dollars in compute to produce from scratch, and to document every design decision and release the full curation codebase so other researchers could reproduce and extend the work.
FineWeb draws its raw material from Common Crawl, a nonprofit organization that regularly crawls the web and makes its archives publicly available. Common Crawl publishes data in three formats:
| Format | Full name | Contents |
|---|---|---|
| WARC | Web ARChive | Full HTTP response including HTML source code |
| WET | WARC Encapsulated Text | Pre-extracted plain text from each page |
| WAT | Web ARChive Text | Metadata about crawled pages |
Most prior datasets (including C4 and parts of RedPajama) relied on WET files for convenience, since the text has already been extracted. However, the FineWeb team discovered that WET files retain too much boilerplate content, including navigation menus, sidebar text, and cookie notices, which degrades downstream model performance.
A core innovation of FineWeb was processing the raw WARC files directly and using the open-source trafilatura library for text extraction. Trafilatura is specifically designed to identify and extract the main content of a web page while discarding boilerplate elements. In ablation experiments, models trained on trafilatura-extracted text consistently outperformed those trained on the default WET text.
FineWeb processes 96 Common Crawl snapshots covering web crawls from mid-2013 through April 2024 (specifically, CC-MAIN-2013-20 through CC-MAIN-2024-10 and later), giving the dataset broad temporal coverage.
The FineWeb processing pipeline was built using datatrove, an open-source data processing library developed by Hugging Face. The pipeline applies a sequential series of steps, each empirically validated through ablation experiments. The overall approach was primarily empirical: the team performed "data ablation" experiments to test different methods at each stage, training models and measuring benchmark performance to determine which steps helped and which did not.
Raw HTML from Common Crawl WARC files is processed using the trafilatura library. Trafilatura identifies the main content region of each web page and extracts clean text while removing navigation bars, headers, footers, advertisements, and other boilerplate. This approach produces notably cleaner text than the pre-extracted WET files provided by Common Crawl. Ablation experiments confirmed that trafilatura WARC extraction substantially outperformed default WET extraction in terms of downstream model performance.
Documents originating from malicious or adult content websites are removed using a URL blocklist. This blocklist contains known domains hosting pornographic material, malware, and spam. The filtering uses both exact domain matching and pattern detection in URLs.
A fastText language classifier is applied to each document. Only English-language documents with a confidence score of 0.65 or higher are retained. This threshold was chosen to balance recall (keeping documents that are primarily English) against precision (excluding non-English or mixed-language content).
FineWeb applies the quality and repetition filters from the MassiveText dataset (Rae et al., 2022), originally developed for training DeepMind's Gopher model. These filters target common patterns in low-quality web text:
| Filter category | Description | |---|---|---| | Word count | Documents with too few or too many words are removed | | Mean word length | Documents with unusually short or long average word lengths are filtered | | Symbol-to-word ratio | Documents with excessive special characters are removed | | Stop word presence | Documents lacking common English stop words are discarded | | Line-level repetition | Documents where a high fraction of lines are duplicates are removed | | Paragraph-level repetition | Documents with excessive duplicate paragraphs are removed | | N-gram repetition | Documents with high proportions of repeated 2-grams, 3-grams, or 4-grams are filtered | | Bullet point ratio | Documents dominated by bullet points or list markers are removed | | Ellipsis ratio | Documents with excessive ellipsis usage are filtered |
Deduplication is one of the most consequential steps in the pipeline. FineWeb uses MinHash-based locality-sensitive hashing to identify and remove near-duplicate documents efficiently at web scale.
The specific parameters are:
| Parameter | Value |
|---|---|
| Shingle size | 5-grams (using English word tokenizer) |
| Number of hash functions | 112 |
| Band configuration | 14 buckets of 8 hashes each |
| Similarity threshold | 75% |
| Match probability at 75% similarity | ~77% |
Each 5-gram in a document is hashed with each of the 112 hash functions, and a document signature is obtained by taking the minimum hash value across all 5-grams for each hash function. Two documents sharing at least 8 identical MinHash values in any single band are flagged as near-duplicates, and one copy is removed.
A notable finding from the FineWeb research is that individual per-snapshot deduplication outperformed global deduplication across all snapshots. Global deduplication removed roughly 90% of content from older snapshots, which turned out to hurt model performance. The authors hypothesized that deduplication of clusters with a small number of duplicates can actually harm data quality, since these documents are often legitimately similar (for example, different editions of reference pages) rather than true spam duplicates. Per-snapshot deduplication retained approximately 20 trillion tokens before additional filtering brought the total to 15 trillion.
After deduplication, heuristic filters originally developed for the C4 dataset are applied. These include checks for curly brackets (common in code or template artifacts), lorem ipsum placeholder text, JavaScript mentions (often indicating error pages), and policy/cookie statement boilerplate. However, the terminal punctuation filter from C4 (which requires sentences to end with standard punctuation marks) was deliberately excluded because ablation experiments showed it removed approximately 30% of all tokens without a corresponding improvement in model quality.
The Hugging Face team evaluated over fifty candidate filtering heuristics from prior work and selected three custom filters based on their impact during ablation studies:
| Filter | Threshold | Tokens removed |
|---|---|---|
| Fraction of lines ending with punctuation | Remove if <= 0.12 | 10.14% |
| Fraction of characters in duplicated lines | Remove if >= 0.1 | 12.47% |
| Fraction of lines shorter than 30 characters | Remove if >= 0.67 | 3.73% |
These filters target list-heavy documents, documents with repeated boilerplate lines, and documents with corrupted line formatting. Together, the additional heuristic filters removed roughly 22% of tokens while improving aggregate benchmark scores.
As a final step, basic personally identifiable information (PII) removal is applied. Email addresses are replaced with anonymized placeholders and public IP addresses are masked. The authors acknowledge this is not a comprehensive PII scrubbing solution.
FineWeb-Edu is a 1.3 trillion token subset of FineWeb that retains only web pages with high educational value. The motivation behind FineWeb-Edu was the observation that filtering for educational content can dramatically improve model performance on knowledge-intensive and reasoning benchmarks, even when the resulting dataset is much smaller than the original.
To generate educational quality annotations, Llama 3-70B-Instruct was prompted to rate approximately 460,000 randomly sampled FineWeb documents on a scale from 0 to 5 based on their educational quality:
| Score | Meaning |
|---|---|
| 0 | Not educational at all |
| 1 | Minimally educational |
| 2 | Somewhat educational |
| 3 | Moderately educational, suitable for learning |
| 4 | Highly educational with clear explanations |
| 5 | Outstanding educational quality, textbook-level |
The annotation prompt was designed to focus on grade-school and middle-school level knowledge, intentionally avoiding bias toward highly technical content like arXiv papers or academic submissions. Of the total annotations, 410,000 were used for training and 50,000 were held out for validation.
Rather than using the LLM directly to score all 15 trillion tokens (which would be prohibitively expensive), the team trained a lightweight classifier to replicate the LLM's annotations at scale:
| Component | Detail |
|---|---|
| Base embedding model | Snowflake Arctic Embed M (snowflake-arctic-embed-m), 109M parameters |
| Classification head | Single linear regression layer (score 0-5) |
| Training samples | 410,000 (with 50,000 held out for validation) |
| Training procedure | Embedding layers frozen; only regression head trained |
| Epochs | 20 |
| Learning rate | 3e-4 |
| Inference compute | ~6,000 H100 GPU hours for full FineWeb |
When evaluated as a binary classifier (keeping documents with score >= 3 versus discarding them), the model achieved an F1 score of 82% on the held-out validation set.
Documents scoring below 3 were removed. This threshold eliminated roughly 92% of the original FineWeb data, leaving 1.3 trillion tokens of educational content. The resulting FineWeb-Edu dataset is notable for its efficiency: models trained on FineWeb-Edu can match the MMLU performance of models trained on roughly 10 times as many tokens from other datasets like C4 or Dolma.
Each document in FineWeb-Edu includes both the continuous score and a rounded integer score (int_score), allowing users to apply their own thresholds. A variant called FineWeb-Edu Score 2 retains documents with scores of 2 or higher, providing a larger but less selective alternative.
The trained classifier is publicly released as HuggingFaceFW/fineweb-edu-classifier on the Hugging Face Hub, enabling researchers to apply educational quality scoring to their own datasets.
Analysis of FineWeb-Edu showed that educational filtering reduced some categories of bias found in the broader FineWeb dataset. While FineWeb contained problematic associations between religious terms and online dating terminology (based on TF-IDF analysis), FineWeb-Edu exhibited more educationally appropriate associations, such as "woman" co-occurring with "pregnancy" and "man" with "king." However, common web data biases persist: the word "man" appears significantly more often than "woman" or "non-binary," and "Christian" is the dominant religious term by frequency.
All evaluation during FineWeb's development used a consistent setup to ensure fair comparisons across datasets.
| Parameter | Value |
|---|---|
| Architecture | Llama-style transformer |
| Total parameters | 1.82 billion (including embeddings) |
| Sequence length | 2,048 tokens |
| Tokenizer | GPT-2 tokenizer |
| Global batch size | ~2 million tokens |
| Short ablation runs | ~28 billion tokens |
| Long confirmation runs | 350 billion tokens |
| Evaluation framework | lighteval (Hugging Face) |
This model size was chosen as a practical compromise: large enough to show meaningful differences between training datasets, but small enough to train many variants within a reasonable compute budget.
Eight benchmarks were used for evaluation:
| Benchmark | Category |
|---|---|
| MMLU | Multitask knowledge (57 subjects, truncated to 1,000 samples) |
| ARC (AI2 Reasoning Challenge) | Grade-school and high-school science questions |
| HellaSwag | Commonsense sentence completion |
| OpenBookQA | Elementary science with open-book reasoning |
| PIQA | Physical intuition and reasoning |
| SIQA (Social IQa) | Social interaction reasoning |
| WinoGrande | Pronoun resolution and commonsense |
| CommonsenseQA | General commonsense reasoning |
All evaluations were run using the lighteval framework.
Over 70 models were trained during the development process. Filtering ablations used approximately 28 billion tokens of training data, while deduplication and final confirmation ablations used 350 billion tokens. The total compute for all ablation experiments was estimated at 80,000 H100 GPU hours.
FineWeb was benchmarked against several widely used open pretraining datasets. For fair comparison, a 1.82 billion parameter model was trained on 350 billion tokens sampled from each dataset (or the full dataset when its size was smaller than 350 billion tokens).
| Dataset | Organization | Tokens (approx.) | Source | Year | License |
|---|---|---|---|---|---|
| FineWeb | Hugging Face | 15T | 96 Common Crawl snapshots | 2024 | ODC-By 1.0 |
| FineWeb-Edu | Hugging Face | 1.3T | Educational subset of FineWeb | 2024 | ODC-By 1.0 |
| C4 | 175B | Single Common Crawl snapshot | 2019 | ODC-By 1.0 | |
| RefinedWeb (public) | TII | 600B | Common Crawl | 2023 | Apache 2.0 |
| Dolma v1.6 (CC portion) | AI2 | ~3T | Mixed sources | 2024 | AI2 ImpACT |
| Dolma v1.7 | AI2 | ~1.2T | Mixed (improved filtering) | 2024 | AI2 ImpACT |
| The Pile | EleutherAI | 340B | 22 diverse sources | 2020 | MIT |
| SlimPajama | Cerebras | 627B | Deduplicated RedPajama | 2023 | Apache 2.0 |
| RedPajama-V2 (dedup) | Together AI | 20T | Common Crawl + other | 2023 | Apache 2.0 |
At the 350 billion token training scale, models trained on FineWeb outperformed models trained on C4, Dolma v1.6, The Pile, SlimPajama, and RedPajama-V2 on the aggregate benchmark score across all eight tasks. FineWeb's aggregate score was competitive with or exceeded RefinedWeb on most individual benchmarks.
FineWeb-Edu showed even stronger results, particularly on knowledge-intensive and reasoning benchmarks:
| Benchmark | FineWeb (350B tokens) | FineWeb-Edu (350B tokens) | Relative improvement |
|---|---|---|---|
| MMLU | ~33% | ~37% | ~12% relative |
| ARC | ~46% | ~57% | ~24% relative |
| Aggregate (all 8 tasks) | Baseline | ~+2 percentage points | Best among all open datasets |
On HellaSwag, models trained on cleaned FineWeb variants reached higher accuracy up to 32% faster compared to less-filtered alternatives, even when using 25% less pretraining data.
The most striking finding concerns data efficiency. FineWeb-Edu can match the final MMLU performance of models trained on roughly 10 times as many tokens from C4 or Dolma. A model trained on approximately 35 billion FineWeb-Edu tokens could match the MMLU performance of a model trained on 350 billion C4 tokens. This suggests that filtering for educational quality is one of the highest-leverage interventions available for improving pretraining data.
Machine classifiers can distinguish FineWeb text from text produced by competing datasets with up to 87% accuracy in binary classification, indicating that FineWeb has a distinctive data distribution that differs meaningfully from other web-scraped corpora.
The entire FineWeb pipeline was implemented using datatrove, Hugging Face's open-source Python library for large-scale data processing. Datatrove provides modular, composable pipeline blocks:
| Block type | Function |
|---|---|
| Readers | Load data from various formats (WARC, Parquet, JSONL) |
| Extractors | Extract text from raw formats (HTML via trafilatura) |
| Filters | Remove documents based on configurable criteria |
| Writers | Save processed data to disk or cloud storage |
| Deduplicators | Perform MinHash or exact deduplication |
Datatrove is designed to scale across computing clusters using frameworks like Slurm, making it practical to process the hundreds of terabytes of raw Common Crawl data. The FineWeb pipeline code is publicly available in the datatrove repository as a reference implementation.
Processing 96 Common Crawl snapshots through the full pipeline required substantial resources. Hugging Face reported spending an estimated $500,000 in GPU compute to process and filter approximately 38,000 TB of raw Common Crawl data down to the final 44 TB FineWeb dataset. The educational quality classification step alone required approximately 6,000 H100 GPU hours. All ablation model training added an estimated 80,000 H100 GPU hours.
Each document in FineWeb is stored as a Parquet record with the following fields:
| Field | Type | Description |
|---|---|---|
| text | string | Extracted and cleaned web page content |
| id | string | Unique document identifier |
| dump | string | Common Crawl snapshot identifier (e.g., CC-MAIN-2024-10) |
| url | string | Source URL of the web page |
| date | string | Date the page was crawled |
| file_path | string | Internal file path within the dataset |
| language | string | Detected language code |
| language_score | float64 | fastText language identification confidence score |
| token_count | int64 | Number of tokens (GPT-2 tokenizer) |
FineWeb-Edu includes two additional fields:
| Field | Type | Description |
|---|---|---|
| score | float64 | Continuous educational quality score (0.0 to 5.0) |
| int_score | int64 | Rounded integer score (0 to 5) |
FineWeb and FineWeb-Edu have been widely adopted since their release, with downstream effects across multiple areas of the open-source AI ecosystem.
Before FineWeb, researchers training open LLMs often had to choose between smaller curated datasets (like C4 or RefinedWeb) or larger but noisier options (like raw RedPajama v2). FineWeb offered both scale and quality in a single package, reducing the need for each research group to build its own data pipeline from scratch. The release of all processing code, the educational classifier, the LLM annotations, and all 70+ ablation model checkpoints enabled full reproducibility.
FineWeb-Edu has served as a foundation for multilingual training initiatives. The TransWeb-Edu project produced balanced document-level corpora from FineWeb-Edu, and the resulting CuatroLLM and TransWebLLM models demonstrated that a high-quality English educational seed corpus can match or exceed the performance of models like Llama 3.2, Gemma, and Qwen 2.5 with far less training data.
The dataset's curation methodology influenced several other projects. Stack-Edu adapted the educational filtering approach for code data. FineWeb-Edu-Chinese applied similar techniques to Chinese web text. The Zyda-2 dataset incorporated FineWeb as a component, and Ultra-FineWeb explored additional quality verification methods on top of FineWeb's pipeline. Domain-specific subsets have also been created; OnlySports, for instance, extracted approximately 600 billion tokens of sports-related content from FineWeb.
The idea that filtering for educational content could so dramatically improve benchmark performance (while discarding over 90% of the data) challenged the common assumption that more data is always better for LLM training. This shifted attention in the field toward data quality over data quantity. The synthetic annotation approach (using an LLM to generate labels, then distilling into a lightweight classifier for efficient inference) has been widely adopted for other data filtering and scoring tasks.
In 2025, Hugging Face released FineWeb-2, a multilingual successor to FineWeb. FineWeb-2 extends the curation pipeline to over 1,000 languages using nearly 100 Common Crawl snapshots. The resulting dataset contains approximately 20 TB of text across roughly 5 billion documents, covering 1,868 language-script pairs (1,226 of which have more than 100 documents). FineWeb-2 was built from the portion of Common Crawl that did not pass FineWeb's English language identification threshold.
FineWeb-2 adapts the pipeline by tuning individual thresholds and stop-word lists for each language. In benchmark evaluations, models trained on FineWeb-2 outperformed those trained on prior multilingual datasets (CC-100, mC4, CulturaX, HPLT, HPLT2, and raw Common Crawl) on 11 out of 14 tested languages.
The authors acknowledge several limitations of FineWeb: