FineWeb

Data & Datasets Machine Learning Natural Language Processing Open Source AI

22 min read

Updated Jun 21, 2026

Suggest edit History Talk

RawGraph

Last edited

Jun 21, 2026

Fact-checked

In review queue

Sources

15 citations

Revision

v7 · 4,397 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

FineWeb is a large-scale, open pretraining dataset for large language models (LLMs) created by Hugging Face. Released in April 2024, it contains approximately 15 trillion tokens extracted and cleaned from 96 Common Crawl snapshots spanning from the summer of 2013 to April 2024.^[1]^[2] At roughly 44 terabytes of disk space, FineWeb is the largest publicly available, cleaned English web corpus built specifically for LLM pretraining.^[2] The dataset was introduced by Guilherme Penedo, Hynek Kydlicek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf in a paper titled "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale," submitted on 25 June 2024 and accepted to the NeurIPS 2024 Datasets and Benchmarks Track.^[1] FineWeb is released under the Open Data Commons Attribution License (ODC-By) v1.0.^[2]

The paper frames FineWeb as a transparency response to closed industrial datasets, stating that "the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created," and that FineWeb "produces better-performing LLMs than other open pretraining datasets."^[1] Alongside the base dataset, the authors released FineWeb-Edu, a 1.3 trillion token subset filtered for educational content using a classifier trained on annotations generated by Llama 3-70B-Instruct.^[1]^[3] Models pretrained on FineWeb-Edu outperform those trained on all other publicly available web datasets across knowledge-intensive and reasoning benchmarks, including MMLU, ARC, and OpenBookQA.^[1]^[3]

Both datasets are freely available on the Hugging Face Hub.^[2]^[3]

Why was FineWeb created?

The quality of pretraining data has a direct effect on the downstream performance of large language models.^[1] While Common Crawl provides petabytes of raw web data, the unprocessed crawl contains enormous amounts of boilerplate text, duplicate content, low-quality pages, and non-English material. Converting this raw data into a corpus suitable for LLM pretraining requires extensive filtering, cleaning, and deduplication.^[1]

Before FineWeb, several open datasets attempted to address this problem. C4 (Colossal Clean Crawled Corpus), introduced by Raffel et al. (2020) for training the T5 model, applied heuristic filters to a single Common Crawl snapshot and produced roughly 175 billion tokens.^[7] RefinedWeb, released by the Technology Innovation Institute (TII) in 2023, used more sophisticated filtering and deduplication to produce a 600 billion token public English web corpus (the full internal version was estimated at 5 trillion tokens).^[9] The Pile, assembled by EleutherAI, combined web data with curated sources like books and academic papers to reach approximately 340 billion tokens.^[12] RedPajama v2, created by Together AI, provided over 20 trillion raw tokens (before quality filtering).^[11] Dolma, released by the Allen Institute for AI (AI2) for the OLMo project, provided roughly 3 trillion tokens in its version 1.6 Common Crawl portion.^[10]

Despite this progress, existing datasets had clear limitations. Some were too small for modern training runs, others used outdated or opaque filtering methods, and few documented their curation decisions in enough detail for the research community to reproduce or improve upon them. FineWeb was designed to address all of these shortcomings by building the largest open web corpus with a fully documented, empirically validated processing pipeline.^[1]

FineWeb started as an effort to openly replicate the RefinedWeb dataset, which was used to train the Falcon family of models.^[9] During development, the Hugging Face team found that by carefully adding extra filtering and deduplication steps, they could push model performance well above what RefinedWeb achieved.^[5] The project grew into a full-scale empirical study of data curation practices, with over 70 ablation models trained to test different processing choices.^[1]^[5] The motivation was twofold: to lower the barrier for open-source model training by providing a high-quality dataset that would otherwise cost hundreds of thousands of dollars in compute to produce from scratch, and to document every design decision and release the full curation codebase so other researchers could reproduce and extend the work.^[5]

Where does FineWeb get its data?

FineWeb draws its raw material from Common Crawl, a nonprofit organization that regularly crawls the web and makes its archives publicly available.^[1] Common Crawl publishes data in three formats:

Format	Full name	Contents
WARC	Web ARChive	Full HTTP response including HTML source code
WET	WARC Encapsulated Text	Pre-extracted plain text from each page
WAT	Web ARChive Text	Metadata about crawled pages

Most prior datasets (including C4 and parts of RedPajama) relied on WET files for convenience, since the text has already been extracted. However, the FineWeb team discovered that WET files retain too much boilerplate content, including navigation menus, sidebar text, and cookie notices, which degrades downstream model performance.^[1]^[5]

A core innovation of FineWeb was processing the raw WARC files directly and using the open-source trafilatura library for text extraction.^[1]^[13] Trafilatura is specifically designed to identify and extract the main content of a web page while discarding boilerplate elements.^[13] In ablation experiments, models trained on trafilatura-extracted text consistently outperformed those trained on the default WET text.^[1]^[5]

FineWeb processes 96 Common Crawl snapshots covering web crawls from mid-2013 through April 2024 (specifically, CC-MAIN-2013-20 through CC-MAIN-2024-10 and later), giving the dataset broad temporal coverage.^[1]^[2]

How is FineWeb processed?

The FineWeb processing pipeline was built using datatrove, an open-source data processing library developed by Hugging Face.^[6] The pipeline applies a sequential series of steps, each empirically validated through ablation experiments. The overall approach was primarily empirical: the team performed "data ablation" experiments to test different methods at each stage, training models and measuring benchmark performance to determine which steps helped and which did not.^[1]^[5]

Text extraction

Raw HTML from Common Crawl WARC files is processed using the trafilatura library.^[13] Trafilatura identifies the main content region of each web page and extracts clean text while removing navigation bars, headers, footers, advertisements, and other boilerplate.^[13] This approach produces notably cleaner text than the pre-extracted WET files provided by Common Crawl. Ablation experiments confirmed that trafilatura WARC extraction substantially outperformed default WET extraction in terms of downstream model performance.^[1]^[5]

URL filtering

Documents originating from malicious or adult content websites are removed using a URL blocklist. This blocklist contains known domains hosting pornographic material, malware, and spam. The filtering uses both exact domain matching and pattern detection in URLs.^[5]

Language identification

A fastText language classifier is applied to each document. Only English-language documents with a confidence score of 0.65 or higher are retained.^[1]^[5] This threshold was chosen to balance recall (keeping documents that are primarily English) against precision (excluding non-English or mixed-language content).^[5]

Quality and repetition filters (Gopher/MassiveText rules)

FineWeb applies the quality and repetition filters from the MassiveText dataset (Rae et al., 2022), originally developed for training DeepMind's Gopher model.^[8] These filters target common patterns in low-quality web text:

| Filter category | Description | |---|---|---| | Word count | Documents with too few or too many words are removed | | Mean word length | Documents with unusually short or long average word lengths are filtered | | Symbol-to-word ratio | Documents with excessive special characters are removed | | Stop word presence | Documents lacking common English stop words are discarded | | Line-level repetition | Documents where a high fraction of lines are duplicates are removed | | Paragraph-level repetition | Documents with excessive duplicate paragraphs are removed | | N-gram repetition | Documents with high proportions of repeated 2-grams, 3-grams, or 4-grams are filtered | | Bullet point ratio | Documents dominated by bullet points or list markers are removed | | Ellipsis ratio | Documents with excessive ellipsis usage are filtered |

MinHash deduplication

Deduplication is one of the most consequential steps in the pipeline. FineWeb uses MinHash-based locality-sensitive hashing to identify and remove near-duplicate documents efficiently at web scale.^[1]^[5]

The specific parameters are:

Parameter	Value
Shingle size	5-grams (using English word tokenizer)
Number of hash functions	112
Band configuration	14 buckets of 8 hashes each
Similarity threshold	75%
Match probability at 75% similarity	~77%

Each 5-gram in a document is hashed with each of the 112 hash functions, and a document signature is obtained by taking the minimum hash value across all 5-grams for each hash function.^[5] Two documents sharing at least 8 identical MinHash values in any single band are flagged as near-duplicates, and one copy is removed.^[5]

A notable finding from the FineWeb research is that individual per-snapshot deduplication outperformed global deduplication across all snapshots.^[1]^[5] Global deduplication removed roughly 90% of content from older snapshots, which turned out to hurt model performance.^[5] The authors hypothesized that deduplication of clusters with a small number of duplicates can actually harm data quality, since these documents are often legitimately similar (for example, different editions of reference pages) rather than true spam duplicates.^[1]^[5] Per-snapshot deduplication retained approximately 20 trillion tokens before additional filtering brought the total to 15 trillion.^[5]

C4-style filters

After deduplication, heuristic filters originally developed for the C4 dataset are applied.^[7] These include checks for curly brackets (common in code or template artifacts), lorem ipsum placeholder text, JavaScript mentions (often indicating error pages), and policy/cookie statement boilerplate.^[7] However, the terminal punctuation filter from C4 (which requires sentences to end with standard punctuation marks) was deliberately excluded because ablation experiments showed it removed approximately 30% of all tokens without a corresponding improvement in model quality.^[5]

Custom heuristic filters

The Hugging Face team evaluated over fifty candidate filtering heuristics from prior work and selected three custom filters based on their impact during ablation studies:^[5]

Filter	Threshold	Tokens removed
Fraction of lines ending with punctuation	Remove if <= 0.12	10.14%
Fraction of characters in duplicated lines	Remove if >= 0.1	12.47%
Fraction of lines shorter than 30 characters	Remove if >= 0.67	3.73%

These filters target list-heavy documents, documents with repeated boilerplate lines, and documents with corrupted line formatting. Together, the additional heuristic filters removed roughly 22% of tokens while improving aggregate benchmark scores.^[5]

PII removal

As a final step, basic personally identifiable information (PII) removal is applied. Email addresses are replaced with anonymized placeholders and public IP addresses are masked.^[2]^[5] The authors acknowledge this is not a comprehensive PII scrubbing solution.^[2]

What is FineWeb-Edu?

FineWeb-Edu is a 1.3 trillion token subset of FineWeb that retains only web pages with high educational value.^[3] The motivation behind FineWeb-Edu was the observation that filtering for educational content can dramatically improve model performance on knowledge-intensive and reasoning benchmarks, even when the resulting dataset is much smaller than the original.^[1]^[3] The paper reports that "LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC."^[1]

Annotation process

To generate educational quality annotations, Llama 3-70B-Instruct was prompted to rate approximately 460,000 randomly sampled FineWeb documents on a scale from 0 to 5 based on their educational quality:^[3]^[5]

Score	Meaning
0	Not educational at all
1	Minimally educational
2	Somewhat educational
3	Moderately educational, suitable for learning
4	Highly educational with clear explanations
5	Outstanding educational quality, textbook-level

The annotation prompt was designed to focus on grade-school and middle-school level knowledge, intentionally avoiding bias toward highly technical content like arXiv papers or academic submissions.^[3]^[5] Of the total annotations, 410,000 were used for training and 50,000 were held out for validation.^[4]^[5]

Classifier architecture and training

Rather than using the LLM directly to score all 15 trillion tokens (which would be prohibitively expensive), the team trained a lightweight classifier to replicate the LLM's annotations at scale:^[3]^[4]

Component	Detail
Base embedding model	Snowflake Arctic Embed M (snowflake-arctic-embed-m), 109M parameters
Classification head	Single linear regression layer (score 0-5)
Training samples	410,000 (with 50,000 held out for validation)
Training procedure	Embedding layers frozen; only regression head trained
Epochs	20
Learning rate	3e-4
Inference compute	~6,000 H100 GPU hours for full FineWeb

When evaluated as a binary classifier (keeping documents with score >= 3 versus discarding them), the model achieved an F1 score of 82% on the held-out validation set.^[4]^[5]

Filtering threshold and results

Documents scoring below 3 were removed. This threshold eliminated roughly 92% of the original FineWeb data, leaving 1.3 trillion tokens of educational content.^[3]^[5] The resulting FineWeb-Edu dataset is notable for its efficiency: models trained on FineWeb-Edu can match the MMLU performance of models trained on roughly 10 times as many tokens from other datasets like C4 or Dolma.^[1]^[5]

Each document in FineWeb-Edu includes both the continuous score and a rounded integer score (int_score), allowing users to apply their own thresholds.^[3] A variant called FineWeb-Edu Score 2 retains documents with scores of 2 or higher, providing a larger but less selective alternative.^[3]

The trained classifier is publicly released as HuggingFaceFW/fineweb-edu-classifier on the Hugging Face Hub, enabling researchers to apply educational quality scoring to their own datasets.^[4]

Bias analysis

Analysis of FineWeb-Edu showed that educational filtering reduced some categories of bias found in the broader FineWeb dataset.^[5] While FineWeb contained problematic associations between religious terms and online dating terminology (based on TF-IDF analysis), FineWeb-Edu exhibited more educationally appropriate associations, such as "woman" co-occurring with "pregnancy" and "man" with "king."^[5] However, common web data biases persist: the word "man" appears significantly more often than "woman" or "non-binary," and "Christian" is the dominant religious term by frequency.^[5]

How is FineWeb evaluated?

All evaluation during FineWeb's development used a consistent setup to ensure fair comparisons across datasets.^[1]^[5]

Model architecture

Parameter	Value
Architecture	Llama-style transformer
Total parameters	1.82 billion (including embeddings)
Sequence length	2,048 tokens
Tokenizer	GPT-2 tokenizer
Global batch size	~2 million tokens
Short ablation runs	~28 billion tokens
Long confirmation runs	350 billion tokens
Evaluation framework	lighteval (Hugging Face)

This model size was chosen as a practical compromise: large enough to show meaningful differences between training datasets, but small enough to train many variants within a reasonable compute budget.^[5]

Benchmark suite

Eight benchmarks were used for evaluation:^[1]^[5]

Benchmark	Category
MMLU	Multitask knowledge (57 subjects, truncated to 1,000 samples)
ARC (AI2 Reasoning Challenge)	Grade-school and high-school science questions
HellaSwag	Commonsense sentence completion
OpenBookQA	Elementary science with open-book reasoning
PIQA	Physical intuition and reasoning
SIQA (Social IQa)	Social interaction reasoning
WinoGrande	Pronoun resolution and commonsense
CommonsenseQA	General commonsense reasoning

All evaluations were run using the lighteval framework.^[5]

Total ablation compute

Over 70 models were trained during the development process.^[1]^[5] Filtering ablations used approximately 28 billion tokens of training data, while deduplication and final confirmation ablations used 350 billion tokens.^[5] The total compute for all ablation experiments was estimated at 80,000 H100 GPU hours.^[5]

How does FineWeb compare with other datasets?

FineWeb was benchmarked against several widely used open pretraining datasets. For fair comparison, a 1.82 billion parameter model was trained on 350 billion tokens sampled from each dataset (or the full dataset when its size was smaller than 350 billion tokens).^[1]^[5]

Dataset overview

Dataset	Organization	Tokens (approx.)	Source	Year	License
FineWeb	Hugging Face	15T	96 Common Crawl snapshots	2024	ODC-By 1.0
FineWeb-Edu	Hugging Face	1.3T	Educational subset of FineWeb	2024	ODC-By 1.0
C4	Google	175B	Single Common Crawl snapshot	2019	ODC-By 1.0
RefinedWeb (public)	TII	600B	Common Crawl	2023	Apache 2.0
Dolma v1.6 (CC portion)	AI2	~3T	Mixed sources	2024	AI2 ImpACT
Dolma v1.7	AI2	~1.2T	Mixed (improved filtering)	2024	AI2 ImpACT
The Pile	EleutherAI	340B	22 diverse sources	2020	MIT
SlimPajama	Cerebras	627B	Deduplicated RedPajama	2023	Apache 2.0
RedPajama-V2 (dedup)	Together AI	20T	Common Crawl + other	2023	Apache 2.0

Benchmark results

At the 350 billion token training scale, models trained on FineWeb outperformed models trained on C4, Dolma v1.6, The Pile, SlimPajama, and RedPajama-V2 on the aggregate benchmark score across all eight tasks.^[1]^[5] FineWeb's aggregate score was competitive with or exceeded RefinedWeb on most individual benchmarks.^[1]^[5]

FineWeb-Edu showed even stronger results, particularly on knowledge-intensive and reasoning benchmarks:^[1]^[3]

Benchmark	FineWeb (350B tokens)	FineWeb-Edu (350B tokens)	Relative improvement
MMLU	~33%	~37%	~12% relative
ARC	~46%	~57%	~24% relative
Aggregate (all 8 tasks)	Baseline	~+2 percentage points	Best among all open datasets

On HellaSwag, models trained on cleaned FineWeb variants reached higher accuracy up to 32% faster compared to less-filtered alternatives, even when using 25% less pretraining data.^[5]

The most striking finding concerns data efficiency. FineWeb-Edu can match the final MMLU performance of models trained on roughly 10 times as many tokens from C4 or Dolma.^[1]^[5] A model trained on approximately 35 billion FineWeb-Edu tokens could match the MMLU performance of a model trained on 350 billion C4 tokens.^[5] This suggests that filtering for educational quality is one of the highest-leverage interventions available for improving pretraining data.^[1]

Machine classifiers can distinguish FineWeb text from text produced by competing datasets with up to 87% accuracy in binary classification, indicating that FineWeb has a distinctive data distribution that differs meaningfully from other web-scraped corpora.^[5]

Technical infrastructure

The datatrove library

The entire FineWeb pipeline was implemented using datatrove, Hugging Face's open-source Python library for large-scale data processing.^[6] Datatrove provides modular, composable pipeline blocks:^[6]

Block type	Function
Readers	Load data from various formats (WARC, Parquet, JSONL)
Extractors	Extract text from raw formats (HTML via trafilatura)
Filters	Remove documents based on configurable criteria
Writers	Save processed data to disk or cloud storage
Deduplicators	Perform MinHash or exact deduplication

Datatrove is designed to scale across computing clusters using frameworks like Slurm, making it practical to process the hundreds of terabytes of raw Common Crawl data.^[6] The FineWeb pipeline code is publicly available in the datatrove repository as a reference implementation.^[6]

How much did it cost to build FineWeb?

Processing 96 Common Crawl snapshots through the full pipeline required substantial resources. Hugging Face reported spending an estimated $500,000 in GPU compute to process and filter approximately 38,000 TB of raw Common Crawl data down to the final 44 TB FineWeb dataset.^[5] The educational quality classification step alone required approximately 6,000 H100 GPU hours.^[5] All ablation model training added an estimated 80,000 H100 GPU hours.^[5]

Data schema

Each document in FineWeb is stored as a Parquet record with the following fields:^[2]

Field	Type	Description
text	string	Extracted and cleaned web page content
id	string	Unique document identifier
dump	string	Common Crawl snapshot identifier (e.g., CC-MAIN-2024-10)
url	string	Source URL of the web page
date	string	Date the page was crawled
file_path	string	Internal file path within the dataset
language	string	Detected language code
language_score	float64	fastText language identification confidence score
token_count	int64	Number of tokens (GPT-2 tokenizer)

FineWeb-Edu includes two additional fields:^[3]

Field	Type	Description
score	float64	Continuous educational quality score (0.0 to 5.0)
int_score	int64	Rounded integer score (0 to 5)

What has FineWeb been used for?

FineWeb and FineWeb-Edu have been widely adopted since their release, with downstream effects across multiple areas of the open-source AI ecosystem.

Lowering the barrier for LLM training

Before FineWeb, researchers training open LLMs often had to choose between smaller curated datasets (like C4 or RefinedWeb) or larger but noisier options (like raw RedPajama v2). FineWeb offered both scale and quality in a single package, reducing the need for each research group to build its own data pipeline from scratch. The release of all processing code, the educational classifier, the LLM annotations, and all 70+ ablation model checkpoints enabled full reproducibility.^[1]^[5]

Downstream datasets and models

FineWeb-Edu has served as a foundation for multilingual training initiatives. The TransWeb-Edu project produced balanced document-level corpora from FineWeb-Edu, and the resulting CuatroLLM and TransWebLLM models demonstrated that a high-quality English educational seed corpus can match or exceed the performance of models like Llama 3.2, Gemma, and Qwen 2.5 with far less training data.

The dataset's curation methodology influenced several other projects. Stack-Edu adapted the educational filtering approach for code data. FineWeb-Edu-Chinese applied similar techniques to Chinese web text. The Zyda-2 dataset incorporated FineWeb as a component, and Ultra-FineWeb explored additional quality verification methods on top of FineWeb's pipeline. Domain-specific subsets have also been created; OnlySports, for instance, extracted approximately 600 billion tokens of sports-related content from FineWeb.

Methodological influence

The idea that filtering for educational content could so dramatically improve benchmark performance (while discarding over 90% of the data) challenged the common assumption that more data is always better for LLM training.^[1]^[5] This shifted attention in the field toward data quality over data quantity. The synthetic annotation approach (using an LLM to generate labels, then distilling into a lightweight classifier for efficient inference) has been widely adopted for other data filtering and scoring tasks.^[3]

What is FineWeb-2?

In 2025, Hugging Face released FineWeb-2, a multilingual successor to FineWeb, accompanied by a paper, "FineWeb2: One Pipeline to Scale Them All, Adapting Pre-Training Data Processing to Every Language," posted to arXiv in June 2025.^[14]^[15] FineWeb-2 extends the curation pipeline to over 1,000 languages using nearly 100 Common Crawl snapshots.^[14]^[15] The resulting dataset contains approximately 20 TB of text across roughly 5 billion documents, covering 1,868 language-script pairs (1,226 of which have more than 100 documents, 474 more than 1,000 documents, and 203 at least 10,000 documents).^[14]^[15] FineWeb-2 was built from the portion of Common Crawl that did not pass FineWeb's English language identification threshold.^[14]

FineWeb-2 adapts the pipeline by tuning individual thresholds and stop-word lists for each language.^[14]^[15] In benchmark evaluations, models trained on FineWeb-2 outperformed those trained on prior multilingual datasets (CC-100, mC4, CulturaX, HPLT, HPLT2, and raw Common Crawl) on 11 out of 14 tested languages.^[14]^[15]

Limitations

The authors acknowledge several limitations of FineWeb:

English only: The base FineWeb dataset contains only English text. Non-English content was later addressed by FineWeb-2.^[2]^[14]
Web-only source: FineWeb does not include books, academic papers, code repositories, or other curated text sources. It is intended to be combined with other data sources in a training mixture.^[2]
PII concerns: While basic PII removal is applied, the dataset may still contain personal information found on the open web.^[2]
Bias and toxicity: As a web-scraped dataset, FineWeb inherits the biases and potentially harmful content present on the internet. Gender bias analysis found that male-associated terms appear far more frequently than female or non-binary terms.^[5] The URL blocklist removes some problematic domains, but no content-level toxicity filtering is applied.^[2]
Temporal skew: More recent Common Crawl snapshots tend to be larger, meaning the dataset contains more recent web content than older content.^[5]
Snapshot-level deduplication only: Deduplicating within each snapshot rather than globally means some near-duplicate content appears across different snapshots. This was a deliberate trade-off to preserve older high-quality content.^[1]^[5]

References

Penedo, G., Kydlicek, H., Ben Allal, L., Lozhkov, A., Mitchell, M., Raffel, C., Von Werra, L., & Wolf, T. (2024). "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." *Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track.* arXiv:2406.17557. https://arxiv.org/abs/2406.17557 ↩
Hugging Face. (2024). "FineWeb Dataset Card." Hugging Face Hub. https://huggingface.co/datasets/HuggingFaceFW/fineweb ↩
Hugging Face. (2024). "FineWeb-Edu Dataset Card." Hugging Face Hub. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu ↩
Hugging Face. (2024). "FineWeb-Edu Classifier." Hugging Face Hub. https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier ↩
Hugging Face. (2024). "FineWeb: Decanting the Web for the Finest Text Data at Scale." Blog post. https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1 ↩
Hugging Face. (2024). "datatrove: Freeing data processing from scripting madness." GitHub. https://github.com/huggingface/datatrove ↩
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." *Journal of Machine Learning Research*, 21(140), 1-67. ↩
Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., et al. (2022). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher." arXiv:2112.11446. ↩
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Launay, J., & Srinivasan, B. V. (2023). "The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only." *NeurIPS 2023.* arXiv:2306.01116. ↩
Soldaini, L., et al. (2024). "Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research." arXiv:2402.00159. ↩
Together AI. (2023). "RedPajama: An Open Dataset for Training Large Language Models." https://together.ai/blog/redpajama ↩
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., et al. (2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling." arXiv:2101.00027. ↩
Barbaresi, A. (2021). "Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction." *Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics.* ↩
Hugging Face. (2025). "FineWeb2: One Pipeline to Scale Them All." Hugging Face Hub. https://huggingface.co/datasets/HuggingFaceFW/fineweb-2 ↩
Penedo, G., Kydlicek, H., Sabolcec, V., Messmer, B., Foroutan, N., Jaggi, M., Bosselut, A., & Wolf, T. (2025). "FineWeb2: One Pipeline to Scale Them All, Adapting Pre-Training Data Processing to Every Language." arXiv:2506.20920. https://arxiv.org/abs/2506.20920 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

6 revisions by 1 contributors · full history

Suggest edit

FineWeb

Why was FineWeb created?

Where does FineWeb get its data?

How is FineWeb processed?

Text extraction

URL filtering

Language identification

Quality and repetition filters (Gopher/MassiveText rules)

MinHash deduplication

C4-style filters

Custom heuristic filters

PII removal

What is FineWeb-Edu?

Annotation process

Classifier architecture and training

Filtering threshold and results

Bias analysis

How is FineWeb evaluated?

Model architecture

Benchmark suite

Total ablation compute

How does FineWeb compare with other datasets?

Dataset overview

Benchmark results

Technical infrastructure

The datatrove library

How much did it cost to build FineWeb?

Data schema

What has FineWeb been used for?

Lowering the barrier for LLM training

Downstream datasets and models

Methodological influence

What is FineWeb-2?

Limitations

See also

References

Improve this article

What links here (24 of 29)

What links here (24 of 29)

Why was FineWeb created?

Where does FineWeb get its data?

How is FineWeb processed?

Text extraction

URL filtering

Language identification

Quality and repetition filters (Gopher/MassiveText rules)

MinHash deduplication

C4-style filters

Custom heuristic filters

PII removal

What is FineWeb-Edu?

Annotation process

Classifier architecture and training

Filtering threshold and results

Bias analysis

How is FineWeb evaluated?

Model architecture

Benchmark suite

Total ablation compute

How does FineWeb compare with other datasets?

Dataset overview

Benchmark results

Technical infrastructure

The datatrove library

How much did it cost to build FineWeb?

Data schema

What has FineWeb been used for?

Lowering the barrier for LLM training

Downstream datasets and models

Methodological influence

What is FineWeb-2?

Limitations

See also

References

Improve this article

Related Articles

The Pile (dataset)

RedPajama

Common Corpus

DCLM (DataComp for Language Models)

Common Pile

Reporting Bias

What links here (24 of 29)

Related Articles

The Pile (dataset)

RedPajama

Common Corpus

DCLM (DataComp for Language Models)

Common Pile

Reporting Bias

What links here (24 of 29)