FineWeb-Edu

Data & Datasets Large Language Models Machine Learning

8 min read

Updated May 31, 2026

Suggest edit History Talk

RawGraph

Last edited

May 31, 2026

Fact-checked

In review queue

Sources

9 citations

Revision

v1 · 1,690 words

Fact-checks are independent of edits: a reviewer re-verifies the article against its sources and stamps the date. How we verify

FineWeb-Edu is an open, English-language pretraining dataset of roughly 1.3 trillion tokens, built by filtering the much larger FineWeb web corpus down to the documents that an automatic classifier judges to be the most educational. It was released in 2024 by a Hugging Face team led by Guilherme Penedo and Hynek Kydlíček, and it is documented alongside FineWeb in the paper "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale" (arXiv:2406.17557), which appeared at the NeurIPS 2024 Datasets and Benchmarks track ^[1]^[2]. The dataset is distributed on the Hugging Face Hub as HuggingFaceFW/fineweb-edu under the permissive Open Data Commons Attribution License (ODC-By v1.0) ^[3].

The central idea is to push data quality ahead of raw quantity. A large language model trained on FineWeb-Edu reaches the same accuracy on knowledge and reasoning benchmarks as a model trained on a much bigger pile of generic web text, while seeing far fewer tokens. That result helped make educational filtering a standard ingredient in open pretraining recipes through 2024 and 2025.

Motivation

By 2023 it was widely accepted that what a model learns is shaped at least as much by what it reads as by how big it is. The pretraining corpora behind the strongest open models of the time, including Llama 3 and Mixtral, were not released, and very little was published about how they had been cleaned or filtered ^[1]. FineWeb set out to close that gap with a fully documented 15 trillion token corpus drawn from 96 Common Crawl snapshots, ablating each filtering decision in the open ^[1]^[2].

FineWeb-Edu took the argument a step further. Microsoft's Phi models had popularized the "textbooks are all you need" hypothesis: that training on clean, instructional, textbook-like material can teach reasoning more efficiently than training on the open web, even with far less data. The Phi work leaned heavily on synthetic textbooks and never fully disclosed its data. The FineWeb team asked a related question: instead of generating textbooks, could you simply find the textbook-like pages that already exist on the web and keep only those? FineWeb-Edu is the answer to that question, and it pairs naturally with synthetic data efforts like Cosmopedia, which generate instructional text rather than mining it.

The educational-quality classifier

The filter at the heart of FineWeb-Edu is a small classifier trained on annotations from a much larger model. The team sampled roughly 460,000 documents from FineWeb and prompted Llama-3-70B-Instruct to rate each one for educational value on an additive scale from 0 to 5, where a 0 is not educational at all and a 5 is a clear, self-contained piece suitable for teaching at a primary or grade-school level ^[1]^[3]. Using a strong instruction-tuned model as the annotator was the key trick: it produced hundreds of thousands of consistent quality judgments far more cheaply than human raters could.

Those scores then trained a lightweight predictor that could be run over the whole corpus. The classifier is a single linear regression head placed on top of the frozen Snowflake-arctic-embed-m text-embedding model, trained on about 450,000 of the annotations for 20 epochs at a learning rate of 3e-4 ^[1]. On a held-out validation split of roughly 45,000 to 50,000 examples, treating a predicted score of 3 or higher as the positive class, the regressor reached an F1 score of about 82 percent, which the authors judged good enough to drive large-scale filtering ^[1]^[3].

Running this classifier across all 15 trillion tokens of FineWeb cost roughly 6,000 H100 GPU-hours, a tiny fraction of what training a model on the data would cost ^[1]. Keeping only documents that scored 3 or above removed about 92 percent of FineWeb and left the 1.3 trillion educational tokens that make up the main release ^[1]^[3].

Sizes, variants, and licensing

FineWeb-Edu ships in two main flavors that differ only in how strict the quality cutoff is. The headline dataset keeps documents scoring 3 or higher. A second release, HuggingFaceFW/fineweb-edu-score-2, lowers the bar to a score of 2 and so keeps far more material, about 5.4 trillion tokens ^[1]^[4]. The looser variant trades some average quality for the larger unique-token budget that long training runs need, while the stricter 1.3T set gives the cleanest signal per token.

Variant	Score threshold	Approx. tokens	Hugging Face ID	License
FineWeb-Edu	score ≥ 3	~1.3 trillion	HuggingFaceFW/fineweb-edu	ODC-By v1.0
FineWeb-Edu-score-2	score ≥ 2	~5.4 trillion	HuggingFaceFW/fineweb-edu-score-2	ODC-By v1.0
Source corpus (FineWeb)	none	~15 trillion	HuggingFaceFW/fineweb	ODC-By v1.0

Both variants are English-only, stored as Parquet, and carry per-document metadata including the original URL, the Common Crawl dump it came from, a token count, and the predicted educational score. The team also published the annotation set (fineweb-edu-llama3-annotations) and the trained classifier (fineweb-edu-classifier) so the whole pipeline can be reproduced or retargeted ^[3]^[8]. The ODC-By license permits commercial use with attribution, and use is also subject to the Common Crawl terms of service.

Ablation results

The case for FineWeb-Edu rests on controlled comparisons in which the only thing that changes is the training data. Across these ablations, models trained on FineWeb-Edu beat models trained on FineWeb itself and on every other public web dataset the team tested, including C4, Dolma, RefinedWeb, The Pile, SlimPajama, and RedPajama2, on knowledge and reasoning benchmarks such as MMLU, ARC, and OpenBookQA ^[1]^[2].

The size of the jump is what drew attention. Substituting FineWeb-Edu for FineWeb raised MMLU accuracy from about 33 percent to about 37 percent and ARC from about 46 percent to about 57 percent at a matched token budget ^[1]^[2]. The efficiency angle is even more striking: the authors report that FineWeb-Edu matches the MMLU accuracy of the strongest baseline using roughly 10 times fewer tokens, and reaches scores that competing web datasets like C4 and Dolma need far more data to approach ^[1]. In one widely cited curve, a 1.8 billion parameter model trained on only 350 billion FineWeb-Edu tokens already outperforms the same model trained on the full FineWeb set ^[1].

The filtering is not free of trade-offs. Pushing the score threshold above 3 keeps squeezing out gains on knowledge-heavy benchmarks but starts to hurt performance on HellaSwag and PIQA, which reward commonsense and everyday language more than academic prose ^[1]^[3]. A score of 3 was chosen as the balance point that helps reasoning without gutting the broader distribution.

Adoption

FineWeb-Edu quickly became a default high-quality component in open pretraining mixes. Hugging Face's own SmolLM models were trained on the SmolLM-Corpus, which pairs a deduplicated 220 billion token slice of FineWeb-Edu with the synthetic Cosmopedia v2 textbooks (about 28 billion tokens generated by Mixtral-8x7B-Instruct) and a Python-Edu code subset ^[5]^[6]. That pairing is a clean illustration of the two routes to instructional data sitting side by side: mined educational web pages from FineWeb-Edu and generated educational text from synthetic data pipelines.

The influence spread beyond Hugging Face. Allen Institute for AI's OLMo 2 reused the FineWeb-Edu classifier to score and select high-quality web text for the later, quality-focused stage of its training mix ^[7]. The classifier itself, fine-tuned from Arctic Embed M on over 400,000 Llama-3-70B labels, became a reusable artifact that other groups apply to their own corpora. The dataset has also been kept current: later releases (v1.2.0 in early 2025 and v1.4.0 in mid-2025) folded in additional Common Crawl snapshots covering 2024 and the first half of 2025 ^[3]. Follow-on research such as Ultra-FineWeb has tried to refine the same model-based filtering recipe further ^[9].

Limitations

The quality of FineWeb-Edu is only as good as the judgments baked into its classifier, and several caveats follow from that. The labels come from a single annotator model, Llama-3-70B-Instruct, so any biases that model holds about what counts as educational are inherited by the filter ^[3]. At the high end of the scale the classifier tends to reward content that simply looks academic, which can favor formal, polished prose over equally useful but plainer material ^[3].

The notion of "educational" used here is also fairly narrow. The annotation prompt was oriented toward primary and grade-school learning, so the classifier can be less reliable on advanced, higher-education, or specialized technical content, and on text that differs from the FineWeb distribution it was trained on ^[3]. Each page is scored on its own, without any view of broader context, which can misjudge documents that only make sense as part of a larger whole.

Two structural limits round out the picture. The dataset is English-only, so it does nothing for multilingual training on its own, a gap the separate FineWeb2 effort was created to address ^[1]. And because the headline gains are measured on academic benchmarks like MMLU and ARC, there is a real risk of optimizing the data toward those tests; the observed drop on HellaSwag and PIQA at higher thresholds is a concrete reminder that aggressive educational filtering narrows the distribution as well as cleaning it ^[1].

References

Penedo, Guilherme; Kydlíček, Hynek; Ben Allal, Loubna; Lozhkov, Anton; Mitchell, Margaret; Raffel, Colin; von Werra, Leandro; Wolf, Thomas. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." arXiv:2406.17557, 2024. https://arxiv.org/abs/2406.17557 ↩
Penedo, Guilherme; et al. "The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale." Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Datasets and Benchmarks Track. https://papers.neurips.cc/paper_files/paper/2024/file/370df50ccfdf8bde18f8f9c2d9151bda-Paper-Datasets_and_Benchmarks_Track.pdf ↩
"HuggingFaceFW/fineweb-edu." Hugging Face Datasets. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu ↩
"HuggingFaceFW/fineweb-edu-score-2." Hugging Face Datasets. https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-score-2 ↩
Ben Allal, Loubna; et al. "SmolLM: blazingly fast and remarkably powerful." Hugging Face blog, 2024. https://huggingface.co/blog/smollm ↩
Ben Allal, Loubna; et al. "Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models." Hugging Face blog, 2024. https://huggingface.co/blog/cosmopedia ↩
OLMo Team. "2 OLMo 2 Furious." arXiv:2501.00656, 2025. https://arxiv.org/abs/2501.00656 ↩
"HuggingFaceFW/fineweb-edu-classifier." Hugging Face. https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier ↩
Wang, Yudong; et al. "Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data." arXiv:2505.05427, 2025. https://arxiv.org/abs/2505.05427 ↩

Improve this article

Add missing citations, update stale details, or suggest a clearer explanation. Every suggestion is reviewed for sourcing before it goes live.

Suggest edit

What links here

FineWeb-2

Motivation

The educational-quality classifier

Sizes, variants, and licensing

Ablation results

Adoption

Limitations

See also

References

Improve this article

Related Articles

How to Prevent OpenAI and Google From Training Their LLMs on Your Website's Data

Dolma

RefinedWeb

SlimPajama

UltraChat

Self-Instruct